Re: [PATCH 2/2] qla2xxx: Fix missed DMA unmap for aborted cmds

From: Chesnokov Gleb <Chesnokov.G@raidix.com>
To: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: "linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>
Subject: Re: [PATCH 2/2] qla2xxx: Fix missed DMA unmap for aborted cmds
Date: Mon, 25 Apr 2022 18:49:47 +0000	[thread overview]
Message-ID: <AS8PR10MB495233A032E7425D1DAE9D549DF89@AS8PR10MB4952.EURPRD10.PROD.OUTLOOK.COM> (raw)
In-Reply-To: <F1592489-7C94-454B-8EF3-BF5C56F48A10@oracle.com>

>> On Apr 20, 2022, at 7:42 AM, Chesnokov Gleb <Chesnokov.G@raidix.com> wrote:
>> 
>>> Do you have a log showing this error sequence?
>> 
>> Yes, I have, but the problem is that I have a different target stack, not LIO. So the Call Trace basically contains code sequence from this target stack only,
>> except for the call of the qlt_free_cmd() that trigger BUG: BUG_ON(cmd->sg_mapped).
>> Regardless, I think the problem lies on the qlogic driver side, because it is responsible for management to map/unmap sgl list.
>
> Agree. Am curious to understand the test case/steps that would trigger this issue in your env. If you can share your test scenario would be a bit more helpful. 
>>
>> 
>>> Can you share more details?
>> 
>> What I am observing:
>> 
>> 1) Command processing calls qlt_rdy_to_xfer(), maps sgl and sends a command to the firmware
>> 2) Qlogic adapter reset occurs
>> 
>> qla2xxx [0000:82:00.1]-5003:13: ISP System Error - mbx1=110eh mbx2=10h mbx3=dh mbx4=0h mbx5=8a1h mbx6=0h mbx7=0h.
>
> This message indicates there was a firmware crash. Qlogic/Marvell folks should be able to help you capture/save dump. That firmware dump might give you clues on what is the cause of the firmware crash. 
>
>> qla2xxx [0000:82:00.1]-d01e:13: -> fwdump no buffer
>
>> qla2xxx [0000:82:00.1]-00af:13: Performing ISP error recovery - ha=ffff9dd7d6058000.
>> 
>
>> 3) Somehow the command is being aborted, so that means the command's abort flag has already been set.
>> I think it may happens something like this:
>> qla2x00_abort_isp_cleanup() --> qla2x00_abort_all_cmds()
>> 
>
> I think this is the aftereffect of a firmware crash and the driver is just recovering from that. A good firmware analysis will shed more light on this issue. 
>
>> 4) The target stack calls qlt_abort_cmd(), and since aborted flag has already been set, this call ended as multiple abort.
>> 
>> 5) The target stack calls xmit_response, and since command has already been aborted, this call starts the code sequence to release the command that ended > with qlt_free_cmd()
>> 
>> I think I could try to reproduce the problem with LIO target stack, but I have special case with my target stack that lead to reset of qlogic adapter (ISP error recovery) and this is one important part of the error sequence. So, I think I will not be able to reproduce the problem with the LIO until I find out how to similarly reset qlogic adapter during processing active commands that have already been sent to the firmware.
>
>
> Himanshu Madhani        Oracle Linux Engineering

I seem to know the cause of the firmware crash. This is an abnormal sg list that is generated by my backend driver and passed to the Qlogic driver via target stack. The abnormal state of the sg list in my case means that it contains more than a thousand nents. So apparently Qlogic adapter does not know how to work with such buffers.

In any case, I think that the main thing is not to find the cause of the firmware crash or fix it (because it actually comes from my side), but to fix the crash during recovery the Qlogic driver after a firmware crash.

I have special case that allows me to reproduce the problem, but perhaps it can be reproduced in other cases that cause a firmware crash. Maybe there is a way to manually cause the firmware crash and it will allow to artificially reproduce the problem?