Re: iSCSI Abort Task and WRITE PENDING

From: Mike Christie <michael.christie@oracle.com>
To: Konstantin Shelekhin <k.shelekhin@yadro.com>
Cc: target-devel@vger.kernel.org, linux@yadro.com,
	Maurizio Lombardi <mlombard@redhat.com>
Subject: Re: iSCSI Abort Task and WRITE PENDING
Date: Mon, 18 Oct 2021 11:29:23 -0500	[thread overview]
Message-ID: <2318e7d3-84c1-e5b0-62ce-dd25a21d3476@oracle.com> (raw)
In-Reply-To: <YW1g/OFXMHq44CYo@yadro.com>

On 10/18/21 6:56 AM, Konstantin Shelekhin wrote:
> On Thu, Oct 14, 2021 at 10:18:13PM -0500, michael.christie@oracle.com wrote:
>>> If I understand this aproach correctly, it fixes the deadlock, but the
>>> connection reinstatement will still happen, because WRITE_10 won't be
>>> aborted and the connection will go down after the timeout.> 
>>> IMO it's not ideal either, since now iSCSI will have a 50% chance to
>>> have the connection (meaning SCSI session) killed on arbitrary ABOR
>>
>> I wouldn't call this an arbitrary abort. It's indicating a problem.
>> When do you see this? Why do we need to fix it per cmd? Are you hitting
>> the big command short timeout issue? Driver/fw bug?
> 
> It was triggered by ESXi. During some heavy IOPS intervals the backend
> device cannot handle the load and some IOs get stuck for more than 30
> seconds. I suspect that ABORT TASKSs are issued by the virtual machines.
> So a series of ABORT TASK will come, and the unlucky one will hit the
> issue.

I didn't get this. If only the backend is backed up then we should
still be transmitting the data out/R2Ts quickly and we shouldn't be
hitting the issue where we got stuck waiting on them.

>  
>>> TASK. While I'm sure most initiators will be able to recover from this
>>> event, such drastic measures will certanly cause a lot of confusion for
>>> people who are not familiar with TCM internals
>> How will this cause confusion vs the case where the cmd reaches the target
>> and we are waiting for it on the backend? In both cases, the initiator sends
>> an abort, it times out, the initiator or target drop the connection, we
>> relogin. Every initiator handles this.
> 
> Because usually (when a WRITE request is past the WRITE PENDING state)

Ah I think we were talking about different things here. I thought you meant
users and I was just saying they wouldn't see a difference. But for ESXi
it's going to work differently than I was thinking. I thought the initiator
was going to escalate to LUN RESET then we hit the issue I mention
below in the FastAbort part of the mail where we end up dropping the
connection waiting on the data outs.

> the ABORT TASK does not trigger relogin. In my experience the initiator
> just waits for the TMR completion and goes on.
> 
> And from a blackbox perspective it looks suspicious:
> 
>   1. ABORT TASK sent to WRITE_10 tag 0x1; waits for it's completion
>   2. ABORT TASK sent to WRITE_10 tag 0x2; almost immediately the connection is dropped

I didn't get this part where the connection is dropped almost immediately.
If only the backend is backed up, what is dropping the connection right
away? The data out timers shouldn't be firing right? It sounds like above
the network between the initiator and target were ok so data outs and R2Ts
should be executing quickly like normal right?

> 
> The only difference between #1 and #2 is that the command 0x1 is past
> the WRITE PENDING state.
> 
>> With that said I am in favor of you fixing the code so we can cleanup
>> a partially sent cmd if it can be done sanely.
>>
>> I personally would just leave the current behavior and fix the deadlock
>> because:
>>
>> 1. When I see this happening it's normally the network so we have to blow
>> away the group of commands and we end up dropping the connection one way
>> or another. I don't see the big command short timeout case often anymore.
>>
>> 2. Initiators just did not implement this right. I know this for sure
>> for open-iscsi at least. I started to fix my screw ups the other day but it
>> ends up breaking the targets.
>>
>> For example,
>>
>> - If we've sent a R2T and the initiator has sent a LUN RESET, what are
>> you going to have the target do? Send the response right away?
> 
> AFAIR the spec says "nuke it, there will be no data after this".> 
>> - If we've sent a R2T and the initiator has sent some of the data
>> PDUs to full fill it but has not sent the final PDU, then it sends the
>> LUN RESET, what do you do?
> 
> The same. However, I understand the interoperability concerns. I'll
> check what other targets do
I think maybe you are replying about aborts, but I was asking about
LUN RESET which is opposite but will also hit the same hang if the
connection is dropped after one is sent.

For aborts it works like you wrote above. For LUN RESET it's opposite.
In 3270, it doesn't say how to handle aborts, but on the pdl lists it
came up and they said equivalent of your nuke it. However, for TMFs
that affect multiple tasks they clarified it in later versions of the
specs.

In the original it only says how to handle abort/clear task set, but in

https://datatracker.ietf.org/doc/html/rfc5048

the behavior was clarified and in 7143 we have the original/default
way:

https://datatracker.ietf.org/doc/html/rfc7143#section-4.2.3.3

which says to wait for the data outs.

And then we have FastAbort which is nuke it:

https://datatracker.ietf.org/doc/html/rfc7143#section-4.2.3.4

>  
>> - You also have the immediate data case and the InitialR2T case.
> 
> True.
>  
>> The updated specs clarify how to handle this, and even have a FastAbort
>> key to specify which behavior we are going to do. But we don't support
>> it and I don't think many people implemented it.

So here I was saying I don't think anyone implemented the ability to
negotiate for TaskReporting=FastAbort, so most might do the original behavior.
If that's right then we just end up dropping the connection for windows and
linux and newer versions of ESXi if that session drop setting is set
(I can't remember the setting), when they end up timing out their TMFs.