Re: iSCSI Abort Task and WRITE PENDING

From: Mike Christie <michael.christie@oracle.com>
To: Konstantin Shelekhin <k.shelekhin@yadro.com>,
	target-devel@vger.kernel.org
Cc: linux@yadro.com, Maurizio Lombardi <mlombard@redhat.com>
Subject: Re: iSCSI Abort Task and WRITE PENDING
Date: Wed, 13 Oct 2021 13:05:26 -0500	[thread overview]
Message-ID: <b49d2f3b-4762-832f-55ff-7543e08f387b@oracle.com> (raw)
In-Reply-To: <17647b68-f039-3fc3-808e-0feb652ddf8c@oracle.com>

On 10/13/21 12:51 PM, Mike Christie wrote:
> On 10/13/21 8:21 AM, Konstantin Shelekhin wrote:
>> Hi,
>>
>> I really need the collective wisdom.
>>
>> Not long ago we've uncovered the problem with iSCSI and ABORT TASK
>> handling. Currently it's not possible to abort a WRITE_10 command in
>> TRANSPORT_WRITE_PENDING state, because ABORT TASK  will hang itself in
>> the process:
>>
>>   # dmesg | tail -2
>>   [   83.563505] ABORT_TASK: Found referenced iSCSI task_tag: 3372979269
>>   [   84.593545] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus <nexus>
>>
>>   # ps aux | awk '$8 ~/D/'
>>   root        32  0.0  0.0      0     0 ?        D    15:19   0:00 [kworker/0:1+events]
>>   root      1187  0.0  0.0      0     0 ?        D    15:20   0:00 [iscsi_ttx]
>>
>>   # cat /proc/32/stack
>>   [<0>] target_put_cmd_and_wait+0x68/0xa0
>>   [<0>] core_tmr_abort_task.cold+0x16b/0x192
>>   [<0>] target_tmr_work+0x9e/0xe0
>>   [<0>] process_one_work+0x1d4/0x370
>>   [<0>] worker_thread+0x48/0x3d0
>>   [<0>] kthread+0x122/0x140
>>   [<0>] ret_from_fork+0x22/0x30
>>
>>   # cat /proc/1187/stack
>>   [<0>] __transport_wait_for_tasks+0xaf/0x100
>>   [<0>] transport_generic_free_cmd+0xe9/0x180
>>   [<0>] iscsit_free_cmd+0x50/0xb0
>>   [<0>] iscsit_close_connection+0x47d/0x8c0
>>   [<0>] iscsit_take_action_for_connection_exit+0x6f/0xf0
>>   [<0>] iscsi_target_tx_thread+0x184/0x200
>>   [<0>] kthread+0x122/0x140
>>   [<0>] ret_from_fork+0x22/0x30
>>
>> What happens:
>>
>>   1. Initiator sends WRITE_10 CDB
>>   2. Target parses the CDB and sends R2T
>>   3. Target starts the Data-Out timer
>>   4. Initiator sends ABORT TASK; no new data from Initiator after this
>>   5. Target starts aborting WRITE_10, gets into core_tmr_abort_task()
>>      and starts waiting for the request completion
>>   6. Nothing happens
>>   7. The Data-Out timers expires, connection teardown starts and gets
>>      stuck waiting for ABORT TASK that waits for WRITE_10
>>
>> The ABORT TASK processing looks roughly like this:
>>
>>   iscsi_rx_opcode
>>     iscsi_handle_task_mgt_cmd
>>       iscsi_tmr_abort_task
>>       transport_generic_handle_tmr
>>         if (tmr_cmd->transport_state & CMD_T_ABORTED)
>>           target_handle_abort
>>         else
>>           target_tmr_work
>>             if (tmr_cmd->transport_state & CMD_T_ABORTED)
>>               target_handle_abort
>>             else
>>               core_tmr_abort_task
>>                 ret = __target_check_io_state
>>                   if (write_cmd->transport_state & CMD_T_STOP)
>>                     return -1
>>                   write_cmd->transport_state |= CMD_T_ABORTED
>>                   return 0
>>                 if (!ret)
>>                   list_move_tail(&write_cmd->state_list, &aborted)
>>                   target_put_cmd_and_wait(&write_cmd)
>>
>> As I see it, the main problem is that the abort path can't initiate the
>> command termination, it simply waits for the request to handle this on
>> the execution path like in target_execute_cmd():
>>
>>   target_execute_cmd
>>     target_cmd_interrupted
>>       INIT_WORK(&cmd->work, target_abort_work)
>>
>> However, in this case the request is not going to be executed because
>> Initiator will not send the Data-Out buffer.
>>
>> I have a couple of ideas on how to fix this, but they all look kinda
>> ugly. The one that currently works around this for me:
>>
>>   core_tmr_abort_task():
>>
>>     [...]
>>
>>     spin_lock_irqsave(&se_cmd->t_state_lock, flags);
>>     write_pending = se_cmd->t_state == TRANSPORT_WRITE_PENDING;
>>     spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
>>     
>>     if (write_pending && se_cmd->se_tfo->abort_write_pending)
>>             se_cmd->se_tfo->abort_write_pending(se_cmd);
>>     
>>     target_put_cmd_and_wait(se_cmd);
>>
>>     [...]
>>
>> The new method abort_write_pending() is defined only for iSCSI and calls
>> target_handle_abort(). However, this opens up another can of worms
>> because this code heavily races with R2T sending and requires a couple
>> of checks to "work most of the time". Not ideal, by far.
>>
>> I can make this one better by introducing R2T list draining that ensures
>> the proper order during cleanup, but maybe there is a much easier way
>> that I'm not seeing here.
> 
> Ccing Maurizio to make sure I don't add his original bug back.
> 
> If I understand you, I think I added this bug in:
> 
> commit f36199355c64a39fe82cfddc7623d827c7e050da
> Author: Mike Christie <michael.christie@oracle.com>
> Date:   Fri Nov 13 19:46:18 2020 -0600
> 
>     scsi: target: iscsi: Fix cmd abort fabric stop race
> 
> With that patch if the abort or a lun reset has got to lio core then we
> are going to be stuck waiting for the data which won't come because we
> killed the iscsi threads.
> 
> Can go back to always having the iscsi target clean up the cmd, but if
> LIO has started to abort the cmd we take an extra ref so we don't free
> the cmd from under each other.
> 
> This patch is completely untested:
> 
> 
> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
> index 2c54c5d8412d..d221e9be7468 100644
> --- a/drivers/target/iscsi/iscsi_target.c
> +++ b/drivers/target/iscsi/iscsi_target.c
> @@ -4090,12 +4090,13 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn)
>  			spin_lock_irq(&se_cmd->t_state_lock);
>  			if (se_cmd->transport_state & CMD_T_ABORTED) {
>  				/*
> -				 * LIO's abort path owns the cleanup for this,
> -				 * so put it back on the list and let
> -				 * aborted_task handle it.
> +				 * The LIO TMR handler owns the cmd but if
> +				 * we were waiting for data from the initiator
> +				 * then we need to internally cleanup to be
> +				 * able to complete it. Get an extra ref so
> +				 * we don't free the cmd from under LIO core.
>  				 */
> -				list_move_tail(&cmd->i_conn_node,
> -					       &conn->conn_cmd_list);
> +				target_get_sess_cmd(se_cmd, false);
>  			} else {
>  				se_cmd->transport_state |= CMD_T_FABRIC_STOP;
>  			}
> 

Another alternative would be to have iscsi check if it was waiting on
data before sending a TMR to LIO core. If it is, then it would just cleanup
internally and complete the TMR with success/failed depending on the TMR or
drop it and let the initiator escalate.