All of lore.kernel.org
 help / color / mirror / Atom feed
* iSCSI Abort Task and WRITE PENDING
@ 2021-10-13 13:21 Konstantin Shelekhin
  2021-10-13 14:22 ` Hannes Reinecke
  2021-10-13 17:51 ` Mike Christie
  0 siblings, 2 replies; 26+ messages in thread
From: Konstantin Shelekhin @ 2021-10-13 13:21 UTC (permalink / raw)
  To: target-devel; +Cc: linux

Hi,

I really need the collective wisdom.

Not long ago we've uncovered the problem with iSCSI and ABORT TASK
handling. Currently it's not possible to abort a WRITE_10 command in
TRANSPORT_WRITE_PENDING state, because ABORT TASK  will hang itself in
the process:

  # dmesg | tail -2
  [   83.563505] ABORT_TASK: Found referenced iSCSI task_tag: 3372979269
  [   84.593545] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus <nexus>

  # ps aux | awk '$8 ~/D/'
  root        32  0.0  0.0      0     0 ?        D    15:19   0:00 [kworker/0:1+events]
  root      1187  0.0  0.0      0     0 ?        D    15:20   0:00 [iscsi_ttx]

  # cat /proc/32/stack
  [<0>] target_put_cmd_and_wait+0x68/0xa0
  [<0>] core_tmr_abort_task.cold+0x16b/0x192
  [<0>] target_tmr_work+0x9e/0xe0
  [<0>] process_one_work+0x1d4/0x370
  [<0>] worker_thread+0x48/0x3d0
  [<0>] kthread+0x122/0x140
  [<0>] ret_from_fork+0x22/0x30

  # cat /proc/1187/stack
  [<0>] __transport_wait_for_tasks+0xaf/0x100
  [<0>] transport_generic_free_cmd+0xe9/0x180
  [<0>] iscsit_free_cmd+0x50/0xb0
  [<0>] iscsit_close_connection+0x47d/0x8c0
  [<0>] iscsit_take_action_for_connection_exit+0x6f/0xf0
  [<0>] iscsi_target_tx_thread+0x184/0x200
  [<0>] kthread+0x122/0x140
  [<0>] ret_from_fork+0x22/0x30

What happens:

  1. Initiator sends WRITE_10 CDB
  2. Target parses the CDB and sends R2T
  3. Target starts the Data-Out timer
  4. Initiator sends ABORT TASK; no new data from Initiator after this
  5. Target starts aborting WRITE_10, gets into core_tmr_abort_task()
     and starts waiting for the request completion
  6. Nothing happens
  7. The Data-Out timers expires, connection teardown starts and gets
     stuck waiting for ABORT TASK that waits for WRITE_10

The ABORT TASK processing looks roughly like this:

  iscsi_rx_opcode
    iscsi_handle_task_mgt_cmd
      iscsi_tmr_abort_task
      transport_generic_handle_tmr
        if (tmr_cmd->transport_state & CMD_T_ABORTED)
          target_handle_abort
        else
          target_tmr_work
            if (tmr_cmd->transport_state & CMD_T_ABORTED)
              target_handle_abort
            else
              core_tmr_abort_task
                ret = __target_check_io_state
                  if (write_cmd->transport_state & CMD_T_STOP)
                    return -1
                  write_cmd->transport_state |= CMD_T_ABORTED
                  return 0
                if (!ret)
                  list_move_tail(&write_cmd->state_list, &aborted)
                  target_put_cmd_and_wait(&write_cmd)

As I see it, the main problem is that the abort path can't initiate the
command termination, it simply waits for the request to handle this on
the execution path like in target_execute_cmd():

  target_execute_cmd
    target_cmd_interrupted
      INIT_WORK(&cmd->work, target_abort_work)

However, in this case the request is not going to be executed because
Initiator will not send the Data-Out buffer.

I have a couple of ideas on how to fix this, but they all look kinda
ugly. The one that currently works around this for me:

  core_tmr_abort_task():

    [...]

    spin_lock_irqsave(&se_cmd->t_state_lock, flags);
    write_pending = se_cmd->t_state == TRANSPORT_WRITE_PENDING;
    spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
    
    if (write_pending && se_cmd->se_tfo->abort_write_pending)
            se_cmd->se_tfo->abort_write_pending(se_cmd);
    
    target_put_cmd_and_wait(se_cmd);

    [...]

The new method abort_write_pending() is defined only for iSCSI and calls
target_handle_abort(). However, this opens up another can of worms
because this code heavily races with R2T sending and requires a couple
of checks to "work most of the time". Not ideal, by far.

I can make this one better by introducing R2T list draining that ensures
the proper order during cleanup, but maybe there is a much easier way
that I'm not seeing here.

The tempting idea to wait for the Data-Out timer to time out and then
abort the request is not going to work either, because if the WRITE_10
processing lags and WRITE_10 is marked as aborted before R2Ts are sent,
ABORT TASK will still be stuck, but the Data-Out timer won't even be
started. Also the timeout upper limit is 60 seconds, so...

I really need some advice on how to deal with this properly.

P.S. This case is easily reproduced by libiscsi with some minor
modifications:

  $ git clone https://github.com/sahlberg/libiscsi.git
  $ cd libiscsi
  $ cat <<-EOF | patch -p 1
	diff --git a/test-tool/test_async_abort_simple.c b/test-tool/test_async_abort_simple.c
	index 82abb9f..70fd5ea 100644
	--- a/test-tool/test_async_abort_simple.c
	+++ b/test-tool/test_async_abort_simple.c
	@@ -92,7 +92,7 @@ test_async_abort_simple(void)
	 {
	 	int ret;
	 	struct tests_async_abort_state state = { NULL, 0, 0, 0, 0 };
	-	int blocks_per_io = 8;
	+	int blocks_per_io = 32678;
	 	unsigned char *buf;
	 	uint64_t timeout_sec;
	 
	@@ -146,6 +146,8 @@ test_async_abort_simple(void)
	 	}
	 	logging(LOG_VERBOSE, "dispatched");
	 
	+	sleep(2);
	+
	 	/*
	 	 * queue abort - shouldn't cancel the dispatched task. TMF req should
	 	 * be sent to the target.
  EOF
  $ ./autogen.sh
  $ ./configure --enable-test-tool
  $ ./test-tool/iscsi-test-cu -V --dataloss --test iSCSI.iSCSITMF.AbortTaskSimpleAsync \
  	-i <initiator-iqn> iscsi://<portal>/<target-iqn>/<lun>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-13 13:21 iSCSI Abort Task and WRITE PENDING Konstantin Shelekhin
@ 2021-10-13 14:22 ` Hannes Reinecke
  2021-10-13 14:53   ` Konstantin Shelekhin
  2021-10-13 17:51 ` Mike Christie
  1 sibling, 1 reply; 26+ messages in thread
From: Hannes Reinecke @ 2021-10-13 14:22 UTC (permalink / raw)
  To: Konstantin Shelekhin, target-devel; +Cc: linux

On 10/13/21 3:21 PM, Konstantin Shelekhin wrote:
> Hi,
> 
> I really need the collective wisdom.
> 
> Not long ago we've uncovered the problem with iSCSI and ABORT TASK
> handling. Currently it's not possible to abort a WRITE_10 command in
> TRANSPORT_WRITE_PENDING state, because ABORT TASK  will hang itself in
> the process:
> 
>   # dmesg | tail -2
>   [   83.563505] ABORT_TASK: Found referenced iSCSI task_tag: 3372979269
>   [   84.593545] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus <nexus>
> 
>   # ps aux | awk '$8 ~/D/'
>   root        32  0.0  0.0      0     0 ?        D    15:19   0:00 [kworker/0:1+events]
>   root      1187  0.0  0.0      0     0 ?        D    15:20   0:00 [iscsi_ttx]
> 
>   # cat /proc/32/stack
>   [<0>] target_put_cmd_and_wait+0x68/0xa0
>   [<0>] core_tmr_abort_task.cold+0x16b/0x192
>   [<0>] target_tmr_work+0x9e/0xe0
>   [<0>] process_one_work+0x1d4/0x370
>   [<0>] worker_thread+0x48/0x3d0
>   [<0>] kthread+0x122/0x140
>   [<0>] ret_from_fork+0x22/0x30
> 
>   # cat /proc/1187/stack
>   [<0>] __transport_wait_for_tasks+0xaf/0x100
>   [<0>] transport_generic_free_cmd+0xe9/0x180
>   [<0>] iscsit_free_cmd+0x50/0xb0
>   [<0>] iscsit_close_connection+0x47d/0x8c0
>   [<0>] iscsit_take_action_for_connection_exit+0x6f/0xf0
>   [<0>] iscsi_target_tx_thread+0x184/0x200
>   [<0>] kthread+0x122/0x140
>   [<0>] ret_from_fork+0x22/0x30
> 
> What happens:
> 
>   1. Initiator sends WRITE_10 CDB
>   2. Target parses the CDB and sends R2T
>   3. Target starts the Data-Out timer
>   4. Initiator sends ABORT TASK; no new data from Initiator after this
>   5. Target starts aborting WRITE_10, gets into core_tmr_abort_task()
>      and starts waiting for the request completion
>   6. Nothing happens
>   7. The Data-Out timers expires, connection teardown starts and gets
>      stuck waiting for ABORT TASK that waits for WRITE_10
> 
> The ABORT TASK processing looks roughly like this:
> 
>   iscsi_rx_opcode
>     iscsi_handle_task_mgt_cmd
>       iscsi_tmr_abort_task
>       transport_generic_handle_tmr
>         if (tmr_cmd->transport_state & CMD_T_ABORTED)
>           target_handle_abort
>         else
>           target_tmr_work
>             if (tmr_cmd->transport_state & CMD_T_ABORTED)
>               target_handle_abort
>             else
>               core_tmr_abort_task
>                 ret = __target_check_io_state
>                   if (write_cmd->transport_state & CMD_T_STOP)
>                     return -1
>                   write_cmd->transport_state |= CMD_T_ABORTED
>                   return 0
>                 if (!ret)
>                   list_move_tail(&write_cmd->state_list, &aborted)
>                   target_put_cmd_and_wait(&write_cmd)
> 
> As I see it, the main problem is that the abort path can't initiate the
> command termination, it simply waits for the request to handle this on
> the execution path like in target_execute_cmd():
> 
>   target_execute_cmd
>     target_cmd_interrupted
>       INIT_WORK(&cmd->work, target_abort_work)
> 
> However, in this case the request is not going to be executed because
> Initiator will not send the Data-Out buffer.
> 
> I have a couple of ideas on how to fix this, but they all look kinda
> ugly. The one that currently works around this for me:
> 
>   core_tmr_abort_task():
> 
>     [...]
> 
>     spin_lock_irqsave(&se_cmd->t_state_lock, flags);
>     write_pending = se_cmd->t_state == TRANSPORT_WRITE_PENDING;
>     spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
>     
>     if (write_pending && se_cmd->se_tfo->abort_write_pending)
>             se_cmd->se_tfo->abort_write_pending(se_cmd);
>     
>     target_put_cmd_and_wait(se_cmd);
> 
>     [...]
> 
> The new method abort_write_pending() is defined only for iSCSI and calls
> target_handle_abort(). However, this opens up another can of worms
> because this code heavily races with R2T sending and requires a couple
> of checks to "work most of the time". Not ideal, by far.
> 
> I can make this one better by introducing R2T list draining that ensures
> the proper order during cleanup, but maybe there is a much easier way
> that I'm not seeing here.
> 
> The tempting idea to wait for the Data-Out timer to time out and then
> abort the request is not going to work either, because if the WRITE_10
> processing lags and WRITE_10 is marked as aborted before R2Ts are sent,
> ABORT TASK will still be stuck, but the Data-Out timer won't even be
> started. Also the timeout upper limit is 60 seconds, so...
> 
> I really need some advice on how to deal with this properly.
> 
Short answer: you can't.

There is no feasible path in the I/O stack to abort running I/O; the
only chance you have here is to wait for it to time-out.

We have run into similar issues, and found that the only sane solution
was to wait for the I/O to come back and then retry.
As this would take some time (30 seconds if you are unlucky) most
initiators will get unhappy and try to reset.
Which won't work, either, as the I/O is still stuck.
So we finally delayed relogin until all I/O was cleared.

Not the best solution, but the only thing we can do in the absense of a
proper I/O abort mechanism.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-13 14:22 ` Hannes Reinecke
@ 2021-10-13 14:53   ` Konstantin Shelekhin
  2021-10-13 14:56     ` Konstantin Shelekhin
  2021-10-14  7:09     ` Hannes Reinecke
  0 siblings, 2 replies; 26+ messages in thread
From: Konstantin Shelekhin @ 2021-10-13 14:53 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: target-devel, linux

On Wed, Oct 13, 2021 at 04:22:41PM +0200, Hannes Reinecke wrote:
> On 10/13/21 3:21 PM, Konstantin Shelekhin wrote:
> Short answer: you can't.
> 
> There is no feasible path in the I/O stack to abort running I/O; the
> only chance you have here is to wait for it to time-out.
>
> We have run into similar issues, and found that the only sane solution
> was to wait for the I/O to come back and then retry.
> As this would take some time (30 seconds if you are unlucky) most
> initiators will get unhappy and try to reset.
> Which won't work, either, as the I/O is still stuck.
> So we finally delayed relogin until all I/O was cleared.
> 
> Not the best solution, but the only thing we can do in the absense of a
> proper I/O abort mechanism.

I'm not sure we are talking about the same bug. In this case the relogin
is not possible, because new connections are rejected by the target and
the existing one is not going anywhere, because it's deadlocked on ABORT
TASK. The only solution is to reset the server.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-13 14:53   ` Konstantin Shelekhin
@ 2021-10-13 14:56     ` Konstantin Shelekhin
  2021-10-14  7:09     ` Hannes Reinecke
  1 sibling, 0 replies; 26+ messages in thread
From: Konstantin Shelekhin @ 2021-10-13 14:56 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: target-devel, linux

On Wed, Oct 13, 2021 at 05:53:08PM +0300, Konstantin Shelekhin wrote:
> On Wed, Oct 13, 2021 at 04:22:41PM +0200, Hannes Reinecke wrote:
> > On 10/13/21 3:21 PM, Konstantin Shelekhin wrote:
> > Short answer: you can't.
> > 
> > There is no feasible path in the I/O stack to abort running I/O; the
> > only chance you have here is to wait for it to time-out.
> >
> > We have run into similar issues, and found that the only sane solution
> > was to wait for the I/O to come back and then retry.
> > As this would take some time (30 seconds if you are unlucky) most
> > initiators will get unhappy and try to reset.
> > Which won't work, either, as the I/O is still stuck.
> > So we finally delayed relogin until all I/O was cleared.
> > 
> > Not the best solution, but the only thing we can do in the absense of a
> > proper I/O abort mechanism.
> 
> I'm not sure we are talking about the same bug. In this case the relogin
> is not possible, because new connections are rejected by the target and
> the existing one is not going anywhere, because it's deadlocked on ABORT
> TASK. The only solution is to reset the server.

* to reset target server

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-13 13:21 iSCSI Abort Task and WRITE PENDING Konstantin Shelekhin
  2021-10-13 14:22 ` Hannes Reinecke
@ 2021-10-13 17:51 ` Mike Christie
  2021-10-13 18:05   ` Mike Christie
  2021-10-13 18:08   ` Konstantin Shelekhin
  1 sibling, 2 replies; 26+ messages in thread
From: Mike Christie @ 2021-10-13 17:51 UTC (permalink / raw)
  To: Konstantin Shelekhin, target-devel; +Cc: linux, Maurizio Lombardi

On 10/13/21 8:21 AM, Konstantin Shelekhin wrote:
> Hi,
> 
> I really need the collective wisdom.
> 
> Not long ago we've uncovered the problem with iSCSI and ABORT TASK
> handling. Currently it's not possible to abort a WRITE_10 command in
> TRANSPORT_WRITE_PENDING state, because ABORT TASK  will hang itself in
> the process:
> 
>   # dmesg | tail -2
>   [   83.563505] ABORT_TASK: Found referenced iSCSI task_tag: 3372979269
>   [   84.593545] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus <nexus>
> 
>   # ps aux | awk '$8 ~/D/'
>   root        32  0.0  0.0      0     0 ?        D    15:19   0:00 [kworker/0:1+events]
>   root      1187  0.0  0.0      0     0 ?        D    15:20   0:00 [iscsi_ttx]
> 
>   # cat /proc/32/stack
>   [<0>] target_put_cmd_and_wait+0x68/0xa0
>   [<0>] core_tmr_abort_task.cold+0x16b/0x192
>   [<0>] target_tmr_work+0x9e/0xe0
>   [<0>] process_one_work+0x1d4/0x370
>   [<0>] worker_thread+0x48/0x3d0
>   [<0>] kthread+0x122/0x140
>   [<0>] ret_from_fork+0x22/0x30
> 
>   # cat /proc/1187/stack
>   [<0>] __transport_wait_for_tasks+0xaf/0x100
>   [<0>] transport_generic_free_cmd+0xe9/0x180
>   [<0>] iscsit_free_cmd+0x50/0xb0
>   [<0>] iscsit_close_connection+0x47d/0x8c0
>   [<0>] iscsit_take_action_for_connection_exit+0x6f/0xf0
>   [<0>] iscsi_target_tx_thread+0x184/0x200
>   [<0>] kthread+0x122/0x140
>   [<0>] ret_from_fork+0x22/0x30
> 
> What happens:
> 
>   1. Initiator sends WRITE_10 CDB
>   2. Target parses the CDB and sends R2T
>   3. Target starts the Data-Out timer
>   4. Initiator sends ABORT TASK; no new data from Initiator after this
>   5. Target starts aborting WRITE_10, gets into core_tmr_abort_task()
>      and starts waiting for the request completion
>   6. Nothing happens
>   7. The Data-Out timers expires, connection teardown starts and gets
>      stuck waiting for ABORT TASK that waits for WRITE_10
> 
> The ABORT TASK processing looks roughly like this:
> 
>   iscsi_rx_opcode
>     iscsi_handle_task_mgt_cmd
>       iscsi_tmr_abort_task
>       transport_generic_handle_tmr
>         if (tmr_cmd->transport_state & CMD_T_ABORTED)
>           target_handle_abort
>         else
>           target_tmr_work
>             if (tmr_cmd->transport_state & CMD_T_ABORTED)
>               target_handle_abort
>             else
>               core_tmr_abort_task
>                 ret = __target_check_io_state
>                   if (write_cmd->transport_state & CMD_T_STOP)
>                     return -1
>                   write_cmd->transport_state |= CMD_T_ABORTED
>                   return 0
>                 if (!ret)
>                   list_move_tail(&write_cmd->state_list, &aborted)
>                   target_put_cmd_and_wait(&write_cmd)
> 
> As I see it, the main problem is that the abort path can't initiate the
> command termination, it simply waits for the request to handle this on
> the execution path like in target_execute_cmd():
> 
>   target_execute_cmd
>     target_cmd_interrupted
>       INIT_WORK(&cmd->work, target_abort_work)
> 
> However, in this case the request is not going to be executed because
> Initiator will not send the Data-Out buffer.
> 
> I have a couple of ideas on how to fix this, but they all look kinda
> ugly. The one that currently works around this for me:
> 
>   core_tmr_abort_task():
> 
>     [...]
> 
>     spin_lock_irqsave(&se_cmd->t_state_lock, flags);
>     write_pending = se_cmd->t_state == TRANSPORT_WRITE_PENDING;
>     spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
>     
>     if (write_pending && se_cmd->se_tfo->abort_write_pending)
>             se_cmd->se_tfo->abort_write_pending(se_cmd);
>     
>     target_put_cmd_and_wait(se_cmd);
> 
>     [...]
> 
> The new method abort_write_pending() is defined only for iSCSI and calls
> target_handle_abort(). However, this opens up another can of worms
> because this code heavily races with R2T sending and requires a couple
> of checks to "work most of the time". Not ideal, by far.
> 
> I can make this one better by introducing R2T list draining that ensures
> the proper order during cleanup, but maybe there is a much easier way
> that I'm not seeing here.

Ccing Maurizio to make sure I don't add his original bug back.

If I understand you, I think I added this bug in:

commit f36199355c64a39fe82cfddc7623d827c7e050da
Author: Mike Christie <michael.christie@oracle.com>
Date:   Fri Nov 13 19:46:18 2020 -0600

    scsi: target: iscsi: Fix cmd abort fabric stop race

With that patch if the abort or a lun reset has got to lio core then we
are going to be stuck waiting for the data which won't come because we
killed the iscsi threads.

Can go back to always having the iscsi target clean up the cmd, but if
LIO has started to abort the cmd we take an extra ref so we don't free
the cmd from under each other.

This patch is completely untested:


diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
index 2c54c5d8412d..d221e9be7468 100644
--- a/drivers/target/iscsi/iscsi_target.c
+++ b/drivers/target/iscsi/iscsi_target.c
@@ -4090,12 +4090,13 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn)
 			spin_lock_irq(&se_cmd->t_state_lock);
 			if (se_cmd->transport_state & CMD_T_ABORTED) {
 				/*
-				 * LIO's abort path owns the cleanup for this,
-				 * so put it back on the list and let
-				 * aborted_task handle it.
+				 * The LIO TMR handler owns the cmd but if
+				 * we were waiting for data from the initiator
+				 * then we need to internally cleanup to be
+				 * able to complete it. Get an extra ref so
+				 * we don't free the cmd from under LIO core.
 				 */
-				list_move_tail(&cmd->i_conn_node,
-					       &conn->conn_cmd_list);
+				target_get_sess_cmd(se_cmd, false);
 			} else {
 				se_cmd->transport_state |= CMD_T_FABRIC_STOP;
 			}

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-13 17:51 ` Mike Christie
@ 2021-10-13 18:05   ` Mike Christie
  2021-10-13 18:11     ` Konstantin Shelekhin
  2021-10-13 18:08   ` Konstantin Shelekhin
  1 sibling, 1 reply; 26+ messages in thread
From: Mike Christie @ 2021-10-13 18:05 UTC (permalink / raw)
  To: Konstantin Shelekhin, target-devel; +Cc: linux, Maurizio Lombardi

On 10/13/21 12:51 PM, Mike Christie wrote:
> On 10/13/21 8:21 AM, Konstantin Shelekhin wrote:
>> Hi,
>>
>> I really need the collective wisdom.
>>
>> Not long ago we've uncovered the problem with iSCSI and ABORT TASK
>> handling. Currently it's not possible to abort a WRITE_10 command in
>> TRANSPORT_WRITE_PENDING state, because ABORT TASK  will hang itself in
>> the process:
>>
>>   # dmesg | tail -2
>>   [   83.563505] ABORT_TASK: Found referenced iSCSI task_tag: 3372979269
>>   [   84.593545] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus <nexus>
>>
>>   # ps aux | awk '$8 ~/D/'
>>   root        32  0.0  0.0      0     0 ?        D    15:19   0:00 [kworker/0:1+events]
>>   root      1187  0.0  0.0      0     0 ?        D    15:20   0:00 [iscsi_ttx]
>>
>>   # cat /proc/32/stack
>>   [<0>] target_put_cmd_and_wait+0x68/0xa0
>>   [<0>] core_tmr_abort_task.cold+0x16b/0x192
>>   [<0>] target_tmr_work+0x9e/0xe0
>>   [<0>] process_one_work+0x1d4/0x370
>>   [<0>] worker_thread+0x48/0x3d0
>>   [<0>] kthread+0x122/0x140
>>   [<0>] ret_from_fork+0x22/0x30
>>
>>   # cat /proc/1187/stack
>>   [<0>] __transport_wait_for_tasks+0xaf/0x100
>>   [<0>] transport_generic_free_cmd+0xe9/0x180
>>   [<0>] iscsit_free_cmd+0x50/0xb0
>>   [<0>] iscsit_close_connection+0x47d/0x8c0
>>   [<0>] iscsit_take_action_for_connection_exit+0x6f/0xf0
>>   [<0>] iscsi_target_tx_thread+0x184/0x200
>>   [<0>] kthread+0x122/0x140
>>   [<0>] ret_from_fork+0x22/0x30
>>
>> What happens:
>>
>>   1. Initiator sends WRITE_10 CDB
>>   2. Target parses the CDB and sends R2T
>>   3. Target starts the Data-Out timer
>>   4. Initiator sends ABORT TASK; no new data from Initiator after this
>>   5. Target starts aborting WRITE_10, gets into core_tmr_abort_task()
>>      and starts waiting for the request completion
>>   6. Nothing happens
>>   7. The Data-Out timers expires, connection teardown starts and gets
>>      stuck waiting for ABORT TASK that waits for WRITE_10
>>
>> The ABORT TASK processing looks roughly like this:
>>
>>   iscsi_rx_opcode
>>     iscsi_handle_task_mgt_cmd
>>       iscsi_tmr_abort_task
>>       transport_generic_handle_tmr
>>         if (tmr_cmd->transport_state & CMD_T_ABORTED)
>>           target_handle_abort
>>         else
>>           target_tmr_work
>>             if (tmr_cmd->transport_state & CMD_T_ABORTED)
>>               target_handle_abort
>>             else
>>               core_tmr_abort_task
>>                 ret = __target_check_io_state
>>                   if (write_cmd->transport_state & CMD_T_STOP)
>>                     return -1
>>                   write_cmd->transport_state |= CMD_T_ABORTED
>>                   return 0
>>                 if (!ret)
>>                   list_move_tail(&write_cmd->state_list, &aborted)
>>                   target_put_cmd_and_wait(&write_cmd)
>>
>> As I see it, the main problem is that the abort path can't initiate the
>> command termination, it simply waits for the request to handle this on
>> the execution path like in target_execute_cmd():
>>
>>   target_execute_cmd
>>     target_cmd_interrupted
>>       INIT_WORK(&cmd->work, target_abort_work)
>>
>> However, in this case the request is not going to be executed because
>> Initiator will not send the Data-Out buffer.
>>
>> I have a couple of ideas on how to fix this, but they all look kinda
>> ugly. The one that currently works around this for me:
>>
>>   core_tmr_abort_task():
>>
>>     [...]
>>
>>     spin_lock_irqsave(&se_cmd->t_state_lock, flags);
>>     write_pending = se_cmd->t_state == TRANSPORT_WRITE_PENDING;
>>     spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
>>     
>>     if (write_pending && se_cmd->se_tfo->abort_write_pending)
>>             se_cmd->se_tfo->abort_write_pending(se_cmd);
>>     
>>     target_put_cmd_and_wait(se_cmd);
>>
>>     [...]
>>
>> The new method abort_write_pending() is defined only for iSCSI and calls
>> target_handle_abort(). However, this opens up another can of worms
>> because this code heavily races with R2T sending and requires a couple
>> of checks to "work most of the time". Not ideal, by far.
>>
>> I can make this one better by introducing R2T list draining that ensures
>> the proper order during cleanup, but maybe there is a much easier way
>> that I'm not seeing here.
> 
> Ccing Maurizio to make sure I don't add his original bug back.
> 
> If I understand you, I think I added this bug in:
> 
> commit f36199355c64a39fe82cfddc7623d827c7e050da
> Author: Mike Christie <michael.christie@oracle.com>
> Date:   Fri Nov 13 19:46:18 2020 -0600
> 
>     scsi: target: iscsi: Fix cmd abort fabric stop race
> 
> With that patch if the abort or a lun reset has got to lio core then we
> are going to be stuck waiting for the data which won't come because we
> killed the iscsi threads.
> 
> Can go back to always having the iscsi target clean up the cmd, but if
> LIO has started to abort the cmd we take an extra ref so we don't free
> the cmd from under each other.
> 
> This patch is completely untested:
> 
> 
> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
> index 2c54c5d8412d..d221e9be7468 100644
> --- a/drivers/target/iscsi/iscsi_target.c
> +++ b/drivers/target/iscsi/iscsi_target.c
> @@ -4090,12 +4090,13 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn)
>  			spin_lock_irq(&se_cmd->t_state_lock);
>  			if (se_cmd->transport_state & CMD_T_ABORTED) {
>  				/*
> -				 * LIO's abort path owns the cleanup for this,
> -				 * so put it back on the list and let
> -				 * aborted_task handle it.
> +				 * The LIO TMR handler owns the cmd but if
> +				 * we were waiting for data from the initiator
> +				 * then we need to internally cleanup to be
> +				 * able to complete it. Get an extra ref so
> +				 * we don't free the cmd from under LIO core.
>  				 */
> -				list_move_tail(&cmd->i_conn_node,
> -					       &conn->conn_cmd_list);
> +				target_get_sess_cmd(se_cmd, false);
>  			} else {
>  				se_cmd->transport_state |= CMD_T_FABRIC_STOP;
>  			}
> 

Another alternative would be to have iscsi check if it was waiting on
data before sending a TMR to LIO core. If it is, then it would just cleanup
internally and complete the TMR with success/failed depending on the TMR or
drop it and let the initiator escalate.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-13 17:51 ` Mike Christie
  2021-10-13 18:05   ` Mike Christie
@ 2021-10-13 18:08   ` Konstantin Shelekhin
  2021-10-13 18:24     ` Mike Christie
  1 sibling, 1 reply; 26+ messages in thread
From: Konstantin Shelekhin @ 2021-10-13 18:08 UTC (permalink / raw)
  To: Mike Christie; +Cc: target-devel, linux, Maurizio Lombardi

On Wed, Oct 13, 2021 at 12:51:32PM -0500, Mike Christie wrote:
> On 10/13/21 8:21 AM, Konstantin Shelekhin wrote:
> > Hi,
> > 
> > I really need the collective wisdom.
> > 
> > Not long ago we've uncovered the problem with iSCSI and ABORT TASK
> > handling. Currently it's not possible to abort a WRITE_10 command in
> > TRANSPORT_WRITE_PENDING state, because ABORT TASK  will hang itself in
> > the process:
> > 
> >   # dmesg | tail -2
> >   [   83.563505] ABORT_TASK: Found referenced iSCSI task_tag: 3372979269
> >   [   84.593545] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus <nexus>
> > 
> >   # ps aux | awk '$8 ~/D/'
> >   root        32  0.0  0.0      0     0 ?        D    15:19   0:00 [kworker/0:1+events]
> >   root      1187  0.0  0.0      0     0 ?        D    15:20   0:00 [iscsi_ttx]
> > 
> >   # cat /proc/32/stack
> >   [<0>] target_put_cmd_and_wait+0x68/0xa0
> >   [<0>] core_tmr_abort_task.cold+0x16b/0x192
> >   [<0>] target_tmr_work+0x9e/0xe0
> >   [<0>] process_one_work+0x1d4/0x370
> >   [<0>] worker_thread+0x48/0x3d0
> >   [<0>] kthread+0x122/0x140
> >   [<0>] ret_from_fork+0x22/0x30
> > 
> >   # cat /proc/1187/stack
> >   [<0>] __transport_wait_for_tasks+0xaf/0x100
> >   [<0>] transport_generic_free_cmd+0xe9/0x180
> >   [<0>] iscsit_free_cmd+0x50/0xb0
> >   [<0>] iscsit_close_connection+0x47d/0x8c0
> >   [<0>] iscsit_take_action_for_connection_exit+0x6f/0xf0
> >   [<0>] iscsi_target_tx_thread+0x184/0x200
> >   [<0>] kthread+0x122/0x140
> >   [<0>] ret_from_fork+0x22/0x30
> > 
> > What happens:
> > 
> >   1. Initiator sends WRITE_10 CDB
> >   2. Target parses the CDB and sends R2T
> >   3. Target starts the Data-Out timer
> >   4. Initiator sends ABORT TASK; no new data from Initiator after this
> >   5. Target starts aborting WRITE_10, gets into core_tmr_abort_task()
> >      and starts waiting for the request completion
> >   6. Nothing happens
> >   7. The Data-Out timers expires, connection teardown starts and gets
> >      stuck waiting for ABORT TASK that waits for WRITE_10
> > 
> > The ABORT TASK processing looks roughly like this:
> > 
> >   iscsi_rx_opcode
> >     iscsi_handle_task_mgt_cmd
> >       iscsi_tmr_abort_task
> >       transport_generic_handle_tmr
> >         if (tmr_cmd->transport_state & CMD_T_ABORTED)
> >           target_handle_abort
> >         else
> >           target_tmr_work
> >             if (tmr_cmd->transport_state & CMD_T_ABORTED)
> >               target_handle_abort
> >             else
> >               core_tmr_abort_task
> >                 ret = __target_check_io_state
> >                   if (write_cmd->transport_state & CMD_T_STOP)
> >                     return -1
> >                   write_cmd->transport_state |= CMD_T_ABORTED
> >                   return 0
> >                 if (!ret)
> >                   list_move_tail(&write_cmd->state_list, &aborted)
> >                   target_put_cmd_and_wait(&write_cmd)
> > 
> > As I see it, the main problem is that the abort path can't initiate the
> > command termination, it simply waits for the request to handle this on
> > the execution path like in target_execute_cmd():
> > 
> >   target_execute_cmd
> >     target_cmd_interrupted
> >       INIT_WORK(&cmd->work, target_abort_work)
> > 
> > However, in this case the request is not going to be executed because
> > Initiator will not send the Data-Out buffer.
> > 
> > I have a couple of ideas on how to fix this, but they all look kinda
> > ugly. The one that currently works around this for me:
> > 
> >   core_tmr_abort_task():
> > 
> >     [...]
> > 
> >     spin_lock_irqsave(&se_cmd->t_state_lock, flags);
> >     write_pending = se_cmd->t_state == TRANSPORT_WRITE_PENDING;
> >     spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
> >     
> >     if (write_pending && se_cmd->se_tfo->abort_write_pending)
> >             se_cmd->se_tfo->abort_write_pending(se_cmd);
> >     
> >     target_put_cmd_and_wait(se_cmd);
> > 
> >     [...]
> > 
> > The new method abort_write_pending() is defined only for iSCSI and calls
> > target_handle_abort(). However, this opens up another can of worms
> > because this code heavily races with R2T sending and requires a couple
> > of checks to "work most of the time". Not ideal, by far.
> > 
> > I can make this one better by introducing R2T list draining that ensures
> > the proper order during cleanup, but maybe there is a much easier way
> > that I'm not seeing here.
> 
> Ccing Maurizio to make sure I don't add his original bug back.
> 
> If I understand you, I think I added this bug in:
> 
> commit f36199355c64a39fe82cfddc7623d827c7e050da
> Author: Mike Christie <michael.christie@oracle.com>
> Date:   Fri Nov 13 19:46:18 2020 -0600
> 
>     scsi: target: iscsi: Fix cmd abort fabric stop race
> 
> With that patch if the abort or a lun reset has got to lio core then we
> are going to be stuck waiting for the data which won't come because we
> killed the iscsi threads.
> 
> Can go back to always having the iscsi target clean up the cmd, but if
> LIO has started to abort the cmd we take an extra ref so we don't free
> the cmd from under each other.
> 
> This patch is completely untested:
> 
> 
> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
> index 2c54c5d8412d..d221e9be7468 100644
> --- a/drivers/target/iscsi/iscsi_target.c
> +++ b/drivers/target/iscsi/iscsi_target.c
> @@ -4090,12 +4090,13 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn)
>  			spin_lock_irq(&se_cmd->t_state_lock);
>  			if (se_cmd->transport_state & CMD_T_ABORTED) {
>  				/*
> -				 * LIO's abort path owns the cleanup for this,
> -				 * so put it back on the list and let
> -				 * aborted_task handle it.
> +				 * The LIO TMR handler owns the cmd but if
> +				 * we were waiting for data from the initiator
> +				 * then we need to internally cleanup to be
> +				 * able to complete it. Get an extra ref so
> +				 * we don't free the cmd from under LIO core.
>  				 */
> -				list_move_tail(&cmd->i_conn_node,
> -					       &conn->conn_cmd_list);
> +				target_get_sess_cmd(se_cmd, false);
>  			} else {
>  				se_cmd->transport_state |= CMD_T_FABRIC_STOP;
>  			}

The bug was there before. I had to backport this patch in order to
introduce my fix. I can revert it and check what is different, but it's
there in some form.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-13 18:05   ` Mike Christie
@ 2021-10-13 18:11     ` Konstantin Shelekhin
  0 siblings, 0 replies; 26+ messages in thread
From: Konstantin Shelekhin @ 2021-10-13 18:11 UTC (permalink / raw)
  To: Mike Christie; +Cc: target-devel, linux, Maurizio Lombardi

On Wed, Oct 13, 2021 at 01:05:26PM -0500, Mike Christie wrote:
> On 10/13/21 12:51 PM, Mike Christie wrote:
> Another alternative would be to have iscsi check if it was waiting on
> data before sending a TMR to LIO core. If it is, then it would just cleanup
> internally and complete the TMR with success/failed depending on the TMR or
> drop it and let the initiator escalate.

I was thinking about that, but there is some TAS code that requires the
proper response from the aborted command. Also the ABORT TASK execution
path kinda allows ABORT TASK to be aborted itself. I'm not sure if
that's code has any meaning now, but if it does, then clearing WRITE_10
early may not be a valid action from the SCSI protocol point of view.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-13 18:08   ` Konstantin Shelekhin
@ 2021-10-13 18:24     ` Mike Christie
  2021-10-13 18:30       ` Mike Christie
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Christie @ 2021-10-13 18:24 UTC (permalink / raw)
  To: Konstantin Shelekhin; +Cc: target-devel, linux, Maurizio Lombardi

On 10/13/21 1:08 PM, Konstantin Shelekhin wrote:
> On Wed, Oct 13, 2021 at 12:51:32PM -0500, Mike Christie wrote:
>> On 10/13/21 8:21 AM, Konstantin Shelekhin wrote:
>>> Hi,
>>>
>>> I really need the collective wisdom.
>>>
>>> Not long ago we've uncovered the problem with iSCSI and ABORT TASK
>>> handling. Currently it's not possible to abort a WRITE_10 command in
>>> TRANSPORT_WRITE_PENDING state, because ABORT TASK  will hang itself in
>>> the process:
>>>
>>>   # dmesg | tail -2
>>>   [   83.563505] ABORT_TASK: Found referenced iSCSI task_tag: 3372979269
>>>   [   84.593545] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus <nexus>
>>>
>>>   # ps aux | awk '$8 ~/D/'
>>>   root        32  0.0  0.0      0     0 ?        D    15:19   0:00 [kworker/0:1+events]
>>>   root      1187  0.0  0.0      0     0 ?        D    15:20   0:00 [iscsi_ttx]
>>>
>>>   # cat /proc/32/stack
>>>   [<0>] target_put_cmd_and_wait+0x68/0xa0
>>>   [<0>] core_tmr_abort_task.cold+0x16b/0x192
>>>   [<0>] target_tmr_work+0x9e/0xe0
>>>   [<0>] process_one_work+0x1d4/0x370
>>>   [<0>] worker_thread+0x48/0x3d0
>>>   [<0>] kthread+0x122/0x140
>>>   [<0>] ret_from_fork+0x22/0x30
>>>
>>>   # cat /proc/1187/stack
>>>   [<0>] __transport_wait_for_tasks+0xaf/0x100
>>>   [<0>] transport_generic_free_cmd+0xe9/0x180
>>>   [<0>] iscsit_free_cmd+0x50/0xb0
>>>   [<0>] iscsit_close_connection+0x47d/0x8c0
>>>   [<0>] iscsit_take_action_for_connection_exit+0x6f/0xf0
>>>   [<0>] iscsi_target_tx_thread+0x184/0x200
>>>   [<0>] kthread+0x122/0x140
>>>   [<0>] ret_from_fork+0x22/0x30
>>>
>>> What happens:
>>>
>>>   1. Initiator sends WRITE_10 CDB
>>>   2. Target parses the CDB and sends R2T
>>>   3. Target starts the Data-Out timer
>>>   4. Initiator sends ABORT TASK; no new data from Initiator after this
>>>   5. Target starts aborting WRITE_10, gets into core_tmr_abort_task()
>>>      and starts waiting for the request completion
>>>   6. Nothing happens
>>>   7. The Data-Out timers expires, connection teardown starts and gets
>>>      stuck waiting for ABORT TASK that waits for WRITE_10
>>>
>>> The ABORT TASK processing looks roughly like this:
>>>
>>>   iscsi_rx_opcode
>>>     iscsi_handle_task_mgt_cmd
>>>       iscsi_tmr_abort_task
>>>       transport_generic_handle_tmr
>>>         if (tmr_cmd->transport_state & CMD_T_ABORTED)
>>>           target_handle_abort
>>>         else
>>>           target_tmr_work
>>>             if (tmr_cmd->transport_state & CMD_T_ABORTED)
>>>               target_handle_abort
>>>             else
>>>               core_tmr_abort_task
>>>                 ret = __target_check_io_state
>>>                   if (write_cmd->transport_state & CMD_T_STOP)
>>>                     return -1
>>>                   write_cmd->transport_state |= CMD_T_ABORTED
>>>                   return 0
>>>                 if (!ret)
>>>                   list_move_tail(&write_cmd->state_list, &aborted)
>>>                   target_put_cmd_and_wait(&write_cmd)
>>>
>>> As I see it, the main problem is that the abort path can't initiate the
>>> command termination, it simply waits for the request to handle this on
>>> the execution path like in target_execute_cmd():
>>>
>>>   target_execute_cmd
>>>     target_cmd_interrupted
>>>       INIT_WORK(&cmd->work, target_abort_work)
>>>
>>> However, in this case the request is not going to be executed because
>>> Initiator will not send the Data-Out buffer.
>>>
>>> I have a couple of ideas on how to fix this, but they all look kinda
>>> ugly. The one that currently works around this for me:
>>>
>>>   core_tmr_abort_task():
>>>
>>>     [...]
>>>
>>>     spin_lock_irqsave(&se_cmd->t_state_lock, flags);
>>>     write_pending = se_cmd->t_state == TRANSPORT_WRITE_PENDING;
>>>     spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
>>>     
>>>     if (write_pending && se_cmd->se_tfo->abort_write_pending)
>>>             se_cmd->se_tfo->abort_write_pending(se_cmd);
>>>     
>>>     target_put_cmd_and_wait(se_cmd);
>>>
>>>     [...]
>>>
>>> The new method abort_write_pending() is defined only for iSCSI and calls
>>> target_handle_abort(). However, this opens up another can of worms
>>> because this code heavily races with R2T sending and requires a couple
>>> of checks to "work most of the time". Not ideal, by far.
>>>
>>> I can make this one better by introducing R2T list draining that ensures
>>> the proper order during cleanup, but maybe there is a much easier way
>>> that I'm not seeing here.
>>
>> Ccing Maurizio to make sure I don't add his original bug back.
>>
>> If I understand you, I think I added this bug in:
>>
>> commit f36199355c64a39fe82cfddc7623d827c7e050da
>> Author: Mike Christie <michael.christie@oracle.com>
>> Date:   Fri Nov 13 19:46:18 2020 -0600
>>
>>     scsi: target: iscsi: Fix cmd abort fabric stop race
>>
>> With that patch if the abort or a lun reset has got to lio core then we
>> are going to be stuck waiting for the data which won't come because we
>> killed the iscsi threads.
>>
>> Can go back to always having the iscsi target clean up the cmd, but if
>> LIO has started to abort the cmd we take an extra ref so we don't free
>> the cmd from under each other.
>>
>> This patch is completely untested:
>>
>>
>> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
>> index 2c54c5d8412d..d221e9be7468 100644
>> --- a/drivers/target/iscsi/iscsi_target.c
>> +++ b/drivers/target/iscsi/iscsi_target.c
>> @@ -4090,12 +4090,13 @@ static void 	(struct iscsi_conn *conn)
>>  			spin_lock_irq(&se_cmd->t_state_lock);
>>  			if (se_cmd->transport_state & CMD_T_ABORTED) {
>>  				/*
>> -				 * LIO's abort path owns the cleanup for this,
>> -				 * so put it back on the list and let
>> -				 * aborted_task handle it.
>> +				 * The LIO TMR handler owns the cmd but if
>> +				 * we were waiting for data from the initiator
>> +				 * then we need to internally cleanup to be
>> +				 * able to complete it. Get an extra ref so
>> +				 * we don't free the cmd from under LIO core.
>>  				 */
>> -				list_move_tail(&cmd->i_conn_node,
>> -					       &conn->conn_cmd_list);
>> +				target_get_sess_cmd(se_cmd, false);
>>  			} else {
>>  				se_cmd->transport_state |= CMD_T_FABRIC_STOP;
>>  			}
> 
> The bug was there before. I had to backport this patch in order to
> introduce my fix. I can revert it and check what is different, but it's
> there in some form.
> 

Don't waste your time. It's because iscsit_free_cmd's call to
transport_generic_free_cmd has wait_for_tasks=true.

We then do transport_generic_free_cmd -> target_wait_free_cmd ->
__transport_wait_for_tasks like you posted above.

We need to be able to do the wake up instead of wait for
this case.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-13 18:24     ` Mike Christie
@ 2021-10-13 18:30       ` Mike Christie
  2021-10-13 18:58         ` Konstantin Shelekhin
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Christie @ 2021-10-13 18:30 UTC (permalink / raw)
  To: Konstantin Shelekhin; +Cc: target-devel, linux, Maurizio Lombardi

On 10/13/21 1:24 PM, Mike Christie wrote:
> On 10/13/21 1:08 PM, Konstantin Shelekhin wrote:
>> On Wed, Oct 13, 2021 at 12:51:32PM -0500, Mike Christie wrote:
>>> On 10/13/21 8:21 AM, Konstantin Shelekhin wrote:
>>>> Hi,
>>>>
>>>> I really need the collective wisdom.
>>>>
>>>> Not long ago we've uncovered the problem with iSCSI and ABORT TASK
>>>> handling. Currently it's not possible to abort a WRITE_10 command in
>>>> TRANSPORT_WRITE_PENDING state, because ABORT TASK  will hang itself in
>>>> the process:
>>>>
>>>>   # dmesg | tail -2
>>>>   [   83.563505] ABORT_TASK: Found referenced iSCSI task_tag: 3372979269
>>>>   [   84.593545] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus <nexus>
>>>>
>>>>   # ps aux | awk '$8 ~/D/'
>>>>   root        32  0.0  0.0      0     0 ?        D    15:19   0:00 [kworker/0:1+events]
>>>>   root      1187  0.0  0.0      0     0 ?        D    15:20   0:00 [iscsi_ttx]
>>>>
>>>>   # cat /proc/32/stack
>>>>   [<0>] target_put_cmd_and_wait+0x68/0xa0
>>>>   [<0>] core_tmr_abort_task.cold+0x16b/0x192
>>>>   [<0>] target_tmr_work+0x9e/0xe0
>>>>   [<0>] process_one_work+0x1d4/0x370
>>>>   [<0>] worker_thread+0x48/0x3d0
>>>>   [<0>] kthread+0x122/0x140
>>>>   [<0>] ret_from_fork+0x22/0x30
>>>>
>>>>   # cat /proc/1187/stack
>>>>   [<0>] __transport_wait_for_tasks+0xaf/0x100
>>>>   [<0>] transport_generic_free_cmd+0xe9/0x180
>>>>   [<0>] iscsit_free_cmd+0x50/0xb0
>>>>   [<0>] iscsit_close_connection+0x47d/0x8c0
>>>>   [<0>] iscsit_take_action_for_connection_exit+0x6f/0xf0
>>>>   [<0>] iscsi_target_tx_thread+0x184/0x200
>>>>   [<0>] kthread+0x122/0x140
>>>>   [<0>] ret_from_fork+0x22/0x30
>>>>
>>>> What happens:
>>>>
>>>>   1. Initiator sends WRITE_10 CDB
>>>>   2. Target parses the CDB and sends R2T
>>>>   3. Target starts the Data-Out timer
>>>>   4. Initiator sends ABORT TASK; no new data from Initiator after this
>>>>   5. Target starts aborting WRITE_10, gets into core_tmr_abort_task()
>>>>      and starts waiting for the request completion
>>>>   6. Nothing happens
>>>>   7. The Data-Out timers expires, connection teardown starts and gets
>>>>      stuck waiting for ABORT TASK that waits for WRITE_10
>>>>
>>>> The ABORT TASK processing looks roughly like this:
>>>>
>>>>   iscsi_rx_opcode
>>>>     iscsi_handle_task_mgt_cmd
>>>>       iscsi_tmr_abort_task
>>>>       transport_generic_handle_tmr
>>>>         if (tmr_cmd->transport_state & CMD_T_ABORTED)
>>>>           target_handle_abort
>>>>         else
>>>>           target_tmr_work
>>>>             if (tmr_cmd->transport_state & CMD_T_ABORTED)
>>>>               target_handle_abort
>>>>             else
>>>>               core_tmr_abort_task
>>>>                 ret = __target_check_io_state
>>>>                   if (write_cmd->transport_state & CMD_T_STOP)
>>>>                     return -1
>>>>                   write_cmd->transport_state |= CMD_T_ABORTED
>>>>                   return 0
>>>>                 if (!ret)
>>>>                   list_move_tail(&write_cmd->state_list, &aborted)
>>>>                   target_put_cmd_and_wait(&write_cmd)
>>>>
>>>> As I see it, the main problem is that the abort path can't initiate the
>>>> command termination, it simply waits for the request to handle this on
>>>> the execution path like in target_execute_cmd():
>>>>
>>>>   target_execute_cmd
>>>>     target_cmd_interrupted
>>>>       INIT_WORK(&cmd->work, target_abort_work)
>>>>
>>>> However, in this case the request is not going to be executed because
>>>> Initiator will not send the Data-Out buffer.
>>>>
>>>> I have a couple of ideas on how to fix this, but they all look kinda
>>>> ugly. The one that currently works around this for me:
>>>>
>>>>   core_tmr_abort_task():
>>>>
>>>>     [...]
>>>>
>>>>     spin_lock_irqsave(&se_cmd->t_state_lock, flags);
>>>>     write_pending = se_cmd->t_state == TRANSPORT_WRITE_PENDING;
>>>>     spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
>>>>     
>>>>     if (write_pending && se_cmd->se_tfo->abort_write_pending)
>>>>             se_cmd->se_tfo->abort_write_pending(se_cmd);
>>>>     
>>>>     target_put_cmd_and_wait(se_cmd);
>>>>
>>>>     [...]
>>>>
>>>> The new method abort_write_pending() is defined only for iSCSI and calls
>>>> target_handle_abort(). However, this opens up another can of worms
>>>> because this code heavily races with R2T sending and requires a couple
>>>> of checks to "work most of the time". Not ideal, by far.
>>>>
>>>> I can make this one better by introducing R2T list draining that ensures
>>>> the proper order during cleanup, but maybe there is a much easier way
>>>> that I'm not seeing here.
>>>
>>> Ccing Maurizio to make sure I don't add his original bug back.
>>>
>>> If I understand you, I think I added this bug in:
>>>
>>> commit f36199355c64a39fe82cfddc7623d827c7e050da
>>> Author: Mike Christie <michael.christie@oracle.com>
>>> Date:   Fri Nov 13 19:46:18 2020 -0600
>>>
>>>     scsi: target: iscsi: Fix cmd abort fabric stop race
>>>
>>> With that patch if the abort or a lun reset has got to lio core then we
>>> are going to be stuck waiting for the data which won't come because we
>>> killed the iscsi threads.
>>>
>>> Can go back to always having the iscsi target clean up the cmd, but if
>>> LIO has started to abort the cmd we take an extra ref so we don't free
>>> the cmd from under each other.
>>>
>>> This patch is completely untested:
>>>
>>>
>>> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
>>> index 2c54c5d8412d..d221e9be7468 100644
>>> --- a/drivers/target/iscsi/iscsi_target.c
>>> +++ b/drivers/target/iscsi/iscsi_target.c
>>> @@ -4090,12 +4090,13 @@ static void 	(struct iscsi_conn *conn)
>>>  			spin_lock_irq(&se_cmd->t_state_lock);
>>>  			if (se_cmd->transport_state & CMD_T_ABORTED) {
>>>  				/*
>>> -				 * LIO's abort path owns the cleanup for this,
>>> -				 * so put it back on the list and let
>>> -				 * aborted_task handle it.
>>> +				 * The LIO TMR handler owns the cmd but if
>>> +				 * we were waiting for data from the initiator
>>> +				 * then we need to internally cleanup to be
>>> +				 * able to complete it. Get an extra ref so
>>> +				 * we don't free the cmd from under LIO core.
>>>  				 */
>>> -				list_move_tail(&cmd->i_conn_node,
>>> -					       &conn->conn_cmd_list);
>>> +				target_get_sess_cmd(se_cmd, false);
>>>  			} else {
>>>  				se_cmd->transport_state |= CMD_T_FABRIC_STOP;
>>>  			}
>>
>> The bug was there before. I had to backport this patch in order to
>> introduce my fix. I can revert it and check what is different, but it's
>> there in some form.
>>
> 
> Don't waste your time. It's because iscsit_free_cmd's call to
> transport_generic_free_cmd has wait_for_tasks=true.
> 
> We then do transport_generic_free_cmd -> target_wait_free_cmd ->
> __transport_wait_for_tasks like you posted above.

That's wrong, it's the transport_generic_free_cmd wait:

        if (aborted) {
                pr_debug("Detected CMD_T_ABORTED for ITT: %llu\n", cmd->tag);
                wait_for_completion(&compl);
                ret = 1;


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-13 18:30       ` Mike Christie
@ 2021-10-13 18:58         ` Konstantin Shelekhin
  2021-10-13 19:01           ` Konstantin Shelekhin
  0 siblings, 1 reply; 26+ messages in thread
From: Konstantin Shelekhin @ 2021-10-13 18:58 UTC (permalink / raw)
  To: Mike Christie; +Cc: target-devel, linux, Maurizio Lombardi

On Wed, Oct 13, 2021 at 01:30:32PM -0500, Mike Christie wrote:
> On 10/13/21 1:24 PM, Mike Christie wrote:
> > On 10/13/21 1:08 PM, Konstantin Shelekhin wrote:
> >> On Wed, Oct 13, 2021 at 12:51:32PM -0500, Mike Christie wrote:
> >>> On 10/13/21 8:21 AM, Konstantin Shelekhin wrote:
> >>>> Hi,
> >>>>
> >>>> I really need the collective wisdom.
> >>>>
> >>>> Not long ago we've uncovered the problem with iSCSI and ABORT TASK
> >>>> handling. Currently it's not possible to abort a WRITE_10 command in
> >>>> TRANSPORT_WRITE_PENDING state, because ABORT TASK  will hang itself in
> >>>> the process:
> >>>>
> >>>>   # dmesg | tail -2
> >>>>   [   83.563505] ABORT_TASK: Found referenced iSCSI task_tag: 3372979269
> >>>>   [   84.593545] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus <nexus>
> >>>>
> >>>>   # ps aux | awk '$8 ~/D/'
> >>>>   root        32  0.0  0.0      0     0 ?        D    15:19   0:00 [kworker/0:1+events]
> >>>>   root      1187  0.0  0.0      0     0 ?        D    15:20   0:00 [iscsi_ttx]
> >>>>
> >>>>   # cat /proc/32/stack
> >>>>   [<0>] target_put_cmd_and_wait+0x68/0xa0
> >>>>   [<0>] core_tmr_abort_task.cold+0x16b/0x192
> >>>>   [<0>] target_tmr_work+0x9e/0xe0
> >>>>   [<0>] process_one_work+0x1d4/0x370
> >>>>   [<0>] worker_thread+0x48/0x3d0
> >>>>   [<0>] kthread+0x122/0x140
> >>>>   [<0>] ret_from_fork+0x22/0x30
> >>>>
> >>>>   # cat /proc/1187/stack
> >>>>   [<0>] __transport_wait_for_tasks+0xaf/0x100
> >>>>   [<0>] transport_generic_free_cmd+0xe9/0x180
> >>>>   [<0>] iscsit_free_cmd+0x50/0xb0
> >>>>   [<0>] iscsit_close_connection+0x47d/0x8c0
> >>>>   [<0>] iscsit_take_action_for_connection_exit+0x6f/0xf0
> >>>>   [<0>] iscsi_target_tx_thread+0x184/0x200
> >>>>   [<0>] kthread+0x122/0x140
> >>>>   [<0>] ret_from_fork+0x22/0x30
> >>>>
> >>>> What happens:
> >>>>
> >>>>   1. Initiator sends WRITE_10 CDB
> >>>>   2. Target parses the CDB and sends R2T
> >>>>   3. Target starts the Data-Out timer
> >>>>   4. Initiator sends ABORT TASK; no new data from Initiator after this
> >>>>   5. Target starts aborting WRITE_10, gets into core_tmr_abort_task()
> >>>>      and starts waiting for the request completion
> >>>>   6. Nothing happens
> >>>>   7. The Data-Out timers expires, connection teardown starts and gets
> >>>>      stuck waiting for ABORT TASK that waits for WRITE_10
> >>>>
> >>>> The ABORT TASK processing looks roughly like this:
> >>>>
> >>>>   iscsi_rx_opcode
> >>>>     iscsi_handle_task_mgt_cmd
> >>>>       iscsi_tmr_abort_task
> >>>>       transport_generic_handle_tmr
> >>>>         if (tmr_cmd->transport_state & CMD_T_ABORTED)
> >>>>           target_handle_abort
> >>>>         else
> >>>>           target_tmr_work
> >>>>             if (tmr_cmd->transport_state & CMD_T_ABORTED)
> >>>>               target_handle_abort
> >>>>             else
> >>>>               core_tmr_abort_task
> >>>>                 ret = __target_check_io_state
> >>>>                   if (write_cmd->transport_state & CMD_T_STOP)
> >>>>                     return -1
> >>>>                   write_cmd->transport_state |= CMD_T_ABORTED
> >>>>                   return 0
> >>>>                 if (!ret)
> >>>>                   list_move_tail(&write_cmd->state_list, &aborted)
> >>>>                   target_put_cmd_and_wait(&write_cmd)
> >>>>
> >>>> As I see it, the main problem is that the abort path can't initiate the
> >>>> command termination, it simply waits for the request to handle this on
> >>>> the execution path like in target_execute_cmd():
> >>>>
> >>>>   target_execute_cmd
> >>>>     target_cmd_interrupted
> >>>>       INIT_WORK(&cmd->work, target_abort_work)
> >>>>
> >>>> However, in this case the request is not going to be executed because
> >>>> Initiator will not send the Data-Out buffer.
> >>>>
> >>>> I have a couple of ideas on how to fix this, but they all look kinda
> >>>> ugly. The one that currently works around this for me:
> >>>>
> >>>>   core_tmr_abort_task():
> >>>>
> >>>>     [...]
> >>>>
> >>>>     spin_lock_irqsave(&se_cmd->t_state_lock, flags);
> >>>>     write_pending = se_cmd->t_state == TRANSPORT_WRITE_PENDING;
> >>>>     spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
> >>>>     
> >>>>     if (write_pending && se_cmd->se_tfo->abort_write_pending)
> >>>>             se_cmd->se_tfo->abort_write_pending(se_cmd);
> >>>>     
> >>>>     target_put_cmd_and_wait(se_cmd);
> >>>>
> >>>>     [...]
> >>>>
> >>>> The new method abort_write_pending() is defined only for iSCSI and calls
> >>>> target_handle_abort(). However, this opens up another can of worms
> >>>> because this code heavily races with R2T sending and requires a couple
> >>>> of checks to "work most of the time". Not ideal, by far.
> >>>>
> >>>> I can make this one better by introducing R2T list draining that ensures
> >>>> the proper order during cleanup, but maybe there is a much easier way
> >>>> that I'm not seeing here.
> >>>
> >>> Ccing Maurizio to make sure I don't add his original bug back.
> >>>
> >>> If I understand you, I think I added this bug in:
> >>>
> >>> commit f36199355c64a39fe82cfddc7623d827c7e050da
> >>> Author: Mike Christie <michael.christie@oracle.com>
> >>> Date:   Fri Nov 13 19:46:18 2020 -0600
> >>>
> >>>     scsi: target: iscsi: Fix cmd abort fabric stop race
> >>>
> >>> With that patch if the abort or a lun reset has got to lio core then we
> >>> are going to be stuck waiting for the data which won't come because we
> >>> killed the iscsi threads.
> >>>
> >>> Can go back to always having the iscsi target clean up the cmd, but if
> >>> LIO has started to abort the cmd we take an extra ref so we don't free
> >>> the cmd from under each other.
> >>>
> >>> This patch is completely untested:
> >>>
> >>>
> >>> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
> >>> index 2c54c5d8412d..d221e9be7468 100644
> >>> --- a/drivers/target/iscsi/iscsi_target.c
> >>> +++ b/drivers/target/iscsi/iscsi_target.c
> >>> @@ -4090,12 +4090,13 @@ static void 	(struct iscsi_conn *conn)
> >>>  			spin_lock_irq(&se_cmd->t_state_lock);
> >>>  			if (se_cmd->transport_state & CMD_T_ABORTED) {
> >>>  				/*
> >>> -				 * LIO's abort path owns the cleanup for this,
> >>> -				 * so put it back on the list and let
> >>> -				 * aborted_task handle it.
> >>> +				 * The LIO TMR handler owns the cmd but if
> >>> +				 * we were waiting for data from the initiator
> >>> +				 * then we need to internally cleanup to be
> >>> +				 * able to complete it. Get an extra ref so
> >>> +				 * we don't free the cmd from under LIO core.
> >>>  				 */
> >>> -				list_move_tail(&cmd->i_conn_node,
> >>> -					       &conn->conn_cmd_list);
> >>> +				target_get_sess_cmd(se_cmd, false);
> >>>  			} else {
> >>>  				se_cmd->transport_state |= CMD_T_FABRIC_STOP;
> >>>  			}
> >>
> >> The bug was there before. I had to backport this patch in order to
> >> introduce my fix. I can revert it and check what is different, but it's
> >> there in some form.
> >>
> > 
> > Don't waste your time. It's because iscsit_free_cmd's call to
> > transport_generic_free_cmd has wait_for_tasks=true.
> > 
> > We then do transport_generic_free_cmd -> target_wait_free_cmd ->
> > __transport_wait_for_tasks like you posted above.
> 
> That's wrong, it's the transport_generic_free_cmd wait:
> 
>         if (aborted) {
>                 pr_debug("Detected CMD_T_ABORTED for ITT: %llu\n", cmd->tag);
>                 wait_for_completion(&compl);
>                 ret = 1;

I'm not sure I'm following you. The wait on master comes from this:

 core_tmr_abort_task():
  [...]
  target_put_cmd_and_wait(se_cmd);
  [...]

IMO it's kinda legit wait. And we can't just drop the references and
call it a day, because a request has to go through the
target_handle_abort() because it (at least) does some TAS magic. And
currently there is no code that calls target_handle_abort() for
WRITE_PENDING requests.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-13 18:58         ` Konstantin Shelekhin
@ 2021-10-13 19:01           ` Konstantin Shelekhin
  2021-10-13 20:21             ` Mike Christie
  0 siblings, 1 reply; 26+ messages in thread
From: Konstantin Shelekhin @ 2021-10-13 19:01 UTC (permalink / raw)
  To: Mike Christie; +Cc: target-devel, linux, Maurizio Lombardi

On Wed, Oct 13, 2021 at 09:58:32PM +0300, Konstantin Shelekhin wrote:
> On Wed, Oct 13, 2021 at 01:30:32PM -0500, Mike Christie wrote:
> > On 10/13/21 1:24 PM, Mike Christie wrote:
> > > On 10/13/21 1:08 PM, Konstantin Shelekhin wrote:
> > >> On Wed, Oct 13, 2021 at 12:51:32PM -0500, Mike Christie wrote:
> > >>> On 10/13/21 8:21 AM, Konstantin Shelekhin wrote:
> > >>>> Hi,
> > >>>>
> > >>>> I really need the collective wisdom.
> > >>>>
> > >>>> Not long ago we've uncovered the problem with iSCSI and ABORT TASK
> > >>>> handling. Currently it's not possible to abort a WRITE_10 command in
> > >>>> TRANSPORT_WRITE_PENDING state, because ABORT TASK  will hang itself in
> > >>>> the process:
> > >>>>
> > >>>>   # dmesg | tail -2
> > >>>>   [   83.563505] ABORT_TASK: Found referenced iSCSI task_tag: 3372979269
> > >>>>   [   84.593545] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus <nexus>
> > >>>>
> > >>>>   # ps aux | awk '$8 ~/D/'
> > >>>>   root        32  0.0  0.0      0     0 ?        D    15:19   0:00 [kworker/0:1+events]
> > >>>>   root      1187  0.0  0.0      0     0 ?        D    15:20   0:00 [iscsi_ttx]
> > >>>>
> > >>>>   # cat /proc/32/stack
> > >>>>   [<0>] target_put_cmd_and_wait+0x68/0xa0
> > >>>>   [<0>] core_tmr_abort_task.cold+0x16b/0x192
> > >>>>   [<0>] target_tmr_work+0x9e/0xe0
> > >>>>   [<0>] process_one_work+0x1d4/0x370
> > >>>>   [<0>] worker_thread+0x48/0x3d0
> > >>>>   [<0>] kthread+0x122/0x140
> > >>>>   [<0>] ret_from_fork+0x22/0x30
> > >>>>
> > >>>>   # cat /proc/1187/stack
> > >>>>   [<0>] __transport_wait_for_tasks+0xaf/0x100
> > >>>>   [<0>] transport_generic_free_cmd+0xe9/0x180
> > >>>>   [<0>] iscsit_free_cmd+0x50/0xb0
> > >>>>   [<0>] iscsit_close_connection+0x47d/0x8c0
> > >>>>   [<0>] iscsit_take_action_for_connection_exit+0x6f/0xf0
> > >>>>   [<0>] iscsi_target_tx_thread+0x184/0x200
> > >>>>   [<0>] kthread+0x122/0x140
> > >>>>   [<0>] ret_from_fork+0x22/0x30
> > >>>>
> > >>>> What happens:
> > >>>>
> > >>>>   1. Initiator sends WRITE_10 CDB
> > >>>>   2. Target parses the CDB and sends R2T
> > >>>>   3. Target starts the Data-Out timer
> > >>>>   4. Initiator sends ABORT TASK; no new data from Initiator after this
> > >>>>   5. Target starts aborting WRITE_10, gets into core_tmr_abort_task()
> > >>>>      and starts waiting for the request completion
> > >>>>   6. Nothing happens
> > >>>>   7. The Data-Out timers expires, connection teardown starts and gets
> > >>>>      stuck waiting for ABORT TASK that waits for WRITE_10
> > >>>>
> > >>>> The ABORT TASK processing looks roughly like this:
> > >>>>
> > >>>>   iscsi_rx_opcode
> > >>>>     iscsi_handle_task_mgt_cmd
> > >>>>       iscsi_tmr_abort_task
> > >>>>       transport_generic_handle_tmr
> > >>>>         if (tmr_cmd->transport_state & CMD_T_ABORTED)
> > >>>>           target_handle_abort
> > >>>>         else
> > >>>>           target_tmr_work
> > >>>>             if (tmr_cmd->transport_state & CMD_T_ABORTED)
> > >>>>               target_handle_abort
> > >>>>             else
> > >>>>               core_tmr_abort_task
> > >>>>                 ret = __target_check_io_state
> > >>>>                   if (write_cmd->transport_state & CMD_T_STOP)
> > >>>>                     return -1
> > >>>>                   write_cmd->transport_state |= CMD_T_ABORTED
> > >>>>                   return 0
> > >>>>                 if (!ret)
> > >>>>                   list_move_tail(&write_cmd->state_list, &aborted)
> > >>>>                   target_put_cmd_and_wait(&write_cmd)
> > >>>>
> > >>>> As I see it, the main problem is that the abort path can't initiate the
> > >>>> command termination, it simply waits for the request to handle this on
> > >>>> the execution path like in target_execute_cmd():
> > >>>>
> > >>>>   target_execute_cmd
> > >>>>     target_cmd_interrupted
> > >>>>       INIT_WORK(&cmd->work, target_abort_work)
> > >>>>
> > >>>> However, in this case the request is not going to be executed because
> > >>>> Initiator will not send the Data-Out buffer.
> > >>>>
> > >>>> I have a couple of ideas on how to fix this, but they all look kinda
> > >>>> ugly. The one that currently works around this for me:
> > >>>>
> > >>>>   core_tmr_abort_task():
> > >>>>
> > >>>>     [...]
> > >>>>
> > >>>>     spin_lock_irqsave(&se_cmd->t_state_lock, flags);
> > >>>>     write_pending = se_cmd->t_state == TRANSPORT_WRITE_PENDING;
> > >>>>     spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
> > >>>>     
> > >>>>     if (write_pending && se_cmd->se_tfo->abort_write_pending)
> > >>>>             se_cmd->se_tfo->abort_write_pending(se_cmd);
> > >>>>     
> > >>>>     target_put_cmd_and_wait(se_cmd);
> > >>>>
> > >>>>     [...]
> > >>>>
> > >>>> The new method abort_write_pending() is defined only for iSCSI and calls
> > >>>> target_handle_abort(). However, this opens up another can of worms
> > >>>> because this code heavily races with R2T sending and requires a couple
> > >>>> of checks to "work most of the time". Not ideal, by far.
> > >>>>
> > >>>> I can make this one better by introducing R2T list draining that ensures
> > >>>> the proper order during cleanup, but maybe there is a much easier way
> > >>>> that I'm not seeing here.
> > >>>
> > >>> Ccing Maurizio to make sure I don't add his original bug back.
> > >>>
> > >>> If I understand you, I think I added this bug in:
> > >>>
> > >>> commit f36199355c64a39fe82cfddc7623d827c7e050da
> > >>> Author: Mike Christie <michael.christie@oracle.com>
> > >>> Date:   Fri Nov 13 19:46:18 2020 -0600
> > >>>
> > >>>     scsi: target: iscsi: Fix cmd abort fabric stop race
> > >>>
> > >>> With that patch if the abort or a lun reset has got to lio core then we
> > >>> are going to be stuck waiting for the data which won't come because we
> > >>> killed the iscsi threads.
> > >>>
> > >>> Can go back to always having the iscsi target clean up the cmd, but if
> > >>> LIO has started to abort the cmd we take an extra ref so we don't free
> > >>> the cmd from under each other.
> > >>>
> > >>> This patch is completely untested:
> > >>>
> > >>>
> > >>> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
> > >>> index 2c54c5d8412d..d221e9be7468 100644
> > >>> --- a/drivers/target/iscsi/iscsi_target.c
> > >>> +++ b/drivers/target/iscsi/iscsi_target.c
> > >>> @@ -4090,12 +4090,13 @@ static void 	(struct iscsi_conn *conn)
> > >>>  			spin_lock_irq(&se_cmd->t_state_lock);
> > >>>  			if (se_cmd->transport_state & CMD_T_ABORTED) {
> > >>>  				/*
> > >>> -				 * LIO's abort path owns the cleanup for this,
> > >>> -				 * so put it back on the list and let
> > >>> -				 * aborted_task handle it.
> > >>> +				 * The LIO TMR handler owns the cmd but if
> > >>> +				 * we were waiting for data from the initiator
> > >>> +				 * then we need to internally cleanup to be
> > >>> +				 * able to complete it. Get an extra ref so
> > >>> +				 * we don't free the cmd from under LIO core.
> > >>>  				 */
> > >>> -				list_move_tail(&cmd->i_conn_node,
> > >>> -					       &conn->conn_cmd_list);
> > >>> +				target_get_sess_cmd(se_cmd, false);
> > >>>  			} else {
> > >>>  				se_cmd->transport_state |= CMD_T_FABRIC_STOP;
> > >>>  			}
> > >>
> > >> The bug was there before. I had to backport this patch in order to
> > >> introduce my fix. I can revert it and check what is different, but it's
> > >> there in some form.
> > >>
> > > 
> > > Don't waste your time. It's because iscsit_free_cmd's call to
> > > transport_generic_free_cmd has wait_for_tasks=true.
> > > 
> > > We then do transport_generic_free_cmd -> target_wait_free_cmd ->
> > > __transport_wait_for_tasks like you posted above.
> > 
> > That's wrong, it's the transport_generic_free_cmd wait:
> > 
> >         if (aborted) {
> >                 pr_debug("Detected CMD_T_ABORTED for ITT: %llu\n", cmd->tag);
> >                 wait_for_completion(&compl);
> >                 ret = 1;
> 
> I'm not sure I'm following you. The wait on master comes from this:
> 
>  core_tmr_abort_task():
>   [...]
>   target_put_cmd_and_wait(se_cmd);
>   [...]
> 
> IMO it's kinda legit wait. And we can't just drop the references and
> call it a day, because a request has to go through the
> target_handle_abort() because it (at least) does some TAS magic. And
> currently there is no code that calls target_handle_abort() for
> WRITE_PENDING requests.

Ah, sorry, I should've mentioned it I guess. The
transport_generic_free_cmd() bit is connection teardown waiting on ABORT TASK
that waits for WRITE_10 task. That's part is completely fine IMO.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-13 19:01           ` Konstantin Shelekhin
@ 2021-10-13 20:21             ` Mike Christie
  2021-10-14 23:12               ` Konstantin Shelekhin
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Christie @ 2021-10-13 20:21 UTC (permalink / raw)
  To: Konstantin Shelekhin; +Cc: target-devel, linux, Maurizio Lombardi

On 10/13/21 2:01 PM, Konstantin Shelekhin wrote:
> On Wed, Oct 13, 2021 at 09:58:32PM +0300, Konstantin Shelekhin wrote:
>> On Wed, Oct 13, 2021 at 01:30:32PM -0500, Mike Christie wrote:
>>> On 10/13/21 1:24 PM, Mike Christie wrote:
>>>> On 10/13/21 1:08 PM, Konstantin Shelekhin wrote:
>>>>> On Wed, Oct 13, 2021 at 12:51:32PM -0500, Mike Christie wrote:
>>>>>> On 10/13/21 8:21 AM, Konstantin Shelekhin wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I really need the collective wisdom.
>>>>>>>
>>>>>>> Not long ago we've uncovered the problem with iSCSI and ABORT TASK
>>>>>>> handling. Currently it's not possible to abort a WRITE_10 command in
>>>>>>> TRANSPORT_WRITE_PENDING state, because ABORT TASK  will hang itself in
>>>>>>> the process:
>>>>>>>
>>>>>>>   # dmesg | tail -2
>>>>>>>   [   83.563505] ABORT_TASK: Found referenced iSCSI task_tag: 3372979269
>>>>>>>   [   84.593545] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus <nexus>
>>>>>>>
>>>>>>>   # ps aux | awk '$8 ~/D/'
>>>>>>>   root        32  0.0  0.0      0     0 ?        D    15:19   0:00 [kworker/0:1+events]
>>>>>>>   root      1187  0.0  0.0      0     0 ?        D    15:20   0:00 [iscsi_ttx]
>>>>>>>
>>>>>>>   # cat /proc/32/stack
>>>>>>>   [<0>] target_put_cmd_and_wait+0x68/0xa0
>>>>>>>   [<0>] core_tmr_abort_task.cold+0x16b/0x192
>>>>>>>   [<0>] target_tmr_work+0x9e/0xe0
>>>>>>>   [<0>] process_one_work+0x1d4/0x370
>>>>>>>   [<0>] worker_thread+0x48/0x3d0
>>>>>>>   [<0>] kthread+0x122/0x140
>>>>>>>   [<0>] ret_from_fork+0x22/0x30
>>>>>>>
>>>>>>>   # cat /proc/1187/stack
>>>>>>>   [<0>] __transport_wait_for_tasks+0xaf/0x100
>>>>>>>   [<0>] transport_generic_free_cmd+0xe9/0x180
>>>>>>>   [<0>] iscsit_free_cmd+0x50/0xb0
>>>>>>>   [<0>] iscsit_close_connection+0x47d/0x8c0
>>>>>>>   [<0>] iscsit_take_action_for_connection_exit+0x6f/0xf0
>>>>>>>   [<0>] iscsi_target_tx_thread+0x184/0x200
>>>>>>>   [<0>] kthread+0x122/0x140
>>>>>>>   [<0>] ret_from_fork+0x22/0x30
>>>>>>>
>>>>>>> What happens:
>>>>>>>
>>>>>>>   1. Initiator sends WRITE_10 CDB
>>>>>>>   2. Target parses the CDB and sends R2T
>>>>>>>   3. Target starts the Data-Out timer
>>>>>>>   4. Initiator sends ABORT TASK; no new data from Initiator after this
>>>>>>>   5. Target starts aborting WRITE_10, gets into core_tmr_abort_task()
>>>>>>>      and starts waiting for the request completion
>>>>>>>   6. Nothing happens
>>>>>>>   7. The Data-Out timers expires, connection teardown starts and gets
>>>>>>>      stuck waiting for ABORT TASK that waits for WRITE_10
>>>>>>>
>>>>>>> The ABORT TASK processing looks roughly like this:
>>>>>>>
>>>>>>>   iscsi_rx_opcode
>>>>>>>     iscsi_handle_task_mgt_cmd
>>>>>>>       iscsi_tmr_abort_task
>>>>>>>       transport_generic_handle_tmr
>>>>>>>         if (tmr_cmd->transport_state & CMD_T_ABORTED)
>>>>>>>           target_handle_abort
>>>>>>>         else
>>>>>>>           target_tmr_work
>>>>>>>             if (tmr_cmd->transport_state & CMD_T_ABORTED)
>>>>>>>               target_handle_abort
>>>>>>>             else
>>>>>>>               core_tmr_abort_task
>>>>>>>                 ret = __target_check_io_state
>>>>>>>                   if (write_cmd->transport_state & CMD_T_STOP)
>>>>>>>                     return -1
>>>>>>>                   write_cmd->transport_state |= CMD_T_ABORTED
>>>>>>>                   return 0
>>>>>>>                 if (!ret)
>>>>>>>                   list_move_tail(&write_cmd->state_list, &aborted)
>>>>>>>                   target_put_cmd_and_wait(&write_cmd)
>>>>>>>
>>>>>>> As I see it, the main problem is that the abort path can't initiate the
>>>>>>> command termination, it simply waits for the request to handle this on
>>>>>>> the execution path like in target_execute_cmd():
>>>>>>>
>>>>>>>   target_execute_cmd
>>>>>>>     target_cmd_interrupted
>>>>>>>       INIT_WORK(&cmd->work, target_abort_work)
>>>>>>>
>>>>>>> However, in this case the request is not going to be executed because
>>>>>>> Initiator will not send the Data-Out buffer.
>>>>>>>
>>>>>>> I have a couple of ideas on how to fix this, but they all look kinda
>>>>>>> ugly. The one that currently works around this for me:
>>>>>>>
>>>>>>>   core_tmr_abort_task():
>>>>>>>
>>>>>>>     [...]
>>>>>>>
>>>>>>>     spin_lock_irqsave(&se_cmd->t_state_lock, flags);
>>>>>>>     write_pending = se_cmd->t_state == TRANSPORT_WRITE_PENDING;
>>>>>>>     spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
>>>>>>>     
>>>>>>>     if (write_pending && se_cmd->se_tfo->abort_write_pending)
>>>>>>>             se_cmd->se_tfo->abort_write_pending(se_cmd);
>>>>>>>     
>>>>>>>     target_put_cmd_and_wait(se_cmd);
>>>>>>>
>>>>>>>     [...]
>>>>>>>
>>>>>>> The new method abort_write_pending() is defined only for iSCSI and calls
>>>>>>> target_handle_abort(). However, this opens up another can of worms
>>>>>>> because this code heavily races with R2T sending and requires a couple
>>>>>>> of checks to "work most of the time". Not ideal, by far.
>>>>>>>
>>>>>>> I can make this one better by introducing R2T list draining that ensures
>>>>>>> the proper order during cleanup, but maybe there is a much easier way
>>>>>>> that I'm not seeing here.
>>>>>>
>>>>>> Ccing Maurizio to make sure I don't add his original bug back.
>>>>>>
>>>>>> If I understand you, I think I added this bug in:
>>>>>>
>>>>>> commit f36199355c64a39fe82cfddc7623d827c7e050da
>>>>>> Author: Mike Christie <michael.christie@oracle.com>
>>>>>> Date:   Fri Nov 13 19:46:18 2020 -0600
>>>>>>
>>>>>>     scsi: target: iscsi: Fix cmd abort fabric stop race
>>>>>>
>>>>>> With that patch if the abort or a lun reset has got to lio core then we
>>>>>> are going to be stuck waiting for the data which won't come because we
>>>>>> killed the iscsi threads.
>>>>>>
>>>>>> Can go back to always having the iscsi target clean up the cmd, but if
>>>>>> LIO has started to abort the cmd we take an extra ref so we don't free
>>>>>> the cmd from under each other.
>>>>>>
>>>>>> This patch is completely untested:
>>>>>>
>>>>>>
>>>>>> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
>>>>>> index 2c54c5d8412d..d221e9be7468 100644
>>>>>> --- a/drivers/target/iscsi/iscsi_target.c
>>>>>> +++ b/drivers/target/iscsi/iscsi_target.c
>>>>>> @@ -4090,12 +4090,13 @@ static void 	(struct iscsi_conn *conn)
>>>>>>  			spin_lock_irq(&se_cmd->t_state_lock);
>>>>>>  			if (se_cmd->transport_state & CMD_T_ABORTED) {
>>>>>>  				/*
>>>>>> -				 * LIO's abort path owns the cleanup for this,
>>>>>> -				 * so put it back on the list and let
>>>>>> -				 * aborted_task handle it.
>>>>>> +				 * The LIO TMR handler owns the cmd but if
>>>>>> +				 * we were waiting for data from the initiator
>>>>>> +				 * then we need to internally cleanup to be
>>>>>> +				 * able to complete it. Get an extra ref so
>>>>>> +				 * we don't free the cmd from under LIO core.
>>>>>>  				 */
>>>>>> -				list_move_tail(&cmd->i_conn_node,
>>>>>> -					       &conn->conn_cmd_list);
>>>>>> +				target_get_sess_cmd(se_cmd, false);
>>>>>>  			} else {
>>>>>>  				se_cmd->transport_state |= CMD_T_FABRIC_STOP;
>>>>>>  			}
>>>>>
>>>>> The bug was there before. I had to backport this patch in order to
>>>>> introduce my fix. I can revert it and check what is different, but it's
>>>>> there in some form.
>>>>>
>>>>
>>>> Don't waste your time. It's because iscsit_free_cmd's call to
>>>> transport_generic_free_cmd has wait_for_tasks=true.
>>>>
>>>> We then do transport_generic_free_cmd -> target_wait_free_cmd ->
>>>> __transport_wait_for_tasks like you posted above.
>>>
>>> That's wrong, it's the transport_generic_free_cmd wait:
>>>
>>>         if (aborted) {
>>>                 pr_debug("Detected CMD_T_ABORTED for ITT: %llu\n", cmd->tag);
>>>                 wait_for_completion(&compl);
>>>                 ret = 1;
>>
>> I'm not sure I'm following you. The wait on master comes from this:
>>
>>  core_tmr_abort_task():
>>   [...]
>>   target_put_cmd_and_wait(se_cmd);
>>   [...]
>>
>> IMO it's kinda legit wait. And we can't just drop the references and
>> call it a day, because a request has to go through the
>> target_handle_abort() because it (at least) does some TAS magic. And
>> currently there is no code that calls target_handle_abort() for
>> WRITE_PENDING requests.
> 
> Ah, sorry, I should've mentioned it I guess. The
> transport_generic_free_cmd() bit is connection teardown waiting on ABORT TASK
> that waits for WRITE_10 task. That's part is completely fine IMO.
>


For lun resets, let's take the abort case.

1. core_tmr_abort_task is waiting in target_put_cmd_and_wait for the WRITE.
2. For the normal case where a cmd has been passed to LIO core then eventually
the cmd completes and we do:

target_complete_cmd -> target_complete_cmd_with_sense -> target_cmd_interrupted
-> target_handle_abort. Ignoring TAS, that function then does the last puts on the
cmd which wakes up the TMR code in #1.

For the case where we never submit the cmd to LIO core, then the puts will never
be done by that target_complete_cmd path. And, in this case where the iscsi connection
is closed we know the cmd will never be submitted to LIO core because the iscsi
threads have been killed. iscsi needs to do the last puts on the cmd so #1 wakes up
on that WRITE.

So we need a cmd bit to set to indicate iscsi has called one of the submission
/execution functions in LIO. If it's not set then iscsit_release_commands_from_conn/
iscsit_free_cmd needs to do those last puts.

Here in this example to better show what I mean, we detect if an abort has been sent
to LIO core for a cmd that has not been sent to LIO core. If that happens then
iscsit_free_cmd gets force_cleanup=true so transport_generic_free_cmd will do a put
and not wait, and then iscsit_free_cmd does the last target_put_sess_cmd which
will wake #1 above. The abort will then complete too.

Note:
I don't think we have to worry about TAS in this case, because we never got a complete
SCSI command. The iscsi layer only got part of it and never submitted it to the SCSI
layer. We then escalated to session level recovery so we are not sending any responses
for any of the cmds or TMFs. The transport is killed at this point, so no responses
can even be sent.


diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
index 2c54c5d8412d..d0e80a2b653b 100644
--- a/drivers/target/iscsi/iscsi_target.c
+++ b/drivers/target/iscsi/iscsi_target.c
@@ -4088,7 +4088,8 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn)
 
 		if (se_cmd->se_tfo != NULL) {
 			spin_lock_irq(&se_cmd->t_state_lock);
-			if (se_cmd->transport_state & CMD_T_ABORTED) {
+			if (se_cmd->transport_state & CMD_T_ABORTED &&
+			    se_cmd->transport_state & CMD_T_SUBMITTED) {
 				/*
 				 * LIO's abort path owns the cleanup for this,
 				 * so put it back on the list and let
@@ -4096,7 +4097,7 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn)
 				 */
 				list_move_tail(&cmd->i_conn_node,
 					       &conn->conn_cmd_list);
-			} else {
+			} else if (!(se_cmd->transport_state & CMD_T_ABORTED)) {
 				se_cmd->transport_state |= CMD_T_FABRIC_STOP;
 			}
 			spin_unlock_irq(&se_cmd->t_state_lock);
@@ -4108,8 +4109,12 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn)
 		list_del_init(&cmd->i_conn_node);
 
 		iscsit_increment_maxcmdsn(cmd, sess);
-		iscsit_free_cmd(cmd, true);
 
+		if (se_cmd->transport_state & CMD_T_ABORTED &&
+		    !(se_cmd->transport_state & CMD_T_SUBMITTED))
+			iscsit_free_cmd(cmd, false, true);
+		else
+			iscsit_free_cmd(cmd, true, false);
 	}
 }
 
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 6dd5810e2af1..931586595044 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -742,7 +742,7 @@ void __iscsit_free_cmd(struct iscsi_cmd *cmd, bool check_queues)
 		conn->conn_transport->iscsit_unmap_cmd(conn, cmd);
 }
 
-void iscsit_free_cmd(struct iscsi_cmd *cmd, bool shutdown)
+void iscsit_free_cmd(struct iscsi_cmd *cmd, bool shutdown, bool force_cleanup)
 {
 	struct se_cmd *se_cmd = cmd->se_cmd.se_tfo ? &cmd->se_cmd : NULL;
 	int rc;
@@ -751,10 +751,14 @@ void iscsit_free_cmd(struct iscsi_cmd *cmd, bool shutdown)
 
 	__iscsit_free_cmd(cmd, shutdown);
 	if (se_cmd) {
-		rc = transport_generic_free_cmd(se_cmd, shutdown);
+		rc = transport_generic_free_cmd(se_cmd,
+					force_cleanup ? false : shutdown);
 		if (!rc && shutdown && se_cmd->se_sess) {
 			__iscsit_free_cmd(cmd, shutdown);
 			target_put_sess_cmd(se_cmd);
+		} else if (se_cmd->sess && force_cleanup) {
+			__iscsit_free_cmd(cmd, true);
+			target_put_sess_cmd(se_cmd);
 		}
 	} else {
 		iscsit_release_cmd(cmd);
diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
index 14c6f2bb1b01..eb233ea8db65 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -1554,7 +1554,7 @@ int transport_handle_cdb_direct(
 	 * this to be called for initial descriptor submission.
 	 */
 	cmd->t_state = TRANSPORT_NEW_CMD;
-	cmd->transport_state |= CMD_T_ACTIVE;
+	cmd->transport_state |= (CMD_T_ACTIVE | CMD_T_SUBMITTED);
 
 	/*
 	 * transport_generic_new_cmd() is already handling QUEUE_FULL,
@@ -2221,7 +2221,7 @@ void target_execute_cmd(struct se_cmd *cmd)
 		return;
 
 	spin_lock_irq(&cmd->t_state_lock);
-	cmd->t_state = TRANSPORT_PROCESSING;
+	cmd->t_state = (TRANSPORT_PROCESSING | CMD_T_SUBMITTED);
 	cmd->transport_state |= CMD_T_ACTIVE | CMD_T_SENT;
 	spin_unlock_irq(&cmd->t_state_lock);
 
diff --git a/include/target/target_core_base.h b/include/target/target_core_base.h
index fb11c7693b25..b759ec810fa9 100644
--- a/include/target/target_core_base.h
+++ b/include/target/target_core_base.h
@@ -511,6 +511,7 @@ struct se_cmd {
 #define CMD_T_COMPLETE		(1 << 2)
 #define CMD_T_SENT		(1 << 4)
 #define CMD_T_STOP		(1 << 5)
+#define CMD_T_SUBMITTED		(1 << 6)
 #define CMD_T_TAS		(1 << 10)
 #define CMD_T_FABRIC_STOP	(1 << 11)
 	spinlock_t		t_state_lock;






^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-13 14:53   ` Konstantin Shelekhin
  2021-10-13 14:56     ` Konstantin Shelekhin
@ 2021-10-14  7:09     ` Hannes Reinecke
  2021-10-14  7:52       ` Konstantin Shelekhin
  1 sibling, 1 reply; 26+ messages in thread
From: Hannes Reinecke @ 2021-10-14  7:09 UTC (permalink / raw)
  To: Konstantin Shelekhin; +Cc: target-devel, linux

On 10/13/21 4:53 PM, Konstantin Shelekhin wrote:
> On Wed, Oct 13, 2021 at 04:22:41PM +0200, Hannes Reinecke wrote:
>> On 10/13/21 3:21 PM, Konstantin Shelekhin wrote:
>> Short answer: you can't.
>>
>> There is no feasible path in the I/O stack to abort running I/O; the
>> only chance you have here is to wait for it to time-out.
>>
>> We have run into similar issues, and found that the only sane solution
>> was to wait for the I/O to come back and then retry.
>> As this would take some time (30 seconds if you are unlucky) most
>> initiators will get unhappy and try to reset.
>> Which won't work, either, as the I/O is still stuck.
>> So we finally delayed relogin until all I/O was cleared.
>>
>> Not the best solution, but the only thing we can do in the absense of a
>> proper I/O abort mechanism.
> 
> I'm not sure we are talking about the same bug. In this case the relogin
> is not possible, because new connections are rejected by the target and
> the existing one is not going anywhere, because it's deadlocked on ABORT
> TASK. The only solution is to reset the server.
> 
Precisely.

Relogin fails as there is I/O outstanding on the original session, and 
you try to relogin into the same session. Which is still busy, hence you 
cannot login.

And I/O is outstanding as it can't be aborted, as the only transport 
implementing abort is target_core_user.c; for all the others you are 
screwed.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-14  7:09     ` Hannes Reinecke
@ 2021-10-14  7:52       ` Konstantin Shelekhin
  0 siblings, 0 replies; 26+ messages in thread
From: Konstantin Shelekhin @ 2021-10-14  7:52 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: target-devel, linux

On Thu, Oct 14, 2021 at 09:09:07AM +0200, Hannes Reinecke wrote:
> On 10/13/21 4:53 PM, Konstantin Shelekhin wrote:
> > On Wed, Oct 13, 2021 at 04:22:41PM +0200, Hannes Reinecke wrote:
> > > On 10/13/21 3:21 PM, Konstantin Shelekhin wrote:
> > > Short answer: you can't.
> > > 
> > > There is no feasible path in the I/O stack to abort running I/O; the
> > > only chance you have here is to wait for it to time-out.
> > > 
> > > We have run into similar issues, and found that the only sane solution
> > > was to wait for the I/O to come back and then retry.
> > > As this would take some time (30 seconds if you are unlucky) most
> > > initiators will get unhappy and try to reset.
> > > Which won't work, either, as the I/O is still stuck.
> > > So we finally delayed relogin until all I/O was cleared.
> > > 
> > > Not the best solution, but the only thing we can do in the absense of a
> > > proper I/O abort mechanism.
> > 
> > I'm not sure we are talking about the same bug. In this case the relogin
> > is not possible, because new connections are rejected by the target and
> > the existing one is not going anywhere, because it's deadlocked on ABORT
> > TASK. The only solution is to reset the server.
> > 
> Precisely.
> 
> Relogin fails as there is I/O outstanding on the original session, and you
> try to relogin into the same session. Which is still busy, hence you cannot
> login.
> 
> And I/O is outstanding as it can't be aborted, as the only transport
> implementing abort is target_core_user.c; for all the others you are
> screwed.

If I understand you correctly, you're talking about the very different
case where bios sent to a backend's block device get stuck. True, we can
do little in that case. In this case, however, there are no bios yet,
TCM is still waiting for the data from the initiator. We can do anything
we want here, because at this point TCM has complete control over the
request execution.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-13 20:21             ` Mike Christie
@ 2021-10-14 23:12               ` Konstantin Shelekhin
  2021-10-15  3:18                 ` michael.christie
  0 siblings, 1 reply; 26+ messages in thread
From: Konstantin Shelekhin @ 2021-10-14 23:12 UTC (permalink / raw)
  To: Mike Christie; +Cc: target-devel, linux, Maurizio Lombardi

On Wed, Oct 13, 2021 at 03:21:19PM -0500, Mike Christie wrote:
> For lun resets, let's take the abort case.
> 
> 1. core_tmr_abort_task is waiting in target_put_cmd_and_wait for the WRITE.
> 2. For the normal case where a cmd has been passed to LIO core then eventually
> the cmd completes and we do:
> 
> target_complete_cmd -> target_complete_cmd_with_sense -> target_cmd_interrupted
> -> target_handle_abort. Ignoring TAS, that function then does the last puts on the
> cmd which wakes up the TMR code in #1.
> 
> For the case where we never submit the cmd to LIO core, then the puts will never
> be done by that target_complete_cmd path. And, in this case where the iscsi connection
> is closed we know the cmd will never be submitted to LIO core because the iscsi
> threads have been killed. iscsi needs to do the last puts on the cmd so #1 wakes up
> on that WRITE.
> 
> So we need a cmd bit to set to indicate iscsi has called one of the submission
> /execution functions in LIO. If it's not set then iscsit_release_commands_from_conn/
> iscsit_free_cmd needs to do those last puts.
> 
> Here in this example to better show what I mean, we detect if an abort has been sent
> to LIO core for a cmd that has not been sent to LIO core. If that happens then
> iscsit_free_cmd gets force_cleanup=true so transport_generic_free_cmd will do a put
> and not wait, and then iscsit_free_cmd does the last target_put_sess_cmd which
> will wake #1 above. The abort will then complete too.
> 
> Note:
> I don't think we have to worry about TAS in this case, because we never got a complete
> SCSI command. The iscsi layer only got part of it and never submitted it to the SCSI
> layer. We then escalated to session level recovery so we are not sending any responses
> for any of the cmds or TMFs. The transport is killed at this point, so no responses
> can even be sent.
> 
> 
> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
> index 2c54c5d8412d..d0e80a2b653b 100644
> --- a/drivers/target/iscsi/iscsi_target.c
> +++ b/drivers/target/iscsi/iscsi_target.c
> @@ -4088,7 +4088,8 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn)
>  
>  		if (se_cmd->se_tfo != NULL) {
>  			spin_lock_irq(&se_cmd->t_state_lock);
> -			if (se_cmd->transport_state & CMD_T_ABORTED) {
> +			if (se_cmd->transport_state & CMD_T_ABORTED &&
> +			    se_cmd->transport_state & CMD_T_SUBMITTED) {
>  				/*
>  				 * LIO's abort path owns the cleanup for this,
>  				 * so put it back on the list and let
> @@ -4096,7 +4097,7 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn)
>  				 */
>  				list_move_tail(&cmd->i_conn_node,
>  					       &conn->conn_cmd_list);
> -			} else {
> +			} else if (!(se_cmd->transport_state & CMD_T_ABORTED)) {
>  				se_cmd->transport_state |= CMD_T_FABRIC_STOP;
>  			}
>  			spin_unlock_irq(&se_cmd->t_state_lock);
> @@ -4108,8 +4109,12 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn)
>  		list_del_init(&cmd->i_conn_node);
>  
>  		iscsit_increment_maxcmdsn(cmd, sess);
> -		iscsit_free_cmd(cmd, true);
>  
> +		if (se_cmd->transport_state & CMD_T_ABORTED &&
> +		    !(se_cmd->transport_state & CMD_T_SUBMITTED))
> +			iscsit_free_cmd(cmd, false, true);
> +		else
> +			iscsit_free_cmd(cmd, true, false);
>  	}
>  }
>  
> diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
> index 6dd5810e2af1..931586595044 100644
> --- a/drivers/target/iscsi/iscsi_target_util.c
> +++ b/drivers/target/iscsi/iscsi_target_util.c
> @@ -742,7 +742,7 @@ void __iscsit_free_cmd(struct iscsi_cmd *cmd, bool check_queues)
>  		conn->conn_transport->iscsit_unmap_cmd(conn, cmd);
>  }
>  
> -void iscsit_free_cmd(struct iscsi_cmd *cmd, bool shutdown)
> +void iscsit_free_cmd(struct iscsi_cmd *cmd, bool shutdown, bool force_cleanup)
>  {
>  	struct se_cmd *se_cmd = cmd->se_cmd.se_tfo ? &cmd->se_cmd : NULL;
>  	int rc;
> @@ -751,10 +751,14 @@ void iscsit_free_cmd(struct iscsi_cmd *cmd, bool shutdown)
>  
>  	__iscsit_free_cmd(cmd, shutdown);
>  	if (se_cmd) {
> -		rc = transport_generic_free_cmd(se_cmd, shutdown);
> +		rc = transport_generic_free_cmd(se_cmd,
> +					force_cleanup ? false : shutdown);
>  		if (!rc && shutdown && se_cmd->se_sess) {
>  			__iscsit_free_cmd(cmd, shutdown);
>  			target_put_sess_cmd(se_cmd);
> +		} else if (se_cmd->sess && force_cleanup) {
> +			__iscsit_free_cmd(cmd, true);
> +			target_put_sess_cmd(se_cmd);
>  		}
>  	} else {
>  		iscsit_release_cmd(cmd);
> diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
> index 14c6f2bb1b01..eb233ea8db65 100644
> --- a/drivers/target/target_core_transport.c
> +++ b/drivers/target/target_core_transport.c
> @@ -1554,7 +1554,7 @@ int transport_handle_cdb_direct(
>  	 * this to be called for initial descriptor submission.
>  	 */
>  	cmd->t_state = TRANSPORT_NEW_CMD;
> -	cmd->transport_state |= CMD_T_ACTIVE;
> +	cmd->transport_state |= (CMD_T_ACTIVE | CMD_T_SUBMITTED);
>  
>  	/*
>  	 * transport_generic_new_cmd() is already handling QUEUE_FULL,
> @@ -2221,7 +2221,7 @@ void target_execute_cmd(struct se_cmd *cmd)
>  		return;
>  
>  	spin_lock_irq(&cmd->t_state_lock);
> -	cmd->t_state = TRANSPORT_PROCESSING;
> +	cmd->t_state = (TRANSPORT_PROCESSING | CMD_T_SUBMITTED);
>  	cmd->transport_state |= CMD_T_ACTIVE | CMD_T_SENT;
>  	spin_unlock_irq(&cmd->t_state_lock);
>  
> diff --git a/include/target/target_core_base.h b/include/target/target_core_base.h
> index fb11c7693b25..b759ec810fa9 100644
> --- a/include/target/target_core_base.h
> +++ b/include/target/target_core_base.h
> @@ -511,6 +511,7 @@ struct se_cmd {
>  #define CMD_T_COMPLETE		(1 << 2)
>  #define CMD_T_SENT		(1 << 4)
>  #define CMD_T_STOP		(1 << 5)
> +#define CMD_T_SUBMITTED		(1 << 6)
>  #define CMD_T_TAS		(1 << 10)
>  #define CMD_T_FABRIC_STOP	(1 << 11)
>  	spinlock_t		t_state_lock;
> 
> 
> 
> 
> 

If I understand this aproach correctly, it fixes the deadlock, but the
connection reinstatement will still happen, because WRITE_10 won't be
aborted and the connection will go down after the timeout.

IMO it's not ideal either, since now iSCSI will have a 50% chance to
have the connection (meaning SCSI session) killed on arbitrary ABORT
TASK. While I'm sure most initiators will be able to recover from this
event, such drastic measures will certanly cause a lot of confusion for
people who are not familiar with TCM internals.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-14 23:12               ` Konstantin Shelekhin
@ 2021-10-15  3:18                 ` michael.christie
  2021-10-18 11:56                   ` Konstantin Shelekhin
  0 siblings, 1 reply; 26+ messages in thread
From: michael.christie @ 2021-10-15  3:18 UTC (permalink / raw)
  To: Konstantin Shelekhin; +Cc: target-devel, linux, Maurizio Lombardi

On 10/14/21 6:12 PM, Konstantin Shelekhin wrote:
> On Wed, Oct 13, 2021 at 03:21:19PM -0500, Mike Christie wrote:
>> For lun resets, let's take the abort case.
>>
>> 1. core_tmr_abort_task is waiting in target_put_cmd_and_wait for the WRITE.
>> 2. For the normal case where a cmd has been passed to LIO core then eventually
>> the cmd completes and we do:
>>
>> target_complete_cmd -> target_complete_cmd_with_sense -> target_cmd_interrupted
>> -> target_handle_abort. Ignoring TAS, that function then does the last puts on the
>> cmd which wakes up the TMR code in #1.
>>
>> For the case where we never submit the cmd to LIO core, then the puts will never
>> be done by that target_complete_cmd path. And, in this case where the iscsi connection
>> is closed we know the cmd will never be submitted to LIO core because the iscsi
>> threads have been killed. iscsi needs to do the last puts on the cmd so #1 wakes up
>> on that WRITE.
>>
>> So we need a cmd bit to set to indicate iscsi has called one of the submission
>> /execution functions in LIO. If it's not set then iscsit_release_commands_from_conn/
>> iscsit_free_cmd needs to do those last puts.
>>
>> Here in this example to better show what I mean, we detect if an abort has been sent
>> to LIO core for a cmd that has not been sent to LIO core. If that happens then
>> iscsit_free_cmd gets force_cleanup=true so transport_generic_free_cmd will do a put
>> and not wait, and then iscsit_free_cmd does the last target_put_sess_cmd which
>> will wake #1 above. The abort will then complete too.
>>
>> Note:
>> I don't think we have to worry about TAS in this case, because we never got a complete
>> SCSI command. The iscsi layer only got part of it and never submitted it to the SCSI
>> layer. We then escalated to session level recovery so we are not sending any responses
>> for any of the cmds or TMFs. The transport is killed at this point, so no responses
>> can even be sent.
>>
>>
>> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
>> index 2c54c5d8412d..d0e80a2b653b 100644
>> --- a/drivers/target/iscsi/iscsi_target.c
>> +++ b/drivers/target/iscsi/iscsi_target.c
>> @@ -4088,7 +4088,8 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn)
>>  
>>  		if (se_cmd->se_tfo != NULL) {
>>  			spin_lock_irq(&se_cmd->t_state_lock);
>> -			if (se_cmd->transport_state & CMD_T_ABORTED) {
>> +			if (se_cmd->transport_state & CMD_T_ABORTED &&
>> +			    se_cmd->transport_state & CMD_T_SUBMITTED) {
>>  				/*
>>  				 * LIO's abort path owns the cleanup for this,
>>  				 * so put it back on the list and let
>> @@ -4096,7 +4097,7 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn)
>>  				 */
>>  				list_move_tail(&cmd->i_conn_node,
>>  					       &conn->conn_cmd_list);
>> -			} else {
>> +			} else if (!(se_cmd->transport_state & CMD_T_ABORTED)) {
>>  				se_cmd->transport_state |= CMD_T_FABRIC_STOP;
>>  			}
>>  			spin_unlock_irq(&se_cmd->t_state_lock);
>> @@ -4108,8 +4109,12 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn)
>>  		list_del_init(&cmd->i_conn_node);
>>  
>>  		iscsit_increment_maxcmdsn(cmd, sess);
>> -		iscsit_free_cmd(cmd, true);
>>  
>> +		if (se_cmd->transport_state & CMD_T_ABORTED &&
>> +		    !(se_cmd->transport_state & CMD_T_SUBMITTED))
>> +			iscsit_free_cmd(cmd, false, true);
>> +		else
>> +			iscsit_free_cmd(cmd, true, false);
>>  	}
>>  }
>>  
>> diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
>> index 6dd5810e2af1..931586595044 100644
>> --- a/drivers/target/iscsi/iscsi_target_util.c
>> +++ b/drivers/target/iscsi/iscsi_target_util.c
>> @@ -742,7 +742,7 @@ void __iscsit_free_cmd(struct iscsi_cmd *cmd, bool check_queues)
>>  		conn->conn_transport->iscsit_unmap_cmd(conn, cmd);
>>  }
>>  
>> -void iscsit_free_cmd(struct iscsi_cmd *cmd, bool shutdown)
>> +void iscsit_free_cmd(struct iscsi_cmd *cmd, bool shutdown, bool force_cleanup)
>>  {
>>  	struct se_cmd *se_cmd = cmd->se_cmd.se_tfo ? &cmd->se_cmd : NULL;
>>  	int rc;
>> @@ -751,10 +751,14 @@ void iscsit_free_cmd(struct iscsi_cmd *cmd, bool shutdown)
>>  
>>  	__iscsit_free_cmd(cmd, shutdown);
>>  	if (se_cmd) {
>> -		rc = transport_generic_free_cmd(se_cmd, shutdown);
>> +		rc = transport_generic_free_cmd(se_cmd,
>> +					force_cleanup ? false : shutdown);
>>  		if (!rc && shutdown && se_cmd->se_sess) {
>>  			__iscsit_free_cmd(cmd, shutdown);
>>  			target_put_sess_cmd(se_cmd);
>> +		} else if (se_cmd->sess && force_cleanup) {
>> +			__iscsit_free_cmd(cmd, true);
>> +			target_put_sess_cmd(se_cmd);
>>  		}
>>  	} else {
>>  		iscsit_release_cmd(cmd);
>> diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
>> index 14c6f2bb1b01..eb233ea8db65 100644
>> --- a/drivers/target/target_core_transport.c
>> +++ b/drivers/target/target_core_transport.c
>> @@ -1554,7 +1554,7 @@ int transport_handle_cdb_direct(
>>  	 * this to be called for initial descriptor submission.
>>  	 */
>>  	cmd->t_state = TRANSPORT_NEW_CMD;
>> -	cmd->transport_state |= CMD_T_ACTIVE;
>> +	cmd->transport_state |= (CMD_T_ACTIVE | CMD_T_SUBMITTED);
>>  
>>  	/*
>>  	 * transport_generic_new_cmd() is already handling QUEUE_FULL,
>> @@ -2221,7 +2221,7 @@ void target_execute_cmd(struct se_cmd *cmd)
>>  		return;
>>  
>>  	spin_lock_irq(&cmd->t_state_lock);
>> -	cmd->t_state = TRANSPORT_PROCESSING;
>> +	cmd->t_state = (TRANSPORT_PROCESSING | CMD_T_SUBMITTED);
>>  	cmd->transport_state |= CMD_T_ACTIVE | CMD_T_SENT;
>>  	spin_unlock_irq(&cmd->t_state_lock);
>>  
>> diff --git a/include/target/target_core_base.h b/include/target/target_core_base.h
>> index fb11c7693b25..b759ec810fa9 100644
>> --- a/include/target/target_core_base.h
>> +++ b/include/target/target_core_base.h
>> @@ -511,6 +511,7 @@ struct se_cmd {
>>  #define CMD_T_COMPLETE		(1 << 2)
>>  #define CMD_T_SENT		(1 << 4)
>>  #define CMD_T_STOP		(1 << 5)
>> +#define CMD_T_SUBMITTED		(1 << 6)
>>  #define CMD_T_TAS		(1 << 10)
>>  #define CMD_T_FABRIC_STOP	(1 << 11)
>>  	spinlock_t		t_state_lock;
>>
>>
>>
>>
>>
> 
> If I understand this aproach correctly, it fixes the deadlock, but the
> connection reinstatement will still happen, because WRITE_10 won't be
> aborted and the connection will go down after the timeout.> 
> IMO it's not ideal either, since now iSCSI will have a 50% chance to
> have the connection (meaning SCSI session) killed on arbitrary ABOR

I wouldn't call this an arbitrary abort. It's indicating a problem.
When do you see this? Why do we need to fix it per cmd? Are you hitting
the big command short timeout issue? Driver/fw bug?

> TASK. While I'm sure most initiators will be able to recover from this
> event, such drastic measures will certanly cause a lot of confusion for
> people who are not familiar with TCM internals
How will this cause confusion vs the case where the cmd reaches the target
and we are waiting for it on the backend? In both cases, the initiator sends
an abort, it times out, the initiator or target drop the connection, we
relogin. Every initiator handles this.

With that said I am in favor of you fixing the code so we can cleanup
a partially sent cmd if it can be done sanely.

I personally would just leave the current behavior and fix the deadlock
because:

1. When I see this happening it's normally the network so we have to blow
away the group of commands and we end up dropping the connection one way
or another. I don't see the big command short timeout case often anymore.

2. Initiators just did not implement this right. I know this for sure
for open-iscsi at least. I started to fix my screw ups the other day but it
ends up breaking the targets.

For example,

- If we've sent a R2T and the initiator has sent a LUN RESET, what are
you going to have the target do? Send the response right away?

- If we've sent a R2T and the initiator has sent some of the data
PDUs to full fill it but has not sent the final PDU, then it sends the
LUN RESET, what do you do?

- You also have the immediate data case and the InitialR2T case.

The updated specs clarify how to handle this, and even have a FastAbort
key to specify which behavior we are going to do. But we don't support
it and I don't think many people implemented it.

So you are going to get a mix of behavior. Some initiators will send the
RESET and still send the data out PDUs and some will just stop sending
data outs after the RESET. To be safe do you wait for the initiator to
complete the sequence of data out PDUs? If so then you probably just hit
the same issue where we don't get the needed PDUs and the one side drops
the connection. If we send the ABORT response while the initiator is
still sending data outs, then we risk breaking them.

If you want to do it then go for it, but to answer you original email's
question the only easy way out is to just let it time out :)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-15  3:18                 ` michael.christie
@ 2021-10-18 11:56                   ` Konstantin Shelekhin
  2021-10-18 16:29                     ` Mike Christie
  0 siblings, 1 reply; 26+ messages in thread
From: Konstantin Shelekhin @ 2021-10-18 11:56 UTC (permalink / raw)
  To: michael.christie; +Cc: target-devel, linux, Maurizio Lombardi

On Thu, Oct 14, 2021 at 10:18:13PM -0500, michael.christie@oracle.com wrote:
> > If I understand this aproach correctly, it fixes the deadlock, but the
> > connection reinstatement will still happen, because WRITE_10 won't be
> > aborted and the connection will go down after the timeout.> 
> > IMO it's not ideal either, since now iSCSI will have a 50% chance to
> > have the connection (meaning SCSI session) killed on arbitrary ABOR
> 
> I wouldn't call this an arbitrary abort. It's indicating a problem.
> When do you see this? Why do we need to fix it per cmd? Are you hitting
> the big command short timeout issue? Driver/fw bug?

It was triggered by ESXi. During some heavy IOPS intervals the backend
device cannot handle the load and some IOs get stuck for more than 30
seconds. I suspect that ABORT TASKSs are issued by the virtual machines.
So a series of ABORT TASK will come, and the unlucky one will hit the
issue.
 
> > TASK. While I'm sure most initiators will be able to recover from this
> > event, such drastic measures will certanly cause a lot of confusion for
> > people who are not familiar with TCM internals
> How will this cause confusion vs the case where the cmd reaches the target
> and we are waiting for it on the backend? In both cases, the initiator sends
> an abort, it times out, the initiator or target drop the connection, we
> relogin. Every initiator handles this.

Because usually (when a WRITE request is past the WRITE PENDING state)
the ABORT TASK does not trigger relogin. In my experience the initiator
just waits for the TMR completion and goes on.

And from a blackbox perspective it looks suspicious:

  1. ABORT TASK sent to WRITE_10 tag 0x1; waits for it's completion
  2. ABORT TASK sent to WRITE_10 tag 0x2; almost immediately the connection is dropped

The only difference between #1 and #2 is that the command 0x1 is past
the WRITE PENDING state.

> With that said I am in favor of you fixing the code so we can cleanup
> a partially sent cmd if it can be done sanely.
> 
> I personally would just leave the current behavior and fix the deadlock
> because:
> 
> 1. When I see this happening it's normally the network so we have to blow
> away the group of commands and we end up dropping the connection one way
> or another. I don't see the big command short timeout case often anymore.
> 
> 2. Initiators just did not implement this right. I know this for sure
> for open-iscsi at least. I started to fix my screw ups the other day but it
> ends up breaking the targets.
> 
> For example,
> 
> - If we've sent a R2T and the initiator has sent a LUN RESET, what are
> you going to have the target do? Send the response right away?

AFAIR the spec says "nuke it, there will be no data after this".

> - If we've sent a R2T and the initiator has sent some of the data
> PDUs to full fill it but has not sent the final PDU, then it sends the
> LUN RESET, what do you do?

The same. However, I understand the interoperability concerns. I'll
check what other targets do.
 
> - You also have the immediate data case and the InitialR2T case.

True.
 
> The updated specs clarify how to handle this, and even have a FastAbort
> key to specify which behavior we are going to do. But we don't support
> it and I don't think many people implemented it.
> 
> So you are going to get a mix of behavior. Some initiators will send the
> RESET and still send the data out PDUs and some will just stop sending
> data outs after the RESET. To be safe do you wait for the initiator to
> complete the sequence of data out PDUs? If so then you probably just hit
> the same issue where we don't get the needed PDUs and the one side drops
> the connection. If we send the ABORT response while the initiator is
> still sending data outs, then we risk breaking them.
> 
> If you want to do it then go for it, but to answer you original email's
> question the only easy way out is to just let it time out :)

Sounds reasonable. I'll test your solution.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-18 11:56                   ` Konstantin Shelekhin
@ 2021-10-18 16:29                     ` Mike Christie
  2021-10-18 17:08                       ` Mike Christie
  2021-10-18 17:32                       ` Konstantin Shelekhin
  0 siblings, 2 replies; 26+ messages in thread
From: Mike Christie @ 2021-10-18 16:29 UTC (permalink / raw)
  To: Konstantin Shelekhin; +Cc: target-devel, linux, Maurizio Lombardi

On 10/18/21 6:56 AM, Konstantin Shelekhin wrote:
> On Thu, Oct 14, 2021 at 10:18:13PM -0500, michael.christie@oracle.com wrote:
>>> If I understand this aproach correctly, it fixes the deadlock, but the
>>> connection reinstatement will still happen, because WRITE_10 won't be
>>> aborted and the connection will go down after the timeout.> 
>>> IMO it's not ideal either, since now iSCSI will have a 50% chance to
>>> have the connection (meaning SCSI session) killed on arbitrary ABOR
>>
>> I wouldn't call this an arbitrary abort. It's indicating a problem.
>> When do you see this? Why do we need to fix it per cmd? Are you hitting
>> the big command short timeout issue? Driver/fw bug?
> 
> It was triggered by ESXi. During some heavy IOPS intervals the backend
> device cannot handle the load and some IOs get stuck for more than 30
> seconds. I suspect that ABORT TASKSs are issued by the virtual machines.
> So a series of ABORT TASK will come, and the unlucky one will hit the
> issue.

I didn't get this. If only the backend is backed up then we should
still be transmitting the data out/R2Ts quickly and we shouldn't be
hitting the issue where we got stuck waiting on them.

>  
>>> TASK. While I'm sure most initiators will be able to recover from this
>>> event, such drastic measures will certanly cause a lot of confusion for
>>> people who are not familiar with TCM internals
>> How will this cause confusion vs the case where the cmd reaches the target
>> and we are waiting for it on the backend? In both cases, the initiator sends
>> an abort, it times out, the initiator or target drop the connection, we
>> relogin. Every initiator handles this.
> 
> Because usually (when a WRITE request is past the WRITE PENDING state)

Ah I think we were talking about different things here. I thought you meant
users and I was just saying they wouldn't see a difference. But for ESXi
it's going to work differently than I was thinking. I thought the initiator
was going to escalate to LUN RESET then we hit the issue I mention
below in the FastAbort part of the mail where we end up dropping the
connection waiting on the data outs.


> the ABORT TASK does not trigger relogin. In my experience the initiator
> just waits for the TMR completion and goes on.
> 
> And from a blackbox perspective it looks suspicious:
> 
>   1. ABORT TASK sent to WRITE_10 tag 0x1; waits for it's completion
>   2. ABORT TASK sent to WRITE_10 tag 0x2; almost immediately the connection is dropped


I didn't get this part where the connection is dropped almost immediately.
If only the backend is backed up, what is dropping the connection right
away? The data out timers shouldn't be firing right? It sounds like above
the network between the initiator and target were ok so data outs and R2Ts
should be executing quickly like normal right?

> 
> The only difference between #1 and #2 is that the command 0x1 is past
> the WRITE PENDING state.
> 
>> With that said I am in favor of you fixing the code so we can cleanup
>> a partially sent cmd if it can be done sanely.
>>
>> I personally would just leave the current behavior and fix the deadlock
>> because:
>>
>> 1. When I see this happening it's normally the network so we have to blow
>> away the group of commands and we end up dropping the connection one way
>> or another. I don't see the big command short timeout case often anymore.
>>
>> 2. Initiators just did not implement this right. I know this for sure
>> for open-iscsi at least. I started to fix my screw ups the other day but it
>> ends up breaking the targets.
>>
>> For example,
>>
>> - If we've sent a R2T and the initiator has sent a LUN RESET, what are
>> you going to have the target do? Send the response right away?
> 
> AFAIR the spec says "nuke it, there will be no data after this".> 
>> - If we've sent a R2T and the initiator has sent some of the data
>> PDUs to full fill it but has not sent the final PDU, then it sends the
>> LUN RESET, what do you do?
> 
> The same. However, I understand the interoperability concerns. I'll
> check what other targets do
I think maybe you are replying about aborts, but I was asking about
LUN RESET which is opposite but will also hit the same hang if the
connection is dropped after one is sent.

For aborts it works like you wrote above. For LUN RESET it's opposite.
In 3270, it doesn't say how to handle aborts, but on the pdl lists it
came up and they said equivalent of your nuke it. However, for TMFs
that affect multiple tasks they clarified it in later versions of the
specs.

In the original it only says how to handle abort/clear task set, but in

https://datatracker.ietf.org/doc/html/rfc5048

the behavior was clarified and in 7143 we have the original/default
way:

https://datatracker.ietf.org/doc/html/rfc7143#section-4.2.3.3

which says to wait for the data outs.

And then we have FastAbort which is nuke it:

https://datatracker.ietf.org/doc/html/rfc7143#section-4.2.3.4

>  
>> - You also have the immediate data case and the InitialR2T case.
> 
> True.
>  
>> The updated specs clarify how to handle this, and even have a FastAbort
>> key to specify which behavior we are going to do. But we don't support
>> it and I don't think many people implemented it.


So here I was saying I don't think anyone implemented the ability to
negotiate for TaskReporting=FastAbort, so most might do the original behavior.
If that's right then we just end up dropping the connection for windows and
linux and newer versions of ESXi if that session drop setting is set
(I can't remember the setting), when they end up timing out their TMFs.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-18 16:29                     ` Mike Christie
@ 2021-10-18 17:08                       ` Mike Christie
  2021-10-26 10:59                         ` Konstantin Shelekhin
  2021-10-18 17:32                       ` Konstantin Shelekhin
  1 sibling, 1 reply; 26+ messages in thread
From: Mike Christie @ 2021-10-18 17:08 UTC (permalink / raw)
  To: Konstantin Shelekhin; +Cc: target-devel, linux, Maurizio Lombardi

On 10/18/21 11:29 AM, Mike Christie wrote:
> On 10/18/21 6:56 AM, Konstantin Shelekhin wrote:
>> On Thu, Oct 14, 2021 at 10:18:13PM -0500, michael.christie@oracle.com wrote:
>>>> If I understand this aproach correctly, it fixes the deadlock, but the
>>>> connection reinstatement will still happen, because WRITE_10 won't be
>>>> aborted and the connection will go down after the timeout.> 
>>>> IMO it's not ideal either, since now iSCSI will have a 50% chance to
>>>> have the connection (meaning SCSI session) killed on arbitrary ABOR
>>> I wouldn't call this an arbitrary abort. It's indicating a problem.
>>> When do you see this? Why do we need to fix it per cmd? Are you hitting
>>> the big command short timeout issue? Driver/fw bug?
>> It was triggered by ESXi. During some heavy IOPS intervals the backend
>> device cannot handle the load and some IOs get stuck for more than 30
>> seconds. I suspect that ABORT TASKSs are issued by the virtual machines.
>> So a series of ABORT TASK will come, and the unlucky one will hit the
>> issue.
> I didn't get this. If only the backend is backed up then we should
> still be transmitting the data out/R2Ts quickly and we shouldn't be
> hitting the issue where we got stuck waiting on them.
> 

Oh wait, I just remembered the bug you might be hitting. If you
are using iblock for example, then when the iscsi target calls
into LIO core to submit the cmd we can end up calling into the block
layer and blocking on a full queue (hitting the nr_requests limit).
The iscsi layer is then not able to do its normal R2T/DataOut handling
because one of the iscsi threads is stuck.

I'll send a patch to fix this issue.

We should still fix your TMF hang issue.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-18 16:29                     ` Mike Christie
  2021-10-18 17:08                       ` Mike Christie
@ 2021-10-18 17:32                       ` Konstantin Shelekhin
  2021-10-18 20:20                         ` Mike Christie
  1 sibling, 1 reply; 26+ messages in thread
From: Konstantin Shelekhin @ 2021-10-18 17:32 UTC (permalink / raw)
  To: Mike Christie; +Cc: target-devel, linux, Maurizio Lombardi

On Mon, Oct 18, 2021 at 11:29:23AM -0500, Mike Christie wrote:
> On 10/18/21 6:56 AM, Konstantin Shelekhin wrote:
> > On Thu, Oct 14, 2021 at 10:18:13PM -0500, michael.christie@oracle.com wrote:
> >>> If I understand this aproach correctly, it fixes the deadlock, but the
> >>> connection reinstatement will still happen, because WRITE_10 won't be
> >>> aborted and the connection will go down after the timeout.> 
> >>> IMO it's not ideal either, since now iSCSI will have a 50% chance to
> >>> have the connection (meaning SCSI session) killed on arbitrary ABOR
> >>
> >> I wouldn't call this an arbitrary abort. It's indicating a problem.
> >> When do you see this? Why do we need to fix it per cmd? Are you hitting
> >> the big command short timeout issue? Driver/fw bug?
> > 
> > It was triggered by ESXi. During some heavy IOPS intervals the backend
> > device cannot handle the load and some IOs get stuck for more than 30
> > seconds. I suspect that ABORT TASKSs are issued by the virtual machines.
> > So a series of ABORT TASK will come, and the unlucky one will hit the
> > issue.
> 
> I didn't get this. If only the backend is backed up then we should
> still be transmitting the data out/R2Ts quickly and we shouldn't be
> hitting the issue where we got stuck waiting on them.

We stuck waiting on them because the initiator will not send Data-Out
PDUs after sending ABORT TASK:

  1. Initiator sends WRITE CDB
  2. Target sends R2T
  3. Almost immediately Initiator decides to abort the request and sends
     ABORT TASK without sending any further Data-Out PDUs (maybe except for
     the first one); I believe it happens because the initiator tries to
     abort a larger batch of requests, and this unlucky request is just
     the last in series
  4. Target still waits for Data-Out PDUs and times out on Data-Out timer

The problem is that between #3 and #4 there is no code that will
actually abort the task, meaning stopping the Data-Out timer, sending
the responses if TAS is requires and so on.

> >>> TASK. While I'm sure most initiators will be able to recover from this
> >>> event, such drastic measures will certanly cause a lot of confusion for
> >>> people who are not familiar with TCM internals
> >> How will this cause confusion vs the case where the cmd reaches the target
> >> and we are waiting for it on the backend? In both cases, the initiator sends
> >> an abort, it times out, the initiator or target drop the connection, we
> >> relogin. Every initiator handles this.
> > 
> > Because usually (when a WRITE request is past the WRITE PENDING state)
> 
> Ah I think we were talking about different things here. I thought you meant
> users and I was just saying they wouldn't see a difference. But for ESXi
> it's going to work differently than I was thinking. I thought the initiator
> was going to escalate to LUN RESET then we hit the issue I mention
> below in the FastAbort part of the mail where we end up dropping the
> connection waiting on the data outs.

Oh, I see.

> > the ABORT TASK does not trigger relogin. In my experience the initiator
> > just waits for the TMR completion and goes on.
> > 
> > And from a blackbox perspective it looks suspicious:
> > 
> >   1. ABORT TASK sent to WRITE_10 tag 0x1; waits for it's completion
> >   2. ABORT TASK sent to WRITE_10 tag 0x2; almost immediately the connection is dropped
> 
> 
> I didn't get this part where the connection is dropped almost immediately.
> If only the backend is backed up, what is dropping the connection right
> away? The data out timers shouldn't be firing right? It sounds like above
> the network between the initiator and target were ok so data outs and R2Ts
> should be executing quickly like normal right?

I was talking about the patch you proposed. Waiting for the Data-Out
timeout means that the reconnection will be triggered. And this creates
duality of a sort. If ABORT TASK was issued after we received all the
Data-Out PDUs, the target will wait for the WRITE request to complete.
But if we didn't receive them, the target will just wait unless the
Data-Out timer expires and close the session.
 
> > The only difference between #1 and #2 is that the command 0x1 is past
> > the WRITE PENDING state.
> > 
> >> With that said I am in favor of you fixing the code so we can cleanup
> >> a partially sent cmd if it can be done sanely.
> >>
> >> I personally would just leave the current behavior and fix the deadlock
> >> because:
> >>
> >> 1. When I see this happening it's normally the network so we have to blow
> >> away the group of commands and we end up dropping the connection one way
> >> or another. I don't see the big command short timeout case often anymore.
> >>
> >> 2. Initiators just did not implement this right. I know this for sure
> >> for open-iscsi at least. I started to fix my screw ups the other day but it
> >> ends up breaking the targets.
> >>
> >> For example,
> >>
> >> - If we've sent a R2T and the initiator has sent a LUN RESET, what are
> >> you going to have the target do? Send the response right away?
> > 
> > AFAIR the spec says "nuke it, there will be no data after this".> 
> >> - If we've sent a R2T and the initiator has sent some of the data
> >> PDUs to full fill it but has not sent the final PDU, then it sends the
> >> LUN RESET, what do you do?
> > 
> > The same. However, I understand the interoperability concerns. I'll
> > check what other targets do
> I think maybe you are replying about aborts, but I was asking about
> LUN RESET which is opposite but will also hit the same hang if the
> connection is dropped after one is sent.
> 
> For aborts it works like you wrote above. For LUN RESET it's opposite.
> In 3270, it doesn't say how to handle aborts, but on the pdl lists it
> came up and they said equivalent of your nuke it. However, for TMFs
> that affect multiple tasks they clarified it in later versions of the
> specs.
> 
> In the original it only says how to handle abort/clear task set, but in
> 
> https://datatracker.ietf.org/doc/html/rfc5048
> 
> the behavior was clarified and in 7143 we have the original/default
> way:
> 
> https://datatracker.ietf.org/doc/html/rfc7143#section-4.2.3.3
> 
> which says to wait for the data outs.
> 
> And then we have FastAbort which is nuke it:
> 
> https://datatracker.ietf.org/doc/html/rfc7143#section-4.2.3.4

For Target it says the following even for ABORT TASK:

  a) MUST wait for responses on currently valid Target Transfer Tags
     of the affected tasks from the issuing initiator. MAY wait for
     responses on currently valid Target Transfer Tags of the affected
     tasks from third-party initiators.

So either ESXi violates the RFC or just not RFC7143 compliant. However
I'm getting hit with this even on Linux. I'll try to get some TCP dumps.

> >> - You also have the immediate data case and the InitialR2T case.
> > 
> > True.
> >  
> >> The updated specs clarify how to handle this, and even have a FastAbort
> >> key to specify which behavior we are going to do. But we don't support
> >> it and I don't think many people implemented it.
> 
> So here I was saying I don't think anyone implemented the ability to
> negotiate for TaskReporting=FastAbort, so most might do the original behavior.
> If that's right then we just end up dropping the connection for windows and
> linux and newer versions of ESXi if that session drop setting is set
> (I can't remember the setting), when they end up timing out their TMFs.

Hmm... I'll check how long the initiator actually waits before killing
the connection.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-18 17:32                       ` Konstantin Shelekhin
@ 2021-10-18 20:20                         ` Mike Christie
  2021-10-18 20:34                           ` Mike Christie
  2021-10-18 21:48                           ` Konstantin Shelekhin
  0 siblings, 2 replies; 26+ messages in thread
From: Mike Christie @ 2021-10-18 20:20 UTC (permalink / raw)
  To: Konstantin Shelekhin; +Cc: target-devel, linux, Maurizio Lombardi

On 10/18/21 12:32 PM, Konstantin Shelekhin wrote:
> On Mon, Oct 18, 2021 at 11:29:23AM -0500, Mike Christie wrote:
>> On 10/18/21 6:56 AM, Konstantin Shelekhin wrote:
>>> On Thu, Oct 14, 2021 at 10:18:13PM -0500, michael.christie@oracle.com wrote:
>>>>> If I understand this aproach correctly, it fixes the deadlock, but the
>>>>> connection reinstatement will still happen, because WRITE_10 won't be
>>>>> aborted and the connection will go down after the timeout.> 
>>>>> IMO it's not ideal either, since now iSCSI will have a 50% chance to
>>>>> have the connection (meaning SCSI session) killed on arbitrary ABOR
>>>>
>>>> I wouldn't call this an arbitrary abort. It's indicating a problem.
>>>> When do you see this? Why do we need to fix it per cmd? Are you hitting
>>>> the big command short timeout issue? Driver/fw bug?
>>>
>>> It was triggered by ESXi. During some heavy IOPS intervals the backend
>>> device cannot handle the load and some IOs get stuck for more than 30
>>> seconds. I suspect that ABORT TASKSs are issued by the virtual machines.
>>> So a series of ABORT TASK will come, and the unlucky one will hit the
>>> issue.
>>
>> I didn't get this. If only the backend is backed up then we should
>> still be transmitting the data out/R2Ts quickly and we shouldn't be
>> hitting the issue where we got stuck waiting on them.
> 
> We stuck waiting on them because the initiator will not send Data-Out

We are talking about different things here. Above I'm just asking about what
leads to the cmd timeout.

You wrote before the abort is sent the backend gets backed up, and the back
up causes IO to take long enough for the initiator cmd timeout to fire.
I'm asking why before the initiator side cmd timeout and before the abort is sent,
why aren't R2T/data_outs executing quickly if only the backend is backed up?

Is it the bug I mentioned where one of the iscsi threads is stuck on the
submission to the block layer, so that thread can't handle iscsi IO?
If so I have a patch for that.

I get that once the abort is sent we hit these other issues.


> PDUs after sending ABORT TASK:
> 
>   1. Initiator sends WRITE CDB
>   2. Target sends R2T
>   3. Almost immediately Initiator decides to abort the request and sends

Are you using iscsi in the VM or in the hypervisor? For the latter is
timeout 15 seconds for normal READs/WRITEs? What version of ESXi?

>      ABORT TASK without sending any further Data-Out PDUs (maybe except for
>      the first one); I believe it happens because the initiator tries to
>      abort a larger batch of requests, and this unlucky request is just
>      the last in series
>   4. Target still waits for Data-Out PDUs and times out on Data-Out timer
> 
> The problem is that between #3 and #4 there is no code that will
> actually abort the task, meaning stopping the Data-Out timer, sending
> the responses if TAS is requires and so on.
> 
>>>>> TASK. While I'm sure most initiators will be able to recover from this
>>>>> event, such drastic measures will certanly cause a lot of confusion for
>>>>> people who are not familiar with TCM internals
>>>> How will this cause confusion vs the case where the cmd reaches the target
>>>> and we are waiting for it on the backend? In both cases, the initiator sends
>>>> an abort, it times out, the initiator or target drop the connection, we
>>>> relogin. Every initiator handles this.
>>>
>>> Because usually (when a WRITE request is past the WRITE PENDING state)
>>
>> Ah I think we were talking about different things here. I thought you meant
>> users and I was just saying they wouldn't see a difference. But for ESXi
>> it's going to work differently than I was thinking. I thought the initiator
>> was going to escalate to LUN RESET then we hit the issue I mention
>> below in the FastAbort part of the mail where we end up dropping the
>> connection waiting on the data outs.
> 
> Oh, I see.
> 
>>> the ABORT TASK does not trigger relogin. In my experience the initiator
>>> just waits for the TMR completion and goes on.
>>>
>>> And from a blackbox perspective it looks suspicious:
>>>
>>>   1. ABORT TASK sent to WRITE_10 tag 0x1; waits for it's completion
>>>   2. ABORT TASK sent to WRITE_10 tag 0x2; almost immediately the connection is dropped
>>
>>
>> I didn't get this part where the connection is dropped almost immediately.
>> If only the backend is backed up, what is dropping the connection right
>> away? The data out timers shouldn't be firing right? It sounds like above
>> the network between the initiator and target were ok so data outs and R2Ts
>> should be executing quickly like normal right?
> 
> I was talking about the patch you proposed. Waiting for the Data-Out
> timeout means that the reconnection will be triggered. And this creates
> duality of a sort. If ABORT TASK was issued after we received all the
> Data-Out PDUs, the target will wait for the WRITE request to complete.
> But if we didn't receive them, the target will just wait unless the
> Data-Out timer expires and close the session>  
>>> The only difference between #1 and #2 is that the command 0x1 is past
>>> the WRITE PENDING state.
>>>
>>>> With that said I am in favor of you fixing the code so we can cleanup
>>>> a partially sent cmd if it can be done sanely.
>>>>
>>>> I personally would just leave the current behavior and fix the deadlock
>>>> because:
>>>>
>>>> 1. When I see this happening it's normally the network so we have to blow
>>>> away the group of commands and we end up dropping the connection one way
>>>> or another. I don't see the big command short timeout case often anymore.
>>>>
>>>> 2. Initiators just did not implement this right. I know this for sure
>>>> for open-iscsi at least. I started to fix my screw ups the other day but it
>>>> ends up breaking the targets.
>>>>
>>>> For example,
>>>>
>>>> - If we've sent a R2T and the initiator has sent a LUN RESET, what are
>>>> you going to have the target do? Send the response right away?
>>>
>>> AFAIR the spec says "nuke it, there will be no data after this".> 
>>>> - If we've sent a R2T and the initiator has sent some of the data
>>>> PDUs to full fill it but has not sent the final PDU, then it sends the
>>>> LUN RESET, what do you do?
>>>
>>> The same. However, I understand the interoperability concerns. I'll
>>> check what other targets do
>> I think maybe you are replying about aborts, but I was asking about
>> LUN RESET which is opposite but will also hit the same hang if the
>> connection is dropped after one is sent.
>>
>> For aborts it works like you wrote above. For LUN RESET it's opposite.
>> In 3270, it doesn't say how to handle aborts, but on the pdl lists it
>> came up and they said equivalent of your nuke it. However, for TMFs
>> that affect multiple tasks they clarified it in later versions of the
>> specs.
>>
>> In the original it only says how to handle abort/clear task set, but in
>>
>> https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc5048__;!!ACWV5N9M2RV99hQ!YiQjBP1k-nr1NNEI8Cyc2MEwcl5cd_eYzOONU5pcJMTx34q95jrx6zPUrrXfFS95VLYa$ 
>>
>> the behavior was clarified and in 7143 we have the original/default
>> way:
>>
>> https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc7143*section-4.2.3.3__;Iw!!ACWV5N9M2RV99hQ!YiQjBP1k-nr1NNEI8Cyc2MEwcl5cd_eYzOONU5pcJMTx34q95jrx6zPUrrXfFcgM734S$ 
>>
>> which says to wait for the data outs.
>>
>> And then we have FastAbort which is nuke it:
>>
>> https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc7143*section-4.2.3.4__;Iw!!ACWV5N9M2RV99hQ!YiQjBP1k-nr1NNEI8Cyc2MEwcl5cd_eYzOONU5pcJMTx34q95jrx6zPUrrXfFXbAD6Jb$ 
> 
> For Target it says the following even for ABORT TASK:
> 
>   a) MUST wait for responses on currently valid Target Transfer Tags
>      of the affected tasks from the issuing initiator. MAY wait for
>      responses on currently valid Target Transfer Tags of the affected
>      tasks from third-party initiators.

Where do you see "ABORT TASK" in there? That RFC chunk from 4.2.3.3;

https://datatracker.ietf.org/doc/html/rfc7143#section-4.2.3.3

lists the TMFs it covers:

The execution of ABORT TASK SET, CLEAR TASK SET, LOGICAL UNIT RESET, TARGET
WARM RESET, and TARGET COLD RESET TMF  Requests

But in the link, the "SET" part of "ABORT TASK SET" is on the next line so
you might have just scanned over it wrong.


> 
> So either ESXi violates the RFC or just not RFC7143 compliant. However> I'm getting hit with this even on Linux. I'll try to get some TCP dumps.

Linux is wrong for different reasons and was why I was saying initiators
just did not do things right and you can get a crazy mix of behavior.
I basically programed it for how targets were working and not being strict
RFC wise.

1. For aborts, it will send the abort then not send anything else.

2. For LUN/target resets, it "sort of" does FastAbort. I wrote the linux
code before FastAbort was a thing so it's all messed up. Basically, we
send the LUN RESET, then the default is to just stop sending DataOuts
for WRITEs.

However, we implemented that code before RFC 7143 so we don't negotiate
for TaskReporting. It's just a iscsid.conf setting node.session.iscsi.FastAbort.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-18 20:20                         ` Mike Christie
@ 2021-10-18 20:34                           ` Mike Christie
  2021-10-18 21:50                             ` Konstantin Shelekhin
  2021-10-18 21:48                           ` Konstantin Shelekhin
  1 sibling, 1 reply; 26+ messages in thread
From: Mike Christie @ 2021-10-18 20:34 UTC (permalink / raw)
  To: Konstantin Shelekhin; +Cc: target-devel, linux, Maurizio Lombardi

On 10/18/21 3:20 PM, Mike Christie wrote:
> On 10/18/21 12:32 PM, Konstantin Shelekhin wrote:
>> On Mon, Oct 18, 2021 at 11:29:23AM -0500, Mike Christie wrote:
>>> On 10/18/21 6:56 AM, Konstantin Shelekhin wrote:
>>>> On Thu, Oct 14, 2021 at 10:18:13PM -0500, michael.christie@oracle.com wrote:
>>>>>> If I understand this aproach correctly, it fixes the deadlock, but the
>>>>>> connection reinstatement will still happen, because WRITE_10 won't be
>>>>>> aborted and the connection will go down after the timeout.> 
>>>>>> IMO it's not ideal either, since now iSCSI will have a 50% chance to
>>>>>> have the connection (meaning SCSI session) killed on arbitrary ABOR
>>>>>
>>>>> I wouldn't call this an arbitrary abort. It's indicating a problem.
>>>>> When do you see this? Why do we need to fix it per cmd? Are you hitting
>>>>> the big command short timeout issue? Driver/fw bug?
>>>>
>>>> It was triggered by ESXi. During some heavy IOPS intervals the backend
>>>> device cannot handle the load and some IOs get stuck for more than 30
>>>> seconds. I suspect that ABORT TASKSs are issued by the virtual machines.
>>>> So a series of ABORT TASK will come, and the unlucky one will hit the
>>>> issue.
>>>
>>> I didn't get this. If only the backend is backed up then we should
>>> still be transmitting the data out/R2Ts quickly and we shouldn't be
>>> hitting the issue where we got stuck waiting on them.
>>
>> We stuck waiting on them because the initiator will not send Data-Out
> 
> We are talking about different things here. Above I'm just asking about what
> leads to the cmd timeout.

Oh wait, I miss understood the "almost immediately" part in your #3.

Just tell me if you are running iscsi in the guest or hypervisor and if
the latter what version of ESXi,

> 
> You wrote before the abort is sent the backend gets backed up, and the back
> up causes IO to take long enough for the initiator cmd timeout to fire.
> I'm asking why before the initiator side cmd timeout and before the abort is sent,
> why aren't R2T/data_outs executing quickly if only the backend is backed up?
> 
> Is it the bug I mentioned where one of the iscsi threads is stuck on the
> submission to the block layer, so that thread can't handle iscsi IO?
> If so I have a patch for that.
> 
> I get that once the abort is sent we hit these other issues.
> 
> 
>> PDUs after sending ABORT TASK:
>>
>>   1. Initiator sends WRITE CDB
>>   2. Target sends R2T
>>   3. Almost immediately Initiator decides to abort the request and sends
> 
> Are you using iscsi in the VM or in the hypervisor? For the latter is
> timeout 15 seconds for normal READs/WRITEs? What version of ESXi?
> 
>>      ABORT TASK without sending any further Data-Out PDUs (maybe except for
>>      the first one); I believe it happens because the initiator tries to
>>      abort a larger batch of requests, and this unlucky request is just
>>      the last in series
>>   4. Target still waits for Data-Out PDUs and times out on Data-Out timer
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-18 20:20                         ` Mike Christie
  2021-10-18 20:34                           ` Mike Christie
@ 2021-10-18 21:48                           ` Konstantin Shelekhin
  1 sibling, 0 replies; 26+ messages in thread
From: Konstantin Shelekhin @ 2021-10-18 21:48 UTC (permalink / raw)
  To: Mike Christie; +Cc: target-devel, linux, Maurizio Lombardi

On Mon, Oct 18, 2021 at 03:20:40PM -0500, Mike Christie wrote:
> On 10/18/21 12:32 PM, Konstantin Shelekhin wrote:
> > On Mon, Oct 18, 2021 at 11:29:23AM -0500, Mike Christie wrote:
> >> On 10/18/21 6:56 AM, Konstantin Shelekhin wrote:
> >>> On Thu, Oct 14, 2021 at 10:18:13PM -0500, michael.christie@oracle.com wrote:
> >>>>> If I understand this aproach correctly, it fixes the deadlock, but the
> >>>>> connection reinstatement will still happen, because WRITE_10 won't be
> >>>>> aborted and the connection will go down after the timeout.> 
> >>>>> IMO it's not ideal either, since now iSCSI will have a 50% chance to
> >>>>> have the connection (meaning SCSI session) killed on arbitrary ABOR
> >>>>
> >>>> I wouldn't call this an arbitrary abort. It's indicating a problem.
> >>>> When do you see this? Why do we need to fix it per cmd? Are you hitting
> >>>> the big command short timeout issue? Driver/fw bug?
> >>>
> >>> It was triggered by ESXi. During some heavy IOPS intervals the backend
> >>> device cannot handle the load and some IOs get stuck for more than 30
> >>> seconds. I suspect that ABORT TASKSs are issued by the virtual machines.
> >>> So a series of ABORT TASK will come, and the unlucky one will hit the
> >>> issue.
> >>
> >> I didn't get this. If only the backend is backed up then we should
> >> still be transmitting the data out/R2Ts quickly and we shouldn't be
> >> hitting the issue where we got stuck waiting on them.
> > 
> > We stuck waiting on them because the initiator will not send Data-Out
> 
> We are talking about different things here. Above I'm just asking about what
> leads to the cmd timeout.
> 
> You wrote before the abort is sent the backend gets backed up, and the back
> up causes IO to take long enough for the initiator cmd timeout to fire.
> I'm asking why before the initiator side cmd timeout and before the abort is sent,
> why aren't R2T/data_outs executing quickly if only the backend is backed up?
>
> Is it the bug I mentioned where one of the iscsi threads is stuck on the
> submission to the block layer, so that thread can't handle iscsi IO?
> If so I have a patch for that.

On that I'm not sure, I haven't checked the socket contents itself.
Let's try the patch! I can test it tomorrow morning.
 
> I get that once the abort is sent we hit these other issues.
> 
> 
> > PDUs after sending ABORT TASK:
> > 
> >   1. Initiator sends WRITE CDB
> >   2. Target sends R2T
> >   3. Almost immediately Initiator decides to abort the request and sends
> 
> Are you using iscsi in the VM or in the hypervisor? For the latter is
> timeout 15 seconds for normal READs/WRITEs? What version of ESXi?

The Linux server is attach to the ESXi 6.7, both physical. The ESXi
connects the datastore over iSCSI and hosts a bunch of different VMs.
Mostly Linux VMs AFAIR. The timeouts are all defaults, but I am not sure
that ABORT TASKs are generated by the ESXi itself and not by some of the
guests.

> >      ABORT TASK without sending any further Data-Out PDUs (maybe except for
> >      the first one); I believe it happens because the initiator tries to
> >      abort a larger batch of requests, and this unlucky request is just
> >      the last in series
> >   4. Target still waits for Data-Out PDUs and times out on Data-Out timer
> > 
> > The problem is that between #3 and #4 there is no code that will
> > actually abort the task, meaning stopping the Data-Out timer, sending
> > the responses if TAS is requires and so on.
> > 
> >>>>> TASK. While I'm sure most initiators will be able to recover from this
> >>>>> event, such drastic measures will certanly cause a lot of confusion for
> >>>>> people who are not familiar with TCM internals
> >>>> How will this cause confusion vs the case where the cmd reaches the target
> >>>> and we are waiting for it on the backend? In both cases, the initiator sends
> >>>> an abort, it times out, the initiator or target drop the connection, we
> >>>> relogin. Every initiator handles this.
> >>>
> >>> Because usually (when a WRITE request is past the WRITE PENDING state)
> >>
> >> Ah I think we were talking about different things here. I thought you meant
> >> users and I was just saying they wouldn't see a difference. But for ESXi
> >> it's going to work differently than I was thinking. I thought the initiator
> >> was going to escalate to LUN RESET then we hit the issue I mention
> >> below in the FastAbort part of the mail where we end up dropping the
> >> connection waiting on the data outs.
> > 
> > Oh, I see.
> > 
> >>> the ABORT TASK does not trigger relogin. In my experience the initiator
> >>> just waits for the TMR completion and goes on.
> >>>
> >>> And from a blackbox perspective it looks suspicious:
> >>>
> >>>   1. ABORT TASK sent to WRITE_10 tag 0x1; waits for it's completion
> >>>   2. ABORT TASK sent to WRITE_10 tag 0x2; almost immediately the connection is dropped
> >>
> >>
> >> I didn't get this part where the connection is dropped almost immediately.
> >> If only the backend is backed up, what is dropping the connection right
> >> away? The data out timers shouldn't be firing right? It sounds like above
> >> the network between the initiator and target were ok so data outs and R2Ts
> >> should be executing quickly like normal right?
> > 
> > I was talking about the patch you proposed. Waiting for the Data-Out
> > timeout means that the reconnection will be triggered. And this creates
> > duality of a sort. If ABORT TASK was issued after we received all the
> > Data-Out PDUs, the target will wait for the WRITE request to complete.
> > But if we didn't receive them, the target will just wait unless the
> > Data-Out timer expires and close the session>  
> >>> The only difference between #1 and #2 is that the command 0x1 is past
> >>> the WRITE PENDING state.
> >>>
> >>>> With that said I am in favor of you fixing the code so we can cleanup
> >>>> a partially sent cmd if it can be done sanely.
> >>>>
> >>>> I personally would just leave the current behavior and fix the deadlock
> >>>> because:
> >>>>
> >>>> 1. When I see this happening it's normally the network so we have to blow
> >>>> away the group of commands and we end up dropping the connection one way
> >>>> or another. I don't see the big command short timeout case often anymore.
> >>>>
> >>>> 2. Initiators just did not implement this right. I know this for sure
> >>>> for open-iscsi at least. I started to fix my screw ups the other day but it
> >>>> ends up breaking the targets.
> >>>>
> >>>> For example,
> >>>>
> >>>> - If we've sent a R2T and the initiator has sent a LUN RESET, what are
> >>>> you going to have the target do? Send the response right away?
> >>>
> >>> AFAIR the spec says "nuke it, there will be no data after this".> 
> >>>> - If we've sent a R2T and the initiator has sent some of the data
> >>>> PDUs to full fill it but has not sent the final PDU, then it sends the
> >>>> LUN RESET, what do you do?
> >>>
> >>> The same. However, I understand the interoperability concerns. I'll
> >>> check what other targets do
> >> I think maybe you are replying about aborts, but I was asking about
> >> LUN RESET which is opposite but will also hit the same hang if the
> >> connection is dropped after one is sent.
> >>
> >> For aborts it works like you wrote above. For LUN RESET it's opposite.
> >> In 3270, it doesn't say how to handle aborts, but on the pdl lists it
> >> came up and they said equivalent of your nuke it. However, for TMFs
> >> that affect multiple tasks they clarified it in later versions of the
> >> specs.
> >>
> >> In the original it only says how to handle abort/clear task set, but in
> >>
> >> https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc5048__;!!ACWV5N9M2RV99hQ!YiQjBP1k-nr1NNEI8Cyc2MEwcl5cd_eYzOONU5pcJMTx34q95jrx6zPUrrXfFS95VLYa$ 
> >>
> >> the behavior was clarified and in 7143 we have the original/default
> >> way:
> >>
> >> https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc7143*section-4.2.3.3__;Iw!!ACWV5N9M2RV99hQ!YiQjBP1k-nr1NNEI8Cyc2MEwcl5cd_eYzOONU5pcJMTx34q95jrx6zPUrrXfFcgM734S$ 
> >>
> >> which says to wait for the data outs.
> >>
> >> And then we have FastAbort which is nuke it:
> >>
> >> https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc7143*section-4.2.3.4__;Iw!!ACWV5N9M2RV99hQ!YiQjBP1k-nr1NNEI8Cyc2MEwcl5cd_eYzOONU5pcJMTx34q95jrx6zPUrrXfFXbAD6Jb$ 
> > 
> > For Target it says the following even for ABORT TASK:
> > 
> >   a) MUST wait for responses on currently valid Target Transfer Tags
> >      of the affected tasks from the issuing initiator. MAY wait for
> >      responses on currently valid Target Transfer Tags of the affected
> >      tasks from third-party initiators.
> 
> Where do you see "ABORT TASK" in there? That RFC chunk from 4.2.3.3;
> 
> https://datatracker.ietf.org/doc/html/rfc7143#section-4.2.3.3
> 
> lists the TMFs it covers:
> 
> The execution of ABORT TASK SET, CLEAR TASK SET, LOGICAL UNIT RESET, TARGET
> WARM RESET, and TARGET COLD RESET TMF  Requests
> 
> But in the link, the "SET" part of "ABORT TASK SET" is on the next line so
> you might have just scanned over it wrong.

I'm sorry, you are right, ABORT TASK SET.
 
> > 
> > So either ESXi violates the RFC or just not RFC7143 compliant. However
> > I'm getting hit with this even on Linux. I'll try to get some TCP dumps.
> 
> Linux is wrong for different reasons and was why I was saying initiators
> just did not do things right and you can get a crazy mix of behavior.
> I basically programed it for how targets were working and not being strict
> RFC wise.
> 
> 1. For aborts, it will send the abort then not send anything else.
> 
> 2. For LUN/target resets, it "sort of" does FastAbort. I wrote the linux
> code before FastAbort was a thing so it's all messed up. Basically, we
> send the LUN RESET, then the default is to just stop sending DataOuts
> for WRITEs.
> 
> However, we implemented that code before RFC 7143 so we don't negotiate
> for TaskReporting. It's just a iscsid.conf setting node.session.iscsi.FastAbort.

Well, that makes things clearer, thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-18 20:34                           ` Mike Christie
@ 2021-10-18 21:50                             ` Konstantin Shelekhin
  0 siblings, 0 replies; 26+ messages in thread
From: Konstantin Shelekhin @ 2021-10-18 21:50 UTC (permalink / raw)
  To: Mike Christie; +Cc: target-devel, linux, Maurizio Lombardi

On Mon, Oct 18, 2021 at 03:34:44PM -0500, Mike Christie wrote:
> On 10/18/21 3:20 PM, Mike Christie wrote:
> > On 10/18/21 12:32 PM, Konstantin Shelekhin wrote:
> >> On Mon, Oct 18, 2021 at 11:29:23AM -0500, Mike Christie wrote:
> >>> On 10/18/21 6:56 AM, Konstantin Shelekhin wrote:
> >>>> On Thu, Oct 14, 2021 at 10:18:13PM -0500, michael.christie@oracle.com wrote:
> >>>>>> If I understand this aproach correctly, it fixes the deadlock, but the
> >>>>>> connection reinstatement will still happen, because WRITE_10 won't be
> >>>>>> aborted and the connection will go down after the timeout.> 
> >>>>>> IMO it's not ideal either, since now iSCSI will have a 50% chance to
> >>>>>> have the connection (meaning SCSI session) killed on arbitrary ABOR
> >>>>>
> >>>>> I wouldn't call this an arbitrary abort. It's indicating a problem.
> >>>>> When do you see this? Why do we need to fix it per cmd? Are you hitting
> >>>>> the big command short timeout issue? Driver/fw bug?
> >>>>
> >>>> It was triggered by ESXi. During some heavy IOPS intervals the backend
> >>>> device cannot handle the load and some IOs get stuck for more than 30
> >>>> seconds. I suspect that ABORT TASKSs are issued by the virtual machines.
> >>>> So a series of ABORT TASK will come, and the unlucky one will hit the
> >>>> issue.
> >>>
> >>> I didn't get this. If only the backend is backed up then we should
> >>> still be transmitting the data out/R2Ts quickly and we shouldn't be
> >>> hitting the issue where we got stuck waiting on them.
> >>
> >> We stuck waiting on them because the initiator will not send Data-Out
> > 
> > We are talking about different things here. Above I'm just asking about what
> > leads to the cmd timeout.
> 
> Oh wait, I miss understood the "almost immediately" part in your #3.
> 
> Just tell me if you are running iscsi in the guest or hypervisor and if
> the latter what version of ESXi,

ESXi 6.7 is connected over iSCSI. It uses the block device for
datastore.
 
> > 
> > You wrote before the abort is sent the backend gets backed up, and the back
> > up causes IO to take long enough for the initiator cmd timeout to fire.
> > I'm asking why before the initiator side cmd timeout and before the abort is sent,
> > why aren't R2T/data_outs executing quickly if only the backend is backed up?
> > 
> > Is it the bug I mentioned where one of the iscsi threads is stuck on the
> > submission to the block layer, so that thread can't handle iscsi IO?
> > If so I have a patch for that.
> > 
> > I get that once the abort is sent we hit these other issues.
> > 
> > 
> >> PDUs after sending ABORT TASK:
> >>
> >>   1. Initiator sends WRITE CDB
> >>   2. Target sends R2T
> >>   3. Almost immediately Initiator decides to abort the request and sends
> > 
> > Are you using iscsi in the VM or in the hypervisor? For the latter is
> > timeout 15 seconds for normal READs/WRITEs? What version of ESXi?
> > 
> >>      ABORT TASK without sending any further Data-Out PDUs (maybe except for
> >>      the first one); I believe it happens because the initiator tries to
> >>      abort a larger batch of requests, and this unlucky request is just
> >>      the last in series
> >>   4. Target still waits for Data-Out PDUs and times out on Data-Out timer
> >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iSCSI Abort Task and WRITE PENDING
  2021-10-18 17:08                       ` Mike Christie
@ 2021-10-26 10:59                         ` Konstantin Shelekhin
  0 siblings, 0 replies; 26+ messages in thread
From: Konstantin Shelekhin @ 2021-10-26 10:59 UTC (permalink / raw)
  To: Mike Christie; +Cc: target-devel, linux, Maurizio Lombardi

On Mon, Oct 18, 2021 at 12:08:26PM -0500, Mike Christie wrote:
> Oh wait, I just remembered the bug you might be hitting. If you
> are using iblock for example, then when the iscsi target calls
> into LIO core to submit the cmd we can end up calling into the block
> layer and blocking on a full queue (hitting the nr_requests limit).
> The iscsi layer is then not able to do its normal R2T/DataOut handling
> because one of the iscsi threads is stuck.
> 
> I'll send a patch to fix this issue.

Can you please send me a patch so I could test it?

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-10-26 10:59 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-13 13:21 iSCSI Abort Task and WRITE PENDING Konstantin Shelekhin
2021-10-13 14:22 ` Hannes Reinecke
2021-10-13 14:53   ` Konstantin Shelekhin
2021-10-13 14:56     ` Konstantin Shelekhin
2021-10-14  7:09     ` Hannes Reinecke
2021-10-14  7:52       ` Konstantin Shelekhin
2021-10-13 17:51 ` Mike Christie
2021-10-13 18:05   ` Mike Christie
2021-10-13 18:11     ` Konstantin Shelekhin
2021-10-13 18:08   ` Konstantin Shelekhin
2021-10-13 18:24     ` Mike Christie
2021-10-13 18:30       ` Mike Christie
2021-10-13 18:58         ` Konstantin Shelekhin
2021-10-13 19:01           ` Konstantin Shelekhin
2021-10-13 20:21             ` Mike Christie
2021-10-14 23:12               ` Konstantin Shelekhin
2021-10-15  3:18                 ` michael.christie
2021-10-18 11:56                   ` Konstantin Shelekhin
2021-10-18 16:29                     ` Mike Christie
2021-10-18 17:08                       ` Mike Christie
2021-10-26 10:59                         ` Konstantin Shelekhin
2021-10-18 17:32                       ` Konstantin Shelekhin
2021-10-18 20:20                         ` Mike Christie
2021-10-18 20:34                           ` Mike Christie
2021-10-18 21:50                             ` Konstantin Shelekhin
2021-10-18 21:48                           ` Konstantin Shelekhin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.