linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] scsi: target: loop: Fix handling of aborted TMRs
@ 2020-07-15 16:04 Bodo Stroesser
  2020-07-26  5:16 ` Mike Christie
  0 siblings, 1 reply; 7+ messages in thread
From: Bodo Stroesser @ 2020-07-15 16:04 UTC (permalink / raw)
  To: Martin K. Petersen, Mike Christie, linux-scsi, target-devel
  Cc: Bodo Stroesser

If an ABORT_TASK TMR is aborted by a LUN_RESET, core calls
tcm_loop's aborted_task fabric callback for the ABORT_TASK.

The aborted_task callback is not prepared to handle aborted TMRs,
but is an empty function which is ok for aborted SCSI cmds only.
So it does not wake up tcm_loop_issue_tmr() sleeping in
wait_for_completion(). Therefore scmd_eh_abort_handler
hangs forever and we get the following messages:

INFO: task kworker/u48:1:31216 blocked for more than 3932 seconds.
      Tainted: G           OE     5.8.0-rc1+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u48:1   D    0 31216      2 0x00004000
Workqueue: scsi_tmf_14 scmd_eh_abort_handler [scsi_mod]
Call Trace:
 __schedule+0x2aa/0x6d0
 schedule+0x42/0xb0
 schedule_timeout+0x1ba/0x280
 ? __queue_work+0x13b/0x3d0
 ? kmem_cache_alloc_trace+0x1e6/0x200
 wait_for_completion+0x7f/0xd0
 tcm_loop_issue_tmr.isra.10+0xc1/0x110 [tcm_loop]
 tcm_loop_abort_task+0x3d/0x50 [tcm_loop]
 scmd_eh_abort_handler+0x91/0x230 [scsi_mod]
 process_one_work+0x166/0x370
 worker_thread+0x49/0x3e0
 ? rescuer_thread+0x320/0x320
 kthread+0xfc/0x130
 ? kthread_bind+0x10/0x10
 ret_from_fork+0x22/0x30

Fix:
After calling the aborted_task callback the core immediately
releases the se_cmd that represents the ABORT_TASK. The woken
up thread (tcm_loop_issue_tmr) therefore must not access se_cmd
and tl_cmd in case of aborted TMRs.

So I changed aborted_task and queue_tm_rsp to transfer result
code from se_cmd to tcm_loop_issue_tmr's stack and added the
missing wake_up() to aborted_task.
Now tcm_loop_issue_tmr after waking up no longer accesses se_cmd
and tl_cmd. Therefore tcm_loop_issue_tmr no longer needs to call
target_put_sess_cmd and flag TARGET_SCF_ACK_KREF is no longer
needed in se_cmd.

Signed-off-by: Bodo Stroesser <bstroesser@ts.fujitsu.com>
---
 drivers/target/loopback/tcm_loop.c | 39 ++++++++++++++++++++++----------------
 drivers/target/loopback/tcm_loop.h |  4 +++-
 2 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/drivers/target/loopback/tcm_loop.c b/drivers/target/loopback/tcm_loop.c
index 16d5a4e117a2..0968bc8b6640 100644
--- a/drivers/target/loopback/tcm_loop.c
+++ b/drivers/target/loopback/tcm_loop.c
@@ -199,6 +199,7 @@ static int tcm_loop_issue_tmr(struct tcm_loop_tpg *tl_tpg,
 	struct tcm_loop_nexus *tl_nexus;
 	struct tcm_loop_cmd *tl_cmd;
 	int ret = TMR_FUNCTION_FAILED, rc;
+	DECLARE_COMPLETION_ONSTACK(compl);
 
 	/*
 	 * Locate the tl_nexus and se_sess pointers
@@ -213,26 +214,23 @@ static int tcm_loop_issue_tmr(struct tcm_loop_tpg *tl_tpg,
 	if (!tl_cmd)
 		return ret;
 
-	init_completion(&tl_cmd->tmr_done);
+	tl_cmd->is_tmr = true;
+	tl_cmd->tmr_done = &compl;
+	tl_cmd->tmr_result = &ret;
 
 	se_cmd = &tl_cmd->tl_se_cmd;
 	se_sess = tl_tpg->tl_nexus->se_sess;
 
 	rc = target_submit_tmr(se_cmd, se_sess, tl_cmd->tl_sense_buf, lun,
-			       NULL, tmr, GFP_KERNEL, task,
-			       TARGET_SCF_ACK_KREF);
-	if (rc < 0)
-		goto release;
-	wait_for_completion(&tl_cmd->tmr_done);
-	ret = se_cmd->se_tmr_req->response;
-	target_put_sess_cmd(se_cmd);
+			       NULL, tmr, GFP_KERNEL, task, 0);
+	if (rc < 0) {
+		kmem_cache_free(tcm_loop_cmd_cache, tl_cmd);
+		return ret;
+	}
 
-out:
-	return ret;
+	wait_for_completion(tl_cmd->tmr_done);
 
-release:
-	kmem_cache_free(tcm_loop_cmd_cache, tl_cmd);
-	goto out;
+	return ret;
 }
 
 static int tcm_loop_abort_task(struct scsi_cmnd *sc)
@@ -590,13 +588,22 @@ static void tcm_loop_queue_tm_rsp(struct se_cmd *se_cmd)
 	struct tcm_loop_cmd *tl_cmd = container_of(se_cmd,
 				struct tcm_loop_cmd, tl_se_cmd);
 
-	/* Wake up tcm_loop_issue_tmr(). */
-	complete(&tl_cmd->tmr_done);
+	/* Set tmr result and wake up tcm_loop_issue_tmr(). */
+	*tl_cmd->tmr_result = se_cmd->se_tmr_req->response;
+	complete(tl_cmd->tmr_done);
 }
 
 static void tcm_loop_aborted_task(struct se_cmd *se_cmd)
 {
-	return;
+	struct tcm_loop_cmd *tl_cmd = container_of(se_cmd,
+				struct tcm_loop_cmd, tl_se_cmd);
+
+	if (!tl_cmd->is_tmr)
+		return;
+
+	/* Set tmr result and wake up tcm_loop_issue_tmr(). */
+	*tl_cmd->tmr_result = TMR_FUNCTION_REJECTED;
+	complete(tl_cmd->tmr_done);
 }
 
 static char *tcm_loop_dump_proto_id(struct tcm_loop_hba *tl_hba)
diff --git a/drivers/target/loopback/tcm_loop.h b/drivers/target/loopback/tcm_loop.h
index d3110909a213..e7615b9f5ed1 100644
--- a/drivers/target/loopback/tcm_loop.h
+++ b/drivers/target/loopback/tcm_loop.h
@@ -17,7 +17,9 @@ struct tcm_loop_cmd {
 	/* The TCM I/O descriptor that is accessed via container_of() */
 	struct se_cmd tl_se_cmd;
 	struct work_struct work;
-	struct completion tmr_done;
+	struct completion *tmr_done;
+	bool is_tmr;
+	int *tmr_result;
 	/* Sense buffer that will be mapped into outgoing status */
 	unsigned char tl_sense_buf[TRANSPORT_SENSE_BUFFER];
 };
-- 
2.12.3


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] scsi: target: loop: Fix handling of aborted TMRs
  2020-07-15 16:04 [PATCH] scsi: target: loop: Fix handling of aborted TMRs Bodo Stroesser
@ 2020-07-26  5:16 ` Mike Christie
  2020-07-26 11:02   ` Bodo Stroesser
  0 siblings, 1 reply; 7+ messages in thread
From: Mike Christie @ 2020-07-26  5:16 UTC (permalink / raw)
  To: Bodo Stroesser, Martin K. Petersen, linux-scsi, target-devel

On 7/15/20 11:04 AM, Bodo Stroesser wrote:
> Fix:
> After calling the aborted_task callback the core immediately
> releases the se_cmd that represents the ABORT_TASK. The woken
> up thread (tcm_loop_issue_tmr) therefore must not access se_cmd
> and tl_cmd in case of aborted TMRs.

The code and fix description below look ok. I didn't get the above part though. If we have TARGET_SCF_ACK_KREF set then doesn't the se_cmd and tl_cmd stay around until we do the target_put_sess_cmd in tcm_loop_issue_tmr?

The way you coded it below so we don't need TARGET_SCF_ACK_KREF seems ok. I was just thinking that the above chunk of patch description could be dropped, or we need to fix some other drivers because they are doing something similar to loop before your patch.


> 
> So I changed aborted_task and queue_tm_rsp to transfer result
> code from se_cmd to tcm_loop_issue_tmr's stack and added the
> missing wake_up() to aborted_task.
> Now tcm_loop_issue_tmr after waking up no longer accesses se_cmd
> and tl_cmd. Therefore tcm_loop_issue_tmr no longer needs to call
> target_put_sess_cmd and flag TARGET_SCF_ACK_KREF is no longer
> needed in se_cmd.
> 
> Signed-off-by: Bodo Stroesser <bstroesser@ts.fujitsu.com>
> ---
>  drivers/target/loopback/tcm_loop.c | 39 ++++++++++++++++++++++----------------
>  drivers/target/loopback/tcm_loop.h |  4 +++-
>  2 files changed, 26 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/target/loopback/tcm_loop.c b/drivers/target/loopback/tcm_loop.c
> index 16d5a4e117a2..0968bc8b6640 100644
> --- a/drivers/target/loopback/tcm_loop.c
> +++ b/drivers/target/loopback/tcm_loop.c
> @@ -199,6 +199,7 @@ static int tcm_loop_issue_tmr(struct tcm_loop_tpg *tl_tpg,
>  	struct tcm_loop_nexus *tl_nexus;
>  	struct tcm_loop_cmd *tl_cmd;
>  	int ret = TMR_FUNCTION_FAILED, rc;
> +	DECLARE_COMPLETION_ONSTACK(compl);
>  
>  	/*
>  	 * Locate the tl_nexus and se_sess pointers
> @@ -213,26 +214,23 @@ static int tcm_loop_issue_tmr(struct tcm_loop_tpg *tl_tpg,
>  	if (!tl_cmd)
>  		return ret;
>  
> -	init_completion(&tl_cmd->tmr_done);
> +	tl_cmd->is_tmr = true;
> +	tl_cmd->tmr_done = &compl;
> +	tl_cmd->tmr_result = &ret;
>  
>  	se_cmd = &tl_cmd->tl_se_cmd;
>  	se_sess = tl_tpg->tl_nexus->se_sess;
>  
>  	rc = target_submit_tmr(se_cmd, se_sess, tl_cmd->tl_sense_buf, lun,
> -			       NULL, tmr, GFP_KERNEL, task,
> -			       TARGET_SCF_ACK_KREF);
> -	if (rc < 0)
> -		goto release;
> -	wait_for_completion(&tl_cmd->tmr_done);
> -	ret = se_cmd->se_tmr_req->response;
> -	target_put_sess_cmd(se_cmd);
> +			       NULL, tmr, GFP_KERNEL, task, 0);
> +	if (rc < 0) {
> +		kmem_cache_free(tcm_loop_cmd_cache, tl_cmd);
> +		return ret;
> +	}
>  
> -out:
> -	return ret;
> +	wait_for_completion(tl_cmd->tmr_done);
>  
> -release:
> -	kmem_cache_free(tcm_loop_cmd_cache, tl_cmd);
> -	goto out;
> +	return ret;
>  }
>  
>  static int tcm_loop_abort_task(struct scsi_cmnd *sc)
> @@ -590,13 +588,22 @@ static void tcm_loop_queue_tm_rsp(struct se_cmd *se_cmd)
>  	struct tcm_loop_cmd *tl_cmd = container_of(se_cmd,
>  				struct tcm_loop_cmd, tl_se_cmd);
>  
> -	/* Wake up tcm_loop_issue_tmr(). */
> -	complete(&tl_cmd->tmr_done);
> +	/* Set tmr result and wake up tcm_loop_issue_tmr(). */
> +	*tl_cmd->tmr_result = se_cmd->se_tmr_req->response;
> +	complete(tl_cmd->tmr_done);
>  }
>  
>  static void tcm_loop_aborted_task(struct se_cmd *se_cmd)
>  {
> -	return;
> +	struct tcm_loop_cmd *tl_cmd = container_of(se_cmd,
> +				struct tcm_loop_cmd, tl_se_cmd);
> +
> +	if (!tl_cmd->is_tmr)
> +		return;
> +
> +	/* Set tmr result and wake up tcm_loop_issue_tmr(). */
> +	*tl_cmd->tmr_result = TMR_FUNCTION_REJECTED;
> +	complete(tl_cmd->tmr_done);
>  }
>  
>  static char *tcm_loop_dump_proto_id(struct tcm_loop_hba *tl_hba)
> diff --git a/drivers/target/loopback/tcm_loop.h b/drivers/target/loopback/tcm_loop.h
> index d3110909a213..e7615b9f5ed1 100644
> --- a/drivers/target/loopback/tcm_loop.h
> +++ b/drivers/target/loopback/tcm_loop.h
> @@ -17,7 +17,9 @@ struct tcm_loop_cmd {
>  	/* The TCM I/O descriptor that is accessed via container_of() */
>  	struct se_cmd tl_se_cmd;
>  	struct work_struct work;
> -	struct completion tmr_done;
> +	struct completion *tmr_done;
> +	bool is_tmr;
> +	int *tmr_result;
>  	/* Sense buffer that will be mapped into outgoing status */
>  	unsigned char tl_sense_buf[TRANSPORT_SENSE_BUFFER];
>  };
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] scsi: target: loop: Fix handling of aborted TMRs
  2020-07-26  5:16 ` Mike Christie
@ 2020-07-26 11:02   ` Bodo Stroesser
  2020-07-26 18:37     ` Mike Christie
  0 siblings, 1 reply; 7+ messages in thread
From: Bodo Stroesser @ 2020-07-26 11:02 UTC (permalink / raw)
  To: Mike Christie, Martin K. Petersen, linux-scsi, target-devel

On 2020-07-26 07:16, Mike Christie wrote:
> On 7/15/20 11:04 AM, Bodo Stroesser wrote:
>> Fix:
>> After calling the aborted_task callback the core immediately
>> releases the se_cmd that represents the ABORT_TASK. The woken
>> up thread (tcm_loop_issue_tmr) therefore must not access se_cmd
>> and tl_cmd in case of aborted TMRs.
> 
> The code and fix description below look ok. I didn't get the above part though. If we have TARGET_SCF_ACK_KREF set then doesn't the se_cmd and tl_cmd stay around until we do the target_put_sess_cmd in tcm_loop_issue_tmr?

No. For an aborted ABORT_TASK, target_handle_abort is called.
If tas is not set, it executes this code:

        } else {
                /*
                 * Allow the fabric driver to unmap any resources before
                 * releasing the descriptor via TFO->release_cmd().
                 */
                cmd->se_tfo->aborted_task(cmd);
                if (ack_kref)
                        WARN_ON_ONCE(target_put_sess_cmd(cmd) != 0);
                /*
                 * To do: establish a unit attention condition on the I_T
                 * nexus associated with cmd. See also the paragraph "Aborting
                 * commands" in SAM.
                 */
        }

        WARN_ON_ONCE(kref_read(&cmd->cmd_kref) == 0);

        transport_lun_remove_cmd(cmd);

        transport_cmd_check_stop_to_fabric(cmd);

That means: no matter whether SCF_ACK_REF is set in the cmd or not:
1) fabric's aborted_task handler and a waiter woken up by aborted_task must not call target_put_sess_cmd.
2) a waiter woken up by aborted_task() must not access se_cmd (or tl_cmd) since target_handle_abort
   might have released it completely meanwhile.

OTOH, if TMR is not aborted, fabric's queue_tm_rsp() handler is called. If SCF_ACK_REF is set fabric has to release cmd_kref when it no longer needs the se_/tl_cmd. 
Due to 2) I had to change the waiting tcm_loop_issue_tmr such, that it no longer needs access to tl_/se_cmd.
After that, there was no more need for TARGET_SCF_ACK_KREF. Removing it avoids to add different handling of normal and aborted case in tcm_loop_issue_tmr.

> 
> The way you coded it below so we don't need TARGET_SCF_ACK_KREF seems ok. I was just thinking that the above chunk of patch description could be dropped, or we need to fix some other drivers because they are doing something similar to loop before your patch.
> 
> 
>>
>> So I changed aborted_task and queue_tm_rsp to transfer result
>> code from se_cmd to tcm_loop_issue_tmr's stack and added the
>> missing wake_up() to aborted_task.
>> Now tcm_loop_issue_tmr after waking up no longer accesses se_cmd
>> and tl_cmd. Therefore tcm_loop_issue_tmr no longer needs to call
>> target_put_sess_cmd and flag TARGET_SCF_ACK_KREF is no longer
>> needed in se_cmd.
>>
>> Signed-off-by: Bodo Stroesser <bstroesser@ts.fujitsu.com>
>> ---
>>   drivers/target/loopback/tcm_loop.c | 39 ++++++++++++++++++++++----------------
>>   drivers/target/loopback/tcm_loop.h |  4 +++-
>>   2 files changed, 26 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/target/loopback/tcm_loop.c b/drivers/target/loopback/tcm_loop.c
>> index 16d5a4e117a2..0968bc8b6640 100644
>> --- a/drivers/target/loopback/tcm_loop.c
>> +++ b/drivers/target/loopback/tcm_loop.c
>> @@ -199,6 +199,7 @@ static int tcm_loop_issue_tmr(struct tcm_loop_tpg *tl_tpg,
>>   	struct tcm_loop_nexus *tl_nexus;
>>   	struct tcm_loop_cmd *tl_cmd;
>>   	int ret = TMR_FUNCTION_FAILED, rc;
>> +	DECLARE_COMPLETION_ONSTACK(compl);
>>   
>>   	/*
>>   	 * Locate the tl_nexus and se_sess pointers
>> @@ -213,26 +214,23 @@ static int tcm_loop_issue_tmr(struct tcm_loop_tpg *tl_tpg,
>>   	if (!tl_cmd)
>>   		return ret;
>>   
>> -	init_completion(&tl_cmd->tmr_done);
>> +	tl_cmd->is_tmr = true;
>> +	tl_cmd->tmr_done = &compl;
>> +	tl_cmd->tmr_result = &ret;
>>   
>>   	se_cmd = &tl_cmd->tl_se_cmd;
>>   	se_sess = tl_tpg->tl_nexus->se_sess;
>>   
>>   	rc = target_submit_tmr(se_cmd, se_sess, tl_cmd->tl_sense_buf, lun,
>> -			       NULL, tmr, GFP_KERNEL, task,
>> -			       TARGET_SCF_ACK_KREF);
>> -	if (rc < 0)
>> -		goto release;
>> -	wait_for_completion(&tl_cmd->tmr_done);
>> -	ret = se_cmd->se_tmr_req->response;
>> -	target_put_sess_cmd(se_cmd);
>> +			       NULL, tmr, GFP_KERNEL, task, 0);
>> +	if (rc < 0) {
>> +		kmem_cache_free(tcm_loop_cmd_cache, tl_cmd);
>> +		return ret;
>> +	}
>>   
>> -out:
>> -	return ret;
>> +	wait_for_completion(tl_cmd->tmr_done);
>>   
>> -release:
>> -	kmem_cache_free(tcm_loop_cmd_cache, tl_cmd);
>> -	goto out;
>> +	return ret;
>>   }
>>   
>>   static int tcm_loop_abort_task(struct scsi_cmnd *sc)
>> @@ -590,13 +588,22 @@ static void tcm_loop_queue_tm_rsp(struct se_cmd *se_cmd)
>>   	struct tcm_loop_cmd *tl_cmd = container_of(se_cmd,
>>   				struct tcm_loop_cmd, tl_se_cmd);
>>   
>> -	/* Wake up tcm_loop_issue_tmr(). */
>> -	complete(&tl_cmd->tmr_done);
>> +	/* Set tmr result and wake up tcm_loop_issue_tmr(). */
>> +	*tl_cmd->tmr_result = se_cmd->se_tmr_req->response;
>> +	complete(tl_cmd->tmr_done);
>>   }
>>   
>>   static void tcm_loop_aborted_task(struct se_cmd *se_cmd)
>>   {
>> -	return;
>> +	struct tcm_loop_cmd *tl_cmd = container_of(se_cmd,
>> +				struct tcm_loop_cmd, tl_se_cmd);
>> +
>> +	if (!tl_cmd->is_tmr)
>> +		return;
>> +
>> +	/* Set tmr result and wake up tcm_loop_issue_tmr(). */
>> +	*tl_cmd->tmr_result = TMR_FUNCTION_REJECTED;
>> +	complete(tl_cmd->tmr_done);
>>   }
>>   
>>   static char *tcm_loop_dump_proto_id(struct tcm_loop_hba *tl_hba)
>> diff --git a/drivers/target/loopback/tcm_loop.h b/drivers/target/loopback/tcm_loop.h
>> index d3110909a213..e7615b9f5ed1 100644
>> --- a/drivers/target/loopback/tcm_loop.h
>> +++ b/drivers/target/loopback/tcm_loop.h
>> @@ -17,7 +17,9 @@ struct tcm_loop_cmd {
>>   	/* The TCM I/O descriptor that is accessed via container_of() */
>>   	struct se_cmd tl_se_cmd;
>>   	struct work_struct work;
>> -	struct completion tmr_done;
>> +	struct completion *tmr_done;
>> +	bool is_tmr;
>> +	int *tmr_result;
>>   	/* Sense buffer that will be mapped into outgoing status */
>>   	unsigned char tl_sense_buf[TRANSPORT_SENSE_BUFFER];
>>   };
>>
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] scsi: target: loop: Fix handling of aborted TMRs
  2020-07-26 11:02   ` Bodo Stroesser
@ 2020-07-26 18:37     ` Mike Christie
  2020-07-27 14:26       ` Bodo Stroesser
  0 siblings, 1 reply; 7+ messages in thread
From: Mike Christie @ 2020-07-26 18:37 UTC (permalink / raw)
  To: Bodo Stroesser, Martin K. Petersen, linux-scsi, target-devel

On 7/26/20 6:02 AM, Bodo Stroesser wrote:
> On 2020-07-26 07:16, Mike Christie wrote:
>> On 7/15/20 11:04 AM, Bodo Stroesser wrote:
>>> Fix:
>>> After calling the aborted_task callback the core immediately
>>> releases the se_cmd that represents the ABORT_TASK. The woken
>>> up thread (tcm_loop_issue_tmr) therefore must not access se_cmd
>>> and tl_cmd in case of aborted TMRs.
>>
>> The code and fix description below look ok. I didn't get the above part though. If we have TARGET_SCF_ACK_KREF set then doesn't the se_cmd and tl_cmd stay around until we do the target_put_sess_cmd in tcm_loop_issue_tmr?
> 
> No. For an aborted ABORT_TASK, target_handle_abort is called.
> If tas is not set, it executes this code:
> 
>         } else {
>                 /*
>                  * Allow the fabric driver to unmap any resources before
>                  * releasing the descriptor via TFO->release_cmd().
>                  */
>                 cmd->se_tfo->aborted_task(cmd);
>                 if (ack_kref)
>                         WARN_ON_ONCE(target_put_sess_cmd(cmd) != 0);
>                 /*
>                  * To do: establish a unit attention condition on the I_T
>                  * nexus associated with cmd. See also the paragraph "Aborting
>                  * commands" in SAM.
>                  */
>         }
> 
>         WARN_ON_ONCE(kref_read(&cmd->cmd_kref) == 0);
> 
>         transport_lun_remove_cmd(cmd);
> 
>         transport_cmd_check_stop_to_fabric(cmd);
> 
> That means: no matter whether SCF_ACK_REF is set in the cmd or not:
> 1) fabric's aborted_task handler and a waiter woken up by aborted_task must not call target_put_sess_cmd.
> 2) a waiter woken up by aborted_task() must not access se_cmd (or tl_cmd) since target_handle_abort
>    might have released it completely meanwhile.
> 

Oh no, so xen has the same cmd lifetime issue as loop right?

And, it looks like iscsi has an issue too. I can hit both of those WARNs.

I'm ok with your patch, but is there a way to fix this in core for everyone?

It seems like something that must have worked at some point for everyone, but we broke. I'll try to get some time today and git bisect this to see if it's a regression.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] scsi: target: loop: Fix handling of aborted TMRs
  2020-07-26 18:37     ` Mike Christie
@ 2020-07-27 14:26       ` Bodo Stroesser
  2020-07-27 18:13         ` Mike Christie
  0 siblings, 1 reply; 7+ messages in thread
From: Bodo Stroesser @ 2020-07-27 14:26 UTC (permalink / raw)
  To: Mike Christie, Martin K. Petersen, linux-scsi, target-devel

On 2020-07-26 20:37, Mike Christie wrote:
> On 7/26/20 6:02 AM, Bodo Stroesser wrote:
>> On 2020-07-26 07:16, Mike Christie wrote:
>>> On 7/15/20 11:04 AM, Bodo Stroesser wrote:
>>>> Fix:
>>>> After calling the aborted_task callback the core immediately
>>>> releases the se_cmd that represents the ABORT_TASK. The woken
>>>> up thread (tcm_loop_issue_tmr) therefore must not access se_cmd
>>>> and tl_cmd in case of aborted TMRs.
>>>
>>> The code and fix description below look ok. I didn't get the above part though. If we have TARGET_SCF_ACK_KREF set then doesn't the se_cmd and tl_cmd stay around until we do the target_put_sess_cmd in tcm_loop_issue_tmr?
>>
>> No. For an aborted ABORT_TASK, target_handle_abort is called.
>> If tas is not set, it executes this code:
>>
>>          } else {
>>                  /*
>>                   * Allow the fabric driver to unmap any resources before
>>                   * releasing the descriptor via TFO->release_cmd().
>>                   */
>>                  cmd->se_tfo->aborted_task(cmd);
>>                  if (ack_kref)
>>                          WARN_ON_ONCE(target_put_sess_cmd(cmd) != 0);
>>                  /*
>>                   * To do: establish a unit attention condition on the I_T
>>                   * nexus associated with cmd. See also the paragraph "Aborting
>>                   * commands" in SAM.
>>                   */
>>          }
>>
>>          WARN_ON_ONCE(kref_read(&cmd->cmd_kref) == 0);
>>
>>          transport_lun_remove_cmd(cmd);
>>
>>          transport_cmd_check_stop_to_fabric(cmd);
>>
>> That means: no matter whether SCF_ACK_REF is set in the cmd or not:
>> 1) fabric's aborted_task handler and a waiter woken up by aborted_task must not call target_put_sess_cmd.
>> 2) a waiter woken up by aborted_task() must not access se_cmd (or tl_cmd) since target_handle_abort
>>     might have released it completely meanwhile.
>>
> 
> Oh no, so xen has the same cmd lifetime issue as loop right?

To me it looks like xen uses nearly the same code like tcm_loop did before my patch.
There is nothing wrong with that code regarding the cmd lifetime.
The problem instead is, that the thread which started a TMR (ABORT_TASK) will sleep forever if that TMR itself is aborted by a further TMR (LUN_RESET).
This is because tcm_loop_aborted_task() misses the complete() call.
 
But if we just add the complete() call to XXXX_aborted_task(), we run into trouble because what core expects from fabric handlers is different:
1) If core calls XXXX_queue_tm_rsp(), then fabric has to release one ref count only if SCF_ACK_REF is set. Otherwise it must not.
2) If core calls XXXX_aborted_task(), then fabric must not release ref count, no matter whether SCF_ACK_REF is set.

So I decided for my patch to no longer use TARGET_SCF_ACK_KREF, since then we can handle both situation the same way.
After that it was a short step to move the data fields used by the thread after wakeup() from tl_cmd to stack, because then the woken up theqard has no need to access tl_cmd, which can be freed meanwhile.

I think, the same way to fix the problem would be fine for xen also, but I'm still wondering why the thread there does not call target_put_sess_cmd, but calls transport_generic_free_cmd.

> 
> And, it looks like iscsi has an issue too. I can hit both of those WARNs.
> 
> I'm ok with your patch, but is there a way to fix this in core for everyone?
> 
> It seems like something that must have worked at some point for everyone, but we broke. I'll try to get some time today and git bisect this to see if it's a regression.
> 

I'm wondering whether aborting an Abort ever was tested at least for these drivers.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] scsi: target: loop: Fix handling of aborted TMRs
  2020-07-27 14:26       ` Bodo Stroesser
@ 2020-07-27 18:13         ` Mike Christie
  2020-08-07  1:42           ` Mike Christie
  0 siblings, 1 reply; 7+ messages in thread
From: Mike Christie @ 2020-07-27 18:13 UTC (permalink / raw)
  To: Bodo Stroesser, Martin K. Petersen, linux-scsi, target-devel

On 7/27/20 9:26 AM, Bodo Stroesser wrote:
> On 2020-07-26 20:37, Mike Christie wrote:
>> On 7/26/20 6:02 AM, Bodo Stroesser wrote:
>>> On 2020-07-26 07:16, Mike Christie wrote:
>>>> On 7/15/20 11:04 AM, Bodo Stroesser wrote:
>>>>> Fix:
>>>>> After calling the aborted_task callback the core immediately
>>>>> releases the se_cmd that represents the ABORT_TASK. The woken
>>>>> up thread (tcm_loop_issue_tmr) therefore must not access se_cmd
>>>>> and tl_cmd in case of aborted TMRs.
>>>>
>>>> The code and fix description below look ok. I didn't get the above part though. If we have TARGET_SCF_ACK_KREF set then doesn't the se_cmd and tl_cmd stay around until we do the target_put_sess_cmd in tcm_loop_issue_tmr?
>>>
>>> No. For an aborted ABORT_TASK, target_handle_abort is called.
>>> If tas is not set, it executes this code:
>>>
>>>          } else {
>>>                  /*
>>>                   * Allow the fabric driver to unmap any resources before
>>>                   * releasing the descriptor via TFO->release_cmd().
>>>                   */
>>>                  cmd->se_tfo->aborted_task(cmd);
>>>                  if (ack_kref)
>>>                          WARN_ON_ONCE(target_put_sess_cmd(cmd) != 0);
>>>                  /*
>>>                   * To do: establish a unit attention condition on the I_T
>>>                   * nexus associated with cmd. See also the paragraph "Aborting
>>>                   * commands" in SAM.
>>>                   */
>>>          }
>>>
>>>          WARN_ON_ONCE(kref_read(&cmd->cmd_kref) == 0);
>>>
>>>          transport_lun_remove_cmd(cmd);
>>>
>>>          transport_cmd_check_stop_to_fabric(cmd);
>>>
>>> That means: no matter whether SCF_ACK_REF is set in the cmd or not:
>>> 1) fabric's aborted_task handler and a waiter woken up by aborted_task must not call target_put_sess_cmd.
>>> 2) a waiter woken up by aborted_task() must not access se_cmd (or tl_cmd) since target_handle_abort
>>>     might have released it completely meanwhile.
>>>
>>
>> Oh no, so xen has the same cmd lifetime issue as loop right?
> 
> To me it looks like xen uses nearly the same code like tcm_loop did before my patch.
> There is nothing wrong with that code regarding the cmd lifetim> The problem instead is, that the thread which started a TMR (ABORT_TASK) will sleep forever if that TMR itself is aborted by a further TMR (LUN_RESET).
> This is because tcm_loop_aborted_task() misses the complete() call.
>  
> But if we just add the complete() call to XXXX_aborted_task(), we run into trouble because what core expects from fabric handlers is different:
> 1) If core calls XXXX_queue_tm_rsp(), then fabric has to release one ref count only if SCF_ACK_REF is set. Otherwise it must not.
> 2) If core calls XXXX_aborted_task(), then fabric must not release ref count, no matter whether SCF_ACK_REF is set.

Maybe we agree.

I was calling #2 a cmd lifetime issue. At least loop, xen and iscsi think they can access the cmd after aborted_cmd. For loop/xen we just don't hit it, because we hit #1 first. For iscsi we are.

> 
> So I decided for my patch to no longer use TARGET_SCF_ACK_KREF, since then we can handle both situation the same way.
> After that it was a short step to move the data fields used by the thread after wakeup() from tl_cmd to stack, because then the woken up theqard has no need to access tl_cmd, which can be freed meanwhile.
> 
> I think, the same way to fix the problem would be fine for xen also, but I'm still wondering why the thread there does not call target_put_sess_cmd, but calls transport_generic_free_cmd.

For xen and #1 and #2 I agree we can do a similar fix as you did for xen.

I would really like to figure out if #2 is a regression and understand what happened so we don't make the same mistake again and also fix iscsi. The problem with iscsi though is every couple kernels has a bug in this code path so git bisect is being a pain.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] scsi: target: loop: Fix handling of aborted TMRs
  2020-07-27 18:13         ` Mike Christie
@ 2020-08-07  1:42           ` Mike Christie
  0 siblings, 0 replies; 7+ messages in thread
From: Mike Christie @ 2020-08-07  1:42 UTC (permalink / raw)
  To: Bodo Stroesser, Martin K. Petersen, linux-scsi, target-devel

On 7/27/20 1:13 PM, Mike Christie wrote:
> I would really like to figure out if #2 is a regression and understand what happened so we don't make the same mistake again and also fix iscsi. The problem with iscsi though is every couple kernels has a bug in this code path so git bisect is being a pain.

Sorry for the delay. I was able to track down the hang part of this. It turns out loop did work at some point. It worked a couple times over the years, but every X kernels we would break it again.

The last time we didn't have the hang was because we executed TMFs one at a time. We would end up always calling queue_tm_rsp since the LUN reset was always going to execute after the abort. So to fix the wake up / hang part of the bug for everyone we could revert:

commit db5b21a24e01d35495014076700efa02d6dcbb68
Author: Bart Van Assche <bvanassche@acm.org>
Date:   Tue Nov 27 15:51:59 2018 -0800

    scsi: target/core: Use system workqueues for TMF

It's not clear how much processing TMFs for the same device in parallel helps. If it doesn't help, this would be an easy fix for everyone.

One other thing that is not clear to me is if aborted_task was always meant to be used for task management requests. I think when it was first added, it was only called for normal cmds. I couldn't tell exactly when we meant (vs it snuck in by accident) to start calling it for all commands.

If the async TMF feature helps, then I'm ok with with your fix for loop where we say the aborted_task callout was meant for all commands and go from there. We then have to do something similar in xen, and check that the other drivers didn't expect the old behavior. I think for that qla2xxx bug you mentioned, tcm_qla2xxx_aborted_task didn't expect a qla_tgt_mgmt_cmd to be passed into aborted_task so it crashed trying to access it as a qla_tgt_cmd.

What do you think?

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-08-07  1:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-15 16:04 [PATCH] scsi: target: loop: Fix handling of aborted TMRs Bodo Stroesser
2020-07-26  5:16 ` Mike Christie
2020-07-26 11:02   ` Bodo Stroesser
2020-07-26 18:37     ` Mike Christie
2020-07-27 14:26       ` Bodo Stroesser
2020-07-27 18:13         ` Mike Christie
2020-08-07  1:42           ` Mike Christie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).