All of lore.kernel.org
 help / color / mirror / Atom feed
From: Maurizio Lombardi <mlombard@redhat.com>
To: Michael Christie <michael.christie@oracle.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org, target-devel@vger.kernel.org,
	bvanassche@acm.org
Subject: Re: [PATCH 2/2] target: iscsi: fix a race condition when aborting a task
Date: Wed, 28 Oct 2020 17:09:48 +0000	[thread overview]
Message-ID: <a936cc4e-1610-5201-5960-107689b81820@redhat.com> (raw)
In-Reply-To: <71691FED-C164-482C-B629-A8B89B81E566@oracle.com>



Dne 27. 10. 20 v 21:03 Michael Christie napsal(a):
> 
> 
>> On Oct 27, 2020, at 12:54 PM, Mike Christie <michael.christie@oracle.com> wrote:
>>
>> On 10/27/20 8:49 AM, Maurizio Lombardi wrote:
>>> Hello Mike,
>>>
>>> Dne 22. 10. 20 v 4:42 Mike Christie napsal(a):
>>>> If we free the cmd from the abort path, then for your conn stop plus abort race case, could we do:
>>>>
>>>> 1. thread1 runs iscsit_release_commands_from_conn and sets CMD_T_FABRIC_STOP.
>>>> 2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It then returns from the aborted_task callout and we finish target_handle_abort and do:
>>>>
>>>> target_handle_abort -> transport_cmd_check_stop_to_fabric -> lio_check_stop_free -> target_put_sess_cmd
>>>>
>>>> The cmd is now freed.
>>>> 3. thread1 now finishes iscsit_release_commands_from_conn and runs iscsit_free_cmd while accessing a command we just released.
>>>>
>>>>
>>>
>>> Thanks for the review!
>>>
>>> There are definitely some problems with task aborts and commands' refcounting *
>>> but this is a different bug than the one this patch is trying to solve (a race to list_del_init());
>>> unless you are saying that abort tasks should never be executed when the connection 
>>> is going down and we have to prevent such cases from happening at all.
>>
>> Yeah, I think if we prevent the race then we fix the refcount issue and your issue.
>> Here is a patch that is only compile tested:
>>
>> From 209709bcedd9a6ce6003e6bb86f3ebf547dca6af Mon Sep 17 00:00:00 2001
>> From: Mike Christie <michael.christie@oracle.com>
>> Date: Tue, 27 Oct 2020 12:30:53 -0500
>> Subject: [PATCH] iscsi target: fix cmd abort vs fabric stop race
>>
>> The abort and cmd stop paths can race where:
>>
>> 1. thread1 runs iscsit_release_commands_from_conn and sets
>> CMD_T_FABRIC_STOP.
>> 2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It
>> then returns from the aborted_task callout and we finish
>> target_handle_abort and do:
>>
>> target_handle_abort -> transport_cmd_check_stop_to_fabric ->
>> lio_check_stop_free -> target_put_sess_cmd
>>
>> The cmd is now freed.
>> 3. thread1 now finishes iscsit_release_commands_from_conn and runs
>> iscsit_free_cmd while accessing a command we just released.
>>
>> In __target_check_io_state we check for CMD_T_FABRIC_STOP and set the
>> CMD_T_ABORTED if the driver is not cleaning up the cmd because of
>> a session shutdown. However, iscsit_release_commands_from_conn only
>> sets the CMD_T_FABRIC_STOP and does not check to see if the abort path
>> has claimed completion ownership of the command.
>>
>> This adds a check in iscsit_release_commands_from_conn so only the
>> abort or fabric stop path cleanup the command.
>> ---
>> drivers/target/iscsi/iscsi_target.c | 13 +++++++++++--
>> 1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
>> index f77e5ee..85027d3 100644
>> --- a/drivers/target/iscsi/iscsi_target.c
>> +++ b/drivers/target/iscsi/iscsi_target.c
>> @@ -483,8 +483,7 @@ int iscsit_queue_rsp(struct iscsi_conn *conn, struct iscsi_cmd *cmd)
>> void iscsit_aborted_task(struct iscsi_conn *conn, struct iscsi_cmd *cmd)
>> {
>> 	spin_lock_bh(&conn->cmd_lock);
>> -	if (!list_empty(&cmd->i_conn_node) &&
>> -	    !(cmd->se_cmd.transport_state & CMD_T_FABRIC_STOP))
>> +	if (!list_empty(&cmd->i_conn_node))
>> 		list_del_init(&cmd->i_conn_node);
>> 	spin_unlock_bh(&conn->cmd_lock);
>>
>> @@ -4088,6 +4087,16 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn)
>>
>> 		if (se_cmd->se_tfo != NULL) {
>> 			spin_lock_irq(&se_cmd->t_state_lock);
>> +			if (se_cmd->transport_state & CMD_T_ABORTED) {
>> +				/*
>> +				 * LIO's abort path owns the cleanup for this,
>> +				 * so put it back on the list and let
>> +				 * aborted_task handle it.
>> +				 */
>> +				list_add_tail(&cmd->i_conn_node,
>> +					      &conn->conn_cmd_list);
> 
> 
> That should have been a move from the tmp list back to the conn_cmd_list.

Nice, it looks simple and I will test it.
I am a bit worried there could be other possible race conditions.

Example: 

thread1: connection is going to be closed,
iscsit_release_commands_from_conn() finds a command that is about
to be aborted, re-adds it to conn_cmd_list and proceeds.
iscsit_close_connection() decreases the conn usage count and finally calls iscsit_free_conn(conn)
that destroys the conn structure.

thread2: iscsit_aborted_task() gets called and tries to lock the conn->cmd_lock spinlock, dereferencing
an invalid pointer.

Possible solutions that I can think of:

- Make iscsit_release_commands_from_conn() wait for the abort task to finish
or
- abort handler could hold a reference to the conn structure so that iscsit_close_connection()
will sleep when calling iscsit_check_conn_usage_count(conn) until abort finishes.


Maurizio

WARNING: multiple messages have this Message-ID (diff)
From: Maurizio Lombardi <mlombard@redhat.com>
To: Michael Christie <michael.christie@oracle.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org, target-devel@vger.kernel.org,
	bvanassche@acm.org
Subject: Re: [PATCH 2/2] target: iscsi: fix a race condition when aborting a task
Date: Wed, 28 Oct 2020 18:09:48 +0100	[thread overview]
Message-ID: <a936cc4e-1610-5201-5960-107689b81820@redhat.com> (raw)
In-Reply-To: <71691FED-C164-482C-B629-A8B89B81E566@oracle.com>



Dne 27. 10. 20 v 21:03 Michael Christie napsal(a):
> 
> 
>> On Oct 27, 2020, at 12:54 PM, Mike Christie <michael.christie@oracle.com> wrote:
>>
>> On 10/27/20 8:49 AM, Maurizio Lombardi wrote:
>>> Hello Mike,
>>>
>>> Dne 22. 10. 20 v 4:42 Mike Christie napsal(a):
>>>> If we free the cmd from the abort path, then for your conn stop plus abort race case, could we do:
>>>>
>>>> 1. thread1 runs iscsit_release_commands_from_conn and sets CMD_T_FABRIC_STOP.
>>>> 2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It then returns from the aborted_task callout and we finish target_handle_abort and do:
>>>>
>>>> target_handle_abort -> transport_cmd_check_stop_to_fabric -> lio_check_stop_free -> target_put_sess_cmd
>>>>
>>>> The cmd is now freed.
>>>> 3. thread1 now finishes iscsit_release_commands_from_conn and runs iscsit_free_cmd while accessing a command we just released.
>>>>
>>>>
>>>
>>> Thanks for the review!
>>>
>>> There are definitely some problems with task aborts and commands' refcounting *
>>> but this is a different bug than the one this patch is trying to solve (a race to list_del_init());
>>> unless you are saying that abort tasks should never be executed when the connection 
>>> is going down and we have to prevent such cases from happening at all.
>>
>> Yeah, I think if we prevent the race then we fix the refcount issue and your issue.
>> Here is a patch that is only compile tested:
>>
>> From 209709bcedd9a6ce6003e6bb86f3ebf547dca6af Mon Sep 17 00:00:00 2001
>> From: Mike Christie <michael.christie@oracle.com>
>> Date: Tue, 27 Oct 2020 12:30:53 -0500
>> Subject: [PATCH] iscsi target: fix cmd abort vs fabric stop race
>>
>> The abort and cmd stop paths can race where:
>>
>> 1. thread1 runs iscsit_release_commands_from_conn and sets
>> CMD_T_FABRIC_STOP.
>> 2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It
>> then returns from the aborted_task callout and we finish
>> target_handle_abort and do:
>>
>> target_handle_abort -> transport_cmd_check_stop_to_fabric ->
>> lio_check_stop_free -> target_put_sess_cmd
>>
>> The cmd is now freed.
>> 3. thread1 now finishes iscsit_release_commands_from_conn and runs
>> iscsit_free_cmd while accessing a command we just released.
>>
>> In __target_check_io_state we check for CMD_T_FABRIC_STOP and set the
>> CMD_T_ABORTED if the driver is not cleaning up the cmd because of
>> a session shutdown. However, iscsit_release_commands_from_conn only
>> sets the CMD_T_FABRIC_STOP and does not check to see if the abort path
>> has claimed completion ownership of the command.
>>
>> This adds a check in iscsit_release_commands_from_conn so only the
>> abort or fabric stop path cleanup the command.
>> ---
>> drivers/target/iscsi/iscsi_target.c | 13 +++++++++++--
>> 1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
>> index f77e5ee..85027d3 100644
>> --- a/drivers/target/iscsi/iscsi_target.c
>> +++ b/drivers/target/iscsi/iscsi_target.c
>> @@ -483,8 +483,7 @@ int iscsit_queue_rsp(struct iscsi_conn *conn, struct iscsi_cmd *cmd)
>> void iscsit_aborted_task(struct iscsi_conn *conn, struct iscsi_cmd *cmd)
>> {
>> 	spin_lock_bh(&conn->cmd_lock);
>> -	if (!list_empty(&cmd->i_conn_node) &&
>> -	    !(cmd->se_cmd.transport_state & CMD_T_FABRIC_STOP))
>> +	if (!list_empty(&cmd->i_conn_node))
>> 		list_del_init(&cmd->i_conn_node);
>> 	spin_unlock_bh(&conn->cmd_lock);
>>
>> @@ -4088,6 +4087,16 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn)
>>
>> 		if (se_cmd->se_tfo != NULL) {
>> 			spin_lock_irq(&se_cmd->t_state_lock);
>> +			if (se_cmd->transport_state & CMD_T_ABORTED) {
>> +				/*
>> +				 * LIO's abort path owns the cleanup for this,
>> +				 * so put it back on the list and let
>> +				 * aborted_task handle it.
>> +				 */
>> +				list_add_tail(&cmd->i_conn_node,
>> +					      &conn->conn_cmd_list);
> 
> 
> That should have been a move from the tmp list back to the conn_cmd_list.

Nice, it looks simple and I will test it.
I am a bit worried there could be other possible race conditions.

Example: 

thread1: connection is going to be closed,
iscsit_release_commands_from_conn() finds a command that is about
to be aborted, re-adds it to conn_cmd_list and proceeds.
iscsit_close_connection() decreases the conn usage count and finally calls iscsit_free_conn(conn)
that destroys the conn structure.

thread2: iscsit_aborted_task() gets called and tries to lock the conn->cmd_lock spinlock, dereferencing
an invalid pointer.

Possible solutions that I can think of:

- Make iscsit_release_commands_from_conn() wait for the abort task to finish
or
- abort handler could hold a reference to the conn structure so that iscsit_close_connection()
will sleep when calling iscsit_check_conn_usage_count(conn) until abort finishes.


Maurizio


  reply	other threads:[~2020-10-28 17:09 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-07 14:53 [PATCH 0/2] fix race conditions with task aborts Maurizio Lombardi
2020-10-07 14:53 ` Maurizio Lombardi
2020-10-07 14:53 ` [PATCH 1/2] target: iscsi: prevent a race condition in iscsit_unmap_cmd() Maurizio Lombardi
2020-10-07 14:53   ` Maurizio Lombardi
2020-10-08  2:15   ` Bart Van Assche
2020-10-08  2:15     ` Bart Van Assche
2020-10-08  9:42     ` Maurizio Lombardi
2020-10-08  9:42       ` Maurizio Lombardi
2020-10-07 14:53 ` [PATCH 2/2] target: iscsi: fix a race condition when aborting a task Maurizio Lombardi
2020-10-07 14:53   ` Maurizio Lombardi
2020-10-22  2:42   ` Mike Christie
2020-10-22  2:42     ` Mike Christie
2020-10-27 13:49     ` Maurizio Lombardi
2020-10-27 13:49       ` Maurizio Lombardi
2020-10-27 17:54       ` Mike Christie
2020-10-27 17:54         ` Mike Christie
2020-10-27 20:03         ` Michael Christie
2020-10-27 20:03           ` Michael Christie
2020-10-28 17:09           ` Maurizio Lombardi [this message]
2020-10-28 17:09             ` Maurizio Lombardi
2020-10-28 20:37             ` Mike Christie
2020-10-28 20:37               ` Mike Christie
2020-11-10 21:29               ` Maurizio Lombardi
2020-11-10 21:29                 ` Maurizio Lombardi
2020-11-10 23:08                 ` Mike Christie
2020-11-10 23:08                   ` Mike Christie
2020-11-11  2:16             ` Mike Christie
2020-11-11  2:16               ` Mike Christie
2020-11-11 14:58               ` Maurizio Lombardi
2020-11-11 14:58                 ` Maurizio Lombardi
2020-11-11 15:37                 ` Michael Christie
2020-11-11 15:37                   ` Michael Christie
2020-11-11 15:48                   ` Maurizio Lombardi
2020-11-11 15:48                     ` Maurizio Lombardi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a936cc4e-1610-5201-5960-107689b81820@redhat.com \
    --to=mlombard@redhat.com \
    --cc=bvanassche@acm.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=michael.christie@oracle.com \
    --cc=target-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.