[PATCH] scsi_debug: deadlock between completions and surprise module removal

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] scsi_debug: deadlock between completions and surprise module removal
@ 2014-08-31 23:09 Douglas Gilbert
  2014-09-01 15:36 ` Christoph Hellwig
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Douglas Gilbert @ 2014-08-31 23:09 UTC (permalink / raw)
  To: SCSI development list, linux-kernel, James Bottomley, Christoph Hellwig
  Cc: Milan Broz

[-- Attachment #1: Type: text/plain, Size: 547 bytes --]

A deadlock has been reported when the completion
of SCSI commands (simulated by a timer) was surprised
by a module removal. This patch removes one half of
the offending locks around timer deletions. This fix
is applied both to stop_all_queued() which is were
the deadlock was discovered and stop_queued_cmnd()
which has very similar logic.

This patch should be applied both to the lk 3.17 tree
and Christoph's drivers-for-3.18 tree.

Tested-and-reported-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Douglas Gilbert <dgilbert@interlog.com>

[-- Attachment #2: sdebug317rc2_dlock1.patch --]
[-- Type: text/x-patch, Size: 2029 bytes --]

--- a/drivers/scsi/scsi_debug.c	2014-08-26 13:24:51.646948507 -0400
+++ b/drivers/scsi/scsi_debug.c	2014-08-30 18:04:54.589226679 -0400
@@ -2743,6 +2743,13 @@ static int stop_queued_cmnd(struct scsi_
 		if (test_bit(k, queued_in_use_bm)) {
 			sqcp = &queued_arr[k];
 			if (cmnd == sqcp->a_cmnd) {
+				devip = (struct sdebug_dev_info *)
+					cmnd->device->hostdata;
+				if (devip)
+					atomic_dec(&devip->num_in_q);
+				sqcp->a_cmnd = NULL;
+				spin_unlock_irqrestore(&queued_arr_lock,
+						       iflags);
 				if (scsi_debug_ndelay > 0) {
 					if (sqcp->sd_hrtp)
 						hrtimer_cancel(
@@ -2755,18 +2762,13 @@ static int stop_queued_cmnd(struct scsi_
 					if (sqcp->tletp)
 						tasklet_kill(sqcp->tletp);
 				}
-				__clear_bit(k, queued_in_use_bm);
-				devip = (struct sdebug_dev_info *)
-					cmnd->device->hostdata;
-				if (devip)
-					atomic_dec(&devip->num_in_q);
-				sqcp->a_cmnd = NULL;
-				break;
+				clear_bit(k, queued_in_use_bm);
+				return 1;
 			}
 		}
 	}
 	spin_unlock_irqrestore(&queued_arr_lock, iflags);
-	return (k < qmax) ? 1 : 0;
+	return 0;
 }
 
 /* Deletes (stops) timers or tasklets of all queued commands */
@@ -2782,6 +2784,13 @@ static void stop_all_queued(void)
 		if (test_bit(k, queued_in_use_bm)) {
 			sqcp = &queued_arr[k];
 			if (sqcp->a_cmnd) {
+				devip = (struct sdebug_dev_info *)
+					sqcp->a_cmnd->device->hostdata;
+				if (devip)
+					atomic_dec(&devip->num_in_q);
+				sqcp->a_cmnd = NULL;
+				spin_unlock_irqrestore(&queued_arr_lock,
+						       iflags);
 				if (scsi_debug_ndelay > 0) {
 					if (sqcp->sd_hrtp)
 						hrtimer_cancel(
@@ -2794,12 +2803,8 @@ static void stop_all_queued(void)
 					if (sqcp->tletp)
 						tasklet_kill(sqcp->tletp);
 				}
-				__clear_bit(k, queued_in_use_bm);
-				devip = (struct sdebug_dev_info *)
-					sqcp->a_cmnd->device->hostdata;
-				if (devip)
-					atomic_dec(&devip->num_in_q);
-				sqcp->a_cmnd = NULL;
+				clear_bit(k, queued_in_use_bm);
+				spin_lock_irqsave(&queued_arr_lock, iflags);
 			}
 		}
 	}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal
  2014-08-31 23:09 [PATCH] scsi_debug: deadlock between completions and surprise module removal Douglas Gilbert
@ 2014-09-01 15:36 ` Christoph Hellwig
  2014-09-01 19:52   ` Douglas Gilbert
  2014-09-05  5:24 ` Christoph Hellwig
  2014-09-25 12:13 ` Christoph Hellwig
  2 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2014-09-01 15:36 UTC (permalink / raw)
  To: Douglas Gilbert
  Cc: SCSI development list, linux-kernel, James Bottomley,
	Christoph Hellwig, Milan Broz

> --- a/drivers/scsi/scsi_debug.c	2014-08-26 13:24:51.646948507 -0400
> +++ b/drivers/scsi/scsi_debug.c	2014-08-30 18:04:54.589226679 -0400
> @@ -2743,6 +2743,13 @@ static int stop_queued_cmnd(struct scsi_
>  		if (test_bit(k, queued_in_use_bm)) {
>  			sqcp = &queued_arr[k];
>  			if (cmnd == sqcp->a_cmnd) {
> +				devip = (struct sdebug_dev_info *)
> +					cmnd->device->hostdata;
> +				if (devip)
> +					atomic_dec(&devip->num_in_q);
> +				sqcp->a_cmnd = NULL;

Why would the hostdata every be NULL here?  We should never
call ->slave_destroy on a device that has outstanding commands.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal
  2014-09-01 15:36 ` Christoph Hellwig
@ 2014-09-01 19:52   ` Douglas Gilbert
  0 siblings, 0 replies; 14+ messages in thread
From: Douglas Gilbert @ 2014-09-01 19:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: SCSI development list, linux-kernel, James Bottomley, Milan Broz

On 14-09-01 11:36 AM, Christoph Hellwig wrote:
>> --- a/drivers/scsi/scsi_debug.c	2014-08-26 13:24:51.646948507 -0400
>> +++ b/drivers/scsi/scsi_debug.c	2014-08-30 18:04:54.589226679 -0400
>> @@ -2743,6 +2743,13 @@ static int stop_queued_cmnd(struct scsi_
>>   		if (test_bit(k, queued_in_use_bm)) {
>>   			sqcp = &queued_arr[k];
>>   			if (cmnd == sqcp->a_cmnd) {
>> +				devip = (struct sdebug_dev_info *)
>> +					cmnd->device->hostdata;
>> +				if (devip)
>> +					atomic_dec(&devip->num_in_q);
>> +				sqcp->a_cmnd = NULL;
>
> Why would the hostdata every be NULL here?  We should never
> call ->slave_destroy on a device that has outstanding commands.

To your first question, perhaps it could not be. In
the scsi_debug driver almost all uses of 'devip'
check for NULL, so it may not always have been true.

'rmmod scsi_debug' would lead to scsi_debug_exit()
being called and that called stop_all_queued() which
deadlocked with a command completion, or worse command
commencement. scsi_debug_exit() looks a bit racy: the
first thing it does is stop_all_queued() but has
anything been done to stop new commands being issued?
Later scsi_debug_exit() brings down the hosts.

This patch is primarily a bug fix for the lk 3.17 series and
the code you highlight was simply moved from being under a
lock to outside that lock. I didn't want to be too creative,
it's too easy to slip up and upset the management.

Enhancements could be sent to your drivers-for-3.18 tree. Rob
Elliott already has a few in mind, to improve performance.
Removing all 'devip' NULL checks would also improve performance,
and readability.

Doug Gilbert

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal
  2014-08-31 23:09 [PATCH] scsi_debug: deadlock between completions and surprise module removal Douglas Gilbert
  2014-09-01 15:36 ` Christoph Hellwig
@ 2014-09-05  5:24 ` Christoph Hellwig
  2014-09-05 13:56   ` Douglas Gilbert
  2014-09-25 12:13 ` Christoph Hellwig
  2 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2014-09-05  5:24 UTC (permalink / raw)
  To: Douglas Gilbert
  Cc: SCSI development list, linux-kernel, James Bottomley,
	Christoph Hellwig, Milan Broz

Can I get another review for this one?

On Sun, Aug 31, 2014 at 07:09:59PM -0400, Douglas Gilbert wrote:
> A deadlock has been reported when the completion
> of SCSI commands (simulated by a timer) was surprised
> by a module removal. This patch removes one half of
> the offending locks around timer deletions. This fix
> is applied both to stop_all_queued() which is were
> the deadlock was discovered and stop_queued_cmnd()
> which has very similar logic.
> 
> This patch should be applied both to the lk 3.17 tree
> and Christoph's drivers-for-3.18 tree.
> 
> Tested-and-reported-by: Milan Broz <gmazyland@gmail.com>
> Signed-off-by: Douglas Gilbert <dgilbert@interlog.com>

> --- a/drivers/scsi/scsi_debug.c	2014-08-26 13:24:51.646948507 -0400
> +++ b/drivers/scsi/scsi_debug.c	2014-08-30 18:04:54.589226679 -0400
> @@ -2743,6 +2743,13 @@ static int stop_queued_cmnd(struct scsi_
>  		if (test_bit(k, queued_in_use_bm)) {
>  			sqcp = &queued_arr[k];
>  			if (cmnd == sqcp->a_cmnd) {
> +				devip = (struct sdebug_dev_info *)
> +					cmnd->device->hostdata;
> +				if (devip)
> +					atomic_dec(&devip->num_in_q);
> +				sqcp->a_cmnd = NULL;
> +				spin_unlock_irqrestore(&queued_arr_lock,
> +						       iflags);
>  				if (scsi_debug_ndelay > 0) {
>  					if (sqcp->sd_hrtp)
>  						hrtimer_cancel(
> @@ -2755,18 +2762,13 @@ static int stop_queued_cmnd(struct scsi_
>  					if (sqcp->tletp)
>  						tasklet_kill(sqcp->tletp);
>  				}
> -				__clear_bit(k, queued_in_use_bm);
> -				devip = (struct sdebug_dev_info *)
> -					cmnd->device->hostdata;
> -				if (devip)
> -					atomic_dec(&devip->num_in_q);
> -				sqcp->a_cmnd = NULL;
> -				break;
> +				clear_bit(k, queued_in_use_bm);
> +				return 1;
>  			}
>  		}
>  	}
>  	spin_unlock_irqrestore(&queued_arr_lock, iflags);
> -	return (k < qmax) ? 1 : 0;
> +	return 0;
>  }
>  
>  /* Deletes (stops) timers or tasklets of all queued commands */
> @@ -2782,6 +2784,13 @@ static void stop_all_queued(void)
>  		if (test_bit(k, queued_in_use_bm)) {
>  			sqcp = &queued_arr[k];
>  			if (sqcp->a_cmnd) {
> +				devip = (struct sdebug_dev_info *)
> +					sqcp->a_cmnd->device->hostdata;
> +				if (devip)
> +					atomic_dec(&devip->num_in_q);
> +				sqcp->a_cmnd = NULL;
> +				spin_unlock_irqrestore(&queued_arr_lock,
> +						       iflags);
>  				if (scsi_debug_ndelay > 0) {
>  					if (sqcp->sd_hrtp)
>  						hrtimer_cancel(
> @@ -2794,12 +2803,8 @@ static void stop_all_queued(void)
>  					if (sqcp->tletp)
>  						tasklet_kill(sqcp->tletp);
>  				}
> -				__clear_bit(k, queued_in_use_bm);
> -				devip = (struct sdebug_dev_info *)
> -					sqcp->a_cmnd->device->hostdata;
> -				if (devip)
> -					atomic_dec(&devip->num_in_q);
> -				sqcp->a_cmnd = NULL;
> +				clear_bit(k, queued_in_use_bm);
> +				spin_lock_irqsave(&queued_arr_lock, iflags);
>  			}
>  		}
>  	}

---end quoted text---

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal
  2014-09-05  5:24 ` Christoph Hellwig
@ 2014-09-05 13:56   ` Douglas Gilbert
  2014-09-05 15:25     ` Bart Van Assche
  0 siblings, 1 reply; 14+ messages in thread
From: Douglas Gilbert @ 2014-09-05 13:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: SCSI development list, linux-kernel, James Bottomley, Milan Broz

With scsi-mq I think many LLDs probably have a new
race possibility between a surprise rmmod of the LLD
and another thread presenting a new command at about
the same time (or another thread's command completing
around that time). Does anything above the LLD stop
this happening?

Looking at mpt3sas and hpsa module exit calls, they don't
seem to guard against this possibility.

The test is pretty easy: build the LLD as a module, load
it and fire up a multi-thread, libaio fio test on one or
more devices (SSDs would probably be good) on that LLD.
While the test is running, do 'rmmod LLD'.

Doug Gilbert


On 14-09-05 01:24 AM, Christoph Hellwig wrote:
> Can I get another review for this one?
>
> On Sun, Aug 31, 2014 at 07:09:59PM -0400, Douglas Gilbert wrote:
>> A deadlock has been reported when the completion
>> of SCSI commands (simulated by a timer) was surprised
>> by a module removal. This patch removes one half of
>> the offending locks around timer deletions. This fix
>> is applied both to stop_all_queued() which is were
>> the deadlock was discovered and stop_queued_cmnd()
>> which has very similar logic.
>>
>> This patch should be applied both to the lk 3.17 tree
>> and Christoph's drivers-for-3.18 tree.
>>
>> Tested-and-reported-by: Milan Broz <gmazyland@gmail.com>
>> Signed-off-by: Douglas Gilbert <dgilbert@interlog.com>
>
>> --- a/drivers/scsi/scsi_debug.c	2014-08-26 13:24:51.646948507 -0400
>> +++ b/drivers/scsi/scsi_debug.c	2014-08-30 18:04:54.589226679 -0400
>> @@ -2743,6 +2743,13 @@ static int stop_queued_cmnd(struct scsi_
>>   		if (test_bit(k, queued_in_use_bm)) {
>>   			sqcp = &queued_arr[k];
>>   			if (cmnd == sqcp->a_cmnd) {
>> +				devip = (struct sdebug_dev_info *)
>> +					cmnd->device->hostdata;
>> +				if (devip)
>> +					atomic_dec(&devip->num_in_q);
>> +				sqcp->a_cmnd = NULL;
>> +				spin_unlock_irqrestore(&queued_arr_lock,
>> +						       iflags);
>>   				if (scsi_debug_ndelay > 0) {
>>   					if (sqcp->sd_hrtp)
>>   						hrtimer_cancel(
>> @@ -2755,18 +2762,13 @@ static int stop_queued_cmnd(struct scsi_
>>   					if (sqcp->tletp)
>>   						tasklet_kill(sqcp->tletp);
>>   				}
>> -				__clear_bit(k, queued_in_use_bm);
>> -				devip = (struct sdebug_dev_info *)
>> -					cmnd->device->hostdata;
>> -				if (devip)
>> -					atomic_dec(&devip->num_in_q);
>> -				sqcp->a_cmnd = NULL;
>> -				break;
>> +				clear_bit(k, queued_in_use_bm);
>> +				return 1;
>>   			}
>>   		}
>>   	}
>>   	spin_unlock_irqrestore(&queued_arr_lock, iflags);
>> -	return (k < qmax) ? 1 : 0;
>> +	return 0;
>>   }
>>
>>   /* Deletes (stops) timers or tasklets of all queued commands */
>> @@ -2782,6 +2784,13 @@ static void stop_all_queued(void)
>>   		if (test_bit(k, queued_in_use_bm)) {
>>   			sqcp = &queued_arr[k];
>>   			if (sqcp->a_cmnd) {
>> +				devip = (struct sdebug_dev_info *)
>> +					sqcp->a_cmnd->device->hostdata;
>> +				if (devip)
>> +					atomic_dec(&devip->num_in_q);
>> +				sqcp->a_cmnd = NULL;
>> +				spin_unlock_irqrestore(&queued_arr_lock,
>> +						       iflags);
>>   				if (scsi_debug_ndelay > 0) {
>>   					if (sqcp->sd_hrtp)
>>   						hrtimer_cancel(
>> @@ -2794,12 +2803,8 @@ static void stop_all_queued(void)
>>   					if (sqcp->tletp)
>>   						tasklet_kill(sqcp->tletp);
>>   				}
>> -				__clear_bit(k, queued_in_use_bm);
>> -				devip = (struct sdebug_dev_info *)
>> -					sqcp->a_cmnd->device->hostdata;
>> -				if (devip)
>> -					atomic_dec(&devip->num_in_q);
>> -				sqcp->a_cmnd = NULL;
>> +				clear_bit(k, queued_in_use_bm);
>> +				spin_lock_irqsave(&queued_arr_lock, iflags);
>>   			}
>>   		}
>>   	}
>
> ---end quoted text---
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal
  2014-09-05 13:56   ` Douglas Gilbert
@ 2014-09-05 15:25     ` Bart Van Assche
  2014-09-06 14:40       ` Douglas Gilbert
  0 siblings, 1 reply; 14+ messages in thread
From: Bart Van Assche @ 2014-09-05 15:25 UTC (permalink / raw)
  To: dgilbert, Christoph Hellwig
  Cc: SCSI development list, linux-kernel, James Bottomley, Milan Broz

On 09/05/14 15:56, Douglas Gilbert wrote:
> With scsi-mq I think many LLDs probably have a new
> race possibility between a surprise rmmod of the LLD
> and another thread presenting a new command at about
> the same time (or another thread's command completing
> around that time). Does anything above the LLD stop
> this happening?
>
> Looking at mpt3sas and hpsa module exit calls, they don't
> seem to guard against this possibility.
>
> The test is pretty easy: build the LLD as a module, load
> it and fire up a multi-thread, libaio fio test on one or
> more devices (SSDs would probably be good) on that LLD.
> While the test is running, do 'rmmod LLD'.

An LLD must call scsi_remove_host() directly or indirectly from the 
module cleanup path. scsi_remove_host() triggers a call to 
blk_cleanup_queue(). That last function sets the flag QUEUE_FLAG_DYING 
which prevents that new I/O is queued and waits until previously queued 
requests have finished before returning.

Bart.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal
  2014-09-05 15:25     ` Bart Van Assche
@ 2014-09-06 14:40       ` Douglas Gilbert
  2014-09-06 14:44         ` Christoph Hellwig
  2014-09-08  9:11         ` Bart Van Assche
  0 siblings, 2 replies; 14+ messages in thread
From: Douglas Gilbert @ 2014-09-06 14:40 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig
  Cc: SCSI development list, linux-kernel, James Bottomley, Milan Broz

On 14-09-05 11:25 AM, Bart Van Assche wrote:
> On 09/05/14 15:56, Douglas Gilbert wrote:
>> With scsi-mq I think many LLDs probably have a new
>> race possibility between a surprise rmmod of the LLD
>> and another thread presenting a new command at about
>> the same time (or another thread's command completing
>> around that time). Does anything above the LLD stop
>> this happening?
>>
>> Looking at mpt3sas and hpsa module exit calls, they don't
>> seem to guard against this possibility.
>>
>> The test is pretty easy: build the LLD as a module, load
>> it and fire up a multi-thread, libaio fio test on one or
>> more devices (SSDs would probably be good) on that LLD.
>> While the test is running, do 'rmmod LLD'.
>
> An LLD must call scsi_remove_host() directly or indirectly from the module
> cleanup path. scsi_remove_host() triggers a call to blk_cleanup_queue(). That
> last function sets the flag QUEUE_FLAG_DYING which prevents that new I/O is
> queued and waits until previously queued requests have finished before returning.

And they do call scsi_remove_host(). But they do that toward
the end of their clean-up. The problem that I observed has
already happened before that.

IOW I think the QUEUE_FLAG_DYING state needs to be set and
acknowledged as the first order of business by the code
that implements 'rmmod LLD'.

Doug Gilbert

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal
  2014-09-06 14:40       ` Douglas Gilbert
@ 2014-09-06 14:44         ` Christoph Hellwig
  2014-09-08  9:11         ` Bart Van Assche
  1 sibling, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2014-09-06 14:44 UTC (permalink / raw)
  To: Douglas Gilbert
  Cc: Bart Van Assche, Christoph Hellwig, SCSI development list,
	linux-kernel, James Bottomley, Milan Broz

On Sat, Sep 06, 2014 at 10:40:06AM -0400, Douglas Gilbert wrote:
> And they do call scsi_remove_host(). But they do that toward
> the end of their clean-up. The problem that I observed has
> already happened before that.
> 
> IOW I think the QUEUE_FLAG_DYING state needs to be set and
> acknowledged as the first order of business by the code
> that implements 'rmmod LLD'.

That's how driver should implement their ->remove driver callback:

foo_remove()
{
	scsi_remove_host()
	< actual cleanup here>

	scsi_host_put();
}

if a driver doesn't do that, thats a bug in the driver which needs
fixing.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal
  2014-09-06 14:40       ` Douglas Gilbert
  2014-09-06 14:44         ` Christoph Hellwig
@ 2014-09-08  9:11         ` Bart Van Assche
  2014-09-08 15:07           ` Christoph Hellwig
  1 sibling, 1 reply; 14+ messages in thread
From: Bart Van Assche @ 2014-09-08  9:11 UTC (permalink / raw)
  To: dgilbert, Christoph Hellwig
  Cc: SCSI development list, linux-kernel, James Bottomley, Milan Broz

On 09/06/14 16:40, Douglas Gilbert wrote:
> On 14-09-05 11:25 AM, Bart Van Assche wrote:
>> An LLD must call scsi_remove_host() directly or indirectly from the
>> module
>> cleanup path. scsi_remove_host() triggers a call to
>> blk_cleanup_queue(). That
>> last function sets the flag QUEUE_FLAG_DYING which prevents that new
>> I/O is
>> queued and waits until previously queued requests have finished before
>> returning.
>
> And they do call scsi_remove_host(). But they do that toward
> the end of their clean-up. The problem that I observed has
> already happened before that.
>
> IOW I think the QUEUE_FLAG_DYING state needs to be set and
> acknowledged as the first order of business by the code
> that implements 'rmmod LLD'.

Hello Doug,

In the scsi_debug driver scsi_remove_host() is called from inside the 
sdebug_driver_remove() callback function. Unless I have missed something 
it is not guaranteed that that callback function is invoked before 
unloading of the scsi_debug driver has finished. I think most of the 
code in sdebug_driver_remove() should be moved to sdebug_remove_adapter().

Bart.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal
  2014-09-08  9:11         ` Bart Van Assche
@ 2014-09-08 15:07           ` Christoph Hellwig
  2014-09-08 20:31             ` Douglas Gilbert
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2014-09-08 15:07 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: dgilbert, Christoph Hellwig, SCSI development list, linux-kernel,
	James Bottomley, Milan Broz

On Mon, Sep 08, 2014 at 11:11:23AM +0200, Bart Van Assche wrote:
> Hello Doug,
> 
> In the scsi_debug driver scsi_remove_host() is called from inside the
> sdebug_driver_remove() callback function. Unless I have missed something it
> is not guaranteed that that callback function is invoked before unloading of
> the scsi_debug driver has finished. I think most of the code in
> sdebug_driver_remove() should be moved to sdebug_remove_adapter().

I'm not sure that's right.  scsi_debug uses the driver mode in a
slightly unusual way, and includes both the bus driver, device and
device driver.

sdebug_driver_remove is a bus method, but as we don't have driver methods
should act very much like all other _remove callbacks.

sdebug_remove_adapter is more a "bus-level" function that calls into
the driver model to unbind devices from the driver.

But we defintively shouldn't stop and free queued command before we
fully remove the hosts.  As far as I can tell the stop_all_queued
call can be entirely removed from the remove path, as the midlayer
will take care of waiting for all commands to return, and the
free_all_queued should be after all hosts are unregistered.

Something like the (untested) patch below would do the trick.
We'd still need Dougs patch for the EH case, though.

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index d19c0e3..d022c2f 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -3983,14 +3983,13 @@ static void __exit scsi_debug_exit(void)
 {
 	int k = scsi_debug_add_host;

-	stop_all_queued();
-	free_all_queued();
 	for (; k; k--)
 		sdebug_remove_adapter();
 	driver_unregister(&sdebug_driverfs_driver);
 	bus_unregister(&pseudo_lld_bus);
 	root_device_unregister(pseudo_primary);

+	free_all_queued();
 	if (dif_storep)
 		vfree(dif_storep);

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal
  2014-09-08 15:07           ` Christoph Hellwig
@ 2014-09-08 20:31             ` Douglas Gilbert
  2014-09-09 15:30               ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Douglas Gilbert @ 2014-09-08 20:31 UTC (permalink / raw)
  To: Christoph Hellwig, Bart Van Assche
  Cc: SCSI development list, linux-kernel, James Bottomley, Milan Broz

On 14-09-08 11:07 AM, Christoph Hellwig wrote:
> On Mon, Sep 08, 2014 at 11:11:23AM +0200, Bart Van Assche wrote:
>> Hello Doug,
>>
>> In the scsi_debug driver scsi_remove_host() is called from inside the
>> sdebug_driver_remove() callback function. Unless I have missed something it
>> is not guaranteed that that callback function is invoked before unloading of
>> the scsi_debug driver has finished. I think most of the code in
>> sdebug_driver_remove() should be moved to sdebug_remove_adapter().
>
> I'm not sure that's right.  scsi_debug uses the driver mode in a
> slightly unusual way, and includes both the bus driver, device and
> device driver.
>
> sdebug_driver_remove is a bus method, but as we don't have driver methods
> should act very much like all other _remove callbacks.
>
> sdebug_remove_adapter is more a "bus-level" function that calls into
> the driver model to unbind devices from the driver.
>
> But we defintively shouldn't stop and free queued command before we
> fully remove the hosts.  As far as I can tell the stop_all_queued
> call can be entirely removed from the remove path, as the midlayer
> will take care of waiting for all commands to return, and the
> free_all_queued should be after all hosts are unregistered.

stop_all_queued() is doing hrtimer_cancel(), del_timer_sync()
or tasklet_kill() on all the scsi_cmnd objects that are
"in play". Unless another mechanism calls the .eh_abort_handler
entry point reliably on each "in play" command then the module
cannot be removed. That is because some timer expiry callbacks
are pending.

> Something like the (untested) patch below would do the trick.
> We'd still need Dougs patch for the EH case, though.
>
> diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
> index d19c0e3..d022c2f 100644
> --- a/drivers/scsi/scsi_debug.c
> +++ b/drivers/scsi/scsi_debug.c
> @@ -3983,14 +3983,13 @@ static void __exit scsi_debug_exit(void)
>   {
>   	int k = scsi_debug_add_host;
>
> -	stop_all_queued();
> -	free_all_queued();
>   	for (; k; k--)
>   		sdebug_remove_adapter();
>   	driver_unregister(&sdebug_driverfs_driver);
>   	bus_unregister(&pseudo_lld_bus);
>   	root_device_unregister(pseudo_primary);
>
> +	free_all_queued();
>   	if (dif_storep)
>   		vfree(dif_storep);
>
> --

The only other call to stop_all_queued() is from the
.eh_host_reset_handler entry point.

Doug Gilbert



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal
  2014-09-08 20:31             ` Douglas Gilbert
@ 2014-09-09 15:30               ` Christoph Hellwig
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2014-09-09 15:30 UTC (permalink / raw)
  To: Douglas Gilbert
  Cc: Christoph Hellwig, Bart Van Assche, SCSI development list,
	linux-kernel, James Bottomley, Milan Broz

On Mon, Sep 08, 2014 at 04:31:01PM -0400, Douglas Gilbert wrote:
> stop_all_queued() is doing hrtimer_cancel(), del_timer_sync()
> or tasklet_kill() on all the scsi_cmnd objects that are
> "in play". Unless another mechanism calls the .eh_abort_handler
> entry point reliably on each "in play" command then the module
> cannot be removed. That is because some timer expiry callbacks
> are pending.

scsi_remove_host disabled all queueing of new commands, so all these
timers and tasklets will eventually expire or run and allow the
removal to complete.  Of course this could be sped up by cancelling
them, but you don't need the sync version

> >Something like the (untested) patch below would do the trick.
> >We'd still need Dougs patch for the EH case, though.

> 
> The only other call to stop_all_queued() is from the
> .eh_host_reset_handler entry point.

True, but you also have stop_queued_cmnd for a abort case which
also needs that treatment.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal
  2014-08-31 23:09 [PATCH] scsi_debug: deadlock between completions and surprise module removal Douglas Gilbert
  2014-09-01 15:36 ` Christoph Hellwig
  2014-09-05  5:24 ` Christoph Hellwig
@ 2014-09-25 12:13 ` Christoph Hellwig
  2014-10-03 18:16   ` Elliott, Robert (Server Storage)
  2 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2014-09-25 12:13 UTC (permalink / raw)
  To: Douglas Gilbert
  Cc: SCSI development list, linux-kernel, James Bottomley,
	Christoph Hellwig, Milan Broz

Review ping again?

While I think the shutdown code in scsi_debug needs a bit more of an
overhault I'd really like to include the fix at least for 3.18 and
3.17-stable now that we have missed the 3.17 window.

On Sun, Aug 31, 2014 at 07:09:59PM -0400, Douglas Gilbert wrote:
> A deadlock has been reported when the completion
> of SCSI commands (simulated by a timer) was surprised
> by a module removal. This patch removes one half of
> the offending locks around timer deletions. This fix
> is applied both to stop_all_queued() which is were
> the deadlock was discovered and stop_queued_cmnd()
> which has very similar logic.
> 
> This patch should be applied both to the lk 3.17 tree
> and Christoph's drivers-for-3.18 tree.
> 
> Tested-and-reported-by: Milan Broz <gmazyland@gmail.com>
> Signed-off-by: Douglas Gilbert <dgilbert@interlog.com>

> --- a/drivers/scsi/scsi_debug.c	2014-08-26 13:24:51.646948507 -0400
> +++ b/drivers/scsi/scsi_debug.c	2014-08-30 18:04:54.589226679 -0400
> @@ -2743,6 +2743,13 @@ static int stop_queued_cmnd(struct scsi_
>  		if (test_bit(k, queued_in_use_bm)) {
>  			sqcp = &queued_arr[k];
>  			if (cmnd == sqcp->a_cmnd) {
> +				devip = (struct sdebug_dev_info *)
> +					cmnd->device->hostdata;
> +				if (devip)
> +					atomic_dec(&devip->num_in_q);
> +				sqcp->a_cmnd = NULL;
> +				spin_unlock_irqrestore(&queued_arr_lock,
> +						       iflags);
>  				if (scsi_debug_ndelay > 0) {
>  					if (sqcp->sd_hrtp)
>  						hrtimer_cancel(
> @@ -2755,18 +2762,13 @@ static int stop_queued_cmnd(struct scsi_
>  					if (sqcp->tletp)
>  						tasklet_kill(sqcp->tletp);
>  				}
> -				__clear_bit(k, queued_in_use_bm);
> -				devip = (struct sdebug_dev_info *)
> -					cmnd->device->hostdata;
> -				if (devip)
> -					atomic_dec(&devip->num_in_q);
> -				sqcp->a_cmnd = NULL;
> -				break;
> +				clear_bit(k, queued_in_use_bm);
> +				return 1;
>  			}
>  		}
>  	}
>  	spin_unlock_irqrestore(&queued_arr_lock, iflags);
> -	return (k < qmax) ? 1 : 0;
> +	return 0;
>  }
>  
>  /* Deletes (stops) timers or tasklets of all queued commands */
> @@ -2782,6 +2784,13 @@ static void stop_all_queued(void)
>  		if (test_bit(k, queued_in_use_bm)) {
>  			sqcp = &queued_arr[k];
>  			if (sqcp->a_cmnd) {
> +				devip = (struct sdebug_dev_info *)
> +					sqcp->a_cmnd->device->hostdata;
> +				if (devip)
> +					atomic_dec(&devip->num_in_q);
> +				sqcp->a_cmnd = NULL;
> +				spin_unlock_irqrestore(&queued_arr_lock,
> +						       iflags);
>  				if (scsi_debug_ndelay > 0) {
>  					if (sqcp->sd_hrtp)
>  						hrtimer_cancel(
> @@ -2794,12 +2803,8 @@ static void stop_all_queued(void)
>  					if (sqcp->tletp)
>  						tasklet_kill(sqcp->tletp);
>  				}
> -				__clear_bit(k, queued_in_use_bm);
> -				devip = (struct sdebug_dev_info *)
> -					sqcp->a_cmnd->device->hostdata;
> -				if (devip)
> -					atomic_dec(&devip->num_in_q);
> -				sqcp->a_cmnd = NULL;
> +				clear_bit(k, queued_in_use_bm);
> +				spin_lock_irqsave(&queued_arr_lock, iflags);
>  			}
>  		}
>  	}

---end quoted text---

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH] scsi_debug: deadlock between completions and surprise module removal
  2014-09-25 12:13 ` Christoph Hellwig
@ 2014-10-03 18:16   ` Elliott, Robert (Server Storage)
  0 siblings, 0 replies; 14+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-10-03 18:16 UTC (permalink / raw)
  To: Christoph Hellwig, Douglas Gilbert
  Cc: SCSI development list, linux-kernel, James Bottomley, Milan Broz



> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Thursday, 25 September, 2014 7:13 AM
> To: Douglas Gilbert
> Cc: SCSI development list; linux-kernel; James Bottomley; Christoph Hellwig;
> Milan Broz
> Subject: Re: [PATCH] scsi_debug: deadlock between completions and surprise
> module removal
> 
> Review ping again?
> 
> While I think the shutdown code in scsi_debug needs a bit more of an
> overhault I'd really like to include the fix at least for 3.18 and
> 3.17-stable now that we have missed the 3.17 window.
> 
> On Sun, Aug 31, 2014 at 07:09:59PM -0400, Douglas Gilbert wrote:
> > A deadlock has been reported when the completion
> > of SCSI commands (simulated by a timer) was surprised
> > by a module removal. This patch removes one half of
> > the offending locks around timer deletions. This fix
> > is applied both to stop_all_queued() which is were
> > the deadlock was discovered and stop_queued_cmnd()
> > which has very similar logic.
> >
> > This patch should be applied both to the lk 3.17 tree
> > and Christoph's drivers-for-3.18 tree.
> >
> > Tested-and-reported-by: Milan Broz <gmazyland@gmail.com>
> > Signed-off-by: Douglas Gilbert <dgilbert@interlog.com>

Reviewed-by: Robert Elliott <elliott@hp.com>




^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-10-03 18:17 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-31 23:09 [PATCH] scsi_debug: deadlock between completions and surprise module removal Douglas Gilbert
2014-09-01 15:36 ` Christoph Hellwig
2014-09-01 19:52   ` Douglas Gilbert
2014-09-05  5:24 ` Christoph Hellwig
2014-09-05 13:56   ` Douglas Gilbert
2014-09-05 15:25     ` Bart Van Assche
2014-09-06 14:40       ` Douglas Gilbert
2014-09-06 14:44         ` Christoph Hellwig
2014-09-08  9:11         ` Bart Van Assche
2014-09-08 15:07           ` Christoph Hellwig
2014-09-08 20:31             ` Douglas Gilbert
2014-09-09 15:30               ` Christoph Hellwig
2014-09-25 12:13 ` Christoph Hellwig
2014-10-03 18:16   ` Elliott, Robert (Server Storage)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.