All of lore.kernel.org
 help / color / mirror / Atom feed
* Perfromance drop on SCSI hard disk
@ 2011-05-10  6:40 Alex,Shi
  2011-05-10  6:52 ` Shaohua Li
  2011-05-12 20:29 ` Jens Axboe
  0 siblings, 2 replies; 17+ messages in thread
From: Alex,Shi @ 2011-05-10  6:40 UTC (permalink / raw)
  To: jaxboe, James.Bottomley; +Cc: Li, Shaohua, linux-kernel

commit c21e6beba8835d09bb80e34961 removed the REENTER flag and changed
scsi_run_queue() to punt all requests on starved_list devices to
kblockd. Yes, like Jens mentioned, the performance on slow SCSI disk was
hurt here.  :) (Intel SSD isn't effected here)

In our testing on 12 SAS disk JBD, the fio write with sync ioengine drop
about 30~40% throughput, fio randread/randwrite with aio ioengine drop
about 20%/50% throughput. and fio mmap testing was hurt also. 

With the following debug patch, the performance can be totally recovered
in our testing. But without REENTER flag here, in some corner case, like
a device is keeping blocked and then unblocked repeatedly,
__blk_run_queue() may recursively call scsi_run_queue() and then cause
kernel stack overflow. 
I don't know details of block device driver, just wondering why on scsi
need the REENTER flag here. :) 

James, do you have some idea on this. 

Regards
Alex
======
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index e9901b8..24e8589 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -432,8 +432,11 @@ static void scsi_run_queue(struct request_queue *q)
 				       &shost->starved_list);
 			continue;
 		}
-
-		blk_run_queue_async(sdev->request_queue);
+		spin_unlock(shost->host_lock);
+		spin_lock(sdev->request_queue->queue_lock);
+		__blk_run_queue(sdev->request_queue);
+		spin_unlock(sdev->request_queue->queue_lock);
+		spin_lock(shost->host_lock);
 	}
 	/* put any unprocessed entries back */
 	list_splice(&starved_list, &shost->starved_list);








^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Perfromance drop on SCSI hard disk
  2011-05-10  6:40 Perfromance drop on SCSI hard disk Alex,Shi
@ 2011-05-10  6:52 ` Shaohua Li
  2011-05-12  0:36   ` Shaohua Li
  2011-05-12 20:29 ` Jens Axboe
  1 sibling, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2011-05-10  6:52 UTC (permalink / raw)
  To: Shi, Alex, jaxboe; +Cc: jaxboe, James.Bottomley, linux-kernel, tim.c.chen

On Tue, May 10, 2011 at 02:40:00PM +0800, Shi, Alex wrote:
> commit c21e6beba8835d09bb80e34961 removed the REENTER flag and changed
> scsi_run_queue() to punt all requests on starved_list devices to
> kblockd. Yes, like Jens mentioned, the performance on slow SCSI disk was
> hurt here.  :) (Intel SSD isn't effected here)
> 
> In our testing on 12 SAS disk JBD, the fio write with sync ioengine drop
> about 30~40% throughput, fio randread/randwrite with aio ioengine drop
> about 20%/50% throughput. and fio mmap testing was hurt also. 
> 
> With the following debug patch, the performance can be totally recovered
> in our testing. But without REENTER flag here, in some corner case, like
> a device is keeping blocked and then unblocked repeatedly,
> __blk_run_queue() may recursively call scsi_run_queue() and then cause
> kernel stack overflow. 
> I don't know details of block device driver, just wondering why on scsi
> need the REENTER flag here. :) 
Hi Jens,
I want to add more analysis about the problem to help understand the issue.
This is a severe problem, hopefully we can solve it before 2.6.39 release.

Basically the offending patch has some problems:
a. more context switches
b. __blk_run_queue losts the recursive detection. In some cases, it could be
  called recursively. For example, blk_run_queue in scsi_run_queue()
c. fairness issue. Say we have sdev1 and sdev2 in starved_list. Then run
scsi_run_queue():
  1. remove both sdev1 and sdev2 from starved_list
  2. async queue dispatches sdev1's request. host becames busy again.
  3. add sdev1 into starved_list again. Since starved_list is empty,
     sdev1 is added at the head
  4. async queue checks sdev2, and add sdev2 into starved_list tail.
  In this scenario, sdev1 is serviced first next time, so sdev2 is starved.
  In our test, 12 disks connect to one HBA card. disk's queue depth is 64,
  while HBA card queue depth is 128. Our test does sync write, so block size
  is big, so just several requests can occurpy one disk's bandwidth. Saturate
  one disk but starve others will hurt total throughput.

problem a isn't a big problem in our test (we did observe higher context
switch, which is about 4x more CS), but guess it will hurt high end system.

problem b is easy to fix for scsi. just replace blk_run_queue with
blk_run_queue_async in scsi_run_queue

problem c is the root cause for the regression. I had a patch for it.
Basically with my patch, we don't remove sdev from starved_list in
scsi_run_queue, but we delay the removal in scsi_request_fn() till a
starved device really dispatches a request. My patch can fully fix
the regression.

But given problem a, we should revert the patch (or Alex's patch if stack
overflow isn't a big deal here), so I didn't post my patch here. Problem
c actually exsists even we revert the patch (we could do async execution
with small chance), but not that severe. I can post a fix after the
patch is reverted.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perfromance drop on SCSI hard disk
  2011-05-10  6:52 ` Shaohua Li
@ 2011-05-12  0:36   ` Shaohua Li
  0 siblings, 0 replies; 17+ messages in thread
From: Shaohua Li @ 2011-05-12  0:36 UTC (permalink / raw)
  To: Shi, Alex, jaxboe; +Cc: James.Bottomley, linux-kernel, tim.c.chen, akpm

2011/5/10 Shaohua Li <shaohua.li@intel.com>:
> On Tue, May 10, 2011 at 02:40:00PM +0800, Shi, Alex wrote:
>> commit c21e6beba8835d09bb80e34961 removed the REENTER flag and changed
>> scsi_run_queue() to punt all requests on starved_list devices to
>> kblockd. Yes, like Jens mentioned, the performance on slow SCSI disk was
>> hurt here.  :) (Intel SSD isn't effected here)
>>
>> In our testing on 12 SAS disk JBD, the fio write with sync ioengine drop
>> about 30~40% throughput, fio randread/randwrite with aio ioengine drop
>> about 20%/50% throughput. and fio mmap testing was hurt also.
>>
>> With the following debug patch, the performance can be totally recovered
>> in our testing. But without REENTER flag here, in some corner case, like
>> a device is keeping blocked and then unblocked repeatedly,
>> __blk_run_queue() may recursively call scsi_run_queue() and then cause
>> kernel stack overflow.
>> I don't know details of block device driver, just wondering why on scsi
>> need the REENTER flag here. :)
> Hi Jens,
> I want to add more analysis about the problem to help understand the issue.
> This is a severe problem, hopefully we can solve it before 2.6.39 release.
>
> Basically the offending patch has some problems:
> a. more context switches
> b. __blk_run_queue losts the recursive detection. In some cases, it could be
>  called recursively. For example, blk_run_queue in scsi_run_queue()
> c. fairness issue. Say we have sdev1 and sdev2 in starved_list. Then run
> scsi_run_queue():
>  1. remove both sdev1 and sdev2 from starved_list
>  2. async queue dispatches sdev1's request. host becames busy again.
>  3. add sdev1 into starved_list again. Since starved_list is empty,
>     sdev1 is added at the head
>  4. async queue checks sdev2, and add sdev2 into starved_list tail.
>  In this scenario, sdev1 is serviced first next time, so sdev2 is starved.
>  In our test, 12 disks connect to one HBA card. disk's queue depth is 64,
>  while HBA card queue depth is 128. Our test does sync write, so block size
>  is big, so just several requests can occurpy one disk's bandwidth. Saturate
>  one disk but starve others will hurt total throughput.
>
> problem a isn't a big problem in our test (we did observe higher context
> switch, which is about 4x more CS), but guess it will hurt high end system.
>
> problem b is easy to fix for scsi. just replace blk_run_queue with
> blk_run_queue_async in scsi_run_queue
>
> problem c is the root cause for the regression. I had a patch for it.
> Basically with my patch, we don't remove sdev from starved_list in
> scsi_run_queue, but we delay the removal in scsi_request_fn() till a
> starved device really dispatches a request. My patch can fully fix
> the regression.
>
> But given problem a, we should revert the patch (or Alex's patch if stack
> overflow isn't a big deal here), so I didn't post my patch here. Problem
> c actually exsists even we revert the patch (we could do async execution
> with small chance), but not that severe. I can post a fix after the
> patch is reverted.
Hi,
Ping again. I hope this issue isn't missed. The regression is big from
20% ~ 50% of IO throughput.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perfromance drop on SCSI hard disk
  2011-05-10  6:40 Perfromance drop on SCSI hard disk Alex,Shi
  2011-05-10  6:52 ` Shaohua Li
@ 2011-05-12 20:29 ` Jens Axboe
  2011-05-13  0:11   ` Alex,Shi
  2011-05-13  0:48   ` Shaohua Li
  1 sibling, 2 replies; 17+ messages in thread
From: Jens Axboe @ 2011-05-12 20:29 UTC (permalink / raw)
  To: Alex,Shi; +Cc: James.Bottomley, Li, Shaohua, linux-kernel

On 2011-05-10 08:40, Alex,Shi wrote:
> commit c21e6beba8835d09bb80e34961 removed the REENTER flag and changed
> scsi_run_queue() to punt all requests on starved_list devices to
> kblockd. Yes, like Jens mentioned, the performance on slow SCSI disk was
> hurt here.  :) (Intel SSD isn't effected here)
> 
> In our testing on 12 SAS disk JBD, the fio write with sync ioengine drop
> about 30~40% throughput, fio randread/randwrite with aio ioengine drop
> about 20%/50% throughput. and fio mmap testing was hurt also. 
> 
> With the following debug patch, the performance can be totally recovered
> in our testing. But without REENTER flag here, in some corner case, like
> a device is keeping blocked and then unblocked repeatedly,
> __blk_run_queue() may recursively call scsi_run_queue() and then cause
> kernel stack overflow. 
> I don't know details of block device driver, just wondering why on scsi
> need the REENTER flag here. :) 

This is a problem and we should do something about it for 2.6.39. I knew
that there would be cases where the async offload would cause a
performance degredation, but not to the extent that you are reporting.
Must be hitting the pathological case.

I can think of two scenarios where it could potentially recurse:

- request_fn enter, end up requeuing IO. Run queue at the end. Rinse,
  repeat.
- Running starved list from request_fn, two (or more) devices could
  alternately recurse.

The first case should be fairly easy to handle. The second one is
already handled by the local list splice.

Looking at the code, is this a real scenario? Only potential recurse I
see is:

scsi_request_fn()
        scsi_dispatch_cmd()
                scsi_queue_insert()
                        __scsi_queue_insert()
                                scsi_run_queue()

Why are we even re-running the queue immediately on a BUSY condition?
Should only be needed if we have zero pending commands from this
particular queue, and for that particular case async run is just fine
since it's a rare condition (or performance would suck already).

And it should only really be needed for the 'q' being passed in, not the
others. Something like the below.

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0bac91e..0b01c1f 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -74,7 +74,7 @@ struct kmem_cache *scsi_sdb_cache;
  */
 #define SCSI_QUEUE_DELAY	3
 
-static void scsi_run_queue(struct request_queue *q);
+static void scsi_run_queue_async(struct request_queue *q);
 
 /*
  * Function:	scsi_unprep_request()
@@ -161,7 +161,7 @@ static int __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
 	blk_requeue_request(q, cmd->request);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
-	scsi_run_queue(q);
+	scsi_run_queue_async(q);
 
 	return 0;
 }
@@ -391,13 +391,14 @@ static inline int scsi_host_is_busy(struct Scsi_Host *shost)
  * Purpose:	Select a proper request queue to serve next
  *
  * Arguments:	q	- last request's queue
+ * 		async	- prevent potential request_fn recurse by running async
  *
  * Returns:     Nothing
  *
  * Notes:	The previous command was completely finished, start
  *		a new one if possible.
  */
-static void scsi_run_queue(struct request_queue *q)
+static void __scsi_run_queue(struct request_queue *q, bool async)
 {
 	struct scsi_device *sdev = q->queuedata;
 	struct Scsi_Host *shost;
@@ -438,13 +439,30 @@ static void scsi_run_queue(struct request_queue *q)
 			continue;
 		}
 
-		blk_run_queue_async(sdev->request_queue);
+		spin_unlock(shost->host_lock);
+		spin_lock(sdev->request_queue->queue_lock);
+		__blk_run_queue(sdev->request_queue);
+		spin_unlock(sdev->request_queue->queue_lock);
+		spin_lock(shost->host_lock);
 	}
 	/* put any unprocessed entries back */
 	list_splice(&starved_list, &shost->starved_list);
 	spin_unlock_irqrestore(shost->host_lock, flags);
 
-	blk_run_queue(q);
+	if (async)
+		blk_run_queue_async(q);
+	else
+		blk_run_queue(q);
+}
+
+static void scsi_run_queue(struct request_queue *q)
+{
+	__scsi_run_queue(q, false);
+}
+
+static void scsi_run_queue_async(struct request_queue *q)
+{
+	__scsi_run_queue(q, true);
 }
 
 /*

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Perfromance drop on SCSI hard disk
  2011-05-12 20:29 ` Jens Axboe
@ 2011-05-13  0:11   ` Alex,Shi
  2011-05-13  0:48   ` Shaohua Li
  1 sibling, 0 replies; 17+ messages in thread
From: Alex,Shi @ 2011-05-13  0:11 UTC (permalink / raw)
  To: Jens Axboe; +Cc: James.Bottomley, Li, Shaohua, linux-kernel

On Fri, 2011-05-13 at 04:29 +0800, Jens Axboe wrote:
> On 2011-05-10 08:40, Alex,Shi wrote:
> > commit c21e6beba8835d09bb80e34961 removed the REENTER flag and changed
> > scsi_run_queue() to punt all requests on starved_list devices to
> > kblockd. Yes, like Jens mentioned, the performance on slow SCSI disk was
> > hurt here.  :) (Intel SSD isn't effected here)
> > 
> > In our testing on 12 SAS disk JBD, the fio write with sync ioengine drop
> > about 30~40% throughput, fio randread/randwrite with aio ioengine drop
> > about 20%/50% throughput. and fio mmap testing was hurt also. 
> > 
> > With the following debug patch, the performance can be totally recovered
> > in our testing. But without REENTER flag here, in some corner case, like
> > a device is keeping blocked and then unblocked repeatedly,
> > __blk_run_queue() may recursively call scsi_run_queue() and then cause
> > kernel stack overflow. 
> > I don't know details of block device driver, just wondering why on scsi
> > need the REENTER flag here. :) 
> 
> This is a problem and we should do something about it for 2.6.39. I knew
> that there would be cases where the async offload would cause a
> performance degredation, but not to the extent that you are reporting.
> Must be hitting the pathological case.
> 
> I can think of two scenarios where it could potentially recurse:
> 
> - request_fn enter, end up requeuing IO. Run queue at the end. Rinse,
>   repeat.
> - Running starved list from request_fn, two (or more) devices could
>   alternately recurse.
> 
> The first case should be fairly easy to handle. The second one is
> already handled by the local list splice.
> 
> Looking at the code, is this a real scenario? Only potential recurse I
> see is:
> 
> scsi_request_fn()
>         scsi_dispatch_cmd()
>                 scsi_queue_insert()
>                         __scsi_queue_insert()
>                                 scsi_run_queue()
> 
> Why are we even re-running the queue immediately on a BUSY condition?
> Should only be needed if we have zero pending commands from this
> particular queue, and for that particular case async run is just fine
> since it's a rare condition (or performance would suck already).

Yeah, this is correct way to fix it. Let me try the patch on our
machine. 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perfromance drop on SCSI hard disk
  2011-05-12 20:29 ` Jens Axboe
  2011-05-13  0:11   ` Alex,Shi
@ 2011-05-13  0:48   ` Shaohua Li
  2011-05-13  3:01     ` Shaohua Li
  1 sibling, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2011-05-13  0:48 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Shi, Alex, James.Bottomley, linux-kernel

On Fri, 2011-05-13 at 04:29 +0800, Jens Axboe wrote:
> On 2011-05-10 08:40, Alex,Shi wrote:
> > commit c21e6beba8835d09bb80e34961 removed the REENTER flag and changed
> > scsi_run_queue() to punt all requests on starved_list devices to
> > kblockd. Yes, like Jens mentioned, the performance on slow SCSI disk was
> > hurt here.  :) (Intel SSD isn't effected here)
> > 
> > In our testing on 12 SAS disk JBD, the fio write with sync ioengine drop
> > about 30~40% throughput, fio randread/randwrite with aio ioengine drop
> > about 20%/50% throughput. and fio mmap testing was hurt also. 
> > 
> > With the following debug patch, the performance can be totally recovered
> > in our testing. But without REENTER flag here, in some corner case, like
> > a device is keeping blocked and then unblocked repeatedly,
> > __blk_run_queue() may recursively call scsi_run_queue() and then cause
> > kernel stack overflow. 
> > I don't know details of block device driver, just wondering why on scsi
> > need the REENTER flag here. :) 
> 
> This is a problem and we should do something about it for 2.6.39. I knew
> that there would be cases where the async offload would cause a
> performance degredation, but not to the extent that you are reporting.
> Must be hitting the pathological case.
async offload is expected to increase context switch. But the real root
cause of the issue is fairness issue. Please see my previous email.

> I can think of two scenarios where it could potentially recurse:
> 
> - request_fn enter, end up requeuing IO. Run queue at the end. Rinse,
>   repeat.
> - Running starved list from request_fn, two (or more) devices could
>   alternately recurse.
> 
> The first case should be fairly easy to handle. The second one is
> already handled by the local list splice.
this isn't true to me. if you unlock host_lock in scsi_run_queue, other
cpus can add sdev to the starved device list again. In the recursive
call of scsi_run_queue, the starved device list might not be empty. So
the local list_splice doesn't help.

> 
> Looking at the code, is this a real scenario? Only potential recurse I
> see is:
> 
> scsi_request_fn()
>         scsi_dispatch_cmd()
>                 scsi_queue_insert()
>                         __scsi_queue_insert()
>                                 scsi_run_queue()
> 
> Why are we even re-running the queue immediately on a BUSY condition?
> Should only be needed if we have zero pending commands from this
> particular queue, and for that particular case async run is just fine
> since it's a rare condition (or performance would suck already).
> 
> And it should only really be needed for the 'q' being passed in, not the
> others. Something like the below.
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 0bac91e..0b01c1f 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -74,7 +74,7 @@ struct kmem_cache *scsi_sdb_cache;
>   */
>  #define SCSI_QUEUE_DELAY	3
>  
> -static void scsi_run_queue(struct request_queue *q);
> +static void scsi_run_queue_async(struct request_queue *q);
>  
>  /*
>   * Function:	scsi_unprep_request()
> @@ -161,7 +161,7 @@ static int __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
>  	blk_requeue_request(q, cmd->request);
>  	spin_unlock_irqrestore(q->queue_lock, flags);
>  
> -	scsi_run_queue(q);
> +	scsi_run_queue_async(q);
so you could still recursivly run into starved list. Do you want to put
the whole __scsi_run_queue into workqueue?

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perfromance drop on SCSI hard disk
  2011-05-13  0:48   ` Shaohua Li
@ 2011-05-13  3:01     ` Shaohua Li
  2011-05-16  8:04       ` Shaohua Li
  0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2011-05-13  3:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Shi, Alex, James.Bottomley, linux-kernel

On Fri, 2011-05-13 at 08:48 +0800, Shaohua Li wrote:
> On Fri, 2011-05-13 at 04:29 +0800, Jens Axboe wrote:
> > On 2011-05-10 08:40, Alex,Shi wrote:
> > > commit c21e6beba8835d09bb80e34961 removed the REENTER flag and changed
> > > scsi_run_queue() to punt all requests on starved_list devices to
> > > kblockd. Yes, like Jens mentioned, the performance on slow SCSI disk was
> > > hurt here.  :) (Intel SSD isn't effected here)
> > > 
> > > In our testing on 12 SAS disk JBD, the fio write with sync ioengine drop
> > > about 30~40% throughput, fio randread/randwrite with aio ioengine drop
> > > about 20%/50% throughput. and fio mmap testing was hurt also. 
> > > 
> > > With the following debug patch, the performance can be totally recovered
> > > in our testing. But without REENTER flag here, in some corner case, like
> > > a device is keeping blocked and then unblocked repeatedly,
> > > __blk_run_queue() may recursively call scsi_run_queue() and then cause
> > > kernel stack overflow. 
> > > I don't know details of block device driver, just wondering why on scsi
> > > need the REENTER flag here. :) 
> > 
> > This is a problem and we should do something about it for 2.6.39. I knew
> > that there would be cases where the async offload would cause a
> > performance degredation, but not to the extent that you are reporting.
> > Must be hitting the pathological case.
> async offload is expected to increase context switch. But the real root
> cause of the issue is fairness issue. Please see my previous email.
> 
> > I can think of two scenarios where it could potentially recurse:
> > 
> > - request_fn enter, end up requeuing IO. Run queue at the end. Rinse,
> >   repeat.
> > - Running starved list from request_fn, two (or more) devices could
> >   alternately recurse.
> > 
> > The first case should be fairly easy to handle. The second one is
> > already handled by the local list splice.
> this isn't true to me. if you unlock host_lock in scsi_run_queue, other
> cpus can add sdev to the starved device list again. In the recursive
> call of scsi_run_queue, the starved device list might not be empty. So
> the local list_splice doesn't help.
> 
> > 
> > Looking at the code, is this a real scenario? Only potential recurse I
> > see is:
> > 
> > scsi_request_fn()
> >         scsi_dispatch_cmd()
> >                 scsi_queue_insert()
> >                         __scsi_queue_insert()
> >                                 scsi_run_queue()
> > 
> > Why are we even re-running the queue immediately on a BUSY condition?
> > Should only be needed if we have zero pending commands from this
> > particular queue, and for that particular case async run is just fine
> > since it's a rare condition (or performance would suck already).
> > 
> > And it should only really be needed for the 'q' being passed in, not the
> > others. Something like the below.
> > 
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index 0bac91e..0b01c1f 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -74,7 +74,7 @@ struct kmem_cache *scsi_sdb_cache;
> >   */
> >  #define SCSI_QUEUE_DELAY	3
> >  
> > -static void scsi_run_queue(struct request_queue *q);
> > +static void scsi_run_queue_async(struct request_queue *q);
> >  
> >  /*
> >   * Function:	scsi_unprep_request()
> > @@ -161,7 +161,7 @@ static int __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
> >  	blk_requeue_request(q, cmd->request);
> >  	spin_unlock_irqrestore(q->queue_lock, flags);
> >  
> > -	scsi_run_queue(q);
> > +	scsi_run_queue_async(q);
> so you could still recursivly run into starved list. Do you want to put
> the whole __scsi_run_queue into workqueue?
what I mean is current sdev (other devices too) can still be added into
starved list, so only does async execute for current q isn't enough,
we'd better put whole __scsi_run_queue into workqueue. something like
below on top of yours, untested. Not sure if there are other recursive
cases.

Index: linux/drivers/scsi/scsi_lib.c
===================================================================
--- linux.orig/drivers/scsi/scsi_lib.c	2011-05-13 10:32:28.000000000 +0800
+++ linux/drivers/scsi/scsi_lib.c	2011-05-13 10:52:51.000000000 +0800
@@ -74,8 +74,6 @@ struct kmem_cache *scsi_sdb_cache;
  */
 #define SCSI_QUEUE_DELAY	3
 
-static void scsi_run_queue_async(struct request_queue *q);
-
 /*
  * Function:	scsi_unprep_request()
  *
@@ -161,7 +159,7 @@ static int __scsi_queue_insert(struct sc
 	blk_requeue_request(q, cmd->request);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
-	scsi_run_queue_async(q);
+	kblockd_schedule_work(q, &device->requeue_work);
 
 	return 0;
 }
@@ -391,14 +389,13 @@ static inline int scsi_host_is_busy(stru
  * Purpose:	Select a proper request queue to serve next
  *
  * Arguments:	q	- last request's queue
- * 		async	- prevent potential request_fn recurse by running async
  *
  * Returns:     Nothing
  *
  * Notes:	The previous command was completely finished, start
  *		a new one if possible.
  */
-static void __scsi_run_queue(struct request_queue *q, bool async)
+static void scsi_run_queue(struct request_queue *q)
 {
 	struct scsi_device *sdev = q->queuedata;
 	struct Scsi_Host *shost;
@@ -449,20 +446,17 @@ static void __scsi_run_queue(struct requ
 	list_splice(&starved_list, &shost->starved_list);
 	spin_unlock_irqrestore(shost->host_lock, flags);
 
-	if (async)
-		blk_run_queue_async(q);
-	else
-		blk_run_queue(q);
+	blk_run_queue(q);
 }
 
-static void scsi_run_queue(struct request_queue *q)
+void scsi_requeue_run_queue(struct work_struct *work)
 {
-	__scsi_run_queue(q, false);
-}
+	struct scsi_device *sdev;
+	struct request_queue *q;
 
-static void scsi_run_queue_async(struct request_queue *q)
-{
-	__scsi_run_queue(q, true);
+	sdev = container_of(work, struct scsi_device, requeue_work);
+	q = sdev->request_queue;
+	scsi_run_queue(q);
 }
 
 /*
Index: linux/drivers/scsi/scsi_scan.c
===================================================================
--- linux.orig/drivers/scsi/scsi_scan.c	2011-05-13 10:44:09.000000000 +0800
+++ linux/drivers/scsi/scsi_scan.c	2011-05-13 10:45:41.000000000 +0800
@@ -242,6 +242,7 @@ static struct scsi_device *scsi_alloc_sd
 	int display_failure_msg = 1, ret;
 	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
 	extern void scsi_evt_thread(struct work_struct *work);
+	extern void scsi_requeue_run_queue(struct work_struct *work);
 
 	sdev = kzalloc(sizeof(*sdev) + shost->transportt->device_size,
 		       GFP_ATOMIC);
@@ -264,6 +265,7 @@ static struct scsi_device *scsi_alloc_sd
 	INIT_LIST_HEAD(&sdev->event_list);
 	spin_lock_init(&sdev->list_lock);
 	INIT_WORK(&sdev->event_work, scsi_evt_thread);
+	INIT_WORK(&sdev->requeue_work, scsi_requeue_run_queue);
 
 	sdev->sdev_gendev.parent = get_device(&starget->dev);
 	sdev->sdev_target = starget;
Index: linux/include/scsi/scsi_device.h
===================================================================
--- linux.orig/include/scsi/scsi_device.h	2011-05-13 10:36:31.000000000 +0800
+++ linux/include/scsi/scsi_device.h	2011-05-13 10:40:46.000000000 +0800
@@ -169,6 +169,7 @@ struct scsi_device {
 				sdev_dev;
 
 	struct execute_work	ew; /* used to get process context on put */
+	struct work_struct	requeue_work;
 
 	struct scsi_dh_data	*scsi_dh_data;
 	enum scsi_device_state sdev_state;




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perfromance drop on SCSI hard disk
  2011-05-13  3:01     ` Shaohua Li
@ 2011-05-16  8:04       ` Shaohua Li
  2011-05-16  8:37         ` Alex,Shi
  0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2011-05-16  8:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Shi, Alex, James.Bottomley, linux-kernel

On Fri, 2011-05-13 at 11:01 +0800, Shaohua Li wrote:
> On Fri, 2011-05-13 at 08:48 +0800, Shaohua Li wrote:
> > On Fri, 2011-05-13 at 04:29 +0800, Jens Axboe wrote:
> > > On 2011-05-10 08:40, Alex,Shi wrote:
> > > > commit c21e6beba8835d09bb80e34961 removed the REENTER flag and changed
> > > > scsi_run_queue() to punt all requests on starved_list devices to
> > > > kblockd. Yes, like Jens mentioned, the performance on slow SCSI disk was
> > > > hurt here.  :) (Intel SSD isn't effected here)
> > > > 
> > > > In our testing on 12 SAS disk JBD, the fio write with sync ioengine drop
> > > > about 30~40% throughput, fio randread/randwrite with aio ioengine drop
> > > > about 20%/50% throughput. and fio mmap testing was hurt also. 
> > > > 
> > > > With the following debug patch, the performance can be totally recovered
> > > > in our testing. But without REENTER flag here, in some corner case, like
> > > > a device is keeping blocked and then unblocked repeatedly,
> > > > __blk_run_queue() may recursively call scsi_run_queue() and then cause
> > > > kernel stack overflow. 
> > > > I don't know details of block device driver, just wondering why on scsi
> > > > need the REENTER flag here. :) 
> > > 
> > > This is a problem and we should do something about it for 2.6.39. I knew
> > > that there would be cases where the async offload would cause a
> > > performance degredation, but not to the extent that you are reporting.
> > > Must be hitting the pathological case.
> > async offload is expected to increase context switch. But the real root
> > cause of the issue is fairness issue. Please see my previous email.
> > 
> > > I can think of two scenarios where it could potentially recurse:
> > > 
> > > - request_fn enter, end up requeuing IO. Run queue at the end. Rinse,
> > >   repeat.
> > > - Running starved list from request_fn, two (or more) devices could
> > >   alternately recurse.
> > > 
> > > The first case should be fairly easy to handle. The second one is
> > > already handled by the local list splice.
> > this isn't true to me. if you unlock host_lock in scsi_run_queue, other
> > cpus can add sdev to the starved device list again. In the recursive
> > call of scsi_run_queue, the starved device list might not be empty. So
> > the local list_splice doesn't help.
> > 
> > > 
> > > Looking at the code, is this a real scenario? Only potential recurse I
> > > see is:
> > > 
> > > scsi_request_fn()
> > >         scsi_dispatch_cmd()
> > >                 scsi_queue_insert()
> > >                         __scsi_queue_insert()
> > >                                 scsi_run_queue()
> > > 
> > > Why are we even re-running the queue immediately on a BUSY condition?
> > > Should only be needed if we have zero pending commands from this
> > > particular queue, and for that particular case async run is just fine
> > > since it's a rare condition (or performance would suck already).
> > > 
> > > And it should only really be needed for the 'q' being passed in, not the
> > > others. Something like the below.
> > > 
> > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > index 0bac91e..0b01c1f 100644
> > > --- a/drivers/scsi/scsi_lib.c
> > > +++ b/drivers/scsi/scsi_lib.c
> > > @@ -74,7 +74,7 @@ struct kmem_cache *scsi_sdb_cache;
> > >   */
> > >  #define SCSI_QUEUE_DELAY	3
> > >  
> > > -static void scsi_run_queue(struct request_queue *q);
> > > +static void scsi_run_queue_async(struct request_queue *q);
> > >  
> > >  /*
> > >   * Function:	scsi_unprep_request()
> > > @@ -161,7 +161,7 @@ static int __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
> > >  	blk_requeue_request(q, cmd->request);
> > >  	spin_unlock_irqrestore(q->queue_lock, flags);
> > >  
> > > -	scsi_run_queue(q);
> > > +	scsi_run_queue_async(q);
> > so you could still recursivly run into starved list. Do you want to put
> > the whole __scsi_run_queue into workqueue?
> what I mean is current sdev (other devices too) can still be added into
> starved list, so only does async execute for current q isn't enough,
> we'd better put whole __scsi_run_queue into workqueue. something like
> below on top of yours, untested. Not sure if there are other recursive
> cases.
verified the regression can be fully fixed by your patch (with my
suggested fix to avoid race). Can we put a formal patch upstream?

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perfromance drop on SCSI hard disk
  2011-05-16  8:04       ` Shaohua Li
@ 2011-05-16  8:37         ` Alex,Shi
  2011-05-17  6:09           ` Alex,Shi
  0 siblings, 1 reply; 17+ messages in thread
From: Alex,Shi @ 2011-05-16  8:37 UTC (permalink / raw)
  To: Li, Shaohua; +Cc: Jens Axboe, James.Bottomley, linux-kernel


> > what I mean is current sdev (other devices too) can still be added into
> > starved list, so only does async execute for current q isn't enough,
> > we'd better put whole __scsi_run_queue into workqueue. something like
> > below on top of yours, untested. Not sure if there are other recursive
> > cases.
> verified the regression can be fully fixed by your patch (with my
> suggested fix to avoid race). Can we put a formal patch upstream?

Yes, we tested Jens patch alone and plus Shaohua's patch too. Both of
them recovered SAS disk performance too. Now I am testing them on SSD
disk with kbuild and fio cases, In theory, they will both work. 

> 
> Thanks,
> Shaohua
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perfromance drop on SCSI hard disk
  2011-05-16  8:37         ` Alex,Shi
@ 2011-05-17  6:09           ` Alex,Shi
  2011-05-17  7:20             ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Alex,Shi @ 2011-05-17  6:09 UTC (permalink / raw)
  To: Li, Shaohua; +Cc: Jens Axboe, James.Bottomley, linux-kernel

On Mon, 2011-05-16 at 16:37 +0800, Alex,Shi wrote:
> > > what I mean is current sdev (other devices too) can still be added into
> > > starved list, so only does async execute for current q isn't enough,
> > > we'd better put whole __scsi_run_queue into workqueue. something like
> > > below on top of yours, untested. Not sure if there are other recursive
> > > cases.
> > verified the regression can be fully fixed by your patch (with my
> > suggested fix to avoid race). Can we put a formal patch upstream?
> 
> Yes, we tested Jens patch alone and plus Shaohua's patch too. Both of
> them recovered SAS disk performance too. Now I am testing them on SSD
> disk with kbuild and fio cases, In theory, they will both work. 
> 

As expected, both patches has no effect on SSD disks in JBD. 

> > 
> > Thanks,
> > Shaohua
> > 
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perfromance drop on SCSI hard disk
  2011-05-17  6:09           ` Alex,Shi
@ 2011-05-17  7:20             ` Jens Axboe
  2011-05-19  8:26               ` Alex,Shi
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2011-05-17  7:20 UTC (permalink / raw)
  To: Alex,Shi; +Cc: Li, Shaohua, James.Bottomley, linux-kernel

On 2011-05-17 08:09, Alex,Shi wrote:
> On Mon, 2011-05-16 at 16:37 +0800, Alex,Shi wrote:
>>>> what I mean is current sdev (other devices too) can still be added into
>>>> starved list, so only does async execute for current q isn't enough,
>>>> we'd better put whole __scsi_run_queue into workqueue. something like
>>>> below on top of yours, untested. Not sure if there are other recursive
>>>> cases.
>>> verified the regression can be fully fixed by your patch (with my
>>> suggested fix to avoid race). Can we put a formal patch upstream?
>>
>> Yes, we tested Jens patch alone and plus Shaohua's patch too. Both of
>> them recovered SAS disk performance too. Now I am testing them on SSD
>> disk with kbuild and fio cases, In theory, they will both work. 
>>
> 
> As expected, both patches has no effect on SSD disks in JBD. 

I will queue up the combined patch, it looks fine from here as well.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perfromance drop on SCSI hard disk
  2011-05-17  7:20             ` Jens Axboe
@ 2011-05-19  8:26               ` Alex,Shi
  2011-05-19  8:47                 ` Alex,Shi
  2011-05-19 18:27                 ` Jens Axboe
  0 siblings, 2 replies; 17+ messages in thread
From: Alex,Shi @ 2011-05-19  8:26 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Li, Shaohua, James.Bottomley, linux-kernel


> I will queue up the combined patch, it looks fine from here as well.
> 

When I have some time to study Jens and shaohua's patch today. I find a
simpler way to resolved the re-enter issue on starved_list. Following
Jens' idea, we can just put the starved_list device into kblockd if it
come from __scsi_queue_insert(). 
It can resolve the re-enter issue and recover performance totally, and
need not a work_struct in every scsi_device. The logic/code also looks a
bit simpler. 
What's your opinion of this? 


Signed-off-by: Alex Shi <alex.shi@intel.com>
---
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index ec1803a..de7c569 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -74,6 +74,8 @@ struct kmem_cache *scsi_sdb_cache;
  */
 #define SCSI_QUEUE_DELAY	3
 
+static void scsi_run_queue_async(struct request_queue *q);
+
 /*
  * Function:	scsi_unprep_request()
  *
@@ -159,7 +161,7 @@ static int __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
 	blk_requeue_request(q, cmd->request);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
-	kblockd_schedule_work(q, &device->requeue_work);
+	scsi_run_queue_async(q);
 
 	return 0;
 }
@@ -395,7 +397,7 @@ static inline int scsi_host_is_busy(struct Scsi_Host *shost)
  * Notes:	The previous command was completely finished, start
  *		a new one if possible.
  */
-static void scsi_run_queue(struct request_queue *q)
+static void __scsi_run_queue(struct request_queue *q, bool async)
 {
 	struct scsi_device *sdev = q->queuedata;
 	struct Scsi_Host *shost;
@@ -435,30 +437,35 @@ static void scsi_run_queue(struct request_queue *q)
 				       &shost->starved_list);
 			continue;
 		}
-
-		spin_unlock(shost->host_lock);
-		spin_lock(sdev->request_queue->queue_lock);
-		__blk_run_queue(sdev->request_queue);
-		spin_unlock(sdev->request_queue->queue_lock);
-		spin_lock(shost->host_lock);
+		if (async)
+			blk_run_queue_async(sdev->request_queue);
+		else {
+			spin_unlock(shost->host_lock);
+			spin_lock(sdev->request_queue->queue_lock);
+			__blk_run_queue(sdev->request_queue);
+			spin_unlock(sdev->request_queue->queue_lock);
+			spin_lock(shost->host_lock);
+		}
 	}
 	/* put any unprocessed entries back */
 	list_splice(&starved_list, &shost->starved_list);
 	spin_unlock_irqrestore(shost->host_lock, flags);
 
-	blk_run_queue(q);
+	if (async)
+		blk_run_queue_async(q);
+	else
+		blk_run_queue(q);
 }
 
-void scsi_requeue_run_queue(struct work_struct *work)
+static void scsi_run_queue(struct request_queue *q)
 {
-	struct scsi_device *sdev;
-	struct request_queue *q;
-
-	sdev = container_of(work, struct scsi_device, requeue_work);
-	q = sdev->request_queue;
-	scsi_run_queue(q);
+	__scsi_run_queue(q, false);
 }
 
+static void scsi_run_queue_async(struct request_queue *q)
+{
+	__scsi_run_queue(q, true);
+}
 /*
  * Function:	scsi_requeue_command()
  *
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 58584dc..087821f 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -242,7 +242,6 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
 	int display_failure_msg = 1, ret;
 	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
 	extern void scsi_evt_thread(struct work_struct *work);
-	extern void scsi_requeue_run_queue(struct work_struct *work);
 
 	sdev = kzalloc(sizeof(*sdev) + shost->transportt->device_size,
 		       GFP_ATOMIC);
@@ -265,7 +264,6 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
 	INIT_LIST_HEAD(&sdev->event_list);
 	spin_lock_init(&sdev->list_lock);
 	INIT_WORK(&sdev->event_work, scsi_evt_thread);
-	INIT_WORK(&sdev->requeue_work, scsi_requeue_run_queue);
 
 	sdev->sdev_gendev.parent = get_device(&starget->dev);
 	sdev->sdev_target = starget;
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index dd82e02..2d3ec50 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -169,7 +169,6 @@ struct scsi_device {
 				sdev_dev;
 
 	struct execute_work	ew; /* used to get process context on put */
-	struct work_struct	requeue_work;
 
 	struct scsi_dh_data	*scsi_dh_data;
 	enum scsi_device_state sdev_state;



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Perfromance drop on SCSI hard disk
  2011-05-19  8:26               ` Alex,Shi
@ 2011-05-19  8:47                 ` Alex,Shi
  2011-05-19 18:27                 ` Jens Axboe
  1 sibling, 0 replies; 17+ messages in thread
From: Alex,Shi @ 2011-05-19  8:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Li, Shaohua, James.Bottomley, linux-kernel, Chen, Tim C

On Thu, 2011-05-19 at 16:26 +0800, Alex,Shi wrote:
> > I will queue up the combined patch, it looks fine from here as well.
> > 
> 
> When I have some time to study Jens and shaohua's patch today. I find a
> simpler way to resolved the re-enter issue on starved_list. Following
> Jens' idea, we can just put the starved_list device into kblockd if it
> come from __scsi_queue_insert(). 
> It can resolve the re-enter issue and recover performance totally, and
> need not a work_struct in every scsi_device. The logic/code also looks a
> bit simpler. 
> What's your opinion of this? 

BTW, Jens, this patch is against on latest kernel, 2.6.39. 
> 
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index ec1803a..de7c569 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -74,6 +74,8 @@ struct kmem_cache *scsi_sdb_cache;
>   */
>  #define SCSI_QUEUE_DELAY	3
>  
> +static void scsi_run_queue_async(struct request_queue *q);
> +
>  /*
>   * Function:	scsi_unprep_request()
>   *
> @@ -159,7 +161,7 @@ static int __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
>  	blk_requeue_request(q, cmd->request);
>  	spin_unlock_irqrestore(q->queue_lock, flags);
>  
> -	kblockd_schedule_work(q, &device->requeue_work);
> +	scsi_run_queue_async(q);
>  
>  	return 0;
>  }
> @@ -395,7 +397,7 @@ static inline int scsi_host_is_busy(struct Scsi_Host *shost)
>   * Notes:	The previous command was completely finished, start
>   *		a new one if possible.
>   */
> -static void scsi_run_queue(struct request_queue *q)
> +static void __scsi_run_queue(struct request_queue *q, bool async)
>  {
>  	struct scsi_device *sdev = q->queuedata;
>  	struct Scsi_Host *shost;
> @@ -435,30 +437,35 @@ static void scsi_run_queue(struct request_queue *q)
>  				       &shost->starved_list);
>  			continue;
>  		}
> -
> -		spin_unlock(shost->host_lock);
> -		spin_lock(sdev->request_queue->queue_lock);
> -		__blk_run_queue(sdev->request_queue);
> -		spin_unlock(sdev->request_queue->queue_lock);
> -		spin_lock(shost->host_lock);
> +		if (async)
> +			blk_run_queue_async(sdev->request_queue);
> +		else {
> +			spin_unlock(shost->host_lock);
> +			spin_lock(sdev->request_queue->queue_lock);
> +			__blk_run_queue(sdev->request_queue);
> +			spin_unlock(sdev->request_queue->queue_lock);
> +			spin_lock(shost->host_lock);
> +		}
>  	}
>  	/* put any unprocessed entries back */
>  	list_splice(&starved_list, &shost->starved_list);
>  	spin_unlock_irqrestore(shost->host_lock, flags);
>  
> -	blk_run_queue(q);
> +	if (async)
> +		blk_run_queue_async(q);
> +	else
> +		blk_run_queue(q);
>  }
>  
> -void scsi_requeue_run_queue(struct work_struct *work)
> +static void scsi_run_queue(struct request_queue *q)
>  {
> -	struct scsi_device *sdev;
> -	struct request_queue *q;
> -
> -	sdev = container_of(work, struct scsi_device, requeue_work);
> -	q = sdev->request_queue;
> -	scsi_run_queue(q);
> +	__scsi_run_queue(q, false);
>  }
>  
> +static void scsi_run_queue_async(struct request_queue *q)
> +{
> +	__scsi_run_queue(q, true);
> +}
>  /*
>   * Function:	scsi_requeue_command()
>   *
> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
> index 58584dc..087821f 100644
> --- a/drivers/scsi/scsi_scan.c
> +++ b/drivers/scsi/scsi_scan.c
> @@ -242,7 +242,6 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
>  	int display_failure_msg = 1, ret;
>  	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
>  	extern void scsi_evt_thread(struct work_struct *work);
> -	extern void scsi_requeue_run_queue(struct work_struct *work);
>  
>  	sdev = kzalloc(sizeof(*sdev) + shost->transportt->device_size,
>  		       GFP_ATOMIC);
> @@ -265,7 +264,6 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
>  	INIT_LIST_HEAD(&sdev->event_list);
>  	spin_lock_init(&sdev->list_lock);
>  	INIT_WORK(&sdev->event_work, scsi_evt_thread);
> -	INIT_WORK(&sdev->requeue_work, scsi_requeue_run_queue);
>  
>  	sdev->sdev_gendev.parent = get_device(&starget->dev);
>  	sdev->sdev_target = starget;
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index dd82e02..2d3ec50 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -169,7 +169,6 @@ struct scsi_device {
>  				sdev_dev;
>  
>  	struct execute_work	ew; /* used to get process context on put */
> -	struct work_struct	requeue_work;
>  
>  	struct scsi_dh_data	*scsi_dh_data;
>  	enum scsi_device_state sdev_state;
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perfromance drop on SCSI hard disk
  2011-05-19  8:26               ` Alex,Shi
  2011-05-19  8:47                 ` Alex,Shi
@ 2011-05-19 18:27                 ` Jens Axboe
  2011-05-20  0:22                   ` Alex,Shi
  1 sibling, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2011-05-19 18:27 UTC (permalink / raw)
  To: Alex,Shi; +Cc: Li, Shaohua, James.Bottomley, linux-kernel

On 2011-05-19 10:26, Alex,Shi wrote:
> 
>> I will queue up the combined patch, it looks fine from here as well.
>>
> 
> When I have some time to study Jens and shaohua's patch today. I find a
> simpler way to resolved the re-enter issue on starved_list. Following
> Jens' idea, we can just put the starved_list device into kblockd if it
> come from __scsi_queue_insert(). 
> It can resolve the re-enter issue and recover performance totally, and
> need not a work_struct in every scsi_device. The logic/code also looks a
> bit simpler. 
> What's your opinion of this? 

Isn't this _identical_ to my original patch, with the added async run of
the queue passed in (which is important, an oversight)?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perfromance drop on SCSI hard disk
  2011-05-19 18:27                 ` Jens Axboe
@ 2011-05-20  0:22                   ` Alex,Shi
  2011-05-20  0:40                     ` Shaohua Li
  0 siblings, 1 reply; 17+ messages in thread
From: Alex,Shi @ 2011-05-20  0:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Li, Shaohua, James.Bottomley, linux-kernel

On Fri, 2011-05-20 at 02:27 +0800, Jens Axboe wrote:
> On 2011-05-19 10:26, Alex,Shi wrote:
> > 
> >> I will queue up the combined patch, it looks fine from here as well.
> >>
> > 
> > When I have some time to study Jens and shaohua's patch today. I find a
> > simpler way to resolved the re-enter issue on starved_list. Following
> > Jens' idea, we can just put the starved_list device into kblockd if it
> > come from __scsi_queue_insert(). 
> > It can resolve the re-enter issue and recover performance totally, and
> > need not a work_struct in every scsi_device. The logic/code also looks a
> > bit simpler. 
> > What's your opinion of this? 
> 
> Isn't this _identical_ to my original patch, with the added async run of
> the queue passed in (which is important, an oversight)?

Not exactly same. It bases on your patch, but added a bypass way for
starved_list device. If a starved_list device come from
__scsi_queue_insert(), that may caused by our talking recursion, kblockd
with take over the process.  Maybe you oversight this point in original
patch. :) 

The different part from yours is below:
---
static void __scsi_run_queue(struct request_queue *q, bool async)
 {
        struct scsi_device *sdev = q->queuedata;
        struct Scsi_Host *shost;
@@ -435,30 +437,35 @@ static void scsi_run_queue(struct request_queue
*q)
                                       &shost->starved_list);
                        continue;
                }
-
-               spin_unlock(shost->host_lock);
-               spin_lock(sdev->request_queue->queue_lock);
-               __blk_run_queue(sdev->request_queue);
-               spin_unlock(sdev->request_queue->queue_lock);
-               spin_lock(shost->host_lock);
+               if (async)
+                       blk_run_queue_async(sdev->request_queue);
+               else {
+                       spin_unlock(shost->host_lock);
+                       spin_lock(sdev->request_queue->queue_lock);
+                       __blk_run_queue(sdev->request_queue);
+                       spin_unlock(sdev->request_queue->queue_lock);
+                       spin_lock(shost->host_lock);
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perfromance drop on SCSI hard disk
  2011-05-20  0:22                   ` Alex,Shi
@ 2011-05-20  0:40                     ` Shaohua Li
  2011-05-20  5:17                       ` Alex,Shi
  0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2011-05-20  0:40 UTC (permalink / raw)
  To: Alex,Shi; +Cc: Jens Axboe, James.Bottomley, linux-kernel

2011/5/20 Alex,Shi <alex.shi@intel.com>:
> On Fri, 2011-05-20 at 02:27 +0800, Jens Axboe wrote:
>> On 2011-05-19 10:26, Alex,Shi wrote:
>> >
>> >> I will queue up the combined patch, it looks fine from here as well.
>> >>
>> >
>> > When I have some time to study Jens and shaohua's patch today. I find a
>> > simpler way to resolved the re-enter issue on starved_list. Following
>> > Jens' idea, we can just put the starved_list device into kblockd if it
>> > come from __scsi_queue_insert().
>> > It can resolve the re-enter issue and recover performance totally, and
>> > need not a work_struct in every scsi_device. The logic/code also looks a
>> > bit simpler.
>> > What's your opinion of this?
>>
>> Isn't this _identical_ to my original patch, with the added async run of
>> the queue passed in (which is important, an oversight)?
>
> Not exactly same. It bases on your patch, but added a bypass way for
> starved_list device. If a starved_list device come from
> __scsi_queue_insert(), that may caused by our talking recursion, kblockd
> with take over the process.  Maybe you oversight this point in original
> patch. :)
>
> The different part from yours is below:
> ---
> static void __scsi_run_queue(struct request_queue *q, bool async)
>  {
>        struct scsi_device *sdev = q->queuedata;
>        struct Scsi_Host *shost;
> @@ -435,30 +437,35 @@ static void scsi_run_queue(struct request_queue
> *q)
>                                       &shost->starved_list);
>                        continue;
>                }
> -
> -               spin_unlock(shost->host_lock);
> -               spin_lock(sdev->request_queue->queue_lock);
> -               __blk_run_queue(sdev->request_queue);
> -               spin_unlock(sdev->request_queue->queue_lock);
> -               spin_lock(shost->host_lock);
> +               if (async)
> +                       blk_run_queue_async(sdev->request_queue);
> +               else {
> +                       spin_unlock(shost->host_lock);
> +                       spin_lock(sdev->request_queue->queue_lock);
> +                       __blk_run_queue(sdev->request_queue);
> +                       spin_unlock(sdev->request_queue->queue_lock);
> +                       spin_lock(shost->host_lock);
>>
I don't quite like this approach. blk_run_queue_async() could
introduce fairness issue as I said in previous mail, because we drop
the sdev from starved list but didn't run its queue immediately. The
issue exists before, but it's a bug to me.
Alex, is there any real advantage of your patch?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Perfromance drop on SCSI hard disk
  2011-05-20  0:40                     ` Shaohua Li
@ 2011-05-20  5:17                       ` Alex,Shi
  0 siblings, 0 replies; 17+ messages in thread
From: Alex,Shi @ 2011-05-20  5:17 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Jens Axboe, James.Bottomley, linux-kernel

On Fri, 2011-05-20 at 08:40 +0800, Shaohua Li wrote:
> 2011/5/20 Alex,Shi <alex.shi@intel.com>:
> > On Fri, 2011-05-20 at 02:27 +0800, Jens Axboe wrote:
> >> On 2011-05-19 10:26, Alex,Shi wrote:
> >> >
> >> >> I will queue up the combined patch, it looks fine from here as well.
> >> >>
> >> >
> >> > When I have some time to study Jens and shaohua's patch today. I find a
> >> > simpler way to resolved the re-enter issue on starved_list. Following
> >> > Jens' idea, we can just put the starved_list device into kblockd if it
> >> > come from __scsi_queue_insert().
> >> > It can resolve the re-enter issue and recover performance totally, and
> >> > need not a work_struct in every scsi_device. The logic/code also looks a
> >> > bit simpler.
> >> > What's your opinion of this?
> >>
> >> Isn't this _identical_ to my original patch, with the added async run of
> >> the queue passed in (which is important, an oversight)?
> >
> > Not exactly same. It bases on your patch, but added a bypass way for
> > starved_list device. If a starved_list device come from
> > __scsi_queue_insert(), that may caused by our talking recursion, kblockd
> > with take over the process.  Maybe you oversight this point in original
> > patch. :)
> >
> > The different part from yours is below:
> > ---
> > static void __scsi_run_queue(struct request_queue *q, bool async)
> >  {
> >        struct scsi_device *sdev = q->queuedata;
> >        struct Scsi_Host *shost;
> > @@ -435,30 +437,35 @@ static void scsi_run_queue(struct request_queue
> > *q)
> >                                       &shost->starved_list);
> >                        continue;
> >                }
> > -
> > -               spin_unlock(shost->host_lock);
> > -               spin_lock(sdev->request_queue->queue_lock);
> > -               __blk_run_queue(sdev->request_queue);
> > -               spin_unlock(sdev->request_queue->queue_lock);
> > -               spin_lock(shost->host_lock);
> > +               if (async)
> > +                       blk_run_queue_async(sdev->request_queue);
> > +               else {
> > +                       spin_unlock(shost->host_lock);
> > +                       spin_lock(sdev->request_queue->queue_lock);
> > +                       __blk_run_queue(sdev->request_queue);
> > +                       spin_unlock(sdev->request_queue->queue_lock);
> > +                       spin_lock(shost->host_lock);
> >>
> I don't quite like this approach. blk_run_queue_async() could
> introduce fairness issue as I said in previous mail, because we drop
> the sdev from starved list but didn't run its queue immediately. The
> issue exists before, but it's a bug to me.

I understand what's your worried. But not quite clear of the trigger
scenario. anyway, it is still a potential issue of fairness exist. So
forget my patch. 



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-05-20  5:19 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-10  6:40 Perfromance drop on SCSI hard disk Alex,Shi
2011-05-10  6:52 ` Shaohua Li
2011-05-12  0:36   ` Shaohua Li
2011-05-12 20:29 ` Jens Axboe
2011-05-13  0:11   ` Alex,Shi
2011-05-13  0:48   ` Shaohua Li
2011-05-13  3:01     ` Shaohua Li
2011-05-16  8:04       ` Shaohua Li
2011-05-16  8:37         ` Alex,Shi
2011-05-17  6:09           ` Alex,Shi
2011-05-17  7:20             ` Jens Axboe
2011-05-19  8:26               ` Alex,Shi
2011-05-19  8:47                 ` Alex,Shi
2011-05-19 18:27                 ` Jens Axboe
2011-05-20  0:22                   ` Alex,Shi
2011-05-20  0:40                     ` Shaohua Li
2011-05-20  5:17                       ` Alex,Shi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.