All of lore.kernel.org
 help / color / mirror / Atom feed
From: Shaohua Li <shaohua.li@intel.com>
To: Jens Axboe <jaxboe@fusionio.com>
Cc: "Shi, Alex" <alex.shi@intel.com>,
	"James.Bottomley@hansenpartnership.com" 
	<James.Bottomley@hansenpartnership.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: Perfromance drop on SCSI hard disk
Date: Fri, 13 May 2011 11:01:57 +0800	[thread overview]
Message-ID: <1305255717.2373.38.camel@sli10-conroe> (raw)
In-Reply-To: <1305247704.2373.32.camel@sli10-conroe>

On Fri, 2011-05-13 at 08:48 +0800, Shaohua Li wrote:
> On Fri, 2011-05-13 at 04:29 +0800, Jens Axboe wrote:
> > On 2011-05-10 08:40, Alex,Shi wrote:
> > > commit c21e6beba8835d09bb80e34961 removed the REENTER flag and changed
> > > scsi_run_queue() to punt all requests on starved_list devices to
> > > kblockd. Yes, like Jens mentioned, the performance on slow SCSI disk was
> > > hurt here.  :) (Intel SSD isn't effected here)
> > > 
> > > In our testing on 12 SAS disk JBD, the fio write with sync ioengine drop
> > > about 30~40% throughput, fio randread/randwrite with aio ioengine drop
> > > about 20%/50% throughput. and fio mmap testing was hurt also. 
> > > 
> > > With the following debug patch, the performance can be totally recovered
> > > in our testing. But without REENTER flag here, in some corner case, like
> > > a device is keeping blocked and then unblocked repeatedly,
> > > __blk_run_queue() may recursively call scsi_run_queue() and then cause
> > > kernel stack overflow. 
> > > I don't know details of block device driver, just wondering why on scsi
> > > need the REENTER flag here. :) 
> > 
> > This is a problem and we should do something about it for 2.6.39. I knew
> > that there would be cases where the async offload would cause a
> > performance degredation, but not to the extent that you are reporting.
> > Must be hitting the pathological case.
> async offload is expected to increase context switch. But the real root
> cause of the issue is fairness issue. Please see my previous email.
> 
> > I can think of two scenarios where it could potentially recurse:
> > 
> > - request_fn enter, end up requeuing IO. Run queue at the end. Rinse,
> >   repeat.
> > - Running starved list from request_fn, two (or more) devices could
> >   alternately recurse.
> > 
> > The first case should be fairly easy to handle. The second one is
> > already handled by the local list splice.
> this isn't true to me. if you unlock host_lock in scsi_run_queue, other
> cpus can add sdev to the starved device list again. In the recursive
> call of scsi_run_queue, the starved device list might not be empty. So
> the local list_splice doesn't help.
> 
> > 
> > Looking at the code, is this a real scenario? Only potential recurse I
> > see is:
> > 
> > scsi_request_fn()
> >         scsi_dispatch_cmd()
> >                 scsi_queue_insert()
> >                         __scsi_queue_insert()
> >                                 scsi_run_queue()
> > 
> > Why are we even re-running the queue immediately on a BUSY condition?
> > Should only be needed if we have zero pending commands from this
> > particular queue, and for that particular case async run is just fine
> > since it's a rare condition (or performance would suck already).
> > 
> > And it should only really be needed for the 'q' being passed in, not the
> > others. Something like the below.
> > 
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index 0bac91e..0b01c1f 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -74,7 +74,7 @@ struct kmem_cache *scsi_sdb_cache;
> >   */
> >  #define SCSI_QUEUE_DELAY	3
> >  
> > -static void scsi_run_queue(struct request_queue *q);
> > +static void scsi_run_queue_async(struct request_queue *q);
> >  
> >  /*
> >   * Function:	scsi_unprep_request()
> > @@ -161,7 +161,7 @@ static int __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
> >  	blk_requeue_request(q, cmd->request);
> >  	spin_unlock_irqrestore(q->queue_lock, flags);
> >  
> > -	scsi_run_queue(q);
> > +	scsi_run_queue_async(q);
> so you could still recursivly run into starved list. Do you want to put
> the whole __scsi_run_queue into workqueue?
what I mean is current sdev (other devices too) can still be added into
starved list, so only does async execute for current q isn't enough,
we'd better put whole __scsi_run_queue into workqueue. something like
below on top of yours, untested. Not sure if there are other recursive
cases.

Index: linux/drivers/scsi/scsi_lib.c
===================================================================
--- linux.orig/drivers/scsi/scsi_lib.c	2011-05-13 10:32:28.000000000 +0800
+++ linux/drivers/scsi/scsi_lib.c	2011-05-13 10:52:51.000000000 +0800
@@ -74,8 +74,6 @@ struct kmem_cache *scsi_sdb_cache;
  */
 #define SCSI_QUEUE_DELAY	3
 
-static void scsi_run_queue_async(struct request_queue *q);
-
 /*
  * Function:	scsi_unprep_request()
  *
@@ -161,7 +159,7 @@ static int __scsi_queue_insert(struct sc
 	blk_requeue_request(q, cmd->request);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
-	scsi_run_queue_async(q);
+	kblockd_schedule_work(q, &device->requeue_work);
 
 	return 0;
 }
@@ -391,14 +389,13 @@ static inline int scsi_host_is_busy(stru
  * Purpose:	Select a proper request queue to serve next
  *
  * Arguments:	q	- last request's queue
- * 		async	- prevent potential request_fn recurse by running async
  *
  * Returns:     Nothing
  *
  * Notes:	The previous command was completely finished, start
  *		a new one if possible.
  */
-static void __scsi_run_queue(struct request_queue *q, bool async)
+static void scsi_run_queue(struct request_queue *q)
 {
 	struct scsi_device *sdev = q->queuedata;
 	struct Scsi_Host *shost;
@@ -449,20 +446,17 @@ static void __scsi_run_queue(struct requ
 	list_splice(&starved_list, &shost->starved_list);
 	spin_unlock_irqrestore(shost->host_lock, flags);
 
-	if (async)
-		blk_run_queue_async(q);
-	else
-		blk_run_queue(q);
+	blk_run_queue(q);
 }
 
-static void scsi_run_queue(struct request_queue *q)
+void scsi_requeue_run_queue(struct work_struct *work)
 {
-	__scsi_run_queue(q, false);
-}
+	struct scsi_device *sdev;
+	struct request_queue *q;
 
-static void scsi_run_queue_async(struct request_queue *q)
-{
-	__scsi_run_queue(q, true);
+	sdev = container_of(work, struct scsi_device, requeue_work);
+	q = sdev->request_queue;
+	scsi_run_queue(q);
 }
 
 /*
Index: linux/drivers/scsi/scsi_scan.c
===================================================================
--- linux.orig/drivers/scsi/scsi_scan.c	2011-05-13 10:44:09.000000000 +0800
+++ linux/drivers/scsi/scsi_scan.c	2011-05-13 10:45:41.000000000 +0800
@@ -242,6 +242,7 @@ static struct scsi_device *scsi_alloc_sd
 	int display_failure_msg = 1, ret;
 	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
 	extern void scsi_evt_thread(struct work_struct *work);
+	extern void scsi_requeue_run_queue(struct work_struct *work);
 
 	sdev = kzalloc(sizeof(*sdev) + shost->transportt->device_size,
 		       GFP_ATOMIC);
@@ -264,6 +265,7 @@ static struct scsi_device *scsi_alloc_sd
 	INIT_LIST_HEAD(&sdev->event_list);
 	spin_lock_init(&sdev->list_lock);
 	INIT_WORK(&sdev->event_work, scsi_evt_thread);
+	INIT_WORK(&sdev->requeue_work, scsi_requeue_run_queue);
 
 	sdev->sdev_gendev.parent = get_device(&starget->dev);
 	sdev->sdev_target = starget;
Index: linux/include/scsi/scsi_device.h
===================================================================
--- linux.orig/include/scsi/scsi_device.h	2011-05-13 10:36:31.000000000 +0800
+++ linux/include/scsi/scsi_device.h	2011-05-13 10:40:46.000000000 +0800
@@ -169,6 +169,7 @@ struct scsi_device {
 				sdev_dev;
 
 	struct execute_work	ew; /* used to get process context on put */
+	struct work_struct	requeue_work;
 
 	struct scsi_dh_data	*scsi_dh_data;
 	enum scsi_device_state sdev_state;




  reply	other threads:[~2011-05-13  3:02 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-05-10  6:40 Perfromance drop on SCSI hard disk Alex,Shi
2011-05-10  6:52 ` Shaohua Li
2011-05-12  0:36   ` Shaohua Li
2011-05-12 20:29 ` Jens Axboe
2011-05-13  0:11   ` Alex,Shi
2011-05-13  0:48   ` Shaohua Li
2011-05-13  3:01     ` Shaohua Li [this message]
2011-05-16  8:04       ` Shaohua Li
2011-05-16  8:37         ` Alex,Shi
2011-05-17  6:09           ` Alex,Shi
2011-05-17  7:20             ` Jens Axboe
2011-05-19  8:26               ` Alex,Shi
2011-05-19  8:47                 ` Alex,Shi
2011-05-19 18:27                 ` Jens Axboe
2011-05-20  0:22                   ` Alex,Shi
2011-05-20  0:40                     ` Shaohua Li
2011-05-20  5:17                       ` Alex,Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1305255717.2373.38.camel@sli10-conroe \
    --to=shaohua.li@intel.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=alex.shi@intel.com \
    --cc=jaxboe@fusionio.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.