linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] cfq-iosched: rework seeky detection
       [not found] <1267296340-3820-1-git-send-email-czoccolo@gmail.com>
@ 2010-02-27 18:45 ` Corrado Zoccolo
  2010-02-27 18:45   ` [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs Corrado Zoccolo
  2010-02-28 18:41 ` [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34 Jens Axboe
  2010-03-01 16:35 ` Vivek Goyal
  2 siblings, 1 reply; 16+ messages in thread
From: Corrado Zoccolo @ 2010-02-27 18:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-Kernel, Jeff Moyer, Vivek Goyal, Shaohua Li, Gui Jianfeng,
	Corrado Zoccolo

Current seeky detection is based on average seek lenght.
This is suboptimal, since the average will not distinguish between:
* a process doing medium sized seeks
* a process doing some sequential requests interleaved with larger seeks
and even a medium seek can take lot of time, if the requested sector
happens to be behind the disk head in the rotation (50% probability).

Therefore, we change the seeky queue detection to work as follows:
* each request can be classified as sequential if it is very close to
  the current head position, i.e. it is likely in the disk cache (disks
  usually read more data than requested, and put it in cache for
  subsequent reads). Otherwise, the request is classified as seeky.
* an history window of the last 32 requests is kept, storing the
  classification result.
* A queue is marked as seeky if more than 1/8 of the last 32 requests
  were seeky.

This patch fixes a regression reported by Yanmin, on mmap 64k random
reads.

Reported-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
---
 block/cfq-iosched.c |   54 +++++++++++++-------------------------------------
 1 files changed, 14 insertions(+), 40 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index eed649c..806d30b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -46,8 +46,8 @@ static const int cfq_hist_divisor = 4;
 #define CFQ_HW_QUEUE_MIN	(5)
 #define CFQ_SERVICE_SHIFT       12
 
-#define CFQQ_SEEK_THR		8 * 1024
-#define CFQQ_SEEKY(cfqq)	((cfqq)->seek_mean > CFQQ_SEEK_THR)
+#define CFQQ_SEEK_THR		(sector_t)(8 * 100)
+#define CFQQ_SEEKY(cfqq)	(hweight32(cfqq->seek_history) > 32/8)
 
 #define RQ_CIC(rq)		\
 	((struct cfq_io_context *) (rq)->elevator_private)
@@ -132,9 +132,7 @@ struct cfq_queue {
 
 	pid_t pid;
 
-	unsigned int seek_samples;
-	u64 seek_total;
-	sector_t seek_mean;
+	u32 seek_history;
 	sector_t last_request_pos;
 
 	struct cfq_rb_root *service_tree;
@@ -1662,16 +1660,7 @@ static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
 static inline int cfq_rq_close(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 			       struct request *rq, bool for_preempt)
 {
-	sector_t sdist = cfqq->seek_mean;
-
-	if (!sample_valid(cfqq->seek_samples))
-		sdist = CFQQ_SEEK_THR;
-
-	/* if seek_mean is big, using it as close criteria is meaningless */
-	if (sdist > CFQQ_SEEK_THR && !for_preempt)
-		sdist = CFQQ_SEEK_THR;
-
-	return cfq_dist_from_last(cfqd, rq) <= sdist;
+	return cfq_dist_from_last(cfqd, rq) <= CFQQ_SEEK_THR;
 }
 
 static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
@@ -2968,30 +2957,16 @@ static void
 cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		       struct request *rq)
 {
-	sector_t sdist;
-	u64 total;
-
-	if (!cfqq->last_request_pos)
-		sdist = 0;
-	else if (cfqq->last_request_pos < blk_rq_pos(rq))
-		sdist = blk_rq_pos(rq) - cfqq->last_request_pos;
-	else
-		sdist = cfqq->last_request_pos - blk_rq_pos(rq);
-
-	/*
-	 * Don't allow the seek distance to get too large from the
-	 * odd fragment, pagein, etc
-	 */
-	if (cfqq->seek_samples <= 60) /* second&third seek */
-		sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*1024);
-	else
-		sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*64);
+	sector_t sdist = 0;
+	if (cfqq->last_request_pos) {
+		if (cfqq->last_request_pos < blk_rq_pos(rq))
+			sdist = blk_rq_pos(rq) - cfqq->last_request_pos;
+		else
+			sdist = cfqq->last_request_pos - blk_rq_pos(rq);
+	}
 
-	cfqq->seek_samples = (7*cfqq->seek_samples + 256) / 8;
-	cfqq->seek_total = (7*cfqq->seek_total + (u64)256*sdist) / 8;
-	total = cfqq->seek_total + (cfqq->seek_samples/2);
-	do_div(total, cfqq->seek_samples);
-	cfqq->seek_mean = (sector_t)total;
+	cfqq->seek_history <<= 1;
+	cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
 }
 
 /*
@@ -3016,8 +2991,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		cfq_mark_cfqq_deep(cfqq);
 
 	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples)
-	     && CFQQ_SEEKY(cfqq)))
+	    (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq)))
 		enable_idle = 0;
 	else if (sample_valid(cic->ttime_samples)) {
 		if (cic->ttime_mean > cfqd->cfq_slice_idle)
-- 
1.6.4.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs
  2010-02-27 18:45 ` [PATCH 1/2] cfq-iosched: rework seeky detection Corrado Zoccolo
@ 2010-02-27 18:45   ` Corrado Zoccolo
  2010-03-01 14:25     ` Vivek Goyal
  0 siblings, 1 reply; 16+ messages in thread
From: Corrado Zoccolo @ 2010-02-27 18:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-Kernel, Jeff Moyer, Vivek Goyal, Shaohua Li, Gui Jianfeng,
	Corrado Zoccolo

CFQ currently applies the same logic of detecting seeky queues and
grouping them together for rotational disks as well as SSDs.
For SSDs, the time to complete a request doesn't depend on the
request location, but only on the size.
This patch therefore changes the criterion to group queues by
request size in case of SSDs, in order to achieve better fairness.

Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
---
 block/cfq-iosched.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 806d30b..f27e535 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -47,6 +47,7 @@ static const int cfq_hist_divisor = 4;
 #define CFQ_SERVICE_SHIFT       12
 
 #define CFQQ_SEEK_THR		(sector_t)(8 * 100)
+#define CFQQ_SECT_THR_NONROT	(sector_t)(2 * 32)
 #define CFQQ_SEEKY(cfqq)	(hweight32(cfqq->seek_history) > 32/8)
 
 #define RQ_CIC(rq)		\
@@ -2958,6 +2959,7 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		       struct request *rq)
 {
 	sector_t sdist = 0;
+	sector_t n_sec = blk_rq_sectors(rq);
 	if (cfqq->last_request_pos) {
 		if (cfqq->last_request_pos < blk_rq_pos(rq))
 			sdist = blk_rq_pos(rq) - cfqq->last_request_pos;
@@ -2966,7 +2968,10 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	}
 
 	cfqq->seek_history <<= 1;
-	cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
+	if (blk_queue_nonrot(cfqd->queue))
+		cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT);
+	else
+		cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
 }
 
 /*
-- 
1.6.4.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34
       [not found] <1267296340-3820-1-git-send-email-czoccolo@gmail.com>
  2010-02-27 18:45 ` [PATCH 1/2] cfq-iosched: rework seeky detection Corrado Zoccolo
@ 2010-02-28 18:41 ` Jens Axboe
  2010-03-01 16:35 ` Vivek Goyal
  2 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2010-02-28 18:41 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Linux-Kernel, Jeff Moyer, Vivek Goyal, Shaohua Li, Gui Jianfeng,
	#, This, line, is, ignored.

On Sat, Feb 27 2010, Corrado Zoccolo wrote:
> 
> Hi, I'm resending the rework seeky detection patch, together with 
> the companion patch for SSDs, in order to get some testing on more
> hardware.
> 
> The first patch in the series fixes a regression introduced in 2.6.33
> for random mmap reads of more than one page, when multiple processes
> are competing for the disk.
> There is at least one HW RAID controller where it reduces performance,
> though (but this controller generally performs worse with CFQ than
> with NOOP, probably because it is performing non-work-conserving 
> I/O scheduling inside), so more testing on RAIDs is appreciated.
> 
> The second patch changes the seeky detection logic to be meaningful
> also for SSDs. A seeky request is one that doesn't utilize the full
> bandwidth for the device. For SSDs, this happens for small requests,
> regardless of their location.
> With this change, the grouping of "seeky" requests done by CFQ can
> result in a fairer distribution of disk service time among processes.

Thanks, I have applied this.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs
  2010-02-27 18:45   ` [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs Corrado Zoccolo
@ 2010-03-01 14:25     ` Vivek Goyal
  2010-03-03 19:47       ` Corrado Zoccolo
  0 siblings, 1 reply; 16+ messages in thread
From: Vivek Goyal @ 2010-03-01 14:25 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng

On Sat, Feb 27, 2010 at 07:45:40PM +0100, Corrado Zoccolo wrote:
> CFQ currently applies the same logic of detecting seeky queues and
> grouping them together for rotational disks as well as SSDs.
> For SSDs, the time to complete a request doesn't depend on the
> request location, but only on the size.
> This patch therefore changes the criterion to group queues by
> request size in case of SSDs, in order to achieve better fairness.

Hi Corrado,

Can you give some numbers regarding how are you measuring fairness and
how did you decide that we achieve better fairness?

In case of SSDs with NCQ, we will not idle on any of the queues (either
sync or sync-noidle (seeky queues)). So w.r.t code, what behavior changes
if we mark a queue as seeky/non-seeky on SSD?

IOW, looking at this patch, now any queue doing IO in smaller chunks than
32K on SSD will be marked as seeky. How does that change the behavior in 
terms of fairness for the queue?
 
Thanks
Vivek

> 
> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
> ---
>  block/cfq-iosched.c |    7 ++++++-
>  1 files changed, 6 insertions(+), 1 deletions(-)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 806d30b..f27e535 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -47,6 +47,7 @@ static const int cfq_hist_divisor = 4;
>  #define CFQ_SERVICE_SHIFT       12
>  
>  #define CFQQ_SEEK_THR		(sector_t)(8 * 100)
> +#define CFQQ_SECT_THR_NONROT	(sector_t)(2 * 32)
>  #define CFQQ_SEEKY(cfqq)	(hweight32(cfqq->seek_history) > 32/8)
>  
>  #define RQ_CIC(rq)		\
> @@ -2958,6 +2959,7 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>  		       struct request *rq)
>  {
>  	sector_t sdist = 0;
> +	sector_t n_sec = blk_rq_sectors(rq);
>  	if (cfqq->last_request_pos) {
>  		if (cfqq->last_request_pos < blk_rq_pos(rq))
>  			sdist = blk_rq_pos(rq) - cfqq->last_request_pos;
> @@ -2966,7 +2968,10 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>  	}
>  
>  	cfqq->seek_history <<= 1;
> -	cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
> +	if (blk_queue_nonrot(cfqd->queue))
> +		cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT);
> +	else
> +		cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
>  }
>  
>  /*
> -- 
> 1.6.4.4

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34
       [not found] <1267296340-3820-1-git-send-email-czoccolo@gmail.com>
  2010-02-27 18:45 ` [PATCH 1/2] cfq-iosched: rework seeky detection Corrado Zoccolo
  2010-02-28 18:41 ` [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34 Jens Axboe
@ 2010-03-01 16:35 ` Vivek Goyal
  2010-03-01 19:45   ` Vivek Goyal
  2010-03-01 23:01   ` Corrado Zoccolo
  2 siblings, 2 replies; 16+ messages in thread
From: Vivek Goyal @ 2010-03-01 16:35 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng, #,
	This, line, is, ignored.

On Sat, Feb 27, 2010 at 07:45:38PM +0100, Corrado Zoccolo wrote:
> 
> Hi, I'm resending the rework seeky detection patch, together with 
> the companion patch for SSDs, in order to get some testing on more
> hardware.
> 
> The first patch in the series fixes a regression introduced in 2.6.33
> for random mmap reads of more than one page, when multiple processes
> are competing for the disk.
> There is at least one HW RAID controller where it reduces performance,
> though (but this controller generally performs worse with CFQ than
> with NOOP, probably because it is performing non-work-conserving 
> I/O scheduling inside), so more testing on RAIDs is appreciated.
> 

Hi Corrado,

This time I don't have the machine where I had previously reported
regressions. But somebody has exported me two Lun from an storage box
over SAN and I have done my testing on that. With this seek patch applied, 
I still see the regressions.

iosched=cfq     Filesz=1G   bs=64K

                        2.6.33              2.6.33-seek
workload  Set NR  RDBW(KB/s)  WRBW(KB/s)  RDBW(KB/s)  WRBW(KB/s)    %Rd %Wr
--------  --- --  ----------  ----------  ----------  ----------   ---- ----
brrmmap   3   1   7113        0           7044        0              0% 0%
brrmmap   3   2   6977        0           6774        0             -2% 0%
brrmmap   3   4   7410        0           6181        0            -16% 0%
brrmmap   3   8   9405        0           6020        0            -35% 0%
brrmmap   3   16  11445       0           5792        0            -49% 0%

                        2.6.33              2.6.33-seek
workload  Set NR  RDBW(KB/s)  WRBW(KB/s)  RDBW(KB/s)  WRBW(KB/s)    %Rd %Wr
--------  --- --  ----------  ----------  ----------  ----------   ---- ----
drrmmap   3   1   7195        0           7337        0              1% 0%
drrmmap   3   2   7016        0           6855        0             -2% 0%
drrmmap   3   4   7438        0           6103        0            -17% 0%
drrmmap   3   8   9298        0           6020        0            -35% 0%
drrmmap   3   16  11576       0           5827        0            -49% 0%


I have run buffered random reads on mmaped files (brrmmap) and direct
random reads on mmaped files (drrmmap) using fio. I have run these for
increasing number of threads and did this for 3 times and took average of
three sets for reporting.

I have used filesize 1G and bz=64K and ran each test sample for 30
seconds.

Because with new seek logic, we will mark above type of cfqq as non seeky
and will idle on these, I take a significant hit in performance on storage
boxes which have more than 1 spindle.

So basically, the regression is not only on that particular RAID card but
on other kind of devices which can support more than one spindle.

I will run some test on single SATA disk also where this patch should
benefit.

Based on testing results so far, I am not a big fan of marking these mmap
queues as sync-idle. I guess if this patch really benefits, then we need
to first put in place some kind of logic to detect whether if it is single
spindle SATA disk and then on these disks, mark mmap queues as sync.

Apart from synthetic workloads, in practice, where this patch is helping you?

Thanks
Vivek


> The second patch changes the seeky detection logic to be meaningful
> also for SSDs. A seeky request is one that doesn't utilize the full
> bandwidth for the device. For SSDs, this happens for small requests,
> regardless of their location.
> With this change, the grouping of "seeky" requests done by CFQ can
> result in a fairer distribution of disk service time among processes.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34
  2010-03-01 16:35 ` Vivek Goyal
@ 2010-03-01 19:45   ` Vivek Goyal
  2010-03-01 23:01   ` Corrado Zoccolo
  1 sibling, 0 replies; 16+ messages in thread
From: Vivek Goyal @ 2010-03-01 19:45 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng

On Mon, Mar 01, 2010 at 11:35:52AM -0500, Vivek Goyal wrote:
> On Sat, Feb 27, 2010 at 07:45:38PM +0100, Corrado Zoccolo wrote:
> > 
> > Hi, I'm resending the rework seeky detection patch, together with 
> > the companion patch for SSDs, in order to get some testing on more
> > hardware.
> > 
> > The first patch in the series fixes a regression introduced in 2.6.33
> > for random mmap reads of more than one page, when multiple processes
> > are competing for the disk.
> > There is at least one HW RAID controller where it reduces performance,
> > though (but this controller generally performs worse with CFQ than
> > with NOOP, probably because it is performing non-work-conserving 
> > I/O scheduling inside), so more testing on RAIDs is appreciated.
> > 
> 
> Hi Corrado,
> 
> This time I don't have the machine where I had previously reported
> regressions. But somebody has exported me two Lun from an storage box
> over SAN and I have done my testing on that. With this seek patch applied, 
> I still see the regressions.
> 
> iosched=cfq     Filesz=1G   bs=64K
> 
>                         2.6.33              2.6.33-seek
> workload  Set NR  RDBW(KB/s)  WRBW(KB/s)  RDBW(KB/s)  WRBW(KB/s)    %Rd %Wr
> --------  --- --  ----------  ----------  ----------  ----------   ---- ----
> brrmmap   3   1   7113        0           7044        0              0% 0%
> brrmmap   3   2   6977        0           6774        0             -2% 0%
> brrmmap   3   4   7410        0           6181        0            -16% 0%
> brrmmap   3   8   9405        0           6020        0            -35% 0%
> brrmmap   3   16  11445       0           5792        0            -49% 0%
> 
>                         2.6.33              2.6.33-seek
> workload  Set NR  RDBW(KB/s)  WRBW(KB/s)  RDBW(KB/s)  WRBW(KB/s)    %Rd %Wr
> --------  --- --  ----------  ----------  ----------  ----------   ---- ----
> drrmmap   3   1   7195        0           7337        0              1% 0%
> drrmmap   3   2   7016        0           6855        0             -2% 0%
> drrmmap   3   4   7438        0           6103        0            -17% 0%
> drrmmap   3   8   9298        0           6020        0            -35% 0%
> drrmmap   3   16  11576       0           5827        0            -49% 0%
> 
> 
> I have run buffered random reads on mmaped files (brrmmap) and direct
> random reads on mmaped files (drrmmap) using fio. I have run these for
> increasing number of threads and did this for 3 times and took average of
> three sets for reporting.
> 
> I have used filesize 1G and bz=64K and ran each test sample for 30
> seconds.
> 
> Because with new seek logic, we will mark above type of cfqq as non seeky
> and will idle on these, I take a significant hit in performance on storage
> boxes which have more than 1 spindle.
> 
> So basically, the regression is not only on that particular RAID card but
> on other kind of devices which can support more than one spindle.
> 
> I will run some test on single SATA disk also where this patch should
> benefit.
> 

Ok, some more results on a single SATA disk.

iosched=cfq     Filesz=1G   bs=64K  

                        2.6.33              2.6.33-seek
workload  Set NR  RDBW(KB/s)  WRBW(KB/s)  RDBW(KB/s)  WRBW(KB/s)    %Rd %Wr
--------  --- --  ----------  ----------  ----------  ----------   ---- ----
brrmmap   3   1   4200        0           4200        0              0% 0%
brrmmap   3   2   4214        0           4246        0              0% 0%
brrmmap   3   4   3296        0           3868        0             17% 0%
brrmmap   3   8   2442        0           3117        0             27% 0%
brrmmap   3   16  1895        0           2510        0             32% 0%

                        2.6.33              2.6.33-seek
workload  Set NR  RDBW(KB/s)  WRBW(KB/s)  RDBW(KB/s)  WRBW(KB/s)    %Rd %Wr
--------  --- --  ----------  ----------  ----------  ----------   ---- ----
drrmmap   3   1   5476        0           5494        0              0% 0%
drrmmap   3   2   5065        0           5070        0              0% 0%
drrmmap   3   4   3607        0           4213        0             16% 0%
drrmmap   3   8   2474        0           3198        0             29% 0%
drrmmap   3   16  1912        0           2418        0             26% 0%

So we see improvements on single SATA disk as expected. But we lose more
on higher end storage/hardware RAID setups.

Also ran same test with bs=32K on SATA disk.

iosched=cfq     Filesz=1G   bs=32K  

                        2.6.33              2.6.33-seek
workload  Set NR  RDBW(KB/s)  WRBW(KB/s)  RDBW(KB/s)  WRBW(KB/s)    %Rd %Wr
--------  --- --  ----------  ----------  ----------  ----------   ---- ----
brrmmap   3   1   2408        0           2374        0             -1% 0%
brrmmap   3   2   2045        0           2304        0             12% 0%
brrmmap   3   4   1687        0           1753        0              3% 0%
brrmmap   3   8   1697        0           1562        0             -7% 0%
brrmmap   3   16  1604        0           1573        0             -1% 0%

                        2.6.33              2.6.33-seek
workload  Set NR  RDBW(KB/s)  WRBW(KB/s)  RDBW(KB/s)  WRBW(KB/s)    %Rd %Wr
--------  --- --  ----------  ----------  ----------  ----------   ---- ----
drrmmap   3   1   3171        0           3145        0              0% 0%
drrmmap   3   2   2634        0           2838        0              7% 0%
drrmmap   3   4   1844        0           1935        0              4% 0%
drrmmap   3   8   1761        0           1609        0             -8% 0%
drrmmap   3   16  1602        0           1573        0             -1% 0%

I think in this case cfqq is not being marked as sync-idle and continues
to be sync-noidle.

So in summary, yes we gain on single SATA disks for this test case but
lose on multi spindle setups. IMHO, we should enhance this patch with
some kind of single spindle detection and enable this functionality only
with those disks so that higher end storage does not incur the penalty.

Thanks
Vivek


 


> Based on testing results so far, I am not a big fan of marking these mmap
> queues as sync-idle. I guess if this patch really benefits, then we need
> to first put in place some kind of logic to detect whether if it is single
> spindle SATA disk and then on these disks, mark mmap queues as sync.
> 
> Apart from synthetic workloads, in practice, where this patch is helping you?
> 
> Thanks
> Vivek
> 
> 
> > The second patch changes the seeky detection logic to be meaningful
> > also for SSDs. A seeky request is one that doesn't utilize the full
> > bandwidth for the device. For SSDs, this happens for small requests,
> > regardless of their location.
> > With this change, the grouping of "seeky" requests done by CFQ can
> > result in a fairer distribution of disk service time among processes.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34
  2010-03-01 16:35 ` Vivek Goyal
  2010-03-01 19:45   ` Vivek Goyal
@ 2010-03-01 23:01   ` Corrado Zoccolo
  2010-03-03 22:39     ` Corrado Zoccolo
  1 sibling, 1 reply; 16+ messages in thread
From: Corrado Zoccolo @ 2010-03-01 23:01 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng, #,
	This, line, is, ignored.

Hi Vivek,
On Mon, Mar 1, 2010 at 5:35 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Sat, Feb 27, 2010 at 07:45:38PM +0100, Corrado Zoccolo wrote:
>>
>> Hi, I'm resending the rework seeky detection patch, together with
>> the companion patch for SSDs, in order to get some testing on more
>> hardware.
>>
>> The first patch in the series fixes a regression introduced in 2.6.33
>> for random mmap reads of more than one page, when multiple processes
>> are competing for the disk.
>> There is at least one HW RAID controller where it reduces performance,
>> though (but this controller generally performs worse with CFQ than
>> with NOOP, probably because it is performing non-work-conserving
>> I/O scheduling inside), so more testing on RAIDs is appreciated.
>>
>
> Hi Corrado,
>
> This time I don't have the machine where I had previously reported
> regressions. But somebody has exported me two Lun from an storage box
> over SAN and I have done my testing on that. With this seek patch applied,
> I still see the regressions.
>
> iosched=cfq     Filesz=1G   bs=64K
>
>                        2.6.33              2.6.33-seek
> workload  Set NR  RDBW(KB/s)  WRBW(KB/s)  RDBW(KB/s)  WRBW(KB/s)    %Rd %Wr
> --------  --- --  ----------  ----------  ----------  ----------   ---- ----
> brrmmap   3   1   7113        0           7044        0              0% 0%
> brrmmap   3   2   6977        0           6774        0             -2% 0%
> brrmmap   3   4   7410        0           6181        0            -16% 0%
> brrmmap   3   8   9405        0           6020        0            -35% 0%
> brrmmap   3   16  11445       0           5792        0            -49% 0%
>
>                        2.6.33              2.6.33-seek
> workload  Set NR  RDBW(KB/s)  WRBW(KB/s)  RDBW(KB/s)  WRBW(KB/s)    %Rd %Wr
> --------  --- --  ----------  ----------  ----------  ----------   ---- ----
> drrmmap   3   1   7195        0           7337        0              1% 0%
> drrmmap   3   2   7016        0           6855        0             -2% 0%
> drrmmap   3   4   7438        0           6103        0            -17% 0%
> drrmmap   3   8   9298        0           6020        0            -35% 0%
> drrmmap   3   16  11576       0           5827        0            -49% 0%
>
>
> I have run buffered random reads on mmaped files (brrmmap) and direct
> random reads on mmaped files (drrmmap) using fio. I have run these for
> increasing number of threads and did this for 3 times and took average of
> three sets for reporting.
>
> I have used filesize 1G and bz=64K and ran each test sample for 30
> seconds.
>
> Because with new seek logic, we will mark above type of cfqq as non seeky
> and will idle on these, I take a significant hit in performance on storage
> boxes which have more than 1 spindle.

Thanks for testing on a different setup.
I wonder if the wrong part for multi-spindle is the 64kb threshold.
Can you run with larger bs, and see if there is a value for which
idling is better?
For example on a 2 disk raid 0 I would expect  that a bs larger than
the stripe will still benefit by idling.

>
> So basically, the regression is not only on that particular RAID card but
> on other kind of devices which can support more than one spindle.
>
> I will run some test on single SATA disk also where this patch should
> benefit.
>
> Based on testing results so far, I am not a big fan of marking these mmap
> queues as sync-idle. I guess if this patch really benefits, then we need
> to first put in place some kind of logic to detect whether if it is single
> spindle SATA disk and then on these disks, mark mmap queues as sync.
>
> Apart from synthetic workloads, in practice, where this patch is helping you?

The synthetic workload mimics the page fault patterns that can be seen
on program startup, and that is the target of my optimization. In
2.6.32, we went the direction of enabling idling also for seeky
queues, while 2.6.33 tried to be more friendly with parallel storage
by usually allowing more parallel requests. Unfortunately, this
impacted this peculiar access pattern, so we need to fix it somehow.

Thanks,
Corrado

>
> Thanks
> Vivek
>
>
>> The second patch changes the seeky detection logic to be meaningful
>> also for SSDs. A seeky request is one that doesn't utilize the full
>> bandwidth for the device. For SSDs, this happens for small requests,
>> regardless of their location.
>> With this change, the grouping of "seeky" requests done by CFQ can
>> result in a fairer distribution of disk service time among processes.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs
  2010-03-01 14:25     ` Vivek Goyal
@ 2010-03-03 19:47       ` Corrado Zoccolo
  2010-03-03 21:21         ` Vivek Goyal
  2010-03-03 23:28         ` Vivek Goyal
  0 siblings, 2 replies; 16+ messages in thread
From: Corrado Zoccolo @ 2010-03-03 19:47 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng

[-- Attachment #1: Type: text/plain, Size: 5579 bytes --]

Hi Vivek,
On Mon, Mar 1, 2010 at 3:25 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Sat, Feb 27, 2010 at 07:45:40PM +0100, Corrado Zoccolo wrote:
>> CFQ currently applies the same logic of detecting seeky queues and
>> grouping them together for rotational disks as well as SSDs.
>> For SSDs, the time to complete a request doesn't depend on the
>> request location, but only on the size.
>> This patch therefore changes the criterion to group queues by
>> request size in case of SSDs, in order to achieve better fairness.
>
> Hi Corrado,
>
> Can you give some numbers regarding how are you measuring fairness and
> how did you decide that we achieve better fairness?
>
Please, see the attached fio script. It benchmarks pairs of processes
performing direct random I/O.
One is always fixed at bs=4k , while I vary the other from 8K to 64K
test00: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=1
test01: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
test10: (g=1): rw=randread, bs=16K-16K/16K-16K, ioengine=sync, iodepth=1
test11: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
test20: (g=2): rw=randread, bs=32K-32K/32K-32K, ioengine=sync, iodepth=1
test21: (g=2): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
test30: (g=3): rw=randread, bs=64K-64K/64K-64K, ioengine=sync, iodepth=1
test31: (g=3): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1

With unpatched cfq (2.6.33), on a flash card (non-ncq), after running
a fio script with high number of parallel readers to make sure ncq
detection is stabilized, I get the following:
Run status group 0 (all jobs):
   READ: io=21528KiB, aggrb=4406KiB/s, minb=1485KiB/s, maxb=2922KiB/s,
mint=5001msec, maxt=5003msec

Run status group 1 (all jobs):
   READ: io=31524KiB, aggrb=6452KiB/s, minb=1327KiB/s, maxb=5126KiB/s,
mint=5002msec, maxt=5003msec

Run status group 2 (all jobs):
   READ: io=46544KiB, aggrb=9524KiB/s, minb=1031KiB/s, maxb=8493KiB/s,
mint=5001msec, maxt=5004msec

Run status group 3 (all jobs):
   READ: io=64712KiB, aggrb=13242KiB/s, minb=761KiB/s,
maxb=12486KiB/s, mint=5002msec, maxt=5004msec

As you can see from minb, the process with smallest I/O size is
penalized (the fact is that being both marked as noidle, they both end
up in the noidle tree, where they are serviced round robin, so they
get fairness in term of IOPS, but bandwidth varies a lot.

With my patches in place, I get:
Run status group 0 (all jobs):
   READ: io=21544KiB, aggrb=4409KiB/s, minb=1511KiB/s, maxb=2898KiB/s,
mint=5002msec, maxt=5003msec

Run status group 1 (all jobs):
   READ: io=32000KiB, aggrb=6549KiB/s, minb=1277KiB/s, maxb=5274KiB/s,
mint=5001msec, maxt=5003msec

Run status group 2 (all jobs):
   READ: io=39444KiB, aggrb=8073KiB/s, minb=1576KiB/s, maxb=6498KiB/s,
mint=5002msec, maxt=5003msec

Run status group 3 (all jobs):
   READ: io=49180KiB, aggrb=10059KiB/s, minb=1512KiB/s,
maxb=8548KiB/s, mint=5001msec, maxt=5006msec

The process doing smaller requests is now not penalized by the fact
that it is run concurrently with the other one, and the other still
benefits from larger requests because it uses better its time slice.

> In case of SSDs with NCQ, we will not idle on any of the queues (either
> sync or sync-noidle (seeky queues)). So w.r.t code, what behavior changes
> if we mark a queue as seeky/non-seeky on SSD?
>

I've not tested on NCQ SSD, but I think at worst it will not harm, and
at best, it will provide similar fairness improvements when the queue
of processes submitting requests grows above the available NCQ slots.

> IOW, looking at this patch, now any queue doing IO in smaller chunks than
> 32K on SSD will be marked as seeky. How does that change the behavior in
> terms of fairness for the queue?
>
Basically, we will have IOPS based fairness for small requests, and
time based fairness for larger requests.

Thanks,
Corrado

> Thanks
> Vivek
>
>>
>> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
>> ---
>>  block/cfq-iosched.c |    7 ++++++-
>>  1 files changed, 6 insertions(+), 1 deletions(-)
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index 806d30b..f27e535 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -47,6 +47,7 @@ static const int cfq_hist_divisor = 4;
>>  #define CFQ_SERVICE_SHIFT       12
>>
>>  #define CFQQ_SEEK_THR                (sector_t)(8 * 100)
>> +#define CFQQ_SECT_THR_NONROT (sector_t)(2 * 32)
>>  #define CFQQ_SEEKY(cfqq)     (hweight32(cfqq->seek_history) > 32/8)
>>
>>  #define RQ_CIC(rq)           \
>> @@ -2958,6 +2959,7 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>>                      struct request *rq)
>>  {
>>       sector_t sdist = 0;
>> +     sector_t n_sec = blk_rq_sectors(rq);
>>       if (cfqq->last_request_pos) {
>>               if (cfqq->last_request_pos < blk_rq_pos(rq))
>>                       sdist = blk_rq_pos(rq) - cfqq->last_request_pos;
>> @@ -2966,7 +2968,10 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>>       }
>>
>>       cfqq->seek_history <<= 1;
>> -     cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
>> +     if (blk_queue_nonrot(cfqd->queue))
>> +             cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT);
>> +     else
>> +             cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
>>  }
>>
>>  /*
>> --
>> 1.6.4.4
>

[-- Attachment #2: fair.fio --]
[-- Type: application/octet-stream, Size: 483 bytes --]

[global]
size=1G
ioengine=sync
invalidate=1
runtime=5
time_based
direct=1

[test00]
rw=randread
bs=8k
filename=testfile1

[test01]
rw=randread
bs=4k
filename=testfile2

[test10]
rw=randread
bs=16k
filename=testfile1
stonewall

[test11]
rw=randread
bs=4k
filename=testfile2

[test20]
rw=randread
bs=32k
filename=testfile1
stonewall

[test21]
rw=randread
bs=4k
filename=testfile2

[test30]
rw=randread
bs=64k
filename=testfile1
stonewall

[test31]
rw=randread
bs=4k
filename=testfile2

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs
  2010-03-03 19:47       ` Corrado Zoccolo
@ 2010-03-03 21:21         ` Vivek Goyal
  2010-03-03 23:28         ` Vivek Goyal
  1 sibling, 0 replies; 16+ messages in thread
From: Vivek Goyal @ 2010-03-03 21:21 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng

On Wed, Mar 03, 2010 at 08:47:31PM +0100, Corrado Zoccolo wrote:
> Hi Vivek,
> On Mon, Mar 1, 2010 at 3:25 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Sat, Feb 27, 2010 at 07:45:40PM +0100, Corrado Zoccolo wrote:
> >> CFQ currently applies the same logic of detecting seeky queues and
> >> grouping them together for rotational disks as well as SSDs.
> >> For SSDs, the time to complete a request doesn't depend on the
> >> request location, but only on the size.
> >> This patch therefore changes the criterion to group queues by
> >> request size in case of SSDs, in order to achieve better fairness.
> >
> > Hi Corrado,
> >
> > Can you give some numbers regarding how are you measuring fairness and
> > how did you decide that we achieve better fairness?
> >
> Please, see the attached fio script. It benchmarks pairs of processes
> performing direct random I/O.
> One is always fixed at bs=4k , while I vary the other from 8K to 64K
> test00: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=1
> test01: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
> test10: (g=1): rw=randread, bs=16K-16K/16K-16K, ioengine=sync, iodepth=1
> test11: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
> test20: (g=2): rw=randread, bs=32K-32K/32K-32K, ioengine=sync, iodepth=1
> test21: (g=2): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
> test30: (g=3): rw=randread, bs=64K-64K/64K-64K, ioengine=sync, iodepth=1
> test31: (g=3): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
> 
> With unpatched cfq (2.6.33), on a flash card (non-ncq), after running
> a fio script with high number of parallel readers to make sure ncq
> detection is stabilized, I get the following:
> Run status group 0 (all jobs):
>    READ: io=21528KiB, aggrb=4406KiB/s, minb=1485KiB/s, maxb=2922KiB/s,
> mint=5001msec, maxt=5003msec
> 
> Run status group 1 (all jobs):
>    READ: io=31524KiB, aggrb=6452KiB/s, minb=1327KiB/s, maxb=5126KiB/s,
> mint=5002msec, maxt=5003msec
> 
> Run status group 2 (all jobs):
>    READ: io=46544KiB, aggrb=9524KiB/s, minb=1031KiB/s, maxb=8493KiB/s,
> mint=5001msec, maxt=5004msec
> 
> Run status group 3 (all jobs):
>    READ: io=64712KiB, aggrb=13242KiB/s, minb=761KiB/s,
> maxb=12486KiB/s, mint=5002msec, maxt=5004msec
> 
> As you can see from minb, the process with smallest I/O size is
> penalized (the fact is that being both marked as noidle, they both end
> up in the noidle tree, where they are serviced round robin, so they
> get fairness in term of IOPS, but bandwidth varies a lot.
> 
> With my patches in place, I get:
> Run status group 0 (all jobs):
>    READ: io=21544KiB, aggrb=4409KiB/s, minb=1511KiB/s, maxb=2898KiB/s,
> mint=5002msec, maxt=5003msec
> 
> Run status group 1 (all jobs):
>    READ: io=32000KiB, aggrb=6549KiB/s, minb=1277KiB/s, maxb=5274KiB/s,
> mint=5001msec, maxt=5003msec
> 
> Run status group 2 (all jobs):
>    READ: io=39444KiB, aggrb=8073KiB/s, minb=1576KiB/s, maxb=6498KiB/s,
> mint=5002msec, maxt=5003msec
> 
> Run status group 3 (all jobs):
>    READ: io=49180KiB, aggrb=10059KiB/s, minb=1512KiB/s,
> maxb=8548KiB/s, mint=5001msec, maxt=5006msec
> 
> The process doing smaller requests is now not penalized by the fact
> that it is run concurrently with the other one, and the other still
> benefits from larger requests because it uses better its time slice.

Ok, so with this patch, higher size requests will be marked as sync-idle
so that now 4K size process and 64K size processes will be on separate
service tree.

But this will work only if we were idling on service tree (on SSD). I
thought in SSD we will not idle even on service tree. But looks like
we have left a bug somewhere. Otherwise on NCQ SSD we will suffer
in terms of performance in this kind of setup. Especially if you
increase number of readers. I will do run your fio script on my NCQ SSD.

Or it is intentional that idle on service tree with hw_tag=1 but don't
idle with NCQ hard disk. That makes sense though. 

But looking at cfq_should_idle(), looks like we will always on a service
tree even on NCQ SSD even if cfq_cfqq_idle_window=0. I think if I run the
same test on NCQ SSD, now bigger size process should loose because we will
idle on sync-noidle service tree but not on sync-idle service tree.

> 
> > In case of SSDs with NCQ, we will not idle on any of the queues (either
> > sync or sync-noidle (seeky queues)). So w.r.t code, what behavior changes
> > if we mark a queue as seeky/non-seeky on SSD?
> >
> 
> I've not tested on NCQ SSD, but I think at worst it will not harm, and
> at best, it will provide similar fairness improvements when the queue
> of processes submitting requests grows above the available NCQ slots.
> 
> > IOW, looking at this patch, now any queue doing IO in smaller chunks than
> > 32K on SSD will be marked as seeky. How does that change the behavior in
> > terms of fairness for the queue?
> >
> Basically, we will have IOPS based fairness for small requests, and
> time based fairness for larger requests.
> 
> Thanks,
> Corrado
> 
> > Thanks
> > Vivek
> >
> >>
> >> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
> >> ---
> >>  block/cfq-iosched.c |    7 ++++++-
> >>  1 files changed, 6 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> >> index 806d30b..f27e535 100644
> >> --- a/block/cfq-iosched.c
> >> +++ b/block/cfq-iosched.c
> >> @@ -47,6 +47,7 @@ static const int cfq_hist_divisor = 4;
> >>  #define CFQ_SERVICE_SHIFT       12
> >>
> >>  #define CFQQ_SEEK_THR                (sector_t)(8 * 100)
> >> +#define CFQQ_SECT_THR_NONROT (sector_t)(2 * 32)
> >>  #define CFQQ_SEEKY(cfqq)     (hweight32(cfqq->seek_history) > 32/8)
> >>
> >>  #define RQ_CIC(rq)           \
> >> @@ -2958,6 +2959,7 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> >>                      struct request *rq)
> >>  {
> >>       sector_t sdist = 0;
> >> +     sector_t n_sec = blk_rq_sectors(rq);
> >>       if (cfqq->last_request_pos) {
> >>               if (cfqq->last_request_pos < blk_rq_pos(rq))
> >>                       sdist = blk_rq_pos(rq) - cfqq->last_request_pos;
> >> @@ -2966,7 +2968,10 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> >>       }
> >>
> >>       cfqq->seek_history <<= 1;
> >> -     cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
> >> +     if (blk_queue_nonrot(cfqd->queue))
> >> +             cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT);
> >> +     else
> >> +             cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
> >>  }
> >>
> >>  /*
> >> --
> >> 1.6.4.4
> >



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34
  2010-03-01 23:01   ` Corrado Zoccolo
@ 2010-03-03 22:39     ` Corrado Zoccolo
  2010-03-03 23:11       ` Vivek Goyal
  0 siblings, 1 reply; 16+ messages in thread
From: Corrado Zoccolo @ 2010-03-03 22:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng, #,
	This, line, is, ignored.

On Tue, Mar 2, 2010 at 12:01 AM, Corrado Zoccolo <czoccolo@gmail.com> wrote:
> Hi Vivek,
> On Mon, Mar 1, 2010 at 5:35 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> On Sat, Feb 27, 2010 at 07:45:38PM +0100, Corrado Zoccolo wrote:
>>>
>>> Hi, I'm resending the rework seeky detection patch, together with
>>> the companion patch for SSDs, in order to get some testing on more
>>> hardware.
>>>
>>> The first patch in the series fixes a regression introduced in 2.6.33
>>> for random mmap reads of more than one page, when multiple processes
>>> are competing for the disk.
>>> There is at least one HW RAID controller where it reduces performance,
>>> though (but this controller generally performs worse with CFQ than
>>> with NOOP, probably because it is performing non-work-conserving
>>> I/O scheduling inside), so more testing on RAIDs is appreciated.
>>>
>>
>> Hi Corrado,
>>
>> This time I don't have the machine where I had previously reported
>> regressions. But somebody has exported me two Lun from an storage box
>> over SAN and I have done my testing on that. With this seek patch applied,
>> I still see the regressions.
>>
>> iosched=cfq     Filesz=1G   bs=64K
>>
>>                        2.6.33              2.6.33-seek
>> workload  Set NR  RDBW(KB/s)  WRBW(KB/s)  RDBW(KB/s)  WRBW(KB/s)    %Rd %Wr
>> --------  --- --  ----------  ----------  ----------  ----------   ---- ----
>> brrmmap   3   1   7113        0           7044        0              0% 0%
>> brrmmap   3   2   6977        0           6774        0             -2% 0%
>> brrmmap   3   4   7410        0           6181        0            -16% 0%
>> brrmmap   3   8   9405        0           6020        0            -35% 0%
>> brrmmap   3   16  11445       0           5792        0            -49% 0%
>>
>>                        2.6.33              2.6.33-seek
>> workload  Set NR  RDBW(KB/s)  WRBW(KB/s)  RDBW(KB/s)  WRBW(KB/s)    %Rd %Wr
>> --------  --- --  ----------  ----------  ----------  ----------   ---- ----
>> drrmmap   3   1   7195        0           7337        0              1% 0%
>> drrmmap   3   2   7016        0           6855        0             -2% 0%
>> drrmmap   3   4   7438        0           6103        0            -17% 0%
>> drrmmap   3   8   9298        0           6020        0            -35% 0%
>> drrmmap   3   16  11576       0           5827        0            -49% 0%
>>
>>
>> I have run buffered random reads on mmaped files (brrmmap) and direct
>> random reads on mmaped files (drrmmap) using fio. I have run these for
>> increasing number of threads and did this for 3 times and took average of
>> three sets for reporting.

BTW, I think O_DIRECT doesn't affect mmap operation.

>>
>> I have used filesize 1G and bz=64K and ran each test sample for 30
>> seconds.
>>
>> Because with new seek logic, we will mark above type of cfqq as non seeky
>> and will idle on these, I take a significant hit in performance on storage
>> boxes which have more than 1 spindle.
Thinking about this, can you check if your disks have a non-zero
/sys/block/sda/queue/optimal_io_size ?
>From the comment in blk-settings.c, I see this should be non-zero for
RAIDs, so it may help discriminating the cases we want to optimize
for.
It could also help in identifying the correct threshold.
>
> Thanks for testing on a different setup.
> I wonder if the wrong part for multi-spindle is the 64kb threshold.
> Can you run with larger bs, and see if there is a value for which
> idling is better?
> For example on a 2 disk raid 0 I would expect  that a bs larger than
> the stripe will still benefit by idling.
>
>>
>> So basically, the regression is not only on that particular RAID card but
>> on other kind of devices which can support more than one spindle.
Ok makes sense. If the number of sequential pages read before jumping
to a random address is smaller than the raid stripe, we are wasting
potential parallelism.
>>
>> I will run some test on single SATA disk also where this patch should
>> benefit.
>>
>> Based on testing results so far, I am not a big fan of marking these mmap
>> queues as sync-idle. I guess if this patch really benefits, then we need
>> to first put in place some kind of logic to detect whether if it is single
>> spindle SATA disk and then on these disks, mark mmap queues as sync.
>>
>> Apart from synthetic workloads, in practice, where this patch is helping you?
>
> The synthetic workload mimics the page fault patterns that can be seen
> on program startup, and that is the target of my optimization. In
> 2.6.32, we went the direction of enabling idling also for seeky
> queues, while 2.6.33 tried to be more friendly with parallel storage
> by usually allowing more parallel requests. Unfortunately, this
> impacted this peculiar access pattern, so we need to fix it somehow.
>
> Thanks,
> Corrado
>
>>
>> Thanks
>> Vivek
>>
>>
>>> The second patch changes the seeky detection logic to be meaningful
>>> also for SSDs. A seeky request is one that doesn't utilize the full
>>> bandwidth for the device. For SSDs, this happens for small requests,
>>> regardless of their location.
>>> With this change, the grouping of "seeky" requests done by CFQ can
>>> result in a fairer distribution of disk service time among processes.
>>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34
  2010-03-03 22:39     ` Corrado Zoccolo
@ 2010-03-03 23:11       ` Vivek Goyal
  0 siblings, 0 replies; 16+ messages in thread
From: Vivek Goyal @ 2010-03-03 23:11 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng

On Wed, Mar 03, 2010 at 11:39:05PM +0100, Corrado Zoccolo wrote:
> On Tue, Mar 2, 2010 at 12:01 AM, Corrado Zoccolo <czoccolo@gmail.com> wrote:
> > Hi Vivek,
> > On Mon, Mar 1, 2010 at 5:35 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> On Sat, Feb 27, 2010 at 07:45:38PM +0100, Corrado Zoccolo wrote:
> >>>
> >>> Hi, I'm resending the rework seeky detection patch, together with
> >>> the companion patch for SSDs, in order to get some testing on more
> >>> hardware.
> >>>
> >>> The first patch in the series fixes a regression introduced in 2.6.33
> >>> for random mmap reads of more than one page, when multiple processes
> >>> are competing for the disk.
> >>> There is at least one HW RAID controller where it reduces performance,
> >>> though (but this controller generally performs worse with CFQ than
> >>> with NOOP, probably because it is performing non-work-conserving
> >>> I/O scheduling inside), so more testing on RAIDs is appreciated.
> >>>
> >>
> >> Hi Corrado,
> >>
> >> This time I don't have the machine where I had previously reported
> >> regressions. But somebody has exported me two Lun from an storage box
> >> over SAN and I have done my testing on that. With this seek patch applied,
> >> I still see the regressions.
> >>
> >> iosched=cfq     Filesz=1G   bs=64K
> >>
> >>                        2.6.33              2.6.33-seek
> >> workload  Set NR  RDBW(KB/s)  WRBW(KB/s)  RDBW(KB/s)  WRBW(KB/s)    %Rd %Wr
> >> --------  --- --  ----------  ----------  ----------  ----------   ---- ----
> >> brrmmap   3   1   7113        0           7044        0              0% 0%
> >> brrmmap   3   2   6977        0           6774        0             -2% 0%
> >> brrmmap   3   4   7410        0           6181        0            -16% 0%
> >> brrmmap   3   8   9405        0           6020        0            -35% 0%
> >> brrmmap   3   16  11445       0           5792        0            -49% 0%
> >>
> >>                        2.6.33              2.6.33-seek
> >> workload  Set NR  RDBW(KB/s)  WRBW(KB/s)  RDBW(KB/s)  WRBW(KB/s)    %Rd %Wr
> >> --------  --- --  ----------  ----------  ----------  ----------   ---- ----
> >> drrmmap   3   1   7195        0           7337        0              1% 0%
> >> drrmmap   3   2   7016        0           6855        0             -2% 0%
> >> drrmmap   3   4   7438        0           6103        0            -17% 0%
> >> drrmmap   3   8   9298        0           6020        0            -35% 0%
> >> drrmmap   3   16  11576       0           5827        0            -49% 0%
> >>
> >>
> >> I have run buffered random reads on mmaped files (brrmmap) and direct
> >> random reads on mmaped files (drrmmap) using fio. I have run these for
> >> increasing number of threads and did this for 3 times and took average of
> >> three sets for reporting.
> 
> BTW, I think O_DIRECT doesn't affect mmap operation.

Yes, just for the sake of curiosity I tested O_DIRECT case also.

> 
> >>
> >> I have used filesize 1G and bz=64K and ran each test sample for 30
> >> seconds.
> >>
> >> Because with new seek logic, we will mark above type of cfqq as non seeky
> >> and will idle on these, I take a significant hit in performance on storage
> >> boxes which have more than 1 spindle.
> Thinking about this, can you check if your disks have a non-zero
> /sys/block/sda/queue/optimal_io_size ?
> >From the comment in blk-settings.c, I see this should be non-zero for
> RAIDs, so it may help discriminating the cases we want to optimize
> for.
> It could also help in identifying the correct threshold.

I have got multipath device setup. But I see optimal_io_size=0 both on 
higher level multipath device as well as underlying component devices.

> >
> > Thanks for testing on a different setup.
> > I wonder if the wrong part for multi-spindle is the 64kb threshold.
> > Can you run with larger bs, and see if there is a value for which
> > idling is better?
> > For example on a 2 disk raid 0 I would expect  that a bs larger than
> > the stripe will still benefit by idling.
> >
> >>
> >> So basically, the regression is not only on that particular RAID card but
> >> on other kind of devices which can support more than one spindle.
> Ok makes sense. If the number of sequential pages read before jumping
> to a random address is smaller than the raid stripe, we are wasting
> potential parallelism.

Actually even if we are doing IO size bigger than stripe size, we will
probably keep only request_size/stripe_size spindles busy by one request.
We are still not exploiting parallelism of rest of the spindles.

Secondly in this particular case, becuse you are issuing 4K pages reads
at a time, you are for sure going to keep one spindle busy.

Increasing the block size to 128K or 256K does bring down the % of regression,
but I think that primarly comes from the fact that now we have made
workload less random and more sequential (One seek after 256/4=64
sequential reads as opposed to one seek after 64K/4=16 sequentila reads).

With bs=128K
===========
                        2.6.33              2.6.33-seek
workload  Set NR  RDBW(KB/s)  WRBW(KB/s)  RDBW(KB/s)  WRBW(KB/s)    %Rd %Wr
--------  --- --  ----------  ----------  ----------  ----------   ---- ----
brrmmap   3   1   8338        0           8532        0              2% 0%
brrmmap   3   2   8724        0           8553        0             -1% 0%
brrmmap   3   4   9577        0           8002        0            -16% 0%
brrmmap   3   8   11806       0           7990        0            -32% 0%
brrmmap   3   16  13329       0           8101        0            -39% 0%


With bs=256K
===========
                        2.6.33              2.6.33-seek
workload  Set NR  RDBW(KB/s)  WRBW(KB/s)  RDBW(KB/s)  WRBW(KB/s)    %Rd %Wr
--------  --- --  ----------  ----------  ----------  ----------   ---- ----
brrmmap   3   1   9778        0           9572        0             -2% 0%
brrmmap   3   2   10321       0           10029       0             -2% 0%
brrmmap   3   4   11132       0           9675        0            -13% 0%
brrmmap   3   8   13111       0           10057       0            -23% 0%
brrmmap   3   16  13910       0           10366       0            -25% 0%

So if we can detect there are multiple spindles underlying, we can probably
make the non-seeky definition stricter and that is instead of looking
for 4 seeky requests per 32 samples, we could say 2 seeky requests per
64 samples etc. That could help a bit on storages with multiple spindles
behind single Lun.

Thanks
Vivek
 

> >>
> >> I will run some test on single SATA disk also where this patch should
> >> benefit.
> >>
> >> Based on testing results so far, I am not a big fan of marking these mmap
> >> queues as sync-idle. I guess if this patch really benefits, then we need
> >> to first put in place some kind of logic to detect whether if it is single
> >> spindle SATA disk and then on these disks, mark mmap queues as sync.
> >>
> >> Apart from synthetic workloads, in practice, where this patch is helping you?
> >
> > The synthetic workload mimics the page fault patterns that can be seen
> > on program startup, and that is the target of my optimization. In
> > 2.6.32, we went the direction of enabling idling also for seeky
> > queues, while 2.6.33 tried to be more friendly with parallel storage
> > by usually allowing more parallel requests. Unfortunately, this
> > impacted this peculiar access pattern, so we need to fix it somehow.
> >
> > Thanks,
> > Corrado
> >
> >>
> >> Thanks
> >> Vivek
> >>
> >>
> >>> The second patch changes the seeky detection logic to be meaningful
> >>> also for SSDs. A seeky request is one that doesn't utilize the full
> >>> bandwidth for the device. For SSDs, this happens for small requests,
> >>> regardless of their location.
> >>> With this change, the grouping of "seeky" requests done by CFQ can
> >>> result in a fairer distribution of disk service time among processes.
> >>
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs
  2010-03-03 19:47       ` Corrado Zoccolo
  2010-03-03 21:21         ` Vivek Goyal
@ 2010-03-03 23:28         ` Vivek Goyal
  2010-03-04 20:34           ` Corrado Zoccolo
  1 sibling, 1 reply; 16+ messages in thread
From: Vivek Goyal @ 2010-03-03 23:28 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng

On Wed, Mar 03, 2010 at 08:47:31PM +0100, Corrado Zoccolo wrote:
> Hi Vivek,
> On Mon, Mar 1, 2010 at 3:25 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Sat, Feb 27, 2010 at 07:45:40PM +0100, Corrado Zoccolo wrote:
> >> CFQ currently applies the same logic of detecting seeky queues and
> >> grouping them together for rotational disks as well as SSDs.
> >> For SSDs, the time to complete a request doesn't depend on the
> >> request location, but only on the size.
> >> This patch therefore changes the criterion to group queues by
> >> request size in case of SSDs, in order to achieve better fairness.
> >
> > Hi Corrado,
> >
> > Can you give some numbers regarding how are you measuring fairness and
> > how did you decide that we achieve better fairness?
> >
> Please, see the attached fio script. It benchmarks pairs of processes
> performing direct random I/O.
> One is always fixed at bs=4k , while I vary the other from 8K to 64K
> test00: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=1
> test01: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
> test10: (g=1): rw=randread, bs=16K-16K/16K-16K, ioengine=sync, iodepth=1
> test11: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
> test20: (g=2): rw=randread, bs=32K-32K/32K-32K, ioengine=sync, iodepth=1
> test21: (g=2): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
> test30: (g=3): rw=randread, bs=64K-64K/64K-64K, ioengine=sync, iodepth=1
> test31: (g=3): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
> 
> With unpatched cfq (2.6.33), on a flash card (non-ncq), after running
> a fio script with high number of parallel readers to make sure ncq
> detection is stabilized, I get the following:
> Run status group 0 (all jobs):
>    READ: io=21528KiB, aggrb=4406KiB/s, minb=1485KiB/s, maxb=2922KiB/s,
> mint=5001msec, maxt=5003msec
> 
> Run status group 1 (all jobs):
>    READ: io=31524KiB, aggrb=6452KiB/s, minb=1327KiB/s, maxb=5126KiB/s,
> mint=5002msec, maxt=5003msec
> 
> Run status group 2 (all jobs):
>    READ: io=46544KiB, aggrb=9524KiB/s, minb=1031KiB/s, maxb=8493KiB/s,
> mint=5001msec, maxt=5004msec
> 
> Run status group 3 (all jobs):
>    READ: io=64712KiB, aggrb=13242KiB/s, minb=761KiB/s,
> maxb=12486KiB/s, mint=5002msec, maxt=5004msec
> 
> As you can see from minb, the process with smallest I/O size is
> penalized (the fact is that being both marked as noidle, they both end
> up in the noidle tree, where they are serviced round robin, so they
> get fairness in term of IOPS, but bandwidth varies a lot.
> 
> With my patches in place, I get:
> Run status group 0 (all jobs):
>    READ: io=21544KiB, aggrb=4409KiB/s, minb=1511KiB/s, maxb=2898KiB/s,
> mint=5002msec, maxt=5003msec
> 
> Run status group 1 (all jobs):
>    READ: io=32000KiB, aggrb=6549KiB/s, minb=1277KiB/s, maxb=5274KiB/s,
> mint=5001msec, maxt=5003msec
> 
> Run status group 2 (all jobs):
>    READ: io=39444KiB, aggrb=8073KiB/s, minb=1576KiB/s, maxb=6498KiB/s,
> mint=5002msec, maxt=5003msec
> 
> Run status group 3 (all jobs):
>    READ: io=49180KiB, aggrb=10059KiB/s, minb=1512KiB/s,
> maxb=8548KiB/s, mint=5001msec, maxt=5006msec
> 
> The process doing smaller requests is now not penalized by the fact
> that it is run concurrently with the other one, and the other still
> benefits from larger requests because it uses better its time slice.
> 

Hi Corrado,

I ran your fio script on my SSD which supports NCQ. In my case both the
processes lose.

Vanilla kernel (2.6.33)
=======================
Run status group 0 (all jobs):
   READ: io=258MB, aggrb=52,903KB/s, minb=18,058KB/s, maxb=36,114KB/s,
mint=5001msec, maxt=5001msec

Run status group 1 (all jobs):
   READ: io=382MB, aggrb=78,301KB/s, minb=16,037KB/s, maxb=64,143KB/s,
mint=5001msec, maxt=5001msec

Run status group 2 (all jobs):
   READ: io=560MB, aggrb=112MB/s, minb=13,052KB/s, maxb=102MB/s,
mint=5001msec, maxt=5001msec

Run status group 3 (all jobs):
   READ: io=774MB, aggrb=155MB/s, minb=9,595KB/s, maxb=149MB/s,
mint=5001msec, maxt=5001msec

With two seek patches applied
=============================
Run status group 0 (all jobs):
   READ: io=260MB, aggrb=53,148KB/s, minb=18,140KB/s, maxb=36,283KB/s,
mint=5001msec, maxt=5001msec

Run status group 1 (all jobs):
   READ: io=383MB, aggrb=78,484KB/s, minb=16,073KB/s, maxb=64,294KB/s,
mint=5001msec, maxt=5001msec

Run status group 2 (all jobs):
   READ: io=359MB, aggrb=73,534KB/s, minb=8,367KB/s, maxb=66,931KB/s,
mint=5001msec, maxt=5001msec

Run status group 3 (all jobs):
   READ: io=556MB, aggrb=111MB/s, minb=6,852KB/s, maxb=107MB/s,
mint=5001msec, maxt=5001msec

Note, how overall BW suffers for group 2 and group 3. This needs some
debugging why it is happening. I guess we are idling on sync-noidle
tree and not idling on sync-idle tree, probably that's why bigger request
size process can't get enough IO pushed. But then smaller request size
process should have got more IO done but that also does not seem to
be happening.

Both small request size and big request size processes are losing. 

Thanks
Vivek

> > In case of SSDs with NCQ, we will not idle on any of the queues (either
> > sync or sync-noidle (seeky queues)). So w.r.t code, what behavior changes
> > if we mark a queue as seeky/non-seeky on SSD?
> >
> 
> I've not tested on NCQ SSD, but I think at worst it will not harm, and
> at best, it will provide similar fairness improvements when the queue
> of processes submitting requests grows above the available NCQ slots.
> 
> > IOW, looking at this patch, now any queue doing IO in smaller chunks than
> > 32K on SSD will be marked as seeky. How does that change the behavior in
> > terms of fairness for the queue?
> >
> Basically, we will have IOPS based fairness for small requests, and
> time based fairness for larger requests.
> 
> Thanks,
> Corrado
> 
> > Thanks
> > Vivek
> >
> >>
> >> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
> >> ---
> >>  block/cfq-iosched.c |    7 ++++++-
> >>  1 files changed, 6 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> >> index 806d30b..f27e535 100644
> >> --- a/block/cfq-iosched.c
> >> +++ b/block/cfq-iosched.c
> >> @@ -47,6 +47,7 @@ static const int cfq_hist_divisor = 4;
> >>  #define CFQ_SERVICE_SHIFT       12
> >>
> >>  #define CFQQ_SEEK_THR                (sector_t)(8 * 100)
> >> +#define CFQQ_SECT_THR_NONROT (sector_t)(2 * 32)
> >>  #define CFQQ_SEEKY(cfqq)     (hweight32(cfqq->seek_history) > 32/8)
> >>
> >>  #define RQ_CIC(rq)           \
> >> @@ -2958,6 +2959,7 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> >>                      struct request *rq)
> >>  {
> >>       sector_t sdist = 0;
> >> +     sector_t n_sec = blk_rq_sectors(rq);
> >>       if (cfqq->last_request_pos) {
> >>               if (cfqq->last_request_pos < blk_rq_pos(rq))
> >>                       sdist = blk_rq_pos(rq) - cfqq->last_request_pos;
> >> @@ -2966,7 +2968,10 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> >>       }
> >>
> >>       cfqq->seek_history <<= 1;
> >> -     cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
> >> +     if (blk_queue_nonrot(cfqd->queue))
> >> +             cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT);
> >> +     else
> >> +             cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
> >>  }
> >>
> >>  /*
> >> --
> >> 1.6.4.4
> >



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs
  2010-03-03 23:28         ` Vivek Goyal
@ 2010-03-04 20:34           ` Corrado Zoccolo
  2010-03-04 22:27             ` Vivek Goyal
  0 siblings, 1 reply; 16+ messages in thread
From: Corrado Zoccolo @ 2010-03-04 20:34 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng

Hi Vivek,
On Thu, Mar 4, 2010 at 12:28 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Wed, Mar 03, 2010 at 08:47:31PM +0100, Corrado Zoccolo wrote:
>> Hi Vivek,
>> On Mon, Mar 1, 2010 at 3:25 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> > On Sat, Feb 27, 2010 at 07:45:40PM +0100, Corrado Zoccolo wrote:
>> >> CFQ currently applies the same logic of detecting seeky queues and
>> >> grouping them together for rotational disks as well as SSDs.
>> >> For SSDs, the time to complete a request doesn't depend on the
>> >> request location, but only on the size.
>> >> This patch therefore changes the criterion to group queues by
>> >> request size in case of SSDs, in order to achieve better fairness.
>> >
>> > Hi Corrado,
>> >
>> > Can you give some numbers regarding how are you measuring fairness and
>> > how did you decide that we achieve better fairness?
>> >
>> Please, see the attached fio script. It benchmarks pairs of processes
>> performing direct random I/O.
>> One is always fixed at bs=4k , while I vary the other from 8K to 64K
>> test00: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=1
>> test01: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
>> test10: (g=1): rw=randread, bs=16K-16K/16K-16K, ioengine=sync, iodepth=1
>> test11: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
>> test20: (g=2): rw=randread, bs=32K-32K/32K-32K, ioengine=sync, iodepth=1
>> test21: (g=2): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
>> test30: (g=3): rw=randread, bs=64K-64K/64K-64K, ioengine=sync, iodepth=1
>> test31: (g=3): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
>>
>> With unpatched cfq (2.6.33), on a flash card (non-ncq), after running
>> a fio script with high number of parallel readers to make sure ncq
>> detection is stabilized, I get the following:
>> Run status group 0 (all jobs):
>>    READ: io=21528KiB, aggrb=4406KiB/s, minb=1485KiB/s, maxb=2922KiB/s,
>> mint=5001msec, maxt=5003msec
>>
>> Run status group 1 (all jobs):
>>    READ: io=31524KiB, aggrb=6452KiB/s, minb=1327KiB/s, maxb=5126KiB/s,
>> mint=5002msec, maxt=5003msec
>>
>> Run status group 2 (all jobs):
>>    READ: io=46544KiB, aggrb=9524KiB/s, minb=1031KiB/s, maxb=8493KiB/s,
>> mint=5001msec, maxt=5004msec
>>
>> Run status group 3 (all jobs):
>>    READ: io=64712KiB, aggrb=13242KiB/s, minb=761KiB/s,
>> maxb=12486KiB/s, mint=5002msec, maxt=5004msec
>>
>> As you can see from minb, the process with smallest I/O size is
>> penalized (the fact is that being both marked as noidle, they both end
>> up in the noidle tree, where they are serviced round robin, so they
>> get fairness in term of IOPS, but bandwidth varies a lot.
>>
>> With my patches in place, I get:
>> Run status group 0 (all jobs):
>>    READ: io=21544KiB, aggrb=4409KiB/s, minb=1511KiB/s, maxb=2898KiB/s,
>> mint=5002msec, maxt=5003msec
>>
>> Run status group 1 (all jobs):
>>    READ: io=32000KiB, aggrb=6549KiB/s, minb=1277KiB/s, maxb=5274KiB/s,
>> mint=5001msec, maxt=5003msec
>>
>> Run status group 2 (all jobs):
>>    READ: io=39444KiB, aggrb=8073KiB/s, minb=1576KiB/s, maxb=6498KiB/s,
>> mint=5002msec, maxt=5003msec
>>
>> Run status group 3 (all jobs):
>>    READ: io=49180KiB, aggrb=10059KiB/s, minb=1512KiB/s,
>> maxb=8548KiB/s, mint=5001msec, maxt=5006msec
>>
>> The process doing smaller requests is now not penalized by the fact
>> that it is run concurrently with the other one, and the other still
>> benefits from larger requests because it uses better its time slice.
>>
>
> Hi Corrado,
>
> I ran your fio script on my SSD which supports NCQ. In my case both the
> processes lose.
>
> Vanilla kernel (2.6.33)
> =======================
> Run status group 0 (all jobs):
>   READ: io=258MB, aggrb=52,903KB/s, minb=18,058KB/s, maxb=36,114KB/s,
> mint=5001msec, maxt=5001msec
>
> Run status group 1 (all jobs):
>   READ: io=382MB, aggrb=78,301KB/s, minb=16,037KB/s, maxb=64,143KB/s,
> mint=5001msec, maxt=5001msec
>
> Run status group 2 (all jobs):
>   READ: io=560MB, aggrb=112MB/s, minb=13,052KB/s, maxb=102MB/s,
> mint=5001msec, maxt=5001msec
>
> Run status group 3 (all jobs):
>   READ: io=774MB, aggrb=155MB/s, minb=9,595KB/s, maxb=149MB/s,
> mint=5001msec, maxt=5001msec
>
> With two seek patches applied
> =============================
> Run status group 0 (all jobs):
>   READ: io=260MB, aggrb=53,148KB/s, minb=18,140KB/s, maxb=36,283KB/s,
> mint=5001msec, maxt=5001msec
>
> Run status group 1 (all jobs):
>   READ: io=383MB, aggrb=78,484KB/s, minb=16,073KB/s, maxb=64,294KB/s,
> mint=5001msec, maxt=5001msec
>
> Run status group 2 (all jobs):
>   READ: io=359MB, aggrb=73,534KB/s, minb=8,367KB/s, maxb=66,931KB/s,
> mint=5001msec, maxt=5001msec
>
> Run status group 3 (all jobs):
>   READ: io=556MB, aggrb=111MB/s, minb=6,852KB/s, maxb=107MB/s,
> mint=5001msec, maxt=5001msec
>
> Note, how overall BW suffers for group 2 and group 3. This needs some
> debugging why it is happening. I guess we are idling on sync-noidle
> tree and not idling on sync-idle tree, probably that's why bigger request
> size process can't get enough IO pushed. But then smaller request size
> process should have got more IO done but that also does not seem to
> be happening.

I think this is a preexisting bug. You can probably trigger it in
vanilla 2.6.33 by having one process doing random and an other doing
sequential reads.
I think the issue is:
* cfq_should_idle returns true for the last queue on a tree, even for NCQ SSDs:

static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
 	enum wl_prio_t prio = cfqq_prio(cfqq);
        struct cfq_rb_root *service_tree = cfqq->service_tree;

        BUG_ON(!service_tree);
	BUG_ON(!service_tree->count);

	/* We never do for idle class queues. */
        if (prio == IDLE_WORKLOAD)
                return false;

	/* We do for queues that were marked with idle window flag. */
	if (cfq_cfqq_idle_window(cfqq) &&
           !(blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag))
		return true;

        /*
         * Otherwise, we do only if they are the last ones
         * in their service tree.
         */
        return service_tree->count == 1 && cfq_cfqq_sync(cfqq);
}

* Even if we never activate a timer, we wait for request completion by
keeping a queue that has requests in flight, when cfq_should_idle
returns true (in two places):
        /*
         * The active queue has run out of time, expire it and select
new.
         */
        if (cfq_slice_used(cfqq) && !cfq_cfqq_must_dispatch(cfqq)) {
                /*
                 * If slice had not expired at the completion of last
request
                 * we might not have turned on wait_busy flag. Don't
expire
                 * the queue yet. Allow the group to get backlogged.
                 *
                 * The very fact that we have used the slice, that
means we
                 * have been idling all along on this queue and it
should be
                 * ok to wait for this request to complete.
                 */
                if (cfqq->cfqg->nr_cfqq == 1 && RB_EMPTY_ROOT(&cfqq->sort_list)
                    && cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) {
			cfqq = NULL;
                        goto keep_queue;
                } else
 	                goto expire;
        }

        /*
         * No requests pending. If the active queue still has requests
in
         * flight or is idling for a new request, allow either of
these
         * conditions to happen (or time out) before selecting a new
queue.
         */
        if (timer_pending(&cfqd->idle_slice_timer) ||
            (cfqq->dispatched && cfq_should_idle(cfqd, cfqq))) {
                cfqq = NULL;
                goto keep_queue;
        }

Can you change cfq_should_idle to always return false for NCQ SSDs and retest?

Thanks
Corrado

>
> Both small request size and big request size processes are losing.
>
> Thanks
> Vivek
>
>> > In case of SSDs with NCQ, we will not idle on any of the queues (either
>> > sync or sync-noidle (seeky queues)). So w.r.t code, what behavior changes
>> > if we mark a queue as seeky/non-seeky on SSD?
>> >
>>
>> I've not tested on NCQ SSD, but I think at worst it will not harm, and
>> at best, it will provide similar fairness improvements when the queue
>> of processes submitting requests grows above the available NCQ slots.
>>
>> > IOW, looking at this patch, now any queue doing IO in smaller chunks than
>> > 32K on SSD will be marked as seeky. How does that change the behavior in
>> > terms of fairness for the queue?
>> >
>> Basically, we will have IOPS based fairness for small requests, and
>> time based fairness for larger requests.
>>
>> Thanks,
>> Corrado
>>
>> > Thanks
>> > Vivek
>> >
>> >>
>> >> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
>> >> ---
>> >>  block/cfq-iosched.c |    7 ++++++-
>> >>  1 files changed, 6 insertions(+), 1 deletions(-)
>> >>
>> >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> >> index 806d30b..f27e535 100644
>> >> --- a/block/cfq-iosched.c
>> >> +++ b/block/cfq-iosched.c
>> >> @@ -47,6 +47,7 @@ static const int cfq_hist_divisor = 4;
>> >>  #define CFQ_SERVICE_SHIFT       12
>> >>
>> >>  #define CFQQ_SEEK_THR                (sector_t)(8 * 100)
>> >> +#define CFQQ_SECT_THR_NONROT (sector_t)(2 * 32)
>> >>  #define CFQQ_SEEKY(cfqq)     (hweight32(cfqq->seek_history) > 32/8)
>> >>
>> >>  #define RQ_CIC(rq)           \
>> >> @@ -2958,6 +2959,7 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> >>                      struct request *rq)
>> >>  {
>> >>       sector_t sdist = 0;
>> >> +     sector_t n_sec = blk_rq_sectors(rq);
>> >>       if (cfqq->last_request_pos) {
>> >>               if (cfqq->last_request_pos < blk_rq_pos(rq))
>> >>                       sdist = blk_rq_pos(rq) - cfqq->last_request_pos;
>> >> @@ -2966,7 +2968,10 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> >>       }
>> >>
>> >>       cfqq->seek_history <<= 1;
>> >> -     cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
>> >> +     if (blk_queue_nonrot(cfqd->queue))
>> >> +             cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT);
>> >> +     else
>> >> +             cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
>> >>  }
>> >>
>> >>  /*
>> >> --
>> >> 1.6.4.4
>> >
>
>
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs
  2010-03-04 20:34           ` Corrado Zoccolo
@ 2010-03-04 22:27             ` Vivek Goyal
  2010-03-05 22:31               ` Corrado Zoccolo
  0 siblings, 1 reply; 16+ messages in thread
From: Vivek Goyal @ 2010-03-04 22:27 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng

On Thu, Mar 04, 2010 at 09:34:51PM +0100, Corrado Zoccolo wrote:
> Hi Vivek,
> On Thu, Mar 4, 2010 at 12:28 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Wed, Mar 03, 2010 at 08:47:31PM +0100, Corrado Zoccolo wrote:
> >> Hi Vivek,
> >> On Mon, Mar 1, 2010 at 3:25 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> > On Sat, Feb 27, 2010 at 07:45:40PM +0100, Corrado Zoccolo wrote:
> >> >> CFQ currently applies the same logic of detecting seeky queues and
> >> >> grouping them together for rotational disks as well as SSDs.
> >> >> For SSDs, the time to complete a request doesn't depend on the
> >> >> request location, but only on the size.
> >> >> This patch therefore changes the criterion to group queues by
> >> >> request size in case of SSDs, in order to achieve better fairness.
> >> >
> >> > Hi Corrado,
> >> >
> >> > Can you give some numbers regarding how are you measuring fairness and
> >> > how did you decide that we achieve better fairness?
> >> >
> >> Please, see the attached fio script. It benchmarks pairs of processes
> >> performing direct random I/O.
> >> One is always fixed at bs=4k , while I vary the other from 8K to 64K
> >> test00: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=1
> >> test01: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
> >> test10: (g=1): rw=randread, bs=16K-16K/16K-16K, ioengine=sync, iodepth=1
> >> test11: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
> >> test20: (g=2): rw=randread, bs=32K-32K/32K-32K, ioengine=sync, iodepth=1
> >> test21: (g=2): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
> >> test30: (g=3): rw=randread, bs=64K-64K/64K-64K, ioengine=sync, iodepth=1
> >> test31: (g=3): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
> >>
> >> With unpatched cfq (2.6.33), on a flash card (non-ncq), after running
> >> a fio script with high number of parallel readers to make sure ncq
> >> detection is stabilized, I get the following:
> >> Run status group 0 (all jobs):
> >>    READ: io=21528KiB, aggrb=4406KiB/s, minb=1485KiB/s, maxb=2922KiB/s,
> >> mint=5001msec, maxt=5003msec
> >>
> >> Run status group 1 (all jobs):
> >>    READ: io=31524KiB, aggrb=6452KiB/s, minb=1327KiB/s, maxb=5126KiB/s,
> >> mint=5002msec, maxt=5003msec
> >>
> >> Run status group 2 (all jobs):
> >>    READ: io=46544KiB, aggrb=9524KiB/s, minb=1031KiB/s, maxb=8493KiB/s,
> >> mint=5001msec, maxt=5004msec
> >>
> >> Run status group 3 (all jobs):
> >>    READ: io=64712KiB, aggrb=13242KiB/s, minb=761KiB/s,
> >> maxb=12486KiB/s, mint=5002msec, maxt=5004msec
> >>
> >> As you can see from minb, the process with smallest I/O size is
> >> penalized (the fact is that being both marked as noidle, they both end
> >> up in the noidle tree, where they are serviced round robin, so they
> >> get fairness in term of IOPS, but bandwidth varies a lot.
> >>
> >> With my patches in place, I get:
> >> Run status group 0 (all jobs):
> >>    READ: io=21544KiB, aggrb=4409KiB/s, minb=1511KiB/s, maxb=2898KiB/s,
> >> mint=5002msec, maxt=5003msec
> >>
> >> Run status group 1 (all jobs):
> >>    READ: io=32000KiB, aggrb=6549KiB/s, minb=1277KiB/s, maxb=5274KiB/s,
> >> mint=5001msec, maxt=5003msec
> >>
> >> Run status group 2 (all jobs):
> >>    READ: io=39444KiB, aggrb=8073KiB/s, minb=1576KiB/s, maxb=6498KiB/s,
> >> mint=5002msec, maxt=5003msec
> >>
> >> Run status group 3 (all jobs):
> >>    READ: io=49180KiB, aggrb=10059KiB/s, minb=1512KiB/s,
> >> maxb=8548KiB/s, mint=5001msec, maxt=5006msec
> >>
> >> The process doing smaller requests is now not penalized by the fact
> >> that it is run concurrently with the other one, and the other still
> >> benefits from larger requests because it uses better its time slice.
> >>
> >
> > Hi Corrado,
> >
> > I ran your fio script on my SSD which supports NCQ. In my case both the
> > processes lose.
> >
> > Vanilla kernel (2.6.33)
> > =======================
> > Run status group 0 (all jobs):
> >   READ: io=258MB, aggrb=52,903KB/s, minb=18,058KB/s, maxb=36,114KB/s,
> > mint=5001msec, maxt=5001msec
> >
> > Run status group 1 (all jobs):
> >   READ: io=382MB, aggrb=78,301KB/s, minb=16,037KB/s, maxb=64,143KB/s,
> > mint=5001msec, maxt=5001msec
> >
> > Run status group 2 (all jobs):
> >   READ: io=560MB, aggrb=112MB/s, minb=13,052KB/s, maxb=102MB/s,
> > mint=5001msec, maxt=5001msec
> >
> > Run status group 3 (all jobs):
> >   READ: io=774MB, aggrb=155MB/s, minb=9,595KB/s, maxb=149MB/s,
> > mint=5001msec, maxt=5001msec
> >
> > With two seek patches applied
> > =============================
> > Run status group 0 (all jobs):
> >   READ: io=260MB, aggrb=53,148KB/s, minb=18,140KB/s, maxb=36,283KB/s,
> > mint=5001msec, maxt=5001msec
> >
> > Run status group 1 (all jobs):
> >   READ: io=383MB, aggrb=78,484KB/s, minb=16,073KB/s, maxb=64,294KB/s,
> > mint=5001msec, maxt=5001msec
> >
> > Run status group 2 (all jobs):
> >   READ: io=359MB, aggrb=73,534KB/s, minb=8,367KB/s, maxb=66,931KB/s,
> > mint=5001msec, maxt=5001msec
> >
> > Run status group 3 (all jobs):
> >   READ: io=556MB, aggrb=111MB/s, minb=6,852KB/s, maxb=107MB/s,
> > mint=5001msec, maxt=5001msec
> >
> > Note, how overall BW suffers for group 2 and group 3. This needs some
> > debugging why it is happening. I guess we are idling on sync-noidle
> > tree and not idling on sync-idle tree, probably that's why bigger request
> > size process can't get enough IO pushed. But then smaller request size
> > process should have got more IO done but that also does not seem to
> > be happening.
> 
> I think this is a preexisting bug. You can probably trigger it in
> vanilla 2.6.33 by having one process doing random and an other doing
> sequential reads.
> I think the issue is:
> * cfq_should_idle returns true for the last queue on a tree, even for NCQ SSDs:

you are right. The way cfq_should_idle() is written, it will return true
for bothe sync-idle and sync-noidle trees. With your patch two random
readers go onto different service trees and now we start driving queue
depth 1 because of follwing.

        if (timer_pending(&cfqd->idle_slice_timer) ||
            (cfqq->dispatched && cfq_should_idle(cfqd, cfqq))) {
                cfqq = NULL;
                goto keep_queue;
        }


I tested following patch and now throughput is almost back at the same levels
as wihtout your patch.

---
 block/cfq-iosched.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 40840da..296aa23 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1788,9 +1788,12 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	if (prio == IDLE_WORKLOAD)
 		return false;
 
+	/* Don't idle on NCQ SSDs */
+	if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
+		return false;
+
 	/* We do for queues that were marked with idle window flag. */
-	if (cfq_cfqq_idle_window(cfqq) &&
-	   !(blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag))
+	if (cfq_cfqq_idle_window(cfqq))
 		return true;
 
 	/*
-- 
1.6.2.5


Now I am wondering what are the side affects of above change.

One advantage of idling on service tree even on NCQ SSD is that READS get
more protection from WRITES. Especially if there are SSDs which implement
NCQ but really suck at WRITE speed or don't implement parallel IO
channels.

I see cfq_should_idle() being used at four places.

- cfq_select_queue()
                if (cfqq->cfqg->nr_cfqq == 1 && RB_EMPTY_ROOT(&cfqq->sort_list)
                    && cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) {
                        cfqq = NULL;
                        goto keep_queue;
		}

This is for fairness in group scheduling. I think that impact here should
not be much because this condition is primarily hit on media where we idle
on the queue.

On NCQ SSD, I don't think we get fairness in group scheduling much until
and unless a group is generating enough traffic to keep the request queue
full.

- cfq_select_queue()
            (cfqq->dispatched && cfq_should_idle(cfqd, cfqq))) {
                cfqq = NULL;

Should we really wait for a dispatched request to complete on NCQ SSD?
On SSD with parallel channel, I think this will reduce throughput?

- cfq_may_dispatch()

        /*
         * Drain async requests before we start sync IO
         */
        if (cfq_should_idle(cfqd, cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
                return false;

If NCQ SSD is implementing parallel IO channels, probably there is no need
to clear out WRITES before we start sync IO?

- cfq_arm_slice_timer()

	       if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq))
                return;

cfq_arm_slice_timer()  already check for NCQ SSD and does not arm timer
so this function should remain unaffected with this change.

So the question is, should we wait on service tree on an NCQ SSD or not?

Secondly, your patch of determining cfqq seeky nature based on request
size on SSD, helps only on non NCQ SSD.

Thanks
Vivek

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs
  2010-03-04 22:27             ` Vivek Goyal
@ 2010-03-05 22:31               ` Corrado Zoccolo
  2010-03-08 14:08                 ` Vivek Goyal
  0 siblings, 1 reply; 16+ messages in thread
From: Corrado Zoccolo @ 2010-03-05 22:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng

On Thu, Mar 4, 2010 at 11:27 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Thu, Mar 04, 2010 at 09:34:51PM +0100, Corrado Zoccolo wrote:
>> Hi Vivek,
>> On Thu, Mar 4, 2010 at 12:28 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> > On Wed, Mar 03, 2010 at 08:47:31PM +0100, Corrado Zoccolo wrote:
>> >> Hi Vivek,
>> >> On Mon, Mar 1, 2010 at 3:25 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> >> > On Sat, Feb 27, 2010 at 07:45:40PM +0100, Corrado Zoccolo wrote:
>> >> >> CFQ currently applies the same logic of detecting seeky queues and
>> >> >> grouping them together for rotational disks as well as SSDs.
>> >> >> For SSDs, the time to complete a request doesn't depend on the
>> >> >> request location, but only on the size.
>> >> >> This patch therefore changes the criterion to group queues by
>> >> >> request size in case of SSDs, in order to achieve better fairness.
>> >> >
>> >> > Hi Corrado,
>> >> >
>> >> > Can you give some numbers regarding how are you measuring fairness and
>> >> > how did you decide that we achieve better fairness?
>> >> >
>> >> Please, see the attached fio script. It benchmarks pairs of processes
>> >> performing direct random I/O.
>> >> One is always fixed at bs=4k , while I vary the other from 8K to 64K
>> >> test00: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=1
>> >> test01: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
>> >> test10: (g=1): rw=randread, bs=16K-16K/16K-16K, ioengine=sync, iodepth=1
>> >> test11: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
>> >> test20: (g=2): rw=randread, bs=32K-32K/32K-32K, ioengine=sync, iodepth=1
>> >> test21: (g=2): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
>> >> test30: (g=3): rw=randread, bs=64K-64K/64K-64K, ioengine=sync, iodepth=1
>> >> test31: (g=3): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
>> >>
>> >> With unpatched cfq (2.6.33), on a flash card (non-ncq), after running
>> >> a fio script with high number of parallel readers to make sure ncq
>> >> detection is stabilized, I get the following:
>> >> Run status group 0 (all jobs):
>> >>    READ: io=21528KiB, aggrb=4406KiB/s, minb=1485KiB/s, maxb=2922KiB/s,
>> >> mint=5001msec, maxt=5003msec
>> >>
>> >> Run status group 1 (all jobs):
>> >>    READ: io=31524KiB, aggrb=6452KiB/s, minb=1327KiB/s, maxb=5126KiB/s,
>> >> mint=5002msec, maxt=5003msec
>> >>
>> >> Run status group 2 (all jobs):
>> >>    READ: io=46544KiB, aggrb=9524KiB/s, minb=1031KiB/s, maxb=8493KiB/s,
>> >> mint=5001msec, maxt=5004msec
>> >>
>> >> Run status group 3 (all jobs):
>> >>    READ: io=64712KiB, aggrb=13242KiB/s, minb=761KiB/s,
>> >> maxb=12486KiB/s, mint=5002msec, maxt=5004msec
>> >>
>> >> As you can see from minb, the process with smallest I/O size is
>> >> penalized (the fact is that being both marked as noidle, they both end
>> >> up in the noidle tree, where they are serviced round robin, so they
>> >> get fairness in term of IOPS, but bandwidth varies a lot.
>> >>
>> >> With my patches in place, I get:
>> >> Run status group 0 (all jobs):
>> >>    READ: io=21544KiB, aggrb=4409KiB/s, minb=1511KiB/s, maxb=2898KiB/s,
>> >> mint=5002msec, maxt=5003msec
>> >>
>> >> Run status group 1 (all jobs):
>> >>    READ: io=32000KiB, aggrb=6549KiB/s, minb=1277KiB/s, maxb=5274KiB/s,
>> >> mint=5001msec, maxt=5003msec
>> >>
>> >> Run status group 2 (all jobs):
>> >>    READ: io=39444KiB, aggrb=8073KiB/s, minb=1576KiB/s, maxb=6498KiB/s,
>> >> mint=5002msec, maxt=5003msec
>> >>
>> >> Run status group 3 (all jobs):
>> >>    READ: io=49180KiB, aggrb=10059KiB/s, minb=1512KiB/s,
>> >> maxb=8548KiB/s, mint=5001msec, maxt=5006msec
>> >>
>> >> The process doing smaller requests is now not penalized by the fact
>> >> that it is run concurrently with the other one, and the other still
>> >> benefits from larger requests because it uses better its time slice.
>> >>
>> >
>> > Hi Corrado,
>> >
>> > I ran your fio script on my SSD which supports NCQ. In my case both the
>> > processes lose.
>> >
>> > Vanilla kernel (2.6.33)
>> > =======================
>> > Run status group 0 (all jobs):
>> >   READ: io=258MB, aggrb=52,903KB/s, minb=18,058KB/s, maxb=36,114KB/s,
>> > mint=5001msec, maxt=5001msec
>> >
>> > Run status group 1 (all jobs):
>> >   READ: io=382MB, aggrb=78,301KB/s, minb=16,037KB/s, maxb=64,143KB/s,
>> > mint=5001msec, maxt=5001msec
>> >
>> > Run status group 2 (all jobs):
>> >   READ: io=560MB, aggrb=112MB/s, minb=13,052KB/s, maxb=102MB/s,
>> > mint=5001msec, maxt=5001msec
>> >
>> > Run status group 3 (all jobs):
>> >   READ: io=774MB, aggrb=155MB/s, minb=9,595KB/s, maxb=149MB/s,
>> > mint=5001msec, maxt=5001msec
>> >
>> > With two seek patches applied
>> > =============================
>> > Run status group 0 (all jobs):
>> >   READ: io=260MB, aggrb=53,148KB/s, minb=18,140KB/s, maxb=36,283KB/s,
>> > mint=5001msec, maxt=5001msec
>> >
>> > Run status group 1 (all jobs):
>> >   READ: io=383MB, aggrb=78,484KB/s, minb=16,073KB/s, maxb=64,294KB/s,
>> > mint=5001msec, maxt=5001msec
>> >
>> > Run status group 2 (all jobs):
>> >   READ: io=359MB, aggrb=73,534KB/s, minb=8,367KB/s, maxb=66,931KB/s,
>> > mint=5001msec, maxt=5001msec
>> >
>> > Run status group 3 (all jobs):
>> >   READ: io=556MB, aggrb=111MB/s, minb=6,852KB/s, maxb=107MB/s,
>> > mint=5001msec, maxt=5001msec
>> >
>> > Note, how overall BW suffers for group 2 and group 3. This needs some
>> > debugging why it is happening. I guess we are idling on sync-noidle
>> > tree and not idling on sync-idle tree, probably that's why bigger request
>> > size process can't get enough IO pushed. But then smaller request size
>> > process should have got more IO done but that also does not seem to
>> > be happening.
>>
>> I think this is a preexisting bug. You can probably trigger it in
>> vanilla 2.6.33 by having one process doing random and an other doing
>> sequential reads.
>> I think the issue is:
>> * cfq_should_idle returns true for the last queue on a tree, even for NCQ SSDs:
>
> you are right. The way cfq_should_idle() is written, it will return true
> for bothe sync-idle and sync-noidle trees. With your patch two random
> readers go onto different service trees and now we start driving queue
> depth 1 because of follwing.
>
>        if (timer_pending(&cfqd->idle_slice_timer) ||
>            (cfqq->dispatched && cfq_should_idle(cfqd, cfqq))) {
>                cfqq = NULL;
>                goto keep_queue;
>        }
>
>
> I tested following patch and now throughput is almost back at the same levels
> as wihtout your patch.
>
> ---
>  block/cfq-iosched.c |    7 +++++--
>  1 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 40840da..296aa23 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1788,9 +1788,12 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>        if (prio == IDLE_WORKLOAD)
>                return false;
>
> +       /* Don't idle on NCQ SSDs */
> +       if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
> +               return false;
> +
>        /* We do for queues that were marked with idle window flag. */
> -       if (cfq_cfqq_idle_window(cfqq) &&
> -          !(blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag))
> +       if (cfq_cfqq_idle_window(cfqq))
>                return true;
>
>        /*
> --
> 1.6.2.5
>
>
> Now I am wondering what are the side affects of above change.
>
> One advantage of idling on service tree even on NCQ SSD is that READS get
> more protection from WRITES. Especially if there are SSDs which implement
> NCQ but really suck at WRITE speed or don't implement parallel IO
> channels.
I think the following code:
        /*
         * If this is an async queue and we have sync IO in flight, let it wait
         */
        if (cfqd->rq_in_flight[BLK_RW_SYNC] && !cfq_cfqq_sync(cfqq))
                return false;
already provides this protection, regardless of the result of cfq_should_idle().

>
> I see cfq_should_idle() being used at four places.
>
> - cfq_select_queue()
>                if (cfqq->cfqg->nr_cfqq == 1 && RB_EMPTY_ROOT(&cfqq->sort_list)
>                    && cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) {
>                        cfqq = NULL;
>                        goto keep_queue;
>                }
>
> This is for fairness in group scheduling. I think that impact here should
> not be much because this condition is primarily hit on media where we idle
> on the queue.
>
> On NCQ SSD, I don't think we get fairness in group scheduling much until
> and unless a group is generating enough traffic to keep the request queue
> full.
>
> - cfq_select_queue()
>            (cfqq->dispatched && cfq_should_idle(cfqd, cfqq))) {
>                cfqq = NULL;
>
> Should we really wait for a dispatched request to complete on NCQ SSD?
I don't think so. This is the cause for the bad performance we are looking at.
And remember that, without my patch, you will still get the same bad
behaviour, just with different conditions (one random and one
sequential reader, regardless of request size).
> On SSD with parallel channel, I think this will reduce throughput?
>
> - cfq_may_dispatch()
>
>        /*
>         * Drain async requests before we start sync IO
>         */
>        if (cfq_should_idle(cfqd, cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
>                return false;
>
> If NCQ SSD is implementing parallel IO channels, probably there is no need
> to clear out WRITES before we start sync IO?
Right. I think this is present for crazy NCQ rotational disks, where a
write could sit in cache forever if a stream of reads is coming.
>
> - cfq_arm_slice_timer()
>
>               if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq))
>                return;
>
> cfq_arm_slice_timer()  already check for NCQ SSD and does not arm timer
> so this function should remain unaffected with this change.
>
> So the question is, should we wait on service tree on an NCQ SSD or not?
I don't see a good reason to wait, and bad results if we do.
>
> Secondly, your patch of determining cfqq seeky nature based on request
> size on SSD, helps only on non NCQ SSD.
We know it helps on non NCQ. If we fix the waiting bug, it could help
also on NCQ, in scenarios where lot of processes (> NCQ depth) are
submitting requests with largely different sizes.

Thanks
Corrado

>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs
  2010-03-05 22:31               ` Corrado Zoccolo
@ 2010-03-08 14:08                 ` Vivek Goyal
  0 siblings, 0 replies; 16+ messages in thread
From: Vivek Goyal @ 2010-03-08 14:08 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng

On Fri, Mar 05, 2010 at 11:31:43PM +0100, Corrado Zoccolo wrote:

[..]
> > ---
> >  block/cfq-iosched.c |    7 +++++--
> >  1 files changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > index 40840da..296aa23 100644
> > --- a/block/cfq-iosched.c
> > +++ b/block/cfq-iosched.c
> > @@ -1788,9 +1788,12 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> >        if (prio == IDLE_WORKLOAD)
> >                return false;
> >
> > +       /* Don't idle on NCQ SSDs */
> > +       if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
> > +               return false;
> > +
> >        /* We do for queues that were marked with idle window flag. */
> > -       if (cfq_cfqq_idle_window(cfqq) &&
> > -          !(blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag))
> > +       if (cfq_cfqq_idle_window(cfqq))
> >                return true;
> >
> >        /*
> > --
> > 1.6.2.5
> >
> >
> > Now I am wondering what are the side affects of above change.
> >
> > One advantage of idling on service tree even on NCQ SSD is that READS get
> > more protection from WRITES. Especially if there are SSDs which implement
> > NCQ but really suck at WRITE speed or don't implement parallel IO
> > channels.
> I think the following code:
>         /*
>          * If this is an async queue and we have sync IO in flight, let it wait
>          */
>         if (cfqd->rq_in_flight[BLK_RW_SYNC] && !cfq_cfqq_sync(cfqq))
>                 return false;
> already provides this protection, regardless of the result of cfq_should_idle().

It does but in some cases it does not. For example, consider a case of one
writer and one reader (sequential or seeky) trying to do some IO. Now
reader will do one read and then immediately writer will pump few requests
(4-5 in dispatch queue) and then reader does one read and it goes on and
this can add to latency of reader on bad NCQ SSDs.
  
> 
> >
> > I see cfq_should_idle() being used at four places.
> >
> > - cfq_select_queue()
> >                if (cfqq->cfqg->nr_cfqq == 1 && RB_EMPTY_ROOT(&cfqq->sort_list)
> >                    && cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) {
> >                        cfqq = NULL;
> >                        goto keep_queue;
> >                }
> >
> > This is for fairness in group scheduling. I think that impact here should
> > not be much because this condition is primarily hit on media where we idle
> > on the queue.
> >
> > On NCQ SSD, I don't think we get fairness in group scheduling much until
> > and unless a group is generating enough traffic to keep the request queue
> > full.
> >
> > - cfq_select_queue()
> >            (cfqq->dispatched && cfq_should_idle(cfqd, cfqq))) {
> >                cfqq = NULL;
> >
> > Should we really wait for a dispatched request to complete on NCQ SSD?
> I don't think so. This is the cause for the bad performance we are looking at.
> And remember that, without my patch, you will still get the same bad
> behaviour, just with different conditions (one random and one
> sequential reader, regardless of request size).
> > On SSD with parallel channel, I think this will reduce throughput?
> >
> > - cfq_may_dispatch()
> >
> >        /*
> >         * Drain async requests before we start sync IO
> >         */
> >        if (cfq_should_idle(cfqd, cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> >                return false;
> >
> > If NCQ SSD is implementing parallel IO channels, probably there is no need
> > to clear out WRITES before we start sync IO?
> Right. I think this is present for crazy NCQ rotational disks, where a
> write could sit in cache forever if a stream of reads is coming.
> >
> > - cfq_arm_slice_timer()
> >
> >               if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq))
> >                return;
> >
> > cfq_arm_slice_timer()  already check for NCQ SSD and does not arm timer
> > so this function should remain unaffected with this change.
> >
> > So the question is, should we wait on service tree on an NCQ SSD or not?
> I don't see a good reason to wait, and bad results if we do.

Actually with my patch, we will not get fairness on NCQ SSD (group
scheduling) until and unless a group has enough traffic to keep the
SSD busy (or dispatch queue full). Previously we used to wait on empty
service tree on NCQ SSD and now we will not.

But I guess, we can move wait on the group under the tunable
group_isolation=1, so that if one prefers more isolation between groups
even at the expense of total throughput, he can choose to do so.
 
> >
> > Secondly, your patch of determining cfqq seeky nature based on request
> > size on SSD, helps only on non NCQ SSD.
> We know it helps on non NCQ. If we fix the waiting bug, it could help
> also on NCQ, in scenarios where lot of processes (> NCQ depth) are
> submitting requests with largely different sizes.

Ok, I will run some more tests with my patch and if I don't see any major
regressions, I will send it to Jens.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-03-08 14:08 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1267296340-3820-1-git-send-email-czoccolo@gmail.com>
2010-02-27 18:45 ` [PATCH 1/2] cfq-iosched: rework seeky detection Corrado Zoccolo
2010-02-27 18:45   ` [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs Corrado Zoccolo
2010-03-01 14:25     ` Vivek Goyal
2010-03-03 19:47       ` Corrado Zoccolo
2010-03-03 21:21         ` Vivek Goyal
2010-03-03 23:28         ` Vivek Goyal
2010-03-04 20:34           ` Corrado Zoccolo
2010-03-04 22:27             ` Vivek Goyal
2010-03-05 22:31               ` Corrado Zoccolo
2010-03-08 14:08                 ` Vivek Goyal
2010-02-28 18:41 ` [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34 Jens Axboe
2010-03-01 16:35 ` Vivek Goyal
2010-03-01 19:45   ` Vivek Goyal
2010-03-01 23:01   ` Corrado Zoccolo
2010-03-03 22:39     ` Corrado Zoccolo
2010-03-03 23:11       ` Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).