All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/3 v3] Optimize raid1 read balance for SSD
@ 2012-07-02  1:08 Shaohua Li
  2012-07-02  1:08 ` [patch 1/3 v3] raid1: make sequential read detection per disk based Shaohua Li
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Shaohua Li @ 2012-07-02  1:08 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe

raid1 read balance is an important algorithm to make read performance optimal.
It's a distance based algorithm, eg, for each request dispatch, choose disk
whose last finished request is close the request. This is great for hard disk.
But SSD has some special characteristics:

1. nonrotational. Distance means nothing for SSD, though merging small rquests
to big request is still optimal for SSD. If no merge, distributing rquests to
raid disks as more as possible is more optimal.

2. Getting too big request isn't always optimal. For hard disk, compared to
spindle move, data transfer overhead is trival, so we always prefer bigger
request. In SSD, request size exceeds specific value, performance isn't always
increased with request size increased.  An example is readahead. If readahead
merges too big request and causes some disks idle, the performance is less
optimal than that when all disks are busy and running small requests.

The patches try to address the issues.

V3:
makes the algorithm for the first issue work for hard disk and SSD mixed raid
as suggested by Neil. The algorithm for the second issue only applies to SSD,
because in my test it degrades performance for hard disk raid or hard disk/SSD
mixed raid.

V1->V2:
rebase to latest kernel.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 1/3 v3] raid1: make sequential read detection per disk based
  2012-07-02  1:08 [patch 0/3 v3] Optimize raid1 read balance for SSD Shaohua Li
@ 2012-07-02  1:08 ` Shaohua Li
  2012-07-04  5:38   ` NeilBrown
  2012-07-02  1:08 ` [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD Shaohua Li
  2012-07-02  1:08 ` [patch 3/3 v3] raid1: prevent merging too large request Shaohua Li
  2 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2012-07-02  1:08 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe

[-- Attachment #1: raid1-seq-detection.patch --]
[-- Type: text/plain, Size: 4261 bytes --]

Currently the sequential read detection is global wide. It's natural to make it
per disk based, which can improve the detection for concurrent multiple
sequential reads. And next patch will make SSD read balance not use distance
based algorithm, where this change help detect truly sequential read for SSD.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid1.c |   29 ++++++-----------------------
 drivers/md/raid1.h |   11 +++++------
 2 files changed, 11 insertions(+), 29 deletions(-)

Index: linux/drivers/md/raid1.c
===================================================================
--- linux.orig/drivers/md/raid1.c	2012-06-28 10:44:47.550666575 +0800
+++ linux/drivers/md/raid1.c	2012-06-28 12:01:03.513137377 +0800
@@ -483,7 +483,6 @@ static int read_balance(struct r1conf *c
 	const sector_t this_sector = r1_bio->sector;
 	int sectors;
 	int best_good_sectors;
-	int start_disk;
 	int best_disk;
 	int i;
 	sector_t best_dist;
@@ -503,20 +502,17 @@ static int read_balance(struct r1conf *c
 	best_good_sectors = 0;
 
 	if (conf->mddev->recovery_cp < MaxSector &&
-	    (this_sector + sectors >= conf->next_resync)) {
+	    (this_sector + sectors >= conf->next_resync))
 		choose_first = 1;
-		start_disk = 0;
-	} else {
+	else
 		choose_first = 0;
-		start_disk = conf->last_used;
-	}
 
 	for (i = 0 ; i < conf->raid_disks * 2 ; i++) {
 		sector_t dist;
 		sector_t first_bad;
 		int bad_sectors;
 
-		int disk = start_disk + i;
+		int disk = i;
 		if (disk >= conf->raid_disks)
 			disk -= conf->raid_disks;
 
@@ -580,7 +576,7 @@ static int read_balance(struct r1conf *c
 		dist = abs(this_sector - conf->mirrors[disk].head_position);
 		if (choose_first
 		    /* Don't change to another disk for sequential reads */
-		    || conf->next_seq_sect == this_sector
+		    || conf->mirrors[disk].next_seq_sect == this_sector
 		    || dist == 0
 		    /* If device is idle, use it */
 		    || atomic_read(&rdev->nr_pending) == 0) {
@@ -606,8 +602,7 @@ static int read_balance(struct r1conf *c
 			goto retry;
 		}
 		sectors = best_good_sectors;
-		conf->next_seq_sect = this_sector + sectors;
-		conf->last_used = best_disk;
+		conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
 	}
 	rcu_read_unlock();
 	*max_sectors = sectors;
@@ -2581,7 +2576,6 @@ static struct r1conf *setup_conf(struct
 	conf->recovery_disabled = mddev->recovery_disabled - 1;
 
 	err = -EIO;
-	conf->last_used = -1;
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 
 		disk = conf->mirrors + i;
@@ -2607,19 +2601,9 @@ static struct r1conf *setup_conf(struct
 			if (disk->rdev &&
 			    (disk->rdev->saved_raid_disk < 0))
 				conf->fullsync = 1;
-		} else if (conf->last_used < 0)
-			/*
-			 * The first working device is used as a
-			 * starting point to read balancing.
-			 */
-			conf->last_used = i;
+		}
 	}
 
-	if (conf->last_used < 0) {
-		printk(KERN_ERR "md/raid1:%s: no operational mirrors\n",
-		       mdname(mddev));
-		goto abort;
-	}
 	err = -ENOMEM;
 	conf->thread = md_register_thread(raid1d, mddev, NULL);
 	if (!conf->thread) {
@@ -2876,7 +2860,6 @@ static int raid1_reshape(struct mddev *m
 	conf->raid_disks = mddev->raid_disks = raid_disks;
 	mddev->delta_disks = 0;
 
-	conf->last_used = 0; /* just make sure it is in-range */
 	lower_barrier(conf);
 
 	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
Index: linux/drivers/md/raid1.h
===================================================================
--- linux.orig/drivers/md/raid1.h	2012-06-13 16:18:20.000000000 +0800
+++ linux/drivers/md/raid1.h	2012-06-28 11:55:37.757235867 +0800
@@ -4,6 +4,11 @@
 struct mirror_info {
 	struct md_rdev	*rdev;
 	sector_t	head_position;
+
+	/* When choose the best device for a read (read_balance())
+	 * we try to keep sequential reads one the same device
+	 */
+	sector_t	next_seq_sect;
 };
 
 /*
@@ -29,12 +34,6 @@ struct r1conf {
 						 */
 	int			raid_disks;
 
-	/* When choose the best device for a read (read_balance())
-	 * we try to keep sequential reads one the same device
-	 * using 'last_used' and 'next_seq_sect'
-	 */
-	int			last_used;
-	sector_t		next_seq_sect;
 	/* During resync, read_balancing is only allowed on the part
 	 * of the array that has been resynced.  'next_resync' tells us
 	 * where that is.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD
  2012-07-02  1:08 [patch 0/3 v3] Optimize raid1 read balance for SSD Shaohua Li
  2012-07-02  1:08 ` [patch 1/3 v3] raid1: make sequential read detection per disk based Shaohua Li
@ 2012-07-02  1:08 ` Shaohua Li
  2012-07-02  2:13   ` Roberto Spadim
  2012-07-04  5:45   ` NeilBrown
  2012-07-02  1:08 ` [patch 3/3 v3] raid1: prevent merging too large request Shaohua Li
  2 siblings, 2 replies; 14+ messages in thread
From: Shaohua Li @ 2012-07-02  1:08 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe

[-- Attachment #1: raid1-ssd-read-balance.patch --]
[-- Type: text/plain, Size: 3583 bytes --]

SSD hasn't spindle, distance between requests means nothing. And the original
distance based algorithm sometimes can cause severe performance issue for SSD
raid.

Considering two thread groups, one accesses file A, the other access file B.
The first group will access one disk and the second will access the other disk,
because requests are near from one group and far between groups. In this case,
read balance might keep one disk very busy but the other relative idle.  For
SSD, we should try best to distribute requests to as more disks as possible.
There isn't spindle move penality anyway.

With below patch, I can see more than 50% throughput improvement sometimes
depending on workloads.

The only exception is small requests can be merged to a big request which
typically can drive higher throughput for SSD too. Such small requests are
sequential reads. Unlike hard disk, sequential read which can't be merged (for
example direct IO, or read without readahead) can be ignored for SSD. Again
there is no spindle move penality. readahead dispatches small requests and such
requests can be merged.

Last patch can help detect sequential read well, at least if concurrent read
number isn't greater than raid disk number. In that case, distance based
algorithm doesn't work well too.

V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
random IO too. This makes the algorithm generic for raid with SSD.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid1.c |   23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

Index: linux/drivers/md/raid1.c
===================================================================
--- linux.orig/drivers/md/raid1.c	2012-06-28 16:56:20.846401902 +0800
+++ linux/drivers/md/raid1.c	2012-06-29 14:13:23.856781798 +0800
@@ -486,6 +486,7 @@ static int read_balance(struct r1conf *c
 	int best_disk;
 	int i;
 	sector_t best_dist;
+	unsigned int min_pending;
 	struct md_rdev *rdev;
 	int choose_first;
 
@@ -499,6 +500,7 @@ static int read_balance(struct r1conf *c
 	sectors = r1_bio->sectors;
 	best_disk = -1;
 	best_dist = MaxSector;
+	min_pending = -1;
 	best_good_sectors = 0;
 
 	if (conf->mddev->recovery_cp < MaxSector &&
@@ -511,6 +513,8 @@ static int read_balance(struct r1conf *c
 		sector_t dist;
 		sector_t first_bad;
 		int bad_sectors;
+		bool nonrot;
+		unsigned int pending;
 
 		int disk = i;
 		if (disk >= conf->raid_disks)
@@ -573,17 +577,32 @@ static int read_balance(struct r1conf *c
 		} else
 			best_good_sectors = sectors;
 
+		nonrot = blk_queue_nonrot(bdev_get_queue(rdev->bdev));
+		pending = atomic_read(&rdev->nr_pending);
 		dist = abs(this_sector - conf->mirrors[disk].head_position);
 		if (choose_first
 		    /* Don't change to another disk for sequential reads */
 		    || conf->mirrors[disk].next_seq_sect == this_sector
 		    || dist == 0
 		    /* If device is idle, use it */
-		    || atomic_read(&rdev->nr_pending) == 0) {
+		    || pending == 0) {
 			best_disk = disk;
 			break;
 		}
-		if (dist < best_dist) {
+
+		/*
+		 * If all disks are rotational, choose the closest disk. If any
+		 * disk is non-rotational, choose the disk with less pending
+		 * request even the disk is rotational, which might/might not
+		 * be optimal for raids with mixed ratation/non-rotational
+		 * disks depending on workload.
+		 */
+		if (nonrot || min_pending != -1) {
+			if (min_pending > pending) {
+				min_pending = pending;
+				best_disk = disk;
+			}
+		} else if (dist < best_dist) {
 			best_dist = dist;
 			best_disk = disk;
 		}


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 3/3 v3] raid1: prevent merging too large request
  2012-07-02  1:08 [patch 0/3 v3] Optimize raid1 read balance for SSD Shaohua Li
  2012-07-02  1:08 ` [patch 1/3 v3] raid1: make sequential read detection per disk based Shaohua Li
  2012-07-02  1:08 ` [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD Shaohua Li
@ 2012-07-02  1:08 ` Shaohua Li
  2012-07-04  5:59   ` NeilBrown
  2 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2012-07-02  1:08 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe

[-- Attachment #1: raid1-ssd-split-large-read.patch --]
[-- Type: text/plain, Size: 4751 bytes --]

For SSD, if request size exceeds specific value (optimal io size), request size
isn't important for bandwidth. In such condition, if making request size bigger
will cause some disks idle, the total throughput will actually drop. A good
example is doing a readahead in a two-disk raid1 setup.

So when we should split big request? We absolutly don't want to split big
request to very small requests. Even in SSD, big request transfer is more
efficient. Below patch only consider request with size above optimal io size.

If all disks are busy, is it worthy to do split? Say optimal io size is 16k,
two requests 32k and two disks. We can let each disk run one 32k request, or
split the requests to 4 16k requests and each disk runs two. It's hard to say
which case is better, depending on hardware.

So only consider case where there are idle disks. For readahead, split is
always better in this case. And in my test, below patch can improve > 30%
thoughput. Hmm, not 100%, because disk isn't 100% busy.

Such case can happen not just in readahead, for example, in directio. But I
suppose directio usually will have bigger IO depth and make all disks busy, so
I ignored it.

Note: if the raid uses any hard disk, we don't prevent merging. That will make
performace worse.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid1.c |   46 ++++++++++++++++++++++++++++++++++++++--------
 drivers/md/raid1.h |    1 +
 2 files changed, 39 insertions(+), 8 deletions(-)

Index: linux/drivers/md/raid1.c
===================================================================
--- linux.orig/drivers/md/raid1.c	2012-06-29 15:17:09.160691540 +0800
+++ linux/drivers/md/raid1.c	2012-06-29 15:57:54.993943576 +0800
@@ -489,6 +489,7 @@ static int read_balance(struct r1conf *c
 	unsigned int min_pending;
 	struct md_rdev *rdev;
 	int choose_first;
+	int choose_idle;
 
 	rcu_read_lock();
 	/*
@@ -502,6 +503,7 @@ static int read_balance(struct r1conf *c
 	best_dist = MaxSector;
 	min_pending = -1;
 	best_good_sectors = 0;
+	choose_idle = 0;
 
 	if (conf->mddev->recovery_cp < MaxSector &&
 	    (this_sector + sectors >= conf->next_resync))
@@ -580,16 +582,35 @@ static int read_balance(struct r1conf *c
 		nonrot = blk_queue_nonrot(bdev_get_queue(rdev->bdev));
 		pending = atomic_read(&rdev->nr_pending);
 		dist = abs(this_sector - conf->mirrors[disk].head_position);
-		if (choose_first
-		    /* Don't change to another disk for sequential reads */
-		    || conf->mirrors[disk].next_seq_sect == this_sector
-		    || dist == 0
-		    /* If device is idle, use it */
-		    || pending == 0) {
-			best_disk = disk;
-			break;
+		if (choose_first)
+			goto done;
+		/* Don't change to another disk for sequential reads */
+		if (conf->mirrors[disk].next_seq_sect == this_sector
+		    || dist == 0) {
+			int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
+			struct mirror_info *mirror = &conf->mirrors[disk];
+
+			/*
+			 * If buffered sequential IO size exceeds optimal
+			 * iosize and there is idle disk, choose idle disk
+			 */
+			if (nonrot && opt_iosize > 0 && choose_idle == 0 &&
+			    mirror->seq_start != MaxSector &&
+			    mirror->next_seq_sect > opt_iosize &&
+			    mirror->next_seq_sect - opt_iosize >=
+			    mirror->seq_start) {
+				choose_idle = 1;
+				best_disk = disk;
+				continue;
+			}
+			goto done;
 		}
+		/* If device is idle, use it */
+		if (pending == 0 && (!choose_idle || nonrot))
+			goto done;
 
+		if (choose_idle)
+			continue;
 		/*
 		 * If all disks are rotational, choose the closest disk. If any
 		 * disk is non-rotational, choose the disk with less pending
@@ -606,6 +627,10 @@ static int read_balance(struct r1conf *c
 			best_dist = dist;
 			best_disk = disk;
 		}
+		continue;
+done:
+		best_disk = disk;
+		break;
 	}
 
 	if (best_disk >= 0) {
@@ -621,6 +646,10 @@ static int read_balance(struct r1conf *c
 			goto retry;
 		}
 		sectors = best_good_sectors;
+
+		if (conf->mirrors[best_disk].next_seq_sect != this_sector)
+			conf->mirrors[best_disk].seq_start = this_sector;
+
 		conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
 	}
 	rcu_read_unlock();
@@ -2582,6 +2611,7 @@ static struct r1conf *setup_conf(struct
 			mddev->merge_check_needed = 1;
 
 		disk->head_position = 0;
+		disk->seq_start = MaxSector;
 	}
 	conf->raid_disks = mddev->raid_disks;
 	conf->mddev = mddev;
Index: linux/drivers/md/raid1.h
===================================================================
--- linux.orig/drivers/md/raid1.h	2012-06-29 15:17:09.152691892 +0800
+++ linux/drivers/md/raid1.h	2012-06-29 15:17:09.172691231 +0800
@@ -9,6 +9,7 @@ struct mirror_info {
 	 * we try to keep sequential reads one the same device
 	 */
 	sector_t	next_seq_sect;
+	sector_t	seq_start;
 };
 
 /*


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD
  2012-07-02  1:08 ` [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD Shaohua Li
@ 2012-07-02  2:13   ` Roberto Spadim
  2012-07-02  3:02     ` Shaohua Li
  2012-07-04  5:45   ` NeilBrown
  1 sibling, 1 reply; 14+ messages in thread
From: Roberto Spadim @ 2012-07-02  2:13 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, neilb, axboe

nice =) very very nice =)
maybe to get better than this... select the disk with min(pending time)

time could be estimated with something like:
(distance * time/distance unit) + (blocks to read/write * time to
read/write 1 block) + (non sequencial penalty time)

for ssd:
    time/distance unit  = 0
    time to read/write 1 block = must test with each device
    non sequencial penalty time = must test with each device, but some
tests show that non sequencial are near to sequencial reads
for hd:
    time/distance unit and time to read/write 1 block are proportional
with disk speed (rpm)
    non sequencial penalty time is proportional to distance and head
position, many disk specs show in specs that it´s take, in worst case,
near 10ms to start reading/writing, this time is the time to disk spin
one revolution and put head in right position
        check that time to read/write in rotational change with block
position and number of heads reading (blocks at center of disk are
slower, blocks far from center are faster)
         for ssd it´s change with allocation 'problems', (for write)
if a block is 'trimmed' it´s very fast, if block is used (dirty) it
must read block, change and write, this is a slower... in others
words... the time to read is related with position and mean disk/ssd
read/write times (for a 'good' aproximation not a ideal one...), this
algorithm (without pending information) give me 1% of mean improvement
in kernel 2.6.33 (must check but i think that´s right)



2012/7/1 Shaohua Li <shli@kernel.org>
>
> SSD hasn't spindle, distance between requests means nothing. And the original
> distance based algorithm sometimes can cause severe performance issue for SSD
> raid.
>
> Considering two thread groups, one accesses file A, the other access file B.
> The first group will access one disk and the second will access the other disk,
> because requests are near from one group and far between groups. In this case,
> read balance might keep one disk very busy but the other relative idle.  For
> SSD, we should try best to distribute requests to as more disks as possible.
> There isn't spindle move penality anyway.
>
> With below patch, I can see more than 50% throughput improvement sometimes
> depending on workloads.
>
> The only exception is small requests can be merged to a big request which
> typically can drive higher throughput for SSD too. Such small requests are
> sequential reads. Unlike hard disk, sequential read which can't be merged (for
> example direct IO, or read without readahead) can be ignored for SSD. Again
> there is no spindle move penality. readahead dispatches small requests and such
> requests can be merged.
>
> Last patch can help detect sequential read well, at least if concurrent read
> number isn't greater than raid disk number. In that case, distance based
> algorithm doesn't work well too.
>
> V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
> random IO too. This makes the algorithm generic for raid with SSD.
>
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/raid1.c |   23 +++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
>
> Index: linux/drivers/md/raid1.c
> ===================================================================
> --- linux.orig/drivers/md/raid1.c       2012-06-28 16:56:20.846401902 +0800
> +++ linux/drivers/md/raid1.c    2012-06-29 14:13:23.856781798 +0800
> @@ -486,6 +486,7 @@ static int read_balance(struct r1conf *c
>         int best_disk;
>         int i;
>         sector_t best_dist;
> +       unsigned int min_pending;
>         struct md_rdev *rdev;
>         int choose_first;
>
> @@ -499,6 +500,7 @@ static int read_balance(struct r1conf *c
>         sectors = r1_bio->sectors;
>         best_disk = -1;
>         best_dist = MaxSector;
> +       min_pending = -1;
>         best_good_sectors = 0;
>
>         if (conf->mddev->recovery_cp < MaxSector &&
> @@ -511,6 +513,8 @@ static int read_balance(struct r1conf *c
>                 sector_t dist;
>                 sector_t first_bad;
>                 int bad_sectors;
> +               bool nonrot;
> +               unsigned int pending;
>
>                 int disk = i;
>                 if (disk >= conf->raid_disks)
> @@ -573,17 +577,32 @@ static int read_balance(struct r1conf *c
>                 } else
>                         best_good_sectors = sectors;
>
> +               nonrot = blk_queue_nonrot(bdev_get_queue(rdev->bdev));
> +               pending = atomic_read(&rdev->nr_pending);
>                 dist = abs(this_sector - conf->mirrors[disk].head_position);
>                 if (choose_first
>                     /* Don't change to another disk for sequential reads */
>                     || conf->mirrors[disk].next_seq_sect == this_sector
>                     || dist == 0
>                     /* If device is idle, use it */
> -                   || atomic_read(&rdev->nr_pending) == 0) {
> +                   || pending == 0) {
>                         best_disk = disk;
>                         break;
>                 }
> -               if (dist < best_dist) {
> +
> +               /*
> +                * If all disks are rotational, choose the closest disk. If any
> +                * disk is non-rotational, choose the disk with less pending
> +                * request even the disk is rotational, which might/might not
> +                * be optimal for raids with mixed ratation/non-rotational
> +                * disks depending on workload.
> +                */
> +               if (nonrot || min_pending != -1) {
> +                       if (min_pending > pending) {
> +                               min_pending = pending;
> +                               best_disk = disk;
> +                       }
> +               } else if (dist < best_dist) {
>                         best_dist = dist;
>                         best_disk = disk;
>                 }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD
  2012-07-02  2:13   ` Roberto Spadim
@ 2012-07-02  3:02     ` Shaohua Li
  2012-07-02  3:57       ` Roberto Spadim
  2012-07-02  4:31       ` Roberto Spadim
  0 siblings, 2 replies; 14+ messages in thread
From: Shaohua Li @ 2012-07-02  3:02 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: linux-raid, neilb, axboe

On Sun, Jul 01, 2012 at 11:13:42PM -0300, Roberto Spadim wrote:
> nice =) very very nice =)
> maybe to get better than this... select the disk with min(pending time)
> 
> time could be estimated with something like:
> (distance * time/distance unit) + (blocks to read/write * time to
> read/write 1 block) + (non sequencial penalty time)

I didn't think there is way to measure request time. Disks support NCQ, they
can dispatch several requests at one time and finish them almost at the same
time. I used to measure this time (to make CFQ ioscheduler self tune), but
failed.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD
  2012-07-02  3:02     ` Shaohua Li
@ 2012-07-02  3:57       ` Roberto Spadim
  2012-07-02  4:33         ` Roberto Spadim
  2012-07-02  4:31       ` Roberto Spadim
  1 sibling, 1 reply; 14+ messages in thread
From: Roberto Spadim @ 2012-07-02  3:57 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, neilb, axboe

hummm well that´s true... exist a queue inside disk hardware that we
can´t measure... but... if you want i can make tests to you :)
i used a configuration a bit diferent some time ago, instead of a SSD
and a harddisk, i used a disk with 7200rpm and a disk with 15000 the
"time based" algorithm runs nice in this case, maybe could give just a
little more 'performace' (maybe none), like i told the mean performace
that i got was 1% (i made tests with different disks speed and
ssd+disks, i had a ocz vortex2, a sata 7200rpm (500gb) and a sas
15000rpm (142gb), some other guy here in kernel list tested too, but
they didn´t confirmed if the performace was a mean performace or just
a 'error' in measure

when i done this i got some 'empirical' values to 'tune' the
algorithm, i don´t remember all 'theory' but i done something like
this:


1)  (distance * time/distance unit)
time/distance unit,
    i don´t remember distance unit, i think it´s 1 block =  512bytes
right? well, just check the idea...
    for disks:
        total disk capacity in distance units / 1 revolution time
        1 revolution time = 1/rpm for disk, for example
              7200 rpm => 120 hz => 8.333ms = 8333us (near 10ms like
told in disk spec of random acess time)
              15000 rpm => 250hz => 4ms = 4000us (near 5ms like told
in disk spec)
    for ssd : 0 seconds
	7200 => 500gb (1024*1024*1024/512) / 8333 =   1048576000blocks /
8333us = 0.000'007'946'968'078 block/us
	15000 => 142gb (1024*1024*1024/512) / 4000us = 297795584blocks /
4000us = 0.000'013'432'032'625 block/us
	ssd => infinite blocks/us
		0.000007946 for 7200rpm,
		0.000013432 for 15000rpm,
		0 for ssd



2)(blocks to read/write * time to read/write 1 block)
 this part i put dd to work...
  dd if=/dev/sda of=/dev/null (there was some flags to remove cache
too but don´t remember now...)
   and used iostat -d 1 -k to get mean read performace
 i don´t remember the rights numbers but they was something near this:
    ssd - 230mb/s  = 230Mb(1024*1024)/512bytes => 471040 blocks /
second =  0.000'002'122 => 2.122us / block
    hd 7200 - 120mb/s => 245760 blocks/second => 0.000'004'069 =>
4.069us / block
    hd 15000 - 170mb/s => 348160 blocks/second => 0.000'002'872 =>
2.872us / block

3) (non sequencial penalty time)
here i used two dd to do this (some seconds between first and second dd)
and got the new mb/s values
ssd get a bit down but not much 230 -> 200
hd 7200 120mb -> 90
hd 15000 170 -> 150

with this loses i done a 'penalty' value
(230-200)/230 = 13.043%
(120-90)/120 = 25%
(170-150)/170 = 11.76%

i don´t remember if i used the penalty with distance=0, or if i used
it like in today implementation that select the previous disk when
reading the full md device

======
with this numbers.... some algorithms expected selects...
sda=ssd, sdb=15000rpm, sdc=7200rpm

sda|sdb|sdc
disk positions: 0 | 0 | 0
read 100 block at position 20000...
sda=> distance = 20000, extimate time = 20000*0 + 2.122*100 + 13.043%
		in other words...
			(        0 + 212.2) * 1.13043 = 239.877246
sdb=> distance = 20000, extimate time = 20000*0.000013432 + 2.872*100
+ 11.76% =
			(0.26864 + 287.2) * 1.1176 = 321.274952064
sdc=> distance = 20000, extimate time = 20000*0.000007946 + 4.069*100 + 25% =
			(0.15892 + 406.9) * 1.25 = 508.82365
	HERE WE SELECT sda (239.877)
	
disk positions: 200 | 0 | 0
read 100 blocks at position 0...
sda=> distance = 200, extimate time = 200*0 + 2.122*100 + 13.043%
			(        0 + 212.2) * 1.13043 = 239.877246
sdb=> distance = 0, extimate time = 0*0.000013432 + 2.872*100 + 0% =
	(no penalty here since we are at the right place)
			(        0 + 287.2) * 1 = 287.2
sdc=> distance = 0, extimate time = 0*0.000007946 + 4.069*100 + 0% =
			(        0 + 406.9) * 1 = 406.9
	sda...
	check that i will always select sda... since it´s fast for distance
(0seconds) and have the highets transfer rate
	
that´s here my algorithm didn´t worked fine... (i don´t know anything
about past and queue just the current read)

but now... with someone that know the kernel code... we have this
information of pendings requests =D

i think we can go inside queue and calculate the total estimate time =), or not?
	for each pending request we should calculate this times... and sum
the total time to select the 'best' disk
	here i didn´t coded since i don´t know how to get information from
queue in kernel =( and my hobby ended ='(

thanks to read....
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD
  2012-07-02  3:02     ` Shaohua Li
  2012-07-02  3:57       ` Roberto Spadim
@ 2012-07-02  4:31       ` Roberto Spadim
  2012-07-02  4:36         ` Roberto Spadim
  1 sibling, 1 reply; 14+ messages in thread
From: Roberto Spadim @ 2012-07-02  4:31 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, neilb, axboe

for SSD without inteligent parts (cache and queue) i think this could
works nice... (at least in theory)
for USB sticks and microsd cards may work too...


2012/7/2 Shaohua Li <shli@kernel.org>:
> On Sun, Jul 01, 2012 at 11:13:42PM -0300, Roberto Spadim wrote:
>> nice =) very very nice =)
>> maybe to get better than this... select the disk with min(pending time)
>>
>> time could be estimated with something like:
>> (distance * time/distance unit) + (blocks to read/write * time to
>> read/write 1 block) + (non sequencial penalty time)
>
> I didn't think there is way to measure request time. Disks support NCQ, they
> can dispatch several requests at one time and finish them almost at the same
> time. I used to measure this time (to make CFQ ioscheduler self tune), but
> failed.
>
> Thanks,
> Shaohua
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD
  2012-07-02  3:57       ` Roberto Spadim
@ 2012-07-02  4:33         ` Roberto Spadim
  0 siblings, 0 replies; 14+ messages in thread
From: Roberto Spadim @ 2012-07-02  4:33 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, neilb, axboe

check that if you don´t what this algorithm, you could use:
distance time =1
read time=0
penalty =0
and it would work as today implementation... (ok must check if this
could work for single disk to full array read, but it´s near)

2012/7/2 Roberto Spadim <roberto@spadim.com.br>:
> hummm well that´s true... exist a queue inside disk hardware that we
> can´t measure... but... if you want i can make tests to you :)
> i used a configuration a bit diferent some time ago, instead of a SSD
> and a harddisk, i used a disk with 7200rpm and a disk with 15000 the
> "time based" algorithm runs nice in this case, maybe could give just a
> little more 'performace' (maybe none), like i told the mean performace
> that i got was 1% (i made tests with different disks speed and
> ssd+disks, i had a ocz vortex2, a sata 7200rpm (500gb) and a sas
> 15000rpm (142gb), some other guy here in kernel list tested too, but
> they didn´t confirmed if the performace was a mean performace or just
> a 'error' in measure
>
> when i done this i got some 'empirical' values to 'tune' the
> algorithm, i don´t remember all 'theory' but i done something like
> this:
>
>
> 1)  (distance * time/distance unit)
> time/distance unit,
>     i don´t remember distance unit, i think it´s 1 block =  512bytes
> right? well, just check the idea...
>     for disks:
>         total disk capacity in distance units / 1 revolution time
>         1 revolution time = 1/rpm for disk, for example
>               7200 rpm => 120 hz => 8.333ms = 8333us (near 10ms like
> told in disk spec of random acess time)
>               15000 rpm => 250hz => 4ms = 4000us (near 5ms like told
> in disk spec)
>     for ssd : 0 seconds
>         7200 => 500gb (1024*1024*1024/512) / 8333 =   1048576000blocks /
> 8333us = 0.000'007'946'968'078 block/us
>         15000 => 142gb (1024*1024*1024/512) / 4000us = 297795584blocks /
> 4000us = 0.000'013'432'032'625 block/us
>         ssd => infinite blocks/us
>                 0.000007946 for 7200rpm,
>                 0.000013432 for 15000rpm,
>                 0 for ssd
>
>
>
> 2)(blocks to read/write * time to read/write 1 block)
>  this part i put dd to work...
>   dd if=/dev/sda of=/dev/null (there was some flags to remove cache
> too but don´t remember now...)
>    and used iostat -d 1 -k to get mean read performace
>  i don´t remember the rights numbers but they was something near this:
>     ssd - 230mb/s  = 230Mb(1024*1024)/512bytes => 471040 blocks /
> second =  0.000'002'122 => 2.122us / block
>     hd 7200 - 120mb/s => 245760 blocks/second => 0.000'004'069 =>
> 4.069us / block
>     hd 15000 - 170mb/s => 348160 blocks/second => 0.000'002'872 =>
> 2.872us / block
>
> 3) (non sequencial penalty time)
> here i used two dd to do this (some seconds between first and second dd)
> and got the new mb/s values
> ssd get a bit down but not much 230 -> 200
> hd 7200 120mb -> 90
> hd 15000 170 -> 150
>
> with this loses i done a 'penalty' value
> (230-200)/230 = 13.043%
> (120-90)/120 = 25%
> (170-150)/170 = 11.76%
>
> i don´t remember if i used the penalty with distance=0, or if i used
> it like in today implementation that select the previous disk when
> reading the full md device
>
> ======
> with this numbers.... some algorithms expected selects...
> sda=ssd, sdb=15000rpm, sdc=7200rpm
>
> sda|sdb|sdc
> disk positions: 0 | 0 | 0
> read 100 block at position 20000...
> sda=> distance = 20000, extimate time = 20000*0 + 2.122*100 + 13.043%
>                 in other words...
>                         (        0 + 212.2) * 1.13043 = 239.877246
> sdb=> distance = 20000, extimate time = 20000*0.000013432 + 2.872*100
> + 11.76% =
>                         (0.26864 + 287.2) * 1.1176 = 321.274952064
> sdc=> distance = 20000, extimate time = 20000*0.000007946 + 4.069*100 + 25% =
>                         (0.15892 + 406.9) * 1.25 = 508.82365
>         HERE WE SELECT sda (239.877)
>
> disk positions: 200 | 0 | 0
> read 100 blocks at position 0...
> sda=> distance = 200, extimate time = 200*0 + 2.122*100 + 13.043%
>                         (        0 + 212.2) * 1.13043 = 239.877246
> sdb=> distance = 0, extimate time = 0*0.000013432 + 2.872*100 + 0% =
>         (no penalty here since we are at the right place)
>                         (        0 + 287.2) * 1 = 287.2
> sdc=> distance = 0, extimate time = 0*0.000007946 + 4.069*100 + 0% =
>                         (        0 + 406.9) * 1 = 406.9
>         sda...
>         check that i will always select sda... since it´s fast for distance
> (0seconds) and have the highets transfer rate
>
> that´s here my algorithm didn´t worked fine... (i don´t know anything
> about past and queue just the current read)
>
> but now... with someone that know the kernel code... we have this
> information of pendings requests =D
>
> i think we can go inside queue and calculate the total estimate time =), or not?
>         for each pending request we should calculate this times... and sum
> the total time to select the 'best' disk
>         here i didn´t coded since i don´t know how to get information from
> queue in kernel =( and my hobby ended ='(
>
> thanks to read....



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD
  2012-07-02  4:31       ` Roberto Spadim
@ 2012-07-02  4:36         ` Roberto Spadim
  0 siblings, 0 replies; 14+ messages in thread
From: Roberto Spadim @ 2012-07-02  4:36 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, neilb, axboe

sorry about many posts, i will stop a bit now...

could we turn off disk NCQ? ok nobody want to read this, but could we
turn off NCQ and get better performace? since we do a good elevator at
linux code... maybe turn off disk cache too... (it´s only 32MB for
sata/sas, and can be alot to raid controller...), just some ideas to
get better performace... (i didn´t think before write if cache help or
not when we send I/O to disk, but check if it´s a good question or
not)

2012/7/2 Roberto Spadim <roberto@spadim.com.br>:
> for SSD without inteligent parts (cache and queue) i think this could
> works nice... (at least in theory)
> for USB sticks and microsd cards may work too...
>
>
> 2012/7/2 Shaohua Li <shli@kernel.org>:
>> On Sun, Jul 01, 2012 at 11:13:42PM -0300, Roberto Spadim wrote:
>>> nice =) very very nice =)
>>> maybe to get better than this... select the disk with min(pending time)
>>>
>>> time could be estimated with something like:
>>> (distance * time/distance unit) + (blocks to read/write * time to
>>> read/write 1 block) + (non sequencial penalty time)
>>
>> I didn't think there is way to measure request time. Disks support NCQ, they
>> can dispatch several requests at one time and finish them almost at the same
>> time. I used to measure this time (to make CFQ ioscheduler self tune), but
>> failed.
>>
>> Thanks,
>> Shaohua
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 1/3 v3] raid1: make sequential read detection per disk based
  2012-07-02  1:08 ` [patch 1/3 v3] raid1: make sequential read detection per disk based Shaohua Li
@ 2012-07-04  5:38   ` NeilBrown
  0 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2012-07-04  5:38 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe

[-- Attachment #1: Type: text/plain, Size: 2076 bytes --]

On Mon, 02 Jul 2012 09:08:41 +0800 Shaohua Li <shli@kernel.org> wrote:

> Currently the sequential read detection is global wide. It's natural to make it
> per disk based, which can improve the detection for concurrent multiple
> sequential reads. And next patch will make SSD read balance not use distance
> based algorithm, where this change help detect truly sequential read for SSD.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/raid1.c |   29 ++++++-----------------------
>  drivers/md/raid1.h |   11 +++++------
>  2 files changed, 11 insertions(+), 29 deletions(-)
> 
> Index: linux/drivers/md/raid1.c
> ===================================================================
> --- linux.orig/drivers/md/raid1.c	2012-06-28 10:44:47.550666575 +0800
> +++ linux/drivers/md/raid1.c	2012-06-28 12:01:03.513137377 +0800
> @@ -483,7 +483,6 @@ static int read_balance(struct r1conf *c
>  	const sector_t this_sector = r1_bio->sector;
>  	int sectors;
>  	int best_good_sectors;
> -	int start_disk;
>  	int best_disk;
>  	int i;
>  	sector_t best_dist;
> @@ -503,20 +502,17 @@ static int read_balance(struct r1conf *c
>  	best_good_sectors = 0;
>  
>  	if (conf->mddev->recovery_cp < MaxSector &&
> -	    (this_sector + sectors >= conf->next_resync)) {
> +	    (this_sector + sectors >= conf->next_resync))
>  		choose_first = 1;
> -		start_disk = 0;
> -	} else {
> +	else
>  		choose_first = 0;
> -		start_disk = conf->last_used;
> -	}
>  
>  	for (i = 0 ; i < conf->raid_disks * 2 ; i++) {
>  		sector_t dist;
>  		sector_t first_bad;
>  		int bad_sectors;
>  
> -		int disk = start_disk + i;
> +		int disk = i;
>  		if (disk >= conf->raid_disks)
>  			disk -= conf->raid_disks;

One last little change please.  Could you make that loop:


     for (disk = 0; disk < conf->raid_disks * 2; disk++)

and remote the 'if' statement that wraps 'disk' back to zero
when it gets too big (which is buggy and has been fixed in my for-next).

With that change I'll take the patch.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD
  2012-07-02  1:08 ` [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD Shaohua Li
  2012-07-02  2:13   ` Roberto Spadim
@ 2012-07-04  5:45   ` NeilBrown
  1 sibling, 0 replies; 14+ messages in thread
From: NeilBrown @ 2012-07-04  5:45 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe

[-- Attachment #1: Type: text/plain, Size: 3663 bytes --]

On Mon, 02 Jul 2012 09:08:42 +0800 Shaohua Li <shli@kernel.org> wrote:

> SSD hasn't spindle, distance between requests means nothing. And the original
> distance based algorithm sometimes can cause severe performance issue for SSD
> raid.
> 
> Considering two thread groups, one accesses file A, the other access file B.
> The first group will access one disk and the second will access the other disk,
> because requests are near from one group and far between groups. In this case,
> read balance might keep one disk very busy but the other relative idle.  For
> SSD, we should try best to distribute requests to as more disks as possible.
> There isn't spindle move penality anyway.
> 
> With below patch, I can see more than 50% throughput improvement sometimes
> depending on workloads.
> 
> The only exception is small requests can be merged to a big request which
> typically can drive higher throughput for SSD too. Such small requests are
> sequential reads. Unlike hard disk, sequential read which can't be merged (for
> example direct IO, or read without readahead) can be ignored for SSD. Again
> there is no spindle move penality. readahead dispatches small requests and such
> requests can be merged.
> 
> Last patch can help detect sequential read well, at least if concurrent read
> number isn't greater than raid disk number. In that case, distance based
> algorithm doesn't work well too.
> 
> V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
> random IO too. This makes the algorithm generic for raid with SSD.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/raid1.c |   23 +++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
> 
> Index: linux/drivers/md/raid1.c
> ===================================================================
> --- linux.orig/drivers/md/raid1.c	2012-06-28 16:56:20.846401902 +0800
> +++ linux/drivers/md/raid1.c	2012-06-29 14:13:23.856781798 +0800
> @@ -486,6 +486,7 @@ static int read_balance(struct r1conf *c
>  	int best_disk;
>  	int i;
>  	sector_t best_dist;
> +	unsigned int min_pending;
>  	struct md_rdev *rdev;
>  	int choose_first;
>  
> @@ -499,6 +500,7 @@ static int read_balance(struct r1conf *c
>  	sectors = r1_bio->sectors;
>  	best_disk = -1;
>  	best_dist = MaxSector;
> +	min_pending = -1;
>  	best_good_sectors = 0;
>  
>  	if (conf->mddev->recovery_cp < MaxSector &&


That's not good form - declaring a variable "unsigned int" then assigning
"-1" to it.
Maybe initialise it to "UINT_MAX".

> -		if (dist < best_dist) {
> +
> +		/*
> +		 * If all disks are rotational, choose the closest disk. If any
> +		 * disk is non-rotational, choose the disk with less pending
> +		 * request even the disk is rotational, which might/might not
> +		 * be optimal for raids with mixed ratation/non-rotational
> +		 * disks depending on workload.
> +		 */
> +		if (nonrot || min_pending != -1) {
> +			if (min_pending > pending) {
> +				min_pending = pending;
> +				best_disk = disk;
> +			}
> +		} else if (dist < best_dist) {
>  			best_dist = dist;
>  			best_disk = disk;
>  		}

It don't think it is clear that the code matches the comment in all cases.
e.g. if you have 3 disks and the first 2 were rotational, then you wouldn't
examine the 'pending' count of the 2nd disk at all.
Also you never examine the distance for the first disk.

Maybe if you have a 'best_pending_disk' and a 'best_dist_disk' and after
the loop, use best_pending_disk if any were non-rotation, and best_disk_disk
if all were rotating.

Thanks,
NeilBrown



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 3/3 v3] raid1: prevent merging too large request
  2012-07-02  1:08 ` [patch 3/3 v3] raid1: prevent merging too large request Shaohua Li
@ 2012-07-04  5:59   ` NeilBrown
  2012-07-04  8:01     ` Shaohua Li
  0 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2012-07-04  5:59 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe

[-- Attachment #1: Type: text/plain, Size: 4333 bytes --]

On Mon, 02 Jul 2012 09:08:43 +0800 Shaohua Li <shli@kernel.org> wrote:

> For SSD, if request size exceeds specific value (optimal io size), request size
> isn't important for bandwidth. In such condition, if making request size bigger
> will cause some disks idle, the total throughput will actually drop. A good
> example is doing a readahead in a two-disk raid1 setup.
> 
> So when we should split big request? We absolutly don't want to split big
> request to very small requests. Even in SSD, big request transfer is more
> efficient. Below patch only consider request with size above optimal io size.
> 
> If all disks are busy, is it worthy to do split? Say optimal io size is 16k,
> two requests 32k and two disks. We can let each disk run one 32k request, or
> split the requests to 4 16k requests and each disk runs two. It's hard to say
> which case is better, depending on hardware.
> 
> So only consider case where there are idle disks. For readahead, split is
> always better in this case. And in my test, below patch can improve > 30%
> thoughput. Hmm, not 100%, because disk isn't 100% busy.
> 
> Such case can happen not just in readahead, for example, in directio. But I
> suppose directio usually will have bigger IO depth and make all disks busy, so
> I ignored it.
> 
> Note: if the raid uses any hard disk, we don't prevent merging. That will make
> performace worse.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/raid1.c |   46 ++++++++++++++++++++++++++++++++++++++--------
>  drivers/md/raid1.h |    1 +
>  2 files changed, 39 insertions(+), 8 deletions(-)
> 
> Index: linux/drivers/md/raid1.c
> ===================================================================
> --- linux.orig/drivers/md/raid1.c	2012-06-29 15:17:09.160691540 +0800
> +++ linux/drivers/md/raid1.c	2012-06-29 15:57:54.993943576 +0800
> @@ -489,6 +489,7 @@ static int read_balance(struct r1conf *c
>  	unsigned int min_pending;
>  	struct md_rdev *rdev;
>  	int choose_first;
> +	int choose_idle;
>  
>  	rcu_read_lock();
>  	/*
> @@ -502,6 +503,7 @@ static int read_balance(struct r1conf *c
>  	best_dist = MaxSector;
>  	min_pending = -1;
>  	best_good_sectors = 0;
> +	choose_idle = 0;
>  
>  	if (conf->mddev->recovery_cp < MaxSector &&
>  	    (this_sector + sectors >= conf->next_resync))
> @@ -580,16 +582,35 @@ static int read_balance(struct r1conf *c
>  		nonrot = blk_queue_nonrot(bdev_get_queue(rdev->bdev));
>  		pending = atomic_read(&rdev->nr_pending);
>  		dist = abs(this_sector - conf->mirrors[disk].head_position);
> -		if (choose_first
> -		    /* Don't change to another disk for sequential reads */
> -		    || conf->mirrors[disk].next_seq_sect == this_sector
> -		    || dist == 0
> -		    /* If device is idle, use it */
> -		    || pending == 0) {
> -			best_disk = disk;
> -			break;
> +		if (choose_first)
> +			goto done;
> +		/* Don't change to another disk for sequential reads */
> +		if (conf->mirrors[disk].next_seq_sect == this_sector
> +		    || dist == 0) {
> +			int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
> +			struct mirror_info *mirror = &conf->mirrors[disk];
> +
> +			/*
> +			 * If buffered sequential IO size exceeds optimal
> +			 * iosize and there is idle disk, choose idle disk
> +			 */
> +			if (nonrot && opt_iosize > 0 && choose_idle == 0 &&
> +			    mirror->seq_start != MaxSector &&
> +			    mirror->next_seq_sect > opt_iosize &&
> +			    mirror->next_seq_sect - opt_iosize >=
> +			    mirror->seq_start) {
> +				choose_idle = 1;
> +				best_disk = disk;
> +				continue;
> +			}
> +			goto done;
>  		}
> +		/* If device is idle, use it */
> +		if (pending == 0 && (!choose_idle || nonrot))
> +			goto done;

That doesn't make sense to me.
Ignoring the nonrot bit, it says:
   if (this device is idle, and we haven't decided to choose an idle device)
        then choose this device

which is wrong.

Also, what about the case where an idle device is found before we we find a
device that the sequentially-previous read was from.  Will that still do the
right thing?  It isn't clear.


And I don't really like the "goto done".  Just have
        best_disk = disk;
        break;

it isn't that much more code.


Thanks,
NeilBrown


>  

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 3/3 v3] raid1: prevent merging too large request
  2012-07-04  5:59   ` NeilBrown
@ 2012-07-04  8:01     ` Shaohua Li
  0 siblings, 0 replies; 14+ messages in thread
From: Shaohua Li @ 2012-07-04  8:01 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe

2012/7/4 NeilBrown <neilb@suse.de>:
> On Mon, 02 Jul 2012 09:08:43 +0800 Shaohua Li <shli@kernel.org> wrote:
>
>> For SSD, if request size exceeds specific value (optimal io size), request size
>> isn't important for bandwidth. In such condition, if making request size bigger
>> will cause some disks idle, the total throughput will actually drop. A good
>> example is doing a readahead in a two-disk raid1 setup.
>>
>> So when we should split big request? We absolutly don't want to split big
>> request to very small requests. Even in SSD, big request transfer is more
>> efficient. Below patch only consider request with size above optimal io size.
>>
>> If all disks are busy, is it worthy to do split? Say optimal io size is 16k,
>> two requests 32k and two disks. We can let each disk run one 32k request, or
>> split the requests to 4 16k requests and each disk runs two. It's hard to say
>> which case is better, depending on hardware.
>>
>> So only consider case where there are idle disks. For readahead, split is
>> always better in this case. And in my test, below patch can improve > 30%
>> thoughput. Hmm, not 100%, because disk isn't 100% busy.
>>
>> Such case can happen not just in readahead, for example, in directio. But I
>> suppose directio usually will have bigger IO depth and make all disks busy, so
>> I ignored it.
>>
>> Note: if the raid uses any hard disk, we don't prevent merging. That will make
>> performace worse.
>>
>> Signed-off-by: Shaohua Li <shli@fusionio.com>
>> ---
>>  drivers/md/raid1.c |   46 ++++++++++++++++++++++++++++++++++++++--------
>>  drivers/md/raid1.h |    1 +
>>  2 files changed, 39 insertions(+), 8 deletions(-)
>>
>> Index: linux/drivers/md/raid1.c
>> ===================================================================
>> --- linux.orig/drivers/md/raid1.c     2012-06-29 15:17:09.160691540 +0800
>> +++ linux/drivers/md/raid1.c  2012-06-29 15:57:54.993943576 +0800
>> @@ -489,6 +489,7 @@ static int read_balance(struct r1conf *c
>>       unsigned int min_pending;
>>       struct md_rdev *rdev;
>>       int choose_first;
>> +     int choose_idle;
>>
>>       rcu_read_lock();
>>       /*
>> @@ -502,6 +503,7 @@ static int read_balance(struct r1conf *c
>>       best_dist = MaxSector;
>>       min_pending = -1;
>>       best_good_sectors = 0;
>> +     choose_idle = 0;
>>
>>       if (conf->mddev->recovery_cp < MaxSector &&
>>           (this_sector + sectors >= conf->next_resync))
>> @@ -580,16 +582,35 @@ static int read_balance(struct r1conf *c
>>               nonrot = blk_queue_nonrot(bdev_get_queue(rdev->bdev));
>>               pending = atomic_read(&rdev->nr_pending);
>>               dist = abs(this_sector - conf->mirrors[disk].head_position);
>> -             if (choose_first
>> -                 /* Don't change to another disk for sequential reads */
>> -                 || conf->mirrors[disk].next_seq_sect == this_sector
>> -                 || dist == 0
>> -                 /* If device is idle, use it */
>> -                 || pending == 0) {
>> -                     best_disk = disk;
>> -                     break;
>> +             if (choose_first)
>> +                     goto done;
>> +             /* Don't change to another disk for sequential reads */
>> +             if (conf->mirrors[disk].next_seq_sect == this_sector
>> +                 || dist == 0) {
>> +                     int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
>> +                     struct mirror_info *mirror = &conf->mirrors[disk];
>> +
>> +                     /*
>> +                      * If buffered sequential IO size exceeds optimal
>> +                      * iosize and there is idle disk, choose idle disk
>> +                      */
>> +                     if (nonrot && opt_iosize > 0 && choose_idle == 0 &&
>> +                         mirror->seq_start != MaxSector &&
>> +                         mirror->next_seq_sect > opt_iosize &&
>> +                         mirror->next_seq_sect - opt_iosize >=
>> +                         mirror->seq_start) {
>> +                             choose_idle = 1;
>> +                             best_disk = disk;
>> +                             continue;
>> +                     }
>> +                     goto done;
>>               }
>> +             /* If device is idle, use it */
>> +             if (pending == 0 && (!choose_idle || nonrot))
>> +                     goto done;
>
> That doesn't make sense to me.
> Ignoring the nonrot bit, it says:
>    if (this device is idle, and we haven't decided to choose an idle device)
>         then choose this device

The variable maybe misleading. choose_idle means an seq read
hit optimal request size in a disk and we check if other disks are idle.
In this case, we only consider idle disks.

> which is wrong.
>
> Also, what about the case where an idle device is found before we we find a
> device that the sequentially-previous read was from.  Will that still do the
> right thing?  It isn't clear.

This doesn't matter. Then the first disk will run, the second disk will idle.
When the first disk seq request exceeds optimal size, the second disk
will run.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-07-04  8:01 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-02  1:08 [patch 0/3 v3] Optimize raid1 read balance for SSD Shaohua Li
2012-07-02  1:08 ` [patch 1/3 v3] raid1: make sequential read detection per disk based Shaohua Li
2012-07-04  5:38   ` NeilBrown
2012-07-02  1:08 ` [patch 2/3 v3] raid1: read balance chooses idlest disk for SSD Shaohua Li
2012-07-02  2:13   ` Roberto Spadim
2012-07-02  3:02     ` Shaohua Li
2012-07-02  3:57       ` Roberto Spadim
2012-07-02  4:33         ` Roberto Spadim
2012-07-02  4:31       ` Roberto Spadim
2012-07-02  4:36         ` Roberto Spadim
2012-07-04  5:45   ` NeilBrown
2012-07-02  1:08 ` [patch 3/3 v3] raid1: prevent merging too large request Shaohua Li
2012-07-04  5:59   ` NeilBrown
2012-07-04  8:01     ` Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.