Re: raid0 vs. mkfs

From: Coly Li <colyli@suse.de>
To: Avi Kivity <avi@scylladb.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: raid0 vs. mkfs
Date: Mon, 23 Jan 2017 20:26:39 +0800	[thread overview]
Message-ID: <a89fd8ba-9238-2490-1c3f-3e8ac62212d7@suse.de> (raw)
In-Reply-To: <06411e37-ee35-c8d9-a578-4a9cd2fbb0d9@scylladb.com>

[-- Attachment #1: Type: text/plain, Size: 2130 bytes --]

On 2017/1/23 上午2:01, Avi Kivity wrote:
> Hello,
> 
> 
> On 11/27/2016 05:24 PM, Avi Kivity wrote:
>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
>> disk that supports TRIM/DISCARD (erase whichever is inappropriate). 
>> That is because mkfs issues a TRIM/DISCARD (erase whichever is
>> inappropriate) for the entire partition. As far as I can tell, md
>> converts the large TRIM/DISCARD (erase whichever is inappropriate)
>> into a large number of TRIM/DISCARD (erase whichever is inappropriate)
>> requests, one per chunk-size worth of disk, and issues them to the
>> RAID components individually.
>>
>>
>> It seems to me that md can convert the large TRIM/DISCARD (erase
>> whichever is inappropriate) request it gets into one TRIM/DISCARD
>> (erase whichever is inappropriate) per RAID component, converting an
>> O(disk size / chunk size) operation into an O(number of RAID
>> components) operation, which is much faster.
>>
>>
>> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices,
>> with the operation taking about a quarter of an hour, continuously
>> pushing half-megabyte TRIM/DISCARD (erase whichever is inappropriate)
>> requests to the disk. Linux 4.1.12.
>>
> 
> Did anyone pick this up by any chance?  The only thing I could find is
> more people complaining about the same issue.

Hi Avi,

I proposed a POC patch, Shaohua and Neil provide review comments,
suggest me to simplify this patch.

If you notice, there is a patch I sent out on Dec 9, 2016, in this email
thread. This patch works, but not the final version to be accepted by
upstream. I quote the performance number here,

“ On 4x3TB NVMe raid0, format it with mkfs.xfs. Current upstream kernel
spends 306 seconds, the patched kernel spends 15 seconds. I see average
request size increases from 1 chunk (1024 sectors) to 2048 chunks
(2097152 sectors).”

I call this patch as RFC v2 patch, and attach it again in this email.
Now I am working on a new version by the suggestion from Shaohua and
Neil, but any testing feed back of RFC v2 patch is welcome. It works,
just too complex to review.

Thanks.

Coly

[-- Attachment #2: raid0_handle_large_discard_bio.patch --]
[-- Type: text/plain, Size: 11145 bytes --]

Subject: [RFC v2] optimization for large size DISCARD bio by per-device bios 

This is a very early prototype, still needs more block layer code
modification to make it work.

Current upstream raid0_make_request() only handles TRIM/DISCARD bio by
chunk size, it meams for large raid0 device built by SSDs will call
million times generic_make_request() for the split bio. This patch
tries to combine small bios into large one if they are on same real
device and continuous on this real device, then send the combined large
bio to underlying device by single call to generic_make_request().

For example, use mkfs.xfs to trim a raid0 device built with 4 x 3TB
NVMeSSD, current upstream raid0_make_request() will call
generic_make_request() 5.7 million times, with this patch only 4 calls
to generic_make_request() is required.

This patch won't work in real world, because in block/blk-lib.c:
__blkdev_issue_discard() the original large bio will be split into
smaller ones by restriction of discard_granularity.

If some day SSD supports whole device sized discard_granularity, it
will be very interesting then...

The basic idea is, if a large discard bio received
by raid0_make_request(), for example it requests to discard chunk 1
to 24 on a raid0 device built by 4 SSDs. This large discard bio will
be split and written to each SSD as the following layout,
	SSD1: C1,C5,C9,C13,C17,C21
	SSD2: C2,C6,C10,C14,C18,C22
	SSD3: C3,C7,C11,C15,C19,C23
	SSD4: C4,C8,C12,C16,C20,C24
Current raid0 code will call generic_make_request() for 24 times for
each split bio. But it is possible to calculate the final layout of
each split bio, so we can combine all the bios into four per-SSD large
bio, like this,
	bio1 (on SSD1): C{1,5,9,13,17,21}
	bio2 (on SSD2): C{2,6,10,14,18,22}
	bio3 (on SSD3): C{3,7,11,15,19,23}
	bio4 (on SSD4): C{4,8,12,16,20,24}
Now we only need to call generic_make_request() for 4 times.

The code is not simple, I need more time to write text to complain how
it works. Currently you can treat it as a proof of concept.

Changelogs
v1, Initial prototype.
v2, Major changes inlcude,
    - rename function names, now handle_discard_bio() takes care
      in chunk size DISCARD bio and single disk sutiation, large DISCARD
      bio will be handled in handle_large_discard_bio().
    - Set max_discard_sectors to raid0 device size.
    - Fix several bugs which I find in basic testing..

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/raid0.c | 267 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 266 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 258986a..c7afe0c 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -378,7 +378,7 @@ static int raid0_run(struct mddev *mddev)
 
 		blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
 		blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors);
-		blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors);
+		blk_queue_max_discard_sectors(mddev->queue, raid0_size(mddev, 0, 0));
 
 		blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
 		blk_queue_io_opt(mddev->queue,
@@ -452,6 +452,266 @@ static inline int is_io_in_chunk_boundary(struct mddev *mddev,
 	}
 }
 
+
+struct bio_record {
+	sector_t	bi_sector;
+	unsigned long	sectors;
+	struct md_rdev	*rdev;
+};
+
+static void handle_large_discard_bio(struct mddev *mddev, struct bio *bio)
+{
+	struct bio_record *recs = NULL;
+	struct bio *split;
+	struct r0conf *conf = mddev->private;
+	sector_t sectors, sector;
+	struct strip_zone *first_zone;
+	int zone_idx;
+	sector_t zone_start, zone_end;
+	int nr_strip_zones = conf->nr_strip_zones;
+	int disks;
+	int first_rdev_idx = -1, rdev_idx;
+	struct md_rdev *first_rdev;
+	unsigned int chunk_sects = mddev->chunk_sectors;
+
+	sector = bio->bi_iter.bi_sector;
+	first_zone = find_zone(conf, &sector);
+	first_rdev = map_sector(mddev, first_zone, sector, &sector);
+
+	/* bio is large enough to be split, allocate recs firstly */
+	disks = mddev->raid_disks;
+	recs = kcalloc(disks, sizeof(struct bio_record), GFP_NOIO);
+	if (recs == NULL) {
+		printk(KERN_ERR "md/raid0:%s: failed to allocate memory " \
+				"for bio_record", mdname(mddev));
+		bio->bi_error = -ENOMEM;
+		bio_endio(bio);
+		return;
+	}
+
+	zone_idx = first_zone - conf->strip_zone;
+	for (rdev_idx = 0; rdev_idx < first_zone->nb_dev; rdev_idx++) {
+		struct md_rdev *rdev;
+
+		rdev = conf->devlist[zone_idx * disks + rdev_idx];
+		recs[rdev_idx].rdev = rdev;
+		if (rdev == first_rdev)
+			first_rdev_idx = rdev_idx;
+	}
+
+	sectors = chunk_sects -
+		(likely(is_power_of_2(chunk_sects))
+		? (sector & (chunk_sects - 1))
+		: sector_div(sector, chunk_sects));
+		sector = bio->bi_iter.bi_sector;
+
+	recs[first_rdev_idx].bi_sector = sector + first_zone->dev_start;
+	recs[first_rdev_idx].sectors = sectors;
+
+	/* recs[first_rdev_idx] is initialized with 'sectors', we need to
+	 * handle the rested sectors, which is sotred in 'sectors' too.
+	 */
+	sectors = bio_sectors(bio) - sectors;
+
+	/* bio may not be chunk size aligned, the split bio on first rdev
+	 * may not be chunk size aligned too. But the rested split bios
+	 * on rested rdevs must be chunk size aligned, and aligned to
+	 * round down chunk number.
+	 */
+	zone_end = first_zone->zone_end;
+	rdev_idx = first_rdev_idx + 1;
+	sector = likely(is_power_of_2(chunk_sects))
+		 ? sector & (~(chunk_sects - 1))
+		 : chunk_sects * (sector/chunk_sects);
+
+	while (rdev_idx < first_zone->nb_dev) {
+		if (recs[rdev_idx].sectors == 0) {
+			recs[rdev_idx].bi_sector = sector + first_zone->dev_start;
+			if (sectors <= chunk_sects) {
+				recs[rdev_idx].sectors = sectors;
+				goto issue;
+			}
+			recs[rdev_idx].sectors = chunk_sects;
+			sectors -= chunk_sects;
+		}
+		rdev_idx++;
+	}
+
+	sector += chunk_sects;
+	zone_start = sector + first_zone->dev_start;
+	if (zone_start == zone_end) {
+		zone_idx++;
+		if (zone_idx == nr_strip_zones) {
+			if (sectors != 0)
+				printk(KERN_INFO "bio size exceeds raid0 " \
+					"capability, ignore extra " \
+					"TRIM/DISCARD range.\n");
+			goto issue;
+		}
+		zone_start = conf->strip_zone[zone_idx].dev_start;
+	}
+
+	while (zone_idx < nr_strip_zones) {
+		int rdevs_in_zone = conf->strip_zone[zone_idx].nb_dev;
+		int chunks_per_rdev, rested_chunks, rested_sectors;
+		sector_t zone_sectors, grow_sectors;
+		int add_rested_sectors = 0;
+
+		zone_end = conf->strip_zone[zone_idx].zone_end;
+		zone_sectors = zone_end - zone_start;
+		chunks_per_rdev = sectors;
+		rested_sectors =
+			sector_div(chunks_per_rdev, chunk_sects * rdevs_in_zone);
+		rested_chunks = rested_sectors;
+		rested_sectors = sector_div(rested_chunks, chunk_sects);
+
+		if ((chunks_per_rdev * chunk_sects) > zone_sectors)
+			chunks_per_rdev = zone_sectors/chunk_sects;
+
+		/* rested_chunks and rested_sectors go into next zone, we won't
+		 * handle them in this zone. Set them to 0.
+		 */
+		if ((chunks_per_rdev * chunk_sects) == zone_sectors &&
+		    (rested_chunks != 0 || rested_sectors != 0)) {
+			if (rested_chunks != 0)
+				rested_chunks = 0;
+			if (rested_sectors != 0)
+				rested_sectors = 0;
+		}
+
+		if (rested_chunks == 0 && rested_sectors != 0)
+			add_rested_sectors ++;
+
+		for (rdev_idx = 0; rdev_idx < rdevs_in_zone; rdev_idx++) {
+			/* if .sectors is not initailized (== 0), it indicates
+			 * .bi_sector is not initialized neither. We initiate
+			 * .bi_sector firstly, then set .sectors by
+			 * grow_sectors.
+			 */
+			if (recs[rdev_idx].sectors == 0)
+				recs[rdev_idx].bi_sector = zone_start;
+			grow_sectors = chunks_per_rdev * chunk_sects;
+			if (rested_chunks) {
+				grow_sectors += chunk_sects;
+				rested_chunks--;
+				if (rested_chunks == 0 &&
+				    rested_sectors != 0) {
+					recs[rdev_idx].sectors += grow_sectors;
+					sectors -= grow_sectors;
+					add_rested_sectors ++;
+					continue;
+				}
+			}
+
+			/* if add_rested_sectors != 0, it indicates
+			 * rested_sectors != 0
+			 */
+			if (add_rested_sectors == 1) {
+				grow_sectors += rested_sectors;
+				add_rested_sectors ++;
+			}
+			recs[rdev_idx].sectors += grow_sectors;
+			sectors -= grow_sectors;
+			if (sectors == 0)
+				break;
+		}
+
+		if (sectors == 0)
+			break;
+		zone_start = zone_end;
+		zone_idx++;
+		if (zone_idx < nr_strip_zones)
+			BUG_ON(zone_start != conf->strip_zone[zone_idx].dev_start);
+	}
+
+
+issue:
+	/* recs contains the re-ordered requests layout, now we can
+	 * chain split bios from recs
+	 */
+	for (rdev_idx = 0; rdev_idx < disks; rdev_idx++) {
+		if (rdev_idx == first_rdev_idx ||
+		    recs[rdev_idx].sectors == 0)
+			continue;
+		split = bio_split(bio,
+				  recs[rdev_idx].sectors,
+				  GFP_NOIO,
+				  fs_bio_set);
+		if (split == NULL)
+			break;
+		bio_chain(split, bio);
+		BUG_ON(split->bi_iter.bi_size != recs[rdev_idx].sectors << 9);
+		split->bi_bdev = recs[rdev_idx].rdev->bdev;
+		split->bi_iter.bi_sector = recs[rdev_idx].bi_sector +
+				recs[rdev_idx].rdev->data_offset;
+
+		if (unlikely(!blk_queue_discard(
+				bdev_get_queue(split->bi_bdev))))
+			/* Just ignore it */
+			bio_endio(split);
+		else
+			generic_make_request(split);
+	}
+	BUG_ON(bio->bi_iter.bi_size != recs[first_rdev_idx].sectors << 9);
+	bio->bi_iter.bi_sector = recs[first_rdev_idx].bi_sector +
+					recs[first_rdev_idx].rdev->data_offset;
+	bio->bi_bdev = recs[first_rdev_idx].rdev->bdev;
+
+	if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+		/* Just ignore it */
+		bio_endio(bio);
+	else
+		generic_make_request(bio);
+
+	kfree(recs);
+}
+
+static void handle_discard_bio(struct mddev *mddev, struct bio *bio)
+{
+	struct r0conf *conf = mddev->private;
+	unsigned int chunk_sects = mddev->chunk_sectors;
+	sector_t sector, sectors;
+	struct md_rdev *rdev;
+	struct strip_zone *zone;
+
+	sector = bio->bi_iter.bi_sector;
+	zone = find_zone(conf, &sector);
+	rdev = map_sector(mddev, zone, sector, &sector);
+	bio->bi_bdev = rdev->bdev;
+	sectors = chunk_sects -
+		(likely(is_power_of_2(chunk_sects))
+		 ? (sector & (chunk_sects - 1))
+		 : sector_div(sector, chunk_sects));
+
+	if (unlikely(sectors >= bio_sectors(bio))) {
+		bio->bi_iter.bi_sector = sector + zone->dev_start +
+					 rdev->data_offset;
+		goto single_bio;
+	}
+
+	if (unlikely(zone->nb_dev == 1)) {
+		sectors = conf->strip_zone[0].zone_end -
+			  sector;
+		if (bio_sectors(bio) > sectors)
+			bio->bi_iter.bi_size = sectors << 9;
+		bio->bi_iter.bi_sector = sector + rdev->data_offset;
+		goto single_bio;
+	}
+
+	handle_large_discard_bio(mddev, bio);
+	return;
+
+single_bio:
+	if (unlikely(!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+		/* Just ignore it */
+		bio_endio(bio);
+	else
+		generic_make_request(bio);
+
+	return;
+}
+
+
 static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 {
 	struct strip_zone *zone;
@@ -463,6 +723,11 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 		return;
 	}
 
+	if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
+		handle_discard_bio(mddev, bio);
+		return;
+	}
+
 	do {
 		sector_t sector = bio->bi_iter.bi_sector;
 		unsigned chunk_sects = mddev->chunk_sectors;