linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V1 0/3] Improve handling raid10 discard request
@ 2020-07-20  4:58 Xiao Ni
  2020-07-20  4:58 ` [PATCH V1 1/3] Move codes related with submitting discard bio into one function Xiao Ni
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Xiao Ni @ 2020-07-20  4:58 UTC (permalink / raw)
  To: song, linux-raid; +Cc: colyli, ncroxon, heinzm

Hi all

Now mkfs on raid10 which is combined with ssd/nvme disks takes a long time.
This patch set tries to resolve this problem.

Xiao Ni (3):
  Move codes related with submitting discard bio into one function
  extend r10bio devs to raid disks
  improve raid10 discard request

 drivers/md/md.c     |  22 ++++
 drivers/md/md.h     |   3 +
 drivers/md/raid0.c  |  15 +--
 drivers/md/raid10.c | 286 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 307 insertions(+), 19 deletions(-)

-- 
2.7.5

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH V1 1/3] Move codes related with submitting discard bio into one function
  2020-07-20  4:58 [PATCH V1 0/3] Improve handling raid10 discard request Xiao Ni
@ 2020-07-20  4:58 ` Xiao Ni
  2020-07-20  5:14   ` Coly Li
  2020-07-20  4:58 ` [PATCH V1 2/3] extend r10bio devs to raid disks Xiao Ni
  2020-07-20  4:58 ` [PATCH V1 3/3] improve raid10 discard request Xiao Ni
  2 siblings, 1 reply; 12+ messages in thread
From: Xiao Ni @ 2020-07-20  4:58 UTC (permalink / raw)
  To: song, linux-raid; +Cc: colyli, ncroxon, heinzm

These codes can be used in raid10. So we can move these codes into md.c. raid0 and raid10
can share these codes.

Signed-off-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/md.c    | 22 ++++++++++++++++++++++
 drivers/md/md.h    |  3 +++
 drivers/md/raid0.c | 15 ++-------------
 3 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 07e5b67..2b8f654 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8559,6 +8559,28 @@ void md_write_end(struct mddev *mddev)
 
 EXPORT_SYMBOL(md_write_end);
 
+void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
+				struct bio *bio,
+				sector_t dev_start, sector_t dev_end)
+{
+	struct bio *discard_bio = NULL;
+
+	if (__blkdev_issue_discard(rdev->bdev,
+	    dev_start + rdev->data_offset,
+	    dev_end - dev_start, GFP_NOIO, 0, &discard_bio) ||
+	    !discard_bio)
+		return;
+
+	bio_chain(discard_bio, bio);
+	bio_clone_blkg_association(discard_bio, bio);
+	if (mddev->gendisk)
+		trace_block_bio_remap(bdev_get_queue(rdev->bdev),
+			discard_bio, disk_devt(mddev->gendisk),
+			bio->bi_iter.bi_sector);
+	submit_bio_noacct(discard_bio);
+}
+EXPORT_SYMBOL(md_submit_discard_bio);
+
 /* md_allow_write(mddev)
  * Calling this ensures that the array is marked 'active' so that writes
  * may proceed without blocking.  It is important to call this before
diff --git a/drivers/md/md.h b/drivers/md/md.h
index c26fa8b..83ae77e 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -710,6 +710,9 @@ extern void md_write_end(struct mddev *mddev);
 extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
 extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
 extern void md_finish_reshape(struct mddev *mddev);
+extern void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
+				struct bio *bio,
+				sector_t dev_start, sector_t dev_end);
 
 extern int mddev_congested(struct mddev *mddev, int bits);
 extern bool __must_check md_flush_request(struct mddev *mddev, struct bio *bio);
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index e9e91c8..e37fe8a 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -525,7 +525,6 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
 
 	for (disk = 0; disk < zone->nb_dev; disk++) {
 		sector_t dev_start, dev_end;
-		struct bio *discard_bio = NULL;
 		struct md_rdev *rdev;
 
 		if (disk < start_disk_index)
@@ -548,18 +547,8 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
 
 		rdev = conf->devlist[(zone - conf->strip_zone) *
 			conf->strip_zone[0].nb_dev + disk];
-		if (__blkdev_issue_discard(rdev->bdev,
-			dev_start + zone->dev_start + rdev->data_offset,
-			dev_end - dev_start, GFP_NOIO, 0, &discard_bio) ||
-		    !discard_bio)
-			continue;
-		bio_chain(discard_bio, bio);
-		bio_clone_blkg_association(discard_bio, bio);
-		if (mddev->gendisk)
-			trace_block_bio_remap(bdev_get_queue(rdev->bdev),
-				discard_bio, disk_devt(mddev->gendisk),
-				bio->bi_iter.bi_sector);
-		submit_bio_noacct(discard_bio);
+		dev_start += zone->dev_start;
+		md_submit_discard_bio(mddev, rdev, bio, dev_start, dev_end);
 	}
 	bio_endio(bio);
 }
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH V1 2/3] extend r10bio devs to raid disks
  2020-07-20  4:58 [PATCH V1 0/3] Improve handling raid10 discard request Xiao Ni
  2020-07-20  4:58 ` [PATCH V1 1/3] Move codes related with submitting discard bio into one function Xiao Ni
@ 2020-07-20  4:58 ` Xiao Ni
  2020-07-20  5:15   ` Coly Li
  2020-07-20  4:58 ` [PATCH V1 3/3] improve raid10 discard request Xiao Ni
  2 siblings, 1 reply; 12+ messages in thread
From: Xiao Ni @ 2020-07-20  4:58 UTC (permalink / raw)
  To: song, linux-raid; +Cc: colyli, ncroxon, heinzm

Now it allocs r10bio->devs[conf->copies]. Discard bio needs to submit to all member disks
and it needs to use r10bio. So extend to r10bio->devs[geo.raid_disks].

Signed-off-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/raid10.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e45fd56..2a7423e 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -91,7 +91,7 @@ static inline struct r10bio *get_resync_r10bio(struct bio *bio)
 static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data)
 {
 	struct r10conf *conf = data;
-	int size = offsetof(struct r10bio, devs[conf->copies]);
+	int size = offsetof(struct r10bio, devs[conf->geo.raid_disks]);
 
 	/* allocate a r10bio with room for raid_disks entries in the
 	 * bios array */
@@ -238,7 +238,7 @@ static void put_all_bios(struct r10conf *conf, struct r10bio *r10_bio)
 {
 	int i;
 
-	for (i = 0; i < conf->copies; i++) {
+	for (i = 0; i < conf->geo.raid_disks; i++) {
 		struct bio **bio = & r10_bio->devs[i].bio;
 		if (!BIO_SPECIAL(*bio))
 			bio_put(*bio);
@@ -327,7 +327,7 @@ static int find_bio_disk(struct r10conf *conf, struct r10bio *r10_bio,
 	int slot;
 	int repl = 0;
 
-	for (slot = 0; slot < conf->copies; slot++) {
+	for (slot = 0; slot < conf->geo.raid_disks; slot++) {
 		if (r10_bio->devs[slot].bio == bio)
 			break;
 		if (r10_bio->devs[slot].repl_bio == bio) {
@@ -336,7 +336,7 @@ static int find_bio_disk(struct r10conf *conf, struct r10bio *r10_bio,
 		}
 	}
 
-	BUG_ON(slot == conf->copies);
+	BUG_ON(slot == conf->geo.raid_disks);
 	update_head_pos(slot, r10_bio);
 
 	if (slotp)
@@ -1510,7 +1510,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio, int sectors)
 	r10_bio->mddev = mddev;
 	r10_bio->sector = bio->bi_iter.bi_sector;
 	r10_bio->state = 0;
-	memset(r10_bio->devs, 0, sizeof(r10_bio->devs[0]) * conf->copies);
+	memset(r10_bio->devs, 0, sizeof(r10_bio->devs[0]) * conf->geo.raid_disks);
 
 	if (bio_data_dir(bio) == READ)
 		raid10_read_request(mddev, bio, r10_bio);
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH V1 3/3] improve raid10 discard request
  2020-07-20  4:58 [PATCH V1 0/3] Improve handling raid10 discard request Xiao Ni
  2020-07-20  4:58 ` [PATCH V1 1/3] Move codes related with submitting discard bio into one function Xiao Ni
  2020-07-20  4:58 ` [PATCH V1 2/3] extend r10bio devs to raid disks Xiao Ni
@ 2020-07-20  4:58 ` Xiao Ni
  2020-07-20  5:26   ` Coly Li
  2 siblings, 1 reply; 12+ messages in thread
From: Xiao Ni @ 2020-07-20  4:58 UTC (permalink / raw)
  To: song, linux-raid; +Cc: colyli, ncroxon, heinzm

Now the discard request is split by chunk size. So it takes a long time to finish mkfs on
disks which support discard function. This patch improve handling raid10 discard request.
It uses  the similar way with raid0(29efc390b).

But it's a little complex than raid0. Because raid10 has different layout. If raid10 is
offset layout and the discard request is smaller than stripe size. There are some holes
when we submit discard bio to underlayer disks.

For example: five disks (disk1 - disk5)
D01 D02 D03 D04 D05
D05 D01 D02 D03 D04
D06 D07 D08 D09 D10
D10 D06 D07 D08 D09
The discard bio just wants to discard from D03 to D10. For disk3, there is a hole between
D03 and D08. For disk4, there is a hole between D04 and D09. D03 is a chunk, raid10_write_request
can handle one chunk perfectly. So the part that is not aligned with stripe size is still
handled by raid10_write_request.

If reshape is running when discard bio comes and the discard bio spans the reshape position,
raid10_write_request is responsible to handle this discard bio.

I did a test with this patch set.
Without patch:
time mkfs.xfs /dev/md0
real4m39.775s
user0m0.000s
sys0m0.298s

With patch:
time mkfs.xfs /dev/md0
real0m0.105s
user0m0.000s
sys0m0.007s

nvme3n1           259:1    0   477G  0 disk
└─nvme3n1p1       259:10   0    50G  0 part
nvme4n1           259:2    0   477G  0 disk
└─nvme4n1p1       259:11   0    50G  0 part
nvme5n1           259:6    0   477G  0 disk
└─nvme5n1p1       259:12   0    50G  0 part
nvme2n1           259:9    0   477G  0 disk
└─nvme2n1p1       259:15   0    50G  0 part
nvme0n1           259:13   0   477G  0 disk
└─nvme0n1p1       259:14   0    50G  0 part

v1:
Coly helps to review these patches and give some suggestions:
One bug is found. If discard bio is across one stripe but bio size is bigger
than one stripe size. After spliting, the bio will be NULL. In this version,
it checks whether discard bio size is bigger than 2*stripe_size.
In raid10_end_discard_request, it's better to check R10BIO_Uptodate is set
or not. It can avoid write memory to improve performance.
Add more comments for calculating addresses.

Signed-off-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/raid10.c | 276 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 275 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 2a7423e..9568c23 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1518,6 +1518,271 @@ static void __make_request(struct mddev *mddev, struct bio *bio, int sectors)
 		raid10_write_request(mddev, bio, r10_bio);
 }
 
+static void wait_blocked_dev(struct mddev *mddev, int cnt)
+{
+	int i;
+	struct r10conf *conf = mddev->private;
+	struct md_rdev *blocked_rdev = NULL;
+
+retry_wait:
+	rcu_read_lock();
+	for (i = 0; i < cnt; i++) {
+		struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
+		struct md_rdev *rrdev = rcu_dereference(
+			conf->mirrors[i].replacement);
+		if (rdev == rrdev)
+			rrdev = NULL;
+		if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
+			atomic_inc(&rdev->nr_pending);
+			blocked_rdev = rdev;
+			break;
+		}
+		if (rrdev && unlikely(test_bit(Blocked, &rrdev->flags))) {
+			atomic_inc(&rrdev->nr_pending);
+			blocked_rdev = rrdev;
+			break;
+		}
+	}
+	rcu_read_unlock();
+
+	if (unlikely(blocked_rdev)) {
+		/* Have to wait for this device to get unblocked, then retry */
+		allow_barrier(conf);
+		raid10_log(conf->mddev, "%s wait rdev %d blocked", __func__, blocked_rdev->raid_disk);
+		md_wait_for_blocked_rdev(blocked_rdev, mddev);
+		wait_barrier(conf);
+		goto retry_wait;
+	}
+}
+
+static struct bio *raid10_split_bio(struct r10conf *conf,
+			struct bio *bio, sector_t sectors, bool want_first)
+{
+	struct bio *split;
+
+	split = bio_split(bio, sectors,	GFP_NOIO, &conf->bio_split);
+	bio_chain(split, bio);
+	allow_barrier(conf);
+	if (want_first) {
+		submit_bio_noacct(bio);
+		bio = split;
+	} else
+		submit_bio_noacct(split);
+	wait_barrier(conf);
+
+	return bio;
+}
+
+static void raid10_end_discard_request(struct bio *bio)
+{
+	struct r10bio *r10_bio = bio->bi_private;
+        struct r10conf *conf = r10_bio->mddev->private;
+	struct md_rdev *rdev = NULL;
+        int dev;
+        int slot, repl;
+
+	/*
+	 * We don't care the return value of discard bio
+	 */
+	if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
+		set_bit(R10BIO_Uptodate, &r10_bio->state);
+
+	dev = find_bio_disk(conf, r10_bio, bio, &slot, &repl);
+	if (repl)
+		rdev = conf->mirrors[dev].replacement;
+	if (!rdev) {
+		smp_rmb();
+		repl = 0;
+		rdev = conf->mirrors[dev].rdev;
+	}
+
+	if (atomic_dec_and_test(&r10_bio->remaining)) {
+		md_write_end(r10_bio->mddev);
+		raid_end_bio_io(r10_bio);
+	}
+
+	rdev_dec_pending(rdev, conf->mddev);
+}
+
+static bool raid10_handle_discard(struct mddev *mddev, struct bio *bio)
+{
+	struct r10conf *conf = mddev->private;
+	struct geom geo = conf->geo;
+	struct r10bio *r10_bio;
+
+	int disk;
+	sector_t chunk;
+	int stripe_size, stripe_mask;
+
+	sector_t bio_start, bio_end;
+	sector_t first_stripe_index, last_stripe_index;
+	sector_t start_disk_offset;
+	unsigned int start_disk_index;
+	sector_t end_disk_offset;
+	unsigned int end_disk_index;
+
+	wait_barrier(conf);
+
+	if (conf->reshape_progress != MaxSector &&
+	    ((bio->bi_iter.bi_sector >= conf->reshape_progress) !=
+	     conf->mddev->reshape_backwards))
+		geo = conf->prev;
+
+	stripe_size = (1<<geo.chunk_shift) * geo.raid_disks;
+	stripe_mask = stripe_size - 1;
+
+	/* Maybe one discard bio is smaller than strip size or across one stripe
+	 * and discard region is larger than one stripe size. For far offset layout,
+	 * if the discard region is not aligned with stripe size, there is hole
+	 * when we submit discard bio to member disk. For simplicity, we only
+	 * handle discard bio which discard region is bigger than stripe_size*2
+	 */
+	if (bio_sectors(bio) < stripe_size*2)
+		goto out;
+
+	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
+	     bio->bi_iter.bi_sector < conf->reshape_progress &&
+	     bio->bi_iter.bi_sector + bio_sectors(bio) > conf->reshape_progress)
+		goto out;
+
+	r10_bio = mempool_alloc(&conf->r10bio_pool, GFP_NOIO);
+	r10_bio->mddev = mddev;
+	r10_bio->state = 0;
+	memset(r10_bio->devs, 0, sizeof(r10_bio->devs[0]) * conf->geo.raid_disks);
+
+	wait_blocked_dev(mddev, geo.raid_disks);
+
+	/* For far offset layout, if bio is not aligned with stripe size, it splits
+	 * the part that is not aligned with strip size.
+	 */
+	bio_start = bio->bi_iter.bi_sector;
+	bio_end = bio_end_sector(bio);
+	if (geo.far_offset && (bio_start & stripe_mask)) {
+		sector_t split_size;
+		split_size = round_up(bio_start, stripe_size) - bio_start;
+		bio = raid10_split_bio(conf, bio, split_size, false);
+	}
+	if (geo.far_offset && ((bio_end & stripe_mask) != stripe_mask)) {
+		sector_t split_size;
+		split_size = bio_end - (bio_end & stripe_mask);
+		bio = raid10_split_bio(conf, bio, split_size, true);
+	}
+	r10_bio->master_bio = bio;
+
+	bio_start = bio->bi_iter.bi_sector;
+	bio_end = bio_end_sector(bio);
+
+	/* raid10 uses chunk as the unit to store data. It's similar like raid0.
+	 * One stripe contains the chunks from all member disk (one chunk from
+	 * one disk at the same HAB address). For layout detail, see 'man md 4'
+	 */
+	chunk = bio_start >> geo.chunk_shift;
+	chunk *= geo.near_copies;
+	first_stripe_index = chunk;
+	start_disk_index = sector_div(first_stripe_index, geo.raid_disks);
+	if (geo.far_offset)
+		first_stripe_index *= geo.far_copies;
+	start_disk_offset = (bio_start & geo.chunk_mask) +
+				(first_stripe_index << geo.chunk_shift);
+
+	chunk = bio_end >> geo.chunk_shift;
+	chunk *= geo.near_copies;
+	last_stripe_index = chunk;
+	end_disk_index = sector_div(last_stripe_index, geo.raid_disks);
+	if (geo.far_offset)
+		last_stripe_index *= geo.far_copies;
+	end_disk_offset = (bio_end & geo.chunk_mask) +
+				(last_stripe_index << geo.chunk_shift);
+
+	rcu_read_lock();
+	for (disk = 0; disk < geo.raid_disks; disk++) {
+		struct md_rdev *rdev = rcu_dereference(conf->mirrors[disk].rdev);
+		struct md_rdev *rrdev = rcu_dereference(
+			conf->mirrors[disk].replacement);
+
+		r10_bio->devs[disk].bio = NULL;
+		r10_bio->devs[disk].repl_bio = NULL;
+
+		if (rdev && (test_bit(Faulty, &rdev->flags)))
+			rdev = NULL;
+		if (rrdev && (test_bit(Faulty, &rrdev->flags)))
+			rrdev = NULL;
+		if (!rdev && !rrdev)
+			continue;
+
+		if (rdev) {
+			r10_bio->devs[disk].bio = bio;
+			atomic_inc(&rdev->nr_pending);
+		}
+		if (rrdev) {
+			r10_bio->devs[disk].repl_bio = bio;
+			atomic_inc(&rrdev->nr_pending);
+		}
+	}
+	rcu_read_unlock();
+
+	atomic_set(&r10_bio->remaining, 1);
+	for (disk = 0; disk < geo.raid_disks; disk++) {
+		sector_t dev_start, dev_end;
+		struct bio *mbio, *rbio = NULL;
+		struct md_rdev *rdev = rcu_dereference(conf->mirrors[disk].rdev);
+		struct md_rdev *rrdev = rcu_dereference(
+			conf->mirrors[disk].replacement);
+
+		/*
+		 * Now start to calculate the start and end address for each disk.
+		 * The space between dev_start and dev_end is the discard region.
+		 */
+		if (disk < start_disk_index)
+			dev_start = (first_stripe_index + 1) * mddev->chunk_sectors;
+		else if (disk > start_disk_index)
+			dev_start = first_stripe_index * mddev->chunk_sectors;
+		else
+			dev_start = start_disk_offset;
+
+		if (disk < end_disk_index)
+			dev_end = (last_stripe_index + 1) * mddev->chunk_sectors;
+		else if (disk > end_disk_index)
+			dev_end = last_stripe_index * mddev->chunk_sectors;
+		else
+			dev_end = end_disk_offset;
+
+		/* It only handles discard bio which size is >= stripe size, so
+		 * dev_end > dev_start all the time
+		 */
+		if (r10_bio->devs[disk].bio) {
+			mbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set);
+			mbio->bi_end_io = raid10_end_discard_request;
+			mbio->bi_private = r10_bio;
+			r10_bio->devs[disk].bio = mbio;
+			r10_bio->devs[disk].devnum = disk;
+			atomic_inc(&r10_bio->remaining);
+			md_submit_discard_bio(mddev, rdev, mbio, dev_start, dev_end);
+			bio_endio(mbio);
+		}
+		if (r10_bio->devs[disk].repl_bio) {
+			rbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set);
+			rbio->bi_end_io = raid10_end_discard_request;
+			rbio->bi_private = r10_bio;
+			r10_bio->devs[disk].repl_bio = rbio;
+			r10_bio->devs[disk].devnum = disk;
+			atomic_inc(&r10_bio->remaining);
+			md_submit_discard_bio(mddev, rrdev, rbio, dev_start, dev_end);
+			bio_endio(rbio);
+		}
+	}
+
+	if (atomic_dec_and_test(&r10_bio->remaining)) {
+		md_write_end(r10_bio->mddev);
+		raid_end_bio_io(r10_bio);
+	}
+
+	return 0;
+out:
+	allow_barrier(conf);
+	return -EAGAIN;
+}
+
 static bool raid10_make_request(struct mddev *mddev, struct bio *bio)
 {
 	struct r10conf *conf = mddev->private;
@@ -1532,6 +1797,15 @@ static bool raid10_make_request(struct mddev *mddev, struct bio *bio)
 	if (!md_write_start(mddev, bio))
 		return false;
 
+	/* There are some limitations to handle discard bio
+	 * 1st, the discard size is bigger than stripe size.
+	 * 2st, if the discard bio spans reshape progress, we use the old way to
+	 * handle discard bio
+	 */
+	if (unlikely(bio_op(bio) == REQ_OP_DISCARD))
+		if (!raid10_handle_discard(mddev, bio))
+			return true;
+
 	/*
 	 * If this request crosses a chunk boundary, we need to split
 	 * it.
@@ -3762,7 +4036,7 @@ static int raid10_run(struct mddev *mddev)
 	chunk_size = mddev->chunk_sectors << 9;
 	if (mddev->queue) {
 		blk_queue_max_discard_sectors(mddev->queue,
-					      mddev->chunk_sectors);
+					      UINT_MAX);
 		blk_queue_max_write_same_sectors(mddev->queue, 0);
 		blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
 		blk_queue_io_min(mddev->queue, chunk_size);
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 1/3] Move codes related with submitting discard bio into one function
  2020-07-20  4:58 ` [PATCH V1 1/3] Move codes related with submitting discard bio into one function Xiao Ni
@ 2020-07-20  5:14   ` Coly Li
  0 siblings, 0 replies; 12+ messages in thread
From: Coly Li @ 2020-07-20  5:14 UTC (permalink / raw)
  To: Xiao Ni, song, linux-raid; +Cc: ncroxon, heinzm

On 2020/7/20 12:58, Xiao Ni wrote:
> These codes can be used in raid10. So we can move these codes into md.c. raid0 and raid10
> can share these codes.
> 
> Signed-off-by: Xiao Ni <xni@redhat.com>

It looks good to me.

Reviewed-by: Coly Li <colyli@suse.de>




> ---
>  drivers/md/md.c    | 22 ++++++++++++++++++++++
>  drivers/md/md.h    |  3 +++
>  drivers/md/raid0.c | 15 ++-------------
>  3 files changed, 27 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 07e5b67..2b8f654 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8559,6 +8559,28 @@ void md_write_end(struct mddev *mddev)
>  
>  EXPORT_SYMBOL(md_write_end);
>  
> +void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
> +				struct bio *bio,
> +				sector_t dev_start, sector_t dev_end)
> +{
> +	struct bio *discard_bio = NULL;
> +
> +	if (__blkdev_issue_discard(rdev->bdev,
> +	    dev_start + rdev->data_offset,
> +	    dev_end - dev_start, GFP_NOIO, 0, &discard_bio) ||
> +	    !discard_bio)
> +		return;
> +
> +	bio_chain(discard_bio, bio);
> +	bio_clone_blkg_association(discard_bio, bio);
> +	if (mddev->gendisk)
> +		trace_block_bio_remap(bdev_get_queue(rdev->bdev),
> +			discard_bio, disk_devt(mddev->gendisk),
> +			bio->bi_iter.bi_sector);
> +	submit_bio_noacct(discard_bio);
> +}
> +EXPORT_SYMBOL(md_submit_discard_bio);
> +
>  /* md_allow_write(mddev)
>   * Calling this ensures that the array is marked 'active' so that writes
>   * may proceed without blocking.  It is important to call this before
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index c26fa8b..83ae77e 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -710,6 +710,9 @@ extern void md_write_end(struct mddev *mddev);
>  extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
>  extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
>  extern void md_finish_reshape(struct mddev *mddev);
> +extern void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
> +				struct bio *bio,
> +				sector_t dev_start, sector_t dev_end);
>  
>  extern int mddev_congested(struct mddev *mddev, int bits);
>  extern bool __must_check md_flush_request(struct mddev *mddev, struct bio *bio);
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index e9e91c8..e37fe8a 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -525,7 +525,6 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
>  
>  	for (disk = 0; disk < zone->nb_dev; disk++) {
>  		sector_t dev_start, dev_end;
> -		struct bio *discard_bio = NULL;
>  		struct md_rdev *rdev;
>  
>  		if (disk < start_disk_index)
> @@ -548,18 +547,8 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
>  
>  		rdev = conf->devlist[(zone - conf->strip_zone) *
>  			conf->strip_zone[0].nb_dev + disk];
> -		if (__blkdev_issue_discard(rdev->bdev,
> -			dev_start + zone->dev_start + rdev->data_offset,
> -			dev_end - dev_start, GFP_NOIO, 0, &discard_bio) ||
> -		    !discard_bio)
> -			continue;
> -		bio_chain(discard_bio, bio);
> -		bio_clone_blkg_association(discard_bio, bio);
> -		if (mddev->gendisk)
> -			trace_block_bio_remap(bdev_get_queue(rdev->bdev),
> -				discard_bio, disk_devt(mddev->gendisk),
> -				bio->bi_iter.bi_sector);
> -		submit_bio_noacct(discard_bio);
> +		dev_start += zone->dev_start;
> +		md_submit_discard_bio(mddev, rdev, bio, dev_start, dev_end);
>  	}
>  	bio_endio(bio);
>  }
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 2/3] extend r10bio devs to raid disks
  2020-07-20  4:58 ` [PATCH V1 2/3] extend r10bio devs to raid disks Xiao Ni
@ 2020-07-20  5:15   ` Coly Li
  0 siblings, 0 replies; 12+ messages in thread
From: Coly Li @ 2020-07-20  5:15 UTC (permalink / raw)
  To: Xiao Ni, song, linux-raid; +Cc: ncroxon, heinzm

On 2020/7/20 12:58, Xiao Ni wrote:
> Now it allocs r10bio->devs[conf->copies]. Discard bio needs to submit to all member disks
> and it needs to use r10bio. So extend to r10bio->devs[geo.raid_disks].
> 
> Signed-off-by: Xiao Ni <xni@redhat.com>

Reviewed-by: Coly Li <colyli@suse.de>



Coly Li


> ---
>  drivers/md/raid10.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index e45fd56..2a7423e 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -91,7 +91,7 @@ static inline struct r10bio *get_resync_r10bio(struct bio *bio)
>  static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data)
>  {
>  	struct r10conf *conf = data;
> -	int size = offsetof(struct r10bio, devs[conf->copies]);
> +	int size = offsetof(struct r10bio, devs[conf->geo.raid_disks]);
>  
>  	/* allocate a r10bio with room for raid_disks entries in the
>  	 * bios array */
> @@ -238,7 +238,7 @@ static void put_all_bios(struct r10conf *conf, struct r10bio *r10_bio)
>  {
>  	int i;
>  
> -	for (i = 0; i < conf->copies; i++) {
> +	for (i = 0; i < conf->geo.raid_disks; i++) {
>  		struct bio **bio = & r10_bio->devs[i].bio;
>  		if (!BIO_SPECIAL(*bio))
>  			bio_put(*bio);
> @@ -327,7 +327,7 @@ static int find_bio_disk(struct r10conf *conf, struct r10bio *r10_bio,
>  	int slot;
>  	int repl = 0;
>  
> -	for (slot = 0; slot < conf->copies; slot++) {
> +	for (slot = 0; slot < conf->geo.raid_disks; slot++) {
>  		if (r10_bio->devs[slot].bio == bio)
>  			break;
>  		if (r10_bio->devs[slot].repl_bio == bio) {
> @@ -336,7 +336,7 @@ static int find_bio_disk(struct r10conf *conf, struct r10bio *r10_bio,
>  		}
>  	}
>  
> -	BUG_ON(slot == conf->copies);
> +	BUG_ON(slot == conf->geo.raid_disks);
>  	update_head_pos(slot, r10_bio);
>  
>  	if (slotp)
> @@ -1510,7 +1510,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio, int sectors)
>  	r10_bio->mddev = mddev;
>  	r10_bio->sector = bio->bi_iter.bi_sector;
>  	r10_bio->state = 0;
> -	memset(r10_bio->devs, 0, sizeof(r10_bio->devs[0]) * conf->copies);
> +	memset(r10_bio->devs, 0, sizeof(r10_bio->devs[0]) * conf->geo.raid_disks);
>  
>  	if (bio_data_dir(bio) == READ)
>  		raid10_read_request(mddev, bio, r10_bio);
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 3/3] improve raid10 discard request
  2020-07-20  4:58 ` [PATCH V1 3/3] improve raid10 discard request Xiao Ni
@ 2020-07-20  5:26   ` Coly Li
  2020-07-20  5:38     ` Coly Li
  0 siblings, 1 reply; 12+ messages in thread
From: Coly Li @ 2020-07-20  5:26 UTC (permalink / raw)
  To: Xiao Ni, song, linux-raid; +Cc: ncroxon, heinzm

On 2020/7/20 12:58, Xiao Ni wrote:
> Now the discard request is split by chunk size. So it takes a long time to finish mkfs on
> disks which support discard function. This patch improve handling raid10 discard request.
> It uses  the similar way with raid0(29efc390b).
> 
> But it's a little complex than raid0. Because raid10 has different layout. If raid10 is
> offset layout and the discard request is smaller than stripe size. There are some holes
> when we submit discard bio to underlayer disks.
> 
> For example: five disks (disk1 - disk5)
> D01 D02 D03 D04 D05
> D05 D01 D02 D03 D04
> D06 D07 D08 D09 D10
> D10 D06 D07 D08 D09
> The discard bio just wants to discard from D03 to D10. For disk3, there is a hole between
> D03 and D08. For disk4, there is a hole between D04 and D09. D03 is a chunk, raid10_write_request
> can handle one chunk perfectly. So the part that is not aligned with stripe size is still
> handled by raid10_write_request.
> 
> If reshape is running when discard bio comes and the discard bio spans the reshape position,
> raid10_write_request is responsible to handle this discard bio.
> 
> I did a test with this patch set.
> Without patch:
> time mkfs.xfs /dev/md0
> real4m39.775s
> user0m0.000s
> sys0m0.298s
> 
> With patch:
> time mkfs.xfs /dev/md0
> real0m0.105s
> user0m0.000s
> sys0m0.007s
> 
> nvme3n1           259:1    0   477G  0 disk
> └─nvme3n1p1       259:10   0    50G  0 part
> nvme4n1           259:2    0   477G  0 disk
> └─nvme4n1p1       259:11   0    50G  0 part
> nvme5n1           259:6    0   477G  0 disk
> └─nvme5n1p1       259:12   0    50G  0 part
> nvme2n1           259:9    0   477G  0 disk
> └─nvme2n1p1       259:15   0    50G  0 part
> nvme0n1           259:13   0   477G  0 disk
> └─nvme0n1p1       259:14   0    50G  0 part
> 
> v1:
> Coly helps to review these patches and give some suggestions:
> One bug is found. If discard bio is across one stripe but bio size is bigger
> than one stripe size. After spliting, the bio will be NULL. In this version,
> it checks whether discard bio size is bigger than 2*stripe_size.
> In raid10_end_discard_request, it's better to check R10BIO_Uptodate is set
> or not. It can avoid write memory to improve performance.
> Add more comments for calculating addresses.
> 
> Signed-off-by: Xiao Ni <xni@redhat.com>

The code looks good to me, you may add my Reviewed-by: Coly Li
<colyli@suse.de>

But I still suggest you to add a more detailed commit log, to explain
how the discard range of each component disk is decided. Anyway this is
just a suggestion, it is up to you.

Thank you for this work.


Coly Li

> ---
>  drivers/md/raid10.c | 276 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 275 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 2a7423e..9568c23 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1518,6 +1518,271 @@ static void __make_request(struct mddev *mddev, struct bio *bio, int sectors)
>  		raid10_write_request(mddev, bio, r10_bio);
>  }
>  
> +static void wait_blocked_dev(struct mddev *mddev, int cnt)
> +{
> +	int i;
> +	struct r10conf *conf = mddev->private;
> +	struct md_rdev *blocked_rdev = NULL;
> +
> +retry_wait:
> +	rcu_read_lock();
> +	for (i = 0; i < cnt; i++) {
> +		struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
> +		struct md_rdev *rrdev = rcu_dereference(
> +			conf->mirrors[i].replacement);
> +		if (rdev == rrdev)
> +			rrdev = NULL;
> +		if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
> +			atomic_inc(&rdev->nr_pending);
> +			blocked_rdev = rdev;
> +			break;
> +		}
> +		if (rrdev && unlikely(test_bit(Blocked, &rrdev->flags))) {
> +			atomic_inc(&rrdev->nr_pending);
> +			blocked_rdev = rrdev;
> +			break;
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	if (unlikely(blocked_rdev)) {
> +		/* Have to wait for this device to get unblocked, then retry */
> +		allow_barrier(conf);
> +		raid10_log(conf->mddev, "%s wait rdev %d blocked", __func__, blocked_rdev->raid_disk);
> +		md_wait_for_blocked_rdev(blocked_rdev, mddev);
> +		wait_barrier(conf);
> +		goto retry_wait;
> +	}
> +}
> +
> +static struct bio *raid10_split_bio(struct r10conf *conf,
> +			struct bio *bio, sector_t sectors, bool want_first)
> +{
> +	struct bio *split;
> +
> +	split = bio_split(bio, sectors,	GFP_NOIO, &conf->bio_split);
> +	bio_chain(split, bio);
> +	allow_barrier(conf);
> +	if (want_first) {
> +		submit_bio_noacct(bio);
> +		bio = split;
> +	} else
> +		submit_bio_noacct(split);
> +	wait_barrier(conf);
> +
> +	return bio;
> +}
> +
> +static void raid10_end_discard_request(struct bio *bio)
> +{
> +	struct r10bio *r10_bio = bio->bi_private;
> +        struct r10conf *conf = r10_bio->mddev->private;
> +	struct md_rdev *rdev = NULL;
> +        int dev;
> +        int slot, repl;
> +
> +	/*
> +	 * We don't care the return value of discard bio
> +	 */
> +	if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
> +		set_bit(R10BIO_Uptodate, &r10_bio->state);
> +
> +	dev = find_bio_disk(conf, r10_bio, bio, &slot, &repl);
> +	if (repl)
> +		rdev = conf->mirrors[dev].replacement;
> +	if (!rdev) {
> +		smp_rmb();
> +		repl = 0;
> +		rdev = conf->mirrors[dev].rdev;
> +	}
> +
> +	if (atomic_dec_and_test(&r10_bio->remaining)) {
> +		md_write_end(r10_bio->mddev);
> +		raid_end_bio_io(r10_bio);
> +	}
> +
> +	rdev_dec_pending(rdev, conf->mddev);
> +}
> +
> +static bool raid10_handle_discard(struct mddev *mddev, struct bio *bio)
> +{
> +	struct r10conf *conf = mddev->private;
> +	struct geom geo = conf->geo;
> +	struct r10bio *r10_bio;
> +
> +	int disk;
> +	sector_t chunk;
> +	int stripe_size, stripe_mask;
> +
> +	sector_t bio_start, bio_end;
> +	sector_t first_stripe_index, last_stripe_index;
> +	sector_t start_disk_offset;
> +	unsigned int start_disk_index;
> +	sector_t end_disk_offset;
> +	unsigned int end_disk_index;
> +
> +	wait_barrier(conf);
> +
> +	if (conf->reshape_progress != MaxSector &&
> +	    ((bio->bi_iter.bi_sector >= conf->reshape_progress) !=
> +	     conf->mddev->reshape_backwards))
> +		geo = conf->prev;
> +
> +	stripe_size = (1<<geo.chunk_shift) * geo.raid_disks;
> +	stripe_mask = stripe_size - 1;
> +
> +	/* Maybe one discard bio is smaller than strip size or across one stripe
> +	 * and discard region is larger than one stripe size. For far offset layout,
> +	 * if the discard region is not aligned with stripe size, there is hole
> +	 * when we submit discard bio to member disk. For simplicity, we only
> +	 * handle discard bio which discard region is bigger than stripe_size*2
> +	 */
> +	if (bio_sectors(bio) < stripe_size*2)
> +		goto out;
> +
> +	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> +	     bio->bi_iter.bi_sector < conf->reshape_progress &&
> +	     bio->bi_iter.bi_sector + bio_sectors(bio) > conf->reshape_progress)
> +		goto out;
> +
> +	r10_bio = mempool_alloc(&conf->r10bio_pool, GFP_NOIO);
> +	r10_bio->mddev = mddev;
> +	r10_bio->state = 0;
> +	memset(r10_bio->devs, 0, sizeof(r10_bio->devs[0]) * conf->geo.raid_disks);
> +
> +	wait_blocked_dev(mddev, geo.raid_disks);
> +
> +	/* For far offset layout, if bio is not aligned with stripe size, it splits
> +	 * the part that is not aligned with strip size.
> +	 */
> +	bio_start = bio->bi_iter.bi_sector;
> +	bio_end = bio_end_sector(bio);
> +	if (geo.far_offset && (bio_start & stripe_mask)) {
> +		sector_t split_size;
> +		split_size = round_up(bio_start, stripe_size) - bio_start;
> +		bio = raid10_split_bio(conf, bio, split_size, false);
> +	}
> +	if (geo.far_offset && ((bio_end & stripe_mask) != stripe_mask)) {
> +		sector_t split_size;
> +		split_size = bio_end - (bio_end & stripe_mask);
> +		bio = raid10_split_bio(conf, bio, split_size, true);
> +	}
> +	r10_bio->master_bio = bio;
> +
> +	bio_start = bio->bi_iter.bi_sector;
> +	bio_end = bio_end_sector(bio);
> +
> +	/* raid10 uses chunk as the unit to store data. It's similar like raid0.
> +	 * One stripe contains the chunks from all member disk (one chunk from
> +	 * one disk at the same HAB address). For layout detail, see 'man md 4'
> +	 */
> +	chunk = bio_start >> geo.chunk_shift;
> +	chunk *= geo.near_copies;
> +	first_stripe_index = chunk;
> +	start_disk_index = sector_div(first_stripe_index, geo.raid_disks);
> +	if (geo.far_offset)
> +		first_stripe_index *= geo.far_copies;
> +	start_disk_offset = (bio_start & geo.chunk_mask) +
> +				(first_stripe_index << geo.chunk_shift);
> +
> +	chunk = bio_end >> geo.chunk_shift;
> +	chunk *= geo.near_copies;
> +	last_stripe_index = chunk;
> +	end_disk_index = sector_div(last_stripe_index, geo.raid_disks);
> +	if (geo.far_offset)
> +		last_stripe_index *= geo.far_copies;
> +	end_disk_offset = (bio_end & geo.chunk_mask) +
> +				(last_stripe_index << geo.chunk_shift);
> +
> +	rcu_read_lock();
> +	for (disk = 0; disk < geo.raid_disks; disk++) {
> +		struct md_rdev *rdev = rcu_dereference(conf->mirrors[disk].rdev);
> +		struct md_rdev *rrdev = rcu_dereference(
> +			conf->mirrors[disk].replacement);
> +
> +		r10_bio->devs[disk].bio = NULL;
> +		r10_bio->devs[disk].repl_bio = NULL;
> +
> +		if (rdev && (test_bit(Faulty, &rdev->flags)))
> +			rdev = NULL;
> +		if (rrdev && (test_bit(Faulty, &rrdev->flags)))
> +			rrdev = NULL;
> +		if (!rdev && !rrdev)
> +			continue;
> +
> +		if (rdev) {
> +			r10_bio->devs[disk].bio = bio;
> +			atomic_inc(&rdev->nr_pending);
> +		}
> +		if (rrdev) {
> +			r10_bio->devs[disk].repl_bio = bio;
> +			atomic_inc(&rrdev->nr_pending);
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	atomic_set(&r10_bio->remaining, 1);
> +	for (disk = 0; disk < geo.raid_disks; disk++) {
> +		sector_t dev_start, dev_end;
> +		struct bio *mbio, *rbio = NULL;
> +		struct md_rdev *rdev = rcu_dereference(conf->mirrors[disk].rdev);
> +		struct md_rdev *rrdev = rcu_dereference(
> +			conf->mirrors[disk].replacement);
> +
> +		/*
> +		 * Now start to calculate the start and end address for each disk.
> +		 * The space between dev_start and dev_end is the discard region.
> +		 */
> +		if (disk < start_disk_index)
> +			dev_start = (first_stripe_index + 1) * mddev->chunk_sectors;
> +		else if (disk > start_disk_index)
> +			dev_start = first_stripe_index * mddev->chunk_sectors;
> +		else
> +			dev_start = start_disk_offset;
> +
> +		if (disk < end_disk_index)
> +			dev_end = (last_stripe_index + 1) * mddev->chunk_sectors;
> +		else if (disk > end_disk_index)
> +			dev_end = last_stripe_index * mddev->chunk_sectors;
> +		else
> +			dev_end = end_disk_offset;
> +
> +		/* It only handles discard bio which size is >= stripe size, so
> +		 * dev_end > dev_start all the time
> +		 */
> +		if (r10_bio->devs[disk].bio) {
> +			mbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set);
> +			mbio->bi_end_io = raid10_end_discard_request;
> +			mbio->bi_private = r10_bio;
> +			r10_bio->devs[disk].bio = mbio;
> +			r10_bio->devs[disk].devnum = disk;
> +			atomic_inc(&r10_bio->remaining);
> +			md_submit_discard_bio(mddev, rdev, mbio, dev_start, dev_end);
> +			bio_endio(mbio);
> +		}
> +		if (r10_bio->devs[disk].repl_bio) {
> +			rbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set);
> +			rbio->bi_end_io = raid10_end_discard_request;
> +			rbio->bi_private = r10_bio;
> +			r10_bio->devs[disk].repl_bio = rbio;
> +			r10_bio->devs[disk].devnum = disk;
> +			atomic_inc(&r10_bio->remaining);
> +			md_submit_discard_bio(mddev, rrdev, rbio, dev_start, dev_end);
> +			bio_endio(rbio);
> +		}
> +	}
> +
> +	if (atomic_dec_and_test(&r10_bio->remaining)) {
> +		md_write_end(r10_bio->mddev);
> +		raid_end_bio_io(r10_bio);
> +	}
> +
> +	return 0;
> +out:
> +	allow_barrier(conf);
> +	return -EAGAIN;
> +}
> +
>  static bool raid10_make_request(struct mddev *mddev, struct bio *bio)
>  {
>  	struct r10conf *conf = mddev->private;
> @@ -1532,6 +1797,15 @@ static bool raid10_make_request(struct mddev *mddev, struct bio *bio)
>  	if (!md_write_start(mddev, bio))
>  		return false;
>  
> +	/* There are some limitations to handle discard bio
> +	 * 1st, the discard size is bigger than stripe size.
> +	 * 2st, if the discard bio spans reshape progress, we use the old way to
> +	 * handle discard bio
> +	 */
> +	if (unlikely(bio_op(bio) == REQ_OP_DISCARD))
> +		if (!raid10_handle_discard(mddev, bio))
> +			return true;
> +
>  	/*
>  	 * If this request crosses a chunk boundary, we need to split
>  	 * it.
> @@ -3762,7 +4036,7 @@ static int raid10_run(struct mddev *mddev)
>  	chunk_size = mddev->chunk_sectors << 9;
>  	if (mddev->queue) {
>  		blk_queue_max_discard_sectors(mddev->queue,
> -					      mddev->chunk_sectors);
> +					      UINT_MAX);
>  		blk_queue_max_write_same_sectors(mddev->queue, 0);
>  		blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
>  		blk_queue_io_min(mddev->queue, chunk_size);
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 3/3] improve raid10 discard request
  2020-07-20  5:26   ` Coly Li
@ 2020-07-20  5:38     ` Coly Li
  2020-07-22  0:17       ` Song Liu
  0 siblings, 1 reply; 12+ messages in thread
From: Coly Li @ 2020-07-20  5:38 UTC (permalink / raw)
  To: Xiao Ni; +Cc: song, linux-raid, ncroxon, heinzm

On 2020/7/20 13:26, Coly Li wrote:
> On 2020/7/20 12:58, Xiao Ni wrote:
>> Now the discard request is split by chunk size. So it takes a long time to finish mkfs on
>> disks which support discard function. This patch improve handling raid10 discard request.
>> It uses  the similar way with raid0(29efc390b).
>>
>> But it's a little complex than raid0. Because raid10 has different layout. If raid10 is
>> offset layout and the discard request is smaller than stripe size. There are some holes
>> when we submit discard bio to underlayer disks.
>>
>> For example: five disks (disk1 - disk5)
>> D01 D02 D03 D04 D05
>> D05 D01 D02 D03 D04
>> D06 D07 D08 D09 D10
>> D10 D06 D07 D08 D09
>> The discard bio just wants to discard from D03 to D10. For disk3, there is a hole between
>> D03 and D08. For disk4, there is a hole between D04 and D09. D03 is a chunk, raid10_write_request
>> can handle one chunk perfectly. So the part that is not aligned with stripe size is still
>> handled by raid10_write_request.
>>
>> If reshape is running when discard bio comes and the discard bio spans the reshape position,
>> raid10_write_request is responsible to handle this discard bio.
>>
>> I did a test with this patch set.
>> Without patch:
>> time mkfs.xfs /dev/md0
>> real4m39.775s
>> user0m0.000s
>> sys0m0.298s
>>
>> With patch:
>> time mkfs.xfs /dev/md0
>> real0m0.105s
>> user0m0.000s
>> sys0m0.007s
>>
>> nvme3n1           259:1    0   477G  0 disk
>> └─nvme3n1p1       259:10   0    50G  0 part
>> nvme4n1           259:2    0   477G  0 disk
>> └─nvme4n1p1       259:11   0    50G  0 part
>> nvme5n1           259:6    0   477G  0 disk
>> └─nvme5n1p1       259:12   0    50G  0 part
>> nvme2n1           259:9    0   477G  0 disk
>> └─nvme2n1p1       259:15   0    50G  0 part
>> nvme0n1           259:13   0   477G  0 disk
>> └─nvme0n1p1       259:14   0    50G  0 part
>>
>> v1:
>> Coly helps to review these patches and give some suggestions:
>> One bug is found. If discard bio is across one stripe but bio size is bigger
>> than one stripe size. After spliting, the bio will be NULL. In this version,
>> it checks whether discard bio size is bigger than 2*stripe_size.
>> In raid10_end_discard_request, it's better to check R10BIO_Uptodate is set
>> or not. It can avoid write memory to improve performance.
>> Add more comments for calculating addresses.
>>
>> Signed-off-by: Xiao Ni <xni@redhat.com>
> 
> The code looks good to me, you may add my Reviewed-by: Coly Li
> <colyli@suse.de>
> 
> But I still suggest you to add a more detailed commit log, to explain
> how the discard range of each component disk is decided. Anyway this is
> just a suggestion, it is up to you.
> 
> Thank you for this work.

In raid10_end_discard_request(), the first 5 lines for local variables
declaration, there are 3 lines starts with improper blanks (should be
tab '\t'). You may find this minor issue by checkpatch.pl I guess.

After fixing this format issue, you stil have my Reviewed-by :-)

Coly Li

> 
>> ---
>>  drivers/md/raid10.c | 276 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 275 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index 2a7423e..9568c23 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -1518,6 +1518,271 @@ static void __make_request(struct mddev *mddev, struct bio *bio, int sectors)
>>  		raid10_write_request(mddev, bio, r10_bio);
>>  }
>>  
>> +static void wait_blocked_dev(struct mddev *mddev, int cnt)
>> +{
>> +	int i;
>> +	struct r10conf *conf = mddev->private;
>> +	struct md_rdev *blocked_rdev = NULL;
>> +
>> +retry_wait:
>> +	rcu_read_lock();
>> +	for (i = 0; i < cnt; i++) {
>> +		struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
>> +		struct md_rdev *rrdev = rcu_dereference(
>> +			conf->mirrors[i].replacement);
>> +		if (rdev == rrdev)
>> +			rrdev = NULL;
>> +		if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
>> +			atomic_inc(&rdev->nr_pending);
>> +			blocked_rdev = rdev;
>> +			break;
>> +		}
>> +		if (rrdev && unlikely(test_bit(Blocked, &rrdev->flags))) {
>> +			atomic_inc(&rrdev->nr_pending);
>> +			blocked_rdev = rrdev;
>> +			break;
>> +		}
>> +	}
>> +	rcu_read_unlock();
>> +
>> +	if (unlikely(blocked_rdev)) {
>> +		/* Have to wait for this device to get unblocked, then retry */
>> +		allow_barrier(conf);
>> +		raid10_log(conf->mddev, "%s wait rdev %d blocked", __func__, blocked_rdev->raid_disk);
>> +		md_wait_for_blocked_rdev(blocked_rdev, mddev);
>> +		wait_barrier(conf);
>> +		goto retry_wait;
>> +	}
>> +}
>> +
>> +static struct bio *raid10_split_bio(struct r10conf *conf,
>> +			struct bio *bio, sector_t sectors, bool want_first)
>> +{
>> +	struct bio *split;
>> +
>> +	split = bio_split(bio, sectors,	GFP_NOIO, &conf->bio_split);
>> +	bio_chain(split, bio);
>> +	allow_barrier(conf);
>> +	if (want_first) {
>> +		submit_bio_noacct(bio);
>> +		bio = split;
>> +	} else
>> +		submit_bio_noacct(split);
>> +	wait_barrier(conf);
>> +
>> +	return bio;
>> +}
>> +
>> +static void raid10_end_discard_request(struct bio *bio)
>> +{
>> +	struct r10bio *r10_bio = bio->bi_private;
>> +        struct r10conf *conf = r10_bio->mddev->private;
>> +	struct md_rdev *rdev = NULL;
>> +        int dev;
>> +        int slot, repl;
>> +
>> +	/*
>> +	 * We don't care the return value of discard bio
>> +	 */
>> +	if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
>> +		set_bit(R10BIO_Uptodate, &r10_bio->state);
>> +
>> +	dev = find_bio_disk(conf, r10_bio, bio, &slot, &repl);
>> +	if (repl)
>> +		rdev = conf->mirrors[dev].replacement;
>> +	if (!rdev) {
>> +		smp_rmb();
>> +		repl = 0;
>> +		rdev = conf->mirrors[dev].rdev;
>> +	}
>> +
>> +	if (atomic_dec_and_test(&r10_bio->remaining)) {
>> +		md_write_end(r10_bio->mddev);
>> +		raid_end_bio_io(r10_bio);
>> +	}
>> +
>> +	rdev_dec_pending(rdev, conf->mddev);
>> +}
>> +
>> +static bool raid10_handle_discard(struct mddev *mddev, struct bio *bio)
>> +{
>> +	struct r10conf *conf = mddev->private;
>> +	struct geom geo = conf->geo;
>> +	struct r10bio *r10_bio;
>> +
>> +	int disk;
>> +	sector_t chunk;
>> +	int stripe_size, stripe_mask;
>> +
>> +	sector_t bio_start, bio_end;
>> +	sector_t first_stripe_index, last_stripe_index;
>> +	sector_t start_disk_offset;
>> +	unsigned int start_disk_index;
>> +	sector_t end_disk_offset;
>> +	unsigned int end_disk_index;
>> +
>> +	wait_barrier(conf);
>> +
>> +	if (conf->reshape_progress != MaxSector &&
>> +	    ((bio->bi_iter.bi_sector >= conf->reshape_progress) !=
>> +	     conf->mddev->reshape_backwards))
>> +		geo = conf->prev;
>> +
>> +	stripe_size = (1<<geo.chunk_shift) * geo.raid_disks;
>> +	stripe_mask = stripe_size - 1;
>> +
>> +	/* Maybe one discard bio is smaller than strip size or across one stripe
>> +	 * and discard region is larger than one stripe size. For far offset layout,
>> +	 * if the discard region is not aligned with stripe size, there is hole
>> +	 * when we submit discard bio to member disk. For simplicity, we only
>> +	 * handle discard bio which discard region is bigger than stripe_size*2
>> +	 */
>> +	if (bio_sectors(bio) < stripe_size*2)
>> +		goto out;
>> +
>> +	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
>> +	     bio->bi_iter.bi_sector < conf->reshape_progress &&
>> +	     bio->bi_iter.bi_sector + bio_sectors(bio) > conf->reshape_progress)
>> +		goto out;
>> +
>> +	r10_bio = mempool_alloc(&conf->r10bio_pool, GFP_NOIO);
>> +	r10_bio->mddev = mddev;
>> +	r10_bio->state = 0;
>> +	memset(r10_bio->devs, 0, sizeof(r10_bio->devs[0]) * conf->geo.raid_disks);
>> +
>> +	wait_blocked_dev(mddev, geo.raid_disks);
>> +
>> +	/* For far offset layout, if bio is not aligned with stripe size, it splits
>> +	 * the part that is not aligned with strip size.
>> +	 */
>> +	bio_start = bio->bi_iter.bi_sector;
>> +	bio_end = bio_end_sector(bio);
>> +	if (geo.far_offset && (bio_start & stripe_mask)) {
>> +		sector_t split_size;
>> +		split_size = round_up(bio_start, stripe_size) - bio_start;
>> +		bio = raid10_split_bio(conf, bio, split_size, false);
>> +	}
>> +	if (geo.far_offset && ((bio_end & stripe_mask) != stripe_mask)) {
>> +		sector_t split_size;
>> +		split_size = bio_end - (bio_end & stripe_mask);
>> +		bio = raid10_split_bio(conf, bio, split_size, true);
>> +	}
>> +	r10_bio->master_bio = bio;
>> +
>> +	bio_start = bio->bi_iter.bi_sector;
>> +	bio_end = bio_end_sector(bio);
>> +
>> +	/* raid10 uses chunk as the unit to store data. It's similar like raid0.
>> +	 * One stripe contains the chunks from all member disk (one chunk from
>> +	 * one disk at the same HAB address). For layout detail, see 'man md 4'
>> +	 */
>> +	chunk = bio_start >> geo.chunk_shift;
>> +	chunk *= geo.near_copies;
>> +	first_stripe_index = chunk;
>> +	start_disk_index = sector_div(first_stripe_index, geo.raid_disks);
>> +	if (geo.far_offset)
>> +		first_stripe_index *= geo.far_copies;
>> +	start_disk_offset = (bio_start & geo.chunk_mask) +
>> +				(first_stripe_index << geo.chunk_shift);
>> +
>> +	chunk = bio_end >> geo.chunk_shift;
>> +	chunk *= geo.near_copies;
>> +	last_stripe_index = chunk;
>> +	end_disk_index = sector_div(last_stripe_index, geo.raid_disks);
>> +	if (geo.far_offset)
>> +		last_stripe_index *= geo.far_copies;
>> +	end_disk_offset = (bio_end & geo.chunk_mask) +
>> +				(last_stripe_index << geo.chunk_shift);
>> +
>> +	rcu_read_lock();
>> +	for (disk = 0; disk < geo.raid_disks; disk++) {
>> +		struct md_rdev *rdev = rcu_dereference(conf->mirrors[disk].rdev);
>> +		struct md_rdev *rrdev = rcu_dereference(
>> +			conf->mirrors[disk].replacement);
>> +
>> +		r10_bio->devs[disk].bio = NULL;
>> +		r10_bio->devs[disk].repl_bio = NULL;
>> +
>> +		if (rdev && (test_bit(Faulty, &rdev->flags)))
>> +			rdev = NULL;
>> +		if (rrdev && (test_bit(Faulty, &rrdev->flags)))
>> +			rrdev = NULL;
>> +		if (!rdev && !rrdev)
>> +			continue;
>> +
>> +		if (rdev) {
>> +			r10_bio->devs[disk].bio = bio;
>> +			atomic_inc(&rdev->nr_pending);
>> +		}
>> +		if (rrdev) {
>> +			r10_bio->devs[disk].repl_bio = bio;
>> +			atomic_inc(&rrdev->nr_pending);
>> +		}
>> +	}
>> +	rcu_read_unlock();
>> +
>> +	atomic_set(&r10_bio->remaining, 1);
>> +	for (disk = 0; disk < geo.raid_disks; disk++) {
>> +		sector_t dev_start, dev_end;
>> +		struct bio *mbio, *rbio = NULL;
>> +		struct md_rdev *rdev = rcu_dereference(conf->mirrors[disk].rdev);
>> +		struct md_rdev *rrdev = rcu_dereference(
>> +			conf->mirrors[disk].replacement);
>> +
>> +		/*
>> +		 * Now start to calculate the start and end address for each disk.
>> +		 * The space between dev_start and dev_end is the discard region.
>> +		 */
>> +		if (disk < start_disk_index)
>> +			dev_start = (first_stripe_index + 1) * mddev->chunk_sectors;
>> +		else if (disk > start_disk_index)
>> +			dev_start = first_stripe_index * mddev->chunk_sectors;
>> +		else
>> +			dev_start = start_disk_offset;
>> +
>> +		if (disk < end_disk_index)
>> +			dev_end = (last_stripe_index + 1) * mddev->chunk_sectors;
>> +		else if (disk > end_disk_index)
>> +			dev_end = last_stripe_index * mddev->chunk_sectors;
>> +		else
>> +			dev_end = end_disk_offset;
>> +
>> +		/* It only handles discard bio which size is >= stripe size, so
>> +		 * dev_end > dev_start all the time
>> +		 */
>> +		if (r10_bio->devs[disk].bio) {
>> +			mbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set);
>> +			mbio->bi_end_io = raid10_end_discard_request;
>> +			mbio->bi_private = r10_bio;
>> +			r10_bio->devs[disk].bio = mbio;
>> +			r10_bio->devs[disk].devnum = disk;
>> +			atomic_inc(&r10_bio->remaining);
>> +			md_submit_discard_bio(mddev, rdev, mbio, dev_start, dev_end);
>> +			bio_endio(mbio);
>> +		}
>> +		if (r10_bio->devs[disk].repl_bio) {
>> +			rbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set);
>> +			rbio->bi_end_io = raid10_end_discard_request;
>> +			rbio->bi_private = r10_bio;
>> +			r10_bio->devs[disk].repl_bio = rbio;
>> +			r10_bio->devs[disk].devnum = disk;
>> +			atomic_inc(&r10_bio->remaining);
>> +			md_submit_discard_bio(mddev, rrdev, rbio, dev_start, dev_end);
>> +			bio_endio(rbio);
>> +		}
>> +	}
>> +
>> +	if (atomic_dec_and_test(&r10_bio->remaining)) {
>> +		md_write_end(r10_bio->mddev);
>> +		raid_end_bio_io(r10_bio);
>> +	}
>> +
>> +	return 0;
>> +out:
>> +	allow_barrier(conf);
>> +	return -EAGAIN;
>> +}
>> +
>>  static bool raid10_make_request(struct mddev *mddev, struct bio *bio)
>>  {
>>  	struct r10conf *conf = mddev->private;
>> @@ -1532,6 +1797,15 @@ static bool raid10_make_request(struct mddev *mddev, struct bio *bio)
>>  	if (!md_write_start(mddev, bio))
>>  		return false;
>>  
>> +	/* There are some limitations to handle discard bio
>> +	 * 1st, the discard size is bigger than stripe size.
>> +	 * 2st, if the discard bio spans reshape progress, we use the old way to
>> +	 * handle discard bio
>> +	 */
>> +	if (unlikely(bio_op(bio) == REQ_OP_DISCARD))
>> +		if (!raid10_handle_discard(mddev, bio))
>> +			return true;
>> +
>>  	/*
>>  	 * If this request crosses a chunk boundary, we need to split
>>  	 * it.
>> @@ -3762,7 +4036,7 @@ static int raid10_run(struct mddev *mddev)
>>  	chunk_size = mddev->chunk_sectors << 9;
>>  	if (mddev->queue) {
>>  		blk_queue_max_discard_sectors(mddev->queue,
>> -					      mddev->chunk_sectors);
>> +					      UINT_MAX);
>>  		blk_queue_max_write_same_sectors(mddev->queue, 0);
>>  		blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
>>  		blk_queue_io_min(mddev->queue, chunk_size);
>>
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 3/3] improve raid10 discard request
  2020-07-20  5:38     ` Coly Li
@ 2020-07-22  0:17       ` Song Liu
  2020-07-22  2:43         ` Xiao Ni
  0 siblings, 1 reply; 12+ messages in thread
From: Song Liu @ 2020-07-22  0:17 UTC (permalink / raw)
  To: Coly Li; +Cc: Xiao Ni, linux-raid, Nigel Croxon, Heinz Mauelshagen

On Sun, Jul 19, 2020 at 10:38 PM Coly Li <colyli@suse.de> wrote:
>
> On 2020/7/20 13:26, Coly Li wrote:
> > On 2020/7/20 12:58, Xiao Ni wrote:
> >> Now the discard request is split by chunk size. So it takes a long time to finish mkfs on
> >> disks which support discard function. This patch improve handling raid10 discard request.
> >> It uses  the similar way with raid0(29efc390b).
> >>
> >> But it's a little complex than raid0. Because raid10 has different layout. If raid10 is
> >> offset layout and the discard request is smaller than stripe size. There are some holes
> >> when we submit discard bio to underlayer disks.
> >>
> >> For example: five disks (disk1 - disk5)
> >> D01 D02 D03 D04 D05
> >> D05 D01 D02 D03 D04
> >> D06 D07 D08 D09 D10
> >> D10 D06 D07 D08 D09
> >> The discard bio just wants to discard from D03 to D10. For disk3, there is a hole between
> >> D03 and D08. For disk4, there is a hole between D04 and D09. D03 is a chunk, raid10_write_request
> >> can handle one chunk perfectly. So the part that is not aligned with stripe size is still
> >> handled by raid10_write_request.
> >>
> >> If reshape is running when discard bio comes and the discard bio spans the reshape position,
> >> raid10_write_request is responsible to handle this discard bio.
> >>
> >> I did a test with this patch set.
> >> Without patch:
> >> time mkfs.xfs /dev/md0
> >> real4m39.775s
> >> user0m0.000s
> >> sys0m0.298s
> >>
> >> With patch:
> >> time mkfs.xfs /dev/md0
> >> real0m0.105s
> >> user0m0.000s
> >> sys0m0.007s
> >>
> >> nvme3n1           259:1    0   477G  0 disk
> >> └─nvme3n1p1       259:10   0    50G  0 part
> >> nvme4n1           259:2    0   477G  0 disk
> >> └─nvme4n1p1       259:11   0    50G  0 part
> >> nvme5n1           259:6    0   477G  0 disk
> >> └─nvme5n1p1       259:12   0    50G  0 part
> >> nvme2n1           259:9    0   477G  0 disk
> >> └─nvme2n1p1       259:15   0    50G  0 part
> >> nvme0n1           259:13   0   477G  0 disk
> >> └─nvme0n1p1       259:14   0    50G  0 part
> >>
> >> v1:
> >> Coly helps to review these patches and give some suggestions:
> >> One bug is found. If discard bio is across one stripe but bio size is bigger
> >> than one stripe size. After spliting, the bio will be NULL. In this version,
> >> it checks whether discard bio size is bigger than 2*stripe_size.
> >> In raid10_end_discard_request, it's better to check R10BIO_Uptodate is set
> >> or not. It can avoid write memory to improve performance.
> >> Add more comments for calculating addresses.
> >>
> >> Signed-off-by: Xiao Ni <xni@redhat.com>
> >
> > The code looks good to me, you may add my Reviewed-by: Coly Li
> > <colyli@suse.de>
> >
> > But I still suggest you to add a more detailed commit log, to explain
> > how the discard range of each component disk is decided. Anyway this is
> > just a suggestion, it is up to you.
> >
> > Thank you for this work.
>
> In raid10_end_discard_request(), the first 5 lines for local variables
> declaration, there are 3 lines starts with improper blanks (should be
> tab '\t'). You may find this minor issue by checkpatch.pl I guess.
>
> After fixing this format issue, you stil have my Reviewed-by :-)
>
> Coly Li

Thanks to Xiao for the patch And thanks to Coly for the review.

I did an experiment with a patch to increase max_discard_sectors,
like:

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 14b1ba732cd7d..0b15397ebfaf9 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3762,7 +3762,7 @@ static int raid10_run(struct mddev *mddev)
        chunk_size = mddev->chunk_sectors << 9;
        if (mddev->queue) {
                blk_queue_max_discard_sectors(mddev->queue,
-                                             mddev->chunk_sectors);
+                                             UINT_MAX);
                blk_queue_max_write_same_sectors(mddev->queue, 0);
                blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
                blk_queue_io_min(mddev->queue, chunk_size);

This simple change reduces the run time of mkfs.xfs on my SSD
from 5+ minutes to about 14 seconds. Of course this is not as good
as ~2s seconds with the set. But I wonder whether the time saving
justifies extra complexity. Thoughts on this?

Also, as Coly mentioned, please run checkpatch.pl on the patches.
I have seen warnings on all 3 patches in the set.

Thanks,
Song

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 3/3] improve raid10 discard request
  2020-07-22  0:17       ` Song Liu
@ 2020-07-22  2:43         ` Xiao Ni
  2020-07-22  5:10           ` Xiao Ni
  0 siblings, 1 reply; 12+ messages in thread
From: Xiao Ni @ 2020-07-22  2:43 UTC (permalink / raw)
  To: Song Liu, Coly Li; +Cc: linux-raid, Nigel Croxon, Heinz Mauelshagen



On 07/22/2020 08:17 AM, Song Liu wrote:
> On Sun, Jul 19, 2020 at 10:38 PM Coly Li <colyli@suse.de> wrote:
>> On 2020/7/20 13:26, Coly Li wrote:
>>> On 2020/7/20 12:58, Xiao Ni wrote:
>>>> Now the discard request is split by chunk size. So it takes a long time to finish mkfs on
>>>> disks which support discard function. This patch improve handling raid10 discard request.
>>>> It uses  the similar way with raid0(29efc390b).
>>>>
>>>> But it's a little complex than raid0. Because raid10 has different layout. If raid10 is
>>>> offset layout and the discard request is smaller than stripe size. There are some holes
>>>> when we submit discard bio to underlayer disks.
>>>>
>>>> For example: five disks (disk1 - disk5)
>>>> D01 D02 D03 D04 D05
>>>> D05 D01 D02 D03 D04
>>>> D06 D07 D08 D09 D10
>>>> D10 D06 D07 D08 D09
>>>> The discard bio just wants to discard from D03 to D10. For disk3, there is a hole between
>>>> D03 and D08. For disk4, there is a hole between D04 and D09. D03 is a chunk, raid10_write_request
>>>> can handle one chunk perfectly. So the part that is not aligned with stripe size is still
>>>> handled by raid10_write_request.
>>>>
>>>> If reshape is running when discard bio comes and the discard bio spans the reshape position,
>>>> raid10_write_request is responsible to handle this discard bio.
>>>>
>>>> I did a test with this patch set.
>>>> Without patch:
>>>> time mkfs.xfs /dev/md0
>>>> real4m39.775s
>>>> user0m0.000s
>>>> sys0m0.298s
>>>>
>>>> With patch:
>>>> time mkfs.xfs /dev/md0
>>>> real0m0.105s
>>>> user0m0.000s
>>>> sys0m0.007s
>>>>
>>>> nvme3n1           259:1    0   477G  0 disk
>>>> └─nvme3n1p1       259:10   0    50G  0 part
>>>> nvme4n1           259:2    0   477G  0 disk
>>>> └─nvme4n1p1       259:11   0    50G  0 part
>>>> nvme5n1           259:6    0   477G  0 disk
>>>> └─nvme5n1p1       259:12   0    50G  0 part
>>>> nvme2n1           259:9    0   477G  0 disk
>>>> └─nvme2n1p1       259:15   0    50G  0 part
>>>> nvme0n1           259:13   0   477G  0 disk
>>>> └─nvme0n1p1       259:14   0    50G  0 part
>>>>
>>>> v1:
>>>> Coly helps to review these patches and give some suggestions:
>>>> One bug is found. If discard bio is across one stripe but bio size is bigger
>>>> than one stripe size. After spliting, the bio will be NULL. In this version,
>>>> it checks whether discard bio size is bigger than 2*stripe_size.
>>>> In raid10_end_discard_request, it's better to check R10BIO_Uptodate is set
>>>> or not. It can avoid write memory to improve performance.
>>>> Add more comments for calculating addresses.
>>>>
>>>> Signed-off-by: Xiao Ni <xni@redhat.com>
>>> The code looks good to me, you may add my Reviewed-by: Coly Li
>>> <colyli@suse.de>
>>>
>>> But I still suggest you to add a more detailed commit log, to explain
>>> how the discard range of each component disk is decided. Anyway this is
>>> just a suggestion, it is up to you.
>>>
>>> Thank you for this work.
>> In raid10_end_discard_request(), the first 5 lines for local variables
>> declaration, there are 3 lines starts with improper blanks (should be
>> tab '\t'). You may find this minor issue by checkpatch.pl I guess.
>>
>> After fixing this format issue, you stil have my Reviewed-by :-)
>>
>> Coly Li
> Thanks to Xiao for the patch And thanks to Coly for the review.
>
> I did an experiment with a patch to increase max_discard_sectors,
> like:
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 14b1ba732cd7d..0b15397ebfaf9 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -3762,7 +3762,7 @@ static int raid10_run(struct mddev *mddev)
>          chunk_size = mddev->chunk_sectors << 9;
>          if (mddev->queue) {
>                  blk_queue_max_discard_sectors(mddev->queue,
> -                                             mddev->chunk_sectors);
> +                                             UINT_MAX);
>                  blk_queue_max_write_same_sectors(mddev->queue, 0);
>                  blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
>                  blk_queue_io_min(mddev->queue, chunk_size);
>
> This simple change reduces the run time of mkfs.xfs on my SSD
> from 5+ minutes to about 14 seconds. Of course this is not as good
> as ~2s seconds with the set. But I wonder whether the time saving
> justifies extra complexity. Thoughts on this?
>
> Also, as Coly mentioned, please run checkpatch.pl on the patches.
> I have seen warnings on all 3 patches in the set.
>
> Thanks,
> Song
>
Hi Song

The mkfs.xfs still takes a very long time if just modifying the max 
discard sectors in my environment.
But it doesn't make sense in your test. It still needs to split the bio 
based on the chunk size. So it don't
make a difference only modifying the mas discard sectors?

And thanks Coly for the review. I'll fix the problem in the next version.

Regards
Xiao

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 3/3] improve raid10 discard request
  2020-07-22  2:43         ` Xiao Ni
@ 2020-07-22  5:10           ` Xiao Ni
  2020-07-22  6:20             ` Song Liu
  0 siblings, 1 reply; 12+ messages in thread
From: Xiao Ni @ 2020-07-22  5:10 UTC (permalink / raw)
  To: Song Liu, Coly Li; +Cc: linux-raid, Nigel Croxon, Heinz Mauelshagen



On 07/22/2020 10:43 AM, Xiao Ni wrote:
>
>
> On 07/22/2020 08:17 AM, Song Liu wrote:
>> On Sun, Jul 19, 2020 at 10:38 PM Coly Li <colyli@suse.de> wrote:
>>> On 2020/7/20 13:26, Coly Li wrote:
>>>> On 2020/7/20 12:58, Xiao Ni wrote:
>>>>> Now the discard request is split by chunk size. So it takes a long 
>>>>> time to finish mkfs on
>>>>> disks which support discard function. This patch improve handling 
>>>>> raid10 discard request.
>>>>> It uses  the similar way with raid0(29efc390b).
>>>>>
>>>>> But it's a little complex than raid0. Because raid10 has different 
>>>>> layout. If raid10 is
>>>>> offset layout and the discard request is smaller than stripe size. 
>>>>> There are some holes
>>>>> when we submit discard bio to underlayer disks.
>>>>>
>>>>> For example: five disks (disk1 - disk5)
>>>>> D01 D02 D03 D04 D05
>>>>> D05 D01 D02 D03 D04
>>>>> D06 D07 D08 D09 D10
>>>>> D10 D06 D07 D08 D09
>>>>> The discard bio just wants to discard from D03 to D10. For disk3, 
>>>>> there is a hole between
>>>>> D03 and D08. For disk4, there is a hole between D04 and D09. D03 
>>>>> is a chunk, raid10_write_request
>>>>> can handle one chunk perfectly. So the part that is not aligned 
>>>>> with stripe size is still
>>>>> handled by raid10_write_request.
>>>>>
>>>>> If reshape is running when discard bio comes and the discard bio 
>>>>> spans the reshape position,
>>>>> raid10_write_request is responsible to handle this discard bio.
>>>>>
>>>>> I did a test with this patch set.
>>>>> Without patch:
>>>>> time mkfs.xfs /dev/md0
>>>>> real4m39.775s
>>>>> user0m0.000s
>>>>> sys0m0.298s
>>>>>
>>>>> With patch:
>>>>> time mkfs.xfs /dev/md0
>>>>> real0m0.105s
>>>>> user0m0.000s
>>>>> sys0m0.007s
>>>>>
>>>>> nvme3n1           259:1    0   477G  0 disk
>>>>> └─nvme3n1p1       259:10   0    50G  0 part
>>>>> nvme4n1           259:2    0   477G  0 disk
>>>>> └─nvme4n1p1       259:11   0    50G  0 part
>>>>> nvme5n1           259:6    0   477G  0 disk
>>>>> └─nvme5n1p1       259:12   0    50G  0 part
>>>>> nvme2n1           259:9    0   477G  0 disk
>>>>> └─nvme2n1p1       259:15   0    50G  0 part
>>>>> nvme0n1           259:13   0   477G  0 disk
>>>>> └─nvme0n1p1       259:14   0    50G  0 part
>>>>>
>>>>> v1:
>>>>> Coly helps to review these patches and give some suggestions:
>>>>> One bug is found. If discard bio is across one stripe but bio size 
>>>>> is bigger
>>>>> than one stripe size. After spliting, the bio will be NULL. In 
>>>>> this version,
>>>>> it checks whether discard bio size is bigger than 2*stripe_size.
>>>>> In raid10_end_discard_request, it's better to check 
>>>>> R10BIO_Uptodate is set
>>>>> or not. It can avoid write memory to improve performance.
>>>>> Add more comments for calculating addresses.
>>>>>
>>>>> Signed-off-by: Xiao Ni <xni@redhat.com>
>>>> The code looks good to me, you may add my Reviewed-by: Coly Li
>>>> <colyli@suse.de>
>>>>
>>>> But I still suggest you to add a more detailed commit log, to explain
>>>> how the discard range of each component disk is decided. Anyway 
>>>> this is
>>>> just a suggestion, it is up to you.
>>>>
>>>> Thank you for this work.
>>> In raid10_end_discard_request(), the first 5 lines for local variables
>>> declaration, there are 3 lines starts with improper blanks (should be
>>> tab '\t'). You may find this minor issue by checkpatch.pl I guess.
>>>
>>> After fixing this format issue, you stil have my Reviewed-by :-)
>>>
>>> Coly Li
>> Thanks to Xiao for the patch And thanks to Coly for the review.
>>
>> I did an experiment with a patch to increase max_discard_sectors,
>> like:
>>
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index 14b1ba732cd7d..0b15397ebfaf9 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -3762,7 +3762,7 @@ static int raid10_run(struct mddev *mddev)
>>          chunk_size = mddev->chunk_sectors << 9;
>>          if (mddev->queue) {
>>                  blk_queue_max_discard_sectors(mddev->queue,
>> - mddev->chunk_sectors);
>> +                                             UINT_MAX);
>> blk_queue_max_write_same_sectors(mddev->queue, 0);
>> blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
>>                  blk_queue_io_min(mddev->queue, chunk_size);
>>
>> This simple change reduces the run time of mkfs.xfs on my SSD
>> from 5+ minutes to about 14 seconds. Of course this is not as good
>> as ~2s seconds with the set. But I wonder whether the time saving
>> justifies extra complexity. Thoughts on this?
>>
>> Also, as Coly mentioned, please run checkpatch.pl on the patches.
>> I have seen warnings on all 3 patches in the set.
>>
>> Thanks,
>> Song
>>
> Hi Song
>
> The mkfs.xfs still takes a very long time if just modifying the max 
> discard sectors in my environment.
> But it doesn't make sense in your test. It still needs to split the 
> bio based on the chunk size. So it don't
> make a difference only modifying the mas discard sectors?
>
> And thanks Coly for the review. I'll fix the problem in the next version.
>
It takes effect only changing mas discard sectors to UINT_MAX in my 
environment.
time mkfs.xfs /dev/md0 -f
real    50m1.674s  (no change)
real    13m19.367s (with change)

Maybe it depends on the disks. In my environment, it still takes a long 
time.

Regards
Xiao

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 3/3] improve raid10 discard request
  2020-07-22  5:10           ` Xiao Ni
@ 2020-07-22  6:20             ` Song Liu
  0 siblings, 0 replies; 12+ messages in thread
From: Song Liu @ 2020-07-22  6:20 UTC (permalink / raw)
  To: Xiao Ni; +Cc: Coly Li, linux-raid, Nigel Croxon, Heinz Mauelshagen

On Tue, Jul 21, 2020 at 10:10 PM Xiao Ni <xni@redhat.com> wrote:
>
>
>
> On 07/22/2020 10:43 AM, Xiao Ni wrote:
> >
> >
> > On 07/22/2020 08:17 AM, Song Liu wrote:
> >> On Sun, Jul 19, 2020 at 10:38 PM Coly Li <colyli@suse.de> wrote:
> >>> On 2020/7/20 13:26, Coly Li wrote:
> >>>> On 2020/7/20 12:58, Xiao Ni wrote:
> >>>>> Now the discard request is split by chunk size. So it takes a long
> >>>>> time to finish mkfs on
> >>>>> disks which support discard function. This patch improve handling
> >>>>> raid10 discard request.
> >>>>> It uses  the similar way with raid0(29efc390b).
> >>>>>
> >>>>> But it's a little complex than raid0. Because raid10 has different
> >>>>> layout. If raid10 is
> >>>>> offset layout and the discard request is smaller than stripe size.
> >>>>> There are some holes
> >>>>> when we submit discard bio to underlayer disks.
> >>>>>
> >>>>> For example: five disks (disk1 - disk5)
> >>>>> D01 D02 D03 D04 D05
> >>>>> D05 D01 D02 D03 D04
> >>>>> D06 D07 D08 D09 D10
> >>>>> D10 D06 D07 D08 D09
> >>>>> The discard bio just wants to discard from D03 to D10. For disk3,
> >>>>> there is a hole between
> >>>>> D03 and D08. For disk4, there is a hole between D04 and D09. D03
> >>>>> is a chunk, raid10_write_request
> >>>>> can handle one chunk perfectly. So the part that is not aligned
> >>>>> with stripe size is still
> >>>>> handled by raid10_write_request.
> >>>>>
> >>>>> If reshape is running when discard bio comes and the discard bio
> >>>>> spans the reshape position,
> >>>>> raid10_write_request is responsible to handle this discard bio.
> >>>>>
> >>>>> I did a test with this patch set.
> >>>>> Without patch:
> >>>>> time mkfs.xfs /dev/md0
> >>>>> real4m39.775s
> >>>>> user0m0.000s
> >>>>> sys0m0.298s
> >>>>>
> >>>>> With patch:
> >>>>> time mkfs.xfs /dev/md0
> >>>>> real0m0.105s
> >>>>> user0m0.000s
> >>>>> sys0m0.007s
> >>>>>
> >>>>> nvme3n1           259:1    0   477G  0 disk
> >>>>> └─nvme3n1p1       259:10   0    50G  0 part
> >>>>> nvme4n1           259:2    0   477G  0 disk
> >>>>> └─nvme4n1p1       259:11   0    50G  0 part
> >>>>> nvme5n1           259:6    0   477G  0 disk
> >>>>> └─nvme5n1p1       259:12   0    50G  0 part
> >>>>> nvme2n1           259:9    0   477G  0 disk
> >>>>> └─nvme2n1p1       259:15   0    50G  0 part
> >>>>> nvme0n1           259:13   0   477G  0 disk
> >>>>> └─nvme0n1p1       259:14   0    50G  0 part
> >>>>>
> >>>>> v1:
> >>>>> Coly helps to review these patches and give some suggestions:
> >>>>> One bug is found. If discard bio is across one stripe but bio size
> >>>>> is bigger
> >>>>> than one stripe size. After spliting, the bio will be NULL. In
> >>>>> this version,
> >>>>> it checks whether discard bio size is bigger than 2*stripe_size.
> >>>>> In raid10_end_discard_request, it's better to check
> >>>>> R10BIO_Uptodate is set
> >>>>> or not. It can avoid write memory to improve performance.
> >>>>> Add more comments for calculating addresses.
> >>>>>
> >>>>> Signed-off-by: Xiao Ni <xni@redhat.com>
> >>>> The code looks good to me, you may add my Reviewed-by: Coly Li
> >>>> <colyli@suse.de>
> >>>>
> >>>> But I still suggest you to add a more detailed commit log, to explain
> >>>> how the discard range of each component disk is decided. Anyway
> >>>> this is
> >>>> just a suggestion, it is up to you.
> >>>>
> >>>> Thank you for this work.
> >>> In raid10_end_discard_request(), the first 5 lines for local variables
> >>> declaration, there are 3 lines starts with improper blanks (should be
> >>> tab '\t'). You may find this minor issue by checkpatch.pl I guess.
> >>>
> >>> After fixing this format issue, you stil have my Reviewed-by :-)
> >>>
> >>> Coly Li
> >> Thanks to Xiao for the patch And thanks to Coly for the review.
> >>
> >> I did an experiment with a patch to increase max_discard_sectors,
> >> like:
> >>
> >> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> >> index 14b1ba732cd7d..0b15397ebfaf9 100644
> >> --- a/drivers/md/raid10.c
> >> +++ b/drivers/md/raid10.c
> >> @@ -3762,7 +3762,7 @@ static int raid10_run(struct mddev *mddev)
> >>          chunk_size = mddev->chunk_sectors << 9;
> >>          if (mddev->queue) {
> >>                  blk_queue_max_discard_sectors(mddev->queue,
> >> - mddev->chunk_sectors);
> >> +                                             UINT_MAX);
> >> blk_queue_max_write_same_sectors(mddev->queue, 0);
> >> blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
> >>                  blk_queue_io_min(mddev->queue, chunk_size);
> >>
> >> This simple change reduces the run time of mkfs.xfs on my SSD
> >> from 5+ minutes to about 14 seconds. Of course this is not as good
> >> as ~2s seconds with the set. But I wonder whether the time saving
> >> justifies extra complexity. Thoughts on this?
> >>
> >> Also, as Coly mentioned, please run checkpatch.pl on the patches.
> >> I have seen warnings on all 3 patches in the set.
> >>
> >> Thanks,
> >> Song
> >>
> > Hi Song
> >
> > The mkfs.xfs still takes a very long time if just modifying the max
> > discard sectors in my environment.
> > But it doesn't make sense in your test. It still needs to split the
> > bio based on the chunk size. So it don't
> > make a difference only modifying the mas discard sectors?
> >
> > And thanks Coly for the review. I'll fix the problem in the next version.
> >
> It takes effect only changing mas discard sectors to UINT_MAX in my
> environment.
> time mkfs.xfs /dev/md0 -f
> real    50m1.674s  (no change)
> real    13m19.367s (with change)
>
> Maybe it depends on the disks. In my environment, it still takes a long
> time.

I see. Let's fix it with this patch then. Please fix checkpatch.pl warnings and
resubmit with Coly's Review tag.

Thanks,
Song

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-07-22  6:20 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-20  4:58 [PATCH V1 0/3] Improve handling raid10 discard request Xiao Ni
2020-07-20  4:58 ` [PATCH V1 1/3] Move codes related with submitting discard bio into one function Xiao Ni
2020-07-20  5:14   ` Coly Li
2020-07-20  4:58 ` [PATCH V1 2/3] extend r10bio devs to raid disks Xiao Ni
2020-07-20  5:15   ` Coly Li
2020-07-20  4:58 ` [PATCH V1 3/3] improve raid10 discard request Xiao Ni
2020-07-20  5:26   ` Coly Li
2020-07-20  5:38     ` Coly Li
2020-07-22  0:17       ` Song Liu
2020-07-22  2:43         ` Xiao Ni
2020-07-22  5:10           ` Xiao Ni
2020-07-22  6:20             ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).