[PATCH V5 0/5] md/raid10: Improve handling raid10 discard request

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V5 0/5] md/raid10: Improve handling raid10 discard request
@ 2020-08-25  5:42 Xiao Ni
  2020-08-25  5:42 ` [PATCH V5 1/5] md/raid10: move codes related with submitting discard bio into one function Xiao Ni
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Xiao Ni @ 2020-08-25  5:42 UTC (permalink / raw)
  To: linux-raid, song; +Cc: heinzm, ncroxon, guoqing.jiang, colyli

Hi all

Now mkfs on raid10 which is combined with ssd/nvme disks takes a long time.
This patch set tries to resolve this problem.

v1:
Coly helps to review these patches and give some suggestions:
One bug is found. If discard bio is across one stripe but bio size is
bigger than one stripe size. After spliting, the bio will be NULL.
In this version, it checks whether discard bio size is bigger than
(2*stripe_size). 
In raid10_end_discard_request, it's better to check R10BIO_Uptodate
is set or not. It can avoid write memory to improve performance. 
Add more comments for calculating addresses.

v2:
Fix error by checkpatch.pl
Fix one bug for offset layout. v1 calculates wrongly split size
Add more comments to explain how the discard range of each component disk
is decided.

v3:
add support for far layout
Change the patch name

v4:
Pull codes that wait for blocked device into a seprate function
It can't use (stripe_size-1) as a mask to calculate. (stripe_size-1) may
not be power of 2.
It doesn't need to use a full copy of geo.
Fix warning by checkpatch.pl

v5:
In 32bit platform, it doesn't support 64bit devide by 32bit value.
Reported-by: kernel test robot <lkp@intel.com>

Xiao Ni (5):
  md/raid10: move codes related with submitting discard bio into one
    function
  md/raid10: extend r10bio devs to raid disks
  md/raid10: pull codes that wait for blocked dev into one function
  md/raid10: improve raid10 discard request
  md/raid10: improve discard request for far layout

 drivers/md/md.c     |  23 +++
 drivers/md/md.h     |   3 +
 drivers/md/raid0.c  |  15 +-
 drivers/md/raid10.c | 422 +++++++++++++++++++++++++++++++++++++++++++++-------
 drivers/md/raid10.h |   1 +
 5 files changed, 394 insertions(+), 70 deletions(-)

-- 
2.7.5

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V5 1/5] md/raid10: move codes related with submitting discard bio into one function
  2020-08-25  5:42 [PATCH V5 0/5] md/raid10: Improve handling raid10 discard request Xiao Ni
@ 2020-08-25  5:42 ` Xiao Ni
  2020-08-25  5:43 ` [PATCH V5 2/5] md/raid10: extend r10bio devs to raid disks Xiao Ni
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Xiao Ni @ 2020-08-25  5:42 UTC (permalink / raw)
  To: linux-raid, song; +Cc: heinzm, ncroxon, guoqing.jiang, colyli

These codes can be used in raid10. So we can move these codes into
md.c. raid0 and raid10 can share these codes.

Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/md.c    | 23 +++++++++++++++++++++++
 drivers/md/md.h    |  3 +++
 drivers/md/raid0.c | 15 ++-------------
 3 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 6072782..10743be 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8583,6 +8583,29 @@ void md_write_end(struct mddev *mddev)
 
 EXPORT_SYMBOL(md_write_end);
 
+/* This is used by raid0 and raid10 */
+void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
+				struct bio *bio,
+				sector_t dev_start, sector_t dev_end)
+{
+	struct bio *discard_bio = NULL;
+
+	if (__blkdev_issue_discard(rdev->bdev,
+	    dev_start + rdev->data_offset,
+	    dev_end - dev_start, GFP_NOIO, 0, &discard_bio) ||
+	    !discard_bio)
+		return;
+
+	bio_chain(discard_bio, bio);
+	bio_clone_blkg_association(discard_bio, bio);
+	if (mddev->gendisk)
+		trace_block_bio_remap(bdev_get_queue(rdev->bdev),
+			discard_bio, disk_devt(mddev->gendisk),
+			bio->bi_iter.bi_sector);
+	submit_bio_noacct(discard_bio);
+}
+EXPORT_SYMBOL(md_submit_discard_bio);
+
 /* md_allow_write(mddev)
  * Calling this ensures that the array is marked 'active' so that writes
  * may proceed without blocking.  It is important to call this before
diff --git a/drivers/md/md.h b/drivers/md/md.h
index d9c4e6b..bae3bd5 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -713,6 +713,9 @@ extern void md_write_end(struct mddev *mddev);
 extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
 extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
 extern void md_finish_reshape(struct mddev *mddev);
+extern void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
+				struct bio *bio,
+				sector_t dev_start, sector_t dev_end);
 
 extern bool __must_check md_flush_request(struct mddev *mddev, struct bio *bio);
 extern void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index f54a449..2868294 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -510,7 +510,6 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
 
 	for (disk = 0; disk < zone->nb_dev; disk++) {
 		sector_t dev_start, dev_end;
-		struct bio *discard_bio = NULL;
 		struct md_rdev *rdev;
 
 		if (disk < start_disk_index)
@@ -533,18 +532,8 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
 
 		rdev = conf->devlist[(zone - conf->strip_zone) *
 			conf->strip_zone[0].nb_dev + disk];
-		if (__blkdev_issue_discard(rdev->bdev,
-			dev_start + zone->dev_start + rdev->data_offset,
-			dev_end - dev_start, GFP_NOIO, 0, &discard_bio) ||
-		    !discard_bio)
-			continue;
-		bio_chain(discard_bio, bio);
-		bio_clone_blkg_association(discard_bio, bio);
-		if (mddev->gendisk)
-			trace_block_bio_remap(bdev_get_queue(rdev->bdev),
-				discard_bio, disk_devt(mddev->gendisk),
-				bio->bi_iter.bi_sector);
-		submit_bio_noacct(discard_bio);
+		dev_start += zone->dev_start;
+		md_submit_discard_bio(mddev, rdev, bio, dev_start, dev_end);
 	}
 	bio_endio(bio);
 }
-- 
2.7.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V5 2/5] md/raid10: extend r10bio devs to raid disks
  2020-08-25  5:42 [PATCH V5 0/5] md/raid10: Improve handling raid10 discard request Xiao Ni
  2020-08-25  5:42 ` [PATCH V5 1/5] md/raid10: move codes related with submitting discard bio into one function Xiao Ni
@ 2020-08-25  5:43 ` Xiao Ni
  2020-08-25  5:43 ` [PATCH V5 3/5] md/raid10: pull codes that wait for blocked dev into one function Xiao Ni
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Xiao Ni @ 2020-08-25  5:43 UTC (permalink / raw)
  To: linux-raid, song; +Cc: heinzm, ncroxon, guoqing.jiang, colyli

Now it allocs r10bio->devs[conf->copies]. Discard bio needs to submit
to all member disks and it needs to use r10bio. So extend to
r10bio->devs[geo.raid_disks].

Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/raid10.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index da12e3d..c4c8477 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -91,7 +91,7 @@ static inline struct r10bio *get_resync_r10bio(struct bio *bio)
 static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data)
 {
 	struct r10conf *conf = data;
-	int size = offsetof(struct r10bio, devs[conf->copies]);
+	int size = offsetof(struct r10bio, devs[conf->geo.raid_disks]);
 
 	/* allocate a r10bio with room for raid_disks entries in the
 	 * bios array */
@@ -238,7 +238,7 @@ static void put_all_bios(struct r10conf *conf, struct r10bio *r10_bio)
 {
 	int i;
 
-	for (i = 0; i < conf->copies; i++) {
+	for (i = 0; i < conf->geo.raid_disks; i++) {
 		struct bio **bio = & r10_bio->devs[i].bio;
 		if (!BIO_SPECIAL(*bio))
 			bio_put(*bio);
@@ -327,7 +327,7 @@ static int find_bio_disk(struct r10conf *conf, struct r10bio *r10_bio,
 	int slot;
 	int repl = 0;
 
-	for (slot = 0; slot < conf->copies; slot++) {
+	for (slot = 0; slot < conf->geo.raid_disks; slot++) {
 		if (r10_bio->devs[slot].bio == bio)
 			break;
 		if (r10_bio->devs[slot].repl_bio == bio) {
@@ -336,7 +336,6 @@ static int find_bio_disk(struct r10conf *conf, struct r10bio *r10_bio,
 		}
 	}
 
-	BUG_ON(slot == conf->copies);
 	update_head_pos(slot, r10_bio);
 
 	if (slotp)
@@ -1493,7 +1492,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio, int sectors)
 	r10_bio->mddev = mddev;
 	r10_bio->sector = bio->bi_iter.bi_sector;
 	r10_bio->state = 0;
-	memset(r10_bio->devs, 0, sizeof(r10_bio->devs[0]) * conf->copies);
+	memset(r10_bio->devs, 0, sizeof(r10_bio->devs[0]) * conf->geo.raid_disks);
 
 	if (bio_data_dir(bio) == READ)
 		raid10_read_request(mddev, bio, r10_bio);
-- 
2.7.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V5 3/5] md/raid10: pull codes that wait for blocked dev into one function
  2020-08-25  5:42 [PATCH V5 0/5] md/raid10: Improve handling raid10 discard request Xiao Ni
  2020-08-25  5:42 ` [PATCH V5 1/5] md/raid10: move codes related with submitting discard bio into one function Xiao Ni
  2020-08-25  5:43 ` [PATCH V5 2/5] md/raid10: extend r10bio devs to raid disks Xiao Ni
@ 2020-08-25  5:43 ` Xiao Ni
  2020-08-25  5:43 ` [PATCH V5 4/5] md/raid10: improve raid10 discard request Xiao Ni
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Xiao Ni @ 2020-08-25  5:43 UTC (permalink / raw)
  To: linux-raid, song; +Cc: heinzm, ncroxon, guoqing.jiang, colyli

The following patch will do the same job. So pull the same codes
into one function.

Signed-off-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/raid10.c | 118 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 67 insertions(+), 51 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index c4c8477..05e7f8d 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1275,12 +1275,75 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
 	}
 }
 
+static void wait_blocked_dev(struct mddev *mddev, struct r10bio *r10_bio)
+{
+	int i;
+	struct r10conf *conf = mddev->private;
+	struct md_rdev *blocked_rdev;
+
+retry_wait:
+	blocked_rdev = NULL;
+	rcu_read_lock();
+	for (i = 0; i < conf->copies; i++) {
+		struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
+		struct md_rdev *rrdev = rcu_dereference(
+			conf->mirrors[i].replacement);
+		if (rdev == rrdev)
+			rrdev = NULL;
+		if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
+			atomic_inc(&rdev->nr_pending);
+			blocked_rdev = rdev;
+			break;
+		}
+		if (rrdev && unlikely(test_bit(Blocked, &rrdev->flags))) {
+			atomic_inc(&rrdev->nr_pending);
+			blocked_rdev = rrdev;
+			break;
+		}
+
+		if (rdev && test_bit(WriteErrorSeen, &rdev->flags)) {
+			sector_t first_bad;
+			sector_t dev_sector = r10_bio->devs[i].addr;
+			int bad_sectors;
+			int is_bad;
+
+			/* Discard request doesn't care the write result
+			 * so it doesn't need to wait blocked disk here.
+			 */
+			if (!r10_bio->sectors)
+				continue;
+
+			is_bad = is_badblock(rdev, dev_sector, r10_bio->sectors,
+					     &first_bad, &bad_sectors);
+			if (is_bad < 0) {
+				/* Mustn't write here until the bad block
+				 * is acknowledged
+				 */
+				atomic_inc(&rdev->nr_pending);
+				set_bit(BlockedBadBlocks, &rdev->flags);
+				blocked_rdev = rdev;
+				break;
+			}
+		}
+	}
+	rcu_read_unlock();
+
+	if (unlikely(blocked_rdev)) {
+		/* Have to wait for this device to get unblocked, then retry */
+		allow_barrier(conf);
+		raid10_log(conf->mddev, "%s wait rdev %d blocked",
+				__func__, blocked_rdev->raid_disk);
+		md_wait_for_blocked_rdev(blocked_rdev, mddev);
+		wait_barrier(conf);
+		goto retry_wait;
+	}
+}
+
 static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 				 struct r10bio *r10_bio)
 {
 	struct r10conf *conf = mddev->private;
 	int i;
-	struct md_rdev *blocked_rdev;
 	sector_t sectors;
 	int max_sectors;
 
@@ -1338,8 +1401,9 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 
 	r10_bio->read_slot = -1; /* make sure repl_bio gets freed */
 	raid10_find_phys(conf, r10_bio);
-retry_write:
-	blocked_rdev = NULL;
+
+	wait_blocked_dev(mddev, r10_bio);
+
 	rcu_read_lock();
 	max_sectors = r10_bio->sectors;
 
@@ -1350,16 +1414,6 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 			conf->mirrors[d].replacement);
 		if (rdev == rrdev)
 			rrdev = NULL;
-		if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
-			atomic_inc(&rdev->nr_pending);
-			blocked_rdev = rdev;
-			break;
-		}
-		if (rrdev && unlikely(test_bit(Blocked, &rrdev->flags))) {
-			atomic_inc(&rrdev->nr_pending);
-			blocked_rdev = rrdev;
-			break;
-		}
 		if (rdev && (test_bit(Faulty, &rdev->flags)))
 			rdev = NULL;
 		if (rrdev && (test_bit(Faulty, &rrdev->flags)))
@@ -1380,15 +1434,6 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 
 			is_bad = is_badblock(rdev, dev_sector, max_sectors,
 					     &first_bad, &bad_sectors);
-			if (is_bad < 0) {
-				/* Mustn't write here until the bad block
-				 * is acknowledged
-				 */
-				atomic_inc(&rdev->nr_pending);
-				set_bit(BlockedBadBlocks, &rdev->flags);
-				blocked_rdev = rdev;
-				break;
-			}
 			if (is_bad && first_bad <= dev_sector) {
 				/* Cannot write here at all */
 				bad_sectors -= (dev_sector - first_bad);
@@ -1424,35 +1469,6 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 	}
 	rcu_read_unlock();
 
-	if (unlikely(blocked_rdev)) {
-		/* Have to wait for this device to get unblocked, then retry */
-		int j;
-		int d;
-
-		for (j = 0; j < i; j++) {
-			if (r10_bio->devs[j].bio) {
-				d = r10_bio->devs[j].devnum;
-				rdev_dec_pending(conf->mirrors[d].rdev, mddev);
-			}
-			if (r10_bio->devs[j].repl_bio) {
-				struct md_rdev *rdev;
-				d = r10_bio->devs[j].devnum;
-				rdev = conf->mirrors[d].replacement;
-				if (!rdev) {
-					/* Race with remove_disk */
-					smp_mb();
-					rdev = conf->mirrors[d].rdev;
-				}
-				rdev_dec_pending(rdev, mddev);
-			}
-		}
-		allow_barrier(conf);
-		raid10_log(conf->mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
-		md_wait_for_blocked_rdev(blocked_rdev, mddev);
-		wait_barrier(conf);
-		goto retry_write;
-	}
-
 	if (max_sectors < r10_bio->sectors)
 		r10_bio->sectors = max_sectors;
 
-- 
2.7.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V5 4/5] md/raid10: improve raid10 discard request
  2020-08-25  5:42 [PATCH V5 0/5] md/raid10: Improve handling raid10 discard request Xiao Ni
                   ` (2 preceding siblings ...)
  2020-08-25  5:43 ` [PATCH V5 3/5] md/raid10: pull codes that wait for blocked dev into one function Xiao Ni
@ 2020-08-25  5:43 ` Xiao Ni
  2020-08-28 22:16   ` Song Liu
  2020-08-25  5:43 ` [PATCH V5 5/5] md/raid10: improve discard request for far layout Xiao Ni
  2020-08-25 10:19 ` [PATCH V5 0/5] md/raid10: Improve handling raid10 discard request Michal Soltys
  5 siblings, 1 reply; 16+ messages in thread
From: Xiao Ni @ 2020-08-25  5:43 UTC (permalink / raw)
  To: linux-raid, song; +Cc: heinzm, ncroxon, guoqing.jiang, colyli

Now the discard request is split by chunk size. So it takes a long time
to finish mkfs on disks which support discard function. This patch improve
handling raid10 discard request. It uses the similar way with patch
29efc390b (md/md0: optimize raid0 discard handling).

But it's a little complex than raid0. Because raid10 has different layout.
If raid10 is offset layout and the discard request is smaller than stripe
size. There are some holes when we submit discard bio to underlayer disks.

For example: five disks (disk1 - disk5)
D01 D02 D03 D04 D05
D05 D01 D02 D03 D04
D06 D07 D08 D09 D10
D10 D06 D07 D08 D09
The discard bio just wants to discard from D03 to D10. For disk3, there is
a hole between D03 and D08. For disk4, there is a hole between D04 and D09.
D03 is a chunk, raid10_write_request can handle one chunk perfectly. So
the part that is not aligned with stripe size is still handled by
raid10_write_request.

If reshape is running when discard bio comes and the discard bio spans the
reshape position, raid10_write_request is responsible to handle this
discard bio.

I did a test with this patch set.
Without patch:
time mkfs.xfs /dev/md0
real4m39.775s
user0m0.000s
sys0m0.298s

With patch:
time mkfs.xfs /dev/md0
real0m0.105s
user0m0.000s
sys0m0.007s

nvme3n1           259:1    0   477G  0 disk
└─nvme3n1p1       259:10   0    50G  0 part
nvme4n1           259:2    0   477G  0 disk
└─nvme4n1p1       259:11   0    50G  0 part
nvme5n1           259:6    0   477G  0 disk
└─nvme5n1p1       259:12   0    50G  0 part
nvme2n1           259:9    0   477G  0 disk
└─nvme2n1p1       259:15   0    50G  0 part
nvme0n1           259:13   0   477G  0 disk
└─nvme0n1p1       259:14   0    50G  0 part

Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/raid10.c | 254 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 253 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 05e7f8d..257791e 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1516,6 +1516,254 @@ static void __make_request(struct mddev *mddev, struct bio *bio, int sectors)
 		raid10_write_request(mddev, bio, r10_bio);
 }
 
+static struct bio *raid10_split_bio(struct r10conf *conf,
+			struct bio *bio, sector_t sectors, bool want_first)
+{
+	struct bio *split;
+
+	split = bio_split(bio, sectors,	GFP_NOIO, &conf->bio_split);
+	bio_chain(split, bio);
+	allow_barrier(conf);
+	if (want_first) {
+		submit_bio_noacct(bio);
+		bio = split;
+	} else
+		submit_bio_noacct(split);
+	wait_barrier(conf);
+
+	return bio;
+}
+
+static void raid10_end_discard_request(struct bio *bio)
+{
+	struct r10bio *r10_bio = bio->bi_private;
+	struct r10conf *conf = r10_bio->mddev->private;
+	struct md_rdev *rdev = NULL;
+	int dev;
+	int slot, repl;
+
+	/*
+	 * We don't care the return value of discard bio
+	 */
+	if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
+		set_bit(R10BIO_Uptodate, &r10_bio->state);
+
+	dev = find_bio_disk(conf, r10_bio, bio, &slot, &repl);
+	if (repl)
+		rdev = conf->mirrors[dev].replacement;
+	if (!rdev) {
+		/* raid10_remove_disk uses smp_mb to make sure rdev is set to
+		 * replacement before setting replacement to NULL. It can read
+		 * rdev first without barrier protect even replacment is NULL
+		 */
+		smp_rmb();
+		repl = 0;
+		rdev = conf->mirrors[dev].rdev;
+	}
+
+	if (atomic_dec_and_test(&r10_bio->remaining)) {
+		md_write_end(r10_bio->mddev);
+		raid_end_bio_io(r10_bio);
+	}
+
+	rdev_dec_pending(rdev, conf->mddev);
+}
+
+/* There are some limitations to handle discard bio
+ * 1st, the discard size is bigger than stripe_size*2.
+ * 2st, if the discard bio spans reshape progress, we use the old way to
+ * handle discard bio
+ */
+static bool raid10_handle_discard(struct mddev *mddev, struct bio *bio)
+{
+	struct r10conf *conf = mddev->private;
+	struct geom *geo = &conf->geo;
+	struct r10bio *r10_bio;
+
+	int disk;
+	sector_t chunk;
+	unsigned int stripe_size;
+	sector_t split_size;
+
+	sector_t bio_start, bio_end;
+	sector_t first_stripe_index, last_stripe_index;
+	sector_t start_disk_offset;
+	unsigned int start_disk_index;
+	sector_t end_disk_offset;
+	unsigned int end_disk_index;
+	unsigned int remainder;
+
+	wait_barrier(conf);
+
+	if (conf->reshape_progress != MaxSector &&
+	    ((bio->bi_iter.bi_sector >= conf->reshape_progress) !=
+	     conf->mddev->reshape_backwards))
+		geo = &conf->prev;
+
+	stripe_size = geo->raid_disks << geo->chunk_shift;
+	bio_start = bio->bi_iter.bi_sector;
+	bio_end = bio_end_sector(bio);
+
+	/* Maybe one discard bio is smaller than strip size or across one stripe
+	 * and discard region is larger than one stripe size. For far offset layout,
+	 * if the discard region is not aligned with stripe size, there is hole
+	 * when we submit discard bio to member disk. For simplicity, we only
+	 * handle discard bio which discard region is bigger than stripe_size*2
+	 */
+	if (bio_sectors(bio) < stripe_size*2)
+		goto out;
+
+	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
+		bio_start < conf->reshape_progress &&
+		bio_end > conf->reshape_progress)
+		goto out;
+
+	/* For far offset layout, if bio is not aligned with stripe size, it splits
+	 * the part that is not aligned with strip size.
+	 */
+	div_u64_rem(bio_start, stripe_size, &remainder);
+	if (geo->far_offset && remainder) {
+		split_size = stripe_size - remainder;
+		bio = raid10_split_bio(conf, bio, split_size, false);
+	}
+	div_u64_rem(bio_end, stripe_size, &remainder);
+	if (geo->far_offset && remainder) {
+		split_size = bio_sectors(bio) - remainder;
+		bio = raid10_split_bio(conf, bio, split_size, true);
+	}
+
+	r10_bio = mempool_alloc(&conf->r10bio_pool, GFP_NOIO);
+	r10_bio->mddev = mddev;
+	r10_bio->state = 0;
+	r10_bio->sectors = 0;
+	memset(r10_bio->devs, 0, sizeof(r10_bio->devs[0]) * geo->raid_disks);
+
+	wait_blocked_dev(mddev, r10_bio);
+
+	r10_bio->master_bio = bio;
+
+	bio_start = bio->bi_iter.bi_sector;
+	bio_end = bio_end_sector(bio);
+
+	/* raid10 uses chunk as the unit to store data. It's similar like raid0.
+	 * One stripe contains the chunks from all member disk (one chunk from
+	 * one disk at the same HBA address). For layout detail, see 'man md 4'
+	 */
+	chunk = bio_start >> geo->chunk_shift;
+	chunk *= geo->near_copies;
+	first_stripe_index = chunk;
+	start_disk_index = sector_div(first_stripe_index, geo->raid_disks);
+	if (geo->far_offset)
+		first_stripe_index *= geo->far_copies;
+	start_disk_offset = (bio_start & geo->chunk_mask) +
+				(first_stripe_index << geo->chunk_shift);
+
+	chunk = bio_end >> geo->chunk_shift;
+	chunk *= geo->near_copies;
+	last_stripe_index = chunk;
+	end_disk_index = sector_div(last_stripe_index, geo->raid_disks);
+	if (geo->far_offset)
+		last_stripe_index *= geo->far_copies;
+	end_disk_offset = (bio_end & geo->chunk_mask) +
+				(last_stripe_index << geo->chunk_shift);
+
+	rcu_read_lock();
+	for (disk = 0; disk < geo->raid_disks; disk++) {
+		struct md_rdev *rdev = rcu_dereference(conf->mirrors[disk].rdev);
+		struct md_rdev *rrdev = rcu_dereference(
+			conf->mirrors[disk].replacement);
+
+		r10_bio->devs[disk].bio = NULL;
+		r10_bio->devs[disk].repl_bio = NULL;
+
+		if (rdev && (test_bit(Faulty, &rdev->flags)))
+			rdev = NULL;
+		if (rrdev && (test_bit(Faulty, &rrdev->flags)))
+			rrdev = NULL;
+		if (!rdev && !rrdev)
+			continue;
+
+		if (rdev) {
+			r10_bio->devs[disk].bio = bio;
+			atomic_inc(&rdev->nr_pending);
+		}
+		if (rrdev) {
+			r10_bio->devs[disk].repl_bio = bio;
+			atomic_inc(&rrdev->nr_pending);
+		}
+	}
+	rcu_read_unlock();
+
+	atomic_set(&r10_bio->remaining, 1);
+	for (disk = 0; disk < geo->raid_disks; disk++) {
+		sector_t dev_start, dev_end;
+		struct bio *mbio, *rbio = NULL;
+		struct md_rdev *rdev = rcu_dereference(conf->mirrors[disk].rdev);
+		struct md_rdev *rrdev = rcu_dereference(
+			conf->mirrors[disk].replacement);
+
+		/*
+		 * Now start to calculate the start and end address for each disk.
+		 * The space between dev_start and dev_end is the discard region.
+		 *
+		 * For dev_start, it needs to consider three conditions:
+		 * 1st, the disk is before start_disk, you can imagine the disk in
+		 * the next stripe. So the dev_start is the start address of next
+		 * stripe.
+		 * 2st, the disk is after start_disk, it means the disk is at the
+		 * same stripe of first disk
+		 * 3st, the first disk itself, we can use start_disk_offset directly
+		 */
+		if (disk < start_disk_index)
+			dev_start = (first_stripe_index + 1) * mddev->chunk_sectors;
+		else if (disk > start_disk_index)
+			dev_start = first_stripe_index * mddev->chunk_sectors;
+		else
+			dev_start = start_disk_offset;
+
+		if (disk < end_disk_index)
+			dev_end = (last_stripe_index + 1) * mddev->chunk_sectors;
+		else if (disk > end_disk_index)
+			dev_end = last_stripe_index * mddev->chunk_sectors;
+		else
+			dev_end = end_disk_offset;
+
+		/* It only handles discard bio which size is >= stripe size, so
+		 * dev_end > dev_start all the time
+		 */
+		if (r10_bio->devs[disk].bio) {
+			mbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set);
+			mbio->bi_end_io = raid10_end_discard_request;
+			mbio->bi_private = r10_bio;
+			r10_bio->devs[disk].bio = mbio;
+			r10_bio->devs[disk].devnum = disk;
+			atomic_inc(&r10_bio->remaining);
+			md_submit_discard_bio(mddev, rdev, mbio, dev_start, dev_end);
+			bio_endio(mbio);
+		}
+		if (r10_bio->devs[disk].repl_bio) {
+			rbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set);
+			rbio->bi_end_io = raid10_end_discard_request;
+			rbio->bi_private = r10_bio;
+			r10_bio->devs[disk].repl_bio = rbio;
+			r10_bio->devs[disk].devnum = disk;
+			atomic_inc(&r10_bio->remaining);
+			md_submit_discard_bio(mddev, rrdev, rbio, dev_start, dev_end);
+			bio_endio(rbio);
+		}
+	}
+
+	if (atomic_dec_and_test(&r10_bio->remaining)) {
+		md_write_end(r10_bio->mddev);
+		raid_end_bio_io(r10_bio);
+	}
+
+	return 0;
+out:
+	allow_barrier(conf);
+	return -EAGAIN;
+}
+
 static bool raid10_make_request(struct mddev *mddev, struct bio *bio)
 {
 	struct r10conf *conf = mddev->private;
@@ -1530,6 +1778,10 @@ static bool raid10_make_request(struct mddev *mddev, struct bio *bio)
 	if (!md_write_start(mddev, bio))
 		return false;
 
+	if (unlikely(bio_op(bio) == REQ_OP_DISCARD))
+		if (!raid10_handle_discard(mddev, bio))
+			return true;
+
 	/*
 	 * If this request crosses a chunk boundary, we need to split
 	 * it.
@@ -3760,7 +4012,7 @@ static int raid10_run(struct mddev *mddev)
 	chunk_size = mddev->chunk_sectors << 9;
 	if (mddev->queue) {
 		blk_queue_max_discard_sectors(mddev->queue,
-					      mddev->chunk_sectors);
+					      UINT_MAX);
 		blk_queue_max_write_same_sectors(mddev->queue, 0);
 		blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
 		blk_queue_io_min(mddev->queue, chunk_size);
-- 
2.7.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V5 5/5] md/raid10: improve discard request for far layout
  2020-08-25  5:42 [PATCH V5 0/5] md/raid10: Improve handling raid10 discard request Xiao Ni
                   ` (3 preceding siblings ...)
  2020-08-25  5:43 ` [PATCH V5 4/5] md/raid10: improve raid10 discard request Xiao Ni
@ 2020-08-25  5:43 ` Xiao Ni
  2020-08-28  7:03   ` Song Liu
  2020-08-25 10:19 ` [PATCH V5 0/5] md/raid10: Improve handling raid10 discard request Michal Soltys
  5 siblings, 1 reply; 16+ messages in thread
From: Xiao Ni @ 2020-08-25  5:43 UTC (permalink / raw)
  To: linux-raid, song; +Cc: heinzm, ncroxon, guoqing.jiang, colyli

For far layout, the discard region is not continuous on disks. So it needs
far copies r10bio to cover all regions. It needs a way to know all r10bios
have finish or not. Similar with raid10_sync_request, only the first r10bio
master_bio records the discard bio. Other r10bios master_bio record the
first r10bio. The first r10bio can finish after other r10bios finish and
then return the discard bio.

Signed-off-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/raid10.c | 87 +++++++++++++++++++++++++++++++++++++++--------------
 drivers/md/raid10.h |  1 +
 2 files changed, 65 insertions(+), 23 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 257791e..f6518ea 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1534,6 +1534,29 @@ static struct bio *raid10_split_bio(struct r10conf *conf,
 	return bio;
 }
 
+static void raid_end_discard_bio(struct r10bio *r10bio)
+{
+	struct r10conf *conf = r10bio->mddev->private;
+	struct r10bio *first_r10bio;
+
+	while (atomic_dec_and_test(&r10bio->remaining)) {
+
+		allow_barrier(conf);
+
+		if (!test_bit(R10BIO_Discard, &r10bio->state)) {
+			first_r10bio = (struct r10bio *)r10bio->master_bio;
+			free_r10bio(r10bio);
+			r10bio = first_r10bio;
+		} else {
+			md_write_end(r10bio->mddev);
+			bio_endio(r10bio->master_bio);
+			free_r10bio(r10bio);
+			break;
+		}
+	}
+}
+
+
 static void raid10_end_discard_request(struct bio *bio)
 {
 	struct r10bio *r10_bio = bio->bi_private;
@@ -1561,11 +1584,7 @@ static void raid10_end_discard_request(struct bio *bio)
 		rdev = conf->mirrors[dev].rdev;
 	}
 
-	if (atomic_dec_and_test(&r10_bio->remaining)) {
-		md_write_end(r10_bio->mddev);
-		raid_end_bio_io(r10_bio);
-	}
-
+	raid_end_discard_bio(r10_bio);
 	rdev_dec_pending(rdev, conf->mddev);
 }
 
@@ -1578,7 +1597,9 @@ static bool raid10_handle_discard(struct mddev *mddev, struct bio *bio)
 {
 	struct r10conf *conf = mddev->private;
 	struct geom *geo = &conf->geo;
-	struct r10bio *r10_bio;
+	struct r10bio *r10_bio, *first_r10bio;
+	int far_copies = geo->far_copies;
+	bool first_copy = true;
 
 	int disk;
 	sector_t chunk;
@@ -1618,30 +1639,20 @@ static bool raid10_handle_discard(struct mddev *mddev, struct bio *bio)
 		bio_end > conf->reshape_progress)
 		goto out;
 
-	/* For far offset layout, if bio is not aligned with stripe size, it splits
-	 * the part that is not aligned with strip size.
+	/* For far and far offset layout, if bio is not aligned with stripe size,
+	 * it splits the part that is not aligned with strip size.
 	 */
 	div_u64_rem(bio_start, stripe_size, &remainder);
-	if (geo->far_offset && remainder) {
+	if ((far_copies > 1) && remainder) {
 		split_size = stripe_size - remainder;
 		bio = raid10_split_bio(conf, bio, split_size, false);
 	}
 	div_u64_rem(bio_end, stripe_size, &remainder);
-	if (geo->far_offset && remainder) {
+	if ((far_copies > 1) && remainder) {
 		split_size = bio_sectors(bio) - remainder;
 		bio = raid10_split_bio(conf, bio, split_size, true);
 	}
 
-	r10_bio = mempool_alloc(&conf->r10bio_pool, GFP_NOIO);
-	r10_bio->mddev = mddev;
-	r10_bio->state = 0;
-	r10_bio->sectors = 0;
-	memset(r10_bio->devs, 0, sizeof(r10_bio->devs[0]) * geo->raid_disks);
-
-	wait_blocked_dev(mddev, r10_bio);
-
-	r10_bio->master_bio = bio;
-
 	bio_start = bio->bi_iter.bi_sector;
 	bio_end = bio_end_sector(bio);
 
@@ -1667,6 +1678,28 @@ static bool raid10_handle_discard(struct mddev *mddev, struct bio *bio)
 	end_disk_offset = (bio_end & geo->chunk_mask) +
 				(last_stripe_index << geo->chunk_shift);
 
+retry_discard:
+	r10_bio = mempool_alloc(&conf->r10bio_pool, GFP_NOIO);
+	r10_bio->mddev = mddev;
+	r10_bio->state = 0;
+	r10_bio->sectors = 0;
+	memset(r10_bio->devs, 0, sizeof(r10_bio->devs[0]) * geo->raid_disks);
+	wait_blocked_dev(mddev, r10_bio);
+
+	/* For far layout it needs more than one r10bio to cover all regions.
+	 * Inspired by raid10_sync_request, we can use the first r10bio->master_bio
+	 * to record the discard bio. Other r10bio->master_bio record the first
+	 * r10bio. The first r10bio only release after all other r10bios finish.
+	 * The discard bio returns only first r10bio finishes
+	 */
+	if (first_copy) {
+		r10_bio->master_bio = bio;
+		set_bit(R10BIO_Discard, &r10_bio->state);
+		first_copy = false;
+		first_r10bio = r10_bio;
+	} else
+		r10_bio->master_bio = (struct bio *)first_r10bio;
+
 	rcu_read_lock();
 	for (disk = 0; disk < geo->raid_disks; disk++) {
 		struct md_rdev *rdev = rcu_dereference(conf->mirrors[disk].rdev);
@@ -1753,11 +1786,19 @@ static bool raid10_handle_discard(struct mddev *mddev, struct bio *bio)
 		}
 	}
 
-	if (atomic_dec_and_test(&r10_bio->remaining)) {
-		md_write_end(r10_bio->mddev);
-		raid_end_bio_io(r10_bio);
+	if (!geo->far_offset && --far_copies) {
+		first_stripe_index += geo->stride >> geo->chunk_shift;
+		start_disk_offset += geo->stride;
+		last_stripe_index += geo->stride >> geo->chunk_shift;
+		end_disk_offset += geo->stride;
+		atomic_inc(&first_r10bio->remaining);
+		raid_end_discard_bio(r10_bio);
+		wait_barrier(conf);
+		goto retry_discard;
 	}
 
+	raid_end_discard_bio(r10_bio);
+
 	return 0;
 out:
 	allow_barrier(conf);
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index 79cd2b7..1461fd5 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -179,5 +179,6 @@ enum r10bio_state {
 	R10BIO_Previous,
 /* failfast devices did receive failfast requests. */
 	R10BIO_FailFast,
+	R10BIO_Discard,
 };
 #endif
-- 
2.7.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH V5 0/5] md/raid10: Improve handling raid10 discard request
  2020-08-25  5:42 [PATCH V5 0/5] md/raid10: Improve handling raid10 discard request Xiao Ni
                   ` (4 preceding siblings ...)
  2020-08-25  5:43 ` [PATCH V5 5/5] md/raid10: improve discard request for far layout Xiao Ni
@ 2020-08-25 10:19 ` Michal Soltys
  2020-08-25 13:25   ` Xiao Ni
  5 siblings, 1 reply; 16+ messages in thread
From: Michal Soltys @ 2020-08-25 10:19 UTC (permalink / raw)
  To: Xiao Ni, linux-raid, song; +Cc: heinzm, ncroxon, guoqing.jiang, colyli

On 8/25/20 7:42 AM, Xiao Ni wrote:
> Hi all
> 
> Now mkfs on raid10 which is combined with ssd/nvme disks takes a long time.
> This patch set tries to resolve this problem.
> 

Are those fixes also possibly related to the issues I found earlier this 
year about it's very weird discard handling whenever the originating 
request wasn't essentially chunk-aligend ?

What I found back then is e.g. discard of 4x32gb raid10 taking good 11 
minutes via blkdiscard w/o explicit step option.

I still have blktraces of that available, the relevant thread part with 
further more detailed followups can be found at:

https://www.spinics.net/lists/raid/msg62115.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V5 0/5] md/raid10: Improve handling raid10 discard request
  2020-08-25 10:19 ` [PATCH V5 0/5] md/raid10: Improve handling raid10 discard request Michal Soltys
@ 2020-08-25 13:25   ` Xiao Ni
  2020-08-25 15:14     ` Michal Soltys
  0 siblings, 1 reply; 16+ messages in thread
From: Xiao Ni @ 2020-08-25 13:25 UTC (permalink / raw)
  To: Michal Soltys, linux-raid, song; +Cc: heinzm, ncroxon, guoqing.jiang, colyli



On 08/25/2020 06:19 PM, Michal Soltys wrote:
> On 8/25/20 7:42 AM, Xiao Ni wrote:
>> Hi all
>>
>> Now mkfs on raid10 which is combined with ssd/nvme disks takes a long 
>> time.
>> This patch set tries to resolve this problem.
>>
>
Hi Michal

> Are those fixes also possibly related to the issues I found earlier 
> this year about it's very weird discard handling whenever the 
> originating request wasn't essentially chunk-aligend ?
I searched by your email in my email box and I didn't find emails from 
you at earlier this year. Is there a link? Discard request is usually 
very big.
If the discard request is not chunk aligned, raid10 can handle this 
problem without my patch. It splits I/O by chunk size and write/discard
this chunk to all copies.
>
> What I found back then is e.g. discard of 4x32gb raid10 taking good 11 
> minutes via blkdiscard w/o explicit step option.
4x32gb means 4 disks and each disk is 32GB?
For the discard time problem, as mentioned just now, it splits big 
discard request into small chunks. So it takes very long time.
My patch resolves this problem.
>
> I still have blktraces of that available, the relevant thread part 
> with further more detailed followups can be found at:
>
> https://www.spinics.net/lists/raid/msg62115.html
>
It's a raid456 problem, not raid10.

Regards
Xiao


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V5 0/5] md/raid10: Improve handling raid10 discard request
  2020-08-25 13:25   ` Xiao Ni
@ 2020-08-25 15:14     ` Michal Soltys
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Soltys @ 2020-08-25 15:14 UTC (permalink / raw)
  To: Xiao Ni, linux-raid, song; +Cc: heinzm, ncroxon, guoqing.jiang, colyli

On 8/25/20 3:25 PM, Xiao Ni wrote:
> 
> 
> On 08/25/2020 06:19 PM, Michal Soltys wrote:
>> On 8/25/20 7:42 AM, Xiao Ni wrote:
>> Are those fixes also possibly related to the issues I found earlier 
>> this year about it's very weird discard handling whenever the 
>> originating request wasn't essentially chunk-aligend ?
> I searched by your email in my email box and I didn't find emails from 
> you at earlier this year. Is there a link? Discard request is usually 
> very big.

Was using different mail then (soltys@ziu.info), but vger seems to be 
refusing .info domain now.

> If the discard request is not chunk aligned, raid10 can handle this 
> problem without my patch. It splits I/O by chunk size and write/discard
> this chunk to all copies.
>>
>> What I found back then is e.g. discard of 4x32gb raid10 taking good 11 
>> minutes via blkdiscard w/o explicit step option.
> 4x32gb means 4 disks and each disk is 32GB?

Using partitions on the disks to be precise, but yea.

> For the discard time problem, as mentioned just now, it splits big 
> discard request into small chunks. So it takes very long time.
> My patch resolves this problem.
>>
>> I still have blktraces of that available, the relevant thread part 
>> with further more detailed followups can be found at:
>>
>> https://www.spinics.net/lists/raid/msg62115.html
>>
> It's a raid456 problem, not raid10.

The part related to raid10 starts at that moment (as well as all 
followups), after Song asked me about raid10 behavior.

The above spinics link has links to blktrace dumps when blkdiscard was 
executed on such raid10. It was split into tiny chunks and executed in 
really weird fashion, that subsequent replies outlined:

https://www.spinics.net/lists/raid/msg62134.html
https://www.spinics.net/lists/raid/msg62164.html

Anyway, just mentioning that after seeing a set of patches related to 
discards and raid10.

I'll doublecheck that behavior with more current kernel version and the 
patches applied.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V5 5/5] md/raid10: improve discard request for far layout
  2020-08-25  5:43 ` [PATCH V5 5/5] md/raid10: improve discard request for far layout Xiao Ni
@ 2020-08-28  7:03   ` Song Liu
  2020-08-28  9:50     ` Xiao Ni
  0 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2020-08-28  7:03 UTC (permalink / raw)
  To: Xiao Ni
  Cc: linux-raid, Heinz Mauelshagen, Nigel Croxon, Guoqing Jiang, Coly Li

On Mon, Aug 24, 2020 at 10:43 PM Xiao Ni <xni@redhat.com> wrote:
>
> For far layout, the discard region is not continuous on disks. So it needs
> far copies r10bio to cover all regions. It needs a way to know all r10bios
> have finish or not. Similar with raid10_sync_request, only the first r10bio
> master_bio records the discard bio. Other r10bios master_bio record the
> first r10bio. The first r10bio can finish after other r10bios finish and
> then return the discard bio.
>
> Signed-off-by: Xiao Ni <xni@redhat.com>
> ---
>  drivers/md/raid10.c | 87 +++++++++++++++++++++++++++++++++++++++--------------
>  drivers/md/raid10.h |  1 +
>  2 files changed, 65 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 257791e..f6518ea 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1534,6 +1534,29 @@ static struct bio *raid10_split_bio(struct r10conf *conf,
>         return bio;
>  }
>
> +static void raid_end_discard_bio(struct r10bio *r10bio)

Let's name this raid10_*

> +{
> +       struct r10conf *conf = r10bio->mddev->private;
> +       struct r10bio *first_r10bio;
> +
> +       while (atomic_dec_and_test(&r10bio->remaining)) {

Should this be "if (atomic_*"?

Thanks,
Song

[...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V5 5/5] md/raid10: improve discard request for far layout
  2020-08-28  7:03   ` Song Liu
@ 2020-08-28  9:50     ` Xiao Ni
  2020-08-28 21:44       ` Song Liu
  0 siblings, 1 reply; 16+ messages in thread
From: Xiao Ni @ 2020-08-28  9:50 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, Heinz Mauelshagen, Nigel Croxon, Guoqing Jiang, Coly Li



On 08/28/2020 03:03 PM, Song Liu wrote:
> On Mon, Aug 24, 2020 at 10:43 PM Xiao Ni <xni@redhat.com> wrote:
>> For far layout, the discard region is not continuous on disks. So it needs
>> far copies r10bio to cover all regions. It needs a way to know all r10bios
>> have finish or not. Similar with raid10_sync_request, only the first r10bio
>> master_bio records the discard bio. Other r10bios master_bio record the
>> first r10bio. The first r10bio can finish after other r10bios finish and
>> then return the discard bio.
>>
>> Signed-off-by: Xiao Ni <xni@redhat.com>
>> ---
>>   drivers/md/raid10.c | 87 +++++++++++++++++++++++++++++++++++++++--------------
>>   drivers/md/raid10.h |  1 +
>>   2 files changed, 65 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index 257791e..f6518ea 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -1534,6 +1534,29 @@ static struct bio *raid10_split_bio(struct r10conf *conf,
>>          return bio;
>>   }
>>
>> +static void raid_end_discard_bio(struct r10bio *r10bio)
> Let's name this raid10_*
Ok
>
>> +{
>> +       struct r10conf *conf = r10bio->mddev->private;
>> +       struct r10bio *first_r10bio;
>> +
>> +       while (atomic_dec_and_test(&r10bio->remaining)) {
> Should this be "if (atomic_*"?
>
The usage of while is right here. For far layout, it needs far copies 
r10bio. It needs to find a method
to know all r10bios finish. The first r10bio->remaining is used to 
achieve the target. It adds the first
r10bio->remaining when preparing other r10bios. I was inspired by 
end_sync_request. So it should
use while here. It needs to decrease the first r10bio remaining for 
other r10bios in the second loop.

Are there more things you want me to modify or add? If not, I'll send 
the v6 to rename the function
name.  Thanks for reviewing these patches :)

Regards
Xiao


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V5 5/5] md/raid10: improve discard request for far layout
  2020-08-28  9:50     ` Xiao Ni
@ 2020-08-28 21:44       ` Song Liu
  0 siblings, 0 replies; 16+ messages in thread
From: Song Liu @ 2020-08-28 21:44 UTC (permalink / raw)
  To: Xiao Ni
  Cc: linux-raid, Heinz Mauelshagen, Nigel Croxon, Guoqing Jiang, Coly Li

On Fri, Aug 28, 2020 at 2:50 AM Xiao Ni <xni@redhat.com> wrote:
>
>
>
> On 08/28/2020 03:03 PM, Song Liu wrote:
> > On Mon, Aug 24, 2020 at 10:43 PM Xiao Ni <xni@redhat.com> wrote:
> >> For far layout, the discard region is not continuous on disks. So it needs
> >> far copies r10bio to cover all regions. It needs a way to know all r10bios
> >> have finish or not. Similar with raid10_sync_request, only the first r10bio
> >> master_bio records the discard bio. Other r10bios master_bio record the
> >> first r10bio. The first r10bio can finish after other r10bios finish and
> >> then return the discard bio.
> >>
> >> Signed-off-by: Xiao Ni <xni@redhat.com>
> >> ---
> >>   drivers/md/raid10.c | 87 +++++++++++++++++++++++++++++++++++++++--------------
> >>   drivers/md/raid10.h |  1 +
> >>   2 files changed, 65 insertions(+), 23 deletions(-)
> >>
> >> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> >> index 257791e..f6518ea 100644
> >> --- a/drivers/md/raid10.c
> >> +++ b/drivers/md/raid10.c
> >> @@ -1534,6 +1534,29 @@ static struct bio *raid10_split_bio(struct r10conf *conf,
> >>          return bio;
> >>   }
> >>
> >> +static void raid_end_discard_bio(struct r10bio *r10bio)
> > Let's name this raid10_*
> Ok
> >
> >> +{
> >> +       struct r10conf *conf = r10bio->mddev->private;
> >> +       struct r10bio *first_r10bio;
> >> +
> >> +       while (atomic_dec_and_test(&r10bio->remaining)) {
> > Should this be "if (atomic_*"?
> >
> The usage of while is right here. For far layout, it needs far copies
> r10bio. It needs to find a method
> to know all r10bios finish. The first r10bio->remaining is used to
> achieve the target. It adds the first
> r10bio->remaining when preparing other r10bios. I was inspired by
> end_sync_request. So it should
> use while here. It needs to decrease the first r10bio remaining for
> other r10bios in the second loop.

Thanks for the explanation.

>
> Are there more things you want me to modify or add? If not, I'll send
> the v6 to rename the function
> name.  Thanks for reviewing these patches :)

1/5 to 3/5 look good so far. I applied them to md-next. I have some
comments on 4/5.

Thanks,
Song

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V5 4/5] md/raid10: improve raid10 discard request
  2020-08-25  5:43 ` [PATCH V5 4/5] md/raid10: improve raid10 discard request Xiao Ni
@ 2020-08-28 22:16   ` Song Liu
  2020-08-31  8:37     ` Xiao Ni
  0 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2020-08-28 22:16 UTC (permalink / raw)
  To: Xiao Ni
  Cc: linux-raid, Heinz Mauelshagen, Nigel Croxon, Guoqing Jiang, Coly Li

On Mon, Aug 24, 2020 at 10:43 PM Xiao Ni <xni@redhat.com> wrote:
>
[...]
> ---
>  drivers/md/raid10.c | 254 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 253 insertions(+), 1 deletion(-)
>
[...]
> +
> +static void raid10_end_discard_request(struct bio *bio)
> +{
> +       struct r10bio *r10_bio = bio->bi_private;
> +       struct r10conf *conf = r10_bio->mddev->private;
> +       struct md_rdev *rdev = NULL;
> +       int dev;
> +       int slot, repl;
> +
> +       /*
> +        * We don't care the return value of discard bio
> +        */
> +       if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
> +               set_bit(R10BIO_Uptodate, &r10_bio->state);

We don't need the test_bit(), just do set_bit().

> +
> +       dev = find_bio_disk(conf, r10_bio, bio, &slot, &repl);
> +       if (repl)
> +               rdev = conf->mirrors[dev].replacement;
> +       if (!rdev) {
> +               /* raid10_remove_disk uses smp_mb to make sure rdev is set to
> +                * replacement before setting replacement to NULL. It can read
> +                * rdev first without barrier protect even replacment is NULL
> +                */
> +               smp_rmb();
> +               repl = 0;
repl is no longer used, right?

> +               rdev = conf->mirrors[dev].rdev;
[...]

> +
> +       if (conf->reshape_progress != MaxSector &&
> +           ((bio->bi_iter.bi_sector >= conf->reshape_progress) !=
> +            conf->mddev->reshape_backwards))
> +               geo = &conf->prev;

Do we need to set R10BIO_Previous here? Also, please run some tests with
reshape in progress.

Thanks,
Song

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V5 4/5] md/raid10: improve raid10 discard request
  2020-08-28 22:16   ` Song Liu
@ 2020-08-31  8:37     ` Xiao Ni
  2020-08-31 14:36       ` Xiao Ni
  0 siblings, 1 reply; 16+ messages in thread
From: Xiao Ni @ 2020-08-31  8:37 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, Heinz Mauelshagen, Nigel Croxon, Guoqing Jiang, Coly Li



On 08/29/2020 06:16 AM, Song Liu wrote:
> On Mon, Aug 24, 2020 at 10:43 PM Xiao Ni <xni@redhat.com> wrote:
> [...]
>> ---
>>   drivers/md/raid10.c | 254 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 253 insertions(+), 1 deletion(-)
>>
> [...]
>> +
>> +static void raid10_end_discard_request(struct bio *bio)
>> +{
>> +       struct r10bio *r10_bio = bio->bi_private;
>> +       struct r10conf *conf = r10_bio->mddev->private;
>> +       struct md_rdev *rdev = NULL;
>> +       int dev;
>> +       int slot, repl;
>> +
>> +       /*
>> +        * We don't care the return value of discard bio
>> +        */
>> +       if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
>> +               set_bit(R10BIO_Uptodate, &r10_bio->state);
> We don't need the test_bit(), just do set_bit().
Coly suggested to do test_bit first to avoid write memory. If there are 
so many requests and the
requests fail, this way can improve performance very much.

But it doesn't care the return value of discard bio. So it should be ok 
that doesn't set R10BIO_Uptodate here.
I'll remove these codes. What do you think?
>
>> +
>> +       dev = find_bio_disk(conf, r10_bio, bio, &slot, &repl);
>> +       if (repl)
>> +               rdev = conf->mirrors[dev].replacement;
>> +       if (!rdev) {
>> +               /* raid10_remove_disk uses smp_mb to make sure rdev is set to
>> +                * replacement before setting replacement to NULL. It can read
>> +                * rdev first without barrier protect even replacment is NULL
>> +                */
>> +               smp_rmb();
>> +               repl = 0;
> repl is no longer used, right?

Right, I'll remove this line
>
>> +               rdev = conf->mirrors[dev].rdev;
> [...]
>
>> +
>> +       if (conf->reshape_progress != MaxSector &&
>> +           ((bio->bi_iter.bi_sector >= conf->reshape_progress) !=
>> +            conf->mddev->reshape_backwards))
>> +               geo = &conf->prev;
> Do we need to set R10BIO_Previous here? Also, please run some tests with
> reshape in progress.
>
> Thanks,
> Song
>
Thanks for pointing about this. Yes, it needs to set R10BIO_Previous 
here. Because it needs to
choose_data_offset when submitting bio to member disks. But it lets me 
think about patch 1/5
has a problem too. It uses rdev->data_offset directly in function 
md_submit_discard_bio. It's ok
for raid0, because it changes other levels during reshape. For raid10, 
it's a bug. Now it's in md-next
branch. Do you want me to resend all patches or a new patch to fix 1/5 
problem?  Sorry for making
this trouble.

Regards
Xiao


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V5 4/5] md/raid10: improve raid10 discard request
  2020-08-31  8:37     ` Xiao Ni
@ 2020-08-31 14:36       ` Xiao Ni
  2020-09-01  5:45         ` Song Liu
  0 siblings, 1 reply; 16+ messages in thread
From: Xiao Ni @ 2020-08-31 14:36 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, Heinz Mauelshagen, Nigel Croxon, Guoqing Jiang, Coly Li



On 08/31/2020 04:37 PM, Xiao Ni wrote:
>
>
> On 08/29/2020 06:16 AM, Song Liu wrote:
>> On Mon, Aug 24, 2020 at 10:43 PM Xiao Ni <xni@redhat.com> wrote:
>> [...]
>>> ---
>>>   drivers/md/raid10.c | 254 
>>> +++++++++++++++++++++++++++++++++++++++++++++++++++-
>>>   1 file changed, 253 insertions(+), 1 deletion(-)
>>>
>> [...]
>>> +
>>> +static void raid10_end_discard_request(struct bio *bio)
>>> +{
>>> +       struct r10bio *r10_bio = bio->bi_private;
>>> +       struct r10conf *conf = r10_bio->mddev->private;
>>> +       struct md_rdev *rdev = NULL;
>>> +       int dev;
>>> +       int slot, repl;
>>> +
>>> +       /*
>>> +        * We don't care the return value of discard bio
>>> +        */
>>> +       if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
>>> +               set_bit(R10BIO_Uptodate, &r10_bio->state);
>> We don't need the test_bit(), just do set_bit().
> Coly suggested to do test_bit first to avoid write memory. If there 
> are so many requests and the
> requests fail, this way can improve performance very much.
>
> But it doesn't care the return value of discard bio. So it should be 
> ok that doesn't set R10BIO_Uptodate here.
> I'll remove these codes. What do you think?

Hi Song

Sorry, for this problem, it still needs to set R10BIO_Uptodate. Because 
in function raid_end_bio_io it needs to use this
flag to justify whether set BLK_STS_IOERR or not. So is it ok to test 
this bit first before setting this bit here?

Regards
Xiao



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V5 4/5] md/raid10: improve raid10 discard request
  2020-08-31 14:36       ` Xiao Ni
@ 2020-09-01  5:45         ` Song Liu
  0 siblings, 0 replies; 16+ messages in thread
From: Song Liu @ 2020-09-01  5:45 UTC (permalink / raw)
  To: Xiao Ni
  Cc: linux-raid, Heinz Mauelshagen, Nigel Croxon, Guoqing Jiang, Coly Li

On Mon, Aug 31, 2020 at 7:36 AM Xiao Ni <xni@redhat.com> wrote:
>
>
>
> On 08/31/2020 04:37 PM, Xiao Ni wrote:
> >
> >
> > On 08/29/2020 06:16 AM, Song Liu wrote:
> >> On Mon, Aug 24, 2020 at 10:43 PM Xiao Ni <xni@redhat.com> wrote:
> >> [...]
> >>> ---
> >>>   drivers/md/raid10.c | 254
> >>> +++++++++++++++++++++++++++++++++++++++++++++++++++-
> >>>   1 file changed, 253 insertions(+), 1 deletion(-)
> >>>
> >> [...]
> >>> +
> >>> +static void raid10_end_discard_request(struct bio *bio)
> >>> +{
> >>> +       struct r10bio *r10_bio = bio->bi_private;
> >>> +       struct r10conf *conf = r10_bio->mddev->private;
> >>> +       struct md_rdev *rdev = NULL;
> >>> +       int dev;
> >>> +       int slot, repl;
> >>> +
> >>> +       /*
> >>> +        * We don't care the return value of discard bio
> >>> +        */
> >>> +       if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
> >>> +               set_bit(R10BIO_Uptodate, &r10_bio->state);
> >> We don't need the test_bit(), just do set_bit().
> > Coly suggested to do test_bit first to avoid write memory. If there
> > are so many requests and the
> > requests fail, this way can improve performance very much.
> >
> > But it doesn't care the return value of discard bio. So it should be
> > ok that doesn't set R10BIO_Uptodate here.
> > I'll remove these codes. What do you think?
>
> Hi Song
>
> Sorry, for this problem, it still needs to set R10BIO_Uptodate. Because
> in function raid_end_bio_io it needs to use this
> flag to justify whether set BLK_STS_IOERR or not. So is it ok to test
> this bit first before setting this bit here?
>
Let's keep the test_bit() and set_bit().

Please send fixes on top of 1/5. I will handle them when I apply the patch.

Thanks,
Song

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-09-01  5:45 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-25  5:42 [PATCH V5 0/5] md/raid10: Improve handling raid10 discard request Xiao Ni
2020-08-25  5:42 ` [PATCH V5 1/5] md/raid10: move codes related with submitting discard bio into one function Xiao Ni
2020-08-25  5:43 ` [PATCH V5 2/5] md/raid10: extend r10bio devs to raid disks Xiao Ni
2020-08-25  5:43 ` [PATCH V5 3/5] md/raid10: pull codes that wait for blocked dev into one function Xiao Ni
2020-08-25  5:43 ` [PATCH V5 4/5] md/raid10: improve raid10 discard request Xiao Ni
2020-08-28 22:16   ` Song Liu
2020-08-31  8:37     ` Xiao Ni
2020-08-31 14:36       ` Xiao Ni
2020-09-01  5:45         ` Song Liu
2020-08-25  5:43 ` [PATCH V5 5/5] md/raid10: improve discard request for far layout Xiao Ni
2020-08-28  7:03   ` Song Liu
2020-08-28  9:50     ` Xiao Ni
2020-08-28 21:44       ` Song Liu
2020-08-25 10:19 ` [PATCH V5 0/5] md/raid10: Improve handling raid10 discard request Michal Soltys
2020-08-25 13:25   ` Xiao Ni
2020-08-25 15:14     ` Michal Soltys

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.