* [PATCH 0/4] Fix order when split bio and send remaining back to itself
@ 2020-12-29 9:18 dannyshih
2020-12-29 9:18 ` [PATCH 1/4] block: introduce submit_bio_noacct_add_head dannyshih
` (4 more replies)
0 siblings, 5 replies; 13+ messages in thread
From: dannyshih @ 2020-12-29 9:18 UTC (permalink / raw)
To: axboe; +Cc: agk, snitzer, dm-devel, song, linux-block, linux-raid, Danny Shih
From: Danny Shih <dannyshih@synology.com>
We found out that split bios might handle not in order when a big bio
had split by blk_queue_split() and also split in stacking block device,
such as md device because chunk size boundary limit.
Stacking block device normally use submit_bio_noacct() add the remaining
bio to current->bio_list's tail after they split original bio. Therefore,
when bio split first time, the last part of bio was add to bio_list.
After then, when bio split second time, the middle part of bio was add to
bio_list. Results that the middle part is now behind the last part of bio.
For example:
There is a RAID0 md device, with max_sectors_kb = 2 KB,
and chunk_size = 1 KB
1. a read bio come to md device wants to read 0-7 KB
2. In blk_queue_split(), bio split into (0-1), (2-7),
and send (2-7) back to md device
current->bio_list = bio_list_on_stack[0]: (md 2-7)
3. RAID0 split bio (0-1) into (0) and (1), since chunk size is 1 KB
and send (1) back to md device
bio_list_on_stack[0]: (md 2-7) -> (md 1)
4. remap and send (0) to lower layer device
bio_list_on_stack[0]: (md 2-7) -> (md 1) -> (lower 0)
5. __submit_bio_noacct() sorting bio let lower bio handle firstly
bio_list_on_stack[0]: (lower 0) -> (md 2-7) -> (md 1)
pop (lower 0)
move bio_list_on_stack[0] to bio_list_on_stack[1]
bio_list_on_stack[1]: (md 2-7) -> (md 1)
6. after handle lower bio, it handle (md 2-7) firstly, and split
in blk_queue_split() into (2-3), (4-7), send (4-7) back
bio_list_on_stack[0]: (md 4-7)
bio_list_on_stack[1]: (md 1)
7. RAID0 split bio (2-3) into (2) and (3) and send (3) back
bio_list_on_stack[0]: (md 4-7) -> (md 3)
bio_list_on_stack[1]: (md 1)
...
In the end, the split bio handle's order will become
0 -> 2 -> 4 -> 6 -> 7 -> 5 -> 3 -> 1
Reverse the order of same queue bio when sorting bio in
__submit_bio_noacct() can solve this issue, but it might influence
too much. So we provide alternative version of submit_bio_noacct(),
named submit_bio_noacct_add_head(), for the case which need to add bio
to the head of current->bio_list. And replace submit_bio_noacct() with
submit_bio_noacct_add_head() in block device layer when we want to
split bio and send remaining back to itself.
Danny Shih (4):
block: introduce submit_bio_noacct_add_head
block: use submit_bio_noacct_add_head for split bio sending back
dm: use submit_bio_noacct_add_head for split bio sending back
md: use submit_bio_noacct_add_head for split bio sending back
block/blk-core.c | 44 +++++++++++++++++++++++++++++++++-----------
block/blk-merge.c | 2 +-
block/bounce.c | 2 +-
drivers/md/dm.c | 2 +-
drivers/md/md-linear.c | 2 +-
drivers/md/raid0.c | 4 ++--
drivers/md/raid1.c | 4 ++--
drivers/md/raid10.c | 4 ++--
drivers/md/raid5.c | 2 +-
include/linux/blkdev.h | 1 +
10 files changed, 45 insertions(+), 22 deletions(-)
--
2.7.4
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/4] block: introduce submit_bio_noacct_add_head
2020-12-29 9:18 [PATCH 0/4] Fix order when split bio and send remaining back to itself dannyshih
@ 2020-12-29 9:18 ` dannyshih
2020-12-30 0:00 ` John Stoffel
2020-12-29 9:18 ` [PATCH 2/4] block: use submit_bio_noacct_add_head for split bio sending back dannyshih
` (3 subsequent siblings)
4 siblings, 1 reply; 13+ messages in thread
From: dannyshih @ 2020-12-29 9:18 UTC (permalink / raw)
To: axboe; +Cc: agk, snitzer, dm-devel, song, linux-block, linux-raid, Danny Shih
From: Danny Shih <dannyshih@synology.com>
Porvide a way for stacking block device to re-submit the bio
which sholud be handled firstly.
Signed-off-by: Danny Shih <dannyshih@synology.com>
Reviewed-by: Allen Peng <allenpeng@synology.com>
Reviewed-by: Alex Wu <alexwu@synology.com>
---
block/blk-core.c | 44 +++++++++++++++++++++++++++++++++-----------
include/linux/blkdev.h | 1 +
2 files changed, 34 insertions(+), 11 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 96e5fcd..693dc83 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1031,16 +1031,7 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
return ret;
}
-/**
- * submit_bio_noacct - re-submit a bio to the block device layer for I/O
- * @bio: The bio describing the location in memory and on the device.
- *
- * This is a version of submit_bio() that shall only be used for I/O that is
- * resubmitted to lower level drivers by stacking block drivers. All file
- * systems and other upper level users of the block layer should use
- * submit_bio() instead.
- */
-blk_qc_t submit_bio_noacct(struct bio *bio)
+static blk_qc_t do_submit_bio_noacct(struct bio *bio, bool add_head)
{
if (!submit_bio_checks(bio))
return BLK_QC_T_NONE;
@@ -1052,7 +1043,10 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
* it is active, and then process them after it returned.
*/
if (current->bio_list) {
- bio_list_add(¤t->bio_list[0], bio);
+ if (add_head)
+ bio_list_add_head(¤t->bio_list[0], bio);
+ else
+ bio_list_add(¤t->bio_list[0], bio);
return BLK_QC_T_NONE;
}
@@ -1060,9 +1054,37 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
return __submit_bio_noacct_mq(bio);
return __submit_bio_noacct(bio);
}
+
+/**
+ * submit_bio_noacct - re-submit a bio to the block device layer for I/O
+ * @bio: The bio describing the location in memory and on the device.
+ *
+ * This is a version of submit_bio() that shall only be used for I/O that is
+ * resubmitted to lower level drivers by stacking block drivers. All file
+ * systems and other upper level users of the block layer should use
+ * submit_bio() instead.
+ */
+blk_qc_t submit_bio_noacct(struct bio *bio)
+{
+ return do_submit_bio_noacct(bio, false);
+}
EXPORT_SYMBOL(submit_bio_noacct);
/**
+ * submit_bio_noacct - re-submit a bio, which needs to be handle firstly,
+ * to the block device layer for I/O
+ * @bio: The bio describing the location in memory and on the device.
+ *
+ * alternative submit_bio_noacct() which add bio to the head of
+ * current->bio_list.
+ */
+blk_qc_t submit_bio_noacct_add_head(struct bio *bio)
+{
+ return do_submit_bio_noacct(bio, true);
+}
+EXPORT_SYMBOL(submit_bio_noacct_add_head);
+
+/**
* submit_bio - submit a bio to the block device layer for I/O
* @bio: The &struct bio which describes the I/O
*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 070de09..b0080d0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -905,6 +905,7 @@ static inline void rq_flush_dcache_pages(struct request *rq)
extern int blk_register_queue(struct gendisk *disk);
extern void blk_unregister_queue(struct gendisk *disk);
blk_qc_t submit_bio_noacct(struct bio *bio);
+blk_qc_t submit_bio_noacct_add_head(struct bio *bio);
extern void blk_rq_init(struct request_queue *q, struct request *rq);
extern void blk_put_request(struct request *);
extern struct request *blk_get_request(struct request_queue *, unsigned int op,
--
2.7.4
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/4] block: use submit_bio_noacct_add_head for split bio sending back
2020-12-29 9:18 [PATCH 0/4] Fix order when split bio and send remaining back to itself dannyshih
2020-12-29 9:18 ` [PATCH 1/4] block: introduce submit_bio_noacct_add_head dannyshih
@ 2020-12-29 9:18 ` dannyshih
2020-12-29 9:18 ` [PATCH 3/4] dm: " dannyshih
` (2 subsequent siblings)
4 siblings, 0 replies; 13+ messages in thread
From: dannyshih @ 2020-12-29 9:18 UTC (permalink / raw)
To: axboe; +Cc: agk, snitzer, dm-devel, song, linux-block, linux-raid, Danny Shih
From: Danny Shih <dannyshih@synology.com>
Use submit_bio_noacct_add_head when sending split bio back to itself.
Otherwise, it might be handled after the lately split bio.
Signed-off-by: Danny Shih <dannyshih@synology.com>
Reviewed-by: Allen Peng <allenpeng@synology.com>
Reviewed-by: Alex Wu <alexwu@synology.com>
---
block/blk-merge.c | 2 +-
block/bounce.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 808768f..e6ddcef0 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -347,7 +347,7 @@ void __blk_queue_split(struct bio **bio, unsigned int *nr_segs)
bio_chain(split, *bio);
trace_block_split(split, (*bio)->bi_iter.bi_sector);
- submit_bio_noacct(*bio);
+ submit_bio_noacct_add_head(*bio);
*bio = split;
}
}
diff --git a/block/bounce.c b/block/bounce.c
index d3f51ac..0b4db65 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -308,7 +308,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
if (!passthrough && sectors < bio_sectors(*bio_orig)) {
bio = bio_split(*bio_orig, sectors, GFP_NOIO, &bounce_bio_split);
bio_chain(bio, *bio_orig);
- submit_bio_noacct(*bio_orig);
+ submit_bio_noacct_add_head(*bio_orig);
*bio_orig = bio;
}
bio = bounce_clone_bio(*bio_orig, GFP_NOIO, passthrough ? NULL :
--
2.7.4
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 3/4] dm: use submit_bio_noacct_add_head for split bio sending back
2020-12-29 9:18 [PATCH 0/4] Fix order when split bio and send remaining back to itself dannyshih
2020-12-29 9:18 ` [PATCH 1/4] block: introduce submit_bio_noacct_add_head dannyshih
2020-12-29 9:18 ` [PATCH 2/4] block: use submit_bio_noacct_add_head for split bio sending back dannyshih
@ 2020-12-29 9:18 ` dannyshih
2020-12-29 9:18 ` [PATCH 4/4] md: " dannyshih
2020-12-30 23:34 ` [PATCH 0/4] Fix order when split bio and send remaining back to itself Mike Snitzer
4 siblings, 0 replies; 13+ messages in thread
From: dannyshih @ 2020-12-29 9:18 UTC (permalink / raw)
To: axboe; +Cc: agk, snitzer, dm-devel, song, linux-block, linux-raid, Danny Shih
From: Danny Shih <dannyshih@synology.com>
Use submit_bio_noacct_add_head when sending split bio back to dm device.
Otherwise, it might be handled after the lately split bio.
Signed-off-by: Danny Shih <dannyshih@synology.com>
Reviewed-by: Allen Peng <allenpeng@synology.com>
Reviewed-by: Alex Wu <alexwu@synology.com>
---
drivers/md/dm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b3c3c8b..1a651d5 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1613,7 +1613,7 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md,
bio_chain(b, bio);
trace_block_split(b, bio->bi_iter.bi_sector);
- ret = submit_bio_noacct(bio);
+ ret = submit_bio_noacct_add_head(bio);
break;
}
}
--
2.7.4
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 4/4] md: use submit_bio_noacct_add_head for split bio sending back
2020-12-29 9:18 [PATCH 0/4] Fix order when split bio and send remaining back to itself dannyshih
` (2 preceding siblings ...)
2020-12-29 9:18 ` [PATCH 3/4] dm: " dannyshih
@ 2020-12-29 9:18 ` dannyshih
2020-12-30 23:34 ` [PATCH 0/4] Fix order when split bio and send remaining back to itself Mike Snitzer
4 siblings, 0 replies; 13+ messages in thread
From: dannyshih @ 2020-12-29 9:18 UTC (permalink / raw)
To: axboe; +Cc: agk, snitzer, dm-devel, song, linux-block, linux-raid, Danny Shih
From: Danny Shih <dannyshih@synology.com>
Use submit_bio_noacct_add_head when sending split bio back to md device.
Otherwise, it might be handled after the lately split bio.
Signed-off-by: Danny Shih <dannyshih@synology.com>
Reviewed-by: Allen Peng <allenpeng@synology.com>
Reviewed-by: Alex Wu <alexwu@synology.com>
---
drivers/md/md-linear.c | 2 +-
drivers/md/raid0.c | 4 ++--
drivers/md/raid1.c | 4 ++--
drivers/md/raid10.c | 4 ++--
drivers/md/raid5.c | 2 +-
5 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c
index 68cac7d..24418ee 100644
--- a/drivers/md/md-linear.c
+++ b/drivers/md/md-linear.c
@@ -243,7 +243,7 @@ static bool linear_make_request(struct mddev *mddev, struct bio *bio)
struct bio *split = bio_split(bio, end_sector - bio_sector,
GFP_NOIO, &mddev->bio_set);
bio_chain(split, bio);
- submit_bio_noacct(bio);
+ submit_bio_noacct_add_head(bio);
bio = split;
}
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 67f157f..92e82d5 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -447,7 +447,7 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
zone->zone_end - bio->bi_iter.bi_sector, GFP_NOIO,
&mddev->bio_set);
bio_chain(split, bio);
- submit_bio_noacct(bio);
+ submit_bio_noacct_add_head(bio);
bio = split;
end = zone->zone_end;
} else
@@ -552,7 +552,7 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio)
struct bio *split = bio_split(bio, sectors, GFP_NOIO,
&mddev->bio_set);
bio_chain(split, bio);
- submit_bio_noacct(bio);
+ submit_bio_noacct_add_head(bio);
bio = split;
}
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index c034799..31cec76 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1282,7 +1282,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
struct bio *split = bio_split(bio, max_sectors,
gfp, &conf->bio_split);
bio_chain(split, bio);
- submit_bio_noacct(bio);
+ submit_bio_noacct_add_head(bio);
bio = split;
r1_bio->master_bio = bio;
r1_bio->sectors = max_sectors;
@@ -1453,7 +1453,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
struct bio *split = bio_split(bio, max_sectors,
GFP_NOIO, &conf->bio_split);
bio_chain(split, bio);
- submit_bio_noacct(bio);
+ submit_bio_noacct_add_head(bio);
bio = split;
r1_bio->master_bio = bio;
r1_bio->sectors = max_sectors;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index c5d88ef..c4dc970 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1177,7 +1177,7 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
gfp, &conf->bio_split);
bio_chain(split, bio);
allow_barrier(conf);
- submit_bio_noacct(bio);
+ submit_bio_noacct_add_head(bio);
wait_barrier(conf);
bio = split;
r10_bio->master_bio = bio;
@@ -1460,7 +1460,7 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
GFP_NOIO, &conf->bio_split);
bio_chain(split, bio);
allow_barrier(conf);
- submit_bio_noacct(bio);
+ submit_bio_noacct_add_head(bio);
wait_barrier(conf);
bio = split;
r10_bio->master_bio = bio;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3a90cc0..17458ac 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5490,7 +5490,7 @@ static struct bio *chunk_aligned_read(struct mddev *mddev, struct bio *raid_bio)
struct r5conf *conf = mddev->private;
split = bio_split(raid_bio, sectors, GFP_NOIO, &conf->bio_split);
bio_chain(split, raid_bio);
- submit_bio_noacct(raid_bio);
+ submit_bio_noacct_add_head(raid_bio);
raid_bio = split;
}
--
2.7.4
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/4] block: introduce submit_bio_noacct_add_head
2020-12-29 9:18 ` [PATCH 1/4] block: introduce submit_bio_noacct_add_head dannyshih
@ 2020-12-30 0:00 ` John Stoffel
2020-12-30 9:51 ` Danny Shih
2020-12-30 11:35 ` antlists
0 siblings, 2 replies; 13+ messages in thread
From: John Stoffel @ 2020-12-30 0:00 UTC (permalink / raw)
To: dannyshih; +Cc: axboe, agk, snitzer, dm-devel, song, linux-block, linux-raid
>>>>> "dannyshih" == dannyshih <dannyshih@synology.com> writes:
dannyshih> From: Danny Shih <dannyshih@synology.com>
dannyshih> Porvide a way for stacking block device to re-submit the bio
dannyshih> which sholud be handled firstly.
You're spelling needs to be fixed in these messages.
dannyshih> Signed-off-by: Danny Shih <dannyshih@synology.com>
dannyshih> Reviewed-by: Allen Peng <allenpeng@synology.com>
dannyshih> Reviewed-by: Alex Wu <alexwu@synology.com>
dannyshih> ---
dannyshih> block/blk-core.c | 44 +++++++++++++++++++++++++++++++++-----------
dannyshih> include/linux/blkdev.h | 1 +
dannyshih> 2 files changed, 34 insertions(+), 11 deletions(-)
dannyshih> diff --git a/block/blk-core.c b/block/blk-core.c
dannyshih> index 96e5fcd..693dc83 100644
dannyshih> --- a/block/blk-core.c
dannyshih> +++ b/block/blk-core.c
dannyshih> @@ -1031,16 +1031,7 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
dannyshih> return ret;
dannyshih> }
dannyshih> -/**
dannyshih> - * submit_bio_noacct - re-submit a bio to the block device layer for I/O
dannyshih> - * @bio: The bio describing the location in memory and on the device.
dannyshih> - *
dannyshih> - * This is a version of submit_bio() that shall only be used for I/O that is
dannyshih> - * resubmitted to lower level drivers by stacking block drivers. All file
dannyshih> - * systems and other upper level users of the block layer should use
dannyshih> - * submit_bio() instead.
dannyshih> - */
dannyshih> -blk_qc_t submit_bio_noacct(struct bio *bio)
dannyshih> +static blk_qc_t do_submit_bio_noacct(struct bio *bio, bool add_head)
dannyshih> {
dannyshih> if (!submit_bio_checks(bio))
dannyshih> return BLK_QC_T_NONE;
dannyshih> @@ -1052,7 +1043,10 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
dannyshih> * it is active, and then process them after it returned.
dannyshih> */
dannyshih> if (current->bio_list) {
dannyshih> - bio_list_add(¤t->bio_list[0], bio);
dannyshih> + if (add_head)
dannyshih> + bio_list_add_head(¤t->bio_list[0], bio);
dannyshih> + else
dannyshih> + bio_list_add(¤t->bio_list[0], bio);
dannyshih> return BLK_QC_T_NONE;
dannyshih> }
dannyshih> @@ -1060,9 +1054,37 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
dannyshih> return __submit_bio_noacct_mq(bio);
dannyshih> return __submit_bio_noacct(bio);
dannyshih> }
dannyshih> +
dannyshih> +/**
dannyshih> + * submit_bio_noacct - re-submit a bio to the block device layer for I/O
dannyshih> + * @bio: The bio describing the location in memory and on the device.
dannyshih> + *
dannyshih> + * This is a version of submit_bio() that shall only be used for I/O that is
dannyshih> + * resubmitted to lower level drivers by stacking block drivers. All file
dannyshih> + * systems and other upper level users of the block layer should use
dannyshih> + * submit_bio() instead.
dannyshih> + */
dannyshih> +blk_qc_t submit_bio_noacct(struct bio *bio)
dannyshih> +{
dannyshih> + return do_submit_bio_noacct(bio, false);
dannyshih> +}
dannyshih> EXPORT_SYMBOL(submit_bio_noacct);
So why is it named "submit_bio_noacct" when it's supposed to be only
used by layers submitting to lower level drivers. How can this be
figured out by drivers automatically, so the writed doesn't have to
know about this?
dannyshih> /**
dannyshih> + * submit_bio_noacct - re-submit a bio, which needs to be handle firstly,
dannyshih> + * to the block device layer for I/O
dannyshih> + * @bio: The bio describing the location in memory and on the device.
dannyshih> + *
dannyshih> + * alternative submit_bio_noacct() which add bio to the head of
dannyshih> + * current->bio_list.
dannyshih> + */
Firstly isn't proper english. Maybe something like:
submit_bio_noacct - re-submit a bio which needs to be handled first
because <reasons> to the block device layer for I/O
But the name still sucks, and the *reason* the bio needs to be handled
differently isn't well explained.
dannyshih> +blk_qc_t submit_bio_noacct_add_head(struct bio *bio)
dannyshih> +{
dannyshih> + return do_submit_bio_noacct(bio, true);
dannyshih> +}
dannyshih> +EXPORT_SYMBOL(submit_bio_noacct_add_head);
dannyshih> +
dannyshih> +/**
dannyshih> * submit_bio - submit a bio to the block device layer for I/O
dannyshih> * @bio: The &struct bio which describes the I/O
dannyshih> *
dannyshih> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
dannyshih> index 070de09..b0080d0 100644
dannyshih> --- a/include/linux/blkdev.h
dannyshih> +++ b/include/linux/blkdev.h
dannyshih> @@ -905,6 +905,7 @@ static inline void rq_flush_dcache_pages(struct request *rq)
dannyshih> extern int blk_register_queue(struct gendisk *disk);
dannyshih> extern void blk_unregister_queue(struct gendisk *disk);
dannyshih> blk_qc_t submit_bio_noacct(struct bio *bio);
dannyshih> +blk_qc_t submit_bio_noacct_add_head(struct bio *bio);
dannyshih> extern void blk_rq_init(struct request_queue *q, struct request *rq);
dannyshih> extern void blk_put_request(struct request *);
dannyshih> extern struct request *blk_get_request(struct request_queue *, unsigned int op,
dannyshih> --
dannyshih> 2.7.4
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/4] block: introduce submit_bio_noacct_add_head
2020-12-30 0:00 ` John Stoffel
@ 2020-12-30 9:51 ` Danny Shih
2020-12-30 17:06 ` John Stoffel
2020-12-30 11:35 ` antlists
1 sibling, 1 reply; 13+ messages in thread
From: Danny Shih @ 2020-12-30 9:51 UTC (permalink / raw)
To: John Stoffel; +Cc: axboe, agk, snitzer, dm-devel, song, linux-block, linux-raid
Hi, John,
Thank you for taking the time to write a review.
John Stoffel <john@stoffel.org> writes:
>>>>>> "dannyshih" == dannyshih <dannyshih@synology.com> writes:
> dannyshih> From: Danny Shih <dannyshih@synology.com>
> dannyshih> Porvide a way for stacking block device to re-submit the bio
> dannyshih> which sholud be handled firstly.
>
> You're spelling needs to be fixed in these messages.
Sorry for so many spelling errors.
The message should be
"Provide a way for stacking block device to re-submit
the bio which should be handled first."
I will fix it.
> dannyshih> Signed-off-by: Danny Shih <dannyshih@synology.com>
> dannyshih> Reviewed-by: Allen Peng <allenpeng@synology.com>
> dannyshih> Reviewed-by: Alex Wu <alexwu@synology.com>
> dannyshih> ---
> dannyshih> block/blk-core.c | 44 +++++++++++++++++++++++++++++++++-----------
> dannyshih> include/linux/blkdev.h | 1 +
> dannyshih> 2 files changed, 34 insertions(+), 11 deletions(-)
>
> dannyshih> diff --git a/block/blk-core.c b/block/blk-core.c
> dannyshih> index 96e5fcd..693dc83 100644
> dannyshih> --- a/block/blk-core.c
> dannyshih> +++ b/block/blk-core.c
> dannyshih> @@ -1031,16 +1031,7 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> dannyshih> return ret;
> dannyshih> }
>
> dannyshih> -/**
> dannyshih> - * submit_bio_noacct - re-submit a bio to the block device layer for I/O
> dannyshih> - * @bio: The bio describing the location in memory and on the device.
> dannyshih> - *
> dannyshih> - * This is a version of submit_bio() that shall only be used for I/O that is
> dannyshih> - * resubmitted to lower level drivers by stacking block drivers. All file
> dannyshih> - * systems and other upper level users of the block layer should use
> dannyshih> - * submit_bio() instead.
> dannyshih> - */
> dannyshih> -blk_qc_t submit_bio_noacct(struct bio *bio)
> dannyshih> +static blk_qc_t do_submit_bio_noacct(struct bio *bio, bool add_head)
> dannyshih> {
> dannyshih> if (!submit_bio_checks(bio))
> dannyshih> return BLK_QC_T_NONE;
> dannyshih> @@ -1052,7 +1043,10 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
> dannyshih> * it is active, and then process them after it returned.
> dannyshih> */
> dannyshih> if (current->bio_list) {
> dannyshih> - bio_list_add(¤t->bio_list[0], bio);
> dannyshih> + if (add_head)
> dannyshih> + bio_list_add_head(¤t->bio_list[0], bio);
> dannyshih> + else
> dannyshih> + bio_list_add(¤t->bio_list[0], bio);
> dannyshih> return BLK_QC_T_NONE;
> dannyshih> }
>
> dannyshih> @@ -1060,9 +1054,37 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
> dannyshih> return __submit_bio_noacct_mq(bio);
> dannyshih> return __submit_bio_noacct(bio);
> dannyshih> }
> dannyshih> +
> dannyshih> +/**
> dannyshih> + * submit_bio_noacct - re-submit a bio to the block device layer for I/O
> dannyshih> + * @bio: The bio describing the location in memory and on the device.
> dannyshih> + *
> dannyshih> + * This is a version of submit_bio() that shall only be used for I/O that is
> dannyshih> + * resubmitted to lower level drivers by stacking block drivers. All file
> dannyshih> + * systems and other upper level users of the block layer should use
> dannyshih> + * submit_bio() instead.
> dannyshih> + */
> dannyshih> +blk_qc_t submit_bio_noacct(struct bio *bio)
> dannyshih> +{
> dannyshih> + return do_submit_bio_noacct(bio, false);
> dannyshih> +}
> dannyshih> EXPORT_SYMBOL(submit_bio_noacct);
>
> So why is it named "submit_bio_noacct" when it's supposed to be only
> used by layers submitting to lower level drivers. How can this be
> figured out by drivers automatically, so the writed doesn't have to
> know about this?
There is no logical change while using submit_bio_noacct() after my
patch. So I didn't change
the name and the documentation of submit_bio_noacct().
>
> dannyshih> /**
> dannyshih> + * submit_bio_noacct - re-submit a bio, which needs to be handle firstly,
> dannyshih> + * to the block device layer for I/O
> dannyshih> + * @bio: The bio describing the location in memory and on the device.
> dannyshih> + *
> dannyshih> + * alternative submit_bio_noacct() which add bio to the head of
> dannyshih> + * current->bio_list.
> dannyshih> + */
>
> Firstly isn't proper english. Maybe something like:
>
> submit_bio_noacct - re-submit a bio which needs to be handled first
> because <reasons> to the block device layer for I/O
>
> But the name still sucks, and the *reason* the bio needs to be handled
> differently isn't well explained.
Sorry for the grammar mistake. And I wrote the wrong function name here.
It should be replaced by submit_bio_noacct_add_head.
About the function name, the name of submit_bio_noacct_add_head()
is trying to let drivers know that this is just an alternative version of
submit_bio_noacct(). The only difference is that this function adds bio to
the head of current->bio_list, and submit_bio_noacct() adds it to the tail.
About the documentation, what if I change it like:
"submit_bio_noacct_add_head - re-submit a bio which needs to
be handled first to the block device layer for I/O, because it has
sequential relevance with the bio handling in current ->submit_bio.
Alternative submit_bio_noacct() adds bio to the head of
current->bio_list. To keep bio sequence, this function is used
when a block device splits bio and re-submits the remainder back
to itself. This makes sure that the re-submitted bio will be handle
just after the split part of the original bio."
Thanks for your suggestion.
> dannyshih> +blk_qc_t submit_bio_noacct_add_head(struct bio *bio)
> dannyshih> +{
> dannyshih> + return do_submit_bio_noacct(bio, true);
> dannyshih> +}
> dannyshih> +EXPORT_SYMBOL(submit_bio_noacct_add_head);
> dannyshih> +
> dannyshih> +/**
> dannyshih> * submit_bio - submit a bio to the block device layer for I/O
> dannyshih> * @bio: The &struct bio which describes the I/O
> dannyshih> *
> dannyshih> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> dannyshih> index 070de09..b0080d0 100644
> dannyshih> --- a/include/linux/blkdev.h
> dannyshih> +++ b/include/linux/blkdev.h
> dannyshih> @@ -905,6 +905,7 @@ static inline void rq_flush_dcache_pages(struct request *rq)
> dannyshih> extern int blk_register_queue(struct gendisk *disk);
> dannyshih> extern void blk_unregister_queue(struct gendisk *disk);
> dannyshih> blk_qc_t submit_bio_noacct(struct bio *bio);
> dannyshih> +blk_qc_t submit_bio_noacct_add_head(struct bio *bio);
> dannyshih> extern void blk_rq_init(struct request_queue *q, struct request *rq);
> dannyshih> extern void blk_put_request(struct request *);
> dannyshih> extern struct request *blk_get_request(struct request_queue *, unsigned int op,
> dannyshih> --
> dannyshih> 2.7.4
Best Regards,
Danny Shih
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/4] block: introduce submit_bio_noacct_add_head
2020-12-30 0:00 ` John Stoffel
2020-12-30 9:51 ` Danny Shih
@ 2020-12-30 11:35 ` antlists
2020-12-30 16:53 ` John Stoffel
1 sibling, 1 reply; 13+ messages in thread
From: antlists @ 2020-12-30 11:35 UTC (permalink / raw)
To: John Stoffel, dannyshih
Cc: axboe, agk, snitzer, dm-devel, song, linux-block, linux-raid
On 30/12/2020 00:00, John Stoffel wrote:
> dannyshih> From: Danny Shih<dannyshih@synology.com>
> dannyshih> Porvide a way for stacking block device to re-submit the bio
> dannyshih> which sholud be handled firstly.
>
> You're spelling needs to be fixed in these messages.
^^^^^^
It is traditional, when correcting someone else's spelling, to make one
of your own ... :-)
Sorry, but if we're being pedantic for furriners, it behoves us to be
correct ourselves :-)
Cheers,
Wol
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/4] block: introduce submit_bio_noacct_add_head
2020-12-30 11:35 ` antlists
@ 2020-12-30 16:53 ` John Stoffel
0 siblings, 0 replies; 13+ messages in thread
From: John Stoffel @ 2020-12-30 16:53 UTC (permalink / raw)
To: antlists
Cc: John Stoffel, dannyshih, axboe, agk, snitzer, dm-devel, song,
linux-block, linux-raid
>>>>> "antlists" == antlists <antlists@youngman.org.uk> writes:
antlists> On 30/12/2020 00:00, John Stoffel wrote:
dannyshih> From: Danny Shih<dannyshih@synology.com>
dannyshih> Porvide a way for stacking block device to re-submit the bio
dannyshih> which sholud be handled firstly.
>>
>> You're spelling needs to be fixed in these messages.
antlists> ^^^^^^
antlists> It is traditional, when correcting someone else's spelling,
antlists> to make one of your own ... :-)
LOL! Yes, touche! I'm completely abashed.
antlists> Sorry, but if we're being pedantic for furriners, it behoves
antlists> us to be correct ourselves :-)
It does for sure, thanks for the nudge.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/4] block: introduce submit_bio_noacct_add_head
2020-12-30 9:51 ` Danny Shih
@ 2020-12-30 17:06 ` John Stoffel
2020-12-30 17:53 ` antlists
0 siblings, 1 reply; 13+ messages in thread
From: John Stoffel @ 2020-12-30 17:06 UTC (permalink / raw)
To: Danny Shih
Cc: John Stoffel, axboe, agk, snitzer, dm-devel, song, linux-block,
linux-raid
>>>>> "Danny" == Danny Shih <dannyshih@synology.com> writes:
Danny> Hi, John,
Danny> Thank you for taking the time to write a review.
Danny> John Stoffel <john@stoffel.org> writes:
>>>>>>> "dannyshih" == dannyshih <dannyshih@synology.com> writes:
dannyshih> From: Danny Shih <dannyshih@synology.com>
dannyshih> Porvide a way for stacking block device to re-submit the bio
dannyshih> which sholud be handled firstly.
>>
>> You're spelling needs to be fixed in these messages.
Danny> Sorry for so many spelling errors.
Danny> The message should be
Danny> "Provide a way for stacking block device to re-submit
Danny> the bio which should be handled first."
Danny> I will fix it.
Great, though my second question is *why* it needs to be handled
first? What is the difference between stacked and un-stacked devices
and how could it be done in a way that doesn't require a seperate
function like this?
dannyshih> Signed-off-by: Danny Shih <dannyshih@synology.com>
dannyshih> Reviewed-by: Allen Peng <allenpeng@synology.com>
dannyshih> Reviewed-by: Alex Wu <alexwu@synology.com>
dannyshih> ---
dannyshih> block/blk-core.c | 44 +++++++++++++++++++++++++++++++++-----------
dannyshih> include/linux/blkdev.h | 1 +
dannyshih> 2 files changed, 34 insertions(+), 11 deletions(-)
>>
dannyshih> diff --git a/block/blk-core.c b/block/blk-core.c
dannyshih> index 96e5fcd..693dc83 100644
dannyshih> --- a/block/blk-core.c
dannyshih> +++ b/block/blk-core.c
dannyshih> @@ -1031,16 +1031,7 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
dannyshih> return ret;
dannyshih> }
>>
dannyshih> -/**
dannyshih> - * submit_bio_noacct - re-submit a bio to the block device layer for I/O
dannyshih> - * @bio: The bio describing the location in memory and on the device.
dannyshih> - *
dannyshih> - * This is a version of submit_bio() that shall only be used for I/O that is
dannyshih> - * resubmitted to lower level drivers by stacking block drivers. All file
dannyshih> - * systems and other upper level users of the block layer should use
dannyshih> - * submit_bio() instead.
dannyshih> - */
dannyshih> -blk_qc_t submit_bio_noacct(struct bio *bio)
dannyshih> +static blk_qc_t do_submit_bio_noacct(struct bio *bio, bool add_head)
dannyshih> {
dannyshih> if (!submit_bio_checks(bio))
dannyshih> return BLK_QC_T_NONE;
dannyshih> @@ -1052,7 +1043,10 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
dannyshih> * it is active, and then process them after it returned.
dannyshih> */
dannyshih> if (current->bio_list) {
dannyshih> - bio_list_add(¤t->bio_list[0], bio);
dannyshih> + if (add_head)
dannyshih> + bio_list_add_head(¤t->bio_list[0], bio);
dannyshih> + else
dannyshih> + bio_list_add(¤t->bio_list[0], bio);
dannyshih> return BLK_QC_T_NONE;
dannyshih> }
>>
dannyshih> @@ -1060,9 +1054,37 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
dannyshih> return __submit_bio_noacct_mq(bio);
dannyshih> return __submit_bio_noacct(bio);
dannyshih> }
dannyshih> +
dannyshih> +/**
dannyshih> + * submit_bio_noacct - re-submit a bio to the block device layer for I/O
dannyshih> + * @bio: The bio describing the location in memory and on the device.
dannyshih> + *
dannyshih> + * This is a version of submit_bio() that shall only be used for I/O that is
dannyshih> + * resubmitted to lower level drivers by stacking block drivers. All file
dannyshih> + * systems and other upper level users of the block layer should use
dannyshih> + * submit_bio() instead.
dannyshih> + */
dannyshih> +blk_qc_t submit_bio_noacct(struct bio *bio)
dannyshih> +{
dannyshih> + return do_submit_bio_noacct(bio, false);
dannyshih> +}
dannyshih> EXPORT_SYMBOL(submit_bio_noacct);
>>
>> So why is it named "submit_bio_noacct" when it's supposed to be only
>> used by layers submitting to lower level drivers. How can this be
>> figured out by drivers automatically, so the writed doesn't have to
>> know about this?
Danny> There is no logical change while using submit_bio_noacct() after my
Danny> patch. So I didn't change
Danny> the name and the documentation of submit_bio_noacct().
>>
dannyshih> /**
dannyshih> + * submit_bio_noacct - re-submit a bio, which needs to be handle firstly,
dannyshih> + * to the block device layer for I/O
dannyshih> + * @bio: The bio describing the location in memory and on the device.
dannyshih> + *
dannyshih> + * alternative submit_bio_noacct() which add bio to the head of
dannyshih> + * current->bio_list.
dannyshih> + */
>>
>> Firstly isn't proper english. Maybe something like:
>>
>> submit_bio_noacct - re-submit a bio which needs to be handled first
>> because <reasons> to the block device layer for I/O
>>
>> But the name still sucks, and the *reason* the bio needs to be handled
>> differently isn't well explained.
Danny> Sorry for the grammar mistake. And I wrote the wrong function name here.
Danny> It should be replaced by submit_bio_noacct_add_head.
Danny> About the function name, the name of submit_bio_noacct_add_head()
Danny> is trying to let drivers know that this is just an alternative version of
Danny> submit_bio_noacct(). The only difference is that this function adds bio to
Danny> the head of current->bio_list, and submit_bio_noacct() adds it to the tail.
Danny> About the documentation, what if I change it like:
Danny> "submit_bio_noacct_add_head - re-submit a bio which needs to
Danny> be handled first to the block device layer for I/O, because it has
Danny> sequential relevance with the bio handling in current ->submit_bio.
Danny> Alternative submit_bio_noacct() adds bio to the head of
current-> bio_list. To keep bio sequence, this function is used
Danny> when a block device splits bio and re-submits the remainder back
Danny> to itself. This makes sure that the re-submitted bio will be handle
Danny> just after the split part of the original bio."
Danny> Thanks for your suggestion.
dannyshih> +blk_qc_t submit_bio_noacct_add_head(struct bio *bio)
dannyshih> +{
dannyshih> + return do_submit_bio_noacct(bio, true);
dannyshih> +}
dannyshih> +EXPORT_SYMBOL(submit_bio_noacct_add_head);
dannyshih> +
dannyshih> +/**
dannyshih> * submit_bio - submit a bio to the block device layer for I/O
dannyshih> * @bio: The &struct bio which describes the I/O
dannyshih> *
dannyshih> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
dannyshih> index 070de09..b0080d0 100644
dannyshih> --- a/include/linux/blkdev.h
dannyshih> +++ b/include/linux/blkdev.h
dannyshih> @@ -905,6 +905,7 @@ static inline void rq_flush_dcache_pages(struct request *rq)
dannyshih> extern int blk_register_queue(struct gendisk *disk);
dannyshih> extern void blk_unregister_queue(struct gendisk *disk);
dannyshih> blk_qc_t submit_bio_noacct(struct bio *bio);
dannyshih> +blk_qc_t submit_bio_noacct_add_head(struct bio *bio);
dannyshih> extern void blk_rq_init(struct request_queue *q, struct request *rq);
dannyshih> extern void blk_put_request(struct request *);
dannyshih> extern struct request *blk_get_request(struct request_queue *, unsigned int op,
dannyshih> --
dannyshih> 2.7.4
Danny> Best Regards,
Danny> Danny Shih
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/4] block: introduce submit_bio_noacct_add_head
2020-12-30 17:06 ` John Stoffel
@ 2020-12-30 17:53 ` antlists
0 siblings, 0 replies; 13+ messages in thread
From: antlists @ 2020-12-30 17:53 UTC (permalink / raw)
To: John Stoffel, Danny Shih
Cc: axboe, agk, snitzer, dm-devel, song, linux-block, linux-raid
On 30/12/2020 17:06, John Stoffel wrote:
> Danny> "Provide a way for stacking block device to re-submit
>
> Danny> the bio which should be handled first."
>
> Danny> I will fix it.
>
> Great, though my second question is*why* it needs to be handled
> first? What is the difference between stacked and un-stacked devices
> and how could it be done in a way that doesn't require a seperate
> function like this?
Is this anything to do with what's on my mind as a database guy? I've
heard that things like RAID and LVM have difficulty providing write
barriers.
I want to be confident that, at EVERY level of the stack, I can put a
barrier in that guarantees that everything written by user-space BEFORE
the barrier is handled before anything written AFTER the barrier. That
way, I can be confident that, after a problem, I know whether I can
safely roll the log forward or back.
Cheers,
Wol
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/4] Fix order when split bio and send remaining back to itself
2020-12-29 9:18 [PATCH 0/4] Fix order when split bio and send remaining back to itself dannyshih
` (3 preceding siblings ...)
2020-12-29 9:18 ` [PATCH 4/4] md: " dannyshih
@ 2020-12-30 23:34 ` Mike Snitzer
2020-12-31 8:28 ` Danny Shih
4 siblings, 1 reply; 13+ messages in thread
From: Mike Snitzer @ 2020-12-30 23:34 UTC (permalink / raw)
To: dannyshih; +Cc: axboe, agk, dm-devel, song, linux-block, linux-raid
On Tue, Dec 29 2020 at 4:18am -0500,
dannyshih <dannyshih@synology.com> wrote:
> From: Danny Shih <dannyshih@synology.com>
>
> We found out that split bios might handle not in order when a big bio
> had split by blk_queue_split() and also split in stacking block device,
> such as md device because chunk size boundary limit.
>
> Stacking block device normally use submit_bio_noacct() add the remaining
> bio to current->bio_list's tail after they split original bio. Therefore,
> when bio split first time, the last part of bio was add to bio_list.
> After then, when bio split second time, the middle part of bio was add to
> bio_list. Results that the middle part is now behind the last part of bio.
>
> For example:
> There is a RAID0 md device, with max_sectors_kb = 2 KB,
> and chunk_size = 1 KB
>
> 1. a read bio come to md device wants to read 0-7 KB
> 2. In blk_queue_split(), bio split into (0-1), (2-7),
> and send (2-7) back to md device
>
> current->bio_list = bio_list_on_stack[0]: (md 2-7)
> 3. RAID0 split bio (0-1) into (0) and (1), since chunk size is 1 KB
> and send (1) back to md device
>
> bio_list_on_stack[0]: (md 2-7) -> (md 1)
> 4. remap and send (0) to lower layer device
>
> bio_list_on_stack[0]: (md 2-7) -> (md 1) -> (lower 0)
> 5. __submit_bio_noacct() sorting bio let lower bio handle firstly
> bio_list_on_stack[0]: (lower 0) -> (md 2-7) -> (md 1)
> pop (lower 0)
> move bio_list_on_stack[0] to bio_list_on_stack[1]
>
> bio_list_on_stack[1]: (md 2-7) -> (md 1)
> 6. after handle lower bio, it handle (md 2-7) firstly, and split
> in blk_queue_split() into (2-3), (4-7), send (4-7) back
>
> bio_list_on_stack[0]: (md 4-7)
> bio_list_on_stack[1]: (md 1)
> 7. RAID0 split bio (2-3) into (2) and (3) and send (3) back
>
> bio_list_on_stack[0]: (md 4-7) -> (md 3)
> bio_list_on_stack[1]: (md 1)
> ...
> In the end, the split bio handle's order will become
> 0 -> 2 -> 4 -> 6 -> 7 -> 5 -> 3 -> 1
>
> Reverse the order of same queue bio when sorting bio in
> __submit_bio_noacct() can solve this issue, but it might influence
> too much. So we provide alternative version of submit_bio_noacct(),
> named submit_bio_noacct_add_head(), for the case which need to add bio
> to the head of current->bio_list. And replace submit_bio_noacct() with
> submit_bio_noacct_add_head() in block device layer when we want to
> split bio and send remaining back to itself.
Ordering aside, you cannot split more than once. So your proposed fix
to insert at head isn't valid because you're still implicitly allocating
more than one bio from the bioset which could cause deadlock in a low
memory situation.
I had to deal with a comparable issue with DM core not too long ago, see
this commit:
commit ee1dfad5325ff1cfb2239e564cd411b3bfe8667a
Author: Mike Snitzer <snitzer@redhat.com>
Date: Mon Sep 14 13:04:19 2020 -0400
dm: fix bio splitting and its bio completion order for regular IO
dm_queue_split() is removed because __split_and_process_bio() _must_
handle splitting bios to ensure proper bio submission and completion
ordering as a bio is split.
Otherwise, multiple recursive calls to ->submit_bio will cause multiple
split bios to be allocated from the same ->bio_split mempool at the same
time. This would result in deadlock in low memory conditions because no
progress could be made (only one bio is available in ->bio_split
mempool).
This fix has been verified to still fix the loss of performance, due
to excess splitting, that commit 120c9257f5f1 provided.
Fixes: 120c9257f5f1 ("Revert "dm: always call blk_queue_split() in dm_process_bio()"")
Cc: stable@vger.kernel.org # 5.0+, requires custom backport due to 5.9 changes
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Basically you cannot split the same bio more than once without
recursing. Your elaborate documentation shows things going wrong quite
early in step 3. That additional split and recursing back to MD
shouldn't happen before the first bio split completes.
Seems the proper fix is to disallow max_sectors_kb to be imposed, via
blk_queue_split(), if MD has further splitting constraints, via
chunk_sectors, that negate max_sectors_kb anyway.
Mike
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/4] Fix order when split bio and send remaining back to itself
2020-12-30 23:34 ` [PATCH 0/4] Fix order when split bio and send remaining back to itself Mike Snitzer
@ 2020-12-31 8:28 ` Danny Shih
0 siblings, 0 replies; 13+ messages in thread
From: Danny Shih @ 2020-12-31 8:28 UTC (permalink / raw)
To: Mike Snitzer; +Cc: axboe, agk, dm-devel, song, linux-block, linux-raid
Mike Snitzer writes:
>> submit_bio_noacct_add_head() in block device layer when we want to
>> split bio and send remaining back to itself.
> Ordering aside, you cannot split more than once. So your proposed fix
> to insert at head isn't valid because you're still implicitly allocating
> more than one bio from the bioset which could cause deadlock in a low
> memory situation.
>
> I had to deal with a comparable issue with DM core not too long ago, see
> this commit:
>
> commit ee1dfad5325ff1cfb2239e564cd411b3bfe8667a
> Author: Mike Snitzer <snitzer@redhat.com>
> Date: Mon Sep 14 13:04:19 2020 -0400
>
> dm: fix bio splitting and its bio completion order for regular IO
>
> dm_queue_split() is removed because __split_and_process_bio() _must_
> handle splitting bios to ensure proper bio submission and completion
> ordering as a bio is split.
>
> Otherwise, multiple recursive calls to ->submit_bio will cause multiple
> split bios to be allocated from the same ->bio_split mempool at the same
> time. This would result in deadlock in low memory conditions because no
> progress could be made (only one bio is available in ->bio_split
> mempool).
>
> This fix has been verified to still fix the loss of performance, due
> to excess splitting, that commit 120c9257f5f1 provided.
>
> Fixes: 120c9257f5f1 ("Revert "dm: always call blk_queue_split() in dm_process_bio()"")
> Cc: stable@vger.kernel.org # 5.0+, requires custom backport due to 5.9 changes
> Reported-by: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>
> Basically you cannot split the same bio more than once without
> recursing. Your elaborate documentation shows things going wrong quite
> early in step 3. That additional split and recursing back to MD
> shouldn't happen before the first bio split completes.
>
> Seems the proper fix is to disallow max_sectors_kb to be imposed, via
> blk_queue_split(), if MD has further splitting constraints, via
> chunk_sectors, that negate max_sectors_kb anyway.
>
> Mike
Hi Mike,
I think you're right that a driver should not split the same bio more
than once without recursing when using the same mempool.
If a driver only split bio once, the out-of-order issue no longer exists.
(Therefore, this problem won't occur on DM device.)
But the MD devices are using their private bioset (mddev->bio_set
or conf->bio_split) for splitting by themselves that are not the same
bioset used in blk_queue_split() (i.e. q->bio_split). The deadlock
you have mentioned might not happen to them.
I think there are two solutions:
1. In case MD devices want to change to use q->bio_split someday
without this out-of-order issue, make them do split once would be
a solution.
2. If MD devices should split the bio twice, so we can separately handle
limits in blk_queue_split() and each raid level's (raid0, raid5,
raid1, ...).
I will try to find another solution in this case.
My proposal is not suitable after I reconsider the problem:
If a bio is split into A part and B part.
+------|------+
| A | B |
+------|------+
I think a driver should make sure A part is always handled before B
part.
Inserting bio at head of current->bio_list and submitting bio in the
same
time while handling A part could make bios generated from A part be
handled before B part. This broke the order of those bios that generated
form A part.
(Maybe I should find a way to make B part at the head of
bio_list_on_stack[1]
while submitting it...)
Thanks for your comments.
I will try to figure out a better way to fix it in the next version.
Best regards,
Danny Shih
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2020-12-31 8:29 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-29 9:18 [PATCH 0/4] Fix order when split bio and send remaining back to itself dannyshih
2020-12-29 9:18 ` [PATCH 1/4] block: introduce submit_bio_noacct_add_head dannyshih
2020-12-30 0:00 ` John Stoffel
2020-12-30 9:51 ` Danny Shih
2020-12-30 17:06 ` John Stoffel
2020-12-30 17:53 ` antlists
2020-12-30 11:35 ` antlists
2020-12-30 16:53 ` John Stoffel
2020-12-29 9:18 ` [PATCH 2/4] block: use submit_bio_noacct_add_head for split bio sending back dannyshih
2020-12-29 9:18 ` [PATCH 3/4] dm: " dannyshih
2020-12-29 9:18 ` [PATCH 4/4] md: " dannyshih
2020-12-30 23:34 ` [PATCH 0/4] Fix order when split bio and send remaining back to itself Mike Snitzer
2020-12-31 8:28 ` Danny Shih
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).