Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
* fix nr_phys_segments vs iterators accounting
@ 2019-05-13  6:37 Christoph Hellwig
  2019-05-13  6:37 ` [PATCH 01/10] block: don't decrement nr_phys_segments for physically contigous segments Christoph Hellwig
                   ` (10 more replies)
  0 siblings, 11 replies; 27+ messages in thread
From: Christoph Hellwig @ 2019-05-13  6:37 UTC (permalink / raw)
  To: axboe; +Cc: ming.lei, Matias Bjorling, linux-block

Hi all,

we have had a problem for a while where the number of segments that
the bvec iterators will iterate over don't match the value in
req->nr_phys_segments, causing problems for anyone looking at
nr_phys_segments and iterating over bvec directly instead of using
blk_rq_map_sg.  The first patch in this series fixes this by
making sure nr_phys_segments matches the actual number of segments.
Drivers using blk_rq_map_sg will still get the lower number returned
from function eventually, but the fact that we don't reduce the
value earlier will not allow some merges that we might otherwise
allow.

With that in place I also noticed that we do not properly account
segements sizes on devices with a virt_boundary, but it turns out that
segment sizes fundamentally don't make sense for such devices, as their
"segment" is a fixed size "device page", and not a variable sized
scatter/gather elements as in the block layer, so we make that fact
formal.

Once all that is sorted out it is pretty clear that there is no
good reason to have the front/back segement accounting, and that
the bio nr_phys_segments is only needed in the submission path,
so we can remove three members of struct bio.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 01/10] block: don't decrement nr_phys_segments for physically contigous segments
  2019-05-13  6:37 fix nr_phys_segments vs iterators accounting Christoph Hellwig
@ 2019-05-13  6:37 ` Christoph Hellwig
  2019-05-13  9:45   ` Ming Lei
  2019-05-13  6:37 ` [PATCH 02/10] block: force an unlimited segment size on queues with a virt boundary Christoph Hellwig
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2019-05-13  6:37 UTC (permalink / raw)
  To: axboe; +Cc: ming.lei, Matias Bjorling, linux-block

Currently ll_merge_requests_fn, unlike all other merge functions,
reduces nr_phys_segments by one if the last segment of the previous,
and the first segment of the next segement are contigous.  While this
seems like a nice solution to avoid building smaller than possible
requests it causes a mismatch between the segments actually present
in the request and those iterated over by the bvec iterators, including
__rq_for_each_bio.  This could cause overwrites of too small kmalloc
allocations in any driver using ranged discard, or also mistrigger
the single segment optimization in the nvme-pci driver.

We could possibly work around this by making the bvec iterators take
the front and back segment size into account, but that would require
moving them from the bio to the bio_iter and spreading this mess
over all users of bvecs.  Or we could simply remove this optimization
under the assumption that most users already build good enough bvecs,
and that the bio merge patch never cared about this optimization
either.  The latter is what this patch does.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-merge.c | 23 +----------------------
 1 file changed, 1 insertion(+), 22 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 21e87a714a73..80a5a0facb87 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -358,7 +358,6 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 	unsigned front_seg_size;
 	struct bio *fbio, *bbio;
 	struct bvec_iter iter;
-	bool new_bio = false;
 
 	if (!bio)
 		return 0;
@@ -379,31 +378,12 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 	nr_phys_segs = 0;
 	for_each_bio(bio) {
 		bio_for_each_bvec(bv, bio, iter) {
-			if (new_bio) {
-				if (seg_size + bv.bv_len
-				    > queue_max_segment_size(q))
-					goto new_segment;
-				if (!biovec_phys_mergeable(q, &bvprv, &bv))
-					goto new_segment;
-
-				seg_size += bv.bv_len;
-
-				if (nr_phys_segs == 1 && seg_size >
-						front_seg_size)
-					front_seg_size = seg_size;
-
-				continue;
-			}
-new_segment:
 			bvec_split_segs(q, &bv, &nr_phys_segs, &seg_size,
 					&front_seg_size, NULL, UINT_MAX);
-			new_bio = false;
 		}
 		bbio = bio;
-		if (likely(bio->bi_iter.bi_size)) {
+		if (likely(bio->bi_iter.bi_size))
 			bvprv = bv;
-			new_bio = true;
-		}
 	}
 
 	fbio->bi_seg_front_size = front_seg_size;
@@ -725,7 +705,6 @@ static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
 			req->bio->bi_seg_front_size = seg_size;
 		if (next->nr_phys_segments == 1)
 			next->biotail->bi_seg_back_size = seg_size;
-		total_phys_segments--;
 	}
 
 	if (total_phys_segments > queue_max_segments(q))
-- 
2.20.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 02/10] block: force an unlimited segment size on queues with a virt boundary
  2019-05-13  6:37 fix nr_phys_segments vs iterators accounting Christoph Hellwig
  2019-05-13  6:37 ` [PATCH 01/10] block: don't decrement nr_phys_segments for physically contigous segments Christoph Hellwig
@ 2019-05-13  6:37 ` Christoph Hellwig
  2019-05-15  8:19   ` Ming Lei
  2019-05-13  6:37 ` [PATCH 03/10] block: remove the segment size check in bio_will_gap Christoph Hellwig
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2019-05-13  6:37 UTC (permalink / raw)
  To: axboe; +Cc: ming.lei, Matias Bjorling, linux-block

We currently fail to update the front/back segment size in the bio when
deciding to allow an otherwise gappy segement to a device with a
virt boundary.  The reason why this did not cause problems is that
devices with a virt boundary fundamentally don't use segments as we
know it and thus don't care.  Make that assumption formal by forcing
an unlimited segement size in this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-settings.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 3facc41476be..2ae348c101a0 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -310,6 +310,9 @@ void blk_queue_max_segment_size(struct request_queue *q, unsigned int max_size)
 		       __func__, max_size);
 	}
 
+	/* see blk_queue_virt_boundary() for the explanation */
+	WARN_ON_ONCE(q->limits.virt_boundary_mask);
+
 	q->limits.max_segment_size = max_size;
 }
 EXPORT_SYMBOL(blk_queue_max_segment_size);
@@ -742,6 +745,14 @@ EXPORT_SYMBOL(blk_queue_segment_boundary);
 void blk_queue_virt_boundary(struct request_queue *q, unsigned long mask)
 {
 	q->limits.virt_boundary_mask = mask;
+
+	/*
+	 * Devices that require a virtual boundary do not support scatter/gather
+	 * I/O natively, but instead require a descriptor list entry for each
+	 * page (which might not be idential to the Linux PAGE_SIZE).  Because
+	 * of that they are not limited by our notion of "segment size".
+	 */
+	q->limits.max_segment_size = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_queue_virt_boundary);
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 03/10] block: remove the segment size check in bio_will_gap
  2019-05-13  6:37 fix nr_phys_segments vs iterators accounting Christoph Hellwig
  2019-05-13  6:37 ` [PATCH 01/10] block: don't decrement nr_phys_segments for physically contigous segments Christoph Hellwig
  2019-05-13  6:37 ` [PATCH 02/10] block: force an unlimited segment size on queues with a virt boundary Christoph Hellwig
@ 2019-05-13  6:37 ` Christoph Hellwig
  2019-05-15  8:34   ` Ming Lei
  2019-05-13  6:37 ` [PATCH 04/10] block: remove the bi_seg_{front,back}_size fields in struct bio Christoph Hellwig
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2019-05-13  6:37 UTC (permalink / raw)
  To: axboe; +Cc: ming.lei, Matias Bjorling, linux-block

We fundamentally do not have a maximum segement size for devices with a
virt boundary.  So don't bother checking it, especially given that the
existing checks didn't properly work to start with as we never update
bi_seg_back_size after a successful merge, and for front merges would
have had to check bi_seg_front_size anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-merge.c | 19 +------------------
 1 file changed, 1 insertion(+), 18 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 80a5a0facb87..eee2c02c50ce 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -12,23 +12,6 @@
 
 #include "blk.h"
 
-/*
- * Check if the two bvecs from two bios can be merged to one segment.  If yes,
- * no need to check gap between the two bios since the 1st bio and the 1st bvec
- * in the 2nd bio can be handled in one segment.
- */
-static inline bool bios_segs_mergeable(struct request_queue *q,
-		struct bio *prev, struct bio_vec *prev_last_bv,
-		struct bio_vec *next_first_bv)
-{
-	if (!biovec_phys_mergeable(q, prev_last_bv, next_first_bv))
-		return false;
-	if (prev->bi_seg_back_size + next_first_bv->bv_len >
-			queue_max_segment_size(q))
-		return false;
-	return true;
-}
-
 static inline bool bio_will_gap(struct request_queue *q,
 		struct request *prev_rq, struct bio *prev, struct bio *next)
 {
@@ -60,7 +43,7 @@ static inline bool bio_will_gap(struct request_queue *q,
 	 */
 	bio_get_last_bvec(prev, &pb);
 	bio_get_first_bvec(next, &nb);
-	if (bios_segs_mergeable(q, prev, &pb, &nb))
+	if (biovec_phys_mergeable(q, &pb, &nb))
 		return false;
 	return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
 }
-- 
2.20.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 04/10] block: remove the bi_seg_{front,back}_size fields in struct bio
  2019-05-13  6:37 fix nr_phys_segments vs iterators accounting Christoph Hellwig
                   ` (2 preceding siblings ...)
  2019-05-13  6:37 ` [PATCH 03/10] block: remove the segment size check in bio_will_gap Christoph Hellwig
@ 2019-05-13  6:37 ` Christoph Hellwig
  2019-05-13  6:37 ` [PATCH 05/10] block: initialize the write priority in blk_rq_bio_prep Christoph Hellwig
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2019-05-13  6:37 UTC (permalink / raw)
  To: axboe; +Cc: ming.lei, Matias Bjorling, linux-block

At this point these fields aren't used for anything, so we can remove
them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-merge.c         | 94 +++++----------------------------------
 include/linux/blk_types.h |  7 ---
 2 files changed, 12 insertions(+), 89 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index eee2c02c50ce..17713d7d98d5 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -162,8 +162,7 @@ static unsigned get_max_segment_size(struct request_queue *q,
  * variables.
  */
 static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
-		unsigned *nsegs, unsigned *last_seg_size,
-		unsigned *front_seg_size, unsigned *sectors, unsigned max_segs)
+		unsigned *nsegs, unsigned *sectors, unsigned max_segs)
 {
 	unsigned len = bv->bv_len;
 	unsigned total_len = 0;
@@ -185,28 +184,12 @@ static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
 			break;
 	}
 
-	if (!new_nsegs)
-		return !!len;
-
-	/* update front segment size */
-	if (!*nsegs) {
-		unsigned first_seg_size;
-
-		if (new_nsegs == 1)
-			first_seg_size = get_max_segment_size(q, bv->bv_offset);
-		else
-			first_seg_size = queue_max_segment_size(q);
-
-		if (*front_seg_size < first_seg_size)
-			*front_seg_size = first_seg_size;
+	if (new_nsegs) {
+		*nsegs += new_nsegs;
+		if (sectors)
+			*sectors += total_len >> 9;
 	}
 
-	/* update other varibles */
-	*last_seg_size = seg_size;
-	*nsegs += new_nsegs;
-	if (sectors)
-		*sectors += total_len >> 9;
-
 	/* split in the middle of the bvec if len != 0 */
 	return !!len;
 }
@@ -218,8 +201,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 {
 	struct bio_vec bv, bvprv, *bvprvp = NULL;
 	struct bvec_iter iter;
-	unsigned seg_size = 0, nsegs = 0, sectors = 0;
-	unsigned front_seg_size = bio->bi_seg_front_size;
+	unsigned nsegs = 0, sectors = 0;
 	bool do_split = true;
 	struct bio *new = NULL;
 	const unsigned max_sectors = get_max_io_size(q, bio);
@@ -243,8 +225,6 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 				/* split in the middle of bvec */
 				bv.bv_len = (max_sectors - sectors) << 9;
 				bvec_split_segs(q, &bv, &nsegs,
-						&seg_size,
-						&front_seg_size,
 						&sectors, max_segs);
 			}
 			goto split;
@@ -258,12 +238,9 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 
 		if (bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
 			nsegs++;
-			seg_size = bv.bv_len;
 			sectors += bv.bv_len >> 9;
-			if (nsegs == 1 && seg_size > front_seg_size)
-				front_seg_size = seg_size;
-		} else if (bvec_split_segs(q, &bv, &nsegs, &seg_size,
-				    &front_seg_size, &sectors, max_segs)) {
+		} else if (bvec_split_segs(q, &bv, &nsegs, &sectors,
+				max_segs)) {
 			goto split;
 		}
 	}
@@ -278,10 +255,6 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 			bio = new;
 	}
 
-	bio->bi_seg_front_size = front_seg_size;
-	if (seg_size > bio->bi_seg_back_size)
-		bio->bi_seg_back_size = seg_size;
-
 	return do_split ? new : NULL;
 }
 
@@ -336,17 +309,13 @@ EXPORT_SYMBOL(blk_queue_split);
 static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 					     struct bio *bio)
 {
-	struct bio_vec uninitialized_var(bv), bvprv = { NULL };
-	unsigned int seg_size, nr_phys_segs;
-	unsigned front_seg_size;
-	struct bio *fbio, *bbio;
+	unsigned int nr_phys_segs = 0;
 	struct bvec_iter iter;
+	struct bio_vec bv;
 
 	if (!bio)
 		return 0;
 
-	front_seg_size = bio->bi_seg_front_size;
-
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
@@ -356,23 +325,11 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 		return 1;
 	}
 
-	fbio = bio;
-	seg_size = 0;
-	nr_phys_segs = 0;
 	for_each_bio(bio) {
-		bio_for_each_bvec(bv, bio, iter) {
-			bvec_split_segs(q, &bv, &nr_phys_segs, &seg_size,
-					&front_seg_size, NULL, UINT_MAX);
-		}
-		bbio = bio;
-		if (likely(bio->bi_iter.bi_size))
-			bvprv = bv;
+		bio_for_each_bvec(bv, bio, iter)
+			bvec_split_segs(q, &bv, &nr_phys_segs, NULL, UINT_MAX);
 	}
 
-	fbio->bi_seg_front_size = front_seg_size;
-	if (seg_size > bbio->bi_seg_back_size)
-		bbio->bi_seg_back_size = seg_size;
-
 	return nr_phys_segs;
 }
 
@@ -392,24 +349,6 @@ void blk_recount_segments(struct request_queue *q, struct bio *bio)
 	bio_set_flag(bio, BIO_SEG_VALID);
 }
 
-static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
-				   struct bio *nxt)
-{
-	struct bio_vec end_bv = { NULL }, nxt_bv;
-
-	if (bio->bi_seg_back_size + nxt->bi_seg_front_size >
-	    queue_max_segment_size(q))
-		return 0;
-
-	if (!bio_has_data(bio))
-		return 1;
-
-	bio_get_last_bvec(bio, &end_bv);
-	bio_get_first_bvec(nxt, &nxt_bv);
-
-	return biovec_phys_mergeable(q, &end_bv, &nxt_bv);
-}
-
 static inline struct scatterlist *blk_next_sg(struct scatterlist **sg,
 		struct scatterlist *sglist)
 {
@@ -669,8 +608,6 @@ static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
 				struct request *next)
 {
 	int total_phys_segments;
-	unsigned int seg_size =
-		req->biotail->bi_seg_back_size + next->bio->bi_seg_front_size;
 
 	if (req_gap_back_merge(req, next->bio))
 		return 0;
@@ -683,13 +620,6 @@ static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
 		return 0;
 
 	total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
-	if (blk_phys_contig_segment(q, req->biotail, next->bio)) {
-		if (req->nr_phys_segments == 1)
-			req->bio->bi_seg_front_size = seg_size;
-		if (next->nr_phys_segments == 1)
-			next->biotail->bi_seg_back_size = seg_size;
-	}
-
 	if (total_phys_segments > queue_max_segments(q))
 		return 0;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index be418275763c..95202f80676c 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -159,13 +159,6 @@ struct bio {
 	 */
 	unsigned int		bi_phys_segments;
 
-	/*
-	 * To keep track of the max segment size, we account for the
-	 * sizes of the first and last mergeable segments in this bio.
-	 */
-	unsigned int		bi_seg_front_size;
-	unsigned int		bi_seg_back_size;
-
 	struct bvec_iter	bi_iter;
 
 	atomic_t		__bi_remaining;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 05/10] block: initialize the write priority in blk_rq_bio_prep
  2019-05-13  6:37 fix nr_phys_segments vs iterators accounting Christoph Hellwig
                   ` (3 preceding siblings ...)
  2019-05-13  6:37 ` [PATCH 04/10] block: remove the bi_seg_{front,back}_size fields in struct bio Christoph Hellwig
@ 2019-05-13  6:37 ` Christoph Hellwig
  2019-05-13 15:04   ` Chaitanya Kulkarni
  2019-05-13  6:37 ` [PATCH 06/10] block: remove blk_init_request_from_bio Christoph Hellwig
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2019-05-13  6:37 UTC (permalink / raw)
  To: axboe; +Cc: ming.lei, Matias Bjorling, linux-block

The priority fiel also make sense for passthrough requests, so
initialize it in blk_rq_bio_prep.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 419d600e6637..7fb394dd3e11 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -716,7 +716,6 @@ void blk_init_request_from_bio(struct request *req, struct bio *bio)
 		req->cmd_flags |= REQ_FAILFAST_MASK;
 
 	req->__sector = bio->bi_iter.bi_sector;
-	req->ioprio = bio_prio(bio);
 	req->write_hint = bio->bi_write_hint;
 	blk_rq_bio_prep(req->q, req, bio);
 }
@@ -1494,6 +1493,7 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 
 	rq->__data_len = bio->bi_iter.bi_size;
 	rq->bio = rq->biotail = bio;
+	rq->ioprio = bio_prio(bio);
 
 	if (bio->bi_disk)
 		rq->rq_disk = bio->bi_disk;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 06/10] block: remove blk_init_request_from_bio
  2019-05-13  6:37 fix nr_phys_segments vs iterators accounting Christoph Hellwig
                   ` (4 preceding siblings ...)
  2019-05-13  6:37 ` [PATCH 05/10] block: initialize the write priority in blk_rq_bio_prep Christoph Hellwig
@ 2019-05-13  6:37 ` Christoph Hellwig
  2019-05-13  6:37 ` [PATCH 07/10] block: remove the bi_phys_segments field in struct bio Christoph Hellwig
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2019-05-13  6:37 UTC (permalink / raw)
  To: axboe; +Cc: ming.lei, Matias Bjorling, linux-block

lightnvm should have never used this function, as it is sending
passthrough requests, so switch it to blk_rq_append_bio like all the
other passthrough request users.  Inline blk_init_request_from_bio into
the only remaining caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c             | 11 -----------
 block/blk-mq.c               |  7 ++++++-
 drivers/nvme/host/lightnvm.c |  2 +-
 include/linux/blkdev.h       |  1 -
 4 files changed, 7 insertions(+), 14 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7fb394dd3e11..b46ea531cb07 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -710,17 +710,6 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 	return false;
 }
 
-void blk_init_request_from_bio(struct request *req, struct bio *bio)
-{
-	if (bio->bi_opf & REQ_RAHEAD)
-		req->cmd_flags |= REQ_FAILFAST_MASK;
-
-	req->__sector = bio->bi_iter.bi_sector;
-	req->write_hint = bio->bi_write_hint;
-	blk_rq_bio_prep(req->q, req, bio);
-}
-EXPORT_SYMBOL_GPL(blk_init_request_from_bio);
-
 static void handle_bad_sector(struct bio *bio, sector_t maxsector)
 {
 	char b[BDEVNAME_SIZE];
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 08a6248d8536..c0e5132d9103 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1765,7 +1765,12 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 
 static void blk_mq_bio_to_request(struct request *rq, struct bio *bio)
 {
-	blk_init_request_from_bio(rq, bio);
+	if (bio->bi_opf & REQ_RAHEAD)
+		rq->cmd_flags |= REQ_FAILFAST_MASK;
+
+	rq->__sector = bio->bi_iter.bi_sector;
+	rq->write_hint = bio->bi_write_hint;
+	blk_rq_bio_prep(rq->q, rq, bio);
 
 	blk_account_io_start(rq, true);
 }
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 949e29e1d782..6cc050894d3a 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -660,7 +660,7 @@ static struct request *nvme_nvm_alloc_request(struct request_queue *q,
 	rq->cmd_flags &= ~REQ_FAILFAST_DRIVER;
 
 	if (rqd->bio)
-		blk_init_request_from_bio(rq, rqd->bio);
+		blk_rq_append_bio(rq, &rqd->bio);
 	else
 		rq->ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_NORM);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1aafeb923e7b..05bc85cdbc25 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -823,7 +823,6 @@ extern void blk_unregister_queue(struct gendisk *disk);
 extern blk_qc_t generic_make_request(struct bio *bio);
 extern blk_qc_t direct_make_request(struct bio *bio);
 extern void blk_rq_init(struct request_queue *q, struct request *rq);
-extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
 extern void blk_put_request(struct request *);
 extern struct request *blk_get_request(struct request_queue *, unsigned int op,
 				       blk_mq_req_flags_t flags);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 07/10] block: remove the bi_phys_segments field in struct bio
  2019-05-13  6:37 fix nr_phys_segments vs iterators accounting Christoph Hellwig
                   ` (5 preceding siblings ...)
  2019-05-13  6:37 ` [PATCH 06/10] block: remove blk_init_request_from_bio Christoph Hellwig
@ 2019-05-13  6:37 ` Christoph Hellwig
  2019-05-13  6:37 ` [PATCH 08/10] block: simplify blk_recalc_rq_segments Christoph Hellwig
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2019-05-13  6:37 UTC (permalink / raw)
  To: axboe; +Cc: ming.lei, Matias Bjorling, linux-block

We only need the number of segments in the blk-mq submission path.
Remove the field from struct bio, and return it from a variant of
blk_queue_split instead of that it can passed as an argument to
those functions that need the value.

This also means we stop recounting segments except for cloning
and partial segments.

To keep the number of arguments in this how path down remove
pointless struct request_queue arguments from any of the functions
that had it and grew a nr_segs argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 Documentation/block/biodoc.txt |  1 -
 block/bfq-iosched.c            |  5 ++-
 block/bio.c                    | 15 +------
 block/blk-core.c               | 32 +++++++--------
 block/blk-map.c                | 10 ++++-
 block/blk-merge.c              | 75 ++++++++++++----------------------
 block/blk-mq-sched.c           | 26 +++++++-----
 block/blk-mq-sched.h           | 10 +++--
 block/blk-mq.c                 | 23 ++++++-----
 block/blk.h                    | 23 ++++++-----
 block/kyber-iosched.c          |  5 ++-
 block/mq-deadline.c            |  5 ++-
 drivers/md/raid5.c             |  1 -
 include/linux/bio.h            |  1 -
 include/linux/blk-mq.h         |  2 +-
 include/linux/blk_types.h      |  6 ---
 include/linux/blkdev.h         |  1 -
 include/linux/elevator.h       |  2 +-
 18 files changed, 106 insertions(+), 137 deletions(-)

diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index ac18b488cb5e..31c177663ed5 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -436,7 +436,6 @@ struct bio {
        struct bvec_iter	bi_iter;	/* current index into bio_vec array */
 
        unsigned int	bi_size;     /* total size in bytes */
-       unsigned short 	bi_phys_segments; /* segments after physaddr coalesce*/
        unsigned short	bi_hw_segments; /* segments after DMA remapping */
        unsigned int	bi_max;	     /* max bio_vecs we can hold
                                         used as index into pool */
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index f8d430f88d25..a6bf842cbe16 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2027,7 +2027,8 @@ static void bfq_remove_request(struct request_queue *q,
 
 }
 
-static bool bfq_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
+static bool bfq_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio,
+		unsigned int nr_segs)
 {
 	struct request_queue *q = hctx->queue;
 	struct bfq_data *bfqd = q->elevator->elevator_data;
@@ -2050,7 +2051,7 @@ static bool bfq_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
 		bfqd->bio_bfqq = NULL;
 	bfqd->bio_bic = bic;
 
-	ret = blk_mq_sched_try_merge(q, bio, &free);
+	ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);
 
 	if (free)
 		blk_mq_free_request(free);
diff --git a/block/bio.c b/block/bio.c
index 683cbb40f051..d550d36392e9 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -558,14 +558,6 @@ void bio_put(struct bio *bio)
 }
 EXPORT_SYMBOL(bio_put);
 
-int bio_phys_segments(struct request_queue *q, struct bio *bio)
-{
-	if (unlikely(!bio_flagged(bio, BIO_SEG_VALID)))
-		blk_recount_segments(q, bio);
-
-	return bio->bi_phys_segments;
-}
-
 /**
  * 	__bio_clone_fast - clone a bio that shares the original bio's biovec
  * 	@bio: destination bio
@@ -739,7 +731,7 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 	if (bio_full(bio))
 		return 0;
 
-	if (bio->bi_phys_segments >= queue_max_segments(q))
+	if (bio->bi_vcnt >= queue_max_segments(q))
 		return 0;
 
 	bvec = &bio->bi_io_vec[bio->bi_vcnt];
@@ -749,8 +741,6 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 	bio->bi_vcnt++;
  done:
 	bio->bi_iter.bi_size += len;
-	bio->bi_phys_segments = bio->bi_vcnt;
-	bio_set_flag(bio, BIO_SEG_VALID);
 	return len;
 }
 
@@ -1910,10 +1900,7 @@ void bio_trim(struct bio *bio, int offset, int size)
 	if (offset == 0 && size == bio->bi_iter.bi_size)
 		return;
 
-	bio_clear_flag(bio, BIO_SEG_VALID);
-
 	bio_advance(bio, offset << 9);
-
 	bio->bi_iter.bi_size = size;
 
 	if (bio_integrity(bio))
diff --git a/block/blk-core.c b/block/blk-core.c
index b46ea531cb07..84118411626c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -573,15 +573,15 @@ void blk_put_request(struct request *req)
 }
 EXPORT_SYMBOL(blk_put_request);
 
-bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
-			    struct bio *bio)
+bool bio_attempt_back_merge(struct request *req, struct bio *bio,
+		unsigned int nr_segs)
 {
 	const int ff = bio->bi_opf & REQ_FAILFAST_MASK;
 
-	if (!ll_back_merge_fn(q, req, bio))
+	if (!ll_back_merge_fn(req, bio, nr_segs))
 		return false;
 
-	trace_block_bio_backmerge(q, req, bio);
+	trace_block_bio_backmerge(req->q, req, bio);
 
 	if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
 		blk_rq_set_mixed_merge(req);
@@ -594,15 +594,15 @@ bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
 	return true;
 }
 
-bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
-			     struct bio *bio)
+bool bio_attempt_front_merge(struct request *req, struct bio *bio,
+		unsigned int nr_segs)
 {
 	const int ff = bio->bi_opf & REQ_FAILFAST_MASK;
 
-	if (!ll_front_merge_fn(q, req, bio))
+	if (!ll_front_merge_fn(req, bio, nr_segs))
 		return false;
 
-	trace_block_bio_frontmerge(q, req, bio);
+	trace_block_bio_frontmerge(req->q, req, bio);
 
 	if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
 		blk_rq_set_mixed_merge(req);
@@ -644,6 +644,7 @@ bool bio_attempt_discard_merge(struct request_queue *q, struct request *req,
  * blk_attempt_plug_merge - try to merge with %current's plugged list
  * @q: request_queue new bio is being queued at
  * @bio: new bio being queued
+ * @nr_segs: number of segments in @bio
  * @same_queue_rq: pointer to &struct request that gets filled in when
  * another request associated with @q is found on the plug list
  * (optional, may be %NULL)
@@ -662,7 +663,7 @@ bool bio_attempt_discard_merge(struct request_queue *q, struct request *req,
  * Caller must ensure !blk_queue_nomerges(q) beforehand.
  */
 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
-			    struct request **same_queue_rq)
+		unsigned int nr_segs, struct request **same_queue_rq)
 {
 	struct blk_plug *plug;
 	struct request *rq;
@@ -691,10 +692,10 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 
 		switch (blk_try_merge(rq, bio)) {
 		case ELEVATOR_BACK_MERGE:
-			merged = bio_attempt_back_merge(q, rq, bio);
+			merged = bio_attempt_back_merge(rq, bio, nr_segs);
 			break;
 		case ELEVATOR_FRONT_MERGE:
-			merged = bio_attempt_front_merge(q, rq, bio);
+			merged = bio_attempt_front_merge(rq, bio, nr_segs);
 			break;
 		case ELEVATOR_DISCARD_MERGE:
 			merged = bio_attempt_discard_merge(q, rq, bio);
@@ -1472,14 +1473,9 @@ bool blk_update_request(struct request *req, blk_status_t error,
 }
 EXPORT_SYMBOL_GPL(blk_update_request);
 
-void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
-		     struct bio *bio)
+void blk_rq_bio_prep(struct request *rq, struct bio *bio, unsigned int nr_segs)
 {
-	if (bio_has_data(bio))
-		rq->nr_phys_segments = bio_phys_segments(q, bio);
-	else if (bio_op(bio) == REQ_OP_DISCARD)
-		rq->nr_phys_segments = 1;
-
+	rq->nr_phys_segments = nr_segs;
 	rq->__data_len = bio->bi_iter.bi_size;
 	rq->bio = rq->biotail = bio;
 	rq->ioprio = bio_prio(bio);
diff --git a/block/blk-map.c b/block/blk-map.c
index db9373bd31ac..3a62e471d81b 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -18,13 +18,19 @@
 int blk_rq_append_bio(struct request *rq, struct bio **bio)
 {
 	struct bio *orig_bio = *bio;
+	struct bvec_iter iter;
+	struct bio_vec bv;
+	unsigned int nr_segs = 0;
 
 	blk_queue_bounce(rq->q, bio);
 
+	bio_for_each_bvec(bv, *bio, iter)
+		nr_segs++;
+
 	if (!rq->bio) {
-		blk_rq_bio_prep(rq->q, rq, *bio);
+		blk_rq_bio_prep(rq, *bio, nr_segs);
 	} else {
-		if (!ll_back_merge_fn(rq->q, rq, *bio)) {
+		if (!ll_back_merge_fn(rq, *bio, nr_segs)) {
 			if (orig_bio != *bio) {
 				bio_put(*bio);
 				*bio = orig_bio;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 17713d7d98d5..72b4fd89a22d 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -258,32 +258,29 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 	return do_split ? new : NULL;
 }
 
-void blk_queue_split(struct request_queue *q, struct bio **bio)
+void __blk_queue_split(struct request_queue *q, struct bio **bio,
+		unsigned int *nr_segs)
 {
-	struct bio *split, *res;
-	unsigned nsegs;
+	struct bio *split;
 
 	switch (bio_op(*bio)) {
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
-		split = blk_bio_discard_split(q, *bio, &q->bio_split, &nsegs);
+		split = blk_bio_discard_split(q, *bio, &q->bio_split, nr_segs);
 		break;
 	case REQ_OP_WRITE_ZEROES:
-		split = blk_bio_write_zeroes_split(q, *bio, &q->bio_split, &nsegs);
+		split = blk_bio_write_zeroes_split(q, *bio, &q->bio_split,
+				nr_segs);
 		break;
 	case REQ_OP_WRITE_SAME:
-		split = blk_bio_write_same_split(q, *bio, &q->bio_split, &nsegs);
+		split = blk_bio_write_same_split(q, *bio, &q->bio_split,
+				nr_segs);
 		break;
 	default:
-		split = blk_bio_segment_split(q, *bio, &q->bio_split, &nsegs);
+		split = blk_bio_segment_split(q, *bio, &q->bio_split, nr_segs);
 		break;
 	}
 
-	/* physical segments can be figured out during splitting */
-	res = split ? split : *bio;
-	res->bi_phys_segments = nsegs;
-	bio_set_flag(res, BIO_SEG_VALID);
-
 	if (split) {
 		/* there isn't chance to merge the splitted bio */
 		split->bi_opf |= REQ_NOMERGE;
@@ -304,6 +301,13 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
 		*bio = split;
 	}
 }
+
+void blk_queue_split(struct request_queue *q, struct bio **bio)
+{
+	unsigned int nr_segs;
+
+	__blk_queue_split(q, bio, &nr_segs);
+}
 EXPORT_SYMBOL(blk_queue_split);
 
 static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
@@ -338,17 +342,6 @@ void blk_recalc_rq_segments(struct request *rq)
 	rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio);
 }
 
-void blk_recount_segments(struct request_queue *q, struct bio *bio)
-{
-	struct bio *nxt = bio->bi_next;
-
-	bio->bi_next = NULL;
-	bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio);
-	bio->bi_next = nxt;
-
-	bio_set_flag(bio, BIO_SEG_VALID);
-}
-
 static inline struct scatterlist *blk_next_sg(struct scatterlist **sg,
 		struct scatterlist *sglist)
 {
@@ -519,16 +512,13 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 }
 EXPORT_SYMBOL(blk_rq_map_sg);
 
-static inline int ll_new_hw_segment(struct request_queue *q,
-				    struct request *req,
-				    struct bio *bio)
+static inline int ll_new_hw_segment(struct request *req, struct bio *bio,
+		unsigned int nr_phys_segs)
 {
-	int nr_phys_segs = bio_phys_segments(q, bio);
-
-	if (req->nr_phys_segments + nr_phys_segs > queue_max_segments(q))
+	if (req->nr_phys_segments + nr_phys_segs > queue_max_segments(req->q))
 		goto no_merge;
 
-	if (blk_integrity_merge_bio(q, req, bio) == false)
+	if (blk_integrity_merge_bio(req->q, req, bio) == false)
 		goto no_merge;
 
 	/*
@@ -539,12 +529,11 @@ static inline int ll_new_hw_segment(struct request_queue *q,
 	return 1;
 
 no_merge:
-	req_set_nomerge(q, req);
+	req_set_nomerge(req->q, req);
 	return 0;
 }
 
-int ll_back_merge_fn(struct request_queue *q, struct request *req,
-		     struct bio *bio)
+int ll_back_merge_fn(struct request *req, struct bio *bio, unsigned int nr_segs)
 {
 	if (req_gap_back_merge(req, bio))
 		return 0;
@@ -553,21 +542,15 @@ int ll_back_merge_fn(struct request_queue *q, struct request *req,
 		return 0;
 	if (blk_rq_sectors(req) + bio_sectors(bio) >
 	    blk_rq_get_max_sectors(req, blk_rq_pos(req))) {
-		req_set_nomerge(q, req);
+		req_set_nomerge(req->q, req);
 		return 0;
 	}
-	if (!bio_flagged(req->biotail, BIO_SEG_VALID))
-		blk_recount_segments(q, req->biotail);
-	if (!bio_flagged(bio, BIO_SEG_VALID))
-		blk_recount_segments(q, bio);
 
-	return ll_new_hw_segment(q, req, bio);
+	return ll_new_hw_segment(req, bio, nr_segs);
 }
 
-int ll_front_merge_fn(struct request_queue *q, struct request *req,
-		      struct bio *bio)
+int ll_front_merge_fn(struct request *req, struct bio *bio, unsigned int nr_segs)
 {
-
 	if (req_gap_front_merge(req, bio))
 		return 0;
 	if (blk_integrity_rq(req) &&
@@ -575,15 +558,11 @@ int ll_front_merge_fn(struct request_queue *q, struct request *req,
 		return 0;
 	if (blk_rq_sectors(req) + bio_sectors(bio) >
 	    blk_rq_get_max_sectors(req, bio->bi_iter.bi_sector)) {
-		req_set_nomerge(q, req);
+		req_set_nomerge(req->q, req);
 		return 0;
 	}
-	if (!bio_flagged(bio, BIO_SEG_VALID))
-		blk_recount_segments(q, bio);
-	if (!bio_flagged(req->bio, BIO_SEG_VALID))
-		blk_recount_segments(q, req->bio);
 
-	return ll_new_hw_segment(q, req, bio);
+	return ll_new_hw_segment(req, bio, nr_segs);
 }
 
 static bool req_attempt_discard_merge(struct request_queue *q, struct request *req,
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 74c6bb871f7e..72124d76b96a 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -224,7 +224,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-			    struct request **merged_request)
+		unsigned int nr_segs, struct request **merged_request)
 {
 	struct request *rq;
 
@@ -232,7 +232,7 @@ bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
 	case ELEVATOR_BACK_MERGE:
 		if (!blk_mq_sched_allow_merge(q, rq, bio))
 			return false;
-		if (!bio_attempt_back_merge(q, rq, bio))
+		if (!bio_attempt_back_merge(rq, bio, nr_segs))
 			return false;
 		*merged_request = attempt_back_merge(q, rq);
 		if (!*merged_request)
@@ -241,7 +241,7 @@ bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
 	case ELEVATOR_FRONT_MERGE:
 		if (!blk_mq_sched_allow_merge(q, rq, bio))
 			return false;
-		if (!bio_attempt_front_merge(q, rq, bio))
+		if (!bio_attempt_front_merge(rq, bio, nr_segs))
 			return false;
 		*merged_request = attempt_front_merge(q, rq);
 		if (!*merged_request)
@@ -260,7 +260,7 @@ EXPORT_SYMBOL_GPL(blk_mq_sched_try_merge);
  * of them.
  */
 bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list,
-			   struct bio *bio)
+			   struct bio *bio, unsigned int nr_segs)
 {
 	struct request *rq;
 	int checked = 8;
@@ -277,11 +277,13 @@ bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list,
 		switch (blk_try_merge(rq, bio)) {
 		case ELEVATOR_BACK_MERGE:
 			if (blk_mq_sched_allow_merge(q, rq, bio))
-				merged = bio_attempt_back_merge(q, rq, bio);
+				merged = bio_attempt_back_merge(rq, bio,
+						nr_segs);
 			break;
 		case ELEVATOR_FRONT_MERGE:
 			if (blk_mq_sched_allow_merge(q, rq, bio))
-				merged = bio_attempt_front_merge(q, rq, bio);
+				merged = bio_attempt_front_merge(rq, bio,
+						nr_segs);
 			break;
 		case ELEVATOR_DISCARD_MERGE:
 			merged = bio_attempt_discard_merge(q, rq, bio);
@@ -304,13 +306,14 @@ EXPORT_SYMBOL_GPL(blk_mq_bio_list_merge);
  */
 static bool blk_mq_attempt_merge(struct request_queue *q,
 				 struct blk_mq_hw_ctx *hctx,
-				 struct blk_mq_ctx *ctx, struct bio *bio)
+				 struct blk_mq_ctx *ctx, struct bio *bio,
+				 unsigned int nr_segs)
 {
 	enum hctx_type type = hctx->type;
 
 	lockdep_assert_held(&ctx->lock);
 
-	if (blk_mq_bio_list_merge(q, &ctx->rq_lists[type], bio)) {
+	if (blk_mq_bio_list_merge(q, &ctx->rq_lists[type], bio, nr_segs)) {
 		ctx->rq_merged++;
 		return true;
 	}
@@ -318,7 +321,8 @@ static bool blk_mq_attempt_merge(struct request_queue *q,
 	return false;
 }
 
-bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
+bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio,
+		unsigned int nr_segs)
 {
 	struct elevator_queue *e = q->elevator;
 	struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
@@ -328,7 +332,7 @@ bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
 
 	if (e && e->type->ops.bio_merge) {
 		blk_mq_put_ctx(ctx);
-		return e->type->ops.bio_merge(hctx, bio);
+		return e->type->ops.bio_merge(hctx, bio, nr_segs);
 	}
 
 	type = hctx->type;
@@ -336,7 +340,7 @@ bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
 			!list_empty_careful(&ctx->rq_lists[type])) {
 		/* default per sw-queue merge */
 		spin_lock(&ctx->lock);
-		ret = blk_mq_attempt_merge(q, hctx, ctx, bio);
+		ret = blk_mq_attempt_merge(q, hctx, ctx, bio, nr_segs);
 		spin_unlock(&ctx->lock);
 	}
 
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index c7bdb52367ac..a1e5850ffb1d 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -12,8 +12,9 @@ void blk_mq_sched_assign_ioc(struct request *rq);
 
 void blk_mq_sched_request_inserted(struct request *rq);
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-				struct request **merged_request);
-bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
+		unsigned int nr_segs, struct request **merged_request);
+bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio,
+		unsigned int nr_segs);
 bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
 void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx);
 void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
@@ -30,12 +31,13 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
 void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
 
 static inline bool
-blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
+blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio,
+		unsigned int nr_segs)
 {
 	if (blk_queue_nomerges(q) || !bio_mergeable(bio))
 		return false;
 
-	return __blk_mq_sched_bio_merge(q, bio);
+	return __blk_mq_sched_bio_merge(q, bio, nr_segs);
 }
 
 static inline bool
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c0e5132d9103..dbb58c50654b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1763,14 +1763,15 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	}
 }
 
-static void blk_mq_bio_to_request(struct request *rq, struct bio *bio)
+static void blk_mq_bio_to_request(struct request *rq, struct bio *bio,
+		unsigned int nr_segs)
 {
 	if (bio->bi_opf & REQ_RAHEAD)
 		rq->cmd_flags |= REQ_FAILFAST_MASK;
 
 	rq->__sector = bio->bi_iter.bi_sector;
 	rq->write_hint = bio->bi_write_hint;
-	blk_rq_bio_prep(rq->q, rq, bio);
+	blk_rq_bio_prep(rq, bio, nr_segs);
 
 	blk_account_io_start(rq, true);
 }
@@ -1940,20 +1941,20 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	struct request *rq;
 	struct blk_plug *plug;
 	struct request *same_queue_rq = NULL;
+	unsigned int nr_segs;
 	blk_qc_t cookie;
 
 	blk_queue_bounce(q, &bio);
-
-	blk_queue_split(q, &bio);
+	__blk_queue_split(q, &bio, &nr_segs);
 
 	if (!bio_integrity_prep(bio))
 		return BLK_QC_T_NONE;
 
 	if (!is_flush_fua && !blk_queue_nomerges(q) &&
-	    blk_attempt_plug_merge(q, bio, &same_queue_rq))
+	    blk_attempt_plug_merge(q, bio, nr_segs, &same_queue_rq))
 		return BLK_QC_T_NONE;
 
-	if (blk_mq_sched_bio_merge(q, bio))
+	if (blk_mq_sched_bio_merge(q, bio, nr_segs))
 		return BLK_QC_T_NONE;
 
 	rq_qos_throttle(q, bio);
@@ -1976,7 +1977,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	plug = current->plug;
 	if (unlikely(is_flush_fua)) {
 		blk_mq_put_ctx(data.ctx);
-		blk_mq_bio_to_request(rq, bio);
+		blk_mq_bio_to_request(rq, bio, nr_segs);
 
 		/* bypass scheduler for flush rq */
 		blk_insert_flush(rq);
@@ -1990,7 +1991,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		struct request *last = NULL;
 
 		blk_mq_put_ctx(data.ctx);
-		blk_mq_bio_to_request(rq, bio);
+		blk_mq_bio_to_request(rq, bio, nr_segs);
 
 		if (!request_count)
 			trace_block_plug(q);
@@ -2005,7 +2006,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 
 		blk_add_rq_to_plug(plug, rq);
 	} else if (plug && !blk_queue_nomerges(q)) {
-		blk_mq_bio_to_request(rq, bio);
+		blk_mq_bio_to_request(rq, bio, nr_segs);
 
 		/*
 		 * We do limited plugging. If the bio can be merged, do that.
@@ -2034,11 +2035,11 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	} else if ((q->nr_hw_queues > 1 && is_sync) || (!q->elevator &&
 			!data.hctx->dispatch_busy)) {
 		blk_mq_put_ctx(data.ctx);
-		blk_mq_bio_to_request(rq, bio);
+		blk_mq_bio_to_request(rq, bio, nr_segs);
 		blk_mq_try_issue_directly(data.hctx, rq, &cookie);
 	} else {
 		blk_mq_put_ctx(data.ctx);
-		blk_mq_bio_to_request(rq, bio);
+		blk_mq_bio_to_request(rq, bio, nr_segs);
 		blk_mq_sched_insert_request(rq, false, true, true);
 	}
 
diff --git a/block/blk.h b/block/blk.h
index e27fd1512e4b..a18d0b9fe353 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -51,8 +51,7 @@ struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
 void blk_free_flush_queue(struct blk_flush_queue *q);
 
 void blk_exit_queue(struct request_queue *q);
-void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
-			struct bio *bio);
+void blk_rq_bio_prep(struct request *rq, struct bio *bio, unsigned int nr_segs);
 void blk_freeze_queue(struct request_queue *q);
 
 static inline void blk_queue_enter_live(struct request_queue *q)
@@ -154,14 +153,14 @@ static inline bool bio_integrity_endio(struct bio *bio)
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
 
-bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
-			     struct bio *bio);
-bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
-			    struct bio *bio);
+bool bio_attempt_front_merge(struct request *req, struct bio *bio,
+		unsigned int nr_segs);
+bool bio_attempt_back_merge(struct request *req, struct bio *bio,
+		unsigned int nr_segs);
 bool bio_attempt_discard_merge(struct request_queue *q, struct request *req,
 		struct bio *bio);
 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
-			    struct request **same_queue_rq);
+		unsigned int nr_segs, struct request **same_queue_rq);
 
 void blk_account_io_start(struct request *req, bool new_io);
 void blk_account_io_completion(struct request *req, unsigned int bytes);
@@ -195,10 +194,12 @@ static inline int blk_should_fake_timeout(struct request_queue *q)
 }
 #endif
 
-int ll_back_merge_fn(struct request_queue *q, struct request *req,
-		     struct bio *bio);
-int ll_front_merge_fn(struct request_queue *q, struct request *req, 
-		      struct bio *bio);
+void __blk_queue_split(struct request_queue *q, struct bio **bio,
+		unsigned int *nr_segs);
+int ll_back_merge_fn(struct request *req, struct bio *bio,
+		unsigned int nr_segs);
+int ll_front_merge_fn(struct request *req,  struct bio *bio,
+		unsigned int nr_segs);
 struct request *attempt_back_merge(struct request_queue *q, struct request *rq);
 struct request *attempt_front_merge(struct request_queue *q, struct request *rq);
 int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index c3b05119cebd..3c2602601741 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -562,7 +562,8 @@ static void kyber_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)
 	}
 }
 
-static bool kyber_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
+static bool kyber_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio,
+		unsigned int nr_segs)
 {
 	struct kyber_hctx_data *khd = hctx->sched_data;
 	struct blk_mq_ctx *ctx = blk_mq_get_ctx(hctx->queue);
@@ -572,7 +573,7 @@ static bool kyber_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
 	bool merged;
 
 	spin_lock(&kcq->lock);
-	merged = blk_mq_bio_list_merge(hctx->queue, rq_list, bio);
+	merged = blk_mq_bio_list_merge(hctx->queue, rq_list, bio, nr_segs);
 	spin_unlock(&kcq->lock);
 	blk_mq_put_ctx(ctx);
 
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 1876f5712bfd..b8a682b5a1bb 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -469,7 +469,8 @@ static int dd_request_merge(struct request_queue *q, struct request **rq,
 	return ELEVATOR_NO_MERGE;
 }
 
-static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
+static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio,
+		unsigned int nr_segs)
 {
 	struct request_queue *q = hctx->queue;
 	struct deadline_data *dd = q->elevator->elevator_data;
@@ -477,7 +478,7 @@ static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
 	bool ret;
 
 	spin_lock(&dd->lock);
-	ret = blk_mq_sched_try_merge(q, bio, &free);
+	ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);
 	spin_unlock(&dd->lock);
 
 	if (free)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7fde645d2e90..16358fdd1850 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5259,7 +5259,6 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
 		rcu_read_unlock();
 		raid_bio->bi_next = (void*)rdev;
 		bio_set_dev(align_bi, rdev->bdev);
-		bio_clear_flag(align_bi, BIO_SEG_VALID);
 
 		if (is_badblock(rdev, align_bi->bi_iter.bi_sector,
 				bio_sectors(align_bi),
diff --git a/include/linux/bio.h b/include/linux/bio.h
index ea73df36529a..e0f3b8898a81 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -408,7 +408,6 @@ static inline void bio_wouldblock_error(struct bio *bio)
 }
 
 struct request_queue;
-extern int bio_phys_segments(struct request_queue *, struct bio *);
 
 extern int submit_bio_wait(struct bio *bio);
 extern void bio_advance(struct bio *, unsigned);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 15d1aa53d96c..3fa1fa59f9b2 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -306,7 +306,7 @@ void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long msecs
 bool blk_mq_complete_request(struct request *rq);
 void blk_mq_complete_request_sync(struct request *rq);
 bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list,
-			   struct bio *bio);
+			   struct bio *bio, unsigned int nr_segs);
 bool blk_mq_queue_stopped(struct request_queue *q);
 void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx);
 void blk_mq_start_hw_queue(struct blk_mq_hw_ctx *hctx);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 95202f80676c..6a53799c3fe2 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -154,11 +154,6 @@ struct bio {
 	blk_status_t		bi_status;
 	u8			bi_partno;
 
-	/* Number of segments in this BIO after
-	 * physical address coalescing is performed.
-	 */
-	unsigned int		bi_phys_segments;
-
 	struct bvec_iter	bi_iter;
 
 	atomic_t		__bi_remaining;
@@ -210,7 +205,6 @@ struct bio {
  */
 enum {
 	BIO_NO_PAGE_REF,	/* don't put release vec pages */
-	BIO_SEG_VALID,		/* bi_phys_segments valid */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */
 	BIO_USER_MAPPED,	/* contains user pages */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 05bc85cdbc25..96edfdf34cd3 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -836,7 +836,6 @@ extern blk_status_t blk_insert_cloned_request(struct request_queue *q,
 				     struct request *rq);
 extern int blk_rq_append_bio(struct request *rq, struct bio **bio);
 extern void blk_queue_split(struct request_queue *, struct bio **);
-extern void blk_recount_segments(struct request_queue *, struct bio *);
 extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
 extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,
 			      unsigned int, void __user *);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 6e8bc53740f0..169bb2e02516 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -34,7 +34,7 @@ struct elevator_mq_ops {
 	void (*depth_updated)(struct blk_mq_hw_ctx *);
 
 	bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
-	bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);
+	bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *, unsigned int);
 	int (*request_merge)(struct request_queue *q, struct request **, struct bio *);
 	void (*request_merged)(struct request_queue *, struct request *, enum elv_merge);
 	void (*requests_merged)(struct request_queue *, struct request *, struct request *);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 08/10] block: simplify blk_recalc_rq_segments
  2019-05-13  6:37 fix nr_phys_segments vs iterators accounting Christoph Hellwig
                   ` (6 preceding siblings ...)
  2019-05-13  6:37 ` [PATCH 07/10] block: remove the bi_phys_segments field in struct bio Christoph Hellwig
@ 2019-05-13  6:37 ` Christoph Hellwig
  2019-05-13  6:37 ` [PATCH 09/10] block: untangle the end of blk_bio_segment_split Christoph Hellwig
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2019-05-13  6:37 UTC (permalink / raw)
  To: axboe; +Cc: ming.lei, Matias Bjorling, linux-block

Return the segement and let the callers assign them, which makes the code
a littler more obvious.  Also pass the request instead of q plus bio
chain, allowing for the use of rq_for_each_bvec.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c  |  4 ++--
 block/blk-merge.c | 21 ++++++---------------
 block/blk.h       |  2 +-
 3 files changed, 9 insertions(+), 18 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 84118411626c..c894c9887dca 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1198,7 +1198,7 @@ static int blk_cloned_rq_check_limits(struct request_queue *q,
 	 * Recalculate it to check the request correctly on this queue's
 	 * limitation.
 	 */
-	blk_recalc_rq_segments(rq);
+	rq->nr_phys_segments = blk_recalc_rq_segments(rq);
 	if (rq->nr_phys_segments > queue_max_segments(q)) {
 		printk(KERN_ERR "%s: over max segments limit.\n", __func__);
 		return -EIO;
@@ -1466,7 +1466,7 @@ bool blk_update_request(struct request *req, blk_status_t error,
 		}
 
 		/* recalculate the number of segments */
-		blk_recalc_rq_segments(req);
+		req->nr_phys_segments = blk_recalc_rq_segments(req);
 	}
 
 	return true;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 72b4fd89a22d..2ea21ffd5f72 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -310,17 +310,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
 }
 EXPORT_SYMBOL(blk_queue_split);
 
-static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
-					     struct bio *bio)
+unsigned int blk_recalc_rq_segments(struct request *rq)
 {
 	unsigned int nr_phys_segs = 0;
-	struct bvec_iter iter;
+	struct req_iterator iter;
 	struct bio_vec bv;
 
-	if (!bio)
+	if (!rq->bio)
 		return 0;
 
-	switch (bio_op(bio)) {
+	switch (bio_op(rq->bio)) {
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
@@ -329,19 +328,11 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 		return 1;
 	}
 
-	for_each_bio(bio) {
-		bio_for_each_bvec(bv, bio, iter)
-			bvec_split_segs(q, &bv, &nr_phys_segs, NULL, UINT_MAX);
-	}
-
+	rq_for_each_bvec(bv, rq, iter)
+		bvec_split_segs(rq->q, &bv, &nr_phys_segs, NULL, UINT_MAX);
 	return nr_phys_segs;
 }
 
-void blk_recalc_rq_segments(struct request *rq)
-{
-	rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio);
-}
-
 static inline struct scatterlist *blk_next_sg(struct scatterlist **sg,
 		struct scatterlist *sglist)
 {
diff --git a/block/blk.h b/block/blk.h
index a18d0b9fe353..5352cdb876a6 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -204,7 +204,7 @@ struct request *attempt_back_merge(struct request_queue *q, struct request *rq);
 struct request *attempt_front_merge(struct request_queue *q, struct request *rq);
 int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
 				struct request *next);
-void blk_recalc_rq_segments(struct request *rq);
+unsigned int blk_recalc_rq_segments(struct request *rq);
 void blk_rq_set_mixed_merge(struct request *rq);
 bool blk_rq_merge_ok(struct request *rq, struct bio *bio);
 enum elv_merge blk_try_merge(struct request *rq, struct bio *bio);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 09/10] block: untangle the end of blk_bio_segment_split
  2019-05-13  6:37 fix nr_phys_segments vs iterators accounting Christoph Hellwig
                   ` (7 preceding siblings ...)
  2019-05-13  6:37 ` [PATCH 08/10] block: simplify blk_recalc_rq_segments Christoph Hellwig
@ 2019-05-13  6:37 ` Christoph Hellwig
  2019-05-13  6:37 ` [PATCH 10/10] block: mark blk_rq_bio_prep as inline Christoph Hellwig
       [not found] ` <CGME20190513063855epcas5p33ef8c4c0a0055bd0b66eadc859796f0f@epcms2p6>
  10 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2019-05-13  6:37 UTC (permalink / raw)
  To: axboe; +Cc: ming.lei, Matias Bjorling, linux-block

Now that we don't need to assign the front/back segment sizes, we can
duplicating the segs assignment for the split vs no-split case and
remove a whole chunk of boilerplate code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-merge.c | 14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2ea21ffd5f72..ca45eb51c669 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -202,8 +202,6 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 	struct bio_vec bv, bvprv, *bvprvp = NULL;
 	struct bvec_iter iter;
 	unsigned nsegs = 0, sectors = 0;
-	bool do_split = true;
-	struct bio *new = NULL;
 	const unsigned max_sectors = get_max_io_size(q, bio);
 	const unsigned max_segs = queue_max_segments(q);
 
@@ -245,17 +243,11 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 		}
 	}
 
-	do_split = false;
+	*segs = nsegs;
+	return NULL;
 split:
 	*segs = nsegs;
-
-	if (do_split) {
-		new = bio_split(bio, sectors, GFP_NOIO, bs);
-		if (new)
-			bio = new;
-	}
-
-	return do_split ? new : NULL;
+	return bio_split(bio, sectors, GFP_NOIO, bs);
 }
 
 void __blk_queue_split(struct request_queue *q, struct bio **bio,
-- 
2.20.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 10/10] block: mark blk_rq_bio_prep as inline
  2019-05-13  6:37 fix nr_phys_segments vs iterators accounting Christoph Hellwig
                   ` (8 preceding siblings ...)
  2019-05-13  6:37 ` [PATCH 09/10] block: untangle the end of blk_bio_segment_split Christoph Hellwig
@ 2019-05-13  6:37 ` Christoph Hellwig
  2019-05-13 14:57   ` Chaitanya Kulkarni
       [not found] ` <CGME20190513063855epcas5p33ef8c4c0a0055bd0b66eadc859796f0f@epcms2p6>
  10 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2019-05-13  6:37 UTC (permalink / raw)
  To: axboe; +Cc: ming.lei, Matias Bjorling, linux-block

This function just has a few trivial assignments, has two callers with
one of them being in the fastpath.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c | 11 -----------
 block/blk.h      | 13 ++++++++++++-
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index c894c9887dca..9405388ac658 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1473,17 +1473,6 @@ bool blk_update_request(struct request *req, blk_status_t error,
 }
 EXPORT_SYMBOL_GPL(blk_update_request);
 
-void blk_rq_bio_prep(struct request *rq, struct bio *bio, unsigned int nr_segs)
-{
-	rq->nr_phys_segments = nr_segs;
-	rq->__data_len = bio->bi_iter.bi_size;
-	rq->bio = rq->biotail = bio;
-	rq->ioprio = bio_prio(bio);
-
-	if (bio->bi_disk)
-		rq->rq_disk = bio->bi_disk;
-}
-
 #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
 /**
  * rq_flush_dcache_pages - Helper function to flush all pages in a request
diff --git a/block/blk.h b/block/blk.h
index 5352cdb876a6..cbb0995ed17e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -51,7 +51,6 @@ struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
 void blk_free_flush_queue(struct blk_flush_queue *q);
 
 void blk_exit_queue(struct request_queue *q);
-void blk_rq_bio_prep(struct request *rq, struct bio *bio, unsigned int nr_segs);
 void blk_freeze_queue(struct request_queue *q);
 
 static inline void blk_queue_enter_live(struct request_queue *q)
@@ -100,6 +99,18 @@ static inline bool bvec_gap_to_prev(struct request_queue *q,
 	return __bvec_gap_to_prev(q, bprv, offset);
 }
 
+static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
+		unsigned int nr_segs)
+{
+	rq->nr_phys_segments = nr_segs;
+	rq->__data_len = bio->bi_iter.bi_size;
+	rq->bio = rq->biotail = bio;
+	rq->ioprio = bio_prio(bio);
+
+	if (bio->bi_disk)
+		rq->rq_disk = bio->bi_disk;
+}
+
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 void blk_flush_integrity(void);
 bool __bio_integrity_endio(struct bio *);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 05/10] block: initialize the write priority in blk_rq_bio_prep
       [not found] ` <CGME20190513063855epcas5p33ef8c4c0a0055bd0b66eadc859796f0f@epcms2p6>
@ 2019-05-13  7:34   ` Minwoo Im
  0 siblings, 0 replies; 27+ messages in thread
From: Minwoo Im @ 2019-05-13  7:34 UTC (permalink / raw)
  To: Christoph Hellwig, axboe, Minwoo Im
  Cc: ming.lei, Matias Bjorling, linux-block

> The priority fiel also make sense for passthrough requests, so
                           ^^^^
                       `s/fiel/field`
I think it's trivial so that it can be done with merging.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 01/10] block: don't decrement nr_phys_segments for physically contigous segments
  2019-05-13  6:37 ` [PATCH 01/10] block: don't decrement nr_phys_segments for physically contigous segments Christoph Hellwig
@ 2019-05-13  9:45   ` Ming Lei
  2019-05-13 12:03     ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Ming Lei @ 2019-05-13  9:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, Matias Bjorling, linux-block

On Mon, May 13, 2019 at 08:37:45AM +0200, Christoph Hellwig wrote:
> Currently ll_merge_requests_fn, unlike all other merge functions,
> reduces nr_phys_segments by one if the last segment of the previous,
> and the first segment of the next segement are contigous.  While this
> seems like a nice solution to avoid building smaller than possible

Some workloads need this optimization, please see 729204ef49ec00b
("block: relax check on sg gap"):

    If the last bvec of the 1st bio and the 1st bvec of the next
    bio are physically contigious, and the latter can be merged
    to last segment of the 1st bio, we should think they don't
    violate sg gap(or virt boundary) limit.

    Both Vitaly and Dexuan reported lots of unmergeable small bios
    are observed when running mkfs on Hyper-V virtual storage, and
    performance becomes quite low. This patch fixes that performance
    issue.

It can be observed that thousands of 512byte bios in one request when
running mkfs related workloads.

> requests it causes a mismatch between the segments actually present
> in the request and those iterated over by the bvec iterators, including
> __rq_for_each_bio.  This could cause overwrites of too small kmalloc

Request based drivers usually shouldn't iterate bio any more.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 01/10] block: don't decrement nr_phys_segments for physically contigous segments
  2019-05-13  9:45   ` Ming Lei
@ 2019-05-13 12:03     ` Christoph Hellwig
  2019-05-13 12:37       ` Christoph Hellwig
  2019-05-14  4:36       ` Ming Lei
  0 siblings, 2 replies; 27+ messages in thread
From: Christoph Hellwig @ 2019-05-13 12:03 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, axboe, Matias Bjorling, linux-block

On Mon, May 13, 2019 at 05:45:45PM +0800, Ming Lei wrote:
> On Mon, May 13, 2019 at 08:37:45AM +0200, Christoph Hellwig wrote:
> > Currently ll_merge_requests_fn, unlike all other merge functions,
> > reduces nr_phys_segments by one if the last segment of the previous,
> > and the first segment of the next segement are contigous.  While this
> > seems like a nice solution to avoid building smaller than possible
> 
> Some workloads need this optimization, please see 729204ef49ec00b
> ("block: relax check on sg gap"):

And we still allow to merge the segments with this patch.  The only
difference is that these merges do get accounted as extra segments.

> > requests it causes a mismatch between the segments actually present
> > in the request and those iterated over by the bvec iterators, including
> > __rq_for_each_bio.  This could cause overwrites of too small kmalloc
> 
> Request based drivers usually shouldn't iterate bio any more.

We do that in a couple of places.  For one the nvme single segment
optimization that triggered this bug.  Also for range discard support
in nvme and virtio.  Then we have loop that  iterate the segments, but
doesn't use the nr_phys_segments count, and plenty of others that
iterate over pages at the moment but should be iterating bvecs,
e.g. ubd or aoe.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 01/10] block: don't decrement nr_phys_segments for physically contigous segments
  2019-05-13 12:03     ` Christoph Hellwig
@ 2019-05-13 12:37       ` Christoph Hellwig
  2019-05-14  4:36       ` Ming Lei
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2019-05-13 12:37 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, axboe, Matias Bjorling, linux-block

On Mon, May 13, 2019 at 02:03:44PM +0200, Christoph Hellwig wrote:
> On Mon, May 13, 2019 at 05:45:45PM +0800, Ming Lei wrote:
> > On Mon, May 13, 2019 at 08:37:45AM +0200, Christoph Hellwig wrote:
> > > Currently ll_merge_requests_fn, unlike all other merge functions,
> > > reduces nr_phys_segments by one if the last segment of the previous,
> > > and the first segment of the next segement are contigous.  While this
> > > seems like a nice solution to avoid building smaller than possible
> > 
> > Some workloads need this optimization, please see 729204ef49ec00b
> > ("block: relax check on sg gap"):
> 
> And we still allow to merge the segments with this patch.  The only
> difference is that these merges do get accounted as extra segments.

Trying mkfs.xfs there were no merges at all, which is expected as
it does perfectly sized direct I/O.

Trying mkfs.ext4 I see lots of bio merges.  But for those this patch
nothing changes at all, as we never decrement nr_phys_segments to start
with, we only every did that for request merges, and for those also
only for those on non-gappy devices due to the way the
req_gap_back_merge check was placed.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 10/10] block: mark blk_rq_bio_prep as inline
  2019-05-13  6:37 ` [PATCH 10/10] block: mark blk_rq_bio_prep as inline Christoph Hellwig
@ 2019-05-13 14:57   ` Chaitanya Kulkarni
  0 siblings, 0 replies; 27+ messages in thread
From: Chaitanya Kulkarni @ 2019-05-13 14:57 UTC (permalink / raw)
  To: Christoph Hellwig, axboe; +Cc: ming.lei, Matias Bjorling, linux-block

Looks good.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

On 05/12/2019 11:39 PM, Christoph Hellwig wrote:
> This function just has a few trivial assignments, has two callers with
> one of them being in the fastpath.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-core.c | 11 -----------
>   block/blk.h      | 13 ++++++++++++-
>   2 files changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index c894c9887dca..9405388ac658 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1473,17 +1473,6 @@ bool blk_update_request(struct request *req, blk_status_t error,
>   }
>   EXPORT_SYMBOL_GPL(blk_update_request);
>
> -void blk_rq_bio_prep(struct request *rq, struct bio *bio, unsigned int nr_segs)
> -{
> -	rq->nr_phys_segments = nr_segs;
> -	rq->__data_len = bio->bi_iter.bi_size;
> -	rq->bio = rq->biotail = bio;
> -	rq->ioprio = bio_prio(bio);
> -
> -	if (bio->bi_disk)
> -		rq->rq_disk = bio->bi_disk;
> -}
> -
>   #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
>   /**
>    * rq_flush_dcache_pages - Helper function to flush all pages in a request
> diff --git a/block/blk.h b/block/blk.h
> index 5352cdb876a6..cbb0995ed17e 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -51,7 +51,6 @@ struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
>   void blk_free_flush_queue(struct blk_flush_queue *q);
>
>   void blk_exit_queue(struct request_queue *q);
> -void blk_rq_bio_prep(struct request *rq, struct bio *bio, unsigned int nr_segs);
>   void blk_freeze_queue(struct request_queue *q);
>
>   static inline void blk_queue_enter_live(struct request_queue *q)
> @@ -100,6 +99,18 @@ static inline bool bvec_gap_to_prev(struct request_queue *q,
>   	return __bvec_gap_to_prev(q, bprv, offset);
>   }
>
> +static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
> +		unsigned int nr_segs)
> +{
> +	rq->nr_phys_segments = nr_segs;
> +	rq->__data_len = bio->bi_iter.bi_size;
> +	rq->bio = rq->biotail = bio;
> +	rq->ioprio = bio_prio(bio);
> +
> +	if (bio->bi_disk)
> +		rq->rq_disk = bio->bi_disk;
> +}
> +
>   #ifdef CONFIG_BLK_DEV_INTEGRITY
>   void blk_flush_integrity(void);
>   bool __bio_integrity_endio(struct bio *);
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 05/10] block: initialize the write priority in blk_rq_bio_prep
  2019-05-13  6:37 ` [PATCH 05/10] block: initialize the write priority in blk_rq_bio_prep Christoph Hellwig
@ 2019-05-13 15:04   ` Chaitanya Kulkarni
  0 siblings, 0 replies; 27+ messages in thread
From: Chaitanya Kulkarni @ 2019-05-13 15:04 UTC (permalink / raw)
  To: Christoph Hellwig, axboe; +Cc: ming.lei, Matias Bjorling, linux-block

With 's/fiel/field' can be done at the time of merge, looks good.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

On 05/12/2019 11:38 PM, Christoph Hellwig wrote:
> The priority fiel also make sense for passthrough requests, so
> initialize it in blk_rq_bio_prep.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-core.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 419d600e6637..7fb394dd3e11 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -716,7 +716,6 @@ void blk_init_request_from_bio(struct request *req, struct bio *bio)
>   		req->cmd_flags |= REQ_FAILFAST_MASK;
>
>   	req->__sector = bio->bi_iter.bi_sector;
> -	req->ioprio = bio_prio(bio);
>   	req->write_hint = bio->bi_write_hint;
>   	blk_rq_bio_prep(req->q, req, bio);
>   }
> @@ -1494,6 +1493,7 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
>
>   	rq->__data_len = bio->bi_iter.bi_size;
>   	rq->bio = rq->biotail = bio;
> +	rq->ioprio = bio_prio(bio);
>
>   	if (bio->bi_disk)
>   		rq->rq_disk = bio->bi_disk;
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 01/10] block: don't decrement nr_phys_segments for physically contigous segments
  2019-05-13 12:03     ` Christoph Hellwig
  2019-05-13 12:37       ` Christoph Hellwig
@ 2019-05-14  4:36       ` Ming Lei
  2019-05-14  5:14         ` Christoph Hellwig
  1 sibling, 1 reply; 27+ messages in thread
From: Ming Lei @ 2019-05-14  4:36 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, Matias Bjorling, linux-block

On Mon, May 13, 2019 at 02:03:44PM +0200, Christoph Hellwig wrote:
> On Mon, May 13, 2019 at 05:45:45PM +0800, Ming Lei wrote:
> > On Mon, May 13, 2019 at 08:37:45AM +0200, Christoph Hellwig wrote:
> > > Currently ll_merge_requests_fn, unlike all other merge functions,
> > > reduces nr_phys_segments by one if the last segment of the previous,
> > > and the first segment of the next segement are contigous.  While this
> > > seems like a nice solution to avoid building smaller than possible
> > 
> > Some workloads need this optimization, please see 729204ef49ec00b
> > ("block: relax check on sg gap"):
> 
> And we still allow to merge the segments with this patch.  The only
> difference is that these merges do get accounted as extra segments.

It is easy for .nr_phys_segments to reach the max segment limit by this
way, then no new bio can be merged any more.

We don't consider segment merge between two bios in ll_new_hw_segment(),
in my mkfs test over virtio-blk, request size can be increased to ~1M(several
segments) from 63k(126 bios/segments) easily if the segment merge between
two bios is considered.

> 
> > > requests it causes a mismatch between the segments actually present
> > > in the request and those iterated over by the bvec iterators, including
> > > __rq_for_each_bio.  This could cause overwrites of too small kmalloc
> > 
> > Request based drivers usually shouldn't iterate bio any more.
> 
> We do that in a couple of places.  For one the nvme single segment
> optimization that triggered this bug.  Also for range discard support
> in nvme and virtio.  Then we have loop that  iterate the segments, but
> doesn't use the nr_phys_segments count, and plenty of others that
> iterate over pages at the moment but should be iterating bvecs,
> e.g. ubd or aoe.

Seems discard segment doesn't consider bios merge for nvme and virtio,
so it should be fine in this way. Will take a close look at nvme/virtio
discard segment merge later.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 01/10] block: don't decrement nr_phys_segments for physically contigous segments
  2019-05-14  4:36       ` Ming Lei
@ 2019-05-14  5:14         ` Christoph Hellwig
  2019-05-14  9:05           ` Ming Lei
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2019-05-14  5:14 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, axboe, Matias Bjorling, linux-block

On Tue, May 14, 2019 at 12:36:43PM +0800, Ming Lei wrote:
> > > Some workloads need this optimization, please see 729204ef49ec00b
> > > ("block: relax check on sg gap"):
> > 
> > And we still allow to merge the segments with this patch.  The only
> > difference is that these merges do get accounted as extra segments.
> 
> It is easy for .nr_phys_segments to reach the max segment limit by this
> way, then no new bio can be merged any more.

As said in my other mail we only decremented it for request merges
in the non-gap case before and no one complained.

> We don't consider segment merge between two bios in ll_new_hw_segment(),
> in my mkfs test over virtio-blk, request size can be increased to ~1M(several
> segments) from 63k(126 bios/segments) easily if the segment merge between
> two bios is considered.

With the gap devices we have unlimited segment size, see my next patch
to actually enforce that.  Which is much more efficient than using
multiple segments.  Also instead of hacking up the merge path even more
we can fix the block device buffered I/O path to submit large I/Os
instead of relying on merging like we do in the direct I/O code and every
major file system.  I have that on my plate as a todo list item.

> > We do that in a couple of places.  For one the nvme single segment
> > optimization that triggered this bug.  Also for range discard support
> > in nvme and virtio.  Then we have loop that  iterate the segments, but
> > doesn't use the nr_phys_segments count, and plenty of others that
> > iterate over pages at the moment but should be iterating bvecs,
> > e.g. ubd or aoe.
> 
> Seems discard segment doesn't consider bios merge for nvme and virtio,
> so it should be fine in this way. Will take a close look at nvme/virtio
> discard segment merge later.

I found the bio case by looking at doing the proper accounting in the
bio merge path and hitting KASAN warning due to the range kmalloc.
So that issue is real as well.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 01/10] block: don't decrement nr_phys_segments for physically contigous segments
  2019-05-14  5:14         ` Christoph Hellwig
@ 2019-05-14  9:05           ` Ming Lei
  2019-05-14 13:51             ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Ming Lei @ 2019-05-14  9:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, Matias Bjorling, linux-block

[-- Attachment #1: Type: text/plain, Size: 3297 bytes --]

On Tue, May 14, 2019 at 07:14:41AM +0200, Christoph Hellwig wrote:
> On Tue, May 14, 2019 at 12:36:43PM +0800, Ming Lei wrote:
> > > > Some workloads need this optimization, please see 729204ef49ec00b
> > > > ("block: relax check on sg gap"):
> > > 
> > > And we still allow to merge the segments with this patch.  The only
> > > difference is that these merges do get accounted as extra segments.
> > 
> > It is easy for .nr_phys_segments to reach the max segment limit by this
> > way, then no new bio can be merged any more.
> 
> As said in my other mail we only decremented it for request merges
> in the non-gap case before and no one complained.

However we still may make it better, for example, the attached patch can
save 10~20% time when running 'mkfs.ntfs -s 512 /dev/vda', lots of small
request(63KB) can be merged to big IO(1MB).

> 
> > We don't consider segment merge between two bios in ll_new_hw_segment(),
> > in my mkfs test over virtio-blk, request size can be increased to ~1M(several
> > segments) from 63k(126 bios/segments) easily if the segment merge between
> > two bios is considered.
> 
> With the gap devices we have unlimited segment size, see my next patch
> to actually enforce that.  Which is much more efficient than using

But this patch does effect on non-gap device, and actually most of
device are non-gap type.

> multiple segments.  Also instead of hacking up the merge path even more
> we can fix the block device buffered I/O path to submit large I/Os
> instead of relying on merging like we do in the direct I/O code and every
> major file system.  I have that on my plate as a todo list item.
> 
> > > We do that in a couple of places.  For one the nvme single segment
> > > optimization that triggered this bug.  Also for range discard support
> > > in nvme and virtio.  Then we have loop that  iterate the segments, but
> > > doesn't use the nr_phys_segments count, and plenty of others that
> > > iterate over pages at the moment but should be iterating bvecs,
> > > e.g. ubd or aoe.
> > 
> > Seems discard segment doesn't consider bios merge for nvme and virtio,
> > so it should be fine in this way. Will take a close look at nvme/virtio
> > discard segment merge later.
> 
> I found the bio case by looking at doing the proper accounting in the
> bio merge path and hitting KASAN warning due to the range kmalloc.
> So that issue is real as well.

I guess the following patch may fix it:

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index f1d90cd3dc47..a913110de389 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -175,7 +175,7 @@ static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
 
 static int virtblk_setup_discard_write_zeroes(struct request *req, bool unmap)
 {
-	unsigned short segments = blk_rq_nr_discard_segments(req);
+	unsigned short segments = 0;
 	unsigned short n = 0;
 	struct virtio_blk_discard_write_zeroes *range;
 	struct bio *bio;
@@ -184,6 +184,9 @@ static int virtblk_setup_discard_write_zeroes(struct request *req, bool unmap)
 	if (unmap)
 		flags |= VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP;
 
+	__rq_for_each_bio(bio, req)
+		segments++;
+
 	range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
 	if (!range)
 		return -ENOMEM;


Thanks,
Ming

[-- Attachment #2: 0001-block-merge-bios-segment.patch --]
[-- Type: text/plain, Size: 1110 bytes --]

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 21e87a714a73..af122efeb28d 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -622,6 +622,8 @@ static inline int ll_new_hw_segment(struct request_queue *q,
 				    struct bio *bio)
 {
 	int nr_phys_segs = bio_phys_segments(q, bio);
+	unsigned int seg_size =
+		req->biotail->bi_seg_back_size + bio->bi_seg_front_size;
 
 	if (req->nr_phys_segments + nr_phys_segs > queue_max_segments(q))
 		goto no_merge;
@@ -629,11 +631,15 @@ static inline int ll_new_hw_segment(struct request_queue *q,
 	if (blk_integrity_merge_bio(q, req, bio) == false)
 		goto no_merge;
 
-	/*
-	 * This will form the start of a new hw segment.  Bump both
-	 * counters.
-	 */
-	req->nr_phys_segments += nr_phys_segs;
+	if (blk_phys_contig_segment(q, req->biotail, bio)) {
+		if (nr_phys_segs)
+			req->nr_phys_segments += nr_phys_segs - 1;
+		if (nr_phys_segs == 1)
+			bio->bi_seg_back_size = seg_size;
+		if (req->nr_phys_segments == 1)
+			req->bio->bi_seg_front_size = seg_size;
+	} else
+		req->nr_phys_segments += nr_phys_segs;
 	return 1;
 
 no_merge:
-- 
2.20.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 01/10] block: don't decrement nr_phys_segments for physically contigous segments
  2019-05-14  9:05           ` Ming Lei
@ 2019-05-14 13:51             ` Christoph Hellwig
  2019-05-14 13:57               ` Hannes Reinecke
  2019-05-14 14:27               ` Ming Lei
  0 siblings, 2 replies; 27+ messages in thread
From: Christoph Hellwig @ 2019-05-14 13:51 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, axboe, Matias Bjorling, linux-block

On Tue, May 14, 2019 at 05:05:45PM +0800, Ming Lei wrote:
> However we still may make it better, for example, the attached patch can
> save 10~20% time when running 'mkfs.ntfs -s 512 /dev/vda', lots of small
> request(63KB) can be merged to big IO(1MB).

And we could save even more time by making the block dev buffered I/O
path not do stupid things to start with.

> > With the gap devices we have unlimited segment size, see my next patch
> > to actually enforce that.  Which is much more efficient than using
> 
> But this patch does effect on non-gap device, and actually most of
> device are non-gap type.

Yes, but only for request merges, and only if merging the requests
goes over max_requests.  The upside is that we actually get a
nr_phys_segments that mirrors what is in the request, fixing bugs
in a few drivers, and allowing for follow on patches that significantly
simplify our I/O path.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 01/10] block: don't decrement nr_phys_segments for physically contigous segments
  2019-05-14 13:51             ` Christoph Hellwig
@ 2019-05-14 13:57               ` Hannes Reinecke
  2019-05-14 14:27               ` Ming Lei
  1 sibling, 0 replies; 27+ messages in thread
From: Hannes Reinecke @ 2019-05-14 13:57 UTC (permalink / raw)
  To: Christoph Hellwig, Ming Lei; +Cc: axboe, Matias Bjorling, linux-block

On 5/14/19 3:51 PM, Christoph Hellwig wrote:
> On Tue, May 14, 2019 at 05:05:45PM +0800, Ming Lei wrote:
>> However we still may make it better, for example, the attached patch can
>> save 10~20% time when running 'mkfs.ntfs -s 512 /dev/vda', lots of small
>> request(63KB) can be merged to big IO(1MB).
> 
> And we could save even more time by making the block dev buffered I/O
> path not do stupid things to start with.
> 
>>> With the gap devices we have unlimited segment size, see my next patch
>>> to actually enforce that.  Which is much more efficient than using
>>
>> But this patch does effect on non-gap device, and actually most of
>> device are non-gap type.
> 
> Yes, but only for request merges, and only if merging the requests
> goes over max_requests.  The upside is that we actually get a
> nr_phys_segments that mirrors what is in the request, fixing bugs
> in a few drivers, and allowing for follow on patches that significantly
> simplify our I/O path.
> 
And we've seen several of these crashes in real life; signature is 
'Kernel oops at drivers/scsi/scsi_lib.c:1003'.
So I'm really glad that this one is finally being addressed.

Thanks Christoph!

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 01/10] block: don't decrement nr_phys_segments for physically contigous segments
  2019-05-14 13:51             ` Christoph Hellwig
  2019-05-14 13:57               ` Hannes Reinecke
@ 2019-05-14 14:27               ` Ming Lei
  2019-05-14 14:31                 ` Christoph Hellwig
  1 sibling, 1 reply; 27+ messages in thread
From: Ming Lei @ 2019-05-14 14:27 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, Matias Bjorling, linux-block

On Tue, May 14, 2019 at 03:51:42PM +0200, Christoph Hellwig wrote:
> On Tue, May 14, 2019 at 05:05:45PM +0800, Ming Lei wrote:
> > However we still may make it better, for example, the attached patch can
> > save 10~20% time when running 'mkfs.ntfs -s 512 /dev/vda', lots of small
> > request(63KB) can be merged to big IO(1MB).
> 
> And we could save even more time by making the block dev buffered I/O
> path not do stupid things to start with.

I am wondering if it can be done easily, given mkfs is userspace
which only calls write syscall on block device. Or could you share
something about how to fix the stupid things?

> 
> > > With the gap devices we have unlimited segment size, see my next patch
> > > to actually enforce that.  Which is much more efficient than using
> > 
> > But this patch does effect on non-gap device, and actually most of
> > device are non-gap type.
> 
> Yes, but only for request merges, and only if merging the requests
> goes over max_requests.  The upside is that we actually get a
> nr_phys_segments that mirrors what is in the request, fixing bugs
> in a few drivers, and allowing for follow on patches that significantly
> simplify our I/O path.

non-gap device still has max segment size limit, and I guess it still
needs to be respected.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 01/10] block: don't decrement nr_phys_segments for physically contigous segments
  2019-05-14 14:27               ` Ming Lei
@ 2019-05-14 14:31                 ` Christoph Hellwig
  2019-05-14 14:32                   ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2019-05-14 14:31 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, axboe, Matias Bjorling, linux-block

On Tue, May 14, 2019 at 10:27:17PM +0800, Ming Lei wrote:
> I am wondering if it can be done easily, given mkfs is userspace
> which only calls write syscall on block device. Or could you share
> something about how to fix the stupid things?

mkfs.ext4 at least uses buffered I/O on the block device.  And the
block device uses the really old buffer head based address_space ops,
which will submit one bio per buffer_head, that is per logic block.
mkfs probably writes much larger sizes than that..

> > nr_phys_segments that mirrors what is in the request, fixing bugs
> > in a few drivers, and allowing for follow on patches that significantly
> > simplify our I/O path.
> 
> non-gap device still has max segment size limit, and I guess it still
> needs to be respected.

It is still respected.  It might just disallow some merges we could
theoretically allow.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 01/10] block: don't decrement nr_phys_segments for physically contigous segments
  2019-05-14 14:31                 ` Christoph Hellwig
@ 2019-05-14 14:32                   ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2019-05-14 14:32 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, axboe, Matias Bjorling, linux-block

On Tue, May 14, 2019 at 04:31:02PM +0200, Christoph Hellwig wrote:
> On Tue, May 14, 2019 at 10:27:17PM +0800, Ming Lei wrote:
> > I am wondering if it can be done easily, given mkfs is userspace
> > which only calls write syscall on block device. Or could you share
> > something about how to fix the stupid things?
> 
> mkfs.ext4 at least uses buffered I/O on the block device.  And the
> block device uses the really old buffer head based address_space ops,
> which will submit one bio per buffer_head, that is per logic block.
> mkfs probably writes much larger sizes than that..

As a first step we could try something like that patch below.  Although
the mpage ops still aren't exactly optimal:


diff --git a/fs/block_dev.c b/fs/block_dev.c
index bded2ee3788d..b2ee74f1c669 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -608,12 +608,12 @@ EXPORT_SYMBOL(thaw_bdev);
 
 static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
 {
-	return block_write_full_page(page, blkdev_get_block, wbc);
+	return mpage_writepage(page, blkdev_get_block, wbc);
 }
 
 static int blkdev_readpage(struct file * file, struct page * page)
 {
-	return block_read_full_page(page, blkdev_get_block);
+	return mpage_readpage(page, blkdev_get_block);
 }
 
 static int blkdev_readpages(struct file *file, struct address_space *mapping,
@@ -1984,7 +1984,7 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
 static int blkdev_writepages(struct address_space *mapping,
 			     struct writeback_control *wbc)
 {
-	return generic_writepages(mapping, wbc);
+	return mpage_writepages(mapping, wbc, blkdev_get_block);
 }
 
 static const struct address_space_operations def_blk_aops = {

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 02/10] block: force an unlimited segment size on queues with a virt boundary
  2019-05-13  6:37 ` [PATCH 02/10] block: force an unlimited segment size on queues with a virt boundary Christoph Hellwig
@ 2019-05-15  8:19   ` Ming Lei
  0 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2019-05-15  8:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, Matias Bjorling, linux-block

On Mon, May 13, 2019 at 08:37:46AM +0200, Christoph Hellwig wrote:
> We currently fail to update the front/back segment size in the bio when
> deciding to allow an otherwise gappy segement to a device with a
> virt boundary.  The reason why this did not cause problems is that
> devices with a virt boundary fundamentally don't use segments as we
> know it and thus don't care.  Make that assumption formal by forcing
> an unlimited segement size in this case.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/blk-settings.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 3facc41476be..2ae348c101a0 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -310,6 +310,9 @@ void blk_queue_max_segment_size(struct request_queue *q, unsigned int max_size)
>  		       __func__, max_size);
>  	}
>  
> +	/* see blk_queue_virt_boundary() for the explanation */
> +	WARN_ON_ONCE(q->limits.virt_boundary_mask);
> +
>  	q->limits.max_segment_size = max_size;
>  }
>  EXPORT_SYMBOL(blk_queue_max_segment_size);
> @@ -742,6 +745,14 @@ EXPORT_SYMBOL(blk_queue_segment_boundary);
>  void blk_queue_virt_boundary(struct request_queue *q, unsigned long mask)
>  {
>  	q->limits.virt_boundary_mask = mask;
> +
> +	/*
> +	 * Devices that require a virtual boundary do not support scatter/gather
> +	 * I/O natively, but instead require a descriptor list entry for each
> +	 * page (which might not be idential to the Linux PAGE_SIZE).  Because
> +	 * of that they are not limited by our notion of "segment size".
> +	 */
> +	q->limits.max_segment_size = UINT_MAX;
>  }
>  EXPORT_SYMBOL(blk_queue_virt_boundary);

All drivers which set virt boundary do not set max segment size, and
guess the DMA controller may partition data in unit of (virt_boundary +
1), so in theory this patch is correct.

Reviewed-by: Ming Lei <ming.lei@redhat.com>


Thanks,
Ming

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 03/10] block: remove the segment size check in bio_will_gap
  2019-05-13  6:37 ` [PATCH 03/10] block: remove the segment size check in bio_will_gap Christoph Hellwig
@ 2019-05-15  8:34   ` Ming Lei
  0 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2019-05-15  8:34 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, Matias Bjorling, linux-block

On Mon, May 13, 2019 at 08:37:47AM +0200, Christoph Hellwig wrote:
> We fundamentally do not have a maximum segement size for devices with a
> virt boundary.  So don't bother checking it, especially given that the
> existing checks didn't properly work to start with as we never update
> bi_seg_back_size after a successful merge, and for front merges would

.bi_seg_back_size is only needed to update in case of single segment
request.

However, ll_new_hw_segment() does not merge segment, so the existing
check works fine.

> have had to check bi_seg_front_size anyway.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/blk-merge.c | 19 +------------------
>  1 file changed, 1 insertion(+), 18 deletions(-)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 80a5a0facb87..eee2c02c50ce 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -12,23 +12,6 @@
>  
>  #include "blk.h"
>  
> -/*
> - * Check if the two bvecs from two bios can be merged to one segment.  If yes,
> - * no need to check gap between the two bios since the 1st bio and the 1st bvec
> - * in the 2nd bio can be handled in one segment.
> - */
> -static inline bool bios_segs_mergeable(struct request_queue *q,
> -		struct bio *prev, struct bio_vec *prev_last_bv,
> -		struct bio_vec *next_first_bv)
> -{
> -	if (!biovec_phys_mergeable(q, prev_last_bv, next_first_bv))
> -		return false;
> -	if (prev->bi_seg_back_size + next_first_bv->bv_len >
> -			queue_max_segment_size(q))
> -		return false;
> -	return true;
> -}
> -
>  static inline bool bio_will_gap(struct request_queue *q,
>  		struct request *prev_rq, struct bio *prev, struct bio *next)
>  {
> @@ -60,7 +43,7 @@ static inline bool bio_will_gap(struct request_queue *q,
>  	 */
>  	bio_get_last_bvec(prev, &pb);
>  	bio_get_first_bvec(next, &nb);
> -	if (bios_segs_mergeable(q, prev, &pb, &nb))
> +	if (biovec_phys_mergeable(q, &pb, &nb))
>  		return false;
>  	return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
>  }
> -- 
> 2.20.1
> 

The patch itself is good, if the commit log is fixed:

Reviewed-by: Ming Lei <ming.lei@redhat.com>

thanks,
Ming

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, back to index

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-13  6:37 fix nr_phys_segments vs iterators accounting Christoph Hellwig
2019-05-13  6:37 ` [PATCH 01/10] block: don't decrement nr_phys_segments for physically contigous segments Christoph Hellwig
2019-05-13  9:45   ` Ming Lei
2019-05-13 12:03     ` Christoph Hellwig
2019-05-13 12:37       ` Christoph Hellwig
2019-05-14  4:36       ` Ming Lei
2019-05-14  5:14         ` Christoph Hellwig
2019-05-14  9:05           ` Ming Lei
2019-05-14 13:51             ` Christoph Hellwig
2019-05-14 13:57               ` Hannes Reinecke
2019-05-14 14:27               ` Ming Lei
2019-05-14 14:31                 ` Christoph Hellwig
2019-05-14 14:32                   ` Christoph Hellwig
2019-05-13  6:37 ` [PATCH 02/10] block: force an unlimited segment size on queues with a virt boundary Christoph Hellwig
2019-05-15  8:19   ` Ming Lei
2019-05-13  6:37 ` [PATCH 03/10] block: remove the segment size check in bio_will_gap Christoph Hellwig
2019-05-15  8:34   ` Ming Lei
2019-05-13  6:37 ` [PATCH 04/10] block: remove the bi_seg_{front,back}_size fields in struct bio Christoph Hellwig
2019-05-13  6:37 ` [PATCH 05/10] block: initialize the write priority in blk_rq_bio_prep Christoph Hellwig
2019-05-13 15:04   ` Chaitanya Kulkarni
2019-05-13  6:37 ` [PATCH 06/10] block: remove blk_init_request_from_bio Christoph Hellwig
2019-05-13  6:37 ` [PATCH 07/10] block: remove the bi_phys_segments field in struct bio Christoph Hellwig
2019-05-13  6:37 ` [PATCH 08/10] block: simplify blk_recalc_rq_segments Christoph Hellwig
2019-05-13  6:37 ` [PATCH 09/10] block: untangle the end of blk_bio_segment_split Christoph Hellwig
2019-05-13  6:37 ` [PATCH 10/10] block: mark blk_rq_bio_prep as inline Christoph Hellwig
2019-05-13 14:57   ` Chaitanya Kulkarni
     [not found] ` <CGME20190513063855epcas5p33ef8c4c0a0055bd0b66eadc859796f0f@epcms2p6>
2019-05-13  7:34   ` [PATCH 05/10] block: initialize the write priority in blk_rq_bio_prep Minwoo Im

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org linux-block@archiver.kernel.org
	public-inbox-index linux-block


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/ public-inbox