All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] Submit zoned requests in LBA order per zone
@ 2023-03-20 23:49 Bart Van Assche
  2023-03-20 23:49 ` [PATCH v2 1/3] block: Split blk_recalc_rq_segments() Bart Van Assche
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Bart Van Assche @ 2023-03-20 23:49 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jaegeuk Kim, Bart Van Assche

Hi Jens,

For zoned storage it is essential that writes are submitted in LBA order per
zone. This patch series ensures this for bio splitting and requeuing. Please
consider this patch series for the next merge window.

Thanks,

Bart.

Changes compared to v1:
- Renamed bio_nr_segments() into bio_chain_nr_segments().
- Fixed the number of segments calculation in __bio_split_to_limits().
- Added a patch for the requeuing code path.

Bart Van Assche (3):
  block: Split blk_recalc_rq_segments()
  block: Split and submit bios in LBA order
  block: Preserve LBA order when requeuing

 block/blk-merge.c | 49 ++++++++++++++++++++++++++++++++---------------
 block/blk-mq.c    | 45 +++++++++++++++++++++++++++++++++++++------
 block/blk.h       |  2 ++
 3 files changed, 75 insertions(+), 21 deletions(-)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/3] block: Split blk_recalc_rq_segments()
  2023-03-20 23:49 [PATCH v2 0/3] Submit zoned requests in LBA order per zone Bart Van Assche
@ 2023-03-20 23:49 ` Bart Van Assche
  2023-03-20 23:49 ` [PATCH v2 2/3] block: Split and submit bios in LBA order Bart Van Assche
  2023-03-20 23:49 ` [PATCH v2 3/3] block: Preserve LBA order when requeuing Bart Van Assche
  2 siblings, 0 replies; 11+ messages in thread
From: Bart Van Assche @ 2023-03-20 23:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jaegeuk Kim, Bart Van Assche, Christoph Hellwig,
	Ming Lei, Damien Le Moal, Johannes Thumshirn

Prepare for adding an additional call to bio_chain_nr_segments().

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-merge.c | 26 ++++++++++++++++----------
 block/blk.h       |  2 ++
 2 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 65e75efa9bd3..d6f8552ef209 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -409,22 +409,22 @@ struct bio *bio_split_to_limits(struct bio *bio)
 }
 EXPORT_SYMBOL(bio_split_to_limits);
 
-unsigned int blk_recalc_rq_segments(struct request *rq)
+/* Calculate the number of DMA segments in a bio chain. */
+unsigned int bio_chain_nr_segments(struct bio *bio,
+				   const struct queue_limits *lim)
 {
 	unsigned int nr_phys_segs = 0;
 	unsigned int bytes = 0;
-	struct req_iterator iter;
+	struct bvec_iter iter;
 	struct bio_vec bv;
 
-	if (!rq->bio)
+	if (!bio)
 		return 0;
 
-	switch (bio_op(rq->bio)) {
+	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
-		if (queue_max_discard_segments(rq->q) > 1) {
-			struct bio *bio = rq->bio;
-
+		if (lim->max_discard_segments > 1) {
 			for_each_bio(bio)
 				nr_phys_segs++;
 			return nr_phys_segs;
@@ -436,12 +436,18 @@ unsigned int blk_recalc_rq_segments(struct request *rq)
 		break;
 	}
 
-	rq_for_each_bvec(bv, rq, iter)
-		bvec_split_segs(&rq->q->limits, &bv, &nr_phys_segs, &bytes,
-				UINT_MAX, UINT_MAX);
+	for_each_bio(bio)
+		bio_for_each_bvec(bv, bio, iter)
+			bvec_split_segs(lim, &bv, &nr_phys_segs, &bytes,
+					UINT_MAX, UINT_MAX);
 	return nr_phys_segs;
 }
 
+unsigned int blk_recalc_rq_segments(struct request *rq)
+{
+	return bio_chain_nr_segments(rq->bio, &rq->q->limits);
+}
+
 static inline struct scatterlist *blk_next_sg(struct scatterlist **sg,
 		struct scatterlist *sglist)
 {
diff --git a/block/blk.h b/block/blk.h
index d65d96994a94..ea15b1a4c2b7 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -330,6 +330,8 @@ int ll_back_merge_fn(struct request *req, struct bio *bio,
 		unsigned int nr_segs);
 bool blk_attempt_req_merge(struct request_queue *q, struct request *rq,
 				struct request *next);
+unsigned int bio_chain_nr_segments(struct bio *bio,
+				   const struct queue_limits *lim);
 unsigned int blk_recalc_rq_segments(struct request *rq);
 void blk_rq_set_mixed_merge(struct request *rq);
 bool blk_rq_merge_ok(struct request *rq, struct bio *bio);

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/3] block: Split and submit bios in LBA order
  2023-03-20 23:49 [PATCH v2 0/3] Submit zoned requests in LBA order per zone Bart Van Assche
  2023-03-20 23:49 ` [PATCH v2 1/3] block: Split blk_recalc_rq_segments() Bart Van Assche
@ 2023-03-20 23:49 ` Bart Van Assche
  2023-03-21  6:03   ` Christoph Hellwig
  2023-03-20 23:49 ` [PATCH v2 3/3] block: Preserve LBA order when requeuing Bart Van Assche
  2 siblings, 1 reply; 11+ messages in thread
From: Bart Van Assche @ 2023-03-20 23:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jaegeuk Kim, Bart Van Assche, Christoph Hellwig,
	Ming Lei, Damien Le Moal, Johannes Thumshirn

Submit the bio fragment with the lowest LBA first. This approach prevents
write errors when submitting large bios to host-managed zoned block devices.
This patch only modifies the behavior of drivers that call
bio_split_to_limits() directly. This includes DRBD, pktcdvd, dm, md and
the NVMe multipath code.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-merge.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index d6f8552ef209..7281f2d91b2f 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -345,8 +345,8 @@ EXPORT_SYMBOL_GPL(bio_split_rw);
  * @nr_segs: returns the number of segments in the returned bio
  *
  * Check if @bio needs splitting based on the queue limits, and if so split off
- * a bio fitting the limits from the beginning of @bio and return it.  @bio is
- * shortened to the remainder and re-submitted.
+ * a bio fitting the limits from the beginning of @bio. @bio is shortened to
+ * the remainder.
  *
  * The split bio is allocated from @q->bio_split, which is provided by the
  * block layer.
@@ -379,10 +379,23 @@ struct bio *__bio_split_to_limits(struct bio *bio,
 		split->bi_opf |= REQ_NOMERGE;
 
 		blkcg_bio_issue_init(split);
-		bio_chain(split, bio);
 		trace_block_split(split, bio->bi_iter.bi_sector);
-		submit_bio_noacct(bio);
-		return split;
+		if (current->bio_list) {
+			/*
+			 * The caller will submit the first half ('split')
+			 * before the second half ('bio').
+			 */
+			bio_chain(split, bio);
+			submit_bio_noacct(bio);
+			return split;
+		}
+		/*
+		 * Submit the first half ('split') let the caller submit the
+		 * second half ('bio').
+		 */
+		*nr_segs = bio_chain_nr_segments(bio, lim);
+		bio_chain(split, bio);
+		submit_bio_noacct(split);
 	}
 	return bio;
 }

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 3/3] block: Preserve LBA order when requeuing
  2023-03-20 23:49 [PATCH v2 0/3] Submit zoned requests in LBA order per zone Bart Van Assche
  2023-03-20 23:49 ` [PATCH v2 1/3] block: Split blk_recalc_rq_segments() Bart Van Assche
  2023-03-20 23:49 ` [PATCH v2 2/3] block: Split and submit bios in LBA order Bart Van Assche
@ 2023-03-20 23:49 ` Bart Van Assche
  2023-03-21  5:58   ` Christoph Hellwig
  2 siblings, 1 reply; 11+ messages in thread
From: Bart Van Assche @ 2023-03-20 23:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jaegeuk Kim, Bart Van Assche, Christoph Hellwig,
	Ming Lei, Damien Le Moal, Johannes Thumshirn

When requeuing a request to a zoned block device, preserve the LBA order
per zone.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c | 45 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 39 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index cc32ad0cd548..2ec7d6140114 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1495,6 +1495,44 @@ static void blk_mq_requeue_work(struct work_struct *work)
 	blk_mq_run_hw_queues(q, false);
 }
 
+static void blk_mq_insert_rq(struct request *rq, struct list_head *list,
+			     bool at_head)
+{
+	bool zone_in_list = false;
+	struct request *rq2;
+
+	/*
+	 * For request queues associated with a zoned block device, check
+	 * whether another request for the same zone has already been queued.
+	 */
+	if (blk_queue_is_zoned(rq->q)) {
+		const unsigned int zno = blk_rq_zone_no(rq);
+
+		list_for_each_entry(rq2, list, queuelist) {
+			if (blk_rq_zone_no(rq2) == zno) {
+				zone_in_list = true;
+				if (blk_rq_pos(rq) < blk_rq_pos(rq2))
+					break;
+			}
+		}
+	}
+	if (!zone_in_list) {
+		if (at_head) {
+			rq->rq_flags |= RQF_SOFTBARRIER;
+			list_add(&rq->queuelist, list);
+		} else {
+			list_add_tail(&rq->queuelist, list);
+		}
+	} else {
+		/*
+		 * Insert the request in the list before another request for
+		 * the same zone and with a higher LBA. If there is no such
+		 * request, insert the request at the end of the list.
+		 */
+		list_add_tail(&rq->queuelist, &rq2->queuelist);
+	}
+}
+
 void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
 				bool kick_requeue_list)
 {
@@ -1508,12 +1546,7 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
 	BUG_ON(rq->rq_flags & RQF_SOFTBARRIER);
 
 	spin_lock_irqsave(&q->requeue_lock, flags);
-	if (at_head) {
-		rq->rq_flags |= RQF_SOFTBARRIER;
-		list_add(&rq->queuelist, &q->requeue_list);
-	} else {
-		list_add_tail(&rq->queuelist, &q->requeue_list);
-	}
+	blk_mq_insert_rq(rq, &q->requeue_list, at_head);
 	spin_unlock_irqrestore(&q->requeue_lock, flags);
 
 	if (kick_requeue_list)

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/3] block: Preserve LBA order when requeuing
  2023-03-20 23:49 ` [PATCH v2 3/3] block: Preserve LBA order when requeuing Bart Van Assche
@ 2023-03-21  5:58   ` Christoph Hellwig
  2023-03-21 14:46     ` Bart Van Assche
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2023-03-21  5:58 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Ming Lei, Damien Le Moal, Johannes Thumshirn

On Mon, Mar 20, 2023 at 04:49:05PM -0700, Bart Van Assche wrote:
> When requeuing a request to a zoned block device, preserve the LBA order
> per zone.

What causes this requeue?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/3] block: Split and submit bios in LBA order
  2023-03-20 23:49 ` [PATCH v2 2/3] block: Split and submit bios in LBA order Bart Van Assche
@ 2023-03-21  6:03   ` Christoph Hellwig
  2023-03-21  6:05     ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2023-03-21  6:03 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Ming Lei, Damien Le Moal, Johannes Thumshirn

On Mon, Mar 20, 2023 at 04:49:04PM -0700, Bart Van Assche wrote:
> Submit the bio fragment with the lowest LBA first. This approach prevents
> write errors when submitting large bios to host-managed zoned block devices.
> This patch only modifies the behavior of drivers that call
> bio_split_to_limits() directly. This includes DRBD, pktcdvd, dm, md and
> the NVMe multipath code.

Umm, doesn't it also change how blk-mq splits, which is the prime
reason why you're looking into that?

> +		if (current->bio_list) {
> +			/*
> +			 * The caller will submit the first half ('split')
> +			 * before the second half ('bio').
> +			 */
> +			bio_chain(split, bio);
> +			submit_bio_noacct(bio);
> +			return split;
> +		}
> +		/*
> +		 * Submit the first half ('split') let the caller submit the
> +		 * second half ('bio').
> +		 */
> +		*nr_segs = bio_chain_nr_segments(bio, lim);
> +		bio_chain(split, bio);
> +		submit_bio_noacct(split);

I'm really confused on why you want to change the behavior here
for the case where run in a stacking context vs not, and neither
the comments nor the commit log help me trying to figure out why.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/3] block: Split and submit bios in LBA order
  2023-03-21  6:03   ` Christoph Hellwig
@ 2023-03-21  6:05     ` Christoph Hellwig
  2023-03-21 18:44       ` Bart Van Assche
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2023-03-21  6:05 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Ming Lei, Damien Le Moal, Johannes Thumshirn

On Tue, Mar 21, 2023 at 07:03:07AM +0100, Christoph Hellwig wrote:
> > +		if (current->bio_list) {
> > +			/*
> > +			 * The caller will submit the first half ('split')
> > +			 * before the second half ('bio').
> > +			 */
> > +			bio_chain(split, bio);
> > +			submit_bio_noacct(bio);
> > +			return split;
> > +		}
> > +		/*
> > +		 * Submit the first half ('split') let the caller submit the
> > +		 * second half ('bio').
> > +		 */
> > +		*nr_segs = bio_chain_nr_segments(bio, lim);
> > +		bio_chain(split, bio);
> > +		submit_bio_noacct(split);
> 
> I'm really confused on why you want to change the behavior here
> for the case where run in a stacking context vs not, and neither
> the comments nor the commit log help me trying to figure out why.

In fact I wonder how you managed to get into __bio_split_to_limits
wtih a NULL current->bio_list at all.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/3] block: Preserve LBA order when requeuing
  2023-03-21  5:58   ` Christoph Hellwig
@ 2023-03-21 14:46     ` Bart Van Assche
  2023-03-23  8:19       ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Bart Van Assche @ 2023-03-21 14:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Ming Lei, Damien Le Moal,
	Johannes Thumshirn

On 3/20/23 22:58, Christoph Hellwig wrote:
> On Mon, Mar 20, 2023 at 04:49:05PM -0700, Bart Van Assche wrote:
>> When requeuing a request to a zoned block device, preserve the LBA order
>> per zone.
> 
> What causes this requeue?

Hi Christoph,

Two examples of why the SCSI core can decide to requeue a command are a 
retryable unit attention or ufshcd_queuecommand() returning 
SCSI_MLQUEUE_HOST_BUSY. For example, ufshcd_queuecommand() returns 
SCSI_MLQUEUE_HOST_BUSY while clock scaling is in progress (changing the 
frequency of the link between host controller and UFS device).

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/3] block: Split and submit bios in LBA order
  2023-03-21  6:05     ` Christoph Hellwig
@ 2023-03-21 18:44       ` Bart Van Assche
  2023-03-23  8:22         ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Bart Van Assche @ 2023-03-21 18:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Ming Lei, Damien Le Moal,
	Johannes Thumshirn

On 3/20/23 23:05, Christoph Hellwig wrote:
> In fact I wonder how you managed to get into __bio_split_to_limits
> wtih a NULL current->bio_list at all.

Hi Christoph,

This patch series is what I came up with after having observed an 
UNALIGNED WRITE COMMAND response with a 5.15 kernel. In that kernel 
version (but probably not in the latest upstream kernel) the device 
mapper core submits split bios in the wrong order. I will check again 
whether this patch is really necessary.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/3] block: Preserve LBA order when requeuing
  2023-03-21 14:46     ` Bart Van Assche
@ 2023-03-23  8:19       ` Christoph Hellwig
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2023-03-23  8:19 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Jaegeuk Kim,
	Ming Lei, Damien Le Moal, Johannes Thumshirn

On Tue, Mar 21, 2023 at 07:46:51AM -0700, Bart Van Assche wrote:
> On 3/20/23 22:58, Christoph Hellwig wrote:
>> On Mon, Mar 20, 2023 at 04:49:05PM -0700, Bart Van Assche wrote:
>>> When requeuing a request to a zoned block device, preserve the LBA order
>>> per zone.
>>
>> What causes this requeue?
>
> Hi Christoph,
>
> Two examples of why the SCSI core can decide to requeue a command are a 
> retryable unit attention or ufshcd_queuecommand() returning 
> SCSI_MLQUEUE_HOST_BUSY. For example, ufshcd_queuecommand() returns 
> SCSI_MLQUEUE_HOST_BUSY while clock scaling is in progress (changing the 
> frequency of the link between host controller and UFS device).

None of these should happen as the upper layers enforce a per-zone
queue depth of 1.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/3] block: Split and submit bios in LBA order
  2023-03-21 18:44       ` Bart Van Assche
@ 2023-03-23  8:22         ` Christoph Hellwig
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2023-03-23  8:22 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Jaegeuk Kim,
	Ming Lei, Damien Le Moal, Johannes Thumshirn

On Tue, Mar 21, 2023 at 11:44:59AM -0700, Bart Van Assche wrote:
> On 3/20/23 23:05, Christoph Hellwig wrote:
>> In fact I wonder how you managed to get into __bio_split_to_limits
>> wtih a NULL current->bio_list at all.
>
> Hi Christoph,
>
> This patch series is what I came up with after having observed an UNALIGNED 
> WRITE COMMAND response with a 5.15 kernel. In that kernel version (but 
> probably not in the latest upstream kernel) the device mapper core submits 
> split bios in the wrong order. I will check again whether this patch is 
> really necessary.

How did that escape the upper layer serialization of zoned writes?

But my point is that all ->submit_bio instances as well as
blk_mq_submit_bio are only called with a non-NULL current->bio_list.
So I don't think you can reach the NULL current->bio_list case in this
patch.

Note that even without any specific zone device isss, submitting bios
in LBA oders works best for both spinning and solid state media,
but all of the arguments here don't add up.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-03-23  8:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-20 23:49 [PATCH v2 0/3] Submit zoned requests in LBA order per zone Bart Van Assche
2023-03-20 23:49 ` [PATCH v2 1/3] block: Split blk_recalc_rq_segments() Bart Van Assche
2023-03-20 23:49 ` [PATCH v2 2/3] block: Split and submit bios in LBA order Bart Van Assche
2023-03-21  6:03   ` Christoph Hellwig
2023-03-21  6:05     ` Christoph Hellwig
2023-03-21 18:44       ` Bart Van Assche
2023-03-23  8:22         ` Christoph Hellwig
2023-03-20 23:49 ` [PATCH v2 3/3] block: Preserve LBA order when requeuing Bart Van Assche
2023-03-21  5:58   ` Christoph Hellwig
2023-03-21 14:46     ` Bart Van Assche
2023-03-23  8:19       ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.