linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Submit split bios in LBA order
@ 2023-03-17 19:59 Bart Van Assche
  2023-03-17 19:59 ` [PATCH 1/2] block: Split blk_recalc_rq_segments() Bart Van Assche
                   ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Bart Van Assche @ 2023-03-17 19:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jaegeuk Kim, Christoph Hellwig, Jan Kara, Bart Van Assche

Hi Jens,

For zoned storage it is essential that split bios are submitted in LBA order.
This patch series realizes this by modifying __bio_split_to_limits() such that
it submits the first bio fragment and returns the remainder instead of
submitting the remainder and returning the first bio fragment. Please consider
this patch series for the next merge window.

Thanks,

Bart.

Bart Van Assche (2):
  block: Split blk_recalc_rq_segments()
  block: Split and submit bios in LBA order

 block/blk-merge.c      | 33 +++++++++++++++++++--------------
 block/blk-mq.c         |  7 +++++--
 block/blk.h            |  1 +
 include/linux/blk-mq.h |  6 ++++++
 4 files changed, 31 insertions(+), 16 deletions(-)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/2] block: Split blk_recalc_rq_segments()
  2023-03-17 19:59 [PATCH 0/2] Submit split bios in LBA order Bart Van Assche
@ 2023-03-17 19:59 ` Bart Van Assche
  2023-03-18  6:38   ` Christoph Hellwig
  2023-03-17 19:59 ` [PATCH 2/2] block: Split and submit bios in LBA order Bart Van Assche
  2023-03-18  6:29 ` [PATCH 0/2] Submit split " Christoph Hellwig
  2 siblings, 1 reply; 43+ messages in thread
From: Bart Van Assche @ 2023-03-17 19:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jaegeuk Kim, Christoph Hellwig, Jan Kara,
	Bart Van Assche, Ming Lei, Damien Le Moal, Johannes Thumshirn

Prepare for adding a direct call to bio_nr_segments().

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-merge.c | 24 ++++++++++++++----------
 block/blk.h       |  1 +
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index b80c3e650588..2e07f6bd96be 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -408,22 +408,20 @@ struct bio *bio_split_to_limits(struct bio *bio)
 }
 EXPORT_SYMBOL(bio_split_to_limits);
 
-unsigned int blk_recalc_rq_segments(struct request *rq)
+unsigned int bio_nr_segments(const struct queue_limits *lim, struct bio *bio)
 {
 	unsigned int nr_phys_segs = 0;
 	unsigned int bytes = 0;
-	struct req_iterator iter;
+	struct bvec_iter iter;
 	struct bio_vec bv;
 
-	if (!rq->bio)
+	if (!bio)
 		return 0;
 
-	switch (bio_op(rq->bio)) {
+	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
-		if (queue_max_discard_segments(rq->q) > 1) {
-			struct bio *bio = rq->bio;
-
+		if (lim->max_discard_segments > 1) {
 			for_each_bio(bio)
 				nr_phys_segs++;
 			return nr_phys_segs;
@@ -435,12 +433,18 @@ unsigned int blk_recalc_rq_segments(struct request *rq)
 		break;
 	}
 
-	rq_for_each_bvec(bv, rq, iter)
-		bvec_split_segs(&rq->q->limits, &bv, &nr_phys_segs, &bytes,
-				UINT_MAX, UINT_MAX);
+	for_each_bio(bio)
+		bio_for_each_bvec(bv, bio, iter)
+			bvec_split_segs(lim, &bv, &nr_phys_segs, &bytes,
+					UINT_MAX, UINT_MAX);
 	return nr_phys_segs;
 }
 
+unsigned int blk_recalc_rq_segments(struct request *rq)
+{
+	return bio_nr_segments(&rq->q->limits, rq->bio);
+}
+
 static inline struct scatterlist *blk_next_sg(struct scatterlist **sg,
 		struct scatterlist *sglist)
 {
diff --git a/block/blk.h b/block/blk.h
index d65d96994a94..9686ee808bab 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -330,6 +330,7 @@ int ll_back_merge_fn(struct request *req, struct bio *bio,
 		unsigned int nr_segs);
 bool blk_attempt_req_merge(struct request_queue *q, struct request *rq,
 				struct request *next);
+unsigned int bio_nr_segments(const struct queue_limits *lim, struct bio *bio);
 unsigned int blk_recalc_rq_segments(struct request *rq);
 void blk_rq_set_mixed_merge(struct request *rq);
 bool blk_rq_merge_ok(struct request *rq, struct bio *bio);

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-17 19:59 [PATCH 0/2] Submit split bios in LBA order Bart Van Assche
  2023-03-17 19:59 ` [PATCH 1/2] block: Split blk_recalc_rq_segments() Bart Van Assche
@ 2023-03-17 19:59 ` Bart Van Assche
  2023-03-17 22:28   ` Jan Kara
                     ` (2 more replies)
  2023-03-18  6:29 ` [PATCH 0/2] Submit split " Christoph Hellwig
  2 siblings, 3 replies; 43+ messages in thread
From: Bart Van Assche @ 2023-03-17 19:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jaegeuk Kim, Christoph Hellwig, Jan Kara,
	Bart Van Assche, Ming Lei, Damien Le Moal, Johannes Thumshirn

Instead of submitting the bio fragment with the highest LBA first,
submit the bio fragment with the lowest LBA first. If plugging is
active, append requests at the end of the list with plugged requests
instead of at the start. This approach prevents write errors when
submitting large bios to host-managed zoned block devices.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-merge.c      | 9 +++++----
 block/blk-mq.c         | 7 +++++--
 include/linux/blk-mq.h | 6 ++++++
 3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2e07f6bd96be..6031021d7ac0 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -344,8 +344,8 @@ static struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
  * @nr_segs: returns the number of segments in the returned bio
  *
  * Check if @bio needs splitting based on the queue limits, and if so split off
- * a bio fitting the limits from the beginning of @bio and return it.  @bio is
- * shortened to the remainder and re-submitted.
+ * a bio fitting the limits from the beginning of @bio and submit it.  @bio is
+ * shortened to the remainder and returned.
  *
  * The split bio is allocated from @q->bio_split, which is provided by the
  * block layer.
@@ -380,8 +380,9 @@ struct bio *__bio_split_to_limits(struct bio *bio,
 		blkcg_bio_issue_init(split);
 		bio_chain(split, bio);
 		trace_block_split(split, bio->bi_iter.bi_sector);
-		submit_bio_noacct(bio);
-		return split;
+		submit_bio_noacct(split);
+		*nr_segs = bio_nr_segments(lim, bio);
+		return bio;
 	}
 	return bio;
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index cc32ad0cd548..9b0f9f3fdba0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1300,7 +1300,7 @@ static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
 
 static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
 {
-	struct request *last = rq_list_peek(&plug->mq_list);
+	struct request *last = rq_list_peek(&plug->mq_list), **last_p;
 
 	if (!plug->rq_count) {
 		trace_block_plug(rq->q);
@@ -1317,7 +1317,10 @@ static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
 	if (!plug->has_elevator && (rq->rq_flags & RQF_ELV))
 		plug->has_elevator = true;
 	rq->rq_next = NULL;
-	rq_list_add(&plug->mq_list, rq);
+	last_p = &plug->mq_list;
+	while (*last_p)
+		last_p = &(*last_p)->rq_next;
+	rq_list_add_tail(&last_p, rq);
 	plug->rq_count++;
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index dd5ce1137f04..5e01791967c0 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -228,6 +228,12 @@ static inline unsigned short req_get_ioprio(struct request *req)
 	*(listptr) = rq;				\
 } while (0)
 
+#define rq_list_add_tail(lastpptr, rq) do {		\
+	(rq)->rq_next = NULL;				\
+	**(lastpptr) = rq;				\
+	*(lastpptr) = &rq->rq_next;			\
+} while (0)
+
 #define rq_list_pop(listptr)				\
 ({							\
 	struct request *__req = NULL;			\

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-17 19:59 ` [PATCH 2/2] block: Split and submit bios in LBA order Bart Van Assche
@ 2023-03-17 22:28   ` Jan Kara
  2023-03-18  6:33     ` Christoph Hellwig
  2023-03-17 23:38   ` Ming Lei
  2023-03-18  6:42   ` Christoph Hellwig
  2 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2023-03-17 22:28 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Jan Kara, Ming Lei, Damien Le Moal, Johannes Thumshirn

On Fri 17-03-23 12:59:38, Bart Van Assche wrote:
> Instead of submitting the bio fragment with the highest LBA first,
> submit the bio fragment with the lowest LBA first. If plugging is
> active, append requests at the end of the list with plugged requests
> instead of at the start. This approach prevents write errors when
> submitting large bios to host-managed zoned block devices.
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

...

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index cc32ad0cd548..9b0f9f3fdba0 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1300,7 +1300,7 @@ static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
>  
>  static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
>  {
> -	struct request *last = rq_list_peek(&plug->mq_list);
> +	struct request *last = rq_list_peek(&plug->mq_list), **last_p;
>  
>  	if (!plug->rq_count) {
>  		trace_block_plug(rq->q);
> @@ -1317,7 +1317,10 @@ static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
>  	if (!plug->has_elevator && (rq->rq_flags & RQF_ELV))
>  		plug->has_elevator = true;
>  	rq->rq_next = NULL;
> -	rq_list_add(&plug->mq_list, rq);
> +	last_p = &plug->mq_list;
> +	while (*last_p)
> +		last_p = &(*last_p)->rq_next;
> +	rq_list_add_tail(&last_p, rq);
>  	plug->rq_count++;
>  }

Uh, I don't think we want to traverse the whole plug list each time we are
adding a request to it. We have just recently managed to avoid that at
least for single-device cases and apparently it was a win for fast devices
(see commit d38a9c04c0d5 ("block: only check previous entry for plug merge
attempt")).

If anything, I'd rather modify blk_mq_dispatch_plug_list() to add requests
to the queuelist in reverse order in this case, like I did it in
34e0a279a993 ("block: do not reverse request order when flushing plug
list") which is now in Jens' tree.

> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index dd5ce1137f04..5e01791967c0 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -228,6 +228,12 @@ static inline unsigned short req_get_ioprio(struct request *req)
>  	*(listptr) = rq;				\
>  } while (0)
>  
> +#define rq_list_add_tail(lastpptr, rq) do {		\
> +	(rq)->rq_next = NULL;				\
> +	**(lastpptr) = rq;				\
> +	*(lastpptr) = &rq->rq_next;			\
> +} while (0)
> +

And this function should be already in Jens's tree as part of commit
34e0a279a993 ("block: do not reverse request order when flushing plug
list").

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-17 19:59 ` [PATCH 2/2] block: Split and submit bios in LBA order Bart Van Assche
  2023-03-17 22:28   ` Jan Kara
@ 2023-03-17 23:38   ` Ming Lei
  2023-03-17 23:45     ` Bart Van Assche
  2023-03-18  6:42   ` Christoph Hellwig
  2 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2023-03-17 23:38 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Jan Kara, Damien Le Moal, Johannes Thumshirn, ming.lei

On Fri, Mar 17, 2023 at 12:59:38PM -0700, Bart Van Assche wrote:
> Instead of submitting the bio fragment with the highest LBA first,
> submit the bio fragment with the lowest LBA first. If plugging is
> active, append requests at the end of the list with plugged requests
> instead of at the start. This approach prevents write errors when
> submitting large bios to host-managed zoned block devices.

zoned pages are added via bio_add_zone_append_page(), which is supposed
to not call into split code path, do you have such error log?

Thanks,
Ming


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-17 23:38   ` Ming Lei
@ 2023-03-17 23:45     ` Bart Van Assche
  2023-03-20 23:28       ` Ming Lei
  0 siblings, 1 reply; 43+ messages in thread
From: Bart Van Assche @ 2023-03-17 23:45 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Jan Kara, Damien Le Moal, Johannes Thumshirn

On 3/17/23 16:38, Ming Lei wrote:
> On Fri, Mar 17, 2023 at 12:59:38PM -0700, Bart Van Assche wrote:
>> Instead of submitting the bio fragment with the highest LBA first,
>> submit the bio fragment with the lowest LBA first. If plugging is
>> active, append requests at the end of the list with plugged requests
>> instead of at the start. This approach prevents write errors when
>> submitting large bios to host-managed zoned block devices.
> 
> zoned pages are added via bio_add_zone_append_page(), which is supposed
> to not call into split code path, do you have such error log?

Hi Ming,

Thanks for having taken a look. This patch series is intended for 
REQ_OP_WRITE requests and not for REQ_OP_ZONE_APPEND requests.

Bart.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/2] Submit split bios in LBA order
  2023-03-17 19:59 [PATCH 0/2] Submit split bios in LBA order Bart Van Assche
  2023-03-17 19:59 ` [PATCH 1/2] block: Split blk_recalc_rq_segments() Bart Van Assche
  2023-03-17 19:59 ` [PATCH 2/2] block: Split and submit bios in LBA order Bart Van Assche
@ 2023-03-18  6:29 ` Christoph Hellwig
  2023-03-20 17:22   ` Bart Van Assche
  2 siblings, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2023-03-18  6:29 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig, Jan Kara

On Fri, Mar 17, 2023 at 12:59:36PM -0700, Bart Van Assche wrote:
> Hi Jens,
> 
> For zoned storage it is essential that split bios are submitted in LBA order.
> This patch series realizes this by modifying __bio_split_to_limits() such that
> it submits the first bio fragment and returns the remainder instead of
> submitting the remainder and returning the first bio fragment. Please consider
> this patch series for the next merge window.

Why are you sending large writes using REQ_OP_WRITE and not
using REQ_OP_ZONE_APPEND which side steps all these issues?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-17 22:28   ` Jan Kara
@ 2023-03-18  6:33     ` Christoph Hellwig
  0 siblings, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2023-03-18  6:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: Bart Van Assche, Jens Axboe, linux-block, Jaegeuk Kim,
	Christoph Hellwig, Ming Lei, Damien Le Moal, Johannes Thumshirn

On Fri, Mar 17, 2023 at 11:28:13PM +0100, Jan Kara wrote:
> > -	rq_list_add(&plug->mq_list, rq);
> > +	last_p = &plug->mq_list;
> > +	while (*last_p)
> > +		last_p = &(*last_p)->rq_next;
> > +	rq_list_add_tail(&last_p, rq);
> >  	plug->rq_count++;
> >  }
> 
> Uh, I don't think we want to traverse the whole plug list each time we are
> adding a request to it. We have just recently managed to avoid that at
> least for single-device cases and apparently it was a win for fast devices
> (see commit d38a9c04c0d5 ("block: only check previous entry for plug merge
> attempt")).

REQ_OP_WRITE request for zoned devices are never added to the plug
list, so all of this actually is dead (and probably untested?) code.

See blk_mq_plug() and bdev_op_is_zoned_write().

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] block: Split blk_recalc_rq_segments()
  2023-03-17 19:59 ` [PATCH 1/2] block: Split blk_recalc_rq_segments() Bart Van Assche
@ 2023-03-18  6:38   ` Christoph Hellwig
  0 siblings, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2023-03-18  6:38 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Jan Kara, Ming Lei, Damien Le Moal, Johannes Thumshirn

On Fri, Mar 17, 2023 at 12:59:37PM -0700, Bart Van Assche wrote:
> +unsigned int bio_nr_segments(const struct queue_limits *lim, struct bio *bio)

The name is wrong, as it operates not on a single bio, but
rather on a chain of bios.  That being said it seems like your caller
in the next patch only cares about the regulad read/write bio case,
which is kust this:

> +	for_each_bio(bio)
> +		bio_for_each_bvec(bv, bio, iter)
> +			bvec_split_segs(lim, &bv, &nr_phys_segs, &bytes,
> +					UINT_MAX, UINT_MAX);

So maybe split that into a separat patch.

Also please pass the queue_limit after the bio like the other
functions in this file.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-17 19:59 ` [PATCH 2/2] block: Split and submit bios in LBA order Bart Van Assche
  2023-03-17 22:28   ` Jan Kara
  2023-03-17 23:38   ` Ming Lei
@ 2023-03-18  6:42   ` Christoph Hellwig
  2 siblings, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2023-03-18  6:42 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Jan Kara, Ming Lei, Damien Le Moal, Johannes Thumshirn

On Fri, Mar 17, 2023 at 12:59:38PM -0700, Bart Van Assche wrote:
> @@ -380,8 +380,9 @@ struct bio *__bio_split_to_limits(struct bio *bio,
>  		blkcg_bio_issue_init(split);
>  		bio_chain(split, bio);
>  		trace_block_split(split, bio->bi_iter.bi_sector);
> -		submit_bio_noacct(bio);
> -		return split;
> +		submit_bio_noacct(split);
> +		*nr_segs = bio_nr_segments(lim, bio);

Please add this recalculation into the individual bio_split_*
functions, as the recalculation is only needed for regular read/write
bios.  It might also be slightly cheaper to just continue the segment
calculation there in the existing loop instead of starting a new
one after the split.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/2] Submit split bios in LBA order
  2023-03-18  6:29 ` [PATCH 0/2] Submit split " Christoph Hellwig
@ 2023-03-20 17:22   ` Bart Van Assche
  2023-03-20 21:06     ` Khazhy Kumykov
  2023-03-23  8:27     ` Christoph Hellwig
  0 siblings, 2 replies; 43+ messages in thread
From: Bart Van Assche @ 2023-03-20 17:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, Jaegeuk Kim, Jan Kara

On 3/17/23 23:29, Christoph Hellwig wrote:
> On Fri, Mar 17, 2023 at 12:59:36PM -0700, Bart Van Assche wrote:
>> For zoned storage it is essential that split bios are submitted in LBA order.
>> This patch series realizes this by modifying __bio_split_to_limits() such that
>> it submits the first bio fragment and returns the remainder instead of
>> submitting the remainder and returning the first bio fragment. Please consider
>> this patch series for the next merge window.
> 
> Why are you sending large writes using REQ_OP_WRITE and not
> using REQ_OP_ZONE_APPEND which side steps all these issues?

Hi Christoph,

How to achieve optimal performance with REQ_OP_ZONE_APPEND for SCSI 
devices? My understanding of how REQ_OP_ZONE_APPEND works for SCSI 
devices is as follows:
* ATA devices cannot support this operation directly because there are
   not enough bits in the ATA sense data to report where appended data
   has been written.
* T10 has not yet started with standardizing a zone append operation.
* The code that emulates REQ_OP_ZONE_APPEND for SCSI devices (in
   sd_zbc.c) serializes REQ_OP_ZONE_APPEND operations (QD=1).
* To achieve optimal performance, QD > 1 is required.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/2] Submit split bios in LBA order
  2023-03-20 17:22   ` Bart Van Assche
@ 2023-03-20 21:06     ` Khazhy Kumykov
  2023-03-23  8:27     ` Christoph Hellwig
  1 sibling, 0 replies; 43+ messages in thread
From: Khazhy Kumykov @ 2023-03-20 21:06 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Jaegeuk Kim, Jan Kara

On Mon, Mar 20, 2023 at 10:28 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 3/17/23 23:29, Christoph Hellwig wrote:
> > On Fri, Mar 17, 2023 at 12:59:36PM -0700, Bart Van Assche wrote:
> >> For zoned storage it is essential that split bios are submitted in LBA order.
> >> This patch series realizes this by modifying __bio_split_to_limits() such that
> >> it submits the first bio fragment and returns the remainder instead of
> >> submitting the remainder and returning the first bio fragment. Please consider
> >> this patch series for the next merge window.
> >
> > Why are you sending large writes using REQ_OP_WRITE and not
> > using REQ_OP_ZONE_APPEND which side steps all these issues?
>
> Hi Christoph,
>
> How to achieve optimal performance with REQ_OP_ZONE_APPEND for SCSI
> devices? My understanding of how REQ_OP_ZONE_APPEND works for SCSI
> devices is as follows:
> * ATA devices cannot support this operation directly because there are
>    not enough bits in the ATA sense data to report where appended data
>    has been written.
> * T10 has not yet started with standardizing a zone append operation.
> * The code that emulates REQ_OP_ZONE_APPEND for SCSI devices (in
>    sd_zbc.c) serializes REQ_OP_ZONE_APPEND operations (QD=1).
> * To achieve optimal performance, QD > 1 is required.
I recall there were dragons lurking particularly with how we handle
requeues wherein just submitting in order was not sufficient to
guarantee IO is actually dispatched in order. (of note: when
requeueing a request, we splice it to the _end_ of the hctx dispatch
list, so if you get a requeue in the middle of a multi-segment IO, it
will get re-ordered. I recall this change went in specifically to
re-order requests in case there was a passthrough lurking to un-jam a
device.) Have you looked at this? Perhaps requeues are slowpath
anyways, so we could sort there? There may also be other requeue
weirdness with layered devices...

Khazhy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-17 23:45     ` Bart Van Assche
@ 2023-03-20 23:28       ` Ming Lei
  2023-03-20 23:32         ` Bart Van Assche
  0 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2023-03-20 23:28 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Jan Kara, Damien Le Moal, Johannes Thumshirn, ming.lei

On Fri, Mar 17, 2023 at 04:45:46PM -0700, Bart Van Assche wrote:
> On 3/17/23 16:38, Ming Lei wrote:
> > On Fri, Mar 17, 2023 at 12:59:38PM -0700, Bart Van Assche wrote:
> > > Instead of submitting the bio fragment with the highest LBA first,
> > > submit the bio fragment with the lowest LBA first. If plugging is
> > > active, append requests at the end of the list with plugged requests
> > > instead of at the start. This approach prevents write errors when
> > > submitting large bios to host-managed zoned block devices.
> > 
> > zoned pages are added via bio_add_zone_append_page(), which is supposed
> > to not call into split code path, do you have such error log?
> 
> Hi Ming,
> 
> Thanks for having taken a look. This patch series is intended for
> REQ_OP_WRITE requests and not for REQ_OP_ZONE_APPEND requests.

But you are talking about host-managed zoned device, and the write
should have to be zone append, right?

Thanks,
Ming


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-20 23:28       ` Ming Lei
@ 2023-03-20 23:32         ` Bart Van Assche
  2023-03-21  0:44           ` Ming Lei
  2023-03-21  5:55           ` Christoph Hellwig
  0 siblings, 2 replies; 43+ messages in thread
From: Bart Van Assche @ 2023-03-20 23:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Jan Kara, Damien Le Moal, Johannes Thumshirn

On 3/20/23 16:28, Ming Lei wrote:
> On Fri, Mar 17, 2023 at 04:45:46PM -0700, Bart Van Assche wrote:
>> Thanks for having taken a look. This patch series is intended for
>> REQ_OP_WRITE requests and not for REQ_OP_ZONE_APPEND requests.
> 
> But you are talking about host-managed zoned device, and the write
> should have to be zone append, right?

Hi Ming,

The use case I'm looking at is Android devices with UFS storage. UFS is 
based on SCSI and hence only REQ_OP_WRITE is supported natively. There 
is a REQ_OP_ZONE_APPEND emulation in drivers/scsi/sd_zbc.c but it 
restricts the queue depth to one.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-20 23:32         ` Bart Van Assche
@ 2023-03-21  0:44           ` Ming Lei
  2023-03-21  1:46             ` Damien Le Moal
  2023-03-21  5:55           ` Christoph Hellwig
  1 sibling, 1 reply; 43+ messages in thread
From: Ming Lei @ 2023-03-21  0:44 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Jan Kara, Damien Le Moal, Johannes Thumshirn, ming.lei

On Mon, Mar 20, 2023 at 04:32:57PM -0700, Bart Van Assche wrote:
> On 3/20/23 16:28, Ming Lei wrote:
> > On Fri, Mar 17, 2023 at 04:45:46PM -0700, Bart Van Assche wrote:
> > > Thanks for having taken a look. This patch series is intended for
> > > REQ_OP_WRITE requests and not for REQ_OP_ZONE_APPEND requests.
> > 
> > But you are talking about host-managed zoned device, and the write
> > should have to be zone append, right?
> 
> Hi Ming,
> 
> The use case I'm looking at is Android devices with UFS storage. UFS is
> based on SCSI and hence only REQ_OP_WRITE is supported natively. There is a
> REQ_OP_ZONE_APPEND emulation in drivers/scsi/sd_zbc.c but it restricts the
> queue depth to one.

But is this UFS one host-managed zoned device? If yes, this "REQ_OP_WRITE"
still should have been handled as REQ_OP_ZONE_APPEND? Otherwise, I don't
think it is host-managed, and your patch isn't needed too.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-21  0:44           ` Ming Lei
@ 2023-03-21  1:46             ` Damien Le Moal
  2023-03-21  2:17               ` Ming Lei
  0 siblings, 1 reply; 43+ messages in thread
From: Damien Le Moal @ 2023-03-21  1:46 UTC (permalink / raw)
  To: Ming Lei, Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Jan Kara, Johannes Thumshirn

On 3/21/23 09:44, Ming Lei wrote:
> On Mon, Mar 20, 2023 at 04:32:57PM -0700, Bart Van Assche wrote:
>> On 3/20/23 16:28, Ming Lei wrote:
>>> On Fri, Mar 17, 2023 at 04:45:46PM -0700, Bart Van Assche wrote:
>>>> Thanks for having taken a look. This patch series is intended for
>>>> REQ_OP_WRITE requests and not for REQ_OP_ZONE_APPEND requests.
>>>
>>> But you are talking about host-managed zoned device, and the write
>>> should have to be zone append, right?
>>
>> Hi Ming,
>>
>> The use case I'm looking at is Android devices with UFS storage. UFS is
>> based on SCSI and hence only REQ_OP_WRITE is supported natively. There is a
>> REQ_OP_ZONE_APPEND emulation in drivers/scsi/sd_zbc.c but it restricts the
>> queue depth to one.
> 
> But is this UFS one host-managed zoned device? If yes, this "REQ_OP_WRITE"
> still should have been handled as REQ_OP_ZONE_APPEND? Otherwise, I don't
> think it is host-managed, and your patch isn't needed too.

Ming,

Both regular writes and zone append writes are supported by host managed
devices. For ZNS, zone append write is natively supported as a different
command. For SCSI & ATA (and UFS) devices, zone append write is emulated in the
sd driver using the regular write command because the SCSI and ATA standards do
not define a zone append write command.

For BIO splitting, splitting a regular write is fine as the resulting fragments
are sequential writes, so all fine. But zone append splitting is not allowed as
the actual written sector is returned by the device (or the sd emulation) to the
block layer through the bio iter sector. So if a zone append is split and its
fragments reordered, we can end up with a the wrong written sector and mangled
data when considering the original large bio that was split.

> 
> 
> Thanks,
> Ming
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-21  1:46             ` Damien Le Moal
@ 2023-03-21  2:17               ` Ming Lei
  2023-03-21  3:24                 ` Damien Le Moal
  0 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2023-03-21  2:17 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Bart Van Assche, Jens Axboe, linux-block, Jaegeuk Kim,
	Christoph Hellwig, Jan Kara, Johannes Thumshirn, ming.lei

On Tue, Mar 21, 2023 at 10:46:30AM +0900, Damien Le Moal wrote:
> On 3/21/23 09:44, Ming Lei wrote:
> > On Mon, Mar 20, 2023 at 04:32:57PM -0700, Bart Van Assche wrote:
> >> On 3/20/23 16:28, Ming Lei wrote:
> >>> On Fri, Mar 17, 2023 at 04:45:46PM -0700, Bart Van Assche wrote:
> >>>> Thanks for having taken a look. This patch series is intended for
> >>>> REQ_OP_WRITE requests and not for REQ_OP_ZONE_APPEND requests.
> >>>
> >>> But you are talking about host-managed zoned device, and the write
> >>> should have to be zone append, right?
> >>
> >> Hi Ming,
> >>
> >> The use case I'm looking at is Android devices with UFS storage. UFS is
> >> based on SCSI and hence only REQ_OP_WRITE is supported natively. There is a
> >> REQ_OP_ZONE_APPEND emulation in drivers/scsi/sd_zbc.c but it restricts the
> >> queue depth to one.
> > 
> > But is this UFS one host-managed zoned device? If yes, this "REQ_OP_WRITE"
> > still should have been handled as REQ_OP_ZONE_APPEND? Otherwise, I don't
> > think it is host-managed, and your patch isn't needed too.
> 
> Ming,
> 
> Both regular writes and zone append writes are supported by host managed
> devices. For ZNS, zone append write is natively supported as a different
> command. For SCSI & ATA (and UFS) devices, zone append write is emulated in the
> sd driver using the regular write command because the SCSI and ATA standards do
> not define a zone append write command.

Thanks for the clarification.

> 
> For BIO splitting, splitting a regular write is fine as the resulting fragments
> are sequential writes, so all fine. But zone append splitting is not allowed as

The current bio split code may not make sequential write requests, and
looks Bart is trying to address it. Then looks scsi zdc emulation still
requires sequential writes aiming to same zone.

But I guess it is hard to maintain bio order, especially md/dm is
involved.

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-21  2:17               ` Ming Lei
@ 2023-03-21  3:24                 ` Damien Le Moal
  2023-03-21  8:00                   ` Ming Lei
  0 siblings, 1 reply; 43+ messages in thread
From: Damien Le Moal @ 2023-03-21  3:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, Jens Axboe, linux-block, Jaegeuk Kim,
	Christoph Hellwig, Jan Kara, Johannes Thumshirn

On 3/21/23 11:17, Ming Lei wrote:
> On Tue, Mar 21, 2023 at 10:46:30AM +0900, Damien Le Moal wrote:
>> On 3/21/23 09:44, Ming Lei wrote:
>>> On Mon, Mar 20, 2023 at 04:32:57PM -0700, Bart Van Assche wrote:
>>>> On 3/20/23 16:28, Ming Lei wrote:
>>>>> On Fri, Mar 17, 2023 at 04:45:46PM -0700, Bart Van Assche wrote:
>>>>>> Thanks for having taken a look. This patch series is intended for
>>>>>> REQ_OP_WRITE requests and not for REQ_OP_ZONE_APPEND requests.
>>>>>
>>>>> But you are talking about host-managed zoned device, and the write
>>>>> should have to be zone append, right?
>>>>
>>>> Hi Ming,
>>>>
>>>> The use case I'm looking at is Android devices with UFS storage. UFS is
>>>> based on SCSI and hence only REQ_OP_WRITE is supported natively. There is a
>>>> REQ_OP_ZONE_APPEND emulation in drivers/scsi/sd_zbc.c but it restricts the
>>>> queue depth to one.
>>>
>>> But is this UFS one host-managed zoned device? If yes, this "REQ_OP_WRITE"
>>> still should have been handled as REQ_OP_ZONE_APPEND? Otherwise, I don't
>>> think it is host-managed, and your patch isn't needed too.
>>
>> Ming,
>>
>> Both regular writes and zone append writes are supported by host managed
>> devices. For ZNS, zone append write is natively supported as a different
>> command. For SCSI & ATA (and UFS) devices, zone append write is emulated in the
>> sd driver using the regular write command because the SCSI and ATA standards do
>> not define a zone append write command.
> 
> Thanks for the clarification.
> 
>>
>> For BIO splitting, splitting a regular write is fine as the resulting fragments
>> are sequential writes, so all fine. But zone append splitting is not allowed as
> 
> The current bio split code may not make sequential write requests, and
> looks Bart is trying to address it. Then looks scsi zdc emulation still
> requires sequential writes aiming to same zone.

Split does create sequential writes, always, but the processing order may not be
sequential in case of plugging. However, writes to sequential zones are never
plugs so reordering due to plugging does not affect zoned devices.

> 
> But I guess it is hard to maintain bio order, especially md/dm is
> involved.

It is not that hard once you get rid of plugging, which we did. So far, with
everything we support, we are not detecting any issues and we test weekly, every rc.

> 
> Thanks, 
> Ming
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-20 23:32         ` Bart Van Assche
  2023-03-21  0:44           ` Ming Lei
@ 2023-03-21  5:55           ` Christoph Hellwig
  2023-03-21 14:36             ` Bart Van Assche
  1 sibling, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2023-03-21  5:55 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Ming Lei, Jens Axboe, linux-block, Jaegeuk Kim,
	Christoph Hellwig, Jan Kara, Damien Le Moal, Johannes Thumshirn

On Mon, Mar 20, 2023 at 04:32:57PM -0700, Bart Van Assche wrote:
> The use case I'm looking at is Android devices with UFS storage. UFS is 
> based on SCSI and hence only REQ_OP_WRITE is supported natively. There is a 
> REQ_OP_ZONE_APPEND emulation in drivers/scsi/sd_zbc.c but it restricts the 
> queue depth to one.

The queue depth (per zone) is limited for regular writes to, for the
same reason that the zone append emulations limits them.  You seem
to be very aware of that too as you've tried various methods to lift
that limit, none of which seems to ultimatively work.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-21  3:24                 ` Damien Le Moal
@ 2023-03-21  8:00                   ` Ming Lei
  2023-03-21  8:51                     ` Damien Le Moal
  0 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2023-03-21  8:00 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Bart Van Assche, Jens Axboe, linux-block, Jaegeuk Kim,
	Christoph Hellwig, Jan Kara, Johannes Thumshirn, ming.lei

On Tue, Mar 21, 2023 at 12:24:51PM +0900, Damien Le Moal wrote:
> On 3/21/23 11:17, Ming Lei wrote:
> > On Tue, Mar 21, 2023 at 10:46:30AM +0900, Damien Le Moal wrote:
> >> On 3/21/23 09:44, Ming Lei wrote:
> >>> On Mon, Mar 20, 2023 at 04:32:57PM -0700, Bart Van Assche wrote:
> >>>> On 3/20/23 16:28, Ming Lei wrote:
> >>>>> On Fri, Mar 17, 2023 at 04:45:46PM -0700, Bart Van Assche wrote:
> >>>>>> Thanks for having taken a look. This patch series is intended for
> >>>>>> REQ_OP_WRITE requests and not for REQ_OP_ZONE_APPEND requests.
> >>>>>
> >>>>> But you are talking about host-managed zoned device, and the write
> >>>>> should have to be zone append, right?
> >>>>
> >>>> Hi Ming,
> >>>>
> >>>> The use case I'm looking at is Android devices with UFS storage. UFS is
> >>>> based on SCSI and hence only REQ_OP_WRITE is supported natively. There is a
> >>>> REQ_OP_ZONE_APPEND emulation in drivers/scsi/sd_zbc.c but it restricts the
> >>>> queue depth to one.
> >>>
> >>> But is this UFS one host-managed zoned device? If yes, this "REQ_OP_WRITE"
> >>> still should have been handled as REQ_OP_ZONE_APPEND? Otherwise, I don't
> >>> think it is host-managed, and your patch isn't needed too.
> >>
> >> Ming,
> >>
> >> Both regular writes and zone append writes are supported by host managed
> >> devices. For ZNS, zone append write is natively supported as a different
> >> command. For SCSI & ATA (and UFS) devices, zone append write is emulated in the
> >> sd driver using the regular write command because the SCSI and ATA standards do
> >> not define a zone append write command.
> > 
> > Thanks for the clarification.
> > 
> >>
> >> For BIO splitting, splitting a regular write is fine as the resulting fragments
> >> are sequential writes, so all fine. But zone append splitting is not allowed as
> > 
> > The current bio split code may not make sequential write requests, and
> > looks Bart is trying to address it. Then looks scsi zdc emulation still
> > requires sequential writes aiming to same zone.
> 
> Split does create sequential writes, always, but the processing order may not be
> sequential in case of plugging. However, writes to sequential zones are never
> plugs so reordering due to plugging does not affect zoned devices.

If it is true, I am wondering why Bart sent the patch of 'block: Split and submit
bios in LBA order'?

> 
> > 
> > But I guess it is hard to maintain bio order, especially md/dm is
> > involved.
> 
> It is not that hard once you get rid of plugging, which we did. So far, with
> everything we support, we are not detecting any issues and we test weekly, every rc.

I see, but I meant current->bio_list, see the following bio order issue:

https://lore.kernel.org/linux-block/1609233522-25837-1-git-send-email-dannyshih@synology.com/


Thanks,
Ming


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-21  8:00                   ` Ming Lei
@ 2023-03-21  8:51                     ` Damien Le Moal
  2023-03-21  9:09                       ` Christoph Hellwig
  0 siblings, 1 reply; 43+ messages in thread
From: Damien Le Moal @ 2023-03-21  8:51 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, Jens Axboe, linux-block, Jaegeuk Kim,
	Christoph Hellwig, Jan Kara, Johannes Thumshirn

On 3/21/23 17:00, Ming Lei wrote:
>>> But I guess it is hard to maintain bio order, especially md/dm is
>>> involved.
>>
>> It is not that hard once you get rid of plugging, which we did. So far, with
>> everything we support, we are not detecting any issues and we test weekly, every rc.
> 
> I see, but I meant current->bio_list, see the following bio order issue:
> 
> https://lore.kernel.org/linux-block/1609233522-25837-1-git-send-email-dannyshih@synology.com/

Interesting. We have never seen such issue in practice with the device mappers
that support zoned devices. But it seems interesting to try to find a use case
that could trigger this. Will look into it, and if everything is fixed, it would
still be a good regression test for blocktest.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-21  8:51                     ` Damien Le Moal
@ 2023-03-21  9:09                       ` Christoph Hellwig
  2023-03-21  9:50                         ` Damien Le Moal
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2023-03-21  9:09 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Ming Lei, Bart Van Assche, Jens Axboe, linux-block, Jaegeuk Kim,
	Christoph Hellwig, Jan Kara, Johannes Thumshirn

On Tue, Mar 21, 2023 at 05:51:29PM +0900, Damien Le Moal wrote:
> Interesting. We have never seen such issue in practice with the device mappers
> that support zoned devices. But it seems interesting to try to find a use case
> that could trigger this. Will look into it, and if everything is fixed, it would
> still be a good regression test for blocktest.

I think it requires a non-zoneappend bio that is large enough to
require splitting and which goes through a remapper.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-21  9:09                       ` Christoph Hellwig
@ 2023-03-21  9:50                         ` Damien Le Moal
  0 siblings, 0 replies; 43+ messages in thread
From: Damien Le Moal @ 2023-03-21  9:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, Bart Van Assche, Jens Axboe, linux-block, Jaegeuk Kim,
	Jan Kara, Johannes Thumshirn

On 3/21/23 18:09, Christoph Hellwig wrote:
> On Tue, Mar 21, 2023 at 05:51:29PM +0900, Damien Le Moal wrote:
>> Interesting. We have never seen such issue in practice with the device mappers
>> that support zoned devices. But it seems interesting to try to find a use case
>> that could trigger this. Will look into it, and if everything is fixed, it would
>> still be a good regression test for blocktest.
> 
> I think it requires a non-zoneappend bio that is large enough to
> require splitting and which goes through a remapper.

dm-crypt and dm-linear could be targets potentially affected. Will have a try.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-21  5:55           ` Christoph Hellwig
@ 2023-03-21 14:36             ` Bart Van Assche
  2023-03-23  8:26               ` Christoph Hellwig
  0 siblings, 1 reply; 43+ messages in thread
From: Bart Van Assche @ 2023-03-21 14:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, Jens Axboe, linux-block, Jaegeuk Kim, Jan Kara,
	Damien Le Moal, Johannes Thumshirn

On 3/20/23 22:55, Christoph Hellwig wrote:
> On Mon, Mar 20, 2023 at 04:32:57PM -0700, Bart Van Assche wrote:
>> The use case I'm looking at is Android devices with UFS storage. UFS is
>> based on SCSI and hence only REQ_OP_WRITE is supported natively. There is a
>> REQ_OP_ZONE_APPEND emulation in drivers/scsi/sd_zbc.c but it restricts the
>> queue depth to one.
> 
> The queue depth (per zone) is limited for regular writes to, for the
> same reason that the zone append emulations limits them.  You seem
> to be very aware of that too as you've tried various methods to lift
> that limit, none of which seems to ultimatively work.

Hi Christoph,

The UFSHCI specification is very clear about the requirement that UFS 
host controllers must process SCSI commands in order if host software 
sets one bit at a time in the UFSHCI 3.0 doorbell register: "For Task 
Management Requests and Transfer Requests, software may issue multiple 
commands at a time, and may issue new commands before previous commands 
have completed. When software sets the corresponding doorbell register, 
the Task Management Requests and Transfer Requests automatically get a 
time stamp with their issue time. The commands within a command list 
(Task Management List or Transfer Request List) shall be processed in
the order of their time stamps, starting from the oldest time stamp. In 
the case multiple commands from the same list have the same time stamp, 
they shall be processed in the order of their command list index,
starting from the lowest index."

Damien and Jens agree about introducing an additional hardware queue for 
preserving the order of zoned writes as one can see here: 
https://lore.kernel.org/linux-block/ed255a4a-a0da-a962-2da4-13321d0a75c5@kernel.dk/

In our tests pipelining zoned writes (REQ_OP_WRITE) works fine as long 
as the UFS error handler is not activated. After the UFS error handler 
has been scheduled and before the SCSI host state is changed into 
SHOST_RECOVERY, the UFS host controller driver responds with 
SCSI_MLQUEUE_HOST_BUSY. I'm still working on a solution for the 
reordering caused by this mechanism.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-21 14:36             ` Bart Van Assche
@ 2023-03-23  8:26               ` Christoph Hellwig
  2023-03-23 10:28                 ` Damien Le Moal
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2023-03-23  8:26 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Ming Lei, Jens Axboe, linux-block,
	Jaegeuk Kim, Jan Kara, Damien Le Moal, Johannes Thumshirn

On Tue, Mar 21, 2023 at 07:36:12AM -0700, Bart Van Assche wrote:
> The UFSHCI specification is very clear about the requirement that UFS host 
> controllers must process SCSI commands in order if host software sets one 
> bit at a time in the UFSHCI 3.0 doorbell register: "For Task Management 
> Requests and Transfer Requests, software may issue multiple commands at a 
> time, and may issue new commands before previous commands have completed. 
> When software sets the corresponding doorbell register, the Task Management 
> Requests and Transfer Requests automatically get a time stamp with their 
> issue time. The commands within a command list (Task Management List or 
> Transfer Request List) shall be processed in
> the order of their time stamps, starting from the oldest time stamp. In the 
> case multiple commands from the same list have the same time stamp, they 
> shall be processed in the order of their command list index,
> starting from the lowest index."

But we can't write Linux software just for UFS.  We have no sensible
ordering guarantee anywhere else.

> Damien and Jens agree about introducing an additional hardware queue for 
> preserving the order of zoned writes as one can see here: 
> https://lore.kernel.org/linux-block/ed255a4a-a0da-a962-2da4-13321d0a75c5@kernel.dk/
>
> In our tests pipelining zoned writes (REQ_OP_WRITE) works fine as long as 
> the UFS error handler is not activated. After the UFS error handler has 
> been scheduled and before the SCSI host state is changed into 
> SHOST_RECOVERY, the UFS host controller driver responds with 
> SCSI_MLQUEUE_HOST_BUSY. I'm still working on a solution for the reordering 
> caused by this mechanism.

We'll still need REQ_OP_ZONE_APPEND as the actual file system fast path
interface.  For a low-end device like UFS the sd.c emulation might be
able to take advantage of the above separate queue as an implementation
detail.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/2] Submit split bios in LBA order
  2023-03-20 17:22   ` Bart Van Assche
  2023-03-20 21:06     ` Khazhy Kumykov
@ 2023-03-23  8:27     ` Christoph Hellwig
  2023-03-24 17:05       ` Bart Van Assche
  1 sibling, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2023-03-23  8:27 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Jaegeuk Kim, Jan Kara

On Mon, Mar 20, 2023 at 10:22:41AM -0700, Bart Van Assche wrote:
> How to achieve optimal performance with REQ_OP_ZONE_APPEND for SCSI 
> devices? My understanding of how REQ_OP_ZONE_APPEND works for SCSI devices 
> is as follows:
> * ATA devices cannot support this operation directly because there are
>   not enough bits in the ATA sense data to report where appended data
>   has been written.

ATA doesn't really have autosense in the SCSI way.  It could be handled
the same way that CDL completions are handled.  That is a complete
mess, and between CDL and Zone Append we'll probably eventually need
an extended FIS for SATA if we want to keep ATA alive.

> * T10 has not yet started with standardizing a zone append operation.

Time to get it started then!

> * The code that emulates REQ_OP_ZONE_APPEND for SCSI devices (in
>   sd_zbc.c) serializes REQ_OP_ZONE_APPEND operations (QD=1).

Because that's the only thing that actually works.

> * To achieve optimal performance, QD > 1 is required.

If you have something magic that works, this code is the place to take
advantage of it.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-23  8:26               ` Christoph Hellwig
@ 2023-03-23 10:28                 ` Damien Le Moal
  2023-03-23 16:27                   ` Bart Van Assche
  0 siblings, 1 reply; 43+ messages in thread
From: Damien Le Moal @ 2023-03-23 10:28 UTC (permalink / raw)
  To: Christoph Hellwig, Bart Van Assche
  Cc: Ming Lei, Jens Axboe, linux-block, Jaegeuk Kim, Jan Kara,
	Johannes Thumshirn

On 3/23/23 17:26, Christoph Hellwig wrote:
> On Tue, Mar 21, 2023 at 07:36:12AM -0700, Bart Van Assche wrote:
>> The UFSHCI specification is very clear about the requirement that UFS host 
>> controllers must process SCSI commands in order if host software sets one 
>> bit at a time in the UFSHCI 3.0 doorbell register: "For Task Management 
>> Requests and Transfer Requests, software may issue multiple commands at a 
>> time, and may issue new commands before previous commands have completed. 
>> When software sets the corresponding doorbell register, the Task Management 
>> Requests and Transfer Requests automatically get a time stamp with their 
>> issue time. The commands within a command list (Task Management List or 
>> Transfer Request List) shall be processed in
>> the order of their time stamps, starting from the oldest time stamp. In the 
>> case multiple commands from the same list have the same time stamp, they 
>> shall be processed in the order of their command list index,
>> starting from the lowest index."
> 
> But we can't write Linux software just for UFS.  We have no sensible
> ordering guarantee anywhere else.
> 
>> Damien and Jens agree about introducing an additional hardware queue for 
>> preserving the order of zoned writes as one can see here: 
>> https://lore.kernel.org/linux-block/ed255a4a-a0da-a962-2da4-13321d0a75c5@kernel.dk/
>>
>> In our tests pipelining zoned writes (REQ_OP_WRITE) works fine as long as 
>> the UFS error handler is not activated. After the UFS error handler has 
>> been scheduled and before the SCSI host state is changed into 
>> SHOST_RECOVERY, the UFS host controller driver responds with 
>> SCSI_MLQUEUE_HOST_BUSY. I'm still working on a solution for the reordering 
>> caused by this mechanism.
> 
> We'll still need REQ_OP_ZONE_APPEND as the actual file system fast path
> interface.  For a low-end device like UFS the sd.c emulation might be
> able to take advantage of the above separate queue as an implementation
> detail.

For the zone append emulation, the write locking is done by sd.c and the upper
layer does not restrict to one append per zone. So we actually could envision a
UFS version of the sd write locking calls that is optimized for the device
capabilities and we can keep a common upper layer (which is preferable in my
opinion).

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-23 10:28                 ` Damien Le Moal
@ 2023-03-23 16:27                   ` Bart Van Assche
  2023-03-23 22:53                     ` Damien Le Moal
  0 siblings, 1 reply; 43+ messages in thread
From: Bart Van Assche @ 2023-03-23 16:27 UTC (permalink / raw)
  To: Damien Le Moal, Christoph Hellwig
  Cc: Ming Lei, Jens Axboe, linux-block, Jaegeuk Kim, Jan Kara,
	Johannes Thumshirn

On 3/23/23 03:28, Damien Le Moal wrote:
> For the zone append emulation, the write locking is done by sd.c and the upper
> layer does not restrict to one append per zone. So we actually could envision a
> UFS version of the sd write locking calls that is optimized for the device
> capabilities and we can keep a common upper layer (which is preferable in my
> opinion).

I see a blk_req_zone_write_trylock() call in 
sd_zbc_prepare_zone_append() and a blk_req_zone_write_unlock() call in 
sd_zbc_complete() for REQ_OP_ZONE_APPEND operations. Does this mean that 
the sd_zbc.c code restricts the queue depth to one per zone for 
REQ_OP_ZONE_APPEND operations?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-23 16:27                   ` Bart Van Assche
@ 2023-03-23 22:53                     ` Damien Le Moal
  2023-03-24 16:55                       ` Bart Van Assche
  0 siblings, 1 reply; 43+ messages in thread
From: Damien Le Moal @ 2023-03-23 22:53 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig
  Cc: Ming Lei, Jens Axboe, linux-block, Jaegeuk Kim, Jan Kara,
	Johannes Thumshirn

On 3/24/23 01:27, Bart Van Assche wrote:
> On 3/23/23 03:28, Damien Le Moal wrote:
>> For the zone append emulation, the write locking is done by sd.c and the upper
>> layer does not restrict to one append per zone. So we actually could envision a
>> UFS version of the sd write locking calls that is optimized for the device
>> capabilities and we can keep a common upper layer (which is preferable in my
>> opinion).
> 
> I see a blk_req_zone_write_trylock() call in 
> sd_zbc_prepare_zone_append() and a blk_req_zone_write_unlock() call in 
> sd_zbc_complete() for REQ_OP_ZONE_APPEND operations. Does this mean that 
> the sd_zbc.c code restricts the queue depth to one per zone for 
> REQ_OP_ZONE_APPEND operations?

Yes, since the append becomes a regular write and HBAs are often happy to
reorder these commands, even for SMR, we need the locking.

But if I understand your use case correctly, given that UFS gives guarantees on
the command dispatching order, you could probably relax this locking for zone
append requests. But you cannot for regular writes as the locking is up in the
block layer and needed to avoid block layer level reordering.

> 
> Thanks,
> 
> Bart.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-23 22:53                     ` Damien Le Moal
@ 2023-03-24 16:55                       ` Bart Van Assche
  2023-03-25  2:00                         ` Damien Le Moal
  0 siblings, 1 reply; 43+ messages in thread
From: Bart Van Assche @ 2023-03-24 16:55 UTC (permalink / raw)
  To: Damien Le Moal, Christoph Hellwig
  Cc: Ming Lei, Jens Axboe, linux-block, Jaegeuk Kim, Jan Kara,
	Johannes Thumshirn

On 3/23/23 15:53, Damien Le Moal wrote:
> On 3/24/23 01:27, Bart Van Assche wrote:
>> On 3/23/23 03:28, Damien Le Moal wrote:
>>> For the zone append emulation, the write locking is done by sd.c and the upper
>>> layer does not restrict to one append per zone. So we actually could envision a
>>> UFS version of the sd write locking calls that is optimized for the device
>>> capabilities and we can keep a common upper layer (which is preferable in my
>>> opinion).
>>
>> I see a blk_req_zone_write_trylock() call in
>> sd_zbc_prepare_zone_append() and a blk_req_zone_write_unlock() call in
>> sd_zbc_complete() for REQ_OP_ZONE_APPEND operations. Does this mean that
>> the sd_zbc.c code restricts the queue depth to one per zone for
>> REQ_OP_ZONE_APPEND operations?
> 
> Yes, since the append becomes a regular write and HBAs are often happy to
> reorder these commands, even for SMR, we need the locking.
> 
> But if I understand your use case correctly, given that UFS gives guarantees on
> the command dispatching order, you could probably relax this locking for zone
> append requests. But you cannot for regular writes as the locking is up in the
> block layer and needed to avoid block layer level reordering.

Hi Damien,

I don't think that we can achieve QD > 1 even if we would switch to
REQ_OP_ZONE_APPEND. A SCSI LLD is allowed to respond with 
SCSI_MLQUEUE_HOST_BUSY if the SCSI core asks it to queue a command. This 
may lead to reordering and hence may cause UNALIGNED WRITE COMMAND 
errors. We want to avoid these errors.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/2] Submit split bios in LBA order
  2023-03-23  8:27     ` Christoph Hellwig
@ 2023-03-24 17:05       ` Bart Van Assche
  2023-03-25  2:15         ` Damien Le Moal
  2023-03-26 23:44         ` Christoph Hellwig
  0 siblings, 2 replies; 43+ messages in thread
From: Bart Van Assche @ 2023-03-24 17:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, Jaegeuk Kim, Jan Kara

On 3/23/23 01:27, Christoph Hellwig wrote:
> On Mon, Mar 20, 2023 at 10:22:41AM -0700, Bart Van Assche wrote:
>> * T10 has not yet started with standardizing a zone append operation.
> 
> Time to get it started then!

Hi Christoph,

If someone else wants to work on this that would be great. I do not plan 
to work on this because I do not expect that a SCSI zone append command 
would be standardized by the time we need it. Although there are 
references to T10 drafts in the UFS standard, since a few months JEDEC 
strongly prefers to refer to finalized external standards in its own 
standards. Hence, standardizing zoned storage for UFS would have to wait 
until T10 has published a standard that supports a zone append command. 
INCITS published ZBC-1 in 2016, two years after the first ZBC-1 draft 
was uploaded to the T10 servers. INCITS approved ZBC-2 this month, six 
years after the first ZBC-2 draft was uploaded to the T10 servers. 
Because of the long time it takes to complete new versions of T10 
standards we plan not to wait until T10 has standardized a zone append 
operation.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-24 16:55                       ` Bart Van Assche
@ 2023-03-25  2:00                         ` Damien Le Moal
  2023-03-25 16:31                           ` Bart Van Assche
  0 siblings, 1 reply; 43+ messages in thread
From: Damien Le Moal @ 2023-03-25  2:00 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig
  Cc: Ming Lei, Jens Axboe, linux-block, Jaegeuk Kim, Jan Kara,
	Johannes Thumshirn

On 3/25/23 01:55, Bart Van Assche wrote:
> On 3/23/23 15:53, Damien Le Moal wrote:
>> On 3/24/23 01:27, Bart Van Assche wrote:
>>> On 3/23/23 03:28, Damien Le Moal wrote:
>>>> For the zone append emulation, the write locking is done by sd.c and the upper
>>>> layer does not restrict to one append per zone. So we actually could envision a
>>>> UFS version of the sd write locking calls that is optimized for the device
>>>> capabilities and we can keep a common upper layer (which is preferable in my
>>>> opinion).
>>>
>>> I see a blk_req_zone_write_trylock() call in
>>> sd_zbc_prepare_zone_append() and a blk_req_zone_write_unlock() call in
>>> sd_zbc_complete() for REQ_OP_ZONE_APPEND operations. Does this mean that
>>> the sd_zbc.c code restricts the queue depth to one per zone for
>>> REQ_OP_ZONE_APPEND operations?
>>
>> Yes, since the append becomes a regular write and HBAs are often happy to
>> reorder these commands, even for SMR, we need the locking.
>>
>> But if I understand your use case correctly, given that UFS gives guarantees on
>> the command dispatching order, you could probably relax this locking for zone
>> append requests. But you cannot for regular writes as the locking is up in the
>> block layer and needed to avoid block layer level reordering.
> 
> Hi Damien,
> 
> I don't think that we can achieve QD > 1 even if we would switch to
> REQ_OP_ZONE_APPEND. A SCSI LLD is allowed to respond with 
> SCSI_MLQUEUE_HOST_BUSY if the SCSI core asks it to queue a command. This 
> may lead to reordering and hence may cause UNALIGNED WRITE COMMAND 
> errors. We want to avoid these errors.

The trick here could be to have the UFS LLD to unlock the target zone of a write
when the command is sent to the device, instead of when the command completes.
This way, the zone is still locked when there is a requeue and there is no
reordering. That could allow for write qd > 1 in the case of UFS. And this
method could actually work for regular writes too.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/2] Submit split bios in LBA order
  2023-03-24 17:05       ` Bart Van Assche
@ 2023-03-25  2:15         ` Damien Le Moal
  2023-03-26 23:42           ` Christoph Hellwig
  2023-03-26 23:44         ` Christoph Hellwig
  1 sibling, 1 reply; 43+ messages in thread
From: Damien Le Moal @ 2023-03-25  2:15 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Jan Kara

On 3/25/23 02:05, Bart Van Assche wrote:
> On 3/23/23 01:27, Christoph Hellwig wrote:
>> On Mon, Mar 20, 2023 at 10:22:41AM -0700, Bart Van Assche wrote:
>>> * T10 has not yet started with standardizing a zone append operation.
>>
>> Time to get it started then!
> 
> Hi Christoph,
> 
> If someone else wants to work on this that would be great. I do not plan 
> to work on this because I do not expect that a SCSI zone append command 
> would be standardized by the time we need it. Although there are 
> references to T10 drafts in the UFS standard, since a few months JEDEC 
> strongly prefers to refer to finalized external standards in its own 
> standards. Hence, standardizing zoned storage for UFS would have to wait 
> until T10 has published a standard that supports a zone append command. 
> INCITS published ZBC-1 in 2016, two years after the first ZBC-1 draft 
> was uploaded to the T10 servers. INCITS approved ZBC-2 this month, six 
> years after the first ZBC-2 draft was uploaded to the T10 servers. 
> Because of the long time it takes to complete new versions of T10 
> standards we plan not to wait until T10 has standardized a zone append 
> operation.

Such standardization effort is likely to face a lot of headwind because defining
a zone append command for ATA (T13 ACS) is not possible with a single
self-contained command (as one cannot return the written sector using sense data
like with scsi). And when it comes to ZBC, keeping it in sync with ZAC is desired...

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-25  2:00                         ` Damien Le Moal
@ 2023-03-25 16:31                           ` Bart Van Assche
  2023-03-26  1:45                             ` Damien Le Moal
  0 siblings, 1 reply; 43+ messages in thread
From: Bart Van Assche @ 2023-03-25 16:31 UTC (permalink / raw)
  To: Damien Le Moal, Christoph Hellwig
  Cc: Ming Lei, Jens Axboe, linux-block, Jaegeuk Kim, Jan Kara,
	Johannes Thumshirn

On 3/24/23 19:00, Damien Le Moal wrote:
> The trick here could be to have the UFS LLD to unlock the target zone of a write
> when the command is sent to the device, instead of when the command completes.
> This way, the zone is still locked when there is a requeue and there is no
> reordering. That could allow for write qd > 1 in the case of UFS. And this
> method could actually work for regular writes too.

Hi Damien,

Although the above sounds interesting to me, I think the following two 
scenarios are not handled by the above approach and can lead to reordering:
* The SCSI device reporting a unit attention.
* The SCSI device responding with the SCSI status "BUSY". The UFS 
standard explicitly allows this. From the UFS standard: "If the unit is 
not ready to accept a new command (e.g., still processing previous 
command) a STATUS response of BUSY will be returned."

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-25 16:31                           ` Bart Van Assche
@ 2023-03-26  1:45                             ` Damien Le Moal
  2023-03-26 23:45                               ` Christoph Hellwig
  2023-03-27 21:20                               ` Bart Van Assche
  0 siblings, 2 replies; 43+ messages in thread
From: Damien Le Moal @ 2023-03-26  1:45 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig
  Cc: Ming Lei, Jens Axboe, linux-block, Jaegeuk Kim, Jan Kara,
	Johannes Thumshirn

On 3/26/23 01:31, Bart Van Assche wrote:
> On 3/24/23 19:00, Damien Le Moal wrote:
>> The trick here could be to have the UFS LLD to unlock the target zone of a write
>> when the command is sent to the device, instead of when the command completes.
>> This way, the zone is still locked when there is a requeue and there is no
>> reordering. That could allow for write qd > 1 in the case of UFS. And this
>> method could actually work for regular writes too.
> 
> Hi Damien,
> 
> Although the above sounds interesting to me, I think the following two 
> scenarios are not handled by the above approach and can lead to reordering:
> * The SCSI device reporting a unit attention.
> * The SCSI device responding with the SCSI status "BUSY". The UFS 
> standard explicitly allows this. From the UFS standard: "If the unit is 
> not ready to accept a new command (e.g., still processing previous 
> command) a STATUS response of BUSY will be returned."

Yes, that likely would be an issue for regular writes, but likely not for zone
append emulation using regular writes though, since a "busy" return for a ZA
emulated regular write can be resent later with a different aligned write location.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/2] Submit split bios in LBA order
  2023-03-25  2:15         ` Damien Le Moal
@ 2023-03-26 23:42           ` Christoph Hellwig
  0 siblings, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2023-03-26 23:42 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Bart Van Assche, Christoph Hellwig, Jens Axboe, linux-block,
	Jaegeuk Kim, Jan Kara

On Sat, Mar 25, 2023 at 11:15:40AM +0900, Damien Le Moal wrote:
> Such standardization effort is likely to face a lot of headwind because
> defining a zone append command for ATA (T13 ACS) is not possible with a
> single self-contained command (as one cannot return the written sector
> using sense data like with scsi).

The same was true for CDL and it got in anyway.  And yes, CDL on ATA
is a complete f**king mess, and needs to be fixed.  So ATA needs to byte
the bullet and extent the FIS anyway, so we might as well get started on
it ASAP.  Fortunately the only implementations that really matter now
are AHCI and SAS expanders, so it sounds very doable to get there.

> And when it comes to ZBC, keeping it in sync with ZAC is desired...

There is so many features in SCSI and not ATA, most notably protection
information that this sounds like a BS argument to me.  That being
said supporting Zone Append and properly doing CDL in ATA would be
very useful.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/2] Submit split bios in LBA order
  2023-03-24 17:05       ` Bart Van Assche
  2023-03-25  2:15         ` Damien Le Moal
@ 2023-03-26 23:44         ` Christoph Hellwig
  2023-04-06 20:32           ` Bart Van Assche
  1 sibling, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2023-03-26 23:44 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Jaegeuk Kim, Jan Kara

On Fri, Mar 24, 2023 at 10:05:48AM -0700, Bart Van Assche wrote:
> If someone else wants to work on this that would be great. I do not plan to 
> work on this because I do not expect that a SCSI zone append command would 
> be standardized by the time we need it. Although there are references to 
> T10 drafts in the UFS standard, since a few months JEDEC strongly prefers 
> to refer to finalized external standards in its own standards. Hence, 
> standardizing zoned storage for UFS would have to wait until T10 has 
> published a standard that supports a zone append command. INCITS published 
> ZBC-1 in 2016, two years after the first ZBC-1 draft was uploaded to the 
> T10 servers. INCITS approved ZBC-2 this month, six years after the first 
> ZBC-2 draft was uploaded to the T10 servers. Because of the long time it 
> takes to complete new versions of T10 standards we plan not to wait until 
> T10 has standardized a zone append operation.

Which is why we need to start the work now.  Note that I don't think
your time frames matter too much - the first draft of zbc2 is where
people opened up the process again.  The more relevant time frame is
between getting the main new feature in and publusing, which is way
shorter.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-26  1:45                             ` Damien Le Moal
@ 2023-03-26 23:45                               ` Christoph Hellwig
  2023-03-27 21:06                                 ` Bart Van Assche
  2023-03-27 21:20                               ` Bart Van Assche
  1 sibling, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2023-03-26 23:45 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Bart Van Assche, Christoph Hellwig, Ming Lei, Jens Axboe,
	linux-block, Jaegeuk Kim, Jan Kara, Johannes Thumshirn

On Sun, Mar 26, 2023 at 10:45:08AM +0900, Damien Le Moal wrote:
> > Although the above sounds interesting to me, I think the following two 
> > scenarios are not handled by the above approach and can lead to reordering:
> > * The SCSI device reporting a unit attention.
> > * The SCSI device responding with the SCSI status "BUSY". The UFS 
> > standard explicitly allows this. From the UFS standard: "If the unit is 
> > not ready to accept a new command (e.g., still processing previous 
> > command) a STATUS response of BUSY will be returned."
> 
> Yes, that likely would be an issue for regular writes, but likely not for zone
> append emulation using regular writes though, since a "busy" return for a ZA
> emulated regular write can be resent later with a different aligned write location.

Exactly.  That's why ZONE_APPEND is the only really viable high-level
interface.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-26 23:45                               ` Christoph Hellwig
@ 2023-03-27 21:06                                 ` Bart Van Assche
  2023-03-27 23:43                                   ` Christoph Hellwig
  0 siblings, 1 reply; 43+ messages in thread
From: Bart Van Assche @ 2023-03-27 21:06 UTC (permalink / raw)
  To: Christoph Hellwig, Damien Le Moal
  Cc: Ming Lei, Jens Axboe, linux-block, Jaegeuk Kim, Jan Kara,
	Johannes Thumshirn

On 3/26/23 16:45, Christoph Hellwig wrote:
> Exactly.  That's why ZONE_APPEND is the only really viable high-level
> interface.

Using REQ_OP_ZONE_APPEND may lead to reordering - data being written on 
the storage medium in another order than the order in which the 
REQ_OP_ZONE_APPEND commands were submitted. Hence, the number of extents 
for large files increases and performance when reading large files 
reduces. To me comparing the performance of these two approaches sounds 
like a good topic for a research paper. I'm not sure that 
REQ_OP_ZONE_APPEND is better for all zoned storage workloads than 
REQ_OP_WRITE.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-26  1:45                             ` Damien Le Moal
  2023-03-26 23:45                               ` Christoph Hellwig
@ 2023-03-27 21:20                               ` Bart Van Assche
  1 sibling, 0 replies; 43+ messages in thread
From: Bart Van Assche @ 2023-03-27 21:20 UTC (permalink / raw)
  To: Damien Le Moal, Christoph Hellwig
  Cc: Ming Lei, Jens Axboe, linux-block, Jaegeuk Kim, Jan Kara,
	Johannes Thumshirn

On 3/25/23 18:45, Damien Le Moal wrote:
> On 3/26/23 01:31, Bart Van Assche wrote:
>> Although the above sounds interesting to me, I think the following two
>> scenarios are not handled by the above approach and can lead to reordering:
>> * The SCSI device reporting a unit attention.
>> * The SCSI device responding with the SCSI status "BUSY". The UFS
>> standard explicitly allows this. From the UFS standard: "If the unit is
>> not ready to accept a new command (e.g., still processing previous
>> command) a STATUS response of BUSY will be returned."
> 
> Yes, that likely would be an issue for regular writes, but likely not for zone
> append emulation using regular writes though, since a "busy" return for a ZA
> emulated regular write can be resent later with a different aligned write location.

Hi Damien,

If a SCSI device responds with status "BUSY" or a unit attention for a 
zone append operation, all pending REQ_OP_ZONE_APPEND operations that 
have already been translated into WRITE commands will fail because of a 
write pointer mismatch. I'm wondering whether the following would be 
sufficient to support multiple pending REQ_OP_ZONE_APPEND operations 
when translating these into SCSI WRITE commands:
* If a zoned write operation is not executed (SAM_STAT_BUSY,
   SAM_STAT_CHECK_CONDITION, ...), change the host state to
   SHOST_RECOVERY to prevent that pending zone append operations fail
   because of a write pointer mismatch.
* After all pending SCSI commands have failed, completed or timed out,
   resubmit these commands. For REQ_OP_ZONE_APPEND, this may result in a
   new LBA being selected when translating these into WRITE commands.
* Retry commands that fail with UNALIGNED WRITE COMMAND until the retry
   count is exhausted.
* Increase the retry count for REQ_OP_ZONE_APPEND operations such that
   the above retry mechanism neither causes an infinite loop nor failures
   because the device responds with SAM_STAT_BUSY or a unit attention.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-27 21:06                                 ` Bart Van Assche
@ 2023-03-27 23:43                                   ` Christoph Hellwig
  2023-04-06 20:30                                     ` Bart Van Assche
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2023-03-27 23:43 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Damien Le Moal, Ming Lei, Jens Axboe,
	linux-block, Jaegeuk Kim, Jan Kara, Johannes Thumshirn

On Mon, Mar 27, 2023 at 02:06:09PM -0700, Bart Van Assche wrote:
> Using REQ_OP_ZONE_APPEND may lead to reordering - data being written on the 
> storage medium in another order than the order in which the 
> REQ_OP_ZONE_APPEND commands were submitted.

Only if that reordering actually happens.  Which we're very good at avoiding.
e.g. when looking at btrfs on ZNS drives we see that this reordering does
not actually happen to a quantifiable amount.

> Hence, the number of extents 
> for large files increases and performance when reading large files reduces. 
> To me comparing the performance of these two approaches sounds like a good 
> topic for a research paper. I'm not sure that REQ_OP_ZONE_APPEND is better 
> for all zoned storage workloads than REQ_OP_WRITE.

For REQ_OP_WRITE you absolutely must avoid reordering, so you need to
globally serialize.  If you can come up with a workload where your write
based approach is fast, please show it!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] block: Split and submit bios in LBA order
  2023-03-27 23:43                                   ` Christoph Hellwig
@ 2023-04-06 20:30                                     ` Bart Van Assche
  0 siblings, 0 replies; 43+ messages in thread
From: Bart Van Assche @ 2023-04-06 20:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Damien Le Moal, Ming Lei, Jens Axboe, linux-block, Jaegeuk Kim,
	Jan Kara, Johannes Thumshirn

On 3/27/23 16:43, Christoph Hellwig wrote:
> On Mon, Mar 27, 2023 at 02:06:09PM -0700, Bart Van Assche wrote:
>> Hence, the number of extents
>> for large files increases and performance when reading large files reduces.
>> To me comparing the performance of these two approaches sounds like a good
>> topic for a research paper. I'm not sure that REQ_OP_ZONE_APPEND is better
>> for all zoned storage workloads than REQ_OP_WRITE.
> 
> For REQ_OP_WRITE you absolutely must avoid reordering, so you need to
> globally serialize.  If you can come up with a workload where your write
> based approach is fast, please show it!

When using REQ_OP_ZONE_APPEND, a global lock or atomics are necessary in 
the space allocator to prevent attempts to write more data into a zone 
than what fits into a zone.

When using REQ_OP_WRITE, only serialization of the code that assigns 
LBAs is required. The writes themselves do not have to be serialized if 
these won't be reordered.

My point is that it is nontrivial to compare filesystem designs based on 
REQ_OP_WRITE versus REQ_OP_ZONE_APPEND and hence that zoned writes 
implemented with REQ_OP_WRITE should remain supported.

Bart.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/2] Submit split bios in LBA order
  2023-03-26 23:44         ` Christoph Hellwig
@ 2023-04-06 20:32           ` Bart Van Assche
  0 siblings, 0 replies; 43+ messages in thread
From: Bart Van Assche @ 2023-04-06 20:32 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, Jaegeuk Kim, Jan Kara

On 3/26/23 16:44, Christoph Hellwig wrote:
> On Fri, Mar 24, 2023 at 10:05:48AM -0700, Bart Van Assche wrote:
>> If someone else wants to work on this that would be great. I do not plan to
>> work on this because I do not expect that a SCSI zone append command would
>> be standardized by the time we need it. Although there are references to
>> T10 drafts in the UFS standard, since a few months JEDEC strongly prefers
>> to refer to finalized external standards in its own standards. Hence,
>> standardizing zoned storage for UFS would have to wait until T10 has
>> published a standard that supports a zone append command. INCITS published
>> ZBC-1 in 2016, two years after the first ZBC-1 draft was uploaded to the
>> T10 servers. INCITS approved ZBC-2 this month, six years after the first
>> ZBC-2 draft was uploaded to the T10 servers. Because of the long time it
>> takes to complete new versions of T10 standards we plan not to wait until
>> T10 has standardized a zone append operation.
> 
> Which is why we need to start the work now.  Note that I don't think
> your time frames matter too much - the first draft of zbc2 is where
> people opened up the process again.  The more relevant time frame is
> between getting the main new feature in and publusing, which is way
> shorter.

Hi Christoph,

If you help with the npo2 zone size patch series making progress towards 
being integrated in the upstream kernel I will help with the 
standardization of a write append command in the T10 ZBC standard.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2023-04-06 20:35 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-17 19:59 [PATCH 0/2] Submit split bios in LBA order Bart Van Assche
2023-03-17 19:59 ` [PATCH 1/2] block: Split blk_recalc_rq_segments() Bart Van Assche
2023-03-18  6:38   ` Christoph Hellwig
2023-03-17 19:59 ` [PATCH 2/2] block: Split and submit bios in LBA order Bart Van Assche
2023-03-17 22:28   ` Jan Kara
2023-03-18  6:33     ` Christoph Hellwig
2023-03-17 23:38   ` Ming Lei
2023-03-17 23:45     ` Bart Van Assche
2023-03-20 23:28       ` Ming Lei
2023-03-20 23:32         ` Bart Van Assche
2023-03-21  0:44           ` Ming Lei
2023-03-21  1:46             ` Damien Le Moal
2023-03-21  2:17               ` Ming Lei
2023-03-21  3:24                 ` Damien Le Moal
2023-03-21  8:00                   ` Ming Lei
2023-03-21  8:51                     ` Damien Le Moal
2023-03-21  9:09                       ` Christoph Hellwig
2023-03-21  9:50                         ` Damien Le Moal
2023-03-21  5:55           ` Christoph Hellwig
2023-03-21 14:36             ` Bart Van Assche
2023-03-23  8:26               ` Christoph Hellwig
2023-03-23 10:28                 ` Damien Le Moal
2023-03-23 16:27                   ` Bart Van Assche
2023-03-23 22:53                     ` Damien Le Moal
2023-03-24 16:55                       ` Bart Van Assche
2023-03-25  2:00                         ` Damien Le Moal
2023-03-25 16:31                           ` Bart Van Assche
2023-03-26  1:45                             ` Damien Le Moal
2023-03-26 23:45                               ` Christoph Hellwig
2023-03-27 21:06                                 ` Bart Van Assche
2023-03-27 23:43                                   ` Christoph Hellwig
2023-04-06 20:30                                     ` Bart Van Assche
2023-03-27 21:20                               ` Bart Van Assche
2023-03-18  6:42   ` Christoph Hellwig
2023-03-18  6:29 ` [PATCH 0/2] Submit split " Christoph Hellwig
2023-03-20 17:22   ` Bart Van Assche
2023-03-20 21:06     ` Khazhy Kumykov
2023-03-23  8:27     ` Christoph Hellwig
2023-03-24 17:05       ` Bart Van Assche
2023-03-25  2:15         ` Damien Le Moal
2023-03-26 23:42           ` Christoph Hellwig
2023-03-26 23:44         ` Christoph Hellwig
2023-04-06 20:32           ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).