All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/10] Introduce Zone Append for writing to zoned block devices
@ 2020-04-09 16:53 Johannes Thumshirn
  2020-04-09 16:53 ` [PATCH v5 01/10] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
                   ` (9 more replies)
  0 siblings, 10 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2020-04-09 16:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Johannes Thumshirn

The upcoming NVMe ZNS Specification will define a new type of write
command for zoned block devices, zone append.

When when writing to a zoned block device using zone append, the start
sector of the write is pointing at the start LBA of the zone to write to.
Upon completion the block device will respond with the position the data
has been placed in the zone. This from a high level perspective can be
seen like a file system's block allocator, where the user writes to a
file and the file-system takes care of the data placement on the device.

In order to fully exploit the new zone append command in file-systems and
other interfaces above the block layer, we choose to emulate zone append
in SCSI and null_blk. This way we can have a single write path for both
file-systems and other interfaces above the block-layer, like io_uring on
zoned block devices, without having to care too much about the underlying
characteristics of the device itself.

The emulation works by providing a cache of each zone's write pointer, so
zone append issued to the disk can be translated to a write with a
starting LBA of the write pointer. This LBA is used as input zone number
for the write pointer lookup in the zone write pointer offset cache and
the cached offset is then added to the LBA to get the actual position to
write the data. In SCSI we then turn the REQ_OP_ZONE_APPEND request into a
WRITE(16) command. Upon successful completion of the WRITE(16), the cache
will be updated to the new write pointer location and the written sector
will be noted in the request. On error the cache entry will be marked as
invalid and on the next write an update of the write pointer will be
scheduled, before issuing the actual write.

In order to reduce memory consumption, the only cached item is the offset
of the write pointer from the start of the zone, everything else can be
calculated. On an example drive with 52156 zones, the additional memory
consumption of the cache is thus 52156 * 4 = 208624 Bytes or 51 4k Byte
pages. The performance impact is neglectable for a spinning drive.

For null_blk the emulation is way simpler, as null_blk's zoned block
device emulation support already caches the write pointer position, so we
only need to report the position back to the upper layers. Additional
caching is not needed here.

Furthermore we have converted zonefs to run use ZONE_APPEND for synchronous
direct I/Os. Asynchronous I/O still uses the normal path via iomap.

The series is based on v5.6 final, but it should be trivial to re-base onto
Jens' for-next branch once it re-opened.

As Christoph asked for a branch I pushed it to a git repo at:
git://git.kernel.org/pub/scm/linux/kernel/git/jth/linux.git zone-append.v5
https://git.kernel.org/pub/scm/linux/kernel/git/jth/linux.git/log/?h=zone-append.v5

Changes to v4:
- Added page merging for zone-append bios (Christoph)
- Removed different locking schmes for zone management operations (Christoph)
- Changed wp_ofst assignment from blk_revalidate_zones (Christoph)
- Smaller nitpicks (Christoph)
- Documented my changes to Keith's patch so it's clear where I messed up so he
  doesn't get blamed
- Added Damien as a Co-developer to the sd emulation patch as he wrote as much
  code for it as I did (if not more)

Changes since v3:
- Remove impact of zone-append from bio_full() and bio_add_page()
  fast-path (Christoph)
- All of the zone write pointer offset caching is handled in SCSI now
  (Christoph) 
- Drop null_blk pathces that damien sent separately (Christoph)
- Use EXPORT_SYMBOL_GPL for new exports (Christoph)	

Changes since v2:
- Remove iomap implementation and directly issue zone-appends from within
  zonefs (Christoph)
- Drop already merged patch
- Rebase onto new for-next branch

Changes since v1:
- Too much to mention, treat as a completely new series.
  block: Modify revalidate zones
  null_blk: Support REQ_OP_ZONE_APPEND

Johannes Thumshirn (7):
  block: provide fallbacks for blk_queue_zone_is_seq and
    blk_queue_zone_no
  block: introduce blk_req_zone_write_trylock
  scsi: sd_zbc: factor out sanity checks for zoned commands
  scsi: export scsi_mq_free_sgtables
  scsi: sd_zbc: emulate ZONE_APPEND commands
  block: export bio_release_pages and bio_iov_iter_get_pages
  zonefs: use REQ_OP_ZONE_APPEND for sync DIO

Keith Busch (1):
  block: Introduce REQ_OP_ZONE_APPEND

 block/bio.c                    |  72 ++++++-
 block/blk-core.c               |  52 +++++
 block/blk-mq.c                 |  27 +++
 block/blk-settings.c           |  23 ++
 block/blk-sysfs.c              |  13 ++
 block/blk-zoned.c              |  33 ++-
 drivers/block/null_blk_zoned.c |  39 +++-
 drivers/scsi/scsi_lib.c        |   8 +-
 drivers/scsi/sd.c              |  26 ++-
 drivers/scsi/sd.h              |  40 +++-
 drivers/scsi/sd_zbc.c          | 380 +++++++++++++++++++++++++++++++--
 fs/zonefs/super.c              |  80 ++++++-
 include/linux/blk_types.h      |  14 ++
 include/linux/blkdev.h         |  32 ++-
 include/scsi/scsi_cmnd.h       |   1 +
 15 files changed, 783 insertions(+), 57 deletions(-)

-- 
2.24.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v5 01/10] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no
  2020-04-09 16:53 [PATCH v5 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
@ 2020-04-09 16:53 ` Johannes Thumshirn
  2020-04-09 16:53 ` [PATCH v5 02/10] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2020-04-09 16:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Johannes Thumshirn,
	Christoph Hellwig

blk_queue_zone_is_seq() and blk_queue_zone_no() have not been called with
CONFIG_BLK_DEV_ZONED disabled until now.

The introduction of REQ_OP_ZONE_APPEND will change this, so we need to
provide noop fallbacks for the !CONFIG_BLK_DEV_ZONED case.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/blkdev.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f629d40c645c..25b63f714619 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -729,6 +729,16 @@ static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
 {
 	return 0;
 }
+static inline bool blk_queue_zone_is_seq(struct request_queue *q,
+					 sector_t sector)
+{
+	return false;
+}
+static inline unsigned int blk_queue_zone_no(struct request_queue *q,
+					     sector_t sector)
+{
+	return 0;
+}
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 static inline bool rq_is_sync(struct request *rq)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 02/10] block: Introduce REQ_OP_ZONE_APPEND
  2020-04-09 16:53 [PATCH v5 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
  2020-04-09 16:53 ` [PATCH v5 01/10] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
@ 2020-04-09 16:53 ` Johannes Thumshirn
  2020-04-10  7:10   ` Christoph Hellwig
  2020-04-09 16:53 ` [PATCH v5 03/10] block: introduce blk_req_zone_write_trylock Johannes Thumshirn
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Johannes Thumshirn @ 2020-04-09 16:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Johannes Thumshirn

From: Keith Busch <kbusch@kernel.org>

Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
block device. This is a no-merge write operation.

A zone append write BIO must:
* Target a zoned block device
* Have a sector position indicating the start sector of the target zone
* The target zone must be a sequential write zone
* The BIO must not cross a zone boundary
* The BIO size must not be split to ensure that a single range of LBAs
  is written with a single command.

Implement these checks in generic_make_request_checks() using the
helper function blk_check_zone_append(). To avoid write append BIO
splitting, introduce the new max_zone_append_sectors queue limit
attribute and ensure that a BIO size is always lower than this limit.
Export this new limit through sysfs and check these limits in bio_full().

Also when a LLDD can't dispatch a request to a specific zone, it
will return BLK_STS_ZONE_RESOURCE indicating this request needs to
be delayed, e.g.  because the zone it will be dispatched to is still
write-locked. If this happens set the request aside in a local list
to continue trying dispatching requests such as READ requests or a
WRITE/ZONE_APPEND requests targetting other zones. This way we can
still keep a high queue depth without starving other requests even if
one request can't be served due to zone write-locking.

Finally, make sure that the bio sector position indicates the actual
write position as indicated by the device on completion.

Signed-off-by: Keith Busch <kbusch@kernel.org>
[ jth: added zone-append specific add_page and merge_page helpers ]
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

---
Changes to v4:
- fix page merging for zone-append bios
- remove unneeded variable
---
 block/bio.c               | 70 +++++++++++++++++++++++++++++++++++++--
 block/blk-core.c          | 52 +++++++++++++++++++++++++++++
 block/blk-mq.c            | 27 +++++++++++++++
 block/blk-settings.c      | 23 +++++++++++++
 block/blk-sysfs.c         | 13 ++++++++
 drivers/scsi/scsi_lib.c   |  1 +
 include/linux/blk_types.h | 14 ++++++++
 include/linux/blkdev.h    | 11 ++++++
 8 files changed, 209 insertions(+), 2 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 94d697217887..689f31357d30 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -679,6 +679,54 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
 }
 EXPORT_SYMBOL(bio_clone_fast);
 
+static bool bio_try_merge_zone_append_page(struct bio *bio, struct page *page,
+					   unsigned int len, unsigned int off,
+					   bool *same_page)
+{
+	struct request_queue *q = bio->bi_disk->queue;
+	struct bio_vec *bv;
+	unsigned long mask = queue_segment_boundary(q);
+	phys_addr_t addr1, addr2;
+
+	if (bio->bi_vcnt < 1)
+		return false;
+
+	bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+
+	addr1 = page_to_phys(bv->bv_page) + bv->bv_offset;
+	addr2 = page_to_phys(page) + off + len - 1;
+
+	if ((addr1 | mask) != (addr2 | mask))
+		return false;
+	if (bv->bv_len + len > queue_max_segment_size(q))
+		return false;
+	return __bio_try_merge_page(bio, page, len, off, same_page);
+}
+
+static int bio_add_append_page(struct bio *bio, struct page *page, unsigned len,
+			       size_t offset)
+{
+	struct request_queue *q = bio->bi_disk->queue;
+	unsigned int max_append_sectors = queue_max_zone_append_sectors(q);
+	bool same_page = false;
+
+	if (WARN_ON_ONCE(!max_append_sectors))
+		return 0;
+
+	if (((bio->bi_iter.bi_size + len) >> 9) > max_append_sectors)
+		return 0;
+
+	if (bio_try_merge_zone_append_page(bio, page, len, offset, &same_page))
+		return len;
+
+	if (bio->bi_vcnt >= queue_max_segments(q))
+		return 0;
+
+	__bio_add_page(bio, page, len, offset);
+
+	return len;
+}
+
 static inline bool page_is_mergeable(const struct bio_vec *bv,
 		struct page *page, unsigned int len, unsigned int off,
 		bool *same_page)
@@ -944,8 +992,22 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		struct page *page = pages[i];
 
 		len = min_t(size_t, PAGE_SIZE - offset, left);
-
-		if (__bio_try_merge_page(bio, page, len, offset, &same_page)) {
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+			int ret;
+
+			if (bio_try_merge_zone_append_page(bio, page, len,
+							   offset,
+							   &same_page)) {
+				if (same_page)
+					put_page(page);
+			} else {
+				ret = bio_add_append_page(bio, page, len,
+							  offset);
+				if (ret != len)
+					return -EINVAL;
+			}
+		} else if (__bio_try_merge_page(bio, page, len, offset,
+						&same_page)) {
 			if (same_page)
 				put_page(page);
 		} else {
@@ -1895,6 +1957,10 @@ struct bio *bio_split(struct bio *bio, int sectors,
 	BUG_ON(sectors <= 0);
 	BUG_ON(sectors >= bio_sectors(bio));
 
+	/* Zone append commands cannot be split */
+	if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
+		return NULL;
+
 	split = bio_clone_fast(bio, gfp, bs);
 	if (!split)
 		return NULL;
diff --git a/block/blk-core.c b/block/blk-core.c
index 60dc9552ef8d..57127092d816 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -135,6 +135,7 @@ static const char *const blk_op_name[] = {
 	REQ_OP_NAME(ZONE_OPEN),
 	REQ_OP_NAME(ZONE_CLOSE),
 	REQ_OP_NAME(ZONE_FINISH),
+	REQ_OP_NAME(ZONE_APPEND),
 	REQ_OP_NAME(WRITE_SAME),
 	REQ_OP_NAME(WRITE_ZEROES),
 	REQ_OP_NAME(SCSI_IN),
@@ -240,6 +241,17 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 
 	bio_advance(bio, nbytes);
 
+	if (req_op(rq) == REQ_OP_ZONE_APPEND && error == BLK_STS_OK) {
+		/*
+		 * Partial zone append completions cannot be supported as the
+		 * BIO fragments may end up not being written sequentially.
+		 */
+		if (bio->bi_iter.bi_size)
+			bio->bi_status = BLK_STS_IOERR;
+		else
+			bio->bi_iter.bi_sector = rq->__sector;
+	}
+
 	/* don't actually finish bio if it's part of flush sequence */
 	if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
 		bio_endio(bio);
@@ -865,6 +877,41 @@ static inline int blk_partition_remap(struct bio *bio)
 	return ret;
 }
 
+/*
+ * Check write append to a zoned block device.
+ */
+static inline blk_status_t blk_check_zone_append(struct request_queue *q,
+						 struct bio *bio)
+{
+	sector_t pos = bio->bi_iter.bi_sector;
+	int nr_sectors = bio_sectors(bio);
+
+	/* Only applicable to zoned block devices */
+	if (!blk_queue_is_zoned(q))
+		return BLK_STS_NOTSUPP;
+
+	/* The bio sector must point to the start of a sequential zone */
+	if (pos & (blk_queue_zone_sectors(q) - 1) ||
+	    !blk_queue_zone_is_seq(q, pos))
+		return BLK_STS_IOERR;
+
+	/*
+	 * Not allowed to cross zone boundaries. Otherwise, the BIO will be
+	 * split and could result in non-contiguous sectors being written in
+	 * different zones.
+	 */
+	if (blk_queue_zone_no(q, pos) != blk_queue_zone_no(q, pos + nr_sectors))
+		return BLK_STS_IOERR;
+
+	/* Make sure the BIO is small enough and will not get split */
+	if (nr_sectors > q->limits.max_zone_append_sectors)
+		return BLK_STS_IOERR;
+
+	bio->bi_opf |= REQ_NOMERGE;
+
+	return BLK_STS_OK;
+}
+
 static noinline_for_stack bool
 generic_make_request_checks(struct bio *bio)
 {
@@ -937,6 +984,11 @@ generic_make_request_checks(struct bio *bio)
 		if (!q->limits.max_write_same_sectors)
 			goto not_supported;
 		break;
+	case REQ_OP_ZONE_APPEND:
+		status = blk_check_zone_append(q, bio);
+		if (status != BLK_STS_OK)
+			goto end_io;
+		break;
 	case REQ_OP_ZONE_RESET:
 	case REQ_OP_ZONE_OPEN:
 	case REQ_OP_ZONE_CLOSE:
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d92088dec6c3..ce60a071660f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1178,6 +1178,19 @@ static void blk_mq_update_dispatch_busy(struct blk_mq_hw_ctx *hctx, bool busy)
 
 #define BLK_MQ_RESOURCE_DELAY	3		/* ms units */
 
+static void blk_mq_handle_zone_resource(struct request *rq,
+					struct list_head *zone_list)
+{
+	/*
+	 * If we end up here it is because we cannot dispatch a request to a
+	 * specific zone due to LLD level zone-write locking or other zone
+	 * related resource not being available. In this case, set the request
+	 * aside in zone_list for retrying it later.
+	 */
+	list_add(&rq->queuelist, zone_list);
+	__blk_mq_requeue_request(rq);
+}
+
 /*
  * Returns true if we did some work AND can potentially do more.
  */
@@ -1189,6 +1202,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	bool no_tag = false;
 	int errors, queued;
 	blk_status_t ret = BLK_STS_OK;
+	LIST_HEAD(zone_list);
 
 	if (list_empty(list))
 		return false;
@@ -1257,6 +1271,16 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			list_add(&rq->queuelist, list);
 			__blk_mq_requeue_request(rq);
 			break;
+		} else if (ret == BLK_STS_ZONE_RESOURCE) {
+			/*
+			 * Move the request to zone_list and keep going through
+			 * the dispatch list to find more requests the drive can
+			 * accept.
+			 */
+			blk_mq_handle_zone_resource(rq, &zone_list);
+			if (list_empty(list))
+				break;
+			continue;
 		}
 
 		if (unlikely(ret != BLK_STS_OK)) {
@@ -1268,6 +1292,9 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		queued++;
 	} while (!list_empty(list));
 
+	if (!list_empty(&zone_list))
+		list_splice_tail_init(&zone_list, list);
+
 	hctx->dispatched[queued_to_index(queued)]++;
 
 	/*
diff --git a/block/blk-settings.c b/block/blk-settings.c
index c8eda2e7b91e..5388965841df 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -48,6 +48,7 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->chunk_sectors = 0;
 	lim->max_write_same_sectors = 0;
 	lim->max_write_zeroes_sectors = 0;
+	lim->max_zone_append_sectors = 0;
 	lim->max_discard_sectors = 0;
 	lim->max_hw_discard_sectors = 0;
 	lim->discard_granularity = 0;
@@ -83,6 +84,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_dev_sectors = UINT_MAX;
 	lim->max_write_same_sectors = UINT_MAX;
 	lim->max_write_zeroes_sectors = UINT_MAX;
+	lim->max_zone_append_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -257,6 +259,25 @@ void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors);
 
+/**
+ * blk_queue_max_zone_append_sectors - set max sectors for a single zone append
+ * @q:  the request queue for the device
+ * @max_zone_append_sectors: maximum number of sectors to write per command
+ **/
+void blk_queue_max_zone_append_sectors(struct request_queue *q,
+		unsigned int max_zone_append_sectors)
+{
+	unsigned int max_sectors;
+
+	max_sectors = min(q->limits.max_hw_sectors, max_zone_append_sectors);
+	if (max_sectors)
+		max_sectors = min_not_zero(q->limits.chunk_sectors,
+					   max_sectors);
+
+	q->limits.max_zone_append_sectors = max_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_zone_append_sectors);
+
 /**
  * blk_queue_max_segments - set max hw segments for a request for this queue
  * @q:  the request queue for the device
@@ -506,6 +527,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 					b->max_write_same_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
 					b->max_write_zeroes_sectors);
+	t->max_zone_append_sectors = min(t->max_zone_append_sectors,
+					b->max_zone_append_sectors);
 	t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
 
 	t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fca9b158f4a0..02643e149d5e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -218,6 +218,13 @@ static ssize_t queue_write_zeroes_max_show(struct request_queue *q, char *page)
 		(unsigned long long)q->limits.max_write_zeroes_sectors << 9);
 }
 
+static ssize_t queue_zone_append_max_show(struct request_queue *q, char *page)
+{
+	unsigned long long max_sectors = q->limits.max_zone_append_sectors;
+
+	return sprintf(page, "%llu\n", max_sectors << SECTOR_SHIFT);
+}
+
 static ssize_t
 queue_max_sectors_store(struct request_queue *q, const char *page, size_t count)
 {
@@ -639,6 +646,11 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
 	.show = queue_write_zeroes_max_show,
 };
 
+static struct queue_sysfs_entry queue_zone_append_max_entry = {
+	.attr = {.name = "zone_append_max_bytes", .mode = 0444 },
+	.show = queue_zone_append_max_show,
+};
+
 static struct queue_sysfs_entry queue_nonrot_entry = {
 	.attr = {.name = "rotational", .mode = 0644 },
 	.show = queue_show_nonrot,
@@ -749,6 +761,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_discard_zeroes_data_entry.attr,
 	&queue_write_same_max_entry.attr,
 	&queue_write_zeroes_max_entry.attr,
+	&queue_zone_append_max_entry.attr,
 	&queue_nonrot_entry.attr,
 	&queue_zoned_entry.attr,
 	&queue_nr_zones_entry.attr,
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 610ee41fa54c..ea327f320b7f 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1706,6 +1706,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
+	case BLK_STS_ZONE_RESOURCE:
 		if (atomic_read(&sdev->device_busy) ||
 		    scsi_device_blocked(sdev))
 			ret = BLK_STS_DEV_RESOURCE;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 70254ae11769..824ec2d89954 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -63,6 +63,18 @@ typedef u8 __bitwise blk_status_t;
  */
 #define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)
 
+/*
+ * BLK_STS_ZONE_RESOURCE is returned from the driver to the block layer if zone
+ * related resources are unavailable, but the driver can guarantee the queue
+ * will be rerun in the future once the resources become available again.
+ *
+ * This is different from BLK_STS_DEV_RESOURCE in that it explicitly references
+ * a zone specific resource and IO to a different zone on the same device could
+ * still be served. Examples of that are zones that are write-locked, but a read
+ * to the same zone could be served.
+ */
+#define BLK_STS_ZONE_RESOURCE	((__force blk_status_t)14)
+
 /**
  * blk_path_error - returns true if error may be path related
  * @error: status the request was completed with
@@ -296,6 +308,8 @@ enum req_opf {
 	REQ_OP_ZONE_CLOSE	= 11,
 	/* Transition a zone to full */
 	REQ_OP_ZONE_FINISH	= 12,
+	/* write data at the current zone write pointer */
+	REQ_OP_ZONE_APPEND	= 13,
 
 	/* SCSI passthrough using struct scsi_request */
 	REQ_OP_SCSI_IN		= 32,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 25b63f714619..36111b10d514 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -336,6 +336,7 @@ struct queue_limits {
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
 	unsigned int		max_write_zeroes_sectors;
+	unsigned int		max_zone_append_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -757,6 +758,9 @@ static inline bool rq_mergeable(struct request *rq)
 	if (req_op(rq) == REQ_OP_WRITE_ZEROES)
 		return false;
 
+	if (req_op(rq) == REQ_OP_ZONE_APPEND)
+		return false;
+
 	if (rq->cmd_flags & REQ_NOMERGE_FLAGS)
 		return false;
 	if (rq->rq_flags & RQF_NOMERGE_FLAGS)
@@ -1088,6 +1092,8 @@ extern void blk_queue_max_write_same_sectors(struct request_queue *q,
 extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 		unsigned int max_write_same_sectors);
 extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
+extern void blk_queue_max_zone_append_sectors(struct request_queue *q,
+		unsigned int max_zone_append_sectors);
 extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
 extern void blk_queue_alignment_offset(struct request_queue *q,
 				       unsigned int alignment);
@@ -1301,6 +1307,11 @@ static inline unsigned int queue_max_segment_size(const struct request_queue *q)
 	return q->limits.max_segment_size;
 }
 
+static inline unsigned int queue_max_zone_append_sectors(const struct request_queue *q)
+{
+	return q->limits.max_zone_append_sectors;
+}
+
 static inline unsigned queue_logical_block_size(const struct request_queue *q)
 {
 	int retval = 512;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 03/10] block: introduce blk_req_zone_write_trylock
  2020-04-09 16:53 [PATCH v5 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
  2020-04-09 16:53 ` [PATCH v5 01/10] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
  2020-04-09 16:53 ` [PATCH v5 02/10] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
@ 2020-04-09 16:53 ` Johannes Thumshirn
  2020-04-09 16:53 ` [PATCH v5 04/10] block: Modify revalidate zones Johannes Thumshirn
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2020-04-09 16:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Johannes Thumshirn,
	Christoph Hellwig

Introduce blk_req_zone_write_trylock(), which either grabs the write-lock
for a sequential zone or returns false, if the zone is already locked.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-zoned.c      | 14 ++++++++++++++
 include/linux/blkdev.h |  1 +
 2 files changed, 15 insertions(+)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 05741c6f618b..00b025b8b7c0 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -50,6 +50,20 @@ bool blk_req_needs_zone_write_lock(struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_req_needs_zone_write_lock);
 
+bool blk_req_zone_write_trylock(struct request *rq)
+{
+	unsigned int zno = blk_rq_zone_no(rq);
+
+	if (test_and_set_bit(zno, rq->q->seq_zones_wlock))
+		return false;
+
+	WARN_ON_ONCE(rq->rq_flags & RQF_ZONE_WRITE_LOCKED);
+	rq->rq_flags |= RQF_ZONE_WRITE_LOCKED;
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(blk_req_zone_write_trylock);
+
 void __blk_req_zone_write_lock(struct request *rq)
 {
 	if (WARN_ON_ONCE(test_and_set_bit(blk_rq_zone_no(rq),
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 36111b10d514..e591b22ace03 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1746,6 +1746,7 @@ extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 
 #ifdef CONFIG_BLK_DEV_ZONED
 bool blk_req_needs_zone_write_lock(struct request *rq);
+bool blk_req_zone_write_trylock(struct request *rq);
 void __blk_req_zone_write_lock(struct request *rq);
 void __blk_req_zone_write_unlock(struct request *rq);
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 04/10] block: Modify revalidate zones
  2020-04-09 16:53 [PATCH v5 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (2 preceding siblings ...)
  2020-04-09 16:53 ` [PATCH v5 03/10] block: introduce blk_req_zone_write_trylock Johannes Thumshirn
@ 2020-04-09 16:53 ` Johannes Thumshirn
  2020-04-10  0:27   ` Damien Le Moal
  2020-04-10  6:40   ` Christoph Hellwig
  2020-04-09 16:53 ` [PATCH v5 05/10] scsi: sd_zbc: factor out sanity checks for zoned commands Johannes Thumshirn
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2020-04-09 16:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Damien Le Moal,
	Johannes Thumshirn

From: Damien Le Moal <damien.lemoal@wdc.com>

Modify the interface of blk_revalidate_disk_zones() to add an optional
revalidation callback function that a driver can use to extend checks and
processing done during zone revalidation. The callback, if defined, is
executed time after all zones are inspected and with the queue frozen.
blk_revalidate_disk_zones() is renamed as __blk_revalidate_disk_zones()
and blk_revalidate_disk_zones() implemented as an inline function calling
__blk_revalidate_disk_zones() without no revalidation callback specified,
resulting in an unchanged behavior for all callers of
blk_revalidate_disk_zones().

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/blk-zoned.c      | 19 ++++++++++++++-----
 include/linux/blkdev.h | 10 +++++++++-
 2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 00b025b8b7c0..6c37fec6859b 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -437,20 +437,27 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 }
 
 /**
- * blk_revalidate_disk_zones - (re)allocate and initialize zone bitmaps
- * @disk:	Target disk
+ * __blk_revalidate_disk_zones - (re)allocate and initialize zone bitmaps
+ * @disk:		Target disk
+ * @revalidate_cb:	LLD callback
+ * @revalidate_data:	LLD callback argument
  *
  * Helper function for low-level device drivers to (re) allocate and initialize
  * a disk request queue zone bitmaps. This functions should normally be called
  * within the disk ->revalidate method for blk-mq based drivers.  For BIO based
  * drivers only q->nr_zones needs to be updated so that the sysfs exposed value
  * is correct.
+ * If the @revalidate_cb callback function is not NULL, the callback will be
+ * executed with the device request queue frozen after all zones have been
+ * checked.
  */
-int blk_revalidate_disk_zones(struct gendisk *disk)
+int __blk_revalidate_disk_zones(struct gendisk *disk,
+				revalidate_zones_cb revalidate_cb,
+				void *revalidate_data)
 {
 	struct request_queue *q = disk->queue;
 	struct blk_revalidate_zone_args args = {
-		.disk		= disk,
+		.disk			= disk,
 	};
 	unsigned int noio_flag;
 	int ret;
@@ -480,6 +487,8 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 		q->nr_zones = args.nr_zones;
 		swap(q->seq_zones_wlock, args.seq_zones_wlock);
 		swap(q->conv_zones_bitmap, args.conv_zones_bitmap);
+		if (revalidate_cb)
+			revalidate_cb(disk, revalidate_data);
 		ret = 0;
 	} else {
 		pr_warn("%s: failed to revalidate zones\n", disk->disk_name);
@@ -491,4 +500,4 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 	kfree(args.conv_zones_bitmap);
 	return ret;
 }
-EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
+EXPORT_SYMBOL_GPL(__blk_revalidate_disk_zones);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e591b22ace03..a730cacda0f7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -353,6 +353,8 @@ struct queue_limits {
 typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx,
 			       void *data);
 
+typedef void (*revalidate_zones_cb)(struct gendisk *disk, void *data);
+
 #ifdef CONFIG_BLK_DEV_ZONED
 
 #define BLK_ALL_ZONES  ((unsigned int)-1)
@@ -362,7 +364,13 @@ unsigned int blkdev_nr_zones(struct gendisk *disk);
 extern int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
 			    sector_t sectors, sector_t nr_sectors,
 			    gfp_t gfp_mask);
-extern int blk_revalidate_disk_zones(struct gendisk *disk);
+int __blk_revalidate_disk_zones(struct gendisk *disk,
+				revalidate_zones_cb revalidate_cb,
+				void *revalidate_data);
+static inline int blk_revalidate_disk_zones(struct gendisk *disk)
+{
+	return __blk_revalidate_disk_zones(disk, NULL, NULL);
+}
 
 extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
 				     unsigned int cmd, unsigned long arg);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 05/10] scsi: sd_zbc: factor out sanity checks for zoned commands
  2020-04-09 16:53 [PATCH v5 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (3 preceding siblings ...)
  2020-04-09 16:53 ` [PATCH v5 04/10] block: Modify revalidate zones Johannes Thumshirn
@ 2020-04-09 16:53 ` Johannes Thumshirn
  2020-04-09 16:53 ` [PATCH v5 06/10] scsi: export scsi_mq_free_sgtables Johannes Thumshirn
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2020-04-09 16:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Johannes Thumshirn,
	Christoph Hellwig

Factor sanity checks for zoned commands from sd_zbc_setup_zone_mgmt_cmnd().

This will help with the introduction of an emulated ZONE_APPEND command.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/sd_zbc.c | 36 +++++++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index f45c22b09726..ee156fbf3780 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -209,6 +209,26 @@ int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
 	return ret;
 }
 
+static blk_status_t sd_zbc_cmnd_checks(struct scsi_cmnd *cmd)
+{
+	struct request *rq = cmd->request;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	sector_t sector = blk_rq_pos(rq);
+
+	if (!sd_is_zoned(sdkp))
+		/* Not a zoned device */
+		return BLK_STS_IOERR;
+
+	if (sdkp->device->changed)
+		return BLK_STS_IOERR;
+
+	if (sector & (sd_zbc_zone_sectors(sdkp) - 1))
+		/* Unaligned request */
+		return BLK_STS_IOERR;
+
+	return BLK_STS_OK;
+}
+
 /**
  * sd_zbc_setup_zone_mgmt_cmnd - Prepare a zone ZBC_OUT command. The operations
  *			can be RESET WRITE POINTER, OPEN, CLOSE or FINISH.
@@ -223,20 +243,14 @@ blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
 					 unsigned char op, bool all)
 {
 	struct request *rq = cmd->request;
-	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
 	sector_t sector = blk_rq_pos(rq);
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
 	sector_t block = sectors_to_logical(sdkp->device, sector);
+	blk_status_t ret;
 
-	if (!sd_is_zoned(sdkp))
-		/* Not a zoned device */
-		return BLK_STS_IOERR;
-
-	if (sdkp->device->changed)
-		return BLK_STS_IOERR;
-
-	if (sector & (sd_zbc_zone_sectors(sdkp) - 1))
-		/* Unaligned request */
-		return BLK_STS_IOERR;
+	ret = sd_zbc_cmnd_checks(cmd);
+	if (ret != BLK_STS_OK)
+		return ret;
 
 	cmd->cmd_len = 16;
 	memset(cmd->cmnd, 0, cmd->cmd_len);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 06/10] scsi: export scsi_mq_free_sgtables
  2020-04-09 16:53 [PATCH v5 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (4 preceding siblings ...)
  2020-04-09 16:53 ` [PATCH v5 05/10] scsi: sd_zbc: factor out sanity checks for zoned commands Johannes Thumshirn
@ 2020-04-09 16:53 ` Johannes Thumshirn
  2020-04-10  5:58   ` Christoph Hellwig
  2020-04-10 14:22   ` Bart Van Assche
  2020-04-09 16:53 ` [PATCH v5 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2020-04-09 16:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Johannes Thumshirn

scsi_mq_free_sgtables is used to free the sg_tables, uninitialize the
command and delete it from the command list.

Export this function so it can be used from modular code to free the
memory allocated by scsi_init_io() if the caller of scsi_init_io() needs
to do error recovery.

While we're at it, rename scsi_mq_free_sgtables() to scsi_free_sgtables().

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

---
Changes to v4:
- Export scsi_mq_free_sgtables() instead of scsi_mq_uninit_cmnd()
---
 drivers/scsi/scsi_lib.c  | 7 ++++---
 include/scsi/scsi_cmnd.h | 1 +
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index ea327f320b7f..b6b3ccd366da 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -548,7 +548,7 @@ static void scsi_uninit_cmd(struct scsi_cmnd *cmd)
 	}
 }
 
-static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd)
+void scsi_free_sgtables(struct scsi_cmnd *cmd)
 {
 	if (cmd->sdb.table.nents)
 		sg_free_table_chained(&cmd->sdb.table,
@@ -557,10 +557,11 @@ static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd)
 		sg_free_table_chained(&cmd->prot_sdb->table,
 				SCSI_INLINE_PROT_SG_CNT);
 }
+EXPORT_SYMBOL_GPL(scsi_free_sgtables);
 
 static void scsi_mq_uninit_cmd(struct scsi_cmnd *cmd)
 {
-	scsi_mq_free_sgtables(cmd);
+	scsi_free_sgtables(cmd);
 	scsi_uninit_cmd(cmd);
 	scsi_del_cmd_from_list(cmd);
 }
@@ -1060,7 +1061,7 @@ blk_status_t scsi_init_io(struct scsi_cmnd *cmd)
 
 	return BLK_STS_OK;
 out_free_sgtables:
-	scsi_mq_free_sgtables(cmd);
+	scsi_free_sgtables(cmd);
 	return ret;
 }
 EXPORT_SYMBOL(scsi_init_io);
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index a2849bb9cd19..a6383ced6156 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -167,6 +167,7 @@ extern void *scsi_kmap_atomic_sg(struct scatterlist *sg, int sg_count,
 extern void scsi_kunmap_atomic_sg(void *virt);
 
 extern blk_status_t scsi_init_io(struct scsi_cmnd *cmd);
+extern void scsi_free_sgtables(struct scsi_cmnd *cmd);
 
 #ifdef CONFIG_SCSI_DMA
 extern int scsi_dma_map(struct scsi_cmnd *cmd);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands
  2020-04-09 16:53 [PATCH v5 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (5 preceding siblings ...)
  2020-04-09 16:53 ` [PATCH v5 06/10] scsi: export scsi_mq_free_sgtables Johannes Thumshirn
@ 2020-04-09 16:53 ` Johannes Thumshirn
  2020-04-10  0:30   ` Damien Le Moal
                     ` (2 more replies)
  2020-04-09 16:53 ` [PATCH v5 08/10] null_blk: Support REQ_OP_ZONE_APPEND Johannes Thumshirn
                   ` (2 subsequent siblings)
  9 siblings, 3 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2020-04-09 16:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Johannes Thumshirn

Emulate ZONE_APPEND for SCSI disks using a regular WRITE(16) with a
start LBA set to the target zone write pointer position.

In order to always know the write pointer position of a sequential write
zone, the queue flag QUEUE_FLAG_ZONE_WP_OFST is set to get an
initialized write pointer offset array attached to the device request
queue. The values of the cache are maintained in sync with the device
as follows:
1) the write pointer offset of a zone is reset to 0 when a
   REQ_OP_ZONE_RESET command completes.
2) the write pointer offset of a zone is set to the zone size when a
   REQ_OP_ZONE_FINISH command completes.
3) the write pointer offset of a zone is incremented by the number of
   512B sectors written when a write or a zone append command completes
4) the write pointer offset of all zones is reset to 0 when a
   REQ_OP_ZONE_RESET_ALL command completes.

Since the block layer does not write lock zones for zone append
commands, to ensure a sequential ordering of the write commands used for
the emulation, the target zone of a zone append command is locked when
the function sd_zbc_prepare_zone_append() is called from
sd_setup_read_write_cmnd(). If the zone write lock cannot be obtained
(e.g. a zone append is in-flight or a regular write has already locked
the zone), the zone append command dispatching is delayed by returning
BLK_STS_ZONE_RESOURCE.

To avoid the need for write locking all zones for REQ_OP_ZONE_RESET_ALL
requests, use a spinlock to protect accesses and modifications of the zone
write pointer offsets. This spinlock is initialized from sd_probe() using
the new function sd_zbc_init().

Co-developed-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

---
Changes to v4:
- fix brace nit
- fix write-locking rules
- revalidate zone reworked
---
 drivers/scsi/sd.c     |  26 +++-
 drivers/scsi/sd.h     |  40 ++++-
 drivers/scsi/sd_zbc.c | 344 ++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 391 insertions(+), 19 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 2710a0e5ae6d..05a7abd4295d 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1206,6 +1206,14 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 		}
 	}
 
+	if (req_op(rq) == REQ_OP_ZONE_APPEND) {
+		ret = sd_zbc_prepare_zone_append(cmd, &lba, nr_blocks);
+		if (ret) {
+			scsi_free_sgtables(cmd);
+			return ret;
+		}
+	}
+
 	fua = rq->cmd_flags & REQ_FUA ? 0x8 : 0;
 	dix = scsi_prot_sg_count(cmd);
 	dif = scsi_host_dif_capable(cmd->device->host, sdkp->protection_type);
@@ -1287,6 +1295,7 @@ static blk_status_t sd_init_command(struct scsi_cmnd *cmd)
 		return sd_setup_flush_cmnd(cmd);
 	case REQ_OP_READ:
 	case REQ_OP_WRITE:
+	case REQ_OP_ZONE_APPEND:
 		return sd_setup_read_write_cmnd(cmd);
 	case REQ_OP_ZONE_RESET:
 		return sd_zbc_setup_zone_mgmt_cmnd(cmd, ZO_RESET_WRITE_POINTER,
@@ -2055,7 +2064,7 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 
  out:
 	if (sd_is_zoned(sdkp))
-		sd_zbc_complete(SCpnt, good_bytes, &sshdr);
+		good_bytes = sd_zbc_complete(SCpnt, good_bytes, &sshdr);
 
 	SCSI_LOG_HLCOMPLETE(1, scmd_printk(KERN_INFO, SCpnt,
 					   "sd_done: completed %d of %d bytes\n",
@@ -3371,6 +3380,10 @@ static int sd_probe(struct device *dev)
 	sdkp->first_scan = 1;
 	sdkp->max_medium_access_timeouts = SD_MAX_MEDIUM_TIMEOUTS;
 
+	error = sd_zbc_init_disk(sdkp);
+	if (error)
+		goto out_free_index;
+
 	sd_revalidate_disk(gd);
 
 	gd->flags = GENHD_FL_EXT_DEVT;
@@ -3408,6 +3421,7 @@ static int sd_probe(struct device *dev)
  out_put:
 	put_disk(gd);
  out_free:
+	sd_zbc_release_disk(sdkp);
 	kfree(sdkp);
  out:
 	scsi_autopm_put_device(sdp);
@@ -3484,6 +3498,8 @@ static void scsi_disk_release(struct device *dev)
 	put_disk(disk);
 	put_device(&sdkp->device->sdev_gendev);
 
+	sd_zbc_release_disk(sdkp);
+
 	kfree(sdkp);
 }
 
@@ -3664,19 +3680,19 @@ static int __init init_sd(void)
 	if (!sd_page_pool) {
 		printk(KERN_ERR "sd: can't init discard page pool\n");
 		err = -ENOMEM;
-		goto err_out_ppool;
+		goto err_out_cdb_pool;
 	}
 
 	err = scsi_register_driver(&sd_template.gendrv);
 	if (err)
-		goto err_out_driver;
+		goto err_out_ppool;
 
 	return 0;
 
-err_out_driver:
+err_out_ppool:
 	mempool_destroy(sd_page_pool);
 
-err_out_ppool:
+err_out_cdb_pool:
 	mempool_destroy(sd_cdb_pool);
 
 err_out_cache:
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 50fff0bf8c8e..f7a4d6fcac6b 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -79,6 +79,12 @@ struct scsi_disk {
 	u32		zones_optimal_open;
 	u32		zones_optimal_nonseq;
 	u32		zones_max_open;
+	u32		*zones_wp_ofst;
+	spinlock_t	zones_wp_ofst_lock;
+	u32		*rev_wp_ofst;
+	struct mutex	rev_mutex;
+	struct work_struct zone_wp_ofst_work;
+	char		*zone_wp_update_buf;
 #endif
 	atomic_t	openers;
 	sector_t	capacity;	/* size in logical blocks */
@@ -207,17 +213,32 @@ static inline int sd_is_zoned(struct scsi_disk *sdkp)
 
 #ifdef CONFIG_BLK_DEV_ZONED
 
+int sd_zbc_init_disk(struct scsi_disk *sdkp);
+void sd_zbc_release_disk(struct scsi_disk *sdkp);
 extern int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buffer);
 extern void sd_zbc_print_zones(struct scsi_disk *sdkp);
 blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
 					 unsigned char op, bool all);
-extern void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
-			    struct scsi_sense_hdr *sshdr);
+unsigned int sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
+			     struct scsi_sense_hdr *sshdr);
 int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
 		unsigned int nr_zones, report_zones_cb cb, void *data);
 
+blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd, sector_t *lba,
+				        unsigned int nr_blocks);
+
 #else /* CONFIG_BLK_DEV_ZONED */
 
+static inline int sd_zbc_init(void)
+{
+	return 0;
+}
+
+static inline void sd_zbc_exit(void) {}
+
+static inline void sd_zbc_init_disk(struct scsi_disk *sdkp) {}
+static inline void sd_zbc_release_disk(struct scsi_disk *sdkp) {}
+
 static inline int sd_zbc_read_zones(struct scsi_disk *sdkp,
 				    unsigned char *buf)
 {
@@ -233,9 +254,18 @@ static inline blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
 	return BLK_STS_TARGET;
 }
 
-static inline void sd_zbc_complete(struct scsi_cmnd *cmd,
-				   unsigned int good_bytes,
-				   struct scsi_sense_hdr *sshdr) {}
+static inline unsigned int sd_zbc_complete(struct scsi_cmnd *cmd,
+			unsigned int good_bytes, struct scsi_sense_hdr *sshdr)
+{
+	return 0;
+}
+
+static inline blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd,
+						      sector_t *lba,
+						      unsigned int nr_blocks)
+{
+	return BLK_STS_TARGET;
+}
 
 #define sd_zbc_report_zones NULL
 
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index ee156fbf3780..53cfe998a3f6 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -11,6 +11,7 @@
 #include <linux/blkdev.h>
 #include <linux/vmalloc.h>
 #include <linux/sched/mm.h>
+#include <linux/mutex.h>
 
 #include <asm/unaligned.h>
 
@@ -19,11 +20,36 @@
 
 #include "sd.h"
 
+static unsigned int sd_zbc_get_zone_wp_ofst(struct blk_zone *zone)
+{
+	if (zone->type == ZBC_ZONE_TYPE_CONV)
+		return 0;
+
+	switch (zone->cond) {
+	case BLK_ZONE_COND_IMP_OPEN:
+	case BLK_ZONE_COND_EXP_OPEN:
+	case BLK_ZONE_COND_CLOSED:
+		return zone->wp - zone->start;
+	case BLK_ZONE_COND_FULL:
+		return zone->len;
+	case BLK_ZONE_COND_EMPTY:
+	case BLK_ZONE_COND_OFFLINE:
+	case BLK_ZONE_COND_READONLY:
+	default:
+		/*
+		 * Offline and read-only zones do not have a valid
+		 * write pointer. Use 0 as for an empty zone.
+		 */
+		return 0;
+	}
+}
+
 static int sd_zbc_parse_report(struct scsi_disk *sdkp, u8 *buf,
 			       unsigned int idx, report_zones_cb cb, void *data)
 {
 	struct scsi_device *sdp = sdkp->device;
 	struct blk_zone zone = { 0 };
+	int ret;
 
 	zone.type = buf[0] & 0x0f;
 	zone.cond = (buf[1] >> 4) & 0xf;
@@ -39,7 +65,14 @@ static int sd_zbc_parse_report(struct scsi_disk *sdkp, u8 *buf,
 	    zone.cond == ZBC_ZONE_COND_FULL)
 		zone.wp = zone.start + zone.len;
 
-	return cb(&zone, idx, data);
+	ret = cb(&zone, idx, data);
+	if (ret)
+		return ret;
+
+	if (sdkp->rev_wp_ofst)
+		sdkp->rev_wp_ofst[idx] = sd_zbc_get_zone_wp_ofst(&zone);
+
+	return 0;
 }
 
 /**
@@ -229,6 +262,134 @@ static blk_status_t sd_zbc_cmnd_checks(struct scsi_cmnd *cmd)
 	return BLK_STS_OK;
 }
 
+#define SD_ZBC_INVALID_WP_OFST	(~0u)
+#define SD_ZBC_UPDATING_WP_OFST	(SD_ZBC_INVALID_WP_OFST - 1)
+
+static int sd_zbc_update_wp_ofst_cb(struct blk_zone *zone, unsigned int idx,
+				    void *data)
+{
+	struct scsi_disk *sdkp = data;
+
+	lockdep_assert_held(&sdkp->zones_wp_ofst_lock);
+
+	sdkp->zones_wp_ofst[idx] = sd_zbc_get_zone_wp_ofst(zone);
+
+	return 0;
+}
+
+static void sd_zbc_update_wp_ofst_workfn(struct work_struct *work)
+{
+	struct scsi_disk *sdkp;
+	unsigned int zno;
+	int ret;
+
+	sdkp = container_of(work, struct scsi_disk, zone_wp_ofst_work);
+
+	spin_lock_bh(&sdkp->zones_wp_ofst_lock);
+	for (zno = 0; zno < sdkp->nr_zones; zno++) {
+		if (sdkp->zones_wp_ofst[zno] != SD_ZBC_UPDATING_WP_OFST)
+			continue;
+
+		spin_unlock_bh(&sdkp->zones_wp_ofst_lock);
+		ret = sd_zbc_do_report_zones(sdkp, sdkp->zone_wp_update_buf,
+					     SD_BUF_SIZE,
+					     zno * sdkp->zone_blocks, true);
+		spin_lock_bh(&sdkp->zones_wp_ofst_lock);
+		if (!ret)
+			sd_zbc_parse_report(sdkp, sdkp->zone_wp_update_buf + 64,
+					    zno, sd_zbc_update_wp_ofst_cb,
+					    sdkp);
+	}
+	spin_unlock_bh(&sdkp->zones_wp_ofst_lock);
+
+	scsi_device_put(sdkp->device);
+}
+
+static blk_status_t sd_zbc_update_wp_ofst(struct scsi_disk *sdkp,
+					  unsigned int zno)
+{
+	/*
+	 * We are about to schedule work to update a zone write pointer offset,
+	 * which will cause the zone append command to be requeued. So make
+	 * sure that the scsi device does not go away while the work is
+	 * being processed.
+	 */
+	if (scsi_device_get(sdkp->device))
+		return BLK_STS_IOERR;
+
+	sdkp->zones_wp_ofst[zno] = SD_ZBC_UPDATING_WP_OFST;
+
+	schedule_work(&sdkp->zone_wp_ofst_work);
+
+	return BLK_STS_RESOURCE;
+}
+
+/**
+ * sd_zbc_prepare_zone_append() - Prepare an emulated ZONE_APPEND command.
+ * @cmd: the command to setup
+ * @lba: the LBA to patch
+ * @nr_blocks: the number of LBAs to be written
+ *
+ * Called from sd_setup_read_write_cmnd() for REQ_OP_ZONE_APPEND.
+ * @sd_zbc_prepare_zone_append() handles the necessary zone wrote locking and
+ * patching of the lba for an emulated ZONE_APPEND command.
+ *
+ * In case the cached write pointer offset is %SD_ZBC_INVALID_WP_OFST it will
+ * schedule a REPORT ZONES command and return BLK_STS_IOERR.
+ */
+blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd, sector_t *lba,
+					unsigned int nr_blocks)
+{
+	struct request *rq = cmd->request;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	unsigned int wp_ofst, zno = blk_rq_zone_no(rq);
+	blk_status_t ret;
+
+	ret = sd_zbc_cmnd_checks(cmd);
+	if (ret != BLK_STS_OK)
+		return ret;
+
+	if (!blk_rq_zone_is_seq(rq))
+		return BLK_STS_IOERR;
+
+	/* Unlock of the write lock will happen in sd_zbc_complete() */
+	if (!blk_req_zone_write_trylock(rq))
+		return BLK_STS_ZONE_RESOURCE;
+
+	spin_lock_bh(&sdkp->zones_wp_ofst_lock);
+
+	wp_ofst = sdkp->zones_wp_ofst[zno];
+	if (wp_ofst == SD_ZBC_UPDATING_WP_OFST) {
+		/* Write pointer offset update in progress: ask for a requeue */
+		ret = BLK_STS_RESOURCE;
+		goto err;
+	}
+
+	if (wp_ofst == SD_ZBC_INVALID_WP_OFST) {
+		/* Invalid write pointer offset: trigger an update from disk */
+		ret = sd_zbc_update_wp_ofst(sdkp, zno);
+		goto err;
+	}
+
+	wp_ofst = sectors_to_logical(sdkp->device, wp_ofst);
+	if (wp_ofst + nr_blocks > sdkp->zone_blocks) {
+		ret = BLK_STS_IOERR;
+		goto err;
+	}
+
+	/* Set the LBA for the write command used to emulate zone append */
+	*lba += wp_ofst;
+
+	spin_unlock_bh(&sdkp->zones_wp_ofst_lock);
+
+	return BLK_STS_OK;
+
+err:
+	spin_unlock_bh(&sdkp->zones_wp_ofst_lock);
+	blk_req_zone_write_unlock(rq);
+	return ret;
+}
+
 /**
  * sd_zbc_setup_zone_mgmt_cmnd - Prepare a zone ZBC_OUT command. The operations
  *			can be RESET WRITE POINTER, OPEN, CLOSE or FINISH.
@@ -269,16 +430,104 @@ blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
 	return BLK_STS_OK;
 }
 
+static bool sd_zbc_need_zone_wp_update(struct request *rq)
+{
+	switch (req_op(rq)) {
+	case REQ_OP_ZONE_APPEND:
+	case REQ_OP_ZONE_FINISH:
+	case REQ_OP_ZONE_RESET:
+	case REQ_OP_ZONE_RESET_ALL:
+		return true;
+	case REQ_OP_WRITE:
+	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_WRITE_SAME:
+		return blk_rq_zone_is_seq(rq);
+	default:
+		return false;
+	}
+}
+
+/**
+ * sd_zbc_zone_wp_update - Update cached zone write pointer upon cmd completion
+ * @cmd: Completed command
+ * @good_bytes: Command reply bytes
+ *
+ * Called from sd_zbc_complete() to handle the update of the cached zone write
+ * pointer value in case an update is needed.
+ */
+static unsigned int sd_zbc_zone_wp_update(struct scsi_cmnd *cmd,
+					  unsigned int good_bytes)
+{
+	int result = cmd->result;
+	struct request *rq = cmd->request;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	unsigned int zno = blk_rq_zone_no(rq);
+	enum req_opf op = req_op(rq);
+
+	/*
+	 * If we got an error for a command that needs updating the write
+	 * pointer offset cache, we must mark the zone wp offset entry as
+	 * invalid to force an update from disk the next time a zone append
+	 * command is issued.
+	 */
+	spin_lock_bh(&sdkp->zones_wp_ofst_lock);
+
+	if (result && op != REQ_OP_ZONE_RESET_ALL) {
+		if (op == REQ_OP_ZONE_APPEND) {
+			/* Force complete completion (no retry) */
+			good_bytes = 0;
+			scsi_set_resid(cmd, blk_rq_bytes(rq));
+		}
+
+		/*
+		 * Force an update of the zone write pointer offset on
+		 * the next zone append access.
+		 */
+		if (sdkp->zones_wp_ofst[zno] != SD_ZBC_UPDATING_WP_OFST)
+			sdkp->zones_wp_ofst[zno] = SD_ZBC_INVALID_WP_OFST;
+		goto unlock_wp_ofst;
+	}
+
+	switch (op) {
+	case REQ_OP_ZONE_APPEND:
+		rq->__sector += sdkp->zones_wp_ofst[zno];
+		/* fallthrough */
+	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_WRITE_SAME:
+	case REQ_OP_WRITE:
+		if (sdkp->zones_wp_ofst[zno] < sd_zbc_zone_sectors(sdkp))
+			sdkp->zones_wp_ofst[zno] += good_bytes >> SECTOR_SHIFT;
+		break;
+	case REQ_OP_ZONE_RESET:
+		sdkp->zones_wp_ofst[zno] = 0;
+		break;
+	case REQ_OP_ZONE_FINISH:
+		sdkp->zones_wp_ofst[zno] = sd_zbc_zone_sectors(sdkp);
+		break;
+	case REQ_OP_ZONE_RESET_ALL:
+		memset(sdkp->zones_wp_ofst, 0,
+		       sdkp->nr_zones * sizeof(unsigned int));
+		break;
+	default:
+		break;
+	}
+
+unlock_wp_ofst:
+	spin_unlock_bh(&sdkp->zones_wp_ofst_lock);
+
+	return good_bytes;
+}
+
 /**
  * sd_zbc_complete - ZBC command post processing.
  * @cmd: Completed command
  * @good_bytes: Command reply bytes
  * @sshdr: command sense header
  *
- * Called from sd_done(). Process report zones reply and handle reset zone
- * and write commands errors.
+ * Called from sd_done() to handle zone commands errors and updates to the
+ * device queue zone write pointer offset cahce.
  */
-void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
+unsigned int sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
 		     struct scsi_sense_hdr *sshdr)
 {
 	int result = cmd->result;
@@ -294,7 +543,18 @@ void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
 		 * so be quiet about the error.
 		 */
 		rq->rq_flags |= RQF_QUIET;
+		goto unlock_zone;
 	}
+
+	if (sd_zbc_need_zone_wp_update(rq))
+		good_bytes = sd_zbc_zone_wp_update(cmd, good_bytes);
+
+
+unlock_zone:
+	if (req_op(rq) == REQ_OP_ZONE_APPEND)
+		blk_req_zone_write_unlock(rq);
+
+	return good_bytes;
 }
 
 /**
@@ -396,11 +656,48 @@ static int sd_zbc_check_capacity(struct scsi_disk *sdkp, unsigned char *buf,
 	return 0;
 }
 
+static void sd_zbc_revalidate_zones_cb(struct gendisk *disk, void *data)
+{
+	struct scsi_disk *sdkp = scsi_disk(disk);
+
+	swap(sdkp->zones_wp_ofst, sdkp->rev_wp_ofst);
+}
+
+static int sd_zbc_revalidate_zones(struct scsi_disk *sdkp,
+				   unsigned int nr_zones)
+{
+	int ret;
+
+	/*
+	 * Make sure revalidate zones are serialized to ensure exclusive
+	 * updates of the temporary array sdkp->rev_wp_ofst.
+	 */
+	mutex_lock(&sdkp->rev_mutex);
+
+	sdkp->rev_wp_ofst = kvcalloc(nr_zones, sizeof(u32), GFP_NOIO);
+	if (!sdkp->rev_wp_ofst) {
+		ret = -ENOMEM;
+		goto unlock;
+	}
+
+	ret = __blk_revalidate_disk_zones(sdkp->disk,
+					sd_zbc_revalidate_zones_cb, NULL);
+	kvfree(sdkp->rev_wp_ofst);
+	sdkp->rev_wp_ofst = NULL;
+
+unlock:
+	mutex_unlock(&sdkp->rev_mutex);
+
+	return ret;
+}
+
 int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf)
 {
 	struct gendisk *disk = sdkp->disk;
+	struct request_queue *q = disk->queue;
 	unsigned int nr_zones;
 	u32 zone_blocks = 0;
+	u32 max_append;
 	int ret;
 
 	if (!sd_is_zoned(sdkp))
@@ -420,10 +717,14 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf)
 	if (ret != 0)
 		goto err;
 
+	max_append = min_t(u32, logical_to_sectors(sdkp->device, zone_blocks),
+			   q->limits.max_segments << (PAGE_SHIFT - 9));
+	max_append = min_t(u32, max_append, queue_max_hw_sectors(q));
+
 	/* The drive satisfies the kernel restrictions: set it up */
-	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, sdkp->disk->queue);
-	blk_queue_required_elevator_features(sdkp->disk->queue,
-					     ELEVATOR_F_ZBD_SEQ_WRITE);
+	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
+	blk_queue_required_elevator_features(q, ELEVATOR_F_ZBD_SEQ_WRITE);
+	blk_queue_max_zone_append_sectors(q, max_append);
 	nr_zones = round_up(sdkp->capacity, zone_blocks) >> ilog2(zone_blocks);
 
 	/* READ16/WRITE16 is mandatory for ZBC disks */
@@ -443,8 +744,8 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf)
 
 	if (sdkp->zone_blocks != zone_blocks ||
 	    sdkp->nr_zones != nr_zones ||
-	    disk->queue->nr_zones != nr_zones) {
-		ret = blk_revalidate_disk_zones(disk);
+	    q->nr_zones != nr_zones) {
+		ret = sd_zbc_revalidate_zones(sdkp, nr_zones);
 		if (ret != 0)
 			goto err;
 		sdkp->zone_blocks = zone_blocks;
@@ -475,3 +776,28 @@ void sd_zbc_print_zones(struct scsi_disk *sdkp)
 			  sdkp->nr_zones,
 			  sdkp->zone_blocks);
 }
+
+int sd_zbc_init_disk(struct scsi_disk *sdkp)
+{
+	if (!sd_is_zoned(sdkp))
+		return 0;
+
+	sdkp->zones_wp_ofst = NULL;
+	spin_lock_init(&sdkp->zones_wp_ofst_lock);
+	sdkp->rev_wp_ofst = NULL;
+	mutex_init(&sdkp->rev_mutex);
+	INIT_WORK(&sdkp->zone_wp_ofst_work, sd_zbc_update_wp_ofst_workfn);
+	sdkp->zone_wp_update_buf = kzalloc(SD_BUF_SIZE, GFP_KERNEL);
+	if (!sdkp->zone_wp_update_buf)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void sd_zbc_release_disk(struct scsi_disk *sdkp)
+{
+	kvfree(sdkp->zones_wp_ofst);
+	sdkp->zones_wp_ofst = NULL;
+	kfree(sdkp->zone_wp_update_buf);
+	sdkp->zone_wp_update_buf = NULL;
+}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 08/10] null_blk: Support REQ_OP_ZONE_APPEND
  2020-04-09 16:53 [PATCH v5 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (6 preceding siblings ...)
  2020-04-09 16:53 ` [PATCH v5 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
@ 2020-04-09 16:53 ` Johannes Thumshirn
  2020-04-09 16:53 ` [PATCH v5 09/10] block: export bio_release_pages and bio_iov_iter_get_pages Johannes Thumshirn
  2020-04-09 16:53 ` [PATCH v5 10/10] zonefs: use REQ_OP_ZONE_APPEND for sync DIO Johannes Thumshirn
  9 siblings, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2020-04-09 16:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Damien Le Moal,
	Christoph Hellwig, Johannes Thumshirn

From: Damien Le Moal <damien.lemoal@wdc.com>

Support REQ_OP_ZONE_APPEND requests for null_blk devices with zoned
mode enabled. Use the internally tracked zone write pointer position
as the actual write position and return it using the command request
__sector field in the case of an mq device and using the command BIO
sector in the case of a BIO device.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 drivers/block/null_blk_zoned.c | 39 +++++++++++++++++++++++++++-------
 1 file changed, 31 insertions(+), 8 deletions(-)

diff --git a/drivers/block/null_blk_zoned.c b/drivers/block/null_blk_zoned.c
index c60b19432a2e..b664be0bbb5e 100644
--- a/drivers/block/null_blk_zoned.c
+++ b/drivers/block/null_blk_zoned.c
@@ -67,13 +67,22 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 
 int null_register_zoned_dev(struct nullb *nullb)
 {
+	struct nullb_device *dev = nullb->dev;
 	struct request_queue *q = nullb->q;
 
-	if (queue_is_mq(q))
-		return blk_revalidate_disk_zones(nullb->disk);
+	if (queue_is_mq(q)) {
+		int ret = blk_revalidate_disk_zones(nullb->disk);
+
+		if (ret)
+			return ret;
+	} else {
+		blk_queue_chunk_sectors(q, dev->zone_size_sects);
+		q->nr_zones = blkdev_nr_zones(nullb->disk);
+	}
 
-	blk_queue_chunk_sectors(q, nullb->dev->zone_size_sects);
-	q->nr_zones = blkdev_nr_zones(nullb->disk);
+	blk_queue_max_zone_append_sectors(q,
+			min_t(sector_t, q->limits.max_hw_sectors,
+			      dev->zone_size_sects));
 
 	return 0;
 }
@@ -133,7 +142,7 @@ size_t null_zone_valid_read_len(struct nullb *nullb,
 }
 
 static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
-		     unsigned int nr_sectors)
+				    unsigned int nr_sectors, bool append)
 {
 	struct nullb_device *dev = cmd->nq->dev;
 	unsigned int zno = null_zone_no(dev, sector);
@@ -151,9 +160,21 @@ static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
 	case BLK_ZONE_COND_IMP_OPEN:
 	case BLK_ZONE_COND_EXP_OPEN:
 	case BLK_ZONE_COND_CLOSED:
-		/* Writes must be at the write pointer position */
-		if (sector != zone->wp)
+		/*
+		 * Regular writes must be at the write pointer position.
+		 * Zone append writes are automatically issued at the write
+		 * pointer and the position returned using the request or BIO
+		 * sector.
+		 */
+		if (append) {
+			sector = zone->wp;
+			if (cmd->bio)
+				cmd->bio->bi_iter.bi_sector = sector;
+			else
+				cmd->rq->__sector = sector;
+		} else if (sector != zone->wp) {
 			return BLK_STS_IOERR;
+		}
 
 		if (zone->cond != BLK_ZONE_COND_EXP_OPEN)
 			zone->cond = BLK_ZONE_COND_IMP_OPEN;
@@ -232,7 +253,9 @@ blk_status_t null_handle_zoned(struct nullb_cmd *cmd, enum req_opf op,
 {
 	switch (op) {
 	case REQ_OP_WRITE:
-		return null_zone_write(cmd, sector, nr_sectors);
+		return null_zone_write(cmd, sector, nr_sectors, false);
+	case REQ_OP_ZONE_APPEND:
+		return null_zone_write(cmd, sector, nr_sectors, true);
 	case REQ_OP_ZONE_RESET:
 	case REQ_OP_ZONE_RESET_ALL:
 	case REQ_OP_ZONE_OPEN:
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 09/10] block: export bio_release_pages and bio_iov_iter_get_pages
  2020-04-09 16:53 [PATCH v5 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (7 preceding siblings ...)
  2020-04-09 16:53 ` [PATCH v5 08/10] null_blk: Support REQ_OP_ZONE_APPEND Johannes Thumshirn
@ 2020-04-09 16:53 ` Johannes Thumshirn
  2020-04-09 16:53 ` [PATCH v5 10/10] zonefs: use REQ_OP_ZONE_APPEND for sync DIO Johannes Thumshirn
  9 siblings, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2020-04-09 16:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Johannes Thumshirn

Export bio_release_pages and bio_iov_iter_get_pages, so they can be used
from modular code.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

---
Changes to v3:
- Use EXPORT_SYMBOL_GPL
---
 block/bio.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 689f31357d30..4029a48f3828 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -934,6 +934,7 @@ void bio_release_pages(struct bio *bio, bool mark_dirty)
 		put_page(bvec->bv_page);
 	}
 }
+EXPORT_SYMBOL_GPL(bio_release_pages);
 
 static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1061,6 +1062,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		bio_set_flag(bio, BIO_NO_PAGE_REF);
 	return bio->bi_vcnt ? 0 : ret;
 }
+EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
 
 static void submit_bio_wait_endio(struct bio *bio)
 {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 10/10] zonefs: use REQ_OP_ZONE_APPEND for sync DIO
  2020-04-09 16:53 [PATCH v5 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (8 preceding siblings ...)
  2020-04-09 16:53 ` [PATCH v5 09/10] block: export bio_release_pages and bio_iov_iter_get_pages Johannes Thumshirn
@ 2020-04-09 16:53 ` Johannes Thumshirn
  9 siblings, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2020-04-09 16:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Johannes Thumshirn

Synchronous direct I/O to a sequential write only zone can be issued using
the new REQ_OP_ZONE_APPEND request operation. As dispatching multiple
BIOs can potentially result in reordering, we cannot support asynchronous
IO via this interface.

We also can only dispatch up to queue_max_zone_append_sectors() via the
new zone-append method and have to return a short write back to user-space
in case an IO larger than queue_max_zone_append_sectors() has been issued.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/zonefs/super.c | 80 ++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 72 insertions(+), 8 deletions(-)

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 3ce9829a6936..0bf7009f50a2 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -20,6 +20,7 @@
 #include <linux/mman.h>
 #include <linux/sched/mm.h>
 #include <linux/crc32.h>
+#include <linux/task_io_accounting_ops.h>
 
 #include "zonefs.h"
 
@@ -596,6 +597,61 @@ static const struct iomap_dio_ops zonefs_write_dio_ops = {
 	.end_io			= zonefs_file_write_dio_end_io,
 };
 
+static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct block_device *bdev = inode->i_sb->s_bdev;
+	unsigned int max;
+	struct bio *bio;
+	ssize_t size;
+	int nr_pages;
+	ssize_t ret;
+
+	nr_pages = iov_iter_npages(from, BIO_MAX_PAGES);
+	if (!nr_pages)
+		return 0;
+
+	max = queue_max_zone_append_sectors(bdev_get_queue(bdev));
+	max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize);
+	iov_iter_truncate(from, max);
+
+	bio = bio_alloc_bioset(GFP_NOFS, nr_pages, &fs_bio_set);
+	if (!bio)
+		return -ENOMEM;
+
+	bio_set_dev(bio, bdev);
+	bio->bi_iter.bi_sector = zi->i_zsector;
+	bio->bi_write_hint = iocb->ki_hint;
+	bio->bi_ioprio = iocb->ki_ioprio;
+	bio->bi_opf = REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE;
+	if (iocb->ki_flags & IOCB_DSYNC)
+		bio->bi_opf |= REQ_FUA;
+
+	ret = bio_iov_iter_get_pages(bio, from);
+	if (unlikely(ret)) {
+		bio_io_error(bio);
+		return ret;
+	}
+	size = bio->bi_iter.bi_size;
+	task_io_account_write(ret);
+
+	if (iocb->ki_flags & IOCB_HIPRI)
+		bio_set_polled(bio, iocb);
+
+	ret = submit_bio_wait(bio);
+
+	bio_put(bio);
+
+	zonefs_file_write_dio_end_io(iocb, size, ret, 0);
+	if (ret >= 0) {
+		iocb->ki_pos += size;
+		return size;
+	}
+
+	return ret;
+}
+
 /*
  * Handle direct writes. For sequential zone files, this is the only possible
  * write path. For these files, check that the user is issuing writes
@@ -611,6 +667,8 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 	struct inode *inode = file_inode(iocb->ki_filp);
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	struct super_block *sb = inode->i_sb;
+	bool sync = is_sync_kiocb(iocb);
+	bool append = false;
 	size_t count;
 	ssize_t ret;
 
@@ -619,7 +677,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 	 * as this can cause write reordering (e.g. the first aio gets EAGAIN
 	 * on the inode lock but the second goes through but is now unaligned).
 	 */
-	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && !is_sync_kiocb(iocb) &&
+	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && !sync &&
 	    (iocb->ki_flags & IOCB_NOWAIT))
 		return -EOPNOTSUPP;
 
@@ -643,16 +701,22 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 	}
 
 	/* Enforce sequential writes (append only) in sequential zones */
-	mutex_lock(&zi->i_truncate_mutex);
-	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && iocb->ki_pos != zi->i_wpoffset) {
+	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ) {
+		mutex_lock(&zi->i_truncate_mutex);
+		if (iocb->ki_pos != zi->i_wpoffset) {
+			mutex_unlock(&zi->i_truncate_mutex);
+			ret = -EINVAL;
+			goto inode_unlock;
+		}
 		mutex_unlock(&zi->i_truncate_mutex);
-		ret = -EINVAL;
-		goto inode_unlock;
+		append = sync;
 	}
-	mutex_unlock(&zi->i_truncate_mutex);
 
-	ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
-			   &zonefs_write_dio_ops, is_sync_kiocb(iocb));
+	if (append)
+		ret = zonefs_file_dio_append(iocb, from);
+	else
+		ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
+				   &zonefs_write_dio_ops, sync);
 	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
 	    (ret > 0 || ret == -EIOCBQUEUED)) {
 		if (ret > 0)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 04/10] block: Modify revalidate zones
  2020-04-09 16:53 ` [PATCH v5 04/10] block: Modify revalidate zones Johannes Thumshirn
@ 2020-04-10  0:27   ` Damien Le Moal
  2020-04-10  6:40   ` Christoph Hellwig
  1 sibling, 0 replies; 29+ messages in thread
From: Damien Le Moal @ 2020-04-10  0:27 UTC (permalink / raw)
  To: Johannes Thumshirn, Jens Axboe
  Cc: hch, linux-block, Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen, linux-fsdevel @ vger . kernel . org

On 2020/04/10 1:54, Johannes Thumshirn wrote:
> From: Damien Le Moal <damien.lemoal@wdc.com>
> 
> Modify the interface of blk_revalidate_disk_zones() to add an optional
> revalidation callback function that a driver can use to extend checks and
> processing done during zone revalidation. The callback, if defined, is
> executed time after all zones are inspected and with the queue frozen.
> blk_revalidate_disk_zones() is renamed as __blk_revalidate_disk_zones()
> and blk_revalidate_disk_zones() implemented as an inline function calling
> __blk_revalidate_disk_zones() without no revalidation callback specified,
> resulting in an unchanged behavior for all callers of
> blk_revalidate_disk_zones().
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  block/blk-zoned.c      | 19 ++++++++++++++-----
>  include/linux/blkdev.h | 10 +++++++++-
>  2 files changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 00b025b8b7c0..6c37fec6859b 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -437,20 +437,27 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>  }
>  
>  /**
> - * blk_revalidate_disk_zones - (re)allocate and initialize zone bitmaps
> - * @disk:	Target disk
> + * __blk_revalidate_disk_zones - (re)allocate and initialize zone bitmaps
> + * @disk:		Target disk
> + * @revalidate_cb:	LLD callback
> + * @revalidate_data:	LLD callback argument
>   *
>   * Helper function for low-level device drivers to (re) allocate and initialize
>   * a disk request queue zone bitmaps. This functions should normally be called
>   * within the disk ->revalidate method for blk-mq based drivers.  For BIO based
>   * drivers only q->nr_zones needs to be updated so that the sysfs exposed value
>   * is correct.
> + * If the @revalidate_cb callback function is not NULL, the callback will be
> + * executed with the device request queue frozen after all zones have been
> + * checked.
>   */
> -int blk_revalidate_disk_zones(struct gendisk *disk)
> +int __blk_revalidate_disk_zones(struct gendisk *disk,
> +				revalidate_zones_cb revalidate_cb,
> +				void *revalidate_data)
>  {
>  	struct request_queue *q = disk->queue;
>  	struct blk_revalidate_zone_args args = {
> -		.disk		= disk,
> +		.disk			= disk,

Ooops... whitespace change here.


>  	};
>  	unsigned int noio_flag;
>  	int ret;
> @@ -480,6 +487,8 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
>  		q->nr_zones = args.nr_zones;
>  		swap(q->seq_zones_wlock, args.seq_zones_wlock);
>  		swap(q->conv_zones_bitmap, args.conv_zones_bitmap);
> +		if (revalidate_cb)
> +			revalidate_cb(disk, revalidate_data);
>  		ret = 0;
>  	} else {
>  		pr_warn("%s: failed to revalidate zones\n", disk->disk_name);
> @@ -491,4 +500,4 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
>  	kfree(args.conv_zones_bitmap);
>  	return ret;
>  }
> -EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
> +EXPORT_SYMBOL_GPL(__blk_revalidate_disk_zones);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index e591b22ace03..a730cacda0f7 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -353,6 +353,8 @@ struct queue_limits {
>  typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx,
>  			       void *data);
>  
> +typedef void (*revalidate_zones_cb)(struct gendisk *disk, void *data);
> +
>  #ifdef CONFIG_BLK_DEV_ZONED
>  
>  #define BLK_ALL_ZONES  ((unsigned int)-1)
> @@ -362,7 +364,13 @@ unsigned int blkdev_nr_zones(struct gendisk *disk);
>  extern int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
>  			    sector_t sectors, sector_t nr_sectors,
>  			    gfp_t gfp_mask);
> -extern int blk_revalidate_disk_zones(struct gendisk *disk);
> +int __blk_revalidate_disk_zones(struct gendisk *disk,
> +				revalidate_zones_cb revalidate_cb,
> +				void *revalidate_data);
> +static inline int blk_revalidate_disk_zones(struct gendisk *disk)
> +{
> +	return __blk_revalidate_disk_zones(disk, NULL, NULL);
> +}
>  
>  extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
>  				     unsigned int cmd, unsigned long arg);
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands
  2020-04-09 16:53 ` [PATCH v5 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
@ 2020-04-10  0:30   ` Damien Le Moal
  2020-04-10  6:18   ` Christoph Hellwig
  2020-04-10  7:23   ` Christoph Hellwig
  2 siblings, 0 replies; 29+ messages in thread
From: Damien Le Moal @ 2020-04-10  0:30 UTC (permalink / raw)
  To: Johannes Thumshirn, Jens Axboe
  Cc: hch, linux-block, Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen, linux-fsdevel @ vger . kernel . org

On 2020/04/10 1:54, Johannes Thumshirn wrote:
> Emulate ZONE_APPEND for SCSI disks using a regular WRITE(16) with a
> start LBA set to the target zone write pointer position.
> 
> In order to always know the write pointer position of a sequential write
> zone, the queue flag QUEUE_FLAG_ZONE_WP_OFST is set to get an

We dropped QUEUE_FLAG_ZONE_WP_OFST. The wp offset initialization uses
serialization of blk_revalidate_disk_zones() and the revalidate callback now.
You need to update the commit message to reflect that change.

> initialized write pointer offset array attached to the device request
> queue. The values of the cache are maintained in sync with the device
> as follows:
> 1) the write pointer offset of a zone is reset to 0 when a
>    REQ_OP_ZONE_RESET command completes.
> 2) the write pointer offset of a zone is set to the zone size when a
>    REQ_OP_ZONE_FINISH command completes.
> 3) the write pointer offset of a zone is incremented by the number of
>    512B sectors written when a write or a zone append command completes
> 4) the write pointer offset of all zones is reset to 0 when a
>    REQ_OP_ZONE_RESET_ALL command completes.
> 
> Since the block layer does not write lock zones for zone append
> commands, to ensure a sequential ordering of the write commands used for
> the emulation, the target zone of a zone append command is locked when
> the function sd_zbc_prepare_zone_append() is called from
> sd_setup_read_write_cmnd(). If the zone write lock cannot be obtained
> (e.g. a zone append is in-flight or a regular write has already locked
> the zone), the zone append command dispatching is delayed by returning
> BLK_STS_ZONE_RESOURCE.
> 
> To avoid the need for write locking all zones for REQ_OP_ZONE_RESET_ALL
> requests, use a spinlock to protect accesses and modifications of the zone
> write pointer offsets. This spinlock is initialized from sd_probe() using
> the new function sd_zbc_init().
> 
> Co-developed-by: Damien Le Moal <Damien.LeMoal@wdc.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> ---
> Changes to v4:
> - fix brace nit
> - fix write-locking rules
> - revalidate zone reworked
> ---
>  drivers/scsi/sd.c     |  26 +++-
>  drivers/scsi/sd.h     |  40 ++++-
>  drivers/scsi/sd_zbc.c | 344 ++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 391 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 2710a0e5ae6d..05a7abd4295d 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1206,6 +1206,14 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>  		}
>  	}
>  
> +	if (req_op(rq) == REQ_OP_ZONE_APPEND) {
> +		ret = sd_zbc_prepare_zone_append(cmd, &lba, nr_blocks);
> +		if (ret) {
> +			scsi_free_sgtables(cmd);
> +			return ret;
> +		}
> +	}
> +
>  	fua = rq->cmd_flags & REQ_FUA ? 0x8 : 0;
>  	dix = scsi_prot_sg_count(cmd);
>  	dif = scsi_host_dif_capable(cmd->device->host, sdkp->protection_type);
> @@ -1287,6 +1295,7 @@ static blk_status_t sd_init_command(struct scsi_cmnd *cmd)
>  		return sd_setup_flush_cmnd(cmd);
>  	case REQ_OP_READ:
>  	case REQ_OP_WRITE:
> +	case REQ_OP_ZONE_APPEND:
>  		return sd_setup_read_write_cmnd(cmd);
>  	case REQ_OP_ZONE_RESET:
>  		return sd_zbc_setup_zone_mgmt_cmnd(cmd, ZO_RESET_WRITE_POINTER,
> @@ -2055,7 +2064,7 @@ static int sd_done(struct scsi_cmnd *SCpnt)
>  
>   out:
>  	if (sd_is_zoned(sdkp))
> -		sd_zbc_complete(SCpnt, good_bytes, &sshdr);
> +		good_bytes = sd_zbc_complete(SCpnt, good_bytes, &sshdr);
>  
>  	SCSI_LOG_HLCOMPLETE(1, scmd_printk(KERN_INFO, SCpnt,
>  					   "sd_done: completed %d of %d bytes\n",
> @@ -3371,6 +3380,10 @@ static int sd_probe(struct device *dev)
>  	sdkp->first_scan = 1;
>  	sdkp->max_medium_access_timeouts = SD_MAX_MEDIUM_TIMEOUTS;
>  
> +	error = sd_zbc_init_disk(sdkp);
> +	if (error)
> +		goto out_free_index;
> +
>  	sd_revalidate_disk(gd);
>  
>  	gd->flags = GENHD_FL_EXT_DEVT;
> @@ -3408,6 +3421,7 @@ static int sd_probe(struct device *dev)
>   out_put:
>  	put_disk(gd);
>   out_free:
> +	sd_zbc_release_disk(sdkp);
>  	kfree(sdkp);
>   out:
>  	scsi_autopm_put_device(sdp);
> @@ -3484,6 +3498,8 @@ static void scsi_disk_release(struct device *dev)
>  	put_disk(disk);
>  	put_device(&sdkp->device->sdev_gendev);
>  
> +	sd_zbc_release_disk(sdkp);
> +
>  	kfree(sdkp);
>  }
>  
> @@ -3664,19 +3680,19 @@ static int __init init_sd(void)
>  	if (!sd_page_pool) {
>  		printk(KERN_ERR "sd: can't init discard page pool\n");
>  		err = -ENOMEM;
> -		goto err_out_ppool;
> +		goto err_out_cdb_pool;
>  	}
>  
>  	err = scsi_register_driver(&sd_template.gendrv);
>  	if (err)
> -		goto err_out_driver;
> +		goto err_out_ppool;
>  
>  	return 0;
>  
> -err_out_driver:
> +err_out_ppool:
>  	mempool_destroy(sd_page_pool);
>  
> -err_out_ppool:
> +err_out_cdb_pool:
>  	mempool_destroy(sd_cdb_pool);
>  
>  err_out_cache:
> diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
> index 50fff0bf8c8e..f7a4d6fcac6b 100644
> --- a/drivers/scsi/sd.h
> +++ b/drivers/scsi/sd.h
> @@ -79,6 +79,12 @@ struct scsi_disk {
>  	u32		zones_optimal_open;
>  	u32		zones_optimal_nonseq;
>  	u32		zones_max_open;
> +	u32		*zones_wp_ofst;
> +	spinlock_t	zones_wp_ofst_lock;
> +	u32		*rev_wp_ofst;
> +	struct mutex	rev_mutex;
> +	struct work_struct zone_wp_ofst_work;
> +	char		*zone_wp_update_buf;
>  #endif
>  	atomic_t	openers;
>  	sector_t	capacity;	/* size in logical blocks */
> @@ -207,17 +213,32 @@ static inline int sd_is_zoned(struct scsi_disk *sdkp)
>  
>  #ifdef CONFIG_BLK_DEV_ZONED
>  
> +int sd_zbc_init_disk(struct scsi_disk *sdkp);
> +void sd_zbc_release_disk(struct scsi_disk *sdkp);
>  extern int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buffer);
>  extern void sd_zbc_print_zones(struct scsi_disk *sdkp);
>  blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
>  					 unsigned char op, bool all);
> -extern void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
> -			    struct scsi_sense_hdr *sshdr);
> +unsigned int sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
> +			     struct scsi_sense_hdr *sshdr);
>  int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
>  		unsigned int nr_zones, report_zones_cb cb, void *data);
>  
> +blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd, sector_t *lba,
> +				        unsigned int nr_blocks);
> +
>  #else /* CONFIG_BLK_DEV_ZONED */
>  
> +static inline int sd_zbc_init(void)
> +{
> +	return 0;
> +}
> +
> +static inline void sd_zbc_exit(void) {}
> +
> +static inline void sd_zbc_init_disk(struct scsi_disk *sdkp) {}
> +static inline void sd_zbc_release_disk(struct scsi_disk *sdkp) {}
> +
>  static inline int sd_zbc_read_zones(struct scsi_disk *sdkp,
>  				    unsigned char *buf)
>  {
> @@ -233,9 +254,18 @@ static inline blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
>  	return BLK_STS_TARGET;
>  }
>  
> -static inline void sd_zbc_complete(struct scsi_cmnd *cmd,
> -				   unsigned int good_bytes,
> -				   struct scsi_sense_hdr *sshdr) {}
> +static inline unsigned int sd_zbc_complete(struct scsi_cmnd *cmd,
> +			unsigned int good_bytes, struct scsi_sense_hdr *sshdr)
> +{
> +	return 0;
> +}
> +
> +static inline blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd,
> +						      sector_t *lba,
> +						      unsigned int nr_blocks)
> +{
> +	return BLK_STS_TARGET;
> +}
>  
>  #define sd_zbc_report_zones NULL
>  
> diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
> index ee156fbf3780..53cfe998a3f6 100644
> --- a/drivers/scsi/sd_zbc.c
> +++ b/drivers/scsi/sd_zbc.c
> @@ -11,6 +11,7 @@
>  #include <linux/blkdev.h>
>  #include <linux/vmalloc.h>
>  #include <linux/sched/mm.h>
> +#include <linux/mutex.h>
>  
>  #include <asm/unaligned.h>
>  
> @@ -19,11 +20,36 @@
>  
>  #include "sd.h"
>  
> +static unsigned int sd_zbc_get_zone_wp_ofst(struct blk_zone *zone)
> +{
> +	if (zone->type == ZBC_ZONE_TYPE_CONV)
> +		return 0;
> +
> +	switch (zone->cond) {
> +	case BLK_ZONE_COND_IMP_OPEN:
> +	case BLK_ZONE_COND_EXP_OPEN:
> +	case BLK_ZONE_COND_CLOSED:
> +		return zone->wp - zone->start;
> +	case BLK_ZONE_COND_FULL:
> +		return zone->len;
> +	case BLK_ZONE_COND_EMPTY:
> +	case BLK_ZONE_COND_OFFLINE:
> +	case BLK_ZONE_COND_READONLY:
> +	default:
> +		/*
> +		 * Offline and read-only zones do not have a valid
> +		 * write pointer. Use 0 as for an empty zone.
> +		 */
> +		return 0;
> +	}
> +}
> +
>  static int sd_zbc_parse_report(struct scsi_disk *sdkp, u8 *buf,
>  			       unsigned int idx, report_zones_cb cb, void *data)
>  {
>  	struct scsi_device *sdp = sdkp->device;
>  	struct blk_zone zone = { 0 };
> +	int ret;
>  
>  	zone.type = buf[0] & 0x0f;
>  	zone.cond = (buf[1] >> 4) & 0xf;
> @@ -39,7 +65,14 @@ static int sd_zbc_parse_report(struct scsi_disk *sdkp, u8 *buf,
>  	    zone.cond == ZBC_ZONE_COND_FULL)
>  		zone.wp = zone.start + zone.len;
>  
> -	return cb(&zone, idx, data);
> +	ret = cb(&zone, idx, data);
> +	if (ret)
> +		return ret;
> +
> +	if (sdkp->rev_wp_ofst)
> +		sdkp->rev_wp_ofst[idx] = sd_zbc_get_zone_wp_ofst(&zone);
> +
> +	return 0;
>  }
>  
>  /**
> @@ -229,6 +262,134 @@ static blk_status_t sd_zbc_cmnd_checks(struct scsi_cmnd *cmd)
>  	return BLK_STS_OK;
>  }
>  
> +#define SD_ZBC_INVALID_WP_OFST	(~0u)
> +#define SD_ZBC_UPDATING_WP_OFST	(SD_ZBC_INVALID_WP_OFST - 1)
> +
> +static int sd_zbc_update_wp_ofst_cb(struct blk_zone *zone, unsigned int idx,
> +				    void *data)
> +{
> +	struct scsi_disk *sdkp = data;
> +
> +	lockdep_assert_held(&sdkp->zones_wp_ofst_lock);
> +
> +	sdkp->zones_wp_ofst[idx] = sd_zbc_get_zone_wp_ofst(zone);
> +
> +	return 0;
> +}
> +
> +static void sd_zbc_update_wp_ofst_workfn(struct work_struct *work)
> +{
> +	struct scsi_disk *sdkp;
> +	unsigned int zno;
> +	int ret;
> +
> +	sdkp = container_of(work, struct scsi_disk, zone_wp_ofst_work);
> +
> +	spin_lock_bh(&sdkp->zones_wp_ofst_lock);
> +	for (zno = 0; zno < sdkp->nr_zones; zno++) {
> +		if (sdkp->zones_wp_ofst[zno] != SD_ZBC_UPDATING_WP_OFST)
> +			continue;
> +
> +		spin_unlock_bh(&sdkp->zones_wp_ofst_lock);
> +		ret = sd_zbc_do_report_zones(sdkp, sdkp->zone_wp_update_buf,
> +					     SD_BUF_SIZE,
> +					     zno * sdkp->zone_blocks, true);
> +		spin_lock_bh(&sdkp->zones_wp_ofst_lock);
> +		if (!ret)
> +			sd_zbc_parse_report(sdkp, sdkp->zone_wp_update_buf + 64,
> +					    zno, sd_zbc_update_wp_ofst_cb,
> +					    sdkp);
> +	}
> +	spin_unlock_bh(&sdkp->zones_wp_ofst_lock);
> +
> +	scsi_device_put(sdkp->device);
> +}
> +
> +static blk_status_t sd_zbc_update_wp_ofst(struct scsi_disk *sdkp,
> +					  unsigned int zno)
> +{
> +	/*
> +	 * We are about to schedule work to update a zone write pointer offset,
> +	 * which will cause the zone append command to be requeued. So make
> +	 * sure that the scsi device does not go away while the work is
> +	 * being processed.
> +	 */
> +	if (scsi_device_get(sdkp->device))
> +		return BLK_STS_IOERR;
> +
> +	sdkp->zones_wp_ofst[zno] = SD_ZBC_UPDATING_WP_OFST;
> +
> +	schedule_work(&sdkp->zone_wp_ofst_work);
> +
> +	return BLK_STS_RESOURCE;
> +}
> +
> +/**
> + * sd_zbc_prepare_zone_append() - Prepare an emulated ZONE_APPEND command.
> + * @cmd: the command to setup
> + * @lba: the LBA to patch
> + * @nr_blocks: the number of LBAs to be written
> + *
> + * Called from sd_setup_read_write_cmnd() for REQ_OP_ZONE_APPEND.
> + * @sd_zbc_prepare_zone_append() handles the necessary zone wrote locking and
> + * patching of the lba for an emulated ZONE_APPEND command.
> + *
> + * In case the cached write pointer offset is %SD_ZBC_INVALID_WP_OFST it will
> + * schedule a REPORT ZONES command and return BLK_STS_IOERR.
> + */
> +blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd, sector_t *lba,
> +					unsigned int nr_blocks)
> +{
> +	struct request *rq = cmd->request;
> +	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
> +	unsigned int wp_ofst, zno = blk_rq_zone_no(rq);
> +	blk_status_t ret;
> +
> +	ret = sd_zbc_cmnd_checks(cmd);
> +	if (ret != BLK_STS_OK)
> +		return ret;
> +
> +	if (!blk_rq_zone_is_seq(rq))
> +		return BLK_STS_IOERR;
> +
> +	/* Unlock of the write lock will happen in sd_zbc_complete() */
> +	if (!blk_req_zone_write_trylock(rq))
> +		return BLK_STS_ZONE_RESOURCE;
> +
> +	spin_lock_bh(&sdkp->zones_wp_ofst_lock);
> +
> +	wp_ofst = sdkp->zones_wp_ofst[zno];
> +	if (wp_ofst == SD_ZBC_UPDATING_WP_OFST) {
> +		/* Write pointer offset update in progress: ask for a requeue */
> +		ret = BLK_STS_RESOURCE;
> +		goto err;
> +	}
> +
> +	if (wp_ofst == SD_ZBC_INVALID_WP_OFST) {
> +		/* Invalid write pointer offset: trigger an update from disk */
> +		ret = sd_zbc_update_wp_ofst(sdkp, zno);
> +		goto err;
> +	}
> +
> +	wp_ofst = sectors_to_logical(sdkp->device, wp_ofst);
> +	if (wp_ofst + nr_blocks > sdkp->zone_blocks) {
> +		ret = BLK_STS_IOERR;
> +		goto err;
> +	}
> +
> +	/* Set the LBA for the write command used to emulate zone append */
> +	*lba += wp_ofst;
> +
> +	spin_unlock_bh(&sdkp->zones_wp_ofst_lock);
> +
> +	return BLK_STS_OK;
> +
> +err:
> +	spin_unlock_bh(&sdkp->zones_wp_ofst_lock);
> +	blk_req_zone_write_unlock(rq);
> +	return ret;
> +}
> +
>  /**
>   * sd_zbc_setup_zone_mgmt_cmnd - Prepare a zone ZBC_OUT command. The operations
>   *			can be RESET WRITE POINTER, OPEN, CLOSE or FINISH.
> @@ -269,16 +430,104 @@ blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
>  	return BLK_STS_OK;
>  }
>  
> +static bool sd_zbc_need_zone_wp_update(struct request *rq)
> +{
> +	switch (req_op(rq)) {
> +	case REQ_OP_ZONE_APPEND:
> +	case REQ_OP_ZONE_FINISH:
> +	case REQ_OP_ZONE_RESET:
> +	case REQ_OP_ZONE_RESET_ALL:
> +		return true;
> +	case REQ_OP_WRITE:
> +	case REQ_OP_WRITE_ZEROES:
> +	case REQ_OP_WRITE_SAME:
> +		return blk_rq_zone_is_seq(rq);
> +	default:
> +		return false;
> +	}
> +}
> +
> +/**
> + * sd_zbc_zone_wp_update - Update cached zone write pointer upon cmd completion
> + * @cmd: Completed command
> + * @good_bytes: Command reply bytes
> + *
> + * Called from sd_zbc_complete() to handle the update of the cached zone write
> + * pointer value in case an update is needed.
> + */
> +static unsigned int sd_zbc_zone_wp_update(struct scsi_cmnd *cmd,
> +					  unsigned int good_bytes)
> +{
> +	int result = cmd->result;
> +	struct request *rq = cmd->request;
> +	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
> +	unsigned int zno = blk_rq_zone_no(rq);
> +	enum req_opf op = req_op(rq);
> +
> +	/*
> +	 * If we got an error for a command that needs updating the write
> +	 * pointer offset cache, we must mark the zone wp offset entry as
> +	 * invalid to force an update from disk the next time a zone append
> +	 * command is issued.
> +	 */
> +	spin_lock_bh(&sdkp->zones_wp_ofst_lock);
> +
> +	if (result && op != REQ_OP_ZONE_RESET_ALL) {
> +		if (op == REQ_OP_ZONE_APPEND) {
> +			/* Force complete completion (no retry) */
> +			good_bytes = 0;
> +			scsi_set_resid(cmd, blk_rq_bytes(rq));
> +		}
> +
> +		/*
> +		 * Force an update of the zone write pointer offset on
> +		 * the next zone append access.
> +		 */
> +		if (sdkp->zones_wp_ofst[zno] != SD_ZBC_UPDATING_WP_OFST)
> +			sdkp->zones_wp_ofst[zno] = SD_ZBC_INVALID_WP_OFST;
> +		goto unlock_wp_ofst;
> +	}
> +
> +	switch (op) {
> +	case REQ_OP_ZONE_APPEND:
> +		rq->__sector += sdkp->zones_wp_ofst[zno];
> +		/* fallthrough */
> +	case REQ_OP_WRITE_ZEROES:
> +	case REQ_OP_WRITE_SAME:
> +	case REQ_OP_WRITE:
> +		if (sdkp->zones_wp_ofst[zno] < sd_zbc_zone_sectors(sdkp))
> +			sdkp->zones_wp_ofst[zno] += good_bytes >> SECTOR_SHIFT;
> +		break;
> +	case REQ_OP_ZONE_RESET:
> +		sdkp->zones_wp_ofst[zno] = 0;
> +		break;
> +	case REQ_OP_ZONE_FINISH:
> +		sdkp->zones_wp_ofst[zno] = sd_zbc_zone_sectors(sdkp);
> +		break;
> +	case REQ_OP_ZONE_RESET_ALL:
> +		memset(sdkp->zones_wp_ofst, 0,
> +		       sdkp->nr_zones * sizeof(unsigned int));
> +		break;
> +	default:
> +		break;
> +	}
> +
> +unlock_wp_ofst:
> +	spin_unlock_bh(&sdkp->zones_wp_ofst_lock);
> +
> +	return good_bytes;
> +}
> +
>  /**
>   * sd_zbc_complete - ZBC command post processing.
>   * @cmd: Completed command
>   * @good_bytes: Command reply bytes
>   * @sshdr: command sense header
>   *
> - * Called from sd_done(). Process report zones reply and handle reset zone
> - * and write commands errors.
> + * Called from sd_done() to handle zone commands errors and updates to the
> + * device queue zone write pointer offset cahce.
>   */
> -void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
> +unsigned int sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
>  		     struct scsi_sense_hdr *sshdr)
>  {
>  	int result = cmd->result;
> @@ -294,7 +543,18 @@ void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
>  		 * so be quiet about the error.
>  		 */
>  		rq->rq_flags |= RQF_QUIET;
> +		goto unlock_zone;
>  	}
> +
> +	if (sd_zbc_need_zone_wp_update(rq))
> +		good_bytes = sd_zbc_zone_wp_update(cmd, good_bytes);
> +
> +
> +unlock_zone:
> +	if (req_op(rq) == REQ_OP_ZONE_APPEND)
> +		blk_req_zone_write_unlock(rq);
> +
> +	return good_bytes;
>  }
>  
>  /**
> @@ -396,11 +656,48 @@ static int sd_zbc_check_capacity(struct scsi_disk *sdkp, unsigned char *buf,
>  	return 0;
>  }
>  
> +static void sd_zbc_revalidate_zones_cb(struct gendisk *disk, void *data)
> +{
> +	struct scsi_disk *sdkp = scsi_disk(disk);
> +
> +	swap(sdkp->zones_wp_ofst, sdkp->rev_wp_ofst);
> +}
> +
> +static int sd_zbc_revalidate_zones(struct scsi_disk *sdkp,
> +				   unsigned int nr_zones)
> +{
> +	int ret;
> +
> +	/*
> +	 * Make sure revalidate zones are serialized to ensure exclusive
> +	 * updates of the temporary array sdkp->rev_wp_ofst.
> +	 */
> +	mutex_lock(&sdkp->rev_mutex);
> +
> +	sdkp->rev_wp_ofst = kvcalloc(nr_zones, sizeof(u32), GFP_NOIO);
> +	if (!sdkp->rev_wp_ofst) {
> +		ret = -ENOMEM;
> +		goto unlock;
> +	}
> +
> +	ret = __blk_revalidate_disk_zones(sdkp->disk,
> +					sd_zbc_revalidate_zones_cb, NULL);
> +	kvfree(sdkp->rev_wp_ofst);
> +	sdkp->rev_wp_ofst = NULL;
> +
> +unlock:
> +	mutex_unlock(&sdkp->rev_mutex);
> +
> +	return ret;
> +}
> +
>  int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf)
>  {
>  	struct gendisk *disk = sdkp->disk;
> +	struct request_queue *q = disk->queue;
>  	unsigned int nr_zones;
>  	u32 zone_blocks = 0;
> +	u32 max_append;
>  	int ret;
>  
>  	if (!sd_is_zoned(sdkp))
> @@ -420,10 +717,14 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf)
>  	if (ret != 0)
>  		goto err;
>  
> +	max_append = min_t(u32, logical_to_sectors(sdkp->device, zone_blocks),
> +			   q->limits.max_segments << (PAGE_SHIFT - 9));
> +	max_append = min_t(u32, max_append, queue_max_hw_sectors(q));
> +
>  	/* The drive satisfies the kernel restrictions: set it up */
> -	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, sdkp->disk->queue);
> -	blk_queue_required_elevator_features(sdkp->disk->queue,
> -					     ELEVATOR_F_ZBD_SEQ_WRITE);
> +	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
> +	blk_queue_required_elevator_features(q, ELEVATOR_F_ZBD_SEQ_WRITE);
> +	blk_queue_max_zone_append_sectors(q, max_append);
>  	nr_zones = round_up(sdkp->capacity, zone_blocks) >> ilog2(zone_blocks);
>  
>  	/* READ16/WRITE16 is mandatory for ZBC disks */
> @@ -443,8 +744,8 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf)
>  
>  	if (sdkp->zone_blocks != zone_blocks ||
>  	    sdkp->nr_zones != nr_zones ||
> -	    disk->queue->nr_zones != nr_zones) {
> -		ret = blk_revalidate_disk_zones(disk);
> +	    q->nr_zones != nr_zones) {
> +		ret = sd_zbc_revalidate_zones(sdkp, nr_zones);
>  		if (ret != 0)
>  			goto err;
>  		sdkp->zone_blocks = zone_blocks;
> @@ -475,3 +776,28 @@ void sd_zbc_print_zones(struct scsi_disk *sdkp)
>  			  sdkp->nr_zones,
>  			  sdkp->zone_blocks);
>  }
> +
> +int sd_zbc_init_disk(struct scsi_disk *sdkp)
> +{
> +	if (!sd_is_zoned(sdkp))
> +		return 0;
> +
> +	sdkp->zones_wp_ofst = NULL;
> +	spin_lock_init(&sdkp->zones_wp_ofst_lock);
> +	sdkp->rev_wp_ofst = NULL;
> +	mutex_init(&sdkp->rev_mutex);
> +	INIT_WORK(&sdkp->zone_wp_ofst_work, sd_zbc_update_wp_ofst_workfn);
> +	sdkp->zone_wp_update_buf = kzalloc(SD_BUF_SIZE, GFP_KERNEL);
> +	if (!sdkp->zone_wp_update_buf)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +void sd_zbc_release_disk(struct scsi_disk *sdkp)
> +{
> +	kvfree(sdkp->zones_wp_ofst);
> +	sdkp->zones_wp_ofst = NULL;
> +	kfree(sdkp->zone_wp_update_buf);
> +	sdkp->zone_wp_update_buf = NULL;
> +}
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 06/10] scsi: export scsi_mq_free_sgtables
  2020-04-09 16:53 ` [PATCH v5 06/10] scsi: export scsi_mq_free_sgtables Johannes Thumshirn
@ 2020-04-10  5:58   ` Christoph Hellwig
  2020-04-10  7:46     ` Johannes Thumshirn
  2020-04-10 14:22   ` Bart Van Assche
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2020-04-10  5:58 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Damien Le Moal,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen, linux-fsdevel @ vger . kernel . org

Looks good, althrough we really don't need the extern for the
prototype in the header (that also applies to a few other patches in
the series):

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands
  2020-04-09 16:53 ` [PATCH v5 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
  2020-04-10  0:30   ` Damien Le Moal
@ 2020-04-10  6:18   ` Christoph Hellwig
  2020-04-10  6:38     ` Christoph Hellwig
  2020-04-10  7:54     ` Johannes Thumshirn
  2020-04-10  7:23   ` Christoph Hellwig
  2 siblings, 2 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-04-10  6:18 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Damien Le Moal,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen, linux-fsdevel @ vger . kernel . org

On Fri, Apr 10, 2020 at 01:53:49AM +0900, Johannes Thumshirn wrote:
> +	if (req_op(rq) == REQ_OP_ZONE_APPEND) {
> +		ret = sd_zbc_prepare_zone_append(cmd, &lba, nr_blocks);
> +		if (ret) {
> +			scsi_free_sgtables(cmd);
> +			return ret;
> +		}
> +	}

So actually.

I've been trying to understand the lifetime of the sgtables.  Shouldn't
we free them in the midayer ->init_command returned BLK_STS_*RESOURCE
instead?  It seems like this just catches one particular error instead
of the whole category?  The end of scsi_queue_rq seem like a particular
good place, as that also releases the resources for the "hard" errors.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands
  2020-04-10  6:18   ` Christoph Hellwig
@ 2020-04-10  6:38     ` Christoph Hellwig
  2020-04-10  8:01       ` Johannes Thumshirn
  2020-04-14 11:09       ` Johannes Thumshirn
  2020-04-10  7:54     ` Johannes Thumshirn
  1 sibling, 2 replies; 29+ messages in thread
From: Christoph Hellwig @ 2020-04-10  6:38 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Damien Le Moal,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen, linux-fsdevel @ vger . kernel . org

On Thu, Apr 09, 2020 at 11:18:22PM -0700, Christoph Hellwig wrote:
> On Fri, Apr 10, 2020 at 01:53:49AM +0900, Johannes Thumshirn wrote:
> > +	if (req_op(rq) == REQ_OP_ZONE_APPEND) {
> > +		ret = sd_zbc_prepare_zone_append(cmd, &lba, nr_blocks);
> > +		if (ret) {
> > +			scsi_free_sgtables(cmd);
> > +			return ret;
> > +		}
> > +	}
> 
> So actually.
> 
> I've been trying to understand the lifetime of the sgtables.  Shouldn't
> we free them in the midayer ->init_command returned BLK_STS_*RESOURCE
> instead?  It seems like this just catches one particular error instead
> of the whole category?  The end of scsi_queue_rq seem like a particular
> good place, as that also releases the resources for the "hard" errors.

Looking more the situation seems even worse.  If scsi_mq_prep_fn
isn't successfull we never seem to free the sgtables, even for fatal
errors.  So I think we need a real bug fix here in front of the series.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 04/10] block: Modify revalidate zones
  2020-04-09 16:53 ` [PATCH v5 04/10] block: Modify revalidate zones Johannes Thumshirn
  2020-04-10  0:27   ` Damien Le Moal
@ 2020-04-10  6:40   ` Christoph Hellwig
  2020-04-10  6:55     ` Damien Le Moal
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2020-04-10  6:40 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Damien Le Moal,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen, linux-fsdevel @ vger . kernel . org

On Fri, Apr 10, 2020 at 01:53:46AM +0900, Johannes Thumshirn wrote:
> From: Damien Le Moal <damien.lemoal@wdc.com>
> 
> Modify the interface of blk_revalidate_disk_zones() to add an optional
> revalidation callback function that a driver can use to extend checks and
> processing done during zone revalidation. The callback, if defined, is
> executed time after all zones are inspected and with the queue frozen.
> blk_revalidate_disk_zones() is renamed as __blk_revalidate_disk_zones()
> and blk_revalidate_disk_zones() implemented as an inline function calling
> __blk_revalidate_disk_zones() without no revalidation callback specified,
> resulting in an unchanged behavior for all callers of
> blk_revalidate_disk_zones().

The data argument to __blk_revalidate_disk_zones and the cllback is now
unused.  I also think we now merge __blk_revalidate_disk_zones and
blk_revalidate_disk_zones instead of having two versions for a grand
total of two callers.  Something like this on top of your whole branch:

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 6c37fec6859b..0e7763a590e5 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -437,23 +437,21 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 }
 
 /**
- * __blk_revalidate_disk_zones - (re)allocate and initialize zone bitmaps
+ * blk_revalidate_disk_zones - (re)allocate and initialize zone bitmaps
  * @disk:		Target disk
- * @revalidate_cb:	LLD callback
- * @revalidate_data:	LLD callback argument
+ * @update_driver_data:	Callback to update driver data on the frozen disk
  *
  * Helper function for low-level device drivers to (re) allocate and initialize
  * a disk request queue zone bitmaps. This functions should normally be called
  * within the disk ->revalidate method for blk-mq based drivers.  For BIO based
  * drivers only q->nr_zones needs to be updated so that the sysfs exposed value
  * is correct.
- * If the @revalidate_cb callback function is not NULL, the callback will be
- * executed with the device request queue frozen after all zones have been
+ * If the @update_driver_data callback function is not NULL, the callback will
+ * be executed with the device request queue frozen after all zones have been
  * checked.
  */
-int __blk_revalidate_disk_zones(struct gendisk *disk,
-				revalidate_zones_cb revalidate_cb,
-				void *revalidate_data)
+int blk_revalidate_disk_zones(struct gendisk *disk,
+		void (*update_driver_data)(struct gendisk *disk))
 {
 	struct request_queue *q = disk->queue;
 	struct blk_revalidate_zone_args args = {
@@ -487,8 +485,8 @@ int __blk_revalidate_disk_zones(struct gendisk *disk,
 		q->nr_zones = args.nr_zones;
 		swap(q->seq_zones_wlock, args.seq_zones_wlock);
 		swap(q->conv_zones_bitmap, args.conv_zones_bitmap);
-		if (revalidate_cb)
-			revalidate_cb(disk, revalidate_data);
+		if (update_driver_data)
+			update_driver_data(disk);
 		ret = 0;
 	} else {
 		pr_warn("%s: failed to revalidate zones\n", disk->disk_name);
@@ -500,4 +498,4 @@ int __blk_revalidate_disk_zones(struct gendisk *disk,
 	kfree(args.conv_zones_bitmap);
 	return ret;
 }
-EXPORT_SYMBOL_GPL(__blk_revalidate_disk_zones);
+EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
diff --git a/drivers/block/null_blk_zoned.c b/drivers/block/null_blk_zoned.c
index b664be0bbb5e..f7beb72a321a 100644
--- a/drivers/block/null_blk_zoned.c
+++ b/drivers/block/null_blk_zoned.c
@@ -71,7 +71,7 @@ int null_register_zoned_dev(struct nullb *nullb)
 	struct request_queue *q = nullb->q;
 
 	if (queue_is_mq(q)) {
-		int ret = blk_revalidate_disk_zones(nullb->disk);
+		int ret = blk_revalidate_disk_zones(nullb->disk, NULL);
 
 		if (ret)
 			return ret;
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 53cfe998a3f6..893d2e0da255 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -656,7 +656,7 @@ static int sd_zbc_check_capacity(struct scsi_disk *sdkp, unsigned char *buf,
 	return 0;
 }
 
-static void sd_zbc_revalidate_zones_cb(struct gendisk *disk, void *data)
+static void sd_zbc_update_zone_data(struct gendisk *disk)
 {
 	struct scsi_disk *sdkp = scsi_disk(disk);
 
@@ -680,8 +680,7 @@ static int sd_zbc_revalidate_zones(struct scsi_disk *sdkp,
 		goto unlock;
 	}
 
-	ret = __blk_revalidate_disk_zones(sdkp->disk,
-					sd_zbc_revalidate_zones_cb, NULL);
+	ret = blk_revalidate_disk_zones(sdkp->disk, sd_zbc_update_zone_data);
 	kvfree(sdkp->rev_wp_ofst);
 	sdkp->rev_wp_ofst = NULL;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a730cacda0f7..d970c36cb8d3 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -353,8 +353,6 @@ struct queue_limits {
 typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx,
 			       void *data);
 
-typedef void (*revalidate_zones_cb)(struct gendisk *disk, void *data);
-
 #ifdef CONFIG_BLK_DEV_ZONED
 
 #define BLK_ALL_ZONES  ((unsigned int)-1)
@@ -364,14 +362,8 @@ unsigned int blkdev_nr_zones(struct gendisk *disk);
 extern int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
 			    sector_t sectors, sector_t nr_sectors,
 			    gfp_t gfp_mask);
-int __blk_revalidate_disk_zones(struct gendisk *disk,
-				revalidate_zones_cb revalidate_cb,
-				void *revalidate_data);
-static inline int blk_revalidate_disk_zones(struct gendisk *disk)
-{
-	return __blk_revalidate_disk_zones(disk, NULL, NULL);
-}
-
+int blk_revalidate_disk_zones(struct gendisk *disk,
+		void (*update_driver_data)(struct gendisk *disk));
 extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
 				     unsigned int cmd, unsigned long arg);
 extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 04/10] block: Modify revalidate zones
  2020-04-10  6:40   ` Christoph Hellwig
@ 2020-04-10  6:55     ` Damien Le Moal
  0 siblings, 0 replies; 29+ messages in thread
From: Damien Le Moal @ 2020-04-10  6:55 UTC (permalink / raw)
  To: hch, Johannes Thumshirn
  Cc: Jens Axboe, linux-block, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org

On 2020/04/10 15:40, Christoph Hellwig wrote:
> On Fri, Apr 10, 2020 at 01:53:46AM +0900, Johannes Thumshirn wrote:
>> From: Damien Le Moal <damien.lemoal@wdc.com>
>>
>> Modify the interface of blk_revalidate_disk_zones() to add an optional
>> revalidation callback function that a driver can use to extend checks and
>> processing done during zone revalidation. The callback, if defined, is
>> executed time after all zones are inspected and with the queue frozen.
>> blk_revalidate_disk_zones() is renamed as __blk_revalidate_disk_zones()
>> and blk_revalidate_disk_zones() implemented as an inline function calling
>> __blk_revalidate_disk_zones() without no revalidation callback specified,
>> resulting in an unchanged behavior for all callers of
>> blk_revalidate_disk_zones().
> 
> The data argument to __blk_revalidate_disk_zones and the cllback is now
> unused.  I also think we now merge __blk_revalidate_disk_zones and
> blk_revalidate_disk_zones instead of having two versions for a grand
> total of two callers.  Something like this on top of your whole branch:

Yes, indeed. Even simpler. Will add this change to v6.

> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 6c37fec6859b..0e7763a590e5 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -437,23 +437,21 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>  }
>  
>  /**
> - * __blk_revalidate_disk_zones - (re)allocate and initialize zone bitmaps
> + * blk_revalidate_disk_zones - (re)allocate and initialize zone bitmaps
>   * @disk:		Target disk
> - * @revalidate_cb:	LLD callback
> - * @revalidate_data:	LLD callback argument
> + * @update_driver_data:	Callback to update driver data on the frozen disk
>   *
>   * Helper function for low-level device drivers to (re) allocate and initialize
>   * a disk request queue zone bitmaps. This functions should normally be called
>   * within the disk ->revalidate method for blk-mq based drivers.  For BIO based
>   * drivers only q->nr_zones needs to be updated so that the sysfs exposed value
>   * is correct.
> - * If the @revalidate_cb callback function is not NULL, the callback will be
> - * executed with the device request queue frozen after all zones have been
> + * If the @update_driver_data callback function is not NULL, the callback will
> + * be executed with the device request queue frozen after all zones have been
>   * checked.
>   */
> -int __blk_revalidate_disk_zones(struct gendisk *disk,
> -				revalidate_zones_cb revalidate_cb,
> -				void *revalidate_data)
> +int blk_revalidate_disk_zones(struct gendisk *disk,
> +		void (*update_driver_data)(struct gendisk *disk))
>  {
>  	struct request_queue *q = disk->queue;
>  	struct blk_revalidate_zone_args args = {
> @@ -487,8 +485,8 @@ int __blk_revalidate_disk_zones(struct gendisk *disk,
>  		q->nr_zones = args.nr_zones;
>  		swap(q->seq_zones_wlock, args.seq_zones_wlock);
>  		swap(q->conv_zones_bitmap, args.conv_zones_bitmap);
> -		if (revalidate_cb)
> -			revalidate_cb(disk, revalidate_data);
> +		if (update_driver_data)
> +			update_driver_data(disk);
>  		ret = 0;
>  	} else {
>  		pr_warn("%s: failed to revalidate zones\n", disk->disk_name);
> @@ -500,4 +498,4 @@ int __blk_revalidate_disk_zones(struct gendisk *disk,
>  	kfree(args.conv_zones_bitmap);
>  	return ret;
>  }
> -EXPORT_SYMBOL_GPL(__blk_revalidate_disk_zones);
> +EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
> diff --git a/drivers/block/null_blk_zoned.c b/drivers/block/null_blk_zoned.c
> index b664be0bbb5e..f7beb72a321a 100644
> --- a/drivers/block/null_blk_zoned.c
> +++ b/drivers/block/null_blk_zoned.c
> @@ -71,7 +71,7 @@ int null_register_zoned_dev(struct nullb *nullb)
>  	struct request_queue *q = nullb->q;
>  
>  	if (queue_is_mq(q)) {
> -		int ret = blk_revalidate_disk_zones(nullb->disk);
> +		int ret = blk_revalidate_disk_zones(nullb->disk, NULL);
>  
>  		if (ret)
>  			return ret;
> diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
> index 53cfe998a3f6..893d2e0da255 100644
> --- a/drivers/scsi/sd_zbc.c
> +++ b/drivers/scsi/sd_zbc.c
> @@ -656,7 +656,7 @@ static int sd_zbc_check_capacity(struct scsi_disk *sdkp, unsigned char *buf,
>  	return 0;
>  }
>  
> -static void sd_zbc_revalidate_zones_cb(struct gendisk *disk, void *data)
> +static void sd_zbc_update_zone_data(struct gendisk *disk)
>  {
>  	struct scsi_disk *sdkp = scsi_disk(disk);
>  
> @@ -680,8 +680,7 @@ static int sd_zbc_revalidate_zones(struct scsi_disk *sdkp,
>  		goto unlock;
>  	}
>  
> -	ret = __blk_revalidate_disk_zones(sdkp->disk,
> -					sd_zbc_revalidate_zones_cb, NULL);
> +	ret = blk_revalidate_disk_zones(sdkp->disk, sd_zbc_update_zone_data);
>  	kvfree(sdkp->rev_wp_ofst);
>  	sdkp->rev_wp_ofst = NULL;
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index a730cacda0f7..d970c36cb8d3 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -353,8 +353,6 @@ struct queue_limits {
>  typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx,
>  			       void *data);
>  
> -typedef void (*revalidate_zones_cb)(struct gendisk *disk, void *data);
> -
>  #ifdef CONFIG_BLK_DEV_ZONED
>  
>  #define BLK_ALL_ZONES  ((unsigned int)-1)
> @@ -364,14 +362,8 @@ unsigned int blkdev_nr_zones(struct gendisk *disk);
>  extern int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
>  			    sector_t sectors, sector_t nr_sectors,
>  			    gfp_t gfp_mask);
> -int __blk_revalidate_disk_zones(struct gendisk *disk,
> -				revalidate_zones_cb revalidate_cb,
> -				void *revalidate_data);
> -static inline int blk_revalidate_disk_zones(struct gendisk *disk)
> -{
> -	return __blk_revalidate_disk_zones(disk, NULL, NULL);
> -}
> -
> +int blk_revalidate_disk_zones(struct gendisk *disk,
> +		void (*update_driver_data)(struct gendisk *disk));
>  extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
>  				     unsigned int cmd, unsigned long arg);
>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 02/10] block: Introduce REQ_OP_ZONE_APPEND
  2020-04-09 16:53 ` [PATCH v5 02/10] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
@ 2020-04-10  7:10   ` Christoph Hellwig
  2020-04-14  9:43     ` Johannes Thumshirn
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2020-04-10  7:10 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Damien Le Moal,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen, linux-fsdevel @ vger . kernel . org

I've just been auditing the bio code for now and have a few suggestions:

 - we really should be reusing the passthrough bio handling for
   zone append instead of reinventing it
 - I think __bio_iov_iter_get_pages should be split into a separate
   append version.  That matches the bvec split (which we fail to
   handle properly for append), avoids a branch for every page in
   the fast path and generall seems to look cleaner.

Patch on top of your whole branch attached:

diff --git a/block/bio.c b/block/bio.c
index 4029a48f3828..dd84bd5adc24 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -679,54 +679,6 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
 }
 EXPORT_SYMBOL(bio_clone_fast);
 
-static bool bio_try_merge_zone_append_page(struct bio *bio, struct page *page,
-					   unsigned int len, unsigned int off,
-					   bool *same_page)
-{
-	struct request_queue *q = bio->bi_disk->queue;
-	struct bio_vec *bv;
-	unsigned long mask = queue_segment_boundary(q);
-	phys_addr_t addr1, addr2;
-
-	if (bio->bi_vcnt < 1)
-		return false;
-
-	bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
-
-	addr1 = page_to_phys(bv->bv_page) + bv->bv_offset;
-	addr2 = page_to_phys(page) + off + len - 1;
-
-	if ((addr1 | mask) != (addr2 | mask))
-		return false;
-	if (bv->bv_len + len > queue_max_segment_size(q))
-		return false;
-	return __bio_try_merge_page(bio, page, len, off, same_page);
-}
-
-static int bio_add_append_page(struct bio *bio, struct page *page, unsigned len,
-			       size_t offset)
-{
-	struct request_queue *q = bio->bi_disk->queue;
-	unsigned int max_append_sectors = queue_max_zone_append_sectors(q);
-	bool same_page = false;
-
-	if (WARN_ON_ONCE(!max_append_sectors))
-		return 0;
-
-	if (((bio->bi_iter.bi_size + len) >> 9) > max_append_sectors)
-		return 0;
-
-	if (bio_try_merge_zone_append_page(bio, page, len, offset, &same_page))
-		return len;
-
-	if (bio->bi_vcnt >= queue_max_segments(q))
-		return 0;
-
-	__bio_add_page(bio, page, len, offset);
-
-	return len;
-}
-
 static inline bool page_is_mergeable(const struct bio_vec *bv,
 		struct page *page, unsigned int len, unsigned int off,
 		bool *same_page)
@@ -746,9 +698,13 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
 	return true;
 }
 
-static bool bio_try_merge_pc_page(struct request_queue *q, struct bio *bio,
-		struct page *page, unsigned len, unsigned offset,
-		bool *same_page)
+/*
+ * Try to merge a page into a segment, while obeying the hardware segment
+ * size limit.  This is not for normal read/write bios, but for passthrough
+ * or Zone Append operations that we can't split.
+ */
+static bool bio_try_merge_hw_seg(struct request_queue *q, struct bio *bio,
+		struct page *page, unsigned len, unsigned offset, bool *same_page)
 {
 	struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
 	unsigned long mask = queue_segment_boundary(q);
@@ -762,39 +718,24 @@ static bool bio_try_merge_pc_page(struct request_queue *q, struct bio *bio,
 	return __bio_try_merge_page(bio, page, len, offset, same_page);
 }
 
-/**
- *	__bio_add_pc_page	- attempt to add page to passthrough bio
- *	@q: the target queue
- *	@bio: destination bio
- *	@page: page to add
- *	@len: vec entry length
- *	@offset: vec entry offset
- *	@same_page: return if the merge happen inside the same page
- *
- *	Attempt to add a page to the bio_vec maplist. This can fail for a
- *	number of reasons, such as the bio being full or target block device
- *	limitations. The target block device must allow bio's up to PAGE_SIZE,
- *	so it is always possible to add a single page to an empty bio.
- *
- *	This should only be used by passthrough bios.
+/*
+ * Add a page to a bio while respecting the hardware max_sectors, max_segment
+ * and gap limitations.
  */
-static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
+static int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
-		bool *same_page)
+		unsigned int max_sectors, bool *same_page)
 {
 	struct bio_vec *bvec;
 
-	/*
-	 * cloned bio must not modify vec list
-	 */
-	if (unlikely(bio_flagged(bio, BIO_CLONED)))
+	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
 		return 0;
 
-	if (((bio->bi_iter.bi_size + len) >> 9) > queue_max_hw_sectors(q))
+	if (((bio->bi_iter.bi_size + len) >> 9) > max_sectors)
 		return 0;
 
 	if (bio->bi_vcnt > 0) {
-		if (bio_try_merge_pc_page(q, bio, page, len, offset, same_page))
+		if (bio_try_merge_hw_seg(q, bio, page, len, offset, same_page))
 			return len;
 
 		/*
@@ -821,11 +762,27 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 	return len;
 }
 
+/**
+ * bio_add_pc_page	- attempt to add page to passthrough bio
+ * @q: the target queue
+ * @bio: destination bio
+ * @page: page to add
+ * @len: vec entry length
+ * @offset: vec entry offset
+ *
+ * Attempt to add a page to the bio_vec maplist. This can fail for a
+ * number of reasons, such as the bio being full or target block device
+ * limitations. The target block device must allow bio's up to PAGE_SIZE,
+ * so it is always possible to add a single page to an empty bio.
+ *
+ * This should only be used by passthrough bios.
+ */
 int bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset)
 {
 	bool same_page = false;
-	return __bio_add_pc_page(q, bio, page, len, offset, &same_page);
+	return bio_add_hw_page(q, bio, page, len, offset,
+			queue_max_hw_sectors(q), &same_page);
 }
 EXPORT_SYMBOL(bio_add_pc_page);
 
@@ -993,27 +950,12 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		struct page *page = pages[i];
 
 		len = min_t(size_t, PAGE_SIZE - offset, left);
-		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
-			int ret;
-
-			if (bio_try_merge_zone_append_page(bio, page, len,
-							   offset,
-							   &same_page)) {
-				if (same_page)
-					put_page(page);
-			} else {
-				ret = bio_add_append_page(bio, page, len,
-							  offset);
-				if (ret != len)
-					return -EINVAL;
-			}
-		} else if (__bio_try_merge_page(bio, page, len, offset,
-						&same_page)) {
+		if (__bio_try_merge_page(bio, page, len, offset, &same_page)) {
 			if (same_page)
 				put_page(page);
 		} else {
 			if (WARN_ON_ONCE(bio_full(bio, len)))
-                                return -EINVAL;
+				return -EINVAL;
 			__bio_add_page(bio, page, len, offset);
 		}
 		offset = 0;
@@ -1023,6 +965,50 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	return 0;
 }
 
+static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
+{
+	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
+	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
+	struct request_queue *q = bio->bi_disk->queue;
+	unsigned int max_append_sectors = queue_max_zone_append_sectors(q);
+	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
+	struct page **pages = (struct page **)bv;
+	ssize_t size, left;
+	unsigned len, i;
+	size_t offset;
+
+	if (WARN_ON_ONCE(!max_append_sectors))
+		return 0;
+
+	/*
+	 * Move page array up in the allocated memory for the bio vecs as far as
+	 * possible so that we can start filling biovecs from the beginning
+	 * without overwriting the temporary page array.
+	*/
+	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
+	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
+
+	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+	if (unlikely(size <= 0))
+		return size ? size : -EFAULT;
+
+	for (left = size, i = 0; left > 0; left -= len, i++) {
+		struct page *page = pages[i];
+		bool same_page = false;
+
+		len = min_t(size_t, PAGE_SIZE - offset, left);
+		if (bio_add_hw_page(q, bio, page, len, offset,
+				max_append_sectors, &same_page) != len)
+			return -EINVAL;
+		if (same_page)
+			put_page(page);
+		offset = 0;
+	}
+
+	iov_iter_advance(iter, size);
+	return 0;
+}
+
 /**
  * bio_iov_iter_get_pages - add user or kernel pages to a bio
  * @bio: bio to add pages to
@@ -1052,10 +1038,16 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		return -EINVAL;
 
 	do {
-		if (is_bvec)
-			ret = __bio_iov_bvec_add_pages(bio, iter);
-		else
-			ret = __bio_iov_iter_get_pages(bio, iter);
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+			if (WARN_ON_ONCE(is_bvec))
+				return -EINVAL;
+			ret = __bio_iov_append_get_pages(bio, iter);
+		} else {
+			if (is_bvec)
+				ret = __bio_iov_bvec_add_pages(bio, iter);
+			else
+				ret = __bio_iov_iter_get_pages(bio, iter);
+		}
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
 
 	if (is_bvec)
@@ -1455,6 +1447,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
 			     struct iov_iter *iter,
 			     gfp_t gfp_mask)
 {
+	unsigned int max_sectors = queue_max_hw_sectors(q);
 	int j;
 	struct bio *bio;
 	int ret;
@@ -1492,8 +1485,8 @@ struct bio *bio_map_user_iov(struct request_queue *q,
 				if (n > bytes)
 					n = bytes;
 
-				if (!__bio_add_pc_page(q, bio, page, n, offs,
-						&same_page)) {
+				if (!bio_add_hw_page(q, bio, page, n, offs,
+						max_sectors, &same_page)) {
 					if (same_page)
 						put_page(page);
 					break;

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands
  2020-04-09 16:53 ` [PATCH v5 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
  2020-04-10  0:30   ` Damien Le Moal
  2020-04-10  6:18   ` Christoph Hellwig
@ 2020-04-10  7:23   ` Christoph Hellwig
  2020-04-14 10:18     ` Johannes Thumshirn
  2 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2020-04-10  7:23 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Damien Le Moal,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen, linux-fsdevel @ vger . kernel . org

> +	spin_lock_bh(&sdkp->zones_wp_ofst_lock);
> +
> +	wp_ofst = sdkp->zones_wp_ofst[zno];
> +	if (wp_ofst == SD_ZBC_UPDATING_WP_OFST) {
> +		/* Write pointer offset update in progress: ask for a requeue */
> +		ret = BLK_STS_RESOURCE;
> +		goto err;
> +	}
> +
> +	if (wp_ofst == SD_ZBC_INVALID_WP_OFST) {
> +		/* Invalid write pointer offset: trigger an update from disk */
> +		ret = sd_zbc_update_wp_ofst(sdkp, zno);
> +		goto err;
> +	}

Maybe I'm a little too clever for my own sake, but what about something
like:

	spin_lock_bh(&sdkp->zones_wp_ofst_lock);
	switch (wp_ofst) {
	case SD_ZBC_INVALID_WP_OFST:
		if (scsi_device_get(sdkp->device)) {
			ret = BLK_STS_IOERR;
			break;
		}
		sdkp->zones_wp_ofst[zno] = SD_ZBC_UPDATING_WP_OFST;
		schedule_work(&sdkp->zone_wp_ofst_work);
		/*FALLTHRU*/
	case SD_ZBC_UPDATING_WP_OFST:
		ret = BLK_STS_DEV_RESOURCE;
		break;
	default:
		wp_ofst = sectors_to_logical(sdkp->device, wp_ofst);
		if (wp_ofst + nr_blocks > sdkp->zone_blocks) {
			ret = BLK_STS_IOERR;
			break;
		}

		*lba += wp_ofst;
	}
	spin_unlock_bh(&sdkp->zones_wp_ofst_lock);
	if (ret)
		blk_req_zone_write_unlock(rq);
	return ret;
}

>  	int result = cmd->result;
> @@ -294,7 +543,18 @@ void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
>  		 * so be quiet about the error.
>  		 */
>  		rq->rq_flags |= RQF_QUIET;
> +		goto unlock_zone;
>  	}
> +
> +	if (sd_zbc_need_zone_wp_update(rq))
> +		good_bytes = sd_zbc_zone_wp_update(cmd, good_bytes);
> +
> +
> +unlock_zone:

why not use a good old "else if" here?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 06/10] scsi: export scsi_mq_free_sgtables
  2020-04-10  5:58   ` Christoph Hellwig
@ 2020-04-10  7:46     ` Johannes Thumshirn
  0 siblings, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2020-04-10  7:46 UTC (permalink / raw)
  To: hch
  Cc: Jens Axboe, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org

On 10/04/2020 07:58, Christoph Hellwig wrote:
> Looks good, althrough we really don't need the extern for the
> prototype in the header (that also applies to a few other patches in
> the series):

I know but all prototypes have an 'extern' prefix so I did to match the 
style of the file.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands
  2020-04-10  6:18   ` Christoph Hellwig
  2020-04-10  6:38     ` Christoph Hellwig
@ 2020-04-10  7:54     ` Johannes Thumshirn
  1 sibling, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2020-04-10  7:54 UTC (permalink / raw)
  To: hch
  Cc: Jens Axboe, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org

On 10/04/2020 08:18, Christoph Hellwig wrote:
> On Fri, Apr 10, 2020 at 01:53:49AM +0900, Johannes Thumshirn wrote:
>> +	if (req_op(rq) == REQ_OP_ZONE_APPEND) {
>> +		ret = sd_zbc_prepare_zone_append(cmd, &lba, nr_blocks);
>> +		if (ret) {
>> +			scsi_free_sgtables(cmd);
>> +			return ret;
>> +		}
>> +	}
> 
> So actually.
> 
> I've been trying to understand the lifetime of the sgtables.  Shouldn't
> we free them in the midayer ->init_command returned BLK_STS_*RESOURCE
> instead?  It seems like this just catches one particular error instead
> of the whole category?  The end of scsi_queue_rq seem like a particular
> good place, as that also releases the resources for the "hard" errors.
> 

I did have this in a private version but it leaked the sgtables in case 
we had an inconsistent wp_ofst cache (I injected a zone-reset via 
/dev/sg so the driver didn't see it).

The code that handles freeing of the sgtables in scsi_queue_rq's error 
handling won't get called on BLK_STS_*RESOURCE, which makes perfect 
sense at least for BLK_STS_RESOURCE, as if you can't allocate the memory 
for the sgtables, there's no need to free it.

The only code in SCSI being a special snowflake there is 
sd_zbc_prepare_zone_append(), this is why I ended up freeing the 
sgtables after it returned an error.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands
  2020-04-10  6:38     ` Christoph Hellwig
@ 2020-04-10  8:01       ` Johannes Thumshirn
  2020-04-14 11:09       ` Johannes Thumshirn
  1 sibling, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2020-04-10  8:01 UTC (permalink / raw)
  To: hch
  Cc: Jens Axboe, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org

On 10/04/2020 08:39, Christoph Hellwig wrote:
> Looking more the situation seems even worse.  If scsi_mq_prep_fn
> isn't successfull we never seem to free the sgtables, even for fatal
> errors.  So I think we need a real bug fix here in front of the series.
> 

I'll look at in on Tuesday, Easter holidays here.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 06/10] scsi: export scsi_mq_free_sgtables
  2020-04-09 16:53 ` [PATCH v5 06/10] scsi: export scsi_mq_free_sgtables Johannes Thumshirn
  2020-04-10  5:58   ` Christoph Hellwig
@ 2020-04-10 14:22   ` Bart Van Assche
  1 sibling, 0 replies; 29+ messages in thread
From: Bart Van Assche @ 2020-04-10 14:22 UTC (permalink / raw)
  To: Johannes Thumshirn, Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org

On 2020-04-09 09:53, Johannes Thumshirn wrote:
> scsi_mq_free_sgtables is used to free the sg_tables, uninitialize the
> command and delete it from the command list.
> 
> Export this function so it can be used from modular code to free the
> memory allocated by scsi_init_io() if the caller of scsi_init_io() needs
> to do error recovery.
> 
> While we're at it, rename scsi_mq_free_sgtables() to scsi_free_sgtables().

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 02/10] block: Introduce REQ_OP_ZONE_APPEND
  2020-04-10  7:10   ` Christoph Hellwig
@ 2020-04-14  9:43     ` Johannes Thumshirn
  2020-04-14 11:28       ` hch
  0 siblings, 1 reply; 29+ messages in thread
From: Johannes Thumshirn @ 2020-04-14  9:43 UTC (permalink / raw)
  To: hch
  Cc: Jens Axboe, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org

On 10/04/2020 09:10, Christoph Hellwig wrote:
> -/**
> - *	__bio_add_pc_page	- attempt to add page to passthrough bio
> - *	@q: the target queue
> - *	@bio: destination bio
> - *	@page: page to add
> - *	@len: vec entry length
> - *	@offset: vec entry offset
> - *	@same_page: return if the merge happen inside the same page
> - *
> - *	Attempt to add a page to the bio_vec maplist. This can fail for a
> - *	number of reasons, such as the bio being full or target block device
> - *	limitations. The target block device must allow bio's up to PAGE_SIZE,
> - *	so it is always possible to add a single page to an empty bio.
> - *
> - *	This should only be used by passthrough bios.
> +/*
> + * Add a page to a bio while respecting the hardware max_sectors, max_segment
> + * and gap limitations.
>    */
> -static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
> +static int bio_add_hw_page(struct request_queue *q, struct bio *bio,
>   		struct page *page, unsigned int len, unsigned int offset,
> -		bool *same_page)
> +		unsigned int max_sectors, bool *same_page)

Should I split that rename into a prep patch and if yes add you as the 
author?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands
  2020-04-10  7:23   ` Christoph Hellwig
@ 2020-04-14 10:18     ` Johannes Thumshirn
  0 siblings, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2020-04-14 10:18 UTC (permalink / raw)
  To: hch
  Cc: Jens Axboe, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org

On 10/04/2020 09:23, Christoph Hellwig wrote:
>> +	spin_lock_bh(&sdkp->zones_wp_ofst_lock);
>> +
>> +	wp_ofst = sdkp->zones_wp_ofst[zno];
>> +	if (wp_ofst == SD_ZBC_UPDATING_WP_OFST) {
>> +		/* Write pointer offset update in progress: ask for a requeue */
>> +		ret = BLK_STS_RESOURCE;
>> +		goto err;
>> +	}
>> +
>> +	if (wp_ofst == SD_ZBC_INVALID_WP_OFST) {
>> +		/* Invalid write pointer offset: trigger an update from disk */
>> +		ret = sd_zbc_update_wp_ofst(sdkp, zno);
>> +		goto err;
>> +	}
> 
> Maybe I'm a little too clever for my own sake, but what about something
> like:
> 
> 	spin_lock_bh(&sdkp->zones_wp_ofst_lock);
> 	switch (wp_ofst) {
> 	case SD_ZBC_INVALID_WP_OFST:
> 		if (scsi_device_get(sdkp->device)) {
> 			ret = BLK_STS_IOERR;
> 			break;
> 		}
> 		sdkp->zones_wp_ofst[zno] = SD_ZBC_UPDATING_WP_OFST;
> 		schedule_work(&sdkp->zone_wp_ofst_work);
> 		/*FALLTHRU*/
> 	case SD_ZBC_UPDATING_WP_OFST:
> 		ret = BLK_STS_DEV_RESOURCE;
> 		break;
> 	default:
> 		wp_ofst = sectors_to_logical(sdkp->device, wp_ofst);
> 		if (wp_ofst + nr_blocks > sdkp->zone_blocks) {
> 			ret = BLK_STS_IOERR;
> 			break;
> 		}
> 
> 		*lba += wp_ofst;
> 	}
> 	spin_unlock_bh(&sdkp->zones_wp_ofst_lock);
> 	if (ret)
> 		blk_req_zone_write_unlock(rq);
> 	return ret;
> }

This indeed looks cleaner, I'll throw it into testing.

> 
>>   	int result = cmd->result;
>> @@ -294,7 +543,18 @@ void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
>>   		 * so be quiet about the error.
>>   		 */
>>   		rq->rq_flags |= RQF_QUIET;
>> +		goto unlock_zone;
>>   	}
>> +
>> +	if (sd_zbc_need_zone_wp_update(rq))
>> +		good_bytes = sd_zbc_zone_wp_update(cmd, good_bytes);
>> +
>> +
>> +unlock_zone:
> 
> why not use a good old "else if" here?
> 

Done

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands
  2020-04-10  6:38     ` Christoph Hellwig
  2020-04-10  8:01       ` Johannes Thumshirn
@ 2020-04-14 11:09       ` Johannes Thumshirn
  2020-04-14 11:30         ` hch
  1 sibling, 1 reply; 29+ messages in thread
From: Johannes Thumshirn @ 2020-04-14 11:09 UTC (permalink / raw)
  To: hch
  Cc: Jens Axboe, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org

On 10/04/2020 08:39, Christoph Hellwig wrote:
> Looking more the situation seems even worse.  If scsi_mq_prep_fn
> isn't successfull we never seem to free the sgtables, even for fatal
> errors.  So I think we need a real bug fix here in front of the series

If I'm not missing something all that needs to be done to fix it is:

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 4724002627cd..5e6165246f77 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1191,6 +1191,7 @@ static blk_status_t scsi_setup_cmnd(struct 
scsi_device *sdev,
                 struct request *req)
  {
         struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
+       blk_status_t ret;

         if (!blk_rq_bytes(req))
                 cmd->sc_data_direction = DMA_NONE;
@@ -1200,9 +1201,14 @@ static blk_status_t scsi_setup_cmnd(struct 
scsi_device *sdev,
                 cmd->sc_data_direction = DMA_FROM_DEVICE;

         if (blk_rq_is_scsi(req))
-               return scsi_setup_scsi_cmnd(sdev, req);
+               ret = scsi_setup_scsi_cmnd(sdev, req);
         else
-               return scsi_setup_fs_cmnd(sdev, req);
+               ret = scsi_setup_fs_cmnd(sdev, req);
+
+       if (ret != BLK_STS_OK)
+               scsi_free_sgtables(cmd);
+
+       return ret;
  }

  static blk_status_t


Theoretically it's enough to catch errors from scsi_setup_fs_cmnd() as 
scsi_setup_scsi_cmnd() either fails scsi_init_io() which means no 
sgtables are allocated or returns BLK_STS_OK.

But for the sake of symmetry and defensive programming I think we can 
also check the return of scsi_setup_scsi_cmnd(). I've checked 
scsi_free_sgtables() and __sg_free_table() and they're double-free safe.

Thoughts?

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 02/10] block: Introduce REQ_OP_ZONE_APPEND
  2020-04-14  9:43     ` Johannes Thumshirn
@ 2020-04-14 11:28       ` hch
  0 siblings, 0 replies; 29+ messages in thread
From: hch @ 2020-04-14 11:28 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: hch, Jens Axboe, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org

On Tue, Apr 14, 2020 at 09:43:24AM +0000, Johannes Thumshirn wrote:
> > -static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
> > +static int bio_add_hw_page(struct request_queue *q, struct bio *bio,
> >   		struct page *page, unsigned int len, unsigned int offset,
> > -		bool *same_page)
> > +		unsigned int max_sectors, bool *same_page)
> 
> Should I split that rename into a prep patch and if yes add you as the 
> author?

It is not just a rename but also passing max_sectors explicitly.  I'm
kinda torn if it is worth a prep patch or not, but then why not..

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands
  2020-04-14 11:09       ` Johannes Thumshirn
@ 2020-04-14 11:30         ` hch
  0 siblings, 0 replies; 29+ messages in thread
From: hch @ 2020-04-14 11:30 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: hch, Jens Axboe, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org

On Tue, Apr 14, 2020 at 11:09:41AM +0000, Johannes Thumshirn wrote:
> On 10/04/2020 08:39, Christoph Hellwig wrote:
> > Looking more the situation seems even worse.  If scsi_mq_prep_fn
> > isn't successfull we never seem to free the sgtables, even for fatal
> > errors.  So I think we need a real bug fix here in front of the series
> 
> If I'm not missing something all that needs to be done to fix it is:

Looks sensible.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-04-14 11:30 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-09 16:53 [PATCH v5 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
2020-04-09 16:53 ` [PATCH v5 01/10] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
2020-04-09 16:53 ` [PATCH v5 02/10] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
2020-04-10  7:10   ` Christoph Hellwig
2020-04-14  9:43     ` Johannes Thumshirn
2020-04-14 11:28       ` hch
2020-04-09 16:53 ` [PATCH v5 03/10] block: introduce blk_req_zone_write_trylock Johannes Thumshirn
2020-04-09 16:53 ` [PATCH v5 04/10] block: Modify revalidate zones Johannes Thumshirn
2020-04-10  0:27   ` Damien Le Moal
2020-04-10  6:40   ` Christoph Hellwig
2020-04-10  6:55     ` Damien Le Moal
2020-04-09 16:53 ` [PATCH v5 05/10] scsi: sd_zbc: factor out sanity checks for zoned commands Johannes Thumshirn
2020-04-09 16:53 ` [PATCH v5 06/10] scsi: export scsi_mq_free_sgtables Johannes Thumshirn
2020-04-10  5:58   ` Christoph Hellwig
2020-04-10  7:46     ` Johannes Thumshirn
2020-04-10 14:22   ` Bart Van Assche
2020-04-09 16:53 ` [PATCH v5 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
2020-04-10  0:30   ` Damien Le Moal
2020-04-10  6:18   ` Christoph Hellwig
2020-04-10  6:38     ` Christoph Hellwig
2020-04-10  8:01       ` Johannes Thumshirn
2020-04-14 11:09       ` Johannes Thumshirn
2020-04-14 11:30         ` hch
2020-04-10  7:54     ` Johannes Thumshirn
2020-04-10  7:23   ` Christoph Hellwig
2020-04-14 10:18     ` Johannes Thumshirn
2020-04-09 16:53 ` [PATCH v5 08/10] null_blk: Support REQ_OP_ZONE_APPEND Johannes Thumshirn
2020-04-09 16:53 ` [PATCH v5 09/10] block: export bio_release_pages and bio_iov_iter_get_pages Johannes Thumshirn
2020-04-09 16:53 ` [PATCH v5 10/10] zonefs: use REQ_OP_ZONE_APPEND for sync DIO Johannes Thumshirn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.