All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/11] Introduce Zone Append for writing to zoned block devices
@ 2020-03-10  9:46 Johannes Thumshirn
  2020-03-10  9:46 ` [PATCH 01/11] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
                   ` (11 more replies)
  0 siblings, 12 replies; 32+ messages in thread
From: Johannes Thumshirn @ 2020-03-10  9:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	Johannes Thumshirn

The upcoming NVMe ZNS Specification will define a new type of write
command for zoned block devices, zone append.

When when writing to a zoned block device using zone append, the start
sector of the write is pointing at the start LBA of the zone to write to.
Upon completion the block device will respond with the position the data
has been placed in the zone. This from a high level perspective can be
seen like a file system's block allocator, where the user writes to a
file and the file-system takes care of the data placement on the device.

In order to fully exploit the new zone append command in file-systems and
other interfaces above the block layer, we choose to emulate zone append
in SCSI and null_blk. This way we can have a single write path for both
file-systems and other interfaces above the block-layer, like io_uring on
zoned block devices, without having to care too much about the underlying
characteristics of the device itself.

The emulation works by providing a cache of each zone's write pointer, so
zone append issued to the disk can be translated to a write with a
starting LBA of the write pointer. This LBA is used as input zone number
for the write pointer lookup in the zone write pointer offset cache and
the cached offset is then added to the LBA to get the actual position to
write the data. In SCSI we then turn the REQ_OP_ZONE_APPEND request into a
WRITE(16) command. Upon successful completion of the WRITE(16), the cache
will be updated to the new write pointer location and the written sector
will be noted in the request. On error the cache entry will be marked as
invalid and on the next write an update of the write pointer will be
scheduled, before issuing the actual write.

In order to reduce memory consumption, the only cached item is the offset
of the write pointer from the start of the zone, everything else can be
calculated. On an example drive with 52156 zones, the additional memory
consumption of the cache is thus 52156 * 4 = 208624 Bytes or 51 4k Byte
pages. The performance impact is neglectable for a spinning drive.

For null_blk the emulation is way simpler, as null_blk's zoned block
device emulation support already caches the write pointer position, so we
only need to report the position back to the upper layers. Additional
caching is not needed here.

Testing has been conducted by translating RWF_APPEND DIOs into
REQ_OP_ZONE_APPEND commands in the block device's direct I/O function and
injecting errors by bypassing the block layer interface and directly
writing to the disc via the SCSI generic interface.

The whole series is relative to Jens' block-5.6 branch 14afc5936197
("block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group()").

Damien Le Moal (2):
  null_blk: Support REQ_OP_ZONE_APPEND
  block: Introduce zone write pointer offset caching

Johannes Thumshirn (8):
  block: provide fallbacks for blk_queue_zone_is_seq and
    blk_queue_zone_no
  block: introduce bio_add_append_page
  block: introduce BLK_STS_ZONE_RESOURCE
  block: introduce blk_req_zone_write_trylock
  block: factor out requeue handling from dispatch code
  block: delay un-dispatchable request
  scsi: sd_zbc: factor out sanity checks for zoned commands
  scsi: sd_zbc: emulate ZONE_APPEND commands

Keith Busch (1):
  block: Introduce REQ_OP_ZONE_APPEND

 block/bio.c                    |  41 +++-
 block/blk-core.c               |  49 +++++
 block/blk-map.c                |   2 +-
 block/blk-mq.c                 |  54 +++++-
 block/blk-settings.c           |  16 ++
 block/blk-sysfs.c              |  15 +-
 block/blk-zoned.c              |  83 +++++++-
 block/blk.h                    |   4 +-
 drivers/block/null_blk_main.c  |   9 +-
 drivers/block/null_blk_zoned.c |  21 +-
 drivers/scsi/scsi_lib.c        |   1 +
 drivers/scsi/sd.c              |  28 ++-
 drivers/scsi/sd.h              |  35 +++-
 drivers/scsi/sd_zbc.c          | 344 +++++++++++++++++++++++++++++++--
 include/linux/bio.h            |   3 +-
 include/linux/blk_types.h      |  14 ++
 include/linux/blkdev.h         |  42 +++-
 17 files changed, 701 insertions(+), 60 deletions(-)

-- 
2.24.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 01/11] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no
  2020-03-10  9:46 [PATCH 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
@ 2020-03-10  9:46 ` Johannes Thumshirn
  2020-03-10  9:46 ` [PATCH 02/11] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Johannes Thumshirn @ 2020-03-10  9:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	Johannes Thumshirn

blk_queue_zone_is_seq() and blk_queue_zone_no() have not been called with
CONFIG_BLK_DEV_ZONED disabled until now.

The introduction of REQ_OP_ZONE_APPEND will change this, so we need to
provide noop fallbacks for the !CONFIG_BLK_DEV_ZONED case.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 include/linux/blkdev.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f629d40c645c..25b63f714619 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -729,6 +729,16 @@ static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
 {
 	return 0;
 }
+static inline bool blk_queue_zone_is_seq(struct request_queue *q,
+					 sector_t sector)
+{
+	return false;
+}
+static inline unsigned int blk_queue_zone_no(struct request_queue *q,
+					     sector_t sector)
+{
+	return 0;
+}
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 static inline bool rq_is_sync(struct request *rq)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 02/11] block: Introduce REQ_OP_ZONE_APPEND
  2020-03-10  9:46 [PATCH 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
  2020-03-10  9:46 ` [PATCH 01/11] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
@ 2020-03-10  9:46 ` Johannes Thumshirn
  2020-03-10  9:46 ` [PATCH 03/11] block: introduce bio_add_append_page Johannes Thumshirn
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Johannes Thumshirn @ 2020-03-10  9:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen

From: Keith Busch <kbusch@kernel.org>

Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
block device. This is a no-merge write operation.

A zone append write BIO must:
* Target a zoned block device
* Have a sector position indicating the start sector of the target zone
* The target zone must be a sequential write zone
* The BIO must not cross a zone boundary
* The BIO size must not be split to ensure that a single range of LBAs
  is written with a single command.

Implement these checks in generic_make_request_checks() using the
helper function blk_check_zone_append(). To avoid write append BIO
splitting, introduce the new max_zone_append_sectors queue limit
attribute and ensure that a BIO size is always lower than this limit.
Export this new limit through sysfs.

Finally, make sure that the bio sector position indicates the actual
write position as indicated by the device on completion.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/bio.c               |  4 ++++
 block/blk-core.c          | 49 +++++++++++++++++++++++++++++++++++++++
 block/blk-settings.c      | 16 +++++++++++++
 block/blk-sysfs.c         | 13 +++++++++++
 include/linux/bio.h       |  1 +
 include/linux/blk_types.h |  2 ++
 include/linux/blkdev.h    | 11 +++++++++
 7 files changed, 96 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 94d697217887..5bff80fc2ad9 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1895,6 +1895,10 @@ struct bio *bio_split(struct bio *bio, int sectors,
 	BUG_ON(sectors <= 0);
 	BUG_ON(sectors >= bio_sectors(bio));
 
+	/* Zone append commands cannot be split */
+	if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
+		return NULL;
+
 	split = bio_clone_fast(bio, gfp, bs);
 	if (!split)
 		return NULL;
diff --git a/block/blk-core.c b/block/blk-core.c
index 60dc9552ef8d..544c5a130ac5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -135,6 +135,7 @@ static const char *const blk_op_name[] = {
 	REQ_OP_NAME(ZONE_OPEN),
 	REQ_OP_NAME(ZONE_CLOSE),
 	REQ_OP_NAME(ZONE_FINISH),
+	REQ_OP_NAME(ZONE_APPEND),
 	REQ_OP_NAME(WRITE_SAME),
 	REQ_OP_NAME(WRITE_ZEROES),
 	REQ_OP_NAME(SCSI_IN),
@@ -239,6 +240,16 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 		bio_set_flag(bio, BIO_QUIET);
 
 	bio_advance(bio, nbytes);
+	if (req_op(rq) == REQ_OP_ZONE_APPEND && error == BLK_STS_OK) {
+		/*
+		 * Partial completions cannot be supported as the BIO
+		 * fragments may end up not being written sequentially.
+		 */
+		if (bio->bi_iter.bi_size)
+			bio->bi_status = BLK_STS_IOERR;
+		else
+			bio->bi_iter.bi_sector = rq->__sector;
+	}
 
 	/* don't actually finish bio if it's part of flush sequence */
 	if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
@@ -865,6 +876,39 @@ static inline int blk_partition_remap(struct bio *bio)
 	return ret;
 }
 
+/*
+ * Check write append to a zoned block device.
+ */
+static inline blk_status_t blk_check_zone_append(struct request_queue *q,
+						 struct bio *bio)
+{
+	sector_t pos = bio->bi_iter.bi_sector;
+	int nr_sectors = bio_sectors(bio);
+
+	/* Only applicable to zoned block devices */
+	if (!blk_queue_is_zoned(q))
+		return BLK_STS_NOTSUPP;
+
+	/* The bio sector must point to the start of a sequential zone */
+	if (pos & (blk_queue_zone_sectors(q) - 1) ||
+	    !blk_queue_zone_is_seq(q, pos))
+		return BLK_STS_IOERR;
+
+	/*
+	 * Not allowed to cross zone boundaries. Otherwise, the BIO will be
+	 * split and could result in non-contiguous sectors being written in
+	 * different zones.
+	 */
+	if (blk_queue_zone_no(q, pos) != blk_queue_zone_no(q, pos + nr_sectors))
+		return BLK_STS_IOERR;
+
+	/* Make sure the BIO is small enough and will not get split */
+	if (nr_sectors > q->limits.max_zone_append_sectors)
+		return BLK_STS_IOERR;
+
+	return BLK_STS_OK;
+}
+
 static noinline_for_stack bool
 generic_make_request_checks(struct bio *bio)
 {
@@ -937,6 +981,11 @@ generic_make_request_checks(struct bio *bio)
 		if (!q->limits.max_write_same_sectors)
 			goto not_supported;
 		break;
+	case REQ_OP_ZONE_APPEND:
+		status = blk_check_zone_append(q, bio);
+		if (status != BLK_STS_OK)
+			goto end_io;
+		break;
 	case REQ_OP_ZONE_RESET:
 	case REQ_OP_ZONE_OPEN:
 	case REQ_OP_ZONE_CLOSE:
diff --git a/block/blk-settings.c b/block/blk-settings.c
index c8eda2e7b91e..1345dbe5a245 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -48,6 +48,7 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->chunk_sectors = 0;
 	lim->max_write_same_sectors = 0;
 	lim->max_write_zeroes_sectors = 0;
+	lim->max_zone_append_sectors = 0;
 	lim->max_discard_sectors = 0;
 	lim->max_hw_discard_sectors = 0;
 	lim->discard_granularity = 0;
@@ -83,6 +84,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_dev_sectors = UINT_MAX;
 	lim->max_write_same_sectors = UINT_MAX;
 	lim->max_write_zeroes_sectors = UINT_MAX;
+	lim->max_zone_append_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -257,6 +259,18 @@ void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors);
 
+/**
+ * blk_queue_max_zone_append_sectors - set max sectors for a single zone append
+ * @q:  the request queue for the device
+ * @max_zone_append_sectors: maximum number of sectors to write per command
+ **/
+void blk_queue_max_zone_append_sectors(struct request_queue *q,
+		unsigned int max_zone_append_sectors)
+{
+	q->limits.max_zone_append_sectors = max_zone_append_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_zone_append_sectors);
+
 /**
  * blk_queue_max_segments - set max hw segments for a request for this queue
  * @q:  the request queue for the device
@@ -506,6 +520,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 					b->max_write_same_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
 					b->max_write_zeroes_sectors);
+	t->max_zone_append_sectors = min(t->max_zone_append_sectors,
+					b->max_zone_append_sectors);
 	t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
 
 	t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fca9b158f4a0..02643e149d5e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -218,6 +218,13 @@ static ssize_t queue_write_zeroes_max_show(struct request_queue *q, char *page)
 		(unsigned long long)q->limits.max_write_zeroes_sectors << 9);
 }
 
+static ssize_t queue_zone_append_max_show(struct request_queue *q, char *page)
+{
+	unsigned long long max_sectors = q->limits.max_zone_append_sectors;
+
+	return sprintf(page, "%llu\n", max_sectors << SECTOR_SHIFT);
+}
+
 static ssize_t
 queue_max_sectors_store(struct request_queue *q, const char *page, size_t count)
 {
@@ -639,6 +646,11 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
 	.show = queue_write_zeroes_max_show,
 };
 
+static struct queue_sysfs_entry queue_zone_append_max_entry = {
+	.attr = {.name = "zone_append_max_bytes", .mode = 0444 },
+	.show = queue_zone_append_max_show,
+};
+
 static struct queue_sysfs_entry queue_nonrot_entry = {
 	.attr = {.name = "rotational", .mode = 0644 },
 	.show = queue_show_nonrot,
@@ -749,6 +761,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_discard_zeroes_data_entry.attr,
 	&queue_write_same_max_entry.attr,
 	&queue_write_zeroes_max_entry.attr,
+	&queue_zone_append_max_entry.attr,
 	&queue_nonrot_entry.attr,
 	&queue_zoned_entry.attr,
 	&queue_nr_zones_entry.attr,
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 853d92ceee64..ef640fd76c23 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -74,6 +74,7 @@ static inline bool bio_no_advance_iter(struct bio *bio)
 {
 	return bio_op(bio) == REQ_OP_DISCARD ||
 	       bio_op(bio) == REQ_OP_SECURE_ERASE ||
+	       bio_op(bio) == REQ_OP_ZONE_APPEND ||
 	       bio_op(bio) == REQ_OP_WRITE_SAME ||
 	       bio_op(bio) == REQ_OP_WRITE_ZEROES;
 }
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 70254ae11769..ae809f07aa27 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -296,6 +296,8 @@ enum req_opf {
 	REQ_OP_ZONE_CLOSE	= 11,
 	/* Transition a zone to full */
 	REQ_OP_ZONE_FINISH	= 12,
+	/* write data at the current zone write pointer */
+	REQ_OP_ZONE_APPEND	= 13,
 
 	/* SCSI passthrough using struct scsi_request */
 	REQ_OP_SCSI_IN		= 32,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 25b63f714619..36111b10d514 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -336,6 +336,7 @@ struct queue_limits {
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
 	unsigned int		max_write_zeroes_sectors;
+	unsigned int		max_zone_append_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -757,6 +758,9 @@ static inline bool rq_mergeable(struct request *rq)
 	if (req_op(rq) == REQ_OP_WRITE_ZEROES)
 		return false;
 
+	if (req_op(rq) == REQ_OP_ZONE_APPEND)
+		return false;
+
 	if (rq->cmd_flags & REQ_NOMERGE_FLAGS)
 		return false;
 	if (rq->rq_flags & RQF_NOMERGE_FLAGS)
@@ -1088,6 +1092,8 @@ extern void blk_queue_max_write_same_sectors(struct request_queue *q,
 extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 		unsigned int max_write_same_sectors);
 extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
+extern void blk_queue_max_zone_append_sectors(struct request_queue *q,
+		unsigned int max_zone_append_sectors);
 extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
 extern void blk_queue_alignment_offset(struct request_queue *q,
 				       unsigned int alignment);
@@ -1301,6 +1307,11 @@ static inline unsigned int queue_max_segment_size(const struct request_queue *q)
 	return q->limits.max_segment_size;
 }
 
+static inline unsigned int queue_max_zone_append_sectors(const struct request_queue *q)
+{
+	return q->limits.max_zone_append_sectors;
+}
+
 static inline unsigned queue_logical_block_size(const struct request_queue *q)
 {
 	int retval = 512;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 03/11] block: introduce bio_add_append_page
  2020-03-10  9:46 [PATCH 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
  2020-03-10  9:46 ` [PATCH 01/11] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
  2020-03-10  9:46 ` [PATCH 02/11] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
@ 2020-03-10  9:46 ` Johannes Thumshirn
  2020-03-10 16:43   ` Christoph Hellwig
  2020-03-11 18:11   ` Johannes Thumshirn
  2020-03-10  9:46 ` [PATCH 04/11] null_blk: Support REQ_OP_ZONE_APPEND Johannes Thumshirn
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 32+ messages in thread
From: Johannes Thumshirn @ 2020-03-10  9:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	Johannes Thumshirn

For REQ_OP_ZONE_APPEND we cannot add unlimited amounts of pages to a bio,
as the bio cannot be split later on.

This is similar to what we have to do for passthrough pages as well, just
with a different limit.

Introduce bio_add_append_page() which can used by file-systems add pages for
a REQ_OP_ZONE_APPEND bio.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/bio.c         | 37 ++++++++++++++++++++++++++++++-------
 block/blk-map.c     |  2 +-
 include/linux/bio.h |  2 +-
 3 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5bff80fc2ad9..3bd648671a28 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -732,7 +732,7 @@ static bool bio_try_merge_pc_page(struct request_queue *q, struct bio *bio,
  */
 static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
-		bool *same_page)
+		bool *same_page, unsigned int max_sectors)
 {
 	struct bio_vec *bvec;
 
@@ -742,7 +742,7 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 	if (unlikely(bio_flagged(bio, BIO_CLONED)))
 		return 0;
 
-	if (((bio->bi_iter.bi_size + len) >> 9) > queue_max_hw_sectors(q))
+	if (((bio->bi_iter.bi_size + len) >> 9) > max_sectors)
 		return 0;
 
 	if (bio->bi_vcnt > 0) {
@@ -777,10 +777,20 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset)
 {
 	bool same_page = false;
-	return __bio_add_pc_page(q, bio, page, len, offset, &same_page);
+	return __bio_add_pc_page(q, bio, page, len, offset, &same_page,
+				 queue_max_hw_sectors(q));
 }
 EXPORT_SYMBOL(bio_add_pc_page);
 
+int bio_add_append_page(struct request_queue *q, struct bio *bio,
+			struct page *page, unsigned int len, unsigned int offset)
+{
+	bool same_page = false;
+	return __bio_add_pc_page(q, bio, page, len, offset, &same_page,
+				 queue_max_zone_append_sectors(q));
+}
+EXPORT_SYMBOL(bio_add_append_page);
+
 /**
  * __bio_try_merge_page - try appending data to an existing bvec.
  * @bio: destination bio
@@ -945,8 +955,15 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 		len = min_t(size_t, PAGE_SIZE - offset, left);
 
-		if (__bio_try_merge_page(bio, page, len, offset, &same_page)) {
-			if (same_page)
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+			size = bio_add_append_page(bio->bi_disk->queue, bio,
+						   page, len, offset);
+
+			if (size != len)
+				return -E2BIG;
+		} else if (__bio_try_merge_page(bio, page, len, offset,
+						&same_page)) {
+				if (same_page)
 				put_page(page);
 		} else {
 			if (WARN_ON_ONCE(bio_full(bio, len)))
@@ -1389,11 +1406,12 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
  */
 struct bio *bio_map_user_iov(struct request_queue *q,
 			     struct iov_iter *iter,
-			     gfp_t gfp_mask)
+			     gfp_t gfp_mask, unsigned int op)
 {
 	int j;
 	struct bio *bio;
 	int ret;
+	unsigned int max_sectors;
 
 	if (!iov_iter_count(iter))
 		return ERR_PTR(-EINVAL);
@@ -1402,6 +1420,11 @@ struct bio *bio_map_user_iov(struct request_queue *q,
 	if (!bio)
 		return ERR_PTR(-ENOMEM);
 
+	if (op == REQ_OP_ZONE_APPEND)
+		max_sectors = queue_max_zone_append_sectors(q);
+	else
+		max_sectors = queue_max_hw_sectors(q);
+
 	while (iov_iter_count(iter)) {
 		struct page **pages;
 		ssize_t bytes;
@@ -1429,7 +1452,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
 					n = bytes;
 
 				if (!__bio_add_pc_page(q, bio, page, n, offs,
-						&same_page)) {
+						&same_page, max_sectors)) {
 					if (same_page)
 						put_page(page);
 					break;
diff --git a/block/blk-map.c b/block/blk-map.c
index b0790268ed9d..a83ba39251a9 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -72,7 +72,7 @@ static int __blk_rq_map_user_iov(struct request *rq,
 	if (copy)
 		bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
 	else
-		bio = bio_map_user_iov(q, iter, gfp_mask);
+		bio = bio_map_user_iov(q, iter, gfp_mask, req_op(rq));
 
 	if (IS_ERR(bio))
 		return PTR_ERR(bio);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index ef640fd76c23..ef69e52cc8d9 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -444,7 +444,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
 void bio_release_pages(struct bio *bio, bool mark_dirty);
 struct rq_map_data;
 extern struct bio *bio_map_user_iov(struct request_queue *,
-				    struct iov_iter *, gfp_t);
+				    struct iov_iter *, gfp_t, unsigned int);
 extern void bio_unmap_user(struct bio *);
 extern struct bio *bio_map_kern(struct request_queue *, void *, unsigned int,
 				gfp_t);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 04/11] null_blk: Support REQ_OP_ZONE_APPEND
  2020-03-10  9:46 [PATCH 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (2 preceding siblings ...)
  2020-03-10  9:46 ` [PATCH 03/11] block: introduce bio_add_append_page Johannes Thumshirn
@ 2020-03-10  9:46 ` Johannes Thumshirn
  2020-03-10 16:43   ` Christoph Hellwig
  2020-03-10  9:46 ` [PATCH 05/11] block: introduce BLK_STS_ZONE_RESOURCE Johannes Thumshirn
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: Johannes Thumshirn @ 2020-03-10  9:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	Damien Le Moal

From: Damien Le Moal <damien.lemoal@wdc.com>

Support REQ_OP_ZONE_APPEND requests for zone mode null_blk devices.
Use the internally tracked zone write pointer position as the write
position.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 drivers/block/null_blk_main.c  |  9 ++++++---
 drivers/block/null_blk_zoned.c | 21 ++++++++++++++++++---
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c
index 133060431dbd..62869431f2cf 100644
--- a/drivers/block/null_blk_main.c
+++ b/drivers/block/null_blk_main.c
@@ -1575,15 +1575,18 @@ static int null_gendisk_register(struct nullb *nullb)
 
 #ifdef CONFIG_BLK_DEV_ZONED
 	if (nullb->dev->zoned) {
-		if (queue_is_mq(nullb->q)) {
+		struct request_queue *q = nullb->q;
+
+		if (queue_is_mq(q)) {
 			int ret = blk_revalidate_disk_zones(disk);
 			if (ret)
 				return ret;
 		} else {
-			blk_queue_chunk_sectors(nullb->q,
+			blk_queue_chunk_sectors(q,
 					nullb->dev->zone_size_sects);
-			nullb->q->nr_zones = blkdev_nr_zones(disk);
+			q->nr_zones = blkdev_nr_zones(disk);
 		}
+		blk_queue_max_zone_append_sectors(q, q->limits.max_hw_sectors);
 	}
 #endif
 
diff --git a/drivers/block/null_blk_zoned.c b/drivers/block/null_blk_zoned.c
index ed34785dd64b..ed9c4cde68f3 100644
--- a/drivers/block/null_blk_zoned.c
+++ b/drivers/block/null_blk_zoned.c
@@ -116,7 +116,7 @@ size_t null_zone_valid_read_len(struct nullb *nullb,
 }
 
 static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
-		     unsigned int nr_sectors)
+		     		    unsigned int nr_sectors, bool append)
 {
 	struct nullb_device *dev = cmd->nq->dev;
 	unsigned int zno = null_zone_no(dev, sector);
@@ -131,7 +131,20 @@ static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
 	case BLK_ZONE_COND_IMP_OPEN:
 	case BLK_ZONE_COND_EXP_OPEN:
 	case BLK_ZONE_COND_CLOSED:
-		/* Writes must be at the write pointer position */
+		/*
+		 * Regular writes must be at the write pointer position.
+		 * Zone append writes are automatically issued at the write
+		 * pointer and the position returned using the request or BIO
+		 * sector.
+		 */
+		if (append) {
+			sector = zone->wp;
+			if (cmd->bio)
+				cmd->bio->bi_iter.bi_sector = sector;
+			else
+				cmd->rq->__sector = sector;
+		}
+
 		if (sector != zone->wp)
 			return BLK_STS_IOERR;
 
@@ -211,7 +224,9 @@ blk_status_t null_handle_zoned(struct nullb_cmd *cmd, enum req_opf op,
 {
 	switch (op) {
 	case REQ_OP_WRITE:
-		return null_zone_write(cmd, sector, nr_sectors);
+		return null_zone_write(cmd, sector, nr_sectors, false);
+	case REQ_OP_ZONE_APPEND:
+		return null_zone_write(cmd, sector, nr_sectors, true);
 	case REQ_OP_ZONE_RESET:
 	case REQ_OP_ZONE_RESET_ALL:
 	case REQ_OP_ZONE_OPEN:
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 05/11] block: introduce BLK_STS_ZONE_RESOURCE
  2020-03-10  9:46 [PATCH 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (3 preceding siblings ...)
  2020-03-10  9:46 ` [PATCH 04/11] null_blk: Support REQ_OP_ZONE_APPEND Johannes Thumshirn
@ 2020-03-10  9:46 ` Johannes Thumshirn
  2020-03-10 16:43   ` Christoph Hellwig
  2020-03-10  9:46 ` [PATCH 06/11] block: introduce blk_req_zone_write_trylock Johannes Thumshirn
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: Johannes Thumshirn @ 2020-03-10  9:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	Johannes Thumshirn

BLK_STS_ZONE_RESOURCE is returned from the driver to the block layer if
zone related resources are unavailable, but the driver can guarantee the
queue will be rerun in the future once the resources become available
again.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 include/linux/blk_types.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index ae809f07aa27..824ec2d89954 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -63,6 +63,18 @@ typedef u8 __bitwise blk_status_t;
  */
 #define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)
 
+/*
+ * BLK_STS_ZONE_RESOURCE is returned from the driver to the block layer if zone
+ * related resources are unavailable, but the driver can guarantee the queue
+ * will be rerun in the future once the resources become available again.
+ *
+ * This is different from BLK_STS_DEV_RESOURCE in that it explicitly references
+ * a zone specific resource and IO to a different zone on the same device could
+ * still be served. Examples of that are zones that are write-locked, but a read
+ * to the same zone could be served.
+ */
+#define BLK_STS_ZONE_RESOURCE	((__force blk_status_t)14)
+
 /**
  * blk_path_error - returns true if error may be path related
  * @error: status the request was completed with
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 06/11] block: introduce blk_req_zone_write_trylock
  2020-03-10  9:46 [PATCH 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (4 preceding siblings ...)
  2020-03-10  9:46 ` [PATCH 05/11] block: introduce BLK_STS_ZONE_RESOURCE Johannes Thumshirn
@ 2020-03-10  9:46 ` Johannes Thumshirn
  2020-03-10  9:46 ` [PATCH 07/11] block: factor out requeue handling from dispatch code Johannes Thumshirn
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Johannes Thumshirn @ 2020-03-10  9:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	Johannes Thumshirn

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/blk-zoned.c      | 14 ++++++++++++++
 include/linux/blkdev.h |  1 +
 2 files changed, 15 insertions(+)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 05741c6f618b..00b025b8b7c0 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -50,6 +50,20 @@ bool blk_req_needs_zone_write_lock(struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_req_needs_zone_write_lock);
 
+bool blk_req_zone_write_trylock(struct request *rq)
+{
+	unsigned int zno = blk_rq_zone_no(rq);
+
+	if (test_and_set_bit(zno, rq->q->seq_zones_wlock))
+		return false;
+
+	WARN_ON_ONCE(rq->rq_flags & RQF_ZONE_WRITE_LOCKED);
+	rq->rq_flags |= RQF_ZONE_WRITE_LOCKED;
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(blk_req_zone_write_trylock);
+
 void __blk_req_zone_write_lock(struct request *rq)
 {
 	if (WARN_ON_ONCE(test_and_set_bit(blk_rq_zone_no(rq),
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 36111b10d514..e591b22ace03 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1746,6 +1746,7 @@ extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 
 #ifdef CONFIG_BLK_DEV_ZONED
 bool blk_req_needs_zone_write_lock(struct request *rq);
+bool blk_req_zone_write_trylock(struct request *rq);
 void __blk_req_zone_write_lock(struct request *rq);
 void __blk_req_zone_write_unlock(struct request *rq);
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 07/11] block: factor out requeue handling from dispatch code
  2020-03-10  9:46 [PATCH 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (5 preceding siblings ...)
  2020-03-10  9:46 ` [PATCH 06/11] block: introduce blk_req_zone_write_trylock Johannes Thumshirn
@ 2020-03-10  9:46 ` Johannes Thumshirn
  2020-03-10 16:44   ` Christoph Hellwig
  2020-03-10  9:46 ` [PATCH 08/11] block: delay un-dispatchable request Johannes Thumshirn
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: Johannes Thumshirn @ 2020-03-10  9:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	Johannes Thumshirn

Factor out the requeue handling from the dispatch code, this will make
subsequent addition of different requeueing schemes easier.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/blk-mq.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d92088dec6c3..f7ab75ef4d0e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1178,6 +1178,23 @@ static void blk_mq_update_dispatch_busy(struct blk_mq_hw_ctx *hctx, bool busy)
 
 #define BLK_MQ_RESOURCE_DELAY	3		/* ms units */
 
+static void blk_mq_handle_dev_resource(struct request *rq,
+				       struct list_head *list)
+{
+	struct request *next =
+		list_first_entry_or_null(list, struct request, queuelist);
+
+	/*
+	 * If an I/O scheduler has been configured and we got a driver tag for
+	 * the next request already, free it.
+	 */
+	if (next)
+		blk_mq_put_driver_tag(next);
+
+	list_add(&rq->queuelist, list);
+	__blk_mq_requeue_request(rq);
+}
+
 /*
  * Returns true if we did some work AND can potentially do more.
  */
@@ -1245,17 +1262,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 
 		ret = q->mq_ops->queue_rq(hctx, &bd);
 		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
-			/*
-			 * If an I/O scheduler has been configured and we got a
-			 * driver tag for the next request already, free it
-			 * again.
-			 */
-			if (!list_empty(list)) {
-				nxt = list_first_entry(list, struct request, queuelist);
-				blk_mq_put_driver_tag(nxt);
-			}
-			list_add(&rq->queuelist, list);
-			__blk_mq_requeue_request(rq);
+			blk_mq_handle_dev_resource(rq, list);
 			break;
 		}
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 08/11] block: delay un-dispatchable request
  2020-03-10  9:46 [PATCH 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (6 preceding siblings ...)
  2020-03-10  9:46 ` [PATCH 07/11] block: factor out requeue handling from dispatch code Johannes Thumshirn
@ 2020-03-10  9:46 ` Johannes Thumshirn
  2020-03-10 16:45   ` Christoph Hellwig
  2020-03-10  9:46 ` [PATCH 09/11] block: Introduce zone write pointer offset caching Johannes Thumshirn
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: Johannes Thumshirn @ 2020-03-10  9:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	Johannes Thumshirn

When a LLDD can't dispatch a request to a specific zone, it will return
BLK_STS_ZONE_RESOURCE indicating this request needs to be delayed, e.g.
because the zone it will be dispatched to is still write-locked.

If this happens set the request aside in a local list to continue trying
dispatching requests such as READ requests or a WRITE/ZONE_APPEND
requests targetting other zones. This way we can still keep a high queue
depth without starving other requests even if one request can't be
served due to zone write-locking.

All requests put aside in the local list due to BLK_STS_ZONE_RESOURCE
are placed back at the head of the dispatch list for retrying the next
time the device queues are run again.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/blk-mq.c          | 27 +++++++++++++++++++++++++++
 drivers/scsi/scsi_lib.c |  1 +
 2 files changed, 28 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f7ab75ef4d0e..89eb062825a7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1195,6 +1195,19 @@ static void blk_mq_handle_dev_resource(struct request *rq,
 	__blk_mq_requeue_request(rq);
 }
 
+static void blk_mq_handle_zone_resource(struct request *rq,
+					struct list_head *zone_list)
+{
+	/*
+	 * If we end up here it is because we cannot dispatch a request to a
+	 * specific zone due to LLD level zone-write locking or other zone
+	 * related resource not being available. In this case, set the request
+	 * aside in zone_list for retrying it later.
+	 */
+	list_add(&rq->queuelist, zone_list);
+	__blk_mq_requeue_request(rq);
+}
+
 /*
  * Returns true if we did some work AND can potentially do more.
  */
@@ -1206,6 +1219,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	bool no_tag = false;
 	int errors, queued;
 	blk_status_t ret = BLK_STS_OK;
+	LIST_HEAD(zone_list);
 
 	if (list_empty(list))
 		return false;
@@ -1264,6 +1278,16 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
 			blk_mq_handle_dev_resource(rq, list);
 			break;
+		} else if (ret == BLK_STS_ZONE_RESOURCE) {
+			/*
+			 * Move the request to zone_list and keep going through
+			 * the dipatch list to find more requests the drive
+			 * accepts.
+			 */
+			blk_mq_handle_zone_resource(rq, &zone_list);
+			if (list_empty(list))
+				break;
+			continue;
 		}
 
 		if (unlikely(ret != BLK_STS_OK)) {
@@ -1275,6 +1299,9 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		queued++;
 	} while (!list_empty(list));
 
+	if (!list_empty(&zone_list))
+		list_splice_tail_init(&zone_list, list);
+
 	hctx->dispatched[queued_to_index(queued)]++;
 
 	/*
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 610ee41fa54c..ea327f320b7f 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1706,6 +1706,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
+	case BLK_STS_ZONE_RESOURCE:
 		if (atomic_read(&sdev->device_busy) ||
 		    scsi_device_blocked(sdev))
 			ret = BLK_STS_DEV_RESOURCE;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 09/11] block: Introduce zone write pointer offset caching
  2020-03-10  9:46 [PATCH 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (7 preceding siblings ...)
  2020-03-10  9:46 ` [PATCH 08/11] block: delay un-dispatchable request Johannes Thumshirn
@ 2020-03-10  9:46 ` Johannes Thumshirn
  2020-03-10 16:46   ` Christoph Hellwig
  2020-03-10  9:46 ` [PATCH 10/11] scsi: sd_zbc: factor out sanity checks for zoned commands Johannes Thumshirn
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: Johannes Thumshirn @ 2020-03-10  9:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	Damien Le Moal

From: Damien Le Moal <damien.lemoal@wdc.com>

Not all zoned block devices natively support the zone append command.
E.g. SCSI and ATA disks do not define this command. However, it is
fairly straightforward to emulate this command at the LLD level using
regular write commands if a zone write pointer position is known.
Introducing such emulation enables the use of zone append write for all
device types, therefore simplifying for instance the implementation of
file systems zoned block device support by avoiding the need for
different write pathes depending on the device capabilities.

To allow devices without zone append command support to emulate its
behavior, introduce a zone write pointer cache attached to the device
request_queue, similarly to the zones bitmaps. To save memory, this
cache stores write pointer offsets relative to each zone start sector as
a 32bits value rather than the 64bits absolute sector position of each
zone write pointer.

The allocation and initialization of this cache can be requested by a
device driver using the QUEUE_FLAG_ZONE_WP_OFST queue flag. The
allocation and initialization of the cache is done in the same manner as
for the zone bitmaps, within the report zones callback function used by
blk_revalidate_disk_zones(). In case of changes to the device zone
configuration, the cache is updated under a queue freeze to avoid any
race between the device driver use of the cache and the request queue
update.

Freeing of this new cache is done together with the zone bitmaps from
the function blk_queue_free_zone_bitmaps(), renamed here to
blk_queue_free_zone_resources().

Maintaining the write pointer offset values is the responsibility of
the device LLD. The helper function blk_get_zone_wp_offset() is provided
to simplify this task.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 block/blk-sysfs.c      |  2 +-
 block/blk-zoned.c      | 69 ++++++++++++++++++++++++++++++++++++++++--
 block/blk.h            |  4 +--
 include/linux/blkdev.h | 20 ++++++++----
 4 files changed, 84 insertions(+), 11 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 02643e149d5e..bd0c9b4c1c5b 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -901,7 +901,7 @@ static void __blk_release_queue(struct work_struct *work)
 
 	blk_exit_queue(q);
 
-	blk_queue_free_zone_bitmaps(q);
+	blk_queue_free_zone_resources(q);
 
 	if (queue_is_mq(q))
 		blk_mq_release(q);
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 00b025b8b7c0..0f83b8baf607 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -344,18 +344,65 @@ static inline unsigned long *blk_alloc_zone_bitmap(int node,
 			    GFP_NOIO, node);
 }
 
-void blk_queue_free_zone_bitmaps(struct request_queue *q)
+static inline unsigned int *blk_alloc_zone_wp_ofst(unsigned int nr_zones)
+{
+	return kvcalloc(nr_zones, sizeof(unsigned int), GFP_NOIO);
+}
+
+void blk_queue_free_zone_resources(struct request_queue *q)
 {
 	kfree(q->conv_zones_bitmap);
 	q->conv_zones_bitmap = NULL;
 	kfree(q->seq_zones_wlock);
 	q->seq_zones_wlock = NULL;
+	kvfree(q->seq_zones_wp_ofst);
+	q->seq_zones_wp_ofst = NULL;
 }
 
+/**
+ * blk_get_zone_wp_ofst - Calculate a zone write pointer offset position
+ * @zone:	Target zone
+ * @wp_ofst:	Calculated write pointer offset
+ *
+ * Helper function for low-level device drivers to obtain a zone write pointer
+ * position relative to the zone start sector (write pointer offset). The write
+ * pointer offset depends on the zone condition. If the zone has an invalid
+ * condition, -ENODEV is returned.
+ */
+int blk_get_zone_wp_offset(struct blk_zone *zone, unsigned int *wp_ofst)
+{
+	switch (zone->cond) {
+	case BLK_ZONE_COND_EMPTY:
+		*wp_ofst = 0;
+		return 0;
+	case BLK_ZONE_COND_IMP_OPEN:
+	case BLK_ZONE_COND_EXP_OPEN:
+	case BLK_ZONE_COND_CLOSED:
+		*wp_ofst = zone->wp - zone->start;
+		return 0;
+	case BLK_ZONE_COND_FULL:
+		*wp_ofst = zone->len;
+		return 0;
+	case BLK_ZONE_COND_NOT_WP:
+	case BLK_ZONE_COND_OFFLINE:
+	case BLK_ZONE_COND_READONLY:
+		/*
+		 * Conventional, offline and read-only zones do not have a valid
+		 * write pointer. Use 0 as a dummy value.
+		 */
+		*wp_ofst = 0;
+		return 0;
+	default:
+		return -ENODEV;
+	}
+}
+EXPORT_SYMBOL_GPL(blk_get_zone_wp_offset);
+
 struct blk_revalidate_zone_args {
 	struct gendisk	*disk;
 	unsigned long	*conv_zones_bitmap;
 	unsigned long	*seq_zones_wlock;
+	unsigned int	*seq_zones_wp_ofst;
 	unsigned int	nr_zones;
 	sector_t	zone_sectors;
 	sector_t	sector;
@@ -371,6 +418,7 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 	struct gendisk *disk = args->disk;
 	struct request_queue *q = disk->queue;
 	sector_t capacity = get_capacity(disk);
+	int ret;
 
 	/*
 	 * All zones must have the same size, with the exception on an eventual
@@ -406,6 +454,13 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 		return -ENODEV;
 	}
 
+	if (blk_queue_zone_wp_ofst(q) && !args->seq_zones_wp_ofst) {
+		args->seq_zones_wp_ofst =
+			blk_alloc_zone_wp_ofst(args->nr_zones);
+		if (!args->seq_zones_wp_ofst)
+			return -ENOMEM;
+	}
+
 	/* Check zone type */
 	switch (zone->type) {
 	case BLK_ZONE_TYPE_CONVENTIONAL:
@@ -432,6 +487,14 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 		return -ENODEV;
 	}
 
+	if (args->seq_zones_wp_ofst) {
+		/* Initialize the zone write pointer offset */
+		ret = blk_get_zone_wp_offset(zone,
+					&args->seq_zones_wp_ofst[idx]);
+		if (ret)
+			return ret;
+	}
+
 	args->sector += zone->len;
 	return 0;
 }
@@ -480,15 +543,17 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 		q->nr_zones = args.nr_zones;
 		swap(q->seq_zones_wlock, args.seq_zones_wlock);
 		swap(q->conv_zones_bitmap, args.conv_zones_bitmap);
+		swap(q->seq_zones_wp_ofst, args.seq_zones_wp_ofst);
 		ret = 0;
 	} else {
 		pr_warn("%s: failed to revalidate zones\n", disk->disk_name);
-		blk_queue_free_zone_bitmaps(q);
+		blk_queue_free_zone_resources(q);
 	}
 	blk_mq_unfreeze_queue(q);
 
 	kfree(args.seq_zones_wlock);
 	kfree(args.conv_zones_bitmap);
+	kvfree(args.seq_zones_wp_ofst);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
diff --git a/block/blk.h b/block/blk.h
index 0b8884353f6b..cc28391bb0b3 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -349,9 +349,9 @@ static inline int blk_iolatency_init(struct request_queue *q) { return 0; }
 struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp);
 
 #ifdef CONFIG_BLK_DEV_ZONED
-void blk_queue_free_zone_bitmaps(struct request_queue *q);
+void blk_queue_free_zone_resources(struct request_queue *q);
 #else
-static inline void blk_queue_free_zone_bitmaps(struct request_queue *q) {}
+static inline void blk_queue_free_zone_resources(struct request_queue *q) {}
 #endif
 
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e591b22ace03..950d3476918c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -363,6 +363,7 @@ extern int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
 			    sector_t sectors, sector_t nr_sectors,
 			    gfp_t gfp_mask);
 extern int blk_revalidate_disk_zones(struct gendisk *disk);
+int blk_get_zone_wp_offset(struct blk_zone *zone, unsigned int *wp_ofst);
 
 extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
 				     unsigned int cmd, unsigned long arg);
@@ -499,14 +500,17 @@ struct request_queue {
 	/*
 	 * Zoned block device information for request dispatch control.
 	 * nr_zones is the total number of zones of the device. This is always
-	 * 0 for regular block devices. conv_zones_bitmap is a bitmap of nr_zones
-	 * bits which indicates if a zone is conventional (bit set) or
+	 * 0 for regular block devices. conv_zones_bitmap is a bitmap of
+	 * nr_zones bits which indicates if a zone is conventional (bit set) or
 	 * sequential (bit clear). seq_zones_wlock is a bitmap of nr_zones
 	 * bits which indicates if a zone is write locked, that is, if a write
-	 * request targeting the zone was dispatched. All three fields are
-	 * initialized by the low level device driver (e.g. scsi/sd.c).
-	 * Stacking drivers (device mappers) may or may not initialize
-	 * these fields.
+	 * request targeting the zone was dispatched. seq_zones_wp_ofst is an
+	 * array of nr_zones write pointer values relative to the zone start
+	 * sector. This is only initialized for LLDs needing zone append write
+	 * command emulation with regular write. All fields are initialized by
+	 * the blk_revalidate_disk_zones() function when called by the low
+	 * level device driver (e.g. scsi/sd.c). Stacking drivers (device
+	 * mappers) may or may not initialize these fields.
 	 *
 	 * Reads of this information must be protected with blk_queue_enter() /
 	 * blk_queue_exit(). Modifying this information is only allowed while
@@ -516,6 +520,7 @@ struct request_queue {
 	unsigned int		nr_zones;
 	unsigned long		*conv_zones_bitmap;
 	unsigned long		*seq_zones_wlock;
+	unsigned int		*seq_zones_wp_ofst;
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 	/*
@@ -613,6 +618,7 @@ struct request_queue {
 #define QUEUE_FLAG_PCI_P2PDMA	25	/* device supports PCI p2p requests */
 #define QUEUE_FLAG_ZONE_RESETALL 26	/* supports Zone Reset All */
 #define QUEUE_FLAG_RQ_ALLOC_TIME 27	/* record rq->alloc_time_ns */
+#define QUEUE_FLAG_ZONE_WP_OFST	28	/* queue needs zone wp offsets */
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP))
@@ -647,6 +653,8 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #else
 #define blk_queue_rq_alloc_time(q)	false
 #endif
+#define blk_queue_zone_wp_ofst(q)	\
+	test_bit(QUEUE_FLAG_ZONE_WP_OFST, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 10/11] scsi: sd_zbc: factor out sanity checks for zoned commands
  2020-03-10  9:46 [PATCH 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (8 preceding siblings ...)
  2020-03-10  9:46 ` [PATCH 09/11] block: Introduce zone write pointer offset caching Johannes Thumshirn
@ 2020-03-10  9:46 ` Johannes Thumshirn
  2020-03-10  9:46 ` [PATCH 11/11] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
  2020-03-10 16:42 ` [PATCH 00/11] Introduce Zone Append for writing to zoned block devices Christoph Hellwig
  11 siblings, 0 replies; 32+ messages in thread
From: Johannes Thumshirn @ 2020-03-10  9:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	Johannes Thumshirn

Factor sanity checks for zoned commands from sd_zbc_setup_zone_mgmt_cmnd().

This will help with the introduction of an emulated ZONE_APPEND command.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 drivers/scsi/sd_zbc.c | 36 +++++++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index e4282bce5834..5b925f52d1ce 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -204,6 +204,26 @@ int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
 	return ret;
 }
 
+static blk_status_t sd_zbc_cmnd_checks(struct scsi_cmnd *cmd)
+{
+	struct request *rq = cmd->request;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	sector_t sector = blk_rq_pos(rq);
+
+	if (!sd_is_zoned(sdkp))
+		/* Not a zoned device */
+		return BLK_STS_IOERR;
+
+	if (sdkp->device->changed)
+		return BLK_STS_IOERR;
+
+	if (sector & (sd_zbc_zone_sectors(sdkp) - 1))
+		/* Unaligned request */
+		return BLK_STS_IOERR;
+
+	return BLK_STS_OK;
+}
+
 /**
  * sd_zbc_setup_zone_mgmt_cmnd - Prepare a zone ZBC_OUT command. The operations
  *			can be RESET WRITE POINTER, OPEN, CLOSE or FINISH.
@@ -218,20 +238,14 @@ blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
 					 unsigned char op, bool all)
 {
 	struct request *rq = cmd->request;
-	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
 	sector_t sector = blk_rq_pos(rq);
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
 	sector_t block = sectors_to_logical(sdkp->device, sector);
+	blk_status_t ret;
 
-	if (!sd_is_zoned(sdkp))
-		/* Not a zoned device */
-		return BLK_STS_IOERR;
-
-	if (sdkp->device->changed)
-		return BLK_STS_IOERR;
-
-	if (sector & (sd_zbc_zone_sectors(sdkp) - 1))
-		/* Unaligned request */
-		return BLK_STS_IOERR;
+	ret = sd_zbc_cmnd_checks(cmd);
+	if (ret != BLK_STS_OK)
+		return ret;
 
 	cmd->cmd_len = 16;
 	memset(cmd->cmnd, 0, cmd->cmd_len);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 11/11] scsi: sd_zbc: emulate ZONE_APPEND commands
  2020-03-10  9:46 [PATCH 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (9 preceding siblings ...)
  2020-03-10  9:46 ` [PATCH 10/11] scsi: sd_zbc: factor out sanity checks for zoned commands Johannes Thumshirn
@ 2020-03-10  9:46 ` Johannes Thumshirn
  2020-03-10 17:43   ` kbuild test robot
                     ` (4 more replies)
  2020-03-10 16:42 ` [PATCH 00/11] Introduce Zone Append for writing to zoned block devices Christoph Hellwig
  11 siblings, 5 replies; 32+ messages in thread
From: Johannes Thumshirn @ 2020-03-10  9:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	Johannes Thumshirn

Emulate ZONE_APPEND for SCSI disks using a regular WRITE(16) with a
start LBA set to the target zone write pointer position.

In order to always know the write pointer position of a sequential write
zone, the queue flag QUEUE_FLAG_ZONE_WP_OFST is set to get an
initialized write pointer offset array attached to the device request
queue. The values of the cache are maintained in sync with the device
as follows:
1) the write pointer offset of a zone is reset to 0 when a
   REQ_OP_ZONE_RESET command completes.
2) the write pointer offset of a zone is set to the zone size when a
   REQ_OP_ZONE_FINISH command completes.
3) the write pointer offset of a zone is incremented by the number of
   512B sectors written when a write or a zone append command completes
4) the write pointer offset of all zones is reset to 0 when a
   REQ_OP_ZONE_RESET_ALL command completes.

Since the block layer does not write lock zones for zone append
commands, to ensure a sequential ordering of the write commands used for
the emulation, the target zone of a zone append command is locked when
the function sd_zbc_prepare_zone_append() is called from
sd_setup_read_write_cmnd(). If the zone write lock cannot be obtained
(e.g. a zone append is in-flight or a regular write has already locked
the zone), the zone append command dispatching is delayed by returning
BLK_STS_ZONE_RESOURCE.

Since zone reset and finish operations can be issued concurrently with
writes and zone append requests, ensure a coherent update of the zone
write pointer offsets by also write locking the target zones for these
zone management requests.

Finally, to avoid the need for write locking all zones for
REQ_OP_ZONE_RESET_ALL requests, use a spinlock to protect accesses and
modifications of the zone write pointer offsets. This spinlock is
initialized from sd_probe() using the new function sd_zbc_init().

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 drivers/scsi/sd.c     |  28 +++-
 drivers/scsi/sd.h     |  35 ++++-
 drivers/scsi/sd_zbc.c | 308 +++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 358 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 8ca9299ffd36..02b8f2c6bc52 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1215,6 +1215,12 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 	else
 		protect = 0;
 
+	if (req_op(rq) == REQ_OP_ZONE_APPEND) {
+		ret = sd_zbc_prepare_zone_append(cmd, &lba, nr_blocks);
+		if (ret)
+			return ret;
+	}
+
 	if (protect && sdkp->protection_type == T10_PI_TYPE2_PROTECTION) {
 		ret = sd_setup_rw32_cmnd(cmd, write, lba, nr_blocks,
 					 protect | fua);
@@ -1287,6 +1293,7 @@ static blk_status_t sd_init_command(struct scsi_cmnd *cmd)
 		return sd_setup_flush_cmnd(cmd);
 	case REQ_OP_READ:
 	case REQ_OP_WRITE:
+	case REQ_OP_ZONE_APPEND:
 		return sd_setup_read_write_cmnd(cmd);
 	case REQ_OP_ZONE_RESET:
 		return sd_zbc_setup_zone_mgmt_cmnd(cmd, ZO_RESET_WRITE_POINTER,
@@ -2055,7 +2062,7 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 
  out:
 	if (sd_is_zoned(sdkp))
-		sd_zbc_complete(SCpnt, good_bytes, &sshdr);
+		good_bytes = sd_zbc_complete(SCpnt, good_bytes, &sshdr);
 
 	SCSI_LOG_HLCOMPLETE(1, scmd_printk(KERN_INFO, SCpnt,
 					   "sd_done: completed %d of %d bytes\n",
@@ -3369,6 +3376,8 @@ static int sd_probe(struct device *dev)
 	sdkp->first_scan = 1;
 	sdkp->max_medium_access_timeouts = SD_MAX_MEDIUM_TIMEOUTS;
 
+	sd_zbc_init_disk(sdkp);
+
 	sd_revalidate_disk(gd);
 
 	gd->flags = GENHD_FL_EXT_DEVT;
@@ -3662,19 +3671,26 @@ static int __init init_sd(void)
 	if (!sd_page_pool) {
 		printk(KERN_ERR "sd: can't init discard page pool\n");
 		err = -ENOMEM;
-		goto err_out_ppool;
+		goto err_out_cdb_pool;
 	}
 
+	err = sd_zbc_init();
+	if (err)
+		goto err_out_ppool;
+
 	err = scsi_register_driver(&sd_template.gendrv);
 	if (err)
-		goto err_out_driver;
+		goto err_out_zbc;
 
 	return 0;
 
-err_out_driver:
-	mempool_destroy(sd_page_pool);
+err_out_zbc:
+	sd_zbc_exit();
 
 err_out_ppool:
+	mempool_destroy(sd_page_pool);
+
+err_out_cdb_pool:
 	mempool_destroy(sd_cdb_pool);
 
 err_out_cache:
@@ -3704,6 +3720,8 @@ static void __exit exit_sd(void)
 	mempool_destroy(sd_page_pool);
 	kmem_cache_destroy(sd_cdb_cache);
 
+	sd_zbc_exit();
+
 	class_unregister(&sd_disk_class);
 
 	for (i = 0; i < SD_MAJORS; i++) {
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 50fff0bf8c8e..19961cdc5a53 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -79,6 +79,7 @@ struct scsi_disk {
 	u32		zones_optimal_open;
 	u32		zones_optimal_nonseq;
 	u32		zones_max_open;
+	spinlock_t	zone_wp_ofst_lock;
 #endif
 	atomic_t	openers;
 	sector_t	capacity;	/* size in logical blocks */
@@ -207,17 +208,33 @@ static inline int sd_is_zoned(struct scsi_disk *sdkp)
 
 #ifdef CONFIG_BLK_DEV_ZONED
 
+int __init sd_zbc_init(void);
+void sd_zbc_exit(void);
+
+void sd_zbc_init_disk(struct scsi_disk *sdkp);
 extern int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buffer);
 extern void sd_zbc_print_zones(struct scsi_disk *sdkp);
 blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
 					 unsigned char op, bool all);
-extern void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
-			    struct scsi_sense_hdr *sshdr);
+unsigned int sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
+			     struct scsi_sense_hdr *sshdr);
 int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
 		unsigned int nr_zones, report_zones_cb cb, void *data);
 
+blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd, sector_t *lba,
+				        unsigned int nr_blocks);
+
 #else /* CONFIG_BLK_DEV_ZONED */
 
+static inline int sd_zbc_init(void)
+{
+	return 0;
+}
+
+static inline void sd_zbc_exit(void) {}
+
+static inline void sd_zbc_init_disk(struct scsi_disk *sdkp) {}
+
 static inline int sd_zbc_read_zones(struct scsi_disk *sdkp,
 				    unsigned char *buf)
 {
@@ -233,9 +250,17 @@ static inline blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
 	return BLK_STS_TARGET;
 }
 
-static inline void sd_zbc_complete(struct scsi_cmnd *cmd,
-				   unsigned int good_bytes,
-				   struct scsi_sense_hdr *sshdr) {}
+static inline insigned int sd_zbc_complete(struct scsi_cmnd *cmd,
+			unsigned int good_bytes, struct scsi_sense_hdr *sshdr)
+{
+	return 0;
+}
+
+static inline blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd,
+						      sector_t *lba)
+{
+	return BLK_STS_TARGET;
+}
 
 #define sd_zbc_report_zones NULL
 
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 5b925f52d1ce..7ee5b0259b40 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -19,6 +19,11 @@
 
 #include "sd.h"
 
+static struct kmem_cache *sd_zbc_zone_work_cache;
+static mempool_t *sd_zbc_zone_work_pool;
+
+#define SD_ZBC_ZONE_WORK_MEMPOOL_SIZE	8
+
 static int sd_zbc_parse_report(struct scsi_disk *sdkp, u8 *buf,
 			       unsigned int idx, report_zones_cb cb, void *data)
 {
@@ -224,6 +229,152 @@ static blk_status_t sd_zbc_cmnd_checks(struct scsi_cmnd *cmd)
 	return BLK_STS_OK;
 }
 
+#define SD_ZBC_INVALID_WP_OFST	~(0u)
+#define SD_ZBC_UPDATING_WP_OFST	(SD_ZBC_INVALID_WP_OFST - 1)
+
+struct sd_zbc_zone_work {
+	struct work_struct work;
+	struct scsi_disk *sdkp;
+	unsigned int zno;
+	char buf[SD_BUF_SIZE];
+};
+
+static int sd_zbc_update_wp_ofst_cb(struct blk_zone *zone, unsigned int idx,
+				    void *data)
+{
+	struct sd_zbc_zone_work *zwork = data;
+	struct scsi_disk *sdkp = zwork->sdkp;
+	struct request_queue *q = sdkp->disk->queue;
+	int ret;
+
+	spin_lock_bh(&sdkp->zone_wp_ofst_lock);
+	ret = blk_get_zone_wp_offset(zone, &q->seq_zones_wp_ofst[zwork->zno]);
+	if (ret)
+		q->seq_zones_wp_ofst[zwork->zno] = SD_ZBC_INVALID_WP_OFST;
+	spin_unlock_bh(&sdkp->zone_wp_ofst_lock);
+
+	return ret;
+}
+
+static void sd_zbc_update_wp_ofst_workfn(struct work_struct *work)
+{
+	struct sd_zbc_zone_work *zwork;
+	struct scsi_disk *sdkp;
+	int ret;
+
+	zwork = container_of(work, struct sd_zbc_zone_work, work);
+	sdkp = zwork->sdkp;
+
+	ret = sd_zbc_do_report_zones(sdkp, zwork->buf, SD_BUF_SIZE,
+				     zwork->zno * sdkp->zone_blocks, true);
+	if (!ret)
+		sd_zbc_parse_report(sdkp, zwork->buf + 64, 0,
+				    sd_zbc_update_wp_ofst_cb, zwork);
+
+	mempool_free(zwork, sd_zbc_zone_work_pool);
+	scsi_device_put(sdkp->device);
+}
+
+static blk_status_t sd_zbc_update_wp_ofst(struct scsi_disk *sdkp,
+					  unsigned int zno)
+{
+	struct sd_zbc_zone_work *zwork;
+
+	/*
+	 * We are about to schedule work to update a zone write pointer offset,
+	 * which will cause the zone append command to be requeued. So make
+	 * sure that the scsi device does not go away while the work is
+	 * being processed.
+	 */
+	if (scsi_device_get(sdkp->device))
+		return BLK_STS_IOERR;
+
+	zwork = mempool_alloc(sd_zbc_zone_work_pool, GFP_ATOMIC);
+	if (!zwork) {
+		/* Retry later */
+		scsi_device_put(sdkp->device);
+		return BLK_STS_RESOURCE;
+	}
+
+	memset(zwork, 0, sizeof(struct sd_zbc_zone_work));
+	INIT_WORK(&zwork->work, sd_zbc_update_wp_ofst_workfn);
+	zwork->sdkp = sdkp;
+	zwork->zno = zno;
+
+	sdkp->disk->queue->seq_zones_wp_ofst[zno] = SD_ZBC_UPDATING_WP_OFST;
+
+	schedule_work(&zwork->work);
+
+	return BLK_STS_RESOURCE;
+}
+
+/**
+ * sd_zbc_prepare_zone_append() - Prepare an emulated ZONE_APPEND command.
+ * @cmd: the command to setup
+ * @lba: the LBA to patch
+ * @nr_blocks: the number of LBAs to be written
+ *
+ * Called from sd_setup_read_write_cmnd() for REQ_OP_ZONE_APPEND.
+ * @sd_zbc_prepare_zone_append() handles the necessary zone wrote locking and
+ * patching of the lba for an emulated ZONE_APPEND command.
+ *
+ * In case the cached write pointer offset is %SD_ZBC_INVALID_WP_OFST it will
+ * schedule a REPORT ZONES command and return BLK_STS_IOERR.
+ */
+blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd, sector_t *lba,
+					unsigned int nr_blocks)
+{
+	struct request *rq = cmd->request;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	unsigned int wp_ofst, zno = blk_rq_zone_no(rq);
+	blk_status_t ret;
+
+	ret = sd_zbc_cmnd_checks(cmd);
+	if (ret != BLK_STS_OK)
+		return ret;
+
+	if (!blk_rq_zone_is_seq(rq))
+		return BLK_STS_IOERR;
+
+	/* Unlock of the write lock will happen in sd_zbc_complete() */
+	if (!blk_req_zone_write_trylock(rq))
+		return BLK_STS_ZONE_RESOURCE;
+
+	spin_lock_bh(&sdkp->zone_wp_ofst_lock);
+
+	wp_ofst = rq->q->seq_zones_wp_ofst[zno];
+
+	if (wp_ofst == SD_ZBC_UPDATING_WP_OFST) {
+		/* Write pointer offset update in progress: ask for a requeue */
+		ret = BLK_STS_RESOURCE;
+		goto err;
+	}
+
+	if (wp_ofst == SD_ZBC_INVALID_WP_OFST) {
+		/* Invalid write pointer offset: trigger an update from disk */
+		ret = sd_zbc_update_wp_ofst(sdkp, zno);
+		goto err;
+	}
+
+	wp_ofst = sectors_to_logical(sdkp->device, wp_ofst);
+	if (wp_ofst + nr_blocks > sdkp->zone_blocks) {
+		ret = BLK_STS_IOERR;
+		goto err;
+	}
+
+	/* Set the LBA for the write command used to emulate zone append */
+	*lba += wp_ofst;
+
+	spin_unlock_bh(&sdkp->zone_wp_ofst_lock);
+
+	return BLK_STS_OK;
+
+err:
+	spin_unlock_bh(&sdkp->zone_wp_ofst_lock);
+	blk_req_zone_write_unlock(rq);
+	return ret;
+}
+
 /**
  * sd_zbc_setup_zone_mgmt_cmnd - Prepare a zone ZBC_OUT command. The operations
  *			can be RESET WRITE POINTER, OPEN, CLOSE or FINISH.
@@ -261,23 +412,75 @@ blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
 	cmd->transfersize = 0;
 	cmd->allowed = 0;
 
+	/* Only zone reset and zone finish need zone write locking */
+	if (op != ZO_RESET_WRITE_POINTER && op != ZO_FINISH_ZONE)
+		return BLK_STS_OK;
+
+	if (all) {
+		/* We do not write lock all zones for an all zone reset */
+		if (op == ZO_RESET_WRITE_POINTER)
+			return BLK_STS_OK;
+
+		/* Finishing all zones is not supported */
+		return BLK_STS_IOERR;
+	}
+
+	if (!blk_rq_zone_is_seq(rq))
+		return BLK_STS_IOERR;
+
+	if (!blk_req_zone_write_trylock(rq))
+		return BLK_STS_ZONE_RESOURCE;
+
 	return BLK_STS_OK;
 }
 
+static inline bool sd_zbc_zone_needs_write_unlock(struct request *rq)
+{
+	/*
+	 * For zone append, the zone was locked in sd_zbc_prepare_zone_append().
+	 * For zone reset and zone finish, the zone was locked in
+	 * sd_zbc_setup_zone_mgmt_cmnd().
+	 * For regular writes, the zone is unlocked by the block layer elevator.
+	 */
+	return req_op(rq) == REQ_OP_ZONE_APPEND ||
+		req_op(rq) == REQ_OP_ZONE_RESET ||
+		req_op(rq) == REQ_OP_ZONE_FINISH;
+}
+
+static bool sd_zbc_need_zone_wp_update(struct request *rq)
+{
+	if (req_op(rq) == REQ_OP_ZONE_RESET_ALL)
+		return true;
+
+	if (!blk_rq_zone_is_seq(rq))
+		return false;
+
+	if (req_op(rq) == REQ_OP_WRITE ||
+	    req_op(rq) == REQ_OP_WRITE_ZEROES ||
+	    req_op(rq) == REQ_OP_WRITE_SAME)
+		return true;
+
+	return sd_zbc_zone_needs_write_unlock(rq);
+}
+
 /**
  * sd_zbc_complete - ZBC command post processing.
  * @cmd: Completed command
  * @good_bytes: Command reply bytes
  * @sshdr: command sense header
  *
- * Called from sd_done(). Process report zones reply and handle reset zone
- * and write commands errors.
+ * Called from sd_done() to handle zone commands errors and updates to the
+ * device queue zone write pointer offset cahce.
  */
-void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
+unsigned int sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
 		     struct scsi_sense_hdr *sshdr)
 {
 	int result = cmd->result;
 	struct request *rq = cmd->request;
+	struct request_queue *q = rq->q;
+	struct gendisk *disk = rq->rq_disk;
+	struct scsi_disk *sdkp = scsi_disk(disk);
+	unsigned int zno;
 
 	if (op_is_zone_mgmt(req_op(rq)) &&
 	    result &&
@@ -289,7 +492,67 @@ void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
 		 * so be quiet about the error.
 		 */
 		rq->rq_flags |= RQF_QUIET;
+		goto unlock_zone;
 	}
+
+	if (!sd_zbc_need_zone_wp_update(rq))
+		goto unlock_zone;
+
+	/*
+	 * If we got an error for a command that needs updating the write
+	 * pointer offset cache, we must mark the zone wp offset entry as
+	 * invalid to force an update from disk the next time a zone append
+	 * command is issued.
+	 */
+	zno = blk_rq_zone_no(rq);
+	spin_lock_bh(&sdkp->zone_wp_ofst_lock);
+
+	if (result && req_op(rq) != REQ_OP_ZONE_RESET_ALL) {
+		if (req_op(rq) == REQ_OP_ZONE_APPEND) {
+			/* Force complete completion (no retry) */
+			good_bytes = 0;
+			scsi_set_resid(cmd, blk_rq_bytes(rq));
+		}
+
+		/*
+		 * Force an update of the zone write pointer offset on
+		 * the next zone append access.
+		 */
+		if (q->seq_zones_wp_ofst[zno] != SD_ZBC_UPDATING_WP_OFST)
+			q->seq_zones_wp_ofst[zno] = SD_ZBC_INVALID_WP_OFST;
+		goto unlock_wp_ofst;
+	}
+
+	switch (req_op(rq)) {
+	case REQ_OP_ZONE_APPEND:
+		rq->__sector += q->seq_zones_wp_ofst[zno];
+		/* fallthrough */
+	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_WRITE_SAME:
+	case REQ_OP_WRITE:
+		if (q->seq_zones_wp_ofst[zno] < sd_zbc_zone_sectors(sdkp))
+			q->seq_zones_wp_ofst[zno] += good_bytes >> SECTOR_SHIFT;
+		break;
+	case REQ_OP_ZONE_RESET:
+		q->seq_zones_wp_ofst[zno] = 0;
+		break;
+	case REQ_OP_ZONE_FINISH:
+		q->seq_zones_wp_ofst[zno] = sd_zbc_zone_sectors(sdkp);
+		break;
+	case REQ_OP_ZONE_RESET_ALL:
+		memset(q->seq_zones_wp_ofst, 0,
+		       sdkp->nr_zones * sizeof(unsigned int));
+		break;
+	}
+
+unlock_wp_ofst:
+	spin_unlock_bh(&sdkp->zone_wp_ofst_lock);
+
+unlock_zone:
+	if (sd_zbc_zone_needs_write_unlock(rq))
+		blk_req_zone_write_unlock(rq);
+
+	return good_bytes;
 }
 
 /**
@@ -417,8 +680,11 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf)
 
 	/* The drive satisfies the kernel restrictions: set it up */
 	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, sdkp->disk->queue);
+	blk_queue_flag_set(QUEUE_FLAG_ZONE_WP_OFST, sdkp->disk->queue);
 	blk_queue_required_elevator_features(sdkp->disk->queue,
 					     ELEVATOR_F_ZBD_SEQ_WRITE);
+	blk_queue_max_zone_append_sectors(sdkp->disk->queue,
+					  sdkp->disk->queue->limits.max_hw_sectors);
 	nr_zones = round_up(sdkp->capacity, zone_blocks) >> ilog2(zone_blocks);
 
 	/* READ16/WRITE16 is mandatory for ZBC disks */
@@ -470,3 +736,39 @@ void sd_zbc_print_zones(struct scsi_disk *sdkp)
 			  sdkp->nr_zones,
 			  sdkp->zone_blocks);
 }
+
+void sd_zbc_init_disk(struct scsi_disk *sdkp)
+{
+	if (!sd_is_zoned(sdkp))
+		return;
+
+	spin_lock_init(&sdkp->zone_wp_ofst_lock);
+}
+
+int __init sd_zbc_init(void)
+{
+	sd_zbc_zone_work_cache =
+		kmem_cache_create("sd_zbc_zone_work",
+				  sizeof(struct sd_zbc_zone_work),
+                                  0, 0, NULL);
+	if (!sd_zbc_zone_work_cache)
+		return -ENOMEM;
+
+	sd_zbc_zone_work_pool =
+		mempool_create_slab_pool(SD_ZBC_ZONE_WORK_MEMPOOL_SIZE,
+					 sd_zbc_zone_work_cache);
+	if (!sd_zbc_zone_work_pool) {
+		kmem_cache_destroy(sd_zbc_zone_work_cache);
+		printk(KERN_ERR "sd_zbc: create zone work pool failed\n");
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+void sd_zbc_exit(void)
+{
+	mempool_destroy(sd_zbc_zone_work_pool);
+	kmem_cache_destroy(sd_zbc_zone_work_cache);
+}
+
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] Introduce Zone Append for writing to zoned block devices
  2020-03-10  9:46 [PATCH 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (10 preceding siblings ...)
  2020-03-10  9:46 ` [PATCH 11/11] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
@ 2020-03-10 16:42 ` Christoph Hellwig
  2020-03-11  0:37   ` Damien Le Moal
  2020-03-11  7:22   ` Johannes Thumshirn
  11 siblings, 2 replies; 32+ messages in thread
From: Christoph Hellwig @ 2020-03-10 16:42 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Damien Le Moal,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen

On Tue, Mar 10, 2020 at 06:46:42PM +0900, Johannes Thumshirn wrote:
> For null_blk the emulation is way simpler, as null_blk's zoned block
> device emulation support already caches the write pointer position, so we
> only need to report the position back to the upper layers. Additional
> caching is not needed here.
> 
> Testing has been conducted by translating RWF_APPEND DIOs into
> REQ_OP_ZONE_APPEND commands in the block device's direct I/O function and
> injecting errors by bypassing the block layer interface and directly
> writing to the disc via the SCSI generic interface.

We really need a user of this to be useful upstream.  Didn't you plan
to look into converting zonefs/iomap to use it?  Without that it is
at best a RFC.  Even better would be converting zonefs and the f2fs
zoned code so that can get rid of the old per-zone serialization in
the I/O scheduler entirely.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 03/11] block: introduce bio_add_append_page
  2020-03-10  9:46 ` [PATCH 03/11] block: introduce bio_add_append_page Johannes Thumshirn
@ 2020-03-10 16:43   ` Christoph Hellwig
  2020-03-11 18:11   ` Johannes Thumshirn
  1 sibling, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2020-03-10 16:43 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Damien Le Moal,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen

On Tue, Mar 10, 2020 at 06:46:45PM +0900, Johannes Thumshirn wrote:
> For REQ_OP_ZONE_APPEND we cannot add unlimited amounts of pages to a bio,
> as the bio cannot be split later on.
> 
> This is similar to what we have to do for passthrough pages as well, just
> with a different limit.
> 
> Introduce bio_add_append_page() which can used by file-systems add pages for
> a REQ_OP_ZONE_APPEND bio.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

This should go into the previous patch.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 04/11] null_blk: Support REQ_OP_ZONE_APPEND
  2020-03-10  9:46 ` [PATCH 04/11] null_blk: Support REQ_OP_ZONE_APPEND Johannes Thumshirn
@ 2020-03-10 16:43   ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2020-03-10 16:43 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Damien Le Moal,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen

On Tue, Mar 10, 2020 at 06:46:46PM +0900, Johannes Thumshirn wrote:
> From: Damien Le Moal <damien.lemoal@wdc.com>
> 
> Support REQ_OP_ZONE_APPEND requests for zone mode null_blk devices.
> Use the internally tracked zone write pointer position as the write
> position.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>

This needs to go after all the infrastructure was added.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/11] block: introduce BLK_STS_ZONE_RESOURCE
  2020-03-10  9:46 ` [PATCH 05/11] block: introduce BLK_STS_ZONE_RESOURCE Johannes Thumshirn
@ 2020-03-10 16:43   ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2020-03-10 16:43 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Damien Le Moal,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen

On Tue, Mar 10, 2020 at 06:46:47PM +0900, Johannes Thumshirn wrote:
> BLK_STS_ZONE_RESOURCE is returned from the driver to the block layer if
> zone related resources are unavailable, but the driver can guarantee the
> queue will be rerun in the future once the resources become available
> again.

This needs to go into the main zone append patch.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 07/11] block: factor out requeue handling from dispatch code
  2020-03-10  9:46 ` [PATCH 07/11] block: factor out requeue handling from dispatch code Johannes Thumshirn
@ 2020-03-10 16:44   ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2020-03-10 16:44 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Damien Le Moal,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen

On Tue, Mar 10, 2020 at 06:46:49PM +0900, Johannes Thumshirn wrote:
> Factor out the requeue handling from the dispatch code, this will make
> subsequent addition of different requeueing schemes easier.

This should go first in the series, as it is useful even without the
rest of the series.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 08/11] block: delay un-dispatchable request
  2020-03-10  9:46 ` [PATCH 08/11] block: delay un-dispatchable request Johannes Thumshirn
@ 2020-03-10 16:45   ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2020-03-10 16:45 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Damien Le Moal,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen

On Tue, Mar 10, 2020 at 06:46:50PM +0900, Johannes Thumshirn wrote:
> When a LLDD can't dispatch a request to a specific zone, it will return
> BLK_STS_ZONE_RESOURCE indicating this request needs to be delayed, e.g.
> because the zone it will be dispatched to is still write-locked.
> 
> If this happens set the request aside in a local list to continue trying
> dispatching requests such as READ requests or a WRITE/ZONE_APPEND
> requests targetting other zones. This way we can still keep a high queue
> depth without starving other requests even if one request can't be
> served due to zone write-locking.
> 
> All requests put aside in the local list due to BLK_STS_ZONE_RESOURCE
> are placed back at the head of the dispatch list for retrying the next
> time the device queues are run again.

This needs to go into the main zone append patch.  Also I think some
of the above explanation would be useful in comments in the code.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 09/11] block: Introduce zone write pointer offset caching
  2020-03-10  9:46 ` [PATCH 09/11] block: Introduce zone write pointer offset caching Johannes Thumshirn
@ 2020-03-10 16:46   ` Christoph Hellwig
  2020-03-11  0:34     ` Damien Le Moal
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2020-03-10 16:46 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Damien Le Moal,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen

On Tue, Mar 10, 2020 at 06:46:51PM +0900, Johannes Thumshirn wrote:
> From: Damien Le Moal <damien.lemoal@wdc.com>
> 
> Not all zoned block devices natively support the zone append command.
> E.g. SCSI and ATA disks do not define this command. However, it is
> fairly straightforward to emulate this command at the LLD level using
> regular write commands if a zone write pointer position is known.
> Introducing such emulation enables the use of zone append write for all
> device types, therefore simplifying for instance the implementation of
> file systems zoned block device support by avoiding the need for
> different write pathes depending on the device capabilities.

I'd much rather have this in the driver itself than in the block layer.
Especially as sd will hopefully remain the only users.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 11/11] scsi: sd_zbc: emulate ZONE_APPEND commands
  2020-03-10  9:46 ` [PATCH 11/11] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
@ 2020-03-10 17:43   ` kbuild test robot
  2020-03-10 19:38   ` kbuild test robot
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 32+ messages in thread
From: kbuild test robot @ 2020-03-10 17:43 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2307 bytes --]

Hi Johannes,

I love your patch! Yet something to improve:

[auto build test ERROR on block/for-next]
[also build test ERROR on scsi/for-next mkp-scsi/for-next linus/master v5.6-rc5 next-20200310]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Johannes-Thumshirn/Introduce-Zone-Append-for-writing-to-zoned-block-devices/20200310-213648
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: alpha-defconfig (attached as .config)
compiler: alpha-linux-gcc (GCC) 9.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=9.2.0 make.cross ARCH=alpha 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from drivers/scsi/sd.c:72:
>> drivers/scsi/sd.h:253:23: error: expected ';' before 'int'
     253 | static inline insigned int sd_zbc_complete(struct scsi_cmnd *cmd,
         |                       ^~~~
         |                       ;
   drivers/scsi/sd.c: In function 'sd_setup_read_write_cmnd':
>> drivers/scsi/sd.c:1219:9: error: too many arguments to function 'sd_zbc_prepare_zone_append'
    1219 |   ret = sd_zbc_prepare_zone_append(cmd, &lba, nr_blocks);
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~
   In file included from drivers/scsi/sd.c:72:
   drivers/scsi/sd.h:259:28: note: declared here
     259 | static inline blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd,
         |                            ^~~~~~~~~~~~~~~~~~~~~~~~~~

vim +253 drivers/scsi/sd.h

   252	
 > 253	static inline insigned int sd_zbc_complete(struct scsi_cmnd *cmd,
   254				unsigned int good_bytes, struct scsi_sense_hdr *sshdr)
   255	{
   256		return 0;
   257	}
   258	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 13396 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 11/11] scsi: sd_zbc: emulate ZONE_APPEND commands
  2020-03-10  9:46 ` [PATCH 11/11] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
  2020-03-10 17:43   ` kbuild test robot
@ 2020-03-10 19:38   ` kbuild test robot
  2020-03-10 22:43   ` kbuild test robot
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 32+ messages in thread
From: kbuild test robot @ 2020-03-10 19:38 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 6591 bytes --]

Hi Johannes,

I love your patch! Yet something to improve:

[auto build test ERROR on block/for-next]
[also build test ERROR on scsi/for-next mkp-scsi/for-next linus/master v5.6-rc5 next-20200310]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Johannes-Thumshirn/Introduce-Zone-Append-for-writing-to-zoned-block-devices/20200310-213648
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: x86_64-randconfig-e003-20200310 (attached as .config)
compiler: gcc-7 (Debian 7.5.0-5) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from drivers/scsi/sd.c:72:0:
   drivers/scsi/sd.h:253:24: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'int'
    static inline insigned int sd_zbc_complete(struct scsi_cmnd *cmd,
                           ^~~
   drivers/scsi/sd.c: In function 'sd_setup_read_write_cmnd':
   drivers/scsi/sd.c:1219:9: error: too many arguments to function 'sd_zbc_prepare_zone_append'
      ret = sd_zbc_prepare_zone_append(cmd, &lba, nr_blocks);
            ^~~~~~~~~~~~~~~~~~~~~~~~~~
   In file included from drivers/scsi/sd.c:72:0:
   drivers/scsi/sd.h:259:28: note: declared here
    static inline blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd,
                               ^~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/scsi/sd.c: In function 'sd_done':
>> drivers/scsi/sd.c:2065:16: error: implicit declaration of function 'sd_zbc_complete'; did you mean 'dpm_complete'? [-Werror=implicit-function-declaration]
      good_bytes = sd_zbc_complete(SCpnt, good_bytes, &sshdr);
                   ^~~~~~~~~~~~~~~
                   dpm_complete
   cc1: some warnings being treated as errors

vim +2065 drivers/scsi/sd.c

  1949	
  1950	/**
  1951	 *	sd_done - bottom half handler: called when the lower level
  1952	 *	driver has completed (successfully or otherwise) a scsi command.
  1953	 *	@SCpnt: mid-level's per command structure.
  1954	 *
  1955	 *	Note: potentially run from within an ISR. Must not block.
  1956	 **/
  1957	static int sd_done(struct scsi_cmnd *SCpnt)
  1958	{
  1959		int result = SCpnt->result;
  1960		unsigned int good_bytes = result ? 0 : scsi_bufflen(SCpnt);
  1961		unsigned int sector_size = SCpnt->device->sector_size;
  1962		unsigned int resid;
  1963		struct scsi_sense_hdr sshdr;
  1964		struct scsi_disk *sdkp = scsi_disk(SCpnt->request->rq_disk);
  1965		struct request *req = SCpnt->request;
  1966		int sense_valid = 0;
  1967		int sense_deferred = 0;
  1968	
  1969		switch (req_op(req)) {
  1970		case REQ_OP_DISCARD:
  1971		case REQ_OP_WRITE_ZEROES:
  1972		case REQ_OP_WRITE_SAME:
  1973		case REQ_OP_ZONE_RESET:
  1974		case REQ_OP_ZONE_RESET_ALL:
  1975		case REQ_OP_ZONE_OPEN:
  1976		case REQ_OP_ZONE_CLOSE:
  1977		case REQ_OP_ZONE_FINISH:
  1978			if (!result) {
  1979				good_bytes = blk_rq_bytes(req);
  1980				scsi_set_resid(SCpnt, 0);
  1981			} else {
  1982				good_bytes = 0;
  1983				scsi_set_resid(SCpnt, blk_rq_bytes(req));
  1984			}
  1985			break;
  1986		default:
  1987			/*
  1988			 * In case of bogus fw or device, we could end up having
  1989			 * an unaligned partial completion. Check this here and force
  1990			 * alignment.
  1991			 */
  1992			resid = scsi_get_resid(SCpnt);
  1993			if (resid & (sector_size - 1)) {
  1994				sd_printk(KERN_INFO, sdkp,
  1995					"Unaligned partial completion (resid=%u, sector_sz=%u)\n",
  1996					resid, sector_size);
  1997				scsi_print_command(SCpnt);
  1998				resid = min(scsi_bufflen(SCpnt),
  1999					    round_up(resid, sector_size));
  2000				scsi_set_resid(SCpnt, resid);
  2001			}
  2002		}
  2003	
  2004		if (result) {
  2005			sense_valid = scsi_command_normalize_sense(SCpnt, &sshdr);
  2006			if (sense_valid)
  2007				sense_deferred = scsi_sense_is_deferred(&sshdr);
  2008		}
  2009		sdkp->medium_access_timed_out = 0;
  2010	
  2011		if (driver_byte(result) != DRIVER_SENSE &&
  2012		    (!sense_valid || sense_deferred))
  2013			goto out;
  2014	
  2015		switch (sshdr.sense_key) {
  2016		case HARDWARE_ERROR:
  2017		case MEDIUM_ERROR:
  2018			good_bytes = sd_completed_bytes(SCpnt);
  2019			break;
  2020		case RECOVERED_ERROR:
  2021			good_bytes = scsi_bufflen(SCpnt);
  2022			break;
  2023		case NO_SENSE:
  2024			/* This indicates a false check condition, so ignore it.  An
  2025			 * unknown amount of data was transferred so treat it as an
  2026			 * error.
  2027			 */
  2028			SCpnt->result = 0;
  2029			memset(SCpnt->sense_buffer, 0, SCSI_SENSE_BUFFERSIZE);
  2030			break;
  2031		case ABORTED_COMMAND:
  2032			if (sshdr.asc == 0x10)  /* DIF: Target detected corruption */
  2033				good_bytes = sd_completed_bytes(SCpnt);
  2034			break;
  2035		case ILLEGAL_REQUEST:
  2036			switch (sshdr.asc) {
  2037			case 0x10:	/* DIX: Host detected corruption */
  2038				good_bytes = sd_completed_bytes(SCpnt);
  2039				break;
  2040			case 0x20:	/* INVALID COMMAND OPCODE */
  2041			case 0x24:	/* INVALID FIELD IN CDB */
  2042				switch (SCpnt->cmnd[0]) {
  2043				case UNMAP:
  2044					sd_config_discard(sdkp, SD_LBP_DISABLE);
  2045					break;
  2046				case WRITE_SAME_16:
  2047				case WRITE_SAME:
  2048					if (SCpnt->cmnd[1] & 8) { /* UNMAP */
  2049						sd_config_discard(sdkp, SD_LBP_DISABLE);
  2050					} else {
  2051						sdkp->device->no_write_same = 1;
  2052						sd_config_write_same(sdkp);
  2053						req->rq_flags |= RQF_QUIET;
  2054					}
  2055					break;
  2056				}
  2057			}
  2058			break;
  2059		default:
  2060			break;
  2061		}
  2062	
  2063	 out:
  2064		if (sd_is_zoned(sdkp))
> 2065			good_bytes = sd_zbc_complete(SCpnt, good_bytes, &sshdr);
  2066	
  2067		SCSI_LOG_HLCOMPLETE(1, scmd_printk(KERN_INFO, SCpnt,
  2068						   "sd_done: completed %d of %d bytes\n",
  2069						   good_bytes, scsi_bufflen(SCpnt)));
  2070	
  2071		return good_bytes;
  2072	}
  2073	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 38936 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 11/11] scsi: sd_zbc: emulate ZONE_APPEND commands
  2020-03-10  9:46 ` [PATCH 11/11] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
  2020-03-10 17:43   ` kbuild test robot
  2020-03-10 19:38   ` kbuild test robot
@ 2020-03-10 22:43   ` kbuild test robot
  2020-03-10 23:04   ` kbuild test robot
  2020-03-11  0:57   ` kbuild test robot
  4 siblings, 0 replies; 32+ messages in thread
From: kbuild test robot @ 2020-03-10 22:43 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2810 bytes --]

Hi Johannes,

I love your patch! Yet something to improve:

[auto build test ERROR on block/for-next]
[also build test ERROR on scsi/for-next mkp-scsi/for-next linus/master v5.6-rc5 next-20200310]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Johannes-Thumshirn/Introduce-Zone-Append-for-writing-to-zoned-block-devices/20200310-213648
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: x86_64-randconfig-g001-20200309 (attached as .config)
compiler: gcc-7 (Debian 7.5.0-5) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from drivers/usb/storage/transport.c:49:0:
>> drivers/usb/storage/../../scsi/sd.h:253:24: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'int'
    static inline insigned int sd_zbc_complete(struct scsi_cmnd *cmd,
                           ^~~
--
   In file included from drivers/scsi/sd.c:72:0:
>> drivers/scsi/sd.h:253:24: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'int'
    static inline insigned int sd_zbc_complete(struct scsi_cmnd *cmd,
                           ^~~
   drivers/scsi/sd.c: In function 'sd_setup_read_write_cmnd':
   drivers/scsi/sd.c:1219:9: error: too many arguments to function 'sd_zbc_prepare_zone_append'
      ret = sd_zbc_prepare_zone_append(cmd, &lba, nr_blocks);
            ^~~~~~~~~~~~~~~~~~~~~~~~~~
   In file included from drivers/scsi/sd.c:72:0:
   drivers/scsi/sd.h:259:28: note: declared here
    static inline blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd,
                               ^~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/scsi/sd.c: In function 'sd_done':
>> drivers/scsi/sd.c:2065:16: error: implicit declaration of function 'sd_zbc_complete'; did you mean 'sd_zbc_exit'? [-Werror=implicit-function-declaration]
      good_bytes = sd_zbc_complete(SCpnt, good_bytes, &sshdr);
                   ^~~~~~~~~~~~~~~
                   sd_zbc_exit
   cc1: some warnings being treated as errors

vim +253 drivers/usb/storage/../../scsi/sd.h

   252	
 > 253	static inline insigned int sd_zbc_complete(struct scsi_cmnd *cmd,
   254				unsigned int good_bytes, struct scsi_sense_hdr *sshdr)
   255	{
   256		return 0;
   257	}
   258	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 38150 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 11/11] scsi: sd_zbc: emulate ZONE_APPEND commands
  2020-03-10  9:46 ` [PATCH 11/11] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
                     ` (2 preceding siblings ...)
  2020-03-10 22:43   ` kbuild test robot
@ 2020-03-10 23:04   ` kbuild test robot
  2020-03-11  0:57   ` kbuild test robot
  4 siblings, 0 replies; 32+ messages in thread
From: kbuild test robot @ 2020-03-10 23:04 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2442 bytes --]

Hi Johannes,

I love your patch! Yet something to improve:

[auto build test ERROR on block/for-next]
[also build test ERROR on scsi/for-next mkp-scsi/for-next linus/master v5.6-rc5 next-20200310]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Johannes-Thumshirn/Introduce-Zone-Append-for-writing-to-zoned-block-devices/20200310-213648
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: x86_64-defconfig (attached as .config)
compiler: clang version 11.0.0 (git://gitmirror/llvm_project 6309334b9574017523f73648da879fa5e6ef017a)
reproduce:
        # FIXME the reproduce steps for clang is not ready yet

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from drivers/scsi/sd.c:72:
>> drivers/scsi/sd.h:253:15: error: unknown type name 'insigned'; did you mean 'unsigned'?
   static inline insigned int sd_zbc_complete(struct scsi_cmnd *cmd,
                 ^~~~~~~~
                 unsigned
>> drivers/scsi/sd.c:1219:47: error: too many arguments to function call, expected 2, have 3
                   ret = sd_zbc_prepare_zone_append(cmd, &lba, nr_blocks);
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~            ^~~~~~~~~
   drivers/scsi/sd.h:259:28: note: 'sd_zbc_prepare_zone_append' declared here
   static inline blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd,
                              ^
   2 errors generated.
--
   In file included from drivers/usb/storage/transport.c:49:
>> drivers/usb/storage/../../scsi/sd.h:253:15: error: unknown type name 'insigned'; did you mean 'unsigned'?
   static inline insigned int sd_zbc_complete(struct scsi_cmnd *cmd,
                 ^~~~~~~~
                 unsigned
   1 error generated.

vim +253 drivers/scsi/sd.h

   252	
 > 253	static inline insigned int sd_zbc_complete(struct scsi_cmnd *cmd,
   254				unsigned int good_bytes, struct scsi_sense_hdr *sshdr)
   255	{
   256		return 0;
   257	}
   258	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 35531 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 09/11] block: Introduce zone write pointer offset caching
  2020-03-10 16:46   ` Christoph Hellwig
@ 2020-03-11  0:34     ` Damien Le Moal
  2020-03-11  6:24       ` Christoph Hellwig
  0 siblings, 1 reply; 32+ messages in thread
From: Damien Le Moal @ 2020-03-11  0:34 UTC (permalink / raw)
  To: Christoph Hellwig, Johannes Thumshirn
  Cc: Jens Axboe, linux-block, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen

On 2020/03/11 1:46, Christoph Hellwig wrote:
> On Tue, Mar 10, 2020 at 06:46:51PM +0900, Johannes Thumshirn wrote:
>> From: Damien Le Moal <damien.lemoal@wdc.com>
>>
>> Not all zoned block devices natively support the zone append command.
>> E.g. SCSI and ATA disks do not define this command. However, it is
>> fairly straightforward to emulate this command at the LLD level using
>> regular write commands if a zone write pointer position is known.
>> Introducing such emulation enables the use of zone append write for all
>> device types, therefore simplifying for instance the implementation of
>> file systems zoned block device support by avoiding the need for
>> different write pathes depending on the device capabilities.
> 
> I'd much rather have this in the driver itself than in the block layer.
> Especially as sd will hopefully remain the only users.

Yes, I agree with you here. That would be nicer, but early attempt to do so
failed as we always ended up with potential races on number of zones/wp array
size in the case of a device change/revalidation. Moving the wp array allocation
and initialization to blk_revalidate_disk_zones() greatly simplifies the code
and removes the races as all updates to zone bitmaps, wp array and nr zones are
done under a queue freeze all together. Moving the wp array only to sd_zbc, even
using a queue freeze, leads to potential out-of-bounds accesses for the wp array.

Another undesirable side effect of moving the wp array initialization to sd_zbc
is that we would need another full drive zone report after
blk_revalidate_disk_zones() own full report. That is costly. On 20TB SMR disks
with more than 75000 zones, the added delay is significant. Doing all
initialization within blk_revalidate_disk_zones() full zone report loop avoids
that added overhead.



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] Introduce Zone Append for writing to zoned block devices
  2020-03-10 16:42 ` [PATCH 00/11] Introduce Zone Append for writing to zoned block devices Christoph Hellwig
@ 2020-03-11  0:37   ` Damien Le Moal
  2020-03-11  6:24     ` Christoph Hellwig
  2020-03-11  7:22   ` Johannes Thumshirn
  1 sibling, 1 reply; 32+ messages in thread
From: Damien Le Moal @ 2020-03-11  0:37 UTC (permalink / raw)
  To: Christoph Hellwig, Johannes Thumshirn
  Cc: Jens Axboe, linux-block, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen

On 2020/03/11 1:42, Christoph Hellwig wrote:
> On Tue, Mar 10, 2020 at 06:46:42PM +0900, Johannes Thumshirn wrote:
>> For null_blk the emulation is way simpler, as null_blk's zoned block
>> device emulation support already caches the write pointer position, so we
>> only need to report the position back to the upper layers. Additional
>> caching is not needed here.
>>
>> Testing has been conducted by translating RWF_APPEND DIOs into
>> REQ_OP_ZONE_APPEND commands in the block device's direct I/O function and
>> injecting errors by bypassing the block layer interface and directly
>> writing to the disc via the SCSI generic interface.
> 
> We really need a user of this to be useful upstream.  Didn't you plan
> to look into converting zonefs/iomap to use it?  Without that it is
> at best a RFC.  Even better would be converting zonefs and the f2fs
> zoned code so that can get rid of the old per-zone serialization in
> the I/O scheduler entirely.

I do not think we can get rid of it entirely as it is needed for applications
using regular writes on raw zoned block devices. But the zone write locking will
be completely bypassed for zone append writes issued by file systems.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 11/11] scsi: sd_zbc: emulate ZONE_APPEND commands
  2020-03-10  9:46 ` [PATCH 11/11] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
                     ` (3 preceding siblings ...)
  2020-03-10 23:04   ` kbuild test robot
@ 2020-03-11  0:57   ` kbuild test robot
  4 siblings, 0 replies; 32+ messages in thread
From: kbuild test robot @ 2020-03-11  0:57 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 6639 bytes --]

Hi Johannes,

I love your patch! Yet something to improve:

[auto build test ERROR on block/for-next]
[also build test ERROR on scsi/for-next mkp-scsi/for-next linus/master v5.6-rc5 next-20200310]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Johannes-Thumshirn/Introduce-Zone-Append-for-writing-to-zoned-block-devices/20200310-213648
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: mips-fuloong2e_defconfig (attached as .config)
compiler: mips64el-linux-gcc (GCC) 5.5.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=5.5.0 make.cross ARCH=mips 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from drivers/scsi/sd.c:72:0:
   drivers/scsi/sd.h:253:24: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'int'
    static inline insigned int sd_zbc_complete(struct scsi_cmnd *cmd,
                           ^
   drivers/scsi/sd.c: In function 'sd_setup_read_write_cmnd':
   drivers/scsi/sd.c:1219:9: error: too many arguments to function 'sd_zbc_prepare_zone_append'
      ret = sd_zbc_prepare_zone_append(cmd, &lba, nr_blocks);
            ^
   In file included from drivers/scsi/sd.c:72:0:
   drivers/scsi/sd.h:259:28: note: declared here
    static inline blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd,
                               ^
   drivers/scsi/sd.c: In function 'sd_done':
>> drivers/scsi/sd.c:2065:16: error: implicit declaration of function 'sd_zbc_complete' [-Werror=implicit-function-declaration]
      good_bytes = sd_zbc_complete(SCpnt, good_bytes, &sshdr);
                   ^
   cc1: some warnings being treated as errors

vim +/sd_zbc_complete +2065 drivers/scsi/sd.c

  1949	
  1950	/**
  1951	 *	sd_done - bottom half handler: called when the lower level
  1952	 *	driver has completed (successfully or otherwise) a scsi command.
  1953	 *	@SCpnt: mid-level's per command structure.
  1954	 *
  1955	 *	Note: potentially run from within an ISR. Must not block.
  1956	 **/
  1957	static int sd_done(struct scsi_cmnd *SCpnt)
  1958	{
  1959		int result = SCpnt->result;
  1960		unsigned int good_bytes = result ? 0 : scsi_bufflen(SCpnt);
  1961		unsigned int sector_size = SCpnt->device->sector_size;
  1962		unsigned int resid;
  1963		struct scsi_sense_hdr sshdr;
  1964		struct scsi_disk *sdkp = scsi_disk(SCpnt->request->rq_disk);
  1965		struct request *req = SCpnt->request;
  1966		int sense_valid = 0;
  1967		int sense_deferred = 0;
  1968	
  1969		switch (req_op(req)) {
  1970		case REQ_OP_DISCARD:
  1971		case REQ_OP_WRITE_ZEROES:
  1972		case REQ_OP_WRITE_SAME:
  1973		case REQ_OP_ZONE_RESET:
  1974		case REQ_OP_ZONE_RESET_ALL:
  1975		case REQ_OP_ZONE_OPEN:
  1976		case REQ_OP_ZONE_CLOSE:
  1977		case REQ_OP_ZONE_FINISH:
  1978			if (!result) {
  1979				good_bytes = blk_rq_bytes(req);
  1980				scsi_set_resid(SCpnt, 0);
  1981			} else {
  1982				good_bytes = 0;
  1983				scsi_set_resid(SCpnt, blk_rq_bytes(req));
  1984			}
  1985			break;
  1986		default:
  1987			/*
  1988			 * In case of bogus fw or device, we could end up having
  1989			 * an unaligned partial completion. Check this here and force
  1990			 * alignment.
  1991			 */
  1992			resid = scsi_get_resid(SCpnt);
  1993			if (resid & (sector_size - 1)) {
  1994				sd_printk(KERN_INFO, sdkp,
  1995					"Unaligned partial completion (resid=%u, sector_sz=%u)\n",
  1996					resid, sector_size);
  1997				scsi_print_command(SCpnt);
  1998				resid = min(scsi_bufflen(SCpnt),
  1999					    round_up(resid, sector_size));
  2000				scsi_set_resid(SCpnt, resid);
  2001			}
  2002		}
  2003	
  2004		if (result) {
  2005			sense_valid = scsi_command_normalize_sense(SCpnt, &sshdr);
  2006			if (sense_valid)
  2007				sense_deferred = scsi_sense_is_deferred(&sshdr);
  2008		}
  2009		sdkp->medium_access_timed_out = 0;
  2010	
  2011		if (driver_byte(result) != DRIVER_SENSE &&
  2012		    (!sense_valid || sense_deferred))
  2013			goto out;
  2014	
  2015		switch (sshdr.sense_key) {
  2016		case HARDWARE_ERROR:
  2017		case MEDIUM_ERROR:
  2018			good_bytes = sd_completed_bytes(SCpnt);
  2019			break;
  2020		case RECOVERED_ERROR:
  2021			good_bytes = scsi_bufflen(SCpnt);
  2022			break;
  2023		case NO_SENSE:
  2024			/* This indicates a false check condition, so ignore it.  An
  2025			 * unknown amount of data was transferred so treat it as an
  2026			 * error.
  2027			 */
  2028			SCpnt->result = 0;
  2029			memset(SCpnt->sense_buffer, 0, SCSI_SENSE_BUFFERSIZE);
  2030			break;
  2031		case ABORTED_COMMAND:
  2032			if (sshdr.asc == 0x10)  /* DIF: Target detected corruption */
  2033				good_bytes = sd_completed_bytes(SCpnt);
  2034			break;
  2035		case ILLEGAL_REQUEST:
  2036			switch (sshdr.asc) {
  2037			case 0x10:	/* DIX: Host detected corruption */
  2038				good_bytes = sd_completed_bytes(SCpnt);
  2039				break;
  2040			case 0x20:	/* INVALID COMMAND OPCODE */
  2041			case 0x24:	/* INVALID FIELD IN CDB */
  2042				switch (SCpnt->cmnd[0]) {
  2043				case UNMAP:
  2044					sd_config_discard(sdkp, SD_LBP_DISABLE);
  2045					break;
  2046				case WRITE_SAME_16:
  2047				case WRITE_SAME:
  2048					if (SCpnt->cmnd[1] & 8) { /* UNMAP */
  2049						sd_config_discard(sdkp, SD_LBP_DISABLE);
  2050					} else {
  2051						sdkp->device->no_write_same = 1;
  2052						sd_config_write_same(sdkp);
  2053						req->rq_flags |= RQF_QUIET;
  2054					}
  2055					break;
  2056				}
  2057			}
  2058			break;
  2059		default:
  2060			break;
  2061		}
  2062	
  2063	 out:
  2064		if (sd_is_zoned(sdkp))
> 2065			good_bytes = sd_zbc_complete(SCpnt, good_bytes, &sshdr);
  2066	
  2067		SCSI_LOG_HLCOMPLETE(1, scmd_printk(KERN_INFO, SCpnt,
  2068						   "sd_done: completed %d of %d bytes\n",
  2069						   good_bytes, scsi_bufflen(SCpnt)));
  2070	
  2071		return good_bytes;
  2072	}
  2073	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 19192 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 09/11] block: Introduce zone write pointer offset caching
  2020-03-11  0:34     ` Damien Le Moal
@ 2020-03-11  6:24       ` Christoph Hellwig
  2020-03-11  6:40         ` Damien Le Moal
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2020-03-11  6:24 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Johannes Thumshirn, Jens Axboe, linux-block,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen

On Wed, Mar 11, 2020 at 12:34:33AM +0000, Damien Le Moal wrote:
> Yes, I agree with you here. That would be nicer, but early attempt to do so
> failed as we always ended up with potential races on number of zones/wp array
> size in the case of a device change/revalidation. Moving the wp array allocation
> and initialization to blk_revalidate_disk_zones() greatly simplifies the code
> and removes the races as all updates to zone bitmaps, wp array and nr zones are
> done under a queue freeze all together. Moving the wp array only to sd_zbc, even
> using a queue freeze, leads to potential out-of-bounds accesses for the wp array.
> 
> Another undesirable side effect of moving the wp array initialization to sd_zbc
> is that we would need another full drive zone report after
> blk_revalidate_disk_zones() own full report. That is costly. On 20TB SMR disks
> with more than 75000 zones, the added delay is significant. Doing all
> initialization within blk_revalidate_disk_zones() full zone report loop avoids
> that added overhead.

That explanation needs to got into the commit log.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] Introduce Zone Append for writing to zoned block devices
  2020-03-11  0:37   ` Damien Le Moal
@ 2020-03-11  6:24     ` Christoph Hellwig
  2020-03-11  6:40       ` Damien Le Moal
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2020-03-11  6:24 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Johannes Thumshirn, Jens Axboe, linux-block,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen

On Wed, Mar 11, 2020 at 12:37:33AM +0000, Damien Le Moal wrote:
> I do not think we can get rid of it entirely as it is needed for applications
> using regular writes on raw zoned block devices. But the zone write locking will
> be completely bypassed for zone append writes issued by file systems.

But applications that are aware of zones should not be sending multiple
write commands to a zone anyway.  We certainly can't use zone write
locking for nvme if we want to be able to use multiple queues.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] Introduce Zone Append for writing to zoned block devices
  2020-03-11  6:24     ` Christoph Hellwig
@ 2020-03-11  6:40       ` Damien Le Moal
  0 siblings, 0 replies; 32+ messages in thread
From: Damien Le Moal @ 2020-03-11  6:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Johannes Thumshirn, Jens Axboe, linux-block, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen

On 2020/03/11 15:25, Christoph Hellwig wrote:
> On Wed, Mar 11, 2020 at 12:37:33AM +0000, Damien Le Moal wrote:
>> I do not think we can get rid of it entirely as it is needed for applications
>> using regular writes on raw zoned block devices. But the zone write locking will
>> be completely bypassed for zone append writes issued by file systems.
> 
> But applications that are aware of zones should not be sending multiple
> write commands to a zone anyway.  We certainly can't use zone write
> locking for nvme if we want to be able to use multiple queues.
> 

True, and that is the main use case I am seeing in the field.

However, even for this to work properly, we will also need to have a special
bio_add_page() function for regular writes to zones, similarly to zone append,
to ensure that a large BIO does not become multiple requests, won't we ?
Otherwise, a write bio submit will generate multiple requests that may get
reordered on dispatch and on requeue (on SAS or on SATA).

Furthermore, we already have aio supported. Customers in the field use that with
fio libaio engine to test drives and for applications development. So I am
afraid that removing the zone write locking now would break user space, no ?

For nvme, we want to allow the "none" elevator as the default rather than
mq-deadline which is now the default for all zoned block devices. This is a very
simple change to the default elevator selection we can add based on the nonrot
queue flag.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 09/11] block: Introduce zone write pointer offset caching
  2020-03-11  6:24       ` Christoph Hellwig
@ 2020-03-11  6:40         ` Damien Le Moal
  0 siblings, 0 replies; 32+ messages in thread
From: Damien Le Moal @ 2020-03-11  6:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Johannes Thumshirn, Jens Axboe, linux-block, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen

On 2020/03/11 15:24, Christoph Hellwig wrote:
> On Wed, Mar 11, 2020 at 12:34:33AM +0000, Damien Le Moal wrote:
>> Yes, I agree with you here. That would be nicer, but early attempt to do so
>> failed as we always ended up with potential races on number of zones/wp array
>> size in the case of a device change/revalidation. Moving the wp array allocation
>> and initialization to blk_revalidate_disk_zones() greatly simplifies the code
>> and removes the races as all updates to zone bitmaps, wp array and nr zones are
>> done under a queue freeze all together. Moving the wp array only to sd_zbc, even
>> using a queue freeze, leads to potential out-of-bounds accesses for the wp array.
>>
>> Another undesirable side effect of moving the wp array initialization to sd_zbc
>> is that we would need another full drive zone report after
>> blk_revalidate_disk_zones() own full report. That is costly. On 20TB SMR disks
>> with more than 75000 zones, the added delay is significant. Doing all
>> initialization within blk_revalidate_disk_zones() full zone report loop avoids
>> that added overhead.
> 
> That explanation needs to got into the commit log.
> 

OK. Will do.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] Introduce Zone Append for writing to zoned block devices
  2020-03-10 16:42 ` [PATCH 00/11] Introduce Zone Append for writing to zoned block devices Christoph Hellwig
  2020-03-11  0:37   ` Damien Le Moal
@ 2020-03-11  7:22   ` Johannes Thumshirn
  1 sibling, 0 replies; 32+ messages in thread
From: Johannes Thumshirn @ 2020-03-11  7:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen

On 10/03/2020 17:42, Christoph Hellwig wrote:
[...]
> We really need a user of this to be useful upstream.  Didn't you plan
> to look into converting zonefs/iomap to use it?  Without that it is
> at best a RFC.  Even better would be converting zonefs and the f2fs
> zoned code so that can get rid of the old per-zone serialization in
> the I/O scheduler entirely.

Yes I'm right now working on iomap/zonefs support for zone append.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 03/11] block: introduce bio_add_append_page
  2020-03-10  9:46 ` [PATCH 03/11] block: introduce bio_add_append_page Johannes Thumshirn
  2020-03-10 16:43   ` Christoph Hellwig
@ 2020-03-11 18:11   ` Johannes Thumshirn
  1 sibling, 0 replies; 32+ messages in thread
From: Johannes Thumshirn @ 2020-03-11 18:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen

On 10/03/2020 10:47, Johannes Thumshirn wrote:
[...]
> @@ -945,8 +955,15 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>   
>   		len = min_t(size_t, PAGE_SIZE - offset, left);
>   
> -		if (__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> -			if (same_page)
> +		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> +			size = bio_add_append_page(bio->bi_disk->queue, bio,
> +						   page, len, offset);
> +
> +			if (size != len)
> +				return -E2BIG;


Converting zonefs/iomap to zone-append found a bug here, should've been:

if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
	int ret;

	ret = bio_add_append_page(bio->bi_disk->queue, bio,
				  page, len, offset);
	if (ret != len)
		return -E2BIG;

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2020-03-11 18:11 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-10  9:46 [PATCH 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
2020-03-10  9:46 ` [PATCH 01/11] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
2020-03-10  9:46 ` [PATCH 02/11] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
2020-03-10  9:46 ` [PATCH 03/11] block: introduce bio_add_append_page Johannes Thumshirn
2020-03-10 16:43   ` Christoph Hellwig
2020-03-11 18:11   ` Johannes Thumshirn
2020-03-10  9:46 ` [PATCH 04/11] null_blk: Support REQ_OP_ZONE_APPEND Johannes Thumshirn
2020-03-10 16:43   ` Christoph Hellwig
2020-03-10  9:46 ` [PATCH 05/11] block: introduce BLK_STS_ZONE_RESOURCE Johannes Thumshirn
2020-03-10 16:43   ` Christoph Hellwig
2020-03-10  9:46 ` [PATCH 06/11] block: introduce blk_req_zone_write_trylock Johannes Thumshirn
2020-03-10  9:46 ` [PATCH 07/11] block: factor out requeue handling from dispatch code Johannes Thumshirn
2020-03-10 16:44   ` Christoph Hellwig
2020-03-10  9:46 ` [PATCH 08/11] block: delay un-dispatchable request Johannes Thumshirn
2020-03-10 16:45   ` Christoph Hellwig
2020-03-10  9:46 ` [PATCH 09/11] block: Introduce zone write pointer offset caching Johannes Thumshirn
2020-03-10 16:46   ` Christoph Hellwig
2020-03-11  0:34     ` Damien Le Moal
2020-03-11  6:24       ` Christoph Hellwig
2020-03-11  6:40         ` Damien Le Moal
2020-03-10  9:46 ` [PATCH 10/11] scsi: sd_zbc: factor out sanity checks for zoned commands Johannes Thumshirn
2020-03-10  9:46 ` [PATCH 11/11] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
2020-03-10 17:43   ` kbuild test robot
2020-03-10 19:38   ` kbuild test robot
2020-03-10 22:43   ` kbuild test robot
2020-03-10 23:04   ` kbuild test robot
2020-03-11  0:57   ` kbuild test robot
2020-03-10 16:42 ` [PATCH 00/11] Introduce Zone Append for writing to zoned block devices Christoph Hellwig
2020-03-11  0:37   ` Damien Le Moal
2020-03-11  6:24     ` Christoph Hellwig
2020-03-11  6:40       ` Damien Le Moal
2020-03-11  7:22   ` Johannes Thumshirn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.