All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/11] Introduce Zone Append for writing to zoned block devices
@ 2020-03-24 15:24 Johannes Thumshirn
  2020-03-24 15:24 ` [PATCH v2 01/11] block: factor out requeue handling from dispatch code Johannes Thumshirn
                   ` (10 more replies)
  0 siblings, 11 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2020-03-24 15:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong,
	Johannes Thumshirn

The upcoming NVMe ZNS Specification will define a new type of write
command for zoned block devices, zone append.

When when writing to a zoned block device using zone append, the start
sector of the write is pointing at the start LBA of the zone to write to.
Upon completion the block device will respond with the position the data
has been placed in the zone. This from a high level perspective can be
seen like a file system's block allocator, where the user writes to a
file and the file-system takes care of the data placement on the device.

In order to fully exploit the new zone append command in file-systems and
other interfaces above the block layer, we choose to emulate zone append
in SCSI and null_blk. This way we can have a single write path for both
file-systems and other interfaces above the block-layer, like io_uring on
zoned block devices, without having to care too much about the underlying
characteristics of the device itself.

The emulation works by providing a cache of each zone's write pointer, so
zone append issued to the disk can be translated to a write with a
starting LBA of the write pointer. This LBA is used as input zone number
for the write pointer lookup in the zone write pointer offset cache and
the cached offset is then added to the LBA to get the actual position to
write the data. In SCSI we then turn the REQ_OP_ZONE_APPEND request into a
WRITE(16) command. Upon successful completion of the WRITE(16), the cache
will be updated to the new write pointer location and the written sector
will be noted in the request. On error the cache entry will be marked as
invalid and on the next write an update of the write pointer will be
scheduled, before issuing the actual write.

In order to reduce memory consumption, the only cached item is the offset
of the write pointer from the start of the zone, everything else can be
calculated. On an example drive with 52156 zones, the additional memory
consumption of the cache is thus 52156 * 4 = 208624 Bytes or 51 4k Byte
pages. The performance impact is neglectable for a spinning drive.

For null_blk the emulation is way simpler, as null_blk's zoned block
device emulation support already caches the write pointer position, so we
only need to report the position back to the upper layers. Additional
caching is not needed here.

Furthermore we have converted zonefs to run use ZONE_APPEND for synchronous
direct I/Os, which also needed changes in iomap.

The series is based on Jens' for-next branch.

Changes since v1:
- Too much to mention, treat as a completely new series.

Damien Le Moal (3):
  block: Introduce zone write pointer offset caching
  null_blk: Cleanup zoned device initialization
  null_blk: Support REQ_OP_ZONE_APPEND

Johannes Thumshirn (7):
  block: factor out requeue handling from dispatch code
  block: provide fallbacks for blk_queue_zone_is_seq and
    blk_queue_zone_no
  block: introduce blk_req_zone_write_trylock
  scsi: sd_zbc: factor out sanity checks for zoned commands
  scsi: sd_zbc: emulate ZONE_APPEND commands
  iomap: Add support for zone append writes
  zonefs: use zone-append for sequential zones

Keith Busch (1):
  block: Introduce REQ_OP_ZONE_APPEND

 block/bio.c                    |  72 ++++++-
 block/blk-core.c               |  52 +++++
 block/blk-map.c                |   2 +-
 block/blk-mq.c                 |  54 ++++-
 block/blk-settings.c           |  19 ++
 block/blk-sysfs.c              |  15 +-
 block/blk-zoned.c              |  83 +++++++-
 block/blk.h                    |   4 +-
 drivers/block/null_blk.h       |  14 +-
 drivers/block/null_blk_main.c  |  35 ++--
 drivers/block/null_blk_zoned.c |  51 ++++-
 drivers/scsi/scsi_lib.c        |   1 +
 drivers/scsi/sd.c              |  28 ++-
 drivers/scsi/sd.h              |  36 +++-
 drivers/scsi/sd_zbc.c          | 352 +++++++++++++++++++++++++++++++--
 fs/iomap/direct-io.c           |  80 ++++++--
 fs/zonefs/super.c              |  15 +-
 include/linux/bio.h            |  22 +--
 include/linux/blk_types.h      |  14 ++
 include/linux/blkdev.h         |  42 +++-
 include/linux/fs.h             |   1 +
 include/linux/iomap.h          |  22 +--
 22 files changed, 881 insertions(+), 133 deletions(-)

-- 
2.24.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 01/11] block: factor out requeue handling from dispatch code
  2020-03-24 15:24 [PATCH v2 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
@ 2020-03-24 15:24 ` Johannes Thumshirn
  2020-03-25  8:40   ` Christoph Hellwig
  2020-03-24 15:24 ` [PATCH v2 02/11] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 28+ messages in thread
From: Johannes Thumshirn @ 2020-03-24 15:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong,
	Johannes Thumshirn, Christoph Hellwig

Factor out the requeue handling from the dispatch code, this will make
subsequent addition of different requeueing schemes easier.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5b2e6550e0b6..745ec592a513 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1178,6 +1178,23 @@ static void blk_mq_update_dispatch_busy(struct blk_mq_hw_ctx *hctx, bool busy)
 
 #define BLK_MQ_RESOURCE_DELAY	3		/* ms units */
 
+static void blk_mq_handle_dev_resource(struct request *rq,
+				       struct list_head *list)
+{
+	struct request *next =
+		list_first_entry_or_null(list, struct request, queuelist);
+
+	/*
+	 * If an I/O scheduler has been configured and we got a driver tag for
+	 * the next request already, free it.
+	 */
+	if (next)
+		blk_mq_put_driver_tag(next);
+
+	list_add(&rq->queuelist, list);
+	__blk_mq_requeue_request(rq);
+}
+
 /*
  * Returns true if we did some work AND can potentially do more.
  */
@@ -1245,17 +1262,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 
 		ret = q->mq_ops->queue_rq(hctx, &bd);
 		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
-			/*
-			 * If an I/O scheduler has been configured and we got a
-			 * driver tag for the next request already, free it
-			 * again.
-			 */
-			if (!list_empty(list)) {
-				nxt = list_first_entry(list, struct request, queuelist);
-				blk_mq_put_driver_tag(nxt);
-			}
-			list_add(&rq->queuelist, list);
-			__blk_mq_requeue_request(rq);
+			blk_mq_handle_dev_resource(rq, list);
 			break;
 		}
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 02/11] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no
  2020-03-24 15:24 [PATCH v2 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
  2020-03-24 15:24 ` [PATCH v2 01/11] block: factor out requeue handling from dispatch code Johannes Thumshirn
@ 2020-03-24 15:24 ` Johannes Thumshirn
  2020-03-24 15:24 ` [PATCH v2 03/11] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2020-03-24 15:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong,
	Johannes Thumshirn

blk_queue_zone_is_seq() and blk_queue_zone_no() have not been called with
CONFIG_BLK_DEV_ZONED disabled until now.

The introduction of REQ_OP_ZONE_APPEND will change this, so we need to
provide noop fallbacks for the !CONFIG_BLK_DEV_ZONED case.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 include/linux/blkdev.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f629d40c645c..25b63f714619 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -729,6 +729,16 @@ static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
 {
 	return 0;
 }
+static inline bool blk_queue_zone_is_seq(struct request_queue *q,
+					 sector_t sector)
+{
+	return false;
+}
+static inline unsigned int blk_queue_zone_no(struct request_queue *q,
+					     sector_t sector)
+{
+	return 0;
+}
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 static inline bool rq_is_sync(struct request *rq)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 03/11] block: Introduce REQ_OP_ZONE_APPEND
  2020-03-24 15:24 [PATCH v2 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
  2020-03-24 15:24 ` [PATCH v2 01/11] block: factor out requeue handling from dispatch code Johannes Thumshirn
  2020-03-24 15:24 ` [PATCH v2 02/11] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
@ 2020-03-24 15:24 ` Johannes Thumshirn
  2020-03-24 16:07   ` Johannes Thumshirn
  2020-03-25 16:24   ` Johannes Thumshirn
  2020-03-24 15:24 ` [PATCH v2 04/11] block: introduce blk_req_zone_write_trylock Johannes Thumshirn
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2020-03-24 15:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong,
	Johannes Thumshirn

From: Keith Busch <kbusch@kernel.org>

Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
block device. This is a no-merge write operation.

A zone append write BIO must:
* Target a zoned block device
* Have a sector position indicating the start sector of the target zone
* The target zone must be a sequential write zone
* The BIO must not cross a zone boundary
* The BIO size must not be split to ensure that a single range of LBAs
  is written with a single command.

Implement these checks in generic_make_request_checks() using the
helper function blk_check_zone_append(). To avoid write append BIO
splitting, introduce the new max_zone_append_sectors queue limit
attribute and ensure that a BIO size is always lower than this limit.
Export this new limit through sysfs. Also add bio_add_zone_append_page()
similar to bio_add_pc_page() so producers can build BIOs respecting the
new limit.

Also when a LLDD can't dispatch a request to a specific zone, it
will return BLK_STS_ZONE_RESOURCE indicating this request needs to
be delayed, e.g.  because the zone it will be dispatched to is still
write-locked. If this happens set the request aside in a local list
to continue trying dispatching requests such as READ requests or a
WRITE/ZONE_APPEND requests targetting other zones. This way we can
still keep a high queue depth without starving other requests even if
one request can't be served due to zone write-locking.

Finally, make sure that the bio sector position indicates the actual
write position as indicated by the device on completion.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/bio.c               | 72 +++++++++++++++++++++++++++++++++++++--
 block/blk-core.c          | 52 ++++++++++++++++++++++++++++
 block/blk-map.c           |  2 +-
 block/blk-mq.c            | 27 +++++++++++++++
 block/blk-settings.c      | 19 +++++++++++
 block/blk-sysfs.c         | 13 +++++++
 drivers/scsi/scsi_lib.c   |  1 +
 include/linux/bio.h       | 22 ++----------
 include/linux/blk_types.h | 14 ++++++++
 include/linux/blkdev.h    | 11 ++++++
 10 files changed, 210 insertions(+), 23 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 0985f3422556..b407716d5734 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -680,6 +680,45 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
 }
 EXPORT_SYMBOL(bio_clone_fast);
 
+static inline bool bio_can_zone_append(struct bio *bio, unsigned len)
+{
+	struct request_queue *q = bio->bi_disk->queue;
+	unsigned int max_append_sectors = queue_max_zone_append_sectors(q);
+
+	if (WARN_ON_ONCE(!max_append_sectors))
+		return false;
+
+	if (((bio->bi_iter.bi_size + len) >> 9) > max_append_sectors)
+		return false;
+
+	if (bio->bi_vcnt >= q->limits.max_segments)
+		return false;
+
+	return true;
+}
+
+/**
+ * bio_full - check if the bio is full
+ * @bio:	bio to check
+ * @len:	length of one segment to be added
+ *
+ * Return true if @bio is full and one segment with @len bytes can't be
+ * added to the bio, otherwise return false
+ */
+bool bio_full(struct bio *bio, unsigned len)
+{
+	if (bio->bi_vcnt >= bio->bi_max_vecs)
+		return true;
+
+	if (bio->bi_iter.bi_size > UINT_MAX - len)
+		return true;
+
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND)
+		return bio_can_zone_append(bio, len);
+
+	return false;
+}
+
 static inline bool page_is_mergeable(const struct bio_vec *bv,
 		struct page *page, unsigned int len, unsigned int off,
 		bool *same_page)
@@ -782,6 +821,22 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio,
 }
 EXPORT_SYMBOL(bio_add_pc_page);
 
+static bool bio_try_merge_zone_append_page(struct bio *bio, struct page *page,
+					   unsigned int len, unsigned int off)
+{
+	struct request_queue *q = bio->bi_disk->queue;
+	struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+	unsigned long mask = queue_segment_boundary(q);
+	phys_addr_t addr1 = page_to_phys(bv->bv_page) + bv->bv_offset;
+	phys_addr_t addr2 = page_to_phys(page) + off + len - 1;
+
+	if ((addr1 | mask) != (addr2 | mask))
+		return false;
+	if (bv->bv_len + len > queue_max_segment_size(q))
+		return false;
+	return true;
+}
+
 /**
  * __bio_try_merge_page - try appending data to an existing bvec.
  * @bio: destination bio
@@ -807,6 +862,12 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
 	if (bio->bi_vcnt > 0) {
 		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
 
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+			if (!bio_try_merge_zone_append_page(bio, page, len,
+							    off))
+				return false;
+		}
+
 		if (page_is_mergeable(bv, page, len, off, same_page)) {
 			if (bio->bi_iter.bi_size > UINT_MAX - len)
 				return false;
@@ -867,6 +928,7 @@ int bio_add_page(struct bio *bio, struct page *page,
 	if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
 		if (bio_full(bio, len))
 			return 0;
+
 		__bio_add_page(bio, page, len, offset);
 	}
 	return len;
@@ -899,7 +961,7 @@ static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
 
 	len = min_t(size_t, bv->bv_len - iter->iov_offset, iter->count);
 	size = bio_add_page(bio, bv->bv_page, len,
-				bv->bv_offset + iter->iov_offset);
+			    bv->bv_offset + iter->iov_offset);
 	if (unlikely(size != len))
 		return -EINVAL;
 	iov_iter_advance(iter, size);
@@ -1399,7 +1461,7 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
  */
 struct bio *bio_map_user_iov(struct request_queue *q,
 			     struct iov_iter *iter,
-			     gfp_t gfp_mask)
+			     gfp_t gfp_mask, unsigned int op)
 {
 	int j;
 	struct bio *bio;
@@ -1439,7 +1501,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
 					n = bytes;
 
 				if (!__bio_add_pc_page(q, bio, page, n, offs,
-						&same_page)) {
+						       &same_page)) {
 					if (same_page)
 						put_page(page);
 					break;
@@ -1905,6 +1967,10 @@ struct bio *bio_split(struct bio *bio, int sectors,
 	BUG_ON(sectors <= 0);
 	BUG_ON(sectors >= bio_sectors(bio));
 
+	/* Zone append commands cannot be split */
+	if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
+		return NULL;
+
 	split = bio_clone_fast(bio, gfp, bs);
 	if (!split)
 		return NULL;
diff --git a/block/blk-core.c b/block/blk-core.c
index abfdcf81a228..7f2636fb3b69 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -135,6 +135,7 @@ static const char *const blk_op_name[] = {
 	REQ_OP_NAME(ZONE_OPEN),
 	REQ_OP_NAME(ZONE_CLOSE),
 	REQ_OP_NAME(ZONE_FINISH),
+	REQ_OP_NAME(ZONE_APPEND),
 	REQ_OP_NAME(WRITE_SAME),
 	REQ_OP_NAME(WRITE_ZEROES),
 	REQ_OP_NAME(SCSI_IN),
@@ -240,6 +241,17 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 
 	bio_advance(bio, nbytes);
 
+	if (req_op(rq) == REQ_OP_ZONE_APPEND && error == BLK_STS_OK) {
+		/*
+		 * Partial zone append completions cannot be supported as the
+		 * BIO fragments may end up not being written sequentially.
+		 */
+		if (bio->bi_iter.bi_size)
+			bio->bi_status = BLK_STS_IOERR;
+		else
+			bio->bi_iter.bi_sector = rq->__sector;
+	}
+
 	/* don't actually finish bio if it's part of flush sequence */
 	if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
 		bio_endio(bio);
@@ -864,6 +876,41 @@ static inline int blk_partition_remap(struct bio *bio)
 	return ret;
 }
 
+/*
+ * Check write append to a zoned block device.
+ */
+static inline blk_status_t blk_check_zone_append(struct request_queue *q,
+						 struct bio *bio)
+{
+	sector_t pos = bio->bi_iter.bi_sector;
+	int nr_sectors = bio_sectors(bio);
+
+	/* Only applicable to zoned block devices */
+	if (!blk_queue_is_zoned(q))
+		return BLK_STS_NOTSUPP;
+
+	/* The bio sector must point to the start of a sequential zone */
+	if (pos & (blk_queue_zone_sectors(q) - 1) ||
+	    !blk_queue_zone_is_seq(q, pos))
+		return BLK_STS_IOERR;
+
+	/*
+	 * Not allowed to cross zone boundaries. Otherwise, the BIO will be
+	 * split and could result in non-contiguous sectors being written in
+	 * different zones.
+	 */
+	if (blk_queue_zone_no(q, pos) != blk_queue_zone_no(q, pos + nr_sectors))
+		return BLK_STS_IOERR;
+
+	/* Make sure the BIO is small enough and will not get split */
+	if (nr_sectors > q->limits.max_zone_append_sectors)
+		return BLK_STS_IOERR;
+
+	bio->bi_opf |= REQ_NOMERGE;
+
+	return BLK_STS_OK;
+}
+
 static noinline_for_stack bool
 generic_make_request_checks(struct bio *bio)
 {
@@ -936,6 +983,11 @@ generic_make_request_checks(struct bio *bio)
 		if (!q->limits.max_write_same_sectors)
 			goto not_supported;
 		break;
+	case REQ_OP_ZONE_APPEND:
+		status = blk_check_zone_append(q, bio);
+		if (status != BLK_STS_OK)
+			goto end_io;
+		break;
 	case REQ_OP_ZONE_RESET:
 	case REQ_OP_ZONE_OPEN:
 	case REQ_OP_ZONE_CLOSE:
diff --git a/block/blk-map.c b/block/blk-map.c
index b0790268ed9d..a83ba39251a9 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -72,7 +72,7 @@ static int __blk_rq_map_user_iov(struct request *rq,
 	if (copy)
 		bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
 	else
-		bio = bio_map_user_iov(q, iter, gfp_mask);
+		bio = bio_map_user_iov(q, iter, gfp_mask, req_op(rq));
 
 	if (IS_ERR(bio))
 		return PTR_ERR(bio);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 745ec592a513..c06c796742ec 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1195,6 +1195,19 @@ static void blk_mq_handle_dev_resource(struct request *rq,
 	__blk_mq_requeue_request(rq);
 }
 
+static void blk_mq_handle_zone_resource(struct request *rq,
+					struct list_head *zone_list)
+{
+	/*
+	 * If we end up here it is because we cannot dispatch a request to a
+	 * specific zone due to LLD level zone-write locking or other zone
+	 * related resource not being available. In this case, set the request
+	 * aside in zone_list for retrying it later.
+	 */
+	list_add(&rq->queuelist, zone_list);
+	__blk_mq_requeue_request(rq);
+}
+
 /*
  * Returns true if we did some work AND can potentially do more.
  */
@@ -1206,6 +1219,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	bool no_tag = false;
 	int errors, queued;
 	blk_status_t ret = BLK_STS_OK;
+	LIST_HEAD(zone_list);
 
 	if (list_empty(list))
 		return false;
@@ -1264,6 +1278,16 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
 			blk_mq_handle_dev_resource(rq, list);
 			break;
+		} else if (ret == BLK_STS_ZONE_RESOURCE) {
+			/*
+			 * Move the request to zone_list and keep going through
+			 * the dipatch list to find more requests the drive
+			 * accepts.
+			 */
+			blk_mq_handle_zone_resource(rq, &zone_list);
+			if (list_empty(list))
+				break;
+			continue;
 		}
 
 		if (unlikely(ret != BLK_STS_OK)) {
@@ -1275,6 +1299,9 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		queued++;
 	} while (!list_empty(list));
 
+	if (!list_empty(&zone_list))
+		list_splice_tail_init(&zone_list, list);
+
 	hctx->dispatched[queued_to_index(queued)]++;
 
 	/*
diff --git a/block/blk-settings.c b/block/blk-settings.c
index be1dca0103a4..ac0711803ee7 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -48,6 +48,7 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->chunk_sectors = 0;
 	lim->max_write_same_sectors = 0;
 	lim->max_write_zeroes_sectors = 0;
+	lim->max_zone_append_sectors = 0;
 	lim->max_discard_sectors = 0;
 	lim->max_hw_discard_sectors = 0;
 	lim->discard_granularity = 0;
@@ -83,6 +84,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_dev_sectors = UINT_MAX;
 	lim->max_write_same_sectors = UINT_MAX;
 	lim->max_write_zeroes_sectors = UINT_MAX;
+	lim->max_zone_append_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -257,6 +259,21 @@ void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors);
 
+/**
+ * blk_queue_max_zone_append_sectors - set max sectors for a single zone append
+ * @q:  the request queue for the device
+ * @max_zone_append_sectors: maximum number of sectors to write per command
+ **/
+void blk_queue_max_zone_append_sectors(struct request_queue *q,
+		unsigned int max_zone_append_sectors)
+{
+	unsigned int max_sectors;
+
+	max_sectors = min(q->limits.max_hw_sectors, max_zone_append_sectors);
+	q->limits.max_zone_append_sectors = max_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_zone_append_sectors);
+
 /**
  * blk_queue_max_segments - set max hw segments for a request for this queue
  * @q:  the request queue for the device
@@ -506,6 +523,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 					b->max_write_same_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
 					b->max_write_zeroes_sectors);
+	t->max_zone_append_sectors = min(t->max_zone_append_sectors,
+					b->max_zone_append_sectors);
 	t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
 
 	t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fca9b158f4a0..02643e149d5e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -218,6 +218,13 @@ static ssize_t queue_write_zeroes_max_show(struct request_queue *q, char *page)
 		(unsigned long long)q->limits.max_write_zeroes_sectors << 9);
 }
 
+static ssize_t queue_zone_append_max_show(struct request_queue *q, char *page)
+{
+	unsigned long long max_sectors = q->limits.max_zone_append_sectors;
+
+	return sprintf(page, "%llu\n", max_sectors << SECTOR_SHIFT);
+}
+
 static ssize_t
 queue_max_sectors_store(struct request_queue *q, const char *page, size_t count)
 {
@@ -639,6 +646,11 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
 	.show = queue_write_zeroes_max_show,
 };
 
+static struct queue_sysfs_entry queue_zone_append_max_entry = {
+	.attr = {.name = "zone_append_max_bytes", .mode = 0444 },
+	.show = queue_zone_append_max_show,
+};
+
 static struct queue_sysfs_entry queue_nonrot_entry = {
 	.attr = {.name = "rotational", .mode = 0644 },
 	.show = queue_show_nonrot,
@@ -749,6 +761,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_discard_zeroes_data_entry.attr,
 	&queue_write_same_max_entry.attr,
 	&queue_write_zeroes_max_entry.attr,
+	&queue_zone_append_max_entry.attr,
 	&queue_nonrot_entry.attr,
 	&queue_zoned_entry.attr,
 	&queue_nr_zones_entry.attr,
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 610ee41fa54c..ea327f320b7f 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1706,6 +1706,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
+	case BLK_STS_ZONE_RESOURCE:
 		if (atomic_read(&sdev->device_busy) ||
 		    scsi_device_blocked(sdev))
 			ret = BLK_STS_DEV_RESOURCE;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 853d92ceee64..d69632cbab8d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -102,24 +102,7 @@ static inline void *bio_data(struct bio *bio)
 	return NULL;
 }
 
-/**
- * bio_full - check if the bio is full
- * @bio:	bio to check
- * @len:	length of one segment to be added
- *
- * Return true if @bio is full and one segment with @len bytes can't be
- * added to the bio, otherwise return false
- */
-static inline bool bio_full(struct bio *bio, unsigned len)
-{
-	if (bio->bi_vcnt >= bio->bi_max_vecs)
-		return true;
-
-	if (bio->bi_iter.bi_size > UINT_MAX - len)
-		return true;
-
-	return false;
-}
+bool bio_full(struct bio *bio, unsigned len);
 
 static inline bool bio_next_segment(const struct bio *bio,
 				    struct bvec_iter_all *iter)
@@ -435,6 +418,7 @@ void bio_chain(struct bio *, struct bio *);
 extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
 			   unsigned int, unsigned int);
+
 bool __bio_try_merge_page(struct bio *bio, struct page *page,
 		unsigned int len, unsigned int off, bool *same_page);
 void __bio_add_page(struct bio *bio, struct page *page,
@@ -443,7 +427,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
 void bio_release_pages(struct bio *bio, bool mark_dirty);
 struct rq_map_data;
 extern struct bio *bio_map_user_iov(struct request_queue *,
-				    struct iov_iter *, gfp_t);
+				    struct iov_iter *, gfp_t, unsigned int);
 extern void bio_unmap_user(struct bio *);
 extern struct bio *bio_map_kern(struct request_queue *, void *, unsigned int,
 				gfp_t);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 70254ae11769..824ec2d89954 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -63,6 +63,18 @@ typedef u8 __bitwise blk_status_t;
  */
 #define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)
 
+/*
+ * BLK_STS_ZONE_RESOURCE is returned from the driver to the block layer if zone
+ * related resources are unavailable, but the driver can guarantee the queue
+ * will be rerun in the future once the resources become available again.
+ *
+ * This is different from BLK_STS_DEV_RESOURCE in that it explicitly references
+ * a zone specific resource and IO to a different zone on the same device could
+ * still be served. Examples of that are zones that are write-locked, but a read
+ * to the same zone could be served.
+ */
+#define BLK_STS_ZONE_RESOURCE	((__force blk_status_t)14)
+
 /**
  * blk_path_error - returns true if error may be path related
  * @error: status the request was completed with
@@ -296,6 +308,8 @@ enum req_opf {
 	REQ_OP_ZONE_CLOSE	= 11,
 	/* Transition a zone to full */
 	REQ_OP_ZONE_FINISH	= 12,
+	/* write data at the current zone write pointer */
+	REQ_OP_ZONE_APPEND	= 13,
 
 	/* SCSI passthrough using struct scsi_request */
 	REQ_OP_SCSI_IN		= 32,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 25b63f714619..36111b10d514 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -336,6 +336,7 @@ struct queue_limits {
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
 	unsigned int		max_write_zeroes_sectors;
+	unsigned int		max_zone_append_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -757,6 +758,9 @@ static inline bool rq_mergeable(struct request *rq)
 	if (req_op(rq) == REQ_OP_WRITE_ZEROES)
 		return false;
 
+	if (req_op(rq) == REQ_OP_ZONE_APPEND)
+		return false;
+
 	if (rq->cmd_flags & REQ_NOMERGE_FLAGS)
 		return false;
 	if (rq->rq_flags & RQF_NOMERGE_FLAGS)
@@ -1088,6 +1092,8 @@ extern void blk_queue_max_write_same_sectors(struct request_queue *q,
 extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 		unsigned int max_write_same_sectors);
 extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
+extern void blk_queue_max_zone_append_sectors(struct request_queue *q,
+		unsigned int max_zone_append_sectors);
 extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
 extern void blk_queue_alignment_offset(struct request_queue *q,
 				       unsigned int alignment);
@@ -1301,6 +1307,11 @@ static inline unsigned int queue_max_segment_size(const struct request_queue *q)
 	return q->limits.max_segment_size;
 }
 
+static inline unsigned int queue_max_zone_append_sectors(const struct request_queue *q)
+{
+	return q->limits.max_zone_append_sectors;
+}
+
 static inline unsigned queue_logical_block_size(const struct request_queue *q)
 {
 	int retval = 512;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 04/11] block: introduce blk_req_zone_write_trylock
  2020-03-24 15:24 [PATCH v2 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (2 preceding siblings ...)
  2020-03-24 15:24 ` [PATCH v2 03/11] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
@ 2020-03-24 15:24 ` Johannes Thumshirn
  2020-03-24 15:24 ` [PATCH v2 05/11] block: Introduce zone write pointer offset caching Johannes Thumshirn
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2020-03-24 15:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong,
	Johannes Thumshirn

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/blk-zoned.c      | 14 ++++++++++++++
 include/linux/blkdev.h |  1 +
 2 files changed, 15 insertions(+)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 6b442ae96499..aefd21a5e6ff 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -50,6 +50,20 @@ bool blk_req_needs_zone_write_lock(struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_req_needs_zone_write_lock);
 
+bool blk_req_zone_write_trylock(struct request *rq)
+{
+	unsigned int zno = blk_rq_zone_no(rq);
+
+	if (test_and_set_bit(zno, rq->q->seq_zones_wlock))
+		return false;
+
+	WARN_ON_ONCE(rq->rq_flags & RQF_ZONE_WRITE_LOCKED);
+	rq->rq_flags |= RQF_ZONE_WRITE_LOCKED;
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(blk_req_zone_write_trylock);
+
 void __blk_req_zone_write_lock(struct request *rq)
 {
 	if (WARN_ON_ONCE(test_and_set_bit(blk_rq_zone_no(rq),
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 36111b10d514..e591b22ace03 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1746,6 +1746,7 @@ extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 
 #ifdef CONFIG_BLK_DEV_ZONED
 bool blk_req_needs_zone_write_lock(struct request *rq);
+bool blk_req_zone_write_trylock(struct request *rq);
 void __blk_req_zone_write_lock(struct request *rq);
 void __blk_req_zone_write_unlock(struct request *rq);
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 05/11] block: Introduce zone write pointer offset caching
  2020-03-24 15:24 [PATCH v2 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (3 preceding siblings ...)
  2020-03-24 15:24 ` [PATCH v2 04/11] block: introduce blk_req_zone_write_trylock Johannes Thumshirn
@ 2020-03-24 15:24 ` Johannes Thumshirn
  2020-03-24 15:24 ` [PATCH v2 06/11] scsi: sd_zbc: factor out sanity checks for zoned commands Johannes Thumshirn
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2020-03-24 15:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong,
	Damien Le Moal, Johannes Thumshirn

From: Damien Le Moal <damien.lemoal@wdc.com>

Not all zoned block devices natively support the zone append command.
E.g. SCSI and ATA disks do not define this command. However, it is
possible to emulate this command at the LLD level using regular write
commands if the write pointer position of zones is known. Introducing
such emulation enables the use of zone append write for all zoned block
device types, therefore simplifying for instance the implementation of
file systems native zoned block device support by avoiding the need for
different write pathes depending on the device capabilities.

To allow devices without zone append command support to emulate its
behavior, introduce a zone write pointer cache attached to the device
request_queue, similarly to the zones bitmaps. To save memory, this
cache stores write pointer offsets relative to each zone start sector
as a 32bits value rather than the 64bits absolute sector position of
each zone write pointer position.

While it would be natural to have each LLD implement as needed zone
write pointer caching, the integration of this cache allocation and
initialization within blk_revalidate_disk_zones() greatly simplifies
the code and avoids potential races between the zone wp array size and
the device known number of zones in the case of changes detected during
device revalidation. Furthermore, initializing the zone wp array
together with the device queue zone bitmaps when
blk_revalidate_disk_zones() execute a full device zone report avoids
the need for an additional full device zone report execution in the LLD
revalidate method. This can significantly reduce the overhead of device
revalidation as larger capacity SMR drives result in very costly full
drive report zones processing. E.g., with a 20TB SMR disks and 256 MB
zones, more than 75000 zones need to be reported using multiple report
zone commands. The added delay of an additional full zone report is
significant and can be avoided with an initialization within
blk_revalidate_disk_zones().

By default, blk_revalidate_disk_zones() will not allocate and
initialize a drive zone wp array. The allocation and initialization of
this cache is done only if a device driver request it with the
QUEUE_FLAG_ZONE_WP_OFST queue flag. The allocation and initialization
of the cache is done in the same manner as for the zone bitmaps, within
the report zones callback function used by blk_revalidate_disk_zones().
In case of changes to the device zone configuration, the cache is
updated under a queue freeze to avoid any race between the device
driver use of the cache and the request queue update.

Freeing of this new cache is done together with the zone bitmaps from
the function blk_queue_free_zone_bitmaps(), renamed here to
blk_queue_free_zone_resources().

Maintaining the write pointer offset values is the responsibility of
the device LLD. The helper function blk_get_zone_wp_offset() is
provided to simplify this task.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/blk-sysfs.c      |  2 +-
 block/blk-zoned.c      | 69 ++++++++++++++++++++++++++++++++++++++++--
 block/blk.h            |  4 +--
 include/linux/blkdev.h | 20 ++++++++----
 4 files changed, 84 insertions(+), 11 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 02643e149d5e..bd0c9b4c1c5b 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -901,7 +901,7 @@ static void __blk_release_queue(struct work_struct *work)
 
 	blk_exit_queue(q);
 
-	blk_queue_free_zone_bitmaps(q);
+	blk_queue_free_zone_resources(q);
 
 	if (queue_is_mq(q))
 		blk_mq_release(q);
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index aefd21a5e6ff..c2172f502969 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -344,18 +344,65 @@ static inline unsigned long *blk_alloc_zone_bitmap(int node,
 			    GFP_NOIO, node);
 }
 
-void blk_queue_free_zone_bitmaps(struct request_queue *q)
+static inline unsigned int *blk_alloc_zone_wp_ofst(unsigned int nr_zones)
+{
+	return kvcalloc(nr_zones, sizeof(unsigned int), GFP_NOIO);
+}
+
+void blk_queue_free_zone_resources(struct request_queue *q)
 {
 	kfree(q->conv_zones_bitmap);
 	q->conv_zones_bitmap = NULL;
 	kfree(q->seq_zones_wlock);
 	q->seq_zones_wlock = NULL;
+	kvfree(q->seq_zones_wp_ofst);
+	q->seq_zones_wp_ofst = NULL;
 }
 
+/**
+ * blk_get_zone_wp_ofst - Calculate a zone write pointer offset position
+ * @zone:	Target zone
+ * @wp_ofst:	Calculated write pointer offset
+ *
+ * Helper function for low-level device drivers to obtain a zone write pointer
+ * position relative to the zone start sector (write pointer offset). The write
+ * pointer offset depends on the zone condition. If the zone has an invalid
+ * condition, -ENODEV is returned.
+ */
+int blk_get_zone_wp_offset(struct blk_zone *zone, unsigned int *wp_ofst)
+{
+	switch (zone->cond) {
+	case BLK_ZONE_COND_EMPTY:
+		*wp_ofst = 0;
+		return 0;
+	case BLK_ZONE_COND_IMP_OPEN:
+	case BLK_ZONE_COND_EXP_OPEN:
+	case BLK_ZONE_COND_CLOSED:
+		*wp_ofst = zone->wp - zone->start;
+		return 0;
+	case BLK_ZONE_COND_FULL:
+		*wp_ofst = zone->len;
+		return 0;
+	case BLK_ZONE_COND_NOT_WP:
+	case BLK_ZONE_COND_OFFLINE:
+	case BLK_ZONE_COND_READONLY:
+		/*
+		 * Conventional, offline and read-only zones do not have a valid
+		 * write pointer. Use 0 as a dummy value.
+		 */
+		*wp_ofst = 0;
+		return 0;
+	default:
+		return -ENODEV;
+	}
+}
+EXPORT_SYMBOL_GPL(blk_get_zone_wp_offset);
+
 struct blk_revalidate_zone_args {
 	struct gendisk	*disk;
 	unsigned long	*conv_zones_bitmap;
 	unsigned long	*seq_zones_wlock;
+	unsigned int	*seq_zones_wp_ofst;
 	unsigned int	nr_zones;
 	sector_t	zone_sectors;
 	sector_t	sector;
@@ -371,6 +418,7 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 	struct gendisk *disk = args->disk;
 	struct request_queue *q = disk->queue;
 	sector_t capacity = get_capacity(disk);
+	int ret;
 
 	/*
 	 * All zones must have the same size, with the exception on an eventual
@@ -406,6 +454,13 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 		return -ENODEV;
 	}
 
+	if (blk_queue_zone_wp_ofst(q) && !args->seq_zones_wp_ofst) {
+		args->seq_zones_wp_ofst =
+			blk_alloc_zone_wp_ofst(args->nr_zones);
+		if (!args->seq_zones_wp_ofst)
+			return -ENOMEM;
+	}
+
 	/* Check zone type */
 	switch (zone->type) {
 	case BLK_ZONE_TYPE_CONVENTIONAL:
@@ -432,6 +487,14 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 		return -ENODEV;
 	}
 
+	if (args->seq_zones_wp_ofst) {
+		/* Initialize the zone write pointer offset */
+		ret = blk_get_zone_wp_offset(zone,
+					&args->seq_zones_wp_ofst[idx]);
+		if (ret)
+			return ret;
+	}
+
 	args->sector += zone->len;
 	return 0;
 }
@@ -480,15 +543,17 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 		q->nr_zones = args.nr_zones;
 		swap(q->seq_zones_wlock, args.seq_zones_wlock);
 		swap(q->conv_zones_bitmap, args.conv_zones_bitmap);
+		swap(q->seq_zones_wp_ofst, args.seq_zones_wp_ofst);
 		ret = 0;
 	} else {
 		pr_warn("%s: failed to revalidate zones\n", disk->disk_name);
-		blk_queue_free_zone_bitmaps(q);
+		blk_queue_free_zone_resources(q);
 	}
 	blk_mq_unfreeze_queue(q);
 
 	kfree(args.seq_zones_wlock);
 	kfree(args.conv_zones_bitmap);
+	kvfree(args.seq_zones_wp_ofst);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
diff --git a/block/blk.h b/block/blk.h
index 670337b7cfa0..8e6f4db1c7d2 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -349,9 +349,9 @@ static inline int blk_iolatency_init(struct request_queue *q) { return 0; }
 struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp);
 
 #ifdef CONFIG_BLK_DEV_ZONED
-void blk_queue_free_zone_bitmaps(struct request_queue *q);
+void blk_queue_free_zone_resources(struct request_queue *q);
 #else
-static inline void blk_queue_free_zone_bitmaps(struct request_queue *q) {}
+static inline void blk_queue_free_zone_resources(struct request_queue *q) {}
 #endif
 
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e591b22ace03..950d3476918c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -363,6 +363,7 @@ extern int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
 			    sector_t sectors, sector_t nr_sectors,
 			    gfp_t gfp_mask);
 extern int blk_revalidate_disk_zones(struct gendisk *disk);
+int blk_get_zone_wp_offset(struct blk_zone *zone, unsigned int *wp_ofst);
 
 extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
 				     unsigned int cmd, unsigned long arg);
@@ -499,14 +500,17 @@ struct request_queue {
 	/*
 	 * Zoned block device information for request dispatch control.
 	 * nr_zones is the total number of zones of the device. This is always
-	 * 0 for regular block devices. conv_zones_bitmap is a bitmap of nr_zones
-	 * bits which indicates if a zone is conventional (bit set) or
+	 * 0 for regular block devices. conv_zones_bitmap is a bitmap of
+	 * nr_zones bits which indicates if a zone is conventional (bit set) or
 	 * sequential (bit clear). seq_zones_wlock is a bitmap of nr_zones
 	 * bits which indicates if a zone is write locked, that is, if a write
-	 * request targeting the zone was dispatched. All three fields are
-	 * initialized by the low level device driver (e.g. scsi/sd.c).
-	 * Stacking drivers (device mappers) may or may not initialize
-	 * these fields.
+	 * request targeting the zone was dispatched. seq_zones_wp_ofst is an
+	 * array of nr_zones write pointer values relative to the zone start
+	 * sector. This is only initialized for LLDs needing zone append write
+	 * command emulation with regular write. All fields are initialized by
+	 * the blk_revalidate_disk_zones() function when called by the low
+	 * level device driver (e.g. scsi/sd.c). Stacking drivers (device
+	 * mappers) may or may not initialize these fields.
 	 *
 	 * Reads of this information must be protected with blk_queue_enter() /
 	 * blk_queue_exit(). Modifying this information is only allowed while
@@ -516,6 +520,7 @@ struct request_queue {
 	unsigned int		nr_zones;
 	unsigned long		*conv_zones_bitmap;
 	unsigned long		*seq_zones_wlock;
+	unsigned int		*seq_zones_wp_ofst;
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 	/*
@@ -613,6 +618,7 @@ struct request_queue {
 #define QUEUE_FLAG_PCI_P2PDMA	25	/* device supports PCI p2p requests */
 #define QUEUE_FLAG_ZONE_RESETALL 26	/* supports Zone Reset All */
 #define QUEUE_FLAG_RQ_ALLOC_TIME 27	/* record rq->alloc_time_ns */
+#define QUEUE_FLAG_ZONE_WP_OFST	28	/* queue needs zone wp offsets */
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP))
@@ -647,6 +653,8 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #else
 #define blk_queue_rq_alloc_time(q)	false
 #endif
+#define blk_queue_zone_wp_ofst(q)	\
+	test_bit(QUEUE_FLAG_ZONE_WP_OFST, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 06/11] scsi: sd_zbc: factor out sanity checks for zoned commands
  2020-03-24 15:24 [PATCH v2 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (4 preceding siblings ...)
  2020-03-24 15:24 ` [PATCH v2 05/11] block: Introduce zone write pointer offset caching Johannes Thumshirn
@ 2020-03-24 15:24 ` Johannes Thumshirn
  2020-03-24 15:24 ` [PATCH v2 07/11] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2020-03-24 15:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong,
	Johannes Thumshirn

Factor sanity checks for zoned commands from sd_zbc_setup_zone_mgmt_cmnd().

This will help with the introduction of an emulated ZONE_APPEND command.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 drivers/scsi/sd_zbc.c | 36 +++++++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index f45c22b09726..ee156fbf3780 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -209,6 +209,26 @@ int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
 	return ret;
 }
 
+static blk_status_t sd_zbc_cmnd_checks(struct scsi_cmnd *cmd)
+{
+	struct request *rq = cmd->request;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	sector_t sector = blk_rq_pos(rq);
+
+	if (!sd_is_zoned(sdkp))
+		/* Not a zoned device */
+		return BLK_STS_IOERR;
+
+	if (sdkp->device->changed)
+		return BLK_STS_IOERR;
+
+	if (sector & (sd_zbc_zone_sectors(sdkp) - 1))
+		/* Unaligned request */
+		return BLK_STS_IOERR;
+
+	return BLK_STS_OK;
+}
+
 /**
  * sd_zbc_setup_zone_mgmt_cmnd - Prepare a zone ZBC_OUT command. The operations
  *			can be RESET WRITE POINTER, OPEN, CLOSE or FINISH.
@@ -223,20 +243,14 @@ blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
 					 unsigned char op, bool all)
 {
 	struct request *rq = cmd->request;
-	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
 	sector_t sector = blk_rq_pos(rq);
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
 	sector_t block = sectors_to_logical(sdkp->device, sector);
+	blk_status_t ret;
 
-	if (!sd_is_zoned(sdkp))
-		/* Not a zoned device */
-		return BLK_STS_IOERR;
-
-	if (sdkp->device->changed)
-		return BLK_STS_IOERR;
-
-	if (sector & (sd_zbc_zone_sectors(sdkp) - 1))
-		/* Unaligned request */
-		return BLK_STS_IOERR;
+	ret = sd_zbc_cmnd_checks(cmd);
+	if (ret != BLK_STS_OK)
+		return ret;
 
 	cmd->cmd_len = 16;
 	memset(cmd->cmnd, 0, cmd->cmd_len);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 07/11] scsi: sd_zbc: emulate ZONE_APPEND commands
  2020-03-24 15:24 [PATCH v2 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (5 preceding siblings ...)
  2020-03-24 15:24 ` [PATCH v2 06/11] scsi: sd_zbc: factor out sanity checks for zoned commands Johannes Thumshirn
@ 2020-03-24 15:24 ` Johannes Thumshirn
  2020-03-24 15:24 ` [PATCH v2 08/11] null_blk: Cleanup zoned device initialization Johannes Thumshirn
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2020-03-24 15:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong,
	Johannes Thumshirn

Emulate ZONE_APPEND for SCSI disks using a regular WRITE(16) with a
start LBA set to the target zone write pointer position.

In order to always know the write pointer position of a sequential write
zone, the queue flag QUEUE_FLAG_ZONE_WP_OFST is set to get an
initialized write pointer offset array attached to the device request
queue. The values of the cache are maintained in sync with the device
as follows:
1) the write pointer offset of a zone is reset to 0 when a
   REQ_OP_ZONE_RESET command completes.
2) the write pointer offset of a zone is set to the zone size when a
   REQ_OP_ZONE_FINISH command completes.
3) the write pointer offset of a zone is incremented by the number of
   512B sectors written when a write or a zone append command completes
4) the write pointer offset of all zones is reset to 0 when a
   REQ_OP_ZONE_RESET_ALL command completes.

Since the block layer does not write lock zones for zone append
commands, to ensure a sequential ordering of the write commands used for
the emulation, the target zone of a zone append command is locked when
the function sd_zbc_prepare_zone_append() is called from
sd_setup_read_write_cmnd(). If the zone write lock cannot be obtained
(e.g. a zone append is in-flight or a regular write has already locked
the zone), the zone append command dispatching is delayed by returning
BLK_STS_ZONE_RESOURCE.

Since zone reset and finish operations can be issued concurrently with
writes and zone append requests, ensure a coherent update of the zone
write pointer offsets by also write locking the target zones for these
zone management requests.

Finally, to avoid the need for write locking all zones for
REQ_OP_ZONE_RESET_ALL requests, use a spinlock to protect accesses and
modifications of the zone write pointer offsets. This spinlock is
initialized from sd_probe() using the new function sd_zbc_init().

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 drivers/scsi/sd.c     |  28 +++-
 drivers/scsi/sd.h     |  36 ++++-
 drivers/scsi/sd_zbc.c | 316 +++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 363 insertions(+), 17 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 707f47c0ec98..18584bf01e11 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1215,6 +1215,12 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 	else
 		protect = 0;
 
+	if (req_op(rq) == REQ_OP_ZONE_APPEND) {
+		ret = sd_zbc_prepare_zone_append(cmd, &lba, nr_blocks);
+		if (ret)
+			return ret;
+	}
+
 	if (protect && sdkp->protection_type == T10_PI_TYPE2_PROTECTION) {
 		ret = sd_setup_rw32_cmnd(cmd, write, lba, nr_blocks,
 					 protect | fua);
@@ -1287,6 +1293,7 @@ static blk_status_t sd_init_command(struct scsi_cmnd *cmd)
 		return sd_setup_flush_cmnd(cmd);
 	case REQ_OP_READ:
 	case REQ_OP_WRITE:
+	case REQ_OP_ZONE_APPEND:
 		return sd_setup_read_write_cmnd(cmd);
 	case REQ_OP_ZONE_RESET:
 		return sd_zbc_setup_zone_mgmt_cmnd(cmd, ZO_RESET_WRITE_POINTER,
@@ -2055,7 +2062,7 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 
  out:
 	if (sd_is_zoned(sdkp))
-		sd_zbc_complete(SCpnt, good_bytes, &sshdr);
+		good_bytes = sd_zbc_complete(SCpnt, good_bytes, &sshdr);
 
 	SCSI_LOG_HLCOMPLETE(1, scmd_printk(KERN_INFO, SCpnt,
 					   "sd_done: completed %d of %d bytes\n",
@@ -3370,6 +3377,8 @@ static int sd_probe(struct device *dev)
 	sdkp->first_scan = 1;
 	sdkp->max_medium_access_timeouts = SD_MAX_MEDIUM_TIMEOUTS;
 
+	sd_zbc_init_disk(sdkp);
+
 	sd_revalidate_disk(gd);
 
 	gd->flags = GENHD_FL_EXT_DEVT;
@@ -3663,19 +3672,26 @@ static int __init init_sd(void)
 	if (!sd_page_pool) {
 		printk(KERN_ERR "sd: can't init discard page pool\n");
 		err = -ENOMEM;
-		goto err_out_ppool;
+		goto err_out_cdb_pool;
 	}
 
+	err = sd_zbc_init();
+	if (err)
+		goto err_out_ppool;
+
 	err = scsi_register_driver(&sd_template.gendrv);
 	if (err)
-		goto err_out_driver;
+		goto err_out_zbc;
 
 	return 0;
 
-err_out_driver:
-	mempool_destroy(sd_page_pool);
+err_out_zbc:
+	sd_zbc_exit();
 
 err_out_ppool:
+	mempool_destroy(sd_page_pool);
+
+err_out_cdb_pool:
 	mempool_destroy(sd_cdb_pool);
 
 err_out_cache:
@@ -3705,6 +3721,8 @@ static void __exit exit_sd(void)
 	mempool_destroy(sd_page_pool);
 	kmem_cache_destroy(sd_cdb_cache);
 
+	sd_zbc_exit();
+
 	class_unregister(&sd_disk_class);
 
 	for (i = 0; i < SD_MAJORS; i++) {
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 50fff0bf8c8e..34641be1d434 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -79,6 +79,7 @@ struct scsi_disk {
 	u32		zones_optimal_open;
 	u32		zones_optimal_nonseq;
 	u32		zones_max_open;
+	spinlock_t	zone_wp_ofst_lock;
 #endif
 	atomic_t	openers;
 	sector_t	capacity;	/* size in logical blocks */
@@ -207,17 +208,33 @@ static inline int sd_is_zoned(struct scsi_disk *sdkp)
 
 #ifdef CONFIG_BLK_DEV_ZONED
 
+int __init sd_zbc_init(void);
+void sd_zbc_exit(void);
+
+void sd_zbc_init_disk(struct scsi_disk *sdkp);
 extern int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buffer);
 extern void sd_zbc_print_zones(struct scsi_disk *sdkp);
 blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
 					 unsigned char op, bool all);
-extern void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
-			    struct scsi_sense_hdr *sshdr);
+unsigned int sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
+			     struct scsi_sense_hdr *sshdr);
 int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
 		unsigned int nr_zones, report_zones_cb cb, void *data);
 
+blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd, sector_t *lba,
+				        unsigned int nr_blocks);
+
 #else /* CONFIG_BLK_DEV_ZONED */
 
+static inline int sd_zbc_init(void)
+{
+	return 0;
+}
+
+static inline void sd_zbc_exit(void) {}
+
+static inline void sd_zbc_init_disk(struct scsi_disk *sdkp) {}
+
 static inline int sd_zbc_read_zones(struct scsi_disk *sdkp,
 				    unsigned char *buf)
 {
@@ -233,9 +250,18 @@ static inline blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
 	return BLK_STS_TARGET;
 }
 
-static inline void sd_zbc_complete(struct scsi_cmnd *cmd,
-				   unsigned int good_bytes,
-				   struct scsi_sense_hdr *sshdr) {}
+static inline unsigned int sd_zbc_complete(struct scsi_cmnd *cmd,
+			unsigned int good_bytes, struct scsi_sense_hdr *sshdr)
+{
+	return 0;
+}
+
+static inline blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd,
+						      sector_t *lba,
+						      unsigned int nr_blocks)
+{
+	return BLK_STS_TARGET;
+}
 
 #define sd_zbc_report_zones NULL
 
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index ee156fbf3780..17bdc50d29f3 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -19,6 +19,11 @@
 
 #include "sd.h"
 
+static struct kmem_cache *sd_zbc_zone_work_cache;
+static mempool_t *sd_zbc_zone_work_pool;
+
+#define SD_ZBC_ZONE_WORK_MEMPOOL_SIZE	8
+
 static int sd_zbc_parse_report(struct scsi_disk *sdkp, u8 *buf,
 			       unsigned int idx, report_zones_cb cb, void *data)
 {
@@ -229,6 +234,152 @@ static blk_status_t sd_zbc_cmnd_checks(struct scsi_cmnd *cmd)
 	return BLK_STS_OK;
 }
 
+#define SD_ZBC_INVALID_WP_OFST	~(0u)
+#define SD_ZBC_UPDATING_WP_OFST	(SD_ZBC_INVALID_WP_OFST - 1)
+
+struct sd_zbc_zone_work {
+	struct work_struct work;
+	struct scsi_disk *sdkp;
+	unsigned int zno;
+	char buf[SD_BUF_SIZE];
+};
+
+static int sd_zbc_update_wp_ofst_cb(struct blk_zone *zone, unsigned int idx,
+				    void *data)
+{
+	struct sd_zbc_zone_work *zwork = data;
+	struct scsi_disk *sdkp = zwork->sdkp;
+	struct request_queue *q = sdkp->disk->queue;
+	int ret;
+
+	spin_lock_bh(&sdkp->zone_wp_ofst_lock);
+	ret = blk_get_zone_wp_offset(zone, &q->seq_zones_wp_ofst[zwork->zno]);
+	if (ret)
+		q->seq_zones_wp_ofst[zwork->zno] = SD_ZBC_INVALID_WP_OFST;
+	spin_unlock_bh(&sdkp->zone_wp_ofst_lock);
+
+	return ret;
+}
+
+static void sd_zbc_update_wp_ofst_workfn(struct work_struct *work)
+{
+	struct sd_zbc_zone_work *zwork;
+	struct scsi_disk *sdkp;
+	int ret;
+
+	zwork = container_of(work, struct sd_zbc_zone_work, work);
+	sdkp = zwork->sdkp;
+
+	ret = sd_zbc_do_report_zones(sdkp, zwork->buf, SD_BUF_SIZE,
+				     zwork->zno * sdkp->zone_blocks, true);
+	if (!ret)
+		sd_zbc_parse_report(sdkp, zwork->buf + 64, 0,
+				    sd_zbc_update_wp_ofst_cb, zwork);
+
+	mempool_free(zwork, sd_zbc_zone_work_pool);
+	scsi_device_put(sdkp->device);
+}
+
+static blk_status_t sd_zbc_update_wp_ofst(struct scsi_disk *sdkp,
+					  unsigned int zno)
+{
+	struct sd_zbc_zone_work *zwork;
+
+	/*
+	 * We are about to schedule work to update a zone write pointer offset,
+	 * which will cause the zone append command to be requeued. So make
+	 * sure that the scsi device does not go away while the work is
+	 * being processed.
+	 */
+	if (scsi_device_get(sdkp->device))
+		return BLK_STS_IOERR;
+
+	zwork = mempool_alloc(sd_zbc_zone_work_pool, GFP_ATOMIC);
+	if (!zwork) {
+		/* Retry later */
+		scsi_device_put(sdkp->device);
+		return BLK_STS_RESOURCE;
+	}
+
+	memset(zwork, 0, sizeof(struct sd_zbc_zone_work));
+	INIT_WORK(&zwork->work, sd_zbc_update_wp_ofst_workfn);
+	zwork->sdkp = sdkp;
+	zwork->zno = zno;
+
+	sdkp->disk->queue->seq_zones_wp_ofst[zno] = SD_ZBC_UPDATING_WP_OFST;
+
+	schedule_work(&zwork->work);
+
+	return BLK_STS_RESOURCE;
+}
+
+/**
+ * sd_zbc_prepare_zone_append() - Prepare an emulated ZONE_APPEND command.
+ * @cmd: the command to setup
+ * @lba: the LBA to patch
+ * @nr_blocks: the number of LBAs to be written
+ *
+ * Called from sd_setup_read_write_cmnd() for REQ_OP_ZONE_APPEND.
+ * @sd_zbc_prepare_zone_append() handles the necessary zone wrote locking and
+ * patching of the lba for an emulated ZONE_APPEND command.
+ *
+ * In case the cached write pointer offset is %SD_ZBC_INVALID_WP_OFST it will
+ * schedule a REPORT ZONES command and return BLK_STS_IOERR.
+ */
+blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd, sector_t *lba,
+					unsigned int nr_blocks)
+{
+	struct request *rq = cmd->request;
+	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+	unsigned int wp_ofst, zno = blk_rq_zone_no(rq);
+	blk_status_t ret;
+
+	ret = sd_zbc_cmnd_checks(cmd);
+	if (ret != BLK_STS_OK)
+		return ret;
+
+	if (!blk_rq_zone_is_seq(rq))
+		return BLK_STS_IOERR;
+
+	/* Unlock of the write lock will happen in sd_zbc_complete() */
+	if (!blk_req_zone_write_trylock(rq))
+		return BLK_STS_ZONE_RESOURCE;
+
+	spin_lock_bh(&sdkp->zone_wp_ofst_lock);
+
+	wp_ofst = rq->q->seq_zones_wp_ofst[zno];
+
+	if (wp_ofst == SD_ZBC_UPDATING_WP_OFST) {
+		/* Write pointer offset update in progress: ask for a requeue */
+		ret = BLK_STS_RESOURCE;
+		goto err;
+	}
+
+	if (wp_ofst == SD_ZBC_INVALID_WP_OFST) {
+		/* Invalid write pointer offset: trigger an update from disk */
+		ret = sd_zbc_update_wp_ofst(sdkp, zno);
+		goto err;
+	}
+
+	wp_ofst = sectors_to_logical(sdkp->device, wp_ofst);
+	if (wp_ofst + nr_blocks > sdkp->zone_blocks) {
+		ret = BLK_STS_IOERR;
+		goto err;
+	}
+
+	/* Set the LBA for the write command used to emulate zone append */
+	*lba += wp_ofst;
+
+	spin_unlock_bh(&sdkp->zone_wp_ofst_lock);
+
+	return BLK_STS_OK;
+
+err:
+	spin_unlock_bh(&sdkp->zone_wp_ofst_lock);
+	blk_req_zone_write_unlock(rq);
+	return ret;
+}
+
 /**
  * sd_zbc_setup_zone_mgmt_cmnd - Prepare a zone ZBC_OUT command. The operations
  *			can be RESET WRITE POINTER, OPEN, CLOSE or FINISH.
@@ -266,25 +417,75 @@ blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
 	cmd->transfersize = 0;
 	cmd->allowed = 0;
 
+	/* Only zone reset and zone finish need zone write locking */
+	if (op != ZO_RESET_WRITE_POINTER && op != ZO_FINISH_ZONE)
+		return BLK_STS_OK;
+
+	if (all) {
+		/* We do not write lock all zones for an all zone reset */
+		if (op == ZO_RESET_WRITE_POINTER)
+			return BLK_STS_OK;
+
+		/* Finishing all zones is not supported */
+		return BLK_STS_IOERR;
+	}
+
+	if (!blk_rq_zone_is_seq(rq))
+		return BLK_STS_IOERR;
+
+	if (!blk_req_zone_write_trylock(rq))
+		return BLK_STS_ZONE_RESOURCE;
+
 	return BLK_STS_OK;
 }
 
+static inline bool sd_zbc_zone_needs_write_unlock(struct request *rq)
+{
+	/*
+	 * For zone append, the zone was locked in sd_zbc_prepare_zone_append().
+	 * For zone reset and zone finish, the zone was locked in
+	 * sd_zbc_setup_zone_mgmt_cmnd().
+	 * For regular writes, the zone is unlocked by the block layer elevator.
+	 */
+	return req_op(rq) == REQ_OP_ZONE_APPEND ||
+		req_op(rq) == REQ_OP_ZONE_RESET ||
+		req_op(rq) == REQ_OP_ZONE_FINISH;
+}
+
+static bool sd_zbc_need_zone_wp_update(struct request *rq)
+{
+	if (req_op(rq) == REQ_OP_WRITE ||
+	    req_op(rq) == REQ_OP_WRITE_ZEROES ||
+	    req_op(rq) == REQ_OP_WRITE_SAME)
+		return blk_rq_zone_is_seq(rq);
+
+	if (req_op(rq) == REQ_OP_ZONE_RESET_ALL)
+		return true;
+
+	return sd_zbc_zone_needs_write_unlock(rq);
+}
+
 /**
  * sd_zbc_complete - ZBC command post processing.
  * @cmd: Completed command
  * @good_bytes: Command reply bytes
  * @sshdr: command sense header
  *
- * Called from sd_done(). Process report zones reply and handle reset zone
- * and write commands errors.
+ * Called from sd_done() to handle zone commands errors and updates to the
+ * device queue zone write pointer offset cahce.
  */
-void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
+unsigned int sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
 		     struct scsi_sense_hdr *sshdr)
 {
 	int result = cmd->result;
 	struct request *rq = cmd->request;
+	struct request_queue *q = rq->q;
+	struct gendisk *disk = rq->rq_disk;
+	struct scsi_disk *sdkp = scsi_disk(disk);
+	enum req_opf op = req_op(rq);
+	unsigned int zno;
 
-	if (op_is_zone_mgmt(req_op(rq)) &&
+	if (op_is_zone_mgmt(op) &&
 	    result &&
 	    sshdr->sense_key == ILLEGAL_REQUEST &&
 	    sshdr->asc == 0x24) {
@@ -294,7 +495,69 @@ void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
 		 * so be quiet about the error.
 		 */
 		rq->rq_flags |= RQF_QUIET;
+		goto unlock_zone;
+	}
+
+	if (!sd_zbc_need_zone_wp_update(rq))
+		goto unlock_zone;
+
+	/*
+	 * If we got an error for a command that needs updating the write
+	 * pointer offset cache, we must mark the zone wp offset entry as
+	 * invalid to force an update from disk the next time a zone append
+	 * command is issued.
+	 */
+	zno = blk_rq_zone_no(rq);
+	spin_lock_bh(&sdkp->zone_wp_ofst_lock);
+
+	if (result && op != REQ_OP_ZONE_RESET_ALL) {
+		if (op == REQ_OP_ZONE_APPEND) {
+			/* Force complete completion (no retry) */
+			good_bytes = 0;
+			scsi_set_resid(cmd, blk_rq_bytes(rq));
+		}
+
+		/*
+		 * Force an update of the zone write pointer offset on
+		 * the next zone append access.
+		 */
+		if (q->seq_zones_wp_ofst[zno] != SD_ZBC_UPDATING_WP_OFST)
+			q->seq_zones_wp_ofst[zno] = SD_ZBC_INVALID_WP_OFST;
+		goto unlock_wp_ofst;
 	}
+
+	switch (op) {
+	case REQ_OP_ZONE_APPEND:
+		rq->__sector += q->seq_zones_wp_ofst[zno];
+		/* fallthrough */
+	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_WRITE_SAME:
+	case REQ_OP_WRITE:
+		if (q->seq_zones_wp_ofst[zno] < sd_zbc_zone_sectors(sdkp))
+			q->seq_zones_wp_ofst[zno] += good_bytes >> SECTOR_SHIFT;
+		break;
+	case REQ_OP_ZONE_RESET:
+		q->seq_zones_wp_ofst[zno] = 0;
+		break;
+	case REQ_OP_ZONE_FINISH:
+		q->seq_zones_wp_ofst[zno] = sd_zbc_zone_sectors(sdkp);
+		break;
+	case REQ_OP_ZONE_RESET_ALL:
+		memset(q->seq_zones_wp_ofst, 0,
+		       sdkp->nr_zones * sizeof(unsigned int));
+		break;
+	default:
+		break;
+	}
+
+unlock_wp_ofst:
+	spin_unlock_bh(&sdkp->zone_wp_ofst_lock);
+
+unlock_zone:
+	if (sd_zbc_zone_needs_write_unlock(rq))
+		blk_req_zone_write_unlock(rq);
+
+	return good_bytes;
 }
 
 /**
@@ -399,6 +662,7 @@ static int sd_zbc_check_capacity(struct scsi_disk *sdkp, unsigned char *buf,
 int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf)
 {
 	struct gendisk *disk = sdkp->disk;
+	struct request_queue *q = disk->queue;
 	unsigned int nr_zones;
 	u32 zone_blocks = 0;
 	int ret;
@@ -421,9 +685,12 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf)
 		goto err;
 
 	/* The drive satisfies the kernel restrictions: set it up */
-	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, sdkp->disk->queue);
-	blk_queue_required_elevator_features(sdkp->disk->queue,
-					     ELEVATOR_F_ZBD_SEQ_WRITE);
+	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
+	blk_queue_flag_set(QUEUE_FLAG_ZONE_WP_OFST, q);
+	blk_queue_required_elevator_features(q, ELEVATOR_F_ZBD_SEQ_WRITE);
+	blk_queue_max_zone_append_sectors(q,
+		min_t(u32, logical_to_sectors(sdkp->device, zone_blocks),
+		      q->limits.max_segments << (PAGE_SHIFT - SECTOR_SHIFT)));
 	nr_zones = round_up(sdkp->capacity, zone_blocks) >> ilog2(zone_blocks);
 
 	/* READ16/WRITE16 is mandatory for ZBC disks */
@@ -475,3 +742,38 @@ void sd_zbc_print_zones(struct scsi_disk *sdkp)
 			  sdkp->nr_zones,
 			  sdkp->zone_blocks);
 }
+
+void sd_zbc_init_disk(struct scsi_disk *sdkp)
+{
+	if (!sd_is_zoned(sdkp))
+		return;
+
+	spin_lock_init(&sdkp->zone_wp_ofst_lock);
+}
+
+int __init sd_zbc_init(void)
+{
+	sd_zbc_zone_work_cache =
+		kmem_cache_create("sd_zbc_zone_work",
+				  sizeof(struct sd_zbc_zone_work),
+                                  0, 0, NULL);
+	if (!sd_zbc_zone_work_cache)
+		return -ENOMEM;
+
+	sd_zbc_zone_work_pool =
+		mempool_create_slab_pool(SD_ZBC_ZONE_WORK_MEMPOOL_SIZE,
+					 sd_zbc_zone_work_cache);
+	if (!sd_zbc_zone_work_pool) {
+		kmem_cache_destroy(sd_zbc_zone_work_cache);
+		printk(KERN_ERR "sd_zbc: create zone work pool failed\n");
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+void sd_zbc_exit(void)
+{
+	mempool_destroy(sd_zbc_zone_work_pool);
+	kmem_cache_destroy(sd_zbc_zone_work_cache);
+}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 08/11] null_blk: Cleanup zoned device initialization
  2020-03-24 15:24 [PATCH v2 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (6 preceding siblings ...)
  2020-03-24 15:24 ` [PATCH v2 07/11] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
@ 2020-03-24 15:24 ` Johannes Thumshirn
  2020-03-24 15:24 ` [PATCH v2 09/11] null_blk: Support REQ_OP_ZONE_APPEND Johannes Thumshirn
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2020-03-24 15:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong,
	Damien Le Moal, Johannes Thumshirn

From: Damien Le Moal <damien.lemoal@wdc.com>

Move all zoned mode related code from null_blk_main.c to
null_blk_zoned.c, avoiding an ugly #ifdef in the process.
Rename null_zone_init() into null_init_zoned_dev(), null_zone_exit()
into null_free_zoned_dev() and add the new function
null_register_zoned_dev() to finalize the zoned dev setup before
add_disk().

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 drivers/block/null_blk.h       | 14 ++++++++++----
 drivers/block/null_blk_main.c  | 26 ++++++--------------------
 drivers/block/null_blk_zoned.c | 21 +++++++++++++++++++--
 3 files changed, 35 insertions(+), 26 deletions(-)

diff --git a/drivers/block/null_blk.h b/drivers/block/null_blk.h
index 62b660821dbc..2874463f1d42 100644
--- a/drivers/block/null_blk.h
+++ b/drivers/block/null_blk.h
@@ -86,8 +86,9 @@ struct nullb {
 };
 
 #ifdef CONFIG_BLK_DEV_ZONED
-int null_zone_init(struct nullb_device *dev);
-void null_zone_exit(struct nullb_device *dev);
+int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q);
+int null_register_zoned_dev(struct nullb *nullb);
+void null_free_zoned_dev(struct nullb_device *dev);
 int null_report_zones(struct gendisk *disk, sector_t sector,
 		      unsigned int nr_zones, report_zones_cb cb, void *data);
 blk_status_t null_handle_zoned(struct nullb_cmd *cmd,
@@ -96,12 +97,17 @@ blk_status_t null_handle_zoned(struct nullb_cmd *cmd,
 size_t null_zone_valid_read_len(struct nullb *nullb,
 				sector_t sector, unsigned int len);
 #else
-static inline int null_zone_init(struct nullb_device *dev)
+static inline int null_init_zoned_dev(struct nullb_device *dev,
+				      struct request_queue *q)
 {
 	pr_err("CONFIG_BLK_DEV_ZONED not enabled\n");
 	return -EINVAL;
 }
-static inline void null_zone_exit(struct nullb_device *dev) {}
+static inline int null_register_zoned_dev(struct nullb *nullb)
+{
+	return -ENODEV;
+}
+static inline void null_free_zoned_dev(struct nullb_device *dev) {}
 static inline blk_status_t null_handle_zoned(struct nullb_cmd *cmd,
 					     enum req_opf op, sector_t sector,
 					     sector_t nr_sectors)
diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c
index e9d66cc0d6b9..3e45e3640c12 100644
--- a/drivers/block/null_blk_main.c
+++ b/drivers/block/null_blk_main.c
@@ -580,7 +580,7 @@ static void null_free_dev(struct nullb_device *dev)
 	if (!dev)
 		return;
 
-	null_zone_exit(dev);
+	null_free_zoned_dev(dev);
 	badblocks_exit(&dev->badblocks);
 	kfree(dev);
 }
@@ -1605,19 +1605,11 @@ static int null_gendisk_register(struct nullb *nullb)
 	disk->queue		= nullb->q;
 	strncpy(disk->disk_name, nullb->disk_name, DISK_NAME_LEN);
 
-#ifdef CONFIG_BLK_DEV_ZONED
 	if (nullb->dev->zoned) {
-		if (queue_is_mq(nullb->q)) {
-			int ret = blk_revalidate_disk_zones(disk);
-			if (ret)
-				return ret;
-		} else {
-			blk_queue_chunk_sectors(nullb->q,
-					nullb->dev->zone_size_sects);
-			nullb->q->nr_zones = blkdev_nr_zones(disk);
-		}
+		int ret = null_register_zoned_dev(nullb);
+		if (ret)
+			return ret;
 	}
-#endif
 
 	add_disk(disk);
 	return 0;
@@ -1795,14 +1787,9 @@ static int null_add_dev(struct nullb_device *dev)
 	}
 
 	if (dev->zoned) {
-		rv = null_zone_init(dev);
+		rv = null_init_zoned_dev(dev, nullb->q);
 		if (rv)
 			goto out_cleanup_blk_queue;
-
-		nullb->q->limits.zoned = BLK_ZONED_HM;
-		blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, nullb->q);
-		blk_queue_required_elevator_features(nullb->q,
-						ELEVATOR_F_ZBD_SEQ_WRITE);
 	}
 
 	nullb->q->queuedata = nullb;
@@ -1831,8 +1818,7 @@ static int null_add_dev(struct nullb_device *dev)
 
 	return 0;
 out_cleanup_zone:
-	if (dev->zoned)
-		null_zone_exit(dev);
+	null_free_zoned_dev(dev);
 out_cleanup_blk_queue:
 	blk_cleanup_queue(nullb->q);
 out_cleanup_tags:
diff --git a/drivers/block/null_blk_zoned.c b/drivers/block/null_blk_zoned.c
index ed34785dd64b..8259f3212a28 100644
--- a/drivers/block/null_blk_zoned.c
+++ b/drivers/block/null_blk_zoned.c
@@ -10,7 +10,7 @@ static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
 	return sect >> ilog2(dev->zone_size_sects);
 }
 
-int null_zone_init(struct nullb_device *dev)
+int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 {
 	sector_t dev_size = (sector_t)dev->size * 1024 * 1024;
 	sector_t sector = 0;
@@ -58,10 +58,27 @@ int null_zone_init(struct nullb_device *dev)
 		sector += dev->zone_size_sects;
 	}
 
+	q->limits.zoned = BLK_ZONED_HM;
+	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
+	blk_queue_required_elevator_features(q, ELEVATOR_F_ZBD_SEQ_WRITE);
+
+	return 0;
+}
+
+int null_register_zoned_dev(struct nullb *nullb)
+{
+	struct request_queue *q = nullb->q;
+
+	if (queue_is_mq(q))
+		return blk_revalidate_disk_zones(nullb->disk);
+
+	blk_queue_chunk_sectors(q, nullb->dev->zone_size_sects);
+	q->nr_zones = blkdev_nr_zones(nullb->disk);
+
 	return 0;
 }
 
-void null_zone_exit(struct nullb_device *dev)
+void null_free_zoned_dev(struct nullb_device *dev)
 {
 	kvfree(dev->zones);
 }
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 09/11] null_blk: Support REQ_OP_ZONE_APPEND
  2020-03-24 15:24 [PATCH v2 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (7 preceding siblings ...)
  2020-03-24 15:24 ` [PATCH v2 08/11] null_blk: Cleanup zoned device initialization Johannes Thumshirn
@ 2020-03-24 15:24 ` Johannes Thumshirn
  2020-03-24 15:24 ` [PATCH v2 10/11] iomap: Add support for zone append writes Johannes Thumshirn
  2020-03-24 15:24 ` [PATCH v2 11/11] zonefs: use zone-append for sequential zones Johannes Thumshirn
  10 siblings, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2020-03-24 15:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong,
	Damien Le Moal, Johannes Thumshirn

From: Damien Le Moal <damien.lemoal@wdc.com>

Support REQ_OP_ZONE_APPEND requests for zone mode null_blk devices.
Use the internally tracked zone write pointer position as the actual
write position, which is returned using the command request __sector
field in the case of an mq device and using the command BIO sector in
the case of a BIO device. Since the write position is used for data copy
in the case of a memory backed device, reverse the order in which
null_handle_zoned() and null_handle_memory_backed() are called to ensure
that null_handle_memory_backed() sees the correct write position for
REQ_OP_ZONE_APPEND operations.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 drivers/block/null_blk_main.c  |  9 +++++---
 drivers/block/null_blk_zoned.c | 38 +++++++++++++++++++++++++++-------
 2 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c
index 3e45e3640c12..5492f1e49eee 100644
--- a/drivers/block/null_blk_main.c
+++ b/drivers/block/null_blk_main.c
@@ -1300,12 +1300,15 @@ static blk_status_t null_handle_cmd(struct nullb_cmd *cmd, sector_t sector,
 			goto out;
 	}
 
+	if (dev->zoned) {
+		cmd->error = null_handle_zoned(cmd, op, sector, nr_sectors);
+		if (cmd->error != BLK_STS_OK)
+			goto out;
+	}
+
 	if (dev->memory_backed)
 		cmd->error = null_handle_memory_backed(cmd, op);
 
-	if (!cmd->error && dev->zoned)
-		cmd->error = null_handle_zoned(cmd, op, sector, nr_sectors);
-
 out:
 	nullb_complete_cmd(cmd);
 	return BLK_STS_OK;
diff --git a/drivers/block/null_blk_zoned.c b/drivers/block/null_blk_zoned.c
index 8259f3212a28..f20be7b91b9f 100644
--- a/drivers/block/null_blk_zoned.c
+++ b/drivers/block/null_blk_zoned.c
@@ -67,13 +67,22 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 
 int null_register_zoned_dev(struct nullb *nullb)
 {
+	struct nullb_device *dev = nullb->dev;
 	struct request_queue *q = nullb->q;
 
-	if (queue_is_mq(q))
-		return blk_revalidate_disk_zones(nullb->disk);
+	if (queue_is_mq(q)) {
+		int ret = blk_revalidate_disk_zones(nullb->disk);
+
+		if (ret)
+			return ret;
+	} else {
+		blk_queue_chunk_sectors(q, dev->zone_size_sects);
+		q->nr_zones = blkdev_nr_zones(nullb->disk);
+	}
 
-	blk_queue_chunk_sectors(q, nullb->dev->zone_size_sects);
-	q->nr_zones = blkdev_nr_zones(nullb->disk);
+	blk_queue_max_zone_append_sectors(q,
+			min_t(sector_t, q->limits.max_hw_sectors,
+			      dev->zone_size_sects));
 
 	return 0;
 }
@@ -133,7 +142,7 @@ size_t null_zone_valid_read_len(struct nullb *nullb,
 }
 
 static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
-		     unsigned int nr_sectors)
+				    unsigned int nr_sectors, bool append)
 {
 	struct nullb_device *dev = cmd->nq->dev;
 	unsigned int zno = null_zone_no(dev, sector);
@@ -148,7 +157,20 @@ static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
 	case BLK_ZONE_COND_IMP_OPEN:
 	case BLK_ZONE_COND_EXP_OPEN:
 	case BLK_ZONE_COND_CLOSED:
-		/* Writes must be at the write pointer position */
+		/*
+		 * Regular writes must be at the write pointer position.
+		 * Zone append writes are automatically issued at the write
+		 * pointer and the position returned using the request or BIO
+		 * sector.
+		 */
+		if (append) {
+			sector = zone->wp;
+			if (cmd->bio)
+				cmd->bio->bi_iter.bi_sector = sector;
+			else
+				cmd->rq->__sector = sector;
+		}
+
 		if (sector != zone->wp)
 			return BLK_STS_IOERR;
 
@@ -228,7 +250,9 @@ blk_status_t null_handle_zoned(struct nullb_cmd *cmd, enum req_opf op,
 {
 	switch (op) {
 	case REQ_OP_WRITE:
-		return null_zone_write(cmd, sector, nr_sectors);
+		return null_zone_write(cmd, sector, nr_sectors, false);
+	case REQ_OP_ZONE_APPEND:
+		return null_zone_write(cmd, sector, nr_sectors, true);
 	case REQ_OP_ZONE_RESET:
 	case REQ_OP_ZONE_RESET_ALL:
 	case REQ_OP_ZONE_OPEN:
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 10/11] iomap: Add support for zone append writes
  2020-03-24 15:24 [PATCH v2 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (8 preceding siblings ...)
  2020-03-24 15:24 ` [PATCH v2 09/11] null_blk: Support REQ_OP_ZONE_APPEND Johannes Thumshirn
@ 2020-03-24 15:24 ` Johannes Thumshirn
  2020-03-24 15:41   ` Christoph Hellwig
  2020-03-24 22:45   ` Dave Chinner
  2020-03-24 15:24 ` [PATCH v2 11/11] zonefs: use zone-append for sequential zones Johannes Thumshirn
  10 siblings, 2 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2020-03-24 15:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong,
	Johannes Thumshirn

Use REQ_OP_ZONE_APPEND for direct I/O write BIOs, instead of REQ_OP_WRITE
if the file-system requests it. The file system can make this request
by setting the new flag IOCB_ZONE_APPEND for a direct I/O kiocb before
calling iompa_dio_rw(). Using this information, this function propagates
the zone append flag using IOMAP_ZONE_APPEND to the file system
iomap_begin() method. The BIOs submitted for the zone append DIO will be
set to use the REQ_OP_ZONE_APPEND operation.

Since zone append operations cannot be split, the iomap_apply() and
iomap_dio_rw() internal loops are executed only once, which may result
in short writes.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/iomap/direct-io.c  | 80 ++++++++++++++++++++++++++++++++++++-------
 include/linux/fs.h    |  1 +
 include/linux/iomap.h | 22 ++++++------
 3 files changed, 79 insertions(+), 24 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 23837926c0c5..b3e2aadce72f 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -17,6 +17,7 @@
  * Private flags for iomap_dio, must not overlap with the public ones in
  * iomap.h:
  */
+#define IOMAP_DIO_ZONE_APPEND	(1 << 27)
 #define IOMAP_DIO_WRITE_FUA	(1 << 28)
 #define IOMAP_DIO_NEED_SYNC	(1 << 29)
 #define IOMAP_DIO_WRITE		(1 << 30)
@@ -39,6 +40,7 @@ struct iomap_dio {
 			struct task_struct	*waiter;
 			struct request_queue	*last_queue;
 			blk_qc_t		cookie;
+			sector_t		sector;
 		} submit;
 
 		/* used for aio completion: */
@@ -151,6 +153,9 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 	if (bio->bi_status)
 		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
 
+	if (dio->flags & IOMAP_DIO_ZONE_APPEND)
+		dio->submit.sector = bio->bi_iter.bi_sector;
+
 	if (atomic_dec_and_test(&dio->ref)) {
 		if (dio->wait_for_completion) {
 			struct task_struct *waiter = dio->submit.waiter;
@@ -194,6 +199,21 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 	iomap_dio_submit_bio(dio, iomap, bio);
 }
 
+static sector_t
+iomap_dio_bio_sector(struct iomap_dio *dio, struct iomap *iomap, loff_t pos)
+{
+	sector_t sector = iomap_sector(iomap, pos);
+
+	/*
+	 * For zone append writes, the BIO needs to point at the start of the
+	 * zone to append to.
+	 */
+	if (dio->flags & IOMAP_DIO_ZONE_APPEND)
+		sector = ALIGN_DOWN(sector, bdev_zone_sectors(iomap->bdev));
+
+	return sector;
+}
+
 static loff_t
 iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		struct iomap_dio *dio, struct iomap *iomap)
@@ -204,6 +224,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 	struct bio *bio;
 	bool need_zeroout = false;
 	bool use_fua = false;
+	bool zone_append = false;
 	int nr_pages, ret = 0;
 	size_t copied = 0;
 	size_t orig_count;
@@ -235,6 +256,9 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 			use_fua = true;
 	}
 
+	if (dio->flags & IOMAP_DIO_ZONE_APPEND)
+		zone_append = true;
+
 	/*
 	 * Save the original count and trim the iter to just the extent we
 	 * are operating on right now.  The iter will be re-expanded once
@@ -266,12 +290,28 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 
 		bio = bio_alloc(GFP_KERNEL, nr_pages);
 		bio_set_dev(bio, iomap->bdev);
-		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
+		bio->bi_iter.bi_sector = iomap_dio_bio_sector(dio, iomap, pos);
 		bio->bi_write_hint = dio->iocb->ki_hint;
 		bio->bi_ioprio = dio->iocb->ki_ioprio;
 		bio->bi_private = dio;
 		bio->bi_end_io = iomap_dio_bio_end_io;
 
+		if (dio->flags & IOMAP_DIO_WRITE) {
+			bio->bi_opf = REQ_SYNC | REQ_IDLE;
+			if (zone_append)
+				bio->bi_opf |= REQ_OP_ZONE_APPEND;
+			else
+				bio->bi_opf |= REQ_OP_WRITE;
+			if (use_fua)
+				bio->bi_opf |= REQ_FUA;
+			else
+				dio->flags &= ~IOMAP_DIO_WRITE_FUA;
+		} else {
+			bio->bi_opf = REQ_OP_READ;
+			if (dio->flags & IOMAP_DIO_DIRTY)
+				bio_set_pages_dirty(bio);
+		}
+
 		ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
 		if (unlikely(ret)) {
 			/*
@@ -284,19 +324,10 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 			goto zero_tail;
 		}
 
-		n = bio->bi_iter.bi_size;
-		if (dio->flags & IOMAP_DIO_WRITE) {
-			bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
-			if (use_fua)
-				bio->bi_opf |= REQ_FUA;
-			else
-				dio->flags &= ~IOMAP_DIO_WRITE_FUA;
+		if (dio->flags & IOMAP_DIO_WRITE)
 			task_io_account_write(n);
-		} else {
-			bio->bi_opf = REQ_OP_READ;
-			if (dio->flags & IOMAP_DIO_DIRTY)
-				bio_set_pages_dirty(bio);
-		}
+
+		n = bio->bi_iter.bi_size;
 
 		dio->size += n;
 		pos += n;
@@ -304,6 +335,15 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 
 		nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES);
 		iomap_dio_submit_bio(dio, iomap, bio);
+
+		/*
+		 * Issuing multiple BIOs for a large zone append write can
+		 * result in reordering of the write fragments and to data
+		 * corruption. So always stop after the first BIO is issued.
+		 */
+		if (zone_append)
+			break;
+
 	} while (nr_pages);
 
 	/*
@@ -446,6 +486,11 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		flags |= IOMAP_WRITE;
 		dio->flags |= IOMAP_DIO_WRITE;
 
+		if (iocb->ki_flags & IOCB_ZONE_APPEND) {
+			flags |= IOMAP_ZONE_APPEND;
+			dio->flags |= IOMAP_DIO_ZONE_APPEND;
+		}
+
 		/* for data sync or sync, we need sync completion processing */
 		if (iocb->ki_flags & IOCB_DSYNC)
 			dio->flags |= IOMAP_DIO_NEED_SYNC;
@@ -516,6 +561,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			iov_iter_revert(iter, pos - dio->i_size);
 			break;
 		}
+
+		/*
+		 * Zone append writes cannot be split and be shorted. Break
+		 * here to let the user know instead of sending more IOs which
+		 * could get reordered and corrupt the written data.
+		 */
+		if (flags & IOMAP_ZONE_APPEND)
+			break;
+
 	} while ((count = iov_iter_count(iter)) > 0);
 	blk_finish_plug(&plug);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3cd4fe6b845e..aa4ad705e549 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -314,6 +314,7 @@ enum rw_hint {
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
 #define IOCB_NOWAIT		(1 << 7)
+#define IOCB_ZONE_APPEND	(1 << 8)
 
 struct kiocb {
 	struct file		*ki_filp;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8b09463dae0d..16c17a79e53d 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -68,7 +68,6 @@ struct vm_fault;
  */
 #define IOMAP_F_PRIVATE		0x1000
 
-
 /*
  * Magic value for addr:
  */
@@ -95,6 +94,17 @@ iomap_sector(struct iomap *iomap, loff_t pos)
 	return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
 }
 
+/*
+ * Flags for iomap_begin / iomap_end.  No flag implies a read.
+ */
+#define IOMAP_WRITE		(1 << 0) /* writing, must allocate blocks */
+#define IOMAP_ZERO		(1 << 1) /* zeroing operation, may skip holes */
+#define IOMAP_REPORT		(1 << 2) /* report extent status, e.g. FIEMAP */
+#define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
+#define IOMAP_DIRECT		(1 << 4) /* direct I/O */
+#define IOMAP_NOWAIT		(1 << 5) /* do not block */
+#define IOMAP_ZONE_APPEND	(1 << 6) /* Use zone append for writes */
+
 /*
  * When a filesystem sets page_ops in an iomap mapping it returns, page_prepare
  * and page_done will be called for each page written to.  This only applies to
@@ -112,16 +122,6 @@ struct iomap_page_ops {
 			struct page *page, struct iomap *iomap);
 };
 
-/*
- * Flags for iomap_begin / iomap_end.  No flag implies a read.
- */
-#define IOMAP_WRITE		(1 << 0) /* writing, must allocate blocks */
-#define IOMAP_ZERO		(1 << 1) /* zeroing operation, may skip holes */
-#define IOMAP_REPORT		(1 << 2) /* report extent status, e.g. FIEMAP */
-#define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
-#define IOMAP_DIRECT		(1 << 4) /* direct I/O */
-#define IOMAP_NOWAIT		(1 << 5) /* do not block */
-
 struct iomap_ops {
 	/*
 	 * Return the existing mapping at pos, or reserve space starting at
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 11/11] zonefs: use zone-append for sequential zones
  2020-03-24 15:24 [PATCH v2 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
                   ` (9 preceding siblings ...)
  2020-03-24 15:24 ` [PATCH v2 10/11] iomap: Add support for zone append writes Johannes Thumshirn
@ 2020-03-24 15:24 ` Johannes Thumshirn
  10 siblings, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2020-03-24 15:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong,
	Johannes Thumshirn

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/zonefs/super.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 69aee3dfb660..d08d715c99de 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -628,12 +628,17 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 		goto inode_unlock;
 	}
 
-	/* Enforce sequential writes (append only) in sequential zones */
 	mutex_lock(&zi->i_truncate_mutex);
-	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && iocb->ki_pos != zi->i_wpoffset) {
-		mutex_unlock(&zi->i_truncate_mutex);
-		ret = -EINVAL;
-		goto inode_unlock;
+	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ) {
+		/* Enforce sequential writes (append only) */
+		if (iocb->ki_pos != zi->i_wpoffset) {
+			mutex_unlock(&zi->i_truncate_mutex);
+			ret = -EINVAL;
+			goto inode_unlock;
+		}
+		/* Use zone append for sync write */
+		if (is_sync_kiocb(iocb))
+			iocb->ki_flags |= IOCB_ZONE_APPEND;
 	}
 	mutex_unlock(&zi->i_truncate_mutex);
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 10/11] iomap: Add support for zone append writes
  2020-03-24 15:24 ` [PATCH v2 10/11] iomap: Add support for zone append writes Johannes Thumshirn
@ 2020-03-24 15:41   ` Christoph Hellwig
  2020-03-25  5:27     ` Damien Le Moal
  2020-03-25  9:45     ` Johannes Thumshirn
  2020-03-24 22:45   ` Dave Chinner
  1 sibling, 2 replies; 28+ messages in thread
From: Christoph Hellwig @ 2020-03-24 15:41 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Damien Le Moal,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen, linux-fsdevel @ vger . kernel . org,
	Darrick J . Wong

On Wed, Mar 25, 2020 at 12:24:53AM +0900, Johannes Thumshirn wrote:
> @@ -39,6 +40,7 @@ struct iomap_dio {
>  			struct task_struct	*waiter;
>  			struct request_queue	*last_queue;
>  			blk_qc_t		cookie;
> +			sector_t		sector;
>  		} submit;
>  
>  		/* used for aio completion: */
> @@ -151,6 +153,9 @@ static void iomap_dio_bio_end_io(struct bio *bio)
>  	if (bio->bi_status)
>  		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
>  
> +	if (dio->flags & IOMAP_DIO_ZONE_APPEND)
> +		dio->submit.sector = bio->bi_iter.bi_sector;

The submit member in struct iomap_dio is for submit-time information,
while this is used in the completion path..

>  		nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES);
>  		iomap_dio_submit_bio(dio, iomap, bio);
> +
> +		/*
> +		 * Issuing multiple BIOs for a large zone append write can
> +		 * result in reordering of the write fragments and to data
> +		 * corruption. So always stop after the first BIO is issued.
> +		 */
> +		if (zone_append)
> +			break;

At least for a normal file system that is absolutely not true.  If
zonefs is so special it might be better of just using a slightly tweaked
copy of blkdev_direct_IO rather than using iomap.

> @@ -446,6 +486,11 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		flags |= IOMAP_WRITE;
>  		dio->flags |= IOMAP_DIO_WRITE;
>  
> +		if (iocb->ki_flags & IOCB_ZONE_APPEND) {
> +			flags |= IOMAP_ZONE_APPEND;
> +			dio->flags |= IOMAP_DIO_ZONE_APPEND;
> +		}
> +
>  		/* for data sync or sync, we need sync completion processing */
>  		if (iocb->ki_flags & IOCB_DSYNC)
>  			dio->flags |= IOMAP_DIO_NEED_SYNC;
> @@ -516,6 +561,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  			iov_iter_revert(iter, pos - dio->i_size);
>  			break;
>  		}
> +
> +		/*
> +		 * Zone append writes cannot be split and be shorted. Break
> +		 * here to let the user know instead of sending more IOs which
> +		 * could get reordered and corrupt the written data.
> +		 */
> +		if (flags & IOMAP_ZONE_APPEND)
> +			break;

But that isn't what we do here.  You exit after a single apply iteration
which is perfectly fine - at at least for a normal file system, zonefs
is rather weird.

> +
>  	} while ((count = iov_iter_count(iter)) > 0);
>  	blk_finish_plug(&plug);
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3cd4fe6b845e..aa4ad705e549 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -314,6 +314,7 @@ enum rw_hint {
>  #define IOCB_SYNC		(1 << 5)
>  #define IOCB_WRITE		(1 << 6)
>  #define IOCB_NOWAIT		(1 << 7)
> +#define IOCB_ZONE_APPEND	(1 << 8)

I don't think the iocb is the right interface for passing this
kind of information.  We currently pass a bool wait to iomap_dio_rw
which really should be flags.  I have a pending patch for that.

> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 8b09463dae0d..16c17a79e53d 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -68,7 +68,6 @@ struct vm_fault;
>   */
>  #define IOMAP_F_PRIVATE		0x1000
>  
> -

Spurious whitespace change.

>  /*
>   * Magic value for addr:
>   */
> @@ -95,6 +94,17 @@ iomap_sector(struct iomap *iomap, loff_t pos)
>  	return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
>  }
>  
> +/*
> + * Flags for iomap_begin / iomap_end.  No flag implies a read.
> + */
> +#define IOMAP_WRITE		(1 << 0) /* writing, must allocate blocks */
> +#define IOMAP_ZERO		(1 << 1) /* zeroing operation, may skip holes */
> +#define IOMAP_REPORT		(1 << 2) /* report extent status, e.g. FIEMAP */
> +#define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
> +#define IOMAP_DIRECT		(1 << 4) /* direct I/O */
> +#define IOMAP_NOWAIT		(1 << 5) /* do not block */
> +#define IOMAP_ZONE_APPEND	(1 << 6) /* Use zone append for writes */

Why is this moved around?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 03/11] block: Introduce REQ_OP_ZONE_APPEND
  2020-03-24 15:24 ` [PATCH v2 03/11] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
@ 2020-03-24 16:07   ` Johannes Thumshirn
  2020-03-25 16:24   ` Johannes Thumshirn
  1 sibling, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2020-03-24 16:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: hch, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong

On 24/03/2020 16:25, Johannes Thumshirn wrote:
> Also add bio_add_zone_append_page()
> similar to bio_add_pc_page() so producers can build BIOs respecting the
> new limit.

Bah, this statement is not true anymore, bio_full() does the necessary 
checks now.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 10/11] iomap: Add support for zone append writes
  2020-03-24 15:24 ` [PATCH v2 10/11] iomap: Add support for zone append writes Johannes Thumshirn
  2020-03-24 15:41   ` Christoph Hellwig
@ 2020-03-24 22:45   ` Dave Chinner
  2020-03-25  5:38     ` Damien Le Moal
  2020-03-25  8:32     ` Johannes Thumshirn
  1 sibling, 2 replies; 28+ messages in thread
From: Dave Chinner @ 2020-03-24 22:45 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Damien Le Moal,
	Keith Busch, linux-scsi @ vger . kernel . org,
	Martin K . Petersen, linux-fsdevel @ vger . kernel . org,
	Darrick J . Wong

On Wed, Mar 25, 2020 at 12:24:53AM +0900, Johannes Thumshirn wrote:
> Use REQ_OP_ZONE_APPEND for direct I/O write BIOs, instead of REQ_OP_WRITE
> if the file-system requests it. The file system can make this request
> by setting the new flag IOCB_ZONE_APPEND for a direct I/O kiocb before
> calling iompa_dio_rw(). Using this information, this function propagates
> the zone append flag using IOMAP_ZONE_APPEND to the file system
> iomap_begin() method. The BIOs submitted for the zone append DIO will be
> set to use the REQ_OP_ZONE_APPEND operation.
> 
> Since zone append operations cannot be split, the iomap_apply() and
> iomap_dio_rw() internal loops are executed only once, which may result
> in short writes.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/iomap/direct-io.c  | 80 ++++++++++++++++++++++++++++++++++++-------
>  include/linux/fs.h    |  1 +
>  include/linux/iomap.h | 22 ++++++------
>  3 files changed, 79 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 23837926c0c5..b3e2aadce72f 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -17,6 +17,7 @@
>   * Private flags for iomap_dio, must not overlap with the public ones in
>   * iomap.h:
>   */
> +#define IOMAP_DIO_ZONE_APPEND	(1 << 27)
>  #define IOMAP_DIO_WRITE_FUA	(1 << 28)
>  #define IOMAP_DIO_NEED_SYNC	(1 << 29)
>  #define IOMAP_DIO_WRITE		(1 << 30)
> @@ -39,6 +40,7 @@ struct iomap_dio {
>  			struct task_struct	*waiter;
>  			struct request_queue	*last_queue;
>  			blk_qc_t		cookie;
> +			sector_t		sector;
>  		} submit;
>  
>  		/* used for aio completion: */
> @@ -151,6 +153,9 @@ static void iomap_dio_bio_end_io(struct bio *bio)
>  	if (bio->bi_status)
>  		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
>  
> +	if (dio->flags & IOMAP_DIO_ZONE_APPEND)
> +		dio->submit.sector = bio->bi_iter.bi_sector;
> +
>  	if (atomic_dec_and_test(&dio->ref)) {
>  		if (dio->wait_for_completion) {
>  			struct task_struct *waiter = dio->submit.waiter;
> @@ -194,6 +199,21 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
>  	iomap_dio_submit_bio(dio, iomap, bio);
>  }
>  
> +static sector_t
> +iomap_dio_bio_sector(struct iomap_dio *dio, struct iomap *iomap, loff_t pos)
> +{
> +	sector_t sector = iomap_sector(iomap, pos);
> +
> +	/*
> +	 * For zone append writes, the BIO needs to point at the start of the
> +	 * zone to append to.
> +	 */
> +	if (dio->flags & IOMAP_DIO_ZONE_APPEND)
> +		sector = ALIGN_DOWN(sector, bdev_zone_sectors(iomap->bdev));
> +
> +	return sector;
> +}

This seems to me like it should be done by the ->iomap_begin
implementation when mapping the IO. I don't see why this needs to be
specially handled by the iomap dio code.

> +
>  static loff_t
>  iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		struct iomap_dio *dio, struct iomap *iomap)
> @@ -204,6 +224,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  	struct bio *bio;
>  	bool need_zeroout = false;
>  	bool use_fua = false;
> +	bool zone_append = false;
>  	int nr_pages, ret = 0;
>  	size_t copied = 0;
>  	size_t orig_count;
> @@ -235,6 +256,9 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  			use_fua = true;
>  	}
>  
> +	if (dio->flags & IOMAP_DIO_ZONE_APPEND)
> +		zone_append = true;
> +
>  	/*
>  	 * Save the original count and trim the iter to just the extent we
>  	 * are operating on right now.  The iter will be re-expanded once
> @@ -266,12 +290,28 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  
>  		bio = bio_alloc(GFP_KERNEL, nr_pages);
>  		bio_set_dev(bio, iomap->bdev);
> -		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
> +		bio->bi_iter.bi_sector = iomap_dio_bio_sector(dio, iomap, pos);
>  		bio->bi_write_hint = dio->iocb->ki_hint;
>  		bio->bi_ioprio = dio->iocb->ki_ioprio;
>  		bio->bi_private = dio;
>  		bio->bi_end_io = iomap_dio_bio_end_io;
>  
> +		if (dio->flags & IOMAP_DIO_WRITE) {
> +			bio->bi_opf = REQ_SYNC | REQ_IDLE;
> +			if (zone_append)
> +				bio->bi_opf |= REQ_OP_ZONE_APPEND;
> +			else
> +				bio->bi_opf |= REQ_OP_WRITE;
> +			if (use_fua)
> +				bio->bi_opf |= REQ_FUA;
> +			else
> +				dio->flags &= ~IOMAP_DIO_WRITE_FUA;
> +		} else {
> +			bio->bi_opf = REQ_OP_READ;
> +			if (dio->flags & IOMAP_DIO_DIRTY)
> +				bio_set_pages_dirty(bio);
> +		}

Why move all this code? If it's needed, please split it into a
separate patchi to separate it from the new functionality...

> +
>  		ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
>  		if (unlikely(ret)) {
>  			/*
> @@ -284,19 +324,10 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  			goto zero_tail;
>  		}
>  
> -		n = bio->bi_iter.bi_size;
> -		if (dio->flags & IOMAP_DIO_WRITE) {
> -			bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
> -			if (use_fua)
> -				bio->bi_opf |= REQ_FUA;
> -			else
> -				dio->flags &= ~IOMAP_DIO_WRITE_FUA;
> +		if (dio->flags & IOMAP_DIO_WRITE)
>  			task_io_account_write(n);
> -		} else {
> -			bio->bi_opf = REQ_OP_READ;
> -			if (dio->flags & IOMAP_DIO_DIRTY)
> -				bio_set_pages_dirty(bio);
> -		}
> +
> +		n = bio->bi_iter.bi_size;
>  
>  		dio->size += n;
>  		pos += n;
> @@ -304,6 +335,15 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  
>  		nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES);
>  		iomap_dio_submit_bio(dio, iomap, bio);
> +
> +		/*
> +		 * Issuing multiple BIOs for a large zone append write can
> +		 * result in reordering of the write fragments and to data
> +		 * corruption. So always stop after the first BIO is issued.
> +		 */
> +		if (zone_append)
> +			break;

I don't think this sort of functionality should be tied to "zone
append". If there is a need for "issue a single (short) bio only" it
should be a flag to iomap_dio_rw() set by the filesystem, which can
then handle the short read/write that is returned.

> +		/*
> +		 * Zone append writes cannot be split and be shorted. Break
> +		 * here to let the user know instead of sending more IOs which
> +		 * could get reordered and corrupt the written data.
> +		 */
> +		if (flags & IOMAP_ZONE_APPEND)
> +			break;

ditto.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 10/11] iomap: Add support for zone append writes
  2020-03-24 15:41   ` Christoph Hellwig
@ 2020-03-25  5:27     ` Damien Le Moal
  2020-03-25  6:25       ` hch
  2020-03-25  9:45     ` Johannes Thumshirn
  1 sibling, 1 reply; 28+ messages in thread
From: Damien Le Moal @ 2020-03-25  5:27 UTC (permalink / raw)
  To: hch, Johannes Thumshirn
  Cc: Jens Axboe, linux-block, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong

On 2020/03/25 0:41, Christoph Hellwig wrote:
> On Wed, Mar 25, 2020 at 12:24:53AM +0900, Johannes Thumshirn wrote:
>> @@ -39,6 +40,7 @@ struct iomap_dio {
>>  			struct task_struct	*waiter;
>>  			struct request_queue	*last_queue;
>>  			blk_qc_t		cookie;
>> +			sector_t		sector;
>>  		} submit;
>>  
>>  		/* used for aio completion: */
>> @@ -151,6 +153,9 @@ static void iomap_dio_bio_end_io(struct bio *bio)
>>  	if (bio->bi_status)
>>  		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
>>  
>> +	if (dio->flags & IOMAP_DIO_ZONE_APPEND)
>> +		dio->submit.sector = bio->bi_iter.bi_sector;
> 
> The submit member in struct iomap_dio is for submit-time information,
> while this is used in the completion path..
> 
>>  		nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES);
>>  		iomap_dio_submit_bio(dio, iomap, bio);
>> +
>> +		/*
>> +		 * Issuing multiple BIOs for a large zone append write can
>> +		 * result in reordering of the write fragments and to data
>> +		 * corruption. So always stop after the first BIO is issued.
>> +		 */
>> +		if (zone_append)
>> +			break;
> 
> At least for a normal file system that is absolutely not true.  If
> zonefs is so special it might be better of just using a slightly tweaked
> copy of blkdev_direct_IO rather than using iomap.

It would be very nice to not have to add this direct BIO use case in zonefs
since that would be only for writes to sequential zones while all other
operations use iomap. So instead of this, what about using a flag as Dave
suggested (see below comment too) ?

> 
>> @@ -446,6 +486,11 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>>  		flags |= IOMAP_WRITE;
>>  		dio->flags |= IOMAP_DIO_WRITE;
>>  
>> +		if (iocb->ki_flags & IOCB_ZONE_APPEND) {
>> +			flags |= IOMAP_ZONE_APPEND;
>> +			dio->flags |= IOMAP_DIO_ZONE_APPEND;
>> +		}
>> +
>>  		/* for data sync or sync, we need sync completion processing */
>>  		if (iocb->ki_flags & IOCB_DSYNC)
>>  			dio->flags |= IOMAP_DIO_NEED_SYNC;
>> @@ -516,6 +561,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>>  			iov_iter_revert(iter, pos - dio->i_size);
>>  			break;
>>  		}
>> +
>> +		/*
>> +		 * Zone append writes cannot be split and be shorted. Break
>> +		 * here to let the user know instead of sending more IOs which
>> +		 * could get reordered and corrupt the written data.
>> +		 */
>> +		if (flags & IOMAP_ZONE_APPEND)
>> +			break;
> 
> But that isn't what we do here.  You exit after a single apply iteration
> which is perfectly fine - at at least for a normal file system, zonefs
> is rather weird.

The comment is indeed not clear. For the short write, as Dave suggested, we
should have a special flag for it. So would you be OK if we replace this with
something like

		if (flags & IOMAP_SHORT_WRITE)
			break;

Normal file systems with real block mapping metadata would not need this as they
can perfectly handle non sequential zone append fragments of a large DIO. But
zonefs will need the short writes since it lacks file block mapping metadata.

> 
>> +
>>  	} while ((count = iov_iter_count(iter)) > 0);
>>  	blk_finish_plug(&plug);
>>  
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 3cd4fe6b845e..aa4ad705e549 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -314,6 +314,7 @@ enum rw_hint {
>>  #define IOCB_SYNC		(1 << 5)
>>  #define IOCB_WRITE		(1 << 6)
>>  #define IOCB_NOWAIT		(1 << 7)
>> +#define IOCB_ZONE_APPEND	(1 << 8)
> 
> I don't think the iocb is the right interface for passing this
> kind of information.  We currently pass a bool wait to iomap_dio_rw
> which really should be flags.  I have a pending patch for that.

Is that patch queued in iomap or xfs tree ? Could you point us to it please ?

> 
>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>> index 8b09463dae0d..16c17a79e53d 100644
>> --- a/include/linux/iomap.h
>> +++ b/include/linux/iomap.h
>> @@ -68,7 +68,6 @@ struct vm_fault;
>>   */
>>  #define IOMAP_F_PRIVATE		0x1000
>>  
>> -
> 
> Spurious whitespace change.
> 
>>  /*
>>   * Magic value for addr:
>>   */
>> @@ -95,6 +94,17 @@ iomap_sector(struct iomap *iomap, loff_t pos)
>>  	return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
>>  }
>>  
>> +/*
>> + * Flags for iomap_begin / iomap_end.  No flag implies a read.
>> + */
>> +#define IOMAP_WRITE		(1 << 0) /* writing, must allocate blocks */
>> +#define IOMAP_ZERO		(1 << 1) /* zeroing operation, may skip holes */
>> +#define IOMAP_REPORT		(1 << 2) /* report extent status, e.g. FIEMAP */
>> +#define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
>> +#define IOMAP_DIRECT		(1 << 4) /* direct I/O */
>> +#define IOMAP_NOWAIT		(1 << 5) /* do not block */
>> +#define IOMAP_ZONE_APPEND	(1 << 6) /* Use zone append for writes */
> 
> Why is this moved around?
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 10/11] iomap: Add support for zone append writes
  2020-03-24 22:45   ` Dave Chinner
@ 2020-03-25  5:38     ` Damien Le Moal
  2020-03-25  8:32     ` Johannes Thumshirn
  1 sibling, 0 replies; 28+ messages in thread
From: Damien Le Moal @ 2020-03-25  5:38 UTC (permalink / raw)
  To: Dave Chinner, Johannes Thumshirn
  Cc: Jens Axboe, hch, linux-block, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong

On 2020/03/25 7:46, Dave Chinner wrote:
> On Wed, Mar 25, 2020 at 12:24:53AM +0900, Johannes Thumshirn wrote:
>> Use REQ_OP_ZONE_APPEND for direct I/O write BIOs, instead of REQ_OP_WRITE
>> if the file-system requests it. The file system can make this request
>> by setting the new flag IOCB_ZONE_APPEND for a direct I/O kiocb before
>> calling iompa_dio_rw(). Using this information, this function propagates
>> the zone append flag using IOMAP_ZONE_APPEND to the file system
>> iomap_begin() method. The BIOs submitted for the zone append DIO will be
>> set to use the REQ_OP_ZONE_APPEND operation.
>>
>> Since zone append operations cannot be split, the iomap_apply() and
>> iomap_dio_rw() internal loops are executed only once, which may result
>> in short writes.
>>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> ---
>>  fs/iomap/direct-io.c  | 80 ++++++++++++++++++++++++++++++++++++-------
>>  include/linux/fs.h    |  1 +
>>  include/linux/iomap.h | 22 ++++++------
>>  3 files changed, 79 insertions(+), 24 deletions(-)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index 23837926c0c5..b3e2aadce72f 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -17,6 +17,7 @@
>>   * Private flags for iomap_dio, must not overlap with the public ones in
>>   * iomap.h:
>>   */
>> +#define IOMAP_DIO_ZONE_APPEND	(1 << 27)
>>  #define IOMAP_DIO_WRITE_FUA	(1 << 28)
>>  #define IOMAP_DIO_NEED_SYNC	(1 << 29)
>>  #define IOMAP_DIO_WRITE		(1 << 30)
>> @@ -39,6 +40,7 @@ struct iomap_dio {
>>  			struct task_struct	*waiter;
>>  			struct request_queue	*last_queue;
>>  			blk_qc_t		cookie;
>> +			sector_t		sector;
>>  		} submit;
>>  
>>  		/* used for aio completion: */
>> @@ -151,6 +153,9 @@ static void iomap_dio_bio_end_io(struct bio *bio)
>>  	if (bio->bi_status)
>>  		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
>>  
>> +	if (dio->flags & IOMAP_DIO_ZONE_APPEND)
>> +		dio->submit.sector = bio->bi_iter.bi_sector;
>> +
>>  	if (atomic_dec_and_test(&dio->ref)) {
>>  		if (dio->wait_for_completion) {
>>  			struct task_struct *waiter = dio->submit.waiter;
>> @@ -194,6 +199,21 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
>>  	iomap_dio_submit_bio(dio, iomap, bio);
>>  }
>>  
>> +static sector_t
>> +iomap_dio_bio_sector(struct iomap_dio *dio, struct iomap *iomap, loff_t pos)
>> +{
>> +	sector_t sector = iomap_sector(iomap, pos);
>> +
>> +	/*
>> +	 * For zone append writes, the BIO needs to point at the start of the
>> +	 * zone to append to.
>> +	 */
>> +	if (dio->flags & IOMAP_DIO_ZONE_APPEND)
>> +		sector = ALIGN_DOWN(sector, bdev_zone_sectors(iomap->bdev));
>> +
>> +	return sector;
>> +}
> 
> This seems to me like it should be done by the ->iomap_begin
> implementation when mapping the IO. I don't see why this needs to be
> specially handled by the iomap dio code.

Fair point. However, iomap_sector() does not simply return iomap->addr >> 9. So
that means that in iomap_begin, the mapping address and offset returned needs to
account for the iomap_sector() calculation, i.e.

(iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT

which does not result in a very obvious values for iomap address and offset.
Well I guess for zone append we simply need to set
iomap->offset = pos;
and
iomap->addr = zone start sector;
and that would then work. This however look fragile to me.

> 
>> +
>>  static loff_t
>>  iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>>  		struct iomap_dio *dio, struct iomap *iomap)
>> @@ -204,6 +224,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>>  	struct bio *bio;
>>  	bool need_zeroout = false;
>>  	bool use_fua = false;
>> +	bool zone_append = false;
>>  	int nr_pages, ret = 0;
>>  	size_t copied = 0;
>>  	size_t orig_count;
>> @@ -235,6 +256,9 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>>  			use_fua = true;
>>  	}
>>  
>> +	if (dio->flags & IOMAP_DIO_ZONE_APPEND)
>> +		zone_append = true;
>> +
>>  	/*
>>  	 * Save the original count and trim the iter to just the extent we
>>  	 * are operating on right now.  The iter will be re-expanded once
>> @@ -266,12 +290,28 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>>  
>>  		bio = bio_alloc(GFP_KERNEL, nr_pages);
>>  		bio_set_dev(bio, iomap->bdev);
>> -		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
>> +		bio->bi_iter.bi_sector = iomap_dio_bio_sector(dio, iomap, pos);
>>  		bio->bi_write_hint = dio->iocb->ki_hint;
>>  		bio->bi_ioprio = dio->iocb->ki_ioprio;
>>  		bio->bi_private = dio;
>>  		bio->bi_end_io = iomap_dio_bio_end_io;
>>  
>> +		if (dio->flags & IOMAP_DIO_WRITE) {
>> +			bio->bi_opf = REQ_SYNC | REQ_IDLE;
>> +			if (zone_append)
>> +				bio->bi_opf |= REQ_OP_ZONE_APPEND;
>> +			else
>> +				bio->bi_opf |= REQ_OP_WRITE;
>> +			if (use_fua)
>> +				bio->bi_opf |= REQ_FUA;
>> +			else
>> +				dio->flags &= ~IOMAP_DIO_WRITE_FUA;
>> +		} else {
>> +			bio->bi_opf = REQ_OP_READ;
>> +			if (dio->flags & IOMAP_DIO_DIRTY)
>> +				bio_set_pages_dirty(bio);
>> +		}
> 
> Why move all this code? If it's needed, please split it into a
> separate patchi to separate it from the new functionality...

The BIO add page in bio_iov_iter_get_pages() needs to know that the BIO is a
zone append to stop adding pages before the BIO grows too large and result in a
split when it is submitted. So bio->bi_opf needs to be set before calling
bio_iov_iter_get_pages(). So this change is related to the new functionality.
But I guess it is OK to do it regardless of the BIO operation.

> 
>> +
>>  		ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
>>  		if (unlikely(ret)) {
>>  			/*
>> @@ -284,19 +324,10 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>>  			goto zero_tail;
>>  		}
>>  
>> -		n = bio->bi_iter.bi_size;
>> -		if (dio->flags & IOMAP_DIO_WRITE) {
>> -			bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
>> -			if (use_fua)
>> -				bio->bi_opf |= REQ_FUA;
>> -			else
>> -				dio->flags &= ~IOMAP_DIO_WRITE_FUA;
>> +		if (dio->flags & IOMAP_DIO_WRITE)
>>  			task_io_account_write(n);
>> -		} else {
>> -			bio->bi_opf = REQ_OP_READ;
>> -			if (dio->flags & IOMAP_DIO_DIRTY)
>> -				bio_set_pages_dirty(bio);
>> -		}
>> +
>> +		n = bio->bi_iter.bi_size;
>>  
>>  		dio->size += n;
>>  		pos += n;
>> @@ -304,6 +335,15 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>>  
>>  		nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES);
>>  		iomap_dio_submit_bio(dio, iomap, bio);
>> +
>> +		/*
>> +		 * Issuing multiple BIOs for a large zone append write can
>> +		 * result in reordering of the write fragments and to data
>> +		 * corruption. So always stop after the first BIO is issued.
>> +		 */
>> +		if (zone_append)
>> +			break;
> 
> I don't think this sort of functionality should be tied to "zone
> append". If there is a need for "issue a single (short) bio only" it
> should be a flag to iomap_dio_rw() set by the filesystem, which can
> then handle the short read/write that is returned.

Yes, that would be cleaner.

> 
>> +		/*
>> +		 * Zone append writes cannot be split and be shorted. Break
>> +		 * here to let the user know instead of sending more IOs which
>> +		 * could get reordered and corrupt the written data.
>> +		 */
>> +		if (flags & IOMAP_ZONE_APPEND)
>> +			break;
> 
> ditto.
> 
> Cheers,
> 
> Dave.
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 10/11] iomap: Add support for zone append writes
  2020-03-25  5:27     ` Damien Le Moal
@ 2020-03-25  6:25       ` hch
  0 siblings, 0 replies; 28+ messages in thread
From: hch @ 2020-03-25  6:25 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: hch, Johannes Thumshirn, Jens Axboe, linux-block, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong

On Wed, Mar 25, 2020 at 05:27:38AM +0000, Damien Le Moal wrote:
> > At least for a normal file system that is absolutely not true.  If
> > zonefs is so special it might be better of just using a slightly tweaked
> > copy of blkdev_direct_IO rather than using iomap.
> 
> It would be very nice to not have to add this direct BIO use case in zonefs
> since that would be only for writes to sequential zones while all other
> operations use iomap. So instead of this, what about using a flag as Dave
> suggested (see below comment too) ?

Given how special the use case is I'm not sure overloading iomap
is a good idea.  Think of how a "normal" zone aware file system would
use iomap and not of this will apply.  OTOH the "simple" single bio
code in __blkdev_direct_IO_simple is less than 100 lines of code.  I
think having a specialized code base for a specialized use case
might be better than overloading generic code with tons of flags.

> > I don't think the iocb is the right interface for passing this
> > kind of information.  We currently pass a bool wait to iomap_dio_rw
> > which really should be flags.  I have a pending patch for that.
> 
> Is that patch queued in iomap or xfs tree ? Could you point us to it please ?

It isn't queued up anywhere yet.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 10/11] iomap: Add support for zone append writes
  2020-03-24 22:45   ` Dave Chinner
  2020-03-25  5:38     ` Damien Le Moal
@ 2020-03-25  8:32     ` Johannes Thumshirn
  1 sibling, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2020-03-25  8:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, hch, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong

On 24/03/2020 23:46, Dave Chinner wrote:
>> @@ -266,12 +290,28 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>>   
>>   		bio = bio_alloc(GFP_KERNEL, nr_pages);
>>   		bio_set_dev(bio, iomap->bdev);
>> -		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
>> +		bio->bi_iter.bi_sector = iomap_dio_bio_sector(dio, iomap, pos);
>>   		bio->bi_write_hint = dio->iocb->ki_hint;
>>   		bio->bi_ioprio = dio->iocb->ki_ioprio;
>>   		bio->bi_private = dio;
>>   		bio->bi_end_io = iomap_dio_bio_end_io;
>>   
>> +		if (dio->flags & IOMAP_DIO_WRITE) {
>> +			bio->bi_opf = REQ_SYNC | REQ_IDLE;
>> +			if (zone_append)
>> +				bio->bi_opf |= REQ_OP_ZONE_APPEND;
>> +			else
>> +				bio->bi_opf |= REQ_OP_WRITE;
>> +			if (use_fua)
>> +				bio->bi_opf |= REQ_FUA;
>> +			else
>> +				dio->flags &= ~IOMAP_DIO_WRITE_FUA;
>> +		} else {
>> +			bio->bi_opf = REQ_OP_READ;
>> +			if (dio->flags & IOMAP_DIO_DIRTY)
>> +				bio_set_pages_dirty(bio);
>> +		}
> Why move all this code? If it's needed, please split it into a
> separate patchi to separate it from the new functionality...
> 

The code is moved as bio_iov_iter_get_pages() needs the correct bi_opf 
set for zone append to be able to check the limits (see patch 03/11, 
block: Introduce REQ_OP_ZONE_APPEND in this series).

The call chain is:
bio_iov_iter_get_pages()
`-> bio_iov_iter_get_pages()
     `-> bio_full()
         `-> bio_can_zone_append()

I'm not sure if separating this movement would make it clearer, apart 
from a commit message?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 01/11] block: factor out requeue handling from dispatch code
  2020-03-24 15:24 ` [PATCH v2 01/11] block: factor out requeue handling from dispatch code Johannes Thumshirn
@ 2020-03-25  8:40   ` Christoph Hellwig
  2020-03-25 15:40     ` Jens Axboe
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2020-03-25  8:40 UTC (permalink / raw)
  To: Johannes Thumshirn, Jens Axboe
  Cc: linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong,
	Christoph Hellwig

On Wed, Mar 25, 2020 at 12:24:44AM +0900, Johannes Thumshirn wrote:
> Factor out the requeue handling from the dispatch code, this will make
> subsequent addition of different requeueing schemes easier.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Jens, can you pick this up?  I think it already is a nice improvement
even without the rest of the series.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 10/11] iomap: Add support for zone append writes
  2020-03-24 15:41   ` Christoph Hellwig
  2020-03-25  5:27     ` Damien Le Moal
@ 2020-03-25  9:45     ` Johannes Thumshirn
  2020-03-25  9:48       ` hch
  1 sibling, 1 reply; 28+ messages in thread
From: Johannes Thumshirn @ 2020-03-25  9:45 UTC (permalink / raw)
  To: hch
  Cc: Jens Axboe, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong

On 24/03/2020 16:41, Christoph Hellwig wrote:
>> +
>> +		/*
>> +		 * Issuing multiple BIOs for a large zone append write can
>> +		 * result in reordering of the write fragments and to data
>> +		 * corruption. So always stop after the first BIO is issued.
>> +		 */
>> +		if (zone_append)
>> +			break;
> At least for a normal file system that is absolutely not true.  If
> zonefs is so special it might be better of just using a slightly tweaked
> copy of blkdev_direct_IO rather than using iomap.
> 

Can you please elaborate on that? Why doesn't this hold true for a 
normal file system? If we split the DIO write into multiple BIOs with 
zone-append, there is nothing which guarantees the order of the written 
data (at least as far as I can see).

So if we have this DIO write:
|AAAAA|BBBB|CCCC|DDDD|
and we have to split it for whatever reason, what safe guards us from it 
ending up on disk like this:
|CCCC|DDDD|AAAA|BBBB|

This is essentially the same reason we can't split zone-append BIOs, or 
am I totally off track now?

Thanks,
	Johannes

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 10/11] iomap: Add support for zone append writes
  2020-03-25  9:45     ` Johannes Thumshirn
@ 2020-03-25  9:48       ` hch
  2020-03-25  9:54         ` Johannes Thumshirn
  2020-03-25  9:59         ` Damien Le Moal
  0 siblings, 2 replies; 28+ messages in thread
From: hch @ 2020-03-25  9:48 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: hch, Jens Axboe, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong

On Wed, Mar 25, 2020 at 09:45:39AM +0000, Johannes Thumshirn wrote:
> 
> Can you please elaborate on that? Why doesn't this hold true for a 
> normal file system? If we split the DIO write into multiple BIOs with 
> zone-append, there is nothing which guarantees the order of the written 
> data (at least as far as I can see).

Of course nothing gurantees the order.  But the whole point is that the
order does not matter.  

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 10/11] iomap: Add support for zone append writes
  2020-03-25  9:48       ` hch
@ 2020-03-25  9:54         ` Johannes Thumshirn
  2020-03-25  9:59         ` Damien Le Moal
  1 sibling, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2020-03-25  9:54 UTC (permalink / raw)
  To: hch
  Cc: Jens Axboe, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong

On 25/03/2020 10:48, hch@infradead.org wrote:
> On Wed, Mar 25, 2020 at 09:45:39AM +0000, Johannes Thumshirn wrote:
>>
>> Can you please elaborate on that? Why doesn't this hold true for a
>> normal file system? If we split the DIO write into multiple BIOs with
>> zone-append, there is nothing which guarantees the order of the written
>> data (at least as far as I can see).
> 
> Of course nothing gurantees the order.  But the whole point is that the
> order does not matter.
> 

Ok now I'm totally lost. The order of data inside an extent does matter, 
otherwise we have data corruption.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 10/11] iomap: Add support for zone append writes
  2020-03-25  9:48       ` hch
  2020-03-25  9:54         ` Johannes Thumshirn
@ 2020-03-25  9:59         ` Damien Le Moal
  2020-03-25 10:01           ` hch
  1 sibling, 1 reply; 28+ messages in thread
From: Damien Le Moal @ 2020-03-25  9:59 UTC (permalink / raw)
  To: hch, Johannes Thumshirn
  Cc: Jens Axboe, linux-block, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong

On 2020/03/25 18:48, hch@infradead.org wrote:
> On Wed, Mar 25, 2020 at 09:45:39AM +0000, Johannes Thumshirn wrote:
>>
>> Can you please elaborate on that? Why doesn't this hold true for a 
>> normal file system? If we split the DIO write into multiple BIOs with 
>> zone-append, there is nothing which guarantees the order of the written 
>> data (at least as far as I can see).
> 
> Of course nothing gurantees the order.  But the whole point is that the
> order does not matter.  
> 

The order does not matter at the DIO level since iomap dio end callback will
allow the FS to add an extent mapping the written data using the drive indicated
write location. But that callback is for the entire DIO, not per BIO fragment of
the DIO. So if the BIO fragments of a large DIO get reordered, as Johannes said,
we will get data corruption in the FS extent. No ?

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 10/11] iomap: Add support for zone append writes
  2020-03-25  9:59         ` Damien Le Moal
@ 2020-03-25 10:01           ` hch
  2020-03-25 10:15             ` Johannes Thumshirn
  0 siblings, 1 reply; 28+ messages in thread
From: hch @ 2020-03-25 10:01 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: hch, Johannes Thumshirn, Jens Axboe, linux-block, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong

On Wed, Mar 25, 2020 at 09:59:19AM +0000, Damien Le Moal wrote:
> On 2020/03/25 18:48, hch@infradead.org wrote:
> > On Wed, Mar 25, 2020 at 09:45:39AM +0000, Johannes Thumshirn wrote:
> >>
> >> Can you please elaborate on that? Why doesn't this hold true for a 
> >> normal file system? If we split the DIO write into multiple BIOs with 
> >> zone-append, there is nothing which guarantees the order of the written 
> >> data (at least as far as I can see).
> > 
> > Of course nothing gurantees the order.  But the whole point is that the
> > order does not matter.  
> > 
> 
> The order does not matter at the DIO level since iomap dio end callback will
> allow the FS to add an extent mapping the written data using the drive indicated
> write location. But that callback is for the entire DIO, not per BIO fragment of
> the DIO. So if the BIO fragments of a large DIO get reordered, as Johannes said,
> we will get data corruption in the FS extent. No ?

I thought of recording the location in ->iomap_end (and in fact
had a prototype for that), but that is not going to work for AIO of
course.  So yes, we'll need some way to have per-extent completion
callbacks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 10/11] iomap: Add support for zone append writes
  2020-03-25 10:01           ` hch
@ 2020-03-25 10:15             ` Johannes Thumshirn
  0 siblings, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2020-03-25 10:15 UTC (permalink / raw)
  To: hch, Damien Le Moal
  Cc: Jens Axboe, linux-block, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong

On 25/03/2020 11:01, hch@infradead.org wrote:
> I thought of recording the location in ->iomap_end (and in fact
> had a prototype for that), but that is not going to work for AIO of
> course.  So yes, we'll need some way to have per-extent completion
> callbacks.

OK let's postpone iomap then and handle zone-append directly in zonefs 
for now.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 01/11] block: factor out requeue handling from dispatch code
  2020-03-25  8:40   ` Christoph Hellwig
@ 2020-03-25 15:40     ` Jens Axboe
  0 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2020-03-25 15:40 UTC (permalink / raw)
  To: Christoph Hellwig, Johannes Thumshirn
  Cc: linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong,
	Christoph Hellwig

On 3/25/20 2:40 AM, Christoph Hellwig wrote:
> On Wed, Mar 25, 2020 at 12:24:44AM +0900, Johannes Thumshirn wrote:
>> Factor out the requeue handling from the dispatch code, this will make
>> subsequent addition of different requeueing schemes easier.
>>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Jens, can you pick this up?  I think it already is a nice improvement
> even without the rest of the series.

Sure, applied for 5.7.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 03/11] block: Introduce REQ_OP_ZONE_APPEND
  2020-03-24 15:24 ` [PATCH v2 03/11] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
  2020-03-24 16:07   ` Johannes Thumshirn
@ 2020-03-25 16:24   ` Johannes Thumshirn
  1 sibling, 0 replies; 28+ messages in thread
From: Johannes Thumshirn @ 2020-03-25 16:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: hch, linux-block, Damien Le Moal, Keith Busch,
	linux-scsi @ vger . kernel . org, Martin K . Petersen,
	linux-fsdevel @ vger . kernel . org, Darrick J . Wong

On 24/03/2020 16:25, Johannes Thumshirn wrote:
> +static inline bool bio_can_zone_append(struct bio *bio, unsigned len)
> +{
> +	struct request_queue *q = bio->bi_disk->queue;
> +	unsigned int max_append_sectors = queue_max_zone_append_sectors(q);
> +
> +	if (WARN_ON_ONCE(!max_append_sectors))
> +		return false;
> +
> +	if (((bio->bi_iter.bi_size + len) >> 9) > max_append_sectors)
> +		return false;
> +
> +	if (bio->bi_vcnt >= q->limits.max_segments)
> +		return false;
> +
> +	return true;
> +}

That return values need to be reversed as well...

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-03-25 16:24 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-24 15:24 [PATCH v2 00/11] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
2020-03-24 15:24 ` [PATCH v2 01/11] block: factor out requeue handling from dispatch code Johannes Thumshirn
2020-03-25  8:40   ` Christoph Hellwig
2020-03-25 15:40     ` Jens Axboe
2020-03-24 15:24 ` [PATCH v2 02/11] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
2020-03-24 15:24 ` [PATCH v2 03/11] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
2020-03-24 16:07   ` Johannes Thumshirn
2020-03-25 16:24   ` Johannes Thumshirn
2020-03-24 15:24 ` [PATCH v2 04/11] block: introduce blk_req_zone_write_trylock Johannes Thumshirn
2020-03-24 15:24 ` [PATCH v2 05/11] block: Introduce zone write pointer offset caching Johannes Thumshirn
2020-03-24 15:24 ` [PATCH v2 06/11] scsi: sd_zbc: factor out sanity checks for zoned commands Johannes Thumshirn
2020-03-24 15:24 ` [PATCH v2 07/11] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
2020-03-24 15:24 ` [PATCH v2 08/11] null_blk: Cleanup zoned device initialization Johannes Thumshirn
2020-03-24 15:24 ` [PATCH v2 09/11] null_blk: Support REQ_OP_ZONE_APPEND Johannes Thumshirn
2020-03-24 15:24 ` [PATCH v2 10/11] iomap: Add support for zone append writes Johannes Thumshirn
2020-03-24 15:41   ` Christoph Hellwig
2020-03-25  5:27     ` Damien Le Moal
2020-03-25  6:25       ` hch
2020-03-25  9:45     ` Johannes Thumshirn
2020-03-25  9:48       ` hch
2020-03-25  9:54         ` Johannes Thumshirn
2020-03-25  9:59         ` Damien Le Moal
2020-03-25 10:01           ` hch
2020-03-25 10:15             ` Johannes Thumshirn
2020-03-24 22:45   ` Dave Chinner
2020-03-25  5:38     ` Damien Le Moal
2020-03-25  8:32     ` Johannes Thumshirn
2020-03-24 15:24 ` [PATCH v2 11/11] zonefs: use zone-append for sequential zones Johannes Thumshirn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.