All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] v2 Patchset : Zone Append command support for NVMe Zoned Namespaces (ZNS)
       [not found] <CGME20200625174119epcas5p122a2197102fe336aa35fdea1273fd1b0@epcas5p1.samsung.com>
@ 2020-06-25 17:38 ` Krishna Kanth Reddy
       [not found]   ` <CGME20200625174124epcas5p18e4fbc502c9cf1fef7e84ba5cefba945@epcas5p1.samsung.com>
                     ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Krishna Kanth Reddy @ 2020-06-25 17:38 UTC (permalink / raw)
  To: axboe; +Cc: fio, Krishna Kanth Reddy

This is a v2 Patchset for the Zone Append command support for NVMe Zoned Namespaces
The v1 Pull Request is https://github.com/axboe/fio/pull/1021

Review comments for the v1 Pull Request have been taken care of in the v2 Patchset.
New Opcodes added for libaio and io_uring are removed as per the review comments for the Linux kernel patches in LKML.
Linux Kernel v1 Patchset for the Zone Append command support :
	https://lkml.org/lkml/2020/6/17/801
	https://lkml.org/lkml/2020/6/17/802
	https://lkml.org/lkml/2020/6/17/803
	https://lkml.org/lkml/2020/6/17/804

RWF_ZONE_APPEND flag is passed in iocb->aio_rw_flags for libaio IO Engine and sqe->rw_flags for io_uring IO Engine.

Add Zone Append command support for NVMe Zoned Namespaces (ZNS) defined in NVM Express TP4053.

1. Added a new FIO option zone_append.
	When zone_append option is enabled, the existing write path will send Zone Append command with LBA offset as start of the Zone.
2. libaio: support for Zone Append command in libaio IO engine
3. iouring: support for Zone Append command in io_uring IO engine
4. t/zbd: Add support to verify Zone Append (ZNS)) command with libaio, io_uring IO engine tests

This is a RFC and the Linux Kernel interface is under submission and review.
Feedback / Comments will help in improving this further.

Ankit Kumar (3):
  Add Zone Append command support for NVMe Zoned Namespaces (ZNS)
  libaio: support for Zone Append command
  iouring: support for Zone Append command

Krishna Kanth Reddy (1):
  t/zbd: Add support to verify Zone Append command with libaio, io_uring
        IO engine tests

 HOWTO                  |  7 ++++
 engines/io_uring.c     | 23 ++++++++++++-
 engines/libaio.c       | 20 ++++++++++-
 fio.1                  |  7 ++++
 io_u.c                 |  4 +--
 io_u.h                 | 10 ++++--
 ioengines.c            |  4 +--
 options.c              | 10 ++++++
 t/zbd/test-zbd-support | 48 +++++++++++++++++++++++++++
 thread_options.h       |  2 ++
 zbd.c                  | 90 ++++++++++++++++++++++++++++++++++++++++++++++----
 zbd.h                  | 13 +++++---
 12 files changed, 220 insertions(+), 18 deletions(-)

-- 
2.7.4



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] Add Zone Append command support for NVMe Zoned Namespaces (ZNS)
       [not found]   ` <CGME20200625174124epcas5p18e4fbc502c9cf1fef7e84ba5cefba945@epcas5p1.samsung.com>
@ 2020-06-25 17:38     ` Krishna Kanth Reddy
  2020-06-26  5:33       ` Damien Le Moal
  2020-06-26  5:50       ` Damien Le Moal
  0 siblings, 2 replies; 17+ messages in thread
From: Krishna Kanth Reddy @ 2020-06-25 17:38 UTC (permalink / raw)
  To: axboe; +Cc: fio, Ankit Kumar, Krishna Kanth Reddy

From: Ankit Kumar <ankit.kumar@samsung.com>

defined in NVM Express TP4053. Added a new FIO option zone_append.
When zone_append option is enabled, the existing write path will
send Zone Append command with offset as start of the Zone.

Signed-off-by: Krishna Kanth Reddy <krish.reddy@samsung.com>
---
 HOWTO            |  7 +++++
 fio.1            |  7 +++++
 io_u.c           |  4 +--
 io_u.h           | 10 +++++--
 ioengines.c      |  4 +--
 options.c        | 10 +++++++
 thread_options.h |  2 ++
 zbd.c            | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
 zbd.h            | 13 +++++---
 9 files changed, 131 insertions(+), 16 deletions(-)

diff --git a/HOWTO b/HOWTO
index 8cf8d65..62b5ac8 100644
--- a/HOWTO
+++ b/HOWTO
@@ -1010,6 +1010,13 @@ Target file/device
 	:option:`zonesize` bytes of data have been transferred. This parameter
 	must be zero for :option:`zonemode` =zbd.
 
+.. option:: zone_append=bool
+
+	For :option:`zonemode` =zbd and for :option:`rw` =write or :option:
+	`rw` =randwrite, if zone_append is enabled, the the io_u points to the
+	starting offset of a zone. On successful completion the multiple of
+	sectors relative to the zone's starting offset is returned.
+
 .. option:: read_beyond_wp=bool
 
 	This parameter applies to :option:`zonemode` =zbd only.
diff --git a/fio.1 b/fio.1
index f134e0b..09add8f 100644
--- a/fio.1
+++ b/fio.1
@@ -782,6 +782,13 @@ sequential workloads and ignored for random workloads. For read workloads,
 see also \fBread_beyond_wp\fR.
 
 .TP
+.BI zone_append
+For \fBzonemode\fR =zbd and for \fBrw\fR=write or \fBrw\fR=randwrite, if
+zone_append is enabled, the io_u points to the starting offset of a zone. On
+successful completion the multiple of sectors relative to the zone's starting
+offset is returned.
+
+.TP
 .BI read_beyond_wp \fR=\fPbool
 This parameter applies to \fBzonemode=zbd\fR only.
 
diff --git a/io_u.c b/io_u.c
index 7f50906..b891a9b 100644
--- a/io_u.c
+++ b/io_u.c
@@ -778,7 +778,7 @@ void put_io_u(struct thread_data *td, struct io_u *io_u)
 {
 	const bool needs_lock = td_async_processing(td);
 
-	zbd_put_io_u(io_u);
+	zbd_put_io_u(td, io_u);
 
 	if (td->parent)
 		td = td->parent;
@@ -1342,7 +1342,7 @@ static long set_io_u_file(struct thread_data *td, struct io_u *io_u)
 		if (!fill_io_u(td, io_u))
 			break;
 
-		zbd_put_io_u(io_u);
+		zbd_put_io_u(td, io_u);
 
 		put_file_log(td, f);
 		td_io_close_file(td, f);
diff --git a/io_u.h b/io_u.h
index 87c2920..f5b24fd 100644
--- a/io_u.h
+++ b/io_u.h
@@ -94,19 +94,25 @@ struct io_u {
 	};
 
 	/*
+	 * for zone append this is the start offset of the zone.
+	 */
+	unsigned long long zone_start_offset;
+
+	/*
 	 * ZBD mode zbd_queue_io callback: called after engine->queue operation
 	 * to advance a zone write pointer and eventually unlock the I/O zone.
 	 * @q indicates the I/O queue status (busy, queued or completed).
 	 * @success == true means that the I/O operation has been queued or
 	 * completed successfully.
 	 */
-	void (*zbd_queue_io)(struct io_u *, int q, bool success);
+	void (*zbd_queue_io)(struct thread_data *, struct io_u *, int q,
+			     bool success);
 
 	/*
 	 * ZBD mode zbd_put_io callback: called in after completion of an I/O
 	 * or commit of an async I/O to unlock the I/O target zone.
 	 */
-	void (*zbd_put_io)(const struct io_u *);
+	void (*zbd_put_io)(struct thread_data *, const struct io_u *);
 
 	/*
 	 * Callback for io completion
diff --git a/ioengines.c b/ioengines.c
index 2c7a0df..81ac846 100644
--- a/ioengines.c
+++ b/ioengines.c
@@ -328,7 +328,7 @@ enum fio_q_status td_io_queue(struct thread_data *td, struct io_u *io_u)
 	}
 
 	ret = td->io_ops->queue(td, io_u);
-	zbd_queue_io_u(io_u, ret);
+	zbd_queue_io_u(td, io_u, ret);
 
 	unlock_file(td, io_u->file);
 
@@ -370,7 +370,7 @@ enum fio_q_status td_io_queue(struct thread_data *td, struct io_u *io_u)
 	if (!td->io_ops->commit) {
 		io_u_mark_submit(td, 1);
 		io_u_mark_complete(td, 1);
-		zbd_put_io_u(io_u);
+		zbd_put_io_u(td, io_u);
 	}
 
 	if (ret == FIO_Q_COMPLETED) {
diff --git a/options.c b/options.c
index 85a0f49..d54da81 100644
--- a/options.c
+++ b/options.c
@@ -3317,6 +3317,16 @@ struct fio_option fio_options[FIO_MAX_OPTS] = {
 		},
 	},
 	{
+		.name	= "zone_append",
+		.lname	= "zone_append",
+		.type	= FIO_OPT_BOOL,
+		.off1	= offsetof(struct thread_options, zone_append),
+		.help	= "Use Zone Append for Zone block device",
+		.def	= "0",
+		.category = FIO_OPT_C_IO,
+		.group	= FIO_OPT_G_ZONE,
+	},
+	{
 		.name	= "zonesize",
 		.lname	= "Zone size",
 		.type	= FIO_OPT_STR_VAL,
diff --git a/thread_options.h b/thread_options.h
index 968ea0a..45c5ef8 100644
--- a/thread_options.h
+++ b/thread_options.h
@@ -195,6 +195,7 @@ struct thread_options {
 	unsigned long long zone_size;
 	unsigned long long zone_skip;
 	enum fio_zone_mode zone_mode;
+	unsigned int zone_append;
 	unsigned long long lockmem;
 	enum fio_memtype mem_type;
 	unsigned int mem_align;
@@ -631,6 +632,7 @@ struct thread_options_pack {
 	uint32_t allow_mounted_write;
 
 	uint32_t zone_mode;
+	uint32_t zone_append;
 } __attribute__((packed));
 
 extern void convert_thread_options_to_cpu(struct thread_options *o, struct thread_options_pack *top);
diff --git a/zbd.c b/zbd.c
index 8cf8f81..ffdb766 100644
--- a/zbd.c
+++ b/zbd.c
@@ -455,6 +455,7 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
 		for (i = 0; i < nrz; i++, j++, z++, p++) {
 			mutex_init_pshared_with_type(&p->mutex,
 						     PTHREAD_MUTEX_RECURSIVE);
+			cond_init_pshared(&p->reset_cond);
 			p->start = z->start;
 			switch (z->cond) {
 			case ZBD_ZONE_COND_NOT_WP:
@@ -469,6 +470,7 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
 			}
 			p->type = z->type;
 			p->cond = z->cond;
+			p->pending_ios = 0;
 			if (j > 0 && p->start != p[-1].start + zone_size) {
 				log_info("%s: invalid zone data\n",
 					 f->file_name);
@@ -1196,20 +1198,24 @@ zbd_find_zone(struct thread_data *td, struct io_u *io_u,
 
 /**
  * zbd_queue_io - update the write pointer of a sequential zone
+ * @td: fio thread data.
  * @io_u: I/O unit
  * @success: Whether or not the I/O unit has been queued successfully
  * @q: queueing status (busy, completed or queued).
  *
  * For write and trim operations, update the write pointer of the I/O unit
  * target zone.
+ * For zone append operation, release the zone mutex
  */
-static void zbd_queue_io(struct io_u *io_u, int q, bool success)
+static void zbd_queue_io(struct thread_data *td, struct io_u *io_u, int q,
+			 bool success)
 {
 	const struct fio_file *f = io_u->file;
 	struct zoned_block_device_info *zbd_info = f->zbd_info;
 	struct fio_zone_info *z;
 	uint32_t zone_idx;
 	uint64_t zone_end;
+	int ret;
 
 	if (!zbd_info)
 		return;
@@ -1241,6 +1247,8 @@ static void zbd_queue_io(struct io_u *io_u, int q, bool success)
 			zbd_info->sectors_with_data += zone_end - z->wp;
 		pthread_mutex_unlock(&zbd_info->mutex);
 		z->wp = zone_end;
+		if (td->o.zone_append)
+			z->pending_ios++;
 		break;
 	case DDIR_TRIM:
 		assert(z->wp == z->start);
@@ -1250,18 +1258,22 @@ static void zbd_queue_io(struct io_u *io_u, int q, bool success)
 	}
 
 unlock:
-	if (!success || q != FIO_Q_QUEUED) {
+	if (!success || q != FIO_Q_QUEUED || td->o.zone_append) {
 		/* BUSY or COMPLETED: unlock the zone */
-		pthread_mutex_unlock(&z->mutex);
-		io_u->zbd_put_io = NULL;
+		ret = pthread_mutex_unlock(&z->mutex);
+		assert(ret == 0);
+		if (!success || q != FIO_Q_QUEUED)
+			io_u->zbd_put_io = NULL;
 	}
 }
 
 /**
  * zbd_put_io - Unlock an I/O unit target zone lock
+ * For zone append operation we don't hold zone lock
+ * @td: fio thread data.
  * @io_u: I/O unit
  */
-static void zbd_put_io(const struct io_u *io_u)
+static void zbd_put_io(struct thread_data *td, const struct io_u *io_u)
 {
 	const struct fio_file *f = io_u->file;
 	struct zoned_block_device_info *zbd_info = f->zbd_info;
@@ -1283,6 +1295,19 @@ static void zbd_put_io(const struct io_u *io_u)
 	       "%s: terminate I/O (%lld, %llu) for zone %u\n",
 	       f->file_name, io_u->offset, io_u->buflen, zone_idx);
 
+	if (td->o.zone_append) {
+		pthread_mutex_lock(&z->mutex);
+		if (z->pending_ios > 0) {
+			z->pending_ios--;
+			/*
+			 * Other threads may be waiting for pending I/O's to
+			 * complete for this zone. Notify them.
+			 */
+			if (!z->pending_ios)
+				pthread_cond_broadcast(&z->reset_cond);
+		}
+	}
+
 	ret = pthread_mutex_unlock(&z->mutex);
 	assert(ret == 0);
 	zbd_check_swd(f);
@@ -1524,16 +1549,69 @@ enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u)
 			 * asynchronously and since we will submit the zone
 			 * reset synchronously, wait until previously submitted
 			 * write requests have completed before issuing a
-			 * zone reset.
+			 * zone reset. For append request release the zone lock
+			 * as other threads will acquire it at the time of
+			 * zbd_put_io.
 			 */
+reset:
+			if (td->o.zone_append)
+				pthread_mutex_unlock(&zb->mutex);
 			io_u_quiesce(td);
+			if (td->o.zone_append)
+				pthread_mutex_lock(&zb->mutex);
+
 			zb->reset_zone = 0;
+			if (td->o.zone_append) {
+				/*
+				 * While processing the current thread queued
+				 * requests the other thread may have already
+				 * done zone reset so need to check zone full
+				 * condition again.
+				 */
+				if (!zbd_zone_full(f, zb, min_bs))
+					goto proceed;
+				/*
+				 * Wait for the pending requests to be completed
+				 * else we are ok to reset this zone.
+				 */
+				if (zb->pending_ios) {
+					pthread_cond_wait(&zb->reset_cond, &zb->mutex);
+					goto proceed;
+				}
+			}
+
 			if (zbd_reset_zone(td, f, zb) < 0)
 				goto eof;
+
+			/* Notify other threads waiting for zone mutex */
+			if (td->o.zone_append)
+				pthread_cond_broadcast(&zb->reset_cond);
+		}
+proceed:
+		/*
+		 * Check for zone full condition again. For zone append request
+		 * the zone may already be reset, written and full while we
+		 * were waiting for our turn.
+		 */
+		if (zbd_zone_full(f, zb, min_bs)) {
+			goto reset;
 		}
+
 		/* Make writes occur at the write pointer */
 		assert(!zbd_zone_full(f, zb, min_bs));
 		io_u->offset = zb->wp;
+
+		/*
+		 * Support zone append for both regular and zoned block
+		 * device.
+		 */
+		if (td->o.zone_append) {
+			if (f->zbd_info->model == ZBD_NONE)
+				io_u->zone_start_offset = zb->wp;
+			else
+				io_u->zone_start_offset = zb->start;
+		}
+
 		if (!is_valid_offset(f, io_u->offset)) {
 			dprint(FD_ZBD, "Dropped request with offset %llu\n",
 			       io_u->offset);
diff --git a/zbd.h b/zbd.h
index e942a7f..eac42f7 100644
--- a/zbd.h
+++ b/zbd.h
@@ -23,8 +23,10 @@ enum io_u_action {
  * struct fio_zone_info - information about a single ZBD zone
  * @start: zone start location (bytes)
  * @wp: zone write pointer location (bytes)
+ * @pending_ios: Number of IO's pending in this zone
  * @verify_block: number of blocks that have been verified for this zone
  * @mutex: protects the modifiable members in this structure
+ * @reset_cond: zone reset check condition. only relevant for zone_append.
  * @type: zone type (BLK_ZONE_TYPE_*)
  * @cond: zone state (BLK_ZONE_COND_*)
  * @open: whether or not this zone is currently open. Only relevant if
@@ -33,8 +35,10 @@ enum io_u_action {
  */
 struct fio_zone_info {
 	pthread_mutex_t		mutex;
+	pthread_cond_t		reset_cond;
 	uint64_t		start;
 	uint64_t		wp;
+	uint32_t		pending_ios;
 	uint32_t		verify_block;
 	enum zbd_zone_type	type:2;
 	enum zbd_zone_cond	cond:4;
@@ -96,18 +100,19 @@ static inline void zbd_close_file(struct fio_file *f)
 		zbd_free_zone_info(f);
 }
 
-static inline void zbd_queue_io_u(struct io_u *io_u, enum fio_q_status status)
+static inline void zbd_queue_io_u(struct thread_data *td, struct io_u *io_u,
+				  enum fio_q_status status)
 {
 	if (io_u->zbd_queue_io) {
-		io_u->zbd_queue_io(io_u, status, io_u->error == 0);
+		io_u->zbd_queue_io(td, io_u, status, io_u->error == 0);
 		io_u->zbd_queue_io = NULL;
 	}
 }
 
-static inline void zbd_put_io_u(struct io_u *io_u)
+static inline void zbd_put_io_u(struct thread_data *td, struct io_u *io_u)
 {
 	if (io_u->zbd_put_io) {
-		io_u->zbd_put_io(io_u);
+		io_u->zbd_put_io(td, io_u);
 		io_u->zbd_queue_io = NULL;
 		io_u->zbd_put_io = NULL;
 	}
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] libaio: support for Zone Append command
       [not found]   ` <CGME20200625174129epcas5p304bf58bb381b4b0c39e0ff91b50a23a9@epcas5p3.samsung.com>
@ 2020-06-25 17:38     ` Krishna Kanth Reddy
  2020-06-26  5:38       ` Damien Le Moal
  0 siblings, 1 reply; 17+ messages in thread
From: Krishna Kanth Reddy @ 2020-06-25 17:38 UTC (permalink / raw)
  To: axboe; +Cc: fio, Ankit Kumar, Krishna Kanth Reddy

From: Ankit Kumar <ankit.kumar@samsung.com>

The zone append command uses the write path with offset as
start of the zone. Upon successful completion, multiple of
sectors relative to the zone's start offset, where the data
has been placed is returned.

Signed-off-by: Krishna Kanth Reddy <krish.reddy@samsung.com>
---
 engines/libaio.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/engines/libaio.c b/engines/libaio.c
index 398fdf9..4ffe831 100644
--- a/engines/libaio.c
+++ b/engines/libaio.c
@@ -123,7 +123,13 @@ static int fio_libaio_prep(struct thread_data *td, struct io_u *io_u)
 		if (o->nowait)
 			iocb->aio_rw_flags |= RWF_NOWAIT;
 	} else if (io_u->ddir == DDIR_WRITE) {
-		io_prep_pwrite(iocb, f->fd, io_u->xfer_buf, io_u->xfer_buflen, io_u->offset);
+		if ((td->o.zone_mode == ZONE_MODE_ZBD) && td->o.zone_append) {
+			io_prep_pwrite(iocb, f->fd, io_u->xfer_buf, io_u->xfer_buflen, io_u->zone_start_offset);
+		}
+		else
+			io_prep_pwrite(iocb, f->fd, io_u->xfer_buf, io_u->xfer_buflen, io_u->offset);
+		if (td->o.zone_append)
+			iocb->aio_rw_flags |= RWF_ZONE_APPEND;
 		if (o->nowait)
 			iocb->aio_rw_flags |= RWF_NOWAIT;
 	} else if (ddir_sync(io_u->ddir))
@@ -228,6 +234,18 @@ static int fio_libaio_getevents(struct thread_data *td, unsigned int min,
 		} else {
 			r = io_getevents(ld->aio_ctx, actual_min,
 				max, ld->aio_events + events, lt);
+			if (td->o.zone_mode == ZONE_MODE_ZBD) {
+				struct io_event *ev;
+				struct io_u *io_u;
+				for (unsigned event = 0; event < r; event++) {
+					ev = ld->aio_events + event;
+					io_u = container_of(ev->obj, struct io_u, iocb);
+					if (td->o.zone_append
+					    && td->o.do_verify && td->o.verify
+					    && (io_u->ddir == DDIR_WRITE))
+						io_u->ipo->offset = io_u->zone_start_offset + (ev->res2 << 9);
+				}
+			}
 		}
 		if (r > 0)
 			events += r;
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] iouring: support for Zone Append command
       [not found]   ` <CGME20200625174131epcas5p36cf7cd413dcb698f117474df71e5648b@epcas5p3.samsung.com>
@ 2020-06-25 17:38     ` Krishna Kanth Reddy
  2020-06-26  5:43       ` Damien Le Moal
  0 siblings, 1 reply; 17+ messages in thread
From: Krishna Kanth Reddy @ 2020-06-25 17:38 UTC (permalink / raw)
  To: axboe; +Cc: fio, Ankit Kumar, Krishna Kanth Reddy

From: Ankit Kumar <ankit.kumar@samsung.com>

The zone append command uses the write path with offset as
start of the zone. Upon successful completion, multiple of
sectors relative to the zone's start offset, where the data
has been placed is returned.

Signed-off-by: Krishna Kanth Reddy <krish.reddy@samsung.com>
---
 engines/io_uring.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/engines/io_uring.c b/engines/io_uring.c
index cd0810f..cb7c5ba 100644
--- a/engines/io_uring.c
+++ b/engines/io_uring.c
@@ -251,7 +251,13 @@ static int fio_ioring_prep(struct thread_data *td, struct io_u *io_u)
 			sqe->ioprio = td->o.ioprio_class << 13;
 		if (ld->ioprio_set)
 			sqe->ioprio |= td->o.ioprio;
-		sqe->off = io_u->offset;
+		if (td->o.zone_append && io_u->ddir == DDIR_WRITE)
+			sqe->rw_flags |= RWF_ZONE_APPEND;
+		if ((td->o.zone_mode == ZONE_MODE_ZBD)
+		     && td->o.zone_append && io_u->ddir == DDIR_WRITE) {
+			sqe->off = io_u->zone_start_offset;
+		} else
+			sqe->off = io_u->offset;
 	} else if (ddir_sync(io_u->ddir)) {
 		if (io_u->ddir == DDIR_SYNC_FILE_RANGE) {
 			sqe->off = f->first_write;
@@ -324,6 +330,21 @@ static int fio_ioring_getevents(struct thread_data *td, unsigned int min,
 	ld->cq_ring_off = *ring->head;
 	do {
 		r = fio_ioring_cqring_reap(td, events, max);
+		if (td->o.zone_mode == ZONE_MODE_ZBD) {
+			struct io_uring_cqe *cqe;
+			struct io_u *io_u;
+			unsigned index;
+			for (unsigned event = 0; event < r; event++) {
+				index = (event + ld->cq_ring_off) & ld->cq_ring_mask;
+
+				cqe = &ld->cq_ring.cqes[index];
+				io_u = (struct io_u *) (uintptr_t) cqe->user_data;
+
+				if (td->o.zone_append && td->o.do_verify
+				    && td->o.verify && (io_u->ddir == DDIR_WRITE))
+					io_u->ipo->offset = io_u->zone_start_offset + (cqe->flags << 9);
+			}
+		}
 		if (r) {
 			events += r;
 			if (actual_min != 0)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] t/zbd: Add support to verify Zone Append command with libaio, io_uring IO engine tests
       [not found]   ` <CGME20200625174133epcas5p1eace8f03319bee805b93c50fe6c690c7@epcas5p1.samsung.com>
@ 2020-06-25 17:38     ` Krishna Kanth Reddy
  2020-06-25 18:44       ` Dmitry Fomichev
  2020-06-26  5:45       ` Damien Le Moal
  0 siblings, 2 replies; 17+ messages in thread
From: Krishna Kanth Reddy @ 2020-06-25 17:38 UTC (permalink / raw)
  To: axboe; +Cc: fio, Krishna Kanth Reddy, Ankit Kumar

Modify the test-zbd-support script to verify the Zone Append command
for NVMe Zoned Namespaces (ZNS) defined in NVM Express TP4053.
Added a new FIO option zone_append.
When zone_append option is enabled, the existing write path will
send Zone Append command with LBA offset as start of the Zone.

Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com>
---
 t/zbd/test-zbd-support | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/t/zbd/test-zbd-support b/t/zbd/test-zbd-support
index 4001be3..ddade22 100755
--- a/t/zbd/test-zbd-support
+++ b/t/zbd/test-zbd-support
@@ -801,6 +801,54 @@ test48() {
 	    >> "${logfile}.${test_number}" 2>&1 || return $?
 }
 
+# Zone append to sequential zones, libaio, 1 job, queue depth 1
+test49() {
+    local i size
+
+    size=$((4 * zone_size))
+    run_fio_on_seq --ioengine=libaio --iodepth=1 --rw=write --zone_append=1 \
+                   --bs="$(max $((zone_size / 64)) "$logical_block_size")"\
+                   --do_verify=1 --verify=md5                           \
+                   >>"${logfile}.${test_number}" 2>&1 || return $?
+    check_written $size || return $?
+    check_read $size || return $?
+}
+
+# Random zone append to sequential zones, libaio, 8 jobs, queue depth 64 per job
+test50() {
+    local size
+
+    size=$((4 * zone_size))
+    run_fio_on_seq --ioengine=libaio --iodepth=64 --rw=randwrite --bs=4K \
+                   --group_reporting=1 --numjobs=8 --zone_append=1 \
+                   >> "${logfile}.${test_number}" 2>&1 || return $?
+    check_written $((size * 8)) || return $?
+}
+
+# Zone append to sequential zones, io_uring, 1 job, queue depth 1
+test51() {
+    local i size
+
+    size=$((4 * zone_size))
+    run_fio_on_seq --ioengine=io_uring --iodepth=1 --rw=write --zone_append=1 \
+                   --bs="$(max $((zone_size / 64)) "$logical_block_size")"\
+                   --do_verify=1 --verify=md5                           \
+                   >>"${logfile}.${test_number}" 2>&1 || return $?
+    check_written $size || return $?
+    check_read $size || return $?
+}
+
+# Random zone append to sequential zones, io_uring, 8 jobs, queue depth 64 per job
+test52() {
+    local size
+
+    size=$((4 * zone_size))
+    run_fio_on_seq --ioengine=io_uring --iodepth=64 --rw=randwrite --bs=4K \
+                   --group_reporting=1 --numjobs=8 --zone_append=1 \
+                   >> "${logfile}.${test_number}" 2>&1 || return $?
+    check_written $((size * 8)) || return $?
+}
+
 tests=()
 dynamic_analyzer=()
 reset_all_zones=
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* RE: [PATCH 4/4] t/zbd: Add support to verify Zone Append command with libaio, io_uring IO engine tests
  2020-06-25 17:38     ` [PATCH 4/4] t/zbd: Add support to verify Zone Append command with libaio, io_uring IO engine tests Krishna Kanth Reddy
@ 2020-06-25 18:44       ` Dmitry Fomichev
  2020-07-03  8:46         ` Krishna Kanth Reddy
  2020-06-26  5:45       ` Damien Le Moal
  1 sibling, 1 reply; 17+ messages in thread
From: Dmitry Fomichev @ 2020-06-25 18:44 UTC (permalink / raw)
  To: Krishna Kanth Reddy, axboe
  Cc: fio, Ankit Kumar, Damien Le Moal, Shinichiro Kawasaki

Will these tests succeed if t/zbd/test-zbd-support script is run against an SMR HDD?
Since zoned HDDs don't support Zone Append, I would expect the i/o to fail.
I think you need to check if this device is an NVMe drive and expect the i/o failure in
the tests below if this is not the case.

More inline...

> -----Original Message-----
> From: fio-owner@vger.kernel.org <fio-owner@vger.kernel.org> On Behalf
> Of Krishna Kanth Reddy
> Sent: Thursday, June 25, 2020 1:39 PM
> To: axboe@kernel.dk
> Cc: fio@vger.kernel.org; Krishna Kanth Reddy <krish.reddy@samsung.com>;
> Ankit Kumar <ankit.kumar@samsung.com>
> Subject: [PATCH 4/4] t/zbd: Add support to verify Zone Append command
> with libaio, io_uring IO engine tests
> 
> Modify the test-zbd-support script to verify the Zone Append command
> for NVMe Zoned Namespaces (ZNS) defined in NVM Express TP4053.
> Added a new FIO option zone_append.
> When zone_append option is enabled, the existing write path will
> send Zone Append command with LBA offset as start of the Zone.
> 
> Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com>
> ---
>  t/zbd/test-zbd-support | 48
> ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 48 insertions(+)
> 
> diff --git a/t/zbd/test-zbd-support b/t/zbd/test-zbd-support
> index 4001be3..ddade22 100755
> --- a/t/zbd/test-zbd-support
> +++ b/t/zbd/test-zbd-support
> @@ -801,6 +801,54 @@ test48() {
>  	    >> "${logfile}.${test_number}" 2>&1 || return $?
>  }
> 
> +# Zone append to sequential zones, libaio, 1 job, queue depth 1
> +test49() {
> +    local i size
> +
> +    size=$((4 * zone_size))
> +    run_fio_on_seq --ioengine=libaio --iodepth=1 --rw=write --
> zone_append=1 \
> +                   --bs="$(max $((zone_size / 64)) "$logical_block_size")"\
> +                   --do_verify=1 --verify=md5                           \
> +                   >>"${logfile}.${test_number}" 2>&1 || return $?
> +    check_written $size || return $?
> +    check_read $size || return $?
> +}
> +
> +# Random zone append to sequential zones, libaio, 8 jobs, queue depth 64
> per job
> +test50() {
> +    local size
> +
> +    size=$((4 * zone_size))
> +    run_fio_on_seq --ioengine=libaio --iodepth=64 --rw=randwrite --bs=4K \
> +                   --group_reporting=1 --numjobs=8 --zone_append=1 \
> +                   >> "${logfile}.${test_number}" 2>&1 || return $?
> +    check_written $((size * 8)) || return $?
> +}
> +
> +# Zone append to sequential zones, io_uring, 1 job, queue depth 1
> +test51() {
> +    local i size
> +
> +    size=$((4 * zone_size))
> +    run_fio_on_seq --ioengine=io_uring --iodepth=1 --rw=write --
> zone_append=1 \
> +                   --bs="$(max $((zone_size / 64)) "$logical_block_size")"\
> +                   --do_verify=1 --verify=md5                           \
> +                   >>"${logfile}.${test_number}" 2>&1 || return $?
> +    check_written $size || return $?
> +    check_read $size || return $?
> +}
> +
> +# Random zone append to sequential zones, io_uring, 8 jobs, queue depth
> 64 per job
> +test52() {
> +    local size
> +
> +    size=$((4 * zone_size))

Maybe try some different size? It is the same in all tests. 

> +    run_fio_on_seq --ioengine=io_uring --iodepth=64 --rw=randwrite --
> bs=4K \

All tests do 4K i/o, but maybe try to run with a different block size?
It could be a good idea to add a test that will write with bs=ZASL(or MDTS).
Yet another test issuing i/o with bs exceeding the maximum i/o size would
be very useful.

> +                   --group_reporting=1 --numjobs=8 --zone_append=1 \
> +                   >> "${logfile}.${test_number}" 2>&1 || return $?
> +    check_written $((size * 8)) || return $?
> +}
> +
>  tests=()
>  dynamic_analyzer=()
>  reset_all_zones=
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] Add Zone Append command support for NVMe Zoned Namespaces (ZNS)
  2020-06-25 17:38     ` [PATCH 1/4] Add " Krishna Kanth Reddy
@ 2020-06-26  5:33       ` Damien Le Moal
  2020-07-03 17:17         ` Krishna Kanth Reddy
  2020-06-26  5:50       ` Damien Le Moal
  1 sibling, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2020-06-26  5:33 UTC (permalink / raw)
  To: Krishna Kanth Reddy, axboe; +Cc: fio, Ankit Kumar

On 2020/06/26 2:41, Krishna Kanth Reddy wrote:
> From: Ankit Kumar <ankit.kumar@samsung.com>
> 
> defined in NVM Express TP4053. Added a new FIO option zone_append.
> When zone_append option is enabled, the existing write path will
> send Zone Append command with offset as start of the Zone.

Zone append is supported by SCSI disks too through emulation and will be for
nullblk too, so please generalize this patch title and commit message. This is
not for NVMe ZNS devices only.

> 
> Signed-off-by: Krishna Kanth Reddy <krish.reddy@samsung.com>
> ---
>  HOWTO            |  7 +++++
>  fio.1            |  7 +++++
>  io_u.c           |  4 +--
>  io_u.h           | 10 +++++--
>  ioengines.c      |  4 +--
>  options.c        | 10 +++++++
>  thread_options.h |  2 ++
>  zbd.c            | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
>  zbd.h            | 13 +++++---
>  9 files changed, 131 insertions(+), 16 deletions(-)
> 
> diff --git a/HOWTO b/HOWTO
> index 8cf8d65..62b5ac8 100644
> --- a/HOWTO
> +++ b/HOWTO
> @@ -1010,6 +1010,13 @@ Target file/device
>  	:option:`zonesize` bytes of data have been transferred. This parameter
>  	must be zero for :option:`zonemode` =zbd.
>  
> +.. option:: zone_append=bool
> +
> +	For :option:`zonemode` =zbd and for :option:`rw` =write or :option:
> +	`rw` =randwrite, if zone_append is enabled, the the io_u points to the

s/the the io_u/the io_u

> +	starting offset of a zone. On successful completion the multiple of
> +	sectors relative to the zone's starting offset is returned.

You are describing this option effect from the code perspective. That is not
helpful for a user who does not know fio internals. Please describe this from
the user perspective, from high level, what the option does, which is, to use
zone append write command instead of regular write command.

> +
>  .. option:: read_beyond_wp=bool
>  
>  	This parameter applies to :option:`zonemode` =zbd only.
> diff --git a/fio.1 b/fio.1
> index f134e0b..09add8f 100644
> --- a/fio.1
> +++ b/fio.1
> @@ -782,6 +782,13 @@ sequential workloads and ignored for random workloads. For read workloads,
>  see also \fBread_beyond_wp\fR.
>  
>  .TP
> +.BI zone_append
> +For \fBzonemode\fR =zbd and for \fBrw\fR=write or \fBrw\fR=randwrite, if
> +zone_append is enabled, the io_u points to the starting offset of a zone. On
> +successful completion the multiple of sectors relative to the zone's starting
> +offset is returned.

Same comment here.

> +
> +.TP
>  .BI read_beyond_wp \fR=\fPbool
>  This parameter applies to \fBzonemode=zbd\fR only.
>  
> diff --git a/io_u.c b/io_u.c
> index 7f50906..b891a9b 100644
> --- a/io_u.c
> +++ b/io_u.c
> @@ -778,7 +778,7 @@ void put_io_u(struct thread_data *td, struct io_u *io_u)
>  {
>  	const bool needs_lock = td_async_processing(td);
>  
> -	zbd_put_io_u(io_u);
> +	zbd_put_io_u(td, io_u);
>  
>  	if (td->parent)
>  		td = td->parent;
> @@ -1342,7 +1342,7 @@ static long set_io_u_file(struct thread_data *td, struct io_u *io_u)
>  		if (!fill_io_u(td, io_u))
>  			break;
>  
> -		zbd_put_io_u(io_u);
> +		zbd_put_io_u(td, io_u);
>  
>  		put_file_log(td, f);
>  		td_io_close_file(td, f);
> diff --git a/io_u.h b/io_u.h
> index 87c2920..f5b24fd 100644
> --- a/io_u.h
> +++ b/io_u.h
> @@ -94,19 +94,25 @@ struct io_u {
>  	};
>  
>  	/*
> +	 * for zone append this is the start offset of the zone.
> +	 */
> +	unsigned long long zone_start_offset;

This is not necessary. This value can be calculated from any io_u offset.

> +
> +	/*
>  	 * ZBD mode zbd_queue_io callback: called after engine->queue operation
>  	 * to advance a zone write pointer and eventually unlock the I/O zone.
>  	 * @q indicates the I/O queue status (busy, queued or completed).
>  	 * @success == true means that the I/O operation has been queued or
>  	 * completed successfully.
>  	 */
> -	void (*zbd_queue_io)(struct io_u *, int q, bool success);
> +	void (*zbd_queue_io)(struct thread_data *, struct io_u *, int q,
> +			     bool success);
>  
>  	/*
>  	 * ZBD mode zbd_put_io callback: called in after completion of an I/O
>  	 * or commit of an async I/O to unlock the I/O target zone.
>  	 */
> -	void (*zbd_put_io)(const struct io_u *);
> +	void (*zbd_put_io)(struct thread_data *, const struct io_u *);
>  
>  	/*
>  	 * Callback for io completion
> diff --git a/ioengines.c b/ioengines.c
> index 2c7a0df..81ac846 100644
> --- a/ioengines.c
> +++ b/ioengines.c
> @@ -328,7 +328,7 @@ enum fio_q_status td_io_queue(struct thread_data *td, struct io_u *io_u)
>  	}
>  
>  	ret = td->io_ops->queue(td, io_u);
> -	zbd_queue_io_u(io_u, ret);
> +	zbd_queue_io_u(td, io_u, ret);
>  
>  	unlock_file(td, io_u->file);
>  
> @@ -370,7 +370,7 @@ enum fio_q_status td_io_queue(struct thread_data *td, struct io_u *io_u)
>  	if (!td->io_ops->commit) {
>  		io_u_mark_submit(td, 1);
>  		io_u_mark_complete(td, 1);
> -		zbd_put_io_u(io_u);
> +		zbd_put_io_u(td, io_u);
>  	}
>  
>  	if (ret == FIO_Q_COMPLETED) {
> diff --git a/options.c b/options.c
> index 85a0f49..d54da81 100644
> --- a/options.c
> +++ b/options.c
> @@ -3317,6 +3317,16 @@ struct fio_option fio_options[FIO_MAX_OPTS] = {
>  		},
>  	},
>  	{
> +		.name	= "zone_append",
> +		.lname	= "zone_append",
> +		.type	= FIO_OPT_BOOL,
> +		.off1	= offsetof(struct thread_options, zone_append),
> +		.help	= "Use Zone Append for Zone block device",

"Use zone append for writing zones of a zoned block device"

may be better as it points out that this has an effect on writes only.

> +		.def	= "0",
> +		.category = FIO_OPT_C_IO,
> +		.group	= FIO_OPT_G_ZONE,
> +	},
> +	{
>  		.name	= "zonesize",
>  		.lname	= "Zone size",
>  		.type	= FIO_OPT_STR_VAL,
> diff --git a/thread_options.h b/thread_options.h
> index 968ea0a..45c5ef8 100644
> --- a/thread_options.h
> +++ b/thread_options.h
> @@ -195,6 +195,7 @@ struct thread_options {
>  	unsigned long long zone_size;
>  	unsigned long long zone_skip;
>  	enum fio_zone_mode zone_mode;
> +	unsigned int zone_append;
>  	unsigned long long lockmem;
>  	enum fio_memtype mem_type;
>  	unsigned int mem_align;
> @@ -631,6 +632,7 @@ struct thread_options_pack {
>  	uint32_t allow_mounted_write;
>  
>  	uint32_t zone_mode;
> +	uint32_t zone_append;
>  } __attribute__((packed));
>  
>  extern void convert_thread_options_to_cpu(struct thread_options *o, struct thread_options_pack *top);
> diff --git a/zbd.c b/zbd.c
> index 8cf8f81..ffdb766 100644
> --- a/zbd.c
> +++ b/zbd.c
> @@ -455,6 +455,7 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
>  		for (i = 0; i < nrz; i++, j++, z++, p++) {
>  			mutex_init_pshared_with_type(&p->mutex,
>  						     PTHREAD_MUTEX_RECURSIVE);
> +			cond_init_pshared(&p->reset_cond);

It would be nice if the commit message explained the changes in this patch and
explain the role of this condition variable. It is not clear to me at all.

>  			p->start = z->start;
>  			switch (z->cond) {
>  			case ZBD_ZONE_COND_NOT_WP:
> @@ -469,6 +470,7 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
>  			}
>  			p->type = z->type;
>  			p->cond = z->cond;
> +			p->pending_ios = 0;

Same here. This is not explained so one has to figure it out from the code only.
Not easy to do as what I think is needed may not be what *you* thought in the
first place.

>  			if (j > 0 && p->start != p[-1].start + zone_size) {
>  				log_info("%s: invalid zone data\n",
>  					 f->file_name);
> @@ -1196,20 +1198,24 @@ zbd_find_zone(struct thread_data *td, struct io_u *io_u,
>  
>  /**
>   * zbd_queue_io - update the write pointer of a sequential zone
> + * @td: fio thread data.
>   * @io_u: I/O unit
>   * @success: Whether or not the I/O unit has been queued successfully
>   * @q: queueing status (busy, completed or queued).
>   *
>   * For write and trim operations, update the write pointer of the I/O unit
>   * target zone.
> + * For zone append operation, release the zone mutex
>   */
> -static void zbd_queue_io(struct io_u *io_u, int q, bool success)
> +static void zbd_queue_io(struct thread_data *td, struct io_u *io_u, int q,
> +			 bool success)
>  {
>  	const struct fio_file *f = io_u->file;
>  	struct zoned_block_device_info *zbd_info = f->zbd_info;
>  	struct fio_zone_info *z;
>  	uint32_t zone_idx;
>  	uint64_t zone_end;
> +	int ret;
>  
>  	if (!zbd_info)
>  		return;
> @@ -1241,6 +1247,8 @@ static void zbd_queue_io(struct io_u *io_u, int q, bool success)
>  			zbd_info->sectors_with_data += zone_end - z->wp;
>  		pthread_mutex_unlock(&zbd_info->mutex);
>  		z->wp = zone_end;
> +		if (td->o.zone_append)
> +			z->pending_ios++;

The name is not very good. Shouldn't this be queued_ios ? And since fio
differentiate queued and in-flight, but this counter does not and is for zone
append commands only, it may be even better to call this something like
in_flight_za_ios or something like that to clarify the use and avoid confusion.

>  		break;
>  	case DDIR_TRIM:
>  		assert(z->wp == z->start);
> @@ -1250,18 +1258,22 @@ static void zbd_queue_io(struct io_u *io_u, int q, bool success)
>  	}
>  
>  unlock:
> -	if (!success || q != FIO_Q_QUEUED) {
> +	if (!success || q != FIO_Q_QUEUED || td->o.zone_append) {
>  		/* BUSY or COMPLETED: unlock the zone */
> -		pthread_mutex_unlock(&z->mutex);
> -		io_u->zbd_put_io = NULL;
> +		ret = pthread_mutex_unlock(&z->mutex);
> +		assert(ret == 0);
> +		if (!success || q != FIO_Q_QUEUED)
> +			io_u->zbd_put_io = NULL;
>  	}
>  }
>  
>  /**
>   * zbd_put_io - Unlock an I/O unit target zone lock
> + * For zone append operation we don't hold zone lock
> + * @td: fio thread data.
>   * @io_u: I/O unit
>   */
> -static void zbd_put_io(const struct io_u *io_u)
> +static void zbd_put_io(struct thread_data *td, const struct io_u *io_u)
>  {
>  	const struct fio_file *f = io_u->file;
>  	struct zoned_block_device_info *zbd_info = f->zbd_info;
> @@ -1283,6 +1295,19 @@ static void zbd_put_io(const struct io_u *io_u)
>  	       "%s: terminate I/O (%lld, %llu) for zone %u\n",
>  	       f->file_name, io_u->offset, io_u->buflen, zone_idx);
>  
> +	if (td->o.zone_append) {
> +		pthread_mutex_lock(&z->mutex);
> +		if (z->pending_ios > 0) {
> +			z->pending_ios--;
> +			/*
> +			 * Other threads may be waiting for pending I/O's to
> +			 * complete for this zone. Notify them.
> +			 */

Please move this comment inside the if, or even better, at the beginning of this
code block. And also explain why the zone lock needs to be taken here for  zone
append only.
> +			if (!z->pending_ios)
> +				pthread_cond_broadcast(&z->reset_cond);
> +		}
> +	}
> +
>  	ret = pthread_mutex_unlock(&z->mutex);
>  	assert(ret == 0);
>  	zbd_check_swd(f);
> @@ -1524,16 +1549,69 @@ enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u)
>  			 * asynchronously and since we will submit the zone
>  			 * reset synchronously, wait until previously submitted
>  			 * write requests have completed before issuing a
> -			 * zone reset.
> +			 * zone reset. For append request release the zone lock
> +			 * as other threads will acquire it at the time of
> +			 * zbd_put_io.
>  			 */
> +reset:
> +			if (td->o.zone_append)
> +				pthread_mutex_unlock(&zb->mutex);
>  			io_u_quiesce(td);
> +			if (td->o.zone_append)
> +				pthread_mutex_lock(&zb->mutex);
> +
>  			zb->reset_zone = 0;
> +			if (td->o.zone_append) {
> +				/*
> +				 * While processing the current thread queued
> +				 * requests the other thread may have already
> +				 * done zone reset so need to check zone full
> +				 * condition again.
> +				 */
> +				if (!zbd_zone_full(f, zb, min_bs))
> +					goto proceed;
> +				/*
> +				 * Wait for the pending requests to be completed
> +				 * else we are ok to reset this zone.
> +				 */
> +				if (zb->pending_ios) {
> +					pthread_cond_wait(&zb->reset_cond, &zb->mutex);
> +					goto proceed;

If you are OK to reset  the zone after the cond wait, why the jump to proceed ?
That will skip the reset...

> +				}
> +			}
> +
>  			if (zbd_reset_zone(td, f, zb) < 0)
>  				goto eof;
> +
> +			/* Notify other threads waiting for zone mutex */
> +			if (td->o.zone_append)
> +				pthread_cond_broadcast(&zb->reset_cond);

Why ? isn't this cond variable used to signal pending_ios going to 0 ?

> +		}
> +proceed:
> +		/*
> +		 * Check for zone full condition again. For zone append request
> +		 * the zone may already be reset, written and full while we
> +		 * were waiting for our turn.
> +		 */
> +		if (zbd_zone_full(f, zb, min_bs)) {
> +			goto reset;
>  		}

No need for the curly braces. But the main problem here is that a job is not
guaranteeds to ever be able to issue a zone append. A job may end up looping
indefinitely between here and reset label.

This does not look necessary at all to me. The zone is locked and it was already
determined that reset is not needed, or, the zone was already reset a few line
above after waiting for all pending ios, all of that with the zone locked. So
how can it become full again at this point ?

> +
>  		/* Make writes occur at the write pointer */
>  		assert(!zbd_zone_full(f, zb, min_bs));
>  		io_u->offset = zb->wp;
> +
> +		/*
> +		 * Support zone append for both regular and zoned block
> +		 * device.
> +		 */
> +		if (td->o.zone_append) {
> +			if (f->zbd_info->model == ZBD_NONE)
> +				io_u->zone_start_offset = zb->wp;
> +			else
> +				io_u->zone_start_offset = zb->start;
> +		}

This hunk is not needed. zone start offset is:

io_u->offset & ~(zone_size - 1)

with zone_size being either td->o.zone_size or f->zbd_info->zone_size. So you
can completely drop this zone_start_offset field.

> +
>  		if (!is_valid_offset(f, io_u->offset)) {
>  			dprint(FD_ZBD, "Dropped request with offset %llu\n",
>  			       io_u->offset);
> diff --git a/zbd.h b/zbd.h
> index e942a7f..eac42f7 100644
> --- a/zbd.h
> +++ b/zbd.h
> @@ -23,8 +23,10 @@ enum io_u_action {
>   * struct fio_zone_info - information about a single ZBD zone
>   * @start: zone start location (bytes)
>   * @wp: zone write pointer location (bytes)
> + * @pending_ios: Number of IO's pending in this zone

Misleading. This is the nukmber of zone append IOs. Only zone append. You are
not counting reads or writes withe this.

>   * @verify_block: number of blocks that have been verified for this zone
>   * @mutex: protects the modifiable members in this structure
> + * @reset_cond: zone reset check condition. only relevant for zone_append.

You are using it to signal pending_ios going to 0 too.

>   * @type: zone type (BLK_ZONE_TYPE_*)
>   * @cond: zone state (BLK_ZONE_COND_*)
>   * @open: whether or not this zone is currently open. Only relevant if
> @@ -33,8 +35,10 @@ enum io_u_action {
>   */
>  struct fio_zone_info {
>  	pthread_mutex_t		mutex;
> +	pthread_cond_t		reset_cond;
>  	uint64_t		start;
>  	uint64_t		wp;
> +	uint32_t		pending_ios;
>  	uint32_t		verify_block;
>  	enum zbd_zone_type	type:2;
>  	enum zbd_zone_cond	cond:4;
> @@ -96,18 +100,19 @@ static inline void zbd_close_file(struct fio_file *f)
>  		zbd_free_zone_info(f);
>  }
>  
> -static inline void zbd_queue_io_u(struct io_u *io_u, enum fio_q_status status)
> +static inline void zbd_queue_io_u(struct thread_data *td, struct io_u *io_u,
> +				  enum fio_q_status status)
>  {
>  	if (io_u->zbd_queue_io) {
> -		io_u->zbd_queue_io(io_u, status, io_u->error == 0);
> +		io_u->zbd_queue_io(td, io_u, status, io_u->error == 0);
>  		io_u->zbd_queue_io = NULL;
>  	}
>  }
>  
> -static inline void zbd_put_io_u(struct io_u *io_u)
> +static inline void zbd_put_io_u(struct thread_data *td, struct io_u *io_u)
>  {
>  	if (io_u->zbd_put_io) {
> -		io_u->zbd_put_io(io_u);
> +		io_u->zbd_put_io(td, io_u);
>  		io_u->zbd_queue_io = NULL;
>  		io_u->zbd_put_io = NULL;
>  	}
> 

So a few fixes are needed. However, my biggest concerns is the change to zone
locking just for zone append... I kind of like the idea to not hold the zone
lock for the entire duration of IOs, and the condition variable with IO counter
seem to be a reasonnable way to do that. Have you tried extending this locking
change to all IOs (read and write) ? A prep patch could do that and zone append
support can come on top. I am however worried that all of this may not be that
easy due to the various inspection loops for choosing a zone to read or write,
finding an open zone or finding a zone to close & open. This needs checking, or
at least explainations in the commit message that none of it is affexcted by
your change.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] libaio: support for Zone Append command
  2020-06-25 17:38     ` [PATCH 2/4] libaio: support for Zone Append command Krishna Kanth Reddy
@ 2020-06-26  5:38       ` Damien Le Moal
  2020-07-03 10:47         ` Krishna Kanth Reddy
  0 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2020-06-26  5:38 UTC (permalink / raw)
  To: Krishna Kanth Reddy, axboe; +Cc: fio, Ankit Kumar

On 2020/06/26 2:41, Krishna Kanth Reddy wrote:
> From: Ankit Kumar <ankit.kumar@samsung.com>
> 
> The zone append command uses the write path with offset as
> start of the zone. Upon successful completion, multiple of
> sectors relative to the zone's start offset, where the data
> has been placed is returned.
> 
> Signed-off-by: Krishna Kanth Reddy <krish.reddy@samsung.com>
> ---
>  engines/libaio.c | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/engines/libaio.c b/engines/libaio.c
> index 398fdf9..4ffe831 100644
> --- a/engines/libaio.c
> +++ b/engines/libaio.c
> @@ -123,7 +123,13 @@ static int fio_libaio_prep(struct thread_data *td, struct io_u *io_u)
>  		if (o->nowait)
>  			iocb->aio_rw_flags |= RWF_NOWAIT;
>  	} else if (io_u->ddir == DDIR_WRITE) {
> -		io_prep_pwrite(iocb, f->fd, io_u->xfer_buf, io_u->xfer_buflen, io_u->offset);
> +		if ((td->o.zone_mode == ZONE_MODE_ZBD) && td->o.zone_append) {
> +			io_prep_pwrite(iocb, f->fd, io_u->xfer_buf, io_u->xfer_buflen, io_u->zone_start_offset);
> +		}

No need for the added braces. And see my comment in the previous patch about
zone_start_offset. You can calculate it from io_u->offset.

> +		else
> +			io_prep_pwrite(iocb, f->fd, io_u->xfer_buf, io_u->xfer_buflen, io_u->offset);
> +		if (td->o.zone_append)
> +			iocb->aio_rw_flags |= RWF_ZONE_APPEND;
>  		if (o->nowait)
>  			iocb->aio_rw_flags |= RWF_NOWAIT;
>  	} else if (ddir_sync(io_u->ddir))
> @@ -228,6 +234,18 @@ static int fio_libaio_getevents(struct thread_data *td, unsigned int min,
>  		} else {
>  			r = io_getevents(ld->aio_ctx, actual_min,
>  				max, ld->aio_events + events, lt);
> +			if (td->o.zone_mode == ZONE_MODE_ZBD) {
> +				struct io_event *ev;
> +				struct io_u *io_u;

Add a blank line after declarations please.

> +				for (unsigned event = 0; event < r; event++) {

Better declare event above instead of here. I know C allows it, but this looks
like C++ to me :)

> +					ev = ld->aio_events + event;
> +					io_u = container_of(ev->obj, struct io_u, iocb);
> +					if (td->o.zone_append
> +					    && td->o.do_verify && td->o.verify
> +					    && (io_u->ddir == DDIR_WRITE))

So this is done only if verify is on ? Then why do the loop at all if verify is
off ? Move the "td->o.zone_append && td->o.do_verify && td->o.verify" condition
outside the loop.

> +						io_u->ipo->offset = io_u->zone_start_offset + (ev->res2 << 9);

same here about start offset.

> +				}
> +			}
>  		}
>  		if (r > 0)
>  			events += r;
> 


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] iouring: support for Zone Append command
  2020-06-25 17:38     ` [PATCH 3/4] iouring: " Krishna Kanth Reddy
@ 2020-06-26  5:43       ` Damien Le Moal
  2020-07-03 10:37         ` Krishna Kanth Reddy
  0 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2020-06-26  5:43 UTC (permalink / raw)
  To: Krishna Kanth Reddy, axboe; +Cc: fio, Ankit Kumar

On 2020/06/26 2:41, Krishna Kanth Reddy wrote:
> From: Ankit Kumar <ankit.kumar@samsung.com>
> 
> The zone append command uses the write path with offset as
> start of the zone. Upon successful completion, multiple of
> sectors relative to the zone's start offset, where the data
> has been placed is returned.
> 
> Signed-off-by: Krishna Kanth Reddy <krish.reddy@samsung.com>
> ---
>  engines/io_uring.c | 23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/engines/io_uring.c b/engines/io_uring.c
> index cd0810f..cb7c5ba 100644
> --- a/engines/io_uring.c
> +++ b/engines/io_uring.c
> @@ -251,7 +251,13 @@ static int fio_ioring_prep(struct thread_data *td, struct io_u *io_u)
>  			sqe->ioprio = td->o.ioprio_class << 13;
>  		if (ld->ioprio_set)
>  			sqe->ioprio |= td->o.ioprio;
> -		sqe->off = io_u->offset;
> +		if (td->o.zone_append && io_u->ddir == DDIR_WRITE)
> +			sqe->rw_flags |= RWF_ZONE_APPEND;
> +		if ((td->o.zone_mode == ZONE_MODE_ZBD)
> +		     && td->o.zone_append && io_u->ddir == DDIR_WRITE) {
> +			sqe->off = io_u->zone_start_offset;
> +		} else
> +			sqe->off = io_u->offset;

Why test twice "td->o.zone_append && io_u->ddir == DDIR_WRITE" ? this can be
rewritten to not have to test this 2 times:

		if (td->o.zone_append && io_u->ddir == DDIR_WRITE) {
			sqe->rw_flags |= RWF_ZONE_APPEND;
			if (td->o.zone_mode == ZONE_MODE_ZBD)
				sqe->off = io_u->zone_start_offset;
			else
				sqe->off = io_u->offset;
		}

And fix the start offset as already commented, using io_u->offset and
td->o.zone_size.

>  	} else if (ddir_sync(io_u->ddir)) {
>  		if (io_u->ddir == DDIR_SYNC_FILE_RANGE) {
>  			sqe->off = f->first_write;
> @@ -324,6 +330,21 @@ static int fio_ioring_getevents(struct thread_data *td, unsigned int min,
>  	ld->cq_ring_off = *ring->head;
>  	do {
>  		r = fio_ioring_cqring_reap(td, events, max);
> +		if (td->o.zone_mode == ZONE_MODE_ZBD) {
> +			struct io_uring_cqe *cqe;
> +			struct io_u *io_u;
> +			unsigned index;

need blank line.

> +			for (unsigned event = 0; event < r; event++) {
> +				index = (event + ld->cq_ring_off) & ld->cq_ring_mask;
> +
> +				cqe = &ld->cq_ring.cqes[index];
> +				io_u = (struct io_u *) (uintptr_t) cqe->user_data;
> +
> +				if (td->o.zone_append && td->o.do_verify
> +				    && td->o.verify && (io_u->ddir == DDIR_WRITE))
> +					io_u->ipo->offset = io_u->zone_start_offset + (cqe->flags << 9);

cqe->flags needs to be cast to 64bit or the shift could give the wrong result.

Same comments as for libaio regarding structure.

> +			}
> +		}
>  		if (r) {
>  			events += r;
>  			if (actual_min != 0)
> 


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] t/zbd: Add support to verify Zone Append command with libaio, io_uring IO engine tests
  2020-06-25 17:38     ` [PATCH 4/4] t/zbd: Add support to verify Zone Append command with libaio, io_uring IO engine tests Krishna Kanth Reddy
  2020-06-25 18:44       ` Dmitry Fomichev
@ 2020-06-26  5:45       ` Damien Le Moal
  2020-07-03  9:09         ` Krishna Kanth Reddy
  1 sibling, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2020-06-26  5:45 UTC (permalink / raw)
  To: Krishna Kanth Reddy, axboe; +Cc: fio, Ankit Kumar

On 2020/06/26 2:41, Krishna Kanth Reddy wrote:
> Modify the test-zbd-support script to verify the Zone Append command
> for NVMe Zoned Namespaces (ZNS) defined in NVM Express TP4053.
> Added a new FIO option zone_append.
> When zone_append option is enabled, the existing write path will
> send Zone Append command with LBA offset as start of the Zone.
> 
> Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com>
> ---
>  t/zbd/test-zbd-support | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 48 insertions(+)
> 
> diff --git a/t/zbd/test-zbd-support b/t/zbd/test-zbd-support
> index 4001be3..ddade22 100755
> --- a/t/zbd/test-zbd-support
> +++ b/t/zbd/test-zbd-support
> @@ -801,6 +801,54 @@ test48() {
>  	    >> "${logfile}.${test_number}" 2>&1 || return $?
>  }
>  
> +# Zone append to sequential zones, libaio, 1 job, queue depth 1
> +test49() {
> +    local i size
> +
> +    size=$((4 * zone_size))
> +    run_fio_on_seq --ioengine=libaio --iodepth=1 --rw=write --zone_append=1 \
> +                   --bs="$(max $((zone_size / 64)) "$logical_block_size")"\
> +                   --do_verify=1 --verify=md5                           \
> +                   >>"${logfile}.${test_number}" 2>&1 || return $?
> +    check_written $size || return $?
> +    check_read $size || return $?
> +}
> +
> +# Random zone append to sequential zones, libaio, 8 jobs, queue depth 64 per job
> +test50() {
> +    local size
> +
> +    size=$((4 * zone_size))
> +    run_fio_on_seq --ioengine=libaio --iodepth=64 --rw=randwrite --bs=4K \
> +                   --group_reporting=1 --numjobs=8 --zone_append=1 \
> +                   >> "${logfile}.${test_number}" 2>&1 || return $?
> +    check_written $((size * 8)) || return $?
> +}
> +
> +# Zone append to sequential zones, io_uring, 1 job, queue depth 1
> +test51() {
> +    local i size
> +
> +    size=$((4 * zone_size))
> +    run_fio_on_seq --ioengine=io_uring --iodepth=1 --rw=write --zone_append=1 \
> +                   --bs="$(max $((zone_size / 64)) "$logical_block_size")"\
> +                   --do_verify=1 --verify=md5                           \
> +                   >>"${logfile}.${test_number}" 2>&1 || return $?
> +    check_written $size || return $?
> +    check_read $size || return $?
> +}
> +
> +# Random zone append to sequential zones, io_uring, 8 jobs, queue depth 64 per job
> +test52() {
> +    local size
> +
> +    size=$((4 * zone_size))
> +    run_fio_on_seq --ioengine=io_uring --iodepth=64 --rw=randwrite --bs=4K \
> +                   --group_reporting=1 --numjobs=8 --zone_append=1 \
> +                   >> "${logfile}.${test_number}" 2>&1 || return $?
> +    check_written $((size * 8)) || return $?
> +}
> +
>  tests=()
>  dynamic_analyzer=()
>  reset_all_zones=
> 

If the test script was called using an SG node with -l option for libzbc tests,
these tests will fail. Please add conditions to not run them when they cannot.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] Add Zone Append command support for NVMe Zoned Namespaces (ZNS)
  2020-06-25 17:38     ` [PATCH 1/4] Add " Krishna Kanth Reddy
  2020-06-26  5:33       ` Damien Le Moal
@ 2020-06-26  5:50       ` Damien Le Moal
  2020-07-03 16:50         ` Krishna Kanth Reddy
  1 sibling, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2020-06-26  5:50 UTC (permalink / raw)
  To: Krishna Kanth Reddy, axboe; +Cc: fio, Ankit Kumar

On 2020/06/26 2:41, Krishna Kanth Reddy wrote:
> From: Ankit Kumar <ankit.kumar@samsung.com>
> 
> defined in NVM Express TP4053. Added a new FIO option zone_append.
> When zone_append option is enabled, the existing write path will
> send Zone Append command with offset as start of the Zone.
> 
> Signed-off-by: Krishna Kanth Reddy <krish.reddy@samsung.com>
> ---
>  HOWTO            |  7 +++++
>  fio.1            |  7 +++++
>  io_u.c           |  4 +--
>  io_u.h           | 10 +++++--
>  ioengines.c      |  4 +--
>  options.c        | 10 +++++++
>  thread_options.h |  2 ++
>  zbd.c            | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
>  zbd.h            | 13 +++++---
>  9 files changed, 131 insertions(+), 16 deletions(-)
> 
> diff --git a/HOWTO b/HOWTO
> index 8cf8d65..62b5ac8 100644
> --- a/HOWTO
> +++ b/HOWTO
> @@ -1010,6 +1010,13 @@ Target file/device
>  	:option:`zonesize` bytes of data have been transferred. This parameter
>  	must be zero for :option:`zonemode` =zbd.
>  
> +.. option:: zone_append=bool
> +
> +	For :option:`zonemode` =zbd and for :option:`rw` =write or :option:
> +	`rw` =randwrite, if zone_append is enabled, the the io_u points to the
> +	starting offset of a zone. On successful completion the multiple of
> +	sectors relative to the zone's starting offset is returned.
> +
>  .. option:: read_beyond_wp=bool
>  
>  	This parameter applies to :option:`zonemode` =zbd only.
> diff --git a/fio.1 b/fio.1
> index f134e0b..09add8f 100644
> --- a/fio.1
> +++ b/fio.1
> @@ -782,6 +782,13 @@ sequential workloads and ignored for random workloads. For read workloads,
>  see also \fBread_beyond_wp\fR.
>  
>  .TP
> +.BI zone_append
> +For \fBzonemode\fR =zbd and for \fBrw\fR=write or \fBrw\fR=randwrite, if
> +zone_append is enabled, the io_u points to the starting offset of a zone. On
> +successful completion the multiple of sectors relative to the zone's starting
> +offset is returned.
> +
> +.TP
>  .BI read_beyond_wp \fR=\fPbool
>  This parameter applies to \fBzonemode=zbd\fR only.
>  
> diff --git a/io_u.c b/io_u.c
> index 7f50906..b891a9b 100644
> --- a/io_u.c
> +++ b/io_u.c
> @@ -778,7 +778,7 @@ void put_io_u(struct thread_data *td, struct io_u *io_u)
>  {
>  	const bool needs_lock = td_async_processing(td);
>  
> -	zbd_put_io_u(io_u);
> +	zbd_put_io_u(td, io_u);
>  
>  	if (td->parent)
>  		td = td->parent;
> @@ -1342,7 +1342,7 @@ static long set_io_u_file(struct thread_data *td, struct io_u *io_u)
>  		if (!fill_io_u(td, io_u))
>  			break;
>  
> -		zbd_put_io_u(io_u);
> +		zbd_put_io_u(td, io_u);
>  
>  		put_file_log(td, f);
>  		td_io_close_file(td, f);
> diff --git a/io_u.h b/io_u.h
> index 87c2920..f5b24fd 100644
> --- a/io_u.h
> +++ b/io_u.h
> @@ -94,19 +94,25 @@ struct io_u {
>  	};
>  
>  	/*
> +	 * for zone append this is the start offset of the zone.
> +	 */
> +	unsigned long long zone_start_offset;
> +
> +	/*
>  	 * ZBD mode zbd_queue_io callback: called after engine->queue operation
>  	 * to advance a zone write pointer and eventually unlock the I/O zone.
>  	 * @q indicates the I/O queue status (busy, queued or completed).
>  	 * @success == true means that the I/O operation has been queued or
>  	 * completed successfully.
>  	 */
> -	void (*zbd_queue_io)(struct io_u *, int q, bool success);
> +	void (*zbd_queue_io)(struct thread_data *, struct io_u *, int q,
> +			     bool success);
>  
>  	/*
>  	 * ZBD mode zbd_put_io callback: called in after completion of an I/O
>  	 * or commit of an async I/O to unlock the I/O target zone.
>  	 */
> -	void (*zbd_put_io)(const struct io_u *);
> +	void (*zbd_put_io)(struct thread_data *, const struct io_u *);
>  
>  	/*
>  	 * Callback for io completion
> diff --git a/ioengines.c b/ioengines.c
> index 2c7a0df..81ac846 100644
> --- a/ioengines.c
> +++ b/ioengines.c
> @@ -328,7 +328,7 @@ enum fio_q_status td_io_queue(struct thread_data *td, struct io_u *io_u)
>  	}
>  
>  	ret = td->io_ops->queue(td, io_u);
> -	zbd_queue_io_u(io_u, ret);
> +	zbd_queue_io_u(td, io_u, ret);
>  
>  	unlock_file(td, io_u->file);
>  
> @@ -370,7 +370,7 @@ enum fio_q_status td_io_queue(struct thread_data *td, struct io_u *io_u)
>  	if (!td->io_ops->commit) {
>  		io_u_mark_submit(td, 1);
>  		io_u_mark_complete(td, 1);
> -		zbd_put_io_u(io_u);
> +		zbd_put_io_u(td, io_u);
>  	}
>  
>  	if (ret == FIO_Q_COMPLETED) {
> diff --git a/options.c b/options.c
> index 85a0f49..d54da81 100644
> --- a/options.c
> +++ b/options.c
> @@ -3317,6 +3317,16 @@ struct fio_option fio_options[FIO_MAX_OPTS] = {
>  		},
>  	},
>  	{
> +		.name	= "zone_append",
> +		.lname	= "zone_append",
> +		.type	= FIO_OPT_BOOL,
> +		.off1	= offsetof(struct thread_options, zone_append),
> +		.help	= "Use Zone Append for Zone block device",
> +		.def	= "0",
> +		.category = FIO_OPT_C_IO,
> +		.group	= FIO_OPT_G_ZONE,
> +	},
> +	{
>  		.name	= "zonesize",
>  		.lname	= "Zone size",
>  		.type	= FIO_OPT_STR_VAL,
> diff --git a/thread_options.h b/thread_options.h
> index 968ea0a..45c5ef8 100644
> --- a/thread_options.h
> +++ b/thread_options.h
> @@ -195,6 +195,7 @@ struct thread_options {
>  	unsigned long long zone_size;
>  	unsigned long long zone_skip;
>  	enum fio_zone_mode zone_mode;
> +	unsigned int zone_append;
>  	unsigned long long lockmem;
>  	enum fio_memtype mem_type;
>  	unsigned int mem_align;
> @@ -631,6 +632,7 @@ struct thread_options_pack {
>  	uint32_t allow_mounted_write;
>  
>  	uint32_t zone_mode;
> +	uint32_t zone_append;
>  } __attribute__((packed));
>  
>  extern void convert_thread_options_to_cpu(struct thread_options *o, struct thread_options_pack *top);
> diff --git a/zbd.c b/zbd.c
> index 8cf8f81..ffdb766 100644
> --- a/zbd.c
> +++ b/zbd.c
> @@ -455,6 +455,7 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
>  		for (i = 0; i < nrz; i++, j++, z++, p++) {
>  			mutex_init_pshared_with_type(&p->mutex,
>  						     PTHREAD_MUTEX_RECURSIVE);
> +			cond_init_pshared(&p->reset_cond);
>  			p->start = z->start;
>  			switch (z->cond) {
>  			case ZBD_ZONE_COND_NOT_WP:
> @@ -469,6 +470,7 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
>  			}
>  			p->type = z->type;
>  			p->cond = z->cond;
> +			p->pending_ios = 0;
>  			if (j > 0 && p->start != p[-1].start + zone_size) {
>  				log_info("%s: invalid zone data\n",
>  					 f->file_name);
> @@ -1196,20 +1198,24 @@ zbd_find_zone(struct thread_data *td, struct io_u *io_u,
>  
>  /**
>   * zbd_queue_io - update the write pointer of a sequential zone
> + * @td: fio thread data.
>   * @io_u: I/O unit
>   * @success: Whether or not the I/O unit has been queued successfully
>   * @q: queueing status (busy, completed or queued).
>   *
>   * For write and trim operations, update the write pointer of the I/O unit
>   * target zone.
> + * For zone append operation, release the zone mutex
>   */
> -static void zbd_queue_io(struct io_u *io_u, int q, bool success)
> +static void zbd_queue_io(struct thread_data *td, struct io_u *io_u, int q,
> +			 bool success)
>  {
>  	const struct fio_file *f = io_u->file;
>  	struct zoned_block_device_info *zbd_info = f->zbd_info;
>  	struct fio_zone_info *z;
>  	uint32_t zone_idx;
>  	uint64_t zone_end;
> +	int ret;
>  
>  	if (!zbd_info)
>  		return;
> @@ -1241,6 +1247,8 @@ static void zbd_queue_io(struct io_u *io_u, int q, bool success)
>  			zbd_info->sectors_with_data += zone_end - z->wp;
>  		pthread_mutex_unlock(&zbd_info->mutex);
>  		z->wp = zone_end;
> +		if (td->o.zone_append)
> +			z->pending_ios++;
>  		break;
>  	case DDIR_TRIM:
>  		assert(z->wp == z->start);
> @@ -1250,18 +1258,22 @@ static void zbd_queue_io(struct io_u *io_u, int q, bool success)
>  	}
>  
>  unlock:
> -	if (!success || q != FIO_Q_QUEUED) {
> +	if (!success || q != FIO_Q_QUEUED || td->o.zone_append) {
>  		/* BUSY or COMPLETED: unlock the zone */
> -		pthread_mutex_unlock(&z->mutex);
> -		io_u->zbd_put_io = NULL;
> +		ret = pthread_mutex_unlock(&z->mutex);
> +		assert(ret == 0);
> +		if (!success || q != FIO_Q_QUEUED)
> +			io_u->zbd_put_io = NULL;
>  	}
>  }
>  
>  /**
>   * zbd_put_io - Unlock an I/O unit target zone lock
> + * For zone append operation we don't hold zone lock
> + * @td: fio thread data.
>   * @io_u: I/O unit
>   */
> -static void zbd_put_io(const struct io_u *io_u)
> +static void zbd_put_io(struct thread_data *td, const struct io_u *io_u)
>  {
>  	const struct fio_file *f = io_u->file;
>  	struct zoned_block_device_info *zbd_info = f->zbd_info;
> @@ -1283,6 +1295,19 @@ static void zbd_put_io(const struct io_u *io_u)
>  	       "%s: terminate I/O (%lld, %llu) for zone %u\n",
>  	       f->file_name, io_u->offset, io_u->buflen, zone_idx);
>  
> +	if (td->o.zone_append) {
> +		pthread_mutex_lock(&z->mutex);
> +		if (z->pending_ios > 0) {
> +			z->pending_ios--;
> +			/*
> +			 * Other threads may be waiting for pending I/O's to
> +			 * complete for this zone. Notify them.
> +			 */
> +			if (!z->pending_ios)
> +				pthread_cond_broadcast(&z->reset_cond);
> +		}
> +	}
> +
>  	ret = pthread_mutex_unlock(&z->mutex);
>  	assert(ret == 0);
>  	zbd_check_swd(f);
> @@ -1524,16 +1549,69 @@ enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u)
>  			 * asynchronously and since we will submit the zone
>  			 * reset synchronously, wait until previously submitted
>  			 * write requests have completed before issuing a
> -			 * zone reset.
> +			 * zone reset. For append request release the zone lock
> +			 * as other threads will acquire it at the time of
> +			 * zbd_put_io.
>  			 */
> +reset:
> +			if (td->o.zone_append)
> +				pthread_mutex_unlock(&zb->mutex);
>  			io_u_quiesce(td);
> +			if (td->o.zone_append)
> +				pthread_mutex_lock(&zb->mutex);
> +
>  			zb->reset_zone = 0;
> +			if (td->o.zone_append) {
> +				/*
> +				 * While processing the current thread queued
> +				 * requests the other thread may have already
> +				 * done zone reset so need to check zone full
> +				 * condition again.
> +				 */
> +				if (!zbd_zone_full(f, zb, min_bs))
> +					goto proceed;
> +				/*
> +				 * Wait for the pending requests to be completed
> +				 * else we are ok to reset this zone.
> +				 */
> +				if (zb->pending_ios) {
> +					pthread_cond_wait(&zb->reset_cond, &zb->mutex);
> +					goto proceed;
> +				}
> +			}
> +
>  			if (zbd_reset_zone(td, f, zb) < 0)
>  				goto eof;
> +
> +			/* Notify other threads waiting for zone mutex */
> +			if (td->o.zone_append)
> +				pthread_cond_broadcast(&zb->reset_cond);
> +		}
> +proceed:
> +		/*
> +		 * Check for zone full condition again. For zone append request
> +		 * the zone may already be reset, written and full while we
> +		 * were waiting for our turn.
> +		 */
> +		if (zbd_zone_full(f, zb, min_bs)) {
> +			goto reset;
>  		}
> +
>  		/* Make writes occur at the write pointer */
>  		assert(!zbd_zone_full(f, zb, min_bs));
>  		io_u->offset = zb->wp;
> +
> +		/*
> +		 * Support zone append for both regular and zoned block
> +		 * device.
> +		 */
> +		if (td->o.zone_append) {
> +			if (f->zbd_info->model == ZBD_NONE)
> +				io_u->zone_start_offset = zb->wp;
> +			else
> +				io_u->zone_start_offset = zb->start;
> +		}
> +
>  		if (!is_valid_offset(f, io_u->offset)) {
>  			dprint(FD_ZBD, "Dropped request with offset %llu\n",
>  			       io_u->offset);
> diff --git a/zbd.h b/zbd.h
> index e942a7f..eac42f7 100644
> --- a/zbd.h
> +++ b/zbd.h
> @@ -23,8 +23,10 @@ enum io_u_action {
>   * struct fio_zone_info - information about a single ZBD zone
>   * @start: zone start location (bytes)
>   * @wp: zone write pointer location (bytes)
> + * @pending_ios: Number of IO's pending in this zone
>   * @verify_block: number of blocks that have been verified for this zone
>   * @mutex: protects the modifiable members in this structure
> + * @reset_cond: zone reset check condition. only relevant for zone_append.
>   * @type: zone type (BLK_ZONE_TYPE_*)
>   * @cond: zone state (BLK_ZONE_COND_*)
>   * @open: whether or not this zone is currently open. Only relevant if
> @@ -33,8 +35,10 @@ enum io_u_action {
>   */
>  struct fio_zone_info {
>  	pthread_mutex_t		mutex;
> +	pthread_cond_t		reset_cond;
>  	uint64_t		start;
>  	uint64_t		wp;
> +	uint32_t		pending_ios;
>  	uint32_t		verify_block;
>  	enum zbd_zone_type	type:2;
>  	enum zbd_zone_cond	cond:4;
> @@ -96,18 +100,19 @@ static inline void zbd_close_file(struct fio_file *f)
>  		zbd_free_zone_info(f);
>  }
>  
> -static inline void zbd_queue_io_u(struct io_u *io_u, enum fio_q_status status)
> +static inline void zbd_queue_io_u(struct thread_data *td, struct io_u *io_u,
> +				  enum fio_q_status status)
>  {
>  	if (io_u->zbd_queue_io) {
> -		io_u->zbd_queue_io(io_u, status, io_u->error == 0);
> +		io_u->zbd_queue_io(td, io_u, status, io_u->error == 0);
>  		io_u->zbd_queue_io = NULL;
>  	}
>  }
>  
> -static inline void zbd_put_io_u(struct io_u *io_u)
> +static inline void zbd_put_io_u(struct thread_data *td, struct io_u *io_u)
>  {
>  	if (io_u->zbd_put_io) {
> -		io_u->zbd_put_io(io_u);
> +		io_u->zbd_put_io(td, io_u);
>  		io_u->zbd_queue_io = NULL;
>  		io_u->zbd_put_io = NULL;
>  	}
> 

Another thing: IO engines such as libzbc will not support any of this, nor will
engines such as psync or SG and many many other. SO it may be good to define a
new io engine flag such as FIO_ZONE_APPEND that can be set in struct
ioengine_ops flags field in the engine declaration to indicate that the engine
supports zone append. That can be checked on initialization instead of no check
at all and seeing lots of weird things happening with engines that do not
support zone append.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] t/zbd: Add support to verify Zone Append command with libaio, io_uring IO engine tests
  2020-06-25 18:44       ` Dmitry Fomichev
@ 2020-07-03  8:46         ` Krishna Kanth Reddy
  0 siblings, 0 replies; 17+ messages in thread
From: Krishna Kanth Reddy @ 2020-07-03  8:46 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: axboe, fio, Ankit Kumar, Damien Le Moal, Shinichiro Kawasaki

[-- Attachment #1: Type: text/plain, Size: 4187 bytes --]

On Thu, Jun 25, 2020 at 06:44:19PM +0000, Dmitry Fomichev wrote:
>Will these tests succeed if t/zbd/test-zbd-support script is run against an SMR HDD?
>Since zoned HDDs don't support Zone Append, I would expect the i/o to fail.
>I think you need to check if this device is an NVMe drive and expect the i/o failure in
>the tests below if this is not the case.
>
>More inline...
>
No, these tests fail for SMR HDD.
We will modify the script to run these tests only for NVMe drives.

>> -----Original Message-----
>> From: fio-owner@vger.kernel.org <fio-owner@vger.kernel.org> On Behalf
>> Of Krishna Kanth Reddy
>> Sent: Thursday, June 25, 2020 1:39 PM
>> To: axboe@kernel.dk
>> Cc: fio@vger.kernel.org; Krishna Kanth Reddy <krish.reddy@samsung.com>;
>> Ankit Kumar <ankit.kumar@samsung.com>
>> Subject: [PATCH 4/4] t/zbd: Add support to verify Zone Append command
>> with libaio, io_uring IO engine tests
>>
>> Modify the test-zbd-support script to verify the Zone Append command
>> for NVMe Zoned Namespaces (ZNS) defined in NVM Express TP4053.
>> Added a new FIO option zone_append.
>> When zone_append option is enabled, the existing write path will
>> send Zone Append command with LBA offset as start of the Zone.
>>
>> Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com>
>> ---
>>  t/zbd/test-zbd-support | 48
>> ++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 48 insertions(+)
>>
>> diff --git a/t/zbd/test-zbd-support b/t/zbd/test-zbd-support
>> index 4001be3..ddade22 100755
>> --- a/t/zbd/test-zbd-support
>> +++ b/t/zbd/test-zbd-support
>> @@ -801,6 +801,54 @@ test48() {
>>  	    >> "${logfile}.${test_number}" 2>&1 || return $?
>>  }
>>
>> +# Zone append to sequential zones, libaio, 1 job, queue depth 1
>> +test49() {
>> +    local i size
>> +
>> +    size=$((4 * zone_size))
>> +    run_fio_on_seq --ioengine=libaio --iodepth=1 --rw=write --
>> zone_append=1 \
>> +                   --bs="$(max $((zone_size / 64)) "$logical_block_size")"\
>> +                   --do_verify=1 --verify=md5                           \
>> +                   >>"${logfile}.${test_number}" 2>&1 || return $?
>> +    check_written $size || return $?
>> +    check_read $size || return $?
>> +}
>> +
>> +# Random zone append to sequential zones, libaio, 8 jobs, queue depth 64
>> per job
>> +test50() {
>> +    local size
>> +
>> +    size=$((4 * zone_size))
>> +    run_fio_on_seq --ioengine=libaio --iodepth=64 --rw=randwrite --bs=4K \
>> +                   --group_reporting=1 --numjobs=8 --zone_append=1 \
>> +                   >> "${logfile}.${test_number}" 2>&1 || return $?
>> +    check_written $((size * 8)) || return $?
>> +}
>> +
>> +# Zone append to sequential zones, io_uring, 1 job, queue depth 1
>> +test51() {
>> +    local i size
>> +
>> +    size=$((4 * zone_size))
>> +    run_fio_on_seq --ioengine=io_uring --iodepth=1 --rw=write --
>> zone_append=1 \
>> +                   --bs="$(max $((zone_size / 64)) "$logical_block_size")"\
>> +                   --do_verify=1 --verify=md5                           \
>> +                   >>"${logfile}.${test_number}" 2>&1 || return $?
>> +    check_written $size || return $?
>> +    check_read $size || return $?
>> +}
>> +
>> +# Random zone append to sequential zones, io_uring, 8 jobs, queue depth
>> 64 per job
>> +test52() {
>> +    local size
>> +
>> +    size=$((4 * zone_size))
>
>Maybe try some different size? It is the same in all tests.
>
Sure, we will add different sizes for the tests.

>> +    run_fio_on_seq --ioengine=io_uring --iodepth=64 --rw=randwrite --
>> bs=4K \
>
>All tests do 4K i/o, but maybe try to run with a different block size?
>It could be a good idea to add a test that will write with bs=ZASL(or MDTS).
>Yet another test issuing i/o with bs exceeding the maximum i/o size would
>be very useful.
>
Ok, this input is very helpful. We will implement the tests as per your
review comments.

>> +                   --group_reporting=1 --numjobs=8 --zone_append=1 \
>> +                   >> "${logfile}.${test_number}" 2>&1 || return $?
>> +    check_written $((size * 8)) || return $?
>> +}
>> +
>>  tests=()
>>  dynamic_analyzer=()
>>  reset_all_zones=
>> --
>> 2.7.4
>

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] t/zbd: Add support to verify Zone Append command with libaio, io_uring IO engine tests
  2020-06-26  5:45       ` Damien Le Moal
@ 2020-07-03  9:09         ` Krishna Kanth Reddy
  0 siblings, 0 replies; 17+ messages in thread
From: Krishna Kanth Reddy @ 2020-07-03  9:09 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: axboe, fio, Ankit Kumar

[-- Attachment #1: Type: text/plain, Size: 3192 bytes --]

On Fri, Jun 26, 2020 at 05:45:49AM +0000, Damien Le Moal wrote:
>On 2020/06/26 2:41, Krishna Kanth Reddy wrote:
>> Modify the test-zbd-support script to verify the Zone Append command
>> for NVMe Zoned Namespaces (ZNS) defined in NVM Express TP4053.
>> Added a new FIO option zone_append.
>> When zone_append option is enabled, the existing write path will
>> send Zone Append command with LBA offset as start of the Zone.
>>
>> Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com>
>> ---
>>  t/zbd/test-zbd-support | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 48 insertions(+)
>>
>> diff --git a/t/zbd/test-zbd-support b/t/zbd/test-zbd-support
>> index 4001be3..ddade22 100755
>> --- a/t/zbd/test-zbd-support
>> +++ b/t/zbd/test-zbd-support
>> @@ -801,6 +801,54 @@ test48() {
>>  	    >> "${logfile}.${test_number}" 2>&1 || return $?
>>  }
>>
>> +# Zone append to sequential zones, libaio, 1 job, queue depth 1
>> +test49() {
>> +    local i size
>> +
>> +    size=$((4 * zone_size))
>> +    run_fio_on_seq --ioengine=libaio --iodepth=1 --rw=write --zone_append=1 \
>> +                   --bs="$(max $((zone_size / 64)) "$logical_block_size")"\
>> +                   --do_verify=1 --verify=md5                           \
>> +                   >>"${logfile}.${test_number}" 2>&1 || return $?
>> +    check_written $size || return $?
>> +    check_read $size || return $?
>> +}
>> +
>> +# Random zone append to sequential zones, libaio, 8 jobs, queue depth 64 per job
>> +test50() {
>> +    local size
>> +
>> +    size=$((4 * zone_size))
>> +    run_fio_on_seq --ioengine=libaio --iodepth=64 --rw=randwrite --bs=4K \
>> +                   --group_reporting=1 --numjobs=8 --zone_append=1 \
>> +                   >> "${logfile}.${test_number}" 2>&1 || return $?
>> +    check_written $((size * 8)) || return $?
>> +}
>> +
>> +# Zone append to sequential zones, io_uring, 1 job, queue depth 1
>> +test51() {
>> +    local i size
>> +
>> +    size=$((4 * zone_size))
>> +    run_fio_on_seq --ioengine=io_uring --iodepth=1 --rw=write --zone_append=1 \
>> +                   --bs="$(max $((zone_size / 64)) "$logical_block_size")"\
>> +                   --do_verify=1 --verify=md5                           \
>> +                   >>"${logfile}.${test_number}" 2>&1 || return $?
>> +    check_written $size || return $?
>> +    check_read $size || return $?
>> +}
>> +
>> +# Random zone append to sequential zones, io_uring, 8 jobs, queue depth 64 per job
>> +test52() {
>> +    local size
>> +
>> +    size=$((4 * zone_size))
>> +    run_fio_on_seq --ioengine=io_uring --iodepth=64 --rw=randwrite --bs=4K \
>> +                   --group_reporting=1 --numjobs=8 --zone_append=1 \
>> +                   >> "${logfile}.${test_number}" 2>&1 || return $?
>> +    check_written $((size * 8)) || return $?
>> +}
>> +
>>  tests=()
>>  dynamic_analyzer=()
>>  reset_all_zones=
>>
>
>If the test script was called using an SG node with -l option for libzbc tests,
>these tests will fail. Please add conditions to not run them when they cannot.
>
Sure, we will add the conditions to not execute tests in such cases.

>
>-- 
>Damien Le Moal
>Western Digital Research
>

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] iouring: support for Zone Append command
  2020-06-26  5:43       ` Damien Le Moal
@ 2020-07-03 10:37         ` Krishna Kanth Reddy
  0 siblings, 0 replies; 17+ messages in thread
From: Krishna Kanth Reddy @ 2020-07-03 10:37 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: axboe, fio, Ankit Kumar

[-- Attachment #1: Type: text/plain, Size: 2977 bytes --]

On Fri, Jun 26, 2020 at 05:43:27AM +0000, Damien Le Moal wrote:
>On 2020/06/26 2:41, Krishna Kanth Reddy wrote:
>> From: Ankit Kumar <ankit.kumar@samsung.com>
>>
>> The zone append command uses the write path with offset as
>> start of the zone. Upon successful completion, multiple of
>> sectors relative to the zone's start offset, where the data
>> has been placed is returned.
>>
>> Signed-off-by: Krishna Kanth Reddy <krish.reddy@samsung.com>
>> ---
>>  engines/io_uring.c | 23 ++++++++++++++++++++++-
>>  1 file changed, 22 insertions(+), 1 deletion(-)
>>
>> diff --git a/engines/io_uring.c b/engines/io_uring.c
>> index cd0810f..cb7c5ba 100644
>> --- a/engines/io_uring.c
>> +++ b/engines/io_uring.c
>> @@ -251,7 +251,13 @@ static int fio_ioring_prep(struct thread_data *td, struct io_u *io_u)
>>  			sqe->ioprio = td->o.ioprio_class << 13;
>>  		if (ld->ioprio_set)
>>  			sqe->ioprio |= td->o.ioprio;
>> -		sqe->off = io_u->offset;
>> +		if (td->o.zone_append && io_u->ddir == DDIR_WRITE)
>> +			sqe->rw_flags |= RWF_ZONE_APPEND;
>> +		if ((td->o.zone_mode == ZONE_MODE_ZBD)
>> +		     && td->o.zone_append && io_u->ddir == DDIR_WRITE) {
>> +			sqe->off = io_u->zone_start_offset;
>> +		} else
>> +			sqe->off = io_u->offset;
>
>Why test twice "td->o.zone_append && io_u->ddir == DDIR_WRITE" ? this can be
>rewritten to not have to test this 2 times:
>
>		if (td->o.zone_append && io_u->ddir == DDIR_WRITE) {
>			sqe->rw_flags |= RWF_ZONE_APPEND;
>			if (td->o.zone_mode == ZONE_MODE_ZBD)
>				sqe->off = io_u->zone_start_offset;
>			else
>				sqe->off = io_u->offset;
>		}
>
>And fix the start offset as already commented, using io_u->offset and
>td->o.zone_size.
>
Sure, will remove the dual check and fix the start offset as per your
comments

>>  	} else if (ddir_sync(io_u->ddir)) {
>>  		if (io_u->ddir == DDIR_SYNC_FILE_RANGE) {
>>  			sqe->off = f->first_write;
>> @@ -324,6 +330,21 @@ static int fio_ioring_getevents(struct thread_data *td, unsigned int min,
>>  	ld->cq_ring_off = *ring->head;
>>  	do {
>>  		r = fio_ioring_cqring_reap(td, events, max);
>> +		if (td->o.zone_mode == ZONE_MODE_ZBD) {
>> +			struct io_uring_cqe *cqe;
>> +			struct io_u *io_u;
>> +			unsigned index;
>
>need blank line.
>
Ok

>> +			for (unsigned event = 0; event < r; event++) {
>> +				index = (event + ld->cq_ring_off) & ld->cq_ring_mask;
>> +
>> +				cqe = &ld->cq_ring.cqes[index];
>> +				io_u = (struct io_u *) (uintptr_t) cqe->user_data;
>> +
>> +				if (td->o.zone_append && td->o.do_verify
>> +				    && td->o.verify && (io_u->ddir == DDIR_WRITE))
>> +					io_u->ipo->offset = io_u->zone_start_offset + (cqe->flags << 9);
>
>cqe->flags needs to be cast to 64bit or the shift could give the wrong result.
>
>Same comments as for libaio regarding structure.
>
Ok. Sure, will add that, as there are chances of overflow happening.

>> +			}
>> +		}
>>  		if (r) {
>>  			events += r;
>>  			if (actual_min != 0)
>>
>
>
>-- 
>Damien Le Moal
>Western Digital Research
>

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] libaio: support for Zone Append command
  2020-06-26  5:38       ` Damien Le Moal
@ 2020-07-03 10:47         ` Krishna Kanth Reddy
  0 siblings, 0 replies; 17+ messages in thread
From: Krishna Kanth Reddy @ 2020-07-03 10:47 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: axboe, fio, Ankit Kumar

[-- Attachment #1: Type: text/plain, Size: 2909 bytes --]

On Fri, Jun 26, 2020 at 05:38:53AM +0000, Damien Le Moal wrote:
>On 2020/06/26 2:41, Krishna Kanth Reddy wrote:
>> From: Ankit Kumar <ankit.kumar@samsung.com>
>>
>> The zone append command uses the write path with offset as
>> start of the zone. Upon successful completion, multiple of
>> sectors relative to the zone's start offset, where the data
>> has been placed is returned.
>>
>> Signed-off-by: Krishna Kanth Reddy <krish.reddy@samsung.com>
>> ---
>>  engines/libaio.c | 20 +++++++++++++++++++-
>>  1 file changed, 19 insertions(+), 1 deletion(-)
>>
>> diff --git a/engines/libaio.c b/engines/libaio.c
>> index 398fdf9..4ffe831 100644
>> --- a/engines/libaio.c
>> +++ b/engines/libaio.c
>> @@ -123,7 +123,13 @@ static int fio_libaio_prep(struct thread_data *td, struct io_u *io_u)
>>  		if (o->nowait)
>>  			iocb->aio_rw_flags |= RWF_NOWAIT;
>>  	} else if (io_u->ddir == DDIR_WRITE) {
>> -		io_prep_pwrite(iocb, f->fd, io_u->xfer_buf, io_u->xfer_buflen, io_u->offset);
>> +		if ((td->o.zone_mode == ZONE_MODE_ZBD) && td->o.zone_append) {
>> +			io_prep_pwrite(iocb, f->fd, io_u->xfer_buf, io_u->xfer_buflen, io_u->zone_start_offset);
>> +		}
>
>No need for the added braces. And see my comment in the previous patch about
>zone_start_offset. You can calculate it from io_u->offset.
>
Sure, will remove the braces and zone_start_offset. Offset will be
calculated as per your comments.

>> +		else
>> +			io_prep_pwrite(iocb, f->fd, io_u->xfer_buf, io_u->xfer_buflen, io_u->offset);
>> +		if (td->o.zone_append)
>> +			iocb->aio_rw_flags |= RWF_ZONE_APPEND;
>>  		if (o->nowait)
>>  			iocb->aio_rw_flags |= RWF_NOWAIT;
>>  	} else if (ddir_sync(io_u->ddir))
>> @@ -228,6 +234,18 @@ static int fio_libaio_getevents(struct thread_data *td, unsigned int min,
>>  		} else {
>>  			r = io_getevents(ld->aio_ctx, actual_min,
>>  				max, ld->aio_events + events, lt);
>> +			if (td->o.zone_mode == ZONE_MODE_ZBD) {
>> +				struct io_event *ev;
>> +				struct io_u *io_u;
>
>Add a blank line after declarations please.
>
Sure.

>> +				for (unsigned event = 0; event < r; event++) {
>
>Better declare event above instead of here. I know C allows it, but this looks
>like C++ to me :)
>
Yeah sure :) Will modify this.

>> +					ev = ld->aio_events + event;
>> +					io_u = container_of(ev->obj, struct io_u, iocb);
>> +					if (td->o.zone_append
>> +					    && td->o.do_verify && td->o.verify
>> +					    && (io_u->ddir == DDIR_WRITE))
>
>So this is done only if verify is on ? Then why do the loop at all if verify is
>off ? Move the "td->o.zone_append && td->o.do_verify && td->o.verify" condition
>outside the loop.
>
Yes, right. Will move this outside the loop.

>> +						io_u->ipo->offset = io_u->zone_start_offset + (ev->res2 << 9);
>
>same here about start offset.
>
Ok, sure.

>> +				}
>> +			}
>>  		}
>>  		if (r > 0)
>>  			events += r;
>>
>
>
>-- 
>Damien Le Moal
>Western Digital Research
>

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] Add Zone Append command support for NVMe Zoned Namespaces (ZNS)
  2020-06-26  5:50       ` Damien Le Moal
@ 2020-07-03 16:50         ` Krishna Kanth Reddy
  0 siblings, 0 replies; 17+ messages in thread
From: Krishna Kanth Reddy @ 2020-07-03 16:50 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: axboe, fio, Ankit Kumar

[-- Attachment #1: Type: text/plain, Size: 14567 bytes --]

On Fri, Jun 26, 2020 at 05:50:10AM +0000, Damien Le Moal wrote:
>On 2020/06/26 2:41, Krishna Kanth Reddy wrote:
>> From: Ankit Kumar <ankit.kumar@samsung.com>
>>
>> defined in NVM Express TP4053. Added a new FIO option zone_append.
>> When zone_append option is enabled, the existing write path will
>> send Zone Append command with offset as start of the Zone.
>>
>> Signed-off-by: Krishna Kanth Reddy <krish.reddy@samsung.com>
>> ---
>>  HOWTO            |  7 +++++
>>  fio.1            |  7 +++++
>>  io_u.c           |  4 +--
>>  io_u.h           | 10 +++++--
>>  ioengines.c      |  4 +--
>>  options.c        | 10 +++++++
>>  thread_options.h |  2 ++
>>  zbd.c            | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
>>  zbd.h            | 13 +++++---
>>  9 files changed, 131 insertions(+), 16 deletions(-)
>>
>> diff --git a/HOWTO b/HOWTO
>> index 8cf8d65..62b5ac8 100644
>> --- a/HOWTO
>> +++ b/HOWTO
>> @@ -1010,6 +1010,13 @@ Target file/device
>>  	:option:`zonesize` bytes of data have been transferred. This parameter
>>  	must be zero for :option:`zonemode` =zbd.
>>
>> +.. option:: zone_append=bool
>> +
>> +	For :option:`zonemode` =zbd and for :option:`rw` =write or :option:
>> +	`rw` =randwrite, if zone_append is enabled, the the io_u points to the
>> +	starting offset of a zone. On successful completion the multiple of
>> +	sectors relative to the zone's starting offset is returned.
>> +
>>  .. option:: read_beyond_wp=bool
>>
>>  	This parameter applies to :option:`zonemode` =zbd only.
>> diff --git a/fio.1 b/fio.1
>> index f134e0b..09add8f 100644
>> --- a/fio.1
>> +++ b/fio.1
>> @@ -782,6 +782,13 @@ sequential workloads and ignored for random workloads. For read workloads,
>>  see also \fBread_beyond_wp\fR.
>>
>>  .TP
>> +.BI zone_append
>> +For \fBzonemode\fR =zbd and for \fBrw\fR=write or \fBrw\fR=randwrite, if
>> +zone_append is enabled, the io_u points to the starting offset of a zone. On
>> +successful completion the multiple of sectors relative to the zone's starting
>> +offset is returned.
>> +
>> +.TP
>>  .BI read_beyond_wp \fR=\fPbool
>>  This parameter applies to \fBzonemode=zbd\fR only.
>>
>> diff --git a/io_u.c b/io_u.c
>> index 7f50906..b891a9b 100644
>> --- a/io_u.c
>> +++ b/io_u.c
>> @@ -778,7 +778,7 @@ void put_io_u(struct thread_data *td, struct io_u *io_u)
>>  {
>>  	const bool needs_lock = td_async_processing(td);
>>
>> -	zbd_put_io_u(io_u);
>> +	zbd_put_io_u(td, io_u);
>>
>>  	if (td->parent)
>>  		td = td->parent;
>> @@ -1342,7 +1342,7 @@ static long set_io_u_file(struct thread_data *td, struct io_u *io_u)
>>  		if (!fill_io_u(td, io_u))
>>  			break;
>>
>> -		zbd_put_io_u(io_u);
>> +		zbd_put_io_u(td, io_u);
>>
>>  		put_file_log(td, f);
>>  		td_io_close_file(td, f);
>> diff --git a/io_u.h b/io_u.h
>> index 87c2920..f5b24fd 100644
>> --- a/io_u.h
>> +++ b/io_u.h
>> @@ -94,19 +94,25 @@ struct io_u {
>>  	};
>>
>>  	/*
>> +	 * for zone append this is the start offset of the zone.
>> +	 */
>> +	unsigned long long zone_start_offset;
>> +
>> +	/*
>>  	 * ZBD mode zbd_queue_io callback: called after engine->queue operation
>>  	 * to advance a zone write pointer and eventually unlock the I/O zone.
>>  	 * @q indicates the I/O queue status (busy, queued or completed).
>>  	 * @success == true means that the I/O operation has been queued or
>>  	 * completed successfully.
>>  	 */
>> -	void (*zbd_queue_io)(struct io_u *, int q, bool success);
>> +	void (*zbd_queue_io)(struct thread_data *, struct io_u *, int q,
>> +			     bool success);
>>
>>  	/*
>>  	 * ZBD mode zbd_put_io callback: called in after completion of an I/O
>>  	 * or commit of an async I/O to unlock the I/O target zone.
>>  	 */
>> -	void (*zbd_put_io)(const struct io_u *);
>> +	void (*zbd_put_io)(struct thread_data *, const struct io_u *);
>>
>>  	/*
>>  	 * Callback for io completion
>> diff --git a/ioengines.c b/ioengines.c
>> index 2c7a0df..81ac846 100644
>> --- a/ioengines.c
>> +++ b/ioengines.c
>> @@ -328,7 +328,7 @@ enum fio_q_status td_io_queue(struct thread_data *td, struct io_u *io_u)
>>  	}
>>
>>  	ret = td->io_ops->queue(td, io_u);
>> -	zbd_queue_io_u(io_u, ret);
>> +	zbd_queue_io_u(td, io_u, ret);
>>
>>  	unlock_file(td, io_u->file);
>>
>> @@ -370,7 +370,7 @@ enum fio_q_status td_io_queue(struct thread_data *td, struct io_u *io_u)
>>  	if (!td->io_ops->commit) {
>>  		io_u_mark_submit(td, 1);
>>  		io_u_mark_complete(td, 1);
>> -		zbd_put_io_u(io_u);
>> +		zbd_put_io_u(td, io_u);
>>  	}
>>
>>  	if (ret == FIO_Q_COMPLETED) {
>> diff --git a/options.c b/options.c
>> index 85a0f49..d54da81 100644
>> --- a/options.c
>> +++ b/options.c
>> @@ -3317,6 +3317,16 @@ struct fio_option fio_options[FIO_MAX_OPTS] = {
>>  		},
>>  	},
>>  	{
>> +		.name	= "zone_append",
>> +		.lname	= "zone_append",
>> +		.type	= FIO_OPT_BOOL,
>> +		.off1	= offsetof(struct thread_options, zone_append),
>> +		.help	= "Use Zone Append for Zone block device",
>> +		.def	= "0",
>> +		.category = FIO_OPT_C_IO,
>> +		.group	= FIO_OPT_G_ZONE,
>> +	},
>> +	{
>>  		.name	= "zonesize",
>>  		.lname	= "Zone size",
>>  		.type	= FIO_OPT_STR_VAL,
>> diff --git a/thread_options.h b/thread_options.h
>> index 968ea0a..45c5ef8 100644
>> --- a/thread_options.h
>> +++ b/thread_options.h
>> @@ -195,6 +195,7 @@ struct thread_options {
>>  	unsigned long long zone_size;
>>  	unsigned long long zone_skip;
>>  	enum fio_zone_mode zone_mode;
>> +	unsigned int zone_append;
>>  	unsigned long long lockmem;
>>  	enum fio_memtype mem_type;
>>  	unsigned int mem_align;
>> @@ -631,6 +632,7 @@ struct thread_options_pack {
>>  	uint32_t allow_mounted_write;
>>
>>  	uint32_t zone_mode;
>> +	uint32_t zone_append;
>>  } __attribute__((packed));
>>
>>  extern void convert_thread_options_to_cpu(struct thread_options *o, struct thread_options_pack *top);
>> diff --git a/zbd.c b/zbd.c
>> index 8cf8f81..ffdb766 100644
>> --- a/zbd.c
>> +++ b/zbd.c
>> @@ -455,6 +455,7 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
>>  		for (i = 0; i < nrz; i++, j++, z++, p++) {
>>  			mutex_init_pshared_with_type(&p->mutex,
>>  						     PTHREAD_MUTEX_RECURSIVE);
>> +			cond_init_pshared(&p->reset_cond);
>>  			p->start = z->start;
>>  			switch (z->cond) {
>>  			case ZBD_ZONE_COND_NOT_WP:
>> @@ -469,6 +470,7 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
>>  			}
>>  			p->type = z->type;
>>  			p->cond = z->cond;
>> +			p->pending_ios = 0;
>>  			if (j > 0 && p->start != p[-1].start + zone_size) {
>>  				log_info("%s: invalid zone data\n",
>>  					 f->file_name);
>> @@ -1196,20 +1198,24 @@ zbd_find_zone(struct thread_data *td, struct io_u *io_u,
>>
>>  /**
>>   * zbd_queue_io - update the write pointer of a sequential zone
>> + * @td: fio thread data.
>>   * @io_u: I/O unit
>>   * @success: Whether or not the I/O unit has been queued successfully
>>   * @q: queueing status (busy, completed or queued).
>>   *
>>   * For write and trim operations, update the write pointer of the I/O unit
>>   * target zone.
>> + * For zone append operation, release the zone mutex
>>   */
>> -static void zbd_queue_io(struct io_u *io_u, int q, bool success)
>> +static void zbd_queue_io(struct thread_data *td, struct io_u *io_u, int q,
>> +			 bool success)
>>  {
>>  	const struct fio_file *f = io_u->file;
>>  	struct zoned_block_device_info *zbd_info = f->zbd_info;
>>  	struct fio_zone_info *z;
>>  	uint32_t zone_idx;
>>  	uint64_t zone_end;
>> +	int ret;
>>
>>  	if (!zbd_info)
>>  		return;
>> @@ -1241,6 +1247,8 @@ static void zbd_queue_io(struct io_u *io_u, int q, bool success)
>>  			zbd_info->sectors_with_data += zone_end - z->wp;
>>  		pthread_mutex_unlock(&zbd_info->mutex);
>>  		z->wp = zone_end;
>> +		if (td->o.zone_append)
>> +			z->pending_ios++;
>>  		break;
>>  	case DDIR_TRIM:
>>  		assert(z->wp == z->start);
>> @@ -1250,18 +1258,22 @@ static void zbd_queue_io(struct io_u *io_u, int q, bool success)
>>  	}
>>
>>  unlock:
>> -	if (!success || q != FIO_Q_QUEUED) {
>> +	if (!success || q != FIO_Q_QUEUED || td->o.zone_append) {
>>  		/* BUSY or COMPLETED: unlock the zone */
>> -		pthread_mutex_unlock(&z->mutex);
>> -		io_u->zbd_put_io = NULL;
>> +		ret = pthread_mutex_unlock(&z->mutex);
>> +		assert(ret == 0);
>> +		if (!success || q != FIO_Q_QUEUED)
>> +			io_u->zbd_put_io = NULL;
>>  	}
>>  }
>>
>>  /**
>>   * zbd_put_io - Unlock an I/O unit target zone lock
>> + * For zone append operation we don't hold zone lock
>> + * @td: fio thread data.
>>   * @io_u: I/O unit
>>   */
>> -static void zbd_put_io(const struct io_u *io_u)
>> +static void zbd_put_io(struct thread_data *td, const struct io_u *io_u)
>>  {
>>  	const struct fio_file *f = io_u->file;
>>  	struct zoned_block_device_info *zbd_info = f->zbd_info;
>> @@ -1283,6 +1295,19 @@ static void zbd_put_io(const struct io_u *io_u)
>>  	       "%s: terminate I/O (%lld, %llu) for zone %u\n",
>>  	       f->file_name, io_u->offset, io_u->buflen, zone_idx);
>>
>> +	if (td->o.zone_append) {
>> +		pthread_mutex_lock(&z->mutex);
>> +		if (z->pending_ios > 0) {
>> +			z->pending_ios--;
>> +			/*
>> +			 * Other threads may be waiting for pending I/O's to
>> +			 * complete for this zone. Notify them.
>> +			 */
>> +			if (!z->pending_ios)
>> +				pthread_cond_broadcast(&z->reset_cond);
>> +		}
>> +	}
>> +
>>  	ret = pthread_mutex_unlock(&z->mutex);
>>  	assert(ret == 0);
>>  	zbd_check_swd(f);
>> @@ -1524,16 +1549,69 @@ enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u)
>>  			 * asynchronously and since we will submit the zone
>>  			 * reset synchronously, wait until previously submitted
>>  			 * write requests have completed before issuing a
>> -			 * zone reset.
>> +			 * zone reset. For append request release the zone lock
>> +			 * as other threads will acquire it at the time of
>> +			 * zbd_put_io.
>>  			 */
>> +reset:
>> +			if (td->o.zone_append)
>> +				pthread_mutex_unlock(&zb->mutex);
>>  			io_u_quiesce(td);
>> +			if (td->o.zone_append)
>> +				pthread_mutex_lock(&zb->mutex);
>> +
>>  			zb->reset_zone = 0;
>> +			if (td->o.zone_append) {
>> +				/*
>> +				 * While processing the current thread queued
>> +				 * requests the other thread may have already
>> +				 * done zone reset so need to check zone full
>> +				 * condition again.
>> +				 */
>> +				if (!zbd_zone_full(f, zb, min_bs))
>> +					goto proceed;
>> +				/*
>> +				 * Wait for the pending requests to be completed
>> +				 * else we are ok to reset this zone.
>> +				 */
>> +				if (zb->pending_ios) {
>> +					pthread_cond_wait(&zb->reset_cond, &zb->mutex);
>> +					goto proceed;
>> +				}
>> +			}
>> +
>>  			if (zbd_reset_zone(td, f, zb) < 0)
>>  				goto eof;
>> +
>> +			/* Notify other threads waiting for zone mutex */
>> +			if (td->o.zone_append)
>> +				pthread_cond_broadcast(&zb->reset_cond);
>> +		}
>> +proceed:
>> +		/*
>> +		 * Check for zone full condition again. For zone append request
>> +		 * the zone may already be reset, written and full while we
>> +		 * were waiting for our turn.
>> +		 */
>> +		if (zbd_zone_full(f, zb, min_bs)) {
>> +			goto reset;
>>  		}
>> +
>>  		/* Make writes occur at the write pointer */
>>  		assert(!zbd_zone_full(f, zb, min_bs));
>>  		io_u->offset = zb->wp;
>> +
>> +		/*
>> +		 * Support zone append for both regular and zoned block
>> +		 * device.
>> +		 */
>> +		if (td->o.zone_append) {
>> +			if (f->zbd_info->model == ZBD_NONE)
>> +				io_u->zone_start_offset = zb->wp;
>> +			else
>> +				io_u->zone_start_offset = zb->start;
>> +		}
>> +
>>  		if (!is_valid_offset(f, io_u->offset)) {
>>  			dprint(FD_ZBD, "Dropped request with offset %llu\n",
>>  			       io_u->offset);
>> diff --git a/zbd.h b/zbd.h
>> index e942a7f..eac42f7 100644
>> --- a/zbd.h
>> +++ b/zbd.h
>> @@ -23,8 +23,10 @@ enum io_u_action {
>>   * struct fio_zone_info - information about a single ZBD zone
>>   * @start: zone start location (bytes)
>>   * @wp: zone write pointer location (bytes)
>> + * @pending_ios: Number of IO's pending in this zone
>>   * @verify_block: number of blocks that have been verified for this zone
>>   * @mutex: protects the modifiable members in this structure
>> + * @reset_cond: zone reset check condition. only relevant for zone_append.
>>   * @type: zone type (BLK_ZONE_TYPE_*)
>>   * @cond: zone state (BLK_ZONE_COND_*)
>>   * @open: whether or not this zone is currently open. Only relevant if
>> @@ -33,8 +35,10 @@ enum io_u_action {
>>   */
>>  struct fio_zone_info {
>>  	pthread_mutex_t		mutex;
>> +	pthread_cond_t		reset_cond;
>>  	uint64_t		start;
>>  	uint64_t		wp;
>> +	uint32_t		pending_ios;
>>  	uint32_t		verify_block;
>>  	enum zbd_zone_type	type:2;
>>  	enum zbd_zone_cond	cond:4;
>> @@ -96,18 +100,19 @@ static inline void zbd_close_file(struct fio_file *f)
>>  		zbd_free_zone_info(f);
>>  }
>>
>> -static inline void zbd_queue_io_u(struct io_u *io_u, enum fio_q_status status)
>> +static inline void zbd_queue_io_u(struct thread_data *td, struct io_u *io_u,
>> +				  enum fio_q_status status)
>>  {
>>  	if (io_u->zbd_queue_io) {
>> -		io_u->zbd_queue_io(io_u, status, io_u->error == 0);
>> +		io_u->zbd_queue_io(td, io_u, status, io_u->error == 0);
>>  		io_u->zbd_queue_io = NULL;
>>  	}
>>  }
>>
>> -static inline void zbd_put_io_u(struct io_u *io_u)
>> +static inline void zbd_put_io_u(struct thread_data *td, struct io_u *io_u)
>>  {
>>  	if (io_u->zbd_put_io) {
>> -		io_u->zbd_put_io(io_u);
>> +		io_u->zbd_put_io(td, io_u);
>>  		io_u->zbd_queue_io = NULL;
>>  		io_u->zbd_put_io = NULL;
>>  	}
>>
>
>Another thing: IO engines such as libzbc will not support any of this, nor will
>engines such as psync or SG and many many other. SO it may be good to define a
>new io engine flag such as FIO_ZONE_APPEND that can be set in struct
>ioengine_ops flags field in the engine declaration to indicate that the engine
>supports zone append. That can be checked on initialization instead of no check
>at all and seeing lots of weird things happening with engines that do not
>support zone append.
>
We can add a new IO engine flag to indicate that the engine supports
Zone Append.
As part of this initial version, we plan to introduce Zone Append as an
extension of Write. This will remove all the zbd.c and zbd.h related
code changes (pending_ios, reset_cond).
I wonder, since we don't have any zbd related code changes and the
zone_start_offset being removed, for the other IO engines, even if we
enable zone_append, it will still do a regular write. So, do we still
need this new IO engine flag ?
>
>-- 
>Damien Le Moal
>Western Digital Research
>

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] Add Zone Append command support for NVMe Zoned Namespaces (ZNS)
  2020-06-26  5:33       ` Damien Le Moal
@ 2020-07-03 17:17         ` Krishna Kanth Reddy
  0 siblings, 0 replies; 17+ messages in thread
From: Krishna Kanth Reddy @ 2020-07-03 17:17 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: axboe, fio, Ankit Kumar

[-- Attachment #1: Type: text/plain, Size: 18649 bytes --]

On Fri, Jun 26, 2020 at 05:33:01AM +0000, Damien Le Moal wrote:
>On 2020/06/26 2:41, Krishna Kanth Reddy wrote:
>> From: Ankit Kumar <ankit.kumar@samsung.com>
>>
>> defined in NVM Express TP4053. Added a new FIO option zone_append.
>> When zone_append option is enabled, the existing write path will
>> send Zone Append command with offset as start of the Zone.
>
>Zone append is supported by SCSI disks too through emulation and will be for
>nullblk too, so please generalize this patch title and commit message. This is
>not for NVMe ZNS devices only.
>
Sure, will take care of this in the future commit messages and the
titles.

>>
>> Signed-off-by: Krishna Kanth Reddy <krish.reddy@samsung.com>
>> ---
>>  HOWTO            |  7 +++++
>>  fio.1            |  7 +++++
>>  io_u.c           |  4 +--
>>  io_u.h           | 10 +++++--
>>  ioengines.c      |  4 +--
>>  options.c        | 10 +++++++
>>  thread_options.h |  2 ++
>>  zbd.c            | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
>>  zbd.h            | 13 +++++---
>>  9 files changed, 131 insertions(+), 16 deletions(-)
>>
>> diff --git a/HOWTO b/HOWTO
>> index 8cf8d65..62b5ac8 100644
>> --- a/HOWTO
>> +++ b/HOWTO
>> @@ -1010,6 +1010,13 @@ Target file/device
>>  	:option:`zonesize` bytes of data have been transferred. This parameter
>>  	must be zero for :option:`zonemode` =zbd.
>>
>> +.. option:: zone_append=bool
>> +
>> +	For :option:`zonemode` =zbd and for :option:`rw` =write or :option:
>> +	`rw` =randwrite, if zone_append is enabled, the the io_u points to the
>
>s/the the io_u/the io_u
Ok.

>
>> +	starting offset of a zone. On successful completion the multiple of
>> +	sectors relative to the zone's starting offset is returned.
>
>You are describing this option effect from the code perspective. That is not
>helpful for a user who does not know fio internals. Please describe this from
>the user perspective, from high level, what the option does, which is, to use
>zone append write command instead of regular write command.
>
Ok, will describe this option from the user perspective.

>> +
>>  .. option:: read_beyond_wp=bool
>>
>>  	This parameter applies to :option:`zonemode` =zbd only.
>> diff --git a/fio.1 b/fio.1
>> index f134e0b..09add8f 100644
>> --- a/fio.1
>> +++ b/fio.1
>> @@ -782,6 +782,13 @@ sequential workloads and ignored for random workloads. For read workloads,
>>  see also \fBread_beyond_wp\fR.
>>
>>  .TP
>> +.BI zone_append
>> +For \fBzonemode\fR =zbd and for \fBrw\fR=write or \fBrw\fR=randwrite, if
>> +zone_append is enabled, the io_u points to the starting offset of a zone. On
>> +successful completion the multiple of sectors relative to the zone's starting
>> +offset is returned.
>
>Same comment here.
>
Ok.

>> +
>> +.TP
>>  .BI read_beyond_wp \fR=\fPbool
>>  This parameter applies to \fBzonemode=zbd\fR only.
>>
>> diff --git a/io_u.c b/io_u.c
>> index 7f50906..b891a9b 100644
>> --- a/io_u.c
>> +++ b/io_u.c
>> @@ -778,7 +778,7 @@ void put_io_u(struct thread_data *td, struct io_u *io_u)
>>  {
>>  	const bool needs_lock = td_async_processing(td);
>>
>> -	zbd_put_io_u(io_u);
>> +	zbd_put_io_u(td, io_u);
>>
>>  	if (td->parent)
>>  		td = td->parent;
>> @@ -1342,7 +1342,7 @@ static long set_io_u_file(struct thread_data *td, struct io_u *io_u)
>>  		if (!fill_io_u(td, io_u))
>>  			break;
>>
>> -		zbd_put_io_u(io_u);
>> +		zbd_put_io_u(td, io_u);
>>
>>  		put_file_log(td, f);
>>  		td_io_close_file(td, f);
>> diff --git a/io_u.h b/io_u.h
>> index 87c2920..f5b24fd 100644
>> --- a/io_u.h
>> +++ b/io_u.h
>> @@ -94,19 +94,25 @@ struct io_u {
>>  	};
>>
>>  	/*
>> +	 * for zone append this is the start offset of the zone.
>> +	 */
>> +	unsigned long long zone_start_offset;
>
>This is not necessary. This value can be calculated from any io_u offset.
>
This is a great suggestion. Will implement the code changes
accordingly.

>> +
>> +	/*
>>  	 * ZBD mode zbd_queue_io callback: called after engine->queue operation
>>  	 * to advance a zone write pointer and eventually unlock the I/O zone.
>>  	 * @q indicates the I/O queue status (busy, queued or completed).
>>  	 * @success == true means that the I/O operation has been queued or
>>  	 * completed successfully.
>>  	 */
>> -	void (*zbd_queue_io)(struct io_u *, int q, bool success);
>> +	void (*zbd_queue_io)(struct thread_data *, struct io_u *, int q,
>> +			     bool success);
>>
>>  	/*
>>  	 * ZBD mode zbd_put_io callback: called in after completion of an I/O
>>  	 * or commit of an async I/O to unlock the I/O target zone.
>>  	 */
>> -	void (*zbd_put_io)(const struct io_u *);
>> +	void (*zbd_put_io)(struct thread_data *, const struct io_u *);
>>
>>  	/*
>>  	 * Callback for io completion
>> diff --git a/ioengines.c b/ioengines.c
>> index 2c7a0df..81ac846 100644
>> --- a/ioengines.c
>> +++ b/ioengines.c
>> @@ -328,7 +328,7 @@ enum fio_q_status td_io_queue(struct thread_data *td, struct io_u *io_u)
>>  	}
>>
>>  	ret = td->io_ops->queue(td, io_u);
>> -	zbd_queue_io_u(io_u, ret);
>> +	zbd_queue_io_u(td, io_u, ret);
>>
>>  	unlock_file(td, io_u->file);
>>
>> @@ -370,7 +370,7 @@ enum fio_q_status td_io_queue(struct thread_data *td, struct io_u *io_u)
>>  	if (!td->io_ops->commit) {
>>  		io_u_mark_submit(td, 1);
>>  		io_u_mark_complete(td, 1);
>> -		zbd_put_io_u(io_u);
>> +		zbd_put_io_u(td, io_u);
>>  	}
>>
>>  	if (ret == FIO_Q_COMPLETED) {
>> diff --git a/options.c b/options.c
>> index 85a0f49..d54da81 100644
>> --- a/options.c
>> +++ b/options.c
>> @@ -3317,6 +3317,16 @@ struct fio_option fio_options[FIO_MAX_OPTS] = {
>>  		},
>>  	},
>>  	{
>> +		.name	= "zone_append",
>> +		.lname	= "zone_append",
>> +		.type	= FIO_OPT_BOOL,
>> +		.off1	= offsetof(struct thread_options, zone_append),
>> +		.help	= "Use Zone Append for Zone block device",
>
>"Use zone append for writing zones of a zoned block device"
>
>may be better as it points out that this has an effect on writes only.
>
Sure. Will update the help message.

>> +		.def	= "0",
>> +		.category = FIO_OPT_C_IO,
>> +		.group	= FIO_OPT_G_ZONE,
>> +	},
>> +	{
>>  		.name	= "zonesize",
>>  		.lname	= "Zone size",
>>  		.type	= FIO_OPT_STR_VAL,
>> diff --git a/thread_options.h b/thread_options.h
>> index 968ea0a..45c5ef8 100644
>> --- a/thread_options.h
>> +++ b/thread_options.h
>> @@ -195,6 +195,7 @@ struct thread_options {
>>  	unsigned long long zone_size;
>>  	unsigned long long zone_skip;
>>  	enum fio_zone_mode zone_mode;
>> +	unsigned int zone_append;
>>  	unsigned long long lockmem;
>>  	enum fio_memtype mem_type;
>>  	unsigned int mem_align;
>> @@ -631,6 +632,7 @@ struct thread_options_pack {
>>  	uint32_t allow_mounted_write;
>>
>>  	uint32_t zone_mode;
>> +	uint32_t zone_append;
>>  } __attribute__((packed));
>>
>>  extern void convert_thread_options_to_cpu(struct thread_options *o, struct thread_options_pack *top);
>> diff --git a/zbd.c b/zbd.c
>> index 8cf8f81..ffdb766 100644
>> --- a/zbd.c
>> +++ b/zbd.c
>> @@ -455,6 +455,7 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
>>  		for (i = 0; i < nrz; i++, j++, z++, p++) {
>>  			mutex_init_pshared_with_type(&p->mutex,
>>  						     PTHREAD_MUTEX_RECURSIVE);
>> +			cond_init_pshared(&p->reset_cond);
>
>It would be nice if the commit message explained the changes in this patch and
>explain the role of this condition variable. It is not clear to me at all.
>
As part of this initial version, we plan to introduce Zone Append as an
extension of Write. This will remove all the zbd.c and zbd.h related
code changes (pending_ios, reset_cond).

>>  			p->start = z->start;
>>  			switch (z->cond) {
>>  			case ZBD_ZONE_COND_NOT_WP:
>> @@ -469,6 +470,7 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
>>  			}
>>  			p->type = z->type;
>>  			p->cond = z->cond;
>> +			p->pending_ios = 0;
>
>Same here. This is not explained so one has to figure it out from the code only.
>Not easy to do as what I think is needed may not be what *you* thought in the
>first place.
>
Will be removed as part of the initial version.

>>  			if (j > 0 && p->start != p[-1].start + zone_size) {
>>  				log_info("%s: invalid zone data\n",
>>  					 f->file_name);
>> @@ -1196,20 +1198,24 @@ zbd_find_zone(struct thread_data *td, struct io_u *io_u,
>>
>>  /**
>>   * zbd_queue_io - update the write pointer of a sequential zone
>> + * @td: fio thread data.
>>   * @io_u: I/O unit
>>   * @success: Whether or not the I/O unit has been queued successfully
>>   * @q: queueing status (busy, completed or queued).
>>   *
>>   * For write and trim operations, update the write pointer of the I/O unit
>>   * target zone.
>> + * For zone append operation, release the zone mutex
>>   */
>> -static void zbd_queue_io(struct io_u *io_u, int q, bool success)
>> +static void zbd_queue_io(struct thread_data *td, struct io_u *io_u, int q,
>> +			 bool success)
>>  {
>>  	const struct fio_file *f = io_u->file;
>>  	struct zoned_block_device_info *zbd_info = f->zbd_info;
>>  	struct fio_zone_info *z;
>>  	uint32_t zone_idx;
>>  	uint64_t zone_end;
>> +	int ret;
>>
>>  	if (!zbd_info)
>>  		return;
>> @@ -1241,6 +1247,8 @@ static void zbd_queue_io(struct io_u *io_u, int q, bool success)
>>  			zbd_info->sectors_with_data += zone_end - z->wp;
>>  		pthread_mutex_unlock(&zbd_info->mutex);
>>  		z->wp = zone_end;
>> +		if (td->o.zone_append)
>> +			z->pending_ios++;
>
>The name is not very good. Shouldn't this be queued_ios ? And since fio
>differentiate queued and in-flight, but this counter does not and is for zone
>append commands only, it may be even better to call this something like
>in_flight_za_ios or something like that to clarify the use and avoid confusion.
>
Will be removed as part of the initial version. Naming of the variable
will be done properly to avoid confusion in the later commits.

>>  		break;
>>  	case DDIR_TRIM:
>>  		assert(z->wp == z->start);
>> @@ -1250,18 +1258,22 @@ static void zbd_queue_io(struct io_u *io_u, int q, bool success)
>>  	}
>>
>>  unlock:
>> -	if (!success || q != FIO_Q_QUEUED) {
>> +	if (!success || q != FIO_Q_QUEUED || td->o.zone_append) {
>>  		/* BUSY or COMPLETED: unlock the zone */
>> -		pthread_mutex_unlock(&z->mutex);
>> -		io_u->zbd_put_io = NULL;
>> +		ret = pthread_mutex_unlock(&z->mutex);
>> +		assert(ret == 0);
>> +		if (!success || q != FIO_Q_QUEUED)
>> +			io_u->zbd_put_io = NULL;
>>  	}
>>  }
>>
>>  /**
>>   * zbd_put_io - Unlock an I/O unit target zone lock
>> + * For zone append operation we don't hold zone lock
>> + * @td: fio thread data.
>>   * @io_u: I/O unit
>>   */
>> -static void zbd_put_io(const struct io_u *io_u)
>> +static void zbd_put_io(struct thread_data *td, const struct io_u *io_u)
>>  {
>>  	const struct fio_file *f = io_u->file;
>>  	struct zoned_block_device_info *zbd_info = f->zbd_info;
>> @@ -1283,6 +1295,19 @@ static void zbd_put_io(const struct io_u *io_u)
>>  	       "%s: terminate I/O (%lld, %llu) for zone %u\n",
>>  	       f->file_name, io_u->offset, io_u->buflen, zone_idx);
>>
>> +	if (td->o.zone_append) {
>> +		pthread_mutex_lock(&z->mutex);
>> +		if (z->pending_ios > 0) {
>> +			z->pending_ios--;
>> +			/*
>> +			 * Other threads may be waiting for pending I/O's to
>> +			 * complete for this zone. Notify them.
>> +			 */
>
>Please move this comment inside the if, or even better, at the beginning of this
>code block. And also explain why the zone lock needs to be taken here for  zone
>append only.
Will be removed as part of the initial version. Sure will explain the
need for zone lock at the beginning of the code block, in the future
commits.

>> +			if (!z->pending_ios)
>> +				pthread_cond_broadcast(&z->reset_cond);
>> +		}
>> +	}
>> +
>>  	ret = pthread_mutex_unlock(&z->mutex);
>>  	assert(ret == 0);
>>  	zbd_check_swd(f);
>> @@ -1524,16 +1549,69 @@ enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u)
>>  			 * asynchronously and since we will submit the zone
>>  			 * reset synchronously, wait until previously submitted
>>  			 * write requests have completed before issuing a
>> -			 * zone reset.
>> +			 * zone reset. For append request release the zone lock
>> +			 * as other threads will acquire it at the time of
>> +			 * zbd_put_io.
>>  			 */
>> +reset:
>> +			if (td->o.zone_append)
>> +				pthread_mutex_unlock(&zb->mutex);
>>  			io_u_quiesce(td);
>> +			if (td->o.zone_append)
>> +				pthread_mutex_lock(&zb->mutex);
>> +
>>  			zb->reset_zone = 0;
>> +			if (td->o.zone_append) {
>> +				/*
>> +				 * While processing the current thread queued
>> +				 * requests the other thread may have already
>> +				 * done zone reset so need to check zone full
>> +				 * condition again.
>> +				 */
>> +				if (!zbd_zone_full(f, zb, min_bs))
>> +					goto proceed;
>> +				/*
>> +				 * Wait for the pending requests to be completed
>> +				 * else we are ok to reset this zone.
>> +				 */
>> +				if (zb->pending_ios) {
>> +					pthread_cond_wait(&zb->reset_cond, &zb->mutex);
>> +					goto proceed;
>
>If you are OK to reset  the zone after the cond wait, why the jump to proceed ?
>That will skip the reset...
>
Will be removed as part of the initial version.

>> +				}
>> +			}
>> +
>>  			if (zbd_reset_zone(td, f, zb) < 0)
>>  				goto eof;
>> +
>> +			/* Notify other threads waiting for zone mutex */
>> +			if (td->o.zone_append)
>> +				pthread_cond_broadcast(&zb->reset_cond);
>
>Why ? isn't this cond variable used to signal pending_ios going to 0 ?
>
Will be removed as part of the initial version.

>> +		}
>> +proceed:
>> +		/*
>> +		 * Check for zone full condition again. For zone append request
>> +		 * the zone may already be reset, written and full while we
>> +		 * were waiting for our turn.
>> +		 */
>> +		if (zbd_zone_full(f, zb, min_bs)) {
>> +			goto reset;
>>  		}
>
>No need for the curly braces. But the main problem here is that a job is not
>guaranteeds to ever be able to issue a zone append. A job may end up looping
>indefinitely between here and reset label.
>
>This does not look necessary at all to me. The zone is locked and it was already
>determined that reset is not needed, or, the zone was already reset a few line
>above after waiting for all pending ios, all of that with the zone locked. So
>how can it become full again at this point ?
>
Will be removed as part of the initial version, as we are planning to
introduce Zone Append as an extension of Zone Write.

>> +
>>  		/* Make writes occur at the write pointer */
>>  		assert(!zbd_zone_full(f, zb, min_bs));
>>  		io_u->offset = zb->wp;
>> +
>> +		/*
>> +		 * Support zone append for both regular and zoned block
>> +		 * device.
>> +		 */
>> +		if (td->o.zone_append) {
>> +			if (f->zbd_info->model == ZBD_NONE)
>> +				io_u->zone_start_offset = zb->wp;
>> +			else
>> +				io_u->zone_start_offset = zb->start;
>> +		}
>
>This hunk is not needed. zone start offset is:
>
>io_u->offset & ~(zone_size - 1)
>
>with zone_size being either td->o.zone_size or f->zbd_info->zone_size. So you
>can completely drop this zone_start_offset field.
>
Thank you for this great suggestion. zone_start_offset will be removed
in our next version of the patches.

>> +
>>  		if (!is_valid_offset(f, io_u->offset)) {
>>  			dprint(FD_ZBD, "Dropped request with offset %llu\n",
>>  			       io_u->offset);
>> diff --git a/zbd.h b/zbd.h
>> index e942a7f..eac42f7 100644
>> --- a/zbd.h
>> +++ b/zbd.h
>> @@ -23,8 +23,10 @@ enum io_u_action {
>>   * struct fio_zone_info - information about a single ZBD zone
>>   * @start: zone start location (bytes)
>>   * @wp: zone write pointer location (bytes)
>> + * @pending_ios: Number of IO's pending in this zone
>
>Misleading. This is the nukmber of zone append IOs. Only zone append. You are
>not counting reads or writes withe this.
>
Will be removed as part of the initial version.

>>   * @verify_block: number of blocks that have been verified for this zone
>>   * @mutex: protects the modifiable members in this structure
>> + * @reset_cond: zone reset check condition. only relevant for zone_append.
>
>You are using it to signal pending_ios going to 0 too.
>
Will be removed as part of the initial version.

>>   * @type: zone type (BLK_ZONE_TYPE_*)
>>   * @cond: zone state (BLK_ZONE_COND_*)
>>   * @open: whether or not this zone is currently open. Only relevant if
>> @@ -33,8 +35,10 @@ enum io_u_action {
>>   */
>>  struct fio_zone_info {
>>  	pthread_mutex_t		mutex;
>> +	pthread_cond_t		reset_cond;
>>  	uint64_t		start;
>>  	uint64_t		wp;
>> +	uint32_t		pending_ios;
>>  	uint32_t		verify_block;
>>  	enum zbd_zone_type	type:2;
>>  	enum zbd_zone_cond	cond:4;
>> @@ -96,18 +100,19 @@ static inline void zbd_close_file(struct fio_file *f)
>>  		zbd_free_zone_info(f);
>>  }
>>
>> -static inline void zbd_queue_io_u(struct io_u *io_u, enum fio_q_status status)
>> +static inline void zbd_queue_io_u(struct thread_data *td, struct io_u *io_u,
>> +				  enum fio_q_status status)
>>  {
>>  	if (io_u->zbd_queue_io) {
>> -		io_u->zbd_queue_io(io_u, status, io_u->error == 0);
>> +		io_u->zbd_queue_io(td, io_u, status, io_u->error == 0);
>>  		io_u->zbd_queue_io = NULL;
>>  	}
>>  }
>>
>> -static inline void zbd_put_io_u(struct io_u *io_u)
>> +static inline void zbd_put_io_u(struct thread_data *td, struct io_u *io_u)
>>  {
>>  	if (io_u->zbd_put_io) {
>> -		io_u->zbd_put_io(io_u);
>> +		io_u->zbd_put_io(td, io_u);
>>  		io_u->zbd_queue_io = NULL;
>>  		io_u->zbd_put_io = NULL;
>>  	}
>>
>
>So a few fixes are needed. However, my biggest concerns is the change to zone
>locking just for zone append... I kind of like the idea to not hold the zone
>lock for the entire duration of IOs, and the condition variable with IO counter
>seem to be a reasonnable way to do that. Have you tried extending this locking
>change to all IOs (read and write) ? A prep patch could do that and zone append
>support can come on top. I am however worried that all of this may not be that
>easy due to the various inspection loops for choosing a zone to read or write,
>finding an open zone or finding a zone to close & open. This needs checking, or
>at least explainations in the commit message that none of it is affexcted by
>your change.
>
No, we haven't tried extending this locking change to all IOs.
In case of multiple jobs, Zone Writes will have issues with ordering.
If the first job releases the lock after queuing a Write request,
other jobs can aquire the lock. Then the other jobs can prepare, queue
and commit their Write requests even before the first job has committed
it's Write request.
Will definitely try to work on your comments in the future patches.
Thank you for your suggestions.

>-- 
>Damien Le Moal
>Western Digital Research
>

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-07-03 17:17 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20200625174119epcas5p122a2197102fe336aa35fdea1273fd1b0@epcas5p1.samsung.com>
2020-06-25 17:38 ` [PATCH 0/4] v2 Patchset : Zone Append command support for NVMe Zoned Namespaces (ZNS) Krishna Kanth Reddy
     [not found]   ` <CGME20200625174124epcas5p18e4fbc502c9cf1fef7e84ba5cefba945@epcas5p1.samsung.com>
2020-06-25 17:38     ` [PATCH 1/4] Add " Krishna Kanth Reddy
2020-06-26  5:33       ` Damien Le Moal
2020-07-03 17:17         ` Krishna Kanth Reddy
2020-06-26  5:50       ` Damien Le Moal
2020-07-03 16:50         ` Krishna Kanth Reddy
     [not found]   ` <CGME20200625174129epcas5p304bf58bb381b4b0c39e0ff91b50a23a9@epcas5p3.samsung.com>
2020-06-25 17:38     ` [PATCH 2/4] libaio: support for Zone Append command Krishna Kanth Reddy
2020-06-26  5:38       ` Damien Le Moal
2020-07-03 10:47         ` Krishna Kanth Reddy
     [not found]   ` <CGME20200625174131epcas5p36cf7cd413dcb698f117474df71e5648b@epcas5p3.samsung.com>
2020-06-25 17:38     ` [PATCH 3/4] iouring: " Krishna Kanth Reddy
2020-06-26  5:43       ` Damien Le Moal
2020-07-03 10:37         ` Krishna Kanth Reddy
     [not found]   ` <CGME20200625174133epcas5p1eace8f03319bee805b93c50fe6c690c7@epcas5p1.samsung.com>
2020-06-25 17:38     ` [PATCH 4/4] t/zbd: Add support to verify Zone Append command with libaio, io_uring IO engine tests Krishna Kanth Reddy
2020-06-25 18:44       ` Dmitry Fomichev
2020-07-03  8:46         ` Krishna Kanth Reddy
2020-06-26  5:45       ` Damien Le Moal
2020-07-03  9:09         ` Krishna Kanth Reddy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.