fio.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/8] zbd: fix 'sectors with data' and zone_reset_threshold accounting issues
@ 2023-02-07  6:37 Shin'ichiro Kawasaki
  2023-02-07  6:37 ` [PATCH v2 1/8] zbd: refer file->last_start[] instead of sectors with data accounting Shin'ichiro Kawasaki
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Shin'ichiro Kawasaki @ 2023-02-07  6:37 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu
  Cc: Damien Le Moal, Niklas Cassel, Dmitry Fomichev, Shin'ichiro Kawasaki

When zonemode=zbd is specified, fio does 'sectors with data' accounting to
record the total number of sectors that have been written on a zoned block
device. This accounting has issues as follows:

Issue 1)
   The name 'sectors with data' indicates that the accounting uses 'sector' as
   the unit. However, it is implemented using 'byte' as the unit.

Issue 2)
   The accounting does not work correctly when multi-jobs with different IO
   ranges due to two reasons. One reason is the accounting field initialization.
   The accounting field is shared across all jobs, but it is initialized
   repeatedly by all jobs. This results in wrong behaviors except the last job.
   The second reason is definition difference between the zone_reset_threshold
   option and the accounting. The option is defined as the ratio of valid
   written data to the IO range of each job. However, the accounting is done
   per device and shared across all jobs. This coverage gap between each job's
   IO range and the accounting range causes unexpected zone reset.

Issue 3)
   Counting the total number of written sectors requires taking the zone lock of
   all zones in a job IO range. For a multi-job workload with overlapping IO
   ranges, this often leads to significant zone lock contention, resulting in
   some jobs starting IOs only after other jobs have completed their work
   (which looks like an apparent deadlock on startup).

This series addresses the issues with solutions as follows:

For Issue 1)
   Rephrase variables, comments and man pages to indicate that the accounting
   unit is not sector. -> 3rd and 4th patches

For Issue 2)
   Limit the condition of the accounting to ensure correct accounting. Do the
   accounting only when the zone_reset_threshold option is specified and all
   write jobs have the same IO range. Initialize the accounting field only once
   for the 1st job. -> 5th and 6th patches

For Issue 3)
   Move the total valid bytes counting code from "file reset" after job start
   to "file set up" before job start. This allows to count without zone locks,
   then avoids the lock contention. -> 7th patch

The first two patches are preparation patches to reduce references to the
'sectors with data' accounting field. The last 8th patch adds test cases for the
zone_reset_threshold option.

Changes from v1:
* Reworked not to change the definition of the zone_reset_threshold option
* Separated the patch to remove CHECK_SWD (or CHECK_VDB) to clarify the removal

Shin'ichiro Kawasaki (8):
  zbd: refer file->last_start[] instead of sectors with data accounting
  zbd: remove CHECK_SWD feature
  zbd: rename the accounting 'sectors with data' to 'valid data bytes'
  doc: fix unit of zone_reset_threshold and relation to other option
  zbd: account valid data bytes only for zone_reset_threshold option
  zbd: check write ranges for zone_reset_threshold option
  zbd: initialize valid data bytes accounting at file set up
  t/zbd: add test cases for zone_reset_threshold option

 HOWTO.rst              |   8 ++-
 fio.1                  |   8 ++-
 t/zbd/test-zbd-support |  60 +++++++++++++++++-
 zbd.c                  | 139 ++++++++++++++++++-----------------------
 zbd.h                  |  11 ++--
 5 files changed, 135 insertions(+), 91 deletions(-)

-- 
2.38.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/8] zbd: refer file->last_start[] instead of sectors with data accounting
  2023-02-07  6:37 [PATCH v2 0/8] zbd: fix 'sectors with data' and zone_reset_threshold accounting issues Shin'ichiro Kawasaki
@ 2023-02-07  6:37 ` Shin'ichiro Kawasaki
  2023-02-07 14:05   ` Niklas Cassel
  2023-02-07  6:37 ` [PATCH v2 2/8] zbd: remove CHECK_SWD feature Shin'ichiro Kawasaki
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Shin'ichiro Kawasaki @ 2023-02-07  6:37 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu
  Cc: Damien Le Moal, Niklas Cassel, Dmitry Fomichev, Shin'ichiro Kawasaki

To decide the first IO direction of randrw workload, the function
zbd_adjust_ddir() refers to the zbd_info->sectors_with_data value which
indicates the number of bytes written to the zoned block devices being
accessed. However, this accounting has two issues. The first issue is
wrong accounting for multiple jobs with different write ranges. The
second issue is job start up failure due to zone lock contention.

Avoid using zbd_info->sectors_with_data and simply refer to file->
last_start[DDIR_WRITE] instead. It is initialized with -1ULL for each
job. After any write operation is done by the job, it keeps valid
offset. If it has valid offset, written data is expected and the first
IO direction can be read.

Also remove zbd_info->sectors_with_data, which is no longer used. Keep
the field zbd_info->wp_sectors_with_data since it is still used for
zones with write pointers.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 zbd.c | 14 +++-----------
 zbd.h |  2 --
 2 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/zbd.c b/zbd.c
index d1e469f6..f5e76c40 100644
--- a/zbd.c
+++ b/zbd.c
@@ -286,7 +286,6 @@ static int zbd_reset_zone(struct thread_data *td, struct fio_file *f,
 	}
 
 	pthread_mutex_lock(&f->zbd_info->mutex);
-	f->zbd_info->sectors_with_data -= data_in_zone;
 	f->zbd_info->wp_sectors_with_data -= data_in_zone;
 	pthread_mutex_unlock(&f->zbd_info->mutex);
 
@@ -1201,7 +1200,6 @@ static uint64_t zbd_process_swd(struct thread_data *td,
 				const struct fio_file *f, enum swd_action a)
 {
 	struct fio_zone_info *zb, *ze, *z;
-	uint64_t swd = 0;
 	uint64_t wp_swd = 0;
 
 	zb = zbd_get_zone(f, f->min_zone);
@@ -1211,17 +1209,14 @@ static uint64_t zbd_process_swd(struct thread_data *td,
 			zone_lock(td, f, z);
 			wp_swd += z->wp - z->start;
 		}
-		swd += z->wp - z->start;
 	}
 
 	pthread_mutex_lock(&f->zbd_info->mutex);
 	switch (a) {
 	case CHECK_SWD:
-		assert(f->zbd_info->sectors_with_data == swd);
 		assert(f->zbd_info->wp_sectors_with_data == wp_swd);
 		break;
 	case SET_SWD:
-		f->zbd_info->sectors_with_data = swd;
 		f->zbd_info->wp_sectors_with_data = wp_swd;
 		break;
 	}
@@ -1231,7 +1226,7 @@ static uint64_t zbd_process_swd(struct thread_data *td,
 		if (z->has_wp)
 			zone_unlock(z);
 
-	return swd;
+	return wp_swd;
 }
 
 /*
@@ -1640,10 +1635,8 @@ static void zbd_queue_io(struct thread_data *td, struct io_u *io_u, int q,
 		 * have occurred.
 		 */
 		pthread_mutex_lock(&zbd_info->mutex);
-		if (z->wp <= zone_end) {
-			zbd_info->sectors_with_data += zone_end - z->wp;
+		if (z->wp <= zone_end)
 			zbd_info->wp_sectors_with_data += zone_end - z->wp;
-		}
 		pthread_mutex_unlock(&zbd_info->mutex);
 		z->wp = zone_end;
 		break;
@@ -1801,8 +1794,7 @@ enum fio_ddir zbd_adjust_ddir(struct thread_data *td, struct io_u *io_u,
 	if (ddir != DDIR_READ || !td_rw(td))
 		return ddir;
 
-	if (io_u->file->zbd_info->sectors_with_data ||
-	    td->o.read_beyond_wp)
+	if (io_u->file->last_start[DDIR_WRITE] != -1ULL || td->o.read_beyond_wp)
 		return DDIR_READ;
 
 	return DDIR_WRITE;
diff --git a/zbd.h b/zbd.h
index d425707e..9ab25c47 100644
--- a/zbd.h
+++ b/zbd.h
@@ -54,7 +54,6 @@ struct fio_zone_info {
  * @mutex: Protects the modifiable members in this structure (refcount and
  *		num_open_zones).
  * @zone_size: size of a single zone in bytes.
- * @sectors_with_data: total size of data in all zones in units of 512 bytes
  * @wp_sectors_with_data: total size of data in zones with write pointers in
  *                        units of 512 bytes
  * @zone_size_log2: log2 of the zone size in bytes if it is a power of 2 or 0
@@ -76,7 +75,6 @@ struct zoned_block_device_info {
 	uint32_t		max_open_zones;
 	pthread_mutex_t		mutex;
 	uint64_t		zone_size;
-	uint64_t		sectors_with_data;
 	uint64_t		wp_sectors_with_data;
 	uint32_t		zone_size_log2;
 	uint32_t		nr_zones;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/8] zbd: remove CHECK_SWD feature
  2023-02-07  6:37 [PATCH v2 0/8] zbd: fix 'sectors with data' and zone_reset_threshold accounting issues Shin'ichiro Kawasaki
  2023-02-07  6:37 ` [PATCH v2 1/8] zbd: refer file->last_start[] instead of sectors with data accounting Shin'ichiro Kawasaki
@ 2023-02-07  6:37 ` Shin'ichiro Kawasaki
  2023-02-07 14:05   ` Niklas Cassel
  2023-02-07  6:37 ` [PATCH v2 3/8] zbd: rename the accounting 'sectors with data' to 'valid data bytes' Shin'ichiro Kawasaki
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Shin'ichiro Kawasaki @ 2023-02-07  6:37 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu
  Cc: Damien Le Moal, Niklas Cassel, Dmitry Fomichev, Shin'ichiro Kawasaki

The 'sectors with data' accounting had been used for CHECK_SWD debug
feature. It compared expected written data size and actually written
data size for zonemode=zbd. However, this feature has been disabled for
a while and not actively used. Also, the sector with data accounting has
two issues. The first issue is wrong accounting for multiple jobs with
different write ranges. The second issue is job start up failure due to
zone lock contention.

Avoid using the accounting by removing the CHECK_SWD feature and related
code. Also rename the function zbd_process_swd() to zbd_set_swd() to
clarify that it no longer works for CHECK_SWD.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 zbd.c | 38 +++-----------------------------------
 1 file changed, 3 insertions(+), 35 deletions(-)

diff --git a/zbd.c b/zbd.c
index f5e76c40..b6cf2a93 100644
--- a/zbd.c
+++ b/zbd.c
@@ -1190,14 +1190,7 @@ static bool zbd_dec_and_reset_write_cnt(const struct thread_data *td,
 	return write_cnt == 0;
 }
 
-enum swd_action {
-	CHECK_SWD,
-	SET_SWD,
-};
-
-/* Calculate the number of sectors with data (swd) and perform action 'a' */
-static uint64_t zbd_process_swd(struct thread_data *td,
-				const struct fio_file *f, enum swd_action a)
+static uint64_t zbd_set_swd(struct thread_data *td, const struct fio_file *f)
 {
 	struct fio_zone_info *zb, *ze, *z;
 	uint64_t wp_swd = 0;
@@ -1212,14 +1205,7 @@ static uint64_t zbd_process_swd(struct thread_data *td,
 	}
 
 	pthread_mutex_lock(&f->zbd_info->mutex);
-	switch (a) {
-	case CHECK_SWD:
-		assert(f->zbd_info->wp_sectors_with_data == wp_swd);
-		break;
-	case SET_SWD:
-		f->zbd_info->wp_sectors_with_data = wp_swd;
-		break;
-	}
+	f->zbd_info->wp_sectors_with_data = wp_swd;
 	pthread_mutex_unlock(&f->zbd_info->mutex);
 
 	for (z = zb; z < ze; z++)
@@ -1229,21 +1215,6 @@ static uint64_t zbd_process_swd(struct thread_data *td,
 	return wp_swd;
 }
 
-/*
- * The swd check is useful for debugging but takes too much time to leave
- * it enabled all the time. Hence it is disabled by default.
- */
-static const bool enable_check_swd = false;
-
-/* Check whether the values of zbd_info.*sectors_with_data are correct. */
-static void zbd_check_swd(struct thread_data *td, const struct fio_file *f)
-{
-	if (!enable_check_swd)
-		return;
-
-	zbd_process_swd(td, f, CHECK_SWD);
-}
-
 void zbd_file_reset(struct thread_data *td, struct fio_file *f)
 {
 	struct fio_zone_info *zb, *ze;
@@ -1255,7 +1226,7 @@ void zbd_file_reset(struct thread_data *td, struct fio_file *f)
 
 	zb = zbd_get_zone(f, f->min_zone);
 	ze = zbd_get_zone(f, f->max_zone);
-	swd = zbd_process_swd(td, f, SET_SWD);
+	swd = zbd_set_swd(td, f);
 
 	dprint(FD_ZBD, "%s(%s): swd = %" PRIu64 "\n",
 	       __func__, f->file_name, swd);
@@ -1677,7 +1648,6 @@ static void zbd_put_io(struct thread_data *td, const struct io_u *io_u)
 	zbd_end_zone_io(td, io_u, z);
 
 	zone_unlock(z);
-	zbd_check_swd(td, f);
 }
 
 /*
@@ -1866,8 +1836,6 @@ enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u)
 	    io_u->ddir == DDIR_READ && td->o.read_beyond_wp)
 		return io_u_accept;
 
-	zbd_check_swd(td, f);
-
 	zone_lock(td, f, zb);
 
 	switch (io_u->ddir) {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 3/8] zbd: rename the accounting 'sectors with data' to 'valid data bytes'
  2023-02-07  6:37 [PATCH v2 0/8] zbd: fix 'sectors with data' and zone_reset_threshold accounting issues Shin'ichiro Kawasaki
  2023-02-07  6:37 ` [PATCH v2 1/8] zbd: refer file->last_start[] instead of sectors with data accounting Shin'ichiro Kawasaki
  2023-02-07  6:37 ` [PATCH v2 2/8] zbd: remove CHECK_SWD feature Shin'ichiro Kawasaki
@ 2023-02-07  6:37 ` Shin'ichiro Kawasaki
  2023-02-07 14:06   ` Niklas Cassel
  2023-02-07  6:37 ` [PATCH v2 4/8] doc: fix unit of zone_reset_threshold and relation to other option Shin'ichiro Kawasaki
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Shin'ichiro Kawasaki @ 2023-02-07  6:37 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu
  Cc: Damien Le Moal, Niklas Cassel, Dmitry Fomichev, Shin'ichiro Kawasaki

The 'sectors with data' accounting was designed to have 'sector' as its
unit. Then related variables have the word 'sector' in their names. Also
related code comments use the words 'sector' or 'logical blocks'.
However, actually it was implemented to have 'byte' as unit. Rename
related variables and comments to indicate the byte unit. Also replace
the abbreviation swd to vdb.

Fixes: a7c2b6fc2959 ("Add support for resetting zones periodically")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 t/zbd/test-zbd-support |  4 ++--
 zbd.c                  | 25 +++++++++++++------------
 zbd.h                  |  5 ++---
 3 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/t/zbd/test-zbd-support b/t/zbd/test-zbd-support
index 4091d9ac..c32953c4 100755
--- a/t/zbd/test-zbd-support
+++ b/t/zbd/test-zbd-support
@@ -1110,8 +1110,8 @@ test51() {
 	run_fio "${opts[@]}" >> "${logfile}.${test_number}" 2>&1 || return $?
 }
 
-# Verify that zone_reset_threshold only takes logical blocks from seq
-# zones into account, and logical blocks of conv zones are not counted.
+# Verify that zone_reset_threshold only accounts written bytes in seq
+# zones, and written data bytes of conv zones are not counted.
 test52() {
 	local off io_size
 
diff --git a/zbd.c b/zbd.c
index b6cf2a93..455dad53 100644
--- a/zbd.c
+++ b/zbd.c
@@ -286,7 +286,7 @@ static int zbd_reset_zone(struct thread_data *td, struct fio_file *f,
 	}
 
 	pthread_mutex_lock(&f->zbd_info->mutex);
-	f->zbd_info->wp_sectors_with_data -= data_in_zone;
+	f->zbd_info->wp_valid_data_bytes -= data_in_zone;
 	pthread_mutex_unlock(&f->zbd_info->mutex);
 
 	z->wp = z->start;
@@ -1190,35 +1190,35 @@ static bool zbd_dec_and_reset_write_cnt(const struct thread_data *td,
 	return write_cnt == 0;
 }
 
-static uint64_t zbd_set_swd(struct thread_data *td, const struct fio_file *f)
+static uint64_t zbd_set_vdb(struct thread_data *td, const struct fio_file *f)
 {
 	struct fio_zone_info *zb, *ze, *z;
-	uint64_t wp_swd = 0;
+	uint64_t wp_vdb = 0;
 
 	zb = zbd_get_zone(f, f->min_zone);
 	ze = zbd_get_zone(f, f->max_zone);
 	for (z = zb; z < ze; z++) {
 		if (z->has_wp) {
 			zone_lock(td, f, z);
-			wp_swd += z->wp - z->start;
+			wp_vdb += z->wp - z->start;
 		}
 	}
 
 	pthread_mutex_lock(&f->zbd_info->mutex);
-	f->zbd_info->wp_sectors_with_data = wp_swd;
+	f->zbd_info->wp_valid_data_bytes = wp_vdb;
 	pthread_mutex_unlock(&f->zbd_info->mutex);
 
 	for (z = zb; z < ze; z++)
 		if (z->has_wp)
 			zone_unlock(z);
 
-	return wp_swd;
+	return wp_vdb;
 }
 
 void zbd_file_reset(struct thread_data *td, struct fio_file *f)
 {
 	struct fio_zone_info *zb, *ze;
-	uint64_t swd;
+	uint64_t vdb;
 	bool verify_data_left = false;
 
 	if (!f->zbd_info || !td_write(td))
@@ -1226,10 +1226,10 @@ void zbd_file_reset(struct thread_data *td, struct fio_file *f)
 
 	zb = zbd_get_zone(f, f->min_zone);
 	ze = zbd_get_zone(f, f->max_zone);
-	swd = zbd_set_swd(td, f);
+	vdb = zbd_set_vdb(td, f);
 
-	dprint(FD_ZBD, "%s(%s): swd = %" PRIu64 "\n",
-	       __func__, f->file_name, swd);
+	dprint(FD_ZBD, "%s(%s): valid data bytes = %" PRIu64 "\n",
+	       __func__, f->file_name, vdb);
 
 	/*
 	 * If data verification is enabled reset the affected zones before
@@ -1607,7 +1607,7 @@ static void zbd_queue_io(struct thread_data *td, struct io_u *io_u, int q,
 		 */
 		pthread_mutex_lock(&zbd_info->mutex);
 		if (z->wp <= zone_end)
-			zbd_info->wp_sectors_with_data += zone_end - z->wp;
+			zbd_info->wp_valid_data_bytes += zone_end - z->wp;
 		pthread_mutex_unlock(&zbd_info->mutex);
 		z->wp = zone_end;
 		break;
@@ -1960,7 +1960,8 @@ retry:
 
 		/* Check whether the zone reset threshold has been exceeded */
 		if (td->o.zrf.u.f) {
-			if (zbdi->wp_sectors_with_data >= f->io_size * td->o.zrt.u.f &&
+			if (zbdi->wp_valid_data_bytes >=
+			    f->io_size * td->o.zrt.u.f &&
 			    zbd_dec_and_reset_write_cnt(td, f))
 				zb->reset_zone = 1;
 		}
diff --git a/zbd.h b/zbd.h
index 9ab25c47..20b2fe17 100644
--- a/zbd.h
+++ b/zbd.h
@@ -54,8 +54,7 @@ struct fio_zone_info {
  * @mutex: Protects the modifiable members in this structure (refcount and
  *		num_open_zones).
  * @zone_size: size of a single zone in bytes.
- * @wp_sectors_with_data: total size of data in zones with write pointers in
- *                        units of 512 bytes
+ * @wp_valid_data_bytes: total size of data in zones with write pointers
  * @zone_size_log2: log2 of the zone size in bytes if it is a power of 2 or 0
  *		if the zone size is not a power of 2.
  * @nr_zones: number of zones
@@ -75,7 +74,7 @@ struct zoned_block_device_info {
 	uint32_t		max_open_zones;
 	pthread_mutex_t		mutex;
 	uint64_t		zone_size;
-	uint64_t		wp_sectors_with_data;
+	uint64_t		wp_valid_data_bytes;
 	uint32_t		zone_size_log2;
 	uint32_t		nr_zones;
 	uint32_t		refcount;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 4/8] doc: fix unit of zone_reset_threshold and relation to other option
  2023-02-07  6:37 [PATCH v2 0/8] zbd: fix 'sectors with data' and zone_reset_threshold accounting issues Shin'ichiro Kawasaki
                   ` (2 preceding siblings ...)
  2023-02-07  6:37 ` [PATCH v2 3/8] zbd: rename the accounting 'sectors with data' to 'valid data bytes' Shin'ichiro Kawasaki
@ 2023-02-07  6:37 ` Shin'ichiro Kawasaki
  2023-02-07 14:06   ` Niklas Cassel
  2023-02-07  6:37 ` [PATCH v2 5/8] zbd: account valid data bytes only for zone_reset_threshold option Shin'ichiro Kawasaki
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Shin'ichiro Kawasaki @ 2023-02-07  6:37 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu
  Cc: Damien Le Moal, Niklas Cassel, Dmitry Fomichev, Shin'ichiro Kawasaki

The zone_reset_threshold option uses the 'sectors with data' accounting
then it was described to have 'logical block' as its unit. However, the
accounting was implement with 'byte' unit. Fix the description of the
option.

Also, the zone_reset_threshold option works together with the
zone_reset_frequency option. Describe this relation also.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 HOWTO.rst | 7 ++++---
 fio.1     | 7 ++++---
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/HOWTO.rst b/HOWTO.rst
index 17caaf5d..d08f8a18 100644
--- a/HOWTO.rst
+++ b/HOWTO.rst
@@ -1085,9 +1085,10 @@ Target file/device
 
 .. option:: zone_reset_threshold=float
 
-	A number between zero and one that indicates the ratio of logical
-	blocks with data to the total number of logical blocks in the test
-	above which zones should be reset periodically.
+	A number between zero and one that indicates the ratio of written bytes
+	to the total size of the zones with write pointers in the IO range. When
+	current ratio is above this ratio, zones are reset periodically as
+	:option:`zone_reset_frequency` specifies.
 
 .. option:: zone_reset_frequency=float
 
diff --git a/fio.1 b/fio.1
index 527b3d46..54d2c403 100644
--- a/fio.1
+++ b/fio.1
@@ -854,9 +854,10 @@ of the zoned block device in use, thus allowing the option \fBmax_open_zones\fR
 value to be larger than the device reported limit. Default: false.
 .TP
 .BI zone_reset_threshold \fR=\fPfloat
-A number between zero and one that indicates the ratio of logical blocks with
-data to the total number of logical blocks in the test above which zones
-should be reset periodically.
+A number between zero and one that indicates the ratio of written bytes to the
+total size of the zones with write pointers in the IO range. When current ratio
+is above this ratio, zones are reset periodically as \fBzone_reset_frequency\fR
+specifies.
 .TP
 .BI zone_reset_frequency \fR=\fPfloat
 A number between zero and one that indicates how often a zone reset should be
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 5/8] zbd: account valid data bytes only for zone_reset_threshold option
  2023-02-07  6:37 [PATCH v2 0/8] zbd: fix 'sectors with data' and zone_reset_threshold accounting issues Shin'ichiro Kawasaki
                   ` (3 preceding siblings ...)
  2023-02-07  6:37 ` [PATCH v2 4/8] doc: fix unit of zone_reset_threshold and relation to other option Shin'ichiro Kawasaki
@ 2023-02-07  6:37 ` Shin'ichiro Kawasaki
  2023-02-07 14:06   ` Niklas Cassel
  2023-02-07  6:37 ` [PATCH v2 6/8] zbd: check write ranges " Shin'ichiro Kawasaki
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Shin'ichiro Kawasaki @ 2023-02-07  6:37 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu
  Cc: Damien Le Moal, Niklas Cassel, Dmitry Fomichev, Shin'ichiro Kawasaki

The valid data bytes accounting is used only for zone_reset_threshold
option. Avoid the accounting when the option is not specified.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 zbd.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/zbd.c b/zbd.c
index 455dad53..6783acf9 100644
--- a/zbd.c
+++ b/zbd.c
@@ -147,6 +147,11 @@ zbd_offset_to_zone(const struct fio_file *f,  uint64_t offset)
 	return zbd_get_zone(f, zbd_offset_to_zone_idx(f, offset));
 }
 
+static bool accounting_vdb(struct thread_data *td, const struct fio_file *f)
+{
+	return td->o.zrt.u.f && td_write(td);
+}
+
 /**
  * zbd_get_zoned_model - Get a device zoned model
  * @td: FIO thread data
@@ -285,9 +290,11 @@ static int zbd_reset_zone(struct thread_data *td, struct fio_file *f,
 		break;
 	}
 
-	pthread_mutex_lock(&f->zbd_info->mutex);
-	f->zbd_info->wp_valid_data_bytes -= data_in_zone;
-	pthread_mutex_unlock(&f->zbd_info->mutex);
+	if (accounting_vdb(td, f)) {
+		pthread_mutex_lock(&f->zbd_info->mutex);
+		f->zbd_info->wp_valid_data_bytes -= data_in_zone;
+		pthread_mutex_unlock(&f->zbd_info->mutex);
+	}
 
 	z->wp = z->start;
 
@@ -1195,6 +1202,9 @@ static uint64_t zbd_set_vdb(struct thread_data *td, const struct fio_file *f)
 	struct fio_zone_info *zb, *ze, *z;
 	uint64_t wp_vdb = 0;
 
+	if (!accounting_vdb(td, f))
+		return 0;
+
 	zb = zbd_get_zone(f, f->min_zone);
 	ze = zbd_get_zone(f, f->max_zone);
 	for (z = zb; z < ze; z++) {
@@ -1605,10 +1615,11 @@ static void zbd_queue_io(struct thread_data *td, struct io_u *io_u, int q,
 		 * z->wp > zone_end means that one or more I/O errors
 		 * have occurred.
 		 */
-		pthread_mutex_lock(&zbd_info->mutex);
-		if (z->wp <= zone_end)
+		if (accounting_vdb(td, f) && z->wp <= zone_end) {
+			pthread_mutex_lock(&zbd_info->mutex);
 			zbd_info->wp_valid_data_bytes += zone_end - z->wp;
-		pthread_mutex_unlock(&zbd_info->mutex);
+			pthread_mutex_unlock(&zbd_info->mutex);
+		}
 		z->wp = zone_end;
 		break;
 	default:
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 6/8] zbd: check write ranges for zone_reset_threshold option
  2023-02-07  6:37 [PATCH v2 0/8] zbd: fix 'sectors with data' and zone_reset_threshold accounting issues Shin'ichiro Kawasaki
                   ` (4 preceding siblings ...)
  2023-02-07  6:37 ` [PATCH v2 5/8] zbd: account valid data bytes only for zone_reset_threshold option Shin'ichiro Kawasaki
@ 2023-02-07  6:37 ` Shin'ichiro Kawasaki
  2023-02-07 14:06   ` Niklas Cassel
  2023-02-07  6:37 ` [PATCH v2 7/8] zbd: initialize valid data bytes accounting at file set up Shin'ichiro Kawasaki
  2023-02-07  6:37 ` [PATCH v2 8/8] t/zbd: add test cases for zone_reset_threshold option Shin'ichiro Kawasaki
  7 siblings, 1 reply; 20+ messages in thread
From: Shin'ichiro Kawasaki @ 2023-02-07  6:37 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu
  Cc: Damien Le Moal, Niklas Cassel, Dmitry Fomichev, Shin'ichiro Kawasaki

The valid data bytes accounting is used for zone_reset_threshold option.
This accounting usage has two issues. The first issue is unexpected
zone reset due to different IO ranges. The valid data bytes accounting
is done for all IO ranges per device, and shared by all jobs. On the
other hand, the zone_reset_threshold option is defined as the ratio to
each job's IO range. When a job refers to the accounting value, it
includes writes to IO ranges out of the job's IO range. Then zone reset
is triggered earlier than expected.

The second issue is accounting value initialization. The initialization
of the accounting field is repeated for each job, then the value
initialized by the first job is overwritten by other jobs. This works as
expected for single job or multiple jobs with same write range. However,
when multiple jobs have different write ranges, the overwritten value is
wrong except for the last job.

To ensure that the accounting works as expected for the option, check
that write ranges of all jobs are same. If jobs have different write
ranges, report it as an error. Initialize the accounting field only once
for the first job. All jobs have same write range, then one time
initialization is enough. Update man page to clarify this limitation of
the option.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 HOWTO.rst |  3 ++-
 fio.1     |  3 ++-
 zbd.c     | 22 +++++++++++++++++++---
 zbd.h     |  4 ++++
 4 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/HOWTO.rst b/HOWTO.rst
index d08f8a18..c71c3949 100644
--- a/HOWTO.rst
+++ b/HOWTO.rst
@@ -1088,7 +1088,8 @@ Target file/device
 	A number between zero and one that indicates the ratio of written bytes
 	to the total size of the zones with write pointers in the IO range. When
 	current ratio is above this ratio, zones are reset periodically as
-	:option:`zone_reset_frequency` specifies.
+	:option:`zone_reset_frequency` specifies. If there are multiple jobs when
+	using this option, the IO range for all write jobs shall be the same.
 
 .. option:: zone_reset_frequency=float
 
diff --git a/fio.1 b/fio.1
index 54d2c403..80cd23b5 100644
--- a/fio.1
+++ b/fio.1
@@ -857,7 +857,8 @@ value to be larger than the device reported limit. Default: false.
 A number between zero and one that indicates the ratio of written bytes to the
 total size of the zones with write pointers in the IO range. When current ratio
 is above this ratio, zones are reset periodically as \fBzone_reset_frequency\fR
-specifies.
+specifies. If there are multiple jobs when using this option, the IO range for
+all write jobs shall be the same.
 .TP
 .BI zone_reset_frequency \fR=\fPfloat
 A number between zero and one that indicates how often a zone reset should be
diff --git a/zbd.c b/zbd.c
index 6783acf9..ca6816b9 100644
--- a/zbd.c
+++ b/zbd.c
@@ -1201,10 +1201,26 @@ static uint64_t zbd_set_vdb(struct thread_data *td, const struct fio_file *f)
 {
 	struct fio_zone_info *zb, *ze, *z;
 	uint64_t wp_vdb = 0;
+	struct zoned_block_device_info *zbdi = f->zbd_info;
 
 	if (!accounting_vdb(td, f))
 		return 0;
 
+	if (zbdi->wp_write_min_zone != zbdi->wp_write_max_zone) {
+		if (zbdi->wp_write_min_zone != f->min_zone ||
+		    zbdi->wp_write_max_zone != f->max_zone) {
+			td_verror(td, EINVAL,
+				  "multi-jobs with different write ranges are "
+				  "not supported with zone_reset_threshold");
+			log_err("multi-jobs with different write ranges are "
+				"not supported with zone_reset_threshold\n");
+		}
+		return 0;
+	}
+
+	zbdi->wp_write_min_zone = f->min_zone;
+	zbdi->wp_write_max_zone = f->max_zone;
+
 	zb = zbd_get_zone(f, f->min_zone);
 	ze = zbd_get_zone(f, f->max_zone);
 	for (z = zb; z < ze; z++) {
@@ -1214,9 +1230,9 @@ static uint64_t zbd_set_vdb(struct thread_data *td, const struct fio_file *f)
 		}
 	}
 
-	pthread_mutex_lock(&f->zbd_info->mutex);
-	f->zbd_info->wp_valid_data_bytes = wp_vdb;
-	pthread_mutex_unlock(&f->zbd_info->mutex);
+	pthread_mutex_lock(&zbdi->mutex);
+	zbdi->wp_valid_data_bytes = wp_vdb;
+	pthread_mutex_unlock(&zbdi->mutex);
 
 	for (z = zb; z < ze; z++)
 		if (z->has_wp)
diff --git a/zbd.h b/zbd.h
index 20b2fe17..d4f81325 100644
--- a/zbd.h
+++ b/zbd.h
@@ -55,6 +55,8 @@ struct fio_zone_info {
  *		num_open_zones).
  * @zone_size: size of a single zone in bytes.
  * @wp_valid_data_bytes: total size of data in zones with write pointers
+ * @wp_write_min_zone: Minimum zone index of all job's write ranges. Inclusive.
+ * @wp_write_max_zone: Maximum zone index of all job's write ranges. Exclusive.
  * @zone_size_log2: log2 of the zone size in bytes if it is a power of 2 or 0
  *		if the zone size is not a power of 2.
  * @nr_zones: number of zones
@@ -75,6 +77,8 @@ struct zoned_block_device_info {
 	pthread_mutex_t		mutex;
 	uint64_t		zone_size;
 	uint64_t		wp_valid_data_bytes;
+	uint32_t		wp_write_min_zone;
+	uint32_t		wp_write_max_zone;
 	uint32_t		zone_size_log2;
 	uint32_t		nr_zones;
 	uint32_t		refcount;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 7/8] zbd: initialize valid data bytes accounting at file set up
  2023-02-07  6:37 [PATCH v2 0/8] zbd: fix 'sectors with data' and zone_reset_threshold accounting issues Shin'ichiro Kawasaki
                   ` (5 preceding siblings ...)
  2023-02-07  6:37 ` [PATCH v2 6/8] zbd: check write ranges " Shin'ichiro Kawasaki
@ 2023-02-07  6:37 ` Shin'ichiro Kawasaki
  2023-02-07 14:06   ` Niklas Cassel
  2023-02-07  6:37 ` [PATCH v2 8/8] t/zbd: add test cases for zone_reset_threshold option Shin'ichiro Kawasaki
  7 siblings, 1 reply; 20+ messages in thread
From: Shin'ichiro Kawasaki @ 2023-02-07  6:37 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu
  Cc: Damien Le Moal, Niklas Cassel, Dmitry Fomichev, Shin'ichiro Kawasaki

The valid data bytes accounting field is initialized at file reset,
after each job started. Each job locks zones to check write pointer
positions of its write target zones. This can cause zone lock contention
with write by other jobs.

To avoid the zone lock contention, move the initialization from file
reset to file set up before job start. It allows to access the write
pointers and the accounting field without locks. Remove the lock and
unlock codes which are no longer required. Ensure the locks are not
required by checking run-state in the struct thread_data.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 zbd.c | 93 ++++++++++++++++++++++++++++-------------------------------
 1 file changed, 44 insertions(+), 49 deletions(-)

diff --git a/zbd.c b/zbd.c
index ca6816b9..a464d6df 100644
--- a/zbd.c
+++ b/zbd.c
@@ -1074,6 +1074,44 @@ void zbd_recalc_options_with_zone_granularity(struct thread_data *td)
 	}
 }
 
+static uint64_t zbd_set_vdb(struct thread_data *td, const struct fio_file *f)
+{
+	struct fio_zone_info *zb, *ze, *z;
+	uint64_t wp_vdb = 0;
+	struct zoned_block_device_info *zbdi = f->zbd_info;
+
+	assert(td->runstate < TD_RUNNING);
+	assert(zbdi);
+
+	if (!accounting_vdb(td, f))
+		return 0;
+
+	if (zbdi->wp_write_min_zone != zbdi->wp_write_max_zone) {
+		if (zbdi->wp_write_min_zone != f->min_zone ||
+		    zbdi->wp_write_max_zone != f->max_zone) {
+			td_verror(td, EINVAL,
+				  "multi-jobs with different write ranges are "
+				  "not supported with zone_reset_threshold");
+			log_err("multi-jobs with different write ranges are "
+				"not supported with zone_reset_threshold\n");
+		}
+		return 0;
+	}
+
+	zbdi->wp_write_min_zone = f->min_zone;
+	zbdi->wp_write_max_zone = f->max_zone;
+
+	zb = zbd_get_zone(f, f->min_zone);
+	ze = zbd_get_zone(f, f->max_zone);
+	for (z = zb; z < ze; z++)
+		if (z->has_wp)
+			wp_vdb += z->wp - z->start;
+
+	zbdi->wp_valid_data_bytes = wp_vdb;
+
+	return wp_vdb;
+}
+
 int zbd_setup_files(struct thread_data *td)
 {
 	struct fio_file *f;
@@ -1099,6 +1137,7 @@ int zbd_setup_files(struct thread_data *td)
 		struct zoned_block_device_info *zbd = f->zbd_info;
 		struct fio_zone_info *z;
 		int zi;
+		uint64_t vdb;
 
 		assert(zbd);
 
@@ -1106,6 +1145,11 @@ int zbd_setup_files(struct thread_data *td)
 		f->max_zone =
 			zbd_offset_to_zone_idx(f, f->file_offset + f->io_size);
 
+		vdb = zbd_set_vdb(td, f);
+
+		dprint(FD_ZBD, "%s(%s): valid data bytes = %" PRIu64 "\n",
+		       __func__, f->file_name, vdb);
+
 		/*
 		 * When all zones in the I/O range are conventional, io_size
 		 * can be smaller than zone size, making min_zone the same
@@ -1197,54 +1241,9 @@ static bool zbd_dec_and_reset_write_cnt(const struct thread_data *td,
 	return write_cnt == 0;
 }
 
-static uint64_t zbd_set_vdb(struct thread_data *td, const struct fio_file *f)
-{
-	struct fio_zone_info *zb, *ze, *z;
-	uint64_t wp_vdb = 0;
-	struct zoned_block_device_info *zbdi = f->zbd_info;
-
-	if (!accounting_vdb(td, f))
-		return 0;
-
-	if (zbdi->wp_write_min_zone != zbdi->wp_write_max_zone) {
-		if (zbdi->wp_write_min_zone != f->min_zone ||
-		    zbdi->wp_write_max_zone != f->max_zone) {
-			td_verror(td, EINVAL,
-				  "multi-jobs with different write ranges are "
-				  "not supported with zone_reset_threshold");
-			log_err("multi-jobs with different write ranges are "
-				"not supported with zone_reset_threshold\n");
-		}
-		return 0;
-	}
-
-	zbdi->wp_write_min_zone = f->min_zone;
-	zbdi->wp_write_max_zone = f->max_zone;
-
-	zb = zbd_get_zone(f, f->min_zone);
-	ze = zbd_get_zone(f, f->max_zone);
-	for (z = zb; z < ze; z++) {
-		if (z->has_wp) {
-			zone_lock(td, f, z);
-			wp_vdb += z->wp - z->start;
-		}
-	}
-
-	pthread_mutex_lock(&zbdi->mutex);
-	zbdi->wp_valid_data_bytes = wp_vdb;
-	pthread_mutex_unlock(&zbdi->mutex);
-
-	for (z = zb; z < ze; z++)
-		if (z->has_wp)
-			zone_unlock(z);
-
-	return wp_vdb;
-}
-
 void zbd_file_reset(struct thread_data *td, struct fio_file *f)
 {
 	struct fio_zone_info *zb, *ze;
-	uint64_t vdb;
 	bool verify_data_left = false;
 
 	if (!f->zbd_info || !td_write(td))
@@ -1252,10 +1251,6 @@ void zbd_file_reset(struct thread_data *td, struct fio_file *f)
 
 	zb = zbd_get_zone(f, f->min_zone);
 	ze = zbd_get_zone(f, f->max_zone);
-	vdb = zbd_set_vdb(td, f);
-
-	dprint(FD_ZBD, "%s(%s): valid data bytes = %" PRIu64 "\n",
-	       __func__, f->file_name, vdb);
 
 	/*
 	 * If data verification is enabled reset the affected zones before
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 8/8] t/zbd: add test cases for zone_reset_threshold option
  2023-02-07  6:37 [PATCH v2 0/8] zbd: fix 'sectors with data' and zone_reset_threshold accounting issues Shin'ichiro Kawasaki
                   ` (6 preceding siblings ...)
  2023-02-07  6:37 ` [PATCH v2 7/8] zbd: initialize valid data bytes accounting at file set up Shin'ichiro Kawasaki
@ 2023-02-07  6:37 ` Shin'ichiro Kawasaki
  2023-02-07 14:06   ` Niklas Cassel
  7 siblings, 1 reply; 20+ messages in thread
From: Shin'ichiro Kawasaki @ 2023-02-07  6:37 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu
  Cc: Damien Le Moal, Niklas Cassel, Dmitry Fomichev, Shin'ichiro Kawasaki

The zone_reset_threshold option works for multiple jobs only when the
jobs have same write range. Add three test cases to confirm that the
option works for multiple jobs as expected. The first test case checks
that different write ranges are reported as an error. The second test
case checks that multiple write jobs work when they have same write
range. The third test case checks that a read job and a write job work
when they have different IO ranges.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 t/zbd/test-zbd-support | 56 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/t/zbd/test-zbd-support b/t/zbd/test-zbd-support
index c32953c4..893aff3c 100755
--- a/t/zbd/test-zbd-support
+++ b/t/zbd/test-zbd-support
@@ -1305,6 +1305,62 @@ test60() {
 	grep -q 'not support experimental verify' "${logfile}.${test_number}"
 }
 
+# Test fio errors out zone_reset_threshold option for multiple jobs with
+# different write ranges.
+test61() {
+	run_fio_on_seq "$(ioengine "psync")" --rw=write --size="$zone_size" \
+		       --numjobs=2 --offset_increment="$zone_size" \
+		       --zone_reset_threshold=0.1 --zone_reset_frequency=1 \
+		       --exitall_on_error=1 \
+		       >> "${logfile}.${test_number}" 2>&1 && return 1
+	grep -q 'different write ranges' "${logfile}.${test_number}"
+}
+
+# Test zone_reset_threshold option works for multiple jobs with same write
+# range.
+test62() {
+	local bs loops=2 size=$((zone_size))
+
+	[ -n "$is_zbd" ] && reset_zone "$dev" -1
+
+	# Two jobs write to single zone twice. Reset zone happens at next write
+	# after half of the zone gets filled. So 2 * 2 * 2 - 1 = 7 times zone
+	# resets are expected.
+	bs=$(min $((256*1024)) $((zone_size / 4)))
+	run_fio_on_seq "$(ioengine "psync")" --rw=write --bs="$bs" \
+		       --size=$size --loops=$loops --numjobs=2 \
+		       --zone_reset_frequency=1 --zone_reset_threshold=.5 \
+		       --group_reporting=1 \
+		       >> "${logfile}.${test_number}" 2>&1 || return $?
+	check_written $((size * loops * 2)) || return $?
+	check_reset_count -eq 7 || return $?
+}
+
+# Test zone_reset_threshold option works for a read job and a write job with
+# different IO range.
+test63() {
+	local bs loops=2 size=$((zone_size)) off1 off2
+
+	[ -n "$is_zbd" ] && reset_zone "$dev" -1
+
+	off1=$((first_sequential_zone_sector * 512))
+	off2=$((off1 + zone_size))
+	bs=$(min $((256*1024)) $((zone_size / 4)))
+
+	# One job writes to single zone twice. Reset zone happens at next write
+	# after half of the zone gets filled. So 2 * 2 - 1 = 3 times zone resets
+	# are expected.
+	run_fio "$(ioengine "psync")" --bs="$bs" --size=$size --loops=$loops \
+		--filename="$dev" --group_reporting=1 \
+		--zonemode=zbd --zonesize="$zone_size" --direct=1 \
+		--zone_reset_frequency=1 --zone_reset_threshold=.5 \
+		--name=r --rw=read --offset=$off1 "${job_var_opts[@]}" \
+		--name=w --rw=write --offset=$off2 "${job_var_opts[@]}" \
+		       >> "${logfile}.${test_number}" 2>&1 || return $?
+	check_written $((size * loops)) || return $?
+	check_reset_count -eq 3 || return $?
+}
+
 SECONDS=0
 tests=()
 dynamic_analyzer=()
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/8] zbd: refer file->last_start[] instead of sectors with data accounting
  2023-02-07  6:37 ` [PATCH v2 1/8] zbd: refer file->last_start[] instead of sectors with data accounting Shin'ichiro Kawasaki
@ 2023-02-07 14:05   ` Niklas Cassel
  0 siblings, 0 replies; 20+ messages in thread
From: Niklas Cassel @ 2023-02-07 14:05 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: fio, Jens Axboe, Vincent Fu, Damien Le Moal, Dmitry Fomichev

On Tue, Feb 07, 2023 at 03:37:32PM +0900, Shin'ichiro Kawasaki wrote:
> To decide the first IO direction of randrw workload, the function
> zbd_adjust_ddir() refers to the zbd_info->sectors_with_data value which
> indicates the number of bytes written to the zoned block devices being
> accessed. However, this accounting has two issues. The first issue is
> wrong accounting for multiple jobs with different write ranges. The
> second issue is job start up failure due to zone lock contention.
> 
> Avoid using zbd_info->sectors_with_data and simply refer to file->
> last_start[DDIR_WRITE] instead. It is initialized with -1ULL for each
> job. After any write operation is done by the job, it keeps valid
> offset. If it has valid offset, written data is expected and the first
> IO direction can be read.
> 
> Also remove zbd_info->sectors_with_data, which is no longer used. Keep
> the field zbd_info->wp_sectors_with_data since it is still used for
> zones with write pointers.
> 
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> ---
>  zbd.c | 14 +++-----------
>  zbd.h |  2 --
>  2 files changed, 3 insertions(+), 13 deletions(-)
> 
> diff --git a/zbd.c b/zbd.c
> index d1e469f6..f5e76c40 100644
> --- a/zbd.c
> +++ b/zbd.c
> @@ -286,7 +286,6 @@ static int zbd_reset_zone(struct thread_data *td, struct fio_file *f,
>  	}
>  
>  	pthread_mutex_lock(&f->zbd_info->mutex);
> -	f->zbd_info->sectors_with_data -= data_in_zone;
>  	f->zbd_info->wp_sectors_with_data -= data_in_zone;
>  	pthread_mutex_unlock(&f->zbd_info->mutex);
>  
> @@ -1201,7 +1200,6 @@ static uint64_t zbd_process_swd(struct thread_data *td,
>  				const struct fio_file *f, enum swd_action a)
>  {
>  	struct fio_zone_info *zb, *ze, *z;
> -	uint64_t swd = 0;
>  	uint64_t wp_swd = 0;
>  
>  	zb = zbd_get_zone(f, f->min_zone);
> @@ -1211,17 +1209,14 @@ static uint64_t zbd_process_swd(struct thread_data *td,
>  			zone_lock(td, f, z);
>  			wp_swd += z->wp - z->start;
>  		}
> -		swd += z->wp - z->start;
>  	}
>  
>  	pthread_mutex_lock(&f->zbd_info->mutex);
>  	switch (a) {
>  	case CHECK_SWD:
> -		assert(f->zbd_info->sectors_with_data == swd);
>  		assert(f->zbd_info->wp_sectors_with_data == wp_swd);
>  		break;
>  	case SET_SWD:
> -		f->zbd_info->sectors_with_data = swd;
>  		f->zbd_info->wp_sectors_with_data = wp_swd;
>  		break;
>  	}
> @@ -1231,7 +1226,7 @@ static uint64_t zbd_process_swd(struct thread_data *td,
>  		if (z->has_wp)
>  			zone_unlock(z);
>  
> -	return swd;
> +	return wp_swd;
>  }
>  
>  /*
> @@ -1640,10 +1635,8 @@ static void zbd_queue_io(struct thread_data *td, struct io_u *io_u, int q,
>  		 * have occurred.
>  		 */
>  		pthread_mutex_lock(&zbd_info->mutex);
> -		if (z->wp <= zone_end) {
> -			zbd_info->sectors_with_data += zone_end - z->wp;
> +		if (z->wp <= zone_end)
>  			zbd_info->wp_sectors_with_data += zone_end - z->wp;
> -		}
>  		pthread_mutex_unlock(&zbd_info->mutex);
>  		z->wp = zone_end;
>  		break;
> @@ -1801,8 +1794,7 @@ enum fio_ddir zbd_adjust_ddir(struct thread_data *td, struct io_u *io_u,
>  	if (ddir != DDIR_READ || !td_rw(td))
>  		return ddir;
>  
> -	if (io_u->file->zbd_info->sectors_with_data ||
> -	    td->o.read_beyond_wp)
> +	if (io_u->file->last_start[DDIR_WRITE] != -1ULL || td->o.read_beyond_wp)
>  		return DDIR_READ;
>  
>  	return DDIR_WRITE;
> diff --git a/zbd.h b/zbd.h
> index d425707e..9ab25c47 100644
> --- a/zbd.h
> +++ b/zbd.h
> @@ -54,7 +54,6 @@ struct fio_zone_info {
>   * @mutex: Protects the modifiable members in this structure (refcount and
>   *		num_open_zones).
>   * @zone_size: size of a single zone in bytes.
> - * @sectors_with_data: total size of data in all zones in units of 512 bytes
>   * @wp_sectors_with_data: total size of data in zones with write pointers in
>   *                        units of 512 bytes
>   * @zone_size_log2: log2 of the zone size in bytes if it is a power of 2 or 0
> @@ -76,7 +75,6 @@ struct zoned_block_device_info {
>  	uint32_t		max_open_zones;
>  	pthread_mutex_t		mutex;
>  	uint64_t		zone_size;
> -	uint64_t		sectors_with_data;
>  	uint64_t		wp_sectors_with_data;
>  	uint32_t		zone_size_log2;
>  	uint32_t		nr_zones;
> -- 
> 2.38.1
> 

Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/8] zbd: remove CHECK_SWD feature
  2023-02-07  6:37 ` [PATCH v2 2/8] zbd: remove CHECK_SWD feature Shin'ichiro Kawasaki
@ 2023-02-07 14:05   ` Niklas Cassel
  0 siblings, 0 replies; 20+ messages in thread
From: Niklas Cassel @ 2023-02-07 14:05 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: fio, Jens Axboe, Vincent Fu, Damien Le Moal, Dmitry Fomichev

On Tue, Feb 07, 2023 at 03:37:33PM +0900, Shin'ichiro Kawasaki wrote:
> The 'sectors with data' accounting had been used for CHECK_SWD debug
> feature. It compared expected written data size and actually written
> data size for zonemode=zbd. However, this feature has been disabled for
> a while and not actively used. Also, the sector with data accounting has
> two issues. The first issue is wrong accounting for multiple jobs with
> different write ranges. The second issue is job start up failure due to
> zone lock contention.
> 
> Avoid using the accounting by removing the CHECK_SWD feature and related
> code. Also rename the function zbd_process_swd() to zbd_set_swd() to
> clarify that it no longer works for CHECK_SWD.
> 
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> ---
>  zbd.c | 38 +++-----------------------------------
>  1 file changed, 3 insertions(+), 35 deletions(-)
> 
> diff --git a/zbd.c b/zbd.c
> index f5e76c40..b6cf2a93 100644
> --- a/zbd.c
> +++ b/zbd.c
> @@ -1190,14 +1190,7 @@ static bool zbd_dec_and_reset_write_cnt(const struct thread_data *td,
>  	return write_cnt == 0;
>  }
>  
> -enum swd_action {
> -	CHECK_SWD,
> -	SET_SWD,
> -};
> -
> -/* Calculate the number of sectors with data (swd) and perform action 'a' */
> -static uint64_t zbd_process_swd(struct thread_data *td,
> -				const struct fio_file *f, enum swd_action a)
> +static uint64_t zbd_set_swd(struct thread_data *td, const struct fio_file *f)
>  {
>  	struct fio_zone_info *zb, *ze, *z;
>  	uint64_t wp_swd = 0;
> @@ -1212,14 +1205,7 @@ static uint64_t zbd_process_swd(struct thread_data *td,
>  	}
>  
>  	pthread_mutex_lock(&f->zbd_info->mutex);
> -	switch (a) {
> -	case CHECK_SWD:
> -		assert(f->zbd_info->wp_sectors_with_data == wp_swd);
> -		break;
> -	case SET_SWD:
> -		f->zbd_info->wp_sectors_with_data = wp_swd;
> -		break;
> -	}
> +	f->zbd_info->wp_sectors_with_data = wp_swd;
>  	pthread_mutex_unlock(&f->zbd_info->mutex);
>  
>  	for (z = zb; z < ze; z++)
> @@ -1229,21 +1215,6 @@ static uint64_t zbd_process_swd(struct thread_data *td,
>  	return wp_swd;
>  }
>  
> -/*
> - * The swd check is useful for debugging but takes too much time to leave
> - * it enabled all the time. Hence it is disabled by default.
> - */
> -static const bool enable_check_swd = false;
> -
> -/* Check whether the values of zbd_info.*sectors_with_data are correct. */
> -static void zbd_check_swd(struct thread_data *td, const struct fio_file *f)
> -{
> -	if (!enable_check_swd)
> -		return;
> -
> -	zbd_process_swd(td, f, CHECK_SWD);
> -}
> -
>  void zbd_file_reset(struct thread_data *td, struct fio_file *f)
>  {
>  	struct fio_zone_info *zb, *ze;
> @@ -1255,7 +1226,7 @@ void zbd_file_reset(struct thread_data *td, struct fio_file *f)
>  
>  	zb = zbd_get_zone(f, f->min_zone);
>  	ze = zbd_get_zone(f, f->max_zone);
> -	swd = zbd_process_swd(td, f, SET_SWD);
> +	swd = zbd_set_swd(td, f);
>  
>  	dprint(FD_ZBD, "%s(%s): swd = %" PRIu64 "\n",
>  	       __func__, f->file_name, swd);
> @@ -1677,7 +1648,6 @@ static void zbd_put_io(struct thread_data *td, const struct io_u *io_u)
>  	zbd_end_zone_io(td, io_u, z);
>  
>  	zone_unlock(z);
> -	zbd_check_swd(td, f);
>  }
>  
>  /*
> @@ -1866,8 +1836,6 @@ enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u)
>  	    io_u->ddir == DDIR_READ && td->o.read_beyond_wp)
>  		return io_u_accept;
>  
> -	zbd_check_swd(td, f);
> -
>  	zone_lock(td, f, zb);
>  
>  	switch (io_u->ddir) {
> -- 
> 2.38.1
> 

Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/8] zbd: rename the accounting 'sectors with data' to 'valid data bytes'
  2023-02-07  6:37 ` [PATCH v2 3/8] zbd: rename the accounting 'sectors with data' to 'valid data bytes' Shin'ichiro Kawasaki
@ 2023-02-07 14:06   ` Niklas Cassel
  0 siblings, 0 replies; 20+ messages in thread
From: Niklas Cassel @ 2023-02-07 14:06 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: fio, Jens Axboe, Vincent Fu, Damien Le Moal, Dmitry Fomichev

On Tue, Feb 07, 2023 at 03:37:34PM +0900, Shin'ichiro Kawasaki wrote:
> The 'sectors with data' accounting was designed to have 'sector' as its
> unit. Then related variables have the word 'sector' in their names. Also
> related code comments use the words 'sector' or 'logical blocks'.
> However, actually it was implemented to have 'byte' as unit. Rename
> related variables and comments to indicate the byte unit. Also replace
> the abbreviation swd to vdb.
> 
> Fixes: a7c2b6fc2959 ("Add support for resetting zones periodically")
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> ---
>  t/zbd/test-zbd-support |  4 ++--
>  zbd.c                  | 25 +++++++++++++------------
>  zbd.h                  |  5 ++---
>  3 files changed, 17 insertions(+), 17 deletions(-)
> 
> diff --git a/t/zbd/test-zbd-support b/t/zbd/test-zbd-support
> index 4091d9ac..c32953c4 100755
> --- a/t/zbd/test-zbd-support
> +++ b/t/zbd/test-zbd-support
> @@ -1110,8 +1110,8 @@ test51() {
>  	run_fio "${opts[@]}" >> "${logfile}.${test_number}" 2>&1 || return $?
>  }
>  
> -# Verify that zone_reset_threshold only takes logical blocks from seq
> -# zones into account, and logical blocks of conv zones are not counted.
> +# Verify that zone_reset_threshold only accounts written bytes in seq
> +# zones, and written data bytes of conv zones are not counted.
>  test52() {
>  	local off io_size
>  
> diff --git a/zbd.c b/zbd.c
> index b6cf2a93..455dad53 100644
> --- a/zbd.c
> +++ b/zbd.c
> @@ -286,7 +286,7 @@ static int zbd_reset_zone(struct thread_data *td, struct fio_file *f,
>  	}
>  
>  	pthread_mutex_lock(&f->zbd_info->mutex);
> -	f->zbd_info->wp_sectors_with_data -= data_in_zone;
> +	f->zbd_info->wp_valid_data_bytes -= data_in_zone;
>  	pthread_mutex_unlock(&f->zbd_info->mutex);
>  
>  	z->wp = z->start;
> @@ -1190,35 +1190,35 @@ static bool zbd_dec_and_reset_write_cnt(const struct thread_data *td,
>  	return write_cnt == 0;
>  }
>  
> -static uint64_t zbd_set_swd(struct thread_data *td, const struct fio_file *f)
> +static uint64_t zbd_set_vdb(struct thread_data *td, const struct fio_file *f)
>  {
>  	struct fio_zone_info *zb, *ze, *z;
> -	uint64_t wp_swd = 0;
> +	uint64_t wp_vdb = 0;
>  
>  	zb = zbd_get_zone(f, f->min_zone);
>  	ze = zbd_get_zone(f, f->max_zone);
>  	for (z = zb; z < ze; z++) {
>  		if (z->has_wp) {
>  			zone_lock(td, f, z);
> -			wp_swd += z->wp - z->start;
> +			wp_vdb += z->wp - z->start;
>  		}
>  	}
>  
>  	pthread_mutex_lock(&f->zbd_info->mutex);
> -	f->zbd_info->wp_sectors_with_data = wp_swd;
> +	f->zbd_info->wp_valid_data_bytes = wp_vdb;
>  	pthread_mutex_unlock(&f->zbd_info->mutex);
>  
>  	for (z = zb; z < ze; z++)
>  		if (z->has_wp)
>  			zone_unlock(z);
>  
> -	return wp_swd;
> +	return wp_vdb;
>  }
>  
>  void zbd_file_reset(struct thread_data *td, struct fio_file *f)
>  {
>  	struct fio_zone_info *zb, *ze;
> -	uint64_t swd;
> +	uint64_t vdb;
>  	bool verify_data_left = false;
>  
>  	if (!f->zbd_info || !td_write(td))
> @@ -1226,10 +1226,10 @@ void zbd_file_reset(struct thread_data *td, struct fio_file *f)
>  
>  	zb = zbd_get_zone(f, f->min_zone);
>  	ze = zbd_get_zone(f, f->max_zone);
> -	swd = zbd_set_swd(td, f);
> +	vdb = zbd_set_vdb(td, f);
>  
> -	dprint(FD_ZBD, "%s(%s): swd = %" PRIu64 "\n",
> -	       __func__, f->file_name, swd);
> +	dprint(FD_ZBD, "%s(%s): valid data bytes = %" PRIu64 "\n",
> +	       __func__, f->file_name, vdb);
>  
>  	/*
>  	 * If data verification is enabled reset the affected zones before
> @@ -1607,7 +1607,7 @@ static void zbd_queue_io(struct thread_data *td, struct io_u *io_u, int q,
>  		 */
>  		pthread_mutex_lock(&zbd_info->mutex);
>  		if (z->wp <= zone_end)
> -			zbd_info->wp_sectors_with_data += zone_end - z->wp;
> +			zbd_info->wp_valid_data_bytes += zone_end - z->wp;
>  		pthread_mutex_unlock(&zbd_info->mutex);
>  		z->wp = zone_end;
>  		break;
> @@ -1960,7 +1960,8 @@ retry:
>  
>  		/* Check whether the zone reset threshold has been exceeded */
>  		if (td->o.zrf.u.f) {
> -			if (zbdi->wp_sectors_with_data >= f->io_size * td->o.zrt.u.f &&
> +			if (zbdi->wp_valid_data_bytes >=
> +			    f->io_size * td->o.zrt.u.f &&
>  			    zbd_dec_and_reset_write_cnt(td, f))
>  				zb->reset_zone = 1;
>  		}
> diff --git a/zbd.h b/zbd.h
> index 9ab25c47..20b2fe17 100644
> --- a/zbd.h
> +++ b/zbd.h
> @@ -54,8 +54,7 @@ struct fio_zone_info {
>   * @mutex: Protects the modifiable members in this structure (refcount and
>   *		num_open_zones).
>   * @zone_size: size of a single zone in bytes.
> - * @wp_sectors_with_data: total size of data in zones with write pointers in
> - *                        units of 512 bytes
> + * @wp_valid_data_bytes: total size of data in zones with write pointers
>   * @zone_size_log2: log2 of the zone size in bytes if it is a power of 2 or 0
>   *		if the zone size is not a power of 2.
>   * @nr_zones: number of zones
> @@ -75,7 +74,7 @@ struct zoned_block_device_info {
>  	uint32_t		max_open_zones;
>  	pthread_mutex_t		mutex;
>  	uint64_t		zone_size;
> -	uint64_t		wp_sectors_with_data;
> +	uint64_t		wp_valid_data_bytes;
>  	uint32_t		zone_size_log2;
>  	uint32_t		nr_zones;
>  	uint32_t		refcount;
> -- 
> 2.38.1
> 

Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 4/8] doc: fix unit of zone_reset_threshold and relation to other option
  2023-02-07  6:37 ` [PATCH v2 4/8] doc: fix unit of zone_reset_threshold and relation to other option Shin'ichiro Kawasaki
@ 2023-02-07 14:06   ` Niklas Cassel
  0 siblings, 0 replies; 20+ messages in thread
From: Niklas Cassel @ 2023-02-07 14:06 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: fio, Jens Axboe, Vincent Fu, Damien Le Moal, Dmitry Fomichev

On Tue, Feb 07, 2023 at 03:37:35PM +0900, Shin'ichiro Kawasaki wrote:
> The zone_reset_threshold option uses the 'sectors with data' accounting
> then it was described to have 'logical block' as its unit. However, the
> accounting was implement with 'byte' unit. Fix the description of the

nit: s/implement/implemented/

> option.
> 
> Also, the zone_reset_threshold option works together with the
> zone_reset_frequency option. Describe this relation also.
> 
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> ---
>  HOWTO.rst | 7 ++++---
>  fio.1     | 7 ++++---
>  2 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/HOWTO.rst b/HOWTO.rst
> index 17caaf5d..d08f8a18 100644
> --- a/HOWTO.rst
> +++ b/HOWTO.rst
> @@ -1085,9 +1085,10 @@ Target file/device
>  
>  .. option:: zone_reset_threshold=float
>  
> -	A number between zero and one that indicates the ratio of logical
> -	blocks with data to the total number of logical blocks in the test
> -	above which zones should be reset periodically.
> +	A number between zero and one that indicates the ratio of written bytes
> +	to the total size of the zones with write pointers in the IO range. When
> +	current ratio is above this ratio, zones are reset periodically as
> +	:option:`zone_reset_frequency` specifies.
>  
>  .. option:: zone_reset_frequency=float
>  
> diff --git a/fio.1 b/fio.1
> index 527b3d46..54d2c403 100644
> --- a/fio.1
> +++ b/fio.1
> @@ -854,9 +854,10 @@ of the zoned block device in use, thus allowing the option \fBmax_open_zones\fR
>  value to be larger than the device reported limit. Default: false.
>  .TP
>  .BI zone_reset_threshold \fR=\fPfloat
> -A number between zero and one that indicates the ratio of logical blocks with
> -data to the total number of logical blocks in the test above which zones
> -should be reset periodically.
> +A number between zero and one that indicates the ratio of written bytes to the
> +total size of the zones with write pointers in the IO range. When current ratio
> +is above this ratio, zones are reset periodically as \fBzone_reset_frequency\fR
> +specifies.
>  .TP
>  .BI zone_reset_frequency \fR=\fPfloat
>  A number between zero and one that indicates how often a zone reset should be
> -- 
> 2.38.1
> 

Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/8] zbd: account valid data bytes only for zone_reset_threshold option
  2023-02-07  6:37 ` [PATCH v2 5/8] zbd: account valid data bytes only for zone_reset_threshold option Shin'ichiro Kawasaki
@ 2023-02-07 14:06   ` Niklas Cassel
  0 siblings, 0 replies; 20+ messages in thread
From: Niklas Cassel @ 2023-02-07 14:06 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: fio, Jens Axboe, Vincent Fu, Damien Le Moal, Dmitry Fomichev

On Tue, Feb 07, 2023 at 03:37:36PM +0900, Shin'ichiro Kawasaki wrote:
> The valid data bytes accounting is used only for zone_reset_threshold
> option. Avoid the accounting when the option is not specified.
> 
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> ---
>  zbd.c | 23 +++++++++++++++++------
>  1 file changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/zbd.c b/zbd.c
> index 455dad53..6783acf9 100644
> --- a/zbd.c
> +++ b/zbd.c
> @@ -147,6 +147,11 @@ zbd_offset_to_zone(const struct fio_file *f,  uint64_t offset)
>  	return zbd_get_zone(f, zbd_offset_to_zone_idx(f, offset));
>  }
>  
> +static bool accounting_vdb(struct thread_data *td, const struct fio_file *f)
> +{
> +	return td->o.zrt.u.f && td_write(td);
> +}
> +
>  /**
>   * zbd_get_zoned_model - Get a device zoned model
>   * @td: FIO thread data
> @@ -285,9 +290,11 @@ static int zbd_reset_zone(struct thread_data *td, struct fio_file *f,
>  		break;
>  	}
>  
> -	pthread_mutex_lock(&f->zbd_info->mutex);
> -	f->zbd_info->wp_valid_data_bytes -= data_in_zone;
> -	pthread_mutex_unlock(&f->zbd_info->mutex);
> +	if (accounting_vdb(td, f)) {
> +		pthread_mutex_lock(&f->zbd_info->mutex);
> +		f->zbd_info->wp_valid_data_bytes -= data_in_zone;
> +		pthread_mutex_unlock(&f->zbd_info->mutex);
> +	}
>  
>  	z->wp = z->start;
>  
> @@ -1195,6 +1202,9 @@ static uint64_t zbd_set_vdb(struct thread_data *td, const struct fio_file *f)
>  	struct fio_zone_info *zb, *ze, *z;
>  	uint64_t wp_vdb = 0;
>  
> +	if (!accounting_vdb(td, f))
> +		return 0;
> +
>  	zb = zbd_get_zone(f, f->min_zone);
>  	ze = zbd_get_zone(f, f->max_zone);
>  	for (z = zb; z < ze; z++) {
> @@ -1605,10 +1615,11 @@ static void zbd_queue_io(struct thread_data *td, struct io_u *io_u, int q,
>  		 * z->wp > zone_end means that one or more I/O errors
>  		 * have occurred.
>  		 */
> -		pthread_mutex_lock(&zbd_info->mutex);
> -		if (z->wp <= zone_end)
> +		if (accounting_vdb(td, f) && z->wp <= zone_end) {
> +			pthread_mutex_lock(&zbd_info->mutex);
>  			zbd_info->wp_valid_data_bytes += zone_end - z->wp;
> -		pthread_mutex_unlock(&zbd_info->mutex);
> +			pthread_mutex_unlock(&zbd_info->mutex);
> +		}
>  		z->wp = zone_end;
>  		break;
>  	default:
> -- 
> 2.38.1
> 

Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 6/8] zbd: check write ranges for zone_reset_threshold option
  2023-02-07  6:37 ` [PATCH v2 6/8] zbd: check write ranges " Shin'ichiro Kawasaki
@ 2023-02-07 14:06   ` Niklas Cassel
  2023-02-08  4:59     ` Shinichiro Kawasaki
  0 siblings, 1 reply; 20+ messages in thread
From: Niklas Cassel @ 2023-02-07 14:06 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: fio, Jens Axboe, Vincent Fu, Damien Le Moal, Dmitry Fomichev

On Tue, Feb 07, 2023 at 03:37:37PM +0900, Shin'ichiro Kawasaki wrote:
> The valid data bytes accounting is used for zone_reset_threshold option.
> This accounting usage has two issues. The first issue is unexpected
> zone reset due to different IO ranges. The valid data bytes accounting
> is done for all IO ranges per device, and shared by all jobs. On the
> other hand, the zone_reset_threshold option is defined as the ratio to
> each job's IO range. When a job refers to the accounting value, it
> includes writes to IO ranges out of the job's IO range. Then zone reset
> is triggered earlier than expected.
> 
> The second issue is accounting value initialization. The initialization
> of the accounting field is repeated for each job, then the value
> initialized by the first job is overwritten by other jobs. This works as
> expected for single job or multiple jobs with same write range. However,
> when multiple jobs have different write ranges, the overwritten value is
> wrong except for the last job.
> 
> To ensure that the accounting works as expected for the option, check
> that write ranges of all jobs are same. If jobs have different write
> ranges, report it as an error. Initialize the accounting field only once
> for the first job. All jobs have same write range, then one time
> initialization is enough. Update man page to clarify this limitation of
> the option.
> 
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> ---
>  HOWTO.rst |  3 ++-
>  fio.1     |  3 ++-
>  zbd.c     | 22 +++++++++++++++++++---
>  zbd.h     |  4 ++++
>  4 files changed, 27 insertions(+), 5 deletions(-)
> 
> diff --git a/HOWTO.rst b/HOWTO.rst
> index d08f8a18..c71c3949 100644
> --- a/HOWTO.rst
> +++ b/HOWTO.rst
> @@ -1088,7 +1088,8 @@ Target file/device
>  	A number between zero and one that indicates the ratio of written bytes
>  	to the total size of the zones with write pointers in the IO range. When
>  	current ratio is above this ratio, zones are reset periodically as
> -	:option:`zone_reset_frequency` specifies.
> +	:option:`zone_reset_frequency` specifies. If there are multiple jobs when
> +	using this option, the IO range for all write jobs shall be the same.

nit: perhaps replace "shall be the same" with "has to be the same"

>  
>  .. option:: zone_reset_frequency=float
>  
> diff --git a/fio.1 b/fio.1
> index 54d2c403..80cd23b5 100644
> --- a/fio.1
> +++ b/fio.1
> @@ -857,7 +857,8 @@ value to be larger than the device reported limit. Default: false.
>  A number between zero and one that indicates the ratio of written bytes to the
>  total size of the zones with write pointers in the IO range. When current ratio
>  is above this ratio, zones are reset periodically as \fBzone_reset_frequency\fR
> -specifies.
> +specifies. If there are multiple jobs when using this option, the IO range for
> +all write jobs shall be the same.
>  .TP
>  .BI zone_reset_frequency \fR=\fPfloat
>  A number between zero and one that indicates how often a zone reset should be
> diff --git a/zbd.c b/zbd.c
> index 6783acf9..ca6816b9 100644
> --- a/zbd.c
> +++ b/zbd.c
> @@ -1201,10 +1201,26 @@ static uint64_t zbd_set_vdb(struct thread_data *td, const struct fio_file *f)
>  {
>  	struct fio_zone_info *zb, *ze, *z;
>  	uint64_t wp_vdb = 0;
> +	struct zoned_block_device_info *zbdi = f->zbd_info;
>  
>  	if (!accounting_vdb(td, f))
>  		return 0;
>  
> +	if (zbdi->wp_write_min_zone != zbdi->wp_write_max_zone) {
> +		if (zbdi->wp_write_min_zone != f->min_zone ||
> +		    zbdi->wp_write_max_zone != f->max_zone) {
> +			td_verror(td, EINVAL,
> +				  "multi-jobs with different write ranges are "
> +				  "not supported with zone_reset_threshold");
> +			log_err("multi-jobs with different write ranges are "
> +				"not supported with zone_reset_threshold\n");
> +		}
> +		return 0;
> +	}

This check looks a bit weird.
What if the first job specifies a f->file_offset and an f->io_size that is
within the same zone? Then f->min_zone and f->max_zone will have the same value.

Which should cause wp_write_min_zone and wp_write_max_zone to have the same
value.

The check will thus not be performed for the second job, after wp_write_min_zone
and wp_write_min_zone have been initialized.


> +
> +	zbdi->wp_write_min_zone = f->min_zone;
> +	zbdi->wp_write_max_zone = f->max_zone;
> +
>  	zb = zbd_get_zone(f, f->min_zone);
>  	ze = zbd_get_zone(f, f->max_zone);
>  	for (z = zb; z < ze; z++) {
> @@ -1214,9 +1230,9 @@ static uint64_t zbd_set_vdb(struct thread_data *td, const struct fio_file *f)
>  		}
>  	}
>  
> -	pthread_mutex_lock(&f->zbd_info->mutex);
> -	f->zbd_info->wp_valid_data_bytes = wp_vdb;
> -	pthread_mutex_unlock(&f->zbd_info->mutex);
> +	pthread_mutex_lock(&zbdi->mutex);
> +	zbdi->wp_valid_data_bytes = wp_vdb;
> +	pthread_mutex_unlock(&zbdi->mutex);
>  
>  	for (z = zb; z < ze; z++)
>  		if (z->has_wp)
> diff --git a/zbd.h b/zbd.h
> index 20b2fe17..d4f81325 100644
> --- a/zbd.h
> +++ b/zbd.h
> @@ -55,6 +55,8 @@ struct fio_zone_info {
>   *		num_open_zones).
>   * @zone_size: size of a single zone in bytes.
>   * @wp_valid_data_bytes: total size of data in zones with write pointers
> + * @wp_write_min_zone: Minimum zone index of all job's write ranges. Inclusive.
> + * @wp_write_max_zone: Maximum zone index of all job's write ranges. Exclusive.
>   * @zone_size_log2: log2 of the zone size in bytes if it is a power of 2 or 0
>   *		if the zone size is not a power of 2.
>   * @nr_zones: number of zones
> @@ -75,6 +77,8 @@ struct zoned_block_device_info {
>  	pthread_mutex_t		mutex;
>  	uint64_t		zone_size;
>  	uint64_t		wp_valid_data_bytes;
> +	uint32_t		wp_write_min_zone;
> +	uint32_t		wp_write_max_zone;
>  	uint32_t		zone_size_log2;
>  	uint32_t		nr_zones;
>  	uint32_t		refcount;
> -- 
> 2.38.1
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 7/8] zbd: initialize valid data bytes accounting at file set up
  2023-02-07  6:37 ` [PATCH v2 7/8] zbd: initialize valid data bytes accounting at file set up Shin'ichiro Kawasaki
@ 2023-02-07 14:06   ` Niklas Cassel
  2023-02-08  5:03     ` Shinichiro Kawasaki
  0 siblings, 1 reply; 20+ messages in thread
From: Niklas Cassel @ 2023-02-07 14:06 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: fio, Jens Axboe, Vincent Fu, Damien Le Moal, Dmitry Fomichev

On Tue, Feb 07, 2023 at 03:37:38PM +0900, Shin'ichiro Kawasaki wrote:
> The valid data bytes accounting field is initialized at file reset,
> after each job started. Each job locks zones to check write pointer
> positions of its write target zones. This can cause zone lock contention
> with write by other jobs.
> 
> To avoid the zone lock contention, move the initialization from file
> reset to file set up before job start. It allows to access the write
> pointers and the accounting field without locks. Remove the lock and
> unlock codes which are no longer required. Ensure the locks are not
> required by checking run-state in the struct thread_data.
> 
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> ---
>  zbd.c | 93 ++++++++++++++++++++++++++++-------------------------------
>  1 file changed, 44 insertions(+), 49 deletions(-)
> 
> diff --git a/zbd.c b/zbd.c
> index ca6816b9..a464d6df 100644
> --- a/zbd.c
> +++ b/zbd.c
> @@ -1074,6 +1074,44 @@ void zbd_recalc_options_with_zone_granularity(struct thread_data *td)
>  	}
>  }
>  
> +static uint64_t zbd_set_vdb(struct thread_data *td, const struct fio_file *f)
> +{
> +	struct fio_zone_info *zb, *ze, *z;
> +	uint64_t wp_vdb = 0;
> +	struct zoned_block_device_info *zbdi = f->zbd_info;
> +
> +	assert(td->runstate < TD_RUNNING);
> +	assert(zbdi);
> +
> +	if (!accounting_vdb(td, f))
> +		return 0;

zbd_set_vdb() is also called from zbd_file_reset().

The code from here:

> +	if (zbdi->wp_write_min_zone != zbdi->wp_write_max_zone) {
> +		if (zbdi->wp_write_min_zone != f->min_zone ||
> +		    zbdi->wp_write_max_zone != f->max_zone) {
> +			td_verror(td, EINVAL,
> +				  "multi-jobs with different write ranges are "
> +				  "not supported with zone_reset_threshold");
> +			log_err("multi-jobs with different write ranges are "
> +				"not supported with zone_reset_threshold\n");
> +		}
> +		return 0;
> +	}
> +
> +	zbdi->wp_write_min_zone = f->min_zone;
> +	zbdi->wp_write_max_zone = f->max_zone;

Up until here. (Basically the code added in patch 6/8.)
Should probably only be called by zbd_setup_files().

Or do we need to perform the verification for every single file reset?

> +
> +	zb = zbd_get_zone(f, f->min_zone);
> +	ze = zbd_get_zone(f, f->max_zone);
> +	for (z = zb; z < ze; z++)
> +		if (z->has_wp)
> +			wp_vdb += z->wp - z->start;
> +
> +	zbdi->wp_valid_data_bytes = wp_vdb;
> +
> +	return wp_vdb;
> +}
> +
>  int zbd_setup_files(struct thread_data *td)
>  {
>  	struct fio_file *f;
> @@ -1099,6 +1137,7 @@ int zbd_setup_files(struct thread_data *td)
>  		struct zoned_block_device_info *zbd = f->zbd_info;
>  		struct fio_zone_info *z;
>  		int zi;
> +		uint64_t vdb;
>  
>  		assert(zbd);
>  
> @@ -1106,6 +1145,11 @@ int zbd_setup_files(struct thread_data *td)
>  		f->max_zone =
>  			zbd_offset_to_zone_idx(f, f->file_offset + f->io_size);
>  
> +		vdb = zbd_set_vdb(td, f);
> +
> +		dprint(FD_ZBD, "%s(%s): valid data bytes = %" PRIu64 "\n",
> +		       __func__, f->file_name, vdb);
> +
>  		/*
>  		 * When all zones in the I/O range are conventional, io_size
>  		 * can be smaller than zone size, making min_zone the same
> @@ -1197,54 +1241,9 @@ static bool zbd_dec_and_reset_write_cnt(const struct thread_data *td,
>  	return write_cnt == 0;
>  }
>  
> -static uint64_t zbd_set_vdb(struct thread_data *td, const struct fio_file *f)
> -{
> -	struct fio_zone_info *zb, *ze, *z;
> -	uint64_t wp_vdb = 0;
> -	struct zoned_block_device_info *zbdi = f->zbd_info;
> -
> -	if (!accounting_vdb(td, f))
> -		return 0;
> -
> -	if (zbdi->wp_write_min_zone != zbdi->wp_write_max_zone) {
> -		if (zbdi->wp_write_min_zone != f->min_zone ||
> -		    zbdi->wp_write_max_zone != f->max_zone) {
> -			td_verror(td, EINVAL,
> -				  "multi-jobs with different write ranges are "
> -				  "not supported with zone_reset_threshold");
> -			log_err("multi-jobs with different write ranges are "
> -				"not supported with zone_reset_threshold\n");
> -		}
> -		return 0;
> -	}
> -
> -	zbdi->wp_write_min_zone = f->min_zone;
> -	zbdi->wp_write_max_zone = f->max_zone;
> -
> -	zb = zbd_get_zone(f, f->min_zone);
> -	ze = zbd_get_zone(f, f->max_zone);
> -	for (z = zb; z < ze; z++) {
> -		if (z->has_wp) {
> -			zone_lock(td, f, z);
> -			wp_vdb += z->wp - z->start;
> -		}
> -	}
> -
> -	pthread_mutex_lock(&zbdi->mutex);
> -	zbdi->wp_valid_data_bytes = wp_vdb;
> -	pthread_mutex_unlock(&zbdi->mutex);
> -
> -	for (z = zb; z < ze; z++)
> -		if (z->has_wp)
> -			zone_unlock(z);
> -
> -	return wp_vdb;
> -}
> -
>  void zbd_file_reset(struct thread_data *td, struct fio_file *f)
>  {
>  	struct fio_zone_info *zb, *ze;
> -	uint64_t vdb;
>  	bool verify_data_left = false;
>  
>  	if (!f->zbd_info || !td_write(td))
> @@ -1252,10 +1251,6 @@ void zbd_file_reset(struct thread_data *td, struct fio_file *f)
>  
>  	zb = zbd_get_zone(f, f->min_zone);
>  	ze = zbd_get_zone(f, f->max_zone);
> -	vdb = zbd_set_vdb(td, f);
> -
> -	dprint(FD_ZBD, "%s(%s): valid data bytes = %" PRIu64 "\n",
> -	       __func__, f->file_name, vdb);
>  
>  	/*
>  	 * If data verification is enabled reset the affected zones before
> -- 
> 2.38.1
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 8/8] t/zbd: add test cases for zone_reset_threshold option
  2023-02-07  6:37 ` [PATCH v2 8/8] t/zbd: add test cases for zone_reset_threshold option Shin'ichiro Kawasaki
@ 2023-02-07 14:06   ` Niklas Cassel
  0 siblings, 0 replies; 20+ messages in thread
From: Niklas Cassel @ 2023-02-07 14:06 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: fio, Jens Axboe, Vincent Fu, Damien Le Moal, Dmitry Fomichev

On Tue, Feb 07, 2023 at 03:37:39PM +0900, Shin'ichiro Kawasaki wrote:
> The zone_reset_threshold option works for multiple jobs only when the
> jobs have same write range. Add three test cases to confirm that the
> option works for multiple jobs as expected. The first test case checks
> that different write ranges are reported as an error. The second test
> case checks that multiple write jobs work when they have same write
> range. The third test case checks that a read job and a write job work
> when they have different IO ranges.
> 
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> ---
>  t/zbd/test-zbd-support | 56 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 56 insertions(+)
> 
> diff --git a/t/zbd/test-zbd-support b/t/zbd/test-zbd-support
> index c32953c4..893aff3c 100755
> --- a/t/zbd/test-zbd-support
> +++ b/t/zbd/test-zbd-support
> @@ -1305,6 +1305,62 @@ test60() {
>  	grep -q 'not support experimental verify' "${logfile}.${test_number}"
>  }
>  
> +# Test fio errors out zone_reset_threshold option for multiple jobs with
> +# different write ranges.
> +test61() {
> +	run_fio_on_seq "$(ioengine "psync")" --rw=write --size="$zone_size" \
> +		       --numjobs=2 --offset_increment="$zone_size" \
> +		       --zone_reset_threshold=0.1 --zone_reset_frequency=1 \
> +		       --exitall_on_error=1 \
> +		       >> "${logfile}.${test_number}" 2>&1 && return 1
> +	grep -q 'different write ranges' "${logfile}.${test_number}"
> +}
> +
> +# Test zone_reset_threshold option works for multiple jobs with same write
> +# range.
> +test62() {
> +	local bs loops=2 size=$((zone_size))
> +
> +	[ -n "$is_zbd" ] && reset_zone "$dev" -1
> +
> +	# Two jobs write to single zone twice. Reset zone happens at next write
> +	# after half of the zone gets filled. So 2 * 2 * 2 - 1 = 7 times zone
> +	# resets are expected.
> +	bs=$(min $((256*1024)) $((zone_size / 4)))
> +	run_fio_on_seq "$(ioengine "psync")" --rw=write --bs="$bs" \
> +		       --size=$size --loops=$loops --numjobs=2 \
> +		       --zone_reset_frequency=1 --zone_reset_threshold=.5 \
> +		       --group_reporting=1 \
> +		       >> "${logfile}.${test_number}" 2>&1 || return $?
> +	check_written $((size * loops * 2)) || return $?
> +	check_reset_count -eq 7 || return $?
> +}
> +
> +# Test zone_reset_threshold option works for a read job and a write job with
> +# different IO range.
> +test63() {
> +	local bs loops=2 size=$((zone_size)) off1 off2
> +
> +	[ -n "$is_zbd" ] && reset_zone "$dev" -1
> +
> +	off1=$((first_sequential_zone_sector * 512))
> +	off2=$((off1 + zone_size))
> +	bs=$(min $((256*1024)) $((zone_size / 4)))
> +
> +	# One job writes to single zone twice. Reset zone happens at next write
> +	# after half of the zone gets filled. So 2 * 2 - 1 = 3 times zone resets
> +	# are expected.
> +	run_fio "$(ioengine "psync")" --bs="$bs" --size=$size --loops=$loops \
> +		--filename="$dev" --group_reporting=1 \
> +		--zonemode=zbd --zonesize="$zone_size" --direct=1 \
> +		--zone_reset_frequency=1 --zone_reset_threshold=.5 \
> +		--name=r --rw=read --offset=$off1 "${job_var_opts[@]}" \
> +		--name=w --rw=write --offset=$off2 "${job_var_opts[@]}" \
> +		       >> "${logfile}.${test_number}" 2>&1 || return $?
> +	check_written $((size * loops)) || return $?
> +	check_reset_count -eq 3 || return $?
> +}
> +
>  SECONDS=0
>  tests=()
>  dynamic_analyzer=()
> -- 
> 2.38.1
> 

Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 6/8] zbd: check write ranges for zone_reset_threshold option
  2023-02-07 14:06   ` Niklas Cassel
@ 2023-02-08  4:59     ` Shinichiro Kawasaki
  0 siblings, 0 replies; 20+ messages in thread
From: Shinichiro Kawasaki @ 2023-02-08  4:59 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: fio, Jens Axboe, Vincent Fu, Damien Le Moal, Dmitry Fomichev

On Feb 07, 2023 / 14:06, Niklas Cassel wrote:
> On Tue, Feb 07, 2023 at 03:37:37PM +0900, Shin'ichiro Kawasaki wrote:
> > The valid data bytes accounting is used for zone_reset_threshold option.
> > This accounting usage has two issues. The first issue is unexpected
> > zone reset due to different IO ranges. The valid data bytes accounting
> > is done for all IO ranges per device, and shared by all jobs. On the
> > other hand, the zone_reset_threshold option is defined as the ratio to
> > each job's IO range. When a job refers to the accounting value, it
> > includes writes to IO ranges out of the job's IO range. Then zone reset
> > is triggered earlier than expected.
> > 
> > The second issue is accounting value initialization. The initialization
> > of the accounting field is repeated for each job, then the value
> > initialized by the first job is overwritten by other jobs. This works as
> > expected for single job or multiple jobs with same write range. However,
> > when multiple jobs have different write ranges, the overwritten value is
> > wrong except for the last job.
> > 
> > To ensure that the accounting works as expected for the option, check
> > that write ranges of all jobs are same. If jobs have different write
> > ranges, report it as an error. Initialize the accounting field only once
> > for the first job. All jobs have same write range, then one time
> > initialization is enough. Update man page to clarify this limitation of
> > the option.
> > 
> > Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > ---
> >  HOWTO.rst |  3 ++-
> >  fio.1     |  3 ++-
> >  zbd.c     | 22 +++++++++++++++++++---
> >  zbd.h     |  4 ++++
> >  4 files changed, 27 insertions(+), 5 deletions(-)
> > 
> > diff --git a/HOWTO.rst b/HOWTO.rst
> > index d08f8a18..c71c3949 100644
> > --- a/HOWTO.rst
> > +++ b/HOWTO.rst
> > @@ -1088,7 +1088,8 @@ Target file/device
> >  	A number between zero and one that indicates the ratio of written bytes
> >  	to the total size of the zones with write pointers in the IO range. When
> >  	current ratio is above this ratio, zones are reset periodically as
> > -	:option:`zone_reset_frequency` specifies.
> > +	:option:`zone_reset_frequency` specifies. If there are multiple jobs when
> > +	using this option, the IO range for all write jobs shall be the same.
> 
> nit: perhaps replace "shall be the same" with "has to be the same"

Will replace. I find "has to" is more consitent in the document than "shall".

> 
> >  
> >  .. option:: zone_reset_frequency=float
> >  
> > diff --git a/fio.1 b/fio.1
> > index 54d2c403..80cd23b5 100644
> > --- a/fio.1
> > +++ b/fio.1
> > @@ -857,7 +857,8 @@ value to be larger than the device reported limit. Default: false.
> >  A number between zero and one that indicates the ratio of written bytes to the
> >  total size of the zones with write pointers in the IO range. When current ratio
> >  is above this ratio, zones are reset periodically as \fBzone_reset_frequency\fR
> > -specifies.
> > +specifies. If there are multiple jobs when using this option, the IO range for
> > +all write jobs shall be the same.
> >  .TP
> >  .BI zone_reset_frequency \fR=\fPfloat
> >  A number between zero and one that indicates how often a zone reset should be
> > diff --git a/zbd.c b/zbd.c
> > index 6783acf9..ca6816b9 100644
> > --- a/zbd.c
> > +++ b/zbd.c
> > @@ -1201,10 +1201,26 @@ static uint64_t zbd_set_vdb(struct thread_data *td, const struct fio_file *f)
> >  {
> >  	struct fio_zone_info *zb, *ze, *z;
> >  	uint64_t wp_vdb = 0;
> > +	struct zoned_block_device_info *zbdi = f->zbd_info;
> >  
> >  	if (!accounting_vdb(td, f))
> >  		return 0;
> >  
> > +	if (zbdi->wp_write_min_zone != zbdi->wp_write_max_zone) {
> > +		if (zbdi->wp_write_min_zone != f->min_zone ||
> > +		    zbdi->wp_write_max_zone != f->max_zone) {
> > +			td_verror(td, EINVAL,
> > +				  "multi-jobs with different write ranges are "
> > +				  "not supported with zone_reset_threshold");
> > +			log_err("multi-jobs with different write ranges are "
> > +				"not supported with zone_reset_threshold\n");
> > +		}
> > +		return 0;
> > +	}
> 
> This check looks a bit weird.
> What if the first job specifies a f->file_offset and an f->io_size that is
> within the same zone? Then f->min_zone and f->max_zone will have the same value.
> 
> Which should cause wp_write_min_zone and wp_write_max_zone to have the same
> value.
> 
> The check will thus not be performed for the second job, after wp_write_min_zone
> and wp_write_min_zone have been initialized.

Thanks for catching this. In most cases, f->min_zone and f->max_zone have
different values, since f->io_size must be equal to or larger than zone size.
However, still f->min_zone and f->max_zone can be same when the IO range fits
within a conventional zone. The commit f952800a0b1a is a reference.

I suggest following 3 more changes below:

1) Before IO range check in zbd_set_vdb(), ensure that the IO range includes
   zones with write pointer. This also ensures that f->min_zone and f->max_zone
   are different. The code diff will be as follows:

diff --git a/zbd.c b/zbd.c
index a464d6df..f0a0856b 100644
--- a/zbd.c
+++ b/zbd.c
@@ -1086,6 +1086,13 @@ static uint64_t zbd_set_vdb(struct thread_data *td, const struct fio_file *f)
        if (!accounting_vdb(td, f))
                return 0;

+       /*
+        * Ensure that all zones in the I/O range are not conventional so that
+        * f->min_zone and f->max_zone have different values.
+        */
+       if (!zbd_is_seq_job(f))
+               return 0;
+
        if (zbdi->wp_write_min_zone != zbdi->wp_write_max_zone) {
                if (zbdi->wp_write_min_zone != f->min_zone ||
                    zbdi->wp_write_max_zone != f->max_zone) {

2) The names of new fields wp_write_min_zone and wp_write_max_zone are not
   accurate, since the IO range may cover both coventional zones and zones
   with write pointers. I will rename them to write_min_zone and write_max_zone
   without the wp_ prefix.

3) The 4th patch needs modification also. The description of the
   zone_reset_threshold option should cover the case that IO range has both
   conventional zones and zones with write pointers. It will be like this:

  .BI zone_reset_threshold \fR=\fPfloat
  A number between zero and one that indicates the ratio of written bytes in the
  zones with write pointers in the IO range to the size of the IO range.

-- 
Shin'ichiro Kawasaki

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 7/8] zbd: initialize valid data bytes accounting at file set up
  2023-02-07 14:06   ` Niklas Cassel
@ 2023-02-08  5:03     ` Shinichiro Kawasaki
  2023-02-08  8:30       ` Niklas Cassel
  0 siblings, 1 reply; 20+ messages in thread
From: Shinichiro Kawasaki @ 2023-02-08  5:03 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: fio, Jens Axboe, Vincent Fu, Damien Le Moal, Dmitry Fomichev

On Feb 07, 2023 / 14:06, Niklas Cassel wrote:
> On Tue, Feb 07, 2023 at 03:37:38PM +0900, Shin'ichiro Kawasaki wrote:
> > The valid data bytes accounting field is initialized at file reset,
> > after each job started. Each job locks zones to check write pointer
> > positions of its write target zones. This can cause zone lock contention
> > with write by other jobs.
> > 
> > To avoid the zone lock contention, move the initialization from file
> > reset to file set up before job start. It allows to access the write
> > pointers and the accounting field without locks. Remove the lock and
> > unlock codes which are no longer required. Ensure the locks are not
> > required by checking run-state in the struct thread_data.
> > 
> > Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > ---
> >  zbd.c | 93 ++++++++++++++++++++++++++++-------------------------------
> >  1 file changed, 44 insertions(+), 49 deletions(-)
> > 
> > diff --git a/zbd.c b/zbd.c
> > index ca6816b9..a464d6df 100644
> > --- a/zbd.c
> > +++ b/zbd.c
> > @@ -1074,6 +1074,44 @@ void zbd_recalc_options_with_zone_granularity(struct thread_data *td)
> >  	}
> >  }
> >  
> > +static uint64_t zbd_set_vdb(struct thread_data *td, const struct fio_file *f)
> > +{
> > +	struct fio_zone_info *zb, *ze, *z;
> > +	uint64_t wp_vdb = 0;
> > +	struct zoned_block_device_info *zbdi = f->zbd_info;
> > +
> > +	assert(td->runstate < TD_RUNNING);
> > +	assert(zbdi);
> > +
> > +	if (!accounting_vdb(td, f))
> > +		return 0;
> 
> zbd_set_vdb() is also called from zbd_file_reset().

Not really, please find my comment below.

> 
> The code from here:
> 
> > +	if (zbdi->wp_write_min_zone != zbdi->wp_write_max_zone) {
> > +		if (zbdi->wp_write_min_zone != f->min_zone ||
> > +		    zbdi->wp_write_max_zone != f->max_zone) {
> > +			td_verror(td, EINVAL,
> > +				  "multi-jobs with different write ranges are "
> > +				  "not supported with zone_reset_threshold");
> > +			log_err("multi-jobs with different write ranges are "
> > +				"not supported with zone_reset_threshold\n");
> > +		}
> > +		return 0;
> > +	}
> > +
> > +	zbdi->wp_write_min_zone = f->min_zone;
> > +	zbdi->wp_write_max_zone = f->max_zone;
> 
> Up until here. (Basically the code added in patch 6/8.)
> Should probably only be called by zbd_setup_files().
> 
> Or do we need to perform the verification for every single file reset?
> 
> > +
> > +	zb = zbd_get_zone(f, f->min_zone);
> > +	ze = zbd_get_zone(f, f->max_zone);
> > +	for (z = zb; z < ze; z++)
> > +		if (z->has_wp)
> > +			wp_vdb += z->wp - z->start;
> > +
> > +	zbdi->wp_valid_data_bytes = wp_vdb;
> > +
> > +	return wp_vdb;
> > +}
> > +
> >  int zbd_setup_files(struct thread_data *td)
> >  {
> >  	struct fio_file *f;
> > @@ -1099,6 +1137,7 @@ int zbd_setup_files(struct thread_data *td)
> >  		struct zoned_block_device_info *zbd = f->zbd_info;
> >  		struct fio_zone_info *z;
> >  		int zi;
> > +		uint64_t vdb;
> >  
> >  		assert(zbd);
> >  
> > @@ -1106,6 +1145,11 @@ int zbd_setup_files(struct thread_data *td)
> >  		f->max_zone =
> >  			zbd_offset_to_zone_idx(f, f->file_offset + f->io_size);
> >  
> > +		vdb = zbd_set_vdb(td, f);
> > +
> > +		dprint(FD_ZBD, "%s(%s): valid data bytes = %" PRIu64 "\n",
> > +		       __func__, f->file_name, vdb);
> > +
> >  		/*
> >  		 * When all zones in the I/O range are conventional, io_size
> >  		 * can be smaller than zone size, making min_zone the same
> > @@ -1197,54 +1241,9 @@ static bool zbd_dec_and_reset_write_cnt(const struct thread_data *td,
> >  	return write_cnt == 0;
> >  }
> >  
> > -static uint64_t zbd_set_vdb(struct thread_data *td, const struct fio_file *f)
> > -{
> > -	struct fio_zone_info *zb, *ze, *z;
> > -	uint64_t wp_vdb = 0;
> > -	struct zoned_block_device_info *zbdi = f->zbd_info;
> > -
> > -	if (!accounting_vdb(td, f))
> > -		return 0;
> > -
> > -	if (zbdi->wp_write_min_zone != zbdi->wp_write_max_zone) {
> > -		if (zbdi->wp_write_min_zone != f->min_zone ||
> > -		    zbdi->wp_write_max_zone != f->max_zone) {
> > -			td_verror(td, EINVAL,
> > -				  "multi-jobs with different write ranges are "
> > -				  "not supported with zone_reset_threshold");
> > -			log_err("multi-jobs with different write ranges are "
> > -				"not supported with zone_reset_threshold\n");
> > -		}
> > -		return 0;
> > -	}
> > -
> > -	zbdi->wp_write_min_zone = f->min_zone;
> > -	zbdi->wp_write_max_zone = f->max_zone;
> > -
> > -	zb = zbd_get_zone(f, f->min_zone);
> > -	ze = zbd_get_zone(f, f->max_zone);
> > -	for (z = zb; z < ze; z++) {
> > -		if (z->has_wp) {
> > -			zone_lock(td, f, z);
> > -			wp_vdb += z->wp - z->start;
> > -		}
> > -	}
> > -
> > -	pthread_mutex_lock(&zbdi->mutex);
> > -	zbdi->wp_valid_data_bytes = wp_vdb;
> > -	pthread_mutex_unlock(&zbdi->mutex);
> > -
> > -	for (z = zb; z < ze; z++)
> > -		if (z->has_wp)
> > -			zone_unlock(z);
> > -
> > -	return wp_vdb;
> > -}
> > -
> >  void zbd_file_reset(struct thread_data *td, struct fio_file *f)
> >  {
> >  	struct fio_zone_info *zb, *ze;
> > -	uint64_t vdb;
> >  	bool verify_data_left = false;
> >  
> >  	if (!f->zbd_info || !td_write(td))
> > @@ -1252,10 +1251,6 @@ void zbd_file_reset(struct thread_data *td, struct fio_file *f)
> >  
> >  	zb = zbd_get_zone(f, f->min_zone);
> >  	ze = zbd_get_zone(f, f->max_zone);
> > -	vdb = zbd_set_vdb(td, f);

The zbd_set_vbd() call in zbd_file_reset() is removed here. Then the check is
done only at zbd_setup_files() as you expect.

> > -
> > -	dprint(FD_ZBD, "%s(%s): valid data bytes = %" PRIu64 "\n",
> > -	       __func__, f->file_name, vdb);
> >  
> >  	/*
> >  	 * If data verification is enabled reset the affected zones before
> > -- 
> > 2.38.1
> > 

-- 
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 7/8] zbd: initialize valid data bytes accounting at file set up
  2023-02-08  5:03     ` Shinichiro Kawasaki
@ 2023-02-08  8:30       ` Niklas Cassel
  0 siblings, 0 replies; 20+ messages in thread
From: Niklas Cassel @ 2023-02-08  8:30 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: fio, Jens Axboe, Vincent Fu, Damien Le Moal, Dmitry Fomichev

On Wed, Feb 08, 2023 at 05:03:30AM +0000, Shinichiro Kawasaki wrote:
> On Feb 07, 2023 / 14:06, Niklas Cassel wrote:
> > On Tue, Feb 07, 2023 at 03:37:38PM +0900, Shin'ichiro Kawasaki wrote:
> > > The valid data bytes accounting field is initialized at file reset,
> > > after each job started. Each job locks zones to check write pointer
> > > positions of its write target zones. This can cause zone lock contention
> > > with write by other jobs.
> > >
> > > To avoid the zone lock contention, move the initialization from file
> > > reset to file set up before job start. It allows to access the write
> > > pointers and the accounting field without locks. Remove the lock and
> > > unlock codes which are no longer required. Ensure the locks are not
> > > required by checking run-state in the struct thread_data.
> > >
> > > Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > > ---

(snip)

> > > @@ -1252,10 +1251,6 @@ void zbd_file_reset(struct thread_data *td, struct fio_file *f)
> > >
> > >	zb = zbd_get_zone(f, f->min_zone);
> > >	ze = zbd_get_zone(f, f->max_zone);
> > > -	vdb = zbd_set_vdb(td, f);
>
> The zbd_set_vbd() call in zbd_file_reset() is removed here. Then the check is
> done only at zbd_setup_files() as you expect.

Sorry, I missed that you also removed the only other place that called this
function.

zbd_setup_files() calls functions that are named e.g.:
	if (!zbd_verify_sizes())
		return 1;

	if (!zbd_verify_bs())
		return 1;

Perhaps consider renaming this function to something like
zbd_verify_and_set_vbd(), since this function is now only
called from zbd_setup_files().


Regardless if you rename or not:
Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-02-08  8:31 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-07  6:37 [PATCH v2 0/8] zbd: fix 'sectors with data' and zone_reset_threshold accounting issues Shin'ichiro Kawasaki
2023-02-07  6:37 ` [PATCH v2 1/8] zbd: refer file->last_start[] instead of sectors with data accounting Shin'ichiro Kawasaki
2023-02-07 14:05   ` Niklas Cassel
2023-02-07  6:37 ` [PATCH v2 2/8] zbd: remove CHECK_SWD feature Shin'ichiro Kawasaki
2023-02-07 14:05   ` Niklas Cassel
2023-02-07  6:37 ` [PATCH v2 3/8] zbd: rename the accounting 'sectors with data' to 'valid data bytes' Shin'ichiro Kawasaki
2023-02-07 14:06   ` Niklas Cassel
2023-02-07  6:37 ` [PATCH v2 4/8] doc: fix unit of zone_reset_threshold and relation to other option Shin'ichiro Kawasaki
2023-02-07 14:06   ` Niklas Cassel
2023-02-07  6:37 ` [PATCH v2 5/8] zbd: account valid data bytes only for zone_reset_threshold option Shin'ichiro Kawasaki
2023-02-07 14:06   ` Niklas Cassel
2023-02-07  6:37 ` [PATCH v2 6/8] zbd: check write ranges " Shin'ichiro Kawasaki
2023-02-07 14:06   ` Niklas Cassel
2023-02-08  4:59     ` Shinichiro Kawasaki
2023-02-07  6:37 ` [PATCH v2 7/8] zbd: initialize valid data bytes accounting at file set up Shin'ichiro Kawasaki
2023-02-07 14:06   ` Niklas Cassel
2023-02-08  5:03     ` Shinichiro Kawasaki
2023-02-08  8:30       ` Niklas Cassel
2023-02-07  6:37 ` [PATCH v2 8/8] t/zbd: add test cases for zone_reset_threshold option Shin'ichiro Kawasaki
2023-02-07 14:06   ` Niklas Cassel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).