[PATCH 1/9] verify: decouple seed generation from buffer fill

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/9] verify: decouple seed generation from buffer fill
@ 2020-05-05 17:56 Alexey Dobriyan
  2020-05-05 17:56 ` [PATCH 2/9] zbd: bump ZBD_MAX_OPEN_ZONES Alexey Dobriyan
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: Alexey Dobriyan @ 2020-05-05 17:56 UTC (permalink / raw)
  To: axboe; +Cc: fio, damien.lemoal, Alexey Dobriyan

It is nicer this way and there will be more code in this area
with ZBD verification.

Signed-off-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
---
 backend.c |  6 ------
 verify.c  | 20 +++++++-------------
 2 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/backend.c b/backend.c
index feb34e51..452975cf 100644
--- a/backend.c
+++ b/backend.c
@@ -1006,12 +1006,6 @@ static void do_io(struct thread_data *td, uint64_t *bytes_done)
 		if (td->o.verify != VERIFY_NONE && io_u->ddir == DDIR_READ &&
 		    ((io_u->flags & IO_U_F_VER_LIST) || !td_rw(td))) {
 
-			if (!td->o.verify_pattern_bytes) {
-				io_u->rand_seed = __rand(&td->verify_state);
-				if (sizeof(int) != sizeof(long *))
-					io_u->rand_seed *= __rand(&td->verify_state);
-			}
-
 			if (verify_state_should_stop(td, io_u)) {
 				put_io_u(td, io_u);
 				break;
diff --git a/verify.c b/verify.c
index cf299ebf..b7fa6693 100644
--- a/verify.c
+++ b/verify.c
@@ -46,15 +46,6 @@ static void __fill_buffer(struct thread_options *o, uint64_t seed, void *p,
 	__fill_random_buf_percentage(seed, p, o->compress_percentage, len, len, o->buffer_pattern, o->buffer_pattern_bytes);
 }
 
-static uint64_t fill_buffer(struct thread_data *td, void *p,
-			    unsigned int len)
-{
-	struct frand_state *fs = &td->verify_state;
-	struct thread_options *o = &td->o;
-
-	return fill_random_buf_percentage(fs, p, o->compress_percentage, len, len, o->buffer_pattern, o->buffer_pattern_bytes);
-}
-
 void fill_verify_pattern(struct thread_data *td, void *p, unsigned int len,
 			 struct io_u *io_u, uint64_t seed, int use_seed)
 {
@@ -63,10 +54,13 @@ void fill_verify_pattern(struct thread_data *td, void *p, unsigned int len,
 	if (!o->verify_pattern_bytes) {
 		dprint(FD_VERIFY, "fill random bytes len=%u\n", len);
 
-		if (use_seed)
-			__fill_buffer(o, seed, p, len);
-		else
-			io_u->rand_seed = fill_buffer(td, p, len);
+		if (!use_seed) {
+			seed = __rand(&td->verify_state);
+			if (sizeof(int) != sizeof(long *))
+				seed *= (unsigned long)__rand(&td->verify_state);
+		}
+		io_u->rand_seed = seed;
+		__fill_buffer(o, seed, p, len);
 		return;
 	}
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/9] zbd: bump ZBD_MAX_OPEN_ZONES
  2020-05-05 17:56 [PATCH 1/9] verify: decouple seed generation from buffer fill Alexey Dobriyan
@ 2020-05-05 17:56 ` Alexey Dobriyan
  2020-05-05 17:56 ` [PATCH 3/9] zbd: don't lock zones outside working area Alexey Dobriyan
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Alexey Dobriyan @ 2020-05-05 17:56 UTC (permalink / raw)
  To: axboe; +Cc: fio, damien.lemoal, Alexey Dobriyan

128 opened zones is not enough for us!

4096 opened zones is OK for 64×iodepth=64 stress testing.

Signed-off-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 zbd_types.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/zbd_types.h b/zbd_types.h
index 2f2f1324..d63c0d0a 100644
--- a/zbd_types.h
+++ b/zbd_types.h
@@ -8,7 +8,7 @@
 
 #include <inttypes.h>
 
-#define ZBD_MAX_OPEN_ZONES	128
+#define ZBD_MAX_OPEN_ZONES	4096
 
 /*
  * Zoned block device models.
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/9] zbd: don't lock zones outside working area
  2020-05-05 17:56 [PATCH 1/9] verify: decouple seed generation from buffer fill Alexey Dobriyan
  2020-05-05 17:56 ` [PATCH 2/9] zbd: bump ZBD_MAX_OPEN_ZONES Alexey Dobriyan
@ 2020-05-05 17:56 ` Alexey Dobriyan
  2020-05-05 17:56 ` [PATCH 4/9] zbd: introduce per job maximum open zones limit Alexey Dobriyan
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Alexey Dobriyan @ 2020-05-05 17:56 UTC (permalink / raw)
  To: axboe; +Cc: fio, damien.lemoal, Alexey Dobriyan

Currently threads lock each other zones even if their working areas as
defined by [f->file_offset, f->file_offset + f->io_size) don't intersect.
This leads to unnecessary quiescing.

Patch clamps every zone to [->min_zone, ->max_zone) when doing search
for opened zone and more importantly adds an assert so that any
unnecessary zone locking becomes very visible.

Signed-off-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 file.h |  3 +++
 zbd.c  | 40 ++++++++++++++++++++++++++--------------
 2 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/file.h b/file.h
index ae0e6fc8..375bbfd3 100644
--- a/file.h
+++ b/file.h
@@ -104,6 +104,9 @@ struct fio_file {
 	 * Zoned block device information. See also zonemode=zbd.
 	 */
 	struct zoned_block_device_info *zbd_info;
+	/* zonemode=zbd working area */
+	uint32_t min_zone;	/* inclusive */
+	uint32_t max_zone;	/* exclusive */
 
 	/*
 	 * Track last end and last start of IO for a given data direction
diff --git a/zbd.c b/zbd.c
index 8dc3c397..76771f2e 100644
--- a/zbd.c
+++ b/zbd.c
@@ -156,8 +156,14 @@ static bool zbd_zone_full(const struct fio_file *f, struct fio_zone_info *z,
 		z->wp + required > z->start + f->zbd_info->zone_size;
 }
 
-static void zone_lock(struct thread_data *td, struct fio_zone_info *z)
+static void zone_lock(struct thread_data *td, struct fio_file *f, struct fio_zone_info *z)
 {
+	struct zoned_block_device_info *zbd = f->zbd_info;
+	uint32_t nz = z - zbd->zone_info;
+
+	/* A thread should never lock zones outside its working area. */
+	assert(f->min_zone <= nz && nz < f->max_zone);
+
 	/*
 	 * Lock the io_u target zone. The zone will be unlocked if io_u offset
 	 * is changed or when io_u completes and zbd_put_io() executed.
@@ -289,6 +295,9 @@ static bool zbd_verify_sizes(void)
 					 (unsigned long long) new_end - f->file_offset);
 				f->io_size = new_end - f->file_offset;
 			}
+
+			f->min_zone = zbd_zone_idx(f, f->file_offset);
+			f->max_zone = zbd_zone_idx(f, f->file_offset + f->io_size);
 		}
 	}
 
@@ -730,7 +739,7 @@ static int zbd_reset_zones(struct thread_data *td, struct fio_file *f,
 
 		if (!zbd_zone_swr(z))
 			continue;
-		zone_lock(td, z);
+		zone_lock(td, f, z);
 		if (all_zones) {
 			unsigned int i;
 
@@ -854,14 +863,12 @@ static void zbd_init_swd(struct fio_file *f)
 void zbd_file_reset(struct thread_data *td, struct fio_file *f)
 {
 	struct fio_zone_info *zb, *ze;
-	uint32_t zone_idx_e;
 
 	if (!f->zbd_info)
 		return;
 
-	zb = &f->zbd_info->zone_info[zbd_zone_idx(f, f->file_offset)];
-	zone_idx_e = zbd_zone_idx(f, f->file_offset + f->io_size);
-	ze = &f->zbd_info->zone_info[zone_idx_e];
+	zb = &f->zbd_info->zone_info[f->min_zone];
+	ze = &f->zbd_info->zone_info[f->max_zone];
 	zbd_init_swd(f);
 	/*
 	 * If data verification is enabled reset the affected zones before
@@ -953,7 +960,7 @@ static struct fio_zone_info *zbd_convert_to_open_zone(struct thread_data *td,
 						      struct io_u *io_u)
 {
 	const uint32_t min_bs = td->o.min_bs[io_u->ddir];
-	const struct fio_file *f = io_u->file;
+	struct fio_file *f = io_u->file;
 	struct fio_zone_info *z;
 	unsigned int open_zone_idx = -1;
 	uint32_t zone_idx, new_zone_idx;
@@ -970,6 +977,10 @@ static struct fio_zone_info *zbd_convert_to_open_zone(struct thread_data *td,
 	} else {
 		zone_idx = zbd_zone_idx(f, io_u->offset);
 	}
+	if (zone_idx < f->min_zone)
+		zone_idx = f->min_zone;
+	else if (zone_idx >= f->max_zone)
+		zone_idx = f->max_zone - 1;
 	dprint(FD_ZBD, "%s(%s): starting from zone %d (offset %lld, buflen %lld)\n",
 	       __func__, f->file_name, zone_idx, io_u->offset, io_u->buflen);
 
@@ -984,7 +995,7 @@ static struct fio_zone_info *zbd_convert_to_open_zone(struct thread_data *td,
 
 		z = &f->zbd_info->zone_info[zone_idx];
 
-		zone_lock(td, z);
+		zone_lock(td, f, z);
 		pthread_mutex_lock(&f->zbd_info->mutex);
 		if (td->o.max_open_zones == 0)
 			goto examine_zone;
@@ -1010,8 +1021,7 @@ static struct fio_zone_info *zbd_convert_to_open_zone(struct thread_data *td,
 			if (tmp_idx >= f->zbd_info->num_open_zones)
 				tmp_idx = 0;
 			tmpz = f->zbd_info->open_zones[tmp_idx];
-
-			if (is_valid_offset(f, f->zbd_info->zone_info[tmpz].start)) {
+			if (f->min_zone <= tmpz && tmpz < f->max_zone) {
 				open_zone_idx = tmp_idx;
 				goto found_candidate_zone;
 			}
@@ -1056,11 +1066,11 @@ examine_zone:
 		z++;
 		if (!is_valid_offset(f, z->start)) {
 			/* Wrap-around. */
-			zone_idx = zbd_zone_idx(f, f->file_offset);
+			zone_idx = f->min_zone;
 			z = &f->zbd_info->zone_info[zone_idx];
 		}
 		assert(is_valid_offset(f, z->start));
-		zone_lock(td, z);
+		zone_lock(td, f, z);
 		if (z->open)
 			continue;
 		if (zbd_open_zone(td, io_u, zone_idx))
@@ -1073,12 +1083,14 @@ examine_zone:
 	pthread_mutex_lock(&f->zbd_info->mutex);
 	for (i = 0; i < f->zbd_info->num_open_zones; i++) {
 		zone_idx = f->zbd_info->open_zones[i];
+		if (zone_idx < f->min_zone || zone_idx >= f->max_zone)
+			continue;
 		pthread_mutex_unlock(&f->zbd_info->mutex);
 		pthread_mutex_unlock(&z->mutex);
 
 		z = &f->zbd_info->zone_info[zone_idx];
 
-		zone_lock(td, z);
+		zone_lock(td, f, z);
 		if (z->wp + min_bs <= (z+1)->start)
 			goto out;
 		pthread_mutex_lock(&f->zbd_info->mutex);
@@ -1404,7 +1416,7 @@ enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u)
 
 	zbd_check_swd(f);
 
-	zone_lock(td, zb);
+	zone_lock(td, f, zb);
 
 	switch (io_u->ddir) {
 	case DDIR_READ:
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/9] zbd: introduce per job maximum open zones limit
  2020-05-05 17:56 [PATCH 1/9] verify: decouple seed generation from buffer fill Alexey Dobriyan
  2020-05-05 17:56 ` [PATCH 2/9] zbd: bump ZBD_MAX_OPEN_ZONES Alexey Dobriyan
  2020-05-05 17:56 ` [PATCH 3/9] zbd: don't lock zones outside working area Alexey Dobriyan
@ 2020-05-05 17:56 ` Alexey Dobriyan
  2020-05-11  1:09   ` Damien Le Moal
  2020-05-05 17:56 ` [PATCH 5/9] zbd: make zbd_info->mutex non-recursive Alexey Dobriyan
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Alexey Dobriyan @ 2020-05-05 17:56 UTC (permalink / raw)
  To: axboe; +Cc: fio, damien.lemoal, Alexey Dobriyan

It is not possible to maintain sustained per-thread iodepth in ZBD mode.
The way code is written, "max_open_zones" acts as a global limit, and
once one or few threads open all "max_open_zones" zones, other threads
can't open anything and _exit_ prematurely.

This config is guaranteed to make equal number of zone resets/IO now:
each thread generates identical pattern and doesn't intersect with other
threads:

	zonemode=zbd
	zonesize=...
	rw=write

	numjobs=N
	offset_increment=M*zonesize

	[j]
	size=M*zonesize

Patch introduces "job_max_open_zones" which is per-thread/process limit.
"max_open_zones" remains per file/device limit. Both limits are checked
for each open zone so one thread can't kick out others.

Signed-off-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
---
 fio.1            |  6 +++++-
 fio.h            |  1 +
 options.c        | 16 +++++++++++++---
 thread_options.h |  1 +
 zbd.c            | 49 +++++++++++++++++++++++++++++++++++++-----------
 zbd.h            |  3 +++
 6 files changed, 61 insertions(+), 15 deletions(-)

diff --git a/fio.1 b/fio.1
index a2379f98..c0f2e7cf 100644
--- a/fio.1
+++ b/fio.1
@@ -804,7 +804,11 @@ so. Default: false.
 When running a random write test across an entire drive many more zones will be
 open than in a typical application workload. Hence this command line option
 that allows to limit the number of open zones. The number of open zones is
-defined as the number of zones to which write commands are issued.
+defined as the number of zones to which write commands are issued by all
+threads/processes.
+.TP
+.BI job_max_open_zones \fR=\fPint
+Limit on the number of simultaneously opened zones per single thread/process.
 .TP
 .BI zone_reset_threshold \fR=\fPfloat
 A number between zero and one that indicates the ratio of logical blocks with
diff --git a/fio.h b/fio.h
index bbf057c1..20ca80e2 100644
--- a/fio.h
+++ b/fio.h
@@ -260,6 +260,7 @@ struct thread_data {
 	struct frand_state prio_state;
 
 	struct zone_split_index **zone_state_index;
+	unsigned int num_open_zones;
 
 	unsigned int verify_batch;
 	unsigned int trim_batch;
diff --git a/options.c b/options.c
index 2372c042..47b3b765 100644
--- a/options.c
+++ b/options.c
@@ -3360,12 +3360,22 @@ struct fio_option fio_options[FIO_MAX_OPTS] = {
 	},
 	{
 		.name	= "max_open_zones",
-		.lname	= "Maximum number of open zones",
+		.lname	= "Global maximum number of open zones",
 		.type	= FIO_OPT_INT,
 		.off1	= offsetof(struct thread_options, max_open_zones),
 		.maxval	= ZBD_MAX_OPEN_ZONES,
-		.help	= "Limit random writes to SMR drives to the specified"
-			  " number of sequential zones",
+		.help	= "Limit on the number of simultaneously opened sequential write zones in SMR/ZNS drives",
+		.def	= "0",
+		.category = FIO_OPT_C_IO,
+		.group	= FIO_OPT_G_INVALID,
+	},
+	{
+		.name	= "job_max_open_zones",
+		.lname	= "Job maximum number of open zones",
+		.type	= FIO_OPT_INT,
+		.off1	= offsetof(struct thread_options, job_max_open_zones),
+		.maxval	= ZBD_MAX_OPEN_ZONES,
+		.help	= "Limit on the number of simultaneously opened sequential write zones in SMR/ZNS drives by one thread/process",
 		.def	= "0",
 		.category = FIO_OPT_C_IO,
 		.group	= FIO_OPT_G_INVALID,
diff --git a/thread_options.h b/thread_options.h
index c78ed43d..b0b493e4 100644
--- a/thread_options.h
+++ b/thread_options.h
@@ -342,6 +342,7 @@ struct thread_options {
 	/* Parameters that affect zonemode=zbd */
 	unsigned int read_beyond_wp;
 	int max_open_zones;
+	unsigned int job_max_open_zones;
 	fio_fp64_t zrt;
 	fio_fp64_t zrf;
 };
diff --git a/zbd.c b/zbd.c
index 76771f2e..b34ceb34 100644
--- a/zbd.c
+++ b/zbd.c
@@ -546,8 +546,10 @@ static int zbd_create_zone_info(struct thread_data *td, struct fio_file *f)
 		return -EINVAL;
 	}
 
-	if (ret == 0)
+	if (ret == 0) {
 		f->zbd_info->model = zbd_model;
+		f->zbd_info->max_open_zones = td->o.max_open_zones;
+	}
 	return ret;
 }
 
@@ -622,6 +624,27 @@ int zbd_init(struct thread_data *td)
 	if (!zbd_verify_bs())
 		return 1;
 
+	for_each_file(td, f, i) {
+		struct zoned_block_device_info *zbd = f->zbd_info;
+
+		if (!zbd)
+			continue;
+
+		if (zbd->max_open_zones == 0) {
+			zbd->max_open_zones = ZBD_MAX_OPEN_ZONES;
+		}
+
+		if (td->o.max_open_zones > 0 &&
+		    zbd->max_open_zones != td->o.max_open_zones) {
+			log_err("Different 'max_open_zones' values\n");
+			return 1;
+		}
+		if (zbd->max_open_zones > ZBD_MAX_OPEN_ZONES) {
+			log_err("'max_open_zones' value is limited by %u\n", ZBD_MAX_OPEN_ZONES);
+			return 1;
+		}
+	}
+
 	return 0;
 }
 
@@ -709,6 +732,7 @@ static void zbd_close_zone(struct thread_data *td, const struct fio_file *f,
 		(ZBD_MAX_OPEN_ZONES - (open_zone_idx + 1)) *
 		sizeof(f->zbd_info->open_zones[0]));
 	f->zbd_info->num_open_zones--;
+	td->num_open_zones--;
 	f->zbd_info->zone_info[zone_idx].open = 0;
 }
 
@@ -888,8 +912,10 @@ static bool is_zone_open(const struct thread_data *td, const struct fio_file *f,
 	struct zoned_block_device_info *zbdi = f->zbd_info;
 	int i;
 
-	assert(td->o.max_open_zones <= ARRAY_SIZE(zbdi->open_zones));
-	assert(zbdi->num_open_zones <= td->o.max_open_zones);
+	assert(td->o.job_max_open_zones == 0 || td->num_open_zones <= td->o.job_max_open_zones);
+
+	assert(td->o.job_max_open_zones <= zbdi->max_open_zones);
+	assert(zbdi->num_open_zones <= zbdi->max_open_zones);
 
 	for (i = 0; i < zbdi->num_open_zones; i++)
 		if (zbdi->open_zones[i] == zone_idx)
@@ -922,18 +948,19 @@ static bool zbd_open_zone(struct thread_data *td, const struct io_u *io_u,
 	if (td->o.verify != VERIFY_NONE && zbd_zone_full(f, z, min_bs))
 		return false;
 
-	/* Zero means no limit */
-	if (!td->o.max_open_zones)
-		return true;
-
 	pthread_mutex_lock(&f->zbd_info->mutex);
 	if (is_zone_open(td, f, zone_idx))
 		goto out;
 	res = false;
-	if (f->zbd_info->num_open_zones >= td->o.max_open_zones)
+	/* Zero means no limit */
+	if (td->o.job_max_open_zones > 0 &&
+	    td->num_open_zones >= td->o.job_max_open_zones)
+		goto out;
+	if (f->zbd_info->num_open_zones >= f->zbd_info->max_open_zones)
 		goto out;
 	dprint(FD_ZBD, "%s: opening zone %d\n", f->file_name, zone_idx);
 	f->zbd_info->open_zones[f->zbd_info->num_open_zones++] = zone_idx;
+	td->num_open_zones++;
 	z->open = 1;
 	res = true;
 
@@ -968,7 +995,7 @@ static struct fio_zone_info *zbd_convert_to_open_zone(struct thread_data *td,
 
 	assert(is_valid_offset(f, io_u->offset));
 
-	if (td->o.max_open_zones) {
+	if (td->o.job_max_open_zones) {
 		/*
 		 * This statement accesses f->zbd_info->open_zones[] on purpose
 		 * without locking.
@@ -997,7 +1024,7 @@ static struct fio_zone_info *zbd_convert_to_open_zone(struct thread_data *td,
 
 		zone_lock(td, f, z);
 		pthread_mutex_lock(&f->zbd_info->mutex);
-		if (td->o.max_open_zones == 0)
+		if (td->o.job_max_open_zones == 0)
 			goto examine_zone;
 		if (f->zbd_info->num_open_zones == 0) {
 			pthread_mutex_unlock(&f->zbd_info->mutex);
@@ -1053,7 +1080,7 @@ examine_zone:
 	}
 	dprint(FD_ZBD, "%s(%s): closing zone %d\n", __func__, f->file_name,
 	       zone_idx);
-	if (td->o.max_open_zones)
+	if (td->o.job_max_open_zones)
 		zbd_close_zone(td, f, open_zone_idx);
 	pthread_mutex_unlock(&f->zbd_info->mutex);
 
diff --git a/zbd.h b/zbd.h
index 5a660399..fb39fb82 100644
--- a/zbd.h
+++ b/zbd.h
@@ -45,6 +45,8 @@ struct fio_zone_info {
 /**
  * zoned_block_device_info - zoned block device characteristics
  * @model: Device model.
+ * @max_open_zones: global limit on the number of simultaneously opened
+ *	sequential write zones.
  * @mutex: Protects the modifiable members in this structure (refcount and
  *		num_open_zones).
  * @zone_size: size of a single zone in units of 512 bytes
@@ -65,6 +67,7 @@ struct fio_zone_info {
  */
 struct zoned_block_device_info {
 	enum zbd_zoned_model	model;
+	uint32_t		max_open_zones;
 	pthread_mutex_t		mutex;
 	uint64_t		zone_size;
 	uint64_t		sectors_with_data;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/9] zbd: make zbd_info->mutex non-recursive
  2020-05-05 17:56 [PATCH 1/9] verify: decouple seed generation from buffer fill Alexey Dobriyan
                   ` (2 preceding siblings ...)
  2020-05-05 17:56 ` [PATCH 4/9] zbd: introduce per job maximum open zones limit Alexey Dobriyan
@ 2020-05-05 17:56 ` Alexey Dobriyan
  2020-05-11  1:11   ` Damien Le Moal
  2020-05-05 17:56 ` [PATCH 6/9] zbd: consolidate zone mutex initialisation Alexey Dobriyan
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Alexey Dobriyan @ 2020-05-05 17:56 UTC (permalink / raw)
  To: axboe; +Cc: fio, damien.lemoal, Alexey Dobriyan

There is no reason for it to be recursive. Resursiveness leaked
from struct fio_zone_info::mutex initialisation.

Signed-off-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
---
 zbd.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/zbd.c b/zbd.c
index b34ceb34..2926bae9 100644
--- a/zbd.c
+++ b/zbd.c
@@ -373,9 +373,9 @@ static int init_zone_info(struct thread_data *td, struct fio_file *f)
 		return -ENOMEM;
 
 	pthread_mutexattr_init(&attr);
-	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
 	pthread_mutexattr_setpshared(&attr, true);
 	pthread_mutex_init(&zbd_info->mutex, &attr);
+	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
 	zbd_info->refcount = 1;
 	p = &zbd_info->zone_info[0];
 	for (i = 0; i < nr_zones; i++, p++) {
@@ -417,7 +417,6 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
 	int i, j, ret = 0;
 
 	pthread_mutexattr_init(&attr);
-	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
 	pthread_mutexattr_setpshared(&attr, true);
 
 	zones = calloc(ZBD_REPORT_MAX_ZONES, sizeof(struct zbd_zone));
@@ -454,6 +453,7 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
 	if (!zbd_info)
 		goto out;
 	pthread_mutex_init(&zbd_info->mutex, &attr);
+	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
 	zbd_info->refcount = 1;
 	p = &zbd_info->zone_info[0];
 	for (offset = 0, j = 0; j < nr_zones;) {
@@ -796,14 +796,20 @@ static int zbd_reset_zones(struct thread_data *td, struct fio_file *f,
  * Reset zbd_info.write_cnt, the counter that counts down towards the next
  * zone reset.
  */
-static void zbd_reset_write_cnt(const struct thread_data *td,
-				const struct fio_file *f)
+static void _zbd_reset_write_cnt(const struct thread_data *td,
+				 const struct fio_file *f)
 {
 	assert(0 <= td->o.zrf.u.f && td->o.zrf.u.f <= 1);
 
-	pthread_mutex_lock(&f->zbd_info->mutex);
 	f->zbd_info->write_cnt = td->o.zrf.u.f ?
 		min(1.0 / td->o.zrf.u.f, 0.0 + UINT_MAX) : UINT_MAX;
+}
+
+static void zbd_reset_write_cnt(const struct thread_data *td,
+				const struct fio_file *f)
+{
+	pthread_mutex_lock(&f->zbd_info->mutex);
+	_zbd_reset_write_cnt(td, f);
 	pthread_mutex_unlock(&f->zbd_info->mutex);
 }
 
@@ -817,7 +823,7 @@ static bool zbd_dec_and_reset_write_cnt(const struct thread_data *td,
 	if (f->zbd_info->write_cnt)
 		write_cnt = --f->zbd_info->write_cnt;
 	if (write_cnt == 0)
-		zbd_reset_write_cnt(td, f);
+		_zbd_reset_write_cnt(td, f);
 	pthread_mutex_unlock(&f->zbd_info->mutex);
 
 	return write_cnt == 0;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/9] zbd: consolidate zone mutex initialisation
  2020-05-05 17:56 [PATCH 1/9] verify: decouple seed generation from buffer fill Alexey Dobriyan
                   ` (3 preceding siblings ...)
  2020-05-05 17:56 ` [PATCH 5/9] zbd: make zbd_info->mutex non-recursive Alexey Dobriyan
@ 2020-05-05 17:56 ` Alexey Dobriyan
  2020-05-11  1:16   ` Damien Le Moal
  2020-05-05 17:56 ` [PATCH 7/9] fio: parse "io_size=1%" Alexey Dobriyan
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Alexey Dobriyan @ 2020-05-05 17:56 UTC (permalink / raw)
  To: axboe; +Cc: fio, damien.lemoal, Alexey Dobriyan

Initialise everything that is not write pointers coming from device or
faked separatedly.

More stuff (verification in ZBD mode and ->write_mutex) will be added.

Signed-off-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
---
 zbd.c | 39 ++++++++++++++++++++-------------------
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/zbd.c b/zbd.c
index 2926bae9..df46da42 100644
--- a/zbd.c
+++ b/zbd.c
@@ -351,7 +351,6 @@ static int init_zone_info(struct thread_data *td, struct fio_file *f)
 	struct fio_zone_info *p;
 	uint64_t zone_size = td->o.zone_size;
 	struct zoned_block_device_info *zbd_info = NULL;
-	pthread_mutexattr_t attr;
 	int i;
 
 	if (zone_size == 0) {
@@ -372,14 +371,9 @@ static int init_zone_info(struct thread_data *td, struct fio_file *f)
 	if (!zbd_info)
 		return -ENOMEM;
 
-	pthread_mutexattr_init(&attr);
-	pthread_mutexattr_setpshared(&attr, true);
-	pthread_mutex_init(&zbd_info->mutex, &attr);
-	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
 	zbd_info->refcount = 1;
 	p = &zbd_info->zone_info[0];
 	for (i = 0; i < nr_zones; i++, p++) {
-		pthread_mutex_init(&p->mutex, &attr);
 		p->start = i * zone_size;
 		p->wp = p->start + zone_size;
 		p->type = ZBD_ZONE_TYPE_SWR;
@@ -393,7 +387,6 @@ static int init_zone_info(struct thread_data *td, struct fio_file *f)
 	f->zbd_info->zone_size_log2 = is_power_of_2(zone_size) ?
 		ilog2(zone_size) : 0;
 	f->zbd_info->nr_zones = nr_zones;
-	pthread_mutexattr_destroy(&attr);
 	return 0;
 }
 
@@ -413,12 +406,8 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
 	struct fio_zone_info *p;
 	uint64_t zone_size, offset;
 	struct zoned_block_device_info *zbd_info = NULL;
-	pthread_mutexattr_t attr;
 	int i, j, ret = 0;
 
-	pthread_mutexattr_init(&attr);
-	pthread_mutexattr_setpshared(&attr, true);
-
 	zones = calloc(ZBD_REPORT_MAX_ZONES, sizeof(struct zbd_zone));
 	if (!zones)
 		goto out;
@@ -452,14 +441,11 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
 	ret = -ENOMEM;
 	if (!zbd_info)
 		goto out;
-	pthread_mutex_init(&zbd_info->mutex, &attr);
-	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
 	zbd_info->refcount = 1;
 	p = &zbd_info->zone_info[0];
 	for (offset = 0, j = 0; j < nr_zones;) {
 		z = &zones[0];
 		for (i = 0; i < nrz; i++, j++, z++, p++) {
-			pthread_mutex_init(&p->mutex, &attr);
 			p->start = z->start;
 			switch (z->cond) {
 			case ZBD_ZONE_COND_NOT_WP:
@@ -510,7 +496,6 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
 out:
 	sfree(zbd_info);
 	free(zones);
-	pthread_mutexattr_destroy(&attr);
 	return ret;
 }
 
@@ -521,7 +506,9 @@ out:
  */
 static int zbd_create_zone_info(struct thread_data *td, struct fio_file *f)
 {
+	struct zoned_block_device_info *zbd;
 	enum zbd_zoned_model zbd_model;
+	pthread_mutexattr_t attr;
 	int ret;
 
 	assert(td->o.zone_mode == ZONE_MODE_ZBD);
@@ -546,11 +533,25 @@ static int zbd_create_zone_info(struct thread_data *td, struct fio_file *f)
 		return -EINVAL;
 	}
 
-	if (ret == 0) {
-		f->zbd_info->model = zbd_model;
-		f->zbd_info->max_open_zones = td->o.max_open_zones;
+	if (ret)
+		return ret;
+
+	zbd = f->zbd_info;
+	zbd->model = zbd_model;
+	zbd->max_open_zones = td->o.max_open_zones;
+
+	pthread_mutexattr_init(&attr);
+	pthread_mutexattr_setpshared(&attr, true);
+	pthread_mutex_init(&zbd->mutex, &attr);
+	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
+	for (uint32_t z = 0; z < zbd->nr_zones; z++) {
+		struct fio_zone_info *zi = &zbd->zone_info[z];
+
+		pthread_mutex_init(&zi->mutex, &attr);
 	}
-	return ret;
+	pthread_mutexattr_destroy(&attr);
+
+	return 0;
 }
 
 void zbd_free_zone_info(struct fio_file *f)
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 7/9] fio: parse "io_size=1%"
  2020-05-05 17:56 [PATCH 1/9] verify: decouple seed generation from buffer fill Alexey Dobriyan
                   ` (4 preceding siblings ...)
  2020-05-05 17:56 ` [PATCH 6/9] zbd: consolidate zone mutex initialisation Alexey Dobriyan
@ 2020-05-05 17:56 ` Alexey Dobriyan
  2020-05-11  1:20   ` Damien Le Moal
  2020-05-05 17:56 ` [PATCH 8/9] zbd: support verification Alexey Dobriyan
  2020-05-11  0:56 ` [PATCH 1/9] verify: decouple seed generation from buffer fill Damien Le Moal
  7 siblings, 1 reply; 21+ messages in thread
From: Alexey Dobriyan @ 2020-05-05 17:56 UTC (permalink / raw)
  To: axboe; +Cc: fio, damien.lemoal, Alexey Dobriyan

The following config:

	size=1%
	io_size=1%

will generate essentially infinite loop, because io_size is set to (2^64-1)-1.

Parse io_size as percentage to avoid the bug.

Note: if size% != io_size%, io_size is ignored but it is separate bug.

Signed-off-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
---
 cconv.c          |  2 ++
 options.c        | 17 +++++++++++++++++
 server.h         |  2 +-
 thread_options.h |  4 ++++
 4 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/cconv.c b/cconv.c
index 48218dc4..48b0db9e 100644
--- a/cconv.c
+++ b/cconv.c
@@ -102,6 +102,7 @@ void convert_thread_options_to_cpu(struct thread_options *o,
 	o->size = le64_to_cpu(top->size);
 	o->io_size = le64_to_cpu(top->io_size);
 	o->size_percent = le32_to_cpu(top->size_percent);
+	o->io_size_percent = le32_to_cpu(top->io_size_percent);
 	o->fill_device = le32_to_cpu(top->fill_device);
 	o->file_append = le32_to_cpu(top->file_append);
 	o->file_size_low = le64_to_cpu(top->file_size_low);
@@ -366,6 +367,7 @@ void convert_thread_options_to_net(struct thread_options_pack *top,
 	top->iodepth_batch_complete_max = cpu_to_le32(o->iodepth_batch_complete_max);
 	top->serialize_overlap = cpu_to_le32(o->serialize_overlap);
 	top->size_percent = cpu_to_le32(o->size_percent);
+	top->io_size_percent = cpu_to_le32(o->io_size_percent);
 	top->fill_device = cpu_to_le32(o->fill_device);
 	top->file_append = cpu_to_le32(o->file_append);
 	top->ratecycle = cpu_to_le32(o->ratecycle);
diff --git a/options.c b/options.c
index 47b3b765..3d0c3a84 100644
--- a/options.c
+++ b/options.c
@@ -1475,6 +1475,22 @@ static int str_size_cb(void *data, unsigned long long *__val)
 	return 0;
 }
 
+static int str_io_size_cb(void *data, unsigned long long *__val)
+{
+	struct thread_data *td = cb_data_to_td(data);
+	unsigned long long v = *__val;
+
+	if (parse_is_percent(v)) {
+		td->o.io_size = 0;
+		td->o.io_size_percent = -1ULL - v;
+		dprint(FD_PARSE, "SET io_size_percent %d\n",
+					td->o.io_size_percent);
+	} else
+		td->o.io_size = v;
+
+	return 0;
+}
+
 static int str_write_bw_log_cb(void *data, const char *str)
 {
 	struct thread_data *td = cb_data_to_td(data);
@@ -2042,6 +2058,7 @@ struct fio_option fio_options[FIO_MAX_OPTS] = {
 		.alias	= "io_limit",
 		.lname	= "IO Size",
 		.type	= FIO_OPT_STR_VAL,
+		.cb	= str_io_size_cb,
 		.off1	= offsetof(struct thread_options, io_size),
 		.help	= "Total size of I/O to be performed",
 		.interval = 1024 * 1024,
diff --git a/server.h b/server.h
index 279b6917..de01a5c8 100644
--- a/server.h
+++ b/server.h
@@ -48,7 +48,7 @@ struct fio_net_cmd_reply {
 };
 
 enum {
-	FIO_SERVER_VER			= 82,
+	FIO_SERVER_VER			= 83,
 
 	FIO_SERVER_MAX_FRAGMENT_PDU	= 1024,
 	FIO_SERVER_MAX_CMD_MB		= 2048,
diff --git a/thread_options.h b/thread_options.h
index b0b493e4..4979ed79 100644
--- a/thread_options.h
+++ b/thread_options.h
@@ -83,6 +83,7 @@ struct thread_options {
 	unsigned long long size;
 	unsigned long long io_size;
 	unsigned int size_percent;
+	unsigned int io_size_percent;
 	unsigned int fill_device;
 	unsigned int file_append;
 	unsigned long long file_size_low;
@@ -377,6 +378,7 @@ struct thread_options_pack {
 	uint64_t size;
 	uint64_t io_size;
 	uint32_t size_percent;
+	uint32_t io_size_percent;
 	uint32_t fill_device;
 	uint32_t file_append;
 	uint32_t unique_filename;
@@ -456,6 +458,8 @@ struct thread_options_pack {
 	struct zone_split zone_split[DDIR_RWDIR_CNT][ZONESPLIT_MAX];
 	uint32_t zone_split_nr[DDIR_RWDIR_CNT];
 
+	uint8_t pad1[4];
+
 	fio_fp64_t zipf_theta;
 	fio_fp64_t pareto_h;
 	fio_fp64_t gauss_dev;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 8/9] zbd: support verification
  2020-05-05 17:56 [PATCH 1/9] verify: decouple seed generation from buffer fill Alexey Dobriyan
                   ` (5 preceding siblings ...)
  2020-05-05 17:56 ` [PATCH 7/9] fio: parse "io_size=1%" Alexey Dobriyan
@ 2020-05-05 17:56 ` Alexey Dobriyan
  2020-05-05 17:59   ` Alexey Dobriyan
  2020-05-11  1:59   ` Damien Le Moal
  2020-05-11  0:56 ` [PATCH 1/9] verify: decouple seed generation from buffer fill Damien Le Moal
  7 siblings, 2 replies; 21+ messages in thread
From: Alexey Dobriyan @ 2020-05-05 17:56 UTC (permalink / raw)
  To: axboe; +Cc: fio, damien.lemoal, Alexey Dobriyan

Verification is not supported in ZBD mode which is codified in this
assert:

	if (zb->reset_zone || zbd_zone_full(f, zb, min_bs)) {
		assert(td->o.verify == VERIFY_NONE);

However, sequential write property is excellent for actually doing
verification if write_list and seeds are maintained per-zone.
It will work automatically with
* overlapping jobs
* zone resets in the middle of job
* different write block sizes

This also paves the way to "verify before zone reset" and
"verify zeroes after wp" features.

Introduce
*) per-zone seed,
	incremented with each reset, so that patterns differ
*) per-zone random generator state,
	reseeded with each zone reset
*) per-zone write I/O list
	linked list is natural for sequential writes

*) per-zone verify_mutex
	This will be used for verify-zeroes-after-wp, definitely.
	Currently it is more a peace of mind member otherwise list
	corruption asserts trigger IIRC.
	TODO: explain why it is needed.

Delete ->verify_block -- obsoleted.

There are also some funny things going on with flushing and resetting
files in the middle of I/O but non-overlapping case more or less works.

Signed-off-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
---
 backend.c | 44 +++++++++++++++++++++++++++++---
 fio.h     |  2 ++
 init.c    |  1 +
 io_u.c    |  2 +-
 iolog.c   | 24 ++++++++++++------
 iolog.h   |  1 +
 verify.c  | 45 ++++++++++++++++++++++++++-------
 verify.h  |  4 ++-
 zbd.c     | 75 ++++++++++++++++++++++++++++---------------------------
 zbd.h     | 20 +++++++++++++--
 10 files changed, 158 insertions(+), 60 deletions(-)

diff --git a/backend.c b/backend.c
index 452975cf..05ca5dc1 100644
--- a/backend.c
+++ b/backend.c
@@ -48,6 +48,7 @@
 #include "rate-submit.h"
 #include "helper_thread.h"
 #include "pshared.h"
+#include "zbd.h"
 #include "zone-dist.h"
 
 static struct fio_sem *startup_sem;
@@ -615,7 +616,7 @@ static enum fio_q_status io_u_submit(struct thread_data *td, struct io_u *io_u)
  * The main verify engine. Runs over the writes we previously submitted,
  * reads the blocks back in, and checks the crc/md5 of the data.
  */
-static void do_verify(struct thread_data *td, uint64_t verify_bytes)
+static void do_verify(struct thread_data *td, uint64_t verify_bytes, struct fio_file *td_f, struct fio_zone_info *zi, bool sync)
 {
 	struct fio_file *f;
 	struct io_u *io_u;
@@ -629,8 +630,12 @@ static void do_verify(struct thread_data *td, uint64_t verify_bytes)
 	 * read from disk.
 	 */
 	for_each_file(td, f, i) {
+		if (td_f && f != td_f)
+			continue;
 		if (!fio_file_open(f))
 			continue;
+		if (!sync)
+			continue;
 		if (fio_io_sync(td, f))
 			break;
 		if (file_invalidate_cache(td, f))
@@ -677,7 +682,7 @@ static void do_verify(struct thread_data *td, uint64_t verify_bytes)
 			if (!io_u)
 				break;
 
-			if (get_next_verify(td, io_u)) {
+			if (get_next_verify(td, io_u, td_f, zi)) {
 				put_io_u(td, io_u);
 				break;
 			}
@@ -1516,6 +1521,35 @@ static uint64_t do_dry_run(struct thread_data *td)
 	return td->bytes_done[DDIR_WRITE] + td->bytes_done[DDIR_TRIM];
 }
 
+static void do_verify_zbd(struct thread_data *td, uint64_t verify_bytes)
+{
+	struct fio_file *f;
+	unsigned int i;
+
+	for_each_file(td, f, i) {
+		struct zoned_block_device_info *zbd = f->zbd_info;
+		bool sync = true;
+
+		if (!zbd)
+			continue;
+
+		for (uint32_t z = f->min_zone; z < f->max_zone; z++) {
+			struct fio_zone_info *zi = &zbd->zone_info[z];
+
+			if (!zbd_zone_swr(zi))
+				continue;
+
+			if (pthread_mutex_trylock(&zi->verify_mutex) != 0) {
+				/* Someone else is verifying this zone. */
+				continue;
+			}
+			do_verify(td, verify_bytes, f, zi, sync);
+			pthread_mutex_unlock(&zi->verify_mutex);
+			sync = false;
+		}
+	}
+}
+
 struct fork_data {
 	struct thread_data *td;
 	struct sk_out *sk_out;
@@ -1839,7 +1873,11 @@ static void *thread_main(void *data)
 
 		fio_gettime(&td->start, NULL);
 
-		do_verify(td, verify_bytes);
+		if (td->o.zone_mode == ZONE_MODE_ZBD) {
+			do_verify_zbd(td, verify_bytes);
+		} else {
+			do_verify(td, verify_bytes, NULL, NULL, true);
+		}
 
 		/*
 		 * See comment further up for why this is done here.
diff --git a/fio.h b/fio.h
index 20ca80e2..42df7a50 100644
--- a/fio.h
+++ b/fio.h
@@ -140,6 +140,7 @@ enum {
 	FIO_RAND_POISSON2_OFF,
 	FIO_RAND_POISSON3_OFF,
 	FIO_RAND_PRIO_CMDS,
+	FIO_RAND_ZBD,
 	FIO_RAND_NR_OFFS,
 };
 
@@ -256,6 +257,7 @@ struct thread_data {
 	struct frand_state buf_state;
 	struct frand_state buf_state_prev;
 	struct frand_state dedupe_state;
+	struct frand_state zbd_state;
 	struct frand_state zone_state;
 	struct frand_state prio_state;
 
diff --git a/init.c b/init.c
index b5315334..d41a23ff 100644
--- a/init.c
+++ b/init.c
@@ -1029,6 +1029,7 @@ static void td_fill_rand_seeds_internal(struct thread_data *td, bool use64)
 	init_rand_seed(&td->poisson_state[1], td->rand_seeds[FIO_RAND_POISSON2_OFF], 0);
 	init_rand_seed(&td->poisson_state[2], td->rand_seeds[FIO_RAND_POISSON3_OFF], 0);
 	init_rand_seed(&td->dedupe_state, td->rand_seeds[FIO_DEDUPE_OFF], false);
+	init_rand_seed(&td->zbd_state, td->rand_seeds[FIO_RAND_ZBD], use64);
 	init_rand_seed(&td->zone_state, td->rand_seeds[FIO_RAND_ZONE_OFF], false);
 	init_rand_seed(&td->prio_state, td->rand_seeds[FIO_RAND_PRIO_CMDS], false);
 
diff --git a/io_u.c b/io_u.c
index 18e94617..3cd7fb71 100644
--- a/io_u.c
+++ b/io_u.c
@@ -1610,7 +1610,7 @@ static bool check_get_verify(struct thread_data *td, struct io_u *io_u)
 			get_verify = 1;
 		}
 
-		if (get_verify && !get_next_verify(td, io_u)) {
+		if (get_verify && !get_next_verify(td, io_u, NULL, NULL)) {
 			td->verify_batch--;
 			return true;
 		}
diff --git a/iolog.c b/iolog.c
index 917a446c..732861a8 100644
--- a/iolog.c
+++ b/iolog.c
@@ -19,6 +19,7 @@
 #include "smalloc.h"
 #include "blktrace.h"
 #include "pshared.h"
+#include "zbd.h"
 
 #include <netinet/in.h>
 #include <netinet/tcp.h>
@@ -231,6 +232,7 @@ void log_io_piece(struct thread_data *td, struct io_u *io_u)
 	ipo->file = io_u->file;
 	ipo->offset = io_u->offset;
 	ipo->len = io_u->buflen;
+	ipo->seed = io_u->rand_seed;
 	ipo->numberio = io_u->numberio;
 	ipo->flags = IP_F_IN_FLIGHT;
 
@@ -241,12 +243,20 @@ void log_io_piece(struct thread_data *td, struct io_u *io_u)
 		td->trim_entries++;
 	}
 
-	/*
-	 * Only sort writes if we don't have a random map in which case we need
-	 * to check for duplicate blocks and drop the old one, which we rely on
-	 * the rb insert/lookup for handling.
-	 */
-	if (file_randommap(td, ipo->file)) {
+	if (td->o.zone_mode == ZONE_MODE_ZBD) {
+		struct fio_file *f = ipo->file;
+		uint32_t z = zbd_zone_idx(f, ipo->offset);
+		struct fio_zone_info *zi = &f->zbd_info->zone_info[z];
+
+		flist_add_tail(&ipo->list, &zi->write_list);
+		ipo->flags |= IP_F_ONLIST;
+		return;
+	} else if (file_randommap(td, ipo->file)) {
+		/*
+		 * Only sort writes if we don't have a random map in which case
+		 * we need to check for duplicate blocks and drop the old one,
+		 * which we rely on the rb insert/lookup for handling.
+		 */
 		INIT_FLIST_HEAD(&ipo->list);
 		flist_add_tail(&ipo->list, &td->io_hist_list);
 		ipo->flags |= IP_F_ONLIST;
@@ -322,7 +332,7 @@ void unlog_io_piece(struct thread_data *td, struct io_u *io_u)
 
 	if (ipo->flags & IP_F_ONRB)
 		rb_erase(&ipo->rb_node, &td->io_hist_tree);
-	else if (ipo->flags & IP_F_ONLIST)
+	else
 		flist_del(&ipo->list);
 
 	free(ipo);
diff --git a/iolog.h b/iolog.h
index 981081f9..7eddb8e0 100644
--- a/iolog.h
+++ b/iolog.h
@@ -211,6 +211,7 @@ struct io_piece {
 		struct fio_file *file;
 	};
 	unsigned long long offset;
+	uint64_t seed;
 	unsigned short numberio;
 	unsigned long len;
 	unsigned int flags;
diff --git a/verify.c b/verify.c
index b7fa6693..025e3eb0 100644
--- a/verify.c
+++ b/verify.c
@@ -11,6 +11,7 @@
 #include "fio.h"
 #include "verify.h"
 #include "trim.h"
+#include "zbd.h"
 #include "lib/rand.h"
 #include "lib/hweight.h"
 #include "lib/pattern.h"
@@ -54,7 +55,16 @@ void fill_verify_pattern(struct thread_data *td, void *p, unsigned int len,
 	if (!o->verify_pattern_bytes) {
 		dprint(FD_VERIFY, "fill random bytes len=%u\n", len);
 
-		if (!use_seed) {
+		if (use_seed) {
+		} else if (td->o.zone_mode == ZONE_MODE_ZBD) {
+			struct fio_file *f = io_u->file;
+			uint32_t z = zbd_zone_idx(f, io_u->offset);
+			struct fio_zone_info *zi = &f->zbd_info->zone_info[z];
+
+			seed = __rand(&zi->rand_state);
+			if (sizeof(int) != sizeof(long *))
+				seed *= __rand(&zi->rand_state);
+		} else {
 			seed = __rand(&td->verify_state);
 			if (sizeof(int) != sizeof(long *))
 				seed *= (unsigned long)__rand(&td->verify_state);
@@ -1291,7 +1301,7 @@ void populate_verify_io_u(struct thread_data *td, struct io_u *io_u)
 	fill_pattern_headers(td, io_u, 0, 0);
 }
 
-int get_next_verify(struct thread_data *td, struct io_u *io_u)
+int get_next_verify(struct thread_data *td, struct io_u *io_u, struct fio_file *td_f, struct fio_zone_info *zi)
 {
 	struct io_piece *ipo = NULL;
 
@@ -1301,7 +1311,26 @@ int get_next_verify(struct thread_data *td, struct io_u *io_u)
 	if (io_u->file)
 		return 0;
 
-	if (!RB_EMPTY_ROOT(&td->io_hist_tree)) {
+	if (zi) {
+		pthread_mutex_lock(&zi->mutex);
+		if (!flist_empty(&zi->write_list)) {
+			ipo = flist_first_entry(&zi->write_list, struct io_piece, list);
+
+			/*
+			 * Ensure that the associated IO has completed
+			 */
+			read_barrier();
+			if (ipo->flags & IP_F_IN_FLIGHT) {
+				pthread_mutex_unlock(&zi->mutex);
+				goto nothing;
+			}
+
+			flist_del(&ipo->list);
+			assert(ipo->flags & IP_F_ONLIST);
+			ipo->flags &= ~IP_F_ONLIST;
+		}
+		pthread_mutex_unlock(&zi->mutex);
+	} else if (!RB_EMPTY_ROOT(&td->io_hist_tree)) {
 		struct fio_rb_node *n = rb_first(&td->io_hist_tree);
 
 		ipo = rb_entry(n, struct io_piece, rb_node);
@@ -1332,10 +1361,13 @@ int get_next_verify(struct thread_data *td, struct io_u *io_u)
 	}
 
 	if (ipo) {
-		td->io_hist_len--;
+		if (!zi) {
+			td->io_hist_len--;
+		}
 
 		io_u->offset = ipo->offset;
 		io_u->buflen = ipo->len;
+		io_u->rand_seed = ipo->seed;
 		io_u->numberio = ipo->numberio;
 		io_u->file = ipo->file;
 		io_u_set(td, io_u, IO_U_F_VER_LIST);
@@ -1363,11 +1395,6 @@ int get_next_verify(struct thread_data *td, struct io_u *io_u)
 		free(ipo);
 		dprint(FD_VERIFY, "get_next_verify: ret io_u %p\n", io_u);
 
-		if (!td->o.verify_pattern_bytes) {
-			io_u->rand_seed = __rand(&td->verify_state);
-			if (sizeof(int) != sizeof(long *))
-				io_u->rand_seed *= __rand(&td->verify_state);
-		}
 		return 0;
 	}
 
diff --git a/verify.h b/verify.h
index 539e6f6c..f046d05b 100644
--- a/verify.h
+++ b/verify.h
@@ -7,6 +7,8 @@
 
 #define FIO_HDR_MAGIC	0xacca
 
+struct fio_zone_info;
+
 enum {
 	VERIFY_NONE = 0,		/* no verification */
 	VERIFY_HDR_ONLY,		/* verify header only, kept for sake of
@@ -94,7 +96,7 @@ struct vhdr_xxhash {
  * Verify helpers
  */
 extern void populate_verify_io_u(struct thread_data *, struct io_u *);
-extern int __must_check get_next_verify(struct thread_data *td, struct io_u *);
+extern int __must_check get_next_verify(struct thread_data *td, struct io_u *, struct fio_file *, struct fio_zone_info *);
 extern int __must_check verify_io_u(struct thread_data *, struct io_u **);
 extern int verify_io_u_async(struct thread_data *, struct io_u **);
 extern void fill_verify_pattern(struct thread_data *td, void *p, unsigned int len, struct io_u *io_u, uint64_t seed, int use_seed);
diff --git a/zbd.c b/zbd.c
index df46da42..c926df15 100644
--- a/zbd.c
+++ b/zbd.c
@@ -118,7 +118,7 @@ int zbd_reset_wp(struct thread_data *td, struct fio_file *f,
  * @offset: offset in bytes. If this offset is in the first zone_size bytes
  *	    past the disk size then the index of the sentinel is returned.
  */
-static uint32_t zbd_zone_idx(const struct fio_file *f, uint64_t offset)
+uint32_t zbd_zone_idx(const struct fio_file *f, uint64_t offset)
 {
 	uint32_t zone_idx;
 
@@ -130,15 +130,6 @@ static uint32_t zbd_zone_idx(const struct fio_file *f, uint64_t offset)
 	return min(zone_idx, f->zbd_info->nr_zones);
 }
 
-/**
- * zbd_zone_swr - Test whether a zone requires sequential writes
- * @z: zone info pointer.
- */
-static inline bool zbd_zone_swr(struct fio_zone_info *z)
-{
-	return z->type == ZBD_ZONE_TYPE_SWR;
-}
-
 /**
  * zbd_zone_full - verify whether a minimum number of bytes remain in a zone
  * @f: file pointer.
@@ -499,6 +490,11 @@ out:
 	return ret;
 }
 
+static inline bool td_use64(const struct thread_data *td)
+{
+	return td->o.random_generator == FIO_RAND_GEN_TAUSWORTHE64;
+}
+
 /*
  * Allocate zone information and store it into f->zbd_info if zonemode=zbd.
  *
@@ -509,6 +505,7 @@ static int zbd_create_zone_info(struct thread_data *td, struct fio_file *f)
 	struct zoned_block_device_info *zbd;
 	enum zbd_zoned_model zbd_model;
 	pthread_mutexattr_t attr;
+	uint64_t diff_seed;
 	int ret;
 
 	assert(td->o.zone_mode == ZONE_MODE_ZBD);
@@ -543,6 +540,23 @@ static int zbd_create_zone_info(struct thread_data *td, struct fio_file *f)
 	pthread_mutexattr_init(&attr);
 	pthread_mutexattr_setpshared(&attr, true);
 	pthread_mutex_init(&zbd->mutex, &attr);
+
+	diff_seed = td_use64(td)
+		? ~(uint64_t)0 / zbd->nr_zones
+		: ~(uint32_t)0 / zbd->nr_zones;
+	for (uint32_t z = 0; z < zbd->nr_zones; z++) {
+		struct fio_zone_info *zi = &zbd->zone_info[z];
+
+		/*
+		 * Spread zone seeds a bit, they will be incremented
+		 * with each reset and better stay unique.
+		 */
+		zi->seed = __rand(&td->zbd_state) + z * diff_seed;
+		init_rand_seed(&zi->rand_state, zi->seed, td_use64(td));
+		INIT_FLIST_HEAD(&zi->write_list);
+		pthread_mutex_init(&zi->verify_mutex, &attr);
+	}
+
 	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
 	for (uint32_t z = 0; z < zbd->nr_zones; z++) {
 		struct fio_zone_info *zi = &zbd->zone_info[z];
@@ -683,13 +697,26 @@ static int zbd_reset_range(struct thread_data *td, struct fio_file *f,
 	zone_idx_e = zbd_zone_idx(f, offset + length);
 	ze = &f->zbd_info->zone_info[zone_idx_e];
 	for (z = zb; z < ze; z++) {
+		FLIST_HEAD(write_list);
+
 		pthread_mutex_lock(&z->mutex);
 		pthread_mutex_lock(&f->zbd_info->mutex);
 		f->zbd_info->sectors_with_data -= z->wp - z->start;
 		pthread_mutex_unlock(&f->zbd_info->mutex);
 		z->wp = z->start;
-		z->verify_block = 0;
+		z->seed++;
+		init_rand_seed(&z->rand_state, z->seed, td_use64(td));
+		flist_splice_init(&z->write_list, &write_list);
 		pthread_mutex_unlock(&z->mutex);
+
+		while (!flist_empty(&write_list)) {
+			struct io_piece *ipo = flist_first_entry(&write_list, struct io_piece, list);
+
+			/* Data "loss"... */
+			flist_del(&ipo->list);
+			assert(ipo->flags & IP_F_ONLIST);
+			free(ipo);
+		}
 	}
 
 	td->ts.nr_zone_resets += ze - zb;
@@ -1142,27 +1169,6 @@ out:
 	return z;
 }
 
-/* The caller must hold z->mutex. */
-static struct fio_zone_info *zbd_replay_write_order(struct thread_data *td,
-						    struct io_u *io_u,
-						    struct fio_zone_info *z)
-{
-	const struct fio_file *f = io_u->file;
-	const uint32_t min_bs = td->o.min_bs[DDIR_WRITE];
-
-	if (!zbd_open_zone(td, io_u, z - f->zbd_info->zone_info)) {
-		pthread_mutex_unlock(&z->mutex);
-		z = zbd_convert_to_open_zone(td, io_u);
-		assert(z);
-	}
-
-	if (z->verify_block * min_bs >= f->zbd_info->zone_size)
-		log_err("%s: %d * %d >= %llu\n", f->file_name, z->verify_block,
-			min_bs, (unsigned long long) f->zbd_info->zone_size);
-	io_u->offset = z->start + z->verify_block++ * min_bs;
-	return z;
-}
-
 /*
  * Find another zone for which @io_u fits below the write pointer. Start
  * searching in zones @zb + 1 .. @zl and continue searching in zones
@@ -1454,10 +1460,6 @@ enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u)
 
 	switch (io_u->ddir) {
 	case DDIR_READ:
-		if (td->runstate == TD_VERIFYING) {
-			zb = zbd_replay_write_order(td, io_u, zb);
-			goto accept;
-		}
 		/*
 		 * Check that there is enough written data in the zone to do an
 		 * I/O of at least min_bs B. If there isn't, find a new zone for
@@ -1532,7 +1534,6 @@ enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u)
 		}
 		/* Reset the zone pointer if necessary */
 		if (zb->reset_zone || zbd_zone_full(f, zb, min_bs)) {
-			assert(td->o.verify == VERIFY_NONE);
 			/*
 			 * Since previous write requests may have been submitted
 			 * asynchronously and since we will submit the zone
diff --git a/zbd.h b/zbd.h
index fb39fb82..013c08c9 100644
--- a/zbd.h
+++ b/zbd.h
@@ -23,23 +23,29 @@ enum io_u_action {
  * struct fio_zone_info - information about a single ZBD zone
  * @start: zone start location (bytes)
  * @wp: zone write pointer location (bytes)
- * @verify_block: number of blocks that have been verified for this zone
  * @mutex: protects the modifiable members in this structure
  * @type: zone type (BLK_ZONE_TYPE_*)
  * @cond: zone state (BLK_ZONE_COND_*)
  * @open: whether or not this zone is currently open. Only relevant if
  *		max_open_zones > 0.
  * @reset_zone: whether or not this zone should be reset before writing to it
+ * @seed:
+ * @rand_state:
+ * @write_list:
+ * @verify_mutex:
  */
 struct fio_zone_info {
 	pthread_mutex_t		mutex;
 	uint64_t		start;
 	uint64_t		wp;
-	uint32_t		verify_block;
 	enum zbd_zone_type	type:2;
 	enum zbd_zone_cond	cond:4;
 	unsigned int		open:1;
 	unsigned int		reset_zone:1;
+	uint64_t		seed;
+	struct frand_state	rand_state;
+	struct flist_head	write_list;
+	pthread_mutex_t		verify_mutex;
 };
 
 /**
@@ -89,6 +95,7 @@ enum fio_ddir zbd_adjust_ddir(struct thread_data *td, struct io_u *io_u,
 			      enum fio_ddir ddir);
 enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u);
 char *zbd_write_status(const struct thread_stat *ts);
+uint32_t zbd_zone_idx(const struct fio_file *f, uint64_t offset);
 
 static inline void zbd_queue_io_u(struct io_u *io_u, enum fio_q_status status)
 {
@@ -107,4 +114,13 @@ static inline void zbd_put_io_u(struct io_u *io_u)
 	}
 }
 
+/**
+ * zbd_zone_swr - Test whether a zone requires sequential writes
+ * @z: zone info pointer.
+ */
+static inline bool zbd_zone_swr(struct fio_zone_info *z)
+{
+	return z->type == ZBD_ZONE_TYPE_SWR;
+}
+
 #endif /* FIO_ZBD_H */
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 8/9] zbd: support verification
  2020-05-05 17:56 ` [PATCH 8/9] zbd: support verification Alexey Dobriyan
@ 2020-05-05 17:59   ` Alexey Dobriyan
  2020-05-11  2:01     ` Damien Le Moal
  2020-05-11  1:59   ` Damien Le Moal
  1 sibling, 1 reply; 21+ messages in thread
From: Alexey Dobriyan @ 2020-05-05 17:59 UTC (permalink / raw)
  To: axboe; +Cc: fio, damien.lemoal

On Tue, May 05, 2020 at 08:56:34PM +0300, Alexey Dobriyan wrote:
> Verification is not supported in ZBD mode which is codified in this
> assert:
> 
> 	if (zb->reset_zone || zbd_zone_full(f, zb, min_bs)) {
> 		assert(td->o.verify == VERIFY_NONE);

This is RFC for now, not for merging, there rest are.

And no, patches haven't been lost, 9/9 is debugging is usual...


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/9] verify: decouple seed generation from buffer fill
  2020-05-05 17:56 [PATCH 1/9] verify: decouple seed generation from buffer fill Alexey Dobriyan
                   ` (6 preceding siblings ...)
  2020-05-05 17:56 ` [PATCH 8/9] zbd: support verification Alexey Dobriyan
@ 2020-05-11  0:56 ` Damien Le Moal
  7 siblings, 0 replies; 21+ messages in thread
From: Damien Le Moal @ 2020-05-11  0:56 UTC (permalink / raw)
  To: Alexey Dobriyan, axboe; +Cc: fio

On 2020/05/06 2:56, Alexey Dobriyan wrote:
> It is nicer this way and there will be more code in this area
> with ZBD verification.
> 
> Signed-off-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
> ---
>  backend.c |  6 ------
>  verify.c  | 20 +++++++-------------
>  2 files changed, 7 insertions(+), 19 deletions(-)
> 
> diff --git a/backend.c b/backend.c
> index feb34e51..452975cf 100644
> --- a/backend.c
> +++ b/backend.c
> @@ -1006,12 +1006,6 @@ static void do_io(struct thread_data *td, uint64_t *bytes_done)
>  		if (td->o.verify != VERIFY_NONE && io_u->ddir == DDIR_READ &&
>  		    ((io_u->flags & IO_U_F_VER_LIST) || !td_rw(td))) {
>  
> -			if (!td->o.verify_pattern_bytes) {
> -				io_u->rand_seed = __rand(&td->verify_state);
> -				if (sizeof(int) != sizeof(long *))
> -					io_u->rand_seed *= __rand(&td->verify_state);
> -			}
> -
>  			if (verify_state_should_stop(td, io_u)) {
>  				put_io_u(td, io_u);
>  				break;
> diff --git a/verify.c b/verify.c
> index cf299ebf..b7fa6693 100644
> --- a/verify.c
> +++ b/verify.c
> @@ -46,15 +46,6 @@ static void __fill_buffer(struct thread_options *o, uint64_t seed, void *p,
>  	__fill_random_buf_percentage(seed, p, o->compress_percentage, len, len, o->buffer_pattern, o->buffer_pattern_bytes);
>  }
>  
> -static uint64_t fill_buffer(struct thread_data *td, void *p,
> -			    unsigned int len)
> -{
> -	struct frand_state *fs = &td->verify_state;
> -	struct thread_options *o = &td->o;
> -
> -	return fill_random_buf_percentage(fs, p, o->compress_percentage, len, len, o->buffer_pattern, o->buffer_pattern_bytes);
> -}
> -
>  void fill_verify_pattern(struct thread_data *td, void *p, unsigned int len,
>  			 struct io_u *io_u, uint64_t seed, int use_seed)
>  {
> @@ -63,10 +54,13 @@ void fill_verify_pattern(struct thread_data *td, void *p, unsigned int len,
>  	if (!o->verify_pattern_bytes) {
>  		dprint(FD_VERIFY, "fill random bytes len=%u\n", len);
>  
> -		if (use_seed)
> -			__fill_buffer(o, seed, p, len);
> -		else
> -			io_u->rand_seed = fill_buffer(td, p, len);
> +		if (!use_seed) {
> +			seed = __rand(&td->verify_state);
> +			if (sizeof(int) != sizeof(long *))
> +				seed *= (unsigned long)__rand(&td->verify_state);
> +		}
> +		io_u->rand_seed = seed;
> +		__fill_buffer(o, seed, p, len);
>  		return;
>  	}
>  
> 

Looks OK to me.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/9] zbd: introduce per job maximum open zones limit
  2020-05-05 17:56 ` [PATCH 4/9] zbd: introduce per job maximum open zones limit Alexey Dobriyan
@ 2020-05-11  1:09   ` Damien Le Moal
  0 siblings, 0 replies; 21+ messages in thread
From: Damien Le Moal @ 2020-05-11  1:09 UTC (permalink / raw)
  To: Alexey Dobriyan, axboe; +Cc: fio

On 2020/05/06 2:57, Alexey Dobriyan wrote:
> It is not possible to maintain sustained per-thread iodepth in ZBD mode.
> The way code is written, "max_open_zones" acts as a global limit, and
> once one or few threads open all "max_open_zones" zones, other threads
> can't open anything and _exit_ prematurely.
> 
> This config is guaranteed to make equal number of zone resets/IO now:
> each thread generates identical pattern and doesn't intersect with other
> threads:
> 
> 	zonemode=zbd
> 	zonesize=...
> 	rw=write
> 
> 	numjobs=N
> 	offset_increment=M*zonesize
> 
> 	[j]
> 	size=M*zonesize
> 
> Patch introduces "job_max_open_zones" which is per-thread/process limit.
> "max_open_zones" remains per file/device limit. Both limits are checked
> for each open zone so one thread can't kick out others.
> 
> Signed-off-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
> ---
>  fio.1            |  6 +++++-
>  fio.h            |  1 +
>  options.c        | 16 +++++++++++++---
>  thread_options.h |  1 +
>  zbd.c            | 49 +++++++++++++++++++++++++++++++++++++-----------
>  zbd.h            |  3 +++
>  6 files changed, 61 insertions(+), 15 deletions(-)
> 
> diff --git a/fio.1 b/fio.1
> index a2379f98..c0f2e7cf 100644
> --- a/fio.1
> +++ b/fio.1
> @@ -804,7 +804,11 @@ so. Default: false.
>  When running a random write test across an entire drive many more zones will be
>  open than in a typical application workload. Hence this command line option
>  that allows to limit the number of open zones. The number of open zones is
> -defined as the number of zones to which write commands are issued.
> +defined as the number of zones to which write commands are issued by all
> +threads/processes.
> +.TP
> +.BI job_max_open_zones \fR=\fPint
> +Limit on the number of simultaneously opened zones per single thread/process.
>  .TP
>  .BI zone_reset_threshold \fR=\fPfloat
>  A number between zero and one that indicates the ratio of logical blocks with
> diff --git a/fio.h b/fio.h
> index bbf057c1..20ca80e2 100644
> --- a/fio.h
> +++ b/fio.h
> @@ -260,6 +260,7 @@ struct thread_data {
>  	struct frand_state prio_state;
>  
>  	struct zone_split_index **zone_state_index;
> +	unsigned int num_open_zones;
>  
>  	unsigned int verify_batch;
>  	unsigned int trim_batch;
> diff --git a/options.c b/options.c
> index 2372c042..47b3b765 100644
> --- a/options.c
> +++ b/options.c
> @@ -3360,12 +3360,22 @@ struct fio_option fio_options[FIO_MAX_OPTS] = {
>  	},
>  	{
>  		.name	= "max_open_zones",
> -		.lname	= "Maximum number of open zones",
> +		.lname	= "Global maximum number of open zones",

Instead of "Global", may be use "Per device file" ? Jobs operating on different
files can each define a different value for max_open_zones, which really in that
use case, make it a per device file limit.

>  		.type	= FIO_OPT_INT,
>  		.off1	= offsetof(struct thread_options, max_open_zones),
>  		.maxval	= ZBD_MAX_OPEN_ZONES,
> -		.help	= "Limit random writes to SMR drives to the specified"
> -			  " number of sequential zones",
> +		.help	= "Limit on the number of simultaneously opened sequential write zones in SMR/ZNS drives",

Please change "in SMR/ZNS drives" to "with zonemode=zbd" since we can use
regular disks with emulated zones too.

> +		.def	= "0",
> +		.category = FIO_OPT_C_IO,
> +		.group	= FIO_OPT_G_INVALID,
> +	},
> +	{
> +		.name	= "job_max_open_zones",
> +		.lname	= "Job maximum number of open zones",
> +		.type	= FIO_OPT_INT,
> +		.off1	= offsetof(struct thread_options, job_max_open_zones),
> +		.maxval	= ZBD_MAX_OPEN_ZONES,
> +		.help	= "Limit on the number of simultaneously opened sequential write zones in SMR/ZNS drives by one thread/process",

Same here.

>  		.def	= "0",
>  		.category = FIO_OPT_C_IO,
>  		.group	= FIO_OPT_G_INVALID,
> diff --git a/thread_options.h b/thread_options.h
> index c78ed43d..b0b493e4 100644
> --- a/thread_options.h
> +++ b/thread_options.h
> @@ -342,6 +342,7 @@ struct thread_options {
>  	/* Parameters that affect zonemode=zbd */
>  	unsigned int read_beyond_wp;
>  	int max_open_zones;
> +	unsigned int job_max_open_zones;
>  	fio_fp64_t zrt;
>  	fio_fp64_t zrf;
>  };
> diff --git a/zbd.c b/zbd.c
> index 76771f2e..b34ceb34 100644
> --- a/zbd.c
> +++ b/zbd.c
> @@ -546,8 +546,10 @@ static int zbd_create_zone_info(struct thread_data *td, struct fio_file *f)
>  		return -EINVAL;
>  	}
>  
> -	if (ret == 0)
> +	if (ret == 0) {
>  		f->zbd_info->model = zbd_model;
> +		f->zbd_info->max_open_zones = td->o.max_open_zones;
> +	}
>  	return ret;
>  }
>  
> @@ -622,6 +624,27 @@ int zbd_init(struct thread_data *td)
>  	if (!zbd_verify_bs())
>  		return 1;
>  
> +	for_each_file(td, f, i) {
> +		struct zoned_block_device_info *zbd = f->zbd_info;
> +
> +		if (!zbd)
> +			continue;
> +
> +		if (zbd->max_open_zones == 0) {
> +			zbd->max_open_zones = ZBD_MAX_OPEN_ZONES;
> +		}

No need for the "{}" brackets here.

> +
> +		if (td->o.max_open_zones > 0 &&
> +		    zbd->max_open_zones != td->o.max_open_zones) {
> +			log_err("Different 'max_open_zones' values\n");
> +			return 1;
> +		}
> +		if (zbd->max_open_zones > ZBD_MAX_OPEN_ZONES) {
> +			log_err("'max_open_zones' value is limited by %u\n", ZBD_MAX_OPEN_ZONES);
> +			return 1;
> +		}
> +	}
> +
>  	return 0;
>  }
>  
> @@ -709,6 +732,7 @@ static void zbd_close_zone(struct thread_data *td, const struct fio_file *f,
>  		(ZBD_MAX_OPEN_ZONES - (open_zone_idx + 1)) *
>  		sizeof(f->zbd_info->open_zones[0]));
>  	f->zbd_info->num_open_zones--;
> +	td->num_open_zones--;
>  	f->zbd_info->zone_info[zone_idx].open = 0;
>  }
>  
> @@ -888,8 +912,10 @@ static bool is_zone_open(const struct thread_data *td, const struct fio_file *f,
>  	struct zoned_block_device_info *zbdi = f->zbd_info;
>  	int i;
>  
> -	assert(td->o.max_open_zones <= ARRAY_SIZE(zbdi->open_zones));
> -	assert(zbdi->num_open_zones <= td->o.max_open_zones);
> +	assert(td->o.job_max_open_zones == 0 || td->num_open_zones <= td->o.job_max_open_zones);
> +

I do not think that this white line is needed.

> +	assert(td->o.job_max_open_zones <= zbdi->max_open_zones);
> +	assert(zbdi->num_open_zones <= zbdi->max_open_zones);
>  
>  	for (i = 0; i < zbdi->num_open_zones; i++)
>  		if (zbdi->open_zones[i] == zone_idx)
> @@ -922,18 +948,19 @@ static bool zbd_open_zone(struct thread_data *td, const struct io_u *io_u,
>  	if (td->o.verify != VERIFY_NONE && zbd_zone_full(f, z, min_bs))
>  		return false;
>  
> -	/* Zero means no limit */
> -	if (!td->o.max_open_zones)
> -		return true;
> -
>  	pthread_mutex_lock(&f->zbd_info->mutex);
>  	if (is_zone_open(td, f, zone_idx))
>  		goto out;
>  	res = false;
> -	if (f->zbd_info->num_open_zones >= td->o.max_open_zones)
> +	/* Zero means no limit */
> +	if (td->o.job_max_open_zones > 0 &&
> +	    td->num_open_zones >= td->o.job_max_open_zones)
> +		goto out;
> +	if (f->zbd_info->num_open_zones >= f->zbd_info->max_open_zones)
>  		goto out;
>  	dprint(FD_ZBD, "%s: opening zone %d\n", f->file_name, zone_idx);
>  	f->zbd_info->open_zones[f->zbd_info->num_open_zones++] = zone_idx;
> +	td->num_open_zones++;
>  	z->open = 1;
>  	res = true;
>  
> @@ -968,7 +995,7 @@ static struct fio_zone_info *zbd_convert_to_open_zone(struct thread_data *td,
>  
>  	assert(is_valid_offset(f, io_u->offset));
>  
> -	if (td->o.max_open_zones) {
> +	if (td->o.job_max_open_zones) {
>  		/*
>  		 * This statement accesses f->zbd_info->open_zones[] on purpose
>  		 * without locking.
> @@ -997,7 +1024,7 @@ static struct fio_zone_info *zbd_convert_to_open_zone(struct thread_data *td,
>  
>  		zone_lock(td, f, z);
>  		pthread_mutex_lock(&f->zbd_info->mutex);
> -		if (td->o.max_open_zones == 0)
> +		if (td->o.job_max_open_zones == 0)
>  			goto examine_zone;
>  		if (f->zbd_info->num_open_zones == 0) {
>  			pthread_mutex_unlock(&f->zbd_info->mutex);
> @@ -1053,7 +1080,7 @@ examine_zone:
>  	}
>  	dprint(FD_ZBD, "%s(%s): closing zone %d\n", __func__, f->file_name,
>  	       zone_idx);
> -	if (td->o.max_open_zones)
> +	if (td->o.job_max_open_zones)
>  		zbd_close_zone(td, f, open_zone_idx);
>  	pthread_mutex_unlock(&f->zbd_info->mutex);
>  
> diff --git a/zbd.h b/zbd.h
> index 5a660399..fb39fb82 100644
> --- a/zbd.h
> +++ b/zbd.h
> @@ -45,6 +45,8 @@ struct fio_zone_info {
>  /**
>   * zoned_block_device_info - zoned block device characteristics
>   * @model: Device model.
> + * @max_open_zones: global limit on the number of simultaneously opened
> + *	sequential write zones.
>   * @mutex: Protects the modifiable members in this structure (refcount and
>   *		num_open_zones).
>   * @zone_size: size of a single zone in units of 512 bytes
> @@ -65,6 +67,7 @@ struct fio_zone_info {
>   */
>  struct zoned_block_device_info {
>  	enum zbd_zoned_model	model;
> +	uint32_t		max_open_zones;
>  	pthread_mutex_t		mutex;
>  	uint64_t		zone_size;
>  	uint64_t		sectors_with_data;
> 

Apart from the nits above, it looks good.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/9] zbd: make zbd_info->mutex non-recursive
  2020-05-05 17:56 ` [PATCH 5/9] zbd: make zbd_info->mutex non-recursive Alexey Dobriyan
@ 2020-05-11  1:11   ` Damien Le Moal
  0 siblings, 0 replies; 21+ messages in thread
From: Damien Le Moal @ 2020-05-11  1:11 UTC (permalink / raw)
  To: Alexey Dobriyan, axboe; +Cc: fio

On 2020/05/06 2:56, Alexey Dobriyan wrote:
> There is no reason for it to be recursive. Resursiveness leaked
> from struct fio_zone_info::mutex initialisation.
> 
> Signed-off-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
> ---
>  zbd.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/zbd.c b/zbd.c
> index b34ceb34..2926bae9 100644
> --- a/zbd.c
> +++ b/zbd.c
> @@ -373,9 +373,9 @@ static int init_zone_info(struct thread_data *td, struct fio_file *f)
>  		return -ENOMEM;
>  
>  	pthread_mutexattr_init(&attr);
> -	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
>  	pthread_mutexattr_setpshared(&attr, true);
>  	pthread_mutex_init(&zbd_info->mutex, &attr);
> +	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
>  	zbd_info->refcount = 1;
>  	p = &zbd_info->zone_info[0];
>  	for (i = 0; i < nr_zones; i++, p++) {
> @@ -417,7 +417,6 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
>  	int i, j, ret = 0;
>  
>  	pthread_mutexattr_init(&attr);
> -	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
>  	pthread_mutexattr_setpshared(&attr, true);
>  
>  	zones = calloc(ZBD_REPORT_MAX_ZONES, sizeof(struct zbd_zone));
> @@ -454,6 +453,7 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
>  	if (!zbd_info)
>  		goto out;
>  	pthread_mutex_init(&zbd_info->mutex, &attr);
> +	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
>  	zbd_info->refcount = 1;
>  	p = &zbd_info->zone_info[0];
>  	for (offset = 0, j = 0; j < nr_zones;) {
> @@ -796,14 +796,20 @@ static int zbd_reset_zones(struct thread_data *td, struct fio_file *f,
>   * Reset zbd_info.write_cnt, the counter that counts down towards the next
>   * zone reset.
>   */
> -static void zbd_reset_write_cnt(const struct thread_data *td,
> -				const struct fio_file *f)
> +static void _zbd_reset_write_cnt(const struct thread_data *td,
> +				 const struct fio_file *f)
>  {
>  	assert(0 <= td->o.zrf.u.f && td->o.zrf.u.f <= 1);
>  
> -	pthread_mutex_lock(&f->zbd_info->mutex);
>  	f->zbd_info->write_cnt = td->o.zrf.u.f ?
>  		min(1.0 / td->o.zrf.u.f, 0.0 + UINT_MAX) : UINT_MAX;
> +}
> +
> +static void zbd_reset_write_cnt(const struct thread_data *td,
> +				const struct fio_file *f)
> +{
> +	pthread_mutex_lock(&f->zbd_info->mutex);
> +	_zbd_reset_write_cnt(td, f);
>  	pthread_mutex_unlock(&f->zbd_info->mutex);
>  }
>  
> @@ -817,7 +823,7 @@ static bool zbd_dec_and_reset_write_cnt(const struct thread_data *td,
>  	if (f->zbd_info->write_cnt)
>  		write_cnt = --f->zbd_info->write_cnt;
>  	if (write_cnt == 0)
> -		zbd_reset_write_cnt(td, f);
> +		_zbd_reset_write_cnt(td, f);
>  	pthread_mutex_unlock(&f->zbd_info->mutex);
>  
>  	return write_cnt == 0;
> 

Looks good to me.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/9] zbd: consolidate zone mutex initialisation
  2020-05-05 17:56 ` [PATCH 6/9] zbd: consolidate zone mutex initialisation Alexey Dobriyan
@ 2020-05-11  1:16   ` Damien Le Moal
  2020-05-21 21:44     ` Alexey Dobriyan
  0 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2020-05-11  1:16 UTC (permalink / raw)
  To: Alexey Dobriyan, axboe; +Cc: fio

On 2020/05/06 2:56, Alexey Dobriyan wrote:
> Initialise everything that is not write pointers coming from device or
> faked separatedly.

s/Initialise/Initialize
s/separatedly/separately

> 
> More stuff (verification in ZBD mode and ->write_mutex) will be added.
> 
> Signed-off-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
> ---
>  zbd.c | 39 ++++++++++++++++++++-------------------
>  1 file changed, 20 insertions(+), 19 deletions(-)
> 
> diff --git a/zbd.c b/zbd.c
> index 2926bae9..df46da42 100644
> --- a/zbd.c
> +++ b/zbd.c
> @@ -351,7 +351,6 @@ static int init_zone_info(struct thread_data *td, struct fio_file *f)
>  	struct fio_zone_info *p;
>  	uint64_t zone_size = td->o.zone_size;
>  	struct zoned_block_device_info *zbd_info = NULL;
> -	pthread_mutexattr_t attr;
>  	int i;
>  
>  	if (zone_size == 0) {
> @@ -372,14 +371,9 @@ static int init_zone_info(struct thread_data *td, struct fio_file *f)
>  	if (!zbd_info)
>  		return -ENOMEM;
>  
> -	pthread_mutexattr_init(&attr);
> -	pthread_mutexattr_setpshared(&attr, true);
> -	pthread_mutex_init(&zbd_info->mutex, &attr);
> -	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
>  	zbd_info->refcount = 1;
>  	p = &zbd_info->zone_info[0];
>  	for (i = 0; i < nr_zones; i++, p++) {
> -		pthread_mutex_init(&p->mutex, &attr);
>  		p->start = i * zone_size;
>  		p->wp = p->start + zone_size;
>  		p->type = ZBD_ZONE_TYPE_SWR;
> @@ -393,7 +387,6 @@ static int init_zone_info(struct thread_data *td, struct fio_file *f)
>  	f->zbd_info->zone_size_log2 = is_power_of_2(zone_size) ?
>  		ilog2(zone_size) : 0;
>  	f->zbd_info->nr_zones = nr_zones;
> -	pthread_mutexattr_destroy(&attr);
>  	return 0;
>  }
>  
> @@ -413,12 +406,8 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
>  	struct fio_zone_info *p;
>  	uint64_t zone_size, offset;
>  	struct zoned_block_device_info *zbd_info = NULL;
> -	pthread_mutexattr_t attr;
>  	int i, j, ret = 0;
>  
> -	pthread_mutexattr_init(&attr);
> -	pthread_mutexattr_setpshared(&attr, true);
> -
>  	zones = calloc(ZBD_REPORT_MAX_ZONES, sizeof(struct zbd_zone));
>  	if (!zones)
>  		goto out;
> @@ -452,14 +441,11 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
>  	ret = -ENOMEM;
>  	if (!zbd_info)
>  		goto out;
> -	pthread_mutex_init(&zbd_info->mutex, &attr);
> -	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
>  	zbd_info->refcount = 1;
>  	p = &zbd_info->zone_info[0];
>  	for (offset = 0, j = 0; j < nr_zones;) {
>  		z = &zones[0];
>  		for (i = 0; i < nrz; i++, j++, z++, p++) {
> -			pthread_mutex_init(&p->mutex, &attr);
>  			p->start = z->start;
>  			switch (z->cond) {
>  			case ZBD_ZONE_COND_NOT_WP:
> @@ -510,7 +496,6 @@ static int parse_zone_info(struct thread_data *td, struct fio_file *f)
>  out:
>  	sfree(zbd_info);
>  	free(zones);
> -	pthread_mutexattr_destroy(&attr);
>  	return ret;
>  }
>  
> @@ -521,7 +506,9 @@ out:
>   */
>  static int zbd_create_zone_info(struct thread_data *td, struct fio_file *f)
>  {
> +	struct zoned_block_device_info *zbd;
>  	enum zbd_zoned_model zbd_model;
> +	pthread_mutexattr_t attr;
>  	int ret;
>  
>  	assert(td->o.zone_mode == ZONE_MODE_ZBD);
> @@ -546,11 +533,25 @@ static int zbd_create_zone_info(struct thread_data *td, struct fio_file *f)
>  		return -EINVAL;
>  	}
>  
> -	if (ret == 0) {
> -		f->zbd_info->model = zbd_model;
> -		f->zbd_info->max_open_zones = td->o.max_open_zones;
> +	if (ret)
> +		return ret;
> +
> +	zbd = f->zbd_info;
> +	zbd->model = zbd_model;
> +	zbd->max_open_zones = td->o.max_open_zones;
> +
> +	pthread_mutexattr_init(&attr);
> +	pthread_mutexattr_setpshared(&attr, true);
> +	pthread_mutex_init(&zbd->mutex, &attr);
> +	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
> +	for (uint32_t z = 0; z < zbd->nr_zones; z++) {
> +		struct fio_zone_info *zi = &zbd->zone_info[z];
> +
> +		pthread_mutex_init(&zi->mutex, &attr);
>  	}
> -	return ret;
> +	pthread_mutexattr_destroy(&attr);
> +
> +	return 0;
>  }
>  
>  void zbd_free_zone_info(struct fio_file *f)
> 

On a very large drive (e.g. 20TB SMR), we have in excess of 70000 zones. Having
to loop over all of them again will start to be painful and likely slow down (a
little) startup. And going forward with ever increasing disk capacity, this will
get even worse. So not sure if this patch makes much sense.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/9] fio: parse "io_size=1%"
  2020-05-05 17:56 ` [PATCH 7/9] fio: parse "io_size=1%" Alexey Dobriyan
@ 2020-05-11  1:20   ` Damien Le Moal
  2020-05-21 22:13     ` Alexey Dobriyan
  0 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2020-05-11  1:20 UTC (permalink / raw)
  To: Alexey Dobriyan, axboe; +Cc: fio

On 2020/05/06 2:57, Alexey Dobriyan wrote:
> The following config:
> 
> 	size=1%
> 	io_size=1%
> 
> will generate essentially infinite loop, because io_size is set to (2^64-1)-1.
> 
> Parse io_size as percentage to avoid the bug.
> 
> Note: if size% != io_size%, io_size is ignored but it is separate bug.
> 
> Signed-off-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
> ---
>  cconv.c          |  2 ++
>  options.c        | 17 +++++++++++++++++
>  server.h         |  2 +-
>  thread_options.h |  4 ++++
>  4 files changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/cconv.c b/cconv.c
> index 48218dc4..48b0db9e 100644
> --- a/cconv.c
> +++ b/cconv.c
> @@ -102,6 +102,7 @@ void convert_thread_options_to_cpu(struct thread_options *o,
>  	o->size = le64_to_cpu(top->size);
>  	o->io_size = le64_to_cpu(top->io_size);
>  	o->size_percent = le32_to_cpu(top->size_percent);
> +	o->io_size_percent = le32_to_cpu(top->io_size_percent);
>  	o->fill_device = le32_to_cpu(top->fill_device);
>  	o->file_append = le32_to_cpu(top->file_append);
>  	o->file_size_low = le64_to_cpu(top->file_size_low);
> @@ -366,6 +367,7 @@ void convert_thread_options_to_net(struct thread_options_pack *top,
>  	top->iodepth_batch_complete_max = cpu_to_le32(o->iodepth_batch_complete_max);
>  	top->serialize_overlap = cpu_to_le32(o->serialize_overlap);
>  	top->size_percent = cpu_to_le32(o->size_percent);
> +	top->io_size_percent = cpu_to_le32(o->io_size_percent);
>  	top->fill_device = cpu_to_le32(o->fill_device);
>  	top->file_append = cpu_to_le32(o->file_append);
>  	top->ratecycle = cpu_to_le32(o->ratecycle);
> diff --git a/options.c b/options.c
> index 47b3b765..3d0c3a84 100644
> --- a/options.c
> +++ b/options.c
> @@ -1475,6 +1475,22 @@ static int str_size_cb(void *data, unsigned long long *__val)
>  	return 0;
>  }
>  
> +static int str_io_size_cb(void *data, unsigned long long *__val)
> +{
> +	struct thread_data *td = cb_data_to_td(data);
> +	unsigned long long v = *__val;
> +
> +	if (parse_is_percent(v)) {
> +		td->o.io_size = 0;
> +		td->o.io_size_percent = -1ULL - v;
> +		dprint(FD_PARSE, "SET io_size_percent %d\n",
> +					td->o.io_size_percent);

I do not see where f->io_size is set using td->o.io_size_percent. Isn't it
needed to set it somewhere in filesetup.c ?

> +	} else
> +		td->o.io_size = v;
> +
> +	return 0;
> +}
> +
>  static int str_write_bw_log_cb(void *data, const char *str)
>  {
>  	struct thread_data *td = cb_data_to_td(data);
> @@ -2042,6 +2058,7 @@ struct fio_option fio_options[FIO_MAX_OPTS] = {
>  		.alias	= "io_limit",
>  		.lname	= "IO Size",
>  		.type	= FIO_OPT_STR_VAL,
> +		.cb	= str_io_size_cb,
>  		.off1	= offsetof(struct thread_options, io_size),
>  		.help	= "Total size of I/O to be performed",
>  		.interval = 1024 * 1024,
> diff --git a/server.h b/server.h
> index 279b6917..de01a5c8 100644
> --- a/server.h
> +++ b/server.h
> @@ -48,7 +48,7 @@ struct fio_net_cmd_reply {
>  };
>  
>  enum {
> -	FIO_SERVER_VER			= 82,
> +	FIO_SERVER_VER			= 83,
>  
>  	FIO_SERVER_MAX_FRAGMENT_PDU	= 1024,
>  	FIO_SERVER_MAX_CMD_MB		= 2048,
> diff --git a/thread_options.h b/thread_options.h
> index b0b493e4..4979ed79 100644
> --- a/thread_options.h
> +++ b/thread_options.h
> @@ -83,6 +83,7 @@ struct thread_options {
>  	unsigned long long size;
>  	unsigned long long io_size;
>  	unsigned int size_percent;
> +	unsigned int io_size_percent;
>  	unsigned int fill_device;
>  	unsigned int file_append;
>  	unsigned long long file_size_low;
> @@ -377,6 +378,7 @@ struct thread_options_pack {
>  	uint64_t size;
>  	uint64_t io_size;
>  	uint32_t size_percent;
> +	uint32_t io_size_percent;
>  	uint32_t fill_device;
>  	uint32_t file_append;
>  	uint32_t unique_filename;
> @@ -456,6 +458,8 @@ struct thread_options_pack {
>  	struct zone_split zone_split[DDIR_RWDIR_CNT][ZONESPLIT_MAX];
>  	uint32_t zone_split_nr[DDIR_RWDIR_CNT];
>  
> +	uint8_t pad1[4];
> +
>  	fio_fp64_t zipf_theta;
>  	fio_fp64_t pareto_h;
>  	fio_fp64_t gauss_dev;
> 


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 8/9] zbd: support verification
  2020-05-05 17:56 ` [PATCH 8/9] zbd: support verification Alexey Dobriyan
  2020-05-05 17:59   ` Alexey Dobriyan
@ 2020-05-11  1:59   ` Damien Le Moal
  2020-05-11 16:00     ` Alexey Dobriyan
  1 sibling, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2020-05-11  1:59 UTC (permalink / raw)
  To: Alexey Dobriyan, axboe; +Cc: fio

On 2020/05/06 2:57, Alexey Dobriyan wrote:
> Verification is not supported in ZBD mode which is codified in this
> assert:
> 
> 	if (zb->reset_zone || zbd_zone_full(f, zb, min_bs)) {
> 		assert(td->o.verify == VERIFY_NONE);

This is not correct. Verification works just fine. The above assert is there
because if verify is enabled, zone reset is not done and full zone are ignored:

static bool zbd_open_zone(struct thread_data *td, const struct io_u *io_u,
                          uint32_t zone_idx)
{
	...

        /*
         * Skip full zones with data verification enabled because resetting a
         * zone causes data loss and hence causes verification to fail.
         */
        if (td->o.verify != VERIFY_NONE && zbd_zone_full(f, z, min_bs))
                return false;
	...

There are many test cases in t/zbd/test-zbd-support that have the verify option
to test it. This support is likely not perfect and may have problems, but saying
that it is not supported is an overstatement.

> However, sequential write property is excellent for actually doing
> verification if write_list and seeds are maintained per-zone.
> It will work automatically with
> * overlapping jobs
> * zone resets in the middle of job
> * different write block sizes
> 
> This also paves the way to "verify before zone reset" and
> "verify zeroes after wp" features.
> 
> Introduce
> *) per-zone seed,
> 	incremented with each reset, so that patterns differ
> *) per-zone random generator state,
> 	reseeded with each zone reset
> *) per-zone write I/O list
> 	linked list is natural for sequential writes

What about conventional zones ? They can be randomly writen.

> 
> *) per-zone verify_mutex
> 	This will be used for verify-zeroes-after-wp, definitely.

Zeroes-after-wp is not a given. SMR drives likely report that by default but if
a "format pattern" is set on SAS drives, that is the pattern that will be
returned. So checking for zeroes will not always work. We should always assume
"garbage" after the wp.

> 	Currently it is more a peace of mind member otherwise list
> 	corruption asserts trigger IIRC.
> 	TODO: explain why it is needed.
> 
> Delete ->verify_block -- obsoleted.
> 
> There are also some funny things going on with flushing and resetting
> files in the middle of I/O but non-overlapping case more or less works.

"more or less" ? Is it working or is it not ? Please clarify the exact
improvements and what remains to be fixed.

> 
> Signed-off-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
> ---
>  backend.c | 44 +++++++++++++++++++++++++++++---
>  fio.h     |  2 ++
>  init.c    |  1 +
>  io_u.c    |  2 +-
>  iolog.c   | 24 ++++++++++++------
>  iolog.h   |  1 +
>  verify.c  | 45 ++++++++++++++++++++++++++-------
>  verify.h  |  4 ++-
>  zbd.c     | 75 ++++++++++++++++++++++++++++---------------------------
>  zbd.h     | 20 +++++++++++++--
>  10 files changed, 158 insertions(+), 60 deletions(-)
> 
> diff --git a/backend.c b/backend.c
> index 452975cf..05ca5dc1 100644
> --- a/backend.c
> +++ b/backend.c
> @@ -48,6 +48,7 @@
>  #include "rate-submit.h"
>  #include "helper_thread.h"
>  #include "pshared.h"
> +#include "zbd.h"
>  #include "zone-dist.h"
>  
>  static struct fio_sem *startup_sem;
> @@ -615,7 +616,7 @@ static enum fio_q_status io_u_submit(struct thread_data *td, struct io_u *io_u)
>   * The main verify engine. Runs over the writes we previously submitted,
>   * reads the blocks back in, and checks the crc/md5 of the data.
>   */
> -static void do_verify(struct thread_data *td, uint64_t verify_bytes)
> +static void do_verify(struct thread_data *td, uint64_t verify_bytes, struct fio_file *td_f, struct fio_zone_info *zi, bool sync)
>  {
>  	struct fio_file *f;
>  	struct io_u *io_u;
> @@ -629,8 +630,12 @@ static void do_verify(struct thread_data *td, uint64_t verify_bytes)
>  	 * read from disk.
>  	 */
>  	for_each_file(td, f, i) {
> +		if (td_f && f != td_f)
> +			continue;
>  		if (!fio_file_open(f))
>  			continue;
> +		if (!sync)
> +			continue;

Why are these changes needed ?

>  		if (fio_io_sync(td, f))
>  			break;
>  		if (file_invalidate_cache(td, f))
> @@ -677,7 +682,7 @@ static void do_verify(struct thread_data *td, uint64_t verify_bytes)
>  			if (!io_u)
>  				break;
>  
> -			if (get_next_verify(td, io_u)) {
> +			if (get_next_verify(td, io_u, td_f, zi)) {
>  				put_io_u(td, io_u);
>  				break;
>  			}
> @@ -1516,6 +1521,35 @@ static uint64_t do_dry_run(struct thread_data *td)
>  	return td->bytes_done[DDIR_WRITE] + td->bytes_done[DDIR_TRIM];
>  }
>  
> +static void do_verify_zbd(struct thread_data *td, uint64_t verify_bytes)
> +{
> +	struct fio_file *f;
> +	unsigned int i;
> +
> +	for_each_file(td, f, i) {
> +		struct zoned_block_device_info *zbd = f->zbd_info;
> +		bool sync = true;
> +
> +		if (!zbd)
> +			continue;
> +
> +		for (uint32_t z = f->min_zone; z < f->max_zone; z++) {
> +			struct fio_zone_info *zi = &zbd->zone_info[z];
> +
> +			if (!zbd_zone_swr(zi))
> +				continue;
> +
> +			if (pthread_mutex_trylock(&zi->verify_mutex) != 0) {
> +				/* Someone else is verifying this zone. */
> +				continue;
> +			}
> +			do_verify(td, verify_bytes, f, zi, sync);
> +			pthread_mutex_unlock(&zi->verify_mutex);
> +			sync = false;
> +		}
> +	}
> +}

Can't we move this to zbd.c ?

> +
>  struct fork_data {
>  	struct thread_data *td;
>  	struct sk_out *sk_out;
> @@ -1839,7 +1873,11 @@ static void *thread_main(void *data)
>  
>  		fio_gettime(&td->start, NULL);
>  
> -		do_verify(td, verify_bytes);
> +		if (td->o.zone_mode == ZONE_MODE_ZBD) {
> +			do_verify_zbd(td, verify_bytes);
> +		} else {
> +			do_verify(td, verify_bytes, NULL, NULL, true);
> +		}

No need for the "{}" brackets.

>  
>  		/*
>  		 * See comment further up for why this is done here.
> diff --git a/fio.h b/fio.h
> index 20ca80e2..42df7a50 100644
> --- a/fio.h
> +++ b/fio.h
> @@ -140,6 +140,7 @@ enum {
>  	FIO_RAND_POISSON2_OFF,
>  	FIO_RAND_POISSON3_OFF,
>  	FIO_RAND_PRIO_CMDS,
> +	FIO_RAND_ZBD,

FIO_RAND_ZBD_VERIFY ?

>  	FIO_RAND_NR_OFFS,
>  };
>  
> @@ -256,6 +257,7 @@ struct thread_data {
>  	struct frand_state buf_state;
>  	struct frand_state buf_state_prev;
>  	struct frand_state dedupe_state;
> +	struct frand_state zbd_state;

A more explicit name may be better. Something like "zone_verify_state" or
"zbd_verify_state" ?

>  	struct frand_state zone_state;
>  	struct frand_state prio_state;
>  
> diff --git a/init.c b/init.c
> index b5315334..d41a23ff 100644
> --- a/init.c
> +++ b/init.c
> @@ -1029,6 +1029,7 @@ static void td_fill_rand_seeds_internal(struct thread_data *td, bool use64)
>  	init_rand_seed(&td->poisson_state[1], td->rand_seeds[FIO_RAND_POISSON2_OFF], 0);
>  	init_rand_seed(&td->poisson_state[2], td->rand_seeds[FIO_RAND_POISSON3_OFF], 0);
>  	init_rand_seed(&td->dedupe_state, td->rand_seeds[FIO_DEDUPE_OFF], false);
> +	init_rand_seed(&td->zbd_state, td->rand_seeds[FIO_RAND_ZBD], use64);
>  	init_rand_seed(&td->zone_state, td->rand_seeds[FIO_RAND_ZONE_OFF], false);
>  	init_rand_seed(&td->prio_state, td->rand_seeds[FIO_RAND_PRIO_CMDS], false);
>  
> diff --git a/io_u.c b/io_u.c
> index 18e94617..3cd7fb71 100644
> --- a/io_u.c
> +++ b/io_u.c
> @@ -1610,7 +1610,7 @@ static bool check_get_verify(struct thread_data *td, struct io_u *io_u)
>  			get_verify = 1;
>  		}
>  
> -		if (get_verify && !get_next_verify(td, io_u)) {
> +		if (get_verify && !get_next_verify(td, io_u, NULL, NULL)) {
>  			td->verify_batch--;
>  			return true;
>  		}
> diff --git a/iolog.c b/iolog.c
> index 917a446c..732861a8 100644
> --- a/iolog.c
> +++ b/iolog.c
> @@ -19,6 +19,7 @@
>  #include "smalloc.h"
>  #include "blktrace.h"
>  #include "pshared.h"
> +#include "zbd.h"
>  
>  #include <netinet/in.h>
>  #include <netinet/tcp.h>
> @@ -231,6 +232,7 @@ void log_io_piece(struct thread_data *td, struct io_u *io_u)
>  	ipo->file = io_u->file;
>  	ipo->offset = io_u->offset;
>  	ipo->len = io_u->buflen;
> +	ipo->seed = io_u->rand_seed;
>  	ipo->numberio = io_u->numberio;
>  	ipo->flags = IP_F_IN_FLIGHT;
>  
> @@ -241,12 +243,20 @@ void log_io_piece(struct thread_data *td, struct io_u *io_u)
>  		td->trim_entries++;
>  	}
>  
> -	/*
> -	 * Only sort writes if we don't have a random map in which case we need
> -	 * to check for duplicate blocks and drop the old one, which we rely on
> -	 * the rb insert/lookup for handling.
> -	 */
> -	if (file_randommap(td, ipo->file)) {
> +	if (td->o.zone_mode == ZONE_MODE_ZBD) {
> +		struct fio_file *f = ipo->file;
> +		uint32_t z = zbd_zone_idx(f, ipo->offset);
> +		struct fio_zone_info *zi = &f->zbd_info->zone_info[z];

It looks like these 2 lines are now being repeated a lot. Maybe introduce an
inline helper "inline struct fio_zone_info *get_zone(f, offset)" ? That would
cleanup the code here.

> +
> +		flist_add_tail(&ipo->list, &zi->write_list);
> +		ipo->flags |= IP_F_ONLIST;
> +		return;

This change may not be necessary at all. You could force the rbtree use for
zonemode=zbd, which will get your the sorting, so an easy access to all IOs for
a particular zone by searching from the zone start offset up to zone start+zone
size. That will remove a lot of added "if (td->o.zone_mode == ZONE_MODE_ZBD) {"
and remove the need for per zone verify list & mutex.

> +	} else if (file_randommap(td, ipo->file)) {
> +		/*
> +		 * Only sort writes if we don't have a random map in which case
> +		 * we need to check for duplicate blocks and drop the old one,
> +		 * which we rely on the rb insert/lookup for handling.
> +		 */
>  		INIT_FLIST_HEAD(&ipo->list);
>  		flist_add_tail(&ipo->list, &td->io_hist_list);
>  		ipo->flags |= IP_F_ONLIST;
> @@ -322,7 +332,7 @@ void unlog_io_piece(struct thread_data *td, struct io_u *io_u)
>  
>  	if (ipo->flags & IP_F_ONRB)
>  		rb_erase(&ipo->rb_node, &td->io_hist_tree);
> -	else if (ipo->flags & IP_F_ONLIST)
> +	else
>  		flist_del(&ipo->list);

With your above change, both cases set IP_F_ONLIST, so why this change ?

>  
>  	free(ipo);
> diff --git a/iolog.h b/iolog.h
> index 981081f9..7eddb8e0 100644
> --- a/iolog.h
> +++ b/iolog.h
> @@ -211,6 +211,7 @@ struct io_piece {
>  		struct fio_file *file;
>  	};
>  	unsigned long long offset;
> +	uint64_t seed;
>  	unsigned short numberio;
>  	unsigned long len;
>  	unsigned int flags;
> diff --git a/verify.c b/verify.c
> index b7fa6693..025e3eb0 100644
> --- a/verify.c
> +++ b/verify.c
> @@ -11,6 +11,7 @@
>  #include "fio.h"
>  #include "verify.h"
>  #include "trim.h"
> +#include "zbd.h"
>  #include "lib/rand.h"
>  #include "lib/hweight.h"
>  #include "lib/pattern.h"
> @@ -54,7 +55,16 @@ void fill_verify_pattern(struct thread_data *td, void *p, unsigned int len,
>  	if (!o->verify_pattern_bytes) {
>  		dprint(FD_VERIFY, "fill random bytes len=%u\n", len);
>  
> -		if (!use_seed) {
> +		if (use_seed) {
> +		} else if (td->o.zone_mode == ZONE_MODE_ZBD) {
> +			struct fio_file *f = io_u->file;
> +			uint32_t z = zbd_zone_idx(f, io_u->offset);
> +			struct fio_zone_info *zi = &f->zbd_info->zone_info[z];
> +
> +			seed = __rand(&zi->rand_state);
> +			if (sizeof(int) != sizeof(long *))
> +				seed *= __rand(&zi->rand_state);
> +		} else {
>  			seed = __rand(&td->verify_state);
>  			if (sizeof(int) != sizeof(long *))
>  				seed *= (unsigned long)__rand(&td->verify_state);
> @@ -1291,7 +1301,7 @@ void populate_verify_io_u(struct thread_data *td, struct io_u *io_u)
>  	fill_pattern_headers(td, io_u, 0, 0);
>  }
>  
> -int get_next_verify(struct thread_data *td, struct io_u *io_u)
> +int get_next_verify(struct thread_data *td, struct io_u *io_u, struct fio_file *td_f, struct fio_zone_info *zi)
>  {
>  	struct io_piece *ipo = NULL;
>  
> @@ -1301,7 +1311,26 @@ int get_next_verify(struct thread_data *td, struct io_u *io_u)
>  	if (io_u->file)
>  		return 0;
>  
> -	if (!RB_EMPTY_ROOT(&td->io_hist_tree)) {
> +	if (zi) {
> +		pthread_mutex_lock(&zi->mutex);
> +		if (!flist_empty(&zi->write_list)) {
> +			ipo = flist_first_entry(&zi->write_list, struct io_piece, list);
> +
> +			/*
> +			 * Ensure that the associated IO has completed
> +			 */
> +			read_barrier();
> +			if (ipo->flags & IP_F_IN_FLIGHT) {
> +				pthread_mutex_unlock(&zi->mutex);
> +				goto nothing;
> +			}
> +
> +			flist_del(&ipo->list);
> +			assert(ipo->flags & IP_F_ONLIST);
> +			ipo->flags &= ~IP_F_ONLIST;
> +		}
> +		pthread_mutex_unlock(&zi->mutex);

can we move this to zbd.c as a helper function get_next_verify_zbd() ?

> +	} else if (!RB_EMPTY_ROOT(&td->io_hist_tree)) {
>  		struct fio_rb_node *n = rb_first(&td->io_hist_tree);
>  
>  		ipo = rb_entry(n, struct io_piece, rb_node);
> @@ -1332,10 +1361,13 @@ int get_next_verify(struct thread_data *td, struct io_u *io_u)
>  	}
>  
>  	if (ipo) {
> -		td->io_hist_len--;
> +		if (!zi) {
> +			td->io_hist_len--;
> +		}

No need for the curly brackets.

>  
>  		io_u->offset = ipo->offset;
>  		io_u->buflen = ipo->len;
> +		io_u->rand_seed = ipo->seed;
>  		io_u->numberio = ipo->numberio;
>  		io_u->file = ipo->file;
>  		io_u_set(td, io_u, IO_U_F_VER_LIST);
> @@ -1363,11 +1395,6 @@ int get_next_verify(struct thread_data *td, struct io_u *io_u)
>  		free(ipo);
>  		dprint(FD_VERIFY, "get_next_verify: ret io_u %p\n", io_u);
>  
> -		if (!td->o.verify_pattern_bytes) {
> -			io_u->rand_seed = __rand(&td->verify_state);
> -			if (sizeof(int) != sizeof(long *))
> -				io_u->rand_seed *= __rand(&td->verify_state);
> -		}
>  		return 0;
>  	}
>  
> diff --git a/verify.h b/verify.h
> index 539e6f6c..f046d05b 100644
> --- a/verify.h
> +++ b/verify.h
> @@ -7,6 +7,8 @@
>  
>  #define FIO_HDR_MAGIC	0xacca
>  
> +struct fio_zone_info;
> +
>  enum {
>  	VERIFY_NONE = 0,		/* no verification */
>  	VERIFY_HDR_ONLY,		/* verify header only, kept for sake of
> @@ -94,7 +96,7 @@ struct vhdr_xxhash {
>   * Verify helpers
>   */
>  extern void populate_verify_io_u(struct thread_data *, struct io_u *);
> -extern int __must_check get_next_verify(struct thread_data *td, struct io_u *);
> +extern int __must_check get_next_verify(struct thread_data *td, struct io_u *, struct fio_file *, struct fio_zone_info *);
>  extern int __must_check verify_io_u(struct thread_data *, struct io_u **);
>  extern int verify_io_u_async(struct thread_data *, struct io_u **);
>  extern void fill_verify_pattern(struct thread_data *td, void *p, unsigned int len, struct io_u *io_u, uint64_t seed, int use_seed);
> diff --git a/zbd.c b/zbd.c
> index df46da42..c926df15 100644
> --- a/zbd.c
> +++ b/zbd.c
> @@ -118,7 +118,7 @@ int zbd_reset_wp(struct thread_data *td, struct fio_file *f,
>   * @offset: offset in bytes. If this offset is in the first zone_size bytes
>   *	    past the disk size then the index of the sentinel is returned.
>   */
> -static uint32_t zbd_zone_idx(const struct fio_file *f, uint64_t offset)
> +uint32_t zbd_zone_idx(const struct fio_file *f, uint64_t offset)
>  {
>  	uint32_t zone_idx;
>  
> @@ -130,15 +130,6 @@ static uint32_t zbd_zone_idx(const struct fio_file *f, uint64_t offset)
>  	return min(zone_idx, f->zbd_info->nr_zones);
>  }
>  
> -/**
> - * zbd_zone_swr - Test whether a zone requires sequential writes
> - * @z: zone info pointer.
> - */
> -static inline bool zbd_zone_swr(struct fio_zone_info *z)
> -{
> -	return z->type == ZBD_ZONE_TYPE_SWR;
> -}
> -
>  /**
>   * zbd_zone_full - verify whether a minimum number of bytes remain in a zone
>   * @f: file pointer.
> @@ -499,6 +490,11 @@ out:
>  	return ret;
>  }
>  
> +static inline bool td_use64(const struct thread_data *td)
> +{
> +	return td->o.random_generator == FIO_RAND_GEN_TAUSWORTHE64;
> +}
> +
>  /*
>   * Allocate zone information and store it into f->zbd_info if zonemode=zbd.
>   *
> @@ -509,6 +505,7 @@ static int zbd_create_zone_info(struct thread_data *td, struct fio_file *f)
>  	struct zoned_block_device_info *zbd;
>  	enum zbd_zoned_model zbd_model;
>  	pthread_mutexattr_t attr;
> +	uint64_t diff_seed;
>  	int ret;
>  
>  	assert(td->o.zone_mode == ZONE_MODE_ZBD);
> @@ -543,6 +540,23 @@ static int zbd_create_zone_info(struct thread_data *td, struct fio_file *f)
>  	pthread_mutexattr_init(&attr);
>  	pthread_mutexattr_setpshared(&attr, true);
>  	pthread_mutex_init(&zbd->mutex, &attr);
> +
> +	diff_seed = td_use64(td)
> +		? ~(uint64_t)0 / zbd->nr_zones
> +		: ~(uint32_t)0 / zbd->nr_zones;
> +	for (uint32_t z = 0; z < zbd->nr_zones; z++) {
> +		struct fio_zone_info *zi = &zbd->zone_info[z];
> +
> +		/*
> +		 * Spread zone seeds a bit, they will be incremented
> +		 * with each reset and better stay unique.
> +		 */
> +		zi->seed = __rand(&td->zbd_state) + z * diff_seed;
> +		init_rand_seed(&zi->rand_state, zi->seed, td_use64(td));
> +		INIT_FLIST_HEAD(&zi->write_list);
> +		pthread_mutex_init(&zi->verify_mutex, &attr);
> +	}
> +
>  	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
>  	for (uint32_t z = 0; z < zbd->nr_zones; z++) {
>  		struct fio_zone_info *zi = &zbd->zone_info[z];
> @@ -683,13 +697,26 @@ static int zbd_reset_range(struct thread_data *td, struct fio_file *f,
>  	zone_idx_e = zbd_zone_idx(f, offset + length);
>  	ze = &f->zbd_info->zone_info[zone_idx_e];
>  	for (z = zb; z < ze; z++) {
> +		FLIST_HEAD(write_list);
> +
>  		pthread_mutex_lock(&z->mutex);
>  		pthread_mutex_lock(&f->zbd_info->mutex);
>  		f->zbd_info->sectors_with_data -= z->wp - z->start;
>  		pthread_mutex_unlock(&f->zbd_info->mutex);
>  		z->wp = z->start;
> -		z->verify_block = 0;
> +		z->seed++;
> +		init_rand_seed(&z->rand_state, z->seed, td_use64(td));
> +		flist_splice_init(&z->write_list, &write_list);
>  		pthread_mutex_unlock(&z->mutex);
> +
> +		while (!flist_empty(&write_list)) {
> +			struct io_piece *ipo = flist_first_entry(&write_list, struct io_piece, list);
> +
> +			/* Data "loss"... */

Why not force verify to run here to avoid the loss and verify failure ?

> +			flist_del(&ipo->list);
> +			assert(ipo->flags & IP_F_ONLIST);
> +			free(ipo);
> +		}
>  	}
>  
>  	td->ts.nr_zone_resets += ze - zb;
> @@ -1142,27 +1169,6 @@ out:
>  	return z;
>  }
>  
> -/* The caller must hold z->mutex. */
> -static struct fio_zone_info *zbd_replay_write_order(struct thread_data *td,
> -						    struct io_u *io_u,
> -						    struct fio_zone_info *z)
> -{
> -	const struct fio_file *f = io_u->file;
> -	const uint32_t min_bs = td->o.min_bs[DDIR_WRITE];
> -
> -	if (!zbd_open_zone(td, io_u, z - f->zbd_info->zone_info)) {
> -		pthread_mutex_unlock(&z->mutex);
> -		z = zbd_convert_to_open_zone(td, io_u);
> -		assert(z);
> -	}
> -
> -	if (z->verify_block * min_bs >= f->zbd_info->zone_size)
> -		log_err("%s: %d * %d >= %llu\n", f->file_name, z->verify_block,
> -			min_bs, (unsigned long long) f->zbd_info->zone_size);
> -	io_u->offset = z->start + z->verify_block++ * min_bs;
> -	return z;
> -}
> -
>  /*
>   * Find another zone for which @io_u fits below the write pointer. Start
>   * searching in zones @zb + 1 .. @zl and continue searching in zones
> @@ -1454,10 +1460,6 @@ enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u)
>  
>  	switch (io_u->ddir) {
>  	case DDIR_READ:
> -		if (td->runstate == TD_VERIFYING) {
> -			zb = zbd_replay_write_order(td, io_u, zb);
> -			goto accept;
> -		}
>  		/*
>  		 * Check that there is enough written data in the zone to do an
>  		 * I/O of at least min_bs B. If there isn't, find a new zone for
> @@ -1532,7 +1534,6 @@ enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u)
>  		}
>  		/* Reset the zone pointer if necessary */
>  		if (zb->reset_zone || zbd_zone_full(f, zb, min_bs)) {
> -			assert(td->o.verify == VERIFY_NONE);
>  			/*
>  			 * Since previous write requests may have been submitted
>  			 * asynchronously and since we will submit the zone
> diff --git a/zbd.h b/zbd.h
> index fb39fb82..013c08c9 100644
> --- a/zbd.h
> +++ b/zbd.h
> @@ -23,23 +23,29 @@ enum io_u_action {
>   * struct fio_zone_info - information about a single ZBD zone
>   * @start: zone start location (bytes)
>   * @wp: zone write pointer location (bytes)
> - * @verify_block: number of blocks that have been verified for this zone
>   * @mutex: protects the modifiable members in this structure
>   * @type: zone type (BLK_ZONE_TYPE_*)
>   * @cond: zone state (BLK_ZONE_COND_*)
>   * @open: whether or not this zone is currently open. Only relevant if
>   *		max_open_zones > 0.
>   * @reset_zone: whether or not this zone should be reset before writing to it
> + * @seed:
> + * @rand_state:
> + * @write_list:
> + * @verify_mutex:
>   */
>  struct fio_zone_info {
>  	pthread_mutex_t		mutex;
>  	uint64_t		start;
>  	uint64_t		wp;
> -	uint32_t		verify_block;
>  	enum zbd_zone_type	type:2;
>  	enum zbd_zone_cond	cond:4;
>  	unsigned int		open:1;
>  	unsigned int		reset_zone:1;
> +	uint64_t		seed;
> +	struct frand_state	rand_state;
> +	struct flist_head	write_list;
> +	pthread_mutex_t		verify_mutex;
>  };
>  
>  /**
> @@ -89,6 +95,7 @@ enum fio_ddir zbd_adjust_ddir(struct thread_data *td, struct io_u *io_u,
>  			      enum fio_ddir ddir);
>  enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u);
>  char *zbd_write_status(const struct thread_stat *ts);
> +uint32_t zbd_zone_idx(const struct fio_file *f, uint64_t offset);
>  
>  static inline void zbd_queue_io_u(struct io_u *io_u, enum fio_q_status status)
>  {
> @@ -107,4 +114,13 @@ static inline void zbd_put_io_u(struct io_u *io_u)
>  	}
>  }
>  
> +/**
> + * zbd_zone_swr - Test whether a zone requires sequential writes
> + * @z: zone info pointer.
> + */
> +static inline bool zbd_zone_swr(struct fio_zone_info *z)
> +{
> +	return z->type == ZBD_ZONE_TYPE_SWR;
> +}
> +

This change is unrelated to this patch goal. Please make this into another patch.

>  #endif /* FIO_ZBD_H */
> 


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 8/9] zbd: support verification
  2020-05-05 17:59   ` Alexey Dobriyan
@ 2020-05-11  2:01     ` Damien Le Moal
  0 siblings, 0 replies; 21+ messages in thread
From: Damien Le Moal @ 2020-05-11  2:01 UTC (permalink / raw)
  To: Alexey Dobriyan, axboe; +Cc: fio

On 2020/05/06 2:59, Alexey Dobriyan wrote:
> On Tue, May 05, 2020 at 08:56:34PM +0300, Alexey Dobriyan wrote:
>> Verification is not supported in ZBD mode which is codified in this
>> assert:
>>
>> 	if (zb->reset_zone || zbd_zone_full(f, zb, min_bs)) {
>> 		assert(td->o.verify == VERIFY_NONE);
> 
> This is RFC for now, not for merging, there rest are.

Then send it separately please.

> 
> And no, patches haven't been lost, 9/9 is debugging is usual...

Then do not number the patches including it please.

Your patch series is still missing a cover letter too.

Thanks.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 8/9] zbd: support verification
  2020-05-11  1:59   ` Damien Le Moal
@ 2020-05-11 16:00     ` Alexey Dobriyan
  0 siblings, 0 replies; 21+ messages in thread
From: Alexey Dobriyan @ 2020-05-11 16:00 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: axboe, fio

On Mon, May 11, 2020 at 01:59:44AM +0000, Damien Le Moal wrote:
> On 2020/05/06 2:57, Alexey Dobriyan wrote:
> > Verification is not supported in ZBD mode which is codified in this
> > assert:
> > 
> > 	if (zb->reset_zone || zbd_zone_full(f, zb, min_bs)) {
> > 		assert(td->o.verify == VERIFY_NONE);
> 
> This is not correct. Verification works just fine. The above assert is there
> because if verify is enabled, zone reset is not done and full zone are ignored:

OK, it was too harsh.

One can't verify that the last generation of data is there in case of
multiple zone resets, data are either not written or written once.
In this sense zonemode=zbd+verify= has a large gap in functionality.

> static bool zbd_open_zone(struct thread_data *td, const struct io_u *io_u,
>                           uint32_t zone_idx)
> {
> 	...
> 
>         /*
>          * Skip full zones with data verification enabled because resetting a
>          * zone causes data loss and hence causes verification to fail.
>          */
>         if (td->o.verify != VERIFY_NONE && zbd_zone_full(f, z, min_bs))
>                 return false;
> 	...
> 
> There are many test cases in t/zbd/test-zbd-support that have the verify option
> to test it. This support is likely not perfect and may have problems, but saying
> that it is not supported is an overstatement.
> 
> > However, sequential write property is excellent for actually doing
> > verification if write_list and seeds are maintained per-zone.
> > It will work automatically with
> > * overlapping jobs
> > * zone resets in the middle of job
> > * different write block sizes
> > 
> > This also paves the way to "verify before zone reset" and
> > "verify zeroes after wp" features.
> > 
> > Introduce
> > *) per-zone seed,
> > 	incremented with each reset, so that patterns differ
> > *) per-zone random generator state,
> > 	reseeded with each zone reset
> > *) per-zone write I/O list
> > 	linked list is natural for sequential writes
> 
> What about conventional zones ? They can be randomly writen.

They can, but zbd_adjust_block() and everything below treats
conventional zones as random access block device, so verification
goes through RB tree or global write list.

Per zone ->write_list is used only for sequential zones.

> > *) per-zone verify_mutex
> > 	This will be used for verify-zeroes-after-wp, definitely.
> 
> Zeroes-after-wp is not a given. SMR drives likely report that by default but if
> a "format pattern" is set on SAS drives, that is the pattern that will be
> returned. So checking for zeroes will not always work. We should always assume
> "garbage" after the wp.

In general it is not given, but concrete piece of hardware can guarantee
zeroes (simply to avoid risks with exposing old data) and of course
developers would want to test it if hardware does it.

> > 	Currently it is more a peace of mind member otherwise list
> > 	corruption asserts trigger IIRC.
> > 	TODO: explain why it is needed.
> > 
> > Delete ->verify_block -- obsoleted.
> > 
> > There are also some funny things going on with flushing and resetting
> > files in the middle of I/O but non-overlapping case more or less works.
> 
> "more or less" ? Is it working or is it not ? Please clarify the exact
> improvements and what remains to be fixed.

It asserts in multithreaded case with lines like "f->fd == -1".
I don't claim to figure out this whole open/reopen/close file area.

> > @@ -629,8 +630,12 @@ static void do_verify(struct thread_data *td, uint64_t verify_bytes)
> >  	 * read from disk.
> >  	 */
> >  	for_each_file(td, f, i) {
> > +		if (td_f && f != td_f)
> > +			continue;
> >  		if (!fio_file_open(f))
> >  			continue;
> > +		if (!sync)
> > +			continue;
> 
> Why are these changes needed ?

To avoid thousands of cache flushes.
See "sync = false" line below.

> > @@ -677,7 +682,7 @@ static void do_verify(struct thread_data *td, uint64_t verify_bytes)
> >  			if (!io_u)
> >  				break;
> >  
> > -			if (get_next_verify(td, io_u)) {
> > +			if (get_next_verify(td, io_u, td_f, zi)) {
> >  				put_io_u(td, io_u);
> >  				break;
> >  			}
> > @@ -1516,6 +1521,35 @@ static uint64_t do_dry_run(struct thread_data *td)
> >  	return td->bytes_done[DDIR_WRITE] + td->bytes_done[DDIR_TRIM];
> >  }
> >  
> > +static void do_verify_zbd(struct thread_data *td, uint64_t verify_bytes)
> > +{
> > +	struct fio_file *f;
> > +	unsigned int i;
> > +
> > +	for_each_file(td, f, i) {
> > +		struct zoned_block_device_info *zbd = f->zbd_info;
> > +		bool sync = true;
> > +
> > +		if (!zbd)
> > +			continue;
> > +
> > +		for (uint32_t z = f->min_zone; z < f->max_zone; z++) {
> > +			struct fio_zone_info *zi = &zbd->zone_info[z];
> > +
> > +			if (!zbd_zone_swr(zi))
> > +				continue;
> > +
> > +			if (pthread_mutex_trylock(&zi->verify_mutex) != 0) {
> > +				/* Someone else is verifying this zone. */
> > +				continue;
> > +			}
> > +			do_verify(td, verify_bytes, f, zi, sync);
> > +			pthread_mutex_unlock(&zi->verify_mutex);
> > +			sync = false;
> > +		}
> > +	}
> > +}
> 
> Can't we move this to zbd.c ?

Yeah, why not.

> >  struct fork_data {
> >  	struct thread_data *td;
> >  	struct sk_out *sk_out;
> > @@ -1839,7 +1873,11 @@ static void *thread_main(void *data)
> >  
> >  		fio_gettime(&td->start, NULL);
> >  
> > -		do_verify(td, verify_bytes);
> > +		if (td->o.zone_mode == ZONE_MODE_ZBD) {
> > +			do_verify_zbd(td, verify_bytes);
> > +		} else {
> > +			do_verify(td, verify_bytes, NULL, NULL, true);
> > +		}
> 
> No need for the "{}" brackets.

Argh.

> > --- a/fio.h
> > +++ b/fio.h
> > @@ -140,6 +140,7 @@ enum {
> >  	FIO_RAND_POISSON2_OFF,
> >  	FIO_RAND_POISSON3_OFF,
> >  	FIO_RAND_PRIO_CMDS,
> > +	FIO_RAND_ZBD,
> 
> FIO_RAND_ZBD_VERIFY ?

Eeeh...

> > @@ -256,6 +257,7 @@ struct thread_data {
> >  	struct frand_state buf_state;
> >  	struct frand_state buf_state_prev;
> >  	struct frand_state dedupe_state;
> > +	struct frand_state zbd_state;
> 
> A more explicit name may be better. Something like "zone_verify_state" or
> "zbd_verify_state" ?
> 
> >  	struct frand_state zone_state;
> >  	struct frand_state prio_state;
> >  
> > diff --git a/init.c b/init.c
> > index b5315334..d41a23ff 100644
> > --- a/init.c
> > +++ b/init.c
> > @@ -1029,6 +1029,7 @@ static void td_fill_rand_seeds_internal(struct thread_data *td, bool use64)
> >  	init_rand_seed(&td->poisson_state[1], td->rand_seeds[FIO_RAND_POISSON2_OFF], 0);
> >  	init_rand_seed(&td->poisson_state[2], td->rand_seeds[FIO_RAND_POISSON3_OFF], 0);
> >  	init_rand_seed(&td->dedupe_state, td->rand_seeds[FIO_DEDUPE_OFF], false);
> > +	init_rand_seed(&td->zbd_state, td->rand_seeds[FIO_RAND_ZBD], use64);
> >  	init_rand_seed(&td->zone_state, td->rand_seeds[FIO_RAND_ZONE_OFF], false);
> >  	init_rand_seed(&td->prio_state, td->rand_seeds[FIO_RAND_PRIO_CMDS], false);
> >  
> > diff --git a/io_u.c b/io_u.c
> > index 18e94617..3cd7fb71 100644
> > --- a/io_u.c
> > +++ b/io_u.c
> > @@ -1610,7 +1610,7 @@ static bool check_get_verify(struct thread_data *td, struct io_u *io_u)
> >  			get_verify = 1;
> >  		}
> >  
> > -		if (get_verify && !get_next_verify(td, io_u)) {
> > +		if (get_verify && !get_next_verify(td, io_u, NULL, NULL)) {
> >  			td->verify_batch--;
> >  			return true;
> >  		}
> > diff --git a/iolog.c b/iolog.c
> > index 917a446c..732861a8 100644
> > --- a/iolog.c
> > +++ b/iolog.c
> > @@ -19,6 +19,7 @@
> >  #include "smalloc.h"
> >  #include "blktrace.h"
> >  #include "pshared.h"
> > +#include "zbd.h"
> >  
> >  #include <netinet/in.h>
> >  #include <netinet/tcp.h>
> > @@ -231,6 +232,7 @@ void log_io_piece(struct thread_data *td, struct io_u *io_u)
> >  	ipo->file = io_u->file;
> >  	ipo->offset = io_u->offset;
> >  	ipo->len = io_u->buflen;
> > +	ipo->seed = io_u->rand_seed;
> >  	ipo->numberio = io_u->numberio;
> >  	ipo->flags = IP_F_IN_FLIGHT;
> >  
> > @@ -241,12 +243,20 @@ void log_io_piece(struct thread_data *td, struct io_u *io_u)
> >  		td->trim_entries++;
> >  	}
> >  
> > -	/*
> > -	 * Only sort writes if we don't have a random map in which case we need
> > -	 * to check for duplicate blocks and drop the old one, which we rely on
> > -	 * the rb insert/lookup for handling.
> > -	 */
> > -	if (file_randommap(td, ipo->file)) {
> > +	if (td->o.zone_mode == ZONE_MODE_ZBD) {
> > +		struct fio_file *f = ipo->file;
> > +		uint32_t z = zbd_zone_idx(f, ipo->offset);
> > +		struct fio_zone_info *zi = &f->zbd_info->zone_info[z];
> 
> It looks like these 2 lines are now being repeated a lot. Maybe introduce an
> inline helper "inline struct fio_zone_info *get_zone(f, offset)" ? That would
> cleanup the code here.

Indeed. I want to do s/zbd_info/zbd/g as well. Typing it becomes boring
really fast.

> > +		flist_add_tail(&ipo->list, &zi->write_list);
> > +		ipo->flags |= IP_F_ONLIST;
> > +		return;
> 
> This change may not be necessary at all. You could force the rbtree use for
> zonemode=zbd, which will get your the sorting, so an easy access to all IOs for
> a particular zone by searching from the zone start offset up to zone start+zone
> size. That will remove a lot of added "if (td->o.zone_mode == ZONE_MODE_ZBD) {"
> and remove the need for per zone verify list & mutex.

RBtree has this scary code for overwriting case:

	 if (overlap) {
                        dprint(FD_IO, "iolog: overlap %llu/%lu, %llu/%lu\n",
                                __ipo->offset, __ipo->len,
                                ipo->offset, ipo->len);

Correct me, but if small write overwrites big write, then whole big
write is tossed form rbtree and difference is not verified (for no reason,
because data is still there).

> > +	} else if (file_randommap(td, ipo->file)) {
> > +		/*
> > +		 * Only sort writes if we don't have a random map in which case
> > +		 * we need to check for duplicate blocks and drop the old one,
> > +		 * which we rely on the rb insert/lookup for handling.
> > +		 */
> >  		INIT_FLIST_HEAD(&ipo->list);
> >  		flist_add_tail(&ipo->list, &td->io_hist_list);
> >  		ipo->flags |= IP_F_ONLIST;
> > @@ -322,7 +332,7 @@ void unlog_io_piece(struct thread_data *td, struct io_u *io_u)
> >  
> >  	if (ipo->flags & IP_F_ONRB)
> >  		rb_erase(&ipo->rb_node, &td->io_hist_tree);
> > -	else if (ipo->flags & IP_F_ONLIST)
> > +	else
> >  		flist_del(&ipo->list);
> 
> With your above change, both cases set IP_F_ONLIST, so why this change ?

Well, either IOP is on rbtree or on write list (global or zone) so it is
either rb_erase or flist_del.

What I want to emphasize is that writes to each zone are naturally
ordered, so using linked list is very natural. Zone resets simply toss
it and verification walks regadless of a) overlapping threads, b)
multiple blocksizes, c) partial zone writes.

> > --- a/iolog.h
> > +++ b/iolog.h
> > @@ -211,6 +211,7 @@ struct io_piece {
> >  		struct fio_file *file;
> >  	};
> >  	unsigned long long offset;
> > +	uint64_t seed;
> >  	unsigned short numberio;
> >  	unsigned long len;
> >  	unsigned int flags;
> > diff --git a/verify.c b/verify.c
> > index b7fa6693..025e3eb0 100644
> > --- a/verify.c
> > +++ b/verify.c
> > @@ -11,6 +11,7 @@
> >  #include "fio.h"
> >  #include "verify.h"
> >  #include "trim.h"
> > +#include "zbd.h"
> >  #include "lib/rand.h"
> >  #include "lib/hweight.h"
> >  #include "lib/pattern.h"
> > @@ -54,7 +55,16 @@ void fill_verify_pattern(struct thread_data *td, void *p, unsigned int len,
> >  	if (!o->verify_pattern_bytes) {
> >  		dprint(FD_VERIFY, "fill random bytes len=%u\n", len);
> >  
> > -		if (!use_seed) {
> > +		if (use_seed) {
> > +		} else if (td->o.zone_mode == ZONE_MODE_ZBD) {
> > +			struct fio_file *f = io_u->file;
> > +			uint32_t z = zbd_zone_idx(f, io_u->offset);
> > +			struct fio_zone_info *zi = &f->zbd_info->zone_info[z];
> > +
> > +			seed = __rand(&zi->rand_state);
> > +			if (sizeof(int) != sizeof(long *))
> > +				seed *= __rand(&zi->rand_state);
> > +		} else {
> >  			seed = __rand(&td->verify_state);
> >  			if (sizeof(int) != sizeof(long *))
> >  				seed *= (unsigned long)__rand(&td->verify_state);
> > @@ -1291,7 +1301,7 @@ void populate_verify_io_u(struct thread_data *td, struct io_u *io_u)
> >  	fill_pattern_headers(td, io_u, 0, 0);
> >  }
> >  
> > -int get_next_verify(struct thread_data *td, struct io_u *io_u)
> > +int get_next_verify(struct thread_data *td, struct io_u *io_u, struct fio_file *td_f, struct fio_zone_info *zi)
> >  {
> >  	struct io_piece *ipo = NULL;
> >  
> > @@ -1301,7 +1311,26 @@ int get_next_verify(struct thread_data *td, struct io_u *io_u)
> >  	if (io_u->file)
> >  		return 0;
> >  
> > -	if (!RB_EMPTY_ROOT(&td->io_hist_tree)) {
> > +	if (zi) {
> > +		pthread_mutex_lock(&zi->mutex);
> > +		if (!flist_empty(&zi->write_list)) {
> > +			ipo = flist_first_entry(&zi->write_list, struct io_piece, list);
> > +
> > +			/*
> > +			 * Ensure that the associated IO has completed
> > +			 */
> > +			read_barrier();
> > +			if (ipo->flags & IP_F_IN_FLIGHT) {
> > +				pthread_mutex_unlock(&zi->mutex);
> > +				goto nothing;
> > +			}
> > +
> > +			flist_del(&ipo->list);
> > +			assert(ipo->flags & IP_F_ONLIST);
> > +			ipo->flags &= ~IP_F_ONLIST;
> > +		}
> > +		pthread_mutex_unlock(&zi->mutex);
> 
> can we move this to zbd.c as a helper function get_next_verify_zbd() ?
> 
> > +	} else if (!RB_EMPTY_ROOT(&td->io_hist_tree)) {
> >  		struct fio_rb_node *n = rb_first(&td->io_hist_tree);
> >  
> >  		ipo = rb_entry(n, struct io_piece, rb_node);
> > @@ -1332,10 +1361,13 @@ int get_next_verify(struct thread_data *td, struct io_u *io_u)
> >  	}
> >  
> >  	if (ipo) {
> > -		td->io_hist_len--;
> > +		if (!zi) {
> > +			td->io_hist_len--;
> > +		}
> 
> No need for the curly brackets.
> 
> >  
> >  		io_u->offset = ipo->offset;
> >  		io_u->buflen = ipo->len;
> > +		io_u->rand_seed = ipo->seed;
> >  		io_u->numberio = ipo->numberio;
> >  		io_u->file = ipo->file;
> >  		io_u_set(td, io_u, IO_U_F_VER_LIST);
> > @@ -1363,11 +1395,6 @@ int get_next_verify(struct thread_data *td, struct io_u *io_u)
> >  		free(ipo);
> >  		dprint(FD_VERIFY, "get_next_verify: ret io_u %p\n", io_u);
> >  
> > -		if (!td->o.verify_pattern_bytes) {
> > -			io_u->rand_seed = __rand(&td->verify_state);
> > -			if (sizeof(int) != sizeof(long *))
> > -				io_u->rand_seed *= __rand(&td->verify_state);
> > -		}
> >  		return 0;
> >  	}
> >  
> > diff --git a/verify.h b/verify.h
> > index 539e6f6c..f046d05b 100644
> > --- a/verify.h
> > +++ b/verify.h
> > @@ -7,6 +7,8 @@
> >  
> >  #define FIO_HDR_MAGIC	0xacca
> >  
> > +struct fio_zone_info;
> > +
> >  enum {
> >  	VERIFY_NONE = 0,		/* no verification */
> >  	VERIFY_HDR_ONLY,		/* verify header only, kept for sake of
> > @@ -94,7 +96,7 @@ struct vhdr_xxhash {
> >   * Verify helpers
> >   */
> >  extern void populate_verify_io_u(struct thread_data *, struct io_u *);
> > -extern int __must_check get_next_verify(struct thread_data *td, struct io_u *);
> > +extern int __must_check get_next_verify(struct thread_data *td, struct io_u *, struct fio_file *, struct fio_zone_info *);
> >  extern int __must_check verify_io_u(struct thread_data *, struct io_u **);
> >  extern int verify_io_u_async(struct thread_data *, struct io_u **);
> >  extern void fill_verify_pattern(struct thread_data *td, void *p, unsigned int len, struct io_u *io_u, uint64_t seed, int use_seed);
> > diff --git a/zbd.c b/zbd.c
> > index df46da42..c926df15 100644
> > --- a/zbd.c
> > +++ b/zbd.c
> > @@ -118,7 +118,7 @@ int zbd_reset_wp(struct thread_data *td, struct fio_file *f,
> >   * @offset: offset in bytes. If this offset is in the first zone_size bytes
> >   *	    past the disk size then the index of the sentinel is returned.
> >   */
> > -static uint32_t zbd_zone_idx(const struct fio_file *f, uint64_t offset)
> > +uint32_t zbd_zone_idx(const struct fio_file *f, uint64_t offset)
> >  {
> >  	uint32_t zone_idx;
> >  
> > @@ -130,15 +130,6 @@ static uint32_t zbd_zone_idx(const struct fio_file *f, uint64_t offset)
> >  	return min(zone_idx, f->zbd_info->nr_zones);
> >  }
> >  
> > -/**
> > - * zbd_zone_swr - Test whether a zone requires sequential writes
> > - * @z: zone info pointer.
> > - */
> > -static inline bool zbd_zone_swr(struct fio_zone_info *z)
> > -{
> > -	return z->type == ZBD_ZONE_TYPE_SWR;
> > -}
> > -
> >  /**
> >   * zbd_zone_full - verify whether a minimum number of bytes remain in a zone
> >   * @f: file pointer.
> > @@ -499,6 +490,11 @@ out:
> >  	return ret;
> >  }
> >  
> > +static inline bool td_use64(const struct thread_data *td)
> > +{
> > +	return td->o.random_generator == FIO_RAND_GEN_TAUSWORTHE64;
> > +}
> > +
> >  /*
> >   * Allocate zone information and store it into f->zbd_info if zonemode=zbd.
> >   *
> > @@ -509,6 +505,7 @@ static int zbd_create_zone_info(struct thread_data *td, struct fio_file *f)
> >  	struct zoned_block_device_info *zbd;
> >  	enum zbd_zoned_model zbd_model;
> >  	pthread_mutexattr_t attr;
> > +	uint64_t diff_seed;
> >  	int ret;
> >  
> >  	assert(td->o.zone_mode == ZONE_MODE_ZBD);
> > @@ -543,6 +540,23 @@ static int zbd_create_zone_info(struct thread_data *td, struct fio_file *f)
> >  	pthread_mutexattr_init(&attr);
> >  	pthread_mutexattr_setpshared(&attr, true);
> >  	pthread_mutex_init(&zbd->mutex, &attr);
> > +
> > +	diff_seed = td_use64(td)
> > +		? ~(uint64_t)0 / zbd->nr_zones
> > +		: ~(uint32_t)0 / zbd->nr_zones;
> > +	for (uint32_t z = 0; z < zbd->nr_zones; z++) {
> > +		struct fio_zone_info *zi = &zbd->zone_info[z];
> > +
> > +		/*
> > +		 * Spread zone seeds a bit, they will be incremented
> > +		 * with each reset and better stay unique.
> > +		 */
> > +		zi->seed = __rand(&td->zbd_state) + z * diff_seed;
> > +		init_rand_seed(&zi->rand_state, zi->seed, td_use64(td));
> > +		INIT_FLIST_HEAD(&zi->write_list);
> > +		pthread_mutex_init(&zi->verify_mutex, &attr);
> > +	}
> > +
> >  	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
> >  	for (uint32_t z = 0; z < zbd->nr_zones; z++) {
> >  		struct fio_zone_info *zi = &zbd->zone_info[z];
> > @@ -683,13 +697,26 @@ static int zbd_reset_range(struct thread_data *td, struct fio_file *f,
> >  	zone_idx_e = zbd_zone_idx(f, offset + length);
> >  	ze = &f->zbd_info->zone_info[zone_idx_e];
> >  	for (z = zb; z < ze; z++) {
> > +		FLIST_HEAD(write_list);
> > +
> >  		pthread_mutex_lock(&z->mutex);
> >  		pthread_mutex_lock(&f->zbd_info->mutex);
> >  		f->zbd_info->sectors_with_data -= z->wp - z->start;
> >  		pthread_mutex_unlock(&f->zbd_info->mutex);
> >  		z->wp = z->start;
> > -		z->verify_block = 0;
> > +		z->seed++;
> > +		init_rand_seed(&z->rand_state, z->seed, td_use64(td));
> > +		flist_splice_init(&z->write_list, &write_list);
> >  		pthread_mutex_unlock(&z->mutex);
> > +
> > +		while (!flist_empty(&write_list)) {
> > +			struct io_piece *ipo = flist_first_entry(&write_list, struct io_piece, list);
> > +
> > +			/* Data "loss"... */
> 
> Why not force verify to run here to avoid the loss and verify failure ?

It is a separate feature: verify before reset. We have this patch in-house.
Currently write list is simply erased to verify currectly in case of
zone resets and overwrites.

> > +			flist_del(&ipo->list);
> > +			assert(ipo->flags & IP_F_ONLIST);
> > +			free(ipo);
> > +		}
> >  	}
> >  
> >  	td->ts.nr_zone_resets += ze - zb;
> > @@ -1142,27 +1169,6 @@ out:
> >  	return z;
> >  }
> >  
> > -/* The caller must hold z->mutex. */
> > -static struct fio_zone_info *zbd_replay_write_order(struct thread_data *td,
> > -						    struct io_u *io_u,
> > -						    struct fio_zone_info *z)
> > -{
> > -	const struct fio_file *f = io_u->file;
> > -	const uint32_t min_bs = td->o.min_bs[DDIR_WRITE];
> > -
> > -	if (!zbd_open_zone(td, io_u, z - f->zbd_info->zone_info)) {
> > -		pthread_mutex_unlock(&z->mutex);
> > -		z = zbd_convert_to_open_zone(td, io_u);
> > -		assert(z);
> > -	}
> > -
> > -	if (z->verify_block * min_bs >= f->zbd_info->zone_size)
> > -		log_err("%s: %d * %d >= %llu\n", f->file_name, z->verify_block,
> > -			min_bs, (unsigned long long) f->zbd_info->zone_size);
> > -	io_u->offset = z->start + z->verify_block++ * min_bs;
> > -	return z;
> > -}
> > -
> >  /*
> >   * Find another zone for which @io_u fits below the write pointer. Start
> >   * searching in zones @zb + 1 .. @zl and continue searching in zones
> > @@ -1454,10 +1460,6 @@ enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u)
> >  
> >  	switch (io_u->ddir) {
> >  	case DDIR_READ:
> > -		if (td->runstate == TD_VERIFYING) {
> > -			zb = zbd_replay_write_order(td, io_u, zb);
> > -			goto accept;
> > -		}
> >  		/*
> >  		 * Check that there is enough written data in the zone to do an
> >  		 * I/O of at least min_bs B. If there isn't, find a new zone for
> > @@ -1532,7 +1534,6 @@ enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u)
> >  		}
> >  		/* Reset the zone pointer if necessary */
> >  		if (zb->reset_zone || zbd_zone_full(f, zb, min_bs)) {
> > -			assert(td->o.verify == VERIFY_NONE);
> >  			/*
> >  			 * Since previous write requests may have been submitted
> >  			 * asynchronously and since we will submit the zone
> > diff --git a/zbd.h b/zbd.h
> > index fb39fb82..013c08c9 100644
> > --- a/zbd.h
> > +++ b/zbd.h
> > @@ -23,23 +23,29 @@ enum io_u_action {
> >   * struct fio_zone_info - information about a single ZBD zone
> >   * @start: zone start location (bytes)
> >   * @wp: zone write pointer location (bytes)
> > - * @verify_block: number of blocks that have been verified for this zone
> >   * @mutex: protects the modifiable members in this structure
> >   * @type: zone type (BLK_ZONE_TYPE_*)
> >   * @cond: zone state (BLK_ZONE_COND_*)
> >   * @open: whether or not this zone is currently open. Only relevant if
> >   *		max_open_zones > 0.
> >   * @reset_zone: whether or not this zone should be reset before writing to it
> > + * @seed:
> > + * @rand_state:
> > + * @write_list:
> > + * @verify_mutex:
> >   */
> >  struct fio_zone_info {
> >  	pthread_mutex_t		mutex;
> >  	uint64_t		start;
> >  	uint64_t		wp;
> > -	uint32_t		verify_block;
> >  	enum zbd_zone_type	type:2;
> >  	enum zbd_zone_cond	cond:4;
> >  	unsigned int		open:1;
> >  	unsigned int		reset_zone:1;
> > +	uint64_t		seed;
> > +	struct frand_state	rand_state;
> > +	struct flist_head	write_list;
> > +	pthread_mutex_t		verify_mutex;
> >  };
> >  
> >  /**
> > @@ -89,6 +95,7 @@ enum fio_ddir zbd_adjust_ddir(struct thread_data *td, struct io_u *io_u,
> >  			      enum fio_ddir ddir);
> >  enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u);
> >  char *zbd_write_status(const struct thread_stat *ts);
> > +uint32_t zbd_zone_idx(const struct fio_file *f, uint64_t offset);
> >  
> >  static inline void zbd_queue_io_u(struct io_u *io_u, enum fio_q_status status)
> >  {
> > @@ -107,4 +114,13 @@ static inline void zbd_put_io_u(struct io_u *io_u)
> >  	}
> >  }
> >  
> > +/**
> > + * zbd_zone_swr - Test whether a zone requires sequential writes
> > + * @z: zone info pointer.
> > + */
> > +static inline bool zbd_zone_swr(struct fio_zone_info *z)
> > +{
> > +	return z->type == ZBD_ZONE_TYPE_SWR;
> > +}
> > +
> 
> This change is unrelated to this patch goal. Please make this into another patch.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/9] zbd: consolidate zone mutex initialisation
  2020-05-11  1:16   ` Damien Le Moal
@ 2020-05-21 21:44     ` Alexey Dobriyan
  2020-05-21 22:02       ` Damien Le Moal
  0 siblings, 1 reply; 21+ messages in thread
From: Alexey Dobriyan @ 2020-05-21 21:44 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: axboe, fio

On Mon, May 11, 2020 at 01:16:04AM +0000, Damien Le Moal wrote:
> On 2020/05/06 2:56, Alexey Dobriyan wrote:
> > Initialise everything that is not write pointers coming from device or
> > faked separatedly.
> 
> s/Initialise/Initialize
> s/separatedly/separately

>  	assert(td->o.zone_mode == ZONE_MODE_ZBD);
> > @@ -546,11 +533,25 @@ static int zbd_create_zone_info(struct thread_data *td, struct fio_file *f)
> >  		return -EINVAL;
> >  	}
> >  
> > -	if (ret == 0) {
> > -		f->zbd_info->model = zbd_model;
> > -		f->zbd_info->max_open_zones = td->o.max_open_zones;
> > +	if (ret)
> > +		return ret;
> > +
> > +	zbd = f->zbd_info;
> > +	zbd->model = zbd_model;
> > +	zbd->max_open_zones = td->o.max_open_zones;
> > +
> > +	pthread_mutexattr_init(&attr);
> > +	pthread_mutexattr_setpshared(&attr, true);
> > +	pthread_mutex_init(&zbd->mutex, &attr);
> > +	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
> > +	for (uint32_t z = 0; z < zbd->nr_zones; z++) {
> > +		struct fio_zone_info *zi = &zbd->zone_info[z];
> > +
> > +		pthread_mutex_init(&zi->mutex, &attr);
> >  	}
> > -	return ret;
> > +	pthread_mutexattr_destroy(&attr);
> > +
> > +	return 0;
> >  }
> >  
> >  void zbd_free_zone_info(struct fio_file *f)
> > 
> 
> On a very large drive (e.g. 20TB SMR), we have in excess of 70000 zones. Having
> to loop over all of them again will start to be painful and likely slow down (a
> little) startup. And going forward with ever increasing disk capacity, this will
> get even worse. So not sure if this patch makes much sense.

It separates hw-specific code (reading write pointers) from fio-specific
code.

Iterating can be made faster by going over zone in [->min_zone, ->max_zone)
only, but I didn't look at it.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/9] zbd: consolidate zone mutex initialisation
  2020-05-21 21:44     ` Alexey Dobriyan
@ 2020-05-21 22:02       ` Damien Le Moal
  2020-05-21 22:27         ` Alexey Dobriyan
  0 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2020-05-21 22:02 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: axboe, fio

On 2020/05/22 6:44, Alexey Dobriyan wrote:
> On Mon, May 11, 2020 at 01:16:04AM +0000, Damien Le Moal wrote:
>> On 2020/05/06 2:56, Alexey Dobriyan wrote:
>>> Initialise everything that is not write pointers coming from device or
>>> faked separatedly.
>>
>> s/Initialise/Initialize
>> s/separatedly/separately
> 
>>  	assert(td->o.zone_mode == ZONE_MODE_ZBD);
>>> @@ -546,11 +533,25 @@ static int zbd_create_zone_info(struct thread_data *td, struct fio_file *f)
>>>  		return -EINVAL;
>>>  	}
>>>  
>>> -	if (ret == 0) {
>>> -		f->zbd_info->model = zbd_model;
>>> -		f->zbd_info->max_open_zones = td->o.max_open_zones;
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	zbd = f->zbd_info;
>>> +	zbd->model = zbd_model;
>>> +	zbd->max_open_zones = td->o.max_open_zones;
>>> +
>>> +	pthread_mutexattr_init(&attr);
>>> +	pthread_mutexattr_setpshared(&attr, true);
>>> +	pthread_mutex_init(&zbd->mutex, &attr);
>>> +	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
>>> +	for (uint32_t z = 0; z < zbd->nr_zones; z++) {
>>> +		struct fio_zone_info *zi = &zbd->zone_info[z];
>>> +
>>> +		pthread_mutex_init(&zi->mutex, &attr);
>>>  	}
>>> -	return ret;
>>> +	pthread_mutexattr_destroy(&attr);
>>> +
>>> +	return 0;
>>>  }
>>>  
>>>  void zbd_free_zone_info(struct fio_file *f)
>>>
>>
>> On a very large drive (e.g. 20TB SMR), we have in excess of 70000 zones. Having
>> to loop over all of them again will start to be painful and likely slow down (a
>> little) startup. And going forward with ever increasing disk capacity, this will
>> get even worse. So not sure if this patch makes much sense.
> 
> It separates hw-specific code (reading write pointers) from fio-specific
> code.

I appreciate the cleanup the patch does, but that does not address my concern
about startup time.

> Iterating can be made faster by going over zone in [->min_zone, ->max_zone)
> only, but I didn't look at it.

That would make things very complicated as the zbd_info zone array would end
being sparse or not all zones initialized in it...

What about replacing the mutex initialization helper doing the entire loop again
over all zones with one doing a single zone mutex initialization:

init_zone_mutex(zone, &attr)

That would cleanup nicely init_zone_info() & parse_zone_info() too.



-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/9] fio: parse "io_size=1%"
  2020-05-11  1:20   ` Damien Le Moal
@ 2020-05-21 22:13     ` Alexey Dobriyan
  0 siblings, 0 replies; 21+ messages in thread
From: Alexey Dobriyan @ 2020-05-21 22:13 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: axboe, fio

On Mon, May 11, 2020 at 01:20:32AM +0000, Damien Le Moal wrote:
> On 2020/05/06 2:57, Alexey Dobriyan wrote:
> > The following config:
> > 
> > 	size=1%
> > 	io_size=1%
> > 
> > will generate essentially infinite loop, because io_size is set to (2^64-1)-1.
> > 
> > Parse io_size as percentage to avoid the bug.
> > 
> > Note: if size% != io_size%, io_size is ignored but it is separate bug.
> > 
> > Signed-off-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
> > ---
> >  cconv.c          |  2 ++
> >  options.c        | 17 +++++++++++++++++
> >  server.h         |  2 +-
> >  thread_options.h |  4 ++++
> >  4 files changed, 24 insertions(+), 1 deletion(-)
> > 
> > diff --git a/cconv.c b/cconv.c
> > index 48218dc4..48b0db9e 100644
> > --- a/cconv.c
> > +++ b/cconv.c
> > @@ -102,6 +102,7 @@ void convert_thread_options_to_cpu(struct thread_options *o,
> >  	o->size = le64_to_cpu(top->size);
> >  	o->io_size = le64_to_cpu(top->io_size);
> >  	o->size_percent = le32_to_cpu(top->size_percent);
> > +	o->io_size_percent = le32_to_cpu(top->io_size_percent);
> >  	o->fill_device = le32_to_cpu(top->fill_device);
> >  	o->file_append = le32_to_cpu(top->file_append);
> >  	o->file_size_low = le64_to_cpu(top->file_size_low);
> > @@ -366,6 +367,7 @@ void convert_thread_options_to_net(struct thread_options_pack *top,
> >  	top->iodepth_batch_complete_max = cpu_to_le32(o->iodepth_batch_complete_max);
> >  	top->serialize_overlap = cpu_to_le32(o->serialize_overlap);
> >  	top->size_percent = cpu_to_le32(o->size_percent);
> > +	top->io_size_percent = cpu_to_le32(o->io_size_percent);
> >  	top->fill_device = cpu_to_le32(o->fill_device);
> >  	top->file_append = cpu_to_le32(o->file_append);
> >  	top->ratecycle = cpu_to_le32(o->ratecycle);
> > diff --git a/options.c b/options.c
> > index 47b3b765..3d0c3a84 100644
> > --- a/options.c
> > +++ b/options.c
> > @@ -1475,6 +1475,22 @@ static int str_size_cb(void *data, unsigned long long *__val)
> >  	return 0;
> >  }
> >  
> > +static int str_io_size_cb(void *data, unsigned long long *__val)
> > +{
> > +	struct thread_data *td = cb_data_to_td(data);
> > +	unsigned long long v = *__val;
> > +
> > +	if (parse_is_percent(v)) {
> > +		td->o.io_size = 0;
> > +		td->o.io_size_percent = -1ULL - v;
> > +		dprint(FD_PARSE, "SET io_size_percent %d\n",
> > +					td->o.io_size_percent);
> 
> I do not see where f->io_size is set using td->o.io_size_percent. Isn't it
> needed to set it somewhere in filesetup.c ?

I recall what bugs me about this code.

It would be correct to use io_size_percent and write this:

	        if (f->io_size == -1ULL)
                        total_size = -1ULL;
                else {
                        if (o->io_size_percent && o->io_size_percent != 100) {
                                uint64_t file_size;

                                file_size = f->io_size + f->file_offset;
                                f->io_size = (file_size *
                                              o->io_size_percent) / 100;
                                if (f->io_size > (file_size - f->file_offset))
                                        f->io_size = file_size - f->file_offset;

                                f->io_size -= (f->io_size % td_min_bs(td));
                        }
                        total_size += f->io_size;
                }

f->io_size is 1M and total of 40MB worth I/O is emitted with this config
which is arguably not correct. It should be 20MB.

	[global]
	filename=/dev/loop0	# 1GB
	direct=1
	thread
	rw=write
	bs=4096
	ioengine=psync
	[j]
	offset=1M
	size=1M
	io_size=2%

	WRITE: bw=39.1MiB/s (40.0MB/s), 39.1MiB/s-39.1MiB/s (40.0MB/s-40.0MB/s), io=40.0KiB (40.0kB)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/9] zbd: consolidate zone mutex initialisation
  2020-05-21 22:02       ` Damien Le Moal
@ 2020-05-21 22:27         ` Alexey Dobriyan
  0 siblings, 0 replies; 21+ messages in thread
From: Alexey Dobriyan @ 2020-05-21 22:27 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: axboe, fio

On Thu, May 21, 2020 at 10:02:09PM +0000, Damien Le Moal wrote:
> On 2020/05/22 6:44, Alexey Dobriyan wrote:
> > On Mon, May 11, 2020 at 01:16:04AM +0000, Damien Le Moal wrote:
> >> On 2020/05/06 2:56, Alexey Dobriyan wrote:
> >>> Initialise everything that is not write pointers coming from device or
> >>> faked separatedly.
> >>
> >> s/Initialise/Initialize
> >> s/separatedly/separately
> > 
> >>  	assert(td->o.zone_mode == ZONE_MODE_ZBD);
> >>> @@ -546,11 +533,25 @@ static int zbd_create_zone_info(struct thread_data *td, struct fio_file *f)
> >>>  		return -EINVAL;
> >>>  	}
> >>>  
> >>> -	if (ret == 0) {
> >>> -		f->zbd_info->model = zbd_model;
> >>> -		f->zbd_info->max_open_zones = td->o.max_open_zones;
> >>> +	if (ret)
> >>> +		return ret;
> >>> +
> >>> +	zbd = f->zbd_info;
> >>> +	zbd->model = zbd_model;
> >>> +	zbd->max_open_zones = td->o.max_open_zones;
> >>> +
> >>> +	pthread_mutexattr_init(&attr);
> >>> +	pthread_mutexattr_setpshared(&attr, true);
> >>> +	pthread_mutex_init(&zbd->mutex, &attr);
> >>> +	pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
> >>> +	for (uint32_t z = 0; z < zbd->nr_zones; z++) {
> >>> +		struct fio_zone_info *zi = &zbd->zone_info[z];
> >>> +
> >>> +		pthread_mutex_init(&zi->mutex, &attr);
> >>>  	}
> >>> -	return ret;
> >>> +	pthread_mutexattr_destroy(&attr);
> >>> +
> >>> +	return 0;
> >>>  }
> >>>  
> >>>  void zbd_free_zone_info(struct fio_file *f)
> >>>
> >>
> >> On a very large drive (e.g. 20TB SMR), we have in excess of 70000 zones. Having
> >> to loop over all of them again will start to be painful and likely slow down (a
> >> little) startup. And going forward with ever increasing disk capacity, this will
> >> get even worse. So not sure if this patch makes much sense.
> > 
> > It separates hw-specific code (reading write pointers) from fio-specific
> > code.
> 
> I appreciate the cleanup the patch does, but that does not address my concern
> about startup time.
> 
> > Iterating can be made faster by going over zone in [->min_zone, ->max_zone)
> > only, but I didn't look at it.
> 
> That would make things very complicated as the zbd_info zone array would end
> being sparse or not all zones initialized in it...

It is more! Don't need to allocate memory for unused trailing zones. :^)

In the future _this_ loop will do more work:

	->mutex
	->write_mutex (divide and conquer writes and appends)
	second ->wp
	per zone seed
	per zone generator state
	write list

> What about replacing the mutex initialization helper doing the entire loop again
> over all zones with one doing a single zone mutex initialization:
> 
> init_zone_mutex(zone, &attr)
> 
> That would cleanup nicely init_zone_info() & parse_zone_info() too.

OK, it is too late here to think.
I'll send what's doesn't raise questions for the release.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-05-21 22:27 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-05 17:56 [PATCH 1/9] verify: decouple seed generation from buffer fill Alexey Dobriyan
2020-05-05 17:56 ` [PATCH 2/9] zbd: bump ZBD_MAX_OPEN_ZONES Alexey Dobriyan
2020-05-05 17:56 ` [PATCH 3/9] zbd: don't lock zones outside working area Alexey Dobriyan
2020-05-05 17:56 ` [PATCH 4/9] zbd: introduce per job maximum open zones limit Alexey Dobriyan
2020-05-11  1:09   ` Damien Le Moal
2020-05-05 17:56 ` [PATCH 5/9] zbd: make zbd_info->mutex non-recursive Alexey Dobriyan
2020-05-11  1:11   ` Damien Le Moal
2020-05-05 17:56 ` [PATCH 6/9] zbd: consolidate zone mutex initialisation Alexey Dobriyan
2020-05-11  1:16   ` Damien Le Moal
2020-05-21 21:44     ` Alexey Dobriyan
2020-05-21 22:02       ` Damien Le Moal
2020-05-21 22:27         ` Alexey Dobriyan
2020-05-05 17:56 ` [PATCH 7/9] fio: parse "io_size=1%" Alexey Dobriyan
2020-05-11  1:20   ` Damien Le Moal
2020-05-21 22:13     ` Alexey Dobriyan
2020-05-05 17:56 ` [PATCH 8/9] zbd: support verification Alexey Dobriyan
2020-05-05 17:59   ` Alexey Dobriyan
2020-05-11  2:01     ` Damien Le Moal
2020-05-11  1:59   ` Damien Le Moal
2020-05-11 16:00     ` Alexey Dobriyan
2020-05-11  0:56 ` [PATCH 1/9] verify: decouple seed generation from buffer fill Damien Le Moal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.