All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v15 0/8] Add support for zoned device
@ 2023-01-29 10:28 Sam Li
  2023-01-29 10:28 ` [PATCH v15 1/8] include: add zoned device structs Sam Li
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: Sam Li @ 2023-01-29 10:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, Kevin Wolf, Paolo Bonzini,
	Hanna Reitz, dmitry.fomichev, hare, damien.lemoal,
	Marc-André Lureau, Fam Zheng, Thomas Huth,
	Daniel P. Berrangé, Philippe Mathieu-Daudé,
	Sam Li

Zoned Block Devices (ZBDs) devide the LBA space to block regions called zones
that are larger than the LBA size. It can only allow sequential writes, which
reduces write amplification in SSD, leading to higher throughput and increased
capacity. More details about ZBDs can be found at:

https://zonedstorage.io/docs/introduction/zoned-storage

The zoned device support aims to let guests (virtual machines) access zoned
storage devices on the host (hypervisor) through a virtio-blk device. This
involves extending QEMU's block layer and virtio-blk emulation code.  In its
current status, the virtio-blk device is not aware of ZBDs but the guest sees
host-managed drives as regular drive that will runs correctly under the most
common write workloads.

This patch series extend the block layer APIs with the minimum set of zoned
commands that are necessary to support zoned devices. The commands are - Report
Zones, four zone operations and Zone Append.

There has been a debate on whethre introducing new zoned_host_device BlockDriver
specifically for zoned devices. In the end, it's been decided to stick to
existing host_device BlockDriver interface by only adding new zoned operations
inside it. The benefit of that is to avoid further changes - one example is
command line syntax - to the applications like Libvirt using QEMU zoned
emulation.

It can be tested on a null_blk device using qemu-io or qemu-iotests. For
example, to test zone report using qemu-io:
$ path/to/qemu-io --image-opts -n driver=host_device,filename=/dev/nullb0
-c "zrp offset nr_zones"

v15:
- drop zoned_host_device BlockDriver
- add zoned device option to host_device driver instead of introducing a new
  zoned_host_device BlockDriver [Stefan]

v14:
- address Stefan's comments of probing block sizes

v13:
- add some tracing points for new zone APIs [Dmitry]
- change error handling in zone_mgmt [Damien, Stefan]

v12:
- address review comments
  * drop BLK_ZO_RESET_ALL bit [Damien]
  * fix error messages, style, and typos[Damien, Hannes]

v11:
- address review comments
  * fix possible BLKZONED config compiling warnings [Stefan]
  * fix capacity field compiling warnings on older kernel [Stefan,Damien]

v10:
- address review comments
  * deal with the last small zone case in zone_mgmt operations [Damien]
  * handle the capacity field outdated in old kernel(before 5.9) [Damien]
  * use byte unit in block layer to be consistent with QEMU [Eric]
  * fix coding style related problems [Stefan]

v9:
- address review comments
  * specify units of zone commands requests [Stefan]
  * fix some error handling in file-posix [Stefan]
  * introduce zoned_host_devcie in the commit message [Markus]

v8:
- address review comments
  * solve patch conflicts and merge sysfs helper funcations into one patch
  * add cache.direct=on check in config

v7:
- address review comments
  * modify sysfs attribute helper funcations
  * move the input validation and error checking into raw_co_zone_* function
  * fix checks in config

v6:
- drop virtio-blk emulation changes
- address Stefan's review comments
  * fix CONFIG_BLKZONED configs in related functions
  * replace reading fd by g_file_get_contents() in get_sysfs_str_val()
  * rewrite documentation for zoned storage

v5:
- add zoned storage emulation to virtio-blk device
- add documentation for zoned storage
- address review comments
  * fix qemu-iotests
  * fix check to block layer
  * modify interfaces of sysfs helper functions
  * rename zoned device structs according to QEMU styles
  * reorder patches

v4:
- add virtio-blk headers for zoned device
- add configurations for zoned host device
- add zone operations for raw-format
- address review comments
  * fix memory leak bug in zone_report
  * add checks to block layers
  * fix qemu-iotests format
  * fix sysfs helper functions

v3:
- add helper functions to get sysfs attributes
- address review comments
  * fix zone report bugs
  * fix the qemu-io code path
  * use thread pool to avoid blocking ioctl() calls

v2:
- add qemu-io sub-commands
- address review comments
  * modify interfaces of APIs

v1:
- add block layer APIs resembling Linux ZoneBlockDevice ioctls

Sam Li (8):
  include: add zoned device structs
  file-posix: introduce helper functions for sysfs attributes
  block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  raw-format: add zone operations to pass through requests
  config: add check to block layer
  qemu-iotests: test new zone operations
  block: add some trace events for new block layer APIs
  docs/zoned-storage: add zoned device documentation

 block.c                                |  19 +
 block/block-backend.c                  | 147 ++++++++
 block/file-posix.c                     | 460 +++++++++++++++++++++++--
 block/io.c                             |  41 +++
 block/raw-format.c                     |  14 +
 block/trace-events                     |   2 +
 docs/devel/zoned-storage.rst           |  43 +++
 docs/system/qemu-block-drivers.rst.inc |   6 +
 include/block/block-common.h           |  43 +++
 include/block/block-io.h               |   7 +
 include/block/block_int-common.h       |  29 ++
 include/block/raw-aio.h                |   6 +-
 include/sysemu/block-backend-io.h      |  18 +
 meson.build                            |   4 +
 qemu-io-cmds.c                         | 149 ++++++++
 tests/qemu-iotests/tests/zoned.out     |  53 +++
 tests/qemu-iotests/tests/zoned.sh      |  86 +++++
 17 files changed, 1092 insertions(+), 35 deletions(-)
 create mode 100644 docs/devel/zoned-storage.rst
 create mode 100644 tests/qemu-iotests/tests/zoned.out
 create mode 100755 tests/qemu-iotests/tests/zoned.sh

-- 
2.38.1



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v15 1/8] include: add zoned device structs
  2023-01-29 10:28 [PATCH v15 0/8] Add support for zoned device Sam Li
@ 2023-01-29 10:28 ` Sam Li
  2023-01-29 10:28 ` [PATCH v15 2/8] file-posix: introduce helper functions for sysfs attributes Sam Li
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Sam Li @ 2023-01-29 10:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, Kevin Wolf, Paolo Bonzini,
	Hanna Reitz, dmitry.fomichev, hare, damien.lemoal,
	Marc-André Lureau, Fam Zheng, Thomas Huth,
	Daniel P. Berrangé, Philippe Mathieu-Daudé,
	Sam Li

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 include/block/block-common.h | 43 ++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index 41686810de..211fbc0847 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -58,6 +58,49 @@ typedef struct BlockDriver BlockDriver;
 typedef struct BdrvChild BdrvChild;
 typedef struct BdrvChildClass BdrvChildClass;
 
+typedef enum BlockZoneOp {
+    BLK_ZO_OPEN,
+    BLK_ZO_CLOSE,
+    BLK_ZO_FINISH,
+    BLK_ZO_RESET,
+} BlockZoneOp;
+
+typedef enum BlockZoneModel {
+    BLK_Z_NONE = 0x0, /* Regular block device */
+    BLK_Z_HM = 0x1, /* Host-managed zoned block device */
+    BLK_Z_HA = 0x2, /* Host-aware zoned block device */
+} BlockZoneModel;
+
+typedef enum BlockZoneState {
+    BLK_ZS_NOT_WP = 0x0,
+    BLK_ZS_EMPTY = 0x1,
+    BLK_ZS_IOPEN = 0x2,
+    BLK_ZS_EOPEN = 0x3,
+    BLK_ZS_CLOSED = 0x4,
+    BLK_ZS_RDONLY = 0xD,
+    BLK_ZS_FULL = 0xE,
+    BLK_ZS_OFFLINE = 0xF,
+} BlockZoneState;
+
+typedef enum BlockZoneType {
+    BLK_ZT_CONV = 0x1, /* Conventional random writes supported */
+    BLK_ZT_SWR = 0x2, /* Sequential writes required */
+    BLK_ZT_SWP = 0x3, /* Sequential writes preferred */
+} BlockZoneType;
+
+/*
+ * Zone descriptor data structure.
+ * Provides information on a zone with all position and size values in bytes.
+ */
+typedef struct BlockZoneDescriptor {
+    uint64_t start;
+    uint64_t length;
+    uint64_t cap;
+    uint64_t wp;
+    BlockZoneType type;
+    BlockZoneState state;
+} BlockZoneDescriptor;
+
 typedef struct BlockDriverInfo {
     /* in bytes, 0 if irrelevant */
     int cluster_size;
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v15 2/8] file-posix: introduce helper functions for sysfs attributes
  2023-01-29 10:28 [PATCH v15 0/8] Add support for zoned device Sam Li
  2023-01-29 10:28 ` [PATCH v15 1/8] include: add zoned device structs Sam Li
@ 2023-01-29 10:28 ` Sam Li
  2023-01-29 10:28 ` [PATCH v15 3/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls Sam Li
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Sam Li @ 2023-01-29 10:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, Kevin Wolf, Paolo Bonzini,
	Hanna Reitz, dmitry.fomichev, hare, damien.lemoal,
	Marc-André Lureau, Fam Zheng, Thomas Huth,
	Daniel P. Berrangé, Philippe Mathieu-Daudé,
	Sam Li

Use get_sysfs_str_val() to get the string value of device
zoned model. Then get_sysfs_zoned_model() can convert it to
BlockZoneModel type of QEMU.

Use get_sysfs_long_val() to get the long value of zoned device
information.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 block/file-posix.c               | 122 ++++++++++++++++++++++---------
 include/block/block_int-common.h |   3 +
 2 files changed, 91 insertions(+), 34 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index fa227d9d14..43c59c6d56 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1202,64 +1202,112 @@ static int hdev_get_max_hw_transfer(int fd, struct stat *st)
 #endif
 }
 
-static int hdev_get_max_segments(int fd, struct stat *st)
-{
+/*
+ * Get a sysfs attribute value as character string.
+ */
+static int get_sysfs_str_val(struct stat *st, const char *attribute,
+                             char **val) {
 #ifdef CONFIG_LINUX
-    char buf[32];
-    const char *end;
-    char *sysfspath = NULL;
+    g_autofree char *sysfspath = NULL;
     int ret;
-    int sysfd = -1;
-    long max_segments;
+    size_t len;
 
-    if (S_ISCHR(st->st_mode)) {
-        if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
-            return ret;
-        }
+    if (!S_ISBLK(st->st_mode)) {
         return -ENOTSUP;
     }
 
-    if (!S_ISBLK(st->st_mode)) {
-        return -ENOTSUP;
+    sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
+                                major(st->st_rdev), minor(st->st_rdev),
+                                attribute);
+    ret = g_file_get_contents(sysfspath, val, &len, NULL);
+    if (ret == -1) {
+        return -ENOENT;
     }
 
-    sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
-                                major(st->st_rdev), minor(st->st_rdev));
-    sysfd = open(sysfspath, O_RDONLY);
-    if (sysfd == -1) {
-        ret = -errno;
-        goto out;
+    /* The file is ended with '\n' */
+    char *p;
+    p = *val;
+    if (*(p + len - 1) == '\n') {
+        *(p + len - 1) = '\0';
     }
-    ret = RETRY_ON_EINTR(read(sysfd, buf, sizeof(buf) - 1));
+    return ret;
+#else
+    return -ENOTSUP;
+#endif
+}
+
+static int get_sysfs_zoned_model(struct stat *st, BlockZoneModel *zoned)
+{
+    g_autofree char *val = NULL;
+    int ret;
+
+    ret = get_sysfs_str_val(st, "zoned", &val);
     if (ret < 0) {
-        ret = -errno;
-        goto out;
-    } else if (ret == 0) {
-        ret = -EIO;
-        goto out;
+        return ret;
     }
-    buf[ret] = 0;
-    /* The file is ended with '\n', pass 'end' to accept that. */
-    ret = qemu_strtol(buf, &end, 10, &max_segments);
-    if (ret == 0 && end && *end == '\n') {
-        ret = max_segments;
+
+    if (strcmp(val, "host-managed") == 0) {
+        *zoned = BLK_Z_HM;
+    } else if (strcmp(val, "host-aware") == 0) {
+        *zoned = BLK_Z_HA;
+    } else if (strcmp(val, "none") == 0) {
+        *zoned = BLK_Z_NONE;
+    } else {
+        return -ENOTSUP;
+    }
+    return 0;
+}
+
+/*
+ * Get a sysfs attribute value as a long integer.
+ */
+static long get_sysfs_long_val(struct stat *st, const char *attribute)
+{
+#ifdef CONFIG_LINUX
+    g_autofree char *str = NULL;
+    const char *end;
+    long val;
+    int ret;
+
+    ret = get_sysfs_str_val(st, attribute, &str);
+    if (ret < 0) {
+        return ret;
     }
 
-out:
-    if (sysfd != -1) {
-        close(sysfd);
+    /* The file is ended with '\n', pass 'end' to accept that. */
+    ret = qemu_strtol(str, &end, 10, &val);
+    if (ret == 0 && end && *end == '\0') {
+        ret = val;
     }
-    g_free(sysfspath);
     return ret;
 #else
     return -ENOTSUP;
 #endif
 }
 
+static int hdev_get_max_segments(int fd, struct stat *st)
+{
+#ifdef CONFIG_LINUX
+    int ret;
+
+    if (S_ISCHR(st->st_mode)) {
+        if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
+            return ret;
+        }
+        return -ENOTSUP;
+    }
+    return get_sysfs_long_val(st, "max_segments");
+#else
+    return -ENOTSUP;
+#endif
+}
+
 static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 {
     BDRVRawState *s = bs->opaque;
     struct stat st;
+    int ret;
+    BlockZoneModel zoned;
 
     s->needs_alignment = raw_needs_alignment(bs);
     raw_probe_alignment(bs, s->fd, errp);
@@ -1297,6 +1345,12 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
             bs->bl.max_hw_iov = ret;
         }
     }
+
+    ret = get_sysfs_zoned_model(&st, &zoned);
+    if (ret < 0) {
+        zoned = BLK_Z_NONE;
+    }
+    bs->bl.zoned = zoned;
 }
 
 static int check_for_dasd(int fd)
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 887ace7dbd..57f0612f5e 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -836,6 +836,9 @@ typedef struct BlockLimits {
 
     /* maximum number of iovec elements */
     int max_iov;
+
+    /* device zone model */
+    BlockZoneModel zoned;
 } BlockLimits;
 
 typedef struct BdrvOpBlocker BdrvOpBlocker;
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v15 3/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  2023-01-29 10:28 [PATCH v15 0/8] Add support for zoned device Sam Li
  2023-01-29 10:28 ` [PATCH v15 1/8] include: add zoned device structs Sam Li
  2023-01-29 10:28 ` [PATCH v15 2/8] file-posix: introduce helper functions for sysfs attributes Sam Li
@ 2023-01-29 10:28 ` Sam Li
  2023-02-06 12:04   ` Stefan Hajnoczi
  2023-02-27 18:20   ` Kevin Wolf
  2023-01-29 10:28 ` [PATCH v15 4/8] raw-format: add zone operations to pass through requests Sam Li
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 17+ messages in thread
From: Sam Li @ 2023-01-29 10:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, Kevin Wolf, Paolo Bonzini,
	Hanna Reitz, dmitry.fomichev, hare, damien.lemoal,
	Marc-André Lureau, Fam Zheng, Thomas Huth,
	Daniel P. Berrangé, Philippe Mathieu-Daudé,
	Sam Li

Add zoned device option to host_device BlockDriver. It will be presented only
for zoned host block devices. By adding zone management operations to the
host_block_device BlockDriver, users can use the new block layer APIs
including Report Zone and four zone management operations
(open, close, finish, reset, reset_all).

Qemu-io uses the new APIs to perform zoned storage commands of the device:
zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
zone_finish(zf).

For example, to test zone_report, use following command:
$ ./build/qemu-io --image-opts -n driver=host_device, filename=/dev/nullb0
-c "zrp offset nr_zones"

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/block-backend.c             | 147 ++++++++++++++
 block/file-posix.c                | 323 ++++++++++++++++++++++++++++++
 block/io.c                        |  41 ++++
 include/block/block-io.h          |   7 +
 include/block/block_int-common.h  |  21 ++
 include/block/raw-aio.h           |   6 +-
 include/sysemu/block-backend-io.h |  18 ++
 meson.build                       |   4 +
 qemu-io-cmds.c                    | 149 ++++++++++++++
 9 files changed, 715 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index ba7bf1d6bc..a4847b9131 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1451,6 +1451,15 @@ typedef struct BlkRwCo {
     void *iobuf;
     int ret;
     BdrvRequestFlags flags;
+    union {
+        struct {
+            unsigned int *nr_zones;
+            BlockZoneDescriptor *zones;
+        } zone_report;
+        struct {
+            unsigned long op;
+        } zone_mgmt;
+    };
 } BlkRwCo;
 
 int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
@@ -1795,6 +1804,144 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
     return ret;
 }
 
+static void coroutine_fn blk_aio_zone_report_entry(void *opaque)
+{
+    BlkAioEmAIOCB *acb = opaque;
+    BlkRwCo *rwco = &acb->rwco;
+
+    rwco->ret = blk_co_zone_report(rwco->blk, rwco->offset,
+                                   rwco->zone_report.nr_zones,
+                                   rwco->zone_report.zones);
+    blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
+                                unsigned int *nr_zones,
+                                BlockZoneDescriptor  *zones,
+                                BlockCompletionFunc *cb, void *opaque)
+{
+    BlkAioEmAIOCB *acb;
+    Coroutine *co;
+    IO_CODE();
+
+    blk_inc_in_flight(blk);
+    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+    acb->rwco = (BlkRwCo) {
+        .blk    = blk,
+        .offset = offset,
+        .ret    = NOT_DONE,
+        .zone_report = {
+            .zones = zones,
+            .nr_zones = nr_zones,
+        },
+    };
+    acb->has_returned = false;
+
+    co = qemu_coroutine_create(blk_aio_zone_report_entry, acb);
+    bdrv_coroutine_enter(blk_bs(blk), co);
+
+    acb->has_returned = true;
+    if (acb->rwco.ret != NOT_DONE) {
+        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+                                         blk_aio_complete_bh, acb);
+    }
+
+    return &acb->common;
+}
+
+static void coroutine_fn blk_aio_zone_mgmt_entry(void *opaque)
+{
+    BlkAioEmAIOCB *acb = opaque;
+    BlkRwCo *rwco = &acb->rwco;
+
+    rwco->ret = blk_co_zone_mgmt(rwco->blk, rwco->zone_mgmt.op,
+                                 rwco->offset, acb->bytes);
+    blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+                              int64_t offset, int64_t len,
+                              BlockCompletionFunc *cb, void *opaque) {
+    BlkAioEmAIOCB *acb;
+    Coroutine *co;
+    IO_CODE();
+
+    blk_inc_in_flight(blk);
+    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+    acb->rwco = (BlkRwCo) {
+        .blk    = blk,
+        .offset = offset,
+        .ret    = NOT_DONE,
+        .zone_mgmt = {
+            .op = op,
+        },
+    };
+    acb->bytes = len;
+    acb->has_returned = false;
+
+    co = qemu_coroutine_create(blk_aio_zone_mgmt_entry, acb);
+    bdrv_coroutine_enter(blk_bs(blk), co);
+
+    acb->has_returned = true;
+    if (acb->rwco.ret != NOT_DONE) {
+        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+                                         blk_aio_complete_bh, acb);
+    }
+
+    return &acb->common;
+}
+
+/*
+ * Send a zone_report command.
+ * offset is a byte offset from the start of the device. No alignment
+ * required for offset.
+ * nr_zones represents IN maximum and OUT actual.
+ */
+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+                                    unsigned int *nr_zones,
+                                    BlockZoneDescriptor *zones)
+{
+    int ret;
+    IO_CODE();
+
+    blk_inc_in_flight(blk); /* increase before waiting */
+    blk_wait_while_drained(blk);
+    if (!blk_is_available(blk)) {
+        blk_dec_in_flight(blk);
+        return -ENOMEDIUM;
+    }
+    ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
+    blk_dec_in_flight(blk);
+    return ret;
+}
+
+/*
+ * Send a zone_management command.
+ * op is the zone operation;
+ * offset is the byte offset from the start of the zoned device;
+ * len is the maximum number of bytes the command should operate on. It
+ * should be aligned with the device zone size.
+ */
+int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+        int64_t offset, int64_t len)
+{
+    int ret;
+    IO_CODE();
+
+    blk_inc_in_flight(blk);
+    blk_wait_while_drained(blk);
+
+    ret = blk_check_byte_request(blk, offset, len);
+    if (ret < 0) {
+        blk_dec_in_flight(blk);
+        return ret;
+    }
+
+    ret = bdrv_co_zone_mgmt(blk_bs(blk), op, offset, len);
+    blk_dec_in_flight(blk);
+    return ret;
+}
+
 void blk_drain(BlockBackend *blk)
 {
     BlockDriverState *bs = blk_bs(blk);
diff --git a/block/file-posix.c b/block/file-posix.c
index 43c59c6d56..b6d88db208 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -68,6 +68,9 @@
 #include <sys/param.h>
 #include <sys/syscall.h>
 #include <sys/vfs.h>
+#if defined(CONFIG_BLKZONED)
+#include <linux/blkzoned.h>
+#endif
 #include <linux/cdrom.h>
 #include <linux/fd.h>
 #include <linux/fs.h>
@@ -216,6 +219,13 @@ typedef struct RawPosixAIOData {
             PreallocMode prealloc;
             Error **errp;
         } truncate;
+        struct {
+            unsigned int *nr_zones;
+            BlockZoneDescriptor *zones;
+        } zone_report;
+        struct {
+            unsigned long op;
+        } zone_mgmt;
     };
 } RawPosixAIOData;
 
@@ -1351,6 +1361,50 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
         zoned = BLK_Z_NONE;
     }
     bs->bl.zoned = zoned;
+    if (zoned != BLK_Z_NONE) {
+        /*
+         * The zoned device must at least have zone size and nr_zones fields.
+         */
+        ret = get_sysfs_long_val(&st, "chunk_sectors");
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "Unable to read chunk_sectors "
+                                         "sysfs attribute");
+            goto out;
+        } else if (!ret) {
+            error_setg(errp, "Read 0 from chunk_sectors sysfs attribute");
+            goto out;
+        }
+        bs->bl.zone_size = ret << BDRV_SECTOR_BITS;
+
+        ret = get_sysfs_long_val(&st, "nr_zones");
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "Unable to read nr_zones "
+                                         "sysfs attribute");
+            goto out;
+        } else if (!ret) {
+            error_setg(errp, "Read 0 from nr_zones sysfs attribute");
+            goto out;
+        }
+        bs->bl.nr_zones = ret;
+
+        ret = get_sysfs_long_val(&st, "zone_append_max_bytes");
+        if (ret > 0) {
+            bs->bl.max_append_sectors = ret >> BDRV_SECTOR_BITS;
+        }
+
+        ret = get_sysfs_long_val(&st, "max_open_zones");
+        if (ret >= 0) {
+            bs->bl.max_open_zones = ret;
+        }
+
+        ret = get_sysfs_long_val(&st, "max_active_zones");
+        if (ret >= 0) {
+            bs->bl.max_active_zones = ret;
+        }
+        return;
+    }
+out:
+    bs->bl.zoned = BLK_Z_NONE;
 }
 
 static int check_for_dasd(int fd)
@@ -1364,6 +1418,23 @@ static int check_for_dasd(int fd)
 #endif
 }
 
+#if defined(CONFIG_BLKZONED)
+/**
+ * Zoned storage needs to be virtualized with the correct physical block size
+ * and logical block size.
+ */
+static int hdev_probe_zoned_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
+{
+    BDRVRawState *s = bs->opaque;
+    int ret;
+
+    ret = probe_logical_blocksize(s->fd, &bsz->log);
+    if (ret < 0) {
+        return ret;
+    }
+    return probe_physical_blocksize(s->fd, &bsz->phys);
+}
+#else
 /**
  * Try to get @bs's logical and physical block size.
  * On success, store them in @bsz and return zero.
@@ -1384,6 +1455,7 @@ static int hdev_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
     }
     return probe_physical_blocksize(s->fd, &bsz->phys);
 }
+#endif
 
 /**
  * Try to get @bs's geometry: cyls, heads, sectors.
@@ -1844,6 +1916,146 @@ static off_t copy_file_range(int in_fd, off_t *in_off, int out_fd,
 }
 #endif
 
+/*
+ * parse_zone - Fill a zone descriptor
+ */
+#if defined(CONFIG_BLKZONED)
+static inline int parse_zone(struct BlockZoneDescriptor *zone,
+                              const struct blk_zone *blkz) {
+    zone->start = blkz->start << BDRV_SECTOR_BITS;
+    zone->length = blkz->len << BDRV_SECTOR_BITS;
+    zone->wp = blkz->wp << BDRV_SECTOR_BITS;
+
+#ifdef HAVE_BLK_ZONE_REP_CAPACITY
+    zone->cap = blkz->capacity << BDRV_SECTOR_BITS;
+#else
+    zone->cap = blkz->len << BDRV_SECTOR_BITS;
+#endif
+
+    switch (blkz->type) {
+    case BLK_ZONE_TYPE_SEQWRITE_REQ:
+        zone->type = BLK_ZT_SWR;
+        break;
+    case BLK_ZONE_TYPE_SEQWRITE_PREF:
+        zone->type = BLK_ZT_SWP;
+        break;
+    case BLK_ZONE_TYPE_CONVENTIONAL:
+        zone->type = BLK_ZT_CONV;
+        break;
+    default:
+        error_report("Unsupported zone type: 0x%x", blkz->type);
+        return -ENOTSUP;
+    }
+
+    switch (blkz->cond) {
+    case BLK_ZONE_COND_NOT_WP:
+        zone->state = BLK_ZS_NOT_WP;
+        break;
+    case BLK_ZONE_COND_EMPTY:
+        zone->state = BLK_ZS_EMPTY;
+        break;
+    case BLK_ZONE_COND_IMP_OPEN:
+        zone->state = BLK_ZS_IOPEN;
+        break;
+    case BLK_ZONE_COND_EXP_OPEN:
+        zone->state = BLK_ZS_EOPEN;
+        break;
+    case BLK_ZONE_COND_CLOSED:
+        zone->state = BLK_ZS_CLOSED;
+        break;
+    case BLK_ZONE_COND_READONLY:
+        zone->state = BLK_ZS_RDONLY;
+        break;
+    case BLK_ZONE_COND_FULL:
+        zone->state = BLK_ZS_FULL;
+        break;
+    case BLK_ZONE_COND_OFFLINE:
+        zone->state = BLK_ZS_OFFLINE;
+        break;
+    default:
+        error_report("Unsupported zone state: 0x%x", blkz->cond);
+        return -ENOTSUP;
+    }
+    return 0;
+}
+#endif
+
+#if defined(CONFIG_BLKZONED)
+static int handle_aiocb_zone_report(void *opaque)
+{
+    RawPosixAIOData *aiocb = opaque;
+    int fd = aiocb->aio_fildes;
+    unsigned int *nr_zones = aiocb->zone_report.nr_zones;
+    BlockZoneDescriptor *zones = aiocb->zone_report.zones;
+    /* zoned block devices use 512-byte sectors */
+    uint64_t sector = aiocb->aio_offset / 512;
+
+    struct blk_zone *blkz;
+    size_t rep_size;
+    unsigned int nrz;
+    int ret, n = 0, i = 0;
+
+    nrz = *nr_zones;
+    rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
+    g_autofree struct blk_zone_report *rep = NULL;
+    rep = g_malloc(rep_size);
+
+    blkz = (struct blk_zone *)(rep + 1);
+    while (n < nrz) {
+        memset(rep, 0, rep_size);
+        rep->sector = sector;
+        rep->nr_zones = nrz - n;
+
+        do {
+            ret = ioctl(fd, BLKREPORTZONE, rep);
+        } while (ret != 0 && errno == EINTR);
+        if (ret != 0) {
+            error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
+                         fd, sector, errno);
+            return -errno;
+        }
+
+        if (!rep->nr_zones) {
+            break;
+        }
+
+        for (i = 0; i < rep->nr_zones; i++, n++) {
+            ret = parse_zone(&zones[n], &blkz[i]);
+            if (ret != 0) {
+                return ret;
+            }
+
+            /* The next report should start after the last zone reported */
+            sector = blkz[i].start + blkz[i].len;
+        }
+    }
+
+    *nr_zones = n;
+    return 0;
+}
+#endif
+
+#if defined(CONFIG_BLKZONED)
+static int handle_aiocb_zone_mgmt(void *opaque)
+{
+    RawPosixAIOData *aiocb = opaque;
+    int fd = aiocb->aio_fildes;
+    uint64_t sector = aiocb->aio_offset / 512;
+    int64_t nr_sectors = aiocb->aio_nbytes / 512;
+    struct blk_zone_range range;
+    int ret;
+
+    /* Execute the operation */
+    range.sector = sector;
+    range.nr_sectors = nr_sectors;
+    do {
+        ret = ioctl(fd, aiocb->zone_mgmt.op, &range);
+    } while (ret != 0 && errno == EINTR);
+
+    return ret;
+}
+#endif
+
 static int handle_aiocb_copy_range(void *opaque)
 {
     RawPosixAIOData *aiocb = opaque;
@@ -3035,6 +3247,107 @@ static void raw_account_discard(BDRVRawState *s, uint64_t nbytes, int ret)
     }
 }
 
+/*
+ * zone report - Get a zone block device's information in the form
+ * of an array of zone descriptors.
+ * zones is an array of zone descriptors to hold zone information on reply;
+ * offset can be any byte within the entire size of the device;
+ * nr_zones is the maxium number of sectors the command should operate on.
+ */
+#if defined(CONFIG_BLKZONED)
+static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t offset,
+                                           unsigned int *nr_zones,
+                                           BlockZoneDescriptor *zones) {
+    BDRVRawState *s = bs->opaque;
+    RawPosixAIOData acb;
+
+    acb = (RawPosixAIOData) {
+        .bs         = bs,
+        .aio_fildes = s->fd,
+        .aio_type   = QEMU_AIO_ZONE_REPORT,
+        .aio_offset = offset,
+        .zone_report    = {
+            .nr_zones       = nr_zones,
+            .zones          = zones,
+        },
+    };
+
+    return raw_thread_pool_submit(bs, handle_aiocb_zone_report, &acb);
+}
+#endif
+
+/*
+ * zone management operations - Execute an operation on a zone
+ */
+#if defined(CONFIG_BLKZONED)
+static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+        int64_t offset, int64_t len) {
+    BDRVRawState *s = bs->opaque;
+    RawPosixAIOData acb;
+    int64_t zone_size, zone_size_mask;
+    const char *op_name;
+    unsigned long zo;
+    int ret;
+    int64_t capacity = bs->total_sectors << BDRV_SECTOR_BITS;
+
+    zone_size = bs->bl.zone_size;
+    zone_size_mask = zone_size - 1;
+    if (offset & zone_size_mask) {
+        error_report("sector offset %" PRId64 " is not aligned to zone size "
+                     "%" PRId64 "", offset / 512, zone_size / 512);
+        return -EINVAL;
+    }
+
+    if (((offset + len) < capacity && len & zone_size_mask) ||
+        offset + len > capacity) {
+        error_report("number of sectors %" PRId64 " is not aligned to zone size"
+                      " %" PRId64 "", len / 512, zone_size / 512);
+        return -EINVAL;
+    }
+
+    switch (op) {
+    case BLK_ZO_OPEN:
+        op_name = "BLKOPENZONE";
+        zo = BLKOPENZONE;
+        break;
+    case BLK_ZO_CLOSE:
+        op_name = "BLKCLOSEZONE";
+        zo = BLKCLOSEZONE;
+        break;
+    case BLK_ZO_FINISH:
+        op_name = "BLKFINISHZONE";
+        zo = BLKFINISHZONE;
+        break;
+    case BLK_ZO_RESET:
+        op_name = "BLKRESETZONE";
+        zo = BLKRESETZONE;
+        break;
+    default:
+        error_report("Unsupported zone op: 0x%x", op);
+        return -ENOTSUP;
+    }
+
+    acb = (RawPosixAIOData) {
+        .bs             = bs,
+        .aio_fildes     = s->fd,
+        .aio_type       = QEMU_AIO_ZONE_MGMT,
+        .aio_offset     = offset,
+        .aio_nbytes     = len,
+        .zone_mgmt  = {
+            .op = zo,
+        },
+    };
+
+    ret = raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb);
+    if (ret != 0) {
+        ret = -errno;
+        error_report("ioctl %s failed %d", op_name, ret);
+    }
+
+    return ret;
+}
+#endif
+
 static coroutine_fn int
 raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes,
                 bool blkdev)
@@ -3756,13 +4069,23 @@ static BlockDriver bdrv_host_device = {
     .bdrv_check_perm = raw_check_perm,
     .bdrv_set_perm   = raw_set_perm,
     .bdrv_abort_perm_update = raw_abort_perm_update,
+#ifndef CONFIG_BLKZONED
     .bdrv_probe_blocksizes = hdev_probe_blocksizes,
+#endif
     .bdrv_probe_geometry = hdev_probe_geometry,
 
     /* generic scsi device */
 #ifdef __linux__
     .bdrv_co_ioctl          = hdev_co_ioctl,
 #endif
+
+    /* zoned device */
+#if defined(CONFIG_BLKZONED)
+    /* zone management operations */
+    .bdrv_probe_blocksizes = hdev_probe_zoned_blocksizes,
+    .bdrv_co_zone_report = raw_co_zone_report,
+    .bdrv_co_zone_mgmt = raw_co_zone_mgmt,
+#endif
 };
 
 #if defined(__linux__) || defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
diff --git a/block/io.c b/block/io.c
index a09a19f7a7..1586e42ab9 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3099,6 +3099,47 @@ out:
     return co.ret;
 }
 
+int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
+                        unsigned int *nr_zones,
+                        BlockZoneDescriptor *zones)
+{
+    BlockDriver *drv = bs->drv;
+    CoroutineIOCompletion co = {
+            .coroutine = qemu_coroutine_self(),
+    };
+    IO_CODE();
+
+    bdrv_inc_in_flight(bs);
+    if (!drv || !drv->bdrv_co_zone_report) {
+        co.ret = -ENOTSUP;
+        goto out;
+    }
+    co.ret = drv->bdrv_co_zone_report(bs, offset, nr_zones, zones);
+out:
+    bdrv_dec_in_flight(bs);
+    return co.ret;
+}
+
+int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+        int64_t offset, int64_t len)
+{
+    BlockDriver *drv = bs->drv;
+    CoroutineIOCompletion co = {
+            .coroutine = qemu_coroutine_self(),
+    };
+    IO_CODE();
+
+    bdrv_inc_in_flight(bs);
+    if (!drv || !drv->bdrv_co_zone_mgmt) {
+        co.ret = -ENOTSUP;
+        goto out;
+    }
+    co.ret = drv->bdrv_co_zone_mgmt(bs, op, offset, len);
+out:
+    bdrv_dec_in_flight(bs);
+    return co.ret;
+}
+
 void *qemu_blockalign(BlockDriverState *bs, size_t size)
 {
     IO_CODE();
diff --git a/include/block/block-io.h b/include/block/block-io.h
index 3398351596..10ff212036 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -98,6 +98,13 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
 
 int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
                                   int64_t bytes);
+/* Report zone information of zone block device. */
+int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
+                                     unsigned int *nr_zones,
+                                     BlockZoneDescriptor *zones);
+int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+                                   int64_t offset, int64_t len);
+
 bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
 int bdrv_block_status(BlockDriverState *bs, int64_t offset,
                       int64_t bytes, int64_t *pnum, int64_t *map,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 57f0612f5e..565228d8dd 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -703,6 +703,12 @@ struct BlockDriver {
     int coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_load_vmstate)(
         BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
 
+    int coroutine_fn (*bdrv_co_zone_report)(BlockDriverState *bs,
+            int64_t offset, unsigned int *nr_zones,
+            BlockZoneDescriptor *zones);
+    int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, BlockZoneOp op,
+            int64_t offset, int64_t len);
+
     /* removable device specific */
     bool (*bdrv_is_inserted)(BlockDriverState *bs);
     void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
@@ -839,6 +845,21 @@ typedef struct BlockLimits {
 
     /* device zone model */
     BlockZoneModel zoned;
+
+    /* zone size expressed in bytes */
+    uint32_t zone_size;
+
+    /* total number of zones */
+    uint32_t nr_zones;
+
+    /* maximum sectors of a zone append write operation */
+    int64_t max_append_sectors;
+
+    /* maximum number of open zones */
+    int64_t max_open_zones;
+
+    /* maximum number of active zones */
+    int64_t max_active_zones;
 } BlockLimits;
 
 typedef struct BdrvOpBlocker BdrvOpBlocker;
diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index f8cda9df91..eda6a7a253 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -28,6 +28,8 @@
 #define QEMU_AIO_WRITE_ZEROES 0x0020
 #define QEMU_AIO_COPY_RANGE   0x0040
 #define QEMU_AIO_TRUNCATE     0x0080
+#define QEMU_AIO_ZONE_REPORT  0x0100
+#define QEMU_AIO_ZONE_MGMT    0x0200
 #define QEMU_AIO_TYPE_MASK \
         (QEMU_AIO_READ | \
          QEMU_AIO_WRITE | \
@@ -36,7 +38,9 @@
          QEMU_AIO_DISCARD | \
          QEMU_AIO_WRITE_ZEROES | \
          QEMU_AIO_COPY_RANGE | \
-         QEMU_AIO_TRUNCATE)
+         QEMU_AIO_TRUNCATE | \
+         QEMU_AIO_ZONE_REPORT | \
+         QEMU_AIO_ZONE_MGMT)
 
 /* AIO flags */
 #define QEMU_AIO_MISALIGNED   0x1000
diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
index 031a27ba10..dc8a4368f0 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -46,6 +46,13 @@ BlockAIOCB *blk_aio_pwritev(BlockBackend *blk, int64_t offset,
                             BlockCompletionFunc *cb, void *opaque);
 BlockAIOCB *blk_aio_flush(BlockBackend *blk,
                           BlockCompletionFunc *cb, void *opaque);
+BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
+                                unsigned int *nr_zones,
+                                BlockZoneDescriptor *zones,
+                                BlockCompletionFunc *cb, void *opaque);
+BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+                              int64_t offset, int64_t len,
+                              BlockCompletionFunc *cb, void *opaque);
 BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes,
                              BlockCompletionFunc *cb, void *opaque);
 void blk_aio_cancel_async(BlockAIOCB *acb);
@@ -166,6 +173,17 @@ int co_wrapper_mixed blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
 int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, int64_t offset,
                                       int64_t bytes, BdrvRequestFlags flags);
 
+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+                                    unsigned int *nr_zones,
+                                    BlockZoneDescriptor *zones);
+int co_wrapper_mixed blk_zone_report(BlockBackend *blk, int64_t offset,
+                                         unsigned int *nr_zones,
+                                         BlockZoneDescriptor *zones);
+int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+                                  int64_t offset, int64_t len);
+int co_wrapper_mixed blk_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+                                       int64_t offset, int64_t len);
+
 int co_wrapper_mixed blk_pdiscard(BlockBackend *blk, int64_t offset,
                                   int64_t bytes);
 int coroutine_fn blk_co_pdiscard(BlockBackend *blk, int64_t offset,
diff --git a/meson.build b/meson.build
index 6d3b665629..a267f74536 100644
--- a/meson.build
+++ b/meson.build
@@ -1962,6 +1962,7 @@ config_host_data.set('CONFIG_REPLICATION', get_option('replication').allowed())
 # has_header
 config_host_data.set('CONFIG_EPOLL', cc.has_header('sys/epoll.h'))
 config_host_data.set('CONFIG_LINUX_MAGIC_H', cc.has_header('linux/magic.h'))
+config_host_data.set('CONFIG_BLKZONED', cc.has_header('linux/blkzoned.h'))
 config_host_data.set('CONFIG_VALGRIND_H', cc.has_header('valgrind/valgrind.h'))
 config_host_data.set('HAVE_BTRFS_H', cc.has_header('linux/btrfs.h'))
 config_host_data.set('HAVE_DRM_H', cc.has_header('libdrm/drm.h'))
@@ -2056,6 +2057,9 @@ config_host_data.set('HAVE_SIGEV_NOTIFY_THREAD_ID',
 config_host_data.set('HAVE_STRUCT_STAT_ST_ATIM',
                      cc.has_member('struct stat', 'st_atim',
                                    prefix: '#include <sys/stat.h>'))
+config_host_data.set('HAVE_BLK_ZONE_REP_CAPACITY',
+                     cc.has_member('struct blk_zone', 'capacity',
+                                   prefix: '#include <linux/blkzoned.h>'))
 
 # has_type
 config_host_data.set('CONFIG_IOVEC',
diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 952dc940f1..3a3bad77c3 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1712,6 +1712,150 @@ static const cmdinfo_t flush_cmd = {
     .oneline    = "flush all in-core file state to disk",
 };
 
+static inline int64_t tosector(int64_t bytes)
+{
+    return bytes >> BDRV_SECTOR_BITS;
+}
+
+static int zone_report_f(BlockBackend *blk, int argc, char **argv)
+{
+    int ret;
+    int64_t offset;
+    unsigned int nr_zones;
+
+    ++optind;
+    offset = cvtnum(argv[optind]);
+    ++optind;
+    nr_zones = cvtnum(argv[optind]);
+
+    g_autofree BlockZoneDescriptor *zones = NULL;
+    zones = g_new(BlockZoneDescriptor, nr_zones);
+    ret = blk_zone_report(blk, offset, &nr_zones, zones);
+    if (ret < 0) {
+        printf("zone report failed: %s\n", strerror(-ret));
+    } else {
+        for (int i = 0; i < nr_zones; ++i) {
+            printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", "
+                   "cap"" 0x%" PRIx64 ", wptr 0x%" PRIx64 ", "
+                   "zcond:%u, [type: %u]\n",
+                    tosector(zones[i].start), tosector(zones[i].length),
+                    tosector(zones[i].cap), tosector(zones[i].wp),
+                    zones[i].state, zones[i].type);
+        }
+    }
+    return ret;
+}
+
+static const cmdinfo_t zone_report_cmd = {
+    .name = "zone_report",
+    .altname = "zrp",
+    .cfunc = zone_report_f,
+    .argmin = 2,
+    .argmax = 2,
+    .args = "offset number",
+    .oneline = "report zone information",
+};
+
+static int zone_open_f(BlockBackend *blk, int argc, char **argv)
+{
+    int ret;
+    int64_t offset, len;
+    ++optind;
+    offset = cvtnum(argv[optind]);
+    ++optind;
+    len = cvtnum(argv[optind]);
+    ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len);
+    if (ret < 0) {
+        printf("zone open failed: %s\n", strerror(-ret));
+    }
+    return ret;
+}
+
+static const cmdinfo_t zone_open_cmd = {
+    .name = "zone_open",
+    .altname = "zo",
+    .cfunc = zone_open_f,
+    .argmin = 2,
+    .argmax = 2,
+    .args = "offset len",
+    .oneline = "explicit open a range of zones in zone block device",
+};
+
+static int zone_close_f(BlockBackend *blk, int argc, char **argv)
+{
+    int ret;
+    int64_t offset, len;
+    ++optind;
+    offset = cvtnum(argv[optind]);
+    ++optind;
+    len = cvtnum(argv[optind]);
+    ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len);
+    if (ret < 0) {
+        printf("zone close failed: %s\n", strerror(-ret));
+    }
+    return ret;
+}
+
+static const cmdinfo_t zone_close_cmd = {
+    .name = "zone_close",
+    .altname = "zc",
+    .cfunc = zone_close_f,
+    .argmin = 2,
+    .argmax = 2,
+    .args = "offset len",
+    .oneline = "close a range of zones in zone block device",
+};
+
+static int zone_finish_f(BlockBackend *blk, int argc, char **argv)
+{
+    int ret;
+    int64_t offset, len;
+    ++optind;
+    offset = cvtnum(argv[optind]);
+    ++optind;
+    len = cvtnum(argv[optind]);
+    ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len);
+    if (ret < 0) {
+        printf("zone finish failed: %s\n", strerror(-ret));
+    }
+    return ret;
+}
+
+static const cmdinfo_t zone_finish_cmd = {
+    .name = "zone_finish",
+    .altname = "zf",
+    .cfunc = zone_finish_f,
+    .argmin = 2,
+    .argmax = 2,
+    .args = "offset len",
+    .oneline = "finish a range of zones in zone block device",
+};
+
+static int zone_reset_f(BlockBackend *blk, int argc, char **argv)
+{
+    int ret;
+    int64_t offset, len;
+    ++optind;
+    offset = cvtnum(argv[optind]);
+    ++optind;
+    len = cvtnum(argv[optind]);
+    ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len);
+    if (ret < 0) {
+        printf("zone reset failed: %s\n", strerror(-ret));
+    }
+    return ret;
+}
+
+static const cmdinfo_t zone_reset_cmd = {
+    .name = "zone_reset",
+    .altname = "zrs",
+    .cfunc = zone_reset_f,
+    .argmin = 2,
+    .argmax = 2,
+    .args = "offset len",
+    .oneline = "reset a zone write pointer in zone block device",
+};
+
 static int truncate_f(BlockBackend *blk, int argc, char **argv);
 static const cmdinfo_t truncate_cmd = {
     .name       = "truncate",
@@ -2504,6 +2648,11 @@ static void __attribute((constructor)) init_qemuio_commands(void)
     qemuio_add_command(&aio_write_cmd);
     qemuio_add_command(&aio_flush_cmd);
     qemuio_add_command(&flush_cmd);
+    qemuio_add_command(&zone_report_cmd);
+    qemuio_add_command(&zone_open_cmd);
+    qemuio_add_command(&zone_close_cmd);
+    qemuio_add_command(&zone_finish_cmd);
+    qemuio_add_command(&zone_reset_cmd);
     qemuio_add_command(&truncate_cmd);
     qemuio_add_command(&length_cmd);
     qemuio_add_command(&info_cmd);
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v15 4/8] raw-format: add zone operations to pass through requests
  2023-01-29 10:28 [PATCH v15 0/8] Add support for zoned device Sam Li
                   ` (2 preceding siblings ...)
  2023-01-29 10:28 ` [PATCH v15 3/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls Sam Li
@ 2023-01-29 10:28 ` Sam Li
  2023-01-29 10:28 ` [PATCH v15 5/8] config: add check to block layer Sam Li
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Sam Li @ 2023-01-29 10:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, Kevin Wolf, Paolo Bonzini,
	Hanna Reitz, dmitry.fomichev, hare, damien.lemoal,
	Marc-André Lureau, Fam Zheng, Thomas Huth,
	Daniel P. Berrangé, Philippe Mathieu-Daudé,
	Sam Li

raw-format driver usually sits on top of file-posix driver. It needs to
pass through requests of zone commands.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 block/raw-format.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/block/raw-format.c b/block/raw-format.c
index b6a0ce58f4..dbbb8f3859 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -317,6 +317,17 @@ static int coroutine_fn raw_co_pdiscard(BlockDriverState *bs,
     return bdrv_co_pdiscard(bs->file, offset, bytes);
 }
 
+static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t offset,
+                                           unsigned int *nr_zones,
+                                           BlockZoneDescriptor *zones) {
+    return bdrv_co_zone_report(bs->file->bs, offset, nr_zones, zones);
+}
+
+static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+                                         int64_t offset, int64_t len) {
+    return bdrv_co_zone_mgmt(bs->file->bs, op, offset, len);
+}
+
 static int64_t raw_getlength(BlockDriverState *bs)
 {
     int64_t len;
@@ -618,6 +629,8 @@ BlockDriver bdrv_raw = {
     .bdrv_co_pwritev      = &raw_co_pwritev,
     .bdrv_co_pwrite_zeroes = &raw_co_pwrite_zeroes,
     .bdrv_co_pdiscard     = &raw_co_pdiscard,
+    .bdrv_co_zone_report  = &raw_co_zone_report,
+    .bdrv_co_zone_mgmt  = &raw_co_zone_mgmt,
     .bdrv_co_block_status = &raw_co_block_status,
     .bdrv_co_copy_range_from = &raw_co_copy_range_from,
     .bdrv_co_copy_range_to  = &raw_co_copy_range_to,
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v15 5/8] config: add check to block layer
  2023-01-29 10:28 [PATCH v15 0/8] Add support for zoned device Sam Li
                   ` (3 preceding siblings ...)
  2023-01-29 10:28 ` [PATCH v15 4/8] raw-format: add zone operations to pass through requests Sam Li
@ 2023-01-29 10:28 ` Sam Li
  2023-01-29 10:28 ` [PATCH v15 6/8] qemu-iotests: test new zone operations Sam Li
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Sam Li @ 2023-01-29 10:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, Kevin Wolf, Paolo Bonzini,
	Hanna Reitz, dmitry.fomichev, hare, damien.lemoal,
	Marc-André Lureau, Fam Zheng, Thomas Huth,
	Daniel P. Berrangé, Philippe Mathieu-Daudé,
	Sam Li

Putting zoned/non-zoned BlockDrivers on top of each other is not
allowed.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 block.c                          | 19 +++++++++++++++++++
 block/file-posix.c               | 12 ++++++++++++
 block/raw-format.c               |  1 +
 include/block/block_int-common.h |  5 +++++
 4 files changed, 37 insertions(+)

diff --git a/block.c b/block.c
index b4a89207ad..5ab0b26510 100644
--- a/block.c
+++ b/block.c
@@ -7913,6 +7913,25 @@ void bdrv_add_child(BlockDriverState *parent_bs, BlockDriverState *child_bs,
         return;
     }
 
+    /*
+     * Non-zoned block drivers do not follow zoned storage constraints
+     * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned
+     * drivers in a graph.
+     */
+    if (!parent_bs->drv->supports_zoned_children &&
+        child_bs->bl.zoned == BLK_Z_HM) {
+        /*
+         * The host-aware model allows zoned storage constraints and random
+         * write. Allow mixing host-aware and non-zoned drivers. Using
+         * host-aware device as a regular device.
+         */
+        error_setg(errp, "Cannot add a %s child to a %s parent",
+                   child_bs->bl.zoned == BLK_Z_HM ? "zoned" : "non-zoned",
+                   parent_bs->drv->supports_zoned_children ?
+                   "support zoned children" : "not support zoned children");
+        return;
+    }
+
     if (!QLIST_EMPTY(&child_bs->parents)) {
         error_setg(errp, "The node %s already has a parent",
                    child_bs->node_name);
diff --git a/block/file-posix.c b/block/file-posix.c
index b6d88db208..f661f202a1 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -776,6 +776,18 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
             goto fail;
         }
     }
+#ifdef CONFIG_BLKZONED
+    /*
+     * The kernel page cache does not reliably work for writes to SWR zones
+     * of zoned block device because it can not guarantee the order of writes.
+     */
+    if ((strcmp(bs->drv->format_name, "zoned_host_device") == 0) &&
+        (!(s->open_flags & O_DIRECT))) {
+        error_setg(errp, "driver=zoned_host_device was specified, but it "
+                   "requires cache.direct=on, which was not specified.");
+        return -EINVAL; /* No host kernel page cache */
+    }
+#endif
 
     if (S_ISBLK(st.st_mode)) {
 #ifdef __linux__
diff --git a/block/raw-format.c b/block/raw-format.c
index dbbb8f3859..772ce777ff 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -618,6 +618,7 @@ static void raw_child_perm(BlockDriverState *bs, BdrvChild *c,
 BlockDriver bdrv_raw = {
     .format_name          = "raw",
     .instance_size        = sizeof(BDRVRawState),
+    .supports_zoned_children = true,
     .bdrv_probe           = &raw_probe,
     .bdrv_reopen_prepare  = &raw_reopen_prepare,
     .bdrv_reopen_commit   = &raw_reopen_commit,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 565228d8dd..cd631f94ed 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -137,6 +137,11 @@ struct BlockDriver {
      */
     bool is_format;
 
+    /*
+     * Set to true if the BlockDriver supports zoned children.
+     */
+    bool supports_zoned_children;
+
     /*
      * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
      * this field set to true, except ones that are defined only by their
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v15 6/8] qemu-iotests: test new zone operations
  2023-01-29 10:28 [PATCH v15 0/8] Add support for zoned device Sam Li
                   ` (4 preceding siblings ...)
  2023-01-29 10:28 ` [PATCH v15 5/8] config: add check to block layer Sam Li
@ 2023-01-29 10:28 ` Sam Li
  2023-01-29 10:28 ` [PATCH v15 7/8] block: add some trace events for new block layer APIs Sam Li
  2023-01-29 10:28 ` [PATCH v15 8/8] docs/zoned-storage: add zoned device documentation Sam Li
  7 siblings, 0 replies; 17+ messages in thread
From: Sam Li @ 2023-01-29 10:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, Kevin Wolf, Paolo Bonzini,
	Hanna Reitz, dmitry.fomichev, hare, damien.lemoal,
	Marc-André Lureau, Fam Zheng, Thomas Huth,
	Daniel P. Berrangé, Philippe Mathieu-Daudé,
	Sam Li

We have added new block layer APIs of zoned block devices. Test it as
follows: Run each zone operation on a newly created null_blk device
and see whether the logs show the correct zone information. By:
$ ./tests/qemu-iotests/tests/zoned.sh

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/qemu-iotests/tests/zoned.out | 53 ++++++++++++++++++
 tests/qemu-iotests/tests/zoned.sh  | 86 ++++++++++++++++++++++++++++++
 2 files changed, 139 insertions(+)
 create mode 100644 tests/qemu-iotests/tests/zoned.out
 create mode 100755 tests/qemu-iotests/tests/zoned.sh

diff --git a/tests/qemu-iotests/tests/zoned.out b/tests/qemu-iotests/tests/zoned.out
new file mode 100644
index 0000000000..0c8f96deb9
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.out
@@ -0,0 +1,53 @@
+QA output created by zoned.sh
+Testing a null_blk device:
+Simple cases: if the operations work
+(1) report the first zone:
+start: 0x0, len 0x80000, cap 0x80000, wptr 0x0, zcond:1, [type: 2]
+
+report the first 10 zones
+start: 0x0, len 0x80000, cap 0x80000, wptr 0x0, zcond:1, [type: 2]
+start: 0x80000, len 0x80000, cap 0x80000, wptr 0x80000, zcond:1, [type: 2]
+start: 0x100000, len 0x80000, cap 0x80000, wptr 0x100000, zcond:1, [type: 2]
+start: 0x180000, len 0x80000, cap 0x80000, wptr 0x180000, zcond:1, [type: 2]
+start: 0x200000, len 0x80000, cap 0x80000, wptr 0x200000, zcond:1, [type: 2]
+start: 0x280000, len 0x80000, cap 0x80000, wptr 0x280000, zcond:1, [type: 2]
+start: 0x300000, len 0x80000, cap 0x80000, wptr 0x300000, zcond:1, [type: 2]
+start: 0x380000, len 0x80000, cap 0x80000, wptr 0x380000, zcond:1, [type: 2]
+start: 0x400000, len 0x80000, cap 0x80000, wptr 0x400000, zcond:1, [type: 2]
+start: 0x480000, len 0x80000, cap 0x80000, wptr 0x480000, zcond:1, [type: 2]
+
+report the last zone:
+start: 0x1f380000, len 0x80000, cap 0x80000, wptr 0x1f380000, zcond:1, [type: 2]
+
+
+(2) opening the first zone
+report after:
+start: 0x0, len 0x80000, cap 0x80000, wptr 0x0, zcond:3, [type: 2]
+
+opening the second zone
+report after:
+start: 0x80000, len 0x80000, cap 0x80000, wptr 0x80000, zcond:3, [type: 2]
+
+opening the last zone
+report after:
+start: 0x1f380000, len 0x80000, cap 0x80000, wptr 0x1f380000, zcond:3, [type: 2]
+
+
+(3) closing the first zone
+report after:
+start: 0x0, len 0x80000, cap 0x80000, wptr 0x0, zcond:1, [type: 2]
+
+closing the last zone
+report after:
+start: 0x1f380000, len 0x80000, cap 0x80000, wptr 0x1f380000, zcond:1, [type: 2]
+
+
+(4) finishing the second zone
+After finishing a zone:
+start: 0x80000, len 0x80000, cap 0x80000, wptr 0x100000, zcond:14, [type: 2]
+
+
+(5) resetting the second zone
+After resetting a zone:
+start: 0x80000, len 0x80000, cap 0x80000, wptr 0x80000, zcond:1, [type: 2]
+*** done
diff --git a/tests/qemu-iotests/tests/zoned.sh b/tests/qemu-iotests/tests/zoned.sh
new file mode 100755
index 0000000000..9d7c15dde6
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.sh
@@ -0,0 +1,86 @@
+#!/usr/bin/env bash
+#
+# Test zone management operations.
+#
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+status=1 # failure is the default!
+
+_cleanup()
+{
+  _cleanup_test_img
+  sudo rmmod null_blk
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.qemu
+
+# This test only runs on Linux hosts with raw image files.
+_supported_fmt raw
+_supported_proto file
+_supported_os Linux
+
+QEMU_IO="build/qemu-io"
+IMG="--image-opts -n driver=host_device,filename=/dev/nullb0"
+QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
+
+echo "Testing a null_blk device:"
+echo "case 1: if the operations work"
+sudo modprobe null_blk nr_devices=1 zoned=1
+
+echo "(1) report the first zone:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "report the first 10 zones"
+sudo $QEMU_IO $IMG -c "zrp 0 10"
+echo
+echo "report the last zone:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e70000000 2" # 0x3e70000000 / 512 = 0x1f380000
+echo
+echo
+echo "(2) opening the first zone"
+sudo $QEMU_IO $IMG -c "zo 0 268435456"  # 268435456 / 512 = 524288
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "opening the second zone"
+sudo $QEMU_IO $IMG -c "zo 268435456 268435456" #
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo "opening the last zone"
+sudo $QEMU_IO $IMG -c "zo 0x3e70000000 268435456"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e70000000 2"
+echo
+echo
+echo "(3) closing the first zone"
+sudo $QEMU_IO $IMG -c "zc 0 268435456"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "closing the last zone"
+sudo $QEMU_IO $IMG -c "zc 0x3e70000000 268435456"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e70000000 2"
+echo
+echo
+echo "(4) finishing the second zone"
+sudo $QEMU_IO $IMG -c "zf 268435456 268435456"
+echo "After finishing a zone:"
+sudo $QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo
+echo "(5) resetting the second zone"
+sudo $QEMU_IO $IMG -c "zrs 268435456 268435456"
+echo "After resetting a zone:"
+sudo $QEMU_IO $IMG -c "zrp 268435456 1"
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v15 7/8] block: add some trace events for new block layer APIs
  2023-01-29 10:28 [PATCH v15 0/8] Add support for zoned device Sam Li
                   ` (5 preceding siblings ...)
  2023-01-29 10:28 ` [PATCH v15 6/8] qemu-iotests: test new zone operations Sam Li
@ 2023-01-29 10:28 ` Sam Li
  2023-01-29 10:28 ` [PATCH v15 8/8] docs/zoned-storage: add zoned device documentation Sam Li
  7 siblings, 0 replies; 17+ messages in thread
From: Sam Li @ 2023-01-29 10:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, Kevin Wolf, Paolo Bonzini,
	Hanna Reitz, dmitry.fomichev, hare, damien.lemoal,
	Marc-André Lureau, Fam Zheng, Thomas Huth,
	Daniel P. Berrangé, Philippe Mathieu-Daudé,
	Sam Li

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/file-posix.c | 3 +++
 block/trace-events | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index f661f202a1..5cf92608db 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -3272,6 +3272,7 @@ static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t offset,
                                            BlockZoneDescriptor *zones) {
     BDRVRawState *s = bs->opaque;
     RawPosixAIOData acb;
+    trace_zbd_zone_report(bs, *nr_zones, offset >> BDRV_SECTOR_BITS);
 
     acb = (RawPosixAIOData) {
         .bs         = bs,
@@ -3350,6 +3351,8 @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
         },
     };
 
+    trace_zbd_zone_mgmt(bs, op_name, offset >> BDRV_SECTOR_BITS,
+                        len >> BDRV_SECTOR_BITS);
     ret = raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb);
     if (ret != 0) {
         ret = -errno;
diff --git a/block/trace-events b/block/trace-events
index 48dbf10c66..3f4e1d088a 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -209,6 +209,8 @@ file_FindEjectableOpticalMedia(const char *media) "Matching using %s"
 file_setup_cdrom(const char *partition) "Using %s as optical disc"
 file_hdev_is_sg(int type, int version) "SG device found: type=%d, version=%d"
 file_flush_fdatasync_failed(int err) "errno %d"
+zbd_zone_report(void *bs, unsigned int nr_zones, int64_t sector) "bs %p report %d zones starting at sector offset 0x%" PRIx64 ""
+zbd_zone_mgmt(void *bs, const char *op_name, int64_t sector, int64_t len) "bs %p %s starts at sector offset 0x%" PRIx64 " over a range of 0x%" PRIx64 " sectors"
 
 # ssh.c
 sftp_error(const char *op, const char *ssh_err, int ssh_err_code, int sftp_err_code) "%s failed: %s (libssh error code: %d, sftp error code: %d)"
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v15 8/8] docs/zoned-storage: add zoned device documentation
  2023-01-29 10:28 [PATCH v15 0/8] Add support for zoned device Sam Li
                   ` (6 preceding siblings ...)
  2023-01-29 10:28 ` [PATCH v15 7/8] block: add some trace events for new block layer APIs Sam Li
@ 2023-01-29 10:28 ` Sam Li
  2023-02-06 12:16   ` Stefan Hajnoczi
  7 siblings, 1 reply; 17+ messages in thread
From: Sam Li @ 2023-01-29 10:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Stefan Hajnoczi, Kevin Wolf, Paolo Bonzini,
	Hanna Reitz, dmitry.fomichev, hare, damien.lemoal,
	Marc-André Lureau, Fam Zheng, Thomas Huth,
	Daniel P. Berrangé, Philippe Mathieu-Daudé,
	Sam Li

Add the documentation about the zoned device support to virtio-blk
emulation.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 docs/devel/zoned-storage.rst           | 43 ++++++++++++++++++++++++++
 docs/system/qemu-block-drivers.rst.inc |  6 ++++
 2 files changed, 49 insertions(+)
 create mode 100644 docs/devel/zoned-storage.rst

diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
new file mode 100644
index 0000000000..03e52efe2e
--- /dev/null
+++ b/docs/devel/zoned-storage.rst
@@ -0,0 +1,43 @@
+=============
+zoned-storage
+=============
+
+Zoned Block Devices (ZBDs) divide the LBA space into block regions called zones
+that are larger than the LBA size. They can only allow sequential writes, which
+can reduce write amplification in SSDs, and potentially lead to higher
+throughput and increased capacity. More details about ZBDs can be found at:
+
+https://zonedstorage.io/docs/introduction/zoned-storage
+
+1. Block layer APIs for zoned storage
+-------------------------------------
+QEMU block layer supports three zoned storage models:
+- BLK_Z_HM: The host-managed zoned model only allows sequential writes access
+to zones. It supports ZBD-specific I/O commands that can be used by a host to
+manage the zones of a device.
+- BLK_Z_HA: The host-aware zoned model allows random write operations in
+zones, making it backward compatible with regular block devices.
+- BLK_Z_NONE: The non-zoned model has no zones support. It includes both
+regular and drive-managed ZBD devices. ZBD-specific I/O commands are not
+supported.
+
+The block device information resides inside BlockDriverState. QEMU uses
+BlockLimits struct(BlockDriverState::bl) that is continuously accessed by the
+block layer while processing I/O requests. A BlockBackend has a root pointer to
+a BlockDriverState graph(for example, raw format on top of file-posix). The
+zoned storage information can be propagated from the leaf BlockDriverState all
+the way up to the BlockBackend. If the zoned storage model in file-posix is
+set to BLK_Z_HM, then block drivers will declare support for zoned host device.
+
+The block layer APIs support commands needed for zoned storage devices,
+including report zones, four zone operations, and zone append.
+
+2. Emulating zoned storage controllers
+--------------------------------------
+When the BlockBackend's BlockLimits model reports a zoned storage device, users
+like the virtio-blk emulation or the qemu-io-cmds.c utility can use block layer
+APIs for zoned storage emulation or testing.
+
+For example, to test zone_report on a null_blk device using qemu-io is:
+$ path/to/qemu-io --image-opts -n driver=zoned_host_device,filename=/dev/nullb0
+-c "zrp offset nr_zones"
diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
index dfe5d2293d..0b97227fd9 100644
--- a/docs/system/qemu-block-drivers.rst.inc
+++ b/docs/system/qemu-block-drivers.rst.inc
@@ -430,6 +430,12 @@ Hard disks
   you may corrupt your host data (use the ``-snapshot`` command
   line option or modify the device permissions accordingly).
 
+Zoned block devices
+  Zoned block devices can be passed through to the guest if the emulated storage
+  controller supports zoned storage. Use ``--blockdev zoned_host_device,
+  node-name=drive0,filename=/dev/nullb0`` to pass through ``/dev/nullb0``
+  as ``drive0``.
+
 Windows
 ^^^^^^^
 
-- 
2.38.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v15 3/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  2023-01-29 10:28 ` [PATCH v15 3/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls Sam Li
@ 2023-02-06 12:04   ` Stefan Hajnoczi
  2023-02-06 12:12     ` Sam Li
  2023-02-27 18:20   ` Kevin Wolf
  1 sibling, 1 reply; 17+ messages in thread
From: Stefan Hajnoczi @ 2023-02-06 12:04 UTC (permalink / raw)
  To: Sam Li
  Cc: qemu-devel, qemu-block, Stefan Hajnoczi, Kevin Wolf,
	Paolo Bonzini, Hanna Reitz, dmitry.fomichev, hare, damien.lemoal,
	Marc-André Lureau, Fam Zheng, Thomas Huth,
	Daniel P. Berrangé, Philippe Mathieu-Daudé

)(_On Sun, 29 Jan 2023 at 05:30, Sam Li <faithilikerun@gmail.com> wrote:
>
> Add zoned device option to host_device BlockDriver. It will be presented only
> for zoned host block devices. By adding zone management operations to the
> host_block_device BlockDriver, users can use the new block layer APIs
> including Report Zone and four zone management operations
> (open, close, finish, reset, reset_all).
>
> Qemu-io uses the new APIs to perform zoned storage commands of the device:
> zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> zone_finish(zf).
>
> For example, to test zone_report, use following command:
> $ ./build/qemu-io --image-opts -n driver=host_device, filename=/dev/nullb0
> -c "zrp offset nr_zones"
>
> Signed-off-by: Sam Li <faithilikerun@gmail.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  block/block-backend.c             | 147 ++++++++++++++
>  block/file-posix.c                | 323 ++++++++++++++++++++++++++++++
>  block/io.c                        |  41 ++++
>  include/block/block-io.h          |   7 +
>  include/block/block_int-common.h  |  21 ++
>  include/block/raw-aio.h           |   6 +-
>  include/sysemu/block-backend-io.h |  18 ++
>  meson.build                       |   4 +
>  qemu-io-cmds.c                    | 149 ++++++++++++++
>  9 files changed, 715 insertions(+), 1 deletion(-)
>
> diff --git a/block/block-backend.c b/block/block-backend.c
> index ba7bf1d6bc..a4847b9131 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1451,6 +1451,15 @@ typedef struct BlkRwCo {
>      void *iobuf;
>      int ret;
>      BdrvRequestFlags flags;
> +    union {
> +        struct {
> +            unsigned int *nr_zones;
> +            BlockZoneDescriptor *zones;
> +        } zone_report;
> +        struct {
> +            unsigned long op;
> +        } zone_mgmt;
> +    };
>  } BlkRwCo;
>
>  int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
> @@ -1795,6 +1804,144 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
>      return ret;
>  }
>
> +static void coroutine_fn blk_aio_zone_report_entry(void *opaque)
> +{
> +    BlkAioEmAIOCB *acb = opaque;
> +    BlkRwCo *rwco = &acb->rwco;
> +
> +    rwco->ret = blk_co_zone_report(rwco->blk, rwco->offset,
> +                                   rwco->zone_report.nr_zones,
> +                                   rwco->zone_report.zones);
> +    blk_aio_complete(acb);
> +}
> +
> +BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
> +                                unsigned int *nr_zones,
> +                                BlockZoneDescriptor  *zones,
> +                                BlockCompletionFunc *cb, void *opaque)
> +{
> +    BlkAioEmAIOCB *acb;
> +    Coroutine *co;
> +    IO_CODE();
> +
> +    blk_inc_in_flight(blk);
> +    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
> +    acb->rwco = (BlkRwCo) {
> +        .blk    = blk,
> +        .offset = offset,
> +        .ret    = NOT_DONE,
> +        .zone_report = {
> +            .zones = zones,
> +            .nr_zones = nr_zones,
> +        },
> +    };
> +    acb->has_returned = false;
> +
> +    co = qemu_coroutine_create(blk_aio_zone_report_entry, acb);
> +    bdrv_coroutine_enter(blk_bs(blk), co);
> +
> +    acb->has_returned = true;
> +    if (acb->rwco.ret != NOT_DONE) {
> +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> +                                         blk_aio_complete_bh, acb);
> +    }
> +
> +    return &acb->common;
> +}
> +
> +static void coroutine_fn blk_aio_zone_mgmt_entry(void *opaque)
> +{
> +    BlkAioEmAIOCB *acb = opaque;
> +    BlkRwCo *rwco = &acb->rwco;
> +
> +    rwco->ret = blk_co_zone_mgmt(rwco->blk, rwco->zone_mgmt.op,
> +                                 rwco->offset, acb->bytes);
> +    blk_aio_complete(acb);
> +}
> +
> +BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +                              int64_t offset, int64_t len,
> +                              BlockCompletionFunc *cb, void *opaque) {
> +    BlkAioEmAIOCB *acb;
> +    Coroutine *co;
> +    IO_CODE();
> +
> +    blk_inc_in_flight(blk);
> +    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
> +    acb->rwco = (BlkRwCo) {
> +        .blk    = blk,
> +        .offset = offset,
> +        .ret    = NOT_DONE,
> +        .zone_mgmt = {
> +            .op = op,
> +        },
> +    };
> +    acb->bytes = len;
> +    acb->has_returned = false;
> +
> +    co = qemu_coroutine_create(blk_aio_zone_mgmt_entry, acb);
> +    bdrv_coroutine_enter(blk_bs(blk), co);
> +
> +    acb->has_returned = true;
> +    if (acb->rwco.ret != NOT_DONE) {
> +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> +                                         blk_aio_complete_bh, acb);
> +    }
> +
> +    return &acb->common;
> +}
> +
> +/*
> + * Send a zone_report command.
> + * offset is a byte offset from the start of the device. No alignment
> + * required for offset.
> + * nr_zones represents IN maximum and OUT actual.
> + */
> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> +                                    unsigned int *nr_zones,
> +                                    BlockZoneDescriptor *zones)
> +{
> +    int ret;
> +    IO_CODE();
> +
> +    blk_inc_in_flight(blk); /* increase before waiting */
> +    blk_wait_while_drained(blk);
> +    if (!blk_is_available(blk)) {
> +        blk_dec_in_flight(blk);
> +        return -ENOMEDIUM;
> +    }
> +    ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
> +    blk_dec_in_flight(blk);
> +    return ret;
> +}
> +
> +/*
> + * Send a zone_management command.
> + * op is the zone operation;
> + * offset is the byte offset from the start of the zoned device;
> + * len is the maximum number of bytes the command should operate on. It
> + * should be aligned with the device zone size.
> + */
> +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +        int64_t offset, int64_t len)
> +{
> +    int ret;
> +    IO_CODE();
> +
> +    blk_inc_in_flight(blk);
> +    blk_wait_while_drained(blk);
> +
> +    ret = blk_check_byte_request(blk, offset, len);
> +    if (ret < 0) {
> +        blk_dec_in_flight(blk);
> +        return ret;
> +    }
> +
> +    ret = bdrv_co_zone_mgmt(blk_bs(blk), op, offset, len);
> +    blk_dec_in_flight(blk);
> +    return ret;
> +}
> +
>  void blk_drain(BlockBackend *blk)
>  {
>      BlockDriverState *bs = blk_bs(blk);
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 43c59c6d56..b6d88db208 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -68,6 +68,9 @@
>  #include <sys/param.h>
>  #include <sys/syscall.h>
>  #include <sys/vfs.h>
> +#if defined(CONFIG_BLKZONED)
> +#include <linux/blkzoned.h>
> +#endif
>  #include <linux/cdrom.h>
>  #include <linux/fd.h>
>  #include <linux/fs.h>
> @@ -216,6 +219,13 @@ typedef struct RawPosixAIOData {
>              PreallocMode prealloc;
>              Error **errp;
>          } truncate;
> +        struct {
> +            unsigned int *nr_zones;
> +            BlockZoneDescriptor *zones;
> +        } zone_report;
> +        struct {
> +            unsigned long op;
> +        } zone_mgmt;
>      };
>  } RawPosixAIOData;
>
> @@ -1351,6 +1361,50 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>          zoned = BLK_Z_NONE;
>      }
>      bs->bl.zoned = zoned;
> +    if (zoned != BLK_Z_NONE) {
> +        /*
> +         * The zoned device must at least have zone size and nr_zones fields.
> +         */
> +        ret = get_sysfs_long_val(&st, "chunk_sectors");
> +        if (ret < 0) {
> +            error_setg_errno(errp, -ret, "Unable to read chunk_sectors "
> +                                         "sysfs attribute");
> +            goto out;
> +        } else if (!ret) {
> +            error_setg(errp, "Read 0 from chunk_sectors sysfs attribute");
> +            goto out;
> +        }
> +        bs->bl.zone_size = ret << BDRV_SECTOR_BITS;
> +
> +        ret = get_sysfs_long_val(&st, "nr_zones");
> +        if (ret < 0) {
> +            error_setg_errno(errp, -ret, "Unable to read nr_zones "
> +                                         "sysfs attribute");
> +            goto out;
> +        } else if (!ret) {
> +            error_setg(errp, "Read 0 from nr_zones sysfs attribute");
> +            goto out;
> +        }
> +        bs->bl.nr_zones = ret;
> +
> +        ret = get_sysfs_long_val(&st, "zone_append_max_bytes");
> +        if (ret > 0) {
> +            bs->bl.max_append_sectors = ret >> BDRV_SECTOR_BITS;
> +        }
> +
> +        ret = get_sysfs_long_val(&st, "max_open_zones");
> +        if (ret >= 0) {
> +            bs->bl.max_open_zones = ret;
> +        }
> +
> +        ret = get_sysfs_long_val(&st, "max_active_zones");
> +        if (ret >= 0) {
> +            bs->bl.max_active_zones = ret;
> +        }
> +        return;
> +    }
> +out:
> +    bs->bl.zoned = BLK_Z_NONE;
>  }
>
>  static int check_for_dasd(int fd)
> @@ -1364,6 +1418,23 @@ static int check_for_dasd(int fd)
>  #endif
>  }
>
> +#if defined(CONFIG_BLKZONED)
> +/**
> + * Zoned storage needs to be virtualized with the correct physical block size
> + * and logical block size.
> + */
> +static int hdev_probe_zoned_blocksizes(BlockDriverState *bs, BlockSizes *bsz)

The #ifdef approach in this patch won't work because the same
BlockDriver now handles both zoned and non-zoned devices at runtime.
This function needs to be unified with hdev_probe_blocksizes():

  if (check_for_dasd(s->fd) < 0 || bs->bl.zoned == BLK_Z_NONE) {
      return -ENOTSUP;
  }

  ...probe block sizes...

> +{
> +    BDRVRawState *s = bs->opaque;
> +    int ret;
> +
> +    ret = probe_logical_blocksize(s->fd, &bsz->log);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +    return probe_physical_blocksize(s->fd, &bsz->phys);
> +}
> +#else
>  /**
>   * Try to get @bs's logical and physical block size.
>   * On success, store them in @bsz and return zero.
> @@ -1384,6 +1455,7 @@ static int hdev_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
>      }
>      return probe_physical_blocksize(s->fd, &bsz->phys);
>  }
> +#endif
>
>  /**
>   * Try to get @bs's geometry: cyls, heads, sectors.
> @@ -1844,6 +1916,146 @@ static off_t copy_file_range(int in_fd, off_t *in_off, int out_fd,
>  }
>  #endif
>
> +/*
> + * parse_zone - Fill a zone descriptor
> + */
> +#if defined(CONFIG_BLKZONED)
> +static inline int parse_zone(struct BlockZoneDescriptor *zone,
> +                              const struct blk_zone *blkz) {
> +    zone->start = blkz->start << BDRV_SECTOR_BITS;
> +    zone->length = blkz->len << BDRV_SECTOR_BITS;
> +    zone->wp = blkz->wp << BDRV_SECTOR_BITS;
> +
> +#ifdef HAVE_BLK_ZONE_REP_CAPACITY
> +    zone->cap = blkz->capacity << BDRV_SECTOR_BITS;
> +#else
> +    zone->cap = blkz->len << BDRV_SECTOR_BITS;
> +#endif
> +
> +    switch (blkz->type) {
> +    case BLK_ZONE_TYPE_SEQWRITE_REQ:
> +        zone->type = BLK_ZT_SWR;
> +        break;
> +    case BLK_ZONE_TYPE_SEQWRITE_PREF:
> +        zone->type = BLK_ZT_SWP;
> +        break;
> +    case BLK_ZONE_TYPE_CONVENTIONAL:
> +        zone->type = BLK_ZT_CONV;
> +        break;
> +    default:
> +        error_report("Unsupported zone type: 0x%x", blkz->type);
> +        return -ENOTSUP;
> +    }
> +
> +    switch (blkz->cond) {
> +    case BLK_ZONE_COND_NOT_WP:
> +        zone->state = BLK_ZS_NOT_WP;
> +        break;
> +    case BLK_ZONE_COND_EMPTY:
> +        zone->state = BLK_ZS_EMPTY;
> +        break;
> +    case BLK_ZONE_COND_IMP_OPEN:
> +        zone->state = BLK_ZS_IOPEN;
> +        break;
> +    case BLK_ZONE_COND_EXP_OPEN:
> +        zone->state = BLK_ZS_EOPEN;
> +        break;
> +    case BLK_ZONE_COND_CLOSED:
> +        zone->state = BLK_ZS_CLOSED;
> +        break;
> +    case BLK_ZONE_COND_READONLY:
> +        zone->state = BLK_ZS_RDONLY;
> +        break;
> +    case BLK_ZONE_COND_FULL:
> +        zone->state = BLK_ZS_FULL;
> +        break;
> +    case BLK_ZONE_COND_OFFLINE:
> +        zone->state = BLK_ZS_OFFLINE;
> +        break;
> +    default:
> +        error_report("Unsupported zone state: 0x%x", blkz->cond);
> +        return -ENOTSUP;
> +    }
> +    return 0;
> +}
> +#endif
> +
> +#if defined(CONFIG_BLKZONED)
> +static int handle_aiocb_zone_report(void *opaque)
> +{
> +    RawPosixAIOData *aiocb = opaque;
> +    int fd = aiocb->aio_fildes;
> +    unsigned int *nr_zones = aiocb->zone_report.nr_zones;
> +    BlockZoneDescriptor *zones = aiocb->zone_report.zones;
> +    /* zoned block devices use 512-byte sectors */
> +    uint64_t sector = aiocb->aio_offset / 512;
> +
> +    struct blk_zone *blkz;
> +    size_t rep_size;
> +    unsigned int nrz;
> +    int ret, n = 0, i = 0;
> +
> +    nrz = *nr_zones;
> +    rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
> +    g_autofree struct blk_zone_report *rep = NULL;
> +    rep = g_malloc(rep_size);
> +
> +    blkz = (struct blk_zone *)(rep + 1);
> +    while (n < nrz) {
> +        memset(rep, 0, rep_size);
> +        rep->sector = sector;
> +        rep->nr_zones = nrz - n;
> +
> +        do {
> +            ret = ioctl(fd, BLKREPORTZONE, rep);
> +        } while (ret != 0 && errno == EINTR);
> +        if (ret != 0) {
> +            error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
> +                         fd, sector, errno);
> +            return -errno;
> +        }
> +
> +        if (!rep->nr_zones) {
> +            break;
> +        }
> +
> +        for (i = 0; i < rep->nr_zones; i++, n++) {
> +            ret = parse_zone(&zones[n], &blkz[i]);
> +            if (ret != 0) {
> +                return ret;
> +            }
> +
> +            /* The next report should start after the last zone reported */
> +            sector = blkz[i].start + blkz[i].len;
> +        }
> +    }
> +
> +    *nr_zones = n;
> +    return 0;
> +}
> +#endif
> +
> +#if defined(CONFIG_BLKZONED)
> +static int handle_aiocb_zone_mgmt(void *opaque)
> +{
> +    RawPosixAIOData *aiocb = opaque;
> +    int fd = aiocb->aio_fildes;
> +    uint64_t sector = aiocb->aio_offset / 512;
> +    int64_t nr_sectors = aiocb->aio_nbytes / 512;
> +    struct blk_zone_range range;
> +    int ret;
> +
> +    /* Execute the operation */
> +    range.sector = sector;
> +    range.nr_sectors = nr_sectors;
> +    do {
> +        ret = ioctl(fd, aiocb->zone_mgmt.op, &range);
> +    } while (ret != 0 && errno == EINTR);
> +
> +    return ret;
> +}
> +#endif
> +
>  static int handle_aiocb_copy_range(void *opaque)
>  {
>      RawPosixAIOData *aiocb = opaque;
> @@ -3035,6 +3247,107 @@ static void raw_account_discard(BDRVRawState *s, uint64_t nbytes, int ret)
>      }
>  }
>
> +/*
> + * zone report - Get a zone block device's information in the form
> + * of an array of zone descriptors.
> + * zones is an array of zone descriptors to hold zone information on reply;
> + * offset can be any byte within the entire size of the device;
> + * nr_zones is the maxium number of sectors the command should operate on.
> + */
> +#if defined(CONFIG_BLKZONED)
> +static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t offset,
> +                                           unsigned int *nr_zones,
> +                                           BlockZoneDescriptor *zones) {
> +    BDRVRawState *s = bs->opaque;
> +    RawPosixAIOData acb;
> +
> +    acb = (RawPosixAIOData) {
> +        .bs         = bs,
> +        .aio_fildes = s->fd,
> +        .aio_type   = QEMU_AIO_ZONE_REPORT,
> +        .aio_offset = offset,
> +        .zone_report    = {
> +            .nr_zones       = nr_zones,
> +            .zones          = zones,
> +        },
> +    };
> +
> +    return raw_thread_pool_submit(bs, handle_aiocb_zone_report, &acb);
> +}
> +#endif
> +
> +/*
> + * zone management operations - Execute an operation on a zone
> + */
> +#if defined(CONFIG_BLKZONED)
> +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> +        int64_t offset, int64_t len) {
> +    BDRVRawState *s = bs->opaque;
> +    RawPosixAIOData acb;
> +    int64_t zone_size, zone_size_mask;
> +    const char *op_name;
> +    unsigned long zo;
> +    int ret;
> +    int64_t capacity = bs->total_sectors << BDRV_SECTOR_BITS;
> +
> +    zone_size = bs->bl.zone_size;
> +    zone_size_mask = zone_size - 1;
> +    if (offset & zone_size_mask) {
> +        error_report("sector offset %" PRId64 " is not aligned to zone size "
> +                     "%" PRId64 "", offset / 512, zone_size / 512);
> +        return -EINVAL;
> +    }
> +
> +    if (((offset + len) < capacity && len & zone_size_mask) ||
> +        offset + len > capacity) {
> +        error_report("number of sectors %" PRId64 " is not aligned to zone size"
> +                      " %" PRId64 "", len / 512, zone_size / 512);
> +        return -EINVAL;
> +    }
> +
> +    switch (op) {
> +    case BLK_ZO_OPEN:
> +        op_name = "BLKOPENZONE";
> +        zo = BLKOPENZONE;
> +        break;
> +    case BLK_ZO_CLOSE:
> +        op_name = "BLKCLOSEZONE";
> +        zo = BLKCLOSEZONE;
> +        break;
> +    case BLK_ZO_FINISH:
> +        op_name = "BLKFINISHZONE";
> +        zo = BLKFINISHZONE;
> +        break;
> +    case BLK_ZO_RESET:
> +        op_name = "BLKRESETZONE";
> +        zo = BLKRESETZONE;
> +        break;
> +    default:
> +        error_report("Unsupported zone op: 0x%x", op);
> +        return -ENOTSUP;
> +    }
> +
> +    acb = (RawPosixAIOData) {
> +        .bs             = bs,
> +        .aio_fildes     = s->fd,
> +        .aio_type       = QEMU_AIO_ZONE_MGMT,
> +        .aio_offset     = offset,
> +        .aio_nbytes     = len,
> +        .zone_mgmt  = {
> +            .op = zo,
> +        },
> +    };
> +
> +    ret = raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb);
> +    if (ret != 0) {
> +        ret = -errno;
> +        error_report("ioctl %s failed %d", op_name, ret);
> +    }
> +
> +    return ret;
> +}
> +#endif
> +
>  static coroutine_fn int
>  raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes,
>                  bool blkdev)
> @@ -3756,13 +4069,23 @@ static BlockDriver bdrv_host_device = {
>      .bdrv_check_perm = raw_check_perm,
>      .bdrv_set_perm   = raw_set_perm,
>      .bdrv_abort_perm_update = raw_abort_perm_update,
> +#ifndef CONFIG_BLKZONED
>      .bdrv_probe_blocksizes = hdev_probe_blocksizes,
> +#endif
>      .bdrv_probe_geometry = hdev_probe_geometry,
>
>      /* generic scsi device */
>  #ifdef __linux__
>      .bdrv_co_ioctl          = hdev_co_ioctl,
>  #endif
> +
> +    /* zoned device */
> +#if defined(CONFIG_BLKZONED)
> +    /* zone management operations */
> +    .bdrv_probe_blocksizes = hdev_probe_zoned_blocksizes,
> +    .bdrv_co_zone_report = raw_co_zone_report,
> +    .bdrv_co_zone_mgmt = raw_co_zone_mgmt,
> +#endif
>  };
>
>  #if defined(__linux__) || defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
> diff --git a/block/io.c b/block/io.c
> index a09a19f7a7..1586e42ab9 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -3099,6 +3099,47 @@ out:
>      return co.ret;
>  }
>
> +int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
> +                        unsigned int *nr_zones,
> +                        BlockZoneDescriptor *zones)
> +{
> +    BlockDriver *drv = bs->drv;
> +    CoroutineIOCompletion co = {
> +            .coroutine = qemu_coroutine_self(),
> +    };
> +    IO_CODE();
> +
> +    bdrv_inc_in_flight(bs);
> +    if (!drv || !drv->bdrv_co_zone_report) {

Now that zoned device support is determined at runtime instead of at
compile-time, checking for drv->bdrv_co_zone_report isn't enough. The
BlockDriverState might have bs->bl.zoned == BLK_Z_NONE.

Please add || bs->bl.zoned == BLK_Z_NONE to this if statement to
prevent calls when the device is not zoned.

The same applies to bdrv_co_zone_mgmt().

> +        co.ret = -ENOTSUP;
> +        goto out;
> +    }
> +    co.ret = drv->bdrv_co_zone_report(bs, offset, nr_zones, zones);
> +out:
> +    bdrv_dec_in_flight(bs);
> +    return co.ret;
> +}
> +
> +int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> +        int64_t offset, int64_t len)
> +{
> +    BlockDriver *drv = bs->drv;
> +    CoroutineIOCompletion co = {
> +            .coroutine = qemu_coroutine_self(),
> +    };
> +    IO_CODE();
> +
> +    bdrv_inc_in_flight(bs);
> +    if (!drv || !drv->bdrv_co_zone_mgmt) {
> +        co.ret = -ENOTSUP;
> +        goto out;
> +    }
> +    co.ret = drv->bdrv_co_zone_mgmt(bs, op, offset, len);
> +out:
> +    bdrv_dec_in_flight(bs);
> +    return co.ret;
> +}
> +
>  void *qemu_blockalign(BlockDriverState *bs, size_t size)
>  {
>      IO_CODE();
> diff --git a/include/block/block-io.h b/include/block/block-io.h
> index 3398351596..10ff212036 100644
> --- a/include/block/block-io.h
> +++ b/include/block/block-io.h
> @@ -98,6 +98,13 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
>
>  int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
>                                    int64_t bytes);
> +/* Report zone information of zone block device. */
> +int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
> +                                     unsigned int *nr_zones,
> +                                     BlockZoneDescriptor *zones);
> +int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> +                                   int64_t offset, int64_t len);
> +
>  bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
>  int bdrv_block_status(BlockDriverState *bs, int64_t offset,
>                        int64_t bytes, int64_t *pnum, int64_t *map,
> diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
> index 57f0612f5e..565228d8dd 100644
> --- a/include/block/block_int-common.h
> +++ b/include/block/block_int-common.h
> @@ -703,6 +703,12 @@ struct BlockDriver {
>      int coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_load_vmstate)(
>          BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
>
> +    int coroutine_fn (*bdrv_co_zone_report)(BlockDriverState *bs,
> +            int64_t offset, unsigned int *nr_zones,
> +            BlockZoneDescriptor *zones);
> +    int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, BlockZoneOp op,
> +            int64_t offset, int64_t len);
> +
>      /* removable device specific */
>      bool (*bdrv_is_inserted)(BlockDriverState *bs);
>      void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
> @@ -839,6 +845,21 @@ typedef struct BlockLimits {
>
>      /* device zone model */
>      BlockZoneModel zoned;
> +
> +    /* zone size expressed in bytes */
> +    uint32_t zone_size;
> +
> +    /* total number of zones */
> +    uint32_t nr_zones;
> +
> +    /* maximum sectors of a zone append write operation */
> +    int64_t max_append_sectors;
> +
> +    /* maximum number of open zones */
> +    int64_t max_open_zones;
> +
> +    /* maximum number of active zones */
> +    int64_t max_active_zones;
>  } BlockLimits;
>
>  typedef struct BdrvOpBlocker BdrvOpBlocker;
> diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
> index f8cda9df91..eda6a7a253 100644
> --- a/include/block/raw-aio.h
> +++ b/include/block/raw-aio.h
> @@ -28,6 +28,8 @@
>  #define QEMU_AIO_WRITE_ZEROES 0x0020
>  #define QEMU_AIO_COPY_RANGE   0x0040
>  #define QEMU_AIO_TRUNCATE     0x0080
> +#define QEMU_AIO_ZONE_REPORT  0x0100
> +#define QEMU_AIO_ZONE_MGMT    0x0200
>  #define QEMU_AIO_TYPE_MASK \
>          (QEMU_AIO_READ | \
>           QEMU_AIO_WRITE | \
> @@ -36,7 +38,9 @@
>           QEMU_AIO_DISCARD | \
>           QEMU_AIO_WRITE_ZEROES | \
>           QEMU_AIO_COPY_RANGE | \
> -         QEMU_AIO_TRUNCATE)
> +         QEMU_AIO_TRUNCATE | \
> +         QEMU_AIO_ZONE_REPORT | \
> +         QEMU_AIO_ZONE_MGMT)
>
>  /* AIO flags */
>  #define QEMU_AIO_MISALIGNED   0x1000
> diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
> index 031a27ba10..dc8a4368f0 100644
> --- a/include/sysemu/block-backend-io.h
> +++ b/include/sysemu/block-backend-io.h
> @@ -46,6 +46,13 @@ BlockAIOCB *blk_aio_pwritev(BlockBackend *blk, int64_t offset,
>                              BlockCompletionFunc *cb, void *opaque);
>  BlockAIOCB *blk_aio_flush(BlockBackend *blk,
>                            BlockCompletionFunc *cb, void *opaque);
> +BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
> +                                unsigned int *nr_zones,
> +                                BlockZoneDescriptor *zones,
> +                                BlockCompletionFunc *cb, void *opaque);
> +BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +                              int64_t offset, int64_t len,
> +                              BlockCompletionFunc *cb, void *opaque);
>  BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes,
>                               BlockCompletionFunc *cb, void *opaque);
>  void blk_aio_cancel_async(BlockAIOCB *acb);
> @@ -166,6 +173,17 @@ int co_wrapper_mixed blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
>  int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, int64_t offset,
>                                        int64_t bytes, BdrvRequestFlags flags);
>
> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> +                                    unsigned int *nr_zones,
> +                                    BlockZoneDescriptor *zones);
> +int co_wrapper_mixed blk_zone_report(BlockBackend *blk, int64_t offset,
> +                                         unsigned int *nr_zones,
> +                                         BlockZoneDescriptor *zones);
> +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +                                  int64_t offset, int64_t len);
> +int co_wrapper_mixed blk_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +                                       int64_t offset, int64_t len);
> +
>  int co_wrapper_mixed blk_pdiscard(BlockBackend *blk, int64_t offset,
>                                    int64_t bytes);
>  int coroutine_fn blk_co_pdiscard(BlockBackend *blk, int64_t offset,
> diff --git a/meson.build b/meson.build
> index 6d3b665629..a267f74536 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1962,6 +1962,7 @@ config_host_data.set('CONFIG_REPLICATION', get_option('replication').allowed())
>  # has_header
>  config_host_data.set('CONFIG_EPOLL', cc.has_header('sys/epoll.h'))
>  config_host_data.set('CONFIG_LINUX_MAGIC_H', cc.has_header('linux/magic.h'))
> +config_host_data.set('CONFIG_BLKZONED', cc.has_header('linux/blkzoned.h'))
>  config_host_data.set('CONFIG_VALGRIND_H', cc.has_header('valgrind/valgrind.h'))
>  config_host_data.set('HAVE_BTRFS_H', cc.has_header('linux/btrfs.h'))
>  config_host_data.set('HAVE_DRM_H', cc.has_header('libdrm/drm.h'))
> @@ -2056,6 +2057,9 @@ config_host_data.set('HAVE_SIGEV_NOTIFY_THREAD_ID',
>  config_host_data.set('HAVE_STRUCT_STAT_ST_ATIM',
>                       cc.has_member('struct stat', 'st_atim',
>                                     prefix: '#include <sys/stat.h>'))
> +config_host_data.set('HAVE_BLK_ZONE_REP_CAPACITY',
> +                     cc.has_member('struct blk_zone', 'capacity',
> +                                   prefix: '#include <linux/blkzoned.h>'))
>
>  # has_type
>  config_host_data.set('CONFIG_IOVEC',
> diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
> index 952dc940f1..3a3bad77c3 100644
> --- a/qemu-io-cmds.c
> +++ b/qemu-io-cmds.c
> @@ -1712,6 +1712,150 @@ static const cmdinfo_t flush_cmd = {
>      .oneline    = "flush all in-core file state to disk",
>  };
>
> +static inline int64_t tosector(int64_t bytes)
> +{
> +    return bytes >> BDRV_SECTOR_BITS;
> +}
> +
> +static int zone_report_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset;
> +    unsigned int nr_zones;
> +
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    nr_zones = cvtnum(argv[optind]);
> +
> +    g_autofree BlockZoneDescriptor *zones = NULL;
> +    zones = g_new(BlockZoneDescriptor, nr_zones);
> +    ret = blk_zone_report(blk, offset, &nr_zones, zones);
> +    if (ret < 0) {
> +        printf("zone report failed: %s\n", strerror(-ret));
> +    } else {
> +        for (int i = 0; i < nr_zones; ++i) {
> +            printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", "
> +                   "cap"" 0x%" PRIx64 ", wptr 0x%" PRIx64 ", "
> +                   "zcond:%u, [type: %u]\n",
> +                    tosector(zones[i].start), tosector(zones[i].length),
> +                    tosector(zones[i].cap), tosector(zones[i].wp),
> +                    zones[i].state, zones[i].type);
> +        }
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_report_cmd = {
> +    .name = "zone_report",
> +    .altname = "zrp",
> +    .cfunc = zone_report_f,
> +    .argmin = 2,
> +    .argmax = 2,
> +    .args = "offset number",
> +    .oneline = "report zone information",
> +};
> +
> +static int zone_open_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset, len;
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    len = cvtnum(argv[optind]);
> +    ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len);
> +    if (ret < 0) {
> +        printf("zone open failed: %s\n", strerror(-ret));
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_open_cmd = {
> +    .name = "zone_open",
> +    .altname = "zo",
> +    .cfunc = zone_open_f,
> +    .argmin = 2,
> +    .argmax = 2,
> +    .args = "offset len",
> +    .oneline = "explicit open a range of zones in zone block device",
> +};
> +
> +static int zone_close_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset, len;
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    len = cvtnum(argv[optind]);
> +    ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len);
> +    if (ret < 0) {
> +        printf("zone close failed: %s\n", strerror(-ret));
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_close_cmd = {
> +    .name = "zone_close",
> +    .altname = "zc",
> +    .cfunc = zone_close_f,
> +    .argmin = 2,
> +    .argmax = 2,
> +    .args = "offset len",
> +    .oneline = "close a range of zones in zone block device",
> +};
> +
> +static int zone_finish_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset, len;
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    len = cvtnum(argv[optind]);
> +    ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len);
> +    if (ret < 0) {
> +        printf("zone finish failed: %s\n", strerror(-ret));
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_finish_cmd = {
> +    .name = "zone_finish",
> +    .altname = "zf",
> +    .cfunc = zone_finish_f,
> +    .argmin = 2,
> +    .argmax = 2,
> +    .args = "offset len",
> +    .oneline = "finish a range of zones in zone block device",
> +};
> +
> +static int zone_reset_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset, len;
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    len = cvtnum(argv[optind]);
> +    ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len);
> +    if (ret < 0) {
> +        printf("zone reset failed: %s\n", strerror(-ret));
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_reset_cmd = {
> +    .name = "zone_reset",
> +    .altname = "zrs",
> +    .cfunc = zone_reset_f,
> +    .argmin = 2,
> +    .argmax = 2,
> +    .args = "offset len",
> +    .oneline = "reset a zone write pointer in zone block device",
> +};
> +
>  static int truncate_f(BlockBackend *blk, int argc, char **argv);
>  static const cmdinfo_t truncate_cmd = {
>      .name       = "truncate",
> @@ -2504,6 +2648,11 @@ static void __attribute((constructor)) init_qemuio_commands(void)
>      qemuio_add_command(&aio_write_cmd);
>      qemuio_add_command(&aio_flush_cmd);
>      qemuio_add_command(&flush_cmd);
> +    qemuio_add_command(&zone_report_cmd);
> +    qemuio_add_command(&zone_open_cmd);
> +    qemuio_add_command(&zone_close_cmd);
> +    qemuio_add_command(&zone_finish_cmd);
> +    qemuio_add_command(&zone_reset_cmd);
>      qemuio_add_command(&truncate_cmd);
>      qemuio_add_command(&length_cmd);
>      qemuio_add_command(&info_cmd);
> --
> 2.38.1
>
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v15 3/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  2023-02-06 12:04   ` Stefan Hajnoczi
@ 2023-02-06 12:12     ` Sam Li
  0 siblings, 0 replies; 17+ messages in thread
From: Sam Li @ 2023-02-06 12:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, Stefan Hajnoczi, Kevin Wolf,
	Paolo Bonzini, Hanna Reitz, dmitry.fomichev, hare, damien.lemoal,
	Marc-André Lureau, Fam Zheng, Thomas Huth,
	Daniel P. Berrangé, Philippe Mathieu-Daudé

Stefan Hajnoczi <stefanha@gmail.com> 于2023年2月6日周一 20:04写道:
>
> )(_On Sun, 29 Jan 2023 at 05:30, Sam Li <faithilikerun@gmail.com> wrote:
> >
> > Add zoned device option to host_device BlockDriver. It will be presented only
> > for zoned host block devices. By adding zone management operations to the
> > host_block_device BlockDriver, users can use the new block layer APIs
> > including Report Zone and four zone management operations
> > (open, close, finish, reset, reset_all).
> >
> > Qemu-io uses the new APIs to perform zoned storage commands of the device:
> > zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> > zone_finish(zf).
> >
> > For example, to test zone_report, use following command:
> > $ ./build/qemu-io --image-opts -n driver=host_device, filename=/dev/nullb0
> > -c "zrp offset nr_zones"
> >
> > Signed-off-by: Sam Li <faithilikerun@gmail.com>
> > Reviewed-by: Hannes Reinecke <hare@suse.de>
> > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  block/block-backend.c             | 147 ++++++++++++++
> >  block/file-posix.c                | 323 ++++++++++++++++++++++++++++++
> >  block/io.c                        |  41 ++++
> >  include/block/block-io.h          |   7 +
> >  include/block/block_int-common.h  |  21 ++
> >  include/block/raw-aio.h           |   6 +-
> >  include/sysemu/block-backend-io.h |  18 ++
> >  meson.build                       |   4 +
> >  qemu-io-cmds.c                    | 149 ++++++++++++++
> >  9 files changed, 715 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/block-backend.c b/block/block-backend.c
> > index ba7bf1d6bc..a4847b9131 100644
> > --- a/block/block-backend.c
> > +++ b/block/block-backend.c
> > @@ -1451,6 +1451,15 @@ typedef struct BlkRwCo {
> >      void *iobuf;
> >      int ret;
> >      BdrvRequestFlags flags;
> > +    union {
> > +        struct {
> > +            unsigned int *nr_zones;
> > +            BlockZoneDescriptor *zones;
> > +        } zone_report;
> > +        struct {
> > +            unsigned long op;
> > +        } zone_mgmt;
> > +    };
> >  } BlkRwCo;
> >
> >  int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
> > @@ -1795,6 +1804,144 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
> >      return ret;
> >  }
> >
> > +static void coroutine_fn blk_aio_zone_report_entry(void *opaque)
> > +{
> > +    BlkAioEmAIOCB *acb = opaque;
> > +    BlkRwCo *rwco = &acb->rwco;
> > +
> > +    rwco->ret = blk_co_zone_report(rwco->blk, rwco->offset,
> > +                                   rwco->zone_report.nr_zones,
> > +                                   rwco->zone_report.zones);
> > +    blk_aio_complete(acb);
> > +}
> > +
> > +BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
> > +                                unsigned int *nr_zones,
> > +                                BlockZoneDescriptor  *zones,
> > +                                BlockCompletionFunc *cb, void *opaque)
> > +{
> > +    BlkAioEmAIOCB *acb;
> > +    Coroutine *co;
> > +    IO_CODE();
> > +
> > +    blk_inc_in_flight(blk);
> > +    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
> > +    acb->rwco = (BlkRwCo) {
> > +        .blk    = blk,
> > +        .offset = offset,
> > +        .ret    = NOT_DONE,
> > +        .zone_report = {
> > +            .zones = zones,
> > +            .nr_zones = nr_zones,
> > +        },
> > +    };
> > +    acb->has_returned = false;
> > +
> > +    co = qemu_coroutine_create(blk_aio_zone_report_entry, acb);
> > +    bdrv_coroutine_enter(blk_bs(blk), co);
> > +
> > +    acb->has_returned = true;
> > +    if (acb->rwco.ret != NOT_DONE) {
> > +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> > +                                         blk_aio_complete_bh, acb);
> > +    }
> > +
> > +    return &acb->common;
> > +}
> > +
> > +static void coroutine_fn blk_aio_zone_mgmt_entry(void *opaque)
> > +{
> > +    BlkAioEmAIOCB *acb = opaque;
> > +    BlkRwCo *rwco = &acb->rwco;
> > +
> > +    rwco->ret = blk_co_zone_mgmt(rwco->blk, rwco->zone_mgmt.op,
> > +                                 rwco->offset, acb->bytes);
> > +    blk_aio_complete(acb);
> > +}
> > +
> > +BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> > +                              int64_t offset, int64_t len,
> > +                              BlockCompletionFunc *cb, void *opaque) {
> > +    BlkAioEmAIOCB *acb;
> > +    Coroutine *co;
> > +    IO_CODE();
> > +
> > +    blk_inc_in_flight(blk);
> > +    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
> > +    acb->rwco = (BlkRwCo) {
> > +        .blk    = blk,
> > +        .offset = offset,
> > +        .ret    = NOT_DONE,
> > +        .zone_mgmt = {
> > +            .op = op,
> > +        },
> > +    };
> > +    acb->bytes = len;
> > +    acb->has_returned = false;
> > +
> > +    co = qemu_coroutine_create(blk_aio_zone_mgmt_entry, acb);
> > +    bdrv_coroutine_enter(blk_bs(blk), co);
> > +
> > +    acb->has_returned = true;
> > +    if (acb->rwco.ret != NOT_DONE) {
> > +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> > +                                         blk_aio_complete_bh, acb);
> > +    }
> > +
> > +    return &acb->common;
> > +}
> > +
> > +/*
> > + * Send a zone_report command.
> > + * offset is a byte offset from the start of the device. No alignment
> > + * required for offset.
> > + * nr_zones represents IN maximum and OUT actual.
> > + */
> > +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> > +                                    unsigned int *nr_zones,
> > +                                    BlockZoneDescriptor *zones)
> > +{
> > +    int ret;
> > +    IO_CODE();
> > +
> > +    blk_inc_in_flight(blk); /* increase before waiting */
> > +    blk_wait_while_drained(blk);
> > +    if (!blk_is_available(blk)) {
> > +        blk_dec_in_flight(blk);
> > +        return -ENOMEDIUM;
> > +    }
> > +    ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
> > +    blk_dec_in_flight(blk);
> > +    return ret;
> > +}
> > +
> > +/*
> > + * Send a zone_management command.
> > + * op is the zone operation;
> > + * offset is the byte offset from the start of the zoned device;
> > + * len is the maximum number of bytes the command should operate on. It
> > + * should be aligned with the device zone size.
> > + */
> > +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> > +        int64_t offset, int64_t len)
> > +{
> > +    int ret;
> > +    IO_CODE();
> > +
> > +    blk_inc_in_flight(blk);
> > +    blk_wait_while_drained(blk);
> > +
> > +    ret = blk_check_byte_request(blk, offset, len);
> > +    if (ret < 0) {
> > +        blk_dec_in_flight(blk);
> > +        return ret;
> > +    }
> > +
> > +    ret = bdrv_co_zone_mgmt(blk_bs(blk), op, offset, len);
> > +    blk_dec_in_flight(blk);
> > +    return ret;
> > +}
> > +
> >  void blk_drain(BlockBackend *blk)
> >  {
> >      BlockDriverState *bs = blk_bs(blk);
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 43c59c6d56..b6d88db208 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -68,6 +68,9 @@
> >  #include <sys/param.h>
> >  #include <sys/syscall.h>
> >  #include <sys/vfs.h>
> > +#if defined(CONFIG_BLKZONED)
> > +#include <linux/blkzoned.h>
> > +#endif
> >  #include <linux/cdrom.h>
> >  #include <linux/fd.h>
> >  #include <linux/fs.h>
> > @@ -216,6 +219,13 @@ typedef struct RawPosixAIOData {
> >              PreallocMode prealloc;
> >              Error **errp;
> >          } truncate;
> > +        struct {
> > +            unsigned int *nr_zones;
> > +            BlockZoneDescriptor *zones;
> > +        } zone_report;
> > +        struct {
> > +            unsigned long op;
> > +        } zone_mgmt;
> >      };
> >  } RawPosixAIOData;
> >
> > @@ -1351,6 +1361,50 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
> >          zoned = BLK_Z_NONE;
> >      }
> >      bs->bl.zoned = zoned;
> > +    if (zoned != BLK_Z_NONE) {
> > +        /*
> > +         * The zoned device must at least have zone size and nr_zones fields.
> > +         */
> > +        ret = get_sysfs_long_val(&st, "chunk_sectors");
> > +        if (ret < 0) {
> > +            error_setg_errno(errp, -ret, "Unable to read chunk_sectors "
> > +                                         "sysfs attribute");
> > +            goto out;
> > +        } else if (!ret) {
> > +            error_setg(errp, "Read 0 from chunk_sectors sysfs attribute");
> > +            goto out;
> > +        }
> > +        bs->bl.zone_size = ret << BDRV_SECTOR_BITS;
> > +
> > +        ret = get_sysfs_long_val(&st, "nr_zones");
> > +        if (ret < 0) {
> > +            error_setg_errno(errp, -ret, "Unable to read nr_zones "
> > +                                         "sysfs attribute");
> > +            goto out;
> > +        } else if (!ret) {
> > +            error_setg(errp, "Read 0 from nr_zones sysfs attribute");
> > +            goto out;
> > +        }
> > +        bs->bl.nr_zones = ret;
> > +
> > +        ret = get_sysfs_long_val(&st, "zone_append_max_bytes");
> > +        if (ret > 0) {
> > +            bs->bl.max_append_sectors = ret >> BDRV_SECTOR_BITS;
> > +        }
> > +
> > +        ret = get_sysfs_long_val(&st, "max_open_zones");
> > +        if (ret >= 0) {
> > +            bs->bl.max_open_zones = ret;
> > +        }
> > +
> > +        ret = get_sysfs_long_val(&st, "max_active_zones");
> > +        if (ret >= 0) {
> > +            bs->bl.max_active_zones = ret;
> > +        }
> > +        return;
> > +    }
> > +out:
> > +    bs->bl.zoned = BLK_Z_NONE;
> >  }
> >
> >  static int check_for_dasd(int fd)
> > @@ -1364,6 +1418,23 @@ static int check_for_dasd(int fd)
> >  #endif
> >  }
> >
> > +#if defined(CONFIG_BLKZONED)
> > +/**
> > + * Zoned storage needs to be virtualized with the correct physical block size
> > + * and logical block size.
> > + */
> > +static int hdev_probe_zoned_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
>
> The #ifdef approach in this patch won't work because the same
> BlockDriver now handles both zoned and non-zoned devices at runtime.
> This function needs to be unified with hdev_probe_blocksizes():
>
>   if (check_for_dasd(s->fd) < 0 || bs->bl.zoned == BLK_Z_NONE) {
>       return -ENOTSUP;
>   }
>
>   ...probe block sizes...
>
> > +{
> > +    BDRVRawState *s = bs->opaque;
> > +    int ret;
> > +
> > +    ret = probe_logical_blocksize(s->fd, &bsz->log);
> > +    if (ret < 0) {
> > +        return ret;
> > +    }
> > +    return probe_physical_blocksize(s->fd, &bsz->phys);
> > +}
> > +#else
> >  /**
> >   * Try to get @bs's logical and physical block size.
> >   * On success, store them in @bsz and return zero.
> > @@ -1384,6 +1455,7 @@ static int hdev_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
> >      }
> >      return probe_physical_blocksize(s->fd, &bsz->phys);
> >  }
> > +#endif
> >
> >  /**
> >   * Try to get @bs's geometry: cyls, heads, sectors.
> > @@ -1844,6 +1916,146 @@ static off_t copy_file_range(int in_fd, off_t *in_off, int out_fd,
> >  }
> >  #endif
> >
> > +/*
> > + * parse_zone - Fill a zone descriptor
> > + */
> > +#if defined(CONFIG_BLKZONED)
> > +static inline int parse_zone(struct BlockZoneDescriptor *zone,
> > +                              const struct blk_zone *blkz) {
> > +    zone->start = blkz->start << BDRV_SECTOR_BITS;
> > +    zone->length = blkz->len << BDRV_SECTOR_BITS;
> > +    zone->wp = blkz->wp << BDRV_SECTOR_BITS;
> > +
> > +#ifdef HAVE_BLK_ZONE_REP_CAPACITY
> > +    zone->cap = blkz->capacity << BDRV_SECTOR_BITS;
> > +#else
> > +    zone->cap = blkz->len << BDRV_SECTOR_BITS;
> > +#endif
> > +
> > +    switch (blkz->type) {
> > +    case BLK_ZONE_TYPE_SEQWRITE_REQ:
> > +        zone->type = BLK_ZT_SWR;
> > +        break;
> > +    case BLK_ZONE_TYPE_SEQWRITE_PREF:
> > +        zone->type = BLK_ZT_SWP;
> > +        break;
> > +    case BLK_ZONE_TYPE_CONVENTIONAL:
> > +        zone->type = BLK_ZT_CONV;
> > +        break;
> > +    default:
> > +        error_report("Unsupported zone type: 0x%x", blkz->type);
> > +        return -ENOTSUP;
> > +    }
> > +
> > +    switch (blkz->cond) {
> > +    case BLK_ZONE_COND_NOT_WP:
> > +        zone->state = BLK_ZS_NOT_WP;
> > +        break;
> > +    case BLK_ZONE_COND_EMPTY:
> > +        zone->state = BLK_ZS_EMPTY;
> > +        break;
> > +    case BLK_ZONE_COND_IMP_OPEN:
> > +        zone->state = BLK_ZS_IOPEN;
> > +        break;
> > +    case BLK_ZONE_COND_EXP_OPEN:
> > +        zone->state = BLK_ZS_EOPEN;
> > +        break;
> > +    case BLK_ZONE_COND_CLOSED:
> > +        zone->state = BLK_ZS_CLOSED;
> > +        break;
> > +    case BLK_ZONE_COND_READONLY:
> > +        zone->state = BLK_ZS_RDONLY;
> > +        break;
> > +    case BLK_ZONE_COND_FULL:
> > +        zone->state = BLK_ZS_FULL;
> > +        break;
> > +    case BLK_ZONE_COND_OFFLINE:
> > +        zone->state = BLK_ZS_OFFLINE;
> > +        break;
> > +    default:
> > +        error_report("Unsupported zone state: 0x%x", blkz->cond);
> > +        return -ENOTSUP;
> > +    }
> > +    return 0;
> > +}
> > +#endif
> > +
> > +#if defined(CONFIG_BLKZONED)
> > +static int handle_aiocb_zone_report(void *opaque)
> > +{
> > +    RawPosixAIOData *aiocb = opaque;
> > +    int fd = aiocb->aio_fildes;
> > +    unsigned int *nr_zones = aiocb->zone_report.nr_zones;
> > +    BlockZoneDescriptor *zones = aiocb->zone_report.zones;
> > +    /* zoned block devices use 512-byte sectors */
> > +    uint64_t sector = aiocb->aio_offset / 512;
> > +
> > +    struct blk_zone *blkz;
> > +    size_t rep_size;
> > +    unsigned int nrz;
> > +    int ret, n = 0, i = 0;
> > +
> > +    nrz = *nr_zones;
> > +    rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
> > +    g_autofree struct blk_zone_report *rep = NULL;
> > +    rep = g_malloc(rep_size);
> > +
> > +    blkz = (struct blk_zone *)(rep + 1);
> > +    while (n < nrz) {
> > +        memset(rep, 0, rep_size);
> > +        rep->sector = sector;
> > +        rep->nr_zones = nrz - n;
> > +
> > +        do {
> > +            ret = ioctl(fd, BLKREPORTZONE, rep);
> > +        } while (ret != 0 && errno == EINTR);
> > +        if (ret != 0) {
> > +            error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
> > +                         fd, sector, errno);
> > +            return -errno;
> > +        }
> > +
> > +        if (!rep->nr_zones) {
> > +            break;
> > +        }
> > +
> > +        for (i = 0; i < rep->nr_zones; i++, n++) {
> > +            ret = parse_zone(&zones[n], &blkz[i]);
> > +            if (ret != 0) {
> > +                return ret;
> > +            }
> > +
> > +            /* The next report should start after the last zone reported */
> > +            sector = blkz[i].start + blkz[i].len;
> > +        }
> > +    }
> > +
> > +    *nr_zones = n;
> > +    return 0;
> > +}
> > +#endif
> > +
> > +#if defined(CONFIG_BLKZONED)
> > +static int handle_aiocb_zone_mgmt(void *opaque)
> > +{
> > +    RawPosixAIOData *aiocb = opaque;
> > +    int fd = aiocb->aio_fildes;
> > +    uint64_t sector = aiocb->aio_offset / 512;
> > +    int64_t nr_sectors = aiocb->aio_nbytes / 512;
> > +    struct blk_zone_range range;
> > +    int ret;
> > +
> > +    /* Execute the operation */
> > +    range.sector = sector;
> > +    range.nr_sectors = nr_sectors;
> > +    do {
> > +        ret = ioctl(fd, aiocb->zone_mgmt.op, &range);
> > +    } while (ret != 0 && errno == EINTR);
> > +
> > +    return ret;
> > +}
> > +#endif
> > +
> >  static int handle_aiocb_copy_range(void *opaque)
> >  {
> >      RawPosixAIOData *aiocb = opaque;
> > @@ -3035,6 +3247,107 @@ static void raw_account_discard(BDRVRawState *s, uint64_t nbytes, int ret)
> >      }
> >  }
> >
> > +/*
> > + * zone report - Get a zone block device's information in the form
> > + * of an array of zone descriptors.
> > + * zones is an array of zone descriptors to hold zone information on reply;
> > + * offset can be any byte within the entire size of the device;
> > + * nr_zones is the maxium number of sectors the command should operate on.
> > + */
> > +#if defined(CONFIG_BLKZONED)
> > +static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t offset,
> > +                                           unsigned int *nr_zones,
> > +                                           BlockZoneDescriptor *zones) {
> > +    BDRVRawState *s = bs->opaque;
> > +    RawPosixAIOData acb;
> > +
> > +    acb = (RawPosixAIOData) {
> > +        .bs         = bs,
> > +        .aio_fildes = s->fd,
> > +        .aio_type   = QEMU_AIO_ZONE_REPORT,
> > +        .aio_offset = offset,
> > +        .zone_report    = {
> > +            .nr_zones       = nr_zones,
> > +            .zones          = zones,
> > +        },
> > +    };
> > +
> > +    return raw_thread_pool_submit(bs, handle_aiocb_zone_report, &acb);
> > +}
> > +#endif
> > +
> > +/*
> > + * zone management operations - Execute an operation on a zone
> > + */
> > +#if defined(CONFIG_BLKZONED)
> > +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> > +        int64_t offset, int64_t len) {
> > +    BDRVRawState *s = bs->opaque;
> > +    RawPosixAIOData acb;
> > +    int64_t zone_size, zone_size_mask;
> > +    const char *op_name;
> > +    unsigned long zo;
> > +    int ret;
> > +    int64_t capacity = bs->total_sectors << BDRV_SECTOR_BITS;
> > +
> > +    zone_size = bs->bl.zone_size;
> > +    zone_size_mask = zone_size - 1;
> > +    if (offset & zone_size_mask) {
> > +        error_report("sector offset %" PRId64 " is not aligned to zone size "
> > +                     "%" PRId64 "", offset / 512, zone_size / 512);
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (((offset + len) < capacity && len & zone_size_mask) ||
> > +        offset + len > capacity) {
> > +        error_report("number of sectors %" PRId64 " is not aligned to zone size"
> > +                      " %" PRId64 "", len / 512, zone_size / 512);
> > +        return -EINVAL;
> > +    }
> > +
> > +    switch (op) {
> > +    case BLK_ZO_OPEN:
> > +        op_name = "BLKOPENZONE";
> > +        zo = BLKOPENZONE;
> > +        break;
> > +    case BLK_ZO_CLOSE:
> > +        op_name = "BLKCLOSEZONE";
> > +        zo = BLKCLOSEZONE;
> > +        break;
> > +    case BLK_ZO_FINISH:
> > +        op_name = "BLKFINISHZONE";
> > +        zo = BLKFINISHZONE;
> > +        break;
> > +    case BLK_ZO_RESET:
> > +        op_name = "BLKRESETZONE";
> > +        zo = BLKRESETZONE;
> > +        break;
> > +    default:
> > +        error_report("Unsupported zone op: 0x%x", op);
> > +        return -ENOTSUP;
> > +    }
> > +
> > +    acb = (RawPosixAIOData) {
> > +        .bs             = bs,
> > +        .aio_fildes     = s->fd,
> > +        .aio_type       = QEMU_AIO_ZONE_MGMT,
> > +        .aio_offset     = offset,
> > +        .aio_nbytes     = len,
> > +        .zone_mgmt  = {
> > +            .op = zo,
> > +        },
> > +    };
> > +
> > +    ret = raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb);
> > +    if (ret != 0) {
> > +        ret = -errno;
> > +        error_report("ioctl %s failed %d", op_name, ret);
> > +    }
> > +
> > +    return ret;
> > +}
> > +#endif
> > +
> >  static coroutine_fn int
> >  raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes,
> >                  bool blkdev)
> > @@ -3756,13 +4069,23 @@ static BlockDriver bdrv_host_device = {
> >      .bdrv_check_perm = raw_check_perm,
> >      .bdrv_set_perm   = raw_set_perm,
> >      .bdrv_abort_perm_update = raw_abort_perm_update,
> > +#ifndef CONFIG_BLKZONED
> >      .bdrv_probe_blocksizes = hdev_probe_blocksizes,
> > +#endif
> >      .bdrv_probe_geometry = hdev_probe_geometry,
> >
> >      /* generic scsi device */
> >  #ifdef __linux__
> >      .bdrv_co_ioctl          = hdev_co_ioctl,
> >  #endif
> > +
> > +    /* zoned device */
> > +#if defined(CONFIG_BLKZONED)
> > +    /* zone management operations */
> > +    .bdrv_probe_blocksizes = hdev_probe_zoned_blocksizes,
> > +    .bdrv_co_zone_report = raw_co_zone_report,
> > +    .bdrv_co_zone_mgmt = raw_co_zone_mgmt,
> > +#endif
> >  };
> >
> >  #if defined(__linux__) || defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
> > diff --git a/block/io.c b/block/io.c
> > index a09a19f7a7..1586e42ab9 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -3099,6 +3099,47 @@ out:
> >      return co.ret;
> >  }
> >
> > +int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
> > +                        unsigned int *nr_zones,
> > +                        BlockZoneDescriptor *zones)
> > +{
> > +    BlockDriver *drv = bs->drv;
> > +    CoroutineIOCompletion co = {
> > +            .coroutine = qemu_coroutine_self(),
> > +    };
> > +    IO_CODE();
> > +
> > +    bdrv_inc_in_flight(bs);
> > +    if (!drv || !drv->bdrv_co_zone_report) {
>
> Now that zoned device support is determined at runtime instead of at
> compile-time, checking for drv->bdrv_co_zone_report isn't enough. The
> BlockDriverState might have bs->bl.zoned == BLK_Z_NONE.
>
> Please add || bs->bl.zoned == BLK_Z_NONE to this if statement to
> prevent calls when the device is not zoned.
>
> The same applies to bdrv_co_zone_mgmt().

I see. Thanks!

>
> > +        co.ret = -ENOTSUP;
> > +        goto out;
> > +    }
> > +    co.ret = drv->bdrv_co_zone_report(bs, offset, nr_zones, zones);
> > +out:
> > +    bdrv_dec_in_flight(bs);
> > +    return co.ret;
> > +}
> > +
> > +int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> > +        int64_t offset, int64_t len)
> > +{
> > +    BlockDriver *drv = bs->drv;
> > +    CoroutineIOCompletion co = {
> > +            .coroutine = qemu_coroutine_self(),
> > +    };
> > +    IO_CODE();
> > +
> > +    bdrv_inc_in_flight(bs);
> > +    if (!drv || !drv->bdrv_co_zone_mgmt) {
> > +        co.ret = -ENOTSUP;
> > +        goto out;
> > +    }
> > +    co.ret = drv->bdrv_co_zone_mgmt(bs, op, offset, len);
> > +out:
> > +    bdrv_dec_in_flight(bs);
> > +    return co.ret;
> > +}
> > +
> >  void *qemu_blockalign(BlockDriverState *bs, size_t size)
> >  {
> >      IO_CODE();
> > diff --git a/include/block/block-io.h b/include/block/block-io.h
> > index 3398351596..10ff212036 100644
> > --- a/include/block/block-io.h
> > +++ b/include/block/block-io.h
> > @@ -98,6 +98,13 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
> >
> >  int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
> >                                    int64_t bytes);
> > +/* Report zone information of zone block device. */
> > +int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
> > +                                     unsigned int *nr_zones,
> > +                                     BlockZoneDescriptor *zones);
> > +int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> > +                                   int64_t offset, int64_t len);
> > +
> >  bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
> >  int bdrv_block_status(BlockDriverState *bs, int64_t offset,
> >                        int64_t bytes, int64_t *pnum, int64_t *map,
> > diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
> > index 57f0612f5e..565228d8dd 100644
> > --- a/include/block/block_int-common.h
> > +++ b/include/block/block_int-common.h
> > @@ -703,6 +703,12 @@ struct BlockDriver {
> >      int coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_load_vmstate)(
> >          BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
> >
> > +    int coroutine_fn (*bdrv_co_zone_report)(BlockDriverState *bs,
> > +            int64_t offset, unsigned int *nr_zones,
> > +            BlockZoneDescriptor *zones);
> > +    int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, BlockZoneOp op,
> > +            int64_t offset, int64_t len);
> > +
> >      /* removable device specific */
> >      bool (*bdrv_is_inserted)(BlockDriverState *bs);
> >      void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
> > @@ -839,6 +845,21 @@ typedef struct BlockLimits {
> >
> >      /* device zone model */
> >      BlockZoneModel zoned;
> > +
> > +    /* zone size expressed in bytes */
> > +    uint32_t zone_size;
> > +
> > +    /* total number of zones */
> > +    uint32_t nr_zones;
> > +
> > +    /* maximum sectors of a zone append write operation */
> > +    int64_t max_append_sectors;
> > +
> > +    /* maximum number of open zones */
> > +    int64_t max_open_zones;
> > +
> > +    /* maximum number of active zones */
> > +    int64_t max_active_zones;
> >  } BlockLimits;
> >
> >  typedef struct BdrvOpBlocker BdrvOpBlocker;
> > diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
> > index f8cda9df91..eda6a7a253 100644
> > --- a/include/block/raw-aio.h
> > +++ b/include/block/raw-aio.h
> > @@ -28,6 +28,8 @@
> >  #define QEMU_AIO_WRITE_ZEROES 0x0020
> >  #define QEMU_AIO_COPY_RANGE   0x0040
> >  #define QEMU_AIO_TRUNCATE     0x0080
> > +#define QEMU_AIO_ZONE_REPORT  0x0100
> > +#define QEMU_AIO_ZONE_MGMT    0x0200
> >  #define QEMU_AIO_TYPE_MASK \
> >          (QEMU_AIO_READ | \
> >           QEMU_AIO_WRITE | \
> > @@ -36,7 +38,9 @@
> >           QEMU_AIO_DISCARD | \
> >           QEMU_AIO_WRITE_ZEROES | \
> >           QEMU_AIO_COPY_RANGE | \
> > -         QEMU_AIO_TRUNCATE)
> > +         QEMU_AIO_TRUNCATE | \
> > +         QEMU_AIO_ZONE_REPORT | \
> > +         QEMU_AIO_ZONE_MGMT)
> >
> >  /* AIO flags */
> >  #define QEMU_AIO_MISALIGNED   0x1000
> > diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
> > index 031a27ba10..dc8a4368f0 100644
> > --- a/include/sysemu/block-backend-io.h
> > +++ b/include/sysemu/block-backend-io.h
> > @@ -46,6 +46,13 @@ BlockAIOCB *blk_aio_pwritev(BlockBackend *blk, int64_t offset,
> >                              BlockCompletionFunc *cb, void *opaque);
> >  BlockAIOCB *blk_aio_flush(BlockBackend *blk,
> >                            BlockCompletionFunc *cb, void *opaque);
> > +BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
> > +                                unsigned int *nr_zones,
> > +                                BlockZoneDescriptor *zones,
> > +                                BlockCompletionFunc *cb, void *opaque);
> > +BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> > +                              int64_t offset, int64_t len,
> > +                              BlockCompletionFunc *cb, void *opaque);
> >  BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes,
> >                               BlockCompletionFunc *cb, void *opaque);
> >  void blk_aio_cancel_async(BlockAIOCB *acb);
> > @@ -166,6 +173,17 @@ int co_wrapper_mixed blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
> >  int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, int64_t offset,
> >                                        int64_t bytes, BdrvRequestFlags flags);
> >
> > +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> > +                                    unsigned int *nr_zones,
> > +                                    BlockZoneDescriptor *zones);
> > +int co_wrapper_mixed blk_zone_report(BlockBackend *blk, int64_t offset,
> > +                                         unsigned int *nr_zones,
> > +                                         BlockZoneDescriptor *zones);
> > +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> > +                                  int64_t offset, int64_t len);
> > +int co_wrapper_mixed blk_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> > +                                       int64_t offset, int64_t len);
> > +
> >  int co_wrapper_mixed blk_pdiscard(BlockBackend *blk, int64_t offset,
> >                                    int64_t bytes);
> >  int coroutine_fn blk_co_pdiscard(BlockBackend *blk, int64_t offset,
> > diff --git a/meson.build b/meson.build
> > index 6d3b665629..a267f74536 100644
> > --- a/meson.build
> > +++ b/meson.build
> > @@ -1962,6 +1962,7 @@ config_host_data.set('CONFIG_REPLICATION', get_option('replication').allowed())
> >  # has_header
> >  config_host_data.set('CONFIG_EPOLL', cc.has_header('sys/epoll.h'))
> >  config_host_data.set('CONFIG_LINUX_MAGIC_H', cc.has_header('linux/magic.h'))
> > +config_host_data.set('CONFIG_BLKZONED', cc.has_header('linux/blkzoned.h'))
> >  config_host_data.set('CONFIG_VALGRIND_H', cc.has_header('valgrind/valgrind.h'))
> >  config_host_data.set('HAVE_BTRFS_H', cc.has_header('linux/btrfs.h'))
> >  config_host_data.set('HAVE_DRM_H', cc.has_header('libdrm/drm.h'))
> > @@ -2056,6 +2057,9 @@ config_host_data.set('HAVE_SIGEV_NOTIFY_THREAD_ID',
> >  config_host_data.set('HAVE_STRUCT_STAT_ST_ATIM',
> >                       cc.has_member('struct stat', 'st_atim',
> >                                     prefix: '#include <sys/stat.h>'))
> > +config_host_data.set('HAVE_BLK_ZONE_REP_CAPACITY',
> > +                     cc.has_member('struct blk_zone', 'capacity',
> > +                                   prefix: '#include <linux/blkzoned.h>'))
> >
> >  # has_type
> >  config_host_data.set('CONFIG_IOVEC',
> > diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
> > index 952dc940f1..3a3bad77c3 100644
> > --- a/qemu-io-cmds.c
> > +++ b/qemu-io-cmds.c
> > @@ -1712,6 +1712,150 @@ static const cmdinfo_t flush_cmd = {
> >      .oneline    = "flush all in-core file state to disk",
> >  };
> >
> > +static inline int64_t tosector(int64_t bytes)
> > +{
> > +    return bytes >> BDRV_SECTOR_BITS;
> > +}
> > +
> > +static int zone_report_f(BlockBackend *blk, int argc, char **argv)
> > +{
> > +    int ret;
> > +    int64_t offset;
> > +    unsigned int nr_zones;
> > +
> > +    ++optind;
> > +    offset = cvtnum(argv[optind]);
> > +    ++optind;
> > +    nr_zones = cvtnum(argv[optind]);
> > +
> > +    g_autofree BlockZoneDescriptor *zones = NULL;
> > +    zones = g_new(BlockZoneDescriptor, nr_zones);
> > +    ret = blk_zone_report(blk, offset, &nr_zones, zones);
> > +    if (ret < 0) {
> > +        printf("zone report failed: %s\n", strerror(-ret));
> > +    } else {
> > +        for (int i = 0; i < nr_zones; ++i) {
> > +            printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", "
> > +                   "cap"" 0x%" PRIx64 ", wptr 0x%" PRIx64 ", "
> > +                   "zcond:%u, [type: %u]\n",
> > +                    tosector(zones[i].start), tosector(zones[i].length),
> > +                    tosector(zones[i].cap), tosector(zones[i].wp),
> > +                    zones[i].state, zones[i].type);
> > +        }
> > +    }
> > +    return ret;
> > +}
> > +
> > +static const cmdinfo_t zone_report_cmd = {
> > +    .name = "zone_report",
> > +    .altname = "zrp",
> > +    .cfunc = zone_report_f,
> > +    .argmin = 2,
> > +    .argmax = 2,
> > +    .args = "offset number",
> > +    .oneline = "report zone information",
> > +};
> > +
> > +static int zone_open_f(BlockBackend *blk, int argc, char **argv)
> > +{
> > +    int ret;
> > +    int64_t offset, len;
> > +    ++optind;
> > +    offset = cvtnum(argv[optind]);
> > +    ++optind;
> > +    len = cvtnum(argv[optind]);
> > +    ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len);
> > +    if (ret < 0) {
> > +        printf("zone open failed: %s\n", strerror(-ret));
> > +    }
> > +    return ret;
> > +}
> > +
> > +static const cmdinfo_t zone_open_cmd = {
> > +    .name = "zone_open",
> > +    .altname = "zo",
> > +    .cfunc = zone_open_f,
> > +    .argmin = 2,
> > +    .argmax = 2,
> > +    .args = "offset len",
> > +    .oneline = "explicit open a range of zones in zone block device",
> > +};
> > +
> > +static int zone_close_f(BlockBackend *blk, int argc, char **argv)
> > +{
> > +    int ret;
> > +    int64_t offset, len;
> > +    ++optind;
> > +    offset = cvtnum(argv[optind]);
> > +    ++optind;
> > +    len = cvtnum(argv[optind]);
> > +    ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len);
> > +    if (ret < 0) {
> > +        printf("zone close failed: %s\n", strerror(-ret));
> > +    }
> > +    return ret;
> > +}
> > +
> > +static const cmdinfo_t zone_close_cmd = {
> > +    .name = "zone_close",
> > +    .altname = "zc",
> > +    .cfunc = zone_close_f,
> > +    .argmin = 2,
> > +    .argmax = 2,
> > +    .args = "offset len",
> > +    .oneline = "close a range of zones in zone block device",
> > +};
> > +
> > +static int zone_finish_f(BlockBackend *blk, int argc, char **argv)
> > +{
> > +    int ret;
> > +    int64_t offset, len;
> > +    ++optind;
> > +    offset = cvtnum(argv[optind]);
> > +    ++optind;
> > +    len = cvtnum(argv[optind]);
> > +    ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len);
> > +    if (ret < 0) {
> > +        printf("zone finish failed: %s\n", strerror(-ret));
> > +    }
> > +    return ret;
> > +}
> > +
> > +static const cmdinfo_t zone_finish_cmd = {
> > +    .name = "zone_finish",
> > +    .altname = "zf",
> > +    .cfunc = zone_finish_f,
> > +    .argmin = 2,
> > +    .argmax = 2,
> > +    .args = "offset len",
> > +    .oneline = "finish a range of zones in zone block device",
> > +};
> > +
> > +static int zone_reset_f(BlockBackend *blk, int argc, char **argv)
> > +{
> > +    int ret;
> > +    int64_t offset, len;
> > +    ++optind;
> > +    offset = cvtnum(argv[optind]);
> > +    ++optind;
> > +    len = cvtnum(argv[optind]);
> > +    ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len);
> > +    if (ret < 0) {
> > +        printf("zone reset failed: %s\n", strerror(-ret));
> > +    }
> > +    return ret;
> > +}
> > +
> > +static const cmdinfo_t zone_reset_cmd = {
> > +    .name = "zone_reset",
> > +    .altname = "zrs",
> > +    .cfunc = zone_reset_f,
> > +    .argmin = 2,
> > +    .argmax = 2,
> > +    .args = "offset len",
> > +    .oneline = "reset a zone write pointer in zone block device",
> > +};
> > +
> >  static int truncate_f(BlockBackend *blk, int argc, char **argv);
> >  static const cmdinfo_t truncate_cmd = {
> >      .name       = "truncate",
> > @@ -2504,6 +2648,11 @@ static void __attribute((constructor)) init_qemuio_commands(void)
> >      qemuio_add_command(&aio_write_cmd);
> >      qemuio_add_command(&aio_flush_cmd);
> >      qemuio_add_command(&flush_cmd);
> > +    qemuio_add_command(&zone_report_cmd);
> > +    qemuio_add_command(&zone_open_cmd);
> > +    qemuio_add_command(&zone_close_cmd);
> > +    qemuio_add_command(&zone_finish_cmd);
> > +    qemuio_add_command(&zone_reset_cmd);
> >      qemuio_add_command(&truncate_cmd);
> >      qemuio_add_command(&length_cmd);
> >      qemuio_add_command(&info_cmd);
> > --
> > 2.38.1
> >
> >


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v15 8/8] docs/zoned-storage: add zoned device documentation
  2023-01-29 10:28 ` [PATCH v15 8/8] docs/zoned-storage: add zoned device documentation Sam Li
@ 2023-02-06 12:16   ` Stefan Hajnoczi
  2023-02-06 12:18     ` Sam Li
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Hajnoczi @ 2023-02-06 12:16 UTC (permalink / raw)
  To: Sam Li
  Cc: qemu-devel, qemu-block, Stefan Hajnoczi, Kevin Wolf,
	Paolo Bonzini, Hanna Reitz, dmitry.fomichev, hare, damien.lemoal,
	Marc-André Lureau, Fam Zheng, Thomas Huth,
	Daniel P. Berrangé, Philippe Mathieu-Daudé

On Sun, 29 Jan 2023 at 05:31, Sam Li <faithilikerun@gmail.com> wrote:
>
> Add the documentation about the zoned device support to virtio-blk
> emulation.
>
> Signed-off-by: Sam Li <faithilikerun@gmail.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  docs/devel/zoned-storage.rst           | 43 ++++++++++++++++++++++++++
>  docs/system/qemu-block-drivers.rst.inc |  6 ++++
>  2 files changed, 49 insertions(+)
>  create mode 100644 docs/devel/zoned-storage.rst

This patch uses the old "zoned_host_device" BlockDriver name. Please
update it to "host_device".

It's probably a good idea to search the patches for zoned_host_device
to find any other places that need to be updated. That can be done
with git log -p master.. | grep zoned_host_device.

> diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
> new file mode 100644
> index 0000000000..03e52efe2e
> --- /dev/null
> +++ b/docs/devel/zoned-storage.rst
> @@ -0,0 +1,43 @@
> +=============
> +zoned-storage
> +=============
> +
> +Zoned Block Devices (ZBDs) divide the LBA space into block regions called zones
> +that are larger than the LBA size. They can only allow sequential writes, which
> +can reduce write amplification in SSDs, and potentially lead to higher
> +throughput and increased capacity. More details about ZBDs can be found at:
> +
> +https://zonedstorage.io/docs/introduction/zoned-storage
> +
> +1. Block layer APIs for zoned storage
> +-------------------------------------
> +QEMU block layer supports three zoned storage models:
> +- BLK_Z_HM: The host-managed zoned model only allows sequential writes access
> +to zones. It supports ZBD-specific I/O commands that can be used by a host to
> +manage the zones of a device.
> +- BLK_Z_HA: The host-aware zoned model allows random write operations in
> +zones, making it backward compatible with regular block devices.
> +- BLK_Z_NONE: The non-zoned model has no zones support. It includes both
> +regular and drive-managed ZBD devices. ZBD-specific I/O commands are not
> +supported.
> +
> +The block device information resides inside BlockDriverState. QEMU uses
> +BlockLimits struct(BlockDriverState::bl) that is continuously accessed by the
> +block layer while processing I/O requests. A BlockBackend has a root pointer to
> +a BlockDriverState graph(for example, raw format on top of file-posix). The
> +zoned storage information can be propagated from the leaf BlockDriverState all
> +the way up to the BlockBackend. If the zoned storage model in file-posix is
> +set to BLK_Z_HM, then block drivers will declare support for zoned host device.
> +
> +The block layer APIs support commands needed for zoned storage devices,
> +including report zones, four zone operations, and zone append.
> +
> +2. Emulating zoned storage controllers
> +--------------------------------------
> +When the BlockBackend's BlockLimits model reports a zoned storage device, users
> +like the virtio-blk emulation or the qemu-io-cmds.c utility can use block layer
> +APIs for zoned storage emulation or testing.
> +
> +For example, to test zone_report on a null_blk device using qemu-io is:
> +$ path/to/qemu-io --image-opts -n driver=zoned_host_device,filename=/dev/nullb0
> +-c "zrp offset nr_zones"
> diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
> index dfe5d2293d..0b97227fd9 100644
> --- a/docs/system/qemu-block-drivers.rst.inc
> +++ b/docs/system/qemu-block-drivers.rst.inc
> @@ -430,6 +430,12 @@ Hard disks
>    you may corrupt your host data (use the ``-snapshot`` command
>    line option or modify the device permissions accordingly).
>
> +Zoned block devices
> +  Zoned block devices can be passed through to the guest if the emulated storage
> +  controller supports zoned storage. Use ``--blockdev zoned_host_device,
> +  node-name=drive0,filename=/dev/nullb0`` to pass through ``/dev/nullb0``
> +  as ``drive0``.
> +
>  Windows
>  ^^^^^^^
>
> --
> 2.38.1
>
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v15 8/8] docs/zoned-storage: add zoned device documentation
  2023-02-06 12:16   ` Stefan Hajnoczi
@ 2023-02-06 12:18     ` Sam Li
  0 siblings, 0 replies; 17+ messages in thread
From: Sam Li @ 2023-02-06 12:18 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, Stefan Hajnoczi, Kevin Wolf,
	Paolo Bonzini, Hanna Reitz, dmitry.fomichev, hare, damien.lemoal,
	Marc-André Lureau, Fam Zheng, Thomas Huth,
	Daniel P. Berrangé, Philippe Mathieu-Daudé

Stefan Hajnoczi <stefanha@gmail.com> 于2023年2月6日周一 20:16写道:
>
> On Sun, 29 Jan 2023 at 05:31, Sam Li <faithilikerun@gmail.com> wrote:
> >
> > Add the documentation about the zoned device support to virtio-blk
> > emulation.
> >
> > Signed-off-by: Sam Li <faithilikerun@gmail.com>
> > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> > Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > ---
> >  docs/devel/zoned-storage.rst           | 43 ++++++++++++++++++++++++++
> >  docs/system/qemu-block-drivers.rst.inc |  6 ++++
> >  2 files changed, 49 insertions(+)
> >  create mode 100644 docs/devel/zoned-storage.rst
>
> This patch uses the old "zoned_host_device" BlockDriver name. Please
> update it to "host_device".
>
> It's probably a good idea to search the patches for zoned_host_device
> to find any other places that need to be updated. That can be done
> with git log -p master.. | grep zoned_host_device.

Sorry for missing that. Will do.

>
> > diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
> > new file mode 100644
> > index 0000000000..03e52efe2e
> > --- /dev/null
> > +++ b/docs/devel/zoned-storage.rst
> > @@ -0,0 +1,43 @@
> > +=============
> > +zoned-storage
> > +=============
> > +
> > +Zoned Block Devices (ZBDs) divide the LBA space into block regions called zones
> > +that are larger than the LBA size. They can only allow sequential writes, which
> > +can reduce write amplification in SSDs, and potentially lead to higher
> > +throughput and increased capacity. More details about ZBDs can be found at:
> > +
> > +https://zonedstorage.io/docs/introduction/zoned-storage
> > +
> > +1. Block layer APIs for zoned storage
> > +-------------------------------------
> > +QEMU block layer supports three zoned storage models:
> > +- BLK_Z_HM: The host-managed zoned model only allows sequential writes access
> > +to zones. It supports ZBD-specific I/O commands that can be used by a host to
> > +manage the zones of a device.
> > +- BLK_Z_HA: The host-aware zoned model allows random write operations in
> > +zones, making it backward compatible with regular block devices.
> > +- BLK_Z_NONE: The non-zoned model has no zones support. It includes both
> > +regular and drive-managed ZBD devices. ZBD-specific I/O commands are not
> > +supported.
> > +
> > +The block device information resides inside BlockDriverState. QEMU uses
> > +BlockLimits struct(BlockDriverState::bl) that is continuously accessed by the
> > +block layer while processing I/O requests. A BlockBackend has a root pointer to
> > +a BlockDriverState graph(for example, raw format on top of file-posix). The
> > +zoned storage information can be propagated from the leaf BlockDriverState all
> > +the way up to the BlockBackend. If the zoned storage model in file-posix is
> > +set to BLK_Z_HM, then block drivers will declare support for zoned host device.
> > +
> > +The block layer APIs support commands needed for zoned storage devices,
> > +including report zones, four zone operations, and zone append.
> > +
> > +2. Emulating zoned storage controllers
> > +--------------------------------------
> > +When the BlockBackend's BlockLimits model reports a zoned storage device, users
> > +like the virtio-blk emulation or the qemu-io-cmds.c utility can use block layer
> > +APIs for zoned storage emulation or testing.
> > +
> > +For example, to test zone_report on a null_blk device using qemu-io is:
> > +$ path/to/qemu-io --image-opts -n driver=zoned_host_device,filename=/dev/nullb0
> > +-c "zrp offset nr_zones"
> > diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
> > index dfe5d2293d..0b97227fd9 100644
> > --- a/docs/system/qemu-block-drivers.rst.inc
> > +++ b/docs/system/qemu-block-drivers.rst.inc
> > @@ -430,6 +430,12 @@ Hard disks
> >    you may corrupt your host data (use the ``-snapshot`` command
> >    line option or modify the device permissions accordingly).
> >
> > +Zoned block devices
> > +  Zoned block devices can be passed through to the guest if the emulated storage
> > +  controller supports zoned storage. Use ``--blockdev zoned_host_device,
> > +  node-name=drive0,filename=/dev/nullb0`` to pass through ``/dev/nullb0``
> > +  as ``drive0``.
> > +
> >  Windows
> >  ^^^^^^^
> >
> > --
> > 2.38.1
> >
> >


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v15 3/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  2023-01-29 10:28 ` [PATCH v15 3/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls Sam Li
  2023-02-06 12:04   ` Stefan Hajnoczi
@ 2023-02-27 18:20   ` Kevin Wolf
  2023-02-27 19:14     ` Stefan Hajnoczi
  1 sibling, 1 reply; 17+ messages in thread
From: Kevin Wolf @ 2023-02-27 18:20 UTC (permalink / raw)
  To: Sam Li
  Cc: qemu-devel, qemu-block, Stefan Hajnoczi, Paolo Bonzini,
	Hanna Reitz, dmitry.fomichev, hare, damien.lemoal,
	Marc-André Lureau, Fam Zheng, Thomas Huth,
	Daniel P. Berrangé, Philippe Mathieu-Daudé

Am 29.01.2023 um 11:28 hat Sam Li geschrieben:
> Add zoned device option to host_device BlockDriver. It will be presented only
> for zoned host block devices. By adding zone management operations to the
> host_block_device BlockDriver, users can use the new block layer APIs
> including Report Zone and four zone management operations
> (open, close, finish, reset, reset_all).
> 
> Qemu-io uses the new APIs to perform zoned storage commands of the device:
> zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> zone_finish(zf).
> 
> For example, to test zone_report, use following command:
> $ ./build/qemu-io --image-opts -n driver=host_device, filename=/dev/nullb0
> -c "zrp offset nr_zones"
> 
> Signed-off-by: Sam Li <faithilikerun@gmail.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  block/block-backend.c             | 147 ++++++++++++++
>  block/file-posix.c                | 323 ++++++++++++++++++++++++++++++
>  block/io.c                        |  41 ++++
>  include/block/block-io.h          |   7 +
>  include/block/block_int-common.h  |  21 ++
>  include/block/raw-aio.h           |   6 +-
>  include/sysemu/block-backend-io.h |  18 ++
>  meson.build                       |   4 +
>  qemu-io-cmds.c                    | 149 ++++++++++++++
>  9 files changed, 715 insertions(+), 1 deletion(-)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index ba7bf1d6bc..a4847b9131 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1451,6 +1451,15 @@ typedef struct BlkRwCo {
>      void *iobuf;
>      int ret;
>      BdrvRequestFlags flags;
> +    union {
> +        struct {
> +            unsigned int *nr_zones;
> +            BlockZoneDescriptor *zones;
> +        } zone_report;
> +        struct {
> +            unsigned long op;
> +        } zone_mgmt;
> +    };
>  } BlkRwCo;

Should we use a different struct for blk_aio_zone_*() so that we don't
need to touch the one for the normal I/O path? My concern is that
increasing the size of the struct (currently 32 bytes) might negatively
impact the performance even of non-zoned devices. Maybe it turns out
that it wasn't really necessary in the end (have we done any
benchmarks?), but I don't think it can hurt anyway.

With this changed, you can add to the series:
Acked-by: Kevin Wolf <kwolf@redhat.com>

Kevin



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v15 3/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  2023-02-27 18:20   ` Kevin Wolf
@ 2023-02-27 19:14     ` Stefan Hajnoczi
  2023-02-28 11:54       ` Kevin Wolf
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Hajnoczi @ 2023-02-27 19:14 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Sam Li, qemu-devel, qemu-block, Paolo Bonzini, Hanna Reitz,
	dmitry.fomichev, hare, damien.lemoal, Marc-André Lureau,
	Fam Zheng, Thomas Huth, Daniel P. Berrangé,
	Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 3556 bytes --]

On Mon, Feb 27, 2023 at 07:20:14PM +0100, Kevin Wolf wrote:
> Am 29.01.2023 um 11:28 hat Sam Li geschrieben:
> > Add zoned device option to host_device BlockDriver. It will be presented only
> > for zoned host block devices. By adding zone management operations to the
> > host_block_device BlockDriver, users can use the new block layer APIs
> > including Report Zone and four zone management operations
> > (open, close, finish, reset, reset_all).
> > 
> > Qemu-io uses the new APIs to perform zoned storage commands of the device:
> > zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> > zone_finish(zf).
> > 
> > For example, to test zone_report, use following command:
> > $ ./build/qemu-io --image-opts -n driver=host_device, filename=/dev/nullb0
> > -c "zrp offset nr_zones"
> > 
> > Signed-off-by: Sam Li <faithilikerun@gmail.com>
> > Reviewed-by: Hannes Reinecke <hare@suse.de>
> > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  block/block-backend.c             | 147 ++++++++++++++
> >  block/file-posix.c                | 323 ++++++++++++++++++++++++++++++
> >  block/io.c                        |  41 ++++
> >  include/block/block-io.h          |   7 +
> >  include/block/block_int-common.h  |  21 ++
> >  include/block/raw-aio.h           |   6 +-
> >  include/sysemu/block-backend-io.h |  18 ++
> >  meson.build                       |   4 +
> >  qemu-io-cmds.c                    | 149 ++++++++++++++
> >  9 files changed, 715 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/block-backend.c b/block/block-backend.c
> > index ba7bf1d6bc..a4847b9131 100644
> > --- a/block/block-backend.c
> > +++ b/block/block-backend.c
> > @@ -1451,6 +1451,15 @@ typedef struct BlkRwCo {
> >      void *iobuf;
> >      int ret;
> >      BdrvRequestFlags flags;
> > +    union {
> > +        struct {
> > +            unsigned int *nr_zones;
> > +            BlockZoneDescriptor *zones;
> > +        } zone_report;
> > +        struct {
> > +            unsigned long op;
> > +        } zone_mgmt;
> > +    };
> >  } BlkRwCo;
> 
> Should we use a different struct for blk_aio_zone_*() so that we don't
> need to touch the one for the normal I/O path? My concern is that
> increasing the size of the struct (currently 32 bytes) might negatively
> impact the performance even of non-zoned devices. Maybe it turns out
> that it wasn't really necessary in the end (have we done any
> benchmarks?), but I don't think it can hurt anyway.
> 
> With this changed, you can add to the series:
> Acked-by: Kevin Wolf <kwolf@redhat.com>

There are unused fields in BlkRwCo and BlkAioEmAIOCB, so changing the
size of the struct isn't necessary. ioctl/flush/pdiscard already use
BlkAioEmAIOCB/BlkRwCo for non-read/write operations, including using the
iobuf field for different types, so it wouldn't be weird:

  typedef struct BlkRwCo {
      BlockBackend *blk;
      int64_t offset;
      void *iobuf;
            ^^^^^ used for preadv/pwritev qiov, ioctl buf, and NULL for
                  other request types. zone_report could put the
                  BlockZoneDescriptor pointer here. zone_mgmt could put
                  op here.
      int ret;
      BdrvRequestFlags flags;
  } BlkRwCo;

  typedef struct BlkAioEmAIOCB {
      BlockAIOCB common;
      BlkRwCo rwco;
      int64_t bytes;
              ^^^^^ zone_report could put the nr_zones pointer here
      bool has_returned;
  } BlkAioEmAIOCB;

Does that sound okay?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v15 3/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  2023-02-27 19:14     ` Stefan Hajnoczi
@ 2023-02-28 11:54       ` Kevin Wolf
  2023-02-28 12:00         ` Sam Li
  0 siblings, 1 reply; 17+ messages in thread
From: Kevin Wolf @ 2023-02-28 11:54 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Sam Li, qemu-devel, qemu-block, Paolo Bonzini, Hanna Reitz,
	dmitry.fomichev, hare, damien.lemoal, Marc-André Lureau,
	Fam Zheng, Thomas Huth, Daniel P. Berrangé,
	Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 3847 bytes --]

Am 27.02.2023 um 20:14 hat Stefan Hajnoczi geschrieben:
> On Mon, Feb 27, 2023 at 07:20:14PM +0100, Kevin Wolf wrote:
> > Am 29.01.2023 um 11:28 hat Sam Li geschrieben:
> > > Add zoned device option to host_device BlockDriver. It will be presented only
> > > for zoned host block devices. By adding zone management operations to the
> > > host_block_device BlockDriver, users can use the new block layer APIs
> > > including Report Zone and four zone management operations
> > > (open, close, finish, reset, reset_all).
> > > 
> > > Qemu-io uses the new APIs to perform zoned storage commands of the device:
> > > zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> > > zone_finish(zf).
> > > 
> > > For example, to test zone_report, use following command:
> > > $ ./build/qemu-io --image-opts -n driver=host_device, filename=/dev/nullb0
> > > -c "zrp offset nr_zones"
> > > 
> > > Signed-off-by: Sam Li <faithilikerun@gmail.com>
> > > Reviewed-by: Hannes Reinecke <hare@suse.de>
> > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > ---
> > >  block/block-backend.c             | 147 ++++++++++++++
> > >  block/file-posix.c                | 323 ++++++++++++++++++++++++++++++
> > >  block/io.c                        |  41 ++++
> > >  include/block/block-io.h          |   7 +
> > >  include/block/block_int-common.h  |  21 ++
> > >  include/block/raw-aio.h           |   6 +-
> > >  include/sysemu/block-backend-io.h |  18 ++
> > >  meson.build                       |   4 +
> > >  qemu-io-cmds.c                    | 149 ++++++++++++++
> > >  9 files changed, 715 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/block/block-backend.c b/block/block-backend.c
> > > index ba7bf1d6bc..a4847b9131 100644
> > > --- a/block/block-backend.c
> > > +++ b/block/block-backend.c
> > > @@ -1451,6 +1451,15 @@ typedef struct BlkRwCo {
> > >      void *iobuf;
> > >      int ret;
> > >      BdrvRequestFlags flags;
> > > +    union {
> > > +        struct {
> > > +            unsigned int *nr_zones;
> > > +            BlockZoneDescriptor *zones;
> > > +        } zone_report;
> > > +        struct {
> > > +            unsigned long op;
> > > +        } zone_mgmt;
> > > +    };
> > >  } BlkRwCo;
> > 
> > Should we use a different struct for blk_aio_zone_*() so that we don't
> > need to touch the one for the normal I/O path? My concern is that
> > increasing the size of the struct (currently 32 bytes) might negatively
> > impact the performance even of non-zoned devices. Maybe it turns out
> > that it wasn't really necessary in the end (have we done any
> > benchmarks?), but I don't think it can hurt anyway.
> > 
> > With this changed, you can add to the series:
> > Acked-by: Kevin Wolf <kwolf@redhat.com>
> 
> There are unused fields in BlkRwCo and BlkAioEmAIOCB, so changing the
> size of the struct isn't necessary. ioctl/flush/pdiscard already use
> BlkAioEmAIOCB/BlkRwCo for non-read/write operations, including using the
> iobuf field for different types, so it wouldn't be weird:
> 
>   typedef struct BlkRwCo {
>       BlockBackend *blk;
>       int64_t offset;
>       void *iobuf;
>             ^^^^^ used for preadv/pwritev qiov, ioctl buf, and NULL for
>                   other request types. zone_report could put the
>                   BlockZoneDescriptor pointer here. zone_mgmt could put
>                   op here.
>       int ret;
>       BdrvRequestFlags flags;
>   } BlkRwCo;
> 
>   typedef struct BlkAioEmAIOCB {
>       BlockAIOCB common;
>       BlkRwCo rwco;
>       int64_t bytes;
>               ^^^^^ zone_report could put the nr_zones pointer here
>       bool has_returned;
>   } BlkAioEmAIOCB;
> 
> Does that sound okay?

Might not be great for readability, but good enough for me.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v15 3/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  2023-02-28 11:54       ` Kevin Wolf
@ 2023-02-28 12:00         ` Sam Li
  0 siblings, 0 replies; 17+ messages in thread
From: Sam Li @ 2023-02-28 12:00 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Stefan Hajnoczi, qemu-devel, qemu-block, Paolo Bonzini,
	Hanna Reitz, dmitry.fomichev, hare, damien.lemoal,
	Marc-André Lureau, Fam Zheng, Thomas Huth,
	Daniel P. Berrangé, Philippe Mathieu-Daudé

Kevin Wolf <kwolf@redhat.com> 于2023年2月28日周二 19:54写道:
>
> Am 27.02.2023 um 20:14 hat Stefan Hajnoczi geschrieben:
> > On Mon, Feb 27, 2023 at 07:20:14PM +0100, Kevin Wolf wrote:
> > > Am 29.01.2023 um 11:28 hat Sam Li geschrieben:
> > > > Add zoned device option to host_device BlockDriver. It will be presented only
> > > > for zoned host block devices. By adding zone management operations to the
> > > > host_block_device BlockDriver, users can use the new block layer APIs
> > > > including Report Zone and four zone management operations
> > > > (open, close, finish, reset, reset_all).
> > > >
> > > > Qemu-io uses the new APIs to perform zoned storage commands of the device:
> > > > zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> > > > zone_finish(zf).
> > > >
> > > > For example, to test zone_report, use following command:
> > > > $ ./build/qemu-io --image-opts -n driver=host_device, filename=/dev/nullb0
> > > > -c "zrp offset nr_zones"
> > > >
> > > > Signed-off-by: Sam Li <faithilikerun@gmail.com>
> > > > Reviewed-by: Hannes Reinecke <hare@suse.de>
> > > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > ---
> > > >  block/block-backend.c             | 147 ++++++++++++++
> > > >  block/file-posix.c                | 323 ++++++++++++++++++++++++++++++
> > > >  block/io.c                        |  41 ++++
> > > >  include/block/block-io.h          |   7 +
> > > >  include/block/block_int-common.h  |  21 ++
> > > >  include/block/raw-aio.h           |   6 +-
> > > >  include/sysemu/block-backend-io.h |  18 ++
> > > >  meson.build                       |   4 +
> > > >  qemu-io-cmds.c                    | 149 ++++++++++++++
> > > >  9 files changed, 715 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/block/block-backend.c b/block/block-backend.c
> > > > index ba7bf1d6bc..a4847b9131 100644
> > > > --- a/block/block-backend.c
> > > > +++ b/block/block-backend.c
> > > > @@ -1451,6 +1451,15 @@ typedef struct BlkRwCo {
> > > >      void *iobuf;
> > > >      int ret;
> > > >      BdrvRequestFlags flags;
> > > > +    union {
> > > > +        struct {
> > > > +            unsigned int *nr_zones;
> > > > +            BlockZoneDescriptor *zones;
> > > > +        } zone_report;
> > > > +        struct {
> > > > +            unsigned long op;
> > > > +        } zone_mgmt;
> > > > +    };
> > > >  } BlkRwCo;
> > >
> > > Should we use a different struct for blk_aio_zone_*() so that we don't
> > > need to touch the one for the normal I/O path? My concern is that
> > > increasing the size of the struct (currently 32 bytes) might negatively
> > > impact the performance even of non-zoned devices. Maybe it turns out
> > > that it wasn't really necessary in the end (have we done any
> > > benchmarks?), but I don't think it can hurt anyway.
> > >
> > > With this changed, you can add to the series:
> > > Acked-by: Kevin Wolf <kwolf@redhat.com>
> >
> > There are unused fields in BlkRwCo and BlkAioEmAIOCB, so changing the
> > size of the struct isn't necessary. ioctl/flush/pdiscard already use
> > BlkAioEmAIOCB/BlkRwCo for non-read/write operations, including using the
> > iobuf field for different types, so it wouldn't be weird:
> >
> >   typedef struct BlkRwCo {
> >       BlockBackend *blk;
> >       int64_t offset;
> >       void *iobuf;
> >             ^^^^^ used for preadv/pwritev qiov, ioctl buf, and NULL for
> >                   other request types. zone_report could put the
> >                   BlockZoneDescriptor pointer here. zone_mgmt could put
> >                   op here.
> >       int ret;
> >       BdrvRequestFlags flags;
> >   } BlkRwCo;
> >
> >   typedef struct BlkAioEmAIOCB {
> >       BlockAIOCB common;
> >       BlkRwCo rwco;
> >       int64_t bytes;
> >               ^^^^^ zone_report could put the nr_zones pointer here
> >       bool has_returned;
> >   } BlkAioEmAIOCB;
> >
> > Does that sound okay?
>
> Might not be great for readability, but good enough for me.
>
> Kevin

I see. Will change it accordingly. Thanks!

Sam


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-02-28 12:00 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-29 10:28 [PATCH v15 0/8] Add support for zoned device Sam Li
2023-01-29 10:28 ` [PATCH v15 1/8] include: add zoned device structs Sam Li
2023-01-29 10:28 ` [PATCH v15 2/8] file-posix: introduce helper functions for sysfs attributes Sam Li
2023-01-29 10:28 ` [PATCH v15 3/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls Sam Li
2023-02-06 12:04   ` Stefan Hajnoczi
2023-02-06 12:12     ` Sam Li
2023-02-27 18:20   ` Kevin Wolf
2023-02-27 19:14     ` Stefan Hajnoczi
2023-02-28 11:54       ` Kevin Wolf
2023-02-28 12:00         ` Sam Li
2023-01-29 10:28 ` [PATCH v15 4/8] raw-format: add zone operations to pass through requests Sam Li
2023-01-29 10:28 ` [PATCH v15 5/8] config: add check to block layer Sam Li
2023-01-29 10:28 ` [PATCH v15 6/8] qemu-iotests: test new zone operations Sam Li
2023-01-29 10:28 ` [PATCH v15 7/8] block: add some trace events for new block layer APIs Sam Li
2023-01-29 10:28 ` [PATCH v15 8/8] docs/zoned-storage: add zoned device documentation Sam Li
2023-02-06 12:16   ` Stefan Hajnoczi
2023-02-06 12:18     ` Sam Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.