All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9 0/7] Add support for zoned device
@ 2022-09-10  5:27 Sam Li
  2022-09-10  5:27 ` [PATCH v9 1/7] include: add zoned device structs Sam Li
                   ` (6 more replies)
  0 siblings, 7 replies; 31+ messages in thread
From: Sam Li @ 2022-09-10  5:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: dmitry.fomichev, Markus Armbruster, Eric Blake, qemu-block,
	Stefan Hajnoczi, Kevin Wolf, Fam Zheng, damien.lemoal, hare,
	Hanna Reitz, Sam Li

Zoned Block Devices (ZBDs) devide the LBA space to block regions called zones
that are larger than the LBA size. It can only allow sequential writes, which
reduces write amplification in SSD, leading to higher throughput and increased
capacity. More details about ZBDs can be found at:

https://zonedstorage.io/docs/introduction/zoned-storage

The zoned device support aims to let guests (virtual machines) access zoned
storage devices on the host (hypervisor) through a virtio-blk device. This
involves extending QEMU's block layer and virtio-blk emulation code.  In its
current status, the virtio-blk device is not aware of ZBDs but the guest sees
host-managed drives as regular drive that will runs correctly under the most
common write workloads.

This patch series extend the block layer APIs with the minimum set of zoned
commands that are necessary to support zoned devices. The commands are - Report
Zones, four zone operations and Zone Append (developing).

It can be tested on a null_blk device using qemu-io or qemu-iotests. For
example, to test zone report using qemu-io:
$ path/to/qemu-io --image-opts -n driver=zoned_host_device,filename=/dev/nullb0
-c "zrp offset nr_zones"

v9:
- address review comments
  * specify units of zone commands requests [Stefan]
  * fix some error handling in file-posix [Stefan]
  * introduce zoned_host_devcie in the commit message [Markus]

v8:
- address review comments
  * solve patch conflicts and merge sysfs helper funcations into one patch
  * add cache.direct=on check in config

v7:
- address review comments
  * modify sysfs attribute helper funcations
  * move the input validation and error checking into raw_co_zone_* function
  * fix checks in config

v6:
- drop virtio-blk emulation changes
- address Stefan's review comments
  * fix CONFIG_BLKZONED configs in related functions
  * replace reading fd by g_file_get_contents() in get_sysfs_str_val()
  * rewrite documentation for zoned storage

v5:
- add zoned storage emulation to virtio-blk device
- add documentation for zoned storage
- address review comments
  * fix qemu-iotests
  * fix check to block layer
  * modify interfaces of sysfs helper functions
  * rename zoned device structs according to QEMU styles
  * reorder patches

v4:
- add virtio-blk headers for zoned device
- add configurations for zoned host device
- add zone operations for raw-format
- address review comments
  * fix memory leak bug in zone_report
  * add checks to block layers
  * fix qemu-iotests format
  * fix sysfs helper functions

v3:
- add helper functions to get sysfs attributes
- address review comments
  * fix zone report bugs
  * fix the qemu-io code path
  * use thread pool to avoid blocking ioctl() calls

v2:
- add qemu-io sub-commands
- address review comments
  * modify interfaces of APIs

v1:
- add block layer APIs resembling Linux ZoneBlockDevice ioctls

Sam Li (7):
  include: add zoned device structs
  file-posix: introduce helper functions for sysfs attributes
  block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  raw-format: add zone operations to pass through requests
  config: add check to block layer
  qemu-iotests: test new zone operations
  docs/zoned-storage: add zoned device documentation

 block.c                                |  14 +
 block/block-backend.c                  | 145 ++++++++
 block/file-posix.c                     | 458 +++++++++++++++++++++++--
 block/io.c                             |  41 +++
 block/raw-format.c                     |  14 +
 docs/devel/zoned-storage.rst           |  41 +++
 docs/system/qemu-block-drivers.rst.inc |   6 +
 include/block/block-common.h           |  43 +++
 include/block/block-io.h               |   7 +
 include/block/block_int-common.h       |  29 ++
 include/block/raw-aio.h                |   6 +-
 include/sysemu/block-backend-io.h      |  17 +
 meson.build                            |   1 +
 qapi/block-core.json                   |   8 +-
 qemu-io-cmds.c                         | 143 ++++++++
 tests/qemu-iotests/tests/zoned.out     |  53 +++
 tests/qemu-iotests/tests/zoned.sh      |  85 +++++
 17 files changed, 1071 insertions(+), 40 deletions(-)
 create mode 100644 docs/devel/zoned-storage.rst
 create mode 100644 tests/qemu-iotests/tests/zoned.out
 create mode 100755 tests/qemu-iotests/tests/zoned.sh

-- 
2.37.3



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v9 1/7] include: add zoned device structs
  2022-09-10  5:27 [PATCH v9 0/7] Add support for zoned device Sam Li
@ 2022-09-10  5:27 ` Sam Li
  2022-09-15  8:05   ` Eric Blake
  2022-09-10  5:27 ` [PATCH v9 2/7] file-posix: introduce helper functions for sysfs attributes Sam Li
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Sam Li @ 2022-09-10  5:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: dmitry.fomichev, Markus Armbruster, Eric Blake, qemu-block,
	Stefan Hajnoczi, Kevin Wolf, Fam Zheng, damien.lemoal, hare,
	Hanna Reitz, Sam Li

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
---
 include/block/block-common.h | 43 ++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index fdb7306e78..36bd0e480e 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -49,6 +49,49 @@ typedef struct BlockDriver BlockDriver;
 typedef struct BdrvChild BdrvChild;
 typedef struct BdrvChildClass BdrvChildClass;
 
+typedef enum BlockZoneOp {
+    BLK_ZO_OPEN,
+    BLK_ZO_CLOSE,
+    BLK_ZO_FINISH,
+    BLK_ZO_RESET,
+} BlockZoneOp;
+
+typedef enum BlockZoneModel {
+    BLK_Z_NONE = 0x0, /* Regular block device */
+    BLK_Z_HM = 0x1, /* Host-managed zoned block device */
+    BLK_Z_HA = 0x2, /* Host-aware zoned block device */
+} BlockZoneModel;
+
+typedef enum BlockZoneCondition {
+    BLK_ZS_NOT_WP = 0x0,
+    BLK_ZS_EMPTY = 0x1,
+    BLK_ZS_IOPEN = 0x2,
+    BLK_ZS_EOPEN = 0x3,
+    BLK_ZS_CLOSED = 0x4,
+    BLK_ZS_RDONLY = 0xD,
+    BLK_ZS_FULL = 0xE,
+    BLK_ZS_OFFLINE = 0xF,
+} BlockZoneCondition;
+
+typedef enum BlockZoneType {
+    BLK_ZT_CONV = 0x1, /* Conventional random writes supported */
+    BLK_ZT_SWR = 0x2, /* Sequential writes required */
+    BLK_ZT_SWP = 0x3, /* Sequential writes preferred */
+} BlockZoneType;
+
+/*
+ * Zone descriptor data structure.
+ * Provides information on a zone with all position and size values in bytes.
+ */
+typedef struct BlockZoneDescriptor {
+    uint64_t start;
+    uint64_t length;
+    uint64_t cap;
+    uint64_t wp;
+    BlockZoneType type;
+    BlockZoneCondition cond;
+} BlockZoneDescriptor;
+
 typedef struct BlockDriverInfo {
     /* in bytes, 0 if irrelevant */
     int cluster_size;
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v9 2/7] file-posix: introduce helper functions for sysfs attributes
  2022-09-10  5:27 [PATCH v9 0/7] Add support for zoned device Sam Li
  2022-09-10  5:27 ` [PATCH v9 1/7] include: add zoned device structs Sam Li
@ 2022-09-10  5:27 ` Sam Li
  2022-09-11  4:56   ` Damien Le Moal
  2022-09-10  5:27 ` [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls Sam Li
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Sam Li @ 2022-09-10  5:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: dmitry.fomichev, Markus Armbruster, Eric Blake, qemu-block,
	Stefan Hajnoczi, Kevin Wolf, Fam Zheng, damien.lemoal, hare,
	Hanna Reitz, Sam Li

Use get_sysfs_str_val() to get the string value of device
zoned model. Then get_sysfs_zoned_model() can convert it to
BlockZoneModel type of QEMU.

Use get_sysfs_long_val() to get the long value of zoned device
information.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/file-posix.c               | 121 ++++++++++++++++++++++---------
 include/block/block_int-common.h |   3 +
 2 files changed, 88 insertions(+), 36 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 48cd096624..0a8b4b426e 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1210,66 +1210,109 @@ static int hdev_get_max_hw_transfer(int fd, struct stat *st)
 #endif
 }
 
-static int hdev_get_max_segments(int fd, struct stat *st)
-{
+/*
+ * Get a sysfs attribute value as character string.
+ */
+static int get_sysfs_str_val(struct stat *st, const char *attribute,
+                             char **val) {
 #ifdef CONFIG_LINUX
-    char buf[32];
-    const char *end;
-    char *sysfspath = NULL;
+    g_autofree char *sysfspath = NULL;
     int ret;
-    int sysfd = -1;
-    long max_segments;
+    size_t len;
 
-    if (S_ISCHR(st->st_mode)) {
-        if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
-            return ret;
-        }
+    if (!S_ISBLK(st->st_mode)) {
         return -ENOTSUP;
     }
 
-    if (!S_ISBLK(st->st_mode)) {
-        return -ENOTSUP;
+    sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
+                                major(st->st_rdev), minor(st->st_rdev),
+                                attribute);
+    ret = g_file_get_contents(sysfspath, val, &len, NULL);
+    if (ret == -1) {
+        return -ENOENT;
     }
 
-    sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
-                                major(st->st_rdev), minor(st->st_rdev));
-    sysfd = open(sysfspath, O_RDONLY);
-    if (sysfd == -1) {
-        ret = -errno;
-        goto out;
+    /* The file is ended with '\n' */
+    char *p;
+    p = *val;
+    if (*(p + len - 1) == '\n') {
+        *(p + len - 1) = '\0';
     }
-    do {
-        ret = read(sysfd, buf, sizeof(buf) - 1);
-    } while (ret == -1 && errno == EINTR);
+    return ret;
+#else
+    return -ENOTSUP;
+#endif
+}
+
+static int get_sysfs_zoned_model(struct stat *st, BlockZoneModel *zoned) {
+    g_autofree char *val = NULL;
+    int ret;
+
+    ret = get_sysfs_str_val(st, "zoned", &val);
     if (ret < 0) {
-        ret = -errno;
-        goto out;
-    } else if (ret == 0) {
-        ret = -EIO;
-        goto out;
+        return ret;
     }
-    buf[ret] = 0;
-    /* The file is ended with '\n', pass 'end' to accept that. */
-    ret = qemu_strtol(buf, &end, 10, &max_segments);
-    if (ret == 0 && end && *end == '\n') {
-        ret = max_segments;
+
+    if (strcmp(val, "host-managed") == 0) {
+        *zoned = BLK_Z_HM;
+    } else if (strcmp(val, "host-aware") == 0) {
+        *zoned = BLK_Z_HA;
+    } else if (strcmp(val, "none") == 0) {
+        *zoned = BLK_Z_NONE;
+    } else {
+        return -ENOTSUP;
     }
+    return 0;
+}
 
-out:
-    if (sysfd != -1) {
-        close(sysfd);
+/*
+ * Get a sysfs attribute value as a long integer.
+ */
+static long get_sysfs_long_val(struct stat *st, const char *attribute) {
+#ifdef CONFIG_LINUX
+    g_autofree char *str = NULL;
+    const char *end;
+    long val;
+    int ret;
+
+    ret = get_sysfs_str_val(st, attribute, &str);
+    if (ret < 0) {
+        return ret;
+    }
+
+    /* The file is ended with '\n', pass 'end' to accept that. */
+    ret = qemu_strtol(str, &end, 10, &val);
+    if (ret == 0 && end && *end == '\0') {
+        ret = val;
     }
-    g_free(sysfspath);
     return ret;
 #else
     return -ENOTSUP;
 #endif
 }
 
+static int hdev_get_max_segments(int fd, struct stat *st) {
+#ifdef CONFIG_LINUX
+    int ret;
+
+    if (S_ISCHR(st->st_mode)) {
+        if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
+            return ret;
+        }
+        return -ENOTSUP;
+    }
+    return get_sysfs_long_val(st, "max_segments");
+#else
+    return -ENOTSUP;
+#endif
+}
+
 static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 {
     BDRVRawState *s = bs->opaque;
     struct stat st;
+    int ret;
+    BlockZoneModel zoned;
 
     s->needs_alignment = raw_needs_alignment(bs);
     raw_probe_alignment(bs, s->fd, errp);
@@ -1307,6 +1350,12 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
             bs->bl.max_hw_iov = ret;
         }
     }
+
+    ret = get_sysfs_zoned_model(&st, &zoned);
+    if (ret < 0) {
+        zoned = BLK_Z_NONE;
+    }
+    bs->bl.zoned = zoned;
 }
 
 static int check_for_dasd(int fd)
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 8947abab76..7f7863cc9e 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -825,6 +825,9 @@ typedef struct BlockLimits {
 
     /* maximum number of iovec elements */
     int max_iov;
+
+    /* device zone model */
+    BlockZoneModel zoned;
 } BlockLimits;
 
 typedef struct BdrvOpBlocker BdrvOpBlocker;
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  2022-09-10  5:27 [PATCH v9 0/7] Add support for zoned device Sam Li
  2022-09-10  5:27 ` [PATCH v9 1/7] include: add zoned device structs Sam Li
  2022-09-10  5:27 ` [PATCH v9 2/7] file-posix: introduce helper functions for sysfs attributes Sam Li
@ 2022-09-10  5:27 ` Sam Li
  2022-09-11  5:31   ` Damien Le Moal
                     ` (3 more replies)
  2022-09-10  5:27 ` [PATCH v9 4/7] raw-format: add zone operations to pass through requests Sam Li
                   ` (3 subsequent siblings)
  6 siblings, 4 replies; 31+ messages in thread
From: Sam Li @ 2022-09-10  5:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: dmitry.fomichev, Markus Armbruster, Eric Blake, qemu-block,
	Stefan Hajnoczi, Kevin Wolf, Fam Zheng, damien.lemoal, hare,
	Hanna Reitz, Sam Li

Add a new zoned_host_device BlockDriver. The zoned_host_device option
accepts only zoned host block devices. By adding zone management
operations in this new BlockDriver, users can use the new block
layer APIs including Report Zone and four zone management operations
(open, close, finish, reset).

Qemu-io uses the new APIs to perform zoned storage commands of the device:
zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
zone_finish(zf).

For example, to test zone_report, use following command:
$ ./build/qemu-io --image-opts -n driver=zoned_host_device, filename=/dev/nullb0
-c "zrp offset nr_zones"

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 block/block-backend.c             | 145 ++++++++++++++
 block/file-posix.c                | 323 +++++++++++++++++++++++++++++-
 block/io.c                        |  41 ++++
 include/block/block-io.h          |   7 +
 include/block/block_int-common.h  |  21 ++
 include/block/raw-aio.h           |   6 +-
 include/sysemu/block-backend-io.h |  17 ++
 meson.build                       |   1 +
 qapi/block-core.json              |   8 +-
 qemu-io-cmds.c                    | 143 +++++++++++++
 10 files changed, 708 insertions(+), 4 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index d4a5df2ac2..ebe8d7bdf3 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1431,6 +1431,15 @@ typedef struct BlkRwCo {
     void *iobuf;
     int ret;
     BdrvRequestFlags flags;
+    union {
+        struct {
+            unsigned int *nr_zones;
+            BlockZoneDescriptor *zones;
+        } zone_report;
+        struct {
+            BlockZoneOp op;
+        } zone_mgmt;
+    };
 } BlkRwCo;
 
 int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
@@ -1775,6 +1784,142 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
     return ret;
 }
 
+static void blk_aio_zone_report_entry(void *opaque) {
+    BlkAioEmAIOCB *acb = opaque;
+    BlkRwCo *rwco = &acb->rwco;
+
+    rwco->ret = blk_co_zone_report(rwco->blk, rwco->offset,
+                                   rwco->zone_report.nr_zones,
+                                   rwco->zone_report.zones);
+    blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
+                                unsigned int *nr_zones,
+                                BlockZoneDescriptor  *zones,
+                                BlockCompletionFunc *cb, void *opaque)
+{
+    BlkAioEmAIOCB *acb;
+    Coroutine *co;
+    IO_CODE();
+
+    blk_inc_in_flight(blk);
+    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+    acb->rwco = (BlkRwCo) {
+            .blk    = blk,
+            .offset = offset,
+            .ret    = NOT_DONE,
+            .zone_report = {
+                    .zones = zones,
+                    .nr_zones = nr_zones,
+            },
+    };
+    acb->has_returned = false;
+
+    co = qemu_coroutine_create(blk_aio_zone_report_entry, acb);
+    bdrv_coroutine_enter(blk_bs(blk), co);
+
+    acb->has_returned = true;
+    if (acb->rwco.ret != NOT_DONE) {
+        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+                                         blk_aio_complete_bh, acb);
+    }
+
+    return &acb->common;
+}
+
+static void blk_aio_zone_mgmt_entry(void *opaque) {
+    BlkAioEmAIOCB *acb = opaque;
+    BlkRwCo *rwco = &acb->rwco;
+
+    rwco->ret = blk_co_zone_mgmt(rwco->blk, rwco->zone_mgmt.op,
+                                 rwco->offset, acb->bytes);
+    blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+                              int64_t offset, int64_t len,
+                              BlockCompletionFunc *cb, void *opaque) {
+    BlkAioEmAIOCB *acb;
+    Coroutine *co;
+    IO_CODE();
+
+    blk_inc_in_flight(blk);
+    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+    acb->rwco = (BlkRwCo) {
+            .blk    = blk,
+            .offset = offset,
+            .ret    = NOT_DONE,
+            .zone_mgmt = {
+                    .op = op,
+            },
+    };
+    acb->bytes = len;
+    acb->has_returned = false;
+
+    co = qemu_coroutine_create(blk_aio_zone_mgmt_entry, acb);
+    bdrv_coroutine_enter(blk_bs(blk), co);
+
+    acb->has_returned = true;
+    if (acb->rwco.ret != NOT_DONE) {
+        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+                                         blk_aio_complete_bh, acb);
+    }
+
+    return &acb->common;
+}
+
+/*
+ * Send a zone_report command.
+ * offset is a byte offset from the start of the device. No alignment
+ * required for offset.
+ * nr_zones represents IN maximum and OUT actual.
+ */
+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+                                    unsigned int *nr_zones,
+                                    BlockZoneDescriptor *zones)
+{
+    int ret;
+    IO_CODE();
+
+    blk_inc_in_flight(blk); /* increase before waiting */
+    blk_wait_while_drained(blk);
+    if (!blk_is_available(blk)) {
+        blk_dec_in_flight(blk);
+        return -ENOMEDIUM;
+    }
+    ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
+    blk_dec_in_flight(blk);
+    return ret;
+}
+
+/*
+ * Send a zone_management command.
+ * op is the zone operation;
+ * offset is the byte offset from the start of the zoned device;
+ * len is the maximum number of bytes the command should operate on. It
+ * should be aligned with the zone sector size.
+ */
+int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+        int64_t offset, int64_t len)
+{
+    int ret;
+    IO_CODE();
+
+
+    blk_inc_in_flight(blk);
+    blk_wait_while_drained(blk);
+
+    ret = blk_check_byte_request(blk, offset, len);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = bdrv_co_zone_mgmt(blk_bs(blk), op, offset, len);
+    blk_dec_in_flight(blk);
+    return ret;
+}
+
 void blk_drain(BlockBackend *blk)
 {
     BlockDriverState *bs = blk_bs(blk);
diff --git a/block/file-posix.c b/block/file-posix.c
index 0a8b4b426e..4edfa25d04 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -67,6 +67,9 @@
 #include <sys/param.h>
 #include <sys/syscall.h>
 #include <sys/vfs.h>
+#if defined(CONFIG_BLKZONED)
+#include <linux/blkzoned.h>
+#endif
 #include <linux/cdrom.h>
 #include <linux/fd.h>
 #include <linux/fs.h>
@@ -216,6 +219,15 @@ typedef struct RawPosixAIOData {
             PreallocMode prealloc;
             Error **errp;
         } truncate;
+        struct {
+            unsigned int *nr_zones;
+            BlockZoneDescriptor *zones;
+        } zone_report;
+        struct {
+            unsigned long zone_op;
+            const char *zone_op_name;
+            bool all;
+        } zone_mgmt;
     };
 } RawPosixAIOData;
 
@@ -1339,7 +1351,7 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 #endif
 
     if (bs->sg || S_ISBLK(st.st_mode)) {
-        int ret = hdev_get_max_hw_transfer(s->fd, &st);
+        ret = hdev_get_max_hw_transfer(s->fd, &st);
 
         if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
             bs->bl.max_hw_transfer = ret;
@@ -1356,6 +1368,27 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
         zoned = BLK_Z_NONE;
     }
     bs->bl.zoned = zoned;
+    if (zoned != BLK_Z_NONE) {
+        ret = get_sysfs_long_val(&st, "chunk_sectors");
+        if (ret > 0) {
+            bs->bl.zone_sectors = ret;
+        }
+
+        ret = get_sysfs_long_val(&st, "zone_append_max_bytes");
+        if (ret > 0) {
+            bs->bl.max_append_sectors = ret / 512;
+        }
+
+        ret = get_sysfs_long_val(&st, "max_open_zones");
+        if (ret >= 0) {
+            bs->bl.max_open_zones = ret;
+        }
+
+        ret = get_sysfs_long_val(&st, "max_active_zones");
+        if (ret >= 0) {
+            bs->bl.max_active_zones = ret;
+        }
+    }
 }
 
 static int check_for_dasd(int fd)
@@ -1850,6 +1883,145 @@ static off_t copy_file_range(int in_fd, off_t *in_off, int out_fd,
 }
 #endif
 
+/*
+ * parse_zone - Fill a zone descriptor
+ */
+#if defined(CONFIG_BLKZONED)
+static inline void parse_zone(struct BlockZoneDescriptor *zone,
+                              const struct blk_zone *blkz) {
+    zone->start = blkz->start;
+    zone->length = blkz->len;
+    zone->cap = blkz->capacity;
+    zone->wp = blkz->wp;
+
+    switch (blkz->type) {
+    case BLK_ZONE_TYPE_SEQWRITE_REQ:
+        zone->type = BLK_ZT_SWR;
+        break;
+    case BLK_ZONE_TYPE_SEQWRITE_PREF:
+        zone->type = BLK_ZT_SWP;
+        break;
+    case BLK_ZONE_TYPE_CONVENTIONAL:
+        zone->type = BLK_ZT_CONV;
+        break;
+    default:
+        g_assert_not_reached();
+    }
+
+    switch (blkz->cond) {
+    case BLK_ZONE_COND_NOT_WP:
+        zone->cond = BLK_ZS_NOT_WP;
+        break;
+    case BLK_ZONE_COND_EMPTY:
+        zone->cond = BLK_ZS_EMPTY;
+        break;
+    case BLK_ZONE_COND_IMP_OPEN:
+        zone->cond =BLK_ZS_IOPEN;
+        break;
+    case BLK_ZONE_COND_EXP_OPEN:
+        zone->cond = BLK_ZS_EOPEN;
+        break;
+    case BLK_ZONE_COND_CLOSED:
+        zone->cond = BLK_ZS_CLOSED;
+        break;
+    case BLK_ZONE_COND_READONLY:
+        zone->cond = BLK_ZS_RDONLY;
+        break;
+    case BLK_ZONE_COND_FULL:
+        zone->cond = BLK_ZS_FULL;
+        break;
+    case BLK_ZONE_COND_OFFLINE:
+        zone->cond = BLK_ZS_OFFLINE;
+        break;
+    default:
+        g_assert_not_reached();
+    }
+}
+#endif
+
+#if defined(CONFIG_BLKZONED)
+static int do_zone_report(int64_t sector, int fd,
+                          struct BlockZoneDescriptor *zones,
+                          unsigned int nrz) {
+    struct blk_zone *blkz;
+    int ret, n = 0, i = 0;
+
+    int64_t rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
+    g_autofree struct blk_zone_report *rep = NULL;
+    rep = g_malloc(rep_size);
+
+    blkz = (struct blk_zone *)(rep + 1);
+    while (n < nrz) {
+        memset(rep, 0, rep_size);
+        rep->sector = sector;
+        rep->nr_zones = nrz - n;
+
+        do {
+            ret = ioctl(fd, BLKREPORTZONE, rep);
+        } while (ret != 0 && errno == EINTR);
+        if (ret != 0) {
+            error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
+                    fd, sector, errno);
+            return -errno;
+        }
+
+        if (!rep->nr_zones) {
+            break;
+        }
+
+        for (i = 0; i < rep->nr_zones; i++, n++) {
+            parse_zone(&zones[n], &blkz[i]);
+            /* The next report should start after the last zone reported */
+            sector = blkz[i].start + blkz[i].len;
+        }
+    }
+    return n;
+}
+#endif
+
+static int handle_aiocb_zone_report(void *opaque) {
+#if defined(CONFIG_BLKZONED)
+    RawPosixAIOData *aiocb = opaque;
+    int fd = aiocb->aio_fildes;
+    unsigned int *nr_zones = aiocb->zone_report.nr_zones;
+    BlockZoneDescriptor *zones = aiocb->zone_report.zones;
+    /* zoned block devices use 512-byte sectors */
+    int64_t sector = aiocb->aio_offset / 512;
+
+    *nr_zones = do_zone_report(sector, fd, zones, *nr_zones);
+    return 0;
+#else
+    return -ENOTSUP;
+#endif
+}
+
+static int handle_aiocb_zone_mgmt(void *opaque) {
+#if defined(CONFIG_BLKZONED)
+    RawPosixAIOData *aiocb = opaque;
+    int fd = aiocb->aio_fildes;
+    int64_t sector = aiocb->aio_offset / 512;
+    int64_t nr_sectors = aiocb->aio_nbytes / 512;
+    struct blk_zone_range range;
+    int ret;
+
+    /* Execute the operation */
+    range.sector = sector;
+    range.nr_sectors = nr_sectors;
+    do {
+        ret = ioctl(fd, aiocb->zone_mgmt.zone_op, &range);
+    } while (ret != 0 && errno == EINTR);
+
+    if (ret != 0) {
+        error_report("ioctl %s failed %d", aiocb->zone_mgmt.zone_op_name,
+                     errno);
+        return -errno;
+    }
+    return ret;
+#else
+    return -ENOTSUP;
+#endif
+}
+
 static int handle_aiocb_copy_range(void *opaque)
 {
     RawPosixAIOData *aiocb = opaque;
@@ -3022,6 +3194,104 @@ static void raw_account_discard(BDRVRawState *s, uint64_t nbytes, int ret)
     }
 }
 
+/*
+ * zone report - Get a zone block device's information in the form
+ * of an array of zone descriptors.
+ * zones is an array of zone descriptors to hold zone information on reply;
+ * offset can be any byte within the entire size of the device;
+ * nr_zones is the maxium number of sectors the command should operate on.
+ */
+static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t offset,
+                                           unsigned int *nr_zones,
+                                           BlockZoneDescriptor *zones) {
+#if defined(CONFIG_BLKZONED)
+    BDRVRawState *s = bs->opaque;
+    RawPosixAIOData acb;
+
+    acb = (RawPosixAIOData) {
+        .bs         = bs,
+        .aio_fildes = s->fd,
+        .aio_type   = QEMU_AIO_ZONE_REPORT,
+        .aio_offset = offset,
+        .zone_report    = {
+                .nr_zones       = nr_zones,
+                .zones          = zones,
+        },
+    };
+
+    return raw_thread_pool_submit(bs, handle_aiocb_zone_report, &acb);
+#else
+    return -ENOTSUP;
+#endif
+}
+
+/*
+ * zone management operations - Execute an operation on a zone
+ */
+static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+        int64_t offset, int64_t len) {
+#if defined(CONFIG_BLKZONED)
+    BDRVRawState *s = bs->opaque;
+    RawPosixAIOData acb;
+    int64_t zone_sector, zone_sector_mask;
+    const char *zone_op_name;
+    unsigned long zone_op;
+    bool is_all = false;
+
+    zone_sector = bs->bl.zone_sectors;
+    zone_sector_mask = zone_sector - 1;
+    if (offset & zone_sector_mask) {
+        error_report("sector offset %" PRId64 " is not aligned to zone size "
+                     "%" PRId64 "", offset, zone_sector);
+        return -EINVAL;
+    }
+
+    if (len & zone_sector_mask) {
+        error_report("number of sectors %" PRId64 " is not aligned to zone size"
+                      " %" PRId64 "", len, zone_sector);
+        return -EINVAL;
+    }
+
+    switch (op) {
+    case BLK_ZO_OPEN:
+        zone_op_name = "BLKOPENZONE";
+        zone_op = BLKOPENZONE;
+        break;
+    case BLK_ZO_CLOSE:
+        zone_op_name = "BLKCLOSEZONE";
+        zone_op = BLKCLOSEZONE;
+        break;
+    case BLK_ZO_FINISH:
+        zone_op_name = "BLKFINISHZONE";
+        zone_op = BLKFINISHZONE;
+        break;
+    case BLK_ZO_RESET:
+        zone_op_name = "BLKRESETZONE";
+        zone_op = BLKRESETZONE;
+        break;
+    default:
+        g_assert_not_reached();
+    }
+
+    acb = (RawPosixAIOData) {
+        .bs             = bs,
+        .aio_fildes     = s->fd,
+        .aio_type       = QEMU_AIO_ZONE_MGMT,
+        .aio_offset     = offset,
+        .aio_nbytes     = len,
+        .zone_mgmt  = {
+                .zone_op = zone_op,
+                .zone_op_name = zone_op_name,
+                .all = is_all,
+        },
+    };
+
+    return raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb);
+#else
+    return -ENOTSUP;
+#endif
+}
+
 static coroutine_fn int
 raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes,
                 bool blkdev)
@@ -3752,6 +4022,54 @@ static BlockDriver bdrv_host_device = {
 #endif
 };
 
+#if defined(CONFIG_BLKZONED)
+static BlockDriver bdrv_zoned_host_device = {
+    .format_name = "zoned_host_device",
+    .protocol_name = "zoned_host_device",
+    .instance_size = sizeof(BDRVRawState),
+    .bdrv_needs_filename = true,
+    .bdrv_probe_device  = hdev_probe_device,
+    .bdrv_file_open     = hdev_open,
+    .bdrv_close         = raw_close,
+    .bdrv_reopen_prepare = raw_reopen_prepare,
+    .bdrv_reopen_commit  = raw_reopen_commit,
+    .bdrv_reopen_abort   = raw_reopen_abort,
+    .bdrv_co_create_opts = bdrv_co_create_opts_simple,
+    .create_opts         = &bdrv_create_opts_simple,
+    .mutable_opts        = mutable_opts,
+    .bdrv_co_invalidate_cache = raw_co_invalidate_cache,
+    .bdrv_co_pwrite_zeroes = hdev_co_pwrite_zeroes,
+
+    .bdrv_co_preadv         = raw_co_preadv,
+    .bdrv_co_pwritev        = raw_co_pwritev,
+    .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
+    .bdrv_co_pdiscard       = hdev_co_pdiscard,
+    .bdrv_co_copy_range_from = raw_co_copy_range_from,
+    .bdrv_co_copy_range_to  = raw_co_copy_range_to,
+    .bdrv_refresh_limits = raw_refresh_limits,
+    .bdrv_io_plug = raw_aio_plug,
+    .bdrv_io_unplug = raw_aio_unplug,
+    .bdrv_attach_aio_context = raw_aio_attach_aio_context,
+
+    .bdrv_co_truncate       = raw_co_truncate,
+    .bdrv_getlength = raw_getlength,
+    .bdrv_get_info = raw_get_info,
+    .bdrv_get_allocated_file_size
+                        = raw_get_allocated_file_size,
+    .bdrv_get_specific_stats = hdev_get_specific_stats,
+    .bdrv_check_perm = raw_check_perm,
+    .bdrv_set_perm   = raw_set_perm,
+    .bdrv_abort_perm_update = raw_abort_perm_update,
+    .bdrv_probe_blocksizes = hdev_probe_blocksizes,
+    .bdrv_probe_geometry = hdev_probe_geometry,
+    .bdrv_co_ioctl = hdev_co_ioctl,
+
+    /* zone management operations */
+    .bdrv_co_zone_report = raw_co_zone_report,
+    .bdrv_co_zone_mgmt = raw_co_zone_mgmt,
+};
+#endif
+
 #if defined(__linux__) || defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
 static void cdrom_parse_filename(const char *filename, QDict *options,
                                  Error **errp)
@@ -4012,6 +4330,9 @@ static void bdrv_file_init(void)
     bdrv_register(&bdrv_file);
 #if defined(HAVE_HOST_BLOCK_DEVICE)
     bdrv_register(&bdrv_host_device);
+#if defined(CONFIG_BLKZONED)
+    bdrv_register(&bdrv_zoned_host_device);
+#endif
 #ifdef __linux__
     bdrv_register(&bdrv_host_cdrom);
 #endif
diff --git a/block/io.c b/block/io.c
index 0a8cbefe86..de9ec1d740 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3198,6 +3198,47 @@ out:
     return co.ret;
 }
 
+int bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
+                        unsigned int *nr_zones,
+                        BlockZoneDescriptor *zones)
+{
+    BlockDriver *drv = bs->drv;
+    CoroutineIOCompletion co = {
+            .coroutine = qemu_coroutine_self(),
+    };
+    IO_CODE();
+
+    bdrv_inc_in_flight(bs);
+    if (!drv || !drv->bdrv_co_zone_report) {
+        co.ret = -ENOTSUP;
+        goto out;
+    }
+    co.ret = drv->bdrv_co_zone_report(bs, offset, nr_zones, zones);
+out:
+    bdrv_dec_in_flight(bs);
+    return co.ret;
+}
+
+int bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+        int64_t offset, int64_t len)
+{
+    BlockDriver *drv = bs->drv;
+    CoroutineIOCompletion co = {
+            .coroutine = qemu_coroutine_self(),
+    };
+    IO_CODE();
+
+    bdrv_inc_in_flight(bs);
+    if (!drv || !drv->bdrv_co_zone_mgmt) {
+        co.ret = -ENOTSUP;
+        goto out;
+    }
+    co.ret = drv->bdrv_co_zone_mgmt(bs, op, offset, len);
+out:
+    bdrv_dec_in_flight(bs);
+    return co.ret;
+}
+
 void *qemu_blockalign(BlockDriverState *bs, size_t size)
 {
     IO_CODE();
diff --git a/include/block/block-io.h b/include/block/block-io.h
index fd25ffa9be..65463b88d9 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -88,6 +88,13 @@ int bdrv_co_ioctl(BlockDriverState *bs, int req, void *buf);
 /* Ensure contents are flushed to disk.  */
 int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
 
+/* Report zone information of zone block device. */
+int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
+                                     unsigned int *nr_zones,
+                                     BlockZoneDescriptor *zones);
+int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+                                   int64_t offset, int64_t len);
+
 int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
 bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
 int bdrv_block_status(BlockDriverState *bs, int64_t offset,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 7f7863cc9e..078ddd7e67 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -691,6 +691,12 @@ struct BlockDriver {
                                           QEMUIOVector *qiov,
                                           int64_t pos);
 
+    int coroutine_fn (*bdrv_co_zone_report)(BlockDriverState *bs,
+            int64_t offset, unsigned int *nr_zones,
+            BlockZoneDescriptor *zones);
+    int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, BlockZoneOp op,
+            int64_t offset, int64_t len);
+
     /* removable device specific */
     bool (*bdrv_is_inserted)(BlockDriverState *bs);
     void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
@@ -828,6 +834,21 @@ typedef struct BlockLimits {
 
     /* device zone model */
     BlockZoneModel zoned;
+
+    /* zone size expressed in 512-byte sectors */
+    uint32_t zone_sectors;
+
+    /* total number of zones */
+    unsigned int nr_zones;
+
+    /* maximum sectors of a zone append write operation */
+    int64_t max_append_sectors;
+
+    /* maximum number of open zones */
+    int64_t max_open_zones;
+
+    /* maximum number of active zones */
+    int64_t max_active_zones;
 } BlockLimits;
 
 typedef struct BdrvOpBlocker BdrvOpBlocker;
diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index 21fc10c4c9..3d26929cdd 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -29,6 +29,8 @@
 #define QEMU_AIO_WRITE_ZEROES 0x0020
 #define QEMU_AIO_COPY_RANGE   0x0040
 #define QEMU_AIO_TRUNCATE     0x0080
+#define QEMU_AIO_ZONE_REPORT  0x0100
+#define QEMU_AIO_ZONE_MGMT    0x0200
 #define QEMU_AIO_TYPE_MASK \
         (QEMU_AIO_READ | \
          QEMU_AIO_WRITE | \
@@ -37,7 +39,9 @@
          QEMU_AIO_DISCARD | \
          QEMU_AIO_WRITE_ZEROES | \
          QEMU_AIO_COPY_RANGE | \
-         QEMU_AIO_TRUNCATE)
+         QEMU_AIO_TRUNCATE  | \
+         QEMU_AIO_ZONE_REPORT | \
+         QEMU_AIO_ZONE_MGMT)
 
 /* AIO flags */
 #define QEMU_AIO_MISALIGNED   0x1000
diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
index 50f5aa2e07..6835525582 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -45,6 +45,12 @@ BlockAIOCB *blk_aio_pwritev(BlockBackend *blk, int64_t offset,
                             BlockCompletionFunc *cb, void *opaque);
 BlockAIOCB *blk_aio_flush(BlockBackend *blk,
                           BlockCompletionFunc *cb, void *opaque);
+BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
+                                unsigned int *nr_zones, BlockZoneDescriptor *zones,
+                                BlockCompletionFunc *cb, void *opaque);
+BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+                              int64_t offset, int64_t len,
+                              BlockCompletionFunc *cb, void *opaque);
 BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes,
                              BlockCompletionFunc *cb, void *opaque);
 void blk_aio_cancel_async(BlockAIOCB *acb);
@@ -156,6 +162,17 @@ int generated_co_wrapper blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
 int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, int64_t offset,
                                       int64_t bytes, BdrvRequestFlags flags);
 
+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+                                    unsigned int *nr_zones,
+                                    BlockZoneDescriptor *zones);
+int generated_co_wrapper blk_zone_report(BlockBackend *blk, int64_t offset,
+                                         unsigned int *nr_zones,
+                                         BlockZoneDescriptor *zones);
+int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+                                  int64_t offset, int64_t len);
+int generated_co_wrapper blk_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+                                       int64_t offset, int64_t len);
+
 int generated_co_wrapper blk_pdiscard(BlockBackend *blk, int64_t offset,
                                       int64_t bytes);
 int coroutine_fn blk_co_pdiscard(BlockBackend *blk, int64_t offset,
diff --git a/meson.build b/meson.build
index 20fddbd707..2f436bb355 100644
--- a/meson.build
+++ b/meson.build
@@ -1883,6 +1883,7 @@ config_host_data.set('CONFIG_REPLICATION', get_option('live_block_migration').al
 # has_header
 config_host_data.set('CONFIG_EPOLL', cc.has_header('sys/epoll.h'))
 config_host_data.set('CONFIG_LINUX_MAGIC_H', cc.has_header('linux/magic.h'))
+config_host_data.set('CONFIG_BLKZONED', cc.has_header('linux/blkzoned.h'))
 config_host_data.set('CONFIG_VALGRIND_H', cc.has_header('valgrind/valgrind.h'))
 config_host_data.set('HAVE_BTRFS_H', cc.has_header('linux/btrfs.h'))
 config_host_data.set('HAVE_DRM_H', cc.has_header('libdrm/drm.h'))
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 2173e7734a..c6bbb7a037 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2942,6 +2942,7 @@
 # @compress: Since 5.0
 # @copy-before-write: Since 6.2
 # @snapshot-access: Since 7.0
+# @zoned_host_device: Since 7.2
 #
 # Since: 2.9
 ##
@@ -2955,7 +2956,8 @@
             'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
             'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
             { 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
-            'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
+            'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat',
+            { 'name': 'zoned_host_device', 'if': 'CONFIG_BLKZONED' } ] }
 
 ##
 # @BlockdevOptionsFile:
@@ -4329,7 +4331,9 @@
       'vhdx':       'BlockdevOptionsGenericFormat',
       'vmdk':       'BlockdevOptionsGenericCOWFormat',
       'vpc':        'BlockdevOptionsGenericFormat',
-      'vvfat':      'BlockdevOptionsVVFAT'
+      'vvfat':      'BlockdevOptionsVVFAT',
+      'zoned_host_device': { 'type': 'BlockdevOptionsFile',
+                             'if': 'CONFIG_BLKZONED' }
   } }
 
 ##
diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 952dc940f1..446a059603 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1712,6 +1712,144 @@ static const cmdinfo_t flush_cmd = {
     .oneline    = "flush all in-core file state to disk",
 };
 
+static int zone_report_f(BlockBackend *blk, int argc, char **argv)
+{
+    int ret;
+    int64_t offset;
+    unsigned int nr_zones;
+
+    ++optind;
+    offset = cvtnum(argv[optind]);
+    ++optind;
+    nr_zones = cvtnum(argv[optind]);
+
+    g_autofree BlockZoneDescriptor *zones = NULL;
+    zones = g_new(BlockZoneDescriptor, nr_zones);
+    ret = blk_zone_report(blk, offset, &nr_zones, zones);
+    if (ret < 0) {
+        printf("zone report failed: %s\n", strerror(-ret));
+    } else {
+        for (int i = 0; i < nr_zones; ++i) {
+            printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", "
+                   "cap"" 0x%" PRIx64 ", wptr 0x%" PRIx64 ", "
+                   "zcond:%u, [type: %u]\n",
+                   zones[i].start, zones[i].length, zones[i].cap, zones[i].wp,
+                   zones[i].cond, zones[i].type);
+        }
+    }
+    return ret;
+}
+
+static const cmdinfo_t zone_report_cmd = {
+        .name = "zone_report",
+        .altname = "zrp",
+        .cfunc = zone_report_f,
+        .argmin = 2,
+        .argmax = 2,
+        .args = "offset number",
+        .oneline = "report zone information",
+};
+
+static int zone_open_f(BlockBackend *blk, int argc, char **argv)
+{
+    int ret;
+    int64_t offset, len;
+    ++optind;
+    offset = cvtnum(argv[optind]);
+    ++optind;
+    len = cvtnum(argv[optind]);
+    ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len);
+    if (ret < 0) {
+        printf("zone open failed: %s\n", strerror(-ret));
+    }
+    return ret;
+}
+
+static const cmdinfo_t zone_open_cmd = {
+        .name = "zone_open",
+        .altname = "zo",
+        .cfunc = zone_open_f,
+        .argmin = 2,
+        .argmax = 2,
+        .args = "offset len",
+        .oneline = "explicit open a range of zones in zone block device",
+};
+
+static int zone_close_f(BlockBackend *blk, int argc, char **argv)
+{
+    int ret;
+    int64_t offset, len;
+    ++optind;
+    offset = cvtnum(argv[optind]);
+    ++optind;
+    len = cvtnum(argv[optind]);
+    ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len);
+    if (ret < 0) {
+        printf("zone close failed: %s\n", strerror(-ret));
+    }
+    return ret;
+}
+
+static const cmdinfo_t zone_close_cmd = {
+        .name = "zone_close",
+        .altname = "zc",
+        .cfunc = zone_close_f,
+        .argmin = 2,
+        .argmax = 2,
+        .args = "offset len",
+        .oneline = "close a range of zones in zone block device",
+};
+
+static int zone_finish_f(BlockBackend *blk, int argc, char **argv)
+{
+    int ret;
+    int64_t offset, len;
+    ++optind;
+    offset = cvtnum(argv[optind]);
+    ++optind;
+    len = cvtnum(argv[optind]);
+    ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len);
+    if (ret < 0) {
+        printf("zone finish failed: %s\n", strerror(-ret));
+    }
+    return ret;
+}
+
+static const cmdinfo_t zone_finish_cmd = {
+        .name = "zone_finish",
+        .altname = "zf",
+        .cfunc = zone_finish_f,
+        .argmin = 2,
+        .argmax = 2,
+        .args = "offset len",
+        .oneline = "finish a range of zones in zone block device",
+};
+
+static int zone_reset_f(BlockBackend *blk, int argc, char **argv)
+{
+    int ret;
+    int64_t offset, len;
+    ++optind;
+    offset = cvtnum(argv[optind]);
+    ++optind;
+    len = cvtnum(argv[optind]);
+    ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len);
+    if (ret < 0) {
+        printf("zone reset failed: %s\n", strerror(-ret));
+    }
+    return ret;
+}
+
+static const cmdinfo_t zone_reset_cmd = {
+        .name = "zone_reset",
+        .altname = "zrs",
+        .cfunc = zone_reset_f,
+        .argmin = 2,
+        .argmax = 2,
+        .args = "offset len",
+        .oneline = "reset a zone write pointer in zone block device",
+};
+
 static int truncate_f(BlockBackend *blk, int argc, char **argv);
 static const cmdinfo_t truncate_cmd = {
     .name       = "truncate",
@@ -2504,6 +2642,11 @@ static void __attribute((constructor)) init_qemuio_commands(void)
     qemuio_add_command(&aio_write_cmd);
     qemuio_add_command(&aio_flush_cmd);
     qemuio_add_command(&flush_cmd);
+    qemuio_add_command(&zone_report_cmd);
+    qemuio_add_command(&zone_open_cmd);
+    qemuio_add_command(&zone_close_cmd);
+    qemuio_add_command(&zone_finish_cmd);
+    qemuio_add_command(&zone_reset_cmd);
     qemuio_add_command(&truncate_cmd);
     qemuio_add_command(&length_cmd);
     qemuio_add_command(&info_cmd);
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v9 4/7] raw-format: add zone operations to pass through requests
  2022-09-10  5:27 [PATCH v9 0/7] Add support for zoned device Sam Li
                   ` (2 preceding siblings ...)
  2022-09-10  5:27 ` [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls Sam Li
@ 2022-09-10  5:27 ` Sam Li
  2022-09-11  5:32   ` Damien Le Moal
  2022-09-10  5:27 ` [PATCH v9 5/7] config: add check to block layer Sam Li
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Sam Li @ 2022-09-10  5:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: dmitry.fomichev, Markus Armbruster, Eric Blake, qemu-block,
	Stefan Hajnoczi, Kevin Wolf, Fam Zheng, damien.lemoal, hare,
	Hanna Reitz, Sam Li

raw-format driver usually sits on top of file-posix driver. It needs to
pass through requests of zone commands.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/raw-format.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/block/raw-format.c b/block/raw-format.c
index 69fd650eaf..6b20bd22ef 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -314,6 +314,17 @@ static int coroutine_fn raw_co_pdiscard(BlockDriverState *bs,
     return bdrv_co_pdiscard(bs->file, offset, bytes);
 }
 
+static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t offset,
+                                           unsigned int *nr_zones,
+                                           BlockZoneDescriptor *zones) {
+    return bdrv_co_zone_report(bs->file->bs, offset, nr_zones, zones);
+}
+
+static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+                                         int64_t offset, int64_t len) {
+    return bdrv_co_zone_mgmt(bs->file->bs, op, offset, len);
+}
+
 static int64_t raw_getlength(BlockDriverState *bs)
 {
     int64_t len;
@@ -614,6 +625,8 @@ BlockDriver bdrv_raw = {
     .bdrv_co_pwritev      = &raw_co_pwritev,
     .bdrv_co_pwrite_zeroes = &raw_co_pwrite_zeroes,
     .bdrv_co_pdiscard     = &raw_co_pdiscard,
+    .bdrv_co_zone_report  = &raw_co_zone_report,
+    .bdrv_co_zone_mgmt  = &raw_co_zone_mgmt,
     .bdrv_co_block_status = &raw_co_block_status,
     .bdrv_co_copy_range_from = &raw_co_copy_range_from,
     .bdrv_co_copy_range_to  = &raw_co_copy_range_to,
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v9 5/7] config: add check to block layer
  2022-09-10  5:27 [PATCH v9 0/7] Add support for zoned device Sam Li
                   ` (3 preceding siblings ...)
  2022-09-10  5:27 ` [PATCH v9 4/7] raw-format: add zone operations to pass through requests Sam Li
@ 2022-09-10  5:27 ` Sam Li
  2022-09-11  5:34   ` Damien Le Moal
  2022-09-16 15:22   ` Stefan Hajnoczi
  2022-09-10  5:27 ` [PATCH v9 6/7] qemu-iotests: test new zone operations Sam Li
  2022-09-10  5:27 ` [PATCH v9 7/7] docs/zoned-storage: add zoned device documentation Sam Li
  6 siblings, 2 replies; 31+ messages in thread
From: Sam Li @ 2022-09-10  5:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: dmitry.fomichev, Markus Armbruster, Eric Blake, qemu-block,
	Stefan Hajnoczi, Kevin Wolf, Fam Zheng, damien.lemoal, hare,
	Hanna Reitz, Sam Li

Putting zoned/non-zoned BlockDrivers on top of each other is not
allowed.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block.c                          | 14 ++++++++++++++
 block/file-posix.c               | 14 ++++++++++++++
 block/raw-format.c               |  1 +
 include/block/block_int-common.h |  5 +++++
 4 files changed, 34 insertions(+)

diff --git a/block.c b/block.c
index bc85f46eed..dad2ed3959 100644
--- a/block.c
+++ b/block.c
@@ -7947,6 +7947,20 @@ void bdrv_add_child(BlockDriverState *parent_bs, BlockDriverState *child_bs,
         return;
     }
 
+    /*
+     * Non-zoned block drivers do not follow zoned storage constraints
+     * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned
+     * drivers in a graph.
+     */
+    if (!parent_bs->drv->supports_zoned_children &&
+        child_bs->bl.zoned == BLK_Z_HM) {
+        error_setg(errp, "Cannot add a %s child to a %s parent",
+                   child_bs->bl.zoned == BLK_Z_HM ? "zoned" : "non-zoned",
+                   parent_bs->drv->supports_zoned_children ?
+                   "support zoned children" : "not support zoned children");
+        return;
+    }
+
     if (!QLIST_EMPTY(&child_bs->parents)) {
         error_setg(errp, "The node %s already has a parent",
                    child_bs->node_name);
diff --git a/block/file-posix.c b/block/file-posix.c
index 4edfa25d04..354de22860 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -779,6 +779,20 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
             goto fail;
         }
     }
+#ifdef CONFIG_BLKZONED
+    /*
+     * The kernel page chache does not reliably work for writes to SWR zones
+     * of zoned block device because it can not guarantee the order of writes.
+     */
+    if (strcmp(bs->drv->format_name, "zoned_host_device") == 0) {
+        if (!(s->open_flags & O_DIRECT)) {
+            error_setg(errp, "driver=zoned_host_device was specified, but it "
+                             "requires cache.direct=on, which was not specified.");
+            ret = -EINVAL;
+            return ret; /* No host kernel page cache */
+        }
+    }
+#endif
 
     if (S_ISBLK(st.st_mode)) {
 #ifdef BLKDISCARDZEROES
diff --git a/block/raw-format.c b/block/raw-format.c
index 6b20bd22ef..9441536819 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -614,6 +614,7 @@ static void raw_child_perm(BlockDriverState *bs, BdrvChild *c,
 BlockDriver bdrv_raw = {
     .format_name          = "raw",
     .instance_size        = sizeof(BDRVRawState),
+    .supports_zoned_children = true,
     .bdrv_probe           = &raw_probe,
     .bdrv_reopen_prepare  = &raw_reopen_prepare,
     .bdrv_reopen_commit   = &raw_reopen_commit,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 078ddd7e67..043aa161a0 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -127,6 +127,11 @@ struct BlockDriver {
      */
     bool is_format;
 
+    /*
+     * Set to true if the BlockDriver supports zoned children.
+     */
+    bool supports_zoned_children;
+
     /*
      * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
      * this field set to true, except ones that are defined only by their
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v9 6/7] qemu-iotests: test new zone operations
  2022-09-10  5:27 [PATCH v9 0/7] Add support for zoned device Sam Li
                   ` (4 preceding siblings ...)
  2022-09-10  5:27 ` [PATCH v9 5/7] config: add check to block layer Sam Li
@ 2022-09-10  5:27 ` Sam Li
  2022-09-10  5:27 ` [PATCH v9 7/7] docs/zoned-storage: add zoned device documentation Sam Li
  6 siblings, 0 replies; 31+ messages in thread
From: Sam Li @ 2022-09-10  5:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: dmitry.fomichev, Markus Armbruster, Eric Blake, qemu-block,
	Stefan Hajnoczi, Kevin Wolf, Fam Zheng, damien.lemoal, hare,
	Hanna Reitz, Sam Li

We have added new block layer APIs of zoned block devices. Test it with:
Create a null_blk device, run each zone operation on it and see
whether reporting right zone information.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/qemu-iotests/tests/zoned.out | 53 +++++++++++++++++++
 tests/qemu-iotests/tests/zoned.sh  | 85 ++++++++++++++++++++++++++++++
 2 files changed, 138 insertions(+)
 create mode 100644 tests/qemu-iotests/tests/zoned.out
 create mode 100755 tests/qemu-iotests/tests/zoned.sh

diff --git a/tests/qemu-iotests/tests/zoned.out b/tests/qemu-iotests/tests/zoned.out
new file mode 100644
index 0000000000..0c8f96deb9
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.out
@@ -0,0 +1,53 @@
+QA output created by zoned.sh
+Testing a null_blk device:
+Simple cases: if the operations work
+(1) report the first zone:
+start: 0x0, len 0x80000, cap 0x80000, wptr 0x0, zcond:1, [type: 2]
+
+report the first 10 zones
+start: 0x0, len 0x80000, cap 0x80000, wptr 0x0, zcond:1, [type: 2]
+start: 0x80000, len 0x80000, cap 0x80000, wptr 0x80000, zcond:1, [type: 2]
+start: 0x100000, len 0x80000, cap 0x80000, wptr 0x100000, zcond:1, [type: 2]
+start: 0x180000, len 0x80000, cap 0x80000, wptr 0x180000, zcond:1, [type: 2]
+start: 0x200000, len 0x80000, cap 0x80000, wptr 0x200000, zcond:1, [type: 2]
+start: 0x280000, len 0x80000, cap 0x80000, wptr 0x280000, zcond:1, [type: 2]
+start: 0x300000, len 0x80000, cap 0x80000, wptr 0x300000, zcond:1, [type: 2]
+start: 0x380000, len 0x80000, cap 0x80000, wptr 0x380000, zcond:1, [type: 2]
+start: 0x400000, len 0x80000, cap 0x80000, wptr 0x400000, zcond:1, [type: 2]
+start: 0x480000, len 0x80000, cap 0x80000, wptr 0x480000, zcond:1, [type: 2]
+
+report the last zone:
+start: 0x1f380000, len 0x80000, cap 0x80000, wptr 0x1f380000, zcond:1, [type: 2]
+
+
+(2) opening the first zone
+report after:
+start: 0x0, len 0x80000, cap 0x80000, wptr 0x0, zcond:3, [type: 2]
+
+opening the second zone
+report after:
+start: 0x80000, len 0x80000, cap 0x80000, wptr 0x80000, zcond:3, [type: 2]
+
+opening the last zone
+report after:
+start: 0x1f380000, len 0x80000, cap 0x80000, wptr 0x1f380000, zcond:3, [type: 2]
+
+
+(3) closing the first zone
+report after:
+start: 0x0, len 0x80000, cap 0x80000, wptr 0x0, zcond:1, [type: 2]
+
+closing the last zone
+report after:
+start: 0x1f380000, len 0x80000, cap 0x80000, wptr 0x1f380000, zcond:1, [type: 2]
+
+
+(4) finishing the second zone
+After finishing a zone:
+start: 0x80000, len 0x80000, cap 0x80000, wptr 0x100000, zcond:14, [type: 2]
+
+
+(5) resetting the second zone
+After resetting a zone:
+start: 0x80000, len 0x80000, cap 0x80000, wptr 0x80000, zcond:1, [type: 2]
+*** done
diff --git a/tests/qemu-iotests/tests/zoned.sh b/tests/qemu-iotests/tests/zoned.sh
new file mode 100755
index 0000000000..871f47efed
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.sh
@@ -0,0 +1,85 @@
+#!/usr/bin/env bash
+#
+# Test zone management operations.
+#
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+status=1 # failure is the default!
+
+_cleanup()
+{
+  _cleanup_test_img
+  sudo rmmod null_blk
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.qemu
+
+# This test only runs on Linux hosts with raw image files.
+_supported_fmt raw
+_supported_proto file
+_supported_os Linux
+
+QEMU_IO="build/qemu-io"
+IMG="--image-opts -n driver=zoned_host_device,filename=/dev/nullb0"
+QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
+
+echo "Testing a null_blk device:"
+echo "Simple cases: if the operations work"
+sudo modprobe null_blk nr_devices=1 zoned=1
+
+echo "(1) report the first zone:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "report the first 10 zones"
+sudo $QEMU_IO $IMG -c "zrp 0 10"
+echo
+echo "report the last zone:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e70000000 2" # 0x3e70000000 / 512 = 0x1f380000
+echo
+echo
+echo "(2) opening the first zone"
+sudo $QEMU_IO $IMG -c "zo 0 268435456"  # 268435456 / 512 = 524288
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "opening the second zone"
+sudo $QEMU_IO $IMG -c "zo 268435456 268435456" #
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo "opening the last zone"
+sudo $QEMU_IO $IMG -c "zo 0x3e70000000 268435456"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e70000000 2"
+echo
+echo
+echo "(3) closing the first zone"
+sudo $QEMU_IO $IMG -c "zc 0 268435456"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "closing the last zone"
+sudo $QEMU_IO $IMG -c "zc 0x3e70000000 268435456"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e70000000 2"
+echo
+echo
+echo "(4) finishing the second zone"
+sudo $QEMU_IO $IMG -c "zf 268435456 268435456"
+echo "After finishing a zone:"
+sudo $QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo
+echo "(5) resetting the second zone"
+sudo $QEMU_IO $IMG -c "zrs 268435456 268435456"
+echo "After resetting a zone:"
+sudo $QEMU_IO $IMG -c "zrp 268435456 1"
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v9 7/7] docs/zoned-storage: add zoned device documentation
  2022-09-10  5:27 [PATCH v9 0/7] Add support for zoned device Sam Li
                   ` (5 preceding siblings ...)
  2022-09-10  5:27 ` [PATCH v9 6/7] qemu-iotests: test new zone operations Sam Li
@ 2022-09-10  5:27 ` Sam Li
  2022-09-11  5:38   ` Damien Le Moal
  6 siblings, 1 reply; 31+ messages in thread
From: Sam Li @ 2022-09-10  5:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: dmitry.fomichev, Markus Armbruster, Eric Blake, qemu-block,
	Stefan Hajnoczi, Kevin Wolf, Fam Zheng, damien.lemoal, hare,
	Hanna Reitz, Sam Li

Add the documentation about the zoned device support to virtio-blk
emulation.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 docs/devel/zoned-storage.rst           | 41 ++++++++++++++++++++++++++
 docs/system/qemu-block-drivers.rst.inc |  6 ++++
 2 files changed, 47 insertions(+)
 create mode 100644 docs/devel/zoned-storage.rst

diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
new file mode 100644
index 0000000000..ead2d149cc
--- /dev/null
+++ b/docs/devel/zoned-storage.rst
@@ -0,0 +1,41 @@
+=============
+zoned-storage
+=============
+
+Zoned Block Devices (ZBDs) devide the LBA space into block regions called zones
+that are larger than the LBA size. It can only allow sequential writes, which
+reduces write amplification in SSDs, leading to higher throughput and increased
+capacity. More details about ZBDs can be found at:
+
+https://zonedstorage.io/docs/introduction/zoned-storage
+
+1. Block layer APIs for zoned storage
+-------------------------------------
+QEMU block layer has three zoned storage model:
+- BLK_Z_HM: This model only allows sequential writes access. It supports a set
+of ZBD-specific I/O request that used by the host to manage device zones.
+- BLK_Z_HA: It deals with both sequential writes and random writes access.
+- BLK_Z_NONE: Regular block devices and drive-managed ZBDs are treated as
+non-zoned devices.
+
+The block device information resides inside BlockDriverState. QEMU uses
+BlockLimits struct(BlockDriverState::bl) that is continuously accessed by the
+block layer while processing I/O requests. A BlockBackend has a root pointer to
+a BlockDriverState graph(for example, raw format on top of file-posix). The
+zoned storage information can be propagated from the leaf BlockDriverState all
+the way up to the BlockBackend. If the zoned storage model in file-posix is
+set to BLK_Z_HM, then block drivers will declare support for zoned host device.
+
+The block layer APIs support commands needed for zoned storage devices,
+including report zones, four zone operations, and zone append.
+
+2. Emulating zoned storage controllers
+--------------------------------------
+When the BlockBackend's BlockLimits model reports a zoned storage device, users
+like the virtio-blk emulation or the qemu-io-cmds.c utility can use block layer
+APIs for zoned storage emulation or testing.
+
+For example, the command line for zone report testing a null_blk device of
+qemu-io-cmds.c is:
+$ path/to/qemu-io --image-opts driver=zoned_host_device,filename=/dev/nullb0 -c
+"zrp offset nr_zones"
diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
index dfe5d2293d..0b97227fd9 100644
--- a/docs/system/qemu-block-drivers.rst.inc
+++ b/docs/system/qemu-block-drivers.rst.inc
@@ -430,6 +430,12 @@ Hard disks
   you may corrupt your host data (use the ``-snapshot`` command
   line option or modify the device permissions accordingly).
 
+Zoned block devices
+  Zoned block devices can be passed through to the guest if the emulated storage
+  controller supports zoned storage. Use ``--blockdev zoned_host_device,
+  node-name=drive0,filename=/dev/nullb0`` to pass through ``/dev/nullb0``
+  as ``drive0``.
+
 Windows
 ^^^^^^^
 
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 2/7] file-posix: introduce helper functions for sysfs attributes
  2022-09-10  5:27 ` [PATCH v9 2/7] file-posix: introduce helper functions for sysfs attributes Sam Li
@ 2022-09-11  4:56   ` Damien Le Moal
  0 siblings, 0 replies; 31+ messages in thread
From: Damien Le Moal @ 2022-09-11  4:56 UTC (permalink / raw)
  To: Sam Li, qemu-devel
  Cc: dmitry.fomichev, Markus Armbruster, Eric Blake, qemu-block,
	Stefan Hajnoczi, Kevin Wolf, Fam Zheng, hare, Hanna Reitz

On 2022/09/10 14:27, Sam Li wrote:
> Use get_sysfs_str_val() to get the string value of device
> zoned model. Then get_sysfs_zoned_model() can convert it to
> BlockZoneModel type of QEMU.
> 
> Use get_sysfs_long_val() to get the long value of zoned device
> information.
> 
> Signed-off-by: Sam Li <faithilikerun@gmail.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

Looks good to me.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>

> ---
>  block/file-posix.c               | 121 ++++++++++++++++++++++---------
>  include/block/block_int-common.h |   3 +
>  2 files changed, 88 insertions(+), 36 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 48cd096624..0a8b4b426e 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1210,66 +1210,109 @@ static int hdev_get_max_hw_transfer(int fd, struct stat *st)
>  #endif
>  }
>  
> -static int hdev_get_max_segments(int fd, struct stat *st)
> -{
> +/*
> + * Get a sysfs attribute value as character string.
> + */
> +static int get_sysfs_str_val(struct stat *st, const char *attribute,
> +                             char **val) {
>  #ifdef CONFIG_LINUX
> -    char buf[32];
> -    const char *end;
> -    char *sysfspath = NULL;
> +    g_autofree char *sysfspath = NULL;
>      int ret;
> -    int sysfd = -1;
> -    long max_segments;
> +    size_t len;
>  
> -    if (S_ISCHR(st->st_mode)) {
> -        if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
> -            return ret;
> -        }
> +    if (!S_ISBLK(st->st_mode)) {
>          return -ENOTSUP;
>      }
>  
> -    if (!S_ISBLK(st->st_mode)) {
> -        return -ENOTSUP;
> +    sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
> +                                major(st->st_rdev), minor(st->st_rdev),
> +                                attribute);
> +    ret = g_file_get_contents(sysfspath, val, &len, NULL);
> +    if (ret == -1) {
> +        return -ENOENT;
>      }
>  
> -    sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
> -                                major(st->st_rdev), minor(st->st_rdev));
> -    sysfd = open(sysfspath, O_RDONLY);
> -    if (sysfd == -1) {
> -        ret = -errno;
> -        goto out;
> +    /* The file is ended with '\n' */
> +    char *p;
> +    p = *val;
> +    if (*(p + len - 1) == '\n') {
> +        *(p + len - 1) = '\0';
>      }
> -    do {
> -        ret = read(sysfd, buf, sizeof(buf) - 1);
> -    } while (ret == -1 && errno == EINTR);
> +    return ret;
> +#else
> +    return -ENOTSUP;
> +#endif
> +}
> +
> +static int get_sysfs_zoned_model(struct stat *st, BlockZoneModel *zoned) {
> +    g_autofree char *val = NULL;
> +    int ret;
> +
> +    ret = get_sysfs_str_val(st, "zoned", &val);
>      if (ret < 0) {
> -        ret = -errno;
> -        goto out;
> -    } else if (ret == 0) {
> -        ret = -EIO;
> -        goto out;
> +        return ret;
>      }
> -    buf[ret] = 0;
> -    /* The file is ended with '\n', pass 'end' to accept that. */
> -    ret = qemu_strtol(buf, &end, 10, &max_segments);
> -    if (ret == 0 && end && *end == '\n') {
> -        ret = max_segments;
> +
> +    if (strcmp(val, "host-managed") == 0) {
> +        *zoned = BLK_Z_HM;
> +    } else if (strcmp(val, "host-aware") == 0) {
> +        *zoned = BLK_Z_HA;
> +    } else if (strcmp(val, "none") == 0) {
> +        *zoned = BLK_Z_NONE;
> +    } else {
> +        return -ENOTSUP;
>      }
> +    return 0;
> +}
>  
> -out:
> -    if (sysfd != -1) {
> -        close(sysfd);
> +/*
> + * Get a sysfs attribute value as a long integer.
> + */
> +static long get_sysfs_long_val(struct stat *st, const char *attribute) {
> +#ifdef CONFIG_LINUX
> +    g_autofree char *str = NULL;
> +    const char *end;
> +    long val;
> +    int ret;
> +
> +    ret = get_sysfs_str_val(st, attribute, &str);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    /* The file is ended with '\n', pass 'end' to accept that. */
> +    ret = qemu_strtol(str, &end, 10, &val);
> +    if (ret == 0 && end && *end == '\0') {
> +        ret = val;
>      }
> -    g_free(sysfspath);
>      return ret;
>  #else
>      return -ENOTSUP;
>  #endif
>  }
>  
> +static int hdev_get_max_segments(int fd, struct stat *st) {
> +#ifdef CONFIG_LINUX
> +    int ret;
> +
> +    if (S_ISCHR(st->st_mode)) {
> +        if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
> +            return ret;
> +        }
> +        return -ENOTSUP;
> +    }
> +    return get_sysfs_long_val(st, "max_segments");
> +#else
> +    return -ENOTSUP;
> +#endif
> +}
> +
>  static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>  {
>      BDRVRawState *s = bs->opaque;
>      struct stat st;
> +    int ret;
> +    BlockZoneModel zoned;
>  
>      s->needs_alignment = raw_needs_alignment(bs);
>      raw_probe_alignment(bs, s->fd, errp);
> @@ -1307,6 +1350,12 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>              bs->bl.max_hw_iov = ret;
>          }
>      }
> +
> +    ret = get_sysfs_zoned_model(&st, &zoned);
> +    if (ret < 0) {
> +        zoned = BLK_Z_NONE;
> +    }
> +    bs->bl.zoned = zoned;
>  }
>  
>  static int check_for_dasd(int fd)
> diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
> index 8947abab76..7f7863cc9e 100644
> --- a/include/block/block_int-common.h
> +++ b/include/block/block_int-common.h
> @@ -825,6 +825,9 @@ typedef struct BlockLimits {
>  
>      /* maximum number of iovec elements */
>      int max_iov;
> +
> +    /* device zone model */
> +    BlockZoneModel zoned;
>  } BlockLimits;
>  
>  typedef struct BdrvOpBlocker BdrvOpBlocker;

-- 
Damien Le Moal
Western Digital Research



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  2022-09-10  5:27 ` [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls Sam Li
@ 2022-09-11  5:31   ` Damien Le Moal
  2022-09-11  6:33     ` Sam Li
  2022-09-11  7:02   ` Damien Le Moal
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2022-09-11  5:31 UTC (permalink / raw)
  To: Sam Li, qemu-devel
  Cc: dmitry.fomichev, Markus Armbruster, Eric Blake, qemu-block,
	Stefan Hajnoczi, Kevin Wolf, Fam Zheng, hare, Hanna Reitz

On 2022/09/10 14:27, Sam Li wrote:
[...]
> +/*
> + * Send a zone_report command.
> + * offset is a byte offset from the start of the device. No alignment
> + * required for offset.
> + * nr_zones represents IN maximum and OUT actual.
> + */
> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> +                                    unsigned int *nr_zones,
> +                                    BlockZoneDescriptor *zones)
> +{
> +    int ret;
> +    IO_CODE();
> +
> +    blk_inc_in_flight(blk); /* increase before waiting */
> +    blk_wait_while_drained(blk);
> +    if (!blk_is_available(blk)) {
> +        blk_dec_in_flight(blk);
> +        return -ENOMEDIUM;
> +    }
> +    ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
> +    blk_dec_in_flight(blk);
> +    return ret;
> +}
> +
> +/*
> + * Send a zone_management command.
> + * op is the zone operation;
> + * offset is the byte offset from the start of the zoned device;
> + * len is the maximum number of bytes the command should operate on. It
> + * should be aligned with the zone sector size.

This should read:

* offset is the byte offset of the start of the first zone to operate on;
* len is the maximum number of bytes the command should operate on. It
* should be aligned with the device zone size.

No ?

> + */
> +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +        int64_t offset, int64_t len)
> +{
> +    int ret;
> +    IO_CODE();
> +
> +
> +    blk_inc_in_flight(blk);
> +    blk_wait_while_drained(blk);
> +
> +    ret = blk_check_byte_request(blk, offset, len);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    ret = bdrv_co_zone_mgmt(blk_bs(blk), op, offset, len);
> +    blk_dec_in_flight(blk);
> +    return ret;
> +}
> +
>  void blk_drain(BlockBackend *blk)
>  {
>      BlockDriverState *bs = blk_bs(blk);
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 0a8b4b426e..4edfa25d04 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -67,6 +67,9 @@
>  #include <sys/param.h>
>  #include <sys/syscall.h>
>  #include <sys/vfs.h>
> +#if defined(CONFIG_BLKZONED)
> +#include <linux/blkzoned.h>
> +#endif
>  #include <linux/cdrom.h>
>  #include <linux/fd.h>
>  #include <linux/fs.h>
> @@ -216,6 +219,15 @@ typedef struct RawPosixAIOData {
>              PreallocMode prealloc;
>              Error **errp;
>          } truncate;
> +        struct {
> +            unsigned int *nr_zones;
> +            BlockZoneDescriptor *zones;
> +        } zone_report;
> +        struct {
> +            unsigned long zone_op;
> +            const char *zone_op_name;
> +            bool all;
> +        } zone_mgmt;
>      };
>  } RawPosixAIOData;
>  
> @@ -1339,7 +1351,7 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>  #endif
>  
>      if (bs->sg || S_ISBLK(st.st_mode)) {
> -        int ret = hdev_get_max_hw_transfer(s->fd, &st);
> +        ret = hdev_get_max_hw_transfer(s->fd, &st);
>  
>          if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
>              bs->bl.max_hw_transfer = ret;
> @@ -1356,6 +1368,27 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>          zoned = BLK_Z_NONE;
>      }
>      bs->bl.zoned = zoned;
> +    if (zoned != BLK_Z_NONE) {
> +        ret = get_sysfs_long_val(&st, "chunk_sectors");
> +        if (ret > 0) {
> +            bs->bl.zone_sectors = ret;
> +        }

It may be good to check that we are getting a valid zone size here. So may be
change the check to something like this ?

	if (ret <= 0) {
	    *** print some error message mentioning the invalid zone size ***
	    bs->bl.zoned = BLK_Z_NONE;
	    return;
	}
	bs->bl.zone_sectors = ret;

> +
> +        ret = get_sysfs_long_val(&st, "zone_append_max_bytes");
> +        if (ret > 0) {
> +            bs->bl.max_append_sectors = ret / 512;
> +        }
> +
> +        ret = get_sysfs_long_val(&st, "max_open_zones");
> +        if (ret >= 0) {
> +            bs->bl.max_open_zones = ret;
> +        }
> +
> +        ret = get_sysfs_long_val(&st, "max_active_zones");
> +        if (ret >= 0) {
> +            bs->bl.max_active_zones = ret;
> +        }
> +    }
>  }
>  
>  static int check_for_dasd(int fd)
> @@ -1850,6 +1883,145 @@ static off_t copy_file_range(int in_fd, off_t *in_off, int out_fd,
>  }
>  #endif
>  
> +/*
> + * parse_zone - Fill a zone descriptor
> + */
> +#if defined(CONFIG_BLKZONED)
> +static inline void parse_zone(struct BlockZoneDescriptor *zone,
> +                              const struct blk_zone *blkz) {
> +    zone->start = blkz->start;
> +    zone->length = blkz->len;
> +    zone->cap = blkz->capacity;
> +    zone->wp = blkz->wp;
> +
> +    switch (blkz->type) {
> +    case BLK_ZONE_TYPE_SEQWRITE_REQ:
> +        zone->type = BLK_ZT_SWR;
> +        break;
> +    case BLK_ZONE_TYPE_SEQWRITE_PREF:
> +        zone->type = BLK_ZT_SWP;
> +        break;
> +    case BLK_ZONE_TYPE_CONVENTIONAL:
> +        zone->type = BLK_ZT_CONV;
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
> +
> +    switch (blkz->cond) {
> +    case BLK_ZONE_COND_NOT_WP:
> +        zone->cond = BLK_ZS_NOT_WP;
> +        break;
> +    case BLK_ZONE_COND_EMPTY:
> +        zone->cond = BLK_ZS_EMPTY;
> +        break;
> +    case BLK_ZONE_COND_IMP_OPEN:
> +        zone->cond =BLK_ZS_IOPEN;

Missing a space after the "=".

> +        break;
> +    case BLK_ZONE_COND_EXP_OPEN:
> +        zone->cond = BLK_ZS_EOPEN;
> +        break;
> +    case BLK_ZONE_COND_CLOSED:
> +        zone->cond = BLK_ZS_CLOSED;
> +        break;
> +    case BLK_ZONE_COND_READONLY:
> +        zone->cond = BLK_ZS_RDONLY;
> +        break;
> +    case BLK_ZONE_COND_FULL:
> +        zone->cond = BLK_ZS_FULL;
> +        break;
> +    case BLK_ZONE_COND_OFFLINE:
> +        zone->cond = BLK_ZS_OFFLINE;
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
> +}
> +#endif
> +
> +#if defined(CONFIG_BLKZONED)
> +static int do_zone_report(int64_t sector, int fd,
> +                          struct BlockZoneDescriptor *zones,
> +                          unsigned int nrz) {
> +    struct blk_zone *blkz;
> +    int ret, n = 0, i = 0;
> +
> +    int64_t rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
> +    g_autofree struct blk_zone_report *rep = NULL;
> +    rep = g_malloc(rep_size);
> +
> +    blkz = (struct blk_zone *)(rep + 1);
> +    while (n < nrz) {
> +        memset(rep, 0, rep_size);
> +        rep->sector = sector;
> +        rep->nr_zones = nrz - n;
> +
> +        do {
> +            ret = ioctl(fd, BLKREPORTZONE, rep);
> +        } while (ret != 0 && errno == EINTR);
> +        if (ret != 0) {
> +            error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
> +                    fd, sector, errno);
> +            return -errno;
> +        }
> +
> +        if (!rep->nr_zones) {
> +            break;
> +        }
> +
> +        for (i = 0; i < rep->nr_zones; i++, n++) {
> +            parse_zone(&zones[n], &blkz[i]);
> +            /* The next report should start after the last zone reported */
> +            sector = blkz[i].start + blkz[i].len;
> +        }
> +    }
> +    return n;
> +}
> +#endif
> +
> +static int handle_aiocb_zone_report(void *opaque) {
> +#if defined(CONFIG_BLKZONED)
> +    RawPosixAIOData *aiocb = opaque;
> +    int fd = aiocb->aio_fildes;
> +    unsigned int *nr_zones = aiocb->zone_report.nr_zones;
> +    BlockZoneDescriptor *zones = aiocb->zone_report.zones;
> +    /* zoned block devices use 512-byte sectors */
> +    int64_t sector = aiocb->aio_offset / 512;

This variable is not really necessary I think.

> +
> +    *nr_zones = do_zone_report(sector, fd, zones, *nr_zones);
> +    return 0;
> +#else
> +    return -ENOTSUP;
> +#endif
> +}
> +
> +static int handle_aiocb_zone_mgmt(void *opaque) {
> +#if defined(CONFIG_BLKZONED)
> +    RawPosixAIOData *aiocb = opaque;
> +    int fd = aiocb->aio_fildes;
> +    int64_t sector = aiocb->aio_offset / 512;
> +    int64_t nr_sectors = aiocb->aio_nbytes / 512;
> +    struct blk_zone_range range;
> +    int ret;
> +
> +    /* Execute the operation */
> +    range.sector = sector;
> +    range.nr_sectors = nr_sectors;
> +    do {
> +        ret = ioctl(fd, aiocb->zone_mgmt.zone_op, &range);
> +    } while (ret != 0 && errno == EINTR);
> +
> +    if (ret != 0) {
> +        error_report("ioctl %s failed %d", aiocb->zone_mgmt.zone_op_name,
> +                     errno);
> +        return -errno;
> +    }
> +    return ret;
> +#else
> +    return -ENOTSUP;
> +#endif
> +}
> +
>  static int handle_aiocb_copy_range(void *opaque)
>  {
>      RawPosixAIOData *aiocb = opaque;
> @@ -3022,6 +3194,104 @@ static void raw_account_discard(BDRVRawState *s, uint64_t nbytes, int ret)
>      }
>  }
>  
> +/*
> + * zone report - Get a zone block device's information in the form
> + * of an array of zone descriptors.
> + * zones is an array of zone descriptors to hold zone information on reply;
> + * offset can be any byte within the entire size of the device;
> + * nr_zones is the maxium number of sectors the command should operate on.
> + */
> +static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t offset,
> +                                           unsigned int *nr_zones,
> +                                           BlockZoneDescriptor *zones) {
> +#if defined(CONFIG_BLKZONED)
> +    BDRVRawState *s = bs->opaque;
> +    RawPosixAIOData acb;
> +
> +    acb = (RawPosixAIOData) {
> +        .bs         = bs,
> +        .aio_fildes = s->fd,
> +        .aio_type   = QEMU_AIO_ZONE_REPORT,
> +        .aio_offset = offset,
> +        .zone_report    = {
> +                .nr_zones       = nr_zones,
> +                .zones          = zones,
> +        },
> +    };
> +
> +    return raw_thread_pool_submit(bs, handle_aiocb_zone_report, &acb);
> +#else
> +    return -ENOTSUP;
> +#endif
> +}
> +
> +/*
> + * zone management operations - Execute an operation on a zone
> + */
> +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> +        int64_t offset, int64_t len) {
> +#if defined(CONFIG_BLKZONED)
> +    BDRVRawState *s = bs->opaque;
> +    RawPosixAIOData acb;
> +    int64_t zone_sector, zone_sector_mask;
> +    const char *zone_op_name;
> +    unsigned long zone_op;
> +    bool is_all = false;
> +
> +    zone_sector = bs->bl.zone_sectors;
> +    zone_sector_mask = zone_sector - 1;
> +    if (offset & zone_sector_mask) {
> +        error_report("sector offset %" PRId64 " is not aligned to zone size "
> +                     "%" PRId64 "", offset, zone_sector);
> +        return -EINVAL;
> +    }
> +
> +    if (len & zone_sector_mask) {

Linux allows SMR drives to have a smaller last zone. So this needs to be
accounted for here. Otherwise, a zone operation that includes the last smaller
zone would always fail. Something like this would work:

	if (((offset + len) < capacity &&
	    len & zone_sector_mask) ||
	    offset + len > capacity) {

> +        error_report("number of sectors %" PRId64 " is not aligned to zone size"
> +                      " %" PRId64 "", len, zone_sector);
> +        return -EINVAL;
> +    }
> +
> +    switch (op) {
> +    case BLK_ZO_OPEN:
> +        zone_op_name = "BLKOPENZONE";
> +        zone_op = BLKOPENZONE;
> +        break;
> +    case BLK_ZO_CLOSE:
> +        zone_op_name = "BLKCLOSEZONE";
> +        zone_op = BLKCLOSEZONE;
> +        break;
> +    case BLK_ZO_FINISH:
> +        zone_op_name = "BLKFINISHZONE";
> +        zone_op = BLKFINISHZONE;
> +        break;
> +    case BLK_ZO_RESET:
> +        zone_op_name = "BLKRESETZONE";
> +        zone_op = BLKRESETZONE;
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
> +
> +    acb = (RawPosixAIOData) {
> +        .bs             = bs,
> +        .aio_fildes     = s->fd,
> +        .aio_type       = QEMU_AIO_ZONE_MGMT,
> +        .aio_offset     = offset,
> +        .aio_nbytes     = len,
> +        .zone_mgmt  = {
> +                .zone_op = zone_op,
> +                .zone_op_name = zone_op_name,
> +                .all = is_all,
> +        },
> +    };
> +
> +    return raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb);
> +#else
> +    return -ENOTSUP;
> +#endif
> +}
> +
>  static coroutine_fn int
>  raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes,
>                  bool blkdev)
> @@ -3752,6 +4022,54 @@ static BlockDriver bdrv_host_device = {
>  #endif
>  };
>  
> +#if defined(CONFIG_BLKZONED)
> +static BlockDriver bdrv_zoned_host_device = {
> +    .format_name = "zoned_host_device",
> +    .protocol_name = "zoned_host_device",
> +    .instance_size = sizeof(BDRVRawState),
> +    .bdrv_needs_filename = true,
> +    .bdrv_probe_device  = hdev_probe_device,
> +    .bdrv_file_open     = hdev_open,
> +    .bdrv_close         = raw_close,
> +    .bdrv_reopen_prepare = raw_reopen_prepare,
> +    .bdrv_reopen_commit  = raw_reopen_commit,
> +    .bdrv_reopen_abort   = raw_reopen_abort,
> +    .bdrv_co_create_opts = bdrv_co_create_opts_simple,
> +    .create_opts         = &bdrv_create_opts_simple,
> +    .mutable_opts        = mutable_opts,
> +    .bdrv_co_invalidate_cache = raw_co_invalidate_cache,
> +    .bdrv_co_pwrite_zeroes = hdev_co_pwrite_zeroes,
> +
> +    .bdrv_co_preadv         = raw_co_preadv,
> +    .bdrv_co_pwritev        = raw_co_pwritev,
> +    .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
> +    .bdrv_co_pdiscard       = hdev_co_pdiscard,
> +    .bdrv_co_copy_range_from = raw_co_copy_range_from,
> +    .bdrv_co_copy_range_to  = raw_co_copy_range_to,
> +    .bdrv_refresh_limits = raw_refresh_limits,
> +    .bdrv_io_plug = raw_aio_plug,
> +    .bdrv_io_unplug = raw_aio_unplug,
> +    .bdrv_attach_aio_context = raw_aio_attach_aio_context,
> +
> +    .bdrv_co_truncate       = raw_co_truncate,
> +    .bdrv_getlength = raw_getlength,
> +    .bdrv_get_info = raw_get_info,
> +    .bdrv_get_allocated_file_size
> +                        = raw_get_allocated_file_size,
> +    .bdrv_get_specific_stats = hdev_get_specific_stats,
> +    .bdrv_check_perm = raw_check_perm,
> +    .bdrv_set_perm   = raw_set_perm,
> +    .bdrv_abort_perm_update = raw_abort_perm_update,
> +    .bdrv_probe_blocksizes = hdev_probe_blocksizes,
> +    .bdrv_probe_geometry = hdev_probe_geometry,
> +    .bdrv_co_ioctl = hdev_co_ioctl,
> +
> +    /* zone management operations */
> +    .bdrv_co_zone_report = raw_co_zone_report,
> +    .bdrv_co_zone_mgmt = raw_co_zone_mgmt,
> +};
> +#endif
> +
>  #if defined(__linux__) || defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
>  static void cdrom_parse_filename(const char *filename, QDict *options,
>                                   Error **errp)
> @@ -4012,6 +4330,9 @@ static void bdrv_file_init(void)
>      bdrv_register(&bdrv_file);
>  #if defined(HAVE_HOST_BLOCK_DEVICE)
>      bdrv_register(&bdrv_host_device);
> +#if defined(CONFIG_BLKZONED)
> +    bdrv_register(&bdrv_zoned_host_device);
> +#endif
>  #ifdef __linux__
>      bdrv_register(&bdrv_host_cdrom);
>  #endif
> diff --git a/block/io.c b/block/io.c
> index 0a8cbefe86..de9ec1d740 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -3198,6 +3198,47 @@ out:
>      return co.ret;
>  }
>  
> +int bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
> +                        unsigned int *nr_zones,
> +                        BlockZoneDescriptor *zones)
> +{
> +    BlockDriver *drv = bs->drv;
> +    CoroutineIOCompletion co = {
> +            .coroutine = qemu_coroutine_self(),
> +    };
> +    IO_CODE();
> +
> +    bdrv_inc_in_flight(bs);
> +    if (!drv || !drv->bdrv_co_zone_report) {
> +        co.ret = -ENOTSUP;
> +        goto out;
> +    }
> +    co.ret = drv->bdrv_co_zone_report(bs, offset, nr_zones, zones);
> +out:
> +    bdrv_dec_in_flight(bs);
> +    return co.ret;
> +}
> +
> +int bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> +        int64_t offset, int64_t len)
> +{
> +    BlockDriver *drv = bs->drv;
> +    CoroutineIOCompletion co = {
> +            .coroutine = qemu_coroutine_self(),
> +    };
> +    IO_CODE();
> +
> +    bdrv_inc_in_flight(bs);
> +    if (!drv || !drv->bdrv_co_zone_mgmt) {
> +        co.ret = -ENOTSUP;
> +        goto out;
> +    }
> +    co.ret = drv->bdrv_co_zone_mgmt(bs, op, offset, len);
> +out:
> +    bdrv_dec_in_flight(bs);
> +    return co.ret;
> +}
> +
>  void *qemu_blockalign(BlockDriverState *bs, size_t size)
>  {
>      IO_CODE();
> diff --git a/include/block/block-io.h b/include/block/block-io.h
> index fd25ffa9be..65463b88d9 100644
> --- a/include/block/block-io.h
> +++ b/include/block/block-io.h
> @@ -88,6 +88,13 @@ int bdrv_co_ioctl(BlockDriverState *bs, int req, void *buf);
>  /* Ensure contents are flushed to disk.  */
>  int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
>  
> +/* Report zone information of zone block device. */
> +int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
> +                                     unsigned int *nr_zones,
> +                                     BlockZoneDescriptor *zones);
> +int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> +                                   int64_t offset, int64_t len);
> +
>  int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
>  bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
>  int bdrv_block_status(BlockDriverState *bs, int64_t offset,
> diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
> index 7f7863cc9e..078ddd7e67 100644
> --- a/include/block/block_int-common.h
> +++ b/include/block/block_int-common.h
> @@ -691,6 +691,12 @@ struct BlockDriver {
>                                            QEMUIOVector *qiov,
>                                            int64_t pos);
>  
> +    int coroutine_fn (*bdrv_co_zone_report)(BlockDriverState *bs,
> +            int64_t offset, unsigned int *nr_zones,
> +            BlockZoneDescriptor *zones);
> +    int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, BlockZoneOp op,
> +            int64_t offset, int64_t len);
> +
>      /* removable device specific */
>      bool (*bdrv_is_inserted)(BlockDriverState *bs);
>      void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
> @@ -828,6 +834,21 @@ typedef struct BlockLimits {
>  
>      /* device zone model */
>      BlockZoneModel zoned;
> +
> +    /* zone size expressed in 512-byte sectors */
> +    uint32_t zone_sectors;
> +
> +    /* total number of zones */
> +    unsigned int nr_zones;
> +
> +    /* maximum sectors of a zone append write operation */
> +    int64_t max_append_sectors;
> +
> +    /* maximum number of open zones */
> +    int64_t max_open_zones;
> +
> +    /* maximum number of active zones */
> +    int64_t max_active_zones;
>  } BlockLimits;
>  
>  typedef struct BdrvOpBlocker BdrvOpBlocker;
> diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
> index 21fc10c4c9..3d26929cdd 100644
> --- a/include/block/raw-aio.h
> +++ b/include/block/raw-aio.h
> @@ -29,6 +29,8 @@
>  #define QEMU_AIO_WRITE_ZEROES 0x0020
>  #define QEMU_AIO_COPY_RANGE   0x0040
>  #define QEMU_AIO_TRUNCATE     0x0080
> +#define QEMU_AIO_ZONE_REPORT  0x0100
> +#define QEMU_AIO_ZONE_MGMT    0x0200
>  #define QEMU_AIO_TYPE_MASK \
>          (QEMU_AIO_READ | \
>           QEMU_AIO_WRITE | \
> @@ -37,7 +39,9 @@
>           QEMU_AIO_DISCARD | \
>           QEMU_AIO_WRITE_ZEROES | \
>           QEMU_AIO_COPY_RANGE | \
> -         QEMU_AIO_TRUNCATE)
> +         QEMU_AIO_TRUNCATE  | \
> +         QEMU_AIO_ZONE_REPORT | \
> +         QEMU_AIO_ZONE_MGMT)
>  
>  /* AIO flags */
>  #define QEMU_AIO_MISALIGNED   0x1000
> diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
> index 50f5aa2e07..6835525582 100644
> --- a/include/sysemu/block-backend-io.h
> +++ b/include/sysemu/block-backend-io.h
> @@ -45,6 +45,12 @@ BlockAIOCB *blk_aio_pwritev(BlockBackend *blk, int64_t offset,
>                              BlockCompletionFunc *cb, void *opaque);
>  BlockAIOCB *blk_aio_flush(BlockBackend *blk,
>                            BlockCompletionFunc *cb, void *opaque);
> +BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
> +                                unsigned int *nr_zones, BlockZoneDescriptor *zones,
> +                                BlockCompletionFunc *cb, void *opaque);
> +BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +                              int64_t offset, int64_t len,
> +                              BlockCompletionFunc *cb, void *opaque);
>  BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes,
>                               BlockCompletionFunc *cb, void *opaque);
>  void blk_aio_cancel_async(BlockAIOCB *acb);
> @@ -156,6 +162,17 @@ int generated_co_wrapper blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
>  int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, int64_t offset,
>                                        int64_t bytes, BdrvRequestFlags flags);
>  
> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> +                                    unsigned int *nr_zones,
> +                                    BlockZoneDescriptor *zones);
> +int generated_co_wrapper blk_zone_report(BlockBackend *blk, int64_t offset,
> +                                         unsigned int *nr_zones,
> +                                         BlockZoneDescriptor *zones);
> +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +                                  int64_t offset, int64_t len);
> +int generated_co_wrapper blk_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +                                       int64_t offset, int64_t len);
> +
>  int generated_co_wrapper blk_pdiscard(BlockBackend *blk, int64_t offset,
>                                        int64_t bytes);
>  int coroutine_fn blk_co_pdiscard(BlockBackend *blk, int64_t offset,
> diff --git a/meson.build b/meson.build
> index 20fddbd707..2f436bb355 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1883,6 +1883,7 @@ config_host_data.set('CONFIG_REPLICATION', get_option('live_block_migration').al
>  # has_header
>  config_host_data.set('CONFIG_EPOLL', cc.has_header('sys/epoll.h'))
>  config_host_data.set('CONFIG_LINUX_MAGIC_H', cc.has_header('linux/magic.h'))
> +config_host_data.set('CONFIG_BLKZONED', cc.has_header('linux/blkzoned.h'))
>  config_host_data.set('CONFIG_VALGRIND_H', cc.has_header('valgrind/valgrind.h'))
>  config_host_data.set('HAVE_BTRFS_H', cc.has_header('linux/btrfs.h'))
>  config_host_data.set('HAVE_DRM_H', cc.has_header('libdrm/drm.h'))
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 2173e7734a..c6bbb7a037 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -2942,6 +2942,7 @@
>  # @compress: Since 5.0
>  # @copy-before-write: Since 6.2
>  # @snapshot-access: Since 7.0
> +# @zoned_host_device: Since 7.2
>  #
>  # Since: 2.9
>  ##
> @@ -2955,7 +2956,8 @@
>              'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
>              'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
>              { 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
> -            'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
> +            'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat',
> +            { 'name': 'zoned_host_device', 'if': 'CONFIG_BLKZONED' } ] }
>  
>  ##
>  # @BlockdevOptionsFile:
> @@ -4329,7 +4331,9 @@
>        'vhdx':       'BlockdevOptionsGenericFormat',
>        'vmdk':       'BlockdevOptionsGenericCOWFormat',
>        'vpc':        'BlockdevOptionsGenericFormat',
> -      'vvfat':      'BlockdevOptionsVVFAT'
> +      'vvfat':      'BlockdevOptionsVVFAT',
> +      'zoned_host_device': { 'type': 'BlockdevOptionsFile',
> +                             'if': 'CONFIG_BLKZONED' }
>    } }
>  
>  ##
> diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
> index 952dc940f1..446a059603 100644
> --- a/qemu-io-cmds.c
> +++ b/qemu-io-cmds.c
> @@ -1712,6 +1712,144 @@ static const cmdinfo_t flush_cmd = {
>      .oneline    = "flush all in-core file state to disk",
>  };
>  
> +static int zone_report_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset;
> +    unsigned int nr_zones;
> +
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    nr_zones = cvtnum(argv[optind]);
> +
> +    g_autofree BlockZoneDescriptor *zones = NULL;
> +    zones = g_new(BlockZoneDescriptor, nr_zones);
> +    ret = blk_zone_report(blk, offset, &nr_zones, zones);
> +    if (ret < 0) {
> +        printf("zone report failed: %s\n", strerror(-ret));
> +    } else {
> +        for (int i = 0; i < nr_zones; ++i) {
> +            printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", "
> +                   "cap"" 0x%" PRIx64 ", wptr 0x%" PRIx64 ", "
> +                   "zcond:%u, [type: %u]\n",
> +                   zones[i].start, zones[i].length, zones[i].cap, zones[i].wp,
> +                   zones[i].cond, zones[i].type);
> +        }
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_report_cmd = {
> +        .name = "zone_report",
> +        .altname = "zrp",
> +        .cfunc = zone_report_f,
> +        .argmin = 2,
> +        .argmax = 2,
> +        .args = "offset number",
> +        .oneline = "report zone information",
> +};
> +
> +static int zone_open_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset, len;
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    len = cvtnum(argv[optind]);
> +    ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len);
> +    if (ret < 0) {
> +        printf("zone open failed: %s\n", strerror(-ret));
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_open_cmd = {
> +        .name = "zone_open",
> +        .altname = "zo",
> +        .cfunc = zone_open_f,
> +        .argmin = 2,
> +        .argmax = 2,
> +        .args = "offset len",
> +        .oneline = "explicit open a range of zones in zone block device",
> +};
> +
> +static int zone_close_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset, len;
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    len = cvtnum(argv[optind]);
> +    ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len);
> +    if (ret < 0) {
> +        printf("zone close failed: %s\n", strerror(-ret));
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_close_cmd = {
> +        .name = "zone_close",
> +        .altname = "zc",
> +        .cfunc = zone_close_f,
> +        .argmin = 2,
> +        .argmax = 2,
> +        .args = "offset len",
> +        .oneline = "close a range of zones in zone block device",
> +};
> +
> +static int zone_finish_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset, len;
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    len = cvtnum(argv[optind]);
> +    ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len);
> +    if (ret < 0) {
> +        printf("zone finish failed: %s\n", strerror(-ret));
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_finish_cmd = {
> +        .name = "zone_finish",
> +        .altname = "zf",
> +        .cfunc = zone_finish_f,
> +        .argmin = 2,
> +        .argmax = 2,
> +        .args = "offset len",
> +        .oneline = "finish a range of zones in zone block device",
> +};
> +
> +static int zone_reset_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset, len;
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    len = cvtnum(argv[optind]);
> +    ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len);
> +    if (ret < 0) {
> +        printf("zone reset failed: %s\n", strerror(-ret));
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_reset_cmd = {
> +        .name = "zone_reset",
> +        .altname = "zrs",
> +        .cfunc = zone_reset_f,
> +        .argmin = 2,
> +        .argmax = 2,
> +        .args = "offset len",
> +        .oneline = "reset a zone write pointer in zone block device",
> +};
> +
>  static int truncate_f(BlockBackend *blk, int argc, char **argv);
>  static const cmdinfo_t truncate_cmd = {
>      .name       = "truncate",
> @@ -2504,6 +2642,11 @@ static void __attribute((constructor)) init_qemuio_commands(void)
>      qemuio_add_command(&aio_write_cmd);
>      qemuio_add_command(&aio_flush_cmd);
>      qemuio_add_command(&flush_cmd);
> +    qemuio_add_command(&zone_report_cmd);
> +    qemuio_add_command(&zone_open_cmd);
> +    qemuio_add_command(&zone_close_cmd);
> +    qemuio_add_command(&zone_finish_cmd);
> +    qemuio_add_command(&zone_reset_cmd);
>      qemuio_add_command(&truncate_cmd);
>      qemuio_add_command(&length_cmd);
>      qemuio_add_command(&info_cmd);

-- 
Damien Le Moal
Western Digital Research



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 4/7] raw-format: add zone operations to pass through requests
  2022-09-10  5:27 ` [PATCH v9 4/7] raw-format: add zone operations to pass through requests Sam Li
@ 2022-09-11  5:32   ` Damien Le Moal
  0 siblings, 0 replies; 31+ messages in thread
From: Damien Le Moal @ 2022-09-11  5:32 UTC (permalink / raw)
  To: Sam Li, qemu-devel
  Cc: dmitry.fomichev, Markus Armbruster, Eric Blake, qemu-block,
	Stefan Hajnoczi, Kevin Wolf, Fam Zheng, hare, Hanna Reitz

On 2022/09/10 14:27, Sam Li wrote:
> raw-format driver usually sits on top of file-posix driver. It needs to
> pass through requests of zone commands.
> 
> Signed-off-by: Sam Li <faithilikerun@gmail.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>

> ---
>  block/raw-format.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/block/raw-format.c b/block/raw-format.c
> index 69fd650eaf..6b20bd22ef 100644
> --- a/block/raw-format.c
> +++ b/block/raw-format.c
> @@ -314,6 +314,17 @@ static int coroutine_fn raw_co_pdiscard(BlockDriverState *bs,
>      return bdrv_co_pdiscard(bs->file, offset, bytes);
>  }
>  
> +static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t offset,
> +                                           unsigned int *nr_zones,
> +                                           BlockZoneDescriptor *zones) {
> +    return bdrv_co_zone_report(bs->file->bs, offset, nr_zones, zones);
> +}
> +
> +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> +                                         int64_t offset, int64_t len) {
> +    return bdrv_co_zone_mgmt(bs->file->bs, op, offset, len);
> +}
> +
>  static int64_t raw_getlength(BlockDriverState *bs)
>  {
>      int64_t len;
> @@ -614,6 +625,8 @@ BlockDriver bdrv_raw = {
>      .bdrv_co_pwritev      = &raw_co_pwritev,
>      .bdrv_co_pwrite_zeroes = &raw_co_pwrite_zeroes,
>      .bdrv_co_pdiscard     = &raw_co_pdiscard,
> +    .bdrv_co_zone_report  = &raw_co_zone_report,
> +    .bdrv_co_zone_mgmt  = &raw_co_zone_mgmt,
>      .bdrv_co_block_status = &raw_co_block_status,
>      .bdrv_co_copy_range_from = &raw_co_copy_range_from,
>      .bdrv_co_copy_range_to  = &raw_co_copy_range_to,

-- 
Damien Le Moal
Western Digital Research



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 5/7] config: add check to block layer
  2022-09-10  5:27 ` [PATCH v9 5/7] config: add check to block layer Sam Li
@ 2022-09-11  5:34   ` Damien Le Moal
  2022-09-11  6:54     ` Sam Li
  2022-09-16 15:22   ` Stefan Hajnoczi
  1 sibling, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2022-09-11  5:34 UTC (permalink / raw)
  To: Sam Li, qemu-devel
  Cc: dmitry.fomichev, Markus Armbruster, Eric Blake, qemu-block,
	Stefan Hajnoczi, Kevin Wolf, Fam Zheng, hare, Hanna Reitz

On 2022/09/10 14:27, Sam Li wrote:
> Putting zoned/non-zoned BlockDrivers on top of each other is not
> allowed.
> 
> Signed-off-by: Sam Li <faithilikerun@gmail.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  block.c                          | 14 ++++++++++++++
>  block/file-posix.c               | 14 ++++++++++++++
>  block/raw-format.c               |  1 +
>  include/block/block_int-common.h |  5 +++++
>  4 files changed, 34 insertions(+)
> 
> diff --git a/block.c b/block.c
> index bc85f46eed..dad2ed3959 100644
> --- a/block.c
> +++ b/block.c
> @@ -7947,6 +7947,20 @@ void bdrv_add_child(BlockDriverState *parent_bs, BlockDriverState *child_bs,
>          return;
>      }
>  
> +    /*
> +     * Non-zoned block drivers do not follow zoned storage constraints
> +     * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned
> +     * drivers in a graph.
> +     */
> +    if (!parent_bs->drv->supports_zoned_children &&
> +        child_bs->bl.zoned == BLK_Z_HM) {

Shouldn't this be "child_bs->bl.zoned != BLK_Z_NONE" ?

> +        error_setg(errp, "Cannot add a %s child to a %s parent",
> +                   child_bs->bl.zoned == BLK_Z_HM ? "zoned" : "non-zoned",
> +                   parent_bs->drv->supports_zoned_children ?
> +                   "support zoned children" : "not support zoned children");
> +        return;
> +    }
> +
>      if (!QLIST_EMPTY(&child_bs->parents)) {
>          error_setg(errp, "The node %s already has a parent",
>                     child_bs->node_name);
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 4edfa25d04..354de22860 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -779,6 +779,20 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>              goto fail;
>          }
>      }
> +#ifdef CONFIG_BLKZONED
> +    /*
> +     * The kernel page chache does not reliably work for writes to SWR zones
> +     * of zoned block device because it can not guarantee the order of writes.
> +     */
> +    if (strcmp(bs->drv->format_name, "zoned_host_device") == 0) {
> +        if (!(s->open_flags & O_DIRECT)) {
> +            error_setg(errp, "driver=zoned_host_device was specified, but it "
> +                             "requires cache.direct=on, which was not specified.");
> +            ret = -EINVAL;

This line is not needed. Simply "return -EINVAL;".

> +            return ret; /* No host kernel page cache */
> +        }
> +    }
> +#endif
>  
>      if (S_ISBLK(st.st_mode)) {
>  #ifdef BLKDISCARDZEROES
> diff --git a/block/raw-format.c b/block/raw-format.c
> index 6b20bd22ef..9441536819 100644
> --- a/block/raw-format.c
> +++ b/block/raw-format.c
> @@ -614,6 +614,7 @@ static void raw_child_perm(BlockDriverState *bs, BdrvChild *c,
>  BlockDriver bdrv_raw = {
>      .format_name          = "raw",
>      .instance_size        = sizeof(BDRVRawState),
> +    .supports_zoned_children = true,
>      .bdrv_probe           = &raw_probe,
>      .bdrv_reopen_prepare  = &raw_reopen_prepare,
>      .bdrv_reopen_commit   = &raw_reopen_commit,
> diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
> index 078ddd7e67..043aa161a0 100644
> --- a/include/block/block_int-common.h
> +++ b/include/block/block_int-common.h
> @@ -127,6 +127,11 @@ struct BlockDriver {
>       */
>      bool is_format;
>  
> +    /*
> +     * Set to true if the BlockDriver supports zoned children.
> +     */
> +    bool supports_zoned_children;
> +
>      /*
>       * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
>       * this field set to true, except ones that are defined only by their

-- 
Damien Le Moal
Western Digital Research



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 7/7] docs/zoned-storage: add zoned device documentation
  2022-09-10  5:27 ` [PATCH v9 7/7] docs/zoned-storage: add zoned device documentation Sam Li
@ 2022-09-11  5:38   ` Damien Le Moal
  0 siblings, 0 replies; 31+ messages in thread
From: Damien Le Moal @ 2022-09-11  5:38 UTC (permalink / raw)
  To: Sam Li, qemu-devel
  Cc: dmitry.fomichev, Markus Armbruster, Eric Blake, qemu-block,
	Stefan Hajnoczi, Kevin Wolf, Fam Zheng, hare, Hanna Reitz

On 2022/09/10 14:27, Sam Li wrote:
> Add the documentation about the zoned device support to virtio-blk
> emulation.
> 
> Signed-off-by: Sam Li <faithilikerun@gmail.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  docs/devel/zoned-storage.rst           | 41 ++++++++++++++++++++++++++
>  docs/system/qemu-block-drivers.rst.inc |  6 ++++
>  2 files changed, 47 insertions(+)
>  create mode 100644 docs/devel/zoned-storage.rst
> 
> diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
> new file mode 100644
> index 0000000000..ead2d149cc
> --- /dev/null
> +++ b/docs/devel/zoned-storage.rst
> @@ -0,0 +1,41 @@
> +=============
> +zoned-storage
> +=============
> +
> +Zoned Block Devices (ZBDs) devide the LBA space into block regions called zones
> +that are larger than the LBA size. It can only allow sequential writes, which

s/It/They

> +reduces write amplification in SSDs, leading to higher throughput and increased
> +capacity. More details about ZBDs can be found at:

I would rephrase this like this, to be less assertive about the potential
benefits (as they depend on the vendor implementation):

..., which can reduce write amplification in SSDs, and potentially lead to
higher throughput and increased device capacity.

> +
> +https://zonedstorage.io/docs/introduction/zoned-storage
> +
> +1. Block layer APIs for zoned storage
> +-------------------------------------
> +QEMU block layer has three zoned storage model:
> +- BLK_Z_HM: This model only allows sequential writes access. It supports a set
> +of ZBD-specific I/O request that used by the host to manage device zones.
> +- BLK_Z_HA: It deals with both sequential writes and random writes access.
> +- BLK_Z_NONE: Regular block devices and drive-managed ZBDs are treated as
> +non-zoned devices.
> +
> +The block device information resides inside BlockDriverState. QEMU uses
> +BlockLimits struct(BlockDriverState::bl) that is continuously accessed by the
> +block layer while processing I/O requests. A BlockBackend has a root pointer to
> +a BlockDriverState graph(for example, raw format on top of file-posix). The
> +zoned storage information can be propagated from the leaf BlockDriverState all
> +the way up to the BlockBackend. If the zoned storage model in file-posix is
> +set to BLK_Z_HM, then block drivers will declare support for zoned host device.
> +
> +The block layer APIs support commands needed for zoned storage devices,
> +including report zones, four zone operations, and zone append.
> +
> +2. Emulating zoned storage controllers
> +--------------------------------------
> +When the BlockBackend's BlockLimits model reports a zoned storage device, users
> +like the virtio-blk emulation or the qemu-io-cmds.c utility can use block layer
> +APIs for zoned storage emulation or testing.
> +
> +For example, the command line for zone report testing a null_blk device of
> +qemu-io-cmds.c is:
> +$ path/to/qemu-io --image-opts driver=zoned_host_device,filename=/dev/nullb0 -c
> +"zrp offset nr_zones"
> diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
> index dfe5d2293d..0b97227fd9 100644
> --- a/docs/system/qemu-block-drivers.rst.inc
> +++ b/docs/system/qemu-block-drivers.rst.inc
> @@ -430,6 +430,12 @@ Hard disks
>    you may corrupt your host data (use the ``-snapshot`` command
>    line option or modify the device permissions accordingly).
>  
> +Zoned block devices
> +  Zoned block devices can be passed through to the guest if the emulated storage
> +  controller supports zoned storage. Use ``--blockdev zoned_host_device,
> +  node-name=drive0,filename=/dev/nullb0`` to pass through ``/dev/nullb0``
> +  as ``drive0``.
> +
>  Windows
>  ^^^^^^^
>  

-- 
Damien Le Moal
Western Digital Research



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  2022-09-11  5:31   ` Damien Le Moal
@ 2022-09-11  6:33     ` Sam Li
  2022-09-11  6:48       ` Damien Le Moal
  0 siblings, 1 reply; 31+ messages in thread
From: Sam Li @ 2022-09-11  6:33 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: qemu-devel, Dmitry Fomichev, Markus Armbruster, Eric Blake,
	qemu block, Stefan Hajnoczi, Kevin Wolf, Fam Zheng,
	Hannes Reinecke, Hanna Reitz

Damien Le Moal <damien.lemoal@opensource.wdc.com> 于2022年9月11日周日 13:31写道:
>
> On 2022/09/10 14:27, Sam Li wrote:
> [...]
> > +/*
> > + * Send a zone_report command.
> > + * offset is a byte offset from the start of the device. No alignment
> > + * required for offset.
> > + * nr_zones represents IN maximum and OUT actual.
> > + */
> > +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> > +                                    unsigned int *nr_zones,
> > +                                    BlockZoneDescriptor *zones)
> > +{
> > +    int ret;
> > +    IO_CODE();
> > +
> > +    blk_inc_in_flight(blk); /* increase before waiting */
> > +    blk_wait_while_drained(blk);
> > +    if (!blk_is_available(blk)) {
> > +        blk_dec_in_flight(blk);
> > +        return -ENOMEDIUM;
> > +    }
> > +    ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
> > +    blk_dec_in_flight(blk);
> > +    return ret;
> > +}
> > +
> > +/*
> > + * Send a zone_management command.
> > + * op is the zone operation;
> > + * offset is the byte offset from the start of the zoned device;
> > + * len is the maximum number of bytes the command should operate on. It
> > + * should be aligned with the zone sector size.
>
> This should read:
>
> * offset is the byte offset of the start of the first zone to operate on;
> * len is the maximum number of bytes the command should operate on. It
> * should be aligned with the device zone size.
>
> No ?

Right. The zone sector size here is meant for the zone size whose unit
is a 512-byte sector.

>
> > + */
> > +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> > +        int64_t offset, int64_t len)
> > +{
> > +    int ret;
> > +    IO_CODE();
> > +
> > +
> > +    blk_inc_in_flight(blk);
> > +    blk_wait_while_drained(blk);
> > +
> > +    ret = blk_check_byte_request(blk, offset, len);
> > +    if (ret < 0) {
> > +        return ret;
> > +    }
> > +
> > +    ret = bdrv_co_zone_mgmt(blk_bs(blk), op, offset, len);
> > +    blk_dec_in_flight(blk);
> > +    return ret;
> > +}
> > +
> >  void blk_drain(BlockBackend *blk)
> >  {
> >      BlockDriverState *bs = blk_bs(blk);
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 0a8b4b426e..4edfa25d04 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -67,6 +67,9 @@
> >  #include <sys/param.h>
> >  #include <sys/syscall.h>
> >  #include <sys/vfs.h>
> > +#if defined(CONFIG_BLKZONED)
> > +#include <linux/blkzoned.h>
> > +#endif
> >  #include <linux/cdrom.h>
> >  #include <linux/fd.h>
> >  #include <linux/fs.h>
> > @@ -216,6 +219,15 @@ typedef struct RawPosixAIOData {
> >              PreallocMode prealloc;
> >              Error **errp;
> >          } truncate;
> > +        struct {
> > +            unsigned int *nr_zones;
> > +            BlockZoneDescriptor *zones;
> > +        } zone_report;
> > +        struct {
> > +            unsigned long zone_op;
> > +            const char *zone_op_name;
> > +            bool all;
> > +        } zone_mgmt;
> >      };
> >  } RawPosixAIOData;
> >
> > @@ -1339,7 +1351,7 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
> >  #endif
> >
> >      if (bs->sg || S_ISBLK(st.st_mode)) {
> > -        int ret = hdev_get_max_hw_transfer(s->fd, &st);
> > +        ret = hdev_get_max_hw_transfer(s->fd, &st);
> >
> >          if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
> >              bs->bl.max_hw_transfer = ret;
> > @@ -1356,6 +1368,27 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
> >          zoned = BLK_Z_NONE;
> >      }
> >      bs->bl.zoned = zoned;
> > +    if (zoned != BLK_Z_NONE) {
> > +        ret = get_sysfs_long_val(&st, "chunk_sectors");
> > +        if (ret > 0) {
> > +            bs->bl.zone_sectors = ret;
> > +        }
>
> It may be good to check that we are getting a valid zone size here. So may be
> change the check to something like this ?
>
>         if (ret <= 0) {
>             *** print some error message mentioning the invalid zone size ***
>             bs->bl.zoned = BLK_Z_NONE;
>             return;
>         }
>         bs->bl.zone_sectors = ret;
>

Ok, thanks!

> > +
> > +        ret = get_sysfs_long_val(&st, "zone_append_max_bytes");
> > +        if (ret > 0) {
> > +            bs->bl.max_append_sectors = ret / 512;
> > +        }
> > +
> > +        ret = get_sysfs_long_val(&st, "max_open_zones");
> > +        if (ret >= 0) {
> > +            bs->bl.max_open_zones = ret;
> > +        }
> > +
> > +        ret = get_sysfs_long_val(&st, "max_active_zones");
> > +        if (ret >= 0) {
> > +            bs->bl.max_active_zones = ret;
> > +        }
> > +    }
> >  }
> >
> >  static int check_for_dasd(int fd)
> > @@ -1850,6 +1883,145 @@ static off_t copy_file_range(int in_fd, off_t *in_off, int out_fd,
> >  }
> >  #endif
> >
> > +/*
> > + * parse_zone - Fill a zone descriptor
> > + */
> > +#if defined(CONFIG_BLKZONED)
> > +static inline void parse_zone(struct BlockZoneDescriptor *zone,
> > +                              const struct blk_zone *blkz) {
> > +    zone->start = blkz->start;
> > +    zone->length = blkz->len;
> > +    zone->cap = blkz->capacity;
> > +    zone->wp = blkz->wp;
> > +
> > +    switch (blkz->type) {
> > +    case BLK_ZONE_TYPE_SEQWRITE_REQ:
> > +        zone->type = BLK_ZT_SWR;
> > +        break;
> > +    case BLK_ZONE_TYPE_SEQWRITE_PREF:
> > +        zone->type = BLK_ZT_SWP;
> > +        break;
> > +    case BLK_ZONE_TYPE_CONVENTIONAL:
> > +        zone->type = BLK_ZT_CONV;
> > +        break;
> > +    default:
> > +        g_assert_not_reached();
> > +    }
> > +
> > +    switch (blkz->cond) {
> > +    case BLK_ZONE_COND_NOT_WP:
> > +        zone->cond = BLK_ZS_NOT_WP;
> > +        break;
> > +    case BLK_ZONE_COND_EMPTY:
> > +        zone->cond = BLK_ZS_EMPTY;
> > +        break;
> > +    case BLK_ZONE_COND_IMP_OPEN:
> > +        zone->cond =BLK_ZS_IOPEN;
>
> Missing a space after the "=".
>
> > +        break;
> > +    case BLK_ZONE_COND_EXP_OPEN:
> > +        zone->cond = BLK_ZS_EOPEN;
> > +        break;
> > +    case BLK_ZONE_COND_CLOSED:
> > +        zone->cond = BLK_ZS_CLOSED;
> > +        break;
> > +    case BLK_ZONE_COND_READONLY:
> > +        zone->cond = BLK_ZS_RDONLY;
> > +        break;
> > +    case BLK_ZONE_COND_FULL:
> > +        zone->cond = BLK_ZS_FULL;
> > +        break;
> > +    case BLK_ZONE_COND_OFFLINE:
> > +        zone->cond = BLK_ZS_OFFLINE;
> > +        break;
> > +    default:
> > +        g_assert_not_reached();
> > +    }
> > +}
> > +#endif
> > +
> > +#if defined(CONFIG_BLKZONED)
> > +static int do_zone_report(int64_t sector, int fd,
> > +                          struct BlockZoneDescriptor *zones,
> > +                          unsigned int nrz) {
> > +    struct blk_zone *blkz;
> > +    int ret, n = 0, i = 0;
> > +
> > +    int64_t rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
> > +    g_autofree struct blk_zone_report *rep = NULL;
> > +    rep = g_malloc(rep_size);
> > +
> > +    blkz = (struct blk_zone *)(rep + 1);
> > +    while (n < nrz) {
> > +        memset(rep, 0, rep_size);
> > +        rep->sector = sector;
> > +        rep->nr_zones = nrz - n;
> > +
> > +        do {
> > +            ret = ioctl(fd, BLKREPORTZONE, rep);
> > +        } while (ret != 0 && errno == EINTR);
> > +        if (ret != 0) {
> > +            error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
> > +                    fd, sector, errno);
> > +            return -errno;
> > +        }
> > +
> > +        if (!rep->nr_zones) {
> > +            break;
> > +        }
> > +
> > +        for (i = 0; i < rep->nr_zones; i++, n++) {
> > +            parse_zone(&zones[n], &blkz[i]);
> > +            /* The next report should start after the last zone reported */
> > +            sector = blkz[i].start + blkz[i].len;
> > +        }
> > +    }
> > +    return n;
> > +}
> > +#endif
> > +
> > +static int handle_aiocb_zone_report(void *opaque) {
> > +#if defined(CONFIG_BLKZONED)
> > +    RawPosixAIOData *aiocb = opaque;
> > +    int fd = aiocb->aio_fildes;
> > +    unsigned int *nr_zones = aiocb->zone_report.nr_zones;
> > +    BlockZoneDescriptor *zones = aiocb->zone_report.zones;
> > +    /* zoned block devices use 512-byte sectors */
> > +    int64_t sector = aiocb->aio_offset / 512;
>
> This variable is not really necessary I think.
>
> > +
> > +    *nr_zones = do_zone_report(sector, fd, zones, *nr_zones);
> > +    return 0;
> > +#else
> > +    return -ENOTSUP;
> > +#endif
> > +}
> > +
> > +static int handle_aiocb_zone_mgmt(void *opaque) {
> > +#if defined(CONFIG_BLKZONED)
> > +    RawPosixAIOData *aiocb = opaque;
> > +    int fd = aiocb->aio_fildes;
> > +    int64_t sector = aiocb->aio_offset / 512;
> > +    int64_t nr_sectors = aiocb->aio_nbytes / 512;
> > +    struct blk_zone_range range;
> > +    int ret;
> > +
> > +    /* Execute the operation */
> > +    range.sector = sector;
> > +    range.nr_sectors = nr_sectors;
> > +    do {
> > +        ret = ioctl(fd, aiocb->zone_mgmt.zone_op, &range);
> > +    } while (ret != 0 && errno == EINTR);
> > +
> > +    if (ret != 0) {
> > +        error_report("ioctl %s failed %d", aiocb->zone_mgmt.zone_op_name,
> > +                     errno);
> > +        return -errno;
> > +    }
> > +    return ret;
> > +#else
> > +    return -ENOTSUP;
> > +#endif
> > +}
> > +
> >  static int handle_aiocb_copy_range(void *opaque)
> >  {
> >      RawPosixAIOData *aiocb = opaque;
> > @@ -3022,6 +3194,104 @@ static void raw_account_discard(BDRVRawState *s, uint64_t nbytes, int ret)
> >      }
> >  }
> >
> > +/*
> > + * zone report - Get a zone block device's information in the form
> > + * of an array of zone descriptors.
> > + * zones is an array of zone descriptors to hold zone information on reply;
> > + * offset can be any byte within the entire size of the device;
> > + * nr_zones is the maxium number of sectors the command should operate on.
> > + */
> > +static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t offset,
> > +                                           unsigned int *nr_zones,
> > +                                           BlockZoneDescriptor *zones) {
> > +#if defined(CONFIG_BLKZONED)
> > +    BDRVRawState *s = bs->opaque;
> > +    RawPosixAIOData acb;
> > +
> > +    acb = (RawPosixAIOData) {
> > +        .bs         = bs,
> > +        .aio_fildes = s->fd,
> > +        .aio_type   = QEMU_AIO_ZONE_REPORT,
> > +        .aio_offset = offset,
> > +        .zone_report    = {
> > +                .nr_zones       = nr_zones,
> > +                .zones          = zones,
> > +        },
> > +    };
> > +
> > +    return raw_thread_pool_submit(bs, handle_aiocb_zone_report, &acb);
> > +#else
> > +    return -ENOTSUP;
> > +#endif
> > +}
> > +
> > +/*
> > + * zone management operations - Execute an operation on a zone
> > + */
> > +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> > +        int64_t offset, int64_t len) {
> > +#if defined(CONFIG_BLKZONED)
> > +    BDRVRawState *s = bs->opaque;
> > +    RawPosixAIOData acb;
> > +    int64_t zone_sector, zone_sector_mask;
> > +    const char *zone_op_name;
> > +    unsigned long zone_op;
> > +    bool is_all = false;
> > +
> > +    zone_sector = bs->bl.zone_sectors;
> > +    zone_sector_mask = zone_sector - 1;
> > +    if (offset & zone_sector_mask) {
> > +        error_report("sector offset %" PRId64 " is not aligned to zone size "
> > +                     "%" PRId64 "", offset, zone_sector);
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (len & zone_sector_mask) {
>
> Linux allows SMR drives to have a smaller last zone. So this needs to be
> accounted for here. Otherwise, a zone operation that includes the last smaller
> zone would always fail. Something like this would work:
>
>         if (((offset + len) < capacity &&
>             len & zone_sector_mask) ||
>             offset + len > capacity) {
>

I see. I think the offset can be removed, like:
if (((len < capacity && len & zone_sector_mask) || len > capacity) {
Then if we use the previous zone's len for the last smaller zone, it
will be greater than its capacity.

I will also include "opening the last zone" as a test case later.

> > +        error_report("number of sectors %" PRId64 " is not aligned to zone size"
> > +                      " %" PRId64 "", len, zone_sector);
> > +        return -EINVAL;
> > +    }
> > +
> > +    switch (op) {
> > +    case BLK_ZO_OPEN:
> > +        zone_op_name = "BLKOPENZONE";
> > +        zone_op = BLKOPENZONE;
> > +        break;
> > +    case BLK_ZO_CLOSE:
> > +        zone_op_name = "BLKCLOSEZONE";
> > +        zone_op = BLKCLOSEZONE;
> > +        break;
> > +    case BLK_ZO_FINISH:
> > +        zone_op_name = "BLKFINISHZONE";
> > +        zone_op = BLKFINISHZONE;
> > +        break;
> > +    case BLK_ZO_RESET:
> > +        zone_op_name = "BLKRESETZONE";
> > +        zone_op = BLKRESETZONE;
> > +        break;
> > +    default:
> > +        g_assert_not_reached();
> > +    }
> > +
> > +    acb = (RawPosixAIOData) {
> > +        .bs             = bs,
> > +        .aio_fildes     = s->fd,
> > +        .aio_type       = QEMU_AIO_ZONE_MGMT,
> > +        .aio_offset     = offset,
> > +        .aio_nbytes     = len,
> > +        .zone_mgmt  = {
> > +                .zone_op = zone_op,
> > +                .zone_op_name = zone_op_name,
> > +                .all = is_all,
> > +        },
> > +    };
> > +
> > +    return raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb);
> > +#else
> > +    return -ENOTSUP;
> > +#endif
> > +}
> > +
> >  static coroutine_fn int
> >  raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes,
> >                  bool blkdev)
> > @@ -3752,6 +4022,54 @@ static BlockDriver bdrv_host_device = {
> >  #endif
> >  };
> >
> > +#if defined(CONFIG_BLKZONED)
> > +static BlockDriver bdrv_zoned_host_device = {
> > +    .format_name = "zoned_host_device",
> > +    .protocol_name = "zoned_host_device",
> > +    .instance_size = sizeof(BDRVRawState),
> > +    .bdrv_needs_filename = true,
> > +    .bdrv_probe_device  = hdev_probe_device,
> > +    .bdrv_file_open     = hdev_open,
> > +    .bdrv_close         = raw_close,
> > +    .bdrv_reopen_prepare = raw_reopen_prepare,
> > +    .bdrv_reopen_commit  = raw_reopen_commit,
> > +    .bdrv_reopen_abort   = raw_reopen_abort,
> > +    .bdrv_co_create_opts = bdrv_co_create_opts_simple,
> > +    .create_opts         = &bdrv_create_opts_simple,
> > +    .mutable_opts        = mutable_opts,
> > +    .bdrv_co_invalidate_cache = raw_co_invalidate_cache,
> > +    .bdrv_co_pwrite_zeroes = hdev_co_pwrite_zeroes,
> > +
> > +    .bdrv_co_preadv         = raw_co_preadv,
> > +    .bdrv_co_pwritev        = raw_co_pwritev,
> > +    .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
> > +    .bdrv_co_pdiscard       = hdev_co_pdiscard,
> > +    .bdrv_co_copy_range_from = raw_co_copy_range_from,
> > +    .bdrv_co_copy_range_to  = raw_co_copy_range_to,
> > +    .bdrv_refresh_limits = raw_refresh_limits,
> > +    .bdrv_io_plug = raw_aio_plug,
> > +    .bdrv_io_unplug = raw_aio_unplug,
> > +    .bdrv_attach_aio_context = raw_aio_attach_aio_context,
> > +
> > +    .bdrv_co_truncate       = raw_co_truncate,
> > +    .bdrv_getlength = raw_getlength,
> > +    .bdrv_get_info = raw_get_info,
> > +    .bdrv_get_allocated_file_size
> > +                        = raw_get_allocated_file_size,
> > +    .bdrv_get_specific_stats = hdev_get_specific_stats,
> > +    .bdrv_check_perm = raw_check_perm,
> > +    .bdrv_set_perm   = raw_set_perm,
> > +    .bdrv_abort_perm_update = raw_abort_perm_update,
> > +    .bdrv_probe_blocksizes = hdev_probe_blocksizes,
> > +    .bdrv_probe_geometry = hdev_probe_geometry,
> > +    .bdrv_co_ioctl = hdev_co_ioctl,
> > +
> > +    /* zone management operations */
> > +    .bdrv_co_zone_report = raw_co_zone_report,
> > +    .bdrv_co_zone_mgmt = raw_co_zone_mgmt,
> > +};
> > +#endif
> > +
> >  #if defined(__linux__) || defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
> >  static void cdrom_parse_filename(const char *filename, QDict *options,
> >                                   Error **errp)
> > @@ -4012,6 +4330,9 @@ static void bdrv_file_init(void)
> >      bdrv_register(&bdrv_file);
> >  #if defined(HAVE_HOST_BLOCK_DEVICE)
> >      bdrv_register(&bdrv_host_device);
> > +#if defined(CONFIG_BLKZONED)
> > +    bdrv_register(&bdrv_zoned_host_device);
> > +#endif
> >  #ifdef __linux__
> >      bdrv_register(&bdrv_host_cdrom);
> >  #endif
> > diff --git a/block/io.c b/block/io.c
> > index 0a8cbefe86..de9ec1d740 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -3198,6 +3198,47 @@ out:
> >      return co.ret;
> >  }
> >
> > +int bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
> > +                        unsigned int *nr_zones,
> > +                        BlockZoneDescriptor *zones)
> > +{
> > +    BlockDriver *drv = bs->drv;
> > +    CoroutineIOCompletion co = {
> > +            .coroutine = qemu_coroutine_self(),
> > +    };
> > +    IO_CODE();
> > +
> > +    bdrv_inc_in_flight(bs);
> > +    if (!drv || !drv->bdrv_co_zone_report) {
> > +        co.ret = -ENOTSUP;
> > +        goto out;
> > +    }
> > +    co.ret = drv->bdrv_co_zone_report(bs, offset, nr_zones, zones);
> > +out:
> > +    bdrv_dec_in_flight(bs);
> > +    return co.ret;
> > +}
> > +
> > +int bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> > +        int64_t offset, int64_t len)
> > +{
> > +    BlockDriver *drv = bs->drv;
> > +    CoroutineIOCompletion co = {
> > +            .coroutine = qemu_coroutine_self(),
> > +    };
> > +    IO_CODE();
> > +
> > +    bdrv_inc_in_flight(bs);
> > +    if (!drv || !drv->bdrv_co_zone_mgmt) {
> > +        co.ret = -ENOTSUP;
> > +        goto out;
> > +    }
> > +    co.ret = drv->bdrv_co_zone_mgmt(bs, op, offset, len);
> > +out:
> > +    bdrv_dec_in_flight(bs);
> > +    return co.ret;
> > +}
> > +
> >  void *qemu_blockalign(BlockDriverState *bs, size_t size)
> >  {
> >      IO_CODE();
> > diff --git a/include/block/block-io.h b/include/block/block-io.h
> > index fd25ffa9be..65463b88d9 100644
> > --- a/include/block/block-io.h
> > +++ b/include/block/block-io.h
> > @@ -88,6 +88,13 @@ int bdrv_co_ioctl(BlockDriverState *bs, int req, void *buf);
> >  /* Ensure contents are flushed to disk.  */
> >  int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
> >
> > +/* Report zone information of zone block device. */
> > +int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
> > +                                     unsigned int *nr_zones,
> > +                                     BlockZoneDescriptor *zones);
> > +int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> > +                                   int64_t offset, int64_t len);
> > +
> >  int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
> >  bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
> >  int bdrv_block_status(BlockDriverState *bs, int64_t offset,
> > diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
> > index 7f7863cc9e..078ddd7e67 100644
> > --- a/include/block/block_int-common.h
> > +++ b/include/block/block_int-common.h
> > @@ -691,6 +691,12 @@ struct BlockDriver {
> >                                            QEMUIOVector *qiov,
> >                                            int64_t pos);
> >
> > +    int coroutine_fn (*bdrv_co_zone_report)(BlockDriverState *bs,
> > +            int64_t offset, unsigned int *nr_zones,
> > +            BlockZoneDescriptor *zones);
> > +    int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, BlockZoneOp op,
> > +            int64_t offset, int64_t len);
> > +
> >      /* removable device specific */
> >      bool (*bdrv_is_inserted)(BlockDriverState *bs);
> >      void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
> > @@ -828,6 +834,21 @@ typedef struct BlockLimits {
> >
> >      /* device zone model */
> >      BlockZoneModel zoned;
> > +
> > +    /* zone size expressed in 512-byte sectors */
> > +    uint32_t zone_sectors;
> > +
> > +    /* total number of zones */
> > +    unsigned int nr_zones;
> > +
> > +    /* maximum sectors of a zone append write operation */
> > +    int64_t max_append_sectors;
> > +
> > +    /* maximum number of open zones */
> > +    int64_t max_open_zones;
> > +
> > +    /* maximum number of active zones */
> > +    int64_t max_active_zones;
> >  } BlockLimits;
> >
> >  typedef struct BdrvOpBlocker BdrvOpBlocker;
> > diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
> > index 21fc10c4c9..3d26929cdd 100644
> > --- a/include/block/raw-aio.h
> > +++ b/include/block/raw-aio.h
> > @@ -29,6 +29,8 @@
> >  #define QEMU_AIO_WRITE_ZEROES 0x0020
> >  #define QEMU_AIO_COPY_RANGE   0x0040
> >  #define QEMU_AIO_TRUNCATE     0x0080
> > +#define QEMU_AIO_ZONE_REPORT  0x0100
> > +#define QEMU_AIO_ZONE_MGMT    0x0200
> >  #define QEMU_AIO_TYPE_MASK \
> >          (QEMU_AIO_READ | \
> >           QEMU_AIO_WRITE | \
> > @@ -37,7 +39,9 @@
> >           QEMU_AIO_DISCARD | \
> >           QEMU_AIO_WRITE_ZEROES | \
> >           QEMU_AIO_COPY_RANGE | \
> > -         QEMU_AIO_TRUNCATE)
> > +         QEMU_AIO_TRUNCATE  | \
> > +         QEMU_AIO_ZONE_REPORT | \
> > +         QEMU_AIO_ZONE_MGMT)
> >
> >  /* AIO flags */
> >  #define QEMU_AIO_MISALIGNED   0x1000
> > diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
> > index 50f5aa2e07..6835525582 100644
> > --- a/include/sysemu/block-backend-io.h
> > +++ b/include/sysemu/block-backend-io.h
> > @@ -45,6 +45,12 @@ BlockAIOCB *blk_aio_pwritev(BlockBackend *blk, int64_t offset,
> >                              BlockCompletionFunc *cb, void *opaque);
> >  BlockAIOCB *blk_aio_flush(BlockBackend *blk,
> >                            BlockCompletionFunc *cb, void *opaque);
> > +BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
> > +                                unsigned int *nr_zones, BlockZoneDescriptor *zones,
> > +                                BlockCompletionFunc *cb, void *opaque);
> > +BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> > +                              int64_t offset, int64_t len,
> > +                              BlockCompletionFunc *cb, void *opaque);
> >  BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes,
> >                               BlockCompletionFunc *cb, void *opaque);
> >  void blk_aio_cancel_async(BlockAIOCB *acb);
> > @@ -156,6 +162,17 @@ int generated_co_wrapper blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
> >  int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, int64_t offset,
> >                                        int64_t bytes, BdrvRequestFlags flags);
> >
> > +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> > +                                    unsigned int *nr_zones,
> > +                                    BlockZoneDescriptor *zones);
> > +int generated_co_wrapper blk_zone_report(BlockBackend *blk, int64_t offset,
> > +                                         unsigned int *nr_zones,
> > +                                         BlockZoneDescriptor *zones);
> > +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> > +                                  int64_t offset, int64_t len);
> > +int generated_co_wrapper blk_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> > +                                       int64_t offset, int64_t len);
> > +
> >  int generated_co_wrapper blk_pdiscard(BlockBackend *blk, int64_t offset,
> >                                        int64_t bytes);
> >  int coroutine_fn blk_co_pdiscard(BlockBackend *blk, int64_t offset,
> > diff --git a/meson.build b/meson.build
> > index 20fddbd707..2f436bb355 100644
> > --- a/meson.build
> > +++ b/meson.build
> > @@ -1883,6 +1883,7 @@ config_host_data.set('CONFIG_REPLICATION', get_option('live_block_migration').al
> >  # has_header
> >  config_host_data.set('CONFIG_EPOLL', cc.has_header('sys/epoll.h'))
> >  config_host_data.set('CONFIG_LINUX_MAGIC_H', cc.has_header('linux/magic.h'))
> > +config_host_data.set('CONFIG_BLKZONED', cc.has_header('linux/blkzoned.h'))
> >  config_host_data.set('CONFIG_VALGRIND_H', cc.has_header('valgrind/valgrind.h'))
> >  config_host_data.set('HAVE_BTRFS_H', cc.has_header('linux/btrfs.h'))
> >  config_host_data.set('HAVE_DRM_H', cc.has_header('libdrm/drm.h'))
> > diff --git a/qapi/block-core.json b/qapi/block-core.json
> > index 2173e7734a..c6bbb7a037 100644
> > --- a/qapi/block-core.json
> > +++ b/qapi/block-core.json
> > @@ -2942,6 +2942,7 @@
> >  # @compress: Since 5.0
> >  # @copy-before-write: Since 6.2
> >  # @snapshot-access: Since 7.0
> > +# @zoned_host_device: Since 7.2
> >  #
> >  # Since: 2.9
> >  ##
> > @@ -2955,7 +2956,8 @@
> >              'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
> >              'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
> >              { 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
> > -            'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
> > +            'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat',
> > +            { 'name': 'zoned_host_device', 'if': 'CONFIG_BLKZONED' } ] }
> >
> >  ##
> >  # @BlockdevOptionsFile:
> > @@ -4329,7 +4331,9 @@
> >        'vhdx':       'BlockdevOptionsGenericFormat',
> >        'vmdk':       'BlockdevOptionsGenericCOWFormat',
> >        'vpc':        'BlockdevOptionsGenericFormat',
> > -      'vvfat':      'BlockdevOptionsVVFAT'
> > +      'vvfat':      'BlockdevOptionsVVFAT',
> > +      'zoned_host_device': { 'type': 'BlockdevOptionsFile',
> > +                             'if': 'CONFIG_BLKZONED' }
> >    } }
> >
> >  ##
> > diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
> > index 952dc940f1..446a059603 100644
> > --- a/qemu-io-cmds.c
> > +++ b/qemu-io-cmds.c
> > @@ -1712,6 +1712,144 @@ static const cmdinfo_t flush_cmd = {
> >      .oneline    = "flush all in-core file state to disk",
> >  };
> >
> > +static int zone_report_f(BlockBackend *blk, int argc, char **argv)
> > +{
> > +    int ret;
> > +    int64_t offset;
> > +    unsigned int nr_zones;
> > +
> > +    ++optind;
> > +    offset = cvtnum(argv[optind]);
> > +    ++optind;
> > +    nr_zones = cvtnum(argv[optind]);
> > +
> > +    g_autofree BlockZoneDescriptor *zones = NULL;
> > +    zones = g_new(BlockZoneDescriptor, nr_zones);
> > +    ret = blk_zone_report(blk, offset, &nr_zones, zones);
> > +    if (ret < 0) {
> > +        printf("zone report failed: %s\n", strerror(-ret));
> > +    } else {
> > +        for (int i = 0; i < nr_zones; ++i) {
> > +            printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", "
> > +                   "cap"" 0x%" PRIx64 ", wptr 0x%" PRIx64 ", "
> > +                   "zcond:%u, [type: %u]\n",
> > +                   zones[i].start, zones[i].length, zones[i].cap, zones[i].wp,
> > +                   zones[i].cond, zones[i].type);
> > +        }
> > +    }
> > +    return ret;
> > +}
> > +
> > +static const cmdinfo_t zone_report_cmd = {
> > +        .name = "zone_report",
> > +        .altname = "zrp",
> > +        .cfunc = zone_report_f,
> > +        .argmin = 2,
> > +        .argmax = 2,
> > +        .args = "offset number",
> > +        .oneline = "report zone information",
> > +};
> > +
> > +static int zone_open_f(BlockBackend *blk, int argc, char **argv)
> > +{
> > +    int ret;
> > +    int64_t offset, len;
> > +    ++optind;
> > +    offset = cvtnum(argv[optind]);
> > +    ++optind;
> > +    len = cvtnum(argv[optind]);
> > +    ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len);
> > +    if (ret < 0) {
> > +        printf("zone open failed: %s\n", strerror(-ret));
> > +    }
> > +    return ret;
> > +}
> > +
> > +static const cmdinfo_t zone_open_cmd = {
> > +        .name = "zone_open",
> > +        .altname = "zo",
> > +        .cfunc = zone_open_f,
> > +        .argmin = 2,
> > +        .argmax = 2,
> > +        .args = "offset len",
> > +        .oneline = "explicit open a range of zones in zone block device",
> > +};
> > +
> > +static int zone_close_f(BlockBackend *blk, int argc, char **argv)
> > +{
> > +    int ret;
> > +    int64_t offset, len;
> > +    ++optind;
> > +    offset = cvtnum(argv[optind]);
> > +    ++optind;
> > +    len = cvtnum(argv[optind]);
> > +    ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len);
> > +    if (ret < 0) {
> > +        printf("zone close failed: %s\n", strerror(-ret));
> > +    }
> > +    return ret;
> > +}
> > +
> > +static const cmdinfo_t zone_close_cmd = {
> > +        .name = "zone_close",
> > +        .altname = "zc",
> > +        .cfunc = zone_close_f,
> > +        .argmin = 2,
> > +        .argmax = 2,
> > +        .args = "offset len",
> > +        .oneline = "close a range of zones in zone block device",
> > +};
> > +
> > +static int zone_finish_f(BlockBackend *blk, int argc, char **argv)
> > +{
> > +    int ret;
> > +    int64_t offset, len;
> > +    ++optind;
> > +    offset = cvtnum(argv[optind]);
> > +    ++optind;
> > +    len = cvtnum(argv[optind]);
> > +    ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len);
> > +    if (ret < 0) {
> > +        printf("zone finish failed: %s\n", strerror(-ret));
> > +    }
> > +    return ret;
> > +}
> > +
> > +static const cmdinfo_t zone_finish_cmd = {
> > +        .name = "zone_finish",
> > +        .altname = "zf",
> > +        .cfunc = zone_finish_f,
> > +        .argmin = 2,
> > +        .argmax = 2,
> > +        .args = "offset len",
> > +        .oneline = "finish a range of zones in zone block device",
> > +};
> > +
> > +static int zone_reset_f(BlockBackend *blk, int argc, char **argv)
> > +{
> > +    int ret;
> > +    int64_t offset, len;
> > +    ++optind;
> > +    offset = cvtnum(argv[optind]);
> > +    ++optind;
> > +    len = cvtnum(argv[optind]);
> > +    ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len);
> > +    if (ret < 0) {
> > +        printf("zone reset failed: %s\n", strerror(-ret));
> > +    }
> > +    return ret;
> > +}
> > +
> > +static const cmdinfo_t zone_reset_cmd = {
> > +        .name = "zone_reset",
> > +        .altname = "zrs",
> > +        .cfunc = zone_reset_f,
> > +        .argmin = 2,
> > +        .argmax = 2,
> > +        .args = "offset len",
> > +        .oneline = "reset a zone write pointer in zone block device",
> > +};
> > +
> >  static int truncate_f(BlockBackend *blk, int argc, char **argv);
> >  static const cmdinfo_t truncate_cmd = {
> >      .name       = "truncate",
> > @@ -2504,6 +2642,11 @@ static void __attribute((constructor)) init_qemuio_commands(void)
> >      qemuio_add_command(&aio_write_cmd);
> >      qemuio_add_command(&aio_flush_cmd);
> >      qemuio_add_command(&flush_cmd);
> > +    qemuio_add_command(&zone_report_cmd);
> > +    qemuio_add_command(&zone_open_cmd);
> > +    qemuio_add_command(&zone_close_cmd);
> > +    qemuio_add_command(&zone_finish_cmd);
> > +    qemuio_add_command(&zone_reset_cmd);
> >      qemuio_add_command(&truncate_cmd);
> >      qemuio_add_command(&length_cmd);
> >      qemuio_add_command(&info_cmd);
>
> --
> Damien Le Moal
> Western Digital Research
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  2022-09-11  6:33     ` Sam Li
@ 2022-09-11  6:48       ` Damien Le Moal
  2022-09-11  7:30         ` Sam Li
  0 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2022-09-11  6:48 UTC (permalink / raw)
  To: Sam Li
  Cc: qemu-devel, Dmitry Fomichev, Markus Armbruster, Eric Blake,
	qemu block, Stefan Hajnoczi, Kevin Wolf, Fam Zheng,
	Hannes Reinecke, Hanna Reitz

On 2022/09/11 15:33, Sam Li wrote:
> Damien Le Moal <damien.lemoal@opensource.wdc.com> 于2022年9月11日周日 13:31写道:
[...]
>>> +/*
>>> + * zone management operations - Execute an operation on a zone
>>> + */
>>> +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
>>> +        int64_t offset, int64_t len) {
>>> +#if defined(CONFIG_BLKZONED)
>>> +    BDRVRawState *s = bs->opaque;
>>> +    RawPosixAIOData acb;
>>> +    int64_t zone_sector, zone_sector_mask;
>>> +    const char *zone_op_name;
>>> +    unsigned long zone_op;
>>> +    bool is_all = false;
>>> +
>>> +    zone_sector = bs->bl.zone_sectors;
>>> +    zone_sector_mask = zone_sector - 1;
>>> +    if (offset & zone_sector_mask) {
>>> +        error_report("sector offset %" PRId64 " is not aligned to zone size "
>>> +                     "%" PRId64 "", offset, zone_sector);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    if (len & zone_sector_mask) {
>>
>> Linux allows SMR drives to have a smaller last zone. So this needs to be
>> accounted for here. Otherwise, a zone operation that includes the last smaller
>> zone would always fail. Something like this would work:
>>
>>         if (((offset + len) < capacity &&
>>             len & zone_sector_mask) ||
>>             offset + len > capacity) {
>>
> 
> I see. I think the offset can be removed, like:
> if (((len < capacity && len & zone_sector_mask) || len > capacity) {
> Then if we use the previous zone's len for the last smaller zone, it
> will be greater than its capacity.

Nope, you cannot remove the offset since the zone operation may be for that last
zone only, that is, offset == last zone start and len == last zone smaller size.
In that case, len is alwats smaller than capacity.

> 
> I will also include "opening the last zone" as a test case later.

Note that you can create such smaller last zone on the host with null_blk by
specifying a device capacity that is *not* a multiple of the zone size.

> 
>>> +        error_report("number of sectors %" PRId64 " is not aligned to zone size"
>>> +                      " %" PRId64 "", len, zone_sector);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    switch (op) {
>>> +    case BLK_ZO_OPEN:
>>> +        zone_op_name = "BLKOPENZONE";
>>> +        zone_op = BLKOPENZONE;
>>> +        break;
>>> +    case BLK_ZO_CLOSE:
>>> +        zone_op_name = "BLKCLOSEZONE";
>>> +        zone_op = BLKCLOSEZONE;
>>> +        break;
>>> +    case BLK_ZO_FINISH:
>>> +        zone_op_name = "BLKFINISHZONE";
>>> +        zone_op = BLKFINISHZONE;
>>> +        break;
>>> +    case BLK_ZO_RESET:
>>> +        zone_op_name = "BLKRESETZONE";
>>> +        zone_op = BLKRESETZONE;
>>> +        break;
>>> +    default:
>>> +        g_assert_not_reached();
>>> +    }
>>> +
>>> +    acb = (RawPosixAIOData) {
>>> +        .bs             = bs,
>>> +        .aio_fildes     = s->fd,
>>> +        .aio_type       = QEMU_AIO_ZONE_MGMT,
>>> +        .aio_offset     = offset,
>>> +        .aio_nbytes     = len,
>>> +        .zone_mgmt  = {
>>> +                .zone_op = zone_op,
>>> +                .zone_op_name = zone_op_name,
>>> +                .all = is_all,
>>> +        },
>>> +    };
>>> +
>>> +    return raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb);
>>> +#else
>>> +    return -ENOTSUP;
>>> +#endif
>>> +}

-- 
Damien Le Moal
Western Digital Research



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 5/7] config: add check to block layer
  2022-09-11  5:34   ` Damien Le Moal
@ 2022-09-11  6:54     ` Sam Li
  2022-09-11  7:05       ` Damien Le Moal
  0 siblings, 1 reply; 31+ messages in thread
From: Sam Li @ 2022-09-11  6:54 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: qemu-devel, Dmitry Fomichev, Markus Armbruster, Eric Blake,
	qemu block, Stefan Hajnoczi, Kevin Wolf, Fam Zheng,
	Hannes Reinecke, Hanna Reitz

Damien Le Moal <damien.lemoal@opensource.wdc.com> 于2022年9月11日周日 13:34写道:
>
> On 2022/09/10 14:27, Sam Li wrote:
> > Putting zoned/non-zoned BlockDrivers on top of each other is not
> > allowed.
> >
> > Signed-off-by: Sam Li <faithilikerun@gmail.com>
> > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  block.c                          | 14 ++++++++++++++
> >  block/file-posix.c               | 14 ++++++++++++++
> >  block/raw-format.c               |  1 +
> >  include/block/block_int-common.h |  5 +++++
> >  4 files changed, 34 insertions(+)
> >
> > diff --git a/block.c b/block.c
> > index bc85f46eed..dad2ed3959 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -7947,6 +7947,20 @@ void bdrv_add_child(BlockDriverState *parent_bs, BlockDriverState *child_bs,
> >          return;
> >      }
> >
> > +    /*
> > +     * Non-zoned block drivers do not follow zoned storage constraints
> > +     * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned
> > +     * drivers in a graph.
> > +     */
> > +    if (!parent_bs->drv->supports_zoned_children &&
> > +        child_bs->bl.zoned == BLK_Z_HM) {
>
> Shouldn't this be "child_bs->bl.zoned != BLK_Z_NONE" ?

The host-aware model allows zoned storage constraints(sequentially
write) and random write. Is mixing HA and non-zoned drivers allowed?
What's the difference?

>
> > +        error_setg(errp, "Cannot add a %s child to a %s parent",
> > +                   child_bs->bl.zoned == BLK_Z_HM ? "zoned" : "non-zoned",
> > +                   parent_bs->drv->supports_zoned_children ?
> > +                   "support zoned children" : "not support zoned children");
> > +        return;
> > +    }
> > +
> >      if (!QLIST_EMPTY(&child_bs->parents)) {
> >          error_setg(errp, "The node %s already has a parent",
> >                     child_bs->node_name);
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 4edfa25d04..354de22860 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -779,6 +779,20 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
> >              goto fail;
> >          }
> >      }
> > +#ifdef CONFIG_BLKZONED
> > +    /*
> > +     * The kernel page chache does not reliably work for writes to SWR zones
> > +     * of zoned block device because it can not guarantee the order of writes.
> > +     */
> > +    if (strcmp(bs->drv->format_name, "zoned_host_device") == 0) {
> > +        if (!(s->open_flags & O_DIRECT)) {
> > +            error_setg(errp, "driver=zoned_host_device was specified, but it "
> > +                             "requires cache.direct=on, which was not specified.");
> > +            ret = -EINVAL;
>
> This line is not needed. Simply "return -EINVAL;".
>
> > +            return ret; /* No host kernel page cache */
> > +        }
> > +    }
> > +#endif
> >
> >      if (S_ISBLK(st.st_mode)) {
> >  #ifdef BLKDISCARDZEROES
> > diff --git a/block/raw-format.c b/block/raw-format.c
> > index 6b20bd22ef..9441536819 100644
> > --- a/block/raw-format.c
> > +++ b/block/raw-format.c
> > @@ -614,6 +614,7 @@ static void raw_child_perm(BlockDriverState *bs, BdrvChild *c,
> >  BlockDriver bdrv_raw = {
> >      .format_name          = "raw",
> >      .instance_size        = sizeof(BDRVRawState),
> > +    .supports_zoned_children = true,
> >      .bdrv_probe           = &raw_probe,
> >      .bdrv_reopen_prepare  = &raw_reopen_prepare,
> >      .bdrv_reopen_commit   = &raw_reopen_commit,
> > diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
> > index 078ddd7e67..043aa161a0 100644
> > --- a/include/block/block_int-common.h
> > +++ b/include/block/block_int-common.h
> > @@ -127,6 +127,11 @@ struct BlockDriver {
> >       */
> >      bool is_format;
> >
> > +    /*
> > +     * Set to true if the BlockDriver supports zoned children.
> > +     */
> > +    bool supports_zoned_children;
> > +
> >      /*
> >       * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
> >       * this field set to true, except ones that are defined only by their
>
> --
> Damien Le Moal
> Western Digital Research
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  2022-09-10  5:27 ` [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls Sam Li
  2022-09-11  5:31   ` Damien Le Moal
@ 2022-09-11  7:02   ` Damien Le Moal
  2022-09-16 16:00   ` Stefan Hajnoczi
  2022-09-20  8:51   ` Klaus Jensen
  3 siblings, 0 replies; 31+ messages in thread
From: Damien Le Moal @ 2022-09-11  7:02 UTC (permalink / raw)
  To: Sam Li, qemu-devel
  Cc: dmitry.fomichev, Markus Armbruster, Eric Blake, qemu-block,
	Stefan Hajnoczi, Kevin Wolf, Fam Zheng, hare, Hanna Reitz

On 2022/09/10 14:27, Sam Li wrote:
> Add a new zoned_host_device BlockDriver. The zoned_host_device option
> accepts only zoned host block devices. By adding zone management
> operations in this new BlockDriver, users can use the new block
> layer APIs including Report Zone and four zone management operations
> (open, close, finish, reset).
> 
> Qemu-io uses the new APIs to perform zoned storage commands of the device:
> zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> zone_finish(zf).
> 
> For example, to test zone_report, use following command:
> $ ./build/qemu-io --image-opts -n driver=zoned_host_device, filename=/dev/nullb0
> -c "zrp offset nr_zones"
> 
> Signed-off-by: Sam Li <faithilikerun@gmail.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> ---
>  block/block-backend.c             | 145 ++++++++++++++
>  block/file-posix.c                | 323 +++++++++++++++++++++++++++++-
>  block/io.c                        |  41 ++++
>  include/block/block-io.h          |   7 +
>  include/block/block_int-common.h  |  21 ++
>  include/block/raw-aio.h           |   6 +-
>  include/sysemu/block-backend-io.h |  17 ++
>  meson.build                       |   1 +
>  qapi/block-core.json              |   8 +-
>  qemu-io-cmds.c                    | 143 +++++++++++++
>  10 files changed, 708 insertions(+), 4 deletions(-)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index d4a5df2ac2..ebe8d7bdf3 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1431,6 +1431,15 @@ typedef struct BlkRwCo {
>      void *iobuf;
>      int ret;
>      BdrvRequestFlags flags;
> +    union {
> +        struct {
> +            unsigned int *nr_zones;
> +            BlockZoneDescriptor *zones;
> +        } zone_report;
> +        struct {
> +            BlockZoneOp op;
> +        } zone_mgmt;
> +    };
>  } BlkRwCo;
>  
>  int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
> @@ -1775,6 +1784,142 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
>      return ret;
>  }
>  
> +static void blk_aio_zone_report_entry(void *opaque) {
> +    BlkAioEmAIOCB *acb = opaque;
> +    BlkRwCo *rwco = &acb->rwco;
> +
> +    rwco->ret = blk_co_zone_report(rwco->blk, rwco->offset,
> +                                   rwco->zone_report.nr_zones,
> +                                   rwco->zone_report.zones);
> +    blk_aio_complete(acb);
> +}
> +
> +BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
> +                                unsigned int *nr_zones,
> +                                BlockZoneDescriptor  *zones,
> +                                BlockCompletionFunc *cb, void *opaque)
> +{
> +    BlkAioEmAIOCB *acb;
> +    Coroutine *co;
> +    IO_CODE();
> +
> +    blk_inc_in_flight(blk);
> +    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
> +    acb->rwco = (BlkRwCo) {
> +            .blk    = blk,
> +            .offset = offset,
> +            .ret    = NOT_DONE,
> +            .zone_report = {
> +                    .zones = zones,
> +                    .nr_zones = nr_zones,
> +            },
> +    };
> +    acb->has_returned = false;
> +
> +    co = qemu_coroutine_create(blk_aio_zone_report_entry, acb);
> +    bdrv_coroutine_enter(blk_bs(blk), co);
> +
> +    acb->has_returned = true;
> +    if (acb->rwco.ret != NOT_DONE) {
> +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> +                                         blk_aio_complete_bh, acb);
> +    }
> +
> +    return &acb->common;
> +}
> +
> +static void blk_aio_zone_mgmt_entry(void *opaque) {
> +    BlkAioEmAIOCB *acb = opaque;
> +    BlkRwCo *rwco = &acb->rwco;
> +
> +    rwco->ret = blk_co_zone_mgmt(rwco->blk, rwco->zone_mgmt.op,
> +                                 rwco->offset, acb->bytes);
> +    blk_aio_complete(acb);
> +}
> +
> +BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +                              int64_t offset, int64_t len,
> +                              BlockCompletionFunc *cb, void *opaque) {
> +    BlkAioEmAIOCB *acb;
> +    Coroutine *co;
> +    IO_CODE();
> +
> +    blk_inc_in_flight(blk);
> +    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
> +    acb->rwco = (BlkRwCo) {
> +            .blk    = blk,
> +            .offset = offset,
> +            .ret    = NOT_DONE,
> +            .zone_mgmt = {
> +                    .op = op,
> +            },
> +    };
> +    acb->bytes = len;
> +    acb->has_returned = false;
> +
> +    co = qemu_coroutine_create(blk_aio_zone_mgmt_entry, acb);
> +    bdrv_coroutine_enter(blk_bs(blk), co);
> +
> +    acb->has_returned = true;
> +    if (acb->rwco.ret != NOT_DONE) {
> +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> +                                         blk_aio_complete_bh, acb);
> +    }
> +
> +    return &acb->common;
> +}
> +
> +/*
> + * Send a zone_report command.
> + * offset is a byte offset from the start of the device. No alignment
> + * required for offset.
> + * nr_zones represents IN maximum and OUT actual.
> + */
> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> +                                    unsigned int *nr_zones,
> +                                    BlockZoneDescriptor *zones)
> +{
> +    int ret;
> +    IO_CODE();
> +
> +    blk_inc_in_flight(blk); /* increase before waiting */
> +    blk_wait_while_drained(blk);
> +    if (!blk_is_available(blk)) {
> +        blk_dec_in_flight(blk);
> +        return -ENOMEDIUM;
> +    }
> +    ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
> +    blk_dec_in_flight(blk);
> +    return ret;
> +}
> +
> +/*
> + * Send a zone_management command.
> + * op is the zone operation;
> + * offset is the byte offset from the start of the zoned device;
> + * len is the maximum number of bytes the command should operate on. It
> + * should be aligned with the zone sector size.
> + */
> +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +        int64_t offset, int64_t len)
> +{
> +    int ret;
> +    IO_CODE();
> +
> +
> +    blk_inc_in_flight(blk);
> +    blk_wait_while_drained(blk);
> +
> +    ret = blk_check_byte_request(blk, offset, len);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    ret = bdrv_co_zone_mgmt(blk_bs(blk), op, offset, len);
> +    blk_dec_in_flight(blk);
> +    return ret;
> +}
> +
>  void blk_drain(BlockBackend *blk)
>  {
>      BlockDriverState *bs = blk_bs(blk);
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 0a8b4b426e..4edfa25d04 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -67,6 +67,9 @@
>  #include <sys/param.h>
>  #include <sys/syscall.h>
>  #include <sys/vfs.h>
> +#if defined(CONFIG_BLKZONED)
> +#include <linux/blkzoned.h>
> +#endif
>  #include <linux/cdrom.h>
>  #include <linux/fd.h>
>  #include <linux/fs.h>
> @@ -216,6 +219,15 @@ typedef struct RawPosixAIOData {
>              PreallocMode prealloc;
>              Error **errp;
>          } truncate;
> +        struct {
> +            unsigned int *nr_zones;
> +            BlockZoneDescriptor *zones;
> +        } zone_report;
> +        struct {
> +            unsigned long zone_op;
> +            const char *zone_op_name;
> +            bool all;
> +        } zone_mgmt;
>      };
>  } RawPosixAIOData;
>  
> @@ -1339,7 +1351,7 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>  #endif
>  
>      if (bs->sg || S_ISBLK(st.st_mode)) {
> -        int ret = hdev_get_max_hw_transfer(s->fd, &st);
> +        ret = hdev_get_max_hw_transfer(s->fd, &st);
>  
>          if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
>              bs->bl.max_hw_transfer = ret;
> @@ -1356,6 +1368,27 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>          zoned = BLK_Z_NONE;
>      }
>      bs->bl.zoned = zoned;
> +    if (zoned != BLK_Z_NONE) {
> +        ret = get_sysfs_long_val(&st, "chunk_sectors");
> +        if (ret > 0) {
> +            bs->bl.zone_sectors = ret;
> +        }
> +
> +        ret = get_sysfs_long_val(&st, "zone_append_max_bytes");
> +        if (ret > 0) {
> +            bs->bl.max_append_sectors = ret / 512;
> +        }
> +
> +        ret = get_sysfs_long_val(&st, "max_open_zones");
> +        if (ret >= 0) {
> +            bs->bl.max_open_zones = ret;
> +        }
> +
> +        ret = get_sysfs_long_val(&st, "max_active_zones");
> +        if (ret >= 0) {
> +            bs->bl.max_active_zones = ret;
> +        }
> +    }
>  }
>  
>  static int check_for_dasd(int fd)
> @@ -1850,6 +1883,145 @@ static off_t copy_file_range(int in_fd, off_t *in_off, int out_fd,
>  }
>  #endif
>  
> +/*
> + * parse_zone - Fill a zone descriptor
> + */
> +#if defined(CONFIG_BLKZONED)
> +static inline void parse_zone(struct BlockZoneDescriptor *zone,
> +                              const struct blk_zone *blkz) {
> +    zone->start = blkz->start;
> +    zone->length = blkz->len;
> +    zone->cap = blkz->capacity;

One thing I forgot to mention here: the capacity field was added to Linux with
kernel 5.9. Previous kernels do not have this, so this will not compile with
kernels older than 5.9. Some trickiness is needed here.

You need to conditionally compile this depending on if the BLK_ZONE_REP_CAPACITY
flag is defined or not in /usr/include/linux/blkzoned.h.

See libzbd code as an example of how to do this:

https://github.com/westerndigitalcorporation/libzbd/blob/master/lib/zbd.h#L27

and

https://github.com/westerndigitalcorporation/libzbd/blob/master/lib/zbd.c#L495

The HAVE_BLK_ZONE_REP_V2 macro comes from a config time test that you need to
add to meson build, similarly to CONFIG_BLKZONED.


> +    zone->wp = blkz->wp;
> +
> +    switch (blkz->type) {
> +    case BLK_ZONE_TYPE_SEQWRITE_REQ:
> +        zone->type = BLK_ZT_SWR;
> +        break;
> +    case BLK_ZONE_TYPE_SEQWRITE_PREF:
> +        zone->type = BLK_ZT_SWP;
> +        break;
> +    case BLK_ZONE_TYPE_CONVENTIONAL:
> +        zone->type = BLK_ZT_CONV;
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
> +
> +    switch (blkz->cond) {
> +    case BLK_ZONE_COND_NOT_WP:
> +        zone->cond = BLK_ZS_NOT_WP;
> +        break;
> +    case BLK_ZONE_COND_EMPTY:
> +        zone->cond = BLK_ZS_EMPTY;
> +        break;
> +    case BLK_ZONE_COND_IMP_OPEN:
> +        zone->cond =BLK_ZS_IOPEN;
> +        break;
> +    case BLK_ZONE_COND_EXP_OPEN:
> +        zone->cond = BLK_ZS_EOPEN;
> +        break;
> +    case BLK_ZONE_COND_CLOSED:
> +        zone->cond = BLK_ZS_CLOSED;
> +        break;
> +    case BLK_ZONE_COND_READONLY:
> +        zone->cond = BLK_ZS_RDONLY;
> +        break;
> +    case BLK_ZONE_COND_FULL:
> +        zone->cond = BLK_ZS_FULL;
> +        break;
> +    case BLK_ZONE_COND_OFFLINE:
> +        zone->cond = BLK_ZS_OFFLINE;
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
> +}
> +#endif
> +
> +#if defined(CONFIG_BLKZONED)
> +static int do_zone_report(int64_t sector, int fd,
> +                          struct BlockZoneDescriptor *zones,
> +                          unsigned int nrz) {
> +    struct blk_zone *blkz;
> +    int ret, n = 0, i = 0;
> +
> +    int64_t rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
> +    g_autofree struct blk_zone_report *rep = NULL;
> +    rep = g_malloc(rep_size);
> +
> +    blkz = (struct blk_zone *)(rep + 1);
> +    while (n < nrz) {
> +        memset(rep, 0, rep_size);
> +        rep->sector = sector;
> +        rep->nr_zones = nrz - n;
> +
> +        do {
> +            ret = ioctl(fd, BLKREPORTZONE, rep);
> +        } while (ret != 0 && errno == EINTR);
> +        if (ret != 0) {
> +            error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
> +                    fd, sector, errno);
> +            return -errno;
> +        }
> +
> +        if (!rep->nr_zones) {
> +            break;
> +        }
> +
> +        for (i = 0; i < rep->nr_zones; i++, n++) {
> +            parse_zone(&zones[n], &blkz[i]);
> +            /* The next report should start after the last zone reported */
> +            sector = blkz[i].start + blkz[i].len;
> +        }
> +    }
> +    return n;
> +}
> +#endif
> +
> +static int handle_aiocb_zone_report(void *opaque) {
> +#if defined(CONFIG_BLKZONED)
> +    RawPosixAIOData *aiocb = opaque;
> +    int fd = aiocb->aio_fildes;
> +    unsigned int *nr_zones = aiocb->zone_report.nr_zones;
> +    BlockZoneDescriptor *zones = aiocb->zone_report.zones;
> +    /* zoned block devices use 512-byte sectors */
> +    int64_t sector = aiocb->aio_offset / 512;
> +
> +    *nr_zones = do_zone_report(sector, fd, zones, *nr_zones);
> +    return 0;
> +#else
> +    return -ENOTSUP;
> +#endif
> +}
> +
> +static int handle_aiocb_zone_mgmt(void *opaque) {
> +#if defined(CONFIG_BLKZONED)
> +    RawPosixAIOData *aiocb = opaque;
> +    int fd = aiocb->aio_fildes;
> +    int64_t sector = aiocb->aio_offset / 512;
> +    int64_t nr_sectors = aiocb->aio_nbytes / 512;
> +    struct blk_zone_range range;
> +    int ret;
> +
> +    /* Execute the operation */
> +    range.sector = sector;
> +    range.nr_sectors = nr_sectors;
> +    do {
> +        ret = ioctl(fd, aiocb->zone_mgmt.zone_op, &range);
> +    } while (ret != 0 && errno == EINTR);
> +
> +    if (ret != 0) {
> +        error_report("ioctl %s failed %d", aiocb->zone_mgmt.zone_op_name,
> +                     errno);
> +        return -errno;
> +    }
> +    return ret;
> +#else
> +    return -ENOTSUP;
> +#endif
> +}
> +
>  static int handle_aiocb_copy_range(void *opaque)
>  {
>      RawPosixAIOData *aiocb = opaque;
> @@ -3022,6 +3194,104 @@ static void raw_account_discard(BDRVRawState *s, uint64_t nbytes, int ret)
>      }
>  }
>  
> +/*
> + * zone report - Get a zone block device's information in the form
> + * of an array of zone descriptors.
> + * zones is an array of zone descriptors to hold zone information on reply;
> + * offset can be any byte within the entire size of the device;
> + * nr_zones is the maxium number of sectors the command should operate on.
> + */
> +static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t offset,
> +                                           unsigned int *nr_zones,
> +                                           BlockZoneDescriptor *zones) {
> +#if defined(CONFIG_BLKZONED)
> +    BDRVRawState *s = bs->opaque;
> +    RawPosixAIOData acb;
> +
> +    acb = (RawPosixAIOData) {
> +        .bs         = bs,
> +        .aio_fildes = s->fd,
> +        .aio_type   = QEMU_AIO_ZONE_REPORT,
> +        .aio_offset = offset,
> +        .zone_report    = {
> +                .nr_zones       = nr_zones,
> +                .zones          = zones,
> +        },
> +    };
> +
> +    return raw_thread_pool_submit(bs, handle_aiocb_zone_report, &acb);
> +#else
> +    return -ENOTSUP;
> +#endif
> +}
> +
> +/*
> + * zone management operations - Execute an operation on a zone
> + */
> +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> +        int64_t offset, int64_t len) {
> +#if defined(CONFIG_BLKZONED)
> +    BDRVRawState *s = bs->opaque;
> +    RawPosixAIOData acb;
> +    int64_t zone_sector, zone_sector_mask;
> +    const char *zone_op_name;
> +    unsigned long zone_op;
> +    bool is_all = false;
> +
> +    zone_sector = bs->bl.zone_sectors;
> +    zone_sector_mask = zone_sector - 1;
> +    if (offset & zone_sector_mask) {
> +        error_report("sector offset %" PRId64 " is not aligned to zone size "
> +                     "%" PRId64 "", offset, zone_sector);
> +        return -EINVAL;
> +    }
> +
> +    if (len & zone_sector_mask) {
> +        error_report("number of sectors %" PRId64 " is not aligned to zone size"
> +                      " %" PRId64 "", len, zone_sector);
> +        return -EINVAL;
> +    }
> +
> +    switch (op) {
> +    case BLK_ZO_OPEN:
> +        zone_op_name = "BLKOPENZONE";
> +        zone_op = BLKOPENZONE;
> +        break;
> +    case BLK_ZO_CLOSE:
> +        zone_op_name = "BLKCLOSEZONE";
> +        zone_op = BLKCLOSEZONE;
> +        break;
> +    case BLK_ZO_FINISH:
> +        zone_op_name = "BLKFINISHZONE";
> +        zone_op = BLKFINISHZONE;
> +        break;
> +    case BLK_ZO_RESET:
> +        zone_op_name = "BLKRESETZONE";
> +        zone_op = BLKRESETZONE;
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
> +
> +    acb = (RawPosixAIOData) {
> +        .bs             = bs,
> +        .aio_fildes     = s->fd,
> +        .aio_type       = QEMU_AIO_ZONE_MGMT,
> +        .aio_offset     = offset,
> +        .aio_nbytes     = len,
> +        .zone_mgmt  = {
> +                .zone_op = zone_op,
> +                .zone_op_name = zone_op_name,
> +                .all = is_all,
> +        },
> +    };
> +
> +    return raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb);
> +#else
> +    return -ENOTSUP;
> +#endif
> +}
> +
>  static coroutine_fn int
>  raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes,
>                  bool blkdev)
> @@ -3752,6 +4022,54 @@ static BlockDriver bdrv_host_device = {
>  #endif
>  };
>  
> +#if defined(CONFIG_BLKZONED)
> +static BlockDriver bdrv_zoned_host_device = {
> +    .format_name = "zoned_host_device",
> +    .protocol_name = "zoned_host_device",
> +    .instance_size = sizeof(BDRVRawState),
> +    .bdrv_needs_filename = true,
> +    .bdrv_probe_device  = hdev_probe_device,
> +    .bdrv_file_open     = hdev_open,
> +    .bdrv_close         = raw_close,
> +    .bdrv_reopen_prepare = raw_reopen_prepare,
> +    .bdrv_reopen_commit  = raw_reopen_commit,
> +    .bdrv_reopen_abort   = raw_reopen_abort,
> +    .bdrv_co_create_opts = bdrv_co_create_opts_simple,
> +    .create_opts         = &bdrv_create_opts_simple,
> +    .mutable_opts        = mutable_opts,
> +    .bdrv_co_invalidate_cache = raw_co_invalidate_cache,
> +    .bdrv_co_pwrite_zeroes = hdev_co_pwrite_zeroes,
> +
> +    .bdrv_co_preadv         = raw_co_preadv,
> +    .bdrv_co_pwritev        = raw_co_pwritev,
> +    .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
> +    .bdrv_co_pdiscard       = hdev_co_pdiscard,
> +    .bdrv_co_copy_range_from = raw_co_copy_range_from,
> +    .bdrv_co_copy_range_to  = raw_co_copy_range_to,
> +    .bdrv_refresh_limits = raw_refresh_limits,
> +    .bdrv_io_plug = raw_aio_plug,
> +    .bdrv_io_unplug = raw_aio_unplug,
> +    .bdrv_attach_aio_context = raw_aio_attach_aio_context,
> +
> +    .bdrv_co_truncate       = raw_co_truncate,
> +    .bdrv_getlength = raw_getlength,
> +    .bdrv_get_info = raw_get_info,
> +    .bdrv_get_allocated_file_size
> +                        = raw_get_allocated_file_size,
> +    .bdrv_get_specific_stats = hdev_get_specific_stats,
> +    .bdrv_check_perm = raw_check_perm,
> +    .bdrv_set_perm   = raw_set_perm,
> +    .bdrv_abort_perm_update = raw_abort_perm_update,
> +    .bdrv_probe_blocksizes = hdev_probe_blocksizes,
> +    .bdrv_probe_geometry = hdev_probe_geometry,
> +    .bdrv_co_ioctl = hdev_co_ioctl,
> +
> +    /* zone management operations */
> +    .bdrv_co_zone_report = raw_co_zone_report,
> +    .bdrv_co_zone_mgmt = raw_co_zone_mgmt,
> +};
> +#endif
> +
>  #if defined(__linux__) || defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
>  static void cdrom_parse_filename(const char *filename, QDict *options,
>                                   Error **errp)
> @@ -4012,6 +4330,9 @@ static void bdrv_file_init(void)
>      bdrv_register(&bdrv_file);
>  #if defined(HAVE_HOST_BLOCK_DEVICE)
>      bdrv_register(&bdrv_host_device);
> +#if defined(CONFIG_BLKZONED)
> +    bdrv_register(&bdrv_zoned_host_device);
> +#endif
>  #ifdef __linux__
>      bdrv_register(&bdrv_host_cdrom);
>  #endif
> diff --git a/block/io.c b/block/io.c
> index 0a8cbefe86..de9ec1d740 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -3198,6 +3198,47 @@ out:
>      return co.ret;
>  }
>  
> +int bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
> +                        unsigned int *nr_zones,
> +                        BlockZoneDescriptor *zones)
> +{
> +    BlockDriver *drv = bs->drv;
> +    CoroutineIOCompletion co = {
> +            .coroutine = qemu_coroutine_self(),
> +    };
> +    IO_CODE();
> +
> +    bdrv_inc_in_flight(bs);
> +    if (!drv || !drv->bdrv_co_zone_report) {
> +        co.ret = -ENOTSUP;
> +        goto out;
> +    }
> +    co.ret = drv->bdrv_co_zone_report(bs, offset, nr_zones, zones);
> +out:
> +    bdrv_dec_in_flight(bs);
> +    return co.ret;
> +}
> +
> +int bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> +        int64_t offset, int64_t len)
> +{
> +    BlockDriver *drv = bs->drv;
> +    CoroutineIOCompletion co = {
> +            .coroutine = qemu_coroutine_self(),
> +    };
> +    IO_CODE();
> +
> +    bdrv_inc_in_flight(bs);
> +    if (!drv || !drv->bdrv_co_zone_mgmt) {
> +        co.ret = -ENOTSUP;
> +        goto out;
> +    }
> +    co.ret = drv->bdrv_co_zone_mgmt(bs, op, offset, len);
> +out:
> +    bdrv_dec_in_flight(bs);
> +    return co.ret;
> +}
> +
>  void *qemu_blockalign(BlockDriverState *bs, size_t size)
>  {
>      IO_CODE();
> diff --git a/include/block/block-io.h b/include/block/block-io.h
> index fd25ffa9be..65463b88d9 100644
> --- a/include/block/block-io.h
> +++ b/include/block/block-io.h
> @@ -88,6 +88,13 @@ int bdrv_co_ioctl(BlockDriverState *bs, int req, void *buf);
>  /* Ensure contents are flushed to disk.  */
>  int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
>  
> +/* Report zone information of zone block device. */
> +int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
> +                                     unsigned int *nr_zones,
> +                                     BlockZoneDescriptor *zones);
> +int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> +                                   int64_t offset, int64_t len);
> +
>  int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
>  bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
>  int bdrv_block_status(BlockDriverState *bs, int64_t offset,
> diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
> index 7f7863cc9e..078ddd7e67 100644
> --- a/include/block/block_int-common.h
> +++ b/include/block/block_int-common.h
> @@ -691,6 +691,12 @@ struct BlockDriver {
>                                            QEMUIOVector *qiov,
>                                            int64_t pos);
>  
> +    int coroutine_fn (*bdrv_co_zone_report)(BlockDriverState *bs,
> +            int64_t offset, unsigned int *nr_zones,
> +            BlockZoneDescriptor *zones);
> +    int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, BlockZoneOp op,
> +            int64_t offset, int64_t len);
> +
>      /* removable device specific */
>      bool (*bdrv_is_inserted)(BlockDriverState *bs);
>      void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
> @@ -828,6 +834,21 @@ typedef struct BlockLimits {
>  
>      /* device zone model */
>      BlockZoneModel zoned;
> +
> +    /* zone size expressed in 512-byte sectors */
> +    uint32_t zone_sectors;
> +
> +    /* total number of zones */
> +    unsigned int nr_zones;
> +
> +    /* maximum sectors of a zone append write operation */
> +    int64_t max_append_sectors;
> +
> +    /* maximum number of open zones */
> +    int64_t max_open_zones;
> +
> +    /* maximum number of active zones */
> +    int64_t max_active_zones;
>  } BlockLimits;
>  
>  typedef struct BdrvOpBlocker BdrvOpBlocker;
> diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
> index 21fc10c4c9..3d26929cdd 100644
> --- a/include/block/raw-aio.h
> +++ b/include/block/raw-aio.h
> @@ -29,6 +29,8 @@
>  #define QEMU_AIO_WRITE_ZEROES 0x0020
>  #define QEMU_AIO_COPY_RANGE   0x0040
>  #define QEMU_AIO_TRUNCATE     0x0080
> +#define QEMU_AIO_ZONE_REPORT  0x0100
> +#define QEMU_AIO_ZONE_MGMT    0x0200
>  #define QEMU_AIO_TYPE_MASK \
>          (QEMU_AIO_READ | \
>           QEMU_AIO_WRITE | \
> @@ -37,7 +39,9 @@
>           QEMU_AIO_DISCARD | \
>           QEMU_AIO_WRITE_ZEROES | \
>           QEMU_AIO_COPY_RANGE | \
> -         QEMU_AIO_TRUNCATE)
> +         QEMU_AIO_TRUNCATE  | \
> +         QEMU_AIO_ZONE_REPORT | \
> +         QEMU_AIO_ZONE_MGMT)
>  
>  /* AIO flags */
>  #define QEMU_AIO_MISALIGNED   0x1000
> diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
> index 50f5aa2e07..6835525582 100644
> --- a/include/sysemu/block-backend-io.h
> +++ b/include/sysemu/block-backend-io.h
> @@ -45,6 +45,12 @@ BlockAIOCB *blk_aio_pwritev(BlockBackend *blk, int64_t offset,
>                              BlockCompletionFunc *cb, void *opaque);
>  BlockAIOCB *blk_aio_flush(BlockBackend *blk,
>                            BlockCompletionFunc *cb, void *opaque);
> +BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
> +                                unsigned int *nr_zones, BlockZoneDescriptor *zones,
> +                                BlockCompletionFunc *cb, void *opaque);
> +BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +                              int64_t offset, int64_t len,
> +                              BlockCompletionFunc *cb, void *opaque);
>  BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes,
>                               BlockCompletionFunc *cb, void *opaque);
>  void blk_aio_cancel_async(BlockAIOCB *acb);
> @@ -156,6 +162,17 @@ int generated_co_wrapper blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
>  int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, int64_t offset,
>                                        int64_t bytes, BdrvRequestFlags flags);
>  
> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> +                                    unsigned int *nr_zones,
> +                                    BlockZoneDescriptor *zones);
> +int generated_co_wrapper blk_zone_report(BlockBackend *blk, int64_t offset,
> +                                         unsigned int *nr_zones,
> +                                         BlockZoneDescriptor *zones);
> +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +                                  int64_t offset, int64_t len);
> +int generated_co_wrapper blk_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +                                       int64_t offset, int64_t len);
> +
>  int generated_co_wrapper blk_pdiscard(BlockBackend *blk, int64_t offset,
>                                        int64_t bytes);
>  int coroutine_fn blk_co_pdiscard(BlockBackend *blk, int64_t offset,
> diff --git a/meson.build b/meson.build
> index 20fddbd707..2f436bb355 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1883,6 +1883,7 @@ config_host_data.set('CONFIG_REPLICATION', get_option('live_block_migration').al
>  # has_header
>  config_host_data.set('CONFIG_EPOLL', cc.has_header('sys/epoll.h'))
>  config_host_data.set('CONFIG_LINUX_MAGIC_H', cc.has_header('linux/magic.h'))
> +config_host_data.set('CONFIG_BLKZONED', cc.has_header('linux/blkzoned.h'))
>  config_host_data.set('CONFIG_VALGRIND_H', cc.has_header('valgrind/valgrind.h'))
>  config_host_data.set('HAVE_BTRFS_H', cc.has_header('linux/btrfs.h'))
>  config_host_data.set('HAVE_DRM_H', cc.has_header('libdrm/drm.h'))
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 2173e7734a..c6bbb7a037 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -2942,6 +2942,7 @@
>  # @compress: Since 5.0
>  # @copy-before-write: Since 6.2
>  # @snapshot-access: Since 7.0
> +# @zoned_host_device: Since 7.2
>  #
>  # Since: 2.9
>  ##
> @@ -2955,7 +2956,8 @@
>              'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
>              'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
>              { 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
> -            'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
> +            'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat',
> +            { 'name': 'zoned_host_device', 'if': 'CONFIG_BLKZONED' } ] }
>  
>  ##
>  # @BlockdevOptionsFile:
> @@ -4329,7 +4331,9 @@
>        'vhdx':       'BlockdevOptionsGenericFormat',
>        'vmdk':       'BlockdevOptionsGenericCOWFormat',
>        'vpc':        'BlockdevOptionsGenericFormat',
> -      'vvfat':      'BlockdevOptionsVVFAT'
> +      'vvfat':      'BlockdevOptionsVVFAT',
> +      'zoned_host_device': { 'type': 'BlockdevOptionsFile',
> +                             'if': 'CONFIG_BLKZONED' }
>    } }
>  
>  ##
> diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
> index 952dc940f1..446a059603 100644
> --- a/qemu-io-cmds.c
> +++ b/qemu-io-cmds.c
> @@ -1712,6 +1712,144 @@ static const cmdinfo_t flush_cmd = {
>      .oneline    = "flush all in-core file state to disk",
>  };
>  
> +static int zone_report_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset;
> +    unsigned int nr_zones;
> +
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    nr_zones = cvtnum(argv[optind]);
> +
> +    g_autofree BlockZoneDescriptor *zones = NULL;
> +    zones = g_new(BlockZoneDescriptor, nr_zones);
> +    ret = blk_zone_report(blk, offset, &nr_zones, zones);
> +    if (ret < 0) {
> +        printf("zone report failed: %s\n", strerror(-ret));
> +    } else {
> +        for (int i = 0; i < nr_zones; ++i) {
> +            printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", "
> +                   "cap"" 0x%" PRIx64 ", wptr 0x%" PRIx64 ", "
> +                   "zcond:%u, [type: %u]\n",
> +                   zones[i].start, zones[i].length, zones[i].cap, zones[i].wp,
> +                   zones[i].cond, zones[i].type);
> +        }
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_report_cmd = {
> +        .name = "zone_report",
> +        .altname = "zrp",
> +        .cfunc = zone_report_f,
> +        .argmin = 2,
> +        .argmax = 2,
> +        .args = "offset number",
> +        .oneline = "report zone information",
> +};
> +
> +static int zone_open_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset, len;
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    len = cvtnum(argv[optind]);
> +    ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len);
> +    if (ret < 0) {
> +        printf("zone open failed: %s\n", strerror(-ret));
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_open_cmd = {
> +        .name = "zone_open",
> +        .altname = "zo",
> +        .cfunc = zone_open_f,
> +        .argmin = 2,
> +        .argmax = 2,
> +        .args = "offset len",
> +        .oneline = "explicit open a range of zones in zone block device",
> +};
> +
> +static int zone_close_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset, len;
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    len = cvtnum(argv[optind]);
> +    ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len);
> +    if (ret < 0) {
> +        printf("zone close failed: %s\n", strerror(-ret));
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_close_cmd = {
> +        .name = "zone_close",
> +        .altname = "zc",
> +        .cfunc = zone_close_f,
> +        .argmin = 2,
> +        .argmax = 2,
> +        .args = "offset len",
> +        .oneline = "close a range of zones in zone block device",
> +};
> +
> +static int zone_finish_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset, len;
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    len = cvtnum(argv[optind]);
> +    ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len);
> +    if (ret < 0) {
> +        printf("zone finish failed: %s\n", strerror(-ret));
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_finish_cmd = {
> +        .name = "zone_finish",
> +        .altname = "zf",
> +        .cfunc = zone_finish_f,
> +        .argmin = 2,
> +        .argmax = 2,
> +        .args = "offset len",
> +        .oneline = "finish a range of zones in zone block device",
> +};
> +
> +static int zone_reset_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset, len;
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    len = cvtnum(argv[optind]);
> +    ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len);
> +    if (ret < 0) {
> +        printf("zone reset failed: %s\n", strerror(-ret));
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_reset_cmd = {
> +        .name = "zone_reset",
> +        .altname = "zrs",
> +        .cfunc = zone_reset_f,
> +        .argmin = 2,
> +        .argmax = 2,
> +        .args = "offset len",
> +        .oneline = "reset a zone write pointer in zone block device",
> +};
> +
>  static int truncate_f(BlockBackend *blk, int argc, char **argv);
>  static const cmdinfo_t truncate_cmd = {
>      .name       = "truncate",
> @@ -2504,6 +2642,11 @@ static void __attribute((constructor)) init_qemuio_commands(void)
>      qemuio_add_command(&aio_write_cmd);
>      qemuio_add_command(&aio_flush_cmd);
>      qemuio_add_command(&flush_cmd);
> +    qemuio_add_command(&zone_report_cmd);
> +    qemuio_add_command(&zone_open_cmd);
> +    qemuio_add_command(&zone_close_cmd);
> +    qemuio_add_command(&zone_finish_cmd);
> +    qemuio_add_command(&zone_reset_cmd);
>      qemuio_add_command(&truncate_cmd);
>      qemuio_add_command(&length_cmd);
>      qemuio_add_command(&info_cmd);

-- 
Damien Le Moal
Western Digital Research



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 5/7] config: add check to block layer
  2022-09-11  6:54     ` Sam Li
@ 2022-09-11  7:05       ` Damien Le Moal
  0 siblings, 0 replies; 31+ messages in thread
From: Damien Le Moal @ 2022-09-11  7:05 UTC (permalink / raw)
  To: Sam Li
  Cc: qemu-devel, Dmitry Fomichev, Markus Armbruster, Eric Blake,
	qemu block, Stefan Hajnoczi, Kevin Wolf, Fam Zheng,
	Hannes Reinecke, Hanna Reitz

On 2022/09/11 15:54, Sam Li wrote:
> Damien Le Moal <damien.lemoal@opensource.wdc.com> 于2022年9月11日周日 13:34写道:
>>
>> On 2022/09/10 14:27, Sam Li wrote:
>>> Putting zoned/non-zoned BlockDrivers on top of each other is not
>>> allowed.
>>>
>>> Signed-off-by: Sam Li <faithilikerun@gmail.com>
>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>  block.c                          | 14 ++++++++++++++
>>>  block/file-posix.c               | 14 ++++++++++++++
>>>  block/raw-format.c               |  1 +
>>>  include/block/block_int-common.h |  5 +++++
>>>  4 files changed, 34 insertions(+)
>>>
>>> diff --git a/block.c b/block.c
>>> index bc85f46eed..dad2ed3959 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -7947,6 +7947,20 @@ void bdrv_add_child(BlockDriverState *parent_bs, BlockDriverState *child_bs,
>>>          return;
>>>      }
>>>
>>> +    /*
>>> +     * Non-zoned block drivers do not follow zoned storage constraints
>>> +     * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned
>>> +     * drivers in a graph.
>>> +     */
>>> +    if (!parent_bs->drv->supports_zoned_children &&
>>> +        child_bs->bl.zoned == BLK_Z_HM) {
>>
>> Shouldn't this be "child_bs->bl.zoned != BLK_Z_NONE" ?
> 
> The host-aware model allows zoned storage constraints(sequentially
> write) and random write. Is mixing HA and non-zoned drivers allowed?
> What's the difference?

Yes, HA devices can be used as regular devices too. If you are allowing this
here, then add a comment explaining it. It may also be good to add a message
like "Using host-aware device as a regular device" here for the HA case.
> 
>>
>>> +        error_setg(errp, "Cannot add a %s child to a %s parent",
>>> +                   child_bs->bl.zoned == BLK_Z_HM ? "zoned" : "non-zoned",
>>> +                   parent_bs->drv->supports_zoned_children ?
>>> +                   "support zoned children" : "not support zoned children");
>>> +        return;
>>> +    }
>>> +
>>>      if (!QLIST_EMPTY(&child_bs->parents)) {
>>>          error_setg(errp, "The node %s already has a parent",
>>>                     child_bs->node_name);
>>> diff --git a/block/file-posix.c b/block/file-posix.c
>>> index 4edfa25d04..354de22860 100644
>>> --- a/block/file-posix.c
>>> +++ b/block/file-posix.c
>>> @@ -779,6 +779,20 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>>>              goto fail;
>>>          }
>>>      }
>>> +#ifdef CONFIG_BLKZONED
>>> +    /*
>>> +     * The kernel page chache does not reliably work for writes to SWR zones
>>> +     * of zoned block device because it can not guarantee the order of writes.
>>> +     */
>>> +    if (strcmp(bs->drv->format_name, "zoned_host_device") == 0) {
>>> +        if (!(s->open_flags & O_DIRECT)) {
>>> +            error_setg(errp, "driver=zoned_host_device was specified, but it "
>>> +                             "requires cache.direct=on, which was not specified.");
>>> +            ret = -EINVAL;
>>
>> This line is not needed. Simply "return -EINVAL;".
>>
>>> +            return ret; /* No host kernel page cache */
>>> +        }
>>> +    }
>>> +#endif
>>>
>>>      if (S_ISBLK(st.st_mode)) {
>>>  #ifdef BLKDISCARDZEROES
>>> diff --git a/block/raw-format.c b/block/raw-format.c
>>> index 6b20bd22ef..9441536819 100644
>>> --- a/block/raw-format.c
>>> +++ b/block/raw-format.c
>>> @@ -614,6 +614,7 @@ static void raw_child_perm(BlockDriverState *bs, BdrvChild *c,
>>>  BlockDriver bdrv_raw = {
>>>      .format_name          = "raw",
>>>      .instance_size        = sizeof(BDRVRawState),
>>> +    .supports_zoned_children = true,
>>>      .bdrv_probe           = &raw_probe,
>>>      .bdrv_reopen_prepare  = &raw_reopen_prepare,
>>>      .bdrv_reopen_commit   = &raw_reopen_commit,
>>> diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
>>> index 078ddd7e67..043aa161a0 100644
>>> --- a/include/block/block_int-common.h
>>> +++ b/include/block/block_int-common.h
>>> @@ -127,6 +127,11 @@ struct BlockDriver {
>>>       */
>>>      bool is_format;
>>>
>>> +    /*
>>> +     * Set to true if the BlockDriver supports zoned children.
>>> +     */
>>> +    bool supports_zoned_children;
>>> +
>>>      /*
>>>       * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
>>>       * this field set to true, except ones that are defined only by their
>>
>> --
>> Damien Le Moal
>> Western Digital Research
>>

-- 
Damien Le Moal
Western Digital Research



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  2022-09-11  6:48       ` Damien Le Moal
@ 2022-09-11  7:30         ` Sam Li
  0 siblings, 0 replies; 31+ messages in thread
From: Sam Li @ 2022-09-11  7:30 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: qemu-devel, Dmitry Fomichev, Markus Armbruster, Eric Blake,
	qemu block, Stefan Hajnoczi, Kevin Wolf, Fam Zheng,
	Hannes Reinecke, Hanna Reitz

Damien Le Moal <damien.lemoal@opensource.wdc.com> 于2022年9月11日周日 14:48写道:
>
> On 2022/09/11 15:33, Sam Li wrote:
> > Damien Le Moal <damien.lemoal@opensource.wdc.com> 于2022年9月11日周日 13:31写道:
> [...]
> >>> +/*
> >>> + * zone management operations - Execute an operation on a zone
> >>> + */
> >>> +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> >>> +        int64_t offset, int64_t len) {
> >>> +#if defined(CONFIG_BLKZONED)
> >>> +    BDRVRawState *s = bs->opaque;
> >>> +    RawPosixAIOData acb;
> >>> +    int64_t zone_sector, zone_sector_mask;
> >>> +    const char *zone_op_name;
> >>> +    unsigned long zone_op;
> >>> +    bool is_all = false;
> >>> +
> >>> +    zone_sector = bs->bl.zone_sectors;
> >>> +    zone_sector_mask = zone_sector - 1;
> >>> +    if (offset & zone_sector_mask) {
> >>> +        error_report("sector offset %" PRId64 " is not aligned to zone size "
> >>> +                     "%" PRId64 "", offset, zone_sector);
> >>> +        return -EINVAL;
> >>> +    }
> >>> +
> >>> +    if (len & zone_sector_mask) {
> >>
> >> Linux allows SMR drives to have a smaller last zone. So this needs to be
> >> accounted for here. Otherwise, a zone operation that includes the last smaller
> >> zone would always fail. Something like this would work:
> >>
> >>         if (((offset + len) < capacity &&
> >>             len & zone_sector_mask) ||
> >>             offset + len > capacity) {
> >>
> >
> > I see. I think the offset can be removed, like:
> > if (((len < capacity && len & zone_sector_mask) || len > capacity) {
> > Then if we use the previous zone's len for the last smaller zone, it
> > will be greater than its capacity.
>
> Nope, you cannot remove the offset since the zone operation may be for that last
> zone only, that is, offset == last zone start and len == last zone smaller size.
> In that case, len is alwats smaller than capacity.

Ok, I was mixing opening one zone with opening several zones.

>
> >
> > I will also include "opening the last zone" as a test case later.
>
> Note that you can create such smaller last zone on the host with null_blk by
> specifying a device capacity that is *not* a multiple of the zone size.
>
> >
> >>> +        error_report("number of sectors %" PRId64 " is not aligned to zone size"
> >>> +                      " %" PRId64 "", len, zone_sector);
> >>> +        return -EINVAL;
> >>> +    }
> >>> +
> >>> +    switch (op) {
> >>> +    case BLK_ZO_OPEN:
> >>> +        zone_op_name = "BLKOPENZONE";
> >>> +        zone_op = BLKOPENZONE;
> >>> +        break;
> >>> +    case BLK_ZO_CLOSE:
> >>> +        zone_op_name = "BLKCLOSEZONE";
> >>> +        zone_op = BLKCLOSEZONE;
> >>> +        break;
> >>> +    case BLK_ZO_FINISH:
> >>> +        zone_op_name = "BLKFINISHZONE";
> >>> +        zone_op = BLKFINISHZONE;
> >>> +        break;
> >>> +    case BLK_ZO_RESET:
> >>> +        zone_op_name = "BLKRESETZONE";
> >>> +        zone_op = BLKRESETZONE;
> >>> +        break;
> >>> +    default:
> >>> +        g_assert_not_reached();
> >>> +    }
> >>> +
> >>> +    acb = (RawPosixAIOData) {
> >>> +        .bs             = bs,
> >>> +        .aio_fildes     = s->fd,
> >>> +        .aio_type       = QEMU_AIO_ZONE_MGMT,
> >>> +        .aio_offset     = offset,
> >>> +        .aio_nbytes     = len,
> >>> +        .zone_mgmt  = {
> >>> +                .zone_op = zone_op,
> >>> +                .zone_op_name = zone_op_name,
> >>> +                .all = is_all,
> >>> +        },
> >>> +    };
> >>> +
> >>> +    return raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb);
> >>> +#else
> >>> +    return -ENOTSUP;
> >>> +#endif
> >>> +}
>
> --
> Damien Le Moal
> Western Digital Research
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 1/7] include: add zoned device structs
  2022-09-10  5:27 ` [PATCH v9 1/7] include: add zoned device structs Sam Li
@ 2022-09-15  8:05   ` Eric Blake
  2022-09-15 10:06     ` Sam Li
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Blake @ 2022-09-15  8:05 UTC (permalink / raw)
  To: Sam Li
  Cc: qemu-devel, dmitry.fomichev, Markus Armbruster, qemu-block,
	Stefan Hajnoczi, Kevin Wolf, Fam Zheng, damien.lemoal, hare,
	Hanna Reitz

On Sat, Sep 10, 2022 at 01:27:53PM +0800, Sam Li wrote:
> Signed-off-by: Sam Li <faithilikerun@gmail.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> ---
>  include/block/block-common.h | 43 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 43 insertions(+)
> 
> diff --git a/include/block/block-common.h b/include/block/block-common.h
> index fdb7306e78..36bd0e480e 100644
> --- a/include/block/block-common.h
> +++ b/include/block/block-common.h
> @@ -49,6 +49,49 @@ typedef struct BlockDriver BlockDriver;
>  typedef struct BdrvChild BdrvChild;
>  typedef struct BdrvChildClass BdrvChildClass;
>  
> +typedef enum BlockZoneOp {
> +    BLK_ZO_OPEN,
> +    BLK_ZO_CLOSE,
> +    BLK_ZO_FINISH,
> +    BLK_ZO_RESET,
> +} BlockZoneOp;
> +
> +typedef enum BlockZoneModel {
> +    BLK_Z_NONE = 0x0, /* Regular block device */
> +    BLK_Z_HM = 0x1, /* Host-managed zoned block device */
> +    BLK_Z_HA = 0x2, /* Host-aware zoned block device */
> +} BlockZoneModel;
> +
> +typedef enum BlockZoneCondition {
> +    BLK_ZS_NOT_WP = 0x0,
> +    BLK_ZS_EMPTY = 0x1,
> +    BLK_ZS_IOPEN = 0x2,
> +    BLK_ZS_EOPEN = 0x3,
> +    BLK_ZS_CLOSED = 0x4,
> +    BLK_ZS_RDONLY = 0xD,
> +    BLK_ZS_FULL = 0xE,
> +    BLK_ZS_OFFLINE = 0xF,
> +} BlockZoneCondition;
> +
> +typedef enum BlockZoneType {
> +    BLK_ZT_CONV = 0x1, /* Conventional random writes supported */
> +    BLK_ZT_SWR = 0x2, /* Sequential writes required */
> +    BLK_ZT_SWP = 0x3, /* Sequential writes preferred */
> +} BlockZoneType;
> +
> +/*
> + * Zone descriptor data structure.
> + * Provides information on a zone with all position and size values in bytes.

I'm glad that you chose bytes here for use in qemu.  But since the
kernel struct blk_zone uses sectors instead of bytes, is it worth
adding a sentence that we intentionally use bytes here, different from
Linux, to make it easier for reviewers to realize that scaling when
translating between qemu and kernel is necessary?

> + */
> +typedef struct BlockZoneDescriptor {
> +    uint64_t start;
> +    uint64_t length;
> +    uint64_t cap;
> +    uint64_t wp;
> +    BlockZoneType type;
> +    BlockZoneCondition cond;
> +} BlockZoneDescriptor;
> +
>  typedef struct BlockDriverInfo {
>      /* in bytes, 0 if irrelevant */
>      int cluster_size;
> -- 
> 2.37.3
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 1/7] include: add zoned device structs
  2022-09-15  8:05   ` Eric Blake
@ 2022-09-15 10:06     ` Sam Li
  2022-09-16 15:16       ` Stefan Hajnoczi
  0 siblings, 1 reply; 31+ messages in thread
From: Sam Li @ 2022-09-15 10:06 UTC (permalink / raw)
  To: Eric Blake
  Cc: qemu-devel, Dmitry Fomichev, Markus Armbruster, qemu block,
	Stefan Hajnoczi, Kevin Wolf, Fam Zheng, Damien Le Moal,
	Hannes Reinecke, Hanna Reitz

Eric Blake <eblake@redhat.com> 于2022年9月15日周四 16:05写道:
>
> On Sat, Sep 10, 2022 at 01:27:53PM +0800, Sam Li wrote:
> > Signed-off-by: Sam Li <faithilikerun@gmail.com>
> > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> > ---
> >  include/block/block-common.h | 43 ++++++++++++++++++++++++++++++++++++
> >  1 file changed, 43 insertions(+)
> >
> > diff --git a/include/block/block-common.h b/include/block/block-common.h
> > index fdb7306e78..36bd0e480e 100644
> > --- a/include/block/block-common.h
> > +++ b/include/block/block-common.h
> > @@ -49,6 +49,49 @@ typedef struct BlockDriver BlockDriver;
> >  typedef struct BdrvChild BdrvChild;
> >  typedef struct BdrvChildClass BdrvChildClass;
> >
> > +typedef enum BlockZoneOp {
> > +    BLK_ZO_OPEN,
> > +    BLK_ZO_CLOSE,
> > +    BLK_ZO_FINISH,
> > +    BLK_ZO_RESET,
> > +} BlockZoneOp;
> > +
> > +typedef enum BlockZoneModel {
> > +    BLK_Z_NONE = 0x0, /* Regular block device */
> > +    BLK_Z_HM = 0x1, /* Host-managed zoned block device */
> > +    BLK_Z_HA = 0x2, /* Host-aware zoned block device */
> > +} BlockZoneModel;
> > +
> > +typedef enum BlockZoneCondition {
> > +    BLK_ZS_NOT_WP = 0x0,
> > +    BLK_ZS_EMPTY = 0x1,
> > +    BLK_ZS_IOPEN = 0x2,
> > +    BLK_ZS_EOPEN = 0x3,
> > +    BLK_ZS_CLOSED = 0x4,
> > +    BLK_ZS_RDONLY = 0xD,
> > +    BLK_ZS_FULL = 0xE,
> > +    BLK_ZS_OFFLINE = 0xF,
> > +} BlockZoneCondition;
> > +
> > +typedef enum BlockZoneType {
> > +    BLK_ZT_CONV = 0x1, /* Conventional random writes supported */
> > +    BLK_ZT_SWR = 0x2, /* Sequential writes required */
> > +    BLK_ZT_SWP = 0x3, /* Sequential writes preferred */
> > +} BlockZoneType;
> > +
> > +/*
> > + * Zone descriptor data structure.
> > + * Provides information on a zone with all position and size values in bytes.
>
> I'm glad that you chose bytes here for use in qemu.  But since the
> kernel struct blk_zone uses sectors instead of bytes, is it worth
> adding a sentence that we intentionally use bytes here, different from
> Linux, to make it easier for reviewers to realize that scaling when
> translating between qemu and kernel is necessary?

Sorry about the unit mistake. The zone information is in sectors which
is the same as kernel struct blk_zone. I think adding a sentence to
inform the sector unit makes it clear what the zone descriptor is.

>
> > + */
> > +typedef struct BlockZoneDescriptor {
> > +    uint64_t start;
> > +    uint64_t length;
> > +    uint64_t cap;
> > +    uint64_t wp;
> > +    BlockZoneType type;
> > +    BlockZoneCondition cond;
> > +} BlockZoneDescriptor;
> > +
> >  typedef struct BlockDriverInfo {
> >      /* in bytes, 0 if irrelevant */
> >      int cluster_size;
> > --
> > 2.37.3
> >
>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.           +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 1/7] include: add zoned device structs
  2022-09-15 10:06     ` Sam Li
@ 2022-09-16 15:16       ` Stefan Hajnoczi
  2022-09-19  0:50         ` Sam Li
  0 siblings, 1 reply; 31+ messages in thread
From: Stefan Hajnoczi @ 2022-09-16 15:16 UTC (permalink / raw)
  To: Sam Li
  Cc: Eric Blake, qemu-devel, Dmitry Fomichev, Markus Armbruster,
	qemu block, Kevin Wolf, Fam Zheng, Damien Le Moal,
	Hannes Reinecke, Hanna Reitz

[-- Attachment #1: Type: text/plain, Size: 2932 bytes --]

On Thu, Sep 15, 2022 at 06:06:38PM +0800, Sam Li wrote:
> Eric Blake <eblake@redhat.com> 于2022年9月15日周四 16:05写道:
> >
> > On Sat, Sep 10, 2022 at 01:27:53PM +0800, Sam Li wrote:
> > > Signed-off-by: Sam Li <faithilikerun@gmail.com>
> > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> > > ---
> > >  include/block/block-common.h | 43 ++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 43 insertions(+)
> > >
> > > diff --git a/include/block/block-common.h b/include/block/block-common.h
> > > index fdb7306e78..36bd0e480e 100644
> > > --- a/include/block/block-common.h
> > > +++ b/include/block/block-common.h
> > > @@ -49,6 +49,49 @@ typedef struct BlockDriver BlockDriver;
> > >  typedef struct BdrvChild BdrvChild;
> > >  typedef struct BdrvChildClass BdrvChildClass;
> > >
> > > +typedef enum BlockZoneOp {
> > > +    BLK_ZO_OPEN,
> > > +    BLK_ZO_CLOSE,
> > > +    BLK_ZO_FINISH,
> > > +    BLK_ZO_RESET,
> > > +} BlockZoneOp;
> > > +
> > > +typedef enum BlockZoneModel {
> > > +    BLK_Z_NONE = 0x0, /* Regular block device */
> > > +    BLK_Z_HM = 0x1, /* Host-managed zoned block device */
> > > +    BLK_Z_HA = 0x2, /* Host-aware zoned block device */
> > > +} BlockZoneModel;
> > > +
> > > +typedef enum BlockZoneCondition {
> > > +    BLK_ZS_NOT_WP = 0x0,
> > > +    BLK_ZS_EMPTY = 0x1,
> > > +    BLK_ZS_IOPEN = 0x2,
> > > +    BLK_ZS_EOPEN = 0x3,
> > > +    BLK_ZS_CLOSED = 0x4,
> > > +    BLK_ZS_RDONLY = 0xD,
> > > +    BLK_ZS_FULL = 0xE,
> > > +    BLK_ZS_OFFLINE = 0xF,
> > > +} BlockZoneCondition;
> > > +
> > > +typedef enum BlockZoneType {
> > > +    BLK_ZT_CONV = 0x1, /* Conventional random writes supported */
> > > +    BLK_ZT_SWR = 0x2, /* Sequential writes required */
> > > +    BLK_ZT_SWP = 0x3, /* Sequential writes preferred */
> > > +} BlockZoneType;
> > > +
> > > +/*
> > > + * Zone descriptor data structure.
> > > + * Provides information on a zone with all position and size values in bytes.
> >
> > I'm glad that you chose bytes here for use in qemu.  But since the
> > kernel struct blk_zone uses sectors instead of bytes, is it worth
> > adding a sentence that we intentionally use bytes here, different from
> > Linux, to make it easier for reviewers to realize that scaling when
> > translating between qemu and kernel is necessary?
> 
> Sorry about the unit mistake. The zone information is in sectors which
> is the same as kernel struct blk_zone. I think adding a sentence to
> inform the sector unit makes it clear what the zone descriptor is.

I'd make the units bytes for consistency with the rest of the QEMU block
layer. For example, the MapEntry structure that "qemu-img map" reports
has names with similar fields and they are in bytes:

  struct MapEntry {
      int64_t start;
      int64_t length;

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 5/7] config: add check to block layer
  2022-09-10  5:27 ` [PATCH v9 5/7] config: add check to block layer Sam Li
  2022-09-11  5:34   ` Damien Le Moal
@ 2022-09-16 15:22   ` Stefan Hajnoczi
  1 sibling, 0 replies; 31+ messages in thread
From: Stefan Hajnoczi @ 2022-09-16 15:22 UTC (permalink / raw)
  To: Sam Li
  Cc: qemu-devel, dmitry.fomichev, Markus Armbruster, Eric Blake,
	qemu-block, Kevin Wolf, Fam Zheng, damien.lemoal, hare,
	Hanna Reitz

[-- Attachment #1: Type: text/plain, Size: 472 bytes --]

On Sat, Sep 10, 2022 at 01:27:57PM +0800, Sam Li wrote:
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 4edfa25d04..354de22860 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -779,6 +779,20 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>              goto fail;
>          }
>      }
> +#ifdef CONFIG_BLKZONED
> +    /*
> +     * The kernel page chache does not reliably work for writes to SWR zones

s/chache/cache/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  2022-09-10  5:27 ` [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls Sam Li
  2022-09-11  5:31   ` Damien Le Moal
  2022-09-11  7:02   ` Damien Le Moal
@ 2022-09-16 16:00   ` Stefan Hajnoczi
  2022-09-20  8:51   ` Klaus Jensen
  3 siblings, 0 replies; 31+ messages in thread
From: Stefan Hajnoczi @ 2022-09-16 16:00 UTC (permalink / raw)
  To: Sam Li
  Cc: qemu-devel, dmitry.fomichev, Markus Armbruster, Eric Blake,
	qemu-block, Kevin Wolf, Fam Zheng, damien.lemoal, hare,
	Hanna Reitz

[-- Attachment #1: Type: text/plain, Size: 34377 bytes --]

On Sat, Sep 10, 2022 at 01:27:55PM +0800, Sam Li wrote:
> Add a new zoned_host_device BlockDriver. The zoned_host_device option
> accepts only zoned host block devices. By adding zone management
> operations in this new BlockDriver, users can use the new block
> layer APIs including Report Zone and four zone management operations
> (open, close, finish, reset).
> 
> Qemu-io uses the new APIs to perform zoned storage commands of the device:
> zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> zone_finish(zf).
> 
> For example, to test zone_report, use following command:
> $ ./build/qemu-io --image-opts -n driver=zoned_host_device, filename=/dev/nullb0
> -c "zrp offset nr_zones"
> 
> Signed-off-by: Sam Li <faithilikerun@gmail.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> ---
>  block/block-backend.c             | 145 ++++++++++++++
>  block/file-posix.c                | 323 +++++++++++++++++++++++++++++-
>  block/io.c                        |  41 ++++
>  include/block/block-io.h          |   7 +
>  include/block/block_int-common.h  |  21 ++
>  include/block/raw-aio.h           |   6 +-
>  include/sysemu/block-backend-io.h |  17 ++
>  meson.build                       |   1 +
>  qapi/block-core.json              |   8 +-
>  qemu-io-cmds.c                    | 143 +++++++++++++
>  10 files changed, 708 insertions(+), 4 deletions(-)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index d4a5df2ac2..ebe8d7bdf3 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1431,6 +1431,15 @@ typedef struct BlkRwCo {
>      void *iobuf;
>      int ret;
>      BdrvRequestFlags flags;
> +    union {
> +        struct {
> +            unsigned int *nr_zones;
> +            BlockZoneDescriptor *zones;
> +        } zone_report;
> +        struct {
> +            BlockZoneOp op;
> +        } zone_mgmt;
> +    };
>  } BlkRwCo;
>  
>  int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
> @@ -1775,6 +1784,142 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
>      return ret;
>  }
>  
> +static void blk_aio_zone_report_entry(void *opaque) {
> +    BlkAioEmAIOCB *acb = opaque;
> +    BlkRwCo *rwco = &acb->rwco;
> +
> +    rwco->ret = blk_co_zone_report(rwco->blk, rwco->offset,
> +                                   rwco->zone_report.nr_zones,
> +                                   rwco->zone_report.zones);
> +    blk_aio_complete(acb);
> +}
> +
> +BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
> +                                unsigned int *nr_zones,
> +                                BlockZoneDescriptor  *zones,
> +                                BlockCompletionFunc *cb, void *opaque)
> +{
> +    BlkAioEmAIOCB *acb;
> +    Coroutine *co;
> +    IO_CODE();
> +
> +    blk_inc_in_flight(blk);
> +    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
> +    acb->rwco = (BlkRwCo) {
> +            .blk    = blk,
> +            .offset = offset,
> +            .ret    = NOT_DONE,
> +            .zone_report = {
> +                    .zones = zones,
> +                    .nr_zones = nr_zones,

Indentation is off here. QEMU uses 4-space indentation.

> +            },
> +    };
> +    acb->has_returned = false;
> +
> +    co = qemu_coroutine_create(blk_aio_zone_report_entry, acb);
> +    bdrv_coroutine_enter(blk_bs(blk), co);
> +
> +    acb->has_returned = true;
> +    if (acb->rwco.ret != NOT_DONE) {
> +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> +                                         blk_aio_complete_bh, acb);
> +    }
> +
> +    return &acb->common;
> +}
> +
> +static void blk_aio_zone_mgmt_entry(void *opaque) {
> +    BlkAioEmAIOCB *acb = opaque;
> +    BlkRwCo *rwco = &acb->rwco;
> +
> +    rwco->ret = blk_co_zone_mgmt(rwco->blk, rwco->zone_mgmt.op,
> +                                 rwco->offset, acb->bytes);
> +    blk_aio_complete(acb);
> +}
> +
> +BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +                              int64_t offset, int64_t len,
> +                              BlockCompletionFunc *cb, void *opaque) {
> +    BlkAioEmAIOCB *acb;
> +    Coroutine *co;
> +    IO_CODE();
> +
> +    blk_inc_in_flight(blk);
> +    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
> +    acb->rwco = (BlkRwCo) {
> +            .blk    = blk,
> +            .offset = offset,
> +            .ret    = NOT_DONE,
> +            .zone_mgmt = {
> +                    .op = op,

Indentation is off here. QEMU uses 4-space indentation.

> +            },
> +    };
> +    acb->bytes = len;
> +    acb->has_returned = false;
> +
> +    co = qemu_coroutine_create(blk_aio_zone_mgmt_entry, acb);
> +    bdrv_coroutine_enter(blk_bs(blk), co);
> +
> +    acb->has_returned = true;
> +    if (acb->rwco.ret != NOT_DONE) {
> +        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> +                                         blk_aio_complete_bh, acb);
> +    }
> +
> +    return &acb->common;
> +}
> +
> +/*
> + * Send a zone_report command.
> + * offset is a byte offset from the start of the device. No alignment
> + * required for offset.
> + * nr_zones represents IN maximum and OUT actual.
> + */
> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> +                                    unsigned int *nr_zones,
> +                                    BlockZoneDescriptor *zones)
> +{
> +    int ret;
> +    IO_CODE();
> +
> +    blk_inc_in_flight(blk); /* increase before waiting */
> +    blk_wait_while_drained(blk);
> +    if (!blk_is_available(blk)) {
> +        blk_dec_in_flight(blk);
> +        return -ENOMEDIUM;
> +    }
> +    ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
> +    blk_dec_in_flight(blk);
> +    return ret;
> +}
> +
> +/*
> + * Send a zone_management command.
> + * op is the zone operation;
> + * offset is the byte offset from the start of the zoned device;
> + * len is the maximum number of bytes the command should operate on. It
> + * should be aligned with the zone sector size.
> + */
> +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +        int64_t offset, int64_t len)
> +{
> +    int ret;
> +    IO_CODE();
> +
> +
> +    blk_inc_in_flight(blk);
> +    blk_wait_while_drained(blk);
> +
> +    ret = blk_check_byte_request(blk, offset, len);
> +    if (ret < 0) {

Missing blk_dec_in_flight(blk).

> +        return ret;
> +    }
> +
> +    ret = bdrv_co_zone_mgmt(blk_bs(blk), op, offset, len);
> +    blk_dec_in_flight(blk);
> +    return ret;
> +}
> +
>  void blk_drain(BlockBackend *blk)
>  {
>      BlockDriverState *bs = blk_bs(blk);
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 0a8b4b426e..4edfa25d04 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -67,6 +67,9 @@
>  #include <sys/param.h>
>  #include <sys/syscall.h>
>  #include <sys/vfs.h>
> +#if defined(CONFIG_BLKZONED)
> +#include <linux/blkzoned.h>
> +#endif
>  #include <linux/cdrom.h>
>  #include <linux/fd.h>
>  #include <linux/fs.h>
> @@ -216,6 +219,15 @@ typedef struct RawPosixAIOData {
>              PreallocMode prealloc;
>              Error **errp;
>          } truncate;
> +        struct {
> +            unsigned int *nr_zones;
> +            BlockZoneDescriptor *zones;
> +        } zone_report;
> +        struct {
> +            unsigned long zone_op;
> +            const char *zone_op_name;
> +            bool all;
> +        } zone_mgmt;
>      };
>  } RawPosixAIOData;
>  
> @@ -1339,7 +1351,7 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>  #endif
>  
>      if (bs->sg || S_ISBLK(st.st_mode)) {
> -        int ret = hdev_get_max_hw_transfer(s->fd, &st);
> +        ret = hdev_get_max_hw_transfer(s->fd, &st);
>  
>          if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
>              bs->bl.max_hw_transfer = ret;
> @@ -1356,6 +1368,27 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>          zoned = BLK_Z_NONE;
>      }
>      bs->bl.zoned = zoned;
> +    if (zoned != BLK_Z_NONE) {
> +        ret = get_sysfs_long_val(&st, "chunk_sectors");
> +        if (ret > 0) {
> +            bs->bl.zone_sectors = ret;
> +        }
> +
> +        ret = get_sysfs_long_val(&st, "zone_append_max_bytes");
> +        if (ret > 0) {
> +            bs->bl.max_append_sectors = ret / 512;
> +        }
> +
> +        ret = get_sysfs_long_val(&st, "max_open_zones");
> +        if (ret >= 0) {
> +            bs->bl.max_open_zones = ret;
> +        }
> +
> +        ret = get_sysfs_long_val(&st, "max_active_zones");
> +        if (ret >= 0) {
> +            bs->bl.max_active_zones = ret;
> +        }
> +    }
>  }
>  
>  static int check_for_dasd(int fd)
> @@ -1850,6 +1883,145 @@ static off_t copy_file_range(int in_fd, off_t *in_off, int out_fd,
>  }
>  #endif
>  
> +/*
> + * parse_zone - Fill a zone descriptor
> + */
> +#if defined(CONFIG_BLKZONED)
> +static inline void parse_zone(struct BlockZoneDescriptor *zone,
> +                              const struct blk_zone *blkz) {
> +    zone->start = blkz->start;
> +    zone->length = blkz->len;
> +    zone->cap = blkz->capacity;
> +    zone->wp = blkz->wp;
> +
> +    switch (blkz->type) {
> +    case BLK_ZONE_TYPE_SEQWRITE_REQ:
> +        zone->type = BLK_ZT_SWR;
> +        break;
> +    case BLK_ZONE_TYPE_SEQWRITE_PREF:
> +        zone->type = BLK_ZT_SWP;
> +        break;
> +    case BLK_ZONE_TYPE_CONVENTIONAL:
> +        zone->type = BLK_ZT_CONV;
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
> +
> +    switch (blkz->cond) {
> +    case BLK_ZONE_COND_NOT_WP:
> +        zone->cond = BLK_ZS_NOT_WP;
> +        break;
> +    case BLK_ZONE_COND_EMPTY:
> +        zone->cond = BLK_ZS_EMPTY;
> +        break;
> +    case BLK_ZONE_COND_IMP_OPEN:
> +        zone->cond =BLK_ZS_IOPEN;
> +        break;
> +    case BLK_ZONE_COND_EXP_OPEN:
> +        zone->cond = BLK_ZS_EOPEN;
> +        break;
> +    case BLK_ZONE_COND_CLOSED:
> +        zone->cond = BLK_ZS_CLOSED;
> +        break;
> +    case BLK_ZONE_COND_READONLY:
> +        zone->cond = BLK_ZS_RDONLY;
> +        break;
> +    case BLK_ZONE_COND_FULL:
> +        zone->cond = BLK_ZS_FULL;
> +        break;
> +    case BLK_ZONE_COND_OFFLINE:
> +        zone->cond = BLK_ZS_OFFLINE;
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
> +}
> +#endif
> +
> +#if defined(CONFIG_BLKZONED)
> +static int do_zone_report(int64_t sector, int fd,
> +                          struct BlockZoneDescriptor *zones,
> +                          unsigned int nrz) {
> +    struct blk_zone *blkz;
> +    int ret, n = 0, i = 0;
> +
> +    int64_t rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
> +    g_autofree struct blk_zone_report *rep = NULL;
> +    rep = g_malloc(rep_size);
> +
> +    blkz = (struct blk_zone *)(rep + 1);
> +    while (n < nrz) {
> +        memset(rep, 0, rep_size);
> +        rep->sector = sector;
> +        rep->nr_zones = nrz - n;
> +
> +        do {
> +            ret = ioctl(fd, BLKREPORTZONE, rep);
> +        } while (ret != 0 && errno == EINTR);
> +        if (ret != 0) {
> +            error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
> +                    fd, sector, errno);
> +            return -errno;
> +        }
> +
> +        if (!rep->nr_zones) {
> +            break;
> +        }
> +
> +        for (i = 0; i < rep->nr_zones; i++, n++) {
> +            parse_zone(&zones[n], &blkz[i]);
> +            /* The next report should start after the last zone reported */
> +            sector = blkz[i].start + blkz[i].len;
> +        }
> +    }
> +    return n;
> +}
> +#endif
> +
> +static int handle_aiocb_zone_report(void *opaque) {
> +#if defined(CONFIG_BLKZONED)
> +    RawPosixAIOData *aiocb = opaque;
> +    int fd = aiocb->aio_fildes;
> +    unsigned int *nr_zones = aiocb->zone_report.nr_zones;
> +    BlockZoneDescriptor *zones = aiocb->zone_report.zones;
> +    /* zoned block devices use 512-byte sectors */
> +    int64_t sector = aiocb->aio_offset / 512;
> +
> +    *nr_zones = do_zone_report(sector, fd, zones, *nr_zones);
> +    return 0;
> +#else
> +    return -ENOTSUP;
> +#endif
> +}
> +
> +static int handle_aiocb_zone_mgmt(void *opaque) {
> +#if defined(CONFIG_BLKZONED)
> +    RawPosixAIOData *aiocb = opaque;
> +    int fd = aiocb->aio_fildes;
> +    int64_t sector = aiocb->aio_offset / 512;
> +    int64_t nr_sectors = aiocb->aio_nbytes / 512;
> +    struct blk_zone_range range;
> +    int ret;
> +
> +    /* Execute the operation */
> +    range.sector = sector;
> +    range.nr_sectors = nr_sectors;
> +    do {
> +        ret = ioctl(fd, aiocb->zone_mgmt.zone_op, &range);
> +    } while (ret != 0 && errno == EINTR);
> +
> +    if (ret != 0) {
> +        error_report("ioctl %s failed %d", aiocb->zone_mgmt.zone_op_name,
> +                     errno);
> +        return -errno;
> +    }
> +    return ret;
> +#else
> +    return -ENOTSUP;
> +#endif
> +}
> +
>  static int handle_aiocb_copy_range(void *opaque)
>  {
>      RawPosixAIOData *aiocb = opaque;
> @@ -3022,6 +3194,104 @@ static void raw_account_discard(BDRVRawState *s, uint64_t nbytes, int ret)
>      }
>  }
>  
> +/*
> + * zone report - Get a zone block device's information in the form
> + * of an array of zone descriptors.
> + * zones is an array of zone descriptors to hold zone information on reply;
> + * offset can be any byte within the entire size of the device;
> + * nr_zones is the maxium number of sectors the command should operate on.
> + */
> +static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t offset,
> +                                           unsigned int *nr_zones,
> +                                           BlockZoneDescriptor *zones) {
> +#if defined(CONFIG_BLKZONED)
> +    BDRVRawState *s = bs->opaque;
> +    RawPosixAIOData acb;
> +
> +    acb = (RawPosixAIOData) {
> +        .bs         = bs,
> +        .aio_fildes = s->fd,
> +        .aio_type   = QEMU_AIO_ZONE_REPORT,
> +        .aio_offset = offset,
> +        .zone_report    = {
> +                .nr_zones       = nr_zones,
> +                .zones          = zones,

Indentation is off here. QEMU uses 4-space indentation.

> +        },
> +    };
> +
> +    return raw_thread_pool_submit(bs, handle_aiocb_zone_report, &acb);
> +#else
> +    return -ENOTSUP;
> +#endif
> +}
> +
> +/*
> + * zone management operations - Execute an operation on a zone
> + */
> +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> +        int64_t offset, int64_t len) {
> +#if defined(CONFIG_BLKZONED)
> +    BDRVRawState *s = bs->opaque;
> +    RawPosixAIOData acb;
> +    int64_t zone_sector, zone_sector_mask;
> +    const char *zone_op_name;
> +    unsigned long zone_op;
> +    bool is_all = false;
> +
> +    zone_sector = bs->bl.zone_sectors;
> +    zone_sector_mask = zone_sector - 1;
> +    if (offset & zone_sector_mask) {
> +        error_report("sector offset %" PRId64 " is not aligned to zone size "
> +                     "%" PRId64 "", offset, zone_sector);
> +        return -EINVAL;
> +    }
> +
> +    if (len & zone_sector_mask) {
> +        error_report("number of sectors %" PRId64 " is not aligned to zone size"
> +                      " %" PRId64 "", len, zone_sector);
> +        return -EINVAL;
> +    }
> +
> +    switch (op) {
> +    case BLK_ZO_OPEN:
> +        zone_op_name = "BLKOPENZONE";
> +        zone_op = BLKOPENZONE;
> +        break;
> +    case BLK_ZO_CLOSE:
> +        zone_op_name = "BLKCLOSEZONE";
> +        zone_op = BLKCLOSEZONE;
> +        break;
> +    case BLK_ZO_FINISH:
> +        zone_op_name = "BLKFINISHZONE";
> +        zone_op = BLKFINISHZONE;
> +        break;
> +    case BLK_ZO_RESET:
> +        zone_op_name = "BLKRESETZONE";
> +        zone_op = BLKRESETZONE;
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
> +
> +    acb = (RawPosixAIOData) {
> +        .bs             = bs,
> +        .aio_fildes     = s->fd,
> +        .aio_type       = QEMU_AIO_ZONE_MGMT,
> +        .aio_offset     = offset,
> +        .aio_nbytes     = len,
> +        .zone_mgmt  = {
> +                .zone_op = zone_op,
> +                .zone_op_name = zone_op_name,
> +                .all = is_all,

This is unused. Please remove the field for now.

> +        },
> +    };
> +
> +    return raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb);
> +#else
> +    return -ENOTSUP;
> +#endif
> +}
> +
>  static coroutine_fn int
>  raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes,
>                  bool blkdev)
> @@ -3752,6 +4022,54 @@ static BlockDriver bdrv_host_device = {
>  #endif
>  };
>  
> +#if defined(CONFIG_BLKZONED)
> +static BlockDriver bdrv_zoned_host_device = {
> +    .format_name = "zoned_host_device",
> +    .protocol_name = "zoned_host_device",
> +    .instance_size = sizeof(BDRVRawState),
> +    .bdrv_needs_filename = true,
> +    .bdrv_probe_device  = hdev_probe_device,
> +    .bdrv_file_open     = hdev_open,
> +    .bdrv_close         = raw_close,
> +    .bdrv_reopen_prepare = raw_reopen_prepare,
> +    .bdrv_reopen_commit  = raw_reopen_commit,
> +    .bdrv_reopen_abort   = raw_reopen_abort,
> +    .bdrv_co_create_opts = bdrv_co_create_opts_simple,
> +    .create_opts         = &bdrv_create_opts_simple,
> +    .mutable_opts        = mutable_opts,
> +    .bdrv_co_invalidate_cache = raw_co_invalidate_cache,
> +    .bdrv_co_pwrite_zeroes = hdev_co_pwrite_zeroes,
> +
> +    .bdrv_co_preadv         = raw_co_preadv,
> +    .bdrv_co_pwritev        = raw_co_pwritev,
> +    .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
> +    .bdrv_co_pdiscard       = hdev_co_pdiscard,
> +    .bdrv_co_copy_range_from = raw_co_copy_range_from,
> +    .bdrv_co_copy_range_to  = raw_co_copy_range_to,
> +    .bdrv_refresh_limits = raw_refresh_limits,
> +    .bdrv_io_plug = raw_aio_plug,
> +    .bdrv_io_unplug = raw_aio_unplug,
> +    .bdrv_attach_aio_context = raw_aio_attach_aio_context,
> +
> +    .bdrv_co_truncate       = raw_co_truncate,
> +    .bdrv_getlength = raw_getlength,
> +    .bdrv_get_info = raw_get_info,
> +    .bdrv_get_allocated_file_size
> +                        = raw_get_allocated_file_size,
> +    .bdrv_get_specific_stats = hdev_get_specific_stats,
> +    .bdrv_check_perm = raw_check_perm,
> +    .bdrv_set_perm   = raw_set_perm,
> +    .bdrv_abort_perm_update = raw_abort_perm_update,
> +    .bdrv_probe_blocksizes = hdev_probe_blocksizes,
> +    .bdrv_probe_geometry = hdev_probe_geometry,
> +    .bdrv_co_ioctl = hdev_co_ioctl,
> +
> +    /* zone management operations */
> +    .bdrv_co_zone_report = raw_co_zone_report,
> +    .bdrv_co_zone_mgmt = raw_co_zone_mgmt,
> +};
> +#endif
> +
>  #if defined(__linux__) || defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
>  static void cdrom_parse_filename(const char *filename, QDict *options,
>                                   Error **errp)
> @@ -4012,6 +4330,9 @@ static void bdrv_file_init(void)
>      bdrv_register(&bdrv_file);
>  #if defined(HAVE_HOST_BLOCK_DEVICE)
>      bdrv_register(&bdrv_host_device);
> +#if defined(CONFIG_BLKZONED)
> +    bdrv_register(&bdrv_zoned_host_device);
> +#endif
>  #ifdef __linux__
>      bdrv_register(&bdrv_host_cdrom);
>  #endif
> diff --git a/block/io.c b/block/io.c
> index 0a8cbefe86..de9ec1d740 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -3198,6 +3198,47 @@ out:
>      return co.ret;
>  }
>  
> +int bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,

Missing coroutine_fn:

  int coroutine_fn bdrv_co_foo(...)

All coroutine functions must be labelled like this.

> +                        unsigned int *nr_zones,
> +                        BlockZoneDescriptor *zones)
> +{
> +    BlockDriver *drv = bs->drv;
> +    CoroutineIOCompletion co = {
> +            .coroutine = qemu_coroutine_self(),
> +    };
> +    IO_CODE();
> +
> +    bdrv_inc_in_flight(bs);
> +    if (!drv || !drv->bdrv_co_zone_report) {
> +        co.ret = -ENOTSUP;
> +        goto out;
> +    }
> +    co.ret = drv->bdrv_co_zone_report(bs, offset, nr_zones, zones);
> +out:
> +    bdrv_dec_in_flight(bs);
> +    return co.ret;
> +}
> +
> +int bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,

coroutine_fn is missing.

> +        int64_t offset, int64_t len)
> +{
> +    BlockDriver *drv = bs->drv;
> +    CoroutineIOCompletion co = {
> +            .coroutine = qemu_coroutine_self(),
> +    };
> +    IO_CODE();
> +
> +    bdrv_inc_in_flight(bs);
> +    if (!drv || !drv->bdrv_co_zone_mgmt) {
> +        co.ret = -ENOTSUP;
> +        goto out;
> +    }
> +    co.ret = drv->bdrv_co_zone_mgmt(bs, op, offset, len);
> +out:
> +    bdrv_dec_in_flight(bs);
> +    return co.ret;
> +}
> +
>  void *qemu_blockalign(BlockDriverState *bs, size_t size)
>  {
>      IO_CODE();
> diff --git a/include/block/block-io.h b/include/block/block-io.h
> index fd25ffa9be..65463b88d9 100644
> --- a/include/block/block-io.h
> +++ b/include/block/block-io.h
> @@ -88,6 +88,13 @@ int bdrv_co_ioctl(BlockDriverState *bs, int req, void *buf);
>  /* Ensure contents are flushed to disk.  */
>  int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
>  
> +/* Report zone information of zone block device. */
> +int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
> +                                     unsigned int *nr_zones,
> +                                     BlockZoneDescriptor *zones);
> +int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> +                                   int64_t offset, int64_t len);
> +
>  int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
>  bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
>  int bdrv_block_status(BlockDriverState *bs, int64_t offset,
> diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
> index 7f7863cc9e..078ddd7e67 100644
> --- a/include/block/block_int-common.h
> +++ b/include/block/block_int-common.h
> @@ -691,6 +691,12 @@ struct BlockDriver {
>                                            QEMUIOVector *qiov,
>                                            int64_t pos);
>  
> +    int coroutine_fn (*bdrv_co_zone_report)(BlockDriverState *bs,
> +            int64_t offset, unsigned int *nr_zones,
> +            BlockZoneDescriptor *zones);
> +    int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, BlockZoneOp op,
> +            int64_t offset, int64_t len);
> +
>      /* removable device specific */
>      bool (*bdrv_is_inserted)(BlockDriverState *bs);
>      void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);
> @@ -828,6 +834,21 @@ typedef struct BlockLimits {
>  
>      /* device zone model */
>      BlockZoneModel zoned;
> +
> +    /* zone size expressed in 512-byte sectors */
> +    uint32_t zone_sectors;
> +
> +    /* total number of zones */
> +    unsigned int nr_zones;
> +
> +    /* maximum sectors of a zone append write operation */
> +    int64_t max_append_sectors;
> +
> +    /* maximum number of open zones */
> +    int64_t max_open_zones;
> +
> +    /* maximum number of active zones */
> +    int64_t max_active_zones;
>  } BlockLimits;
>  
>  typedef struct BdrvOpBlocker BdrvOpBlocker;
> diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
> index 21fc10c4c9..3d26929cdd 100644
> --- a/include/block/raw-aio.h
> +++ b/include/block/raw-aio.h
> @@ -29,6 +29,8 @@
>  #define QEMU_AIO_WRITE_ZEROES 0x0020
>  #define QEMU_AIO_COPY_RANGE   0x0040
>  #define QEMU_AIO_TRUNCATE     0x0080
> +#define QEMU_AIO_ZONE_REPORT  0x0100
> +#define QEMU_AIO_ZONE_MGMT    0x0200
>  #define QEMU_AIO_TYPE_MASK \
>          (QEMU_AIO_READ | \
>           QEMU_AIO_WRITE | \
> @@ -37,7 +39,9 @@
>           QEMU_AIO_DISCARD | \
>           QEMU_AIO_WRITE_ZEROES | \
>           QEMU_AIO_COPY_RANGE | \
> -         QEMU_AIO_TRUNCATE)
> +         QEMU_AIO_TRUNCATE  | \
> +         QEMU_AIO_ZONE_REPORT | \
> +         QEMU_AIO_ZONE_MGMT)
>  
>  /* AIO flags */
>  #define QEMU_AIO_MISALIGNED   0x1000
> diff --git a/include/sysemu/block-backend-io.h b/include/sysemu/block-backend-io.h
> index 50f5aa2e07..6835525582 100644
> --- a/include/sysemu/block-backend-io.h
> +++ b/include/sysemu/block-backend-io.h
> @@ -45,6 +45,12 @@ BlockAIOCB *blk_aio_pwritev(BlockBackend *blk, int64_t offset,
>                              BlockCompletionFunc *cb, void *opaque);
>  BlockAIOCB *blk_aio_flush(BlockBackend *blk,
>                            BlockCompletionFunc *cb, void *opaque);
> +BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
> +                                unsigned int *nr_zones, BlockZoneDescriptor *zones,
> +                                BlockCompletionFunc *cb, void *opaque);
> +BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +                              int64_t offset, int64_t len,
> +                              BlockCompletionFunc *cb, void *opaque);
>  BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes,
>                               BlockCompletionFunc *cb, void *opaque);
>  void blk_aio_cancel_async(BlockAIOCB *acb);
> @@ -156,6 +162,17 @@ int generated_co_wrapper blk_pwrite_zeroes(BlockBackend *blk, int64_t offset,
>  int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, int64_t offset,
>                                        int64_t bytes, BdrvRequestFlags flags);
>  
> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> +                                    unsigned int *nr_zones,
> +                                    BlockZoneDescriptor *zones);
> +int generated_co_wrapper blk_zone_report(BlockBackend *blk, int64_t offset,
> +                                         unsigned int *nr_zones,
> +                                         BlockZoneDescriptor *zones);
> +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +                                  int64_t offset, int64_t len);
> +int generated_co_wrapper blk_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +                                       int64_t offset, int64_t len);
> +
>  int generated_co_wrapper blk_pdiscard(BlockBackend *blk, int64_t offset,
>                                        int64_t bytes);
>  int coroutine_fn blk_co_pdiscard(BlockBackend *blk, int64_t offset,
> diff --git a/meson.build b/meson.build
> index 20fddbd707..2f436bb355 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1883,6 +1883,7 @@ config_host_data.set('CONFIG_REPLICATION', get_option('live_block_migration').al
>  # has_header
>  config_host_data.set('CONFIG_EPOLL', cc.has_header('sys/epoll.h'))
>  config_host_data.set('CONFIG_LINUX_MAGIC_H', cc.has_header('linux/magic.h'))
> +config_host_data.set('CONFIG_BLKZONED', cc.has_header('linux/blkzoned.h'))
>  config_host_data.set('CONFIG_VALGRIND_H', cc.has_header('valgrind/valgrind.h'))
>  config_host_data.set('HAVE_BTRFS_H', cc.has_header('linux/btrfs.h'))
>  config_host_data.set('HAVE_DRM_H', cc.has_header('libdrm/drm.h'))
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 2173e7734a..c6bbb7a037 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -2942,6 +2942,7 @@
>  # @compress: Since 5.0
>  # @copy-before-write: Since 6.2
>  # @snapshot-access: Since 7.0
> +# @zoned_host_device: Since 7.2
>  #
>  # Since: 2.9
>  ##
> @@ -2955,7 +2956,8 @@
>              'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
>              'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
>              { 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
> -            'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
> +            'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat',
> +            { 'name': 'zoned_host_device', 'if': 'CONFIG_BLKZONED' } ] }
>  
>  ##
>  # @BlockdevOptionsFile:
> @@ -4329,7 +4331,9 @@
>        'vhdx':       'BlockdevOptionsGenericFormat',
>        'vmdk':       'BlockdevOptionsGenericCOWFormat',
>        'vpc':        'BlockdevOptionsGenericFormat',
> -      'vvfat':      'BlockdevOptionsVVFAT'
> +      'vvfat':      'BlockdevOptionsVVFAT',
> +      'zoned_host_device': { 'type': 'BlockdevOptionsFile',
> +                             'if': 'CONFIG_BLKZONED' }
>    } }
>  
>  ##
> diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
> index 952dc940f1..446a059603 100644
> --- a/qemu-io-cmds.c
> +++ b/qemu-io-cmds.c
> @@ -1712,6 +1712,144 @@ static const cmdinfo_t flush_cmd = {
>      .oneline    = "flush all in-core file state to disk",
>  };
>  
> +static int zone_report_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset;
> +    unsigned int nr_zones;
> +
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    nr_zones = cvtnum(argv[optind]);
> +
> +    g_autofree BlockZoneDescriptor *zones = NULL;
> +    zones = g_new(BlockZoneDescriptor, nr_zones);
> +    ret = blk_zone_report(blk, offset, &nr_zones, zones);
> +    if (ret < 0) {
> +        printf("zone report failed: %s\n", strerror(-ret));
> +    } else {
> +        for (int i = 0; i < nr_zones; ++i) {
> +            printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", "
> +                   "cap"" 0x%" PRIx64 ", wptr 0x%" PRIx64 ", "
> +                   "zcond:%u, [type: %u]\n",
> +                   zones[i].start, zones[i].length, zones[i].cap, zones[i].wp,
> +                   zones[i].cond, zones[i].type);
> +        }
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_report_cmd = {
> +        .name = "zone_report",
> +        .altname = "zrp",
> +        .cfunc = zone_report_f,
> +        .argmin = 2,
> +        .argmax = 2,
> +        .args = "offset number",
> +        .oneline = "report zone information",
> +};
> +
> +static int zone_open_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset, len;
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    len = cvtnum(argv[optind]);
> +    ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len);
> +    if (ret < 0) {
> +        printf("zone open failed: %s\n", strerror(-ret));
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_open_cmd = {
> +        .name = "zone_open",
> +        .altname = "zo",
> +        .cfunc = zone_open_f,
> +        .argmin = 2,
> +        .argmax = 2,
> +        .args = "offset len",
> +        .oneline = "explicit open a range of zones in zone block device",
> +};
> +
> +static int zone_close_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset, len;
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    len = cvtnum(argv[optind]);
> +    ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len);
> +    if (ret < 0) {
> +        printf("zone close failed: %s\n", strerror(-ret));
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_close_cmd = {
> +        .name = "zone_close",
> +        .altname = "zc",
> +        .cfunc = zone_close_f,
> +        .argmin = 2,
> +        .argmax = 2,
> +        .args = "offset len",
> +        .oneline = "close a range of zones in zone block device",
> +};
> +
> +static int zone_finish_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset, len;
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    len = cvtnum(argv[optind]);
> +    ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len);
> +    if (ret < 0) {
> +        printf("zone finish failed: %s\n", strerror(-ret));
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_finish_cmd = {
> +        .name = "zone_finish",
> +        .altname = "zf",
> +        .cfunc = zone_finish_f,
> +        .argmin = 2,
> +        .argmax = 2,
> +        .args = "offset len",
> +        .oneline = "finish a range of zones in zone block device",
> +};
> +
> +static int zone_reset_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int ret;
> +    int64_t offset, len;
> +    ++optind;
> +    offset = cvtnum(argv[optind]);
> +    ++optind;
> +    len = cvtnum(argv[optind]);
> +    ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len);
> +    if (ret < 0) {
> +        printf("zone reset failed: %s\n", strerror(-ret));
> +    }
> +    return ret;
> +}
> +
> +static const cmdinfo_t zone_reset_cmd = {
> +        .name = "zone_reset",
> +        .altname = "zrs",
> +        .cfunc = zone_reset_f,
> +        .argmin = 2,
> +        .argmax = 2,
> +        .args = "offset len",
> +        .oneline = "reset a zone write pointer in zone block device",
> +};
> +
>  static int truncate_f(BlockBackend *blk, int argc, char **argv);
>  static const cmdinfo_t truncate_cmd = {
>      .name       = "truncate",
> @@ -2504,6 +2642,11 @@ static void __attribute((constructor)) init_qemuio_commands(void)
>      qemuio_add_command(&aio_write_cmd);
>      qemuio_add_command(&aio_flush_cmd);
>      qemuio_add_command(&flush_cmd);
> +    qemuio_add_command(&zone_report_cmd);
> +    qemuio_add_command(&zone_open_cmd);
> +    qemuio_add_command(&zone_close_cmd);
> +    qemuio_add_command(&zone_finish_cmd);
> +    qemuio_add_command(&zone_reset_cmd);
>      qemuio_add_command(&truncate_cmd);
>      qemuio_add_command(&length_cmd);
>      qemuio_add_command(&info_cmd);
> -- 
> 2.37.3
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 1/7] include: add zoned device structs
  2022-09-16 15:16       ` Stefan Hajnoczi
@ 2022-09-19  0:50         ` Sam Li
  2022-09-19  8:04           ` Damien Le Moal
  0 siblings, 1 reply; 31+ messages in thread
From: Sam Li @ 2022-09-19  0:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Eric Blake, qemu-devel, Dmitry Fomichev, Markus Armbruster,
	qemu block, Kevin Wolf, Fam Zheng, Damien Le Moal,
	Hannes Reinecke, Hanna Reitz

Stefan Hajnoczi <stefanha@redhat.com> 于2022年9月18日周日 04:17写道:
>
> On Thu, Sep 15, 2022 at 06:06:38PM +0800, Sam Li wrote:
> > Eric Blake <eblake@redhat.com> 于2022年9月15日周四 16:05写道:
> > >
> > > On Sat, Sep 10, 2022 at 01:27:53PM +0800, Sam Li wrote:
> > > > Signed-off-by: Sam Li <faithilikerun@gmail.com>
> > > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> > > > ---
> > > >  include/block/block-common.h | 43 ++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 43 insertions(+)
> > > >
> > > > diff --git a/include/block/block-common.h b/include/block/block-common.h
> > > > index fdb7306e78..36bd0e480e 100644
> > > > --- a/include/block/block-common.h
> > > > +++ b/include/block/block-common.h
> > > > @@ -49,6 +49,49 @@ typedef struct BlockDriver BlockDriver;
> > > >  typedef struct BdrvChild BdrvChild;
> > > >  typedef struct BdrvChildClass BdrvChildClass;
> > > >
> > > > +typedef enum BlockZoneOp {
> > > > +    BLK_ZO_OPEN,
> > > > +    BLK_ZO_CLOSE,
> > > > +    BLK_ZO_FINISH,
> > > > +    BLK_ZO_RESET,
> > > > +} BlockZoneOp;
> > > > +
> > > > +typedef enum BlockZoneModel {
> > > > +    BLK_Z_NONE = 0x0, /* Regular block device */
> > > > +    BLK_Z_HM = 0x1, /* Host-managed zoned block device */
> > > > +    BLK_Z_HA = 0x2, /* Host-aware zoned block device */
> > > > +} BlockZoneModel;
> > > > +
> > > > +typedef enum BlockZoneCondition {
> > > > +    BLK_ZS_NOT_WP = 0x0,
> > > > +    BLK_ZS_EMPTY = 0x1,
> > > > +    BLK_ZS_IOPEN = 0x2,
> > > > +    BLK_ZS_EOPEN = 0x3,
> > > > +    BLK_ZS_CLOSED = 0x4,
> > > > +    BLK_ZS_RDONLY = 0xD,
> > > > +    BLK_ZS_FULL = 0xE,
> > > > +    BLK_ZS_OFFLINE = 0xF,
> > > > +} BlockZoneCondition;
> > > > +
> > > > +typedef enum BlockZoneType {
> > > > +    BLK_ZT_CONV = 0x1, /* Conventional random writes supported */
> > > > +    BLK_ZT_SWR = 0x2, /* Sequential writes required */
> > > > +    BLK_ZT_SWP = 0x3, /* Sequential writes preferred */
> > > > +} BlockZoneType;
> > > > +
> > > > +/*
> > > > + * Zone descriptor data structure.
> > > > + * Provides information on a zone with all position and size values in bytes.
> > >
> > > I'm glad that you chose bytes here for use in qemu.  But since the
> > > kernel struct blk_zone uses sectors instead of bytes, is it worth
> > > adding a sentence that we intentionally use bytes here, different from
> > > Linux, to make it easier for reviewers to realize that scaling when
> > > translating between qemu and kernel is necessary?
> >
> > Sorry about the unit mistake. The zone information is in sectors which
> > is the same as kernel struct blk_zone. I think adding a sentence to
> > inform the sector unit makes it clear what the zone descriptor is.
>
> I'd make the units bytes for consistency with the rest of the QEMU block
> layer. For example, the MapEntry structure that "qemu-img map" reports
> has names with similar fields and they are in bytes:
>
>   struct MapEntry {
>       int64_t start;
>       int64_t length;
>

I think the zone descriptor uses sector units because ioctl() will
report zones in sector units. Making blk_zone.offset =
zone_descriptor.offset is more convenient than using byte units where
it needs make conversions twice(sector -> byte -> sector in zone
descriptors and offset argument in bdrv_co_zone_report). The MapEntry
uses byte units because lseek() in bdrv_co_block_status suggests the
file offset is set to bytes and I think it may be why the rest of the
block layer uses bytes(not sure).

I do not object to using bytes here but it would require some
compromises. If I was wrong about anything, please let me know.


Sam


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 1/7] include: add zoned device structs
  2022-09-19  0:50         ` Sam Li
@ 2022-09-19  8:04           ` Damien Le Moal
  2022-09-19  8:06             ` Sam Li
  0 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2022-09-19  8:04 UTC (permalink / raw)
  To: Sam Li, Stefan Hajnoczi
  Cc: Eric Blake, qemu-devel, Dmitry Fomichev, Markus Armbruster,
	qemu block, Kevin Wolf, Fam Zheng, Hannes Reinecke, Hanna Reitz

On 9/19/22 09:50, Sam Li wrote:
> Stefan Hajnoczi <stefanha@redhat.com> 于2022年9月18日周日 04:17写道:
>>
>> On Thu, Sep 15, 2022 at 06:06:38PM +0800, Sam Li wrote:
>>> Eric Blake <eblake@redhat.com> 于2022年9月15日周四 16:05写道:
>>>>
>>>> On Sat, Sep 10, 2022 at 01:27:53PM +0800, Sam Li wrote:
>>>>> Signed-off-by: Sam Li <faithilikerun@gmail.com>
>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
>>>>> ---
>>>>>  include/block/block-common.h | 43 ++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 43 insertions(+)
>>>>>
>>>>> diff --git a/include/block/block-common.h b/include/block/block-common.h
>>>>> index fdb7306e78..36bd0e480e 100644
>>>>> --- a/include/block/block-common.h
>>>>> +++ b/include/block/block-common.h
>>>>> @@ -49,6 +49,49 @@ typedef struct BlockDriver BlockDriver;
>>>>>  typedef struct BdrvChild BdrvChild;
>>>>>  typedef struct BdrvChildClass BdrvChildClass;
>>>>>
>>>>> +typedef enum BlockZoneOp {
>>>>> +    BLK_ZO_OPEN,
>>>>> +    BLK_ZO_CLOSE,
>>>>> +    BLK_ZO_FINISH,
>>>>> +    BLK_ZO_RESET,
>>>>> +} BlockZoneOp;
>>>>> +
>>>>> +typedef enum BlockZoneModel {
>>>>> +    BLK_Z_NONE = 0x0, /* Regular block device */
>>>>> +    BLK_Z_HM = 0x1, /* Host-managed zoned block device */
>>>>> +    BLK_Z_HA = 0x2, /* Host-aware zoned block device */
>>>>> +} BlockZoneModel;
>>>>> +
>>>>> +typedef enum BlockZoneCondition {
>>>>> +    BLK_ZS_NOT_WP = 0x0,
>>>>> +    BLK_ZS_EMPTY = 0x1,
>>>>> +    BLK_ZS_IOPEN = 0x2,
>>>>> +    BLK_ZS_EOPEN = 0x3,
>>>>> +    BLK_ZS_CLOSED = 0x4,
>>>>> +    BLK_ZS_RDONLY = 0xD,
>>>>> +    BLK_ZS_FULL = 0xE,
>>>>> +    BLK_ZS_OFFLINE = 0xF,
>>>>> +} BlockZoneCondition;
>>>>> +
>>>>> +typedef enum BlockZoneType {
>>>>> +    BLK_ZT_CONV = 0x1, /* Conventional random writes supported */
>>>>> +    BLK_ZT_SWR = 0x2, /* Sequential writes required */
>>>>> +    BLK_ZT_SWP = 0x3, /* Sequential writes preferred */
>>>>> +} BlockZoneType;
>>>>> +
>>>>> +/*
>>>>> + * Zone descriptor data structure.
>>>>> + * Provides information on a zone with all position and size values in bytes.
>>>>
>>>> I'm glad that you chose bytes here for use in qemu.  But since the
>>>> kernel struct blk_zone uses sectors instead of bytes, is it worth
>>>> adding a sentence that we intentionally use bytes here, different from
>>>> Linux, to make it easier for reviewers to realize that scaling when
>>>> translating between qemu and kernel is necessary?
>>>
>>> Sorry about the unit mistake. The zone information is in sectors which
>>> is the same as kernel struct blk_zone. I think adding a sentence to
>>> inform the sector unit makes it clear what the zone descriptor is.
>>
>> I'd make the units bytes for consistency with the rest of the QEMU block
>> layer. For example, the MapEntry structure that "qemu-img map" reports
>> has names with similar fields and they are in bytes:
>>
>>   struct MapEntry {
>>       int64_t start;
>>       int64_t length;
>>
> 
> I think the zone descriptor uses sector units because ioctl() will
> report zones in sector units. Making blk_zone.offset =
> zone_descriptor.offset is more convenient than using byte units where
> it needs make conversions twice(sector -> byte -> sector in zone
> descriptors and offset argument in bdrv_co_zone_report). The MapEntry
> uses byte units because lseek() in bdrv_co_block_status suggests the
> file offset is set to bytes and I think it may be why the rest of the
> block layer uses bytes(not sure).
> 
> I do not object to using bytes here but it would require some
> compromises. If I was wrong about anything, please let me know.

The conversion can be done using 9-bits left and right shifts, which are
cheap to do. I think it is important to be consistent with qemu block API,
so using for the API bytes is preferred. That will avoid confusions.

> 
> 
> Sam

-- 
Damien Le Moal
Western Digital Research



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 1/7] include: add zoned device structs
  2022-09-19  8:04           ` Damien Le Moal
@ 2022-09-19  8:06             ` Sam Li
  0 siblings, 0 replies; 31+ messages in thread
From: Sam Li @ 2022-09-19  8:06 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Dmitry Fomichev, Eric Blake, Fam Zheng, Hanna Reitz,
	Hannes Reinecke, Kevin Wolf, Markus Armbruster, Stefan Hajnoczi,
	qemu block, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4372 bytes --]

Damien Le Moal <damien.lemoal@opensource.wdc.com>于2022年9月19日 周一16:04写道:

> On 9/19/22 09:50, Sam Li wrote:
> > Stefan Hajnoczi <stefanha@redhat.com> 于2022年9月18日周日 04:17写道:
> >>
> >> On Thu, Sep 15, 2022 at 06:06:38PM +0800, Sam Li wrote:
> >>> Eric Blake <eblake@redhat.com> 于2022年9月15日周四 16:05写道:
> >>>>
> >>>> On Sat, Sep 10, 2022 at 01:27:53PM +0800, Sam Li wrote:
> >>>>> Signed-off-by: Sam Li <faithilikerun@gmail.com>
> >>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> >>>>> ---
> >>>>>  include/block/block-common.h | 43
> ++++++++++++++++++++++++++++++++++++
> >>>>>  1 file changed, 43 insertions(+)
> >>>>>
> >>>>> diff --git a/include/block/block-common.h
> b/include/block/block-common.h
> >>>>> index fdb7306e78..36bd0e480e 100644
> >>>>> --- a/include/block/block-common.h
> >>>>> +++ b/include/block/block-common.h
> >>>>> @@ -49,6 +49,49 @@ typedef struct BlockDriver BlockDriver;
> >>>>>  typedef struct BdrvChild BdrvChild;
> >>>>>  typedef struct BdrvChildClass BdrvChildClass;
> >>>>>
> >>>>> +typedef enum BlockZoneOp {
> >>>>> +    BLK_ZO_OPEN,
> >>>>> +    BLK_ZO_CLOSE,
> >>>>> +    BLK_ZO_FINISH,
> >>>>> +    BLK_ZO_RESET,
> >>>>> +} BlockZoneOp;
> >>>>> +
> >>>>> +typedef enum BlockZoneModel {
> >>>>> +    BLK_Z_NONE = 0x0, /* Regular block device */
> >>>>> +    BLK_Z_HM = 0x1, /* Host-managed zoned block device */
> >>>>> +    BLK_Z_HA = 0x2, /* Host-aware zoned block device */
> >>>>> +} BlockZoneModel;
> >>>>> +
> >>>>> +typedef enum BlockZoneCondition {
> >>>>> +    BLK_ZS_NOT_WP = 0x0,
> >>>>> +    BLK_ZS_EMPTY = 0x1,
> >>>>> +    BLK_ZS_IOPEN = 0x2,
> >>>>> +    BLK_ZS_EOPEN = 0x3,
> >>>>> +    BLK_ZS_CLOSED = 0x4,
> >>>>> +    BLK_ZS_RDONLY = 0xD,
> >>>>> +    BLK_ZS_FULL = 0xE,
> >>>>> +    BLK_ZS_OFFLINE = 0xF,
> >>>>> +} BlockZoneCondition;
> >>>>> +
> >>>>> +typedef enum BlockZoneType {
> >>>>> +    BLK_ZT_CONV = 0x1, /* Conventional random writes supported */
> >>>>> +    BLK_ZT_SWR = 0x2, /* Sequential writes required */
> >>>>> +    BLK_ZT_SWP = 0x3, /* Sequential writes preferred */
> >>>>> +} BlockZoneType;
> >>>>> +
> >>>>> +/*
> >>>>> + * Zone descriptor data structure.
> >>>>> + * Provides information on a zone with all position and size values
> in bytes.
> >>>>
> >>>> I'm glad that you chose bytes here for use in qemu.  But since the
> >>>> kernel struct blk_zone uses sectors instead of bytes, is it worth
> >>>> adding a sentence that we intentionally use bytes here, different from
> >>>> Linux, to make it easier for reviewers to realize that scaling when
> >>>> translating between qemu and kernel is necessary?
> >>>
> >>> Sorry about the unit mistake. The zone information is in sectors which
> >>> is the same as kernel struct blk_zone. I think adding a sentence to
> >>> inform the sector unit makes it clear what the zone descriptor is.
> >>
> >> I'd make the units bytes for consistency with the rest of the QEMU block
> >> layer. For example, the MapEntry structure that "qemu-img map" reports
> >> has names with similar fields and they are in bytes:
> >>
> >>   struct MapEntry {
> >>       int64_t start;
> >>       int64_t length;
> >>
> >
> > I think the zone descriptor uses sector units because ioctl() will
> > report zones in sector units. Making blk_zone.offset =
> > zone_descriptor.offset is more convenient than using byte units where
> > it needs make conversions twice(sector -> byte -> sector in zone
> > descriptors and offset argument in bdrv_co_zone_report). The MapEntry
> > uses byte units because lseek() in bdrv_co_block_status suggests the
> > file offset is set to bytes and I think it may be why the rest of the
> > block layer uses bytes(not sure).
> >
> > I do not object to using bytes here but it would require some
> > compromises. If I was wrong about anything, please let me know.
>
> The conversion can be done using 9-bits left and right shifts, which are
> cheap to do. I think it is important to be consistent with qemu block API,
> so using for the API bytes is preferred. That will avoid confusions.


Ok, will change it. Thanks!


>
> >
> >
> > Sam
>
> --
> Damien Le Moal
> Western Digital Research
>
>

[-- Attachment #2: Type: text/html, Size: 6367 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  2022-09-10  5:27 ` [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls Sam Li
                     ` (2 preceding siblings ...)
  2022-09-16 16:00   ` Stefan Hajnoczi
@ 2022-09-20  8:51   ` Klaus Jensen
  2022-09-20 13:21     ` Sam Li
  2022-09-21  4:44     ` Damien Le Moal
  3 siblings, 2 replies; 31+ messages in thread
From: Klaus Jensen @ 2022-09-20  8:51 UTC (permalink / raw)
  To: Sam Li
  Cc: qemu-devel, dmitry.fomichev, Markus Armbruster, Eric Blake,
	qemu-block, Stefan Hajnoczi, Kevin Wolf, Fam Zheng,
	damien.lemoal, hare, Hanna Reitz

[-- Attachment #1: Type: text/plain, Size: 2579 bytes --]

On Sep 10 13:27, Sam Li wrote:
> Add a new zoned_host_device BlockDriver. The zoned_host_device option
> accepts only zoned host block devices. By adding zone management
> operations in this new BlockDriver, users can use the new block
> layer APIs including Report Zone and four zone management operations
> (open, close, finish, reset).
> 
> Qemu-io uses the new APIs to perform zoned storage commands of the device:
> zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> zone_finish(zf).
> 
> For example, to test zone_report, use following command:
> $ ./build/qemu-io --image-opts -n driver=zoned_host_device, filename=/dev/nullb0
> -c "zrp offset nr_zones"
> 
> Signed-off-by: Sam Li <faithilikerun@gmail.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> ---
>  block/block-backend.c             | 145 ++++++++++++++
>  block/file-posix.c                | 323 +++++++++++++++++++++++++++++-
>  block/io.c                        |  41 ++++
>  include/block/block-io.h          |   7 +
>  include/block/block_int-common.h  |  21 ++
>  include/block/raw-aio.h           |   6 +-
>  include/sysemu/block-backend-io.h |  17 ++
>  meson.build                       |   1 +
>  qapi/block-core.json              |   8 +-
>  qemu-io-cmds.c                    | 143 +++++++++++++
>  10 files changed, 708 insertions(+), 4 deletions(-)
> 
> +/*
> + * zone management operations - Execute an operation on a zone
> + */
> +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> +        int64_t offset, int64_t len) {
> +#if defined(CONFIG_BLKZONED)
> +    BDRVRawState *s = bs->opaque;
> +    RawPosixAIOData acb;
> +    int64_t zone_sector, zone_sector_mask;
> +    const char *zone_op_name;
> +    unsigned long zone_op;
> +    bool is_all = false;
> +
> +    zone_sector = bs->bl.zone_sectors;
> +    zone_sector_mask = zone_sector - 1;
> +    if (offset & zone_sector_mask) {
> +        error_report("sector offset %" PRId64 " is not aligned to zone size "
> +                     "%" PRId64 "", offset, zone_sector);
> +        return -EINVAL;
> +    }
> +
> +    if (len & zone_sector_mask) {
> +        error_report("number of sectors %" PRId64 " is not aligned to zone size"
> +                      " %" PRId64 "", len, zone_sector);
> +        return -EINVAL;
> +    }

These checks impose a power-of-two constraint on the zone size. Can they
be changed to divisions to lift that constraint? I don't see anything in
this patch set that relies on power of two zone sizes.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  2022-09-20  8:51   ` Klaus Jensen
@ 2022-09-20 13:21     ` Sam Li
  2022-09-21  4:44     ` Damien Le Moal
  1 sibling, 0 replies; 31+ messages in thread
From: Sam Li @ 2022-09-20 13:21 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: qemu-devel, Dmitry Fomichev, Markus Armbruster, Eric Blake,
	qemu block, Stefan Hajnoczi, Kevin Wolf, Fam Zheng,
	Damien Le Moal, Hannes Reinecke, Hanna Reitz

Klaus Jensen <its@irrelevant.dk> 于2022年9月20日周二 16:51写道:
>
> On Sep 10 13:27, Sam Li wrote:
> > Add a new zoned_host_device BlockDriver. The zoned_host_device option
> > accepts only zoned host block devices. By adding zone management
> > operations in this new BlockDriver, users can use the new block
> > layer APIs including Report Zone and four zone management operations
> > (open, close, finish, reset).
> >
> > Qemu-io uses the new APIs to perform zoned storage commands of the device:
> > zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> > zone_finish(zf).
> >
> > For example, to test zone_report, use following command:
> > $ ./build/qemu-io --image-opts -n driver=zoned_host_device, filename=/dev/nullb0
> > -c "zrp offset nr_zones"
> >
> > Signed-off-by: Sam Li <faithilikerun@gmail.com>
> > Reviewed-by: Hannes Reinecke <hare@suse.de>
> > ---
> >  block/block-backend.c             | 145 ++++++++++++++
> >  block/file-posix.c                | 323 +++++++++++++++++++++++++++++-
> >  block/io.c                        |  41 ++++
> >  include/block/block-io.h          |   7 +
> >  include/block/block_int-common.h  |  21 ++
> >  include/block/raw-aio.h           |   6 +-
> >  include/sysemu/block-backend-io.h |  17 ++
> >  meson.build                       |   1 +
> >  qapi/block-core.json              |   8 +-
> >  qemu-io-cmds.c                    | 143 +++++++++++++
> >  10 files changed, 708 insertions(+), 4 deletions(-)
> >
> > +/*
> > + * zone management operations - Execute an operation on a zone
> > + */
> > +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> > +        int64_t offset, int64_t len) {
> > +#if defined(CONFIG_BLKZONED)
> > +    BDRVRawState *s = bs->opaque;
> > +    RawPosixAIOData acb;
> > +    int64_t zone_sector, zone_sector_mask;
> > +    const char *zone_op_name;
> > +    unsigned long zone_op;
> > +    bool is_all = false;
> > +
> > +    zone_sector = bs->bl.zone_sectors;
> > +    zone_sector_mask = zone_sector - 1;
> > +    if (offset & zone_sector_mask) {
> > +        error_report("sector offset %" PRId64 " is not aligned to zone size "
> > +                     "%" PRId64 "", offset, zone_sector);
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (len & zone_sector_mask) {
> > +        error_report("number of sectors %" PRId64 " is not aligned to zone size"
> > +                      " %" PRId64 "", len, zone_sector);
> > +        return -EINVAL;
> > +    }
>
> These checks impose a power-of-two constraint on the zone size. Can they
> be changed to divisions to lift that constraint? I don't see anything in
> this patch set that relies on power of two zone sizes.

Yes, it can be changed when the kernel block layer has non-power-of-2
zone sizes support. For now, Dmitry's work on zoned device support on
the kernel side put the constraint on zone sectors to be po2. The
checks here follow this constraint.

Dmitry's patches can be found here:
https://lore.kernel.org/linux-block/20220919022921.946344-1-dmitry.fomichev@wdc.com/T/#m9b0b3c8220de1307e53235d1a73dab1e3a10f62b


Sam


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  2022-09-20  8:51   ` Klaus Jensen
  2022-09-20 13:21     ` Sam Li
@ 2022-09-21  4:44     ` Damien Le Moal
  2022-09-21  9:08       ` Klaus Jensen
  1 sibling, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2022-09-21  4:44 UTC (permalink / raw)
  To: Klaus Jensen, Sam Li
  Cc: qemu-devel, dmitry.fomichev, Markus Armbruster, Eric Blake,
	qemu-block, Stefan Hajnoczi, Kevin Wolf, Fam Zheng, hare,
	Hanna Reitz

On 9/20/22 17:51, Klaus Jensen wrote:
> On Sep 10 13:27, Sam Li wrote:
>> Add a new zoned_host_device BlockDriver. The zoned_host_device option
>> accepts only zoned host block devices. By adding zone management
>> operations in this new BlockDriver, users can use the new block
>> layer APIs including Report Zone and four zone management operations
>> (open, close, finish, reset).
>>
>> Qemu-io uses the new APIs to perform zoned storage commands of the device:
>> zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
>> zone_finish(zf).
>>
>> For example, to test zone_report, use following command:
>> $ ./build/qemu-io --image-opts -n driver=zoned_host_device, filename=/dev/nullb0
>> -c "zrp offset nr_zones"
>>
>> Signed-off-by: Sam Li <faithilikerun@gmail.com>
>> Reviewed-by: Hannes Reinecke <hare@suse.de>
>> ---
>>   block/block-backend.c             | 145 ++++++++++++++
>>   block/file-posix.c                | 323 +++++++++++++++++++++++++++++-
>>   block/io.c                        |  41 ++++
>>   include/block/block-io.h          |   7 +
>>   include/block/block_int-common.h  |  21 ++
>>   include/block/raw-aio.h           |   6 +-
>>   include/sysemu/block-backend-io.h |  17 ++
>>   meson.build                       |   1 +
>>   qapi/block-core.json              |   8 +-
>>   qemu-io-cmds.c                    | 143 +++++++++++++
>>   10 files changed, 708 insertions(+), 4 deletions(-)
>>
>> +/*
>> + * zone management operations - Execute an operation on a zone
>> + */
>> +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
>> +        int64_t offset, int64_t len) {
>> +#if defined(CONFIG_BLKZONED)
>> +    BDRVRawState *s = bs->opaque;
>> +    RawPosixAIOData acb;
>> +    int64_t zone_sector, zone_sector_mask;
>> +    const char *zone_op_name;
>> +    unsigned long zone_op;
>> +    bool is_all = false;
>> +
>> +    zone_sector = bs->bl.zone_sectors;
>> +    zone_sector_mask = zone_sector - 1;
>> +    if (offset & zone_sector_mask) {
>> +        error_report("sector offset %" PRId64 " is not aligned to zone size "
>> +                     "%" PRId64 "", offset, zone_sector);
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (len & zone_sector_mask) {
>> +        error_report("number of sectors %" PRId64 " is not aligned to zone size"
>> +                      " %" PRId64 "", len, zone_sector);
>> +        return -EINVAL;
>> +    }
> 
> These checks impose a power-of-two constraint on the zone size. Can they
> be changed to divisions to lift that constraint? I don't see anything in
> this patch set that relies on power of two zone sizes.

Given that Linux will only expose zoned devices that have a zone size 
that is a power of 2 number of LBAs, this will work as is and avoid 
divisions in the IO path. But given that zone management operations are 
not performance critical, generalizing the code should be fine.

However, once we start adding the code for full zone emulation on top of 
a regular file or qcow image, sector-to-zone conversions requiring 
divisions will hurt. So I really would prefer the code be left as-is for 
now.


-- 
Damien Le Moal
Western Digital Research



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  2022-09-21  4:44     ` Damien Le Moal
@ 2022-09-21  9:08       ` Klaus Jensen
  0 siblings, 0 replies; 31+ messages in thread
From: Klaus Jensen @ 2022-09-21  9:08 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Sam Li, qemu-devel, dmitry.fomichev, Markus Armbruster,
	Eric Blake, qemu-block, Stefan Hajnoczi, Kevin Wolf, Fam Zheng,
	hare, Hanna Reitz, Pankaj Raghav

[-- Attachment #1: Type: text/plain, Size: 4141 bytes --]

On Sep 21 13:44, Damien Le Moal wrote:
> On 9/20/22 17:51, Klaus Jensen wrote:
> > On Sep 10 13:27, Sam Li wrote:
> > > Add a new zoned_host_device BlockDriver. The zoned_host_device option
> > > accepts only zoned host block devices. By adding zone management
> > > operations in this new BlockDriver, users can use the new block
> > > layer APIs including Report Zone and four zone management operations
> > > (open, close, finish, reset).
> > > 
> > > Qemu-io uses the new APIs to perform zoned storage commands of the device:
> > > zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> > > zone_finish(zf).
> > > 
> > > For example, to test zone_report, use following command:
> > > $ ./build/qemu-io --image-opts -n driver=zoned_host_device, filename=/dev/nullb0
> > > -c "zrp offset nr_zones"
> > > 
> > > Signed-off-by: Sam Li <faithilikerun@gmail.com>
> > > Reviewed-by: Hannes Reinecke <hare@suse.de>
> > > ---
> > >   block/block-backend.c             | 145 ++++++++++++++
> > >   block/file-posix.c                | 323 +++++++++++++++++++++++++++++-
> > >   block/io.c                        |  41 ++++
> > >   include/block/block-io.h          |   7 +
> > >   include/block/block_int-common.h  |  21 ++
> > >   include/block/raw-aio.h           |   6 +-
> > >   include/sysemu/block-backend-io.h |  17 ++
> > >   meson.build                       |   1 +
> > >   qapi/block-core.json              |   8 +-
> > >   qemu-io-cmds.c                    | 143 +++++++++++++
> > >   10 files changed, 708 insertions(+), 4 deletions(-)
> > > 
> > > +/*
> > > + * zone management operations - Execute an operation on a zone
> > > + */
> > > +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> > > +        int64_t offset, int64_t len) {
> > > +#if defined(CONFIG_BLKZONED)
> > > +    BDRVRawState *s = bs->opaque;
> > > +    RawPosixAIOData acb;
> > > +    int64_t zone_sector, zone_sector_mask;
> > > +    const char *zone_op_name;
> > > +    unsigned long zone_op;
> > > +    bool is_all = false;
> > > +
> > > +    zone_sector = bs->bl.zone_sectors;
> > > +    zone_sector_mask = zone_sector - 1;
> > > +    if (offset & zone_sector_mask) {
> > > +        error_report("sector offset %" PRId64 " is not aligned to zone size "
> > > +                     "%" PRId64 "", offset, zone_sector);
> > > +        return -EINVAL;
> > > +    }
> > > +
> > > +    if (len & zone_sector_mask) {
> > > +        error_report("number of sectors %" PRId64 " is not aligned to zone size"
> > > +                      " %" PRId64 "", len, zone_sector);
> > > +        return -EINVAL;
> > > +    }
> > 
> > These checks impose a power-of-two constraint on the zone size. Can they
> > be changed to divisions to lift that constraint? I don't see anything in
> > this patch set that relies on power of two zone sizes.
> 
> Given that Linux will only expose zoned devices that have a zone size that
> is a power of 2 number of LBAs, this will work as is and avoid divisions in
> the IO path. But given that zone management operations are not performance
> critical, generalizing the code should be fine.
> 

Aight. That's fine, we can relax the constraint later. But I don't see why.

> However, once we start adding the code for full zone emulation on top of a
> regular file or qcow image, sector-to-zone conversions requiring divisions
> will hurt. So I really would prefer the code be left as-is for now.
> 

Block driver based zone emulation would not hit the above code path
anyway (there would be no iotcl to call), so I don't think that is an
argument for keeping it as-is.

On the sector-to-zone convertions. Yes, they may potentially hurt, but
it would be emulation after all. Why would we care about optimizing
those conversions at the expense of not being able to emulate all valid
geometries? The performance option (which this series enables) is to use
an underlying zoned device through virtio (or, potentially, hw/nvme).
What would be the usecase for deploying a performance sensitive emulated
zoned device?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2022-09-21  9:11 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-10  5:27 [PATCH v9 0/7] Add support for zoned device Sam Li
2022-09-10  5:27 ` [PATCH v9 1/7] include: add zoned device structs Sam Li
2022-09-15  8:05   ` Eric Blake
2022-09-15 10:06     ` Sam Li
2022-09-16 15:16       ` Stefan Hajnoczi
2022-09-19  0:50         ` Sam Li
2022-09-19  8:04           ` Damien Le Moal
2022-09-19  8:06             ` Sam Li
2022-09-10  5:27 ` [PATCH v9 2/7] file-posix: introduce helper functions for sysfs attributes Sam Li
2022-09-11  4:56   ` Damien Le Moal
2022-09-10  5:27 ` [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls Sam Li
2022-09-11  5:31   ` Damien Le Moal
2022-09-11  6:33     ` Sam Li
2022-09-11  6:48       ` Damien Le Moal
2022-09-11  7:30         ` Sam Li
2022-09-11  7:02   ` Damien Le Moal
2022-09-16 16:00   ` Stefan Hajnoczi
2022-09-20  8:51   ` Klaus Jensen
2022-09-20 13:21     ` Sam Li
2022-09-21  4:44     ` Damien Le Moal
2022-09-21  9:08       ` Klaus Jensen
2022-09-10  5:27 ` [PATCH v9 4/7] raw-format: add zone operations to pass through requests Sam Li
2022-09-11  5:32   ` Damien Le Moal
2022-09-10  5:27 ` [PATCH v9 5/7] config: add check to block layer Sam Li
2022-09-11  5:34   ` Damien Le Moal
2022-09-11  6:54     ` Sam Li
2022-09-11  7:05       ` Damien Le Moal
2022-09-16 15:22   ` Stefan Hajnoczi
2022-09-10  5:27 ` [PATCH v9 6/7] qemu-iotests: test new zone operations Sam Li
2022-09-10  5:27 ` [PATCH v9 7/7] docs/zoned-storage: add zoned device documentation Sam Li
2022-09-11  5:38   ` Damien Le Moal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.