All of lore.kernel.org
 help / color / mirror / Atom feed
* [virtio-dev] [RFC PATCH] virtio-blk: add zoned block device specification
@ 2022-06-01 23:55 Dmitry Fomichev
  2022-06-02  7:23 ` Stefan Hajnoczi
  0 siblings, 1 reply; 7+ messages in thread
From: Dmitry Fomichev @ 2022-06-01 23:55 UTC (permalink / raw)
  To: virtio-dev, virtio-comment
  Cc: Damien Le Moal, Matias Bjorling, Niklas Cassel, Hans Holmberg,
	Stefan Hajnoczi, Hannes Reinecke, Sam Li, Dmitry Fomichev

Introduce support for Zoned Block Devices to virtio.

Zoned Block Devices (ZBDs) aim to achieve a better capacity, latency
and/or cost characteristics compared to commonly available block
devices by getting the entire LBA space of the device divided to block
regions that are much larger than the LBA size. These regions are
called zones and they can only be written sequentially. More details
about ZBDs can be found at

https://zonedstorage.io/docs/introduction/zoned-storage .

In its current form, the virtio protocol for block devices (virtio-blk)
is not aware of ZBDs but it allows the guest to successfully scan a
host-managed drive provided by the host. As the result, the
host-managed drive appears at the guest as a regular drive that will
operate erroneously under the most common write workloads.

To fix this, the virtio-blk protocol needs to be extended to add the
capabilities to convey the zone characteristics of host ZBDs to the
guest and to provide the support for ZBD-specific commands - Report
Zones, four zone operations and (optionally) Zone Append. The proposed
standard extension aims to provide this functionality.

This patch extends the virtio-blk section of virtio specification with
the minimum set of requirements that are necessary to support ZBDs.
The resulting device model is a subset of the models defined in ZAC/ZBC
and ZNS standards documents. The included functionality mirrors
the existing Linux kernel block layer ZBD support and should be
sufficient to handle the host-managed and host-aware HDDs that are on
the market today as well as ZNS SSDs that are entering the market at
the moment of this patch submission.

I have developed a proof of concept patch series that adds ZBD support
to virtio-blk Linux kernel driver by implementing the protocol
extensions defined in the spec patch. I would like to receive the
initial feedback on this specification patch before posting that series
to the block LKML.

I would like to thank the following people for their useful feedback
and suggestions while working on the initial iterations of this patch.

Damien Le Moal <damien.lemoal@opensource.wdc.com>
Matias Bjørling <Matias.Bjorling@wdc.com>
Niklas Cassel <Niklas.Cassel@wdc.com>
Hans Holmberg <Hans.Holmberg@wdc.com>

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 content.tex | 686 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 685 insertions(+), 1 deletion(-)

diff --git a/content.tex b/content.tex
index 7508dd1..8ae7578 100644
--- a/content.tex
+++ b/content.tex
@@ -4557,6 +4557,11 @@ \subsection{Feature bits}\label{sec:Device Types / Block Device / Feature bits}
      maximum erase sectors count in \field{max_secure_erase_sectors} and
      maximum erase segment number in \field{max_secure_erase_seg}.
 
+\item[VIRTIO_BLK_F_ZONED(17)] Device is a Zoned Block Device, that is, a device
+	that behaves as defined by the T10 Zoned Block Command standard (ZBC r05) or
+	the NVMe(TM) NVM Express Zoned Namespace Command Set Specification 1.1b
+	(ZNS).
+
 \end{description}
 
 \subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types / Block Device / Feature bits / Legacy Interface: Feature bits}
@@ -4589,6 +4594,31 @@ \subsection{Device configuration layout}\label{sec:Device Types / Block Device /
 \field{max_secure_erase_sectors} \field{secure_erase_sector_alignment} are expressed
 in 512-byte units if the VIRTIO_BLK_F_SECURE_ERASE feature bit is negotiated.
 
+If the VIRTIO_BLK_F_ZONED feature is negotiated, then in
+\field{virtio_blk_zoned_characteristics},
+\begin{itemize}
+\item \field{zone_sectors} value is expressed in 512-byte sectors.
+\item \field{max_append_sectors} value is expressed in 512-byte sectors.
+\item \field{write_granularity} value is expressed in bytes.
+\end{itemize}
+
+The \field{model} field in \field{zoned} may have the following values:
+VIRTIO_BLK_Z_HM(1) and VIRTIO_BLK_Z_HA(2).
+
+\begin{lstlisting}
+#define VIRTIO_BLK_Z_HM        1
+#define VIRTIO_BLK_Z_HA        2
+\end{lstlisting}
+
+If the VIRTIO_BLK_F_ZONED feature is negotiated, then
+\begin{itemize}
+\item The value of VIRTIO_BLK_Z_HM MUST be set by the device if it operates
+    as a host-managed zoned block device.
+
+\item The value of VIRTIO_BLK_Z_HA MUST be set by the device if it operates
+    as a host-aware zoned block device.
+\end{itemize}
+
 \begin{lstlisting}
 struct virtio_blk_config {
         le64 capacity;
@@ -4623,6 +4653,15 @@ \subsection{Device configuration layout}\label{sec:Device Types / Block Device /
         le32 max_secure_erase_sectors;
         le32 max_secure_erase_seg;
         le32 secure_erase_sector_alignment;
+        struct virtio_blk_zoned_characteristics {
+                le32 zone_sectors;
+                le32 max_open_zones;
+                le32 max_active_zones;
+                le32 max_append_sectors;
+                le32 write_granularity;
+                u8 model;
+                u8 unused2[3];
+        } zoned;
 };
 \end{lstlisting}
 
@@ -4686,6 +4725,10 @@ \subsection{Device Initialization}\label{sec:Device Types / Block Device / Devic
     \field{secure_erase_sector_alignment} can be used by OS when splitting a
     request based on alignment.
 
+\item If the VIRTIO_BLK_F_ZONED feature is negotiated, the fields in
+    \field{zoned} can be read by the driver to determine the zone
+    characteristics of the device. All \field{zoned} fields are read-only.
+
 \end{enumerate}
 
 \drivernormative{\subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization}
@@ -4701,6 +4744,24 @@ \subsection{Device Initialization}\label{sec:Device Types / Block Device / Devic
 The driver MUST NOT read \field{writeback} before setting
 the FEATURES_OK \field{device status} bit.
 
+Drivers SHOULD NOT negotiate VIRTIO_BLK_F_ZONED feature if they are incapable
+of supporting devices with the VIRTIO_BLK_Z_HM or VIRTIO_BLK_Z_HA zoned model.
+
+If the VIRTIO_BLK_F_ZONED feature is offered by the device, the
+VIRTIO_BLK_F_DISCARD feature MUST NOT be offered.
+
+If the VIRTIO_BLK_F_ZONED feature is negotiated, then
+\begin{itemize}
+\item If the driver that can not support host-managed zoned devices
+    reads VIRTIO_BLK_Z_HM from the \field{model} field of \field{zoned}, the
+    driver MUST NOT set FEATURES_OK flag and instead set the FAILED bit.
+
+\item If the driver that can not support support zoned devices reads
+    VIRTIO_BLK_Z_HA from the \field{model} field of \field{zoned}, the driver
+    MAY present the device to the guest as a non-zoned device. In this case, the
+    driver SHOULD ignore all other fields in \field{zoned}.
+\end{itemize}
+
 \devicenormative{\subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization}
 
 Devices SHOULD always offer VIRTIO_BLK_F_FLUSH, and MUST offer it
@@ -4712,6 +4773,77 @@ \subsection{Device Initialization}\label{sec:Device Types / Block Device / Devic
 The device MUST initialize padding bytes \field{unused0} and
 \field{unused1} to 0.
 
+If the device that is being initialized is a not a zoned device, the device MUST
+NOT offer the VIRTIO_BLK_F_ZONED feature.
+
+If the VIRTIO_BLK_F_ZONED bit is not set by the driver,
+\begin{itemize}
+\item the device with the VIRTIO_BLK_Z_HA zone model SHOULD proceed with the
+    initialization while setting all zoned topology fields to zero.
+
+\item the device with the VIRTIO_BLK_Z_HM zone model MUST report the device
+    capacity in \field{capacity} in the configuration space as zero to prevent
+    the use of the device that is incorrectly recognized by the driver as
+    a non-zoned device.
+\end{itemize}
+
+If the VIRTIO_BLK_F_ZONED feature is negotiated,
+\begin{itemize}
+\item \field{zoned} can be read by the driver to discover the size of a single
+    zone on the device. All zones of the device have the same size indicated by
+    the \field{zone_sectors} field of \field{zoned} except for the last zone
+    that MAY be smaller than all other zones. The driver can calculate the
+    number of zones on the device as
+    \begin{lstlisting}
+        nr_zones = (capacity + zone_sectors - 1) / zone_sectors;
+    \end{lstlisting}
+    and the size of the last zone as
+    \begin{lstlisting}
+        zs_last = capacity - (nr_zones - 1) * zone_sectors;
+    \end{lstlisting}
+
+\item Zones consume volatile device resources while being in certain states and
+    the device MAY set limits on the number of zones that can be in these states
+    simultaneously.
+
+    Zoned block devices use two internal counters to account for the device
+    resources in use, the number of currently open zones and the number of
+    currently active zones.
+
+    Any zone state transition from a state that doesn't consume a zone resource
+    to a state that consumes the same resource increments the internal device
+    counter for that resource. Any zone transition out of a state that consumes
+    a zone resource to a state that doesn't consume the same resource decrements
+    the counter. Any request that causes the device to exceed the reported zone
+    resource limits is terminated by the device with an error.
+
+\item The \field{max_open_zones} field of the \field{zoned} structure can be
+    read by the driver to discover the maximum number of zones that can be open
+    on the device (zones in the implicit open or explicit open state). A value
+    of zero indicates that the device does not have any limit on the number of
+    open zones.
+
+\item The \field{max_active_zones} field of the \field{zoned} structure can be
+    read by the driver to discover the maximum number zones that can be active
+    on the device (zones in the implicit open, explicit open or closed state).
+    A value of zero indicates that the device does not have any limit on the
+    number of active zones.
+
+\item the \field{max_append_sectors} field of \field{zoned} can be read by the
+    driver to get the maximum data size of a VIRTIO_BLK_T_ZONE_APPEND request
+    issued to the device. The value of this field MUST NOT exceed the
+    \field{seg_max} * \field{size_max} value. A device MAY set the
+    \field{max_append_sectors} to zero if it doesn't support
+    VIRTIO_BLK_T_ZONE_APPEND requests.
+
+\item the \field{write_granularity} field of \field{zoned} can be read by the
+    driver to discover the offset and size alignment constraint for
+    VIRTIO_BLK_T_OUT and VIRTIO_BLK_T_ZONE_APPEND requests issued to
+    a sequential zone of the device.
+
+\item the device MUST initialize padding bytes \field{unused2} to 0.
+\end{itemize}
+
 \subsubsection{Legacy Interface: Device Initialization}\label{sec:Device Types / Block Device / Device Initialization / Legacy Interface: Device Initialization}
 
 Because legacy devices do not have FEATURES_OK, transitional devices
@@ -4738,7 +4870,8 @@ \subsubsection{Legacy Interface: Device Initialization}\label{sec:Device Types /
 \subsection{Device Operation}\label{sec:Device Types / Block Device / Device Operation}
 
 The driver queues requests to the virtqueues, and they are used by
-the device (not necessarily in order). Each request is of form:
+the device (not necessarily in order). If the VIRTIO_BLK_F_ZONED feature
+is not negotiated, then each request is of form:
 
 \begin{lstlisting}
 struct virtio_blk_req {
@@ -4853,6 +4986,331 @@ \subsection{Device Operation}\label{sec:Device Types / Block Device / Device Ope
 command produces VIRTIO_BLK_S_IOERR.  A segment may have completed
 successfully, failed, or not been processed by the device.
 
+The following requirements only apply if the VIRTIO_BLK_F_ZONED feature is
+negotiated.
+
+Each request is of form:
+
+\begin{lstlisting}
+struct virtio_blk_zoned_req {
+        le32 type;
+        le32 reserved;
+        le64 sector;
+        union zoned_params {
+                struct {
+                        /* ALL zone operation flag */
+                        __u8 all;
+                        __u8 unused1[3];
+                } mgmt_send;
+                struct {
+                        /* Partial zone report flag */
+                        __u8 partial;
+                        __u8 unused2[3];
+                } mgmt_receive;
+                struct {
+                        __u8 unused3[4];
+                } append;
+        } zone;
+        u8 data[];
+        le64 zone_result;
+        u8 status;
+        u8 reserved1[3];
+};
+\end{lstlisting}
+
+In addition to the request types defined for non-zoned devices, the type of the
+request can be a zone report (VIRTIO_BLK_T_ZONE_REPORT), an explicit zone open
+(VIRTIO_BLK_T_ZONE_OPEN), an explicit zone close (VIRTIO_BLK_T_ZONE_CLOSE), a
+zone finish (VIRTIO_BLK_T_ZONE_FINISH), a zone_append (VIRTIO_BLK_T_ZONE_APPEND)
+or a zone reset (VIRTIO_BLK_T_ZONE_RESET).
+
+\begin{lstlisting}
+#define VIRTIO_BLK_T_ZONE_APPEND    15
+#define VIRTIO_BLK_T_ZONE_REPORT    16
+#define VIRTIO_BLK_T_ZONE_OPEN      18
+#define VIRTIO_BLK_T_ZONE_CLOSE     20
+#define VIRTIO_BLK_T_ZONE_FINISH    22
+#define VIRTIO_BLK_T_ZONE_RESET     24
+\end{lstlisting}
+
+Requests of the type VIRTIO_BLK_T_ZONE_REPORT are reads and requests of the type
+VIRTIO_BLK_T_ZONE_APPEND are writes. VIRTIO_BLK_T_ZONE_OPEN,
+VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH and VIRTIO_BLK_T_ZONE_RESET
+are non-data requests.
+
+In ZBD standards, the VIRTIO_BLK_T_ZONE_REPORT request belongs to "Zone
+Management Receive" command category and VIRTIO_BLK_T_ZONE_OPEN,
+VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH and VIRTIO_BLK_T_ZONE_RESET
+requests are categorized as "Zone Management Send" commands.
+VIRTIO_BLK_T_ZONE_APPEND is categorized separately from the zone management
+commands. Each of these categories has a distinct set of command parameters and
+these parameters are defined in the \field{zone} union field of the struct
+\field{virtio_blk_zoned_req}.
+
+VIRTIO_BLK_T_ZONE_REPORT is a read request that returns the information about
+the current state of zones on the device starting from the zone containing the
+\field{sector} of the request. The report consists of a header followed by zero
+or more zone descriptors.
+
+A zone report reply has the following structure:
+
+\begin{lstlisting}
+struct virtio_blk_zone_report {
+        le64   nr_zones;
+        u8     reserved[56];
+        struct virtio_blk_zone_descriptor zones[];
+};
+\end{lstlisting}
+
+If the field \field{zone.mgmt_receive.partial} in \field{virtio_blk_zoned_req}
+structure has zero value, then \field{nr_zones} in
+\field{virtio_blk_zone_report} structure is set by the device to the value that
+equals the number of zones that can be reported starting from the report start
+sector, regardless of the number of zone descriptors that can fit in the command
+data buffer. If the field \field{zone.mgmt_receive.partial} is non-zero, the
+device sets the \field{nr_zones} field in the report header to the number of
+fully transferred zone descriptors in the data buffer.
+
+A zone descriptor has the following structure:
+
+\begin{lstlisting}
+struct virtio_blk_zone_descriptor {
+        le64   z_cap;
+        le64   z_start;
+        le64   z_wp;
+        u8     z_type;
+        u8     z_state;
+        u8     reserved[38];
+};
+\end{lstlisting}
+
+The zone descriptor field \field{z_type} \field{virtio_blk_zone_descriptor}
+indicates the type of the zone. The available types are VIRTIO_BLK_ZT_CONV(1),
+VIRTIO_BLK_ZT_SWR(2) or VIRTIO_BLK_ZT_SWP(3).
+
+\begin{lstlisting}
+#define VIRTIO_BLK_ZT_CONV     1
+#define VIRTIO_BLK_ZT_SWR      2
+#define VIRTIO_BLK_ZT_SWP      3
+\end{lstlisting}
+
+Read and write operations into zones with the VIRTIO_BLK_ZT_CONV (Conventional)
+type have the same behavior as read and write operations on a regular block
+device. Any block in a conventional zone can be read or written at any time and
+in any order.
+
+Zones with VIRTIO_BLK_ZT_SWR (Sequential Write Required or SWR) can be read
+randomly, but MUST be written sequentially at a certain point in the zone called
+the Write Pointer (WP). With every write, the Write Pointer is incremented by
+the number of sectors written.
+
+Zones with VIRTIO_BLK_ZT_SWP (Sequential Write Preferred or SWP) can be read
+randomly and SHOULD be written sequentially, similarly to SWR zones. However,
+SWP zones can accept random write operations, that is, VIRTIO_BLK_T_OUT requests
+with a start sector different from the zone write pointer position.
+
+The field \field{z_state} of \field{virtio_blk_zone_descriptor} indicates the
+state of the device zone. The available zone states are VIRTIO_BLK_ZS_NOT_WP(0),
+VIRTIO_BLK_ZS_EMPTY(1), VIRTIO_BLK_ZS_IOPEN(2), VIRTIO_BLK_ZS_EOPEN(3),
+VIRTIO_BLK_ZS_CLOSED(4), VIRTIO_BLK_ZS_RDONLY(13), VIRTIO_BLK_ZS_FULL(14) and
+VIRTIO_BLK_ZS_OFFLINE(15).
+
+\begin{lstlisting}
+#define VIRTIO_BLK_ZS_NOT_WP   0
+#define VIRTIO_BLK_ZS_EMPTY    1
+#define VIRTIO_BLK_ZS_IOPEN    2
+#define VIRTIO_BLK_ZS_EOPEN    3
+#define VIRTIO_BLK_ZS_CLOSED   4
+#define VIRTIO_BLK_ZS_RDONLY   13
+#define VIRTIO_BLK_ZS_FULL     14
+#define VIRTIO_BLK_ZS_OFFLINE  15
+\end{lstlisting}
+
+Zones of the type VIRTIO_BLK_ZT_CONV are always reported by the device to be in
+the VIRTIO_BLK_ZS_NOT_WP state. Zones of the types VIRTIO_BLK_ZT_SWR and
+VIRTIO_BLK_ZT_SWP can not transition to the VIRTIO_BLK_ZS_NOT_WP state.
+
+Zones in VIRTIO_BLK_ZS_EMPTY (Empty), VIRTIO_BLK_ZS_IOPEN (Implicitly Open),
+VIRTIO_BLK_ZS_EOPEN (Explicitly Open) and VIRTIO_BLK_ZS_CLOSED (Closed) state
+are writable, but zones in VIRTIO_BLK_ZS_RDONLY (Read-Only), VIRTIO_BLK_ZS_FULL
+(Full) and VIRTIO_BLK_ZS_OFFLINE (Offline) state are not. The write pointer
+value (\field{z_wp}) is not valid for Read-Only, Full and Offline zones.
+
+The zone descriptor field \field{z_cap} contains the maximum number of 512-byte
+sectors that are available to be written with user data when the zone is in the
+Empty state. This value shall be less than or equal to the \field{zone_sectors}
+value in \field{virtio_blk_zoned_characteristics} structure in the device
+configuration space.
+
+The zone descriptor field \field{z_start} contains the 64-bit address of the
+first 512-byte sector of the zone.
+
+The zone descriptor field \field{z_wp} contains the 64-bit sector address where
+the next write operation for this zone should be issued. This value is undefined
+for conventional zones and for zones in VIRTIO_BLK_ZS_RDONLY, VIRTIO_BLK_ZS_FULL
+and VIRTIO_BLK_ZS_OFFLINE state.
+
+Depending on their state, zones consume resources as follows:
+\begin{itemize}
+\item a zone in VIRTIO_BLK_ZS_IOPEN and VIRTIO_BLK_ZS_EOPEN state consumes one
+    open zone resource and, additionally,
+
+\item a zone in VIRTIO_BLK_ZS_IOPEN, VIRTIO_BLK_ZS_EOPEN and
+    VIRTIO_BLK_ZS_CLOSED state consumes one active resource.
+\end{itemize}
+
+Attempts for zone transitions that violate zone resource limits MUST fail with
+VIRTIO_BLK_S_ZONE_OPEN_RESOURCE or VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE
+\field{zone_result}.
+
+Zones in the VIRTIO_BLK_ZS_EMPTY (Empty) state have the write pointer value
+equal to the start sector of the zone. In this state, the entire capacity of the
+zone is available for writing. A zone can transition from this state to
+\begin{itemize}
+\item VIRTIO_BLK_ZS_IOPEN when a successful VIRTIO_BLK_T_OUT request or
+    VIRTIO_BLK_T_ZONE_APPEND with a non-zero data size is received for the zone.
+
+\item VIRTIO_BLK_ZS_EOPEN when a successful VIRTIO_BLK_T_ZONE_OPEN request is
+    received for the zone
+\end{itemize}
+
+When a VIRTIO_BLK_T_ZONE_RESET request is issued to an Empty zone, the request
+is completed successfully and the zone stays in the VIRTIO_BLK_ZS_EMPTY state.
+
+Zones in the VIRTIO_BLK_ZS_IOPEN (Implicitly Open) state can transition from
+this state to
+\begin{itemize}
+\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request is
+    received for the zone,
+
+\item VIRTIO_BLK_ZS_EOPEN when a successful VIRTIO_BLK_T_ZONE_OPEN request is
+    received for the zone,
+
+\item VIRTIO_BLK_ZS_CLOSED when a successful VIRTIO_BLK_T_ZONE_CLOSE request is
+    received for the zone,
+
+\item VIRTIO_BLK_ZS_CLOSED implicitly by the device when another zone is
+    entering the VIRTIO_BLK_ZS_IOPEN or VIRTIO_BLK_ZS_EOPEN state and the number
+    of currently open zones is at \field{max_open_zones} limit,
+
+\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_ZONE_FINISH request is
+    received for the zone.
+
+\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_OUT or
+    VIRTIO_BLK_T_ZONE_APPEND request that causes the zone to reach its writable
+    capacity is received for the zone.
+\end{itemize}
+
+Zones in the VIRTIO_BLK_ZS_EOPEN (Explicitly Open) state can transition from
+this state to
+\begin{itemize}
+\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request is
+    received for the zone,
+
+\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_CLOSE request is
+    received for the zone and the write pointer of the zone has the value equal
+    to the start sector of the zone,
+
+\item VIRTIO_BLK_ZS_CLOSED when a successful VIRTIO_BLK_T_ZONE_CLOSE request is
+    received for the zone and the zone write pointer is larger then the start
+    sector of the zone,
+
+\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_ZONE_FINISH request is
+    received for the zone,
+
+\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_OUT or
+    VIRTIO_BLK_T_ZONE_APPEND request that causes the zone to reach its writable
+    capacity is received for the zone.
+\end{itemize}
+
+When a VIRTIO_BLK_T_ZONE_EOPEN request is issued to an Explicitly Open zone, the
+request is completed successfully and the zone stays in the VIRTIO_BLK_ZS_EOPEN
+state.
+
+Zones in the VIRTIO_BLK_ZS_CLOSED (Closed) state can transition from this state
+to
+\begin{itemize}
+\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request is
+    received for the zone,
+
+\item VIRTIO_BLK_ZS_IOPEN when a successful VIRTIO_BLK_T_OUT request or
+    VIRTIO_BLK_T_ZONE_APPEND with a non-zero data size is received for the zone.
+
+\item VIRTIO_BLK_ZS_EOPEN when a successful VIRTIO_BLK_T_ZONE_OPEN request is
+    received for the zone,
+\end{itemize}
+
+When a VIRTIO_BLK_T_ZONE_CLOSE request is issued to a Closed zone, the request
+is completed successfully and the zone stays in the VIRTIO_BLK_ZS_CLOSED state.
+
+Zones in the VIRTIO_BLK_ZS_FULL (Full) state can transition from this state to
+VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request is
+received for the zone
+
+When a VIRTIO_BLK_T_ZONE_FINISH request is issued to a Full zone, the request
+is completed successfully and the zone stays in the VIRTIO_BLK_ZS_FULL state.
+
+The device MAY automatically transition zones to VIRTIO_BLK_ZS_RDONLY
+(Read-Only) or VIRTIO_BLK_ZS_OFFLINE (Offline) state from any other state. The
+device MAY also automatically transition zones in the Read-Only state to the
+Offline state. Zones in the Offline state MAY NOT transition to any other state.
+Such automatic transitions usually indicate hardware failures. The previously
+written data may only be read from zones in the Read-Only state. Zones in the
+Offline state can not be read or written.
+
+If a request of the type VIRTIO_BLK_T_ZONE_APPEND is completed with
+VIRTIO_BLK_S_OK status, the field \field{zone_result} in
+\field{virtio_blk_zoned_req} can be read by the driver to obtain the start
+sector of the data written to the zone. The field \field{zone_result} MUST
+be set to zero by the driver for requests of any other type that are completed
+with VIRTIO_BLK_S_OK status.
+
+If a request of type VIRTIO_BLK_T_OUT, VIRTIO_BLK_T_ZONE_OPEN,
+VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH, VIRTIO_BLK_T_ZONE_APPEND
+or VIRTIO_BLK_T_ZONE_RESET is completed with VIRTIO_BLK_S_IOERR status, the
+driver can read the result of the zone operation from the field
+\field{zone_result}. In this case, the possible values of \field{zone_result}
+are VIRTIO_BLK_S_ZONE_INVALID_CMD (0), VIRTIO_BLK_S_ZONE_UNALIGNED_WP(1),
+VIRTIO_BLK_S_ZONE_OPEN_RESOURCE(2) or VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE(3).
+
+\begin{lstlisting}
+#define VIRTIO_BLK_S_ZONE_INVALID_CMD     0
+#define VIRTIO_BLK_S_ZONE_UNALIGNED_WP    1
+#define VIRTIO_BLK_S_ZONE_OPEN_RESOURCE   2
+#define VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE 3
+\end{lstlisting}
+
+VIRTIO_BLK_S_ZONE_UNALIGNED_WP is set by the device when the request received
+from the driver attempts to perform a write to an SWR zone and at least one of
+the following conditions is met:
+
+\begin{itemize}
+\item the starting sector of the request is not equal to the current value of
+    the zone write pointer.
+
+\item the ending sector of the request data multiplied by 512 is not a multiple
+    of the value reported by the device in the field \field{write_granularity}
+    in the device configuration space.
+\end{itemize}
+
+VIRTIO_BLK_S_ZONE_OPEN_RESOURCE is set by the device when a zone operation or
+write request received from the driver can not be handled without exceeding the
+\field{max_open_zones} limit value reported by the device in the configuration
+space.
+
+VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE is set by the device when a zone operation or
+write request received from the driver can not be handled without exceeding the
+\field{max_active_zones} limit value reported by the device in the configuration
+space.
+
+A zone transition request that leads to both the \field{max_open_zones} and the
+\field{max_active_zones} limits to be exceeded is terminated by the device with
+VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE \field{zoned_result} value.
+
+The device SHALL report all other error conditions related to zoned block model
+operation by setting the VIRTIO_BLK_S_ZONE_INVALID_CMD value in
+\field{zone_result} of \field{virtio_blk_zoned_req} structure.
+
 \drivernormative{\subsubsection}{Device Operation}{Device Types / Block Device / Device Operation}
 
 A driver MUST NOT submit a request which would cause a read or write
@@ -4899,6 +5357,68 @@ \subsection{Device Operation}\label{sec:Device Types / Block Device / Device Ope
 successfully, failed, or were processed by the device at all if the request
 failed with VIRTIO_BLK_S_IOERR.
 
+The following requirements only apply if the VIRTIO_BLK_F_ZONED feature is
+negotiated.
+
+When forming a VIRTIO_BLK_T_ZONE_REPORT request, the driver sets the sector
+address of the starting zone to report to \field{sector}. The driver MAY
+set \field{zone.mgmt_receive.partial} field to a non-zero value to request a
+partial zone report from the device. If the request is successful, the number of
+reported zone descriptors is determined by the zone topology of the device, the
+provided start sector to report and the size of the data buffer provided for the
+report. The number of zones returned in the field \field{nr_zones} of
+\field{virtio_blk_zone_report} depends on the value of the field
+\field{zone.mgmt_receive.partial}.
+
+When forming a VIRTIO_BLK_T_ZONE_OPEN, VIRTIO_BLK_T_ZONE_CLOSE,
+VIRTIO_BLK_T_ZONE_FINISH and VIRTIO_BLK_T_ZONE_RESET requests, the driver may
+either make the zone operation to act on a particular zone or apply the
+operation to all applicable zones on the device. To specify a single zone for
+the operation, the driver MUST set the \field{zone.mgmt_send.all} flag to zero
+value and set the zone sector address in the \field{sector} of the request. The
+zone sector address is a 64-bit value expressed in 512-byte units that points at
+the first sector in the target zone. To make a zone operation act upon all
+applicable zones, the driver MUST set the \field{zone.mgmt_send.all} field in
+the request to a non-zero value. If the field \field{zone.mgmt_send.all} is set,
+the driver MUST set the field \field{sector} to zero value because in this case
+it is ignored by the device.
+
+The \field{sector} field of the VIRTIO_BLK_T_ZONE_APPEND request MUST specify
+the first sector of the zone to which data is to be appended at the position of
+the write pointer. The zone sector address is a 64-bit value expressed in
+512-byte units that points anywhere in the target zone. The size of the data
+that is appended MUST NOT exceed the \field{max_append_sectors} value provided
+by the device in \field{virtio_blk_zoned_characteristics} configuration space
+structure.
+
+Upon a successful completion of a VIRTIO_BLK_T_ZONE_APPEND request, the driver
+can read the starting sector location of the written data from the request field
+\field{zone_result}.
+
+All VIRTIO_BLK_T_OUT requests issued by the driver to sequential zones and
+VIRTIO_BLK_T_ZONE_APPEND requests MUST have:
+
+\begin{enumerate}
+\item the data size that is a multiple of the number of bytes reported
+    by the device in the field \field{write_granularity} in the
+    \field{virtio_blk_zoned_characteristics} configuration space structure.
+
+\item the value of the field \field{sector} that is a multiple of the number of
+    bytes reported by the device in the field \field{write_granularity} in the
+    \field{virtio_blk_zoned_characteristics} configuration space structure.
+
+\item the data size that will not exceed the writable zone capacity when its
+    value is added to the current value of the write pointer of the zone.
+
+\end{enumerate}
+
+If the device has set the \field{model} field of
+\field{virtio_blk_zoned_characteristics} structure in the configuration space to
+VIRTIO_BLK_Z_HA and the driver exposes it to the guest as a non-zoned device,
+the driver MUST use \field{virtio_blk_zoned_req} for all requests. In this case,
+the fields of \field{zone} union MUST be initialized to zero by the driver and
+the driver MUST consider \field{zone_result} field reserved.
+
 \devicenormative{\subsubsection}{Device Operation}{Device Types / Block Device / Device Operation}
 
 A device MUST set the \field{status} byte to VIRTIO_BLK_S_IOERR
@@ -4990,6 +5510,170 @@ \subsection{Device Operation}\label{sec:Device Types / Block Device / Device Ope
   simplfy passthrough implementations from eMMC devices.
 \end{note}
 
+If the VIRTIO_BLK_F_ZONED feature is not negotiated, the device MUST reject
+VIRTIO_BLK_T_ZONE_REPORT, VIRTIO_BLK_T_ZONE_OPEN, VIRTIO_BLK_T_ZONE_CLOSE,
+VIRTIO_BLK_T_ZONE_FINISH, VIRTIO_BLK_T_ZONE_APPEND and VIRTIO_BLK_T_ZONE_RESET
+requests with VIRTIO_BLK_S_UNSUPP status.
+
+The following device requirements only apply if the VIRTIO_BLK_F_ZONED feature
+is negotiated.
+
+If a request of type VIRTIO_BLK_T_ZONE_OPEN, VIRTIO_BLK_T_ZONE_CLOSE,
+VIRTIO_BLK_T_ZONE_FINISH or VIRTIO_BLK_T_ZONE_RESET is issued for a Conventional
+zone (type VIRTIO_BLK_ZT_CONV), the request is completed with VIRTIO_BLK_S_IOERR
+status and the field \field{zone_result} is set to the value
+VIRTIO_BLK_S_ZONE_INVALID_CMD.
+
+If the zone specified by the VIRTIO_BLK_T_ZONE_APPEND request is not a SWR zone,
+then the request SHALL be completed with VIRTIO_BLK_S_IOERR \field{status} and
+VIRTIO_BLK_S_ZONE_INVALID_CMD value in the field \field{zone_result}.
+
+The device handles a VIRTIO_BLK_T_ZONE_OPEN request with non-zero
+\field{zone.mgmt_send.all} field by transitioning all zones in
+VIRTIO_BLK_ZS_CLOSED state to VIRTIO_BLK_ZS_EOPEN state. If, while processing
+this request, the available zone resources are insufficient, then no zone state
+transitions shall take place and the request is completed with
+VIRTIO_BLK_S_IOERR \field{status} and VIRTIO_BLK_S_ZONE_OPEN_RESOURCE or
+VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE value in the field \field{zone_result}.
+
+The device handles a VIRTIO_BLK_T_ZONE_OPEN request with zero
+\field{zone.mgmt_send.all} field by attempting to change the state of the zone
+with the \field{sector} address to VIRTIO_BLK_ZS_EOPEN state. If the transition
+to this state can not be performed, the request is completed with
+VIRTIO_BLK_S_IOERR \field{status} and VIRTIO_BLK_S_ZONE_INVALID_CMD value in
+\field{zone_result}. If, while processing this request, the available zone
+resources are insufficient, then the zone state does not change and the request
+is completed with VIRTIO_BLK_S_IOERR \field{status} and
+VIRTIO_BLK_S_ZONE_OPEN_RESOURCE or VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE value in
+the field \field{zone_result}.
+
+The device handles a VIRTIO_BLK_T_ZONE_CLOSE request with non-zero
+\field{zone.mgmt_send.all} field by transitioning all zones in
+VIRTIO_BLK_ZS_IOPEN and VIRTIO_BLK_ZS_EOPEN state to VIRTIO_BLK_ZS_CLOSED state.
+
+The device handles a VIRTIO_BLK_T_ZONE_CLOSE request with zero
+\field{zone.mgmt_send.all} field by attempting to change the state of the zone
+with the \field{sector} address to VIRTIO_BLK_ZS_CLOSED state. If the transition
+to this state can not be performed, the request is completed with
+VIRTIO_BLK_S_IOERR \field{status} and VIRTIO_BLK_S_ZONE_INVALID_CMD value in the
+field \field{zone_result}.
+
+The device handles a VIRTIO_BLK_T_ZONE_FINISH request with non-zero
+\field{zone.mgmt_send.all} field by transitioning all zones in
+VIRTIO_BLK_ZS_IOPEN, VIRTIO_BLK_ZS_EOPEN or VIRTIO_BLK_ZS_CLOSED state to
+VIRTIO_BLK_ZS_FULL state.
+
+The device handles a VIRTIO_BLK_T_ZONE_FINISH request with zero
+\field{zone.mgmt_send.all} field by attempting to change the state of the zone
+with the \field{sector} address to VIRTIO_BLK_ZS_FULL state. If the transition
+to this state can not be performed, the zone state does not change and the
+request is completed with VIRTIO_BLK_S_IOERR \field{status} and
+VIRTIO_BLK_S_ZONE_INVALID_CMD value in the field \field{zone_result}.
+
+The device handles a VIRTIO_BLK_T_ZONE_RESET request with non-zero
+\field{zone.mgmt_send.all} field by transitioning all zones in
+VIRTIO_BLK_ZS_IOPEN, VIRTIO_BLK_ZS_EOPEN, VIRTIO_BLK_ZS_CLOSED and
+VIRTIO_BLK_ZS_FULL state to VIRTIO_BLK_ZS_EMPTY state.
+
+The device handles a VIRTIO_BLK_T_ZONE_RESET request with zero
+\field{zone.mgmt_send.all} field by attempting to change the state of the zone
+with the \field{sector} address to VIRTIO_BLK_ZS_EMPTY state. If the transition
+to this state can not be performed, the zone state does not change and the
+request is completed with VIRTIO_BLK_S_IOERR \field{status} and
+VIRTIO_BLK_S_ZONE_INVALID_CMD value in the field \field{zone_result}.
+
+Upon receiving a VIRTIO_BLK_T_ZONE_APPEND request or a VIRTIO_BLK_T_OUT
+request issued to a SWR zone in VIRTIO_BLK_ZS_EMPTY or VIRTIO_BLK_ZS_CLOSED
+state, the device attempts to perform the transition of the zone to
+VIRTIO_BLK_ZS_IOPEN state before writing data. This transition may fail due to
+insufficient open and/or active zone resources available on the device. In this
+case, the request is completed with VIRTIO_BLK_S_IOERR \field{status} and
+VIRTIO_BLK_S_ZONE_OPEN_RESOURCE or VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE value in
+the \field{zone_result}.
+
+If the \field{sector} field in the VIRTIO_BLK_T_ZONE_APPEND request does not
+specify the lowest sector for a zone, then the request SHALL be completed with
+VIRTIO_BLK_S_IOERR \field{status} and VIRTIO_BLK_S_ZONE_INVALID_CMD value in
+\field{zone_result}.
+
+A VIRTIO_BLK_T_ZONE_APPEND request or a VIRTIO_BLK_T_OUT request that has the
+data range that that exceeds the remaining writable capacity for the zone, then
+the request SHALL be completed with VIRTIO_BLK_S_IOERR \field{status} and
+VIRTIO_BLK_S_ZONE_INVALID_CMD value in \field{zone_result}.
+
+A VIRTIO_BLK_T_ZONE_APPEND request that has the data size that exceeds
+\field{max_append_sectors} configuration space value, then,
+\begin{itemize}
+\item if \field{max_append_sectors} configuration space value is reported as
+    zero by the device, the request SHALL be completed with VIRTIO_BLK_S_UNSUPP
+    \field{status}.
+
+\item if \field{max_append_sectors} configuration space value is reported as
+    a non-zero value by the device, the request SHALL be completed with
+    VIRTIO_BLK_S_IOERR \field{status} and VIRTIO_BLK_S_ZONE_INVALID_CMD value in
+    the field \field{zone_result}.
+\end{itemize}
+
+If a VIRTIO_BLK_T_ZONE_APPEND request, a VIRTIO_BLK_T_IN request or a
+VIRTIO_BLK_T_OUT request issued to a SWR zone has the range that has sectors in
+more than one zone, then the request SHALL completed with VIRTIO_BLK_S_IOERR
+\field{status} and VIRTIO_BLK_S_ZONE_INVALID_CMD value in the field
+\field{zone_result}.
+
+A VIRTIO_BLK_T_OUT request that has the \field{sector} value that is not aligned
+with the write pointer for the zone, then the request SHALL completed with
+VIRTIO_BLK_S_IOERR \field{status} and VIRTIO_BLK_S_ZONE_UNALIGNED_WP value in
+the field \field{zone_result}.
+
+In order to avoid resource-related errors while opening zones implicitly, the
+device MAY automatically transition zones in VIRTIO_BLK_ZS_IOPEN state to
+VIRTIO_BLK_ZS_CLOSED state.
+
+All VIRTIO_BLK_T_OUT requests or VIRTIO_BLK_T_ZONE_APPEND requests issued
+to a zone in the VIRTIO_BLK_ZS_RDONLY state SHALL be completed with
+VIRTIO_BLK_S_IOERR \field{status} and VIRTIO_BLK_S_ZONE_INVALID_CMD value in the
+field \field{zone_result}.
+
+All requests issued to a zone in the VIRTIO_BLK_ZS_OFFLINE state SHALL be
+completed with VIRTIO_BLK_S_IOERR \field{status} and
+VIRTIO_BLK_S_ZONE_INVALID_CMD value in the field \field{zone_result}.
+
+The device MUST consider the data that is read above the write pointer of a zone
+as unwritten data. The sectors between the write pointer position and the upper
+write boundary of the zone during VIRTIO_BLK_T_ZONE_FINISH request processing
+are also considered unwritten data.
+
+When unwritten data is present in the sector range of a read request, the device
+MUST process this data in one of the following ways -
+
+\begin{enumerate}
+\item Fill the unwritten data with a device-specific byte pattern. The
+configuration, control and reporting of this byte pattern is beyond the scope
+of this standard. This is the preferred approach.
+
+\item Fail the request. This may prevent the device from being operational in
+some guest operating systems.
+
+\item Return stale, previously written data to the driver. This approach is the
+least preferred for its obvious negative security implications.
+\end{enumerate}
+
+If the both VIRTIO_BLK_F_ZONED and VIRTIO_BLK_F_SECURE_ERASE features are
+negotiated, then
+
+\begin{enumerate}
+\item the field \field{secure_erase_sector_alignment} in the configuration space
+of the device MUST be a multiple of \field{zone_sectors} value reported in the
+device configuration space.
+
+\item the data size in VIRTIO_BLK_T_SECURE_ERASE requests MUST be a multiple of
+\field{zone_sectors} value in the device configuration space.
+\end{enumerate}
+
+The device MUST handle a VIRTIO_BLK_T_SECURE_ERASE request in the same way it
+handles VIRTIO_BLK_T_ZONE_RESET request for the zone range specified in the
+VIRTIO_BLK_T_SECURE_ERASE request.
+
 \subsubsection{Legacy Interface: Device Operation}\label{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation}
 When using the legacy interface, transitional devices and drivers
 MUST format the fields in struct virtio_blk_req
-- 
2.34.1



---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [virtio-dev] [RFC PATCH] virtio-blk: add zoned block device specification
  2022-06-01 23:55 [virtio-dev] [RFC PATCH] virtio-blk: add zoned block device specification Dmitry Fomichev
@ 2022-06-02  7:23 ` Stefan Hajnoczi
  2022-06-03  2:30   ` Dmitry Fomichev
  0 siblings, 1 reply; 7+ messages in thread
From: Stefan Hajnoczi @ 2022-06-02  7:23 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: virtio-dev, virtio-comment, Damien Le Moal, Matias Bjorling,
	Niklas Cassel, Hans Holmberg, Stefan Hajnoczi, Hannes Reinecke,
	Sam Li

[-- Attachment #1: Type: text/plain, Size: 26775 bytes --]

(Resending from my mailing list subscriber address so this makes it onto
the list.)

On Thu, 2 Jun 2022 at 00:56, Dmitry Fomichev <dmitry.fomichev@wdc.com> wrote:
>
> Introduce support for Zoned Block Devices to virtio.
>
> Zoned Block Devices (ZBDs) aim to achieve a better capacity, latency
> and/or cost characteristics compared to commonly available block
> devices by getting the entire LBA space of the device divided to block
> regions that are much larger than the LBA size. These regions are
> called zones and they can only be written sequentially. More details
> about ZBDs can be found at
>
> https://zonedstorage.io/docs/introduction/zoned-storage .
>
> In its current form, the virtio protocol for block devices (virtio-blk)
> is not aware of ZBDs but it allows the guest to successfully scan a
> host-managed drive provided by the host. As the result, the
> host-managed drive appears at the guest as a regular drive that will
> operate erroneously under the most common write workloads.
>
> To fix this, the virtio-blk protocol needs to be extended to add the
> capabilities to convey the zone characteristics of host ZBDs to the
> guest and to provide the support for ZBD-specific commands - Report
> Zones, four zone operations and (optionally) Zone Append. The proposed
> standard extension aims to provide this functionality.
>
> This patch extends the virtio-blk section of virtio specification with
> the minimum set of requirements that are necessary to support ZBDs.
> The resulting device model is a subset of the models defined in ZAC/ZBC
> and ZNS standards documents. The included functionality mirrors
> the existing Linux kernel block layer ZBD support and should be
> sufficient to handle the host-managed and host-aware HDDs that are on
> the market today as well as ZNS SSDs that are entering the market at
> the moment of this patch submission.
>
> I have developed a proof of concept patch series that adds ZBD support
> to virtio-blk Linux kernel driver by implementing the protocol
> extensions defined in the spec patch. I would like to receive the
> initial feedback on this specification patch before posting that series
> to the block LKML.
>
> I would like to thank the following people for their useful feedback
> and suggestions while working on the initial iterations of this patch.
>
> Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Matias Bjørling <Matias.Bjorling@wdc.com>
> Niklas Cassel <Niklas.Cassel@wdc.com>
> Hans Holmberg <Hans.Holmberg@wdc.com>
>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  content.tex | 686 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 685 insertions(+), 1 deletion(-)
>
> diff --git a/content.tex b/content.tex
> index 7508dd1..8ae7578 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -4557,6 +4557,11 @@ \subsection{Feature bits}\label{sec:Device Types / Block Device / Feature bits}
>       maximum erase sectors count in \field{max_secure_erase_sectors} and
>       maximum erase segment number in \field{max_secure_erase_seg}.
>
> +\item[VIRTIO_BLK_F_ZONED(17)] Device is a Zoned Block Device, that is, a device
> +       that behaves as defined by the T10 Zoned Block Command standard (ZBC r05) or
> +       the NVMe(TM) NVM Express Zoned Namespace Command Set Specification 1.1b
> +       (ZNS).

The rest of the spec doesn't mention ZBC/ZNS but at least the zone
state machine is likely to be derived from these standards. Have you
reviewed the license/IP terms of VIRTIO, ZBC, and ZNS to check that
it's okay to do this? It's probably okay but I wanted to mention it
because it would be a shame to discover an IP issue after the VIRTIO
spec has been released.

Also, the wording suggests to me that the spec will refer to ZBC/ZNS
instead of defining the behavior itself, but actually you've written a
full self-contained spec (which is great!). Maybe change "behaves as
defined by" to "follows the zoned storage device model that is also
supported by industry standards such as".


> +
>  \end{description}
>
>  \subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types / Block Device / Feature bits / Legacy Interface: Feature bits}
> @@ -4589,6 +4594,31 @@ \subsection{Device configuration layout}\label{sec:Device Types / Block Device /
>  \field{max_secure_erase_sectors} \field{secure_erase_sector_alignment} are expressed
>  in 512-byte units if the VIRTIO_BLK_F_SECURE_ERASE feature bit is negotiated.
>
> +If the VIRTIO_BLK_F_ZONED feature is negotiated, then in
> +\field{virtio_blk_zoned_characteristics},
> +\begin{itemize}
> +\item \field{zone_sectors} value is expressed in 512-byte sectors.
> +\item \field{max_append_sectors} value is expressed in 512-byte sectors.
> +\item \field{write_granularity} value is expressed in bytes.
> +\end{itemize}
> +
> +The \field{model} field in \field{zoned} may have the following values:
> +VIRTIO_BLK_Z_HM(1) and VIRTIO_BLK_Z_HA(2).
> +
> +\begin{lstlisting}
> +#define VIRTIO_BLK_Z_HM        1
> +#define VIRTIO_BLK_Z_HA        2
> +\end{lstlisting}
> +
> +If the VIRTIO_BLK_F_ZONED feature is negotiated, then
> +\begin{itemize}
> +\item The value of VIRTIO_BLK_Z_HM MUST be set by the device if it operates
> +    as a host-managed zoned block device.

The term host-managed is not defined. Please either define it or
include a reference to the zoned storage device model (maybe
zonedstorage.io?).

> +
> +\item The value of VIRTIO_BLK_Z_HA MUST be set by the device if it operates
> +    as a host-aware zoned block device.

The term host-aware is not defined.

> +\end{itemize}

The VIRTIO spec is organized into normative sections containing
MUST/MAY/SHOULD and descriptive sections. MUST/MAY/SHOULD need to go
in drivernormative or devicenormative sections and can't be used in
descriptive sections.

(I believe the reason for this is to give driver/device implementors a
definitive list of normative statements they can easily review without
re-reading the entire text of the specification.)
This raises the question whether devices can offer VIRTIO_BLK_F_ZONED
but fall back to regular block device operation if the driver does not
negotiate the bit?

This statement partially precludes that because it requires devices to
disable VIRTIO_BLK_F_DISCARD if they advertize VIRTIO_BLK_F_ZONED. I
think there is a way to allow VIRTIO_BLK_F_DISCARD in
!VIRTIO_BLK_F_ZONED mode while forbidding it in VIRTIO_BLK_F_ZONED
mode: the device can fail to set FEATURES_OK when the driver writes
the Device Status field with VIRTIO_BLK_F_DISCARD &&
VIRTIO_BLK_F_ZONED. That way a host-aware device in
!VIRTIO_BLK_F_ZONED mode could allow VIRTIO_BLK_F_DISCARD. Is this
useful?

> +
> +If the VIRTIO_BLK_F_ZONED feature is negotiated, then
> +\begin{itemize}
> +\item If the driver that can not support host-managed zoned devices
> +    reads VIRTIO_BLK_Z_HM from the \field{model} field of \field{zoned}, the
> +    driver MUST NOT set FEATURES_OK flag and instead set the FAILED bit.
> +
> +\item If the driver that can not support support zoned devices reads
> +    VIRTIO_BLK_Z_HA from the \field{model} field of \field{zoned}, the driver
> +    MAY present the device to the guest as a non-zoned device. In this case, the

Please avoid using the terms "guest" or "host" in the virtualization
sense and instead use "driver" and "device". VIRTIO can be used in
non-virtualization use cases and it's clearer to avoid virtualization
terminology.

> +    driver SHOULD ignore all other fields in \field{zoned}.

Why would the driver negotiate VIRTIO_BLK_F_ZONED if it doesn't
support zoned devices?

> +\end{itemize}
> +
>  \devicenormative{\subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization}
>
>  Devices SHOULD always offer VIRTIO_BLK_F_FLUSH, and MUST offer it
> @@ -4712,6 +4773,77 @@ \subsection{Device Initialization}\label{sec:Device Types / Block Device / Devic
>  The device MUST initialize padding bytes \field{unused0} and
>  \field{unused1} to 0.
>
> +If the device that is being initialized is a not a zoned device, the device MUST
> +NOT offer the VIRTIO_BLK_F_ZONED feature.
> +
> +If the VIRTIO_BLK_F_ZONED bit is not set by the driver,
> +\begin{itemize}
> +\item the device with the VIRTIO_BLK_Z_HA zone model SHOULD proceed with the
> +    initialization while setting all zoned topology fields to zero.
> +
> +\item the device with the VIRTIO_BLK_Z_HM zone model MUST report the device
> +    capacity in \field{capacity} in the configuration space as zero to prevent
> +    the use of the device that is incorrectly recognized by the driver as
> +    a non-zoned device.

I suggest changing this to "MUST fail to set the FEATURES_OK device
status bit when the driver writes the Device Status field". That way
the driver gets a clear error instead of seeing a virtio-blk device
with 0 capacity.


> +\end{itemize}
> +
> +If the VIRTIO_BLK_F_ZONED feature is negotiated,
> +\begin{itemize}
> +\item \field{zoned} can be read by the driver to discover the size of a single
> +    zone on the device. All zones of the device have the same size indicated by
> +    the \field{zone_sectors} field of \field{zoned} except for the last zone
> +    that MAY be smaller than all other zones. The driver can calculate the
> +    number of zones on the device as
> +    \begin{lstlisting}
> +        nr_zones = (capacity + zone_sectors - 1) / zone_sectors;
> +    \end{lstlisting}
> +    and the size of the last zone as
> +    \begin{lstlisting}
> +        zs_last = capacity - (nr_zones - 1) * zone_sectors;
> +    \end{lstlisting}
> +
> +\item Zones consume volatile device resources while being in certain states and
> +    the device MAY set limits on the number of zones that can be in these states
> +    simultaneously.
> +
> +    Zoned block devices use two internal counters to account for the device
> +    resources in use, the number of currently open zones and the number of
> +    currently active zones.
> +
> +    Any zone state transition from a state that doesn't consume a zone resource
> +    to a state that consumes the same resource increments the internal device
> +    counter for that resource. Any zone transition out of a state that consumes
> +    a zone resource to a state that doesn't consume the same resource decrements
> +    the counter. Any request that causes the device to exceed the reported zone
> +    resource limits is terminated by the device with an error.

"an error" is vague. I was wondering what the exact error should be.
Please either mention the error codes here or add "as defined for
specific commands later" so it's clear that the device doesn't get the
choose the error code but they are defined by the spec.

> +
> +\item The \field{max_open_zones} field of the \field{zoned} structure can be
> +    read by the driver to discover the maximum number of zones that can be open
> +    on the device (zones in the implicit open or explicit open state). A value
> +    of zero indicates that the device does not have any limit on the number of
> +    open zones.
> +
> +\item The \field{max_active_zones} field of the \field{zoned} structure can be
> +    read by the driver to discover the maximum number zones that can be active
> +    on the device (zones in the implicit open, explicit open or closed state).
> +    A value of zero indicates that the device does not have any limit on the
> +    number of active zones.
> +
> +\item the \field{max_append_sectors} field of \field{zoned} can be read by the
> +    driver to get the maximum data size of a VIRTIO_BLK_T_ZONE_APPEND request
> +    issued to the device. The value of this field MUST NOT exceed the
> +    \field{seg_max} * \field{size_max} value. A device MAY set the
> +    \field{max_append_sectors} to zero if it doesn't support
> +    VIRTIO_BLK_T_ZONE_APPEND requests.

The MUST NOT/MAY statements need to be in a drivernormative section.
There are more instances below.


> +
> +\item the \field{write_granularity} field of \field{zoned} can be read by the
> +    driver to discover the offset and size alignment constraint for
> +    VIRTIO_BLK_T_OUT and VIRTIO_BLK_T_ZONE_APPEND requests issued to
> +    a sequential zone of the device.
> +
> +\item the device MUST initialize padding bytes \field{unused2} to 0.
> +\end{itemize}
> +
>  \subsubsection{Legacy Interface: Device Initialization}\label{sec:Device Types / Block Device / Device Initialization / Legacy Interface: Device Initialization}
>
>  Because legacy devices do not have FEATURES_OK, transitional devices
> @@ -4738,7 +4870,8 @@ \subsubsection{Legacy Interface: Device Initialization}\label{sec:Device Types /
>  \subsection{Device Operation}\label{sec:Device Types / Block Device / Device Operation}
>
>  The driver queues requests to the virtqueues, and they are used by
> -the device (not necessarily in order). Each request is of form:
> +the device (not necessarily in order). If the VIRTIO_BLK_F_ZONED feature
> +is not negotiated, then each request is of form:
>
>  \begin{lstlisting}
>  struct virtio_blk_req {
> @@ -4853,6 +4986,331 @@ \subsection{Device Operation}\label{sec:Device Types / Block Device / Device Ope
>  command produces VIRTIO_BLK_S_IOERR.  A segment may have completed
>  successfully, failed, or not been processed by the device.
>
> +The following requirements only apply if the VIRTIO_BLK_F_ZONED feature is
> +negotiated.
> +
> +Each request is of form:
> +
> +\begin{lstlisting}
> +struct virtio_blk_zoned_req {
> +        le32 type;
> +        le32 reserved;
> +        le64 sector;
> +        union zoned_params {
> +                struct {
> +                        /* ALL zone operation flag */
> +                        __u8 all;
> +                        __u8 unused1[3];
> +                } mgmt_send;
> +                struct {
> +                        /* Partial zone report flag */
> +                        __u8 partial;
> +                        __u8 unused2[3];
> +                } mgmt_receive;
> +                struct {
> +                        __u8 unused3[4];
> +                } append;
> +        } zone;
> +        u8 data[];
> +        le64 zone_result;
> +        u8 status;
> +        u8 reserved1[3];

Please make status the last byte so device implementations can reuse
their existing request completion code to handle virtio_blk_zoned_req.
Either reserved1[] can be removed or it can be moved before status.

Here is the non-union representation of the structs. Note that this
means existing request processing code can handle IN/OUT/FLUSH even
when VIRTIO_BLK_F_ZONED is negotiated. We don't need two different
code paths for these common commands:

/* IN/OUT/FLUSH/DISCARD/WRITE_ZEROES */
struct virtio_blk_req {
        le32 type;
        le32 reserved;
        le64 sector;
        u8 data[];
        u8 status;
};

/* ZONE_APPEND */
struct virtio_blk_req_zone_append {
        le32 type;
        le32 reserved;
        le64 sector;
        u8 data[];
        le64 zone_result;
        u8 status;
};

/* ZONE_OPEN, ZONE_CLOSE, ZONE_FINISH, ZONE_RESET */
struct virtio_blk_req_zone_mgmt_send {
        le32 type;
        le32 reserved;
        le64 sector;
        /* ALL zone operation flag */
        __u8 all;
        __u8 unused1[3];
        u8 data[];
        u8 status;
};

/* ZONE_REPORT */
struct virtio_blk_req_zone_mgmt_receive {
        le32 type;
        le32 reserved;
        le64 sector;
        /* Partial zone report flag */
        __u8 partial;
        __u8 unused2[3];
        u8 data[];
        u8 status;
};

> +};
> +\end{lstlisting}
> +
> +In addition to the request types defined for non-zoned devices, the type of the
> +request can be a zone report (VIRTIO_BLK_T_ZONE_REPORT), an explicit zone open
> +(VIRTIO_BLK_T_ZONE_OPEN), an explicit zone close (VIRTIO_BLK_T_ZONE_CLOSE), a
> +zone finish (VIRTIO_BLK_T_ZONE_FINISH), a zone_append (VIRTIO_BLK_T_ZONE_APPEND)
> +or a zone reset (VIRTIO_BLK_T_ZONE_RESET).
> +
> +\begin{lstlisting}
> +#define VIRTIO_BLK_T_ZONE_APPEND    15
> +#define VIRTIO_BLK_T_ZONE_REPORT    16
> +#define VIRTIO_BLK_T_ZONE_OPEN      18
> +#define VIRTIO_BLK_T_ZONE_CLOSE     20
> +#define VIRTIO_BLK_T_ZONE_FINISH    22
> +#define VIRTIO_BLK_T_ZONE_RESET     24
> +\end{lstlisting}
> +
> +Requests of the type VIRTIO_BLK_T_ZONE_REPORT are reads and requests of the type

Here "reads" means data transfers from the device to the driver rather
than disk reads. I found that confusing. Maybe
"VIRTIO_BLK_T_ZONE_REPORT transfer data from the device"?

> +VIRTIO_BLK_T_ZONE_APPEND are writes. VIRTIO_BLK_T_ZONE_OPEN,
> +VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH and VIRTIO_BLK_T_ZONE_RESET
> +are non-data requests.
> +
> +In ZBD standards, the VIRTIO_BLK_T_ZONE_REPORT request belongs to "Zone

"ZBD standards" is not defined. I think this refers to the zoned
storage device model and its embodiment in the ZBC/ZNS specs.
The configuration space fields already allow the driver to calculate
the total number of zones. What is the purpose of partial=0, it
doesn't seem to provide any new information?

When the device marks a virtqueue buffer used, it indicates the number
of bytes written (struct virtq_used_elem's len field). The driver can
use this field to determine the number of descriptors filled in by the
device. Is the nr_zones field really necessary?


> +
> +A zone descriptor has the following structure:
> +
> +\begin{lstlisting}
> +struct virtio_blk_zone_descriptor {
> +        le64   z_cap;
> +        le64   z_start;
> +        le64   z_wp;
> +        u8     z_type;
> +        u8     z_state;
> +        u8     reserved[38];
> +};
> +\end{lstlisting}
> +
> +The zone descriptor field \field{z_type} \field{virtio_blk_zone_descriptor}
> +indicates the type of the zone. The available types are VIRTIO_BLK_ZT_CONV(1),
> +VIRTIO_BLK_ZT_SWR(2) or VIRTIO_BLK_ZT_SWP(3).
> +
> +\begin{lstlisting}
> +#define VIRTIO_BLK_ZT_CONV     1
> +#define VIRTIO_BLK_ZT_SWR      2
> +#define VIRTIO_BLK_ZT_SWP      3
> +\end{lstlisting}
> +
> +Read and write operations into zones with the VIRTIO_BLK_ZT_CONV (Conventional)
> +type have the same behavior as read and write operations on a regular block
> +device. Any block in a conventional zone can be read or written at any time and
> +in any order.
> +
> +Zones with VIRTIO_BLK_ZT_SWR (Sequential Write Required or SWR) can be read
> +randomly, but MUST be written sequentially at a certain point in the zone called
> +the Write Pointer (WP). With every write, the Write Pointer is incremented by
> +the number of sectors written.
> +
> +Zones with VIRTIO_BLK_ZT_SWP (Sequential Write Preferred or SWP) can be read
> +randomly and SHOULD be written sequentially, similarly to SWR zones. However,
> +SWP zones can accept random write operations, that is, VIRTIO_BLK_T_OUT requests
> +with a start sector different from the zone write pointer position.
> +
> +The field \field{z_state} of \field{virtio_blk_zone_descriptor} indicates the
> +state of the device zone. The available zone states are VIRTIO_BLK_ZS_NOT_WP(0),
> +VIRTIO_BLK_ZS_EMPTY(1), VIRTIO_BLK_ZS_IOPEN(2), VIRTIO_BLK_ZS_EOPEN(3),
> +VIRTIO_BLK_ZS_CLOSED(4), VIRTIO_BLK_ZS_RDONLY(13), VIRTIO_BLK_ZS_FULL(14) and
> +VIRTIO_BLK_ZS_OFFLINE(15).
> +
> +\begin{lstlisting}
> +#define VIRTIO_BLK_ZS_NOT_WP   0
> +#define VIRTIO_BLK_ZS_EMPTY    1
> +#define VIRTIO_BLK_ZS_IOPEN    2
> +#define VIRTIO_BLK_ZS_EOPEN    3
> +#define VIRTIO_BLK_ZS_CLOSED   4
> +#define VIRTIO_BLK_ZS_RDONLY   13
> +#define VIRTIO_BLK_ZS_FULL     14
> +#define VIRTIO_BLK_ZS_OFFLINE  15
> +\end{lstlisting}
> +
> +Zones of the type VIRTIO_BLK_ZT_CONV are always reported by the device to be in
> +the VIRTIO_BLK_ZS_NOT_WP state. Zones of the types VIRTIO_BLK_ZT_SWR and
> +VIRTIO_BLK_ZT_SWP can not transition to the VIRTIO_BLK_ZS_NOT_WP state.
> +
> +Zones in VIRTIO_BLK_ZS_EMPTY (Empty), VIRTIO_BLK_ZS_IOPEN (Implicitly Open),
> +VIRTIO_BLK_ZS_EOPEN (Explicitly Open) and VIRTIO_BLK_ZS_CLOSED (Closed) state
> +are writable, but zones in VIRTIO_BLK_ZS_RDONLY (Read-Only), VIRTIO_BLK_ZS_FULL
> +(Full) and VIRTIO_BLK_ZS_OFFLINE (Offline) state are not. The write pointer
> +value (\field{z_wp}) is not valid for Read-Only, Full and Offline zones.
> +
> +The zone descriptor field \field{z_cap} contains the maximum number of 512-byte
> +sectors that are available to be written with user data when the zone is in the
> +Empty state. This value shall be less than or equal to the \field{zone_sectors}
> +value in \field{virtio_blk_zoned_characteristics} structure in the device
> +configuration space.
> +
> +The zone descriptor field \field{z_start} contains the 64-bit address of the
> +first 512-byte sector of the zone.
> +
> +The zone descriptor field \field{z_wp} contains the 64-bit sector address where
> +the next write operation for this zone should be issued. This value is undefined
> +for conventional zones and for zones in VIRTIO_BLK_ZS_RDONLY, VIRTIO_BLK_ZS_FULL
> +and VIRTIO_BLK_ZS_OFFLINE state.
> +
> +Depending on their state, zones consume resources as follows:
> +\begin{itemize}
> +\item a zone in VIRTIO_BLK_ZS_IOPEN and VIRTIO_BLK_ZS_EOPEN state consumes one
> +    open zone resource and, additionally,
> +
> +\item a zone in VIRTIO_BLK_ZS_IOPEN, VIRTIO_BLK_ZS_EOPEN and
> +    VIRTIO_BLK_ZS_CLOSED state consumes one active resource.
> +\end{itemize}
> +
> +Attempts for zone transitions that violate zone resource limits MUST fail with
> +VIRTIO_BLK_S_ZONE_OPEN_RESOURCE or VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE
> +\field{zone_result}.

Why use zone_result instead of the status field? The _S_ naming
suggests this is a status field constant, not a zone_result constant.


> +
> +Zones in the VIRTIO_BLK_ZS_EMPTY (Empty) state have the write pointer value
> +equal to the start sector of the zone. In this state, the entire capacity of the
> +zone is available for writing. A zone can transition from this state to
> +\begin{itemize}
> +\item VIRTIO_BLK_ZS_IOPEN when a successful VIRTIO_BLK_T_OUT request or
> +    VIRTIO_BLK_T_ZONE_APPEND with a non-zero data size is received for the zone.
> +
> +\item VIRTIO_BLK_ZS_EOPEN when a successful VIRTIO_BLK_T_ZONE_OPEN request is
> +    received for the zone
> +\end{itemize}
> +
> +When a VIRTIO_BLK_T_ZONE_RESET request is issued to an Empty zone, the request
> +is completed successfully and the zone stays in the VIRTIO_BLK_ZS_EMPTY state.
> +
> +Zones in the VIRTIO_BLK_ZS_IOPEN (Implicitly Open) state can transition from
> +this state to
> +\begin{itemize}
> +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request is
> +    received for the zone,
> +
> +\item VIRTIO_BLK_ZS_EOPEN when a successful VIRTIO_BLK_T_ZONE_OPEN request is
> +    received for the zone,
> +
> +\item VIRTIO_BLK_ZS_CLOSED when a successful VIRTIO_BLK_T_ZONE_CLOSE request is
> +    received for the zone,
> +
> +\item VIRTIO_BLK_ZS_CLOSED implicitly by the device when another zone is
> +    entering the VIRTIO_BLK_ZS_IOPEN or VIRTIO_BLK_ZS_EOPEN state and the number
> +    of currently open zones is at \field{max_open_zones} limit,
> +
> +\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_ZONE_FINISH request is
> +    received for the zone.
> +
> +\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_OUT or
> +    VIRTIO_BLK_T_ZONE_APPEND request that causes the zone to reach its writable
> +    capacity is received for the zone.
> +\end{itemize}
> +
> +Zones in the VIRTIO_BLK_ZS_EOPEN (Explicitly Open) state can transition from
> +this state to
> +\begin{itemize}
> +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request is
> +    received for the zone,
> +
> +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_CLOSE request is
> +    received for the zone and the write pointer of the zone has the value equal
> +    to the start sector of the zone,
> +
> +\item VIRTIO_BLK_ZS_CLOSED when a successful VIRTIO_BLK_T_ZONE_CLOSE request is
> +    received for the zone and the zone write pointer is larger then the start
> +    sector of the zone,
> +
> +\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_ZONE_FINISH request is
> +    received for the zone,
> +
> +\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_OUT or
> +    VIRTIO_BLK_T_ZONE_APPEND request that causes the zone to reach its writable
> +    capacity is received for the zone.
> +\end{itemize}
> +
> +When a VIRTIO_BLK_T_ZONE_EOPEN request is issued to an Explicitly Open zone, the
> +request is completed successfully and the zone stays in the VIRTIO_BLK_ZS_EOPEN
> +state.
> +
> +Zones in the VIRTIO_BLK_ZS_CLOSED (Closed) state can transition from this state
> +to
> +\begin{itemize}
> +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request is
> +    received for the zone,
> +
> +\item VIRTIO_BLK_ZS_IOPEN when a successful VIRTIO_BLK_T_OUT request or
> +    VIRTIO_BLK_T_ZONE_APPEND with a non-zero data size is received for the zone.
> +
> +\item VIRTIO_BLK_ZS_EOPEN when a successful VIRTIO_BLK_T_ZONE_OPEN request is
> +    received for the zone,
> +\end{itemize}
> +
> +When a VIRTIO_BLK_T_ZONE_CLOSE request is issued to a Closed zone, the request
> +is completed successfully and the zone stays in the VIRTIO_BLK_ZS_CLOSED state.
> +
> +Zones in the VIRTIO_BLK_ZS_FULL (Full) state can transition from this state to
> +VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request is
> +received for the zone
> +
> +When a VIRTIO_BLK_T_ZONE_FINISH request is issued to a Full zone, the request
> +is completed successfully and the zone stays in the VIRTIO_BLK_ZS_FULL state.
> +
> +The device MAY automatically transition zones to VIRTIO_BLK_ZS_RDONLY
> +(Read-Only) or VIRTIO_BLK_ZS_OFFLINE (Offline) state from any other state. The
> +device MAY also automatically transition zones in the Read-Only state to the
> +Offline state. Zones in the Offline state MAY NOT transition to any other state.
> +Such automatic transitions usually indicate hardware failures. The previously
> +written data may only be read from zones in the Read-Only state. Zones in the
> +Offline state can not be read or written.

What is the status code when this is attempted?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [virtio-dev] [RFC PATCH] virtio-blk: add zoned block device specification
  2022-06-02  7:23 ` Stefan Hajnoczi
@ 2022-06-03  2:30   ` Dmitry Fomichev
  2022-06-03  8:08     ` [virtio-comment] " Stefan Hajnoczi
       [not found]     ` <CAJSP0QXiUy-=pyGtxjpguWFt=+HJFqjKo3ipwN+YpF43nsxAag@mail.gmail.com>
  0 siblings, 2 replies; 7+ messages in thread
From: Dmitry Fomichev @ 2022-06-03  2:30 UTC (permalink / raw)
  To: stefanha
  Cc: Niklas Cassel, hare, its, Matias Bjørling, virtio-dev,
	virtio-comment, Hans Holmberg, faithilikerun, damien.lemoal,
	stefanha

On Thu, 2022-06-02 at 08:23 +0100, Stefan Hajnoczi wrote:
> (Resending from my mailing list subscriber address so this makes it onto
> the list.)
> 
> On Thu, 2 Jun 2022 at 00:56, Dmitry Fomichev <dmitry.fomichev@wdc.com> wrote:
> > 
> > Introduce support for Zoned Block Devices to virtio.
> > 
> > Zoned Block Devices (ZBDs) aim to achieve a better capacity, latency
> > and/or cost characteristics compared to commonly available block
> > devices by getting the entire LBA space of the device divided to block
> > regions that are much larger than the LBA size. These regions are
> > called zones and they can only be written sequentially. More details
> > about ZBDs can be found at
> > 
> > https://zonedstorage.io/docs/introduction/zoned-storage .
> > 
> > In its current form, the virtio protocol for block devices (virtio-blk)
> > is not aware of ZBDs but it allows the guest to successfully scan a
> > host-managed drive provided by the host. As the result, the
> > host-managed drive appears at the guest as a regular drive that will
> > operate erroneously under the most common write workloads.
> > 
> > To fix this, the virtio-blk protocol needs to be extended to add the
> > capabilities to convey the zone characteristics of host ZBDs to the
> > guest and to provide the support for ZBD-specific commands - Report
> > Zones, four zone operations and (optionally) Zone Append. The proposed
> > standard extension aims to provide this functionality.
> > 
> > This patch extends the virtio-blk section of virtio specification with
> > the minimum set of requirements that are necessary to support ZBDs.
> > The resulting device model is a subset of the models defined in ZAC/ZBC
> > and ZNS standards documents. The included functionality mirrors
> > the existing Linux kernel block layer ZBD support and should be
> > sufficient to handle the host-managed and host-aware HDDs that are on
> > the market today as well as ZNS SSDs that are entering the market at
> > the moment of this patch submission.
> > 
> > I have developed a proof of concept patch series that adds ZBD support
> > to virtio-blk Linux kernel driver by implementing the protocol
> > extensions defined in the spec patch. I would like to receive the
> > initial feedback on this specification patch before posting that series
> > to the block LKML.
> > 
> > I would like to thank the following people for their useful feedback
> > and suggestions while working on the initial iterations of this patch.
> > 
> > Damien Le Moal <damien.lemoal@opensource.wdc.com>
> > Matias Bjørling <Matias.Bjorling@wdc.com>
> > Niklas Cassel <Niklas.Cassel@wdc.com>
> > Hans Holmberg <Hans.Holmberg@wdc.com>
> > 
> > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > ---
> >  content.tex | 686 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 685 insertions(+), 1 deletion(-)
> > 
> > diff --git a/content.tex b/content.tex
> > index 7508dd1..8ae7578 100644
> > --- a/content.tex
> > +++ b/content.tex
> > @@ -4557,6 +4557,11 @@ \subsection{Feature bits}\label{sec:Device Types /
> > Block Device / Feature bits}
> >       maximum erase sectors count in \field{max_secure_erase_sectors} and
> >       maximum erase segment number in \field{max_secure_erase_seg}.
> > 
> > +\item[VIRTIO_BLK_F_ZONED(17)] Device is a Zoned Block Device, that is, a
> > device
> > +       that behaves as defined by the T10 Zoned Block Command standard (ZBC
> > r05) or
> > +       the NVMe(TM) NVM Express Zoned Namespace Command Set Specification
> > 1.1b
> > +       (ZNS).
> 
> The rest of the spec doesn't mention ZBC/ZNS but at least the zone
> state machine is likely to be derived from these standards. Have you
> reviewed the license/IP terms of VIRTIO, ZBC, and ZNS to check that
> it's okay to do this? It's probably okay but I wanted to mention it
> because it would be a shame to discover an IP issue after the VIRTIO
> spec has been released.

I've read the virtio LICENSE.md file and visited the links that are included in
this file.

I am not a legal expert, but I don't expect any IP-related issues to arise by
including this patch. The T10 committee is an INCITS entity which is a part of
ANSI, a public standards body. This is the committee that develops SCSI standards
that are already supported by the virtio spec and ZBC is simply another SCSI
standard, alongside with SPC, SBC and so on. The member companies of T10 usually
have a legal review process to make sure that all the standards that are being
released are not encumbered by any patent claims, etc.

ZNS is a product of NVM Express, an open standard consortium. Here is the page on
their site that has the link to ZNS command set specification pdf file -

https://nvmexpress.org/developers/nvme-command-set-specifications/

This consortium also has a legal review process and I don't foresee any IP-
related difficulties that would arise from following this specification for the
purposes of making a virtio ZBD standard.

> 
> Also, the wording suggests to me that the spec will refer to ZBC/ZNS
> instead of defining the behavior itself, but actually you've written a
> full self-contained spec (which is great!). Maybe change "behaves as
> defined by" to "follows the zoned storage device model that is also
> supported by industry standards such as".

Agreed, the wording here could be more precise. This could be "follows the zoned
storage device behavior that is also supported by industry standards such as". I
would avoid using the word "model" here because it usually refers specifically to
host-managed vs. host-aware vs. drive-managed device models in ZBD world.

> 
> 
> > +
> >  \end{description}
> > 
> >  \subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types /
> > Block Device / Feature bits / Legacy Interface: Feature bits}
> > @@ -4589,6 +4594,31 @@ \subsection{Device configuration
> > layout}\label{sec:Device Types / Block Device /
> >  \field{max_secure_erase_sectors} \field{secure_erase_sector_alignment} are
> > expressed
> >  in 512-byte units if the VIRTIO_BLK_F_SECURE_ERASE feature bit is
> > negotiated.
> > 
> > +If the VIRTIO_BLK_F_ZONED feature is negotiated, then in
> > +\field{virtio_blk_zoned_characteristics},
> > +\begin{itemize}
> > +\item \field{zone_sectors} value is expressed in 512-byte sectors.
> > +\item \field{max_append_sectors} value is expressed in 512-byte sectors.
> > +\item \field{write_granularity} value is expressed in bytes.
> > +\end{itemize}
> > +
> > +The \field{model} field in \field{zoned} may have the following values:
> > +VIRTIO_BLK_Z_HM(1) and VIRTIO_BLK_Z_HA(2).
> > +
> > +\begin{lstlisting}
> > +#define VIRTIO_BLK_Z_HM        1
> > +#define VIRTIO_BLK_Z_HA        2
> > +\end{lstlisting}
> > +
> > +If the VIRTIO_BLK_F_ZONED feature is negotiated, then
> > +\begin{itemize}
> > +\item The value of VIRTIO_BLK_Z_HM MUST be set by the device if it operates
> > +    as a host-managed zoned block device.
> 
> The term host-managed is not defined. Please either define it or
> include a reference to the zoned storage device model (maybe
> zonedstorage.io?).

OK, indeed. There is some text later in the patch describing the differences
between Sequential Write Required (SWR) and Sequential Write Preferred (SWP)
zones, but the wording that relates HM to SWR and HA to SWP is missing, will add
it here.

> 
> > +
> > +\item The value of VIRTIO_BLK_Z_HA MUST be set by the device if it operates
> > +    as a host-aware zoned block device.
> 
> The term host-aware is not defined.
> 
> > +\end{itemize}
> 
> The VIRTIO spec is organized into normative sections containing
> MUST/MAY/SHOULD and descriptive sections. MUST/MAY/SHOULD need to go
> in drivernormative or devicenormative sections and can't be used in
> descriptive sections.
> (I believe the reason for this is to give driver/device implementors a
> definitive list of normative statements they can easily review without
> re-reading the entire text of the specification.)


Got it, will reword.

> +
>  \begin{lstlisting}
>  struct virtio_blk_config {
>          le64 capacity;
> @@ -4623,6 +4653,15 @@ \subsection{Device configuration
layout}\label{sec:Device Types / Block Device /
>          le32 max_secure_erase_sectors;
>          le32 max_secure_erase_seg;
>          le32 secure_erase_sector_alignment;
> +        struct virtio_blk_zoned_characteristics {
> +                le32 zone_sectors;
> +                le32 max_open_zones;
> +                le32 max_active_zones;
> +                le32 max_append_sectors;
> +                le32 write_granularity;
> +                u8 model;
> +                u8 unused2[3];
> +        } zoned;
>  };
>  \end{lstlisting}
>
> @@ -4686,6 +4725,10 @@ \subsection{Device Initialization}\label{sec:Device
Types / Block Device / Devic
>      \field{secure_erase_sector_alignment} can be used by OS when splitting a
>      request based on alignment.
>
> +\item If the VIRTIO_BLK_F_ZONED feature is negotiated, the fields in
> +    \field{zoned} can be read by the driver to determine the zone
> +    characteristics of the device. All \field{zoned} fields are read-only.
> +
>  \end{enumerate}
>
>  \drivernormative{\subsubsection}{Device Initialization}{Device Types / Block
Device / Device Initialization}
> @@ -4701,6 +4744,24 @@ \subsection{Device Initialization}\label{sec:Device
Types / Block Device / Devic
>  The driver MUST NOT read \field{writeback} before setting
>  the FEATURES_OK \field{device status} bit.
>
> +Drivers SHOULD NOT negotiate VIRTIO_BLK_F_ZONED feature if they are incapable
> +of supporting devices with the VIRTIO_BLK_Z_HM or VIRTIO_BLK_Z_HA zoned model.

> This raises the question whether devices can offer VIRTIO_BLK_F_ZONED
> but fall back to regular block device operation if the driver does not
> negotiate the bit?

It could be useful (for testing/development, etc.) to allow the device to offer
the VIRTIO_BLK_F_ZONED feature even when the actual backing block device is not
zoned. In this case, it the driver is capable of handling ZBDs, it will accept it
and the i/o should be operational via virtio_blk_zoned_req requests with
VIRTIO_BLK_Z_NONE model. Therefore there is a SHOULD NOT. Perhaps it is worth it
to add this explanation to the text... or not to allow this at all and change the
text to "must not".

Regarding fallbacks, there are clauses below that define these situations and
they mostly boil down to the following -
1. if the device offers VIRTIO_BLK_F_ZONED with an HM backing device and the
driver doesn't negotiate the feature, the device needs to disallow the device to
operate by not setting FEATURES_OK bit.
2. if the device offers VIRTIO_BLK_F_ZONED with an *HA* backing device and the
driver doesn't negotiate the feature, the device needs to present the device as
non-zoned because in this case it will still work.

> +
> +If the VIRTIO_BLK_F_ZONED feature is offered by the device, the
> +VIRTIO_BLK_F_DISCARD feature MUST NOT be offered.
> 
> This statement partially precludes that because it requires devices to
> disable VIRTIO_BLK_F_DISCARD if they advertize VIRTIO_BLK_F_ZONED. I
> think there is a way to allow VIRTIO_BLK_F_DISCARD in
> !VIRTIO_BLK_F_ZONED mode while forbidding it in VIRTIO_BLK_F_ZONED
> mode: the device can fail to set FEATURES_OK when the driver writes
> the Device Status field with VIRTIO_BLK_F_DISCARD &&
> VIRTIO_BLK_F_ZONED. That way a host-aware device in
> !VIRTIO_BLK_F_ZONED mode could allow VIRTIO_BLK_F_DISCARD. Is this
> useful?

Good catch, I verified that HA devices can usually handle discard. In this case,
the clause above becomes too strict. Perhaps it should be changed to

If the VIRTIO_BLK_F_ZONED feature is offered by the device with the
VIRTIO_BLK_Z_HM zone model, then
 - the VIRTIO_BLK_F_DISCARD feature must not be offered by the driver.
 - if the driver sets both VIRTIO_BLK_F_DISCARD and VIRTIO_BLK_F_ZONED feature
bits, the device must fail by not setting FEATURES_OK bit.

> 
> > +
> > +If the VIRTIO_BLK_F_ZONED feature is negotiated, then
> > +\begin{itemize}
> > +\item If the driver that can not support host-managed zoned devices
> > +    reads VIRTIO_BLK_Z_HM from the \field{model} field of \field{zoned}, the
> > +    driver MUST NOT set FEATURES_OK flag and instead set the FAILED bit.
> > +
> > +\item If the driver that can not support support zoned devices reads
> > +    VIRTIO_BLK_Z_HA from the \field{model} field of \field{zoned}, the
> > driver
> > +    MAY present the device to the guest as a non-zoned device. In this case,
> > the
> 
> Please avoid using the terms "guest" or "host" in the virtualization
> sense and instead use "driver" and "device". VIRTIO can be used in
> non-virtualization use cases and it's clearer to avoid virtualization
> terminology.

OK, will change this.

> 
> > +    driver SHOULD ignore all other fields in \field{zoned}.
> 
> Why would the driver negotiate VIRTIO_BLK_F_ZONED if it doesn't
> support zoned devices?

This covers cases when the driver can recognize the VIRTIO_BLK_F_ZONED feature,
but it is incapable to support ZBDs for some reason. For example, the Linux
virtio-blk driver could be up to date with ZBD support, but if the kernel is
built without CONFIG_BLK_DEV_ZONED option, it will not be able to handle HM
devices. But it still can deal with HA drives in this case.

> 
> > +\end{itemize}
> > +
> >  \devicenormative{\subsubsection}{Device Initialization}{Device Types / Block
> > Device / Device Initialization}
> > 
> >  Devices SHOULD always offer VIRTIO_BLK_F_FLUSH, and MUST offer it
> > @@ -4712,6 +4773,77 @@ \subsection{Device Initialization}\label{sec:Device
> > Types / Block Device / Devic
> >  The device MUST initialize padding bytes \field{unused0} and
> >  \field{unused1} to 0.
> > 
> > +If the device that is being initialized is a not a zoned device, the device
> > MUST
> > +NOT offer the VIRTIO_BLK_F_ZONED feature.
> > +
> > +If the VIRTIO_BLK_F_ZONED bit is not set by the driver,
> > +\begin{itemize}
> > +\item the device with the VIRTIO_BLK_Z_HA zone model SHOULD proceed with the
> > +    initialization while setting all zoned topology fields to zero.
> > +
> > +\item the device with the VIRTIO_BLK_Z_HM zone model MUST report the device
> > +    capacity in \field{capacity} in the configuration space as zero to
> > prevent
> > +    the use of the device that is incorrectly recognized by the driver as
> > +    a non-zoned device.
> 
> I suggest changing this to "MUST fail to set the FEATURES_OK device
> status bit when the driver writes the Device Status field". That way
> the driver gets a clear error instead of seeing a virtio-blk device
> with 0 capacity.

OK. I was under impression that only the driver can do this. It's certainly
better to fail the initialization in this case.

> 
> 
> > +\end{itemize}
> > +
> > +If the VIRTIO_BLK_F_ZONED feature is negotiated,
> > +\begin{itemize}
> > +\item \field{zoned} can be read by the driver to discover the size of a
> > single
> > +    zone on the device. All zones of the device have the same size indicated
> > by
> > +    the \field{zone_sectors} field of \field{zoned} except for the last zone
> > +    that MAY be smaller than all other zones. The driver can calculate the
> > +    number of zones on the device as
> > +    \begin{lstlisting}
> > +        nr_zones = (capacity + zone_sectors - 1) / zone_sectors;
> > +    \end{lstlisting}
> > +    and the size of the last zone as
> > +    \begin{lstlisting}
> > +        zs_last = capacity - (nr_zones - 1) * zone_sectors;
> > +    \end{lstlisting}
> > +
> > +\item Zones consume volatile device resources while being in certain states
> > and
> > +    the device MAY set limits on the number of zones that can be in these
> > states
> > +    simultaneously.
> > +
> > +    Zoned block devices use two internal counters to account for the device
> > +    resources in use, the number of currently open zones and the number of
> > +    currently active zones.
> > +
> > +    Any zone state transition from a state that doesn't consume a zone
> > resource
> > +    to a state that consumes the same resource increments the internal
> > device
> > +    counter for that resource. Any zone transition out of a state that
> > consumes
> > +    a zone resource to a state that doesn't consume the same resource
> > decrements
> > +    the counter. Any request that causes the device to exceed the reported
> > zone
> > +    resource limits is terminated by the device with an error.
> 
> "an error" is vague. I was wondering what the exact error should be.
> Please either mention the error codes here or add "as defined for
> specific commands later" so it's clear that the device doesn't get the
> choose the error code but they are defined by the spec.

Right. The error is defined later in the text, it's worth mentioning it here.

> 
> > +
> > +\item The \field{max_open_zones} field of the \field{zoned} structure can be
> > +    read by the driver to discover the maximum number of zones that can be
> > open
> > +    on the device (zones in the implicit open or explicit open state). A
> > value
> > +    of zero indicates that the device does not have any limit on the number
> > of
> > +    open zones.
> > +
> > +\item The \field{max_active_zones} field of the \field{zoned} structure can
> > be
> > +    read by the driver to discover the maximum number zones that can be
> > active
> > +    on the device (zones in the implicit open, explicit open or closed
> > state).
> > +    A value of zero indicates that the device does not have any limit on the
> > +    number of active zones.
> > +
> > +\item the \field{max_append_sectors} field of \field{zoned} can be read by
> > the
> > +    driver to get the maximum data size of a VIRTIO_BLK_T_ZONE_APPEND
> > request
> > +    issued to the device. The value of this field MUST NOT exceed the
> > +    \field{seg_max} * \field{size_max} value. A device MAY set the
> > +    \field{max_append_sectors} to zero if it doesn't support
> > +    VIRTIO_BLK_T_ZONE_APPEND requests.
> 
> The MUST NOT/MAY statements need to be in a drivernormative section.
> There are more instances below.

Yea, I need to sort this out everywhere in the patch.

> 
> 
> > +
> > +\item the \field{write_granularity} field of \field{zoned} can be read by
> > the
> > +    driver to discover the offset and size alignment constraint for
> > +    VIRTIO_BLK_T_OUT and VIRTIO_BLK_T_ZONE_APPEND requests issued to
> > +    a sequential zone of the device.
> > +
> > +\item the device MUST initialize padding bytes \field{unused2} to 0.
> > +\end{itemize}
> > +
> >  \subsubsection{Legacy Interface: Device Initialization}\label{sec:Device
> > Types / Block Device / Device Initialization / Legacy Interface: Device
> > Initialization}
> > 
> >  Because legacy devices do not have FEATURES_OK, transitional devices
> > @@ -4738,7 +4870,8 @@ \subsubsection{Legacy Interface: Device
> > Initialization}\label{sec:Device Types /
> >  \subsection{Device Operation}\label{sec:Device Types / Block Device / Device
> > Operation}
> > 
> >  The driver queues requests to the virtqueues, and they are used by
> > -the device (not necessarily in order). Each request is of form:
> > +the device (not necessarily in order). If the VIRTIO_BLK_F_ZONED feature
> > +is not negotiated, then each request is of form:
> > 
> >  \begin{lstlisting}
> >  struct virtio_blk_req {
> > @@ -4853,6 +4986,331 @@ \subsection{Device Operation}\label{sec:Device Types
> > / Block Device / Device Ope
> >  command produces VIRTIO_BLK_S_IOERR.  A segment may have completed
> >  successfully, failed, or not been processed by the device.
> > 
> > +The following requirements only apply if the VIRTIO_BLK_F_ZONED feature is
> > +negotiated.
> > +
> > +Each request is of form:
> > +
> > +\begin{lstlisting}
> > +struct virtio_blk_zoned_req {
> > +        le32 type;
> > +        le32 reserved;
> > +        le64 sector;
> > +        union zoned_params {
> > +                struct {
> > +                        /* ALL zone operation flag */
> > +                        __u8 all;
> > +                        __u8 unused1[3];
> > +                } mgmt_send;
> > +                struct {
> > +                        /* Partial zone report flag */
> > +                        __u8 partial;
> > +                        __u8 unused2[3];
> > +                } mgmt_receive;
> > +                struct {
> > +                        __u8 unused3[4];
> > +                } append;
> > +        } zone;
> > +        u8 data[];
> > +        le64 zone_result;
> > +        u8 status;
> > +        u8 reserved1[3];
> 
> Please make status the last byte so device implementations can reuse
> their existing request completion code to handle virtio_blk_zoned_req.
> Either reserved1[] can be removed or it can be moved before status.

OK.

> 
> Here is the non-union representation of the structs. Note that this
> means existing request processing code can handle IN/OUT/FLUSH even
> when VIRTIO_BLK_F_ZONED is negotiated. We don't need two different
> code paths for these common commands:
> 
> /* IN/OUT/FLUSH/DISCARD/WRITE_ZEROES */
> struct virtio_blk_req {
>         le32 type;
>         le32 reserved;
>         le64 sector;
>         u8 data[];
>         u8 status;
> };
> 
> /* ZONE_APPEND */
> struct virtio_blk_req_zone_append {
>         le32 type;
>         le32 reserved;
>         le64 sector;
>         u8 data[];
>         le64 zone_result;
>         u8 status;
> };
> 
> /* ZONE_OPEN, ZONE_CLOSE, ZONE_FINISH, ZONE_RESET */
> struct virtio_blk_req_zone_mgmt_send {
>         le32 type;
>         le32 reserved;
>         le64 sector;
>         /* ALL zone operation flag */
>         __u8 all;
>         __u8 unused1[3];
>         u8 data[];
>         u8 status;
> };
> 
> /* ZONE_REPORT */
> struct virtio_blk_req_zone_mgmt_receive {
>         le32 type;
>         le32 reserved;
>         le64 sector;
>         /* Partial zone report flag */
>         __u8 partial;
>         __u8 unused2[3];
>         u8 data[];
>         u8 status;
> };

We felt that it is less confusing to have a single ZBD request structure so there
is 1:1 relationship between the VIRTIO_BLK_F_ZONED bit value and the choice of
the request -

VIRTIO_BLK_F_ZONED off ->  virtio_blk_req
VIRTIO_BLK_F_ZONED on ->  virtio_blk_zoned_req

We also want to make sure that zone management/send/receive and append requests
are always the same size and it should be easier with a single request
structure. 

It is possible to use a simpler, but less structured union here -

struct virtio_blk_zoned_req {
        le32 type;
        le32 reserved;
        le64 sector;
        union zoned_params {
	        /* ALL zone operation flag */
	        __u8 all;
	        /* Partial zone report flag */
	        __u8 partial;
	        __u32 rsvd1;
        } zone;
        u8 data[];
        le64 zone_result;
        u8 reserved1[3];
        u8 status;
}

Using the single code path is very important and the PoC Linux driver code that I
have sketched up achieves that. Since the request header is identical for the
both cases, I hope that it can be done in QEMU too.

> 
> > +};
> > +\end{lstlisting}
> > +
> > +In addition to the request types defined for non-zoned devices, the type of
> > the
> > +request can be a zone report (VIRTIO_BLK_T_ZONE_REPORT), an explicit zone
> > open
> > +(VIRTIO_BLK_T_ZONE_OPEN), an explicit zone close (VIRTIO_BLK_T_ZONE_CLOSE),
> > a
> > +zone finish (VIRTIO_BLK_T_ZONE_FINISH), a zone_append
> > (VIRTIO_BLK_T_ZONE_APPEND)
> > +or a zone reset (VIRTIO_BLK_T_ZONE_RESET).
> > +
> > +\begin{lstlisting}
> > +#define VIRTIO_BLK_T_ZONE_APPEND    15
> > +#define VIRTIO_BLK_T_ZONE_REPORT    16
> > +#define VIRTIO_BLK_T_ZONE_OPEN      18
> > +#define VIRTIO_BLK_T_ZONE_CLOSE     20
> > +#define VIRTIO_BLK_T_ZONE_FINISH    22
> > +#define VIRTIO_BLK_T_ZONE_RESET     24
> > +\end{lstlisting}
> > +
> > +Requests of the type VIRTIO_BLK_T_ZONE_REPORT are reads and requests of the
> > type
> 
> Here "reads" means data transfers from the device to the driver rather
> than disk reads. I found that confusing. Maybe
> "VIRTIO_BLK_T_ZONE_REPORT transfer data from the device"?

Sure, your wording is better.

> For a cur
> > +VIRTIO_BLK_T_ZONE_APPEND are writes. VIRTIO_BLK_T_ZONE_OPEN,
> > +VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH and
> > VIRTIO_BLK_T_ZONE_RESET
> > +are non-data requests.
> > +
> > +In ZBD standards, the VIRTIO_BLK_T_ZONE_REPORT request belongs to "Zone
> 
> "ZBD standards" is not defined. I think this refers to the zoned
> storage device model and its embodiment in the ZBC/ZNS specs.

> The configuration space fields already allow the driver to calculate
> the total number of zones. What is the purpose of partial=0, it
> doesn't seem to provide any new information?

Both ZBC and ZNS have this categorization. "ZBD standards" can be defined earlier
in the text as a catchall for ZBC/ZNS.

If partial is 0, the nr_zones will contain the number of zones that potentially
can be reported starting from the request "sector" position and until the end of
sector range of the device. It does not necessarily equal the total number of
zones on the drive. 

> 
> When the device marks a virtqueue buffer used, it indicates the number
> of bytes written (struct virtq_used_elem's len field). The driver can
> use this field to determine the number of descriptors filled in by the
> device. Is the nr_zones field really necessary?

This field can help when the buffer is larger than the returned number of
descriptors, but nr_zones is still needed for the partial=0 case above.

> 
> 
> > +
> > +A zone descriptor has the following structure:
> > +
> > +\begin{lstlisting}
> > +struct virtio_blk_zone_descriptor {
> > +        le64   z_cap;
> > +        le64   z_start;
> > +        le64   z_wp;
> > +        u8     z_type;
> > +        u8     z_state;
> > +        u8     reserved[38];
> > +};
> > +\end{lstlisting}
> > +
> > +The zone descriptor field \field{z_type} \field{virtio_blk_zone_descriptor}
> > +indicates the type of the zone. The available types are
> > VIRTIO_BLK_ZT_CONV(1),
> > +VIRTIO_BLK_ZT_SWR(2) or VIRTIO_BLK_ZT_SWP(3).
> > +
> > +\begin{lstlisting}
> > +#define VIRTIO_BLK_ZT_CONV     1
> > +#define VIRTIO_BLK_ZT_SWR      2
> > +#define VIRTIO_BLK_ZT_SWP      3
> > +\end{lstlisting}
> > +
> > +Read and write operations into zones with the VIRTIO_BLK_ZT_CONV
> > (Conventional)
> > +type have the same behavior as read and write operations on a regular block
> > +device. Any block in a conventional zone can be read or written at any time
> > and
> > +in any order.
> > +
> > +Zones with VIRTIO_BLK_ZT_SWR (Sequential Write Required or SWR) can be read
> > +randomly, but MUST be written sequentially at a certain point in the zone
> > called
> > +the Write Pointer (WP). With every write, the Write Pointer is incremented
> > by
> > +the number of sectors written.
> > +
> > +Zones with VIRTIO_BLK_ZT_SWP (Sequential Write Preferred or SWP) can be read
> > +randomly and SHOULD be written sequentially, similarly to SWR zones.
> > However,
> > +SWP zones can accept random write operations, that is, VIRTIO_BLK_T_OUT
> > requests
> > +with a start sector different from the zone write pointer position.
> > +
> > +The field \field{z_state} of \field{virtio_blk_zone_descriptor} indicates
> > the
> > +state of the device zone. The available zone states are
> > VIRTIO_BLK_ZS_NOT_WP(0),
> > +VIRTIO_BLK_ZS_EMPTY(1), VIRTIO_BLK_ZS_IOPEN(2), VIRTIO_BLK_ZS_EOPEN(3),
> > +VIRTIO_BLK_ZS_CLOSED(4), VIRTIO_BLK_ZS_RDONLY(13), VIRTIO_BLK_ZS_FULL(14)
> > and
> > +VIRTIO_BLK_ZS_OFFLINE(15).
> > +
> > +\begin{lstlisting}
> > +#define VIRTIO_BLK_ZS_NOT_WP   0
> > +#define VIRTIO_BLK_ZS_EMPTY    1
> > +#define VIRTIO_BLK_ZS_IOPEN    2
> > +#define VIRTIO_BLK_ZS_EOPEN    3
> > +#define VIRTIO_BLK_ZS_CLOSED   4
> > +#define VIRTIO_BLK_ZS_RDONLY   13
> > +#define VIRTIO_BLK_ZS_FULL     14
> > +#define VIRTIO_BLK_ZS_OFFLINE  15
> > +\end{lstlisting}
> > +
> > +Zones of the type VIRTIO_BLK_ZT_CONV are always reported by the device to be
> > in
> > +the VIRTIO_BLK_ZS_NOT_WP state. Zones of the types VIRTIO_BLK_ZT_SWR and
> > +VIRTIO_BLK_ZT_SWP can not transition to the VIRTIO_BLK_ZS_NOT_WP state.
> > +
> > +Zones in VIRTIO_BLK_ZS_EMPTY (Empty), VIRTIO_BLK_ZS_IOPEN (Implicitly Open),
> > +VIRTIO_BLK_ZS_EOPEN (Explicitly Open) and VIRTIO_BLK_ZS_CLOSED (Closed)
> > state
> > +are writable, but zones in VIRTIO_BLK_ZS_RDONLY (Read-Only),
> > VIRTIO_BLK_ZS_FULL
> > +(Full) and VIRTIO_BLK_ZS_OFFLINE (Offline) state are not. The write pointer
> > +value (\field{z_wp}) is not valid for Read-Only, Full and Offline zones.
> > +
> > +The zone descriptor field \field{z_cap} contains the maximum number of 512-
> > byte
> > +sectors that are available to be written with user data when the zone is in
> > the
> > +Empty state. This value shall be less than or equal to the
> > \field{zone_sectors}
> > +value in \field{virtio_blk_zoned_characteristics} structure in the device
> > +configuration space.
> > +
> > +The zone descriptor field \field{z_start} contains the 64-bit address of the
> > +first 512-byte sector of the zone.
> > +
> > +The zone descriptor field \field{z_wp} contains the 64-bit sector address
> > where
> > +the next write operation for this zone should be issued. This value is
> > undefined
> > +for conventional zones and for zones in VIRTIO_BLK_ZS_RDONLY,
> > VIRTIO_BLK_ZS_FULL
> > +and VIRTIO_BLK_ZS_OFFLINE state.
> > +
> > +Depending on their state, zones consume resources as follows:
> > +\begin{itemize}
> > +\item a zone in VIRTIO_BLK_ZS_IOPEN and VIRTIO_BLK_ZS_EOPEN state consumes
> > one
> > +    open zone resource and, additionally,
> > +
> > +\item a zone in VIRTIO_BLK_ZS_IOPEN, VIRTIO_BLK_ZS_EOPEN and
> > +    VIRTIO_BLK_ZS_CLOSED state consumes one active resource.
> > +\end{itemize}
> > +
> > +Attempts for zone transitions that violate zone resource limits MUST fail
> > with
> > +VIRTIO_BLK_S_ZONE_OPEN_RESOURCE or VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE
> > +\field{zone_result}.
> 
> Why use zone_result instead of the status field? The _S_ naming
> suggests this is a status field constant, not a zone_result constant.

zone_result only a valid ZBD error if the status is VIRTIO_BLK_S_IOERR. If the
status is VIRTIO_BLK_S_OK, this field may contain ALBA for Zone Append request.
It is possible to just define the ZBD-specific error codes following
VIRTIO_BLK_S_UNSUPP, but I think it is cleaner to create a separate numbering for
ZBD-specific codes and keep the non-zoned errors as-is in their impressive
simplicity.

The ZBD error codes all currently have _S_ZONE_ naming, but it might be better to
use something like _ZR_ for example, i.e.

#define VIRTIO_BLK_ZR_INVALID_CMD     0
#define VIRTIO_BLK_ZR_UNALIGNED_WP    1
#define VIRTIO_BLK_ZR_OPEN_RESOURCE   2
#define VIRTIO_BLK_ZR_ACTIVE_RESOURCE 3

> 
> 
> > +
> > +Zones in the VIRTIO_BLK_ZS_EMPTY (Empty) state have the write pointer value
> > +equal to the start sector of the zone. In this state, the entire capacity of
> > the
> > +zone is available for writing. A zone can transition from this state to
> > +\begin{itemize}
> > +\item VIRTIO_BLK_ZS_IOPEN when a successful VIRTIO_BLK_T_OUT request or
> > +    VIRTIO_BLK_T_ZONE_APPEND with a non-zero data size is received for the
> > zone.
> > +
> > +\item VIRTIO_BLK_ZS_EOPEN when a successful VIRTIO_BLK_T_ZONE_OPEN request
> > is
> > +    received for the zone
> > +\end{itemize}
> > +
> > +When a VIRTIO_BLK_T_ZONE_RESET request is issued to an Empty zone, the
> > request
> > +is completed successfully and the zone stays in the VIRTIO_BLK_ZS_EMPTY
> > state.
> > +
> > +Zones in the VIRTIO_BLK_ZS_IOPEN (Implicitly Open) state can transition from
> > +this state to
> > +\begin{itemize}
> > +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request
> > is
> > +    received for the zone,
> > +
> > +\item VIRTIO_BLK_ZS_EOPEN when a successful VIRTIO_BLK_T_ZONE_OPEN request
> > is
> > +    received for the zone,
> > +
> > +\item VIRTIO_BLK_ZS_CLOSED when a successful VIRTIO_BLK_T_ZONE_CLOSE request
> > is
> > +    received for the zone,
> > +
> > +\item VIRTIO_BLK_ZS_CLOSED implicitly by the device when another zone is
> > +    entering the VIRTIO_BLK_ZS_IOPEN or VIRTIO_BLK_ZS_EOPEN state and the
> > number
> > +    of currently open zones is at \field{max_open_zones} limit,
> > +
> > +\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_ZONE_FINISH request
> > is
> > +    received for the zone.
> > +
> > +\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_OUT or
> > +    VIRTIO_BLK_T_ZONE_APPEND request that causes the zone to reach its
> > writable
> > +    capacity is received for the zone.
> > +\end{itemize}
> > +
> > +Zones in the VIRTIO_BLK_ZS_EOPEN (Explicitly Open) state can transition from
> > +this state to
> > +\begin{itemize}
> > +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request
> > is
> > +    received for the zone,
> > +
> > +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_CLOSE request
> > is
> > +    received for the zone and the write pointer of the zone has the value
> > equal
> > +    to the start sector of the zone,
> > +
> > +\item VIRTIO_BLK_ZS_CLOSED when a successful VIRTIO_BLK_T_ZONE_CLOSE request
> > is
> > +    received for the zone and the zone write pointer is larger then the
> > start
> > +    sector of the zone,
> > +
> > +\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_ZONE_FINISH request
> > is
> > +    received for the zone,
> > +
> > +\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_OUT or
> > +    VIRTIO_BLK_T_ZONE_APPEND request that causes the zone to reach its
> > writable
> > +    capacity is received for the zone.
> > +\end{itemize}
> > +
> > +When a VIRTIO_BLK_T_ZONE_EOPEN request is issued to an Explicitly Open zone,
> > the
> > +request is completed successfully and the zone stays in the
> > VIRTIO_BLK_ZS_EOPEN
> > +state.
> > +
> > +Zones in the VIRTIO_BLK_ZS_CLOSED (Closed) state can transition from this
> > state
> > +to
> > +\begin{itemize}
> > +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request
> > is
> > +    received for the zone,
> > +
> > +\item VIRTIO_BLK_ZS_IOPEN when a successful VIRTIO_BLK_T_OUT request or
> > +    VIRTIO_BLK_T_ZONE_APPEND with a non-zero data size is received for the
> > zone.
> > +
> > +\item VIRTIO_BLK_ZS_EOPEN when a successful VIRTIO_BLK_T_ZONE_OPEN request
> > is
> > +    received for the zone,
> > +\end{itemize}
> > +
> > +When a VIRTIO_BLK_T_ZONE_CLOSE request is issued to a Closed zone, the
> > request
> > +is completed successfully and the zone stays in the VIRTIO_BLK_ZS_CLOSED
> > state.
> > +
> > +Zones in the VIRTIO_BLK_ZS_FULL (Full) state can transition from this state
> > to
> > +VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request is
> > +received for the zone
> > +
> > +When a VIRTIO_BLK_T_ZONE_FINISH request is issued to a Full zone, the
> > request
> > +is completed successfully and the zone stays in the VIRTIO_BLK_ZS_FULL
> > state.
> > +
> > +The device MAY automatically transition zones to VIRTIO_BLK_ZS_RDONLY
> > +(Read-Only) or VIRTIO_BLK_ZS_OFFLINE (Offline) state from any other state.
> > The
> > +device MAY also automatically transition zones in the Read-Only state to the
> > +Offline state. Zones in the Offline state MAY NOT transition to any other
> > state.
> > +Such automatic transitions usually indicate hardware failures. The
> > previously
> > +written data may only be read from zones in the Read-Only state. Zones in
> > the
> > +Offline state can not be read or written.
> 
> What is the status code when this is attempted?

It is stated elsewhere -
All requests issued to a zone in the VIRTIO_BLK_ZS_OFFLINE state SHALL be
completed with VIRTIO_BLK_S_IOERR \field{status} and
VIRTIO_BLK_S_ZONE_INVALID_CMD value in the field \field{zone_result}.




^ permalink raw reply	[flat|nested] 7+ messages in thread

* [virtio-comment] Re: [virtio-dev] [RFC PATCH] virtio-blk: add zoned block device specification
  2022-06-03  2:30   ` Dmitry Fomichev
@ 2022-06-03  8:08     ` Stefan Hajnoczi
  2022-06-04  0:09       ` Dmitry Fomichev
       [not found]     ` <CAJSP0QXiUy-=pyGtxjpguWFt=+HJFqjKo3ipwN+YpF43nsxAag@mail.gmail.com>
  1 sibling, 1 reply; 7+ messages in thread
From: Stefan Hajnoczi @ 2022-06-03  8:08 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Niklas Cassel, hare, its, Matias Bjørling, virtio-dev,
	virtio-comment, Hans Holmberg, faithilikerun, damien.lemoal,
	stefanha

[-- Attachment #1: Type: text/plain, Size: 34882 bytes --]

(Resent from my subscriber email.)

On Fri, 3 Jun 2022 at 03:30, Dmitry Fomichev <Dmitry.Fomichev@wdc.com> wrote:
>
> On Thu, 2022-06-02 at 08:23 +0100, Stefan Hajnoczi wrote:
> > (Resending from my mailing list subscriber address so this makes it onto
> > the list.)
> >
> > On Thu, 2 Jun 2022 at 00:56, Dmitry Fomichev <dmitry.fomichev@wdc.com> wrote:
> > >
> > > Introduce support for Zoned Block Devices to virtio.
> > >
> > > Zoned Block Devices (ZBDs) aim to achieve a better capacity, latency
> > > and/or cost characteristics compared to commonly available block
> > > devices by getting the entire LBA space of the device divided to block
> > > regions that are much larger than the LBA size. These regions are
> > > called zones and they can only be written sequentially. More details
> > > about ZBDs can be found at
> > >
> > > https://zonedstorage.io/docs/introduction/zoned-storage .
> > >
> > > In its current form, the virtio protocol for block devices (virtio-blk)
> > > is not aware of ZBDs but it allows the guest to successfully scan a
> > > host-managed drive provided by the host. As the result, the
> > > host-managed drive appears at the guest as a regular drive that will
> > > operate erroneously under the most common write workloads.
> > >
> > > To fix this, the virtio-blk protocol needs to be extended to add the
> > > capabilities to convey the zone characteristics of host ZBDs to the
> > > guest and to provide the support for ZBD-specific commands - Report
> > > Zones, four zone operations and (optionally) Zone Append. The proposed
> > > standard extension aims to provide this functionality.
> > >
> > > This patch extends the virtio-blk section of virtio specification with
> > > the minimum set of requirements that are necessary to support ZBDs.
> > > The resulting device model is a subset of the models defined in ZAC/ZBC
> > > and ZNS standards documents. The included functionality mirrors
> > > the existing Linux kernel block layer ZBD support and should be
> > > sufficient to handle the host-managed and host-aware HDDs that are on
> > > the market today as well as ZNS SSDs that are entering the market at
> > > the moment of this patch submission.
> > >
> > > I have developed a proof of concept patch series that adds ZBD support
> > > to virtio-blk Linux kernel driver by implementing the protocol
> > > extensions defined in the spec patch. I would like to receive the
> > > initial feedback on this specification patch before posting that series
> > > to the block LKML.
> > >
> > > I would like to thank the following people for their useful feedback
> > > and suggestions while working on the initial iterations of this patch.
> > >
> > > Damien Le Moal <damien.lemoal@opensource.wdc.com>
> > > Matias Bjørling <Matias.Bjorling@wdc.com>
> > > Niklas Cassel <Niklas.Cassel@wdc.com>
> > > Hans Holmberg <Hans.Holmberg@wdc.com>
> > >
> > > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > > ---
> > >  content.tex | 686 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 685 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/content.tex b/content.tex
> > > index 7508dd1..8ae7578 100644
> > > --- a/content.tex
> > > +++ b/content.tex
> > > @@ -4557,6 +4557,11 @@ \subsection{Feature bits}\label{sec:Device Types /
> > > Block Device / Feature bits}
> > >       maximum erase sectors count in \field{max_secure_erase_sectors} and
> > >       maximum erase segment number in \field{max_secure_erase_seg}.
> > >
> > > +\item[VIRTIO_BLK_F_ZONED(17)] Device is a Zoned Block Device, that is, a
> > > device
> > > +       that behaves as defined by the T10 Zoned Block Command standard (ZBC
> > > r05) or
> > > +       the NVMe(TM) NVM Express Zoned Namespace Command Set Specification
> > > 1.1b
> > > +       (ZNS).
> >
> > The rest of the spec doesn't mention ZBC/ZNS but at least the zone
> > state machine is likely to be derived from these standards. Have you
> > reviewed the license/IP terms of VIRTIO, ZBC, and ZNS to check that
> > it's okay to do this? It's probably okay but I wanted to mention it
> > because it would be a shame to discover an IP issue after the VIRTIO
> > spec has been released.
>
> I've read the virtio LICENSE.md file and visited the links that are included in
> this file.
>
> I am not a legal expert, but I don't expect any IP-related issues to arise by
> including this patch. The T10 committee is an INCITS entity which is a part of
> ANSI, a public standards body. This is the committee that develops SCSI standards
> that are already supported by the virtio spec and ZBC is simply another SCSI
> standard, alongside with SPC, SBC and so on. The member companies of T10 usually
> have a legal review process to make sure that all the standards that are being
> released are not encumbered by any patent claims, etc.
>
> ZNS is a product of NVM Express, an open standard consortium. Here is the page on
> their site that has the link to ZNS command set specification pdf file -
>
> https://nvmexpress.org/developers/nvme-command-set-specifications/
>
> This consortium also has a legal review process and I don't foresee any IP-
> related difficulties that would arise from following this specification for the
> purposes of making a virtio ZBD standard.
>
> >
> > Also, the wording suggests to me that the spec will refer to ZBC/ZNS
> > instead of defining the behavior itself, but actually you've written a
> > full self-contained spec (which is great!). Maybe change "behaves as
> > defined by" to "follows the zoned storage device model that is also
> > supported by industry standards such as".
>
> Agreed, the wording here could be more precise. This could be "follows the zoned
> storage device behavior that is also supported by industry standards such as". I
> would avoid using the word "model" here because it usually refers specifically to
> host-managed vs. host-aware vs. drive-managed device models in ZBD world.
>
> >
> >
> > > +
> > >  \end{description}
> > >
> > >  \subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types /
> > > Block Device / Feature bits / Legacy Interface: Feature bits}
> > > @@ -4589,6 +4594,31 @@ \subsection{Device configuration
> > > layout}\label{sec:Device Types / Block Device /
> > >  \field{max_secure_erase_sectors} \field{secure_erase_sector_alignment} are
> > > expressed
> > >  in 512-byte units if the VIRTIO_BLK_F_SECURE_ERASE feature bit is
> > > negotiated.
> > >
> > > +If the VIRTIO_BLK_F_ZONED feature is negotiated, then in
> > > +\field{virtio_blk_zoned_characteristics},
> > > +\begin{itemize}
> > > +\item \field{zone_sectors} value is expressed in 512-byte sectors.
> > > +\item \field{max_append_sectors} value is expressed in 512-byte sectors.
> > > +\item \field{write_granularity} value is expressed in bytes.
> > > +\end{itemize}
> > > +
> > > +The \field{model} field in \field{zoned} may have the following values:
> > > +VIRTIO_BLK_Z_HM(1) and VIRTIO_BLK_Z_HA(2).
> > > +
> > > +\begin{lstlisting}
> > > +#define VIRTIO_BLK_Z_HM        1
> > > +#define VIRTIO_BLK_Z_HA        2
> > > +\end{lstlisting}
> > > +
> > > +If the VIRTIO_BLK_F_ZONED feature is negotiated, then
> > > +\begin{itemize}
> > > +\item The value of VIRTIO_BLK_Z_HM MUST be set by the device if it operates
> > > +    as a host-managed zoned block device.
> >
> > The term host-managed is not defined. Please either define it or
> > include a reference to the zoned storage device model (maybe
> > zonedstorage.io?).
>
> OK, indeed. There is some text later in the patch describing the differences
> between Sequential Write Required (SWR) and Sequential Write Preferred (SWP)
> zones, but the wording that relates HM to SWR and HA to SWP is missing, will add
> it here.
>
> >
> > > +
> > > +\item The value of VIRTIO_BLK_Z_HA MUST be set by the device if it operates
> > > +    as a host-aware zoned block device.
> >
> > The term host-aware is not defined.
> >
> > > +\end{itemize}
> >
> > The VIRTIO spec is organized into normative sections containing
> > MUST/MAY/SHOULD and descriptive sections. MUST/MAY/SHOULD need to go
> > in drivernormative or devicenormative sections and can't be used in
> > descriptive sections.
> > (I believe the reason for this is to give driver/device implementors a
> > definitive list of normative statements they can easily review without
> > re-reading the entire text of the specification.)
>
>
> Got it, will reword.
>
> > +
> >  \begin{lstlisting}
> >  struct virtio_blk_config {
> >          le64 capacity;
> > @@ -4623,6 +4653,15 @@ \subsection{Device configuration
> layout}\label{sec:Device Types / Block Device /
> >          le32 max_secure_erase_sectors;
> >          le32 max_secure_erase_seg;
> >          le32 secure_erase_sector_alignment;
> > +        struct virtio_blk_zoned_characteristics {
> > +                le32 zone_sectors;
> > +                le32 max_open_zones;
> > +                le32 max_active_zones;
> > +                le32 max_append_sectors;
> > +                le32 write_granularity;
> > +                u8 model;
> > +                u8 unused2[3];
> > +        } zoned;
> >  };
> >  \end{lstlisting}
> >
> > @@ -4686,6 +4725,10 @@ \subsection{Device Initialization}\label{sec:Device
> Types / Block Device / Devic
> >      \field{secure_erase_sector_alignment} can be used by OS when splitting a
> >      request based on alignment.
> >
> > +\item If the VIRTIO_BLK_F_ZONED feature is negotiated, the fields in
> > +    \field{zoned} can be read by the driver to determine the zone
> > +    characteristics of the device. All \field{zoned} fields are read-only.
> > +
> >  \end{enumerate}
> >
> >  \drivernormative{\subsubsection}{Device Initialization}{Device Types / Block
> Device / Device Initialization}
> > @@ -4701,6 +4744,24 @@ \subsection{Device Initialization}\label{sec:Device
> Types / Block Device / Devic
> >  The driver MUST NOT read \field{writeback} before setting
> >  the FEATURES_OK \field{device status} bit.
> >
> > +Drivers SHOULD NOT negotiate VIRTIO_BLK_F_ZONED feature if they are incapable
> > +of supporting devices with the VIRTIO_BLK_Z_HM or VIRTIO_BLK_Z_HA zoned model.
>
> > This raises the question whether devices can offer VIRTIO_BLK_F_ZONED
> > but fall back to regular block device operation if the driver does not
> > negotiate the bit?
>
> It could be useful (for testing/development, etc.) to allow the device to offer
> the VIRTIO_BLK_F_ZONED feature even when the actual backing block device is not
> zoned. In this case, it the driver is capable of handling ZBDs, it will accept it
> and the i/o should be operational via virtio_blk_zoned_req requests with
> VIRTIO_BLK_Z_NONE model. Therefore there is a SHOULD NOT. Perhaps it is worth it
> to add this explanation to the text... or not to allow this at all and change the
> text to "must not".
>
> Regarding fallbacks, there are clauses below that define these situations and
> they mostly boil down to the following -
> 1. if the device offers VIRTIO_BLK_F_ZONED with an HM backing device and the
> driver doesn't negotiate the feature, the device needs to disallow the device to
> operate by not setting FEATURES_OK bit.
> 2. if the device offers VIRTIO_BLK_F_ZONED with an *HA* backing device and the
> driver doesn't negotiate the feature, the device needs to present the device as
> non-zoned because in this case it will still work.
>
> > +
> > +If the VIRTIO_BLK_F_ZONED feature is offered by the device, the
> > +VIRTIO_BLK_F_DISCARD feature MUST NOT be offered.
> >
> > This statement partially precludes that because it requires devices to
> > disable VIRTIO_BLK_F_DISCARD if they advertize VIRTIO_BLK_F_ZONED. I
> > think there is a way to allow VIRTIO_BLK_F_DISCARD in
> > !VIRTIO_BLK_F_ZONED mode while forbidding it in VIRTIO_BLK_F_ZONED
> > mode: the device can fail to set FEATURES_OK when the driver writes
> > the Device Status field with VIRTIO_BLK_F_DISCARD &&
> > VIRTIO_BLK_F_ZONED. That way a host-aware device in
> > !VIRTIO_BLK_F_ZONED mode could allow VIRTIO_BLK_F_DISCARD. Is this
> > useful?
>
> Good catch, I verified that HA devices can usually handle discard. In this case,
> the clause above becomes too strict. Perhaps it should be changed to
>
> If the VIRTIO_BLK_F_ZONED feature is offered by the device with the
> VIRTIO_BLK_Z_HM zone model, then
>  - the VIRTIO_BLK_F_DISCARD feature must not be offered by the driver.
>  - if the driver sets both VIRTIO_BLK_F_DISCARD and VIRTIO_BLK_F_ZONED feature
> bits, the device must fail by not setting FEATURES_OK bit.

Sounds good.


>
> >
> > > +
> > > +If the VIRTIO_BLK_F_ZONED feature is negotiated, then
> > > +\begin{itemize}
> > > +\item If the driver that can not support host-managed zoned devices
> > > +    reads VIRTIO_BLK_Z_HM from the \field{model} field of \field{zoned}, the
> > > +    driver MUST NOT set FEATURES_OK flag and instead set the FAILED bit.
> > > +
> > > +\item If the driver that can not support support zoned devices reads
> > > +    VIRTIO_BLK_Z_HA from the \field{model} field of \field{zoned}, the
> > > driver
> > > +    MAY present the device to the guest as a non-zoned device. In this case,
> > > the
> >
> > Please avoid using the terms "guest" or "host" in the virtualization
> > sense and instead use "driver" and "device". VIRTIO can be used in
> > non-virtualization use cases and it's clearer to avoid virtualization
> > terminology.
>
> OK, will change this.
>
> >
> > > +    driver SHOULD ignore all other fields in \field{zoned}.
> >
> > Why would the driver negotiate VIRTIO_BLK_F_ZONED if it doesn't
> > support zoned devices?
>
> This covers cases when the driver can recognize the VIRTIO_BLK_F_ZONED feature,
> but it is incapable to support ZBDs for some reason. For example, the Linux
> virtio-blk driver could be up to date with ZBD support, but if the kernel is
> built without CONFIG_BLK_DEV_ZONED option, it will not be able to handle HM
> devices. But it still can deal with HA drives in this case.

I see, thanks.


>
> >
> > > +\end{itemize}
> > > +
> > >  \devicenormative{\subsubsection}{Device Initialization}{Device Types / Block
> > > Device / Device Initialization}
> > >
> > >  Devices SHOULD always offer VIRTIO_BLK_F_FLUSH, and MUST offer it
> > > @@ -4712,6 +4773,77 @@ \subsection{Device Initialization}\label{sec:Device
> > > Types / Block Device / Devic
> > >  The device MUST initialize padding bytes \field{unused0} and
> > >  \field{unused1} to 0.
> > >
> > > +If the device that is being initialized is a not a zoned device, the device
> > > MUST
> > > +NOT offer the VIRTIO_BLK_F_ZONED feature.
> > > +
> > > +If the VIRTIO_BLK_F_ZONED bit is not set by the driver,
> > > +\begin{itemize}
> > > +\item the device with the VIRTIO_BLK_Z_HA zone model SHOULD proceed with the
> > > +    initialization while setting all zoned topology fields to zero.
> > > +
> > > +\item the device with the VIRTIO_BLK_Z_HM zone model MUST report the device
> > > +    capacity in \field{capacity} in the configuration space as zero to
> > > prevent
> > > +    the use of the device that is incorrectly recognized by the driver as
> > > +    a non-zoned device.
> >
> > I suggest changing this to "MUST fail to set the FEATURES_OK device
> > status bit when the driver writes the Device Status field". That way
> > the driver gets a clear error instead of seeing a virtio-blk device
> > with 0 capacity.
>
> OK. I was under impression that only the driver can do this. It's certainly
> better to fail the initialization in this case.
>
> >
> >
> > > +\end{itemize}
> > > +
> > > +If the VIRTIO_BLK_F_ZONED feature is negotiated,
> > > +\begin{itemize}
> > > +\item \field{zoned} can be read by the driver to discover the size of a
> > > single
> > > +    zone on the device. All zones of the device have the same size indicated
> > > by
> > > +    the \field{zone_sectors} field of \field{zoned} except for the last zone
> > > +    that MAY be smaller than all other zones. The driver can calculate the
> > > +    number of zones on the device as
> > > +    \begin{lstlisting}
> > > +        nr_zones = (capacity + zone_sectors - 1) / zone_sectors;
> > > +    \end{lstlisting}
> > > +    and the size of the last zone as
> > > +    \begin{lstlisting}
> > > +        zs_last = capacity - (nr_zones - 1) * zone_sectors;
> > > +    \end{lstlisting}
> > > +
> > > +\item Zones consume volatile device resources while being in certain states
> > > and
> > > +    the device MAY set limits on the number of zones that can be in these
> > > states
> > > +    simultaneously.
> > > +
> > > +    Zoned block devices use two internal counters to account for the device
> > > +    resources in use, the number of currently open zones and the number of
> > > +    currently active zones.
> > > +
> > > +    Any zone state transition from a state that doesn't consume a zone
> > > resource
> > > +    to a state that consumes the same resource increments the internal
> > > device
> > > +    counter for that resource. Any zone transition out of a state that
> > > consumes
> > > +    a zone resource to a state that doesn't consume the same resource
> > > decrements
> > > +    the counter. Any request that causes the device to exceed the reported
> > > zone
> > > +    resource limits is terminated by the device with an error.
> >
> > "an error" is vague. I was wondering what the exact error should be.
> > Please either mention the error codes here or add "as defined for
> > specific commands later" so it's clear that the device doesn't get the
> > choose the error code but they are defined by the spec.
>
> Right. The error is defined later in the text, it's worth mentioning it here.
>
> >
> > > +
> > > +\item The \field{max_open_zones} field of the \field{zoned} structure can be
> > > +    read by the driver to discover the maximum number of zones that can be
> > > open
> > > +    on the device (zones in the implicit open or explicit open state). A
> > > value
> > > +    of zero indicates that the device does not have any limit on the number
> > > of
> > > +    open zones.
> > > +
> > > +\item The \field{max_active_zones} field of the \field{zoned} structure can
> > > be
> > > +    read by the driver to discover the maximum number zones that can be
> > > active
> > > +    on the device (zones in the implicit open, explicit open or closed
> > > state).
> > > +    A value of zero indicates that the device does not have any limit on the
> > > +    number of active zones.
> > > +
> > > +\item the \field{max_append_sectors} field of \field{zoned} can be read by
> > > the
> > > +    driver to get the maximum data size of a VIRTIO_BLK_T_ZONE_APPEND
> > > request
> > > +    issued to the device. The value of this field MUST NOT exceed the
> > > +    \field{seg_max} * \field{size_max} value. A device MAY set the
> > > +    \field{max_append_sectors} to zero if it doesn't support
> > > +    VIRTIO_BLK_T_ZONE_APPEND requests.
> >
> > The MUST NOT/MAY statements need to be in a drivernormative section.
> > There are more instances below.
>
> Yea, I need to sort this out everywhere in the patch.
>
> >
> >
> > > +
> > > +\item the \field{write_granularity} field of \field{zoned} can be read by
> > > the
> > > +    driver to discover the offset and size alignment constraint for
> > > +    VIRTIO_BLK_T_OUT and VIRTIO_BLK_T_ZONE_APPEND requests issued to
> > > +    a sequential zone of the device.
> > > +
> > > +\item the device MUST initialize padding bytes \field{unused2} to 0.
> > > +\end{itemize}
> > > +
> > >  \subsubsection{Legacy Interface: Device Initialization}\label{sec:Device
> > > Types / Block Device / Device Initialization / Legacy Interface: Device
> > > Initialization}
> > >
> > >  Because legacy devices do not have FEATURES_OK, transitional devices
> > > @@ -4738,7 +4870,8 @@ \subsubsection{Legacy Interface: Device
> > > Initialization}\label{sec:Device Types /
> > >  \subsection{Device Operation}\label{sec:Device Types / Block Device / Device
> > > Operation}
> > >
> > >  The driver queues requests to the virtqueues, and they are used by
> > > -the device (not necessarily in order). Each request is of form:
> > > +the device (not necessarily in order). If the VIRTIO_BLK_F_ZONED feature
> > > +is not negotiated, then each request is of form:
> > >
> > >  \begin{lstlisting}
> > >  struct virtio_blk_req {
> > > @@ -4853,6 +4986,331 @@ \subsection{Device Operation}\label{sec:Device Types
> > > / Block Device / Device Ope
> > >  command produces VIRTIO_BLK_S_IOERR.  A segment may have completed
> > >  successfully, failed, or not been processed by the device.
> > >
> > > +The following requirements only apply if the VIRTIO_BLK_F_ZONED feature is
> > > +negotiated.
> > > +
> > > +Each request is of form:
> > > +
> > > +\begin{lstlisting}
> > > +struct virtio_blk_zoned_req {
> > > +        le32 type;
> > > +        le32 reserved;
> > > +        le64 sector;
> > > +        union zoned_params {
> > > +                struct {
> > > +                        /* ALL zone operation flag */
> > > +                        __u8 all;
> > > +                        __u8 unused1[3];
> > > +                } mgmt_send;
> > > +                struct {
> > > +                        /* Partial zone report flag */
> > > +                        __u8 partial;
> > > +                        __u8 unused2[3];
> > > +                } mgmt_receive;
> > > +                struct {
> > > +                        __u8 unused3[4];
> > > +                } append;
> > > +        } zone;
> > > +        u8 data[];
> > > +        le64 zone_result;
> > > +        u8 status;
> > > +        u8 reserved1[3];
> >
> > Please make status the last byte so device implementations can reuse
> > their existing request completion code to handle virtio_blk_zoned_req.
> > Either reserved1[] can be removed or it can be moved before status.
>
> OK.
>
> >
> > Here is the non-union representation of the structs. Note that this
> > means existing request processing code can handle IN/OUT/FLUSH even
> > when VIRTIO_BLK_F_ZONED is negotiated. We don't need two different
> > code paths for these common commands:
> >
> > /* IN/OUT/FLUSH/DISCARD/WRITE_ZEROES */
> > struct virtio_blk_req {
> >         le32 type;
> >         le32 reserved;
> >         le64 sector;
> >         u8 data[];
> >         u8 status;
> > };
> >
> > /* ZONE_APPEND */
> > struct virtio_blk_req_zone_append {
> >         le32 type;
> >         le32 reserved;
> >         le64 sector;
> >         u8 data[];
> >         le64 zone_result;
> >         u8 status;
> > };
> >
> > /* ZONE_OPEN, ZONE_CLOSE, ZONE_FINISH, ZONE_RESET */
> > struct virtio_blk_req_zone_mgmt_send {
> >         le32 type;
> >         le32 reserved;
> >         le64 sector;
> >         /* ALL zone operation flag */
> >         __u8 all;
> >         __u8 unused1[3];
> >         u8 data[];
> >         u8 status;
> > };
> >
> > /* ZONE_REPORT */
> > struct virtio_blk_req_zone_mgmt_receive {
> >         le32 type;
> >         le32 reserved;
> >         le64 sector;
> >         /* Partial zone report flag */
> >         __u8 partial;
> >         __u8 unused2[3];
> >         u8 data[];
> >         u8 status;
> > };
>
> We felt that it is less confusing to have a single ZBD request structure so there
> is 1:1 relationship between the VIRTIO_BLK_F_ZONED bit value and the choice of
> the request -
>
> VIRTIO_BLK_F_ZONED off ->  virtio_blk_req
> VIRTIO_BLK_F_ZONED on ->  virtio_blk_zoned_req
>
> We also want to make sure that zone management/send/receive and append requests
> are always the same size and it should be easier with a single request
> structure.
>
> It is possible to use a simpler, but less structured union here -
>
> struct virtio_blk_zoned_req {
>         le32 type;
>         le32 reserved;
>         le64 sector;
>         union zoned_params {
>                 /* ALL zone operation flag */
>                 __u8 all;
>                 /* Partial zone report flag */
>                 __u8 partial;
>                 __u32 rsvd1;
>         } zone;
>         u8 data[];
>         le64 zone_result;
>         u8 reserved1[3];
>         u8 status;
> }
>
> Using the single code path is very important and the PoC Linux driver code that I
> have sketched up achieves that. Since the request header is identical for the
> both cases, I hope that it can be done in QEMU too.

There are two possibilities for optimizing the request layout:
1. For drivers and devices that implement both !VIRTIO_BLK_F_ZONED and
VIRTIO_BLK_F_ZONED. Here it's best to share the same request layout
for commands available in both modes. In other words IN and OUT
requests to zoned devices should use struct virtio_blk_req so code can
be shared and only commands unique to zoned devices use struct
virtio_blk_zoned_req.
2. For drivers and devices that implement only one of
!VIRTIO_BLK_F_ZONED and VIRTIO_BLK_F_ZONED. Here consistent request
layout is best so the code parses all commands in the same way. In
other words, IN and OUT requests to zoned devices use struct
virtio_blk_zoned_req so that the implementation doesn't have to bother
with struct virtio_blk_req at all.

I think you chose #2 and I gravitated towards #1. Which one is optimal
depends on the use case. Feel free to stick with #2.


>
> >
> > > +};
> > > +\end{lstlisting}
> > > +
> > > +In addition to the request types defined for non-zoned devices, the type of
> > > the
> > > +request can be a zone report (VIRTIO_BLK_T_ZONE_REPORT), an explicit zone
> > > open
> > > +(VIRTIO_BLK_T_ZONE_OPEN), an explicit zone close (VIRTIO_BLK_T_ZONE_CLOSE),
> > > a
> > > +zone finish (VIRTIO_BLK_T_ZONE_FINISH), a zone_append
> > > (VIRTIO_BLK_T_ZONE_APPEND)
> > > +or a zone reset (VIRTIO_BLK_T_ZONE_RESET).
> > > +
> > > +\begin{lstlisting}
> > > +#define VIRTIO_BLK_T_ZONE_APPEND    15
> > > +#define VIRTIO_BLK_T_ZONE_REPORT    16
> > > +#define VIRTIO_BLK_T_ZONE_OPEN      18
> > > +#define VIRTIO_BLK_T_ZONE_CLOSE     20
> > > +#define VIRTIO_BLK_T_ZONE_FINISH    22
> > > +#define VIRTIO_BLK_T_ZONE_RESET     24
> > > +\end{lstlisting}
> > > +
> > > +Requests of the type VIRTIO_BLK_T_ZONE_REPORT are reads and requests of the
> > > type
> >
> > Here "reads" means data transfers from the device to the driver rather
> > than disk reads. I found that confusing. Maybe
> > "VIRTIO_BLK_T_ZONE_REPORT transfer data from the device"?
>
> Sure, your wording is better.
>
> > For a cur
> > > +VIRTIO_BLK_T_ZONE_APPEND are writes. VIRTIO_BLK_T_ZONE_OPEN,
> > > +VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH and
> > > VIRTIO_BLK_T_ZONE_RESET
> > > +are non-data requests.
> > > +
> > > +In ZBD standards, the VIRTIO_BLK_T_ZONE_REPORT request belongs to "Zone
> >
> > "ZBD standards" is not defined. I think this refers to the zoned
> > storage device model and its embodiment in the ZBC/ZNS specs.
>
> > The configuration space fields already allow the driver to calculate
> > the total number of zones. What is the purpose of partial=0, it
> > doesn't seem to provide any new information?
>
> Both ZBC and ZNS have this categorization. "ZBD standards" can be defined earlier
> in the text as a catchall for ZBC/ZNS.

Great.

> If partial is 0, the nr_zones will contain the number of zones that potentially
> can be reported starting from the request "sector" position and until the end of
> sector range of the device. It does not necessarily equal the total number of
> zones on the drive.

The driver knows zone_sectors and nr_zones, so it can calculate
nr_zones - (sector / zone_sectors) itself. Why send a ZONE_REPORT
partial=0 request to the device to get this value?


>
> >
> > When the device marks a virtqueue buffer used, it indicates the number
> > of bytes written (struct virtq_used_elem's len field). The driver can
> > use this field to determine the number of descriptors filled in by the
> > device. Is the nr_zones field really necessary?
>
> This field can help when the buffer is larger than the returned number of
> descriptors, but nr_zones is still needed for the partial=0 case above.
>
> >
> >
> > > +
> > > +A zone descriptor has the following structure:
> > > +
> > > +\begin{lstlisting}
> > > +struct virtio_blk_zone_descriptor {
> > > +        le64   z_cap;
> > > +        le64   z_start;
> > > +        le64   z_wp;
> > > +        u8     z_type;
> > > +        u8     z_state;
> > > +        u8     reserved[38];
> > > +};
> > > +\end{lstlisting}
> > > +
> > > +The zone descriptor field \field{z_type} \field{virtio_blk_zone_descriptor}
> > > +indicates the type of the zone. The available types are
> > > VIRTIO_BLK_ZT_CONV(1),
> > > +VIRTIO_BLK_ZT_SWR(2) or VIRTIO_BLK_ZT_SWP(3).
> > > +
> > > +\begin{lstlisting}
> > > +#define VIRTIO_BLK_ZT_CONV     1
> > > +#define VIRTIO_BLK_ZT_SWR      2
> > > +#define VIRTIO_BLK_ZT_SWP      3
> > > +\end{lstlisting}
> > > +
> > > +Read and write operations into zones with the VIRTIO_BLK_ZT_CONV
> > > (Conventional)
> > > +type have the same behavior as read and write operations on a regular block
> > > +device. Any block in a conventional zone can be read or written at any time
> > > and
> > > +in any order.
> > > +
> > > +Zones with VIRTIO_BLK_ZT_SWR (Sequential Write Required or SWR) can be read
> > > +randomly, but MUST be written sequentially at a certain point in the zone
> > > called
> > > +the Write Pointer (WP). With every write, the Write Pointer is incremented
> > > by
> > > +the number of sectors written.
> > > +
> > > +Zones with VIRTIO_BLK_ZT_SWP (Sequential Write Preferred or SWP) can be read
> > > +randomly and SHOULD be written sequentially, similarly to SWR zones.
> > > However,
> > > +SWP zones can accept random write operations, that is, VIRTIO_BLK_T_OUT
> > > requests
> > > +with a start sector different from the zone write pointer position.
> > > +
> > > +The field \field{z_state} of \field{virtio_blk_zone_descriptor} indicates
> > > the
> > > +state of the device zone. The available zone states are
> > > VIRTIO_BLK_ZS_NOT_WP(0),
> > > +VIRTIO_BLK_ZS_EMPTY(1), VIRTIO_BLK_ZS_IOPEN(2), VIRTIO_BLK_ZS_EOPEN(3),
> > > +VIRTIO_BLK_ZS_CLOSED(4), VIRTIO_BLK_ZS_RDONLY(13), VIRTIO_BLK_ZS_FULL(14)
> > > and
> > > +VIRTIO_BLK_ZS_OFFLINE(15).
> > > +
> > > +\begin{lstlisting}
> > > +#define VIRTIO_BLK_ZS_NOT_WP   0
> > > +#define VIRTIO_BLK_ZS_EMPTY    1
> > > +#define VIRTIO_BLK_ZS_IOPEN    2
> > > +#define VIRTIO_BLK_ZS_EOPEN    3
> > > +#define VIRTIO_BLK_ZS_CLOSED   4
> > > +#define VIRTIO_BLK_ZS_RDONLY   13
> > > +#define VIRTIO_BLK_ZS_FULL     14
> > > +#define VIRTIO_BLK_ZS_OFFLINE  15
> > > +\end{lstlisting}
> > > +
> > > +Zones of the type VIRTIO_BLK_ZT_CONV are always reported by the device to be
> > > in
> > > +the VIRTIO_BLK_ZS_NOT_WP state. Zones of the types VIRTIO_BLK_ZT_SWR and
> > > +VIRTIO_BLK_ZT_SWP can not transition to the VIRTIO_BLK_ZS_NOT_WP state.
> > > +
> > > +Zones in VIRTIO_BLK_ZS_EMPTY (Empty), VIRTIO_BLK_ZS_IOPEN (Implicitly Open),
> > > +VIRTIO_BLK_ZS_EOPEN (Explicitly Open) and VIRTIO_BLK_ZS_CLOSED (Closed)
> > > state
> > > +are writable, but zones in VIRTIO_BLK_ZS_RDONLY (Read-Only),
> > > VIRTIO_BLK_ZS_FULL
> > > +(Full) and VIRTIO_BLK_ZS_OFFLINE (Offline) state are not. The write pointer
> > > +value (\field{z_wp}) is not valid for Read-Only, Full and Offline zones.
> > > +
> > > +The zone descriptor field \field{z_cap} contains the maximum number of 512-
> > > byte
> > > +sectors that are available to be written with user data when the zone is in
> > > the
> > > +Empty state. This value shall be less than or equal to the
> > > \field{zone_sectors}
> > > +value in \field{virtio_blk_zoned_characteristics} structure in the device
> > > +configuration space.
> > > +
> > > +The zone descriptor field \field{z_start} contains the 64-bit address of the
> > > +first 512-byte sector of the zone.
> > > +
> > > +The zone descriptor field \field{z_wp} contains the 64-bit sector address
> > > where
> > > +the next write operation for this zone should be issued. This value is
> > > undefined
> > > +for conventional zones and for zones in VIRTIO_BLK_ZS_RDONLY,
> > > VIRTIO_BLK_ZS_FULL
> > > +and VIRTIO_BLK_ZS_OFFLINE state.
> > > +
> > > +Depending on their state, zones consume resources as follows:
> > > +\begin{itemize}
> > > +\item a zone in VIRTIO_BLK_ZS_IOPEN and VIRTIO_BLK_ZS_EOPEN state consumes
> > > one
> > > +    open zone resource and, additionally,
> > > +
> > > +\item a zone in VIRTIO_BLK_ZS_IOPEN, VIRTIO_BLK_ZS_EOPEN and
> > > +    VIRTIO_BLK_ZS_CLOSED state consumes one active resource.
> > > +\end{itemize}
> > > +
> > > +Attempts for zone transitions that violate zone resource limits MUST fail
> > > with
> > > +VIRTIO_BLK_S_ZONE_OPEN_RESOURCE or VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE
> > > +\field{zone_result}.
> >
> > Why use zone_result instead of the status field? The _S_ naming
> > suggests this is a status field constant, not a zone_result constant.
>
> zone_result only a valid ZBD error if the status is VIRTIO_BLK_S_IOERR. If the
> status is VIRTIO_BLK_S_OK, this field may contain ALBA for Zone Append request.
> It is possible to just define the ZBD-specific error codes following
> VIRTIO_BLK_S_UNSUPP, but I think it is cleaner to create a separate numbering for
> ZBD-specific codes and keep the non-zoned errors as-is in their impressive
> simplicity.
>
> The ZBD error codes all currently have _S_ZONE_ naming, but it might be better to
> use something like _ZR_ for example, i.e.
>
> #define VIRTIO_BLK_ZR_INVALID_CMD     0
> #define VIRTIO_BLK_ZR_UNALIGNED_WP    1
> #define VIRTIO_BLK_ZR_OPEN_RESOURCE   2
> #define VIRTIO_BLK_ZR_ACTIVE_RESOURCE 3

IMO multiple levels of result codes adds complexity in the driver
implementation, especially when zoned and non-zoned code is shared. It
can make error handling and reporting in drivers awkward (e.g. do
error messages for zoned devices include a separate zone_result field
while non-zoned devices just have the status field in their error
messages?). I would use status codes but if you really want to keep
the zoned error codes separate then _ZR_ naming is good.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [virtio-comment] Re: [virtio-dev] [RFC PATCH] virtio-blk: add zoned block device specification
       [not found]       ` <d0c0fcfc-01d2-9af3-3018-e8ca03f21fab@opensource.wdc.com>
@ 2022-06-03 16:31         ` Stefan Hajnoczi
  0 siblings, 0 replies; 7+ messages in thread
From: Stefan Hajnoczi @ 2022-06-03 16:31 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Stefan Hajnoczi, Dmitry Fomichev, Niklas Cassel, hare, its,
	Matias Bjørling, virtio-dev, virtio-comment, Hans Holmberg,
	faithilikerun

[-- Attachment #1: Type: text/plain, Size: 1924 bytes --]

On Fri, Jun 03, 2022 at 05:16:01PM +0900, Damien Le Moal wrote:
> On 6/3/22 16:33, Stefan Hajnoczi wrote:
> [...]
> >> If partial is 0, the nr_zones will contain the number of zones that potentially
> >> can be reported starting from the request "sector" position and until the end of
> >> sector range of the device. It does not necessarily equal the total number of
> >> zones on the drive.
> > 
> > The driver knows zone_sectors and nr_zones, so it can calculate
> > nr_zones - (sector / zone_sectors) itself. Why send a ZONE_REPORT
> > partial=0 request to the device to get this value?
> 
> Good point. Indeed, the partial bit is not useful when we have all zone
> with the same size. ZBC does not enforce that. So in general, it is not
> possible to calculate the number of zones starting from a particular LBA
> even if the total capacity is known. A report zones reply thus always
> indicate the *total* number of zones that could be reported from a given
> LBA, regardless of the report buffer size. This forces the drives to go
> through all zones even if the report zones buffer is already filled up. So
> longer command execution times.
> 
> The partial bit prevents that and is very useful to speedup a zone report
> for a single zone (e.g. to know that zone wp location).
> 
> For a Linux host, since the kernel enforces that all zones must have the
> same size, the partial bit is not useful. But other hosts may not enforce
> this, in which case the partial bit can be useful.
> 
> That said, I would not mind getting rid of it and assume a partial=1
> behavior, always. The kernel ioctl API does not have that parameter
> anyway, so it would have to be emulated by the driver. Not too hard to do,
> but still more code than strictly necessary.

Okay. We can start simple and introduce a feature bit later for devices
that have multiple zone sizes, if necessary.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [virtio-dev] [RFC PATCH] virtio-blk: add zoned block device specification
  2022-06-03  8:08     ` [virtio-comment] " Stefan Hajnoczi
@ 2022-06-04  0:09       ` Dmitry Fomichev
  2022-06-04  8:18         ` [virtio-comment] " Stefan Hajnoczi
  0 siblings, 1 reply; 7+ messages in thread
From: Dmitry Fomichev @ 2022-06-04  0:09 UTC (permalink / raw)
  To: stefanha
  Cc: its, hare, Niklas Cassel, Matias Bjørling, virtio-dev,
	Hans Holmberg, virtio-comment, faithilikerun, damien.lemoal,
	stefanha

On Fri, 2022-06-03 at 09:08 +0100, Stefan Hajnoczi wrote:
> (Resent from my subscriber email.)
> 
> On Fri, 3 Jun 2022 at 03:30, Dmitry Fomichev <Dmitry.Fomichev@wdc.com> wrote:
> > 
> > On Thu, 2022-06-02 at 08:23 +0100, Stefan Hajnoczi wrote:
> > > (Resending from my mailing list subscriber address so this makes it onto
> > > the list.)
> > > 
> > > On Thu, 2 Jun 2022 at 00:56, Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > > wrote:
> > > > 
> > > > Introduce support for Zoned Block Devices to virtio.
> > > > 
> > > > Zoned Block Devices (ZBDs) aim to achieve a better capacity, latency
> > > > and/or cost characteristics compared to commonly available block
> > > > devices by getting the entire LBA space of the device divided to block
> > > > regions that are much larger than the LBA size. These regions are
> > > > called zones and they can only be written sequentially. More details
> > > > about ZBDs can be found at
> > > > 
> > > > https://zonedstorage.io/docs/introduction/zoned-storage .
> > > > 
> > > > In its current form, the virtio protocol for block devices (virtio-blk)
> > > > is not aware of ZBDs but it allows the guest to successfully scan a
> > > > host-managed drive provided by the host. As the result, the
> > > > host-managed drive appears at the guest as a regular drive that will
> > > > operate erroneously under the most common write workloads.
> > > > 
> > > > To fix this, the virtio-blk protocol needs to be extended to add the
> > > > capabilities to convey the zone characteristics of host ZBDs to the
> > > > guest and to provide the support for ZBD-specific commands - Report
> > > > Zones, four zone operations and (optionally) Zone Append. The proposed
> > > > standard extension aims to provide this functionality.
> > > > 
> > > > This patch extends the virtio-blk section of virtio specification with
> > > > the minimum set of requirements that are necessary to support ZBDs.
> > > > The resulting device model is a subset of the models defined in ZAC/ZBC
> > > > and ZNS standards documents. The included functionality mirrors
> > > > the existing Linux kernel block layer ZBD support and should be
> > > > sufficient to handle the host-managed and host-aware HDDs that are on
> > > > the market today as well as ZNS SSDs that are entering the market at
> > > > the moment of this patch submission.
> > > > 
> > > > I have developed a proof of concept patch series that adds ZBD support
> > > > to virtio-blk Linux kernel driver by implementing the protocol
> > > > extensions defined in the spec patch. I would like to receive the
> > > > initial feedback on this specification patch before posting that series
> > > > to the block LKML.
> > > > 
> > > > I would like to thank the following people for their useful feedback
> > > > and suggestions while working on the initial iterations of this patch.
> > > > 
> > > > Damien Le Moal <damien.lemoal@opensource.wdc.com>
> > > > Matias Bjørling <Matias.Bjorling@wdc.com>
> > > > Niklas Cassel <Niklas.Cassel@wdc.com>
> > > > Hans Holmberg <Hans.Holmberg@wdc.com>
> > > > 
> > > > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > > > ---
> > > >  content.tex | 686 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> > > >  1 file changed, 685 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/content.tex b/content.tex
> > > > index 7508dd1..8ae7578 100644
> > > > --- a/content.tex
> > > > +++ b/content.tex
> > 

<snip>

> > > 
> > > Here is the non-union representation of the structs. Note that this
> > > means existing request processing code can handle IN/OUT/FLUSH even
> > > when VIRTIO_BLK_F_ZONED is negotiated. We don't need two different
> > > code paths for these common commands:
> > > 
> > > /* IN/OUT/FLUSH/DISCARD/WRITE_ZEROES */
> > > struct virtio_blk_req {
> > >         le32 type;
> > >         le32 reserved;
> > >         le64 sector;
> > >         u8 data[];
> > >         u8 status;
> > > };
> > > 
> > > /* ZONE_APPEND */
> > > struct virtio_blk_req_zone_append {
> > >         le32 type;
> > >         le32 reserved;
> > >         le64 sector;
> > >         u8 data[];
> > >         le64 zone_result;
> > >         u8 status;
> > > };
> > > 
> > > /* ZONE_OPEN, ZONE_CLOSE, ZONE_FINISH, ZONE_RESET */
> > > struct virtio_blk_req_zone_mgmt_send {
> > >         le32 type;
> > >         le32 reserved;
> > >         le64 sector;
> > >         /* ALL zone operation flag */
> > >         __u8 all;
> > >         __u8 unused1[3];
> > >         u8 data[];
> > >         u8 status;
> > > };
> > > 
> > > /* ZONE_REPORT */
> > > struct virtio_blk_req_zone_mgmt_receive {
> > >         le32 type;
> > >         le32 reserved;
> > >         le64 sector;
> > >         /* Partial zone report flag */
> > >         __u8 partial;
> > >         __u8 unused2[3];
> > >         u8 data[];
> > >         u8 status;
> > > };
> > 
> > We felt that it is less confusing to have a single ZBD request structure so
> > there
> > is 1:1 relationship between the VIRTIO_BLK_F_ZONED bit value and the choice
> > of
> > the request -
> > 
> > VIRTIO_BLK_F_ZONED off ->  virtio_blk_req
> > VIRTIO_BLK_F_ZONED on ->  virtio_blk_zoned_req
> > 
> > We also want to make sure that zone management/send/receive and append
> > requests
> > are always the same size and it should be easier with a single request
> > structure.
> > 
> > It is possible to use a simpler, but less structured union here -
> > 
> > struct virtio_blk_zoned_req {
> >         le32 type;
> >         le32 reserved;
> >         le64 sector;
> >         union zoned_params {
> >                 /* ALL zone operation flag */
> >                 __u8 all;
> >                 /* Partial zone report flag */
> >                 __u8 partial;
> >                 __u32 rsvd1;
> >         } zone;
> >         u8 data[];
> >         le64 zone_result;
> >         u8 reserved1[3];
> >         u8 status;
> > }
> > 
> > Using the single code path is very important and the PoC Linux driver code
> > that I
> > have sketched up achieves that. Since the request header is identical for the
> > both cases, I hope that it can be done in QEMU too.
> 
> There are two possibilities for optimizing the request layout:
> 1. For drivers and devices that implement both !VIRTIO_BLK_F_ZONED and
> VIRTIO_BLK_F_ZONED. Here it's best to share the same request layout
> for commands available in both modes. In other words IN and OUT
> requests to zoned devices should use struct virtio_blk_req so code can
> be shared and only commands unique to zoned devices use struct
> virtio_blk_zoned_req.
> 2. For drivers and devices that implement only one of
> !VIRTIO_BLK_F_ZONED and VIRTIO_BLK_F_ZONED. Here consistent request
> layout is best so the code parses all commands in the same way. In
> other words, IN and OUT requests to zoned devices use struct
> virtio_blk_zoned_req so that the implementation doesn't have to bother
> with struct virtio_blk_req at all.
> 
> I think you chose #2 and I gravitated towards #1. Which one is optimal
> depends on the use case. Feel free to stick with #2.
> 

OK. We certainly considered #1, but we found that in Linux virtio-blk driver
implementation, the request size is fixed. There is the member called vblk-
>tag_set.cmd_size and it is set to sizeof(struct virtblk_req) + <sg list size> as
a part of the device scan routine. Since the Zone Append request has to be larger
than the regular (non-zoned) one by at least 8 bytes to accommodate the ALBA
field, we can't mix and match the two for the same device. Here is how this
device initialization part can be modified to handle the VIRTIO_BLK_F_ZONED
feature bit if we stick with #2 -


@@ -810,13 +1287,18 @@ static int virtblk_probe(struct virtio_device *vdev)
        }
 
        memset(&vblk->tag_set, 0, sizeof(vblk->tag_set));
-       vblk->tag_set.ops = &virtio_mq_ops;
        vblk->tag_set.queue_depth = queue_depth;
        vblk->tag_set.numa_node = NUMA_NO_NODE;
        vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
-       vblk->tag_set.cmd_size =
-               sizeof(struct virtblk_req) +
-               sizeof(struct scatterlist) * VIRTIO_BLK_INLINE_SG_CNT;
+       if (virtio_has_feature(vdev, VIRTIO_BLK_F_ZONED)) {
+               vblk->tag_set.ops = &virtio_mq_zoned_ops;
+               vblk->tag_set.cmd_size = sizeof(struct virtblk_zoned_req);
+       } else {
+               vblk->tag_set.ops = &virtio_mq_ops;
+               vblk->tag_set.cmd_size = sizeof(struct virtblk_req);
+       };
+       vblk->tag_set.cmd_size +=
+                       sizeof(struct scatterlist) * VIRTIO_BLK_INLINE_SG_CNT;
        vblk->tag_set.driver_data = vblk;
        vblk->tag_set.nr_hw_queues = vblk->num_vqs;

> 
> > 
> > > 
> > > > +};
> > > > +\end{lstlisting}
> > > > +
> > > > +In addition to the request types defined for non-zoned devices, the type
> > > > of
> > > > the
> > > > +request can be a zone report (VIRTIO_BLK_T_ZONE_REPORT), an explicit
> > > > zone
> > > > open
> > > > +(VIRTIO_BLK_T_ZONE_OPEN), an explicit zone close
> > > > (VIRTIO_BLK_T_ZONE_CLOSE),
> > > > a
> > > > +zone finish (VIRTIO_BLK_T_ZONE_FINISH), a zone_append
> > > > (VIRTIO_BLK_T_ZONE_APPEND)
> > > > +or a zone reset (VIRTIO_BLK_T_ZONE_RESET).
> > > > +
> > > > +\begin{lstlisting}
> > > > +#define VIRTIO_BLK_T_ZONE_APPEND    15
> > > > +#define VIRTIO_BLK_T_ZONE_REPORT    16
> > > > +#define VIRTIO_BLK_T_ZONE_OPEN      18
> > > > +#define VIRTIO_BLK_T_ZONE_CLOSE     20
> > > > +#define VIRTIO_BLK_T_ZONE_FINISH    22
> > > > +#define VIRTIO_BLK_T_ZONE_RESET     24
> > > > +\end{lstlisting}
> > > > +
> > > > +Requests of the type VIRTIO_BLK_T_ZONE_REPORT are reads and requests of
> > > > the
> > > > type
> > > 
> > > Here "reads" means data transfers from the device to the driver rather
> > > than disk reads. I found that confusing. Maybe
> > > "VIRTIO_BLK_T_ZONE_REPORT transfer data from the device"?
> > 
> > Sure, your wording is better.
> > 
> > > For a cur
> > > > +VIRTIO_BLK_T_ZONE_APPEND are writes. VIRTIO_BLK_T_ZONE_OPEN,
> > > > +VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH and
> > > > VIRTIO_BLK_T_ZONE_RESET
> > > > +are non-data requests.
> > > > +
> > > > +In ZBD standards, the VIRTIO_BLK_T_ZONE_REPORT request belongs to "Zone
> > > 
> > > "ZBD standards" is not defined. I think this refers to the zoned
> > > storage device model and its embodiment in the ZBC/ZNS specs.
> > 
> > > The configuration space fields already allow the driver to calculate
> > > the total number of zones. What is the purpose of partial=0, it
> > > doesn't seem to provide any new information?
> > 
> > Both ZBC and ZNS have this categorization. "ZBD standards" can be defined
> > earlier
> > in the text as a catchall for ZBC/ZNS.
> 
> Great.
> 
> > If partial is 0, the nr_zones will contain the number of zones that
> > potentially
> > can be reported starting from the request "sector" position and until the end
> > of
> > sector range of the device. It does not necessarily equal the total number of
> > zones on the drive.
> 
> The driver knows zone_sectors and nr_zones, so it can calculate
> nr_zones - (sector / zone_sectors) itself. Why send a ZONE_REPORT
> partial=0 request to the device to get this value?
> 

As Damien mentioned in the other reply, the partial bit is needed for the cases
when this calculation is not possible, i.e. when the zone size varies. Yes, we
can omit it for now and possibly add a new feature bit in the future if there's
need to handle ZBDs with varying zone sizes.

> 
<snip>

> > zone_result only a valid ZBD error if the status is VIRTIO_BLK_S_IOERR. If
> > the
> > status is VIRTIO_BLK_S_OK, this field may contain ALBA for Zone Append
> > request.
> > It is possible to just define the ZBD-specific error codes following
> > VIRTIO_BLK_S_UNSUPP, but I think it is cleaner to create a separate numbering
> > for
> > ZBD-specific codes and keep the non-zoned errors as-is in their impressive
> > simplicity.
> > 
> > The ZBD error codes all currently have _S_ZONE_ naming, but it might be
> > better to
> > use something like _ZR_ for example, i.e.
> > 
> > #define VIRTIO_BLK_ZR_INVALID_CMD     0
> > #define VIRTIO_BLK_ZR_UNALIGNED_WP    1
> > #define VIRTIO_BLK_ZR_OPEN_RESOURCE   2
> > #define VIRTIO_BLK_ZR_ACTIVE_RESOURCE 3
> 
> IMO multiple levels of result codes adds complexity in the driver
> implementation, especially when zoned and non-zoned code is shared. It
> can make error handling and reporting in drivers awkward (e.g. do
> error messages for zoned devices include a separate zone_result field
> while non-zoned devices just have the status field in their error
> messages?). I would use status codes but if you really want to keep
> the zoned error codes separate then _ZR_ naming is good.

Ok, let's just define the new zone error codes following VIRTIO_BLK_S_UNSUPP.
This is more straightforward and, as an added benefit, the ALBA field can be
named more appropriately in this case.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [virtio-comment] Re: [virtio-dev] [RFC PATCH] virtio-blk: add zoned block device specification
  2022-06-04  0:09       ` Dmitry Fomichev
@ 2022-06-04  8:18         ` Stefan Hajnoczi
  0 siblings, 0 replies; 7+ messages in thread
From: Stefan Hajnoczi @ 2022-06-04  8:18 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: its, hare, Niklas Cassel, Matias Bjørling, virtio-dev,
	Hans Holmberg, virtio-comment, faithilikerun, damien.lemoal,
	stefanha

[-- Attachment #1: Type: text/plain, Size: 7934 bytes --]

On Sat, Jun 04, 2022 at 12:09:46AM +0000, Dmitry Fomichev wrote:
> On Fri, 2022-06-03 at 09:08 +0100, Stefan Hajnoczi wrote:
> > (Resent from my subscriber email.)
> > 
> > On Fri, 3 Jun 2022 at 03:30, Dmitry Fomichev <Dmitry.Fomichev@wdc.com> wrote:
> > > 
> > > On Thu, 2022-06-02 at 08:23 +0100, Stefan Hajnoczi wrote:
> > > > (Resending from my mailing list subscriber address so this makes it onto
> > > > the list.)
> > > > 
> > > > On Thu, 2 Jun 2022 at 00:56, Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > > > wrote:
> > > > Here is the non-union representation of the structs. Note that this
> > > > means existing request processing code can handle IN/OUT/FLUSH even
> > > > when VIRTIO_BLK_F_ZONED is negotiated. We don't need two different
> > > > code paths for these common commands:
> > > > 
> > > > /* IN/OUT/FLUSH/DISCARD/WRITE_ZEROES */
> > > > struct virtio_blk_req {
> > > >         le32 type;
> > > >         le32 reserved;
> > > >         le64 sector;
> > > >         u8 data[];
> > > >         u8 status;
> > > > };
> > > > 
> > > > /* ZONE_APPEND */
> > > > struct virtio_blk_req_zone_append {
> > > >         le32 type;
> > > >         le32 reserved;
> > > >         le64 sector;
> > > >         u8 data[];
> > > >         le64 zone_result;
> > > >         u8 status;
> > > > };
> > > > 
> > > > /* ZONE_OPEN, ZONE_CLOSE, ZONE_FINISH, ZONE_RESET */
> > > > struct virtio_blk_req_zone_mgmt_send {
> > > >         le32 type;
> > > >         le32 reserved;
> > > >         le64 sector;
> > > >         /* ALL zone operation flag */
> > > >         __u8 all;
> > > >         __u8 unused1[3];
> > > >         u8 data[];
> > > >         u8 status;
> > > > };
> > > > 
> > > > /* ZONE_REPORT */
> > > > struct virtio_blk_req_zone_mgmt_receive {
> > > >         le32 type;
> > > >         le32 reserved;
> > > >         le64 sector;
> > > >         /* Partial zone report flag */
> > > >         __u8 partial;
> > > >         __u8 unused2[3];
> > > >         u8 data[];
> > > >         u8 status;
> > > > };
> > > 
> > > We felt that it is less confusing to have a single ZBD request structure so
> > > there
> > > is 1:1 relationship between the VIRTIO_BLK_F_ZONED bit value and the choice
> > > of
> > > the request -
> > > 
> > > VIRTIO_BLK_F_ZONED off ->  virtio_blk_req
> > > VIRTIO_BLK_F_ZONED on ->  virtio_blk_zoned_req
> > > 
> > > We also want to make sure that zone management/send/receive and append
> > > requests
> > > are always the same size and it should be easier with a single request
> > > structure.
> > > 
> > > It is possible to use a simpler, but less structured union here -
> > > 
> > > struct virtio_blk_zoned_req {
> > >         le32 type;
> > >         le32 reserved;
> > >         le64 sector;
> > >         union zoned_params {
> > >                 /* ALL zone operation flag */
> > >                 __u8 all;
> > >                 /* Partial zone report flag */
> > >                 __u8 partial;
> > >                 __u32 rsvd1;
> > >         } zone;
> > >         u8 data[];
> > >         le64 zone_result;
> > >         u8 reserved1[3];
> > >         u8 status;
> > > }
> > > 
> > > Using the single code path is very important and the PoC Linux driver code
> > > that I
> > > have sketched up achieves that. Since the request header is identical for the
> > > both cases, I hope that it can be done in QEMU too.
> > 
> > There are two possibilities for optimizing the request layout:
> > 1. For drivers and devices that implement both !VIRTIO_BLK_F_ZONED and
> > VIRTIO_BLK_F_ZONED. Here it's best to share the same request layout
> > for commands available in both modes. In other words IN and OUT
> > requests to zoned devices should use struct virtio_blk_req so code can
> > be shared and only commands unique to zoned devices use struct
> > virtio_blk_zoned_req.
> > 2. For drivers and devices that implement only one of
> > !VIRTIO_BLK_F_ZONED and VIRTIO_BLK_F_ZONED. Here consistent request
> > layout is best so the code parses all commands in the same way. In
> > other words, IN and OUT requests to zoned devices use struct
> > virtio_blk_zoned_req so that the implementation doesn't have to bother
> > with struct virtio_blk_req at all.
> > 
> > I think you chose #2 and I gravitated towards #1. Which one is optimal
> > depends on the use case. Feel free to stick with #2.
> > 
> 
> OK. We certainly considered #1, but we found that in Linux virtio-blk driver
> implementation, the request size is fixed. There is the member called vblk-
> >tag_set.cmd_size and it is set to sizeof(struct virtblk_req) + <sg list size> as
> a part of the device scan routine. Since the Zone Append request has to be larger
> than the regular (non-zoned) one by at least 8 bytes to accommodate the ALBA
> field, we can't mix and match the two for the same device. Here is how this
> device initialization part can be modified to handle the VIRTIO_BLK_F_ZONED
> feature bit if we stick with #2 -
> 
> 
> @@ -810,13 +1287,18 @@ static int virtblk_probe(struct virtio_device *vdev)
>         }
>  
>         memset(&vblk->tag_set, 0, sizeof(vblk->tag_set));
> -       vblk->tag_set.ops = &virtio_mq_ops;
>         vblk->tag_set.queue_depth = queue_depth;
>         vblk->tag_set.numa_node = NUMA_NO_NODE;
>         vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
> -       vblk->tag_set.cmd_size =
> -               sizeof(struct virtblk_req) +
> -               sizeof(struct scatterlist) * VIRTIO_BLK_INLINE_SG_CNT;
> +       if (virtio_has_feature(vdev, VIRTIO_BLK_F_ZONED)) {
> +               vblk->tag_set.ops = &virtio_mq_zoned_ops;
> +               vblk->tag_set.cmd_size = sizeof(struct virtblk_zoned_req);
> +       } else {
> +               vblk->tag_set.ops = &virtio_mq_ops;
> +               vblk->tag_set.cmd_size = sizeof(struct virtblk_req);
> +       };
> +       vblk->tag_set.cmd_size +=
> +                       sizeof(struct scatterlist) * VIRTIO_BLK_INLINE_SG_CNT;
>         vblk->tag_set.driver_data = vblk;
>         vblk->tag_set.nr_hw_queues = vblk->num_vqs;

I've included a #1 solution below in case you like it. cmd_size doesn't
need to be changed, sizeof(struct virtblk_req) is fine. Here is struct
virtblk_req extended to handle the ALBA field:

  struct virtblk_req {
      struct virtio_blk_outhdr out_hdr;
+     struct {
+             __virtio64 alba;
+             u8 status;
+     } in;
-     u8 status;
      struct sg_table sg_table;
      struct scatterlist sg[];
  };

The in struct could also be a union of several structs. I kept it simple
assuming just one request (ZONE_APPEND) has additional device-to-driver
(in) fields.

The request layout variation is taken into account when building the
virtqueue scatterlist:

  static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
-         struct scatterlist *data_sg, bool have_data)
+         struct scatterlist *data_sg, bool have_data, bool have_alba)
  {
-     struct scatterlist hdr, status, *sgs[3];
+     struct scatterlist hdr, in, *sgs[3];
      unsigned int num_out = 0, num_in = 0;

      sg_init_one(&hdr, &vbr->out_hdr, sizeof(vbr->out_hdr));
      sgs[num_out++] = &hdr;

      if (have_data) {
          if (vbr->out_hdr.type & cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_OUT))
              sgs[num_out++] = data_sg;
          else
              sgs[num_out + num_in++] = data_sg;
      }

+     if (have_alba)
+         sg_init_one(&in, &vbr->in, sizeof(vbr->in));
+     else
+         sg_init_one(&in, &vbr->in.status, sizeof(vbr->in.status));
+     sgs[num_out + num_in++] = &in;
-     sg_init_one(&status, &vbr->status, sizeof(vbr->status));
-     sgs[num_out + num_in++] = &status;

      return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
  }

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-06-04  8:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-01 23:55 [virtio-dev] [RFC PATCH] virtio-blk: add zoned block device specification Dmitry Fomichev
2022-06-02  7:23 ` Stefan Hajnoczi
2022-06-03  2:30   ` Dmitry Fomichev
2022-06-03  8:08     ` [virtio-comment] " Stefan Hajnoczi
2022-06-04  0:09       ` Dmitry Fomichev
2022-06-04  8:18         ` [virtio-comment] " Stefan Hajnoczi
     [not found]     ` <CAJSP0QXiUy-=pyGtxjpguWFt=+HJFqjKo3ipwN+YpF43nsxAag@mail.gmail.com>
     [not found]       ` <d0c0fcfc-01d2-9af3-3018-e8ca03f21fab@opensource.wdc.com>
2022-06-03 16:31         ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.