All of lore.kernel.org
 help / color / mirror / Atom feed
* Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
@ 2022-05-30  5:09 Sam Li
  2022-05-30  5:38 ` Damien Le Moal
  2022-05-30 11:19 ` Stefan Hajnoczi
  0 siblings, 2 replies; 14+ messages in thread
From: Sam Li @ 2022-05-30  5:09 UTC (permalink / raw)
  Cc: Stefan Hajnoczi, Damien Le Moal, Dmitry Fomichev,
	Hannes Reinecke, qemu-devel, qemu-block

Hi everyone,
I'm Sam Li, working on the Outreachy project which is to add zoned
device support to QEMU's virtio-blk emulation.

For the first goal, adding QEMU block layer APIs resembling Linux ZBD
ioctls, I think the naive approach would be to introduce a new stable
struct zbd_zone descriptor for the library function interface. More
specifically, what I'd like to add to the BlockDriver struct are:
1. zbd_info as zone block device information: includes numbers of
zones, size of logical blocks, and physical blocks.
2. zbd_zone_type and zbd_zone_state
3. zbd_dev_model: host-managed zbd, host-aware zbd
With those basic structs, we can start to implement new functions as
bdrv*() APIs for BLOCK*ZONE ioctls.

I'll start to finish this task based on the above description. If
there is any problem or something I may miss in the design, please let
me know.

Best regards,
Sam


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
  2022-05-30  5:09 Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls Sam Li
@ 2022-05-30  5:38 ` Damien Le Moal
  2022-05-30 11:21   ` Stefan Hajnoczi
  2022-05-30 11:19 ` Stefan Hajnoczi
  1 sibling, 1 reply; 14+ messages in thread
From: Damien Le Moal @ 2022-05-30  5:38 UTC (permalink / raw)
  To: Sam Li
  Cc: Stefan Hajnoczi, Dmitry Fomichev, Hannes Reinecke, qemu-devel,
	qemu-block

On 2022/05/30 14:09, Sam Li wrote:
> Hi everyone,

Hi Sam,

> I'm Sam Li, working on the Outreachy project which is to add zoned
> device support to QEMU's virtio-blk emulation.
> 
> For the first goal, adding QEMU block layer APIs resembling Linux ZBD
> ioctls, I think the naive approach would be to introduce a new stable
> struct zbd_zone descriptor for the library function interface. More
> specifically, what I'd like to add to the BlockDriver struct are:
> 1. zbd_info as zone block device information: includes numbers of
> zones, size of logical blocks, and physical blocks.

Virtio block devices only advertise a "block size" (blk_size field of struct
virtio_blk_config). So I do not think that you need to distinguish between
logical and physical blocks. However, for zoned devices, we need to add a "write
granularity" field which indicates the minimum write size and alignment. This is
to be able to handle 512e SMR disk drives as these have a 512 B logical block
size and 4096 B physical block size. And SMR only allows writing in units of
physical block size, regardless of the LBA size. For NVMe ZNS devices, there is
no logical/physical block size difference, so the write granularity will always
be equal to the block size.

> 2. zbd_zone_type and zbd_zone_state

As a first step, I would recommend to only have the zone type. That will allow
you to not issue a zone ioctl that you know will fail, e.g. if the user tries to
reset a conventional zone, we know this will fail, so no point in executing the
BLKRESETZONE ioctl for that zone. With the zone type cached, you can easily
catch such cases. But even that is actually optional as a first step. You can
rely on the host device failing any invalid operation and return the errors back
to the guest.

Once you have an API working and the ability to execute all zone operations from
a guest, you can then work on adding the more difficult bits: supporting the
zone append operation will require tracking the write pointer position and state
of the device sequential zones. For that, the zone information will need the
zone capacity and write pointer position of all zones. The zone state may not be
necessary as you can infer the empty and full states from the zone capacity and
zone write pointer position. States such as explicit/implicit open, closed,
read-only and offline do not need to be tracked. If an operation cannot be
executed, the device will fail the io on the host side and you can simply
propagate that error to the guest.

See the Linux kernel sd_zbc driver and its emulation of zone append operations
for inspiration: drivers/scsi/sd_zbc.c. Looking at that code (e.g.
sd_zbc_prepare_zone_append()), you will see that the only thing being tracked is
the write pointer position of zones (relative to the zone start sector). The
state is inferred from that value, indicating if the zone can be written (it is
not full) or not (the zone is full).

> 3. zbd_dev_model: host-managed zbd, host-aware zbd

Yes. The current virtio specs draft adding zoned block device support adds
struct virtio_blk_zoned_characteristics. Most, if not all, of the fields in that
structure can be kept as part fot the device zone information.

> With those basic structs, we can start to implement new functions as
> bdrv*() APIs for BLOCK*ZONE ioctls.

BLK*ZONE :)

> 
> I'll start to finish this task based on the above description. If
> there is any problem or something I may miss in the design, please let
> me know.

Supporting all operations will be difficult to do in one go. But as explained
above, if you initially exclude zone append support, you will not need to
dynamically track zone state and wp. This will simplify the code to give you a
solid working base to build upon the remaining support.

> 
> Best regards,
> Sam
> 


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
  2022-05-30  5:09 Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls Sam Li
  2022-05-30  5:38 ` Damien Le Moal
@ 2022-05-30 11:19 ` Stefan Hajnoczi
  2022-06-01  2:57   ` Sam Li
  1 sibling, 1 reply; 14+ messages in thread
From: Stefan Hajnoczi @ 2022-05-30 11:19 UTC (permalink / raw)
  To: Sam Li
  Cc: Damien Le Moal, Dmitry Fomichev, Hannes Reinecke, qemu-devel, qemu block

On Mon, 30 May 2022 at 06:09, Sam Li <faithilikerun@gmail.com> wrote:
>
> Hi everyone,
> I'm Sam Li, working on the Outreachy project which is to add zoned
> device support to QEMU's virtio-blk emulation.
>
> For the first goal, adding QEMU block layer APIs resembling Linux ZBD
> ioctls, I think the naive approach would be to introduce a new stable
> struct zbd_zone descriptor for the library function interface. More
> specifically, what I'd like to add to the BlockDriver struct are:
> 1. zbd_info as zone block device information: includes numbers of
> zones, size of logical blocks, and physical blocks.
> 2. zbd_zone_type and zbd_zone_state
> 3. zbd_dev_model: host-managed zbd, host-aware zbd
> With those basic structs, we can start to implement new functions as
> bdrv*() APIs for BLOCK*ZONE ioctls.
>
> I'll start to finish this task based on the above description. If
> there is any problem or something I may miss in the design, please let
> me know.

Hi Sam,
Can you propose function prototypes for the new BlockDriver callbacks
needed for zoned devices?

Stefan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
  2022-05-30  5:38 ` Damien Le Moal
@ 2022-05-30 11:21   ` Stefan Hajnoczi
  2022-05-30 11:26     ` Damien Le Moal
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Hajnoczi @ 2022-05-30 11:21 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Sam Li, Dmitry Fomichev, Hannes Reinecke, qemu-devel, qemu-block

On Mon, 30 May 2022 at 06:38, Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
> On 2022/05/30 14:09, Sam Li wrote:
> Once you have an API working and the ability to execute all zone operations from
> a guest, you can then work on adding the more difficult bits: supporting the
> zone append operation will require tracking the write pointer position and state
> of the device sequential zones. For that, the zone information will need the
> zone capacity and write pointer position of all zones. The zone state may not be
> necessary as you can infer the empty and full states from the zone capacity and
> zone write pointer position. States such as explicit/implicit open, closed,
> read-only and offline do not need to be tracked. If an operation cannot be
> executed, the device will fail the io on the host side and you can simply
> propagate that error to the guest.
>
> See the Linux kernel sd_zbc driver and its emulation of zone append operations
> for inspiration: drivers/scsi/sd_zbc.c. Looking at that code (e.g.
> sd_zbc_prepare_zone_append()), you will see that the only thing being tracked is
> the write pointer position of zones (relative to the zone start sector). The
> state is inferred from that value, indicating if the zone can be written (it is
> not full) or not (the zone is full).

It sounds like QEMU BlockDrivers need to maintain state? I haven't
thought this through but the guest probably has some state and the
device below QEMU also has state. Can QEMU just invoke the BLK*ZONE
ioctls on behalf of the guest without maintaining it's own state?

Stefan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
  2022-05-30 11:21   ` Stefan Hajnoczi
@ 2022-05-30 11:26     ` Damien Le Moal
  0 siblings, 0 replies; 14+ messages in thread
From: Damien Le Moal @ 2022-05-30 11:26 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Sam Li, Dmitry Fomichev, Hannes Reinecke, qemu-devel, qemu-block

On 2022/05/30 20:22, Stefan Hajnoczi wrote:
> On Mon, 30 May 2022 at 06:38, Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
>> On 2022/05/30 14:09, Sam Li wrote:
>> Once you have an API working and the ability to execute all zone operations from
>> a guest, you can then work on adding the more difficult bits: supporting the
>> zone append operation will require tracking the write pointer position and state
>> of the device sequential zones. For that, the zone information will need the
>> zone capacity and write pointer position of all zones. The zone state may not be
>> necessary as you can infer the empty and full states from the zone capacity and
>> zone write pointer position. States such as explicit/implicit open, closed,
>> read-only and offline do not need to be tracked. If an operation cannot be
>> executed, the device will fail the io on the host side and you can simply
>> propagate that error to the guest.
>>
>> See the Linux kernel sd_zbc driver and its emulation of zone append operations
>> for inspiration: drivers/scsi/sd_zbc.c. Looking at that code (e.g.
>> sd_zbc_prepare_zone_append()), you will see that the only thing being tracked is
>> the write pointer position of zones (relative to the zone start sector). The
>> state is inferred from that value, indicating if the zone can be written (it is
>> not full) or not (the zone is full).
> 
> It sounds like QEMU BlockDrivers need to maintain state? I haven't
> thought this through but the guest probably has some state and the
> device below QEMU also has state. Can QEMU just invoke the BLK*ZONE
> ioctls on behalf of the guest without maintaining it's own state?

FOr all operations except zone append, yes, it can and that the way it should be
done. However, for a linux host, since there is no user interface for the zone
append operation, QEMU will need to emulate that command using regular writes.
And for that to work, zones state (at least the zone write pointer position)
need to be tracked. Linux mandates zone append working for zoned devices...
This kind of emulation is thus implemented in the scsi disk driver since ZBC &
ZAC do not have zone append. It is not too hard to do, but care must be applied
to zone wp tracking and locking (mutual exclusion for zone wp test + change
depending on the operation). This tracking can be fully initialized on startup
and does not need any persistent "metadata".


> 
> Stefan
> 


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
  2022-05-30 11:19 ` Stefan Hajnoczi
@ 2022-06-01  2:57   ` Sam Li
  2022-06-01  5:47     ` Damien Le Moal
  0 siblings, 1 reply; 14+ messages in thread
From: Sam Li @ 2022-06-01  2:57 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Damien Le Moal, Dmitry Fomichev, Hannes Reinecke, qemu-devel, qemu block

Hi Stefan,

Stefan Hajnoczi <stefanha@gmail.com> 于2022年5月30日周一 19:19写道:


>
> On Mon, 30 May 2022 at 06:09, Sam Li <faithilikerun@gmail.com> wrote:
> >
> > Hi everyone,
> > I'm Sam Li, working on the Outreachy project which is to add zoned
> > device support to QEMU's virtio-blk emulation.
> >
> > For the first goal, adding QEMU block layer APIs resembling Linux ZBD
> > ioctls, I think the naive approach would be to introduce a new stable
> > struct zbd_zone descriptor for the library function interface. More
> > specifically, what I'd like to add to the BlockDriver struct are:
> > 1. zbd_info as zone block device information: includes numbers of
> > zones, size of logical blocks, and physical blocks.
> > 2. zbd_zone_type and zbd_zone_state
> > 3. zbd_dev_model: host-managed zbd, host-aware zbd
> > With those basic structs, we can start to implement new functions as
> > bdrv*() APIs for BLOCK*ZONE ioctls.
> >
> > I'll start to finish this task based on the above description. If
> > there is any problem or something I may miss in the design, please let
> > me know.
>
> Hi Sam,
> Can you propose function prototypes for the new BlockDriver callbacks
> needed for zoned devices?

I have made some modifications based on Damien's device in design part
1 and added the function prototypes in design part 2. If there is any
problem or part I missed, please let me know.

Design of Block Layer APIs in BlockDriver:
1. introduce a new stable struct zbd_zone descriptor for the library
function interface.
  a. zbd_info as zone block device information: includes numbers of
zones, size of blocks, write granularity in byte(minimal write size
and alignment
    - write granularity: 512e SMRs: writes in units of physical block
size, 4096 bytes; NVMe ZNS write granularity is equal to the block
size.
    - zone descriptor: start, length, capacity, write pointer, zone type
  b. zbd_zone_type
    - zone type: conventional, sequential write required, sequential
write preferred
  c. zbd_dev_model: host-managed zbd, host-aware zbd

 2. implement new functions as bdrv*() APIs for BLK*ZONE ioctls
   a. support basic operations: get the APIs working when executing
the zone operations from a guest
    - zone information access: report
    - zone manipulation: reset,open,close,finish
  b. support zone append operation: zone capacity, write pointer
positions of all zones(excluded for now)
    - can track the zone state we need: zone is full or not.

More specifically, the function prototypes for 2a are as follows:

int zbd_report_zones(int fd, off_t offset, off_t len, enum
zbd_report_opetion ro, struct zbd_zone *zones, unsigned int
*nr_zones);
int zbd_reset_zones(int fd, off_t offset, off_t len);
int zbd_open_zones(int fd, off_t offset, off_t len);
int zbd_close_zones(int fd, off_t offset, off_t len);
int zbd_finish_zones(int fd, off_t offset, off_t len);

>
> Stefan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
  2022-06-01  2:57   ` Sam Li
@ 2022-06-01  5:47     ` Damien Le Moal
  2022-06-01 10:19       ` Sam Li
  2022-06-01 11:43       ` Stefan Hajnoczi
  0 siblings, 2 replies; 14+ messages in thread
From: Damien Le Moal @ 2022-06-01  5:47 UTC (permalink / raw)
  To: Sam Li, Stefan Hajnoczi
  Cc: Dmitry Fomichev, Hannes Reinecke, qemu-devel, qemu block

On 6/1/22 11:57, Sam Li wrote:
> Hi Stefan,
> 
> Stefan Hajnoczi <stefanha@gmail.com> 于2022年5月30日周一 19:19写道:
> 
> 
>>
>> On Mon, 30 May 2022 at 06:09, Sam Li <faithilikerun@gmail.com> wrote:
>>>
>>> Hi everyone,
>>> I'm Sam Li, working on the Outreachy project which is to add zoned
>>> device support to QEMU's virtio-blk emulation.
>>>
>>> For the first goal, adding QEMU block layer APIs resembling Linux ZBD
>>> ioctls, I think the naive approach would be to introduce a new stable
>>> struct zbd_zone descriptor for the library function interface. More
>>> specifically, what I'd like to add to the BlockDriver struct are:
>>> 1. zbd_info as zone block device information: includes numbers of
>>> zones, size of logical blocks, and physical blocks.
>>> 2. zbd_zone_type and zbd_zone_state
>>> 3. zbd_dev_model: host-managed zbd, host-aware zbd
>>> With those basic structs, we can start to implement new functions as
>>> bdrv*() APIs for BLOCK*ZONE ioctls.
>>>
>>> I'll start to finish this task based on the above description. If
>>> there is any problem or something I may miss in the design, please let
>>> me know.
>>
>> Hi Sam,
>> Can you propose function prototypes for the new BlockDriver callbacks
>> needed for zoned devices?
> 
> I have made some modifications based on Damien's device in design part
> 1 and added the function prototypes in design part 2. If there is any
> problem or part I missed, please let me know.
> 
> Design of Block Layer APIs in BlockDriver:
> 1. introduce a new stable struct zbd_zone descriptor for the library
> function interface.
>   a. zbd_info as zone block device information: includes numbers of
> zones, size of blocks, write granularity in byte(minimal write size
> and alignment
>     - write granularity: 512e SMRs: writes in units of physical block
> size, 4096 bytes; NVMe ZNS write granularity is equal to the block
> size.
>     - zone descriptor: start, length, capacity, write pointer, zone type
>   b. zbd_zone_type
>     - zone type: conventional, sequential write required, sequential
> write preferred
>   c. zbd_dev_model: host-managed zbd, host-aware zbd

This explanation is a little hard to understand. It seems to be mixing up
device level information and per-zone information. I think it would be a
lot simpler to write a struct definition to directly illustrate what you
are planning.

It is something like this ?

struct zbd_zone {
	enum zone_type	type;
	enum zone_cond	cond;
	uint64_t	start;
	uint32_t	length;
	uint32_t	cap;
	uint64_t	wp;
};

strcut zbd_dev {
	enum zone_model	model;
	uint32_t	block_size;
	uint32_t	write_granularity;
	uint32_t	nr_zones
	struct zbd_zone	*zones;	/* array of zones */
};

If yes, then my comments are as follows.

For the device struct: It may be good to have also the maximum number of
open zones and the maximum number of active zones.

For the zone struct: You may need to add a read-write lock per zone to be
able to write lock zones to ensure a sequential write pattern (virtio
devices can be multi-queue and so writes may be coming in from different
contexts) and to correctly emulate zone append operations with an atomic
update of the wp field.

These need to be integrated into the generic block driver interface in
include/block/block_int-common.h or include/block/block-common.h.

> 
>  2. implement new functions as bdrv*() APIs for BLK*ZONE ioctls
>    a. support basic operations: get the APIs working when executing
> the zone operations from a guest
>     - zone information access: report
>     - zone manipulation: reset,open,close,finish

These operations are generally referred to as "zone management" operations.

>   b. support zone append operation: zone capacity, write pointer
> positions of all zones(excluded for now)
>     - can track the zone state we need: zone is full or not.
> 
> More specifically, the function prototypes for 2a are as follows:
> 
> int zbd_report_zones(int fd, off_t offset, off_t len, enum
> zbd_report_opetion ro, struct zbd_zone *zones, unsigned int
> *nr_zones);

The current virtio zone specification draft does not have a reporting
option field for the report zones command. However, it has a "partial"
field that will need to be passed to this function so that the correct
report zones reply can be built by the driver.

> int zbd_reset_zones(int fd, off_t offset, off_t len);
> int zbd_open_zones(int fd, off_t offset, off_t len);
> int zbd_close_zones(int fd, off_t offset, off_t len);
> int zbd_finish_zones(int fd, off_t offset, off_t len);

These 4 functions have the exact same signature, modulo the function name.
So we should simplify here to reduce the number of functions. Something like:

int zbd_zone_mgmt(int fd, enum zone_op op, off_t offset, off_t len);

where op can be BDRV_ZONE_RESET, BDRV_ZONE_OPEN, etc can aggregate all 4
functions into one.

In any case, if you look at how block drivers are defined (an example is
the one for raw files in qemu/block/file-posix.c) using the BlockDriver
data type (defined as a struct in qemu/include/block/block_int-common.h),
you will see that the driver callback functions do not use a file
descriptor (fd) but a pointer to some data structure (most of the time a
BlockDriverState pointer).

Another thing: you will need one more callback to get the device
information related to zones: maximum number of open and active zones at
least (the number of zones can be discovered with a report zones call).

Cheers.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
  2022-06-01  5:47     ` Damien Le Moal
@ 2022-06-01 10:19       ` Sam Li
  2022-06-01 11:28         ` Stefan Hajnoczi
  2022-06-01 11:43       ` Stefan Hajnoczi
  1 sibling, 1 reply; 14+ messages in thread
From: Sam Li @ 2022-06-01 10:19 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Stefan Hajnoczi, Dmitry Fomichev, Hannes Reinecke, qemu-devel,
	qemu block

Hi Damien,

Damien Le Moal <damien.lemoal@opensource.wdc.com> 于2022年6月1日周三 13:47写道:
>
> On 6/1/22 11:57, Sam Li wrote:
> > Hi Stefan,
> >
> > Stefan Hajnoczi <stefanha@gmail.com> 于2022年5月30日周一 19:19写道:
> >
> >
> >>
> >> On Mon, 30 May 2022 at 06:09, Sam Li <faithilikerun@gmail.com> wrote:
> >>>
> >>> Hi everyone,
> >>> I'm Sam Li, working on the Outreachy project which is to add zoned
> >>> device support to QEMU's virtio-blk emulation.
> >>>
> >>> For the first goal, adding QEMU block layer APIs resembling Linux ZBD
> >>> ioctls, I think the naive approach would be to introduce a new stable
> >>> struct zbd_zone descriptor for the library function interface. More
> >>> specifically, what I'd like to add to the BlockDriver struct are:
> >>> 1. zbd_info as zone block device information: includes numbers of
> >>> zones, size of logical blocks, and physical blocks.
> >>> 2. zbd_zone_type and zbd_zone_state
> >>> 3. zbd_dev_model: host-managed zbd, host-aware zbd
> >>> With those basic structs, we can start to implement new functions as
> >>> bdrv*() APIs for BLOCK*ZONE ioctls.
> >>>
> >>> I'll start to finish this task based on the above description. If
> >>> there is any problem or something I may miss in the design, please let
> >>> me know.
> >>
> >> Hi Sam,
> >> Can you propose function prototypes for the new BlockDriver callbacks
> >> needed for zoned devices?
> >
> > I have made some modifications based on Damien's device in design part
> > 1 and added the function prototypes in design part 2. If there is any
> > problem or part I missed, please let me know.
> >
> > Design of Block Layer APIs in BlockDriver:
> > 1. introduce a new stable struct zbd_zone descriptor for the library
> > function interface.
> >   a. zbd_info as zone block device information: includes numbers of
> > zones, size of blocks, write granularity in byte(minimal write size
> > and alignment
> >     - write granularity: 512e SMRs: writes in units of physical block
> > size, 4096 bytes; NVMe ZNS write granularity is equal to the block
> > size.
> >     - zone descriptor: start, length, capacity, write pointer, zone type
> >   b. zbd_zone_type
> >     - zone type: conventional, sequential write required, sequential
> > write preferred
> >   c. zbd_dev_model: host-managed zbd, host-aware zbd
>
> This explanation is a little hard to understand. It seems to be mixing up
> device level information and per-zone information. I think it would be a
> lot simpler to write a struct definition to directly illustrate what you
> are planning.
>
> It is something like this ?
>
> struct zbd_zone {
>         enum zone_type  type;
>         enum zone_cond  cond;
>         uint64_t        start;
>         uint32_t        length;
>         uint32_t        cap;
>         uint64_t        wp;
> };
>
> strcut zbd_dev {
>         enum zone_model model;
>         uint32_t        block_size;
>         uint32_t        write_granularity;
>         uint32_t        nr_zones
>         struct zbd_zone *zones; /* array of zones */
> };
>

Yes! Sorry that I was not very clear in the descriptions.

> If yes, then my comments are as follows.
>
> For the device struct: It may be good to have also the maximum number of
> open zones and the maximum number of active zones.
>
> For the zone struct: You may need to add a read-write lock per zone to be
> able to write lock zones to ensure a sequential write pattern (virtio
> devices can be multi-queue and so writes may be coming in from different
> contexts) and to correctly emulate zone append operations with an atomic
> update of the wp field.
>

Yes, I haven't thought through the thread-safety problem but I'll come
up with an approach.

> These need to be integrated into the generic block driver interface in
> include/block/block_int-common.h or include/block/block-common.h.
>
> >
> >  2. implement new functions as bdrv*() APIs for BLK*ZONE ioctls
> >    a. support basic operations: get the APIs working when executing
> > the zone operations from a guest
> >     - zone information access: report
> >     - zone manipulation: reset,open,close,finish
>
> These operations are generally referred to as "zone management" operations.
>
> >   b. support zone append operation: zone capacity, write pointer
> > positions of all zones(excluded for now)
> >     - can track the zone state we need: zone is full or not.
> >
> > More specifically, the function prototypes for 2a are as follows:
> >
> > int zbd_report_zones(int fd, off_t offset, off_t len, enum
> > zbd_report_opetion ro, struct zbd_zone *zones, unsigned int
> > *nr_zones);
>
> The current virtio zone specification draft does not have a reporting
> option field for the report zones command. However, it has a "partial"
> field that will need to be passed to this function so that the correct
> report zones reply can be built by the driver.
>
> > int zbd_reset_zones(int fd, off_t offset, off_t len);
> > int zbd_open_zones(int fd, off_t offset, off_t len);
> > int zbd_close_zones(int fd, off_t offset, off_t len);
> > int zbd_finish_zones(int fd, off_t offset, off_t len);
>
> These 4 functions have the exact same signature, modulo the function name.
> So we should simplify here to reduce the number of functions. Something like:
>
> int zbd_zone_mgmt(int fd, enum zone_op op, off_t offset, off_t len);
>
> where op can be BDRV_ZONE_RESET, BDRV_ZONE_OPEN, etc can aggregate all 4
> functions into one.
>

Thanks for the suggestions. It would be better to use one general
function supporting four operations rather than four.

> In any case, if you look at how block drivers are defined (an example is
> the one for raw files in qemu/block/file-posix.c) using the BlockDriver
> data type (defined as a struct in qemu/include/block/block_int-common.h),
> you will see that the driver callback functions do not use a file
> descriptor (fd) but a pointer to some data structure (most of the time a
> BlockDriverState pointer).
>

Yes, I'll use it instead.

Meanwhile, the BlockDriver callbacks can be(with Damien's suggestion):
-> virtio_blk_handle_zone_mngmt() -> blk_aio_zone_mngmt() ->
blk_aio_prwv() + blk_aio_zone_mngmt_entry() -> bdrv_co_do_zone_mngmt() ->
bdrv_co_zone_mngmt().

I am still on the way to understand the BlockDriver structures. So the
above may need a second thought.


> Another thing: you will need one more callback to get the device
> information related to zones: maximum number of open and active zones at
> least (the number of zones can be discovered with a report zones call).
>

Is this callback zbd_get_sysfs_attrr(char *devname, const char *attr,
char *str, int str_len) in libzbd?

Thanks for reviewing! Have a good night.

Sam

> Cheers.
>
> --
> Damien Le Moal
> Western Digital Research


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
  2022-06-01 10:19       ` Sam Li
@ 2022-06-01 11:28         ` Stefan Hajnoczi
  0 siblings, 0 replies; 14+ messages in thread
From: Stefan Hajnoczi @ 2022-06-01 11:28 UTC (permalink / raw)
  To: Sam Li
  Cc: Damien Le Moal, Dmitry Fomichev, Hannes Reinecke, qemu-devel, qemu block

On Wed, 1 Jun 2022 at 11:19, Sam Li <faithilikerun@gmail.com> wrote:
> Damien Le Moal <damien.lemoal@opensource.wdc.com> 于2022年6月1日周三 13:47写道:
> > On 6/1/22 11:57, Sam Li wrote:
> > > Stefan Hajnoczi <stefanha@gmail.com> 于2022年5月30日周一 19:19写道:
> > >> On Mon, 30 May 2022 at 06:09, Sam Li <faithilikerun@gmail.com> wrote:
> > For the zone struct: You may need to add a read-write lock per zone to be
> > able to write lock zones to ensure a sequential write pattern (virtio
> > devices can be multi-queue and so writes may be coming in from different
> > contexts) and to correctly emulate zone append operations with an atomic
> > update of the wp field.
> >
>
> Yes, I haven't thought through the thread-safety problem but I'll come
> up with an approach.

Operations in the I/O code path (as opposed to the control/management
code path) will probably be declared as coroutine_fn, which means that
they execute in a coroutine and can yield back to the event loop when
waiting for something to happen.

Coroutines can use CoMutex locks to serialize execution. This way only
one write request will be in flight at a time and the write pointer
can be updated atomically.

Here is a sketch of what the block/file-posix.c driver's write append
function would look like:

static int coroutine_fn raw_write_append_zone(BlockDriverState *bs,
uint64_t offset, uint64_t bytes, QEMUIOVector *qiov)
{
    BDRVRawState *s = bs->opaque;
    RawPosixAIOData acb = ...; /* fill in aio request state */

    /* Serialize append requests */
    QEMU_LOCK_GUARD(&s->append_locks[offset_to_lock_idx(offset)]);
    return raw_thread_pool_submit(bs, handle_aiocb_append_zone, &acb);
}

The actual system call runs in a thread pool worker function
handle_aiocb_append_zone() that performs the write on the underlying
fd and updates the write pointer if the syscall succeeds.

Stefan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
  2022-06-01  5:47     ` Damien Le Moal
  2022-06-01 10:19       ` Sam Li
@ 2022-06-01 11:43       ` Stefan Hajnoczi
  2022-06-02  5:43         ` Sam Li
  1 sibling, 1 reply; 14+ messages in thread
From: Stefan Hajnoczi @ 2022-06-01 11:43 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Sam Li, Dmitry Fomichev, Hannes Reinecke, qemu-devel, qemu block

On Wed, 1 Jun 2022 at 06:47, Damien Le Moal
<damien.lemoal@opensource.wdc.com> wrote:
>
> On 6/1/22 11:57, Sam Li wrote:
> > Hi Stefan,
> >
> > Stefan Hajnoczi <stefanha@gmail.com> 于2022年5月30日周一 19:19写道:
> >
> >
> >>
> >> On Mon, 30 May 2022 at 06:09, Sam Li <faithilikerun@gmail.com> wrote:
> >>>
> >>> Hi everyone,
> >>> I'm Sam Li, working on the Outreachy project which is to add zoned
> >>> device support to QEMU's virtio-blk emulation.
> >>>
> >>> For the first goal, adding QEMU block layer APIs resembling Linux ZBD
> >>> ioctls, I think the naive approach would be to introduce a new stable
> >>> struct zbd_zone descriptor for the library function interface. More
> >>> specifically, what I'd like to add to the BlockDriver struct are:
> >>> 1. zbd_info as zone block device information: includes numbers of
> >>> zones, size of logical blocks, and physical blocks.
> >>> 2. zbd_zone_type and zbd_zone_state
> >>> 3. zbd_dev_model: host-managed zbd, host-aware zbd
> >>> With those basic structs, we can start to implement new functions as
> >>> bdrv*() APIs for BLOCK*ZONE ioctls.
> >>>
> >>> I'll start to finish this task based on the above description. If
> >>> there is any problem or something I may miss in the design, please let
> >>> me know.
> >>
> >> Hi Sam,
> >> Can you propose function prototypes for the new BlockDriver callbacks
> >> needed for zoned devices?
> >
> > I have made some modifications based on Damien's device in design part
> > 1 and added the function prototypes in design part 2. If there is any
> > problem or part I missed, please let me know.
> >
> > Design of Block Layer APIs in BlockDriver:
> > 1. introduce a new stable struct zbd_zone descriptor for the library
> > function interface.
> >   a. zbd_info as zone block device information: includes numbers of
> > zones, size of blocks, write granularity in byte(minimal write size
> > and alignment
> >     - write granularity: 512e SMRs: writes in units of physical block
> > size, 4096 bytes; NVMe ZNS write granularity is equal to the block
> > size.
> >     - zone descriptor: start, length, capacity, write pointer, zone type
> >   b. zbd_zone_type
> >     - zone type: conventional, sequential write required, sequential
> > write preferred
> >   c. zbd_dev_model: host-managed zbd, host-aware zbd
>
> This explanation is a little hard to understand. It seems to be mixing up
> device level information and per-zone information. I think it would be a
> lot simpler to write a struct definition to directly illustrate what you
> are planning.
>
> It is something like this ?
>
> struct zbd_zone {
>         enum zone_type  type;
>         enum zone_cond  cond;
>         uint64_t        start;
>         uint32_t        length;
>         uint32_t        cap;
>         uint64_t        wp;
> };
>
> strcut zbd_dev {
>         enum zone_model model;
>         uint32_t        block_size;
>         uint32_t        write_granularity;
>         uint32_t        nr_zones
>         struct zbd_zone *zones; /* array of zones */
> };
>
> If yes, then my comments are as follows.
>
> For the device struct: It may be good to have also the maximum number of
> open zones and the maximum number of active zones.
>
> For the zone struct: You may need to add a read-write lock per zone to be
> able to write lock zones to ensure a sequential write pattern (virtio
> devices can be multi-queue and so writes may be coming in from different
> contexts) and to correctly emulate zone append operations with an atomic
> update of the wp field.
>
> These need to be integrated into the generic block driver interface in
> include/block/block_int-common.h or include/block/block-common.h.

QEMU's block layer has a few ways of exposing information about block devices:

    int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
    ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs,
Error **errp);

These fetch information from the BlockDriver and are good when a small
amount of data is reported occassionally and consumed by the caller.

For data that is continuously accessed or that could be large, it may
be necessary for the data to reside inside BlockDriverState so that it
can be accessed in place (without copying):

    void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp);

QEMU uses this for the BlockLimits struct (BlockDriverState::bl) that
is continuously accessed by the block layer while processing I/O
requests. The "refresh" function updates the data in case the
underlying storage device has changed somehow. If no update function
is necessary then data can simply be populated during .bdrv_open() and
no new BlockDriver callback needs to be added.

So in the simplest case BlockDriverState can be extended with a struct
zbd_dev field that is populated during .bdrv_open(). If the
BlockDriver doesn't support zones then the zbd_dev.nr_zones field is 0
or the model field indicates that this is not a zoned storage device.

However, a BlockBackend (not BlockDriverState!) API will be needed to
expose this data to users like the hw/block/virtio-blk.c emulation
code or the qemu-io-cmds.c utility that can be used for testing. A
BlockBackend has a root pointer to a BlockDriverState graph (for
example, qcow2 on top of file-posix). It will be necessary to
propagate zoned storage information from the leaf BlockDriverState all
the way up to the BlockBackend. In simple cases the BB root points
directly to the file-posix BDS that has Linux ZBD support but the
design needs to account for additional BDS graph nodes.

In order to make progress on this interface I suggest looking at the
virtio-blk spec extension for zoned storage and thinking what the
BlockBackend API should look like that hw/block/virtio-blk.c will use.
Then it may be easier to decide how to report zone information from
BlockDriverState.

Stefan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
  2022-06-01 11:43       ` Stefan Hajnoczi
@ 2022-06-02  5:43         ` Sam Li
  2022-06-02  8:05           ` Stefan Hajnoczi
  0 siblings, 1 reply; 14+ messages in thread
From: Sam Li @ 2022-06-02  5:43 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Damien Le Moal, Dmitry Fomichev, Hannes Reinecke, qemu-devel, qemu block

Hi Stefan,

Stefan Hajnoczi <stefanha@gmail.com> 于2022年6月1日周三 19:43写道:
>
> On Wed, 1 Jun 2022 at 06:47, Damien Le Moal
> <damien.lemoal@opensource.wdc.com> wrote:
> >
> > On 6/1/22 11:57, Sam Li wrote:
> > > Hi Stefan,
> > >
> > > Stefan Hajnoczi <stefanha@gmail.com> 于2022年5月30日周一 19:19写道:
> > >
> > >
> > >>
> > >> On Mon, 30 May 2022 at 06:09, Sam Li <faithilikerun@gmail.com> wrote:
> > >>>
> > >>> Hi everyone,
> > >>> I'm Sam Li, working on the Outreachy project which is to add zoned
> > >>> device support to QEMU's virtio-blk emulation.
> > >>>
> > >>> For the first goal, adding QEMU block layer APIs resembling Linux ZBD
> > >>> ioctls, I think the naive approach would be to introduce a new stable
> > >>> struct zbd_zone descriptor for the library function interface. More
> > >>> specifically, what I'd like to add to the BlockDriver struct are:
> > >>> 1. zbd_info as zone block device information: includes numbers of
> > >>> zones, size of logical blocks, and physical blocks.
> > >>> 2. zbd_zone_type and zbd_zone_state
> > >>> 3. zbd_dev_model: host-managed zbd, host-aware zbd
> > >>> With those basic structs, we can start to implement new functions as
> > >>> bdrv*() APIs for BLOCK*ZONE ioctls.
> > >>>
> > >>> I'll start to finish this task based on the above description. If
> > >>> there is any problem or something I may miss in the design, please let
> > >>> me know.
> > >>
> > >> Hi Sam,
> > >> Can you propose function prototypes for the new BlockDriver callbacks
> > >> needed for zoned devices?
> > >
> > > I have made some modifications based on Damien's device in design part
> > > 1 and added the function prototypes in design part 2. If there is any
> > > problem or part I missed, please let me know.
> > >
> > > Design of Block Layer APIs in BlockDriver:
> > > 1. introduce a new stable struct zbd_zone descriptor for the library
> > > function interface.
> > >   a. zbd_info as zone block device information: includes numbers of
> > > zones, size of blocks, write granularity in byte(minimal write size
> > > and alignment
> > >     - write granularity: 512e SMRs: writes in units of physical block
> > > size, 4096 bytes; NVMe ZNS write granularity is equal to the block
> > > size.
> > >     - zone descriptor: start, length, capacity, write pointer, zone type
> > >   b. zbd_zone_type
> > >     - zone type: conventional, sequential write required, sequential
> > > write preferred
> > >   c. zbd_dev_model: host-managed zbd, host-aware zbd
> >
> > This explanation is a little hard to understand. It seems to be mixing up
> > device level information and per-zone information. I think it would be a
> > lot simpler to write a struct definition to directly illustrate what you
> > are planning.
> >
> > It is something like this ?
> >
> > struct zbd_zone {
> >         enum zone_type  type;
> >         enum zone_cond  cond;
> >         uint64_t        start;
> >         uint32_t        length;
> >         uint32_t        cap;
> >         uint64_t        wp;
> > };
> >
> > strcut zbd_dev {
> >         enum zone_model model;
> >         uint32_t        block_size;
> >         uint32_t        write_granularity;
> >         uint32_t        nr_zones
> >         struct zbd_zone *zones; /* array of zones */
> > };
> >
> > If yes, then my comments are as follows.
> >
> > For the device struct: It may be good to have also the maximum number of
> > open zones and the maximum number of active zones.
> >
> > For the zone struct: You may need to add a read-write lock per zone to be
> > able to write lock zones to ensure a sequential write pattern (virtio
> > devices can be multi-queue and so writes may be coming in from different
> > contexts) and to correctly emulate zone append operations with an atomic
> > update of the wp field.
> >
> > These need to be integrated into the generic block driver interface in
> > include/block/block_int-common.h or include/block/block-common.h.
>
> QEMU's block layer has a few ways of exposing information about block devices:
>
>     int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
>     ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs,
> Error **errp);
>
> These fetch information from the BlockDriver and are good when a small
> amount of data is reported occassionally and consumed by the caller.
>
> For data that is continuously accessed or that could be large, it may
> be necessary for the data to reside inside BlockDriverState so that it
> can be accessed in place (without copying):
>
>     void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp);
>
> QEMU uses this for the BlockLimits struct (BlockDriverState::bl) that
> is continuously accessed by the block layer while processing I/O
> requests. The "refresh" function updates the data in case the
> underlying storage device has changed somehow. If no update function
> is necessary then data can simply be populated during .bdrv_open() and
> no new BlockDriver callback needs to be added.
>
> So in the simplest case BlockDriverState can be extended with a struct
> zbd_dev field that is populated during .bdrv_open(). If the
> BlockDriver doesn't support zones then the zbd_dev.nr_zones field is 0
> or the model field indicates that this is not a zoned storage device.
>
> However, a BlockBackend (not BlockDriverState!) API will be needed to
> expose this data to users like the hw/block/virtio-blk.c emulation
> code or the qemu-io-cmds.c utility that can be used for testing. A
> BlockBackend has a root pointer to a BlockDriverState graph (for
> example, qcow2 on top of file-posix). It will be necessary to
> propagate zoned storage information from the leaf BlockDriverState all
> the way up to the BlockBackend. In simple cases the BB root points
> directly to the file-posix BDS that has Linux ZBD support but the
> design needs to account for additional BDS graph nodes.

I think a simple way to think BlockBackend APIs is to use following
callbacks: blk_aio_zone_mgmt() -> blk_aio_prwv() +
blk_aio_zone_mgmt_entry() -> blk_co_do_zone_mgmt() -> blk_zone_mgmt().
The last function call will call bdrv_co_zone_mgmt() in
block/file-posix.c. If I understand the additional case correctly, the
BlockBackend API can expose the zone information to the virtio-blk
emulation now.

Besides, comparing blk_aio_flush() with blk_flush() in block-backend.c
which lead to include/block/block-io.h, we may need consider the case
when calling block layer API from non-coroutine context. Meanwhile,
using bdrv_co_writev()/bdrv_co_readv() instead of read-write lock per
zone may be a option too.

If there is any problem, please let me know.

Best regards,
Sam

>
> In order to make progress on this interface I suggest looking at the
> virtio-blk spec extension for zoned storage and thinking what the
> BlockBackend API should look like that hw/block/virtio-blk.c will use.
> Then it may be easier to decide how to report zone information from
> BlockDriverState.
>
>
> Stefan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
  2022-06-02  5:43         ` Sam Li
@ 2022-06-02  8:05           ` Stefan Hajnoczi
  2022-06-02 10:28             ` Sam Li
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Hajnoczi @ 2022-06-02  8:05 UTC (permalink / raw)
  To: Sam Li
  Cc: Damien Le Moal, Dmitry Fomichev, Hannes Reinecke, qemu-devel, qemu block

On Thu, 2 Jun 2022 at 06:43, Sam Li <faithilikerun@gmail.com> wrote:
>
> Hi Stefan,
>
> Stefan Hajnoczi <stefanha@gmail.com> 于2022年6月1日周三 19:43写道:
> >
> > On Wed, 1 Jun 2022 at 06:47, Damien Le Moal
> > <damien.lemoal@opensource.wdc.com> wrote:
> > >
> > > On 6/1/22 11:57, Sam Li wrote:
> > > > Hi Stefan,
> > > >
> > > > Stefan Hajnoczi <stefanha@gmail.com> 于2022年5月30日周一 19:19写道:
> > > >
> > > >
> > > >>
> > > >> On Mon, 30 May 2022 at 06:09, Sam Li <faithilikerun@gmail.com> wrote:
> > > >>>
> > > >>> Hi everyone,
> > > >>> I'm Sam Li, working on the Outreachy project which is to add zoned
> > > >>> device support to QEMU's virtio-blk emulation.
> > > >>>
> > > >>> For the first goal, adding QEMU block layer APIs resembling Linux ZBD
> > > >>> ioctls, I think the naive approach would be to introduce a new stable
> > > >>> struct zbd_zone descriptor for the library function interface. More
> > > >>> specifically, what I'd like to add to the BlockDriver struct are:
> > > >>> 1. zbd_info as zone block device information: includes numbers of
> > > >>> zones, size of logical blocks, and physical blocks.
> > > >>> 2. zbd_zone_type and zbd_zone_state
> > > >>> 3. zbd_dev_model: host-managed zbd, host-aware zbd
> > > >>> With those basic structs, we can start to implement new functions as
> > > >>> bdrv*() APIs for BLOCK*ZONE ioctls.
> > > >>>
> > > >>> I'll start to finish this task based on the above description. If
> > > >>> there is any problem or something I may miss in the design, please let
> > > >>> me know.
> > > >>
> > > >> Hi Sam,
> > > >> Can you propose function prototypes for the new BlockDriver callbacks
> > > >> needed for zoned devices?
> > > >
> > > > I have made some modifications based on Damien's device in design part
> > > > 1 and added the function prototypes in design part 2. If there is any
> > > > problem or part I missed, please let me know.
> > > >
> > > > Design of Block Layer APIs in BlockDriver:
> > > > 1. introduce a new stable struct zbd_zone descriptor for the library
> > > > function interface.
> > > >   a. zbd_info as zone block device information: includes numbers of
> > > > zones, size of blocks, write granularity in byte(minimal write size
> > > > and alignment
> > > >     - write granularity: 512e SMRs: writes in units of physical block
> > > > size, 4096 bytes; NVMe ZNS write granularity is equal to the block
> > > > size.
> > > >     - zone descriptor: start, length, capacity, write pointer, zone type
> > > >   b. zbd_zone_type
> > > >     - zone type: conventional, sequential write required, sequential
> > > > write preferred
> > > >   c. zbd_dev_model: host-managed zbd, host-aware zbd
> > >
> > > This explanation is a little hard to understand. It seems to be mixing up
> > > device level information and per-zone information. I think it would be a
> > > lot simpler to write a struct definition to directly illustrate what you
> > > are planning.
> > >
> > > It is something like this ?
> > >
> > > struct zbd_zone {
> > >         enum zone_type  type;
> > >         enum zone_cond  cond;
> > >         uint64_t        start;
> > >         uint32_t        length;
> > >         uint32_t        cap;
> > >         uint64_t        wp;
> > > };
> > >
> > > strcut zbd_dev {
> > >         enum zone_model model;
> > >         uint32_t        block_size;
> > >         uint32_t        write_granularity;
> > >         uint32_t        nr_zones
> > >         struct zbd_zone *zones; /* array of zones */
> > > };
> > >
> > > If yes, then my comments are as follows.
> > >
> > > For the device struct: It may be good to have also the maximum number of
> > > open zones and the maximum number of active zones.
> > >
> > > For the zone struct: You may need to add a read-write lock per zone to be
> > > able to write lock zones to ensure a sequential write pattern (virtio
> > > devices can be multi-queue and so writes may be coming in from different
> > > contexts) and to correctly emulate zone append operations with an atomic
> > > update of the wp field.
> > >
> > > These need to be integrated into the generic block driver interface in
> > > include/block/block_int-common.h or include/block/block-common.h.
> >
> > QEMU's block layer has a few ways of exposing information about block devices:
> >
> >     int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
> >     ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs,
> > Error **errp);
> >
> > These fetch information from the BlockDriver and are good when a small
> > amount of data is reported occassionally and consumed by the caller.
> >
> > For data that is continuously accessed or that could be large, it may
> > be necessary for the data to reside inside BlockDriverState so that it
> > can be accessed in place (without copying):
> >
> >     void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp);
> >
> > QEMU uses this for the BlockLimits struct (BlockDriverState::bl) that
> > is continuously accessed by the block layer while processing I/O
> > requests. The "refresh" function updates the data in case the
> > underlying storage device has changed somehow. If no update function
> > is necessary then data can simply be populated during .bdrv_open() and
> > no new BlockDriver callback needs to be added.
> >
> > So in the simplest case BlockDriverState can be extended with a struct
> > zbd_dev field that is populated during .bdrv_open(). If the
> > BlockDriver doesn't support zones then the zbd_dev.nr_zones field is 0
> > or the model field indicates that this is not a zoned storage device.
> >
> > However, a BlockBackend (not BlockDriverState!) API will be needed to
> > expose this data to users like the hw/block/virtio-blk.c emulation
> > code or the qemu-io-cmds.c utility that can be used for testing. A
> > BlockBackend has a root pointer to a BlockDriverState graph (for
> > example, qcow2 on top of file-posix). It will be necessary to
> > propagate zoned storage information from the leaf BlockDriverState all
> > the way up to the BlockBackend. In simple cases the BB root points
> > directly to the file-posix BDS that has Linux ZBD support but the
> > design needs to account for additional BDS graph nodes.
>
> I think a simple way to think BlockBackend APIs is to use following
> callbacks: blk_aio_zone_mgmt() -> blk_aio_prwv() +
> blk_aio_zone_mgmt_entry() -> blk_co_do_zone_mgmt() -> blk_zone_mgmt().
> The last function call will call bdrv_co_zone_mgmt() in
> block/file-posix.c. If I understand the additional case correctly, the
> BlockBackend API can expose the zone information to the virtio-blk
> emulation now.

Yes!

block/raw-format.c also needs to implement .bdrv_co_zone_mgmt() by
calling bdrv_co_zone_mgmt(bs->file, ...). This is because the
raw-format.c driver usually sits on top of file-posix.c and has to
pass through requests.

There are filter block drivers like block/throttle.c,
block/blkdebug.c, etc (git grep is_filter block/) that will also need
to be modified to pass through requests in the same way.

Based on what I've read in Dmitry's virtio-blk spec proposal, the
BlockBackend API could look something like:

typedef struct { ...model, zone_sectors, max_open_zones, etc... } BlockZoneInfo;
void blk_get_zone_info(BlockBackend *blk, BlockZoneInfo *info);

virtio-blk.c calls this to fill out configuration space fields and
determine whether the BlockBackend is a zoned device.

Then there are 3 commands that happen in the I/O code path:

typedef struct { ... } BlockZoneDescriptor;
BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
BlockZoneDescriptor *zones, size_t max_zones, BlockCompletionFunc *cb,
void *opaque);

typedef enum { ... } BlockZoneMgmtCmd;
BlockAIOCB *blk_aio_zone_mgmt_send(BlockBackend *blk, int64_t offset,
BlockZoneMgmtCmd cmd, bool all, BlockCompletionFunc *cb, void
*opaque);

typedef void BlockZoneAppendCompletionFunc(void *opaque, int ret,
int64_t new_wp);
BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t offset,
QEMUIOVector *qiov, BlockZoneAppendCompletionFunc *cb, void *opaque);

> Besides, comparing blk_aio_flush() with blk_flush() in block-backend.c
> which lead to include/block/block-io.h, we may need consider the case
> when calling block layer API from non-coroutine context. Meanwhile,
> using bdrv_co_writev()/bdrv_co_readv() instead of read-write lock per
> zone may be a option too.

Yes, device emulation code usually uses the aio versions of the
BlockBackend I/O functions (read, write, flush). The QEMU block layer
runs the aio I/O request inside a coroutine and usually also exposes
coroutine versions of the same functions. For example, block jobs
(e.g. storage mirroring, backup, and migration background tasks)
usually call the coroutine versions of the BlockBackend APIs instead
of the aio ones.

qemu-io-cmds.c will want synchronous versions of the aio commands
(blk_zone_report(), blk_zone_mgmt_send(), blk_zone_append()) that
block until the command completes. This is because the qemu-io utility
typically executes one command at a time and it's written mostly in
blocking style rather than async callbacks or coroutines.
docs/devel/block-coroutine-wrapper.rst describes how to generate
synchronous versions of coroutine functions.

Do you want to start implementing blk_get_zone_info()? This will
require blk_*(), bdrv_*(), and BlockDriver (block/file-posix.c)
functions.

Stefan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
  2022-06-02  8:05           ` Stefan Hajnoczi
@ 2022-06-02 10:28             ` Sam Li
  2022-06-02 19:36               ` Stefan Hajnoczi
  0 siblings, 1 reply; 14+ messages in thread
From: Sam Li @ 2022-06-02 10:28 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Damien Le Moal, Dmitry Fomichev, Hannes Reinecke, qemu-devel, qemu block

Stefan Hajnoczi <stefanha@gmail.com> 于2022年6月2日周四 16:05写道:
>
> On Thu, 2 Jun 2022 at 06:43, Sam Li <faithilikerun@gmail.com> wrote:
> >
> > Hi Stefan,
> >
> > Stefan Hajnoczi <stefanha@gmail.com> 于2022年6月1日周三 19:43写道:
> > >
> > > On Wed, 1 Jun 2022 at 06:47, Damien Le Moal
> > > <damien.lemoal@opensource.wdc.com> wrote:
> > > >
> > > > On 6/1/22 11:57, Sam Li wrote:
> > > > > Hi Stefan,
> > > > >
> > > > > Stefan Hajnoczi <stefanha@gmail.com> 于2022年5月30日周一 19:19写道:
> > > > >
> > > > >
> > > > >>
> > > > >> On Mon, 30 May 2022 at 06:09, Sam Li <faithilikerun@gmail.com> wrote:
> > > > >>>
> > > > >>> Hi everyone,
> > > > >>> I'm Sam Li, working on the Outreachy project which is to add zoned
> > > > >>> device support to QEMU's virtio-blk emulation.
> > > > >>>
> > > > >>> For the first goal, adding QEMU block layer APIs resembling Linux ZBD
> > > > >>> ioctls, I think the naive approach would be to introduce a new stable
> > > > >>> struct zbd_zone descriptor for the library function interface. More
> > > > >>> specifically, what I'd like to add to the BlockDriver struct are:
> > > > >>> 1. zbd_info as zone block device information: includes numbers of
> > > > >>> zones, size of logical blocks, and physical blocks.
> > > > >>> 2. zbd_zone_type and zbd_zone_state
> > > > >>> 3. zbd_dev_model: host-managed zbd, host-aware zbd
> > > > >>> With those basic structs, we can start to implement new functions as
> > > > >>> bdrv*() APIs for BLOCK*ZONE ioctls.
> > > > >>>
> > > > >>> I'll start to finish this task based on the above description. If
> > > > >>> there is any problem or something I may miss in the design, please let
> > > > >>> me know.
> > > > >>
> > > > >> Hi Sam,
> > > > >> Can you propose function prototypes for the new BlockDriver callbacks
> > > > >> needed for zoned devices?
> > > > >
> > > > > I have made some modifications based on Damien's device in design part
> > > > > 1 and added the function prototypes in design part 2. If there is any
> > > > > problem or part I missed, please let me know.
> > > > >
> > > > > Design of Block Layer APIs in BlockDriver:
> > > > > 1. introduce a new stable struct zbd_zone descriptor for the library
> > > > > function interface.
> > > > >   a. zbd_info as zone block device information: includes numbers of
> > > > > zones, size of blocks, write granularity in byte(minimal write size
> > > > > and alignment
> > > > >     - write granularity: 512e SMRs: writes in units of physical block
> > > > > size, 4096 bytes; NVMe ZNS write granularity is equal to the block
> > > > > size.
> > > > >     - zone descriptor: start, length, capacity, write pointer, zone type
> > > > >   b. zbd_zone_type
> > > > >     - zone type: conventional, sequential write required, sequential
> > > > > write preferred
> > > > >   c. zbd_dev_model: host-managed zbd, host-aware zbd
> > > >
> > > > This explanation is a little hard to understand. It seems to be mixing up
> > > > device level information and per-zone information. I think it would be a
> > > > lot simpler to write a struct definition to directly illustrate what you
> > > > are planning.
> > > >
> > > > It is something like this ?
> > > >
> > > > struct zbd_zone {
> > > >         enum zone_type  type;
> > > >         enum zone_cond  cond;
> > > >         uint64_t        start;
> > > >         uint32_t        length;
> > > >         uint32_t        cap;
> > > >         uint64_t        wp;
> > > > };
> > > >
> > > > strcut zbd_dev {
> > > >         enum zone_model model;
> > > >         uint32_t        block_size;
> > > >         uint32_t        write_granularity;
> > > >         uint32_t        nr_zones
> > > >         struct zbd_zone *zones; /* array of zones */
> > > > };
> > > >
> > > > If yes, then my comments are as follows.
> > > >
> > > > For the device struct: It may be good to have also the maximum number of
> > > > open zones and the maximum number of active zones.
> > > >
> > > > For the zone struct: You may need to add a read-write lock per zone to be
> > > > able to write lock zones to ensure a sequential write pattern (virtio
> > > > devices can be multi-queue and so writes may be coming in from different
> > > > contexts) and to correctly emulate zone append operations with an atomic
> > > > update of the wp field.
> > > >
> > > > These need to be integrated into the generic block driver interface in
> > > > include/block/block_int-common.h or include/block/block-common.h.
> > >
> > > QEMU's block layer has a few ways of exposing information about block devices:
> > >
> > >     int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
> > >     ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs,
> > > Error **errp);
> > >
> > > These fetch information from the BlockDriver and are good when a small
> > > amount of data is reported occassionally and consumed by the caller.
> > >
> > > For data that is continuously accessed or that could be large, it may
> > > be necessary for the data to reside inside BlockDriverState so that it
> > > can be accessed in place (without copying):
> > >
> > >     void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp);
> > >
> > > QEMU uses this for the BlockLimits struct (BlockDriverState::bl) that
> > > is continuously accessed by the block layer while processing I/O
> > > requests. The "refresh" function updates the data in case the
> > > underlying storage device has changed somehow. If no update function
> > > is necessary then data can simply be populated during .bdrv_open() and
> > > no new BlockDriver callback needs to be added.
> > >
> > > So in the simplest case BlockDriverState can be extended with a struct
> > > zbd_dev field that is populated during .bdrv_open(). If the
> > > BlockDriver doesn't support zones then the zbd_dev.nr_zones field is 0
> > > or the model field indicates that this is not a zoned storage device.
> > >
> > > However, a BlockBackend (not BlockDriverState!) API will be needed to
> > > expose this data to users like the hw/block/virtio-blk.c emulation
> > > code or the qemu-io-cmds.c utility that can be used for testing. A
> > > BlockBackend has a root pointer to a BlockDriverState graph (for
> > > example, qcow2 on top of file-posix). It will be necessary to
> > > propagate zoned storage information from the leaf BlockDriverState all
> > > the way up to the BlockBackend. In simple cases the BB root points
> > > directly to the file-posix BDS that has Linux ZBD support but the
> > > design needs to account for additional BDS graph nodes.
> >
> > I think a simple way to think BlockBackend APIs is to use following
> > callbacks: blk_aio_zone_mgmt() -> blk_aio_prwv() +
> > blk_aio_zone_mgmt_entry() -> blk_co_do_zone_mgmt() -> blk_zone_mgmt().
> > The last function call will call bdrv_co_zone_mgmt() in
> > block/file-posix.c. If I understand the additional case correctly, the
> > BlockBackend API can expose the zone information to the virtio-blk
> > emulation now.
>
> Yes!
>
> block/raw-format.c also needs to implement .bdrv_co_zone_mgmt() by
> calling bdrv_co_zone_mgmt(bs->file, ...). This is because the
> raw-format.c driver usually sits on top of file-posix.c and has to
> pass through requests.
>
> There are filter block drivers like block/throttle.c,
> block/blkdebug.c, etc (git grep is_filter block/) that will also need
> to be modified to pass through requests in the same way.
>

Are the filter block drivers also on top of file-posix.c but below
block-backend.c? I read that the filter block drivers, and formats are
designed to be manageable pieces so as to make block device
configuration easier and clearer.

> Based on what I've read in Dmitry's virtio-blk spec proposal, the
> BlockBackend API could look something like:
>
> typedef struct { ...model, zone_sectors, max_open_zones, etc... } BlockZoneInfo;
> void blk_get_zone_info(BlockBackend *blk, BlockZoneInfo *info);
>
> virtio-blk.c calls this to fill out configuration space fields and
> determine whether the BlockBackend is a zoned device.
>
> Then there are 3 commands that happen in the I/O code path:
>
> typedef struct { ... } BlockZoneDescriptor;
> BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
> BlockZoneDescriptor *zones, size_t max_zones, BlockCompletionFunc *cb,
> void *opaque);
>
> typedef enum { ... } BlockZoneMgmtCmd;
> BlockAIOCB *blk_aio_zone_mgmt_send(BlockBackend *blk, int64_t offset,
> BlockZoneMgmtCmd cmd, bool all, BlockCompletionFunc *cb, void
> *opaque);
>
> typedef void BlockZoneAppendCompletionFunc(void *opaque, int ret,
> int64_t new_wp);
> BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t offset,
> QEMUIOVector *qiov, BlockZoneAppendCompletionFunc *cb, void *opaque);
>
> > Besides, comparing blk_aio_flush() with blk_flush() in block-backend.c
> > which lead to include/block/block-io.h, we may need consider the case
> > when calling block layer API from non-coroutine context. Meanwhile,
> > using bdrv_co_writev()/bdrv_co_readv() instead of read-write lock per
> > zone may be a option too.
>
> Yes, device emulation code usually uses the aio versions of the
> BlockBackend I/O functions (read, write, flush). The QEMU block layer
> runs the aio I/O request inside a coroutine and usually also exposes
> coroutine versions of the same functions. For example, block jobs
> (e.g. storage mirroring, backup, and migration background tasks)
> usually call the coroutine versions of the BlockBackend APIs instead
> of the aio ones.
>
> qemu-io-cmds.c will want synchronous versions of the aio commands
> (blk_zone_report(), blk_zone_mgmt_send(), blk_zone_append()) that
> block until the command completes. This is because the qemu-io utility
> typically executes one command at a time and it's written mostly in
> blocking style rather than async callbacks or coroutines.
> docs/devel/block-coroutine-wrapper.rst describes how to generate
> synchronous versions of coroutine functions.
>
> Do you want to start implementing blk_get_zone_info()? This will
> require blk_*(), bdrv_*(), and BlockDriver (block/file-posix.c)
> functions.

I want to implement the smallest part that can be tested first and
then move on to the next part. And I want to test zone report
operation first. Does the qemu io-test require the following part to
work: bdrv_co_zone_report in file-pisix.c, blk_get_zone_info() in
block-backend.c, blk_aio_zone_report() in io code path and modify some
test in test/qemu-iotests? If it does, then yes.

Have a nice day!

Sam
>
> Stefan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
  2022-06-02 10:28             ` Sam Li
@ 2022-06-02 19:36               ` Stefan Hajnoczi
  0 siblings, 0 replies; 14+ messages in thread
From: Stefan Hajnoczi @ 2022-06-02 19:36 UTC (permalink / raw)
  To: Sam Li
  Cc: Damien Le Moal, Dmitry Fomichev, Hannes Reinecke, qemu-devel, qemu block

On Thu, 2 Jun 2022 at 11:28, Sam Li <faithilikerun@gmail.com> wrote:
>
> Stefan Hajnoczi <stefanha@gmail.com> 于2022年6月2日周四 16:05写道:
> >
> > On Thu, 2 Jun 2022 at 06:43, Sam Li <faithilikerun@gmail.com> wrote:
> > >
> > > Hi Stefan,
> > >
> > > Stefan Hajnoczi <stefanha@gmail.com> 于2022年6月1日周三 19:43写道:
> > > >
> > > > On Wed, 1 Jun 2022 at 06:47, Damien Le Moal
> > > > <damien.lemoal@opensource.wdc.com> wrote:
> > > > >
> > > > > On 6/1/22 11:57, Sam Li wrote:
> > > > > > Hi Stefan,
> > > > > >
> > > > > > Stefan Hajnoczi <stefanha@gmail.com> 于2022年5月30日周一 19:19写道:
> > > > > >
> > > > > >
> > > > > >>
> > > > > >> On Mon, 30 May 2022 at 06:09, Sam Li <faithilikerun@gmail.com> wrote:
> > > > > >>>
> > > > > >>> Hi everyone,
> > > > > >>> I'm Sam Li, working on the Outreachy project which is to add zoned
> > > > > >>> device support to QEMU's virtio-blk emulation.
> > > > > >>>
> > > > > >>> For the first goal, adding QEMU block layer APIs resembling Linux ZBD
> > > > > >>> ioctls, I think the naive approach would be to introduce a new stable
> > > > > >>> struct zbd_zone descriptor for the library function interface. More
> > > > > >>> specifically, what I'd like to add to the BlockDriver struct are:
> > > > > >>> 1. zbd_info as zone block device information: includes numbers of
> > > > > >>> zones, size of logical blocks, and physical blocks.
> > > > > >>> 2. zbd_zone_type and zbd_zone_state
> > > > > >>> 3. zbd_dev_model: host-managed zbd, host-aware zbd
> > > > > >>> With those basic structs, we can start to implement new functions as
> > > > > >>> bdrv*() APIs for BLOCK*ZONE ioctls.
> > > > > >>>
> > > > > >>> I'll start to finish this task based on the above description. If
> > > > > >>> there is any problem or something I may miss in the design, please let
> > > > > >>> me know.
> > > > > >>
> > > > > >> Hi Sam,
> > > > > >> Can you propose function prototypes for the new BlockDriver callbacks
> > > > > >> needed for zoned devices?
> > > > > >
> > > > > > I have made some modifications based on Damien's device in design part
> > > > > > 1 and added the function prototypes in design part 2. If there is any
> > > > > > problem or part I missed, please let me know.
> > > > > >
> > > > > > Design of Block Layer APIs in BlockDriver:
> > > > > > 1. introduce a new stable struct zbd_zone descriptor for the library
> > > > > > function interface.
> > > > > >   a. zbd_info as zone block device information: includes numbers of
> > > > > > zones, size of blocks, write granularity in byte(minimal write size
> > > > > > and alignment
> > > > > >     - write granularity: 512e SMRs: writes in units of physical block
> > > > > > size, 4096 bytes; NVMe ZNS write granularity is equal to the block
> > > > > > size.
> > > > > >     - zone descriptor: start, length, capacity, write pointer, zone type
> > > > > >   b. zbd_zone_type
> > > > > >     - zone type: conventional, sequential write required, sequential
> > > > > > write preferred
> > > > > >   c. zbd_dev_model: host-managed zbd, host-aware zbd
> > > > >
> > > > > This explanation is a little hard to understand. It seems to be mixing up
> > > > > device level information and per-zone information. I think it would be a
> > > > > lot simpler to write a struct definition to directly illustrate what you
> > > > > are planning.
> > > > >
> > > > > It is something like this ?
> > > > >
> > > > > struct zbd_zone {
> > > > >         enum zone_type  type;
> > > > >         enum zone_cond  cond;
> > > > >         uint64_t        start;
> > > > >         uint32_t        length;
> > > > >         uint32_t        cap;
> > > > >         uint64_t        wp;
> > > > > };
> > > > >
> > > > > strcut zbd_dev {
> > > > >         enum zone_model model;
> > > > >         uint32_t        block_size;
> > > > >         uint32_t        write_granularity;
> > > > >         uint32_t        nr_zones
> > > > >         struct zbd_zone *zones; /* array of zones */
> > > > > };
> > > > >
> > > > > If yes, then my comments are as follows.
> > > > >
> > > > > For the device struct: It may be good to have also the maximum number of
> > > > > open zones and the maximum number of active zones.
> > > > >
> > > > > For the zone struct: You may need to add a read-write lock per zone to be
> > > > > able to write lock zones to ensure a sequential write pattern (virtio
> > > > > devices can be multi-queue and so writes may be coming in from different
> > > > > contexts) and to correctly emulate zone append operations with an atomic
> > > > > update of the wp field.
> > > > >
> > > > > These need to be integrated into the generic block driver interface in
> > > > > include/block/block_int-common.h or include/block/block-common.h.
> > > >
> > > > QEMU's block layer has a few ways of exposing information about block devices:
> > > >
> > > >     int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
> > > >     ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs,
> > > > Error **errp);
> > > >
> > > > These fetch information from the BlockDriver and are good when a small
> > > > amount of data is reported occassionally and consumed by the caller.
> > > >
> > > > For data that is continuously accessed or that could be large, it may
> > > > be necessary for the data to reside inside BlockDriverState so that it
> > > > can be accessed in place (without copying):
> > > >
> > > >     void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp);
> > > >
> > > > QEMU uses this for the BlockLimits struct (BlockDriverState::bl) that
> > > > is continuously accessed by the block layer while processing I/O
> > > > requests. The "refresh" function updates the data in case the
> > > > underlying storage device has changed somehow. If no update function
> > > > is necessary then data can simply be populated during .bdrv_open() and
> > > > no new BlockDriver callback needs to be added.
> > > >
> > > > So in the simplest case BlockDriverState can be extended with a struct
> > > > zbd_dev field that is populated during .bdrv_open(). If the
> > > > BlockDriver doesn't support zones then the zbd_dev.nr_zones field is 0
> > > > or the model field indicates that this is not a zoned storage device.
> > > >
> > > > However, a BlockBackend (not BlockDriverState!) API will be needed to
> > > > expose this data to users like the hw/block/virtio-blk.c emulation
> > > > code or the qemu-io-cmds.c utility that can be used for testing. A
> > > > BlockBackend has a root pointer to a BlockDriverState graph (for
> > > > example, qcow2 on top of file-posix). It will be necessary to
> > > > propagate zoned storage information from the leaf BlockDriverState all
> > > > the way up to the BlockBackend. In simple cases the BB root points
> > > > directly to the file-posix BDS that has Linux ZBD support but the
> > > > design needs to account for additional BDS graph nodes.
> > >
> > > I think a simple way to think BlockBackend APIs is to use following
> > > callbacks: blk_aio_zone_mgmt() -> blk_aio_prwv() +
> > > blk_aio_zone_mgmt_entry() -> blk_co_do_zone_mgmt() -> blk_zone_mgmt().
> > > The last function call will call bdrv_co_zone_mgmt() in
> > > block/file-posix.c. If I understand the additional case correctly, the
> > > BlockBackend API can expose the zone information to the virtio-blk
> > > emulation now.
> >
> > Yes!
> >
> > block/raw-format.c also needs to implement .bdrv_co_zone_mgmt() by
> > calling bdrv_co_zone_mgmt(bs->file, ...). This is because the
> > raw-format.c driver usually sits on top of file-posix.c and has to
> > pass through requests.
> >
> > There are filter block drivers like block/throttle.c,
> > block/blkdebug.c, etc (git grep is_filter block/) that will also need
> > to be modified to pass through requests in the same way.
> >
>
> Are the filter block drivers also on top of file-posix.c but below
> block-backend.c? I read that the filter block drivers, and formats are
> designed to be manageable pieces so as to make block device
> configuration easier and clearer.

Yes, the filters are BlockDrivers and each instance has a
BlockDriverState. They are part of the same BlockDriverState graph as
file-posix.c.

BlockBackend has a "root" BlockDriverState pointer. It points to a
graph of BlockDriverStates and there are 3 types of nodes:
- Filter nodes like block/throttle.c that provide some extra
functionality like I/O throttling.
- Format nodes like qcow2, raw, or vmdk that implement disk image file formats.
- Protocol nodes like file-posix, iSCSI, etc that implement access to
underlying storage.

The graph is pretty flexible. It's possible to insert/remove nodes to
construct arbitrary graphs.

Protocol nodes are the leaf nodes in the graph. Filter and format
nodes are above protocol nodes.

If we want to open /dev/nullb0 and limit I/O rates to 10 MB/s the
graph would be:
throttle (10 MB/s) -> raw-format -> file-posix (/dev/nullb0)

The BlockBackend root would point at the throttle node. I/O requests
made using blk_*() APIs will be forwarded to the throttle node using
bdrv_*() APIs. The throttle node forwards requests to the raw-format
node using bdrv_*() APIs. The raw-format node forwards I/O requests to
the file-posix node using bdrv_*() APIs. Here is block/raw-format.c's
preadv implementation:
  static int coroutine_fn raw_co_preadv(BlockDriverState *bs, int64_t offset,
                                        int64_t bytes, QEMUIOVector *qiov,
                                        BdrvRequestFlags flags)
  {
      int ret;

      ret = raw_adjust_offset(bs, &offset, bytes, false);
      if (ret) {
          return ret;
      }

      BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
      return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
                 ^^^^^^^^^^^^^^^^^^^^^^^ forward I/O to child graph node
  }

>
> > Based on what I've read in Dmitry's virtio-blk spec proposal, the
> > BlockBackend API could look something like:
> >
> > typedef struct { ...model, zone_sectors, max_open_zones, etc... } BlockZoneInfo;
> > void blk_get_zone_info(BlockBackend *blk, BlockZoneInfo *info);
> >
> > virtio-blk.c calls this to fill out configuration space fields and
> > determine whether the BlockBackend is a zoned device.
> >
> > Then there are 3 commands that happen in the I/O code path:
> >
> > typedef struct { ... } BlockZoneDescriptor;
> > BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
> > BlockZoneDescriptor *zones, size_t max_zones, BlockCompletionFunc *cb,
> > void *opaque);
> >
> > typedef enum { ... } BlockZoneMgmtCmd;
> > BlockAIOCB *blk_aio_zone_mgmt_send(BlockBackend *blk, int64_t offset,
> > BlockZoneMgmtCmd cmd, bool all, BlockCompletionFunc *cb, void
> > *opaque);
> >
> > typedef void BlockZoneAppendCompletionFunc(void *opaque, int ret,
> > int64_t new_wp);
> > BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t offset,
> > QEMUIOVector *qiov, BlockZoneAppendCompletionFunc *cb, void *opaque);
> >
> > > Besides, comparing blk_aio_flush() with blk_flush() in block-backend.c
> > > which lead to include/block/block-io.h, we may need consider the case
> > > when calling block layer API from non-coroutine context. Meanwhile,
> > > using bdrv_co_writev()/bdrv_co_readv() instead of read-write lock per
> > > zone may be a option too.
> >
> > Yes, device emulation code usually uses the aio versions of the
> > BlockBackend I/O functions (read, write, flush). The QEMU block layer
> > runs the aio I/O request inside a coroutine and usually also exposes
> > coroutine versions of the same functions. For example, block jobs
> > (e.g. storage mirroring, backup, and migration background tasks)
> > usually call the coroutine versions of the BlockBackend APIs instead
> > of the aio ones.
> >
> > qemu-io-cmds.c will want synchronous versions of the aio commands
> > (blk_zone_report(), blk_zone_mgmt_send(), blk_zone_append()) that
> > block until the command completes. This is because the qemu-io utility
> > typically executes one command at a time and it's written mostly in
> > blocking style rather than async callbacks or coroutines.
> > docs/devel/block-coroutine-wrapper.rst describes how to generate
> > synchronous versions of coroutine functions.
> >
> > Do you want to start implementing blk_get_zone_info()? This will
> > require blk_*(), bdrv_*(), and BlockDriver (block/file-posix.c)
> > functions.
>
> I want to implement the smallest part that can be tested first and
> then move on to the next part. And I want to test zone report
> operation first. Does the qemu io-test require the following part to
> work: bdrv_co_zone_report in file-pisix.c, blk_get_zone_info() in
> block-backend.c, blk_aio_zone_report() in io code path and modify some
> test in test/qemu-iotests? If it does, then yes.

blk_aio_zone_report() can be implemented later. It is not needed by qemu-io.

blk_get_zone_info() will be needed soon but maybe you can skip it
while working on the first version of blk_co_zone_report().

The steps are:
1. Add a .bdrv_co_zone_report() callback to BlockDriver and define a
BlockZoneDescriptor struct.
2. Implement the .bdrv_co_zone_report() callback in block/file-posix.c
using ioctl(BLKREPORTZONE).
3. Implement bdrv_co_zone_report() in block/io.c. It calls
bs->drv->bdrv_co_zone_report() or returns -ENOTSUP if
bs->drv->bdrv_co_zone_report is NULL.
4. Implement blk_co_zone_report() in block/block-backend.c. It calls
bdrv_co_zone_report(blk->root, ...).
5. Generate a synchronous blk_zone_report() wrapper. See
docs/devel/block-coroutine-wrapper.rst.

You now have a working zone report command. It will work with
--blockdev file,filename=/dev/nullb0,node-name=blk0, which creates a
graph with just one block/file-posix.c BlockDriverState node. It won't
work with QEMU's older --drive
if=none,id=blk0,format=raw,file=/dev/nullb0 syntax because that
creates a raw-format -> file-posix graph and you haven't implemented
.bdrv_co_zone_report() in block/raw-format.c yet (but you can skip it
for now).

For testing you can add a qemu-io -c zone_report command to
qemu-io-cmds.c that calls blk_zone_report(). Then you can write a
tests/qemu-iotests/tests/zoned test script that report zones using
qemu-io, writes to the first sectors of the disk using qemu-io, and
then reports zones again to prove that the output has changed. Use the
qemu-io --image-opts driver=host_device,filename=/dev/nullb0 option to
open a Linux null_blk device using the block/file-posix.c BlockDriver.

About qemu-iotests: the output from running the test case is diffed
against a reference file that contains the expected output. This is
quite convenient because you don't have to write code that checks for
the expected output, you just provide a
tests/qemu-iotests/tests/zoned.out file containing the output for a
passing test. There is documentation about qemu-iotests here:
https://qemu.readthedocs.io/en/latest/devel/testing.html#qemu-iotests

Stefan


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-06-02 19:37 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-30  5:09 Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls Sam Li
2022-05-30  5:38 ` Damien Le Moal
2022-05-30 11:21   ` Stefan Hajnoczi
2022-05-30 11:26     ` Damien Le Moal
2022-05-30 11:19 ` Stefan Hajnoczi
2022-06-01  2:57   ` Sam Li
2022-06-01  5:47     ` Damien Le Moal
2022-06-01 10:19       ` Sam Li
2022-06-01 11:28         ` Stefan Hajnoczi
2022-06-01 11:43       ` Stefan Hajnoczi
2022-06-02  5:43         ` Sam Li
2022-06-02  8:05           ` Stefan Hajnoczi
2022-06-02 10:28             ` Sam Li
2022-06-02 19:36               ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.