[PATCH v2 0/2] zone-append support in io-uring and aio

* [PATCH v2 0/2] zone-append support in io-uring and aio
       [not found] <CGME20200625171829epcas5p268486a0780571edb4999fc7b3caab602@epcas5p2.samsung.com>
@ 2020-06-25 17:15 ` Kanchan Joshi
       [not found]   ` <CGME20200625171834epcas5p226a24dfcb84cfa83fe29a2bd17795d85@epcas5p2.samsung.com>
                     ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Kanchan Joshi @ 2020-06-25 17:15 UTC (permalink / raw)
  To: axboe, viro, bcrl
  Cc: asml.silence, Damien.LeMoal, hch, linux-fsdevel, mb,
	linux-kernel, linux-aio, io-uring, linux-block, selvakuma.s1,
	nj.shetty, javier.gonz, Kanchan Joshi

[Revised as per feedback from Damien, Pavel, Jens, Christoph, Matias, Wilcox]

This patchset enables zone-append using io-uring/linux-aio, on block IO path.
Purpose is to provide zone-append consumption ability to applications which are
using zoned-block-device directly.

The application may specify RWF_ZONE_APPEND flag with write when it wants to
send zone-append. RWF_* flags work with a certain subset of APIs e.g. uring,
aio, and pwritev2. An error is reported if zone-append is requested using
pwritev2. It is not in the scope of this patchset to support pwritev2 or any
other sync write API for reasons described later.

Zone-append completion result --->
With zone-append, where write took place can only be known after completion.
So apart from usual return value of write, additional mean is needed to obtain
the actual written location.

In aio, this is returned to application using res2 field of io_event -

struct io_event {
        __u64           data;           /* the data field from the iocb */
        __u64           obj;            /* what iocb this event came from */
        __s64           res;            /* result code for this event */
        __s64           res2;           /* secondary result */
};

In io-uring, cqe->flags is repurposed for zone-append result.

struct io_uring_cqe {
        __u64   user_data;      /* sqe->data submission passed back */
        __s32   res;            /* result code for this event */
        __u32   flags;
};

Since 32 bit flags is not sufficient, we choose to return zone-relative offset
in sector/512b units. This can cover zone-size represented by chunk_sectors.
Applications will have the trouble to combine this with zone start to know
disk-relative offset. But if more bits are obtained by pulling from res field
that too would compel application to interpret res field differently, and it
seems more painstaking than the former option.
To keep uniformity, even with aio, zone-relative offset is returned.

Append using io_uring fixed-buffer --->
This is flagged as not-supported at the moment. Reason being, for fixed-buffer
io-uring sends iov_iter of bvec type. But current append-infra in block-layer
does not support such iov_iter.

Block IO vs File IO --->
For now, the user zone-append interface is supported only for zoned-block-device.
Regular files/block-devices are not supported. Regular file-system (e.g. F2FS)
will not need this anyway, because zone peculiarities are abstracted within FS.
At this point, ZoneFS also likes to use append implicitly rather than explicitly.
But if/when ZoneFS starts supporting explicit/on-demand zone-append, the check
allowing-only-block-device should be changed.

Semantics --->
Zone-append, by its nature, may perform write on a different location than what
was specified. It does not fit into POSIX, and trying to fit may just undermine
its benefit. It may be better to keep semantics as close to zone-append as
possible i.e. specify zone-start location, and obtain the actual-write location
post completion. Towards that goal, existing async APIs seem to fit fine.
Async APIs (uring, linux aio) do not work on implicit write-pointer and demand
explicit write offset (which is what we need for append). Neither write-pointer
is taken as input, nor it is updated on completion. And there is a clear way to
get zone-append result. Zone-aware applications while using these async APIs
can be fine with, for the lack of better word, zone-append semantics itself.

Sync APIs work with implicit write-pointer (at least few of those), and there is
no way to obtain zone-append result, making it hard for user-space zone-append.

Tests --->
Using new interface in fio (uring and libaio engine) by extending zbd tests
for zone-append: https://github.com/axboe/fio/pull/1026

Changes since v1:
- No new opcodes in uring or aio. Use RWF_ZONE_APPEND flag instead.
- linux-aio changes vanish because of no new opcode
- Fixed the overflow and other issues mentioned by Damien
- Simplified uring support code, fixed the issues mentioned by Pavel
- Added error checks

Kanchan Joshi (1):
  fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path

Selvakumar S (1):
  io_uring: add support for zone-append

 fs/block_dev.c          | 28 ++++++++++++++++++++++++----
 fs/io_uring.c           | 32 ++++++++++++++++++++++++++++++--
 include/linux/fs.h      |  9 +++++++++
 include/uapi/linux/fs.h |  5 ++++-
 4 files changed, 67 insertions(+), 7 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 20+ messages in thread