IO-Uring Archive on lore.kernel.org
 help / color / Atom feed
From: "javier.gonz@samsung.com" <javier.gonz@samsung.com>
To: Damien Le Moal <Damien.LeMoal@wdc.com>
Cc: Kanchan Joshi <joshi.k@samsung.com>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
	"bcrl@kvack.org" <bcrl@kvack.org>,
	"asml.silence@gmail.com" <asml.silence@gmail.com>,
	"hch@infradead.org" <hch@infradead.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"mb@lightnvm.io" <mb@lightnvm.io>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-aio@kvack.org" <linux-aio@kvack.org>,
	"io-uring@vger.kernel.org" <io-uring@vger.kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"selvakuma.s1@samsung.com" <selvakuma.s1@samsung.com>,
	"nj.shetty@samsung.com" <nj.shetty@samsung.com>
Subject: Re: [PATCH v2 0/2] zone-append support in io-uring and aio
Date: Fri, 26 Jun 2020 08:37:17 +0200
Message-ID: <20200626063717.4dhsydpcnezjhj3o@mpHalley.localdomain> (raw)
In-Reply-To: <CY4PR04MB37511E3B19035012A143D006E7930@CY4PR04MB3751.namprd04.prod.outlook.com>


[-- Attachment #1: Type: text/plain, Size: 4719 bytes --]

On 26.06.2020 03:11, Damien Le Moal wrote:
>On 2020/06/26 2:18, Kanchan Joshi wrote:
>> [Revised as per feedback from Damien, Pavel, Jens, Christoph, Matias, Wilcox]
>>
>> This patchset enables zone-append using io-uring/linux-aio, on block IO path.
>> Purpose is to provide zone-append consumption ability to applications which are
>> using zoned-block-device directly.
>>
>> The application may specify RWF_ZONE_APPEND flag with write when it wants to
>> send zone-append. RWF_* flags work with a certain subset of APIs e.g. uring,
>> aio, and pwritev2. An error is reported if zone-append is requested using
>> pwritev2. It is not in the scope of this patchset to support pwritev2 or any
>> other sync write API for reasons described later.
>>
>> Zone-append completion result --->
>> With zone-append, where write took place can only be known after completion.
>> So apart from usual return value of write, additional mean is needed to obtain
>> the actual written location.
>>
>> In aio, this is returned to application using res2 field of io_event -
>>
>> struct io_event {
>>         __u64           data;           /* the data field from the iocb */
>>         __u64           obj;            /* what iocb this event came from */
>>         __s64           res;            /* result code for this event */
>>         __s64           res2;           /* secondary result */
>> };
>>
>> In io-uring, cqe->flags is repurposed for zone-append result.
>>
>> struct io_uring_cqe {
>>         __u64   user_data;      /* sqe->data submission passed back */
>>         __s32   res;            /* result code for this event */
>>         __u32   flags;
>> };
>>
>> Since 32 bit flags is not sufficient, we choose to return zone-relative offset
>> in sector/512b units. This can cover zone-size represented by chunk_sectors.
>> Applications will have the trouble to combine this with zone start to know
>> disk-relative offset. But if more bits are obtained by pulling from res field
>> that too would compel application to interpret res field differently, and it
>> seems more painstaking than the former option.
>> To keep uniformity, even with aio, zone-relative offset is returned.
>
>I am really not a fan of this, to say the least. The input is byte offset, the
>output is 512B relative sector count... Arg... We really cannot do better than
>that ?
>
>At the very least, byte relative offset ? The main reason is that this is
>_somewhat_ acceptable for raw block device accesses since the "sector"
>abstraction has a clear meaning, but once we add iomap/zonefs async zone append
>support, we really will want to have byte unit as the interface is regular
>files, not block device file. We could argue that 512B sector unit is still
>around even for files (e.g. block counts in file stat). Bu the different unit
>for input and output of one operation is really ugly. This is not nice for the user.
>

You can refer to the discussion with Jens, Pavel and Alex on the uring
interface. With the bits we have and considering the maximun zone size
supported, there is no space for a byte relative offset. We can take
some bits from cqe->res, but we were afraid this is not very
future-proof. Do you have a better idea?


>>
>> Append using io_uring fixed-buffer --->
>> This is flagged as not-supported at the moment. Reason being, for fixed-buffer
>> io-uring sends iov_iter of bvec type. But current append-infra in block-layer
>> does not support such iov_iter.
>>
>> Block IO vs File IO --->
>> For now, the user zone-append interface is supported only for zoned-block-device.
>> Regular files/block-devices are not supported. Regular file-system (e.g. F2FS)
>> will not need this anyway, because zone peculiarities are abstracted within FS.
>> At this point, ZoneFS also likes to use append implicitly rather than explicitly.
>> But if/when ZoneFS starts supporting explicit/on-demand zone-append, the check
>> allowing-only-block-device should be changed.
>
>Sure, but I think the interface is still a problem. I am not super happy about
>the 512B sector unit. Zonefs will be the only file system that will be impacted
>since other normal POSIX file system will not have zone append interface for
>users. So this is a limited problem. Still, even for raw block device files
>accesses, POSIX system calls use Byte unit everywhere. Let's try to use that.
>
>For aio, it is easy since res2 is unsigned long long. For io_uring, as discussed
>already, we can still 8 bits from the cqe res. All  you need is to add a small
>helper function in userspace iouring.h to simplify the work of the application
>to get that result.

Ok. See above. We can do this.

Jens: Do you see this as a problem in the future?

[...]

Javier

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



  reply index

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20200625171829epcas5p268486a0780571edb4999fc7b3caab602@epcas5p2.samsung.com>
2020-06-25 17:15 ` Kanchan Joshi
     [not found]   ` <CGME20200625171834epcas5p226a24dfcb84cfa83fe29a2bd17795d85@epcas5p2.samsung.com>
2020-06-25 17:15     ` [PATCH v2 1/2] fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path Kanchan Joshi
2020-06-26  2:50       ` Damien Le Moal
2020-06-29 18:32         ` Kanchan Joshi
2020-06-30  0:37           ` Damien Le Moal
2020-06-30  7:40             ` Kanchan Joshi
2020-06-30  7:52               ` Damien Le Moal
2020-06-30  7:56                 ` Damien Le Moal
2020-06-30  8:16                   ` Kanchan Joshi
2020-06-26  8:58       ` Christoph Hellwig
2020-06-26 21:15         ` Kanchan Joshi
2020-06-27  6:51           ` Christoph Hellwig
     [not found]   ` <CGME20200625171838epcas5p449183e12770187142d8d55a9bf422a8d@epcas5p4.samsung.com>
2020-06-25 17:15     ` [PATCH v2 2/2] io_uring: add support for zone-append Kanchan Joshi
2020-06-25 19:40       ` Pavel Begunkov
2020-06-26  3:11   ` [PATCH v2 0/2] zone-append support in io-uring and aio Damien Le Moal
2020-06-26  6:37     ` javier.gonz [this message]
2020-06-26  6:56       ` Damien Le Moal
2020-06-26  7:03         ` javier.gonz@samsung.com
2020-06-26 22:15     ` Kanchan Joshi
2020-06-30 12:46   ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200626063717.4dhsydpcnezjhj3o@mpHalley.localdomain \
    --to=javier.gonz@samsung.com \
    --cc=Damien.LeMoal@wdc.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=bcrl@kvack.org \
    --cc=hch@infradead.org \
    --cc=io-uring@vger.kernel.org \
    --cc=joshi.k@samsung.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mb@lightnvm.io \
    --cc=nj.shetty@samsung.com \
    --cc=selvakuma.s1@samsung.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

IO-Uring Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/io-uring/0 io-uring/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 io-uring io-uring/ https://lore.kernel.org/io-uring \
		io-uring@vger.kernel.org
	public-inbox-index io-uring

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.io-uring


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git