IO-Uring Archive on lore.kernel.org
 help / color / Atom feed
From: "javier.gonz@samsung.com" <javier@javigon.com>
To: Damien Le Moal <Damien.LeMoal@wdc.com>
Cc: Kanchan Joshi <joshi.k@samsung.com>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
	"bcrl@kvack.org" <bcrl@kvack.org>,
	"asml.silence@gmail.com" <asml.silence@gmail.com>,
	"hch@infradead.org" <hch@infradead.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"mb@lightnvm.io" <mb@lightnvm.io>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-aio@kvack.org" <linux-aio@kvack.org>,
	"io-uring@vger.kernel.org" <io-uring@vger.kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"selvakuma.s1@samsung.com" <selvakuma.s1@samsung.com>,
	"nj.shetty@samsung.com" <nj.shetty@samsung.com>
Subject: Re: [PATCH v2 0/2] zone-append support in io-uring and aio
Date: Fri, 26 Jun 2020 09:03:45 +0200
Message-ID: <20200626070345.vuxic46l3agy3jay@mpHalley.localdomain> (raw)
In-Reply-To: <CY4PR04MB375154780F0B8073AB83DA9CE7930@CY4PR04MB3751.namprd04.prod.outlook.com>

On 26.06.2020 06:56, Damien Le Moal wrote:
>On 2020/06/26 15:37, javier.gonz@samsung.com wrote:
>> On 26.06.2020 03:11, Damien Le Moal wrote:
>>> On 2020/06/26 2:18, Kanchan Joshi wrote:
>>>> [Revised as per feedback from Damien, Pavel, Jens, Christoph, Matias, Wilcox]
>>>>
>>>> This patchset enables zone-append using io-uring/linux-aio, on block IO path.
>>>> Purpose is to provide zone-append consumption ability to applications which are
>>>> using zoned-block-device directly.
>>>>
>>>> The application may specify RWF_ZONE_APPEND flag with write when it wants to
>>>> send zone-append. RWF_* flags work with a certain subset of APIs e.g. uring,
>>>> aio, and pwritev2. An error is reported if zone-append is requested using
>>>> pwritev2. It is not in the scope of this patchset to support pwritev2 or any
>>>> other sync write API for reasons described later.
>>>>
>>>> Zone-append completion result --->
>>>> With zone-append, where write took place can only be known after completion.
>>>> So apart from usual return value of write, additional mean is needed to obtain
>>>> the actual written location.
>>>>
>>>> In aio, this is returned to application using res2 field of io_event -
>>>>
>>>> struct io_event {
>>>>         __u64           data;           /* the data field from the iocb */
>>>>         __u64           obj;            /* what iocb this event came from */
>>>>         __s64           res;            /* result code for this event */
>>>>         __s64           res2;           /* secondary result */
>>>> };
>>>>
>>>> In io-uring, cqe->flags is repurposed for zone-append result.
>>>>
>>>> struct io_uring_cqe {
>>>>         __u64   user_data;      /* sqe->data submission passed back */
>>>>         __s32   res;            /* result code for this event */
>>>>         __u32   flags;
>>>> };
>>>>
>>>> Since 32 bit flags is not sufficient, we choose to return zone-relative offset
>>>> in sector/512b units. This can cover zone-size represented by chunk_sectors.
>>>> Applications will have the trouble to combine this with zone start to know
>>>> disk-relative offset. But if more bits are obtained by pulling from res field
>>>> that too would compel application to interpret res field differently, and it
>>>> seems more painstaking than the former option.
>>>> To keep uniformity, even with aio, zone-relative offset is returned.
>>>
>>> I am really not a fan of this, to say the least. The input is byte offset, the
>>> output is 512B relative sector count... Arg... We really cannot do better than
>>> that ?
>>>
>>> At the very least, byte relative offset ? The main reason is that this is
>>> _somewhat_ acceptable for raw block device accesses since the "sector"
>>> abstraction has a clear meaning, but once we add iomap/zonefs async zone append
>>> support, we really will want to have byte unit as the interface is regular
>>> files, not block device file. We could argue that 512B sector unit is still
>>> around even for files (e.g. block counts in file stat). Bu the different unit
>>> for input and output of one operation is really ugly. This is not nice for the user.
>>>
>>
>> You can refer to the discussion with Jens, Pavel and Alex on the uring
>> interface. With the bits we have and considering the maximun zone size
>> supported, there is no space for a byte relative offset. We can take
>> some bits from cqe->res, but we were afraid this is not very
>> future-proof. Do you have a better idea?
>
>If you can take 8 bits, that gives you 40 bits, enough to support byte relative
>offsets for any zone size defined as a number of 512B sectors using an unsigned
>int. Max zone size is 2^31 sectors in that case, so 2^40 bytes. Unless I am
>already too tired and my math is failing me...

Yes, the match is correct. I was thinking more of the bits being needed
for other use-case that could collide with append. We considered this
and discard it for being messy - when Pavel brought up the 512B
alignment we saw it as a good alternative.

Note too that we would be able to translate to a byte offset in
iouring.h too so the user would not need to think of this.

I do not feel strongly on this, so the one that better fits the current
and near-future for uring, that is the one we will send on V3. Will give
it until next week for others to comment too.

>
>zone size is defined by chunk_sectors, which is used for raid and software raids
>too. This has been an unsigned int forever. I do not see the need for changing
>this to a 64bit anytime soon, if ever. A raid with a stripe size larger than 1TB
>does not really make any sense. Same for zone size...

Yes. I think already max zone sizes are pretty huge. But yes, this might
change, so we will take it when it happens.

[...]

Javier

  reply index

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20200625171829epcas5p268486a0780571edb4999fc7b3caab602@epcas5p2.samsung.com>
2020-06-25 17:15 ` Kanchan Joshi
     [not found]   ` <CGME20200625171834epcas5p226a24dfcb84cfa83fe29a2bd17795d85@epcas5p2.samsung.com>
2020-06-25 17:15     ` [PATCH v2 1/2] fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path Kanchan Joshi
2020-06-26  2:50       ` Damien Le Moal
2020-06-29 18:32         ` Kanchan Joshi
2020-06-30  0:37           ` Damien Le Moal
2020-06-30  7:40             ` Kanchan Joshi
2020-06-30  7:52               ` Damien Le Moal
2020-06-30  7:56                 ` Damien Le Moal
2020-06-30  8:16                   ` Kanchan Joshi
2020-06-26  8:58       ` Christoph Hellwig
2020-06-26 21:15         ` Kanchan Joshi
2020-06-27  6:51           ` Christoph Hellwig
     [not found]   ` <CGME20200625171838epcas5p449183e12770187142d8d55a9bf422a8d@epcas5p4.samsung.com>
2020-06-25 17:15     ` [PATCH v2 2/2] io_uring: add support for zone-append Kanchan Joshi
2020-06-25 19:40       ` Pavel Begunkov
2020-06-26  3:11   ` [PATCH v2 0/2] zone-append support in io-uring and aio Damien Le Moal
2020-06-26  6:37     ` javier.gonz
2020-06-26  6:56       ` Damien Le Moal
2020-06-26  7:03         ` javier.gonz@samsung.com [this message]
2020-06-26 22:15     ` Kanchan Joshi
2020-06-30 12:46   ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200626070345.vuxic46l3agy3jay@mpHalley.localdomain \
    --to=javier@javigon.com \
    --cc=Damien.LeMoal@wdc.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=bcrl@kvack.org \
    --cc=hch@infradead.org \
    --cc=io-uring@vger.kernel.org \
    --cc=joshi.k@samsung.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mb@lightnvm.io \
    --cc=nj.shetty@samsung.com \
    --cc=selvakuma.s1@samsung.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

IO-Uring Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/io-uring/0 io-uring/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 io-uring io-uring/ https://lore.kernel.org/io-uring \
		io-uring@vger.kernel.org
	public-inbox-index io-uring

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.io-uring


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git