IO-Uring Archive on lore.kernel.org
 help / color / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: Kanchan Joshi <joshi.k@samsung.com>
Cc: axboe@kernel.dk, viro@zeniv.linux.org.uk, bcrl@kvack.org,
	asml.silence@gmail.com, Damien.LeMoal@wdc.com, hch@infradead.org,
	linux-fsdevel@vger.kernel.org, mb@lightnvm.io,
	linux-kernel@vger.kernel.org, linux-aio@kvack.org,
	io-uring@vger.kernel.org, linux-block@vger.kernel.org,
	selvakuma.s1@samsung.com, nj.shetty@samsung.com,
	javier.gonz@samsung.com
Subject: Re: [PATCH v2 0/2] zone-append support in io-uring and aio
Date: Tue, 30 Jun 2020 13:46:41 +0100
Message-ID: <20200630124641.GN25523@casper.infradead.org> (raw)
In-Reply-To: <1593105349-19270-1-git-send-email-joshi.k@samsung.com>

On Thu, Jun 25, 2020 at 10:45:47PM +0530, Kanchan Joshi wrote:
> Zone-append completion result --->
> With zone-append, where write took place can only be known after completion.
> So apart from usual return value of write, additional mean is needed to obtain
> the actual written location.
> 
> In aio, this is returned to application using res2 field of io_event -
> 
> struct io_event {
>         __u64           data;           /* the data field from the iocb */
>         __u64           obj;            /* what iocb this event came from */
>         __s64           res;            /* result code for this event */
>         __s64           res2;           /* secondary result */
> };

Ah, now I understand.  I think you're being a little too specific by
calling this zone-append.  This is really a "write-anywhere" operation,
and the specified address is only a hint.

> In io-uring, cqe->flags is repurposed for zone-append result.
> 
> struct io_uring_cqe {
>         __u64   user_data;      /* sqe->data submission passed back */
>         __s32   res;            /* result code for this event */
>         __u32   flags;
> };
> 
> Since 32 bit flags is not sufficient, we choose to return zone-relative offset
> in sector/512b units. This can cover zone-size represented by chunk_sectors.
> Applications will have the trouble to combine this with zone start to know
> disk-relative offset. But if more bits are obtained by pulling from res field
> that too would compel application to interpret res field differently, and it
> seems more painstaking than the former option.
> To keep uniformity, even with aio, zone-relative offset is returned.

Urgh, no, that's dreadful.  I'm not familiar with the io_uring code.
Maybe the first 8 bytes of the user_data could be required to be the
result offset for this submission type?

> Block IO vs File IO --->
> For now, the user zone-append interface is supported only for zoned-block-device.
> Regular files/block-devices are not supported. Regular file-system (e.g. F2FS)
> will not need this anyway, because zone peculiarities are abstracted within FS.
> At this point, ZoneFS also likes to use append implicitly rather than explicitly.
> But if/when ZoneFS starts supporting explicit/on-demand zone-append, the check
> allowing-only-block-device should be changed.

But we also have O_APPEND files.  And maybe we'll have other kinds of file
in future for which this would make sense.

> Semantics --->
> Zone-append, by its nature, may perform write on a different location than what
> was specified. It does not fit into POSIX, and trying to fit may just undermine

... I disagree that it doesn't fit into POSIX.  As I said above, O_APPEND
is a POSIX concept, so POSIX already understands that writes may not end
up at the current write pointer.

> its benefit. It may be better to keep semantics as close to zone-append as
> possible i.e. specify zone-start location, and obtain the actual-write location
> post completion. Towards that goal, existing async APIs seem to fit fine.
> Async APIs (uring, linux aio) do not work on implicit write-pointer and demand
> explicit write offset (which is what we need for append). Neither write-pointer
> is taken as input, nor it is updated on completion. And there is a clear way to
> get zone-append result. Zone-aware applications while using these async APIs
> can be fine with, for the lack of better word, zone-append semantics itself.
> 
> Sync APIs work with implicit write-pointer (at least few of those), and there is
> no way to obtain zone-append result, making it hard for user-space zone-append.

      parent reply index

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20200625171829epcas5p268486a0780571edb4999fc7b3caab602@epcas5p2.samsung.com>
2020-06-25 17:15 ` Kanchan Joshi
     [not found]   ` <CGME20200625171834epcas5p226a24dfcb84cfa83fe29a2bd17795d85@epcas5p2.samsung.com>
2020-06-25 17:15     ` [PATCH v2 1/2] fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path Kanchan Joshi
2020-06-26  2:50       ` Damien Le Moal
2020-06-29 18:32         ` Kanchan Joshi
2020-06-30  0:37           ` Damien Le Moal
2020-06-30  7:40             ` Kanchan Joshi
2020-06-30  7:52               ` Damien Le Moal
2020-06-30  7:56                 ` Damien Le Moal
2020-06-30  8:16                   ` Kanchan Joshi
2020-06-26  8:58       ` Christoph Hellwig
2020-06-26 21:15         ` Kanchan Joshi
2020-06-27  6:51           ` Christoph Hellwig
     [not found]   ` <CGME20200625171838epcas5p449183e12770187142d8d55a9bf422a8d@epcas5p4.samsung.com>
2020-06-25 17:15     ` [PATCH v2 2/2] io_uring: add support for zone-append Kanchan Joshi
2020-06-25 19:40       ` Pavel Begunkov
2020-06-26  3:11   ` [PATCH v2 0/2] zone-append support in io-uring and aio Damien Le Moal
2020-06-26  6:37     ` javier.gonz
2020-06-26  6:56       ` Damien Le Moal
2020-06-26  7:03         ` javier.gonz@samsung.com
2020-06-26 22:15     ` Kanchan Joshi
2020-06-30 12:46   ` Matthew Wilcox [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200630124641.GN25523@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=Damien.LeMoal@wdc.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=bcrl@kvack.org \
    --cc=hch@infradead.org \
    --cc=io-uring@vger.kernel.org \
    --cc=javier.gonz@samsung.com \
    --cc=joshi.k@samsung.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mb@lightnvm.io \
    --cc=nj.shetty@samsung.com \
    --cc=selvakuma.s1@samsung.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

IO-Uring Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/io-uring/0 io-uring/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 io-uring io-uring/ https://lore.kernel.org/io-uring \
		io-uring@vger.kernel.org
	public-inbox-index io-uring

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.io-uring


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git