IO-Uring Archive on lore.kernel.org
 help / color / Atom feed
From: "hch@infradead.org" <hch@infradead.org>
To: Damien Le Moal <Damien.LeMoal@wdc.com>
Cc: "hch@infradead.org" <hch@infradead.org>,
	Kanchan Joshi <joshiiitr@gmail.com>, Jens Axboe <axboe@kernel.dk>,
	Pavel Begunkov <asml.silence@gmail.com>,
	Kanchan Joshi <joshi.k@samsung.com>,
	"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
	"bcrl@kvack.org" <bcrl@kvack.org>,
	Matthew Wilcox <willy@infradead.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-aio@kvack.org" <linux-aio@kvack.org>,
	"io-uring@vger.kernel.org" <io-uring@vger.kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"linux-api@vger.kernel.org" <linux-api@vger.kernel.org>,
	SelvaKumar S <selvakuma.s1@samsung.com>,
	Nitesh Shetty <nj.shetty@samsung.com>,
	Javier Gonzalez <javier.gonz@samsung.com>,
	Johannes Thumshirn <Johannes.Thumshirn@wdc.com>,
	Naohiro Aota <Naohiro.Aota@wdc.com>
Subject: Re: [PATCH v4 6/6] io_uring: add support for zone-append
Date: Fri, 31 Jul 2020 13:51:10 +0100
Message-ID: <20200731125110.GA11500@infradead.org> (raw)
In-Reply-To: <MWHPR04MB3758A4B2967DB1FABAAD9265E74E0@MWHPR04MB3758.namprd04.prod.outlook.com>

On Fri, Jul 31, 2020 at 10:16:49AM +0000, Damien Le Moal wrote:
> > 
> > Let's keep semantics and implementation separate.  For the case
> > where we report the actual offset we need a size imitation and no
> > short writes.
> 
> OK. So the name of the flag confused me. The flag name should reflect "Do zone
> append and report written offset", right ?

Well, we already have O_APPEND, which is the equivalent to append to
the write pointer.  The only interesting addition is that we also want
to report where we wrote.  So I'd rather have RWF_REPORT_OFFSET or so.

> Just to clarify, here was my thinking for zonefs:
> 1) file open with O_APPEND/aio has RWF_APPEND: then it is OK to assume that the
> application did not set the aio offset since APPEND means offset==file size. In
> that case, do zone append and report back the written offset.

Yes.

> 2) file open without O_APPEND/aio does not have RWF_APPEND: the application
> specified an aio offset and we must respect it, write it that exact same order,
> so use regular writes.

Yes.

> 
> For regular file systems, with case (1) condition, the FS use whatever it wants
> for the implementation, and report back the written offset, which  will always
> be the file size at the time the aio was issued.

Yes. 

> 
> Your method with a new flag to switch between (1) and (2) is OK with me, but the
> "no short writes" may be difficult to achieve in a regular FS, no ? I do not
> think current FSes have such guarantees... Especially in the case of buffered
> async writes I think.

I'll have to check what Jens recently changed with io_uring, but
traditionally the only short write case for a normal file system is the
artifical INT_MAX limit for the write size.

> > Anything with those semantics can be implemented using Zone Append
> > trivially in zonefs, and we don't even need the exclusive lock in that
> > case.  But even without that flag anything that has an exclusive lock can
> > at least in theory be implemented using Zone Append, it just need
> > support for submitting another request from the I/O completion handler
> > of the first.  I just don't think it is worth it - with the exclusive
> > lock we do have access to the zone serialied so a normal write works
> > just fine.  Both for the sync and async case.
> 
> We did switch to have zonefs do append writes in the sync case, always. Hmmm...
> Not sure anymore it was such a good idea.

It might be a good idea as long as we have the heavy handed scheduler
locking.  But if we don't have that there doesn't seem to be much of
a reason for zone append.

> OK. Makes sense. That said, taking Naohiro's work on btrfs as an example, zone
> append is used for every data write, no matter if it is O_APPEND/RWF_APPEND or
> not. The size limitation for zone append writes is not needed at all by
> applications. Maximum extent size is aligned to the max append write size
> internally, and if the application issued a larger write, it loops over multiple
> extents, similarly to any regular write may do (if there is overwrite etc...).

True, we probably don't need the limit for a normal file system.

> For the regular FS case, my thinking on the semantic really was: if asked to do
> so, return the written offset for a RWF_APPEND aios. And I think that
> implementing that does not depend in any way on what the FS does internally.

Exactly.

> 
> But I think I am starting to see the picture you are drawing here:
> 1) Introduce a fcntl() to get "maximum size for atomic append writes"
> 2) Introduce an aio flag specifying "Do atomic append write and report written
> offset"

I think we just need the 'report written offset flag', in fact even for
zonefs with the right locking we can handle unlimited write sizes, just
with lower performance.  E.g.

1) check if the write size is larger than the zone append limit

if no:

 - take the shared lock  and issue a zone append, done

if yes:

 - take the exclusive per-inode (zone) lock and just issue either normal
   writes or zone append at your choice, relying on the lock to
   serialize other writers.  For the async case this means we need a
   lock than can be release in a different context than it was acquired,
   which is a little ugly but can be done.

> And the implementation is actually completely free to use zone append writes or
> regular writes regardless of the "Do atomic append write and report written
> offset" being used or not.

Yes, that was my point about keeping the semantics and implementation
separate.

  reply index

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20200724155244epcas5p2902f57e36e490ee8772da19aa9408cdc@epcas5p2.samsung.com>
2020-07-24 15:49 ` [PATCH v4 0/6] zone-append support in io-uring and aio Kanchan Joshi
     [not found]   ` <CGME20200724155258epcas5p1a75b926950a18cd1e6c8e7a047e6c589@epcas5p1.samsung.com>
2020-07-24 15:49     ` [PATCH v4 1/6] fs: introduce FMODE_ZONE_APPEND and IOCB_ZONE_APPEND Kanchan Joshi
2020-07-24 16:34       ` Jens Axboe
2020-07-26 15:18       ` Christoph Hellwig
2020-07-28  1:49         ` Matthew Wilcox
2020-07-28  7:26           ` Christoph Hellwig
     [not found]   ` <CGME20200724155324epcas5p18e1d3b4402d1e4a8eca87d0b56a3fa9b@epcas5p1.samsung.com>
2020-07-24 15:49     ` [PATCH v4 2/6] fs: change ki_complete interface to support 64bit ret2 Kanchan Joshi
2020-07-26 15:18       ` Christoph Hellwig
     [not found]   ` <CGME20200724155329epcas5p345ba6bad0b8fe18056bb4bcd26c10019@epcas5p3.samsung.com>
2020-07-24 15:49     ` [PATCH v4 3/6] uio: return status with iov truncation Kanchan Joshi
     [not found]   ` <CGME20200724155341epcas5p15bfc55927f2abb60f19784270fe8e377@epcas5p1.samsung.com>
2020-07-24 15:49     ` [PATCH v4 4/6] block: add zone append handling for direct I/O path Kanchan Joshi
2020-07-26 15:19       ` Christoph Hellwig
     [not found]   ` <CGME20200724155346epcas5p2cfb383fe9904a45280c6145f4c13e1b4@epcas5p2.samsung.com>
2020-07-24 15:49     ` [PATCH v4 5/6] block: enable zone-append for iov_iter of bvec type Kanchan Joshi
2020-07-26 15:20       ` Christoph Hellwig
     [not found]   ` <CGME20200724155350epcas5p3b8f1d59eda7f8fbb38c828f692d42fd6@epcas5p3.samsung.com>
2020-07-24 15:49     ` [PATCH v4 6/6] io_uring: add support for zone-append Kanchan Joshi
2020-07-24 16:29       ` Jens Axboe
2020-07-27 19:16         ` Kanchan Joshi
2020-07-27 20:34           ` Jens Axboe
2020-07-30 16:08             ` Pavel Begunkov
2020-07-30 16:13               ` Jens Axboe
2020-07-30 16:26                 ` Pavel Begunkov
2020-07-30 17:16                   ` Jens Axboe
2020-07-30 17:38                     ` Pavel Begunkov
2020-07-30 17:51                       ` Kanchan Joshi
2020-07-30 17:54                         ` Jens Axboe
2020-07-30 18:25                           ` Kanchan Joshi
2020-07-31  6:42                             ` Damien Le Moal
2020-07-31  6:45                               ` hch
2020-07-31  6:59                                 ` Damien Le Moal
2020-07-31  7:58                                   ` Kanchan Joshi
2020-07-31  8:14                                     ` Damien Le Moal
2020-07-31  9:14                                       ` hch
2020-07-31  9:34                                         ` Damien Le Moal
2020-07-31  9:41                                           ` hch
2020-07-31 10:16                                             ` Damien Le Moal
2020-07-31 12:51                                               ` hch [this message]
2020-07-31 13:08                                                 ` hch
2020-07-31 15:07                                                   ` Kanchan Joshi
2020-08-05  7:35                                                 ` Damien Le Moal
2020-07-31  9:38                                       ` Kanchan Joshi
2020-07-31  7:08                               ` Kanchan Joshi
2020-07-30 15:57       ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200731125110.GA11500@infradead.org \
    --to=hch@infradead.org \
    --cc=Damien.LeMoal@wdc.com \
    --cc=Johannes.Thumshirn@wdc.com \
    --cc=Naohiro.Aota@wdc.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=bcrl@kvack.org \
    --cc=io-uring@vger.kernel.org \
    --cc=javier.gonz@samsung.com \
    --cc=joshi.k@samsung.com \
    --cc=joshiiitr@gmail.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nj.shetty@samsung.com \
    --cc=selvakuma.s1@samsung.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

IO-Uring Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/io-uring/0 io-uring/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 io-uring io-uring/ https://lore.kernel.org/io-uring \
		io-uring@vger.kernel.org
	public-inbox-index io-uring

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.io-uring


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git