IO-Uring Archive on lore.kernel.org
 help / color / Atom feed
From: Damien Le Moal <Damien.LeMoal@wdc.com>
To: Kanchan Joshi <joshiiitr@gmail.com>,
	"hch@infradead.org" <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>,
	Pavel Begunkov <asml.silence@gmail.com>,
	Kanchan Joshi <joshi.k@samsung.com>,
	"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
	"bcrl@kvack.org" <bcrl@kvack.org>,
	Matthew Wilcox <willy@infradead.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-aio@kvack.org" <linux-aio@kvack.org>,
	"io-uring@vger.kernel.org" <io-uring@vger.kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"linux-api@vger.kernel.org" <linux-api@vger.kernel.org>,
	SelvaKumar S <selvakuma.s1@samsung.com>,
	Nitesh Shetty <nj.shetty@samsung.com>,
	Javier Gonzalez <javier.gonz@samsung.com>,
	Johannes Thumshirn <Johannes.Thumshirn@wdc.com>,
	Naohiro Aota <Naohiro.Aota@wdc.com>
Subject: Re: [PATCH v4 6/6] io_uring: add support for zone-append
Date: Fri, 25 Sep 2020 02:52:47 +0000
Message-ID: <MWHPR04MB3758A78AFAED3543F8D38266E7360@MWHPR04MB3758.namprd04.prod.outlook.com> (raw)
In-Reply-To: <CA+1E3r+MSEW=-SL8L+pquq+cFAu+nQOULQ+HZoQsCvdjKMkrNw@mail.gmail.com>

On 2020/09/25 2:20, Kanchan Joshi wrote:
> On Tue, Sep 8, 2020 at 8:48 PM hch@infradead.org <hch@infradead.org> wrote:
>>
>> On Mon, Sep 07, 2020 at 12:31:42PM +0530, Kanchan Joshi wrote:
>>> But there are use-cases which benefit from supporting zone-append on
>>> raw block-dev path.
>>> Certain user-space log-structured/cow FS/DB will use the device that
>>> way. Aerospike is one example.
>>> Pass-through is synchronous, and we lose the ability to use io-uring.
>>
>> So use zonefs, which is designed exactly for that use case.
> 
> Not specific to zone-append, but in general it may not be good to lock
> new features/interfaces to ZoneFS alone, given that direct-block
> interface has its own merits.
> Mapping one file to a one zone is good for some use-cases, but
> limiting for others.
> Some user-space FS/DBs would be more efficient (less meta, indirection)
> with the freedom to decide file-to-zone mapping/placement.

There is no metadata in zonefs. One file == one zone and the mapping between
zonefs files and zones is static, determined at mount time simply using report
zones. Zonefs files cannot be renamed nor deleted in anyway. Choosing a zonefs
file *is* the same as choosing a zone. Zonfes is *not* a POSIX file system doing
dynamic block allocation to files. The backing storage of files in zonefs is
static and fixed to the zone they represent. The difference between zonefs vs
raw zoned block device is the API that has to be used by the application, that
is, file descriptor representing the entire disk for raw disk vs file descriptor
representing one zone in zonefs. Note that the later has *a lot* of advantages
over the former: enables O_APPEND use, protects against bugs with user write
offsets mistakes, adds consistency of cached data against zone resets, and more.

> - Rocksdb and those LSM style DBs would map SSTable to zone, but
> SSTable file may be two small (initially) and may become too large
> (after compaction) for a zone.

You are contradicting yourself here. If a SSTable is mapped to a zone, then its
size cannot exceed the zone capacity, regardless of the interface used to access
the zones. And except for L0 tables which can be smaller (and are in memory
anyway), all levels tables have the same maximum size, which for zoned drives
must be the zone capacity. In any case, solving any problem in this area does
not depend in any way on zonefs vs raw disk interface. The implementation will
differ depending on the chosen interface, but what needs to be done to map
SSTables to zones is the same in both cases.

> - The internal parallelism of a single zone is a design-choice, and
> depends on the drive. Writing multiple zones parallely (striped/raid
> way) can give better performance than writing on one. In that case one
> would want to file that seamlessly combines multiple-zones in a
> striped fashion.

Then write a FS for that... Or have a library do it in user space. For the
library case, the implementation will differ for zonefs vs raw disk due to the
different API (regular file vs block devicer file), but the principles to follow
for stripping zones into a single storage object remain the same.

> Also it seems difficult (compared to block dev) to fit simple-copy TP
> in ZoneFS. The new
> command needs: one NVMe drive, list of source LBAs and one destination
> LBA. In ZoneFS, we would deal with N+1 file-descriptors (N source zone
> file, and one destination zone file) for that. While with block
> interface, we do not need  more than one file-descriptor representing
> the entire device. With more zone-files, we face open/close overhead too.

Are you expecting simple-copy to allow requests that are not zone aligned ? I do
not think that will ever happen. Otherwise, the gotcha cases for it would be far
too numerous. Simple-copy is essentially an optimized regular write command.
Similarly to that command, it will not allow copies over zone boundaries and
will need the destination LBA to be aligned to the destination zone WP. I have
not checked the TP though and given the NVMe NDA, I will stop the discussion here.

filesend() could be used as the interface for simple-copy. Implementing that in
zonefs would not be that hard. What is your plan for simple-copy interface for
raw block device ? An  ioctl ? filesend() too ? As as with any other user level
API, we should not be restricted to a particular device type if we can avoid it,
so in-kernel emulation of the feature is needed for devices that do not have
simple-copy or scsi extended copy. filesend() seems to me like the best choice
since all of that is already implemented there.

As for the open()/close() overhead for zonefs, may be some use cases may suffer
from it, but my tests with LevelDB+zonefs did not show any significant
difference. zonefs open()/close() operations are way faster than for a regular
file system since there is no metadata and all inodes always exist in-memory.
And zonefs() now supports MAR/MOR limits for O_WRONLY open(). That can simplify
things for the user.


-- 
Damien Le Moal
Western Digital Research

  reply index

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20200724155244epcas5p2902f57e36e490ee8772da19aa9408cdc@epcas5p2.samsung.com>
2020-07-24 15:49 ` [PATCH v4 0/6] zone-append support in io-uring and aio Kanchan Joshi
     [not found]   ` <CGME20200724155258epcas5p1a75b926950a18cd1e6c8e7a047e6c589@epcas5p1.samsung.com>
2020-07-24 15:49     ` [PATCH v4 1/6] fs: introduce FMODE_ZONE_APPEND and IOCB_ZONE_APPEND Kanchan Joshi
2020-07-24 16:34       ` Jens Axboe
2020-07-26 15:18       ` Christoph Hellwig
2020-07-28  1:49         ` Matthew Wilcox
2020-07-28  7:26           ` Christoph Hellwig
     [not found]   ` <CGME20200724155324epcas5p18e1d3b4402d1e4a8eca87d0b56a3fa9b@epcas5p1.samsung.com>
2020-07-24 15:49     ` [PATCH v4 2/6] fs: change ki_complete interface to support 64bit ret2 Kanchan Joshi
2020-07-26 15:18       ` Christoph Hellwig
     [not found]   ` <CGME20200724155329epcas5p345ba6bad0b8fe18056bb4bcd26c10019@epcas5p3.samsung.com>
2020-07-24 15:49     ` [PATCH v4 3/6] uio: return status with iov truncation Kanchan Joshi
     [not found]   ` <CGME20200724155341epcas5p15bfc55927f2abb60f19784270fe8e377@epcas5p1.samsung.com>
2020-07-24 15:49     ` [PATCH v4 4/6] block: add zone append handling for direct I/O path Kanchan Joshi
2020-07-26 15:19       ` Christoph Hellwig
     [not found]   ` <CGME20200724155346epcas5p2cfb383fe9904a45280c6145f4c13e1b4@epcas5p2.samsung.com>
2020-07-24 15:49     ` [PATCH v4 5/6] block: enable zone-append for iov_iter of bvec type Kanchan Joshi
2020-07-26 15:20       ` Christoph Hellwig
     [not found]   ` <CGME20200724155350epcas5p3b8f1d59eda7f8fbb38c828f692d42fd6@epcas5p3.samsung.com>
2020-07-24 15:49     ` [PATCH v4 6/6] io_uring: add support for zone-append Kanchan Joshi
2020-07-24 16:29       ` Jens Axboe
2020-07-27 19:16         ` Kanchan Joshi
2020-07-27 20:34           ` Jens Axboe
2020-07-30 16:08             ` Pavel Begunkov
2020-07-30 16:13               ` Jens Axboe
2020-07-30 16:26                 ` Pavel Begunkov
2020-07-30 17:16                   ` Jens Axboe
2020-07-30 17:38                     ` Pavel Begunkov
2020-07-30 17:51                       ` Kanchan Joshi
2020-07-30 17:54                         ` Jens Axboe
2020-07-30 18:25                           ` Kanchan Joshi
2020-07-31  6:42                             ` Damien Le Moal
2020-07-31  6:45                               ` hch
2020-07-31  6:59                                 ` Damien Le Moal
2020-07-31  7:58                                   ` Kanchan Joshi
2020-07-31  8:14                                     ` Damien Le Moal
2020-07-31  9:14                                       ` hch
2020-07-31  9:34                                         ` Damien Le Moal
2020-07-31  9:41                                           ` hch
2020-07-31 10:16                                             ` Damien Le Moal
2020-07-31 12:51                                               ` hch
2020-07-31 13:08                                                 ` hch
2020-07-31 15:07                                                   ` Kanchan Joshi
2020-08-05  7:35                                                 ` Damien Le Moal
2020-08-14  8:14                                                   ` hch
2020-08-14  8:27                                                     ` Damien Le Moal
2020-08-14 12:04                                                       ` hch
2020-08-14 12:20                                                         ` Damien Le Moal
2020-09-07  7:01                                                     ` Kanchan Joshi
2020-09-08 15:18                                                       ` hch
2020-09-24 17:19                                                         ` Kanchan Joshi
2020-09-25  2:52                                                           ` Damien Le Moal [this message]
2020-09-28 18:58                                                             ` Kanchan Joshi
2020-09-29  1:24                                                               ` Damien Le Moal
2020-09-29 18:49                                                                 ` Kanchan Joshi
2020-07-31  9:38                                       ` Kanchan Joshi
2020-07-31  7:08                               ` Kanchan Joshi
2020-07-30 15:57       ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=MWHPR04MB3758A78AFAED3543F8D38266E7360@MWHPR04MB3758.namprd04.prod.outlook.com \
    --to=damien.lemoal@wdc.com \
    --cc=Johannes.Thumshirn@wdc.com \
    --cc=Naohiro.Aota@wdc.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=bcrl@kvack.org \
    --cc=hch@infradead.org \
    --cc=io-uring@vger.kernel.org \
    --cc=javier.gonz@samsung.com \
    --cc=joshi.k@samsung.com \
    --cc=joshiiitr@gmail.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nj.shetty@samsung.com \
    --cc=selvakuma.s1@samsung.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

IO-Uring Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/io-uring/0 io-uring/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 io-uring io-uring/ https://lore.kernel.org/io-uring \
		io-uring@vger.kernel.org
	public-inbox-index io-uring

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.io-uring


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git