linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: Damien Le Moal <Damien.LeMoal@wdc.com>
To: Heiner Litz <hlitz@ucsc.edu>, Keith Busch <kbusch@kernel.org>
Cc: "Jens Axboe" <axboe@kernel.dk>,
	"Niklas Cassel" <Niklas.Cassel@wdc.com>,
	"Javier González" <javier@javigon.com>,
	"Ajay Joshi" <Ajay.Joshi@wdc.com>,
	"Sagi Grimberg" <sagi@grimberg.me>,
	"Keith Busch" <Keith.Busch@wdc.com>,
	"Dmitry Fomichev" <Dmitry.Fomichev@wdc.com>,
	"Aravind Ramesh" <Aravind.Ramesh@wdc.com>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"Hans Holmberg" <Hans.Holmberg@wdc.com>,
	"Matias Bjørling" <mb@lightnvm.io>,
	"Judy Brock" <judy.brock@samsung.com>,
	"Christoph Hellwig" <hch@lst.de>,
	"Matias Bjorling" <Matias.Bjorling@wdc.com>
Subject: Re: [PATCH 5/5] nvme: support for zoned namespaces
Date: Fri, 19 Jun 2020 00:57:47 +0000	[thread overview]
Message-ID: <CY4PR04MB37511C5698D3D2EEFD3115BCE7980@CY4PR04MB3751.namprd04.prod.outlook.com> (raw)
In-Reply-To: CAJbgVnVxtfs3m6HKJOQw4E1sqTQBmtF_P-D4aAZ5zsz4rQUXNA@mail.gmail.com

On 2020/06/19 7:05, Heiner Litz wrote:
> Matias, Keith,
> thanks, this all sounds good and it makes total sense to hide striping
> from the user.
> 
> In the end, the real problem really seems to be that ZNS effectively
> requires in-order IO delivery which the kernel cannot guarantee. I
> think fixing this problem in the ZNS specification instead of in the
> communication substrate (kernel) is problematic, especially as
> out-of-order delivery absolutely has no benefit in the case of ZNS.
> But I guess this has been discussed before..

From the device interface perspective, that is from the ZNS specifications point
of view, only regular writes require in order dispatching by the host. Zone
append write commands can be issued in any order and will succeed as long as
there are enough unwritten blocks in the target zone to fit the append request.
And the zone append command processing can happen in any order the drive sees
fit. SO there is indeed no guarantee back to the host that zone append command
execution will be done in the same order as issued by the host.

That is from the interface perspective, for the protocol. Now the question that
I think you are after seems to be "does this work for the user" ? The answer is
a simple "it depends what the use case is". The device user is free to choose
between issuing regular writes or zone append write. This choice heavily depends
on the answer to the question: "Can I tolerate out of order writes ?". For a
file system, the answer is yes, since metadata is used to indicate the mapping
of file offsets to on-disk locations. It does not matter, functionally speaking,
if the file data blocks for increasing file offsets are out of order. That can
happen with any file system on any regular disk due to block
allocation/fragmentation today.

For an application using raw block device accesses without a file system, the
usability of zone append will heavily depend on the structure/format of the data
being written. A simple logging application where every write to a device stores
a single independent "record" will likely be fine with zone append. If the
application is writing something like a B-tree with dependency between data
blocks pointing to each other, zone append may not be the best choice as the
final location on disk of a write is only approximately known (i.e., one can
only guarantee that it will land "somewhere" in a zone). That however depend on
how the application issues IO requests.

Zone append is not a magic command solving all problems. But it certainly does
simplify a lot of things in the kernel IO stack (no need for strong ordering)
and also can simplify file system implementation (no need to control write
issuing order).

> 
> On Thu, Jun 18, 2020 at 2:19 PM Keith Busch <kbusch@kernel.org> wrote:
>>
>> On Thu, Jun 18, 2020 at 01:47:20PM -0700, Heiner Litz wrote:
>>> the striping explanation makes sense. In this case will rephase to: It
>>> is sufficient to support large enough un-splittable writes to achieve
>>> full per-zone bandwidth with a single writer/single QD.
>>
>> This is subject to the capabilities of the device and software's memory
>> constraints. The maximum DMA size for a single request an nvme device can
>> handle often range anywhere from 64k to 4MB. The pci nvme driver maxes out at
>> 4MB anyway because that's the most we can guarantee forward progress right now,
>> otherwise the scatter lists become to big to ensure we'll be able to allocate
>> one to dispatch a write command.
>>
>> We do report the size and the alignment constraints so that it won't get split,
>> but we still have to work with applications that don't abide by those
>> constraints.
>>
>>> My main point is: There is no fundamental reason for splitting up
>>> requests intermittently just to re-assemble them in the same form
>>> later.
> 


-- 
Damien Le Moal
Western Digital Research

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

  reply	other threads:[~2020-06-19  0:58 UTC|newest]

Thread overview: 96+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-15 23:34 [PATCH 0/5] nvme support for zoned namespace command set Keith Busch
2020-06-15 23:34 ` [PATCH 1/5] block: add capacity field to zone descriptors Keith Busch
2020-06-15 23:49   ` Chaitanya Kulkarni
2020-06-16 10:28   ` Javier González
2020-06-16 13:47   ` Daniel Wagner
2020-06-16 13:54   ` Johannes Thumshirn
2020-06-16 15:41   ` Martin K. Petersen
2020-06-15 23:34 ` [PATCH 2/5] null_blk: introduce zone capacity for zoned device Keith Busch
2020-06-15 23:46   ` Chaitanya Kulkarni
2020-06-16 14:18   ` Daniel Wagner
2020-06-16 15:48   ` Martin K. Petersen
2020-06-15 23:34 ` [PATCH 3/5] nvme: implement I/O Command Sets Command Set support Keith Busch
2020-06-16 10:33   ` Javier González
2020-06-16 17:14     ` Niklas Cassel
2020-06-16 15:58   ` Martin K. Petersen
2020-06-16 17:01     ` Keith Busch
2020-06-17  9:50       ` Niklas Cassel
2020-06-16 17:06     ` Niklas Cassel
2020-06-17  2:01       ` Martin K. Petersen
2020-06-15 23:34 ` [PATCH 4/5] nvme: support for multi-command set effects Keith Busch
2020-06-16 10:34   ` Javier González
2020-06-16 16:03   ` Martin K. Petersen
2020-06-15 23:34 ` [PATCH 5/5] nvme: support for zoned namespaces Keith Busch
2020-06-16 10:41   ` Javier González
2020-06-16 11:18     ` Matias Bjørling
2020-06-16 12:00       ` Javier González
2020-06-16 12:06         ` Matias Bjørling
2020-06-16 12:24           ` Javier González
2020-06-16 12:27             ` Matias Bjørling
2020-06-16 12:35             ` Damien Le Moal
     [not found]               ` <CGME20200616130815uscas1p1be34e5fceaa548eac31fb30790a689d4@uscas1p1.samsung.com>
2020-06-16 13:08                 ` Judy Brock
2020-06-16 13:32                   ` Matias Bjørling
2020-06-16 13:34                   ` Damien Le Moal
2020-06-16 14:16               ` Javier González
2020-06-16 14:42                 ` Damien Le Moal
2020-06-16 15:02                   ` Javier González
2020-06-16 15:20                     ` Matias Bjørling
2020-06-16 16:03                       ` Javier González
2020-06-16 16:07                         ` Matias Bjorling
2020-06-16 16:21                           ` Javier González
2020-06-16 16:25                             ` Matias Bjørling
2020-06-16 15:48                     ` Keith Busch
2020-06-16 15:55                       ` Javier González
2020-06-16 16:04                         ` Matias Bjorling
2020-06-16 16:07                         ` Keith Busch
2020-06-16 16:13                           ` Javier González
2020-06-17  0:38                             ` Damien Le Moal
2020-06-17  6:18                               ` Javier González
2020-06-17  6:54                                 ` Damien Le Moal
2020-06-17  7:11                                   ` Javier González
2020-06-17  7:29                                     ` Damien Le Moal
2020-06-17  7:34                                       ` Javier González
2020-06-17  0:14                     ` Damien Le Moal
2020-06-17  6:09                       ` Javier González
2020-06-17  6:47                         ` Damien Le Moal
2020-06-17  7:02                           ` Javier González
2020-06-17  7:24                             ` Damien Le Moal
2020-06-17  7:29                               ` Javier González
     [not found]         ` <CGME20200616123503uscas1p22ce22054a1b4152a20437b5abdd55119@uscas1p2.samsung.com>
2020-06-16 12:35           ` Judy Brock
2020-06-16 12:37             ` Damien Le Moal
2020-06-16 12:37             ` Matias Bjørling
2020-06-16 13:12               ` Judy Brock
2020-06-16 13:18                 ` Judy Brock
2020-06-16 13:32                   ` Judy Brock
2020-06-16 13:39                     ` Damien Le Moal
2020-06-17  7:43     ` Christoph Hellwig
2020-06-17 12:01       ` Martin K. Petersen
2020-06-17 15:00         ` Javier González
2020-06-17 14:42       ` Javier González
2020-06-17 17:57         ` Matias Bjørling
2020-06-17 18:28           ` Javier González
2020-06-17 18:55             ` Matias Bjorling
2020-06-17 19:09               ` Javier González
2020-06-17 19:23                 ` Matias Bjørling
2020-06-17 19:40                   ` Javier González
2020-06-17 23:44                     ` Heiner Litz
2020-06-18  1:55                       ` Keith Busch
2020-06-18  4:24                         ` Heiner Litz
2020-06-18  5:15                           ` Damien Le Moal
2020-06-18 20:47                             ` Heiner Litz
2020-06-18 21:04                               ` Matias Bjorling
2020-06-18 21:19                               ` Keith Busch
2020-06-18 22:05                                 ` Heiner Litz
2020-06-19  0:57                                   ` Damien Le Moal [this message]
2020-06-19 10:29                                   ` Matias Bjorling
2020-06-19 18:08                                     ` Heiner Litz
2020-06-19 18:10                                       ` Keith Busch
2020-06-19 18:17                                         ` Heiner Litz
2020-06-19 18:22                                           ` Keith Busch
2020-06-19 18:25                                           ` Matias Bjørling
2020-06-19 18:40                                             ` Heiner Litz
2020-06-19 18:18                                       ` Matias Bjørling
2020-06-20  6:33                                       ` Christoph Hellwig
2020-06-20 17:52                                         ` Heiner Litz
2022-03-02 21:11                   ` Luis Chamberlain
2020-06-17  2:08   ` Martin K. Petersen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CY4PR04MB37511C5698D3D2EEFD3115BCE7980@CY4PR04MB3751.namprd04.prod.outlook.com \
    --to=damien.lemoal@wdc.com \
    --cc=Ajay.Joshi@wdc.com \
    --cc=Aravind.Ramesh@wdc.com \
    --cc=Dmitry.Fomichev@wdc.com \
    --cc=Hans.Holmberg@wdc.com \
    --cc=Keith.Busch@wdc.com \
    --cc=Matias.Bjorling@wdc.com \
    --cc=Niklas.Cassel@wdc.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=hlitz@ucsc.edu \
    --cc=javier@javigon.com \
    --cc=judy.brock@samsung.com \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=mb@lightnvm.io \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).