All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Javier González" <javier@javigon.com>
Cc: "Luis Chamberlain" <mcgrof@kernel.org>,
	linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	lsf-pc@lists.linux-foundation.org,
	"Matias Bjørling" <Matias.Bjorling@wdc.com>,
	"Damien Le Moal" <Damien.LeMoal@wdc.com>,
	"Bart Van Assche" <bvanassche@acm.org>,
	"Adam Manzanares" <a.manzanares@samsung.com>,
	"Keith Busch" <Keith.Busch@wdc.com>,
	"Johannes Thumshirn" <Johannes.Thumshirn@wdc.com>,
	"Naohiro Aota" <Naohiro.Aota@wdc.com>,
	"Pankaj Raghav" <pankydev8@gmail.com>,
	"Kanchan Joshi" <joshi.k@samsung.com>,
	"Nitesh Shetty" <nj.shetty@samsung.com>
Subject: Re: [LSF/MM/BPF BoF] BoF for Zoned Storage
Date: Mon, 7 Mar 2022 18:12:29 +1100	[thread overview]
Message-ID: <20220307071229.GR3927073@dread.disaster.area> (raw)
In-Reply-To: <20220305073321.5apdknpmctcvo3qj@ArmHalley.localdomain>

On Sat, Mar 05, 2022 at 08:33:21AM +0100, Javier González wrote:
> On 04.03.2022 14:55, Luis Chamberlain wrote:
> > On Sat, Mar 05, 2022 at 09:42:57AM +1100, Dave Chinner wrote:
> > > On Fri, Mar 04, 2022 at 02:10:08PM -0800, Luis Chamberlain wrote:
> > > > On Fri, Mar 04, 2022 at 11:10:22AM +1100, Dave Chinner wrote:
> > > > > On Wed, Mar 02, 2022 at 04:56:54PM -0800, Luis Chamberlain wrote:
> > > > > > Thinking proactively about LSFMM, regarding just Zone storage..
> > > > > >
> > > > > > I'd like to propose a BoF for Zoned Storage. The point of it is
> > > > > > to address the existing point points we have and take advantage of
> > > > > > having folks in the room we can likely settle on things faster which
> > > > > > otherwise would take years.
> > > > > >
> > > > > > I'll throw at least one topic out:
> > > > > >
> > > > > >   * Raw access for zone append for microbenchmarks:
> > > > > >   	- are we really happy with the status quo?
> > > > > > 	- if not what outlets do we have?
> > > > > >
> > > > > > I think the nvme passthrogh stuff deserves it's own shared
> > > > > > discussion though and should not make it part of the BoF.
> > > > >
> > > > > Reading through the discussion on this thread, perhaps this session
> > > > > should be used to educate application developers about how to use
> > > > > ZoneFS so they never need to manage low level details of zone
> > > > > storage such as enumerating zones, controlling write pointers
> > > > > safely for concurrent IO, performing zone resets, etc.
> > > >
> > > > I'm not even sure users are really aware that given cap can be different
> > > > than zone size and btrfs uses zone size to compute size, the size is a
> > > > flat out lie.
> > > 
> > > Sorry, I don't get what btrfs does with zone management has anything
> > > to do with using Zonefs to get direct, raw IO access to individual
> > > zones.
> > 
> > You are right for direct raw access. My point was that even for
> > filesystem use design I don't think the communication is clear on
> > expectations. Similar computation need to be managed by fileystem
> > design, for instance.
> 
> Dave,
> 
> I understand that you point to ZoneFS for this. It is true that it was
> presented at the moment as the way to do raw zone access from
> user-space.
> 
> However, there is no users of ZoneFS for ZNS devices that I am aware of
> (maybe for SMR this is a different story).  The main open-source
> implementations out there for RocksDB that are being used in production
> (ZenFS and xZTL) rely on either raw zone block access or the generic
> char device in NVMe (/dev/ngXnY).

That's exactly the situation we want to avoid.

You're talking about accessing Zoned storage by knowing directly
about how the hardware works and interfacing directly with hardware
specific device commands.

This is exactly what is wrong with this whole conversation - direct
access to hardware is fragile and very limiting, and the whole
purpose of having an operating system is to abstract the hardware
functionality into a generally usable API. That way when something
new gets added to the hardware or something gets removed, the
applications don't because they weren't written with that sort of
hardware functionality extension in mind.

I understand that RocksDB probably went direct to the hardware
because, at the time, it was the only choice the developers had to
make use of ZNS based storage. I understand that.

However, I also understand that there are *better options now* that
allow applications to target zone storage in a way that doesn't
expose them to the foibles of hardware support and storage protocol
specifications and characteristics.

The generic interface that the kernel provides for zoned storage is
called ZoneFS. Forget about the fact it is a filesystem, all it
does is provide userspace with a named zone abstraction for a zoned
device: every zone is an append-only file.

That's what I'm trying to get across here - this whole discussion
about zone capacity not matching zone size is a hardware/
specification detail that applications *do not need to know about*
to use zone storage. That's something taht Zonefs can/does hide from
applications completely - the zone files behave exactly the same
from the user perspective regardless of whether the hardware zone
capacity is the same or less than the zone size.

Expanding access the hardware and/or raw block devices to ensure
userspace applications can directly manage zone write pointers, zone
capacity/space limits, etc is the wrong architectural direction to
be taking. The sort of *hardware quirks* being discussed in this
thread need to be managed by the kernel and hidden from userspace;
userspace shouldn't need to care about such wierd and esoteric
hardware and storage protocol/specification/implementation
differences.

IMO, while RocksDB is the technology leader for ZNS, it is not the
model that new applications should be trying to emulate. They should
be designed from the ground up to use ZoneFS instead of directly
accessing nvme devices or trying to use the raw block devices for
zoned storage. Use the generic kernel abstraction for the hardware
like applications do for all other things!

> This is because having the capability to do zone management from
> applications that already work with objects fits much better.

ZoneFS doesn't absolve applications from having to perform zone
management to pack it's objects and garbage collect stale storage
space.  ZoneFS merely provides a generic, file based, hardware
independent API for performing these zone management tasks.

> My point is that there is space for both ZoneFS and raw zoned block
> device. And regarding !PO2 zone sizes, my point is that this can be
> leveraged both by btrfs and this raw zone block device.

On that I disagree - any argument that starts with "we need raw
zoned block device access to ...." is starting from an invalid
premise. We should be hiding hardware quirks from userspace, not
exposing them further.

IMO, we want writing zone storage native applications to be simple
and approachable by anyone who knows how to write to append-only
files.  We do not want such applications to be limited to people who
have deep and rare expertise in the dark details of, say, largely
undocumented niche NVMe ZNS specification and protocol quirks.

ZoneFS provides us with a path to the former, what you are
advocating is the latter....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2022-03-07  7:12 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-03  0:56 [LSF/MM/BPF BoF] BoF for Zoned Storage Luis Chamberlain
2022-03-03  1:03 ` Luis Chamberlain
2022-03-03  1:33 ` Bart Van Assche
2022-03-03  4:31   ` Matias Bjørling
2022-03-03  5:21     ` Adam Manzanares
2022-03-03  5:32 ` Javier González
2022-03-03  6:29   ` Javier González
2022-03-03  7:54     ` Pankaj Raghav
2022-03-03  9:49     ` Damien Le Moal
2022-03-03 14:55       ` Adam Manzanares
2022-03-03 15:22         ` Damien Le Moal
2022-03-03 17:10           ` Adam Manzanares
2022-03-03 19:51             ` Matias Bjørling
2022-03-03 20:18               ` Adam Manzanares
2022-03-03 21:08                 ` Javier González
2022-03-03 21:33                 ` Matias Bjørling
2022-03-04 20:12                   ` Luis Chamberlain
2022-03-06 23:54                     ` Damien Le Moal
2022-03-03 16:12     ` Himanshu Madhani
2022-03-03  7:21 ` Hannes Reinecke
2022-03-03  8:55   ` Damien Le Moal
2022-03-03  7:38 ` Kanchan Joshi
2022-03-03  8:43 ` Johannes Thumshirn
2022-03-03 18:20 ` Viacheslav Dubeyko
2022-03-04  0:10 ` Dave Chinner
2022-03-04 22:10   ` Luis Chamberlain
2022-03-04 22:42     ` Dave Chinner
2022-03-04 22:55       ` Luis Chamberlain
2022-03-05  7:33         ` Javier González
2022-03-07  7:12           ` Dave Chinner [this message]
2022-03-07 10:27             ` Matias Bjørling
2022-03-07 11:29               ` Javier González
2022-03-11  0:49             ` Luis Chamberlain
2022-03-11  6:07               ` Christoph Hellwig
2022-03-11 20:31                 ` Luis Chamberlain
2022-03-07 13:55           ` James Bottomley
2022-03-07 14:35             ` Javier González
2022-03-07 15:15               ` Keith Busch
2022-03-07 15:28                 ` Javier González
2022-03-07 20:42                 ` Damien Le Moal
2022-03-11  7:21                   ` Javier González
2022-03-11  7:39                     ` Damien Le Moal
2022-03-11  7:42                       ` Christoph Hellwig
2022-03-11  7:53                         ` Javier González
2022-03-11  8:46                           ` Christoph Hellwig
2022-03-11  8:59                             ` Javier González
2022-03-12  8:03                               ` Damien Le Moal
2022-03-07  0:07         ` Damien Le Moal
2022-03-06 23:56     ` Damien Le Moal
2022-03-07 15:44       ` Luis Chamberlain
2022-03-07 16:23         ` Johannes Thumshirn
2022-03-07 16:36           ` Luis Chamberlain
2022-03-15 18:08 ` [EXT] " Luca Porzio (lporzio)
2022-03-15 18:39   ` Bart Van Assche
2022-03-15 18:47     ` Bean Huo (beanhuo)
2022-03-15 18:49       ` Jens Axboe
2022-03-15 19:04         ` Bean Huo (beanhuo)
2022-03-15 19:16           ` Jens Axboe
2022-03-15 19:59           ` Bart Van Assche

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220307071229.GR3927073@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=Damien.LeMoal@wdc.com \
    --cc=Johannes.Thumshirn@wdc.com \
    --cc=Keith.Busch@wdc.com \
    --cc=Matias.Bjorling@wdc.com \
    --cc=Naohiro.Aota@wdc.com \
    --cc=a.manzanares@samsung.com \
    --cc=bvanassche@acm.org \
    --cc=javier@javigon.com \
    --cc=joshi.k@samsung.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mcgrof@kernel.org \
    --cc=nj.shetty@samsung.com \
    --cc=pankydev8@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.