All of lore.kernel.org
 help / color / mirror / Atom feed
From: Adam Manzanares <a.manzanares@samsung.com>
To: Damien Le Moal <Damien.LeMoal@wdc.com>
Cc: "Javier González" <javier@javigon.com>,
	"Luis Chamberlain" <mcgrof@kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>,
	"Matias Bjørling" <Matias.Bjorling@wdc.com>,
	"Bart Van Assche" <bvanassche@acm.org>,
	"Keith Busch" <Keith.Busch@wdc.com>,
	"Johannes Thumshirn" <Johannes.Thumshirn@wdc.com>,
	"Naohiro Aota" <Naohiro.Aota@wdc.com>,
	"Pankaj Raghav" <pankydev8@gmail.com>,
	"Kanchan Joshi" <joshi.k@samsung.com>,
	"Nitesh Shetty" <nj.shetty@samsung.com>
Subject: Re: [LSF/MM/BPF BoF] BoF for Zoned Storage
Date: Thu, 3 Mar 2022 14:55:56 +0000	[thread overview]
Message-ID: <20220303145551.GA7057@bgt-140510-bm01> (raw)
In-Reply-To: <8386a6b9-3f06-0963-a132-5562b9c93283@wdc.com>

On Thu, Mar 03, 2022 at 09:49:06AM +0000, Damien Le Moal wrote:
> On 2022/03/03 8:29, Javier González wrote:
> > On 03.03.2022 06:32, Javier González wrote:
> >>
> >>> On 3 Mar 2022, at 04.24, Luis Chamberlain <mcgrof@kernel.org> wrote:
> >>>
> >>> Thinking proactively about LSFMM, regarding just Zone storage..
> >>>
> >>> I'd like to propose a BoF for Zoned Storage. The point of it is
> >>> to address the existing point points we have and take advantage of
> >>> having folks in the room we can likely settle on things faster which
> >>> otherwise would take years.
> >>>
> >>> I'll throw at least one topic out:
> >>>
> >>>  * Raw access for zone append for microbenchmarks:
> >>>      - are we really happy with the status quo?
> >>>    - if not what outlets do we have?
> >>>
> >>> I think the nvme passthrogh stuff deserves it's own shared
> >>> discussion though and should not make it part of the BoF.
> >>>
> >>>  Luis
> >>
> >> Thanks for proposing this, Luis.
> >>
> >> I’d like to join this discussion too.
> >>
> >> Thanks,
> >> Javier
> > 
> > Let me expand a bit on this. There is one topic that I would like to
> > cover in this session:
> > 
> >    - PO2 zone sizes
> >        In the past weeks we have been talking to Damien and Matias around
> >        the constraint that we currently have for PO2 zone sizes. While
> >        this has not been an issue for SMR HDDs, the gap that ZNS
> >        introduces between zone capacity and zone size causes holes in the
> >        address space. This unmapped LBA space has been the topic of
> >        discussion with several ZNS adopters.
> > 
> >        One of the things to note here is that even if the zone size is a
> >        PO2, the zone capacity is typically not. This means that even when
> >        we can use shifts to move around zones, the actual data placement
> >        algorithms need to deal with arbitrary sizes. So at the end of the
> >        day applications that use a contiguous address space - like in a
> >        conventional block device -, will have to deal with this.
> 
> "the actual data placement algorithms need to deal with arbitrary sizes"
> 
> ???
> 
> No it does not. With zone cap < zone size, the amount of sectors that can be
> used within a zone may be smaller than the zone size, but:
> 1) Writes still must be issued at the WP location so choosing a zone for writing
> data has the same constraint regardless of the zone capacity: Do I have enough
> usable sectors left in the zone ?

Are you saying holes are irrelevant because an application has to know the 
status of a zone by querying the device for the zone status before using a zone
and at that point it should know a start LBA? I see your point here but we have
to assume things to arrive at this conclusion.

Let's think of another scenario where the drive is managed by a user space 
application that knows the status of zones and picks a zone because it knows 
it is free. To calculate the start offset in terms of LBAs the application has 
to use the difference in zone_size and zone_cap to calculate the write offset
in terms of LBAs. 

My argument is that the zone_size is a construct conceived to make a ZNS zone
a power of 2 that creates a hole in the LBA space. Applications don't want
to deal with the power of 2 constraint and neither do devices. It seems like
the existing zoned kernel infrastructure, which made sense for SMR, pushed 
this constraint onto devices and onto users. Arguments can be made for where 
complexity should lie, but I don't think this decision made things easier for
someone to use a ZNS SSD as a block device. 

> 2) Reading after the WP is not useful (if not outright stupid), regardless of
> where the last usable sector in the zone is (at zone start + zone size or at
> zone start + zone cap).

Of course but the with po2 you force useless LBA space even if you fill a zone.


> 
> And talking about "use a contiguous address space" is in my opinion nonsense in
> the context of zoned storage since by definition, everything has to be managed
> using zones as units. The only sensible range for a "contiguous address space"
> is "zone start + min(zone cap, zone size)".

Definitely disagree with this given previous arguments. This is a construct 
forced upon us because of zoned storage legacy.

> 
> >        Since chunk_sectors is no longer required to be a PO2, we have
> >        started the work in removing this constraint. We are working in 2
> >        phases:
> > 
> >          1. Add an emulation layer in NVMe driver to simulate PO2 devices
> > 	when the HW presents a zone_capacity = zone_size. This is a
> > 	product of one of Damien's early concerns about supporting
> > 	existing applications and FSs that work under the PO2
> > 	assumption. We will post these patches in the next few days.
> > 
> >          2. Remove the PO2 constraint from the block layer and add
> > 	support for arbitrary zone support in btrfs. This will allow the
> > 	raw block device to be present for arbitrary zone sizes (and
> > 	capacities) and btrfs will be able to use it natively.
> 
> Zone sizes cannot be arbitrary in btrfs since block groups must be a multiple of
> 64K. So constraints remain and should be enforced, at least by btrfs.

I don't think we should base a lot of decisions on the work that has gone into 
btrfs. I think it is very promising, but I don't think it is settled that it 
is the only way people will consume ZNS SSDs.

> 
> > 
> > 	For completeness, F2FS works natively in PO2 zone sizes, so we
> > 	will not do work here for now, as the changes will not bring any
> > 	benefit. For F2FS, the emulation layer will help use devices
> > 	that do not have PO2 zone sizes.
> > 
> >       We are working towards having at least a RFC of (2) before LSF/MM.
> >       Since this is a topic that involves several parties across the
> >       stack, I believe that a F2F conversation will help laying the path
> >       forward.
> > 
> > Thanks,
> > Javier
> > 
> 
> 
> -- 
> Damien Le Moal
> Western Digital Research

  reply	other threads:[~2022-03-03 14:56 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-03  0:56 [LSF/MM/BPF BoF] BoF for Zoned Storage Luis Chamberlain
2022-03-03  1:03 ` Luis Chamberlain
2022-03-03  1:33 ` Bart Van Assche
2022-03-03  4:31   ` Matias Bjørling
2022-03-03  5:21     ` Adam Manzanares
2022-03-03  5:32 ` Javier González
2022-03-03  6:29   ` Javier González
2022-03-03  7:54     ` Pankaj Raghav
2022-03-03  9:49     ` Damien Le Moal
2022-03-03 14:55       ` Adam Manzanares [this message]
2022-03-03 15:22         ` Damien Le Moal
2022-03-03 17:10           ` Adam Manzanares
2022-03-03 19:51             ` Matias Bjørling
2022-03-03 20:18               ` Adam Manzanares
2022-03-03 21:08                 ` Javier González
2022-03-03 21:33                 ` Matias Bjørling
2022-03-04 20:12                   ` Luis Chamberlain
2022-03-06 23:54                     ` Damien Le Moal
2022-03-03 16:12     ` Himanshu Madhani
2022-03-03  7:21 ` Hannes Reinecke
2022-03-03  8:55   ` Damien Le Moal
2022-03-03  7:38 ` Kanchan Joshi
2022-03-03  8:43 ` Johannes Thumshirn
2022-03-03 18:20 ` Viacheslav Dubeyko
2022-03-04  0:10 ` Dave Chinner
2022-03-04 22:10   ` Luis Chamberlain
2022-03-04 22:42     ` Dave Chinner
2022-03-04 22:55       ` Luis Chamberlain
2022-03-05  7:33         ` Javier González
2022-03-07  7:12           ` Dave Chinner
2022-03-07 10:27             ` Matias Bjørling
2022-03-07 11:29               ` Javier González
2022-03-11  0:49             ` Luis Chamberlain
2022-03-11  6:07               ` Christoph Hellwig
2022-03-11 20:31                 ` Luis Chamberlain
2022-03-07 13:55           ` James Bottomley
2022-03-07 14:35             ` Javier González
2022-03-07 15:15               ` Keith Busch
2022-03-07 15:28                 ` Javier González
2022-03-07 20:42                 ` Damien Le Moal
2022-03-11  7:21                   ` Javier González
2022-03-11  7:39                     ` Damien Le Moal
2022-03-11  7:42                       ` Christoph Hellwig
2022-03-11  7:53                         ` Javier González
2022-03-11  8:46                           ` Christoph Hellwig
2022-03-11  8:59                             ` Javier González
2022-03-12  8:03                               ` Damien Le Moal
2022-03-07  0:07         ` Damien Le Moal
2022-03-06 23:56     ` Damien Le Moal
2022-03-07 15:44       ` Luis Chamberlain
2022-03-07 16:23         ` Johannes Thumshirn
2022-03-07 16:36           ` Luis Chamberlain
2022-03-15 18:08 ` [EXT] " Luca Porzio (lporzio)
2022-03-15 18:39   ` Bart Van Assche
2022-03-15 18:47     ` Bean Huo (beanhuo)
2022-03-15 18:49       ` Jens Axboe
2022-03-15 19:04         ` Bean Huo (beanhuo)
2022-03-15 19:16           ` Jens Axboe
2022-03-15 19:59           ` Bart Van Assche

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220303145551.GA7057@bgt-140510-bm01 \
    --to=a.manzanares@samsung.com \
    --cc=Damien.LeMoal@wdc.com \
    --cc=Johannes.Thumshirn@wdc.com \
    --cc=Keith.Busch@wdc.com \
    --cc=Matias.Bjorling@wdc.com \
    --cc=Naohiro.Aota@wdc.com \
    --cc=bvanassche@acm.org \
    --cc=javier@javigon.com \
    --cc=joshi.k@samsung.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mcgrof@kernel.org \
    --cc=nj.shetty@samsung.com \
    --cc=pankydev8@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.