Re: [LSF/MM/BPF TOPIC] bcachefs

From: Kent Overstreet <kent.overstreet@linux.dev>
To: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: lsf-pc@lists.linux-foundation.org,
	linux-bcachefs@vger.kernel.org,  linux-fsdevel@vger.kernel.org
Subject: Re: [LSF/MM/BPF TOPIC] bcachefs
Date: Wed, 3 Jan 2024 12:52:36 -0500	[thread overview]
Message-ID: <cgivkso5ugccwkhtd5rh3d6rkoxdrra3hxgxhp5e5m45kn623s@f6hd3iajb3zg> (raw)
In-Reply-To: <74751256-EA58-4EBB-8CA9-F1DD5E2F23FA@dubeyko.com>

On Wed, Jan 03, 2024 at 10:39:50AM +0300, Viacheslav Dubeyko wrote:
> 
> 
> > On Jan 2, 2024, at 7:05 PM, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > 
> > On Tue, Jan 02, 2024 at 11:02:59AM +0300, Viacheslav Dubeyko wrote:
> >> 
> >> 
> >>> On Jan 2, 2024, at 1:56 AM, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >>> 
> >>> LSF topic: bcachefs status & roadmap
> >>> 
> >> 
> >> <skipped>
> >> 
> >>> 
> >>> A delayed allocation for btree nodes mode is coming, which is the main
> >>> piece needed for ZNS support
> >>> 
> >> 
> >> I could miss some emails. But have you shared the vision of ZNS support
> >> architecture for the case of bcachefs already? It will be interesting to hear
> >> the high-level concept.
> > 
> > There's not a whole lot to it. bcache/bcachefs allocation is already
> > bucket based, where the model is that we allocate a bucket, then write
> > to it sequentially and never overwrite until the whole bucket is reused.
> > 
> > The main exception has been btree nodes, which are log structured and
> > typically smaller than a bucket; that doesn't break the "no overwrites"
> > property ZNS wants, but it does mean writes within a bucket aren't
> > happening sequentially.
> > 
> > So I'm adding a mode where every time we do a btree node write we write
> > out the whole node to a new location, instead of appending at an
> > existing location. It won't be as efficient for random updates across a
> > large working set, but in practice that doesn't happen too much; average
> > btree write size has always been quite high on any filesystem I've
> > looked at.
> > 
> > Aside from that, it's mostly just plumbing and integration; bcachefs on
> > ZNS will work pretty much just the same as bcachefs on regular block devices.
> 
> I assume that you are aware about limited number of open/active zones
> on ZNS device. It means that you can open for write operations
> only N zones simultaneously (for example, 14 zones for the case of WDC
> ZNS device). Can bcachefs survive with such limitation? Can you limit the number
> of buckets for write operations?

Yes, open/active zones correspond to write points in the bcachefs
allocator. The default number of write points is 32 for user writes plus
a few for internal ones, but it's not a problem to run with fewer.

> Another potential issue could be the zone size. WDC ZNS device introduces
> 2GB zone size (with 1GB capacity). Could be the bucket is so huge? And could
> btree model of operations works with such huge zones?

Yes. It'll put more pressure on copying garbage collection, but that's
about it.

> Technically speaking, limitation (14 open/active zones) could be the factor of
> performance degradation. Could such limitation doesn’t effect the bcachefs
> performance?

I'm not sure what performance degradation you speak of, but no, that
won't affect bcachefs. 

> Could ZNS model affects a GC operations? Or, oppositely, ZNS model can
> help to manage GC operations more efficiently?

The ZNS model only adds restrictions on top of a regular block device,
so no it's not _helpful_ for our GC operations.

But: since our existing allocation model maps so well to zones, our
existing GC model won't be hurt either, and doing GC in the filesystem
will naturally have benefits in that we know exactly what data is live
and we have access to the LBA mapping so can better avoid fragmentation.

> Do you need in conventional zone? Could bcachefs work without using
> the conventional zone of ZNS device?

Not required, but if zones are all 1GB+ you'd want a small conventional
zone so as to avoid burning two whole zones for the superblock.