Re: Comparison to ZFS and BTRFS

From: Demi Marie Obenour <demi@invisiblethingslab.com>
To: Kent Overstreet <kent.overstreet@gmail.com>
Cc: linux-bcachefs@vger.kernel.org
Subject: Re: Comparison to ZFS and BTRFS
Date: Tue, 19 Apr 2022 09:16:43 -0400	[thread overview]
Message-ID: <Yl62Pe/mrvb1mzuD@itl-email> (raw)
In-Reply-To: <20220419013534.fb5m6kd6f6ithcig@moria.home.lan>

[-- Attachment #1: Type: text/plain, Size: 8352 bytes --]

On Mon, Apr 18, 2022 at 09:35:34PM -0400, Kent Overstreet wrote:
> On Mon, Apr 18, 2022 at 10:07:38AM -0400, Demi Marie Obenour wrote:
> > On Fri, Apr 15, 2022 at 03:11:40PM -0400, Kent Overstreet wrote:
> > > On Wed, Apr 06, 2022 at 02:55:04AM -0400, Demi Marie Obenour wrote:
> > > > How does bcachefs manage to outperform ZFS and BTRFS?  Obviously being
> > > > licensed under GPL-compatible terms is an advantage for inclusion in
> > > > Linux, but I am more interested in the technical aspects.
> > > > 
> > > > - How does bcachefs avoid the nasty performance pitfalls that plague
> > > >   BTRFS?  Are VM disks and databases on bcachefs fast?
> > > 
> > > Clean modular design (the result of years of slow incremental work), and a
> > > _blazingly_ fast B+ tree implementation.
> > > 
> > > We're not fast in every situation yet. We don't have a nocow (non copy-on-write)
> > > mode, and slow random reads can be slow due to checksum granularity being at the
> > > extent level (which is a good tradeoff in most situations, but we need an option
> > > for smaller checksum granularity at some point).
> > 
> > How well does bcachefs handle writes to files that have extents shared
> > (via reflinks or snapshots) with other files?  I would like to use
> > bcachefs in Qubes OS once it reaches mainline, and in Qubes OS, each VM
> > disk image is typically a snapshot of the previous revision.  Therefore,
> > each write breaks sharing.  I am curious how well bcachefs handles this
> > situation; I know that at least dm-thin is not optimized for it.  Also,
> > for a file of size N, are reflinks O(N), or are they O(log N) or better?
> 
> O(N), but they're also cheap to overwrite.

That’s understandable, if somewhat unfortunate.  If the constant factor
is small enough it should not be too big of a problem in practice,
unless the files are huge.  Qubes OS also has an optimization that
allows the reflinks to be created in the background, rather than when
users are waiting on them.  Are there optimizations for
already-reflinked files?  Or are subvolumes better for this use-case?

> > How much of a performance hit can one expect from erasure coding,
> > compared to mirroring?
> 
> Should be very little, but it's not yet stable enough for real world performance
> testing.

Thanks!

> > Is there something lower-level available?  For instance, where should
> > one look if they want to add (read-only) bcachefs support to GRUB?
> 
> The sanest thing to do would be to port bcachefs to grub - you can't read
> anything without reading the journal and overlaying that over the btree, if
> you're not doing journal replay, so that's a lot of code that you really don't
> want to rewrite - and just reading from btree nodes is non trivial. Bcachefs has
> been ported to userspace already, so it'd be a big undertaking but not crazy.

That makes sense.  grub has a policy of never mutating anything except a
tiny environment block, but that is equivalent to ‘-o nochanges’.

> > Also, is it possible to mount a bcachefs filesystem off of a truly
> > immutable volume?
> 
> Yes.

Thanks.  I was worried that this was not possible without replaying the
journal.  I should have read the manual first :).

> > > > - Can bcachefs use faster storage as a cache for slower storage, or
> > > >   otherwise move data around based on usage patterns?
> > > 
> > > Yes.
> > 
> > I am not surprised, considering that bcachefs is based on bcache.  Is
> > there any manual configuration required, or can bcachefs detect fast and
> > slow storage automatically?  Also, does the data remain on the slow
> > storage, or can bcachefs move frequently-used data entirely off of slow
> > storage to make room for infrequently used data?
> 
> You should be reading the manual for these kinds of questions:
> https://bcachefs.org/bcachefs-principles-of-operation.pdf

Indeed I should, sorry!

> Long story short, you tell the IO path where to put things and it can be
> configured filesystem wide, or per file/directory.

Nice!  I was especially impressed by this: “Devices need not have the
same performance characteristics: we track device IO latency and direct
reads to the device that is currently fastest.”  That adaptive behavior
is something I would have expected from a high-end storage array.
Having it in an open source filesystem will be amazing.

> > > > - Can bcachefs saturate your typical NVMe drive on realistic workloads?
> > > >   Can it do so with encryption enabled?
> > > 
> > > This sounds like a question for someone interested in benchmarking :)
> > 
> > I would love to benchmark, but right now I don’t have any machines on
> > which I am willing to install a bespoke kernel build.  I might be able
> > to try bcachefs in a VM, though.  I’m also no expert in storage
> > benchmarking.
> > 
> > > > - Is support for swap files on bcachefs planned?  That would require
> > > >   being able to perform O_DIRECT asynchronous writes without any memory
> > > >   allocations.
> > > 
> > > Yes it's planned, the IO path already has the necessary support
> > 
> > That is awesome!  Will it require disabling CoW or checksums, or will it
> > work even with CoW and checksums enabled and without risking deadlocks?
> 
> Normal IO path, so CoW and checksums and encryption and all.

That is incredible.

> > > > - Is bcachefs being used in production anywhere?
> > > 
> > > Yes
> > 
> > Are there any places that are willing to talk about their use of
> > bcachefs?  Is bcachefs basically the WireGuard of filesystems?
> > 
> > A few other questions:
> > 
> > 1. What would it take for bcachefs to be buildable as a loadable kernel
> >    module?  That would be much more convienient than building a kernel,
> >    and might allow bcachefs to be packaged in distributions.
> 
> Not gonna happen. When I'm ready for more users I'll focus on upstreaming it,
> right now I've still got bugs to fix :)

And I am glad that is your priority :).  A stable, high-quality
filesystem is worth the wait.

> > 2. Would it be possible to digitally sign releases?  The means to sign
> >    them is not particularly relevant, so long as it is secure.  OpenPGP,
> >    signify, minisign, and ssh-keygen -Y are all fine.
> > 
> > 3. Are there plans to add longer, random nonces to the encryption
> >    implementation?  One long-term goal of Qubes OS is untrusted storage
> >    domains, and that requires that encrypted bcachefs be safe against a
> >    malicious block device.  A simple way to implement this is to use a
> >    192-bit random nonce stored along each 128-bit authentication tag,
> >    and use XChaCha20-Poly1305 as the cipher.  A 192-bit nonce is long
> >    enough that one can safely pick a random number at each boot, and
> >    then increment it for each encryption.  This also requires that any
> >    data read from disk that has not been authenticated be treated as
> >    untrusted.
> 
> Nonces are stored with pointers, not with the data they protect, so this isn't
> necessary for what you're talking about - nonces are themselves encrypted and
> authenticated, with a chain of trust up to the superblock, or journal after an
> unclean shutdown.

The problem with this approach is a whole-volume replay attack.  It’s
easy for a malicious storage device to roll back the entire volume, but
keep a snapshot for future use.  The next time the volume is mounted,
bcachefs might reuse the same nonces, but with different data.  Disaster
ensues.  Adding randomness is necessary to prevent this, and the
approach I recommended is the simplest one I am aware of.  In
cryptography, simpler is generally better.  I see that a ‘wide_macs’
option is available; could this be an extension of that?

> However, the superblock isn't currently authenticated - that would be nice to
> fix.

It would be indeed; I will file an issue for that if none has already
been filed.  How is the journal handled?  For instance, could each
journal entry have a MAC or hash of the previous one, with the
superblock having a MAC or hash of the most recent journal entry as well
as a pointer to the first one?

-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]