linux-bcachefs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kent Overstreet <kent.overstreet@gmail.com>
To: Demi Marie Obenour <demi@invisiblethingslab.com>
Cc: linux-bcachefs@vger.kernel.org
Subject: Re: Comparison to ZFS and BTRFS
Date: Mon, 18 Apr 2022 21:35:34 -0400	[thread overview]
Message-ID: <20220419013534.fb5m6kd6f6ithcig@moria.home.lan> (raw)
In-Reply-To: <Yl1wqhHUGRnPdKjx@itl-email>

On Mon, Apr 18, 2022 at 10:07:38AM -0400, Demi Marie Obenour wrote:
> On Fri, Apr 15, 2022 at 03:11:40PM -0400, Kent Overstreet wrote:
> > On Wed, Apr 06, 2022 at 02:55:04AM -0400, Demi Marie Obenour wrote:
> > > How does bcachefs manage to outperform ZFS and BTRFS?  Obviously being
> > > licensed under GPL-compatible terms is an advantage for inclusion in
> > > Linux, but I am more interested in the technical aspects.
> > > 
> > > - How does bcachefs avoid the nasty performance pitfalls that plague
> > >   BTRFS?  Are VM disks and databases on bcachefs fast?
> > 
> > Clean modular design (the result of years of slow incremental work), and a
> > _blazingly_ fast B+ tree implementation.
> > 
> > We're not fast in every situation yet. We don't have a nocow (non copy-on-write)
> > mode, and slow random reads can be slow due to checksum granularity being at the
> > extent level (which is a good tradeoff in most situations, but we need an option
> > for smaller checksum granularity at some point).
> 
> How well does bcachefs handle writes to files that have extents shared
> (via reflinks or snapshots) with other files?  I would like to use
> bcachefs in Qubes OS once it reaches mainline, and in Qubes OS, each VM
> disk image is typically a snapshot of the previous revision.  Therefore,
> each write breaks sharing.  I am curious how well bcachefs handles this
> situation; I know that at least dm-thin is not optimized for it.  Also,
> for a file of size N, are reflinks O(N), or are they O(log N) or better?

O(N), but they're also cheap to overwrite.

> How much of a performance hit can one expect from erasure coding,
> compared to mirroring?

Should be very little, but it's not yet stable enough for real world performance
testing.

> Is there something lower-level available?  For instance, where should
> one look if they want to add (read-only) bcachefs support to GRUB?

The sanest thing to do would be to port bcachefs to grub - you can't read
anything without reading the journal and overlaying that over the btree, if
you're not doing journal replay, so that's a lot of code that you really don't
want to rewrite - and just reading from btree nodes is non trivial. Bcachefs has
been ported to userspace already, so it'd be a big undertaking but not crazy.

> Also, is it possible to mount a bcachefs filesystem off of a truly
> immutable volume?

Yes.

> > > - Can bcachefs use faster storage as a cache for slower storage, or
> > >   otherwise move data around based on usage patterns?
> > 
> > Yes.
> 
> I am not surprised, considering that bcachefs is based on bcache.  Is
> there any manual configuration required, or can bcachefs detect fast and
> slow storage automatically?  Also, does the data remain on the slow
> storage, or can bcachefs move frequently-used data entirely off of slow
> storage to make room for infrequently used data?

You should be reading the manual for these kinds of questions:
https://bcachefs.org/bcachefs-principles-of-operation.pdf

Long story short, you tell the IO path where to put things and it can be
configured filesystem wide, or per file/directory.

> 
> > > - Can bcachefs saturate your typical NVMe drive on realistic workloads?
> > >   Can it do so with encryption enabled?
> > 
> > This sounds like a question for someone interested in benchmarking :)
> 
> I would love to benchmark, but right now I don’t have any machines on
> which I am willing to install a bespoke kernel build.  I might be able
> to try bcachefs in a VM, though.  I’m also no expert in storage
> benchmarking.
> 
> > > - Is support for swap files on bcachefs planned?  That would require
> > >   being able to perform O_DIRECT asynchronous writes without any memory
> > >   allocations.
> > 
> > Yes it's planned, the IO path already has the necessary support
> 
> That is awesome!  Will it require disabling CoW or checksums, or will it
> work even with CoW and checksums enabled and without risking deadlocks?

Normal IO path, so CoW and checksums and encryption and all.

> 
> > > - Is bcachefs being used in production anywhere?
> > 
> > Yes
> 
> Are there any places that are willing to talk about their use of
> bcachefs?  Is bcachefs basically the WireGuard of filesystems?
> 
> A few other questions:
> 
> 1. What would it take for bcachefs to be buildable as a loadable kernel
>    module?  That would be much more convienient than building a kernel,
>    and might allow bcachefs to be packaged in distributions.

Not gonna happen. When I'm ready for more users I'll focus on upstreaming it,
right now I've still got bugs to fix :)

> 
> 2. Would it be possible to digitally sign releases?  The means to sign
>    them is not particularly relevant, so long as it is secure.  OpenPGP,
>    signify, minisign, and ssh-keygen -Y are all fine.
> 
> 3. Are there plans to add longer, random nonces to the encryption
>    implementation?  One long-term goal of Qubes OS is untrusted storage
>    domains, and that requires that encrypted bcachefs be safe against a
>    malicious block device.  A simple way to implement this is to use a
>    192-bit random nonce stored along each 128-bit authentication tag,
>    and use XChaCha20-Poly1305 as the cipher.  A 192-bit nonce is long
>    enough that one can safely pick a random number at each boot, and
>    then increment it for each encryption.  This also requires that any
>    data read from disk that has not been authenticated be treated as
>    untrusted.

Nonces are stored with pointers, not with the data they protect, so this isn't
necessary for what you're talking about - nonces are themselves encrypted and
authenticated, with a chain of trust up to the superblock, or journal after an
unclean shutdown.

However, the superblock isn't currently authenticated - that would be nice to
fix.

  reply	other threads:[~2022-04-19  1:35 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-06  6:55 Comparison to ZFS and BTRFS Demi Marie Obenour
2022-04-13 22:43 ` Eric Wheeler
2022-04-15 19:11 ` Kent Overstreet
2022-04-18 14:07   ` Demi Marie Obenour
2022-04-19  1:35     ` Kent Overstreet [this message]
2022-04-19 13:16       ` Demi Marie Obenour
2022-04-19  1:16   ` bcachefs loop devs (was: Comparison to ZFS and BTRFS) Eric Wheeler
2022-04-19  1:41     ` Kent Overstreet
2022-04-19 20:42       ` bcachefs loop devs Eric Wheeler
2022-06-02  8:45         ` Demi Marie Obenour

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220419013534.fb5m6kd6f6ithcig@moria.home.lan \
    --to=kent.overstreet@gmail.com \
    --cc=demi@invisiblethingslab.com \
    --cc=linux-bcachefs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).