linux-bcachefs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Comparison to ZFS and BTRFS
@ 2022-04-06  6:55 Demi Marie Obenour
  2022-04-13 22:43 ` Eric Wheeler
  2022-04-15 19:11 ` Kent Overstreet
  0 siblings, 2 replies; 10+ messages in thread
From: Demi Marie Obenour @ 2022-04-06  6:55 UTC (permalink / raw)
  To: linux-bcachefs

[-- Attachment #1: Type: text/plain, Size: 1481 bytes --]

How does bcachefs manage to outperform ZFS and BTRFS?  Obviously being
licensed under GPL-compatible terms is an advantage for inclusion in
Linux, but I am more interested in the technical aspects.

- How does bcachefs avoid the nasty performance pitfalls that plague
  BTRFS?  Are VM disks and databases on bcachefs fast?
- How does bcachefs avoid the dreaded RAID write hole? 
- How does an O_DIRECT loop device on bcachefs compare to a zvol on ZFS?
- Is there a good description of the bcachefs on-disk format anywhere?
- What are the internal abstraction layers used in bcachefs?  Is it a
  key-value store with a filesystem on top of it, the way ZFS is?
- Is it possible to shrink a bcachefs filesystem?  Does bcachefs have
  any restrictions regarding the size of disks in a pool, or can I just
  throw a bunch of varying-size disks at bcachefs and have it spread the
  data around automatically to provide the level of redundancy I want?
- Can bcachefs use faster storage as a cache for slower storage, or
  otherwise move data around based on usage patterns?
- Can bcachefs saturate your typical NVMe drive on realistic workloads?
  Can it do so with encryption enabled?
- Is support for swap files on bcachefs planned?  That would require
  being able to perform O_DIRECT asynchronous writes without any memory
  allocations.
- Is bcachefs being used in production anywhere?
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Comparison to ZFS and BTRFS
  2022-04-06  6:55 Comparison to ZFS and BTRFS Demi Marie Obenour
@ 2022-04-13 22:43 ` Eric Wheeler
  2022-04-15 19:11 ` Kent Overstreet
  1 sibling, 0 replies; 10+ messages in thread
From: Eric Wheeler @ 2022-04-13 22:43 UTC (permalink / raw)
  To: Demi Marie Obenour; +Cc: linux-bcachefs

On Wed, 6 Apr 2022, Demi Marie Obenour wrote:
> How does bcachefs manage to outperform ZFS and BTRFS?  Obviously being
> licensed under GPL-compatible terms is an advantage for inclusion in
> Linux, but I am more interested in the technical aspects.

Hi Demi,

It sounds like you are curious about the technical implementation details, 
you might have a look at the architecture writeup and related 
documentation:
	https://bcachefs.org/Architecture/

> - How does bcachefs avoid the dreaded RAID write hole? 

I believe erasure coding will handle this, the Wiki says EC is still in 
dev.  However RAID1/10 works.

> - How does bcachefs avoid the nasty performance pitfalls that plague
>   BTRFS?  Are VM disks and databases on bcachefs fast?
> - How does an O_DIRECT loop device on bcachefs compare to a zvol on ZFS?

It would be great to see some benchmarks!  Demi, can you setup a disk 
benchmark on the Phoronix Test Suite and send us your findings?

Including NVMe-tiered bcachefs vs non-tiered benchmarks against ZFS+ARC 
and without ARC vs. btrfs would be interesting.  Including XFS and EXT4 as 
a baseline would be good to see to.

A second benchmark showing performance across snapshots would be 
informative as well; it would indicate the COW performance behavior of 
bcachefs vs zfs bs btrfs.

Kent might be interested in updating the performance page on the Wiki if 
you can provide numbers!

> - Is there a good description of the bcachefs on-disk format anywhere?

Same link: https://bcachefs.org/Architecture/ and also see these on the 
right side of the page:
    BtreeIterators
    BtreeNodes
    BtreeWhiteouts
    Encryption
    Transactions
    Snapshots
    Allocator
    Fsck
    Roadmap


> - What are the internal abstraction layers used in bcachefs?  Is it a
>   key-value store with a filesystem on top of it, the way ZFS is?

b-tree :)

> - Is it possible to shrink a bcachefs filesystem?  Does bcachefs have
>   any restrictions regarding the size of disks in a pool, or can I just
>   throw a bunch of varying-size disks at bcachefs and have it spread the
>   data around automatically to provide the level of redundancy I want?

Kent?

> - Can bcachefs use faster storage as a cache for slower storage, or
>   otherwise move data around based on usage patterns?

Tiered storage.  See "Feature Status" here:
	https://bcachefs.org/

> - Can bcachefs saturate your typical NVMe drive on realistic workloads?
>   Can it do so with encryption enabled?

Benchmarks welcome ;)

> - Is support for swap files on bcachefs planned?  That would require
>   being able to perform O_DIRECT asynchronous writes without any memory
>   allocations.

Its on the roadmap:
	https://bcachefs.org/Todo/

> - Is bcachefs being used in production anywhere?

I believe there are users with bcachefs running as the root filesystem 
(Kent, didn't you say you boot from bcachefs?).

We are experimenting with using it as a MySQL database filesystem but do 
not yet have data on that subject.

-Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Comparison to ZFS and BTRFS
  2022-04-06  6:55 Comparison to ZFS and BTRFS Demi Marie Obenour
  2022-04-13 22:43 ` Eric Wheeler
@ 2022-04-15 19:11 ` Kent Overstreet
  2022-04-18 14:07   ` Demi Marie Obenour
  2022-04-19  1:16   ` bcachefs loop devs (was: Comparison to ZFS and BTRFS) Eric Wheeler
  1 sibling, 2 replies; 10+ messages in thread
From: Kent Overstreet @ 2022-04-15 19:11 UTC (permalink / raw)
  To: Demi Marie Obenour; +Cc: linux-bcachefs

On Wed, Apr 06, 2022 at 02:55:04AM -0400, Demi Marie Obenour wrote:
> How does bcachefs manage to outperform ZFS and BTRFS?  Obviously being
> licensed under GPL-compatible terms is an advantage for inclusion in
> Linux, but I am more interested in the technical aspects.
> 
> - How does bcachefs avoid the nasty performance pitfalls that plague
>   BTRFS?  Are VM disks and databases on bcachefs fast?

Clean modular design (the result of years of slow incremental work), and a
_blazingly_ fast B+ tree implementation.

We're not fast in every situation yet. We don't have a nocow (non copy-on-write)
mode, and slow random reads can be slow due to checksum granularity being at the
extent level (which is a good tradeoff in most situations, but we need an option
for smaller checksum granularity at some point).

> - How does bcachefs avoid the dreaded RAID write hole? 

We're copy on write - and this extends to our erasure coding implementation, we
don't update existing stripes in place - we create new stripes as needed,
reusing buckets from existing stripes that still have data.

> - How does an O_DIRECT loop device on bcachefs compare to a zvol on ZFS?

I'd have to benchmark/profile it. It appears there's some bugs in the way the
loop driver in O_DIRECT mode interacts with bcachefs according to xfstests, and
the loopback driver is implemented in a more heavyweight way that it needs to be
- there's room for improvement.

> - Is there a good description of the bcachefs on-disk format anywhere?

Try this: https://bcachefs.org/Architecture/

> - What are the internal abstraction layers used in bcachefs?  Is it a
>   key-value store with a filesystem on top of it, the way ZFS is?

It's just a key value store with a filesystem on top, moreso than the way ZFS
is, from what I understand of ZFS.

> - Is it possible to shrink a bcachefs filesystem?

Not yet, but it won't take much work to add

> Does bcachefs have
>   any restrictions regarding the size of disks in a pool, or can I just
>   throw a bunch of varying-size disks at bcachefs and have it spread the
>   data around automatically to provide the level of redundancy I want?

No restrictions, the allocator stripes across available devices but biases in
favor of devices with more free space.

> - Can bcachefs use faster storage as a cache for slower storage, or
>   otherwise move data around based on usage patterns?

Yes.

> - Can bcachefs saturate your typical NVMe drive on realistic workloads?
>   Can it do so with encryption enabled?

This sounds like a question for someone interested in benchmarking :)

> - Is support for swap files on bcachefs planned?  That would require
>   being able to perform O_DIRECT asynchronous writes without any memory
>   allocations.

Yes it's planned, the IO path already has the necessary support

> - Is bcachefs being used in production anywhere?

Yes

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Comparison to ZFS and BTRFS
  2022-04-15 19:11 ` Kent Overstreet
@ 2022-04-18 14:07   ` Demi Marie Obenour
  2022-04-19  1:35     ` Kent Overstreet
  2022-04-19  1:16   ` bcachefs loop devs (was: Comparison to ZFS and BTRFS) Eric Wheeler
  1 sibling, 1 reply; 10+ messages in thread
From: Demi Marie Obenour @ 2022-04-18 14:07 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On Fri, Apr 15, 2022 at 03:11:40PM -0400, Kent Overstreet wrote:
> On Wed, Apr 06, 2022 at 02:55:04AM -0400, Demi Marie Obenour wrote:
> > How does bcachefs manage to outperform ZFS and BTRFS?  Obviously being
> > licensed under GPL-compatible terms is an advantage for inclusion in
> > Linux, but I am more interested in the technical aspects.
> > 
> > - How does bcachefs avoid the nasty performance pitfalls that plague
> >   BTRFS?  Are VM disks and databases on bcachefs fast?
> 
> Clean modular design (the result of years of slow incremental work), and a
> _blazingly_ fast B+ tree implementation.
> 
> We're not fast in every situation yet. We don't have a nocow (non copy-on-write)
> mode, and slow random reads can be slow due to checksum granularity being at the
> extent level (which is a good tradeoff in most situations, but we need an option
> for smaller checksum granularity at some point).

How well does bcachefs handle writes to files that have extents shared
(via reflinks or snapshots) with other files?  I would like to use
bcachefs in Qubes OS once it reaches mainline, and in Qubes OS, each VM
disk image is typically a snapshot of the previous revision.  Therefore,
each write breaks sharing.  I am curious how well bcachefs handles this
situation; I know that at least dm-thin is not optimized for it.  Also,
for a file of size N, are reflinks O(N), or are they O(log N) or better?

> > - How does bcachefs avoid the dreaded RAID write hole? 
> 
> We're copy on write - and this extends to our erasure coding implementation, we
> don't update existing stripes in place - we create new stripes as needed,
> reusing buckets from existing stripes that still have data.

How much of a performance hit can one expect from erasure coding,
compared to mirroring?

> > - Is there a good description of the bcachefs on-disk format anywhere?
> 
> Try this: https://bcachefs.org/Architecture/

Is there something lower-level available?  For instance, where should
one look if they want to add (read-only) bcachefs support to GRUB?
Also, is it possible to mount a bcachefs filesystem off of a truly
immutable volume?

> > - What are the internal abstraction layers used in bcachefs?  Is it a
> >   key-value store with a filesystem on top of it, the way ZFS is?
> 
> It's just a key value store with a filesystem on top, moreso than the way ZFS
> is, from what I understand of ZFS.
> 
> > - Is it possible to shrink a bcachefs filesystem?
> 
> Not yet, but it won't take much work to add

That would be fantastic for desktop use.  Desktop users need to do all
sorts of wild things that are basically never needed in servers.

> > Does bcachefs have
> >   any restrictions regarding the size of disks in a pool, or can I just
> >   throw a bunch of varying-size disks at bcachefs and have it spread the
> >   data around automatically to provide the level of redundancy I want?
> 
> No restrictions, the allocator stripes across available devices but biases in
> favor of devices with more free space.

That is awesome!  Is there a way to ask bcachefs to explicitly
redistribute the data, and let me know when it has finished?

> > - Can bcachefs use faster storage as a cache for slower storage, or
> >   otherwise move data around based on usage patterns?
> 
> Yes.

I am not surprised, considering that bcachefs is based on bcache.  Is
there any manual configuration required, or can bcachefs detect fast and
slow storage automatically?  Also, does the data remain on the slow
storage, or can bcachefs move frequently-used data entirely off of slow
storage to make room for infrequently used data?

> > - Can bcachefs saturate your typical NVMe drive on realistic workloads?
> >   Can it do so with encryption enabled?
> 
> This sounds like a question for someone interested in benchmarking :)

I would love to benchmark, but right now I don’t have any machines on
which I am willing to install a bespoke kernel build.  I might be able
to try bcachefs in a VM, though.  I’m also no expert in storage
benchmarking.

> > - Is support for swap files on bcachefs planned?  That would require
> >   being able to perform O_DIRECT asynchronous writes without any memory
> >   allocations.
> 
> Yes it's planned, the IO path already has the necessary support

That is awesome!  Will it require disabling CoW or checksums, or will it
work even with CoW and checksums enabled and without risking deadlocks?

> > - Is bcachefs being used in production anywhere?
> 
> Yes

Are there any places that are willing to talk about their use of
bcachefs?  Is bcachefs basically the WireGuard of filesystems?

A few other questions:

1. What would it take for bcachefs to be buildable as a loadable kernel
   module?  That would be much more convienient than building a kernel,
   and might allow bcachefs to be packaged in distributions.

2. Would it be possible to digitally sign releases?  The means to sign
   them is not particularly relevant, so long as it is secure.  OpenPGP,
   signify, minisign, and ssh-keygen -Y are all fine.

3. Are there plans to add longer, random nonces to the encryption
   implementation?  One long-term goal of Qubes OS is untrusted storage
   domains, and that requires that encrypted bcachefs be safe against a
   malicious block device.  A simple way to implement this is to use a
   192-bit random nonce stored along each 128-bit authentication tag,
   and use XChaCha20-Poly1305 as the cipher.  A 192-bit nonce is long
   enough that one can safely pick a random number at each boot, and
   then increment it for each encryption.  This also requires that any
   data read from disk that has not been authenticated be treated as
   untrusted.

I hope I have not taken too much of your time, Kent!  Thanks for the
quick responses!

- -- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEdodNnxM2uiJZBxxxsoi1X/+cIsEFAmJdcKsACgkQsoi1X/+c
IsFTwxAA0vWkEH90QkcC/J1TWnIf0sXdXAsZZik4tOI50Oddl8GR9L1knog5Y4vv
H4M+3YQBmMbtdt3T+HXAruhV2vHzSIcwdjx7qB2Kw3sKggTgfHByUrr78+LYKx2B
VZXd0vHslzg7NmSmSKjzeVBV/AeXkxIuHfThC+g2neeXOgzfcWW+AlFyOwvC1JcX
V/uHGeK+NakoIKx66Kz7hMFKNrxeuMCuFe3xLeDi/9jtfnMVuz1JuHDyLS5RnluP
IzLwdGCBlhdGF6NCzZIA75tsstvq8RIaFM/ctfH50PO+utoImwe1Yenaysp6fd2t
ESnb32IbA7KZU1fGVJrapS/Cx/TrTPI+Ql+LGDQobYMq/gw+kAhiNnMREMww7yyy
PdO2HaeqrIxRDrqcuLKIlLetGbUrYqQ3Zm7hSjpFoqIGrN6v7KhRLBq3Oh6LMaCC
UqIU4TQ1bnrmu+7inip5E6ts+XYTGTCeLbAmQPcp1yWZTNH/AdJbJqs4DT50tfe3
nvLW74vd2qiIh3vkxIpgLWYK0oMg87RY05kJkv+R6Y7iSI0ka60kodF8+OjVFiRC
+F+GsR6brZddmwBxf+Hcb7m1nqcp8ZPfiPL+/0NnYBlaGEghUyCcMAGh5LEOEttS
lmh+fTOEpbvj4NafS+6k/v5DSFOPUOx0z76uCVgwVBFI6VfbEjY=
=FX/G
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bcachefs loop devs (was: Comparison to ZFS and BTRFS)
  2022-04-15 19:11 ` Kent Overstreet
  2022-04-18 14:07   ` Demi Marie Obenour
@ 2022-04-19  1:16   ` Eric Wheeler
  2022-04-19  1:41     ` Kent Overstreet
  1 sibling, 1 reply; 10+ messages in thread
From: Eric Wheeler @ 2022-04-19  1:16 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: Demi Marie Obenour, linux-bcachefs

On Fri, 15 Apr 2022, Kent Overstreet wrote:
> On Wed, Apr 06, 2022 at 02:55:04AM -0400, Demi Marie Obenour wrote:
> > - How does an O_DIRECT loop device on bcachefs compare to a zvol on ZFS?
> 
> I'd have to benchmark/profile it. It appears there's some bugs in the way the
> loop driver in O_DIRECT mode interacts with bcachefs according to xfstests, and
> the loopback driver is implemented in a more heavyweight way that it needs to be
> - there's room for improvement.

Hi Kent, regarding loop devs:

We wrote this up before realizing that REQ_OP_FLUSH does not order writes 
like REQ_FLUSH once did, so my premise for the email linked below was 
incorrect---but perhaps the concept is relevant.

I wonder if something is going on between (1) filesystem above loop.c 
(bcachefs in this case), (2) the block layer re-ordering, and (3) the 
kiocb ki_complete callback in loop.c that could create out-of-order 
journal commits in the filesystem above the loop device (eg, xfs from #1):

	https://www.spinics.net/lists/linux-block/msg82730.html

  From loop.c in lo_rw_aio():
	[...]
	cmd->iocb.ki_pos = pos;
	cmd->iocb.ki_filp = file;
	cmd->iocb.ki_complete = lo_rw_aio_complete; 
	cmd->iocb.ki_flags = IOCB_DIRECT;
	cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);

  A more detailed loop.c call tree summary is here:
	https://lore.kernel.org/all/59a58637-837-fc28-6cb9-d584aa21d60@ewheeler.net/T/ 

If bcachefs immediately calls .ki_complete() after queueing the IO within 
bcachefs but before it commits to bcachefs's disk, then loop.c will mark 
the IO as complete (blk_mq_complete_request via lo_rw_aio_complete) too 
soon after .write_iter is called, thus breaking the expected ordering in 
the filesystem (eg, xfs) atop of the loop device.

This could be compounded if bcachefs's .write_iter calls can complete 
early _and_ out-of-order from how loop.c called them (if they are queued 
and dequeued on a tree structure, for example). Perhaps loop.c or the fs 
under the loopdev (like bcachefs) need a bit of help with completion 
notification (or ordering) in this case.

I'm not sure if this is the issue or not, so just passing it along if it 
helps.

-Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Comparison to ZFS and BTRFS
  2022-04-18 14:07   ` Demi Marie Obenour
@ 2022-04-19  1:35     ` Kent Overstreet
  2022-04-19 13:16       ` Demi Marie Obenour
  0 siblings, 1 reply; 10+ messages in thread
From: Kent Overstreet @ 2022-04-19  1:35 UTC (permalink / raw)
  To: Demi Marie Obenour; +Cc: linux-bcachefs

On Mon, Apr 18, 2022 at 10:07:38AM -0400, Demi Marie Obenour wrote:
> On Fri, Apr 15, 2022 at 03:11:40PM -0400, Kent Overstreet wrote:
> > On Wed, Apr 06, 2022 at 02:55:04AM -0400, Demi Marie Obenour wrote:
> > > How does bcachefs manage to outperform ZFS and BTRFS?  Obviously being
> > > licensed under GPL-compatible terms is an advantage for inclusion in
> > > Linux, but I am more interested in the technical aspects.
> > > 
> > > - How does bcachefs avoid the nasty performance pitfalls that plague
> > >   BTRFS?  Are VM disks and databases on bcachefs fast?
> > 
> > Clean modular design (the result of years of slow incremental work), and a
> > _blazingly_ fast B+ tree implementation.
> > 
> > We're not fast in every situation yet. We don't have a nocow (non copy-on-write)
> > mode, and slow random reads can be slow due to checksum granularity being at the
> > extent level (which is a good tradeoff in most situations, but we need an option
> > for smaller checksum granularity at some point).
> 
> How well does bcachefs handle writes to files that have extents shared
> (via reflinks or snapshots) with other files?  I would like to use
> bcachefs in Qubes OS once it reaches mainline, and in Qubes OS, each VM
> disk image is typically a snapshot of the previous revision.  Therefore,
> each write breaks sharing.  I am curious how well bcachefs handles this
> situation; I know that at least dm-thin is not optimized for it.  Also,
> for a file of size N, are reflinks O(N), or are they O(log N) or better?

O(N), but they're also cheap to overwrite.

> How much of a performance hit can one expect from erasure coding,
> compared to mirroring?

Should be very little, but it's not yet stable enough for real world performance
testing.

> Is there something lower-level available?  For instance, where should
> one look if they want to add (read-only) bcachefs support to GRUB?

The sanest thing to do would be to port bcachefs to grub - you can't read
anything without reading the journal and overlaying that over the btree, if
you're not doing journal replay, so that's a lot of code that you really don't
want to rewrite - and just reading from btree nodes is non trivial. Bcachefs has
been ported to userspace already, so it'd be a big undertaking but not crazy.

> Also, is it possible to mount a bcachefs filesystem off of a truly
> immutable volume?

Yes.

> > > - Can bcachefs use faster storage as a cache for slower storage, or
> > >   otherwise move data around based on usage patterns?
> > 
> > Yes.
> 
> I am not surprised, considering that bcachefs is based on bcache.  Is
> there any manual configuration required, or can bcachefs detect fast and
> slow storage automatically?  Also, does the data remain on the slow
> storage, or can bcachefs move frequently-used data entirely off of slow
> storage to make room for infrequently used data?

You should be reading the manual for these kinds of questions:
https://bcachefs.org/bcachefs-principles-of-operation.pdf

Long story short, you tell the IO path where to put things and it can be
configured filesystem wide, or per file/directory.

> 
> > > - Can bcachefs saturate your typical NVMe drive on realistic workloads?
> > >   Can it do so with encryption enabled?
> > 
> > This sounds like a question for someone interested in benchmarking :)
> 
> I would love to benchmark, but right now I don’t have any machines on
> which I am willing to install a bespoke kernel build.  I might be able
> to try bcachefs in a VM, though.  I’m also no expert in storage
> benchmarking.
> 
> > > - Is support for swap files on bcachefs planned?  That would require
> > >   being able to perform O_DIRECT asynchronous writes without any memory
> > >   allocations.
> > 
> > Yes it's planned, the IO path already has the necessary support
> 
> That is awesome!  Will it require disabling CoW or checksums, or will it
> work even with CoW and checksums enabled and without risking deadlocks?

Normal IO path, so CoW and checksums and encryption and all.

> 
> > > - Is bcachefs being used in production anywhere?
> > 
> > Yes
> 
> Are there any places that are willing to talk about their use of
> bcachefs?  Is bcachefs basically the WireGuard of filesystems?
> 
> A few other questions:
> 
> 1. What would it take for bcachefs to be buildable as a loadable kernel
>    module?  That would be much more convienient than building a kernel,
>    and might allow bcachefs to be packaged in distributions.

Not gonna happen. When I'm ready for more users I'll focus on upstreaming it,
right now I've still got bugs to fix :)

> 
> 2. Would it be possible to digitally sign releases?  The means to sign
>    them is not particularly relevant, so long as it is secure.  OpenPGP,
>    signify, minisign, and ssh-keygen -Y are all fine.
> 
> 3. Are there plans to add longer, random nonces to the encryption
>    implementation?  One long-term goal of Qubes OS is untrusted storage
>    domains, and that requires that encrypted bcachefs be safe against a
>    malicious block device.  A simple way to implement this is to use a
>    192-bit random nonce stored along each 128-bit authentication tag,
>    and use XChaCha20-Poly1305 as the cipher.  A 192-bit nonce is long
>    enough that one can safely pick a random number at each boot, and
>    then increment it for each encryption.  This also requires that any
>    data read from disk that has not been authenticated be treated as
>    untrusted.

Nonces are stored with pointers, not with the data they protect, so this isn't
necessary for what you're talking about - nonces are themselves encrypted and
authenticated, with a chain of trust up to the superblock, or journal after an
unclean shutdown.

However, the superblock isn't currently authenticated - that would be nice to
fix.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bcachefs loop devs (was: Comparison to ZFS and BTRFS)
  2022-04-19  1:16   ` bcachefs loop devs (was: Comparison to ZFS and BTRFS) Eric Wheeler
@ 2022-04-19  1:41     ` Kent Overstreet
  2022-04-19 20:42       ` bcachefs loop devs Eric Wheeler
  0 siblings, 1 reply; 10+ messages in thread
From: Kent Overstreet @ 2022-04-19  1:41 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Demi Marie Obenour, linux-bcachefs

On Mon, Apr 18, 2022 at 06:16:09PM -0700, Eric Wheeler wrote:
> On Fri, 15 Apr 2022, Kent Overstreet wrote:
> > On Wed, Apr 06, 2022 at 02:55:04AM -0400, Demi Marie Obenour wrote:
> > > - How does an O_DIRECT loop device on bcachefs compare to a zvol on ZFS?
> > 
> > I'd have to benchmark/profile it. It appears there's some bugs in the way the
> > loop driver in O_DIRECT mode interacts with bcachefs according to xfstests, and
> > the loopback driver is implemented in a more heavyweight way that it needs to be
> > - there's room for improvement.
> 
> Hi Kent, regarding loop devs:
> 
> We wrote this up before realizing that REQ_OP_FLUSH does not order writes 
> like REQ_FLUSH once did, so my premise for the email linked below was 
> incorrect---but perhaps the concept is relevant.
> 
> I wonder if something is going on between (1) filesystem above loop.c 
> (bcachefs in this case), (2) the block layer re-ordering, and (3) the 
> kiocb ki_complete callback in loop.c that could create out-of-order 
> journal commits in the filesystem above the loop device (eg, xfs from #1):
> 
> 	https://www.spinics.net/lists/linux-block/msg82730.html
> 
>   From loop.c in lo_rw_aio():
> 	[...]
> 	cmd->iocb.ki_pos = pos;
> 	cmd->iocb.ki_filp = file;
> 	cmd->iocb.ki_complete = lo_rw_aio_complete; 
> 	cmd->iocb.ki_flags = IOCB_DIRECT;
> 	cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
> 
>   A more detailed loop.c call tree summary is here:
> 	https://lore.kernel.org/all/59a58637-837-fc28-6cb9-d584aa21d60@ewheeler.net/T/ 
> 
> If bcachefs immediately calls .ki_complete() after queueing the IO within 
> bcachefs but before it commits to bcachefs's disk, then loop.c will mark 
> the IO as complete (blk_mq_complete_request via lo_rw_aio_complete) too 
> soon after .write_iter is called, thus breaking the expected ordering in 
> the filesystem (eg, xfs) atop of the loop device.

We don't call .ki_complete (in DIO mode) until the write has been complete,
including the btree update - this is necessary for read-after-write consistency. 

If your description of the loopback code is correct that does sound suspicious
though - queuing every IO to work item shouldn't hurt anything from a
correctness POV but it definitely shouldn't be needed or wanted from a
performance POV.

What are you seeing?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Comparison to ZFS and BTRFS
  2022-04-19  1:35     ` Kent Overstreet
@ 2022-04-19 13:16       ` Demi Marie Obenour
  0 siblings, 0 replies; 10+ messages in thread
From: Demi Marie Obenour @ 2022-04-19 13:16 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

[-- Attachment #1: Type: text/plain, Size: 8352 bytes --]

On Mon, Apr 18, 2022 at 09:35:34PM -0400, Kent Overstreet wrote:
> On Mon, Apr 18, 2022 at 10:07:38AM -0400, Demi Marie Obenour wrote:
> > On Fri, Apr 15, 2022 at 03:11:40PM -0400, Kent Overstreet wrote:
> > > On Wed, Apr 06, 2022 at 02:55:04AM -0400, Demi Marie Obenour wrote:
> > > > How does bcachefs manage to outperform ZFS and BTRFS?  Obviously being
> > > > licensed under GPL-compatible terms is an advantage for inclusion in
> > > > Linux, but I am more interested in the technical aspects.
> > > > 
> > > > - How does bcachefs avoid the nasty performance pitfalls that plague
> > > >   BTRFS?  Are VM disks and databases on bcachefs fast?
> > > 
> > > Clean modular design (the result of years of slow incremental work), and a
> > > _blazingly_ fast B+ tree implementation.
> > > 
> > > We're not fast in every situation yet. We don't have a nocow (non copy-on-write)
> > > mode, and slow random reads can be slow due to checksum granularity being at the
> > > extent level (which is a good tradeoff in most situations, but we need an option
> > > for smaller checksum granularity at some point).
> > 
> > How well does bcachefs handle writes to files that have extents shared
> > (via reflinks or snapshots) with other files?  I would like to use
> > bcachefs in Qubes OS once it reaches mainline, and in Qubes OS, each VM
> > disk image is typically a snapshot of the previous revision.  Therefore,
> > each write breaks sharing.  I am curious how well bcachefs handles this
> > situation; I know that at least dm-thin is not optimized for it.  Also,
> > for a file of size N, are reflinks O(N), or are they O(log N) or better?
> 
> O(N), but they're also cheap to overwrite.

That’s understandable, if somewhat unfortunate.  If the constant factor
is small enough it should not be too big of a problem in practice,
unless the files are huge.  Qubes OS also has an optimization that
allows the reflinks to be created in the background, rather than when
users are waiting on them.  Are there optimizations for
already-reflinked files?  Or are subvolumes better for this use-case?

> > How much of a performance hit can one expect from erasure coding,
> > compared to mirroring?
> 
> Should be very little, but it's not yet stable enough for real world performance
> testing.

Thanks!

> > Is there something lower-level available?  For instance, where should
> > one look if they want to add (read-only) bcachefs support to GRUB?
> 
> The sanest thing to do would be to port bcachefs to grub - you can't read
> anything without reading the journal and overlaying that over the btree, if
> you're not doing journal replay, so that's a lot of code that you really don't
> want to rewrite - and just reading from btree nodes is non trivial. Bcachefs has
> been ported to userspace already, so it'd be a big undertaking but not crazy.

That makes sense.  grub has a policy of never mutating anything except a
tiny environment block, but that is equivalent to ‘-o nochanges’.

> > Also, is it possible to mount a bcachefs filesystem off of a truly
> > immutable volume?
> 
> Yes.

Thanks.  I was worried that this was not possible without replaying the
journal.  I should have read the manual first :).

> > > > - Can bcachefs use faster storage as a cache for slower storage, or
> > > >   otherwise move data around based on usage patterns?
> > > 
> > > Yes.
> > 
> > I am not surprised, considering that bcachefs is based on bcache.  Is
> > there any manual configuration required, or can bcachefs detect fast and
> > slow storage automatically?  Also, does the data remain on the slow
> > storage, or can bcachefs move frequently-used data entirely off of slow
> > storage to make room for infrequently used data?
> 
> You should be reading the manual for these kinds of questions:
> https://bcachefs.org/bcachefs-principles-of-operation.pdf

Indeed I should, sorry!

> Long story short, you tell the IO path where to put things and it can be
> configured filesystem wide, or per file/directory.

Nice!  I was especially impressed by this: “Devices need not have the
same performance characteristics: we track device IO latency and direct
reads to the device that is currently fastest.”  That adaptive behavior
is something I would have expected from a high-end storage array.
Having it in an open source filesystem will be amazing.

> > > > - Can bcachefs saturate your typical NVMe drive on realistic workloads?
> > > >   Can it do so with encryption enabled?
> > > 
> > > This sounds like a question for someone interested in benchmarking :)
> > 
> > I would love to benchmark, but right now I don’t have any machines on
> > which I am willing to install a bespoke kernel build.  I might be able
> > to try bcachefs in a VM, though.  I’m also no expert in storage
> > benchmarking.
> > 
> > > > - Is support for swap files on bcachefs planned?  That would require
> > > >   being able to perform O_DIRECT asynchronous writes without any memory
> > > >   allocations.
> > > 
> > > Yes it's planned, the IO path already has the necessary support
> > 
> > That is awesome!  Will it require disabling CoW or checksums, or will it
> > work even with CoW and checksums enabled and without risking deadlocks?
> 
> Normal IO path, so CoW and checksums and encryption and all.

That is incredible.

> > > > - Is bcachefs being used in production anywhere?
> > > 
> > > Yes
> > 
> > Are there any places that are willing to talk about their use of
> > bcachefs?  Is bcachefs basically the WireGuard of filesystems?
> > 
> > A few other questions:
> > 
> > 1. What would it take for bcachefs to be buildable as a loadable kernel
> >    module?  That would be much more convienient than building a kernel,
> >    and might allow bcachefs to be packaged in distributions.
> 
> Not gonna happen. When I'm ready for more users I'll focus on upstreaming it,
> right now I've still got bugs to fix :)

And I am glad that is your priority :).  A stable, high-quality
filesystem is worth the wait.

> > 2. Would it be possible to digitally sign releases?  The means to sign
> >    them is not particularly relevant, so long as it is secure.  OpenPGP,
> >    signify, minisign, and ssh-keygen -Y are all fine.
> > 
> > 3. Are there plans to add longer, random nonces to the encryption
> >    implementation?  One long-term goal of Qubes OS is untrusted storage
> >    domains, and that requires that encrypted bcachefs be safe against a
> >    malicious block device.  A simple way to implement this is to use a
> >    192-bit random nonce stored along each 128-bit authentication tag,
> >    and use XChaCha20-Poly1305 as the cipher.  A 192-bit nonce is long
> >    enough that one can safely pick a random number at each boot, and
> >    then increment it for each encryption.  This also requires that any
> >    data read from disk that has not been authenticated be treated as
> >    untrusted.
> 
> Nonces are stored with pointers, not with the data they protect, so this isn't
> necessary for what you're talking about - nonces are themselves encrypted and
> authenticated, with a chain of trust up to the superblock, or journal after an
> unclean shutdown.

The problem with this approach is a whole-volume replay attack.  It’s
easy for a malicious storage device to roll back the entire volume, but
keep a snapshot for future use.  The next time the volume is mounted,
bcachefs might reuse the same nonces, but with different data.  Disaster
ensues.  Adding randomness is necessary to prevent this, and the
approach I recommended is the simplest one I am aware of.  In
cryptography, simpler is generally better.  I see that a ‘wide_macs’
option is available; could this be an extension of that?

> However, the superblock isn't currently authenticated - that would be nice to
> fix.

It would be indeed; I will file an issue for that if none has already
been filed.  How is the journal handled?  For instance, could each
journal entry have a MAC or hash of the previous one, with the
superblock having a MAC or hash of the most recent journal entry as well
as a pointer to the first one?

-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bcachefs loop devs
  2022-04-19  1:41     ` Kent Overstreet
@ 2022-04-19 20:42       ` Eric Wheeler
  2022-06-02  8:45         ` Demi Marie Obenour
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Wheeler @ 2022-04-19 20:42 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: Demi Marie Obenour, linux-bcachefs

On Mon, 18 Apr 2022, Kent Overstreet wrote:
> On Mon, Apr 18, 2022 at 06:16:09PM -0700, Eric Wheeler wrote:
> > If bcachefs immediately calls .ki_complete() after queueing the IO within 
> > bcachefs but before it commits to bcachefs's disk, then loop.c will mark 
> > the IO as complete (blk_mq_complete_request via lo_rw_aio_complete) too 
> > soon after .write_iter is called, thus breaking the expected ordering in 
> > the filesystem (eg, xfs) atop of the loop device.
> 
> We don't call .ki_complete (in DIO mode) until the write has been complete,
> including the btree update - this is necessary for read-after-write consistency. 

Good, I figured it would and thought I would ask in case that was the 
issue.  
 
> If your description of the loopback code is correct that does sound suspicious
> though - queuing every IO to work item shouldn't hurt anything from a
> correctness POV but it definitely shouldn't be needed or wanted from a
> performance POV.

REQ_OP_FLUSH just calls vfs_sync (not WQ-queued) and all READ/WRITE IO's
hit the WQ.  Parallel per-socket WQ's might help performance since block
layer doesn't care about ordering and filesystems (or at least bcachefs!)
call ki_complete() after the write finishes so consistency should be ok.

Generally speaking I avoid loop devs for production systems unless
absolutely necessary.

> What are you seeing?

Nothing real-world.

I was just reviewing loop.c in preparation for leaving bcache+dm-thin
for bcachefs+loop to see if there are any DIO issues to consider.

IMHO, it would be neat to have native bcachefs block devices and avoid
the weird loop.c serial WQ (and possibly other issues loop.c has to deal
with that native bcachefs wouldn't).

This is a possible workflow for native bcachefs devices.  Since bcachefs 
is awesome - it would provide SSD caching, snapshots, encryption, and raw 
DIO block devices into VMs:

	]# bcachefs subvolume create /volumes/vol1
	]# truncate -s 1T /volumes/vol1/data.raw
	]# bcachefs blkdev register /volumes/vol1/data.raw
	/dev/bcachefs0
	]# bcachefs subvolume snapshot /volumes/vol1 /volumes/2022-04-19_vol1
	]# bcachefs blkdev register /volumes/2022-04-19_vol1/data.raw
	/dev/bcachefs1
	]# bcachefs blkdev unregister /dev/bcachefs0

And udev could be made to do something like this:
	]# ls -l /dev/bcachefs/volumes/vol1/data.raw
	lrwxrwxrwx 1 root root 7 Apr  9 17:35   data.raw -> /dev/bcachefs0

Which means the VM can have a its disk defined as 
/dev/bcachefs/volumes/vol1/data.raw in its libvirt config, and thus point 
at a real block device!

That would make bcachefs the most awesome disk volume manager, ever!



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bcachefs loop devs
  2022-04-19 20:42       ` bcachefs loop devs Eric Wheeler
@ 2022-06-02  8:45         ` Demi Marie Obenour
  0 siblings, 0 replies; 10+ messages in thread
From: Demi Marie Obenour @ 2022-06-02  8:45 UTC (permalink / raw)
  To: Eric Wheeler, Kent Overstreet; +Cc: linux-bcachefs

[-- Attachment #1: Type: text/plain, Size: 3537 bytes --]

On Tue, Apr 19, 2022 at 01:42:49PM -0700, Eric Wheeler wrote:
> On Mon, 18 Apr 2022, Kent Overstreet wrote:
> > On Mon, Apr 18, 2022 at 06:16:09PM -0700, Eric Wheeler wrote:
> > > If bcachefs immediately calls .ki_complete() after queueing the IO within 
> > > bcachefs but before it commits to bcachefs's disk, then loop.c will mark 
> > > the IO as complete (blk_mq_complete_request via lo_rw_aio_complete) too 
> > > soon after .write_iter is called, thus breaking the expected ordering in 
> > > the filesystem (eg, xfs) atop of the loop device.
> > 
> > We don't call .ki_complete (in DIO mode) until the write has been complete,
> > including the btree update - this is necessary for read-after-write consistency. 
> 
> Good, I figured it would and thought I would ask in case that was the 
> issue.  
>  
> > If your description of the loopback code is correct that does sound suspicious
> > though - queuing every IO to work item shouldn't hurt anything from a
> > correctness POV but it definitely shouldn't be needed or wanted from a
> > performance POV.
> 
> REQ_OP_FLUSH just calls vfs_sync (not WQ-queued) and all READ/WRITE IO's
> hit the WQ.  Parallel per-socket WQ's might help performance since block
> layer doesn't care about ordering and filesystems (or at least bcachefs!)
> call ki_complete() after the write finishes so consistency should be ok.
> 
> Generally speaking I avoid loop devs for production systems unless
> absolutely necessary.
> 
> > What are you seeing?
> 
> Nothing real-world.
> 
> I was just reviewing loop.c in preparation for leaving bcache+dm-thin
> for bcachefs+loop to see if there are any DIO issues to consider.
> 
> IMHO, it would be neat to have native bcachefs block devices and avoid
> the weird loop.c serial WQ (and possibly other issues loop.c has to deal
> with that native bcachefs wouldn't).
> 
> This is a possible workflow for native bcachefs devices.  Since bcachefs 
> is awesome - it would provide SSD caching, snapshots, encryption, and raw 
> DIO block devices into VMs:
> 
> 	]# bcachefs subvolume create /volumes/vol1
> 	]# truncate -s 1T /volumes/vol1/data.raw
> 	]# bcachefs blkdev register /volumes/vol1/data.raw
> 	/dev/bcachefs0
> 	]# bcachefs subvolume snapshot /volumes/vol1 /volumes/2022-04-19_vol1
> 	]# bcachefs blkdev register /volumes/2022-04-19_vol1/data.raw
> 	/dev/bcachefs1
> 	]# bcachefs blkdev unregister /dev/bcachefs0
> 
> And udev could be made to do something like this:
> 	]# ls -l /dev/bcachefs/volumes/vol1/data.raw
> 	lrwxrwxrwx 1 root root 7 Apr  9 17:35   data.raw -> /dev/bcachefs0
> 
> Which means the VM can have a its disk defined as 
> /dev/bcachefs/volumes/vol1/data.raw in its libvirt config, and thus point 
> at a real block device!
> 
> That would make bcachefs the most awesome disk volume manager, ever!

Kent, if you do decide to go this route, please use the disk sequence
number as the number part of the device name.  So instead of
/dev/bcachefs<minor>, it would be /dev/bcachefs<diskseq>.  The latter is
guaranteed to never be reused, while the former is not.

Yes, other block device drivers all have the same problem, but I would
rather fix it in at least one of them.  Also, this would mean that
opening /dev/bcachefs/volumes/something would be just as race-free as
opening a filesystem path, which otherwise could not be guaranteed
without some additional kernel support.

-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-06-02  8:45 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-06  6:55 Comparison to ZFS and BTRFS Demi Marie Obenour
2022-04-13 22:43 ` Eric Wheeler
2022-04-15 19:11 ` Kent Overstreet
2022-04-18 14:07   ` Demi Marie Obenour
2022-04-19  1:35     ` Kent Overstreet
2022-04-19 13:16       ` Demi Marie Obenour
2022-04-19  1:16   ` bcachefs loop devs (was: Comparison to ZFS and BTRFS) Eric Wheeler
2022-04-19  1:41     ` Kent Overstreet
2022-04-19 20:42       ` bcachefs loop devs Eric Wheeler
2022-06-02  8:45         ` Demi Marie Obenour

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).