All of lore.kernel.org
 help / color / mirror / Atom feed
* Btrfs on LUKS on loopback mounted image on Btrfs
@ 2019-08-21 19:42 Chris Murphy
  2019-08-21 20:12 ` Roman Mamedov
  0 siblings, 1 reply; 3+ messages in thread
From: Chris Murphy @ 2019-08-21 19:42 UTC (permalink / raw)
  To: Btrfs BTRFS

Hi,

Why do this? a) compression for home, b) encryption for home, c) home
is portable because it's a file, d) I still get btrfs snapshots
anywhere (I tend to snapshot subvolumes inside of cryptohome; but I
could "snapshot" outside of it by reflink copying the backing file.

But, I'm curious about other's experiences. I've been doing this for a
while (few years) in the following configuration.

NVMe>plain partition>Btrfs sysroot->fallocated file that also has
chattr +C applied>attached to loop device>luksopen>Btrfs on the
dmcrypt device>mounted at /home.

sysroot mkfs options: -dsingle -msingle
cryptohome mkfs options: -M

Btrfs sysroot mount options: noatime,compress=zstd:3,ssd,space_cache=v2
dmcrypt discard passthrough is enabled
Btrfs crypto home mount options: noatime,compress=zstd:3,ssd,space_cache=v2

Ergo, pretty much the same except the smallish home uses mixed block
groups, and I mainly did that to avoid any balance related issues in
home, and figure the allocation behavior at this layer is
irrelevant/virtual anyway. The Btrfs on top of the actual device does
used separate block groups, and sees the "stream" from the loop device
as all data.

I have done some crazy things with this, like, I routinely,
intentionally, just force power off on the laptop while this is all
assembled as described. Literally hundreds of times. Zero complaints
by either Btrfs (as in no mount time complaints, no btrfs check
complaints, no scrub complaints, no Firefox database complaints). I
admit I do not often do super crazy things like simultaneous heavy
writes to both sysroot and home, and *then* force the power off. I
have done it, just not enough times that I can say for sure it's not
possible to corrupt either one of these file systems.

I have not benchmarked this setup at all, but I notice no unusual
latency. It might exist, just that the use cases I regularly use don't
display any additional latency (I do go back and forth between a
crypto home and plaintext home on the same system). For VMs, the
images tend to be +C raw images in /var/lib/libvirt/images; but a
valid use case exists for VM user sessions, including GNOME Boxes
which creates a qcow2 file in /home. That's a curious case I haven't
tested. There's now a new virtio-fs driver that might be better for
this use case, and directly use a subvolume in cryptohome, no VM
backing file needed. (?)

Cryptohome does get subject to fstrim.timer, which passes through and
punches holes in the file just fine. But, as a consequence of this
entire arrangement, the loopback mounted file does fragment quite a
lot. It's only a 4GiB image file, not even half full, and there are
18000+ fragments for the file. I don't defragment it, ever. I don't
use autodefrag. But I'm using NVMe, which has super low latency and
supports multiqueueing. I think it would be a problem on conventional
single queue SATA SSD and HDD.

And to amp this up a notch, I wonder about parallelism or multiqueue
limitations of the loop device? I know XFS and Btrfs both do leverage
parallelism quite a bit.

Anyway, the point is, I'm curious about this arrangement, other
arrangements, and avoiding pathological cases.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Btrfs on LUKS on loopback mounted image on Btrfs
  2019-08-21 19:42 Btrfs on LUKS on loopback mounted image on Btrfs Chris Murphy
@ 2019-08-21 20:12 ` Roman Mamedov
  2019-08-22 21:21   ` Chris Murphy
  0 siblings, 1 reply; 3+ messages in thread
From: Roman Mamedov @ 2019-08-21 20:12 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On Wed, 21 Aug 2019 13:42:53 -0600
Chris Murphy <lists@colorremedies.com> wrote:

> Why do this? a) compression for home, b) encryption for home, c) home
> is portable because it's a file, d) I still get btrfs snapshots
> anywhere (I tend to snapshot subvolumes inside of cryptohome; but

Storing Btrfs on Btrfs really feels suboptimal, good that at least you are
using NOCOW; of course it still will be CoW'ed in case of snapshots. Also you
are likely to run into the space wasting issue as discussed in
https://www.spinics.net/lists/linux-btrfs/msg90352.html

I'd strongly suggest that you look into deploying LVM Thin instead. There you
can specify an arbitrary CoW chunk size, and a value such as 1MB or more will
reduce management overhead and fragmentation dramatically.

Or if the partition size in question is just 4GB, with today's SSD sizes, just
store it as a regular LV and save on quite a bit of complexity and brittleness.

> I could "snapshot" outside of it by reflink copying the backing file.

Pretty sure "cp -a is not atomic, so beware, you cannot safely do this while
the /home is open and mounted. On the other hand if you keep this file inside
a subvolume and then snapshot it, then it is safe(r).

> sysroot mkfs options: -dsingle -msingle

This is asking for trouble, even if you said you power-cut it constantly,
there is little reason to run with "single" metadata, not even on SSDs where
some insinuate that "DUP" is always magically 100% deduped internally by the
SSD during writes at speeds of 600-2500 MB/sec; even though we can't see the
internals and SSD firmware is proprietary to reliably confirm or deny, this
seems very unlikely, and more importantly there are other places where one
(and in your case the only) copy of metadata might get corrupted: RAM, storage
controller, cabling. Even a sudden poweroff has more chances to finally do its
thing when there's no possible "other copy of metadata" to refer to, and the
broken one is all you get.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Btrfs on LUKS on loopback mounted image on Btrfs
  2019-08-21 20:12 ` Roman Mamedov
@ 2019-08-22 21:21   ` Chris Murphy
  0 siblings, 0 replies; 3+ messages in thread
From: Chris Murphy @ 2019-08-22 21:21 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Chris Murphy, Btrfs BTRFS

On Wed, Aug 21, 2019 at 2:12 PM Roman Mamedov <rm@romanrm.net> wrote:
>
> On Wed, 21 Aug 2019 13:42:53 -0600
> Chris Murphy <lists@colorremedies.com> wrote:
>
> > Why do this? a) compression for home, b) encryption for home, c) home
> > is portable because it's a file, d) I still get btrfs snapshots
> > anywhere (I tend to snapshot subvolumes inside of cryptohome; but
>
> Storing Btrfs on Btrfs really feels suboptimal,

Yes, although not having native encryption on Btrfs is also
suboptimal. It leaves separate file system on LUKS for /home.

>good that at least you are
> using NOCOW; of course it still will be CoW'ed in case of snapshots. Also you
> are likely to run into the space wasting issue as discussed in
> https://www.spinics.net/lists/linux-btrfs/msg90352.html

Interesting problem. I think it can mostly be worked around by
snapshoting the "upper" (plaintext) file system subvolumes, rather
than the ciphertext backing file.

> I'd strongly suggest that you look into deploying LVM Thin instead. There you
> can specify an arbitrary CoW chunk size, and a value such as 1MB or more will
> reduce management overhead and fragmentation dramatically.

Yes on paper LVM thinp is well suited for this. I used to use it quite
a lot for throw away VMs, it's not directly supported by virt-manager
but it is possible to add a thinLV using virsh. The thing is, for
mortal users, it's more complicated even than LVM - conceptually and
should any repairs be needed. I'm looking for something simpler that
doesn't depend on LVM.

> > sysroot mkfs options: -dsingle -msingle
>
> This is asking for trouble, even if you said you power-cut it constantly,

In any case, if the hardware is working correctly, the file system is
always consistent regardless of how many copies of metadata there are.
I'm not sure what this gets me even hypothetically speaking, setting
aside the upstream default is single for all SSDs.

The filesystem definitely needs one copy committed to stable media,
two copies doesn't improve the chances of commitment to stable media.
Two copies is insurance against subsequent corruption. There's no such
thing as torn or redirected writes with SSDs. If the first committed
copy is corrupt but the second isn't, true Btrfs automatically
recovers and repairs the bad copy. But I don't see how it improves the
chance of data or metadata getting onto stable media.

If anything, the slight additional latency of writing out a 2nd copy,
delays writing the super block that points to the new tree roots. So
improves handling for corruption but maybe increases the chance of an
automatic rollback to an older tree at next mount?


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-08-22 21:21 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-21 19:42 Btrfs on LUKS on loopback mounted image on Btrfs Chris Murphy
2019-08-21 20:12 ` Roman Mamedov
2019-08-22 21:21   ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.