APFS improvements (e.g. firm links, volume w/ subvols replication) as ideas for Btrfs?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* APFS improvements (e.g. firm links, volume w/ subvols replication) as ideas for Btrfs?
@ 2019-06-11 18:31 Neal Gompa
  2019-06-12  4:03 ` Chris Murphy
  0 siblings, 1 reply; 7+ messages in thread
From: Neal Gompa @ 2019-06-11 18:31 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Chris Murphy, Josef Bacik, David Sterba

Hey,

So Apple held its WWDC event last week, and among other things, they
talked about improvements they've made to filesystems in macOS[1].

Among other things, one of the things introduced was a concept of
"firm links", which is something like NTFS' directory junctions,
except they can cross (sub)volumes. This concept makes it easier to
handle uglier layouts. While bind mounts work kind of okay for this
with simpler configurations, it requires operating system awareness,
rather than being setup automatically as the volume is mounted. This
is less brittle and works better for recovery environments, and help
make easier to do read-only system volumes while supported read-write
sections in a more flexible way.

For example, this would be useful if a volume has two subvolumes: OS
and data. OS would have /usr and data would have /var and /home. But,
importantly, a couple of system data things need to be part of the OS
that are on /var: /var/lib/rpm and /var/lib/alternatives. These two
belong with the OS, and it's incredibly difficult to move it around
due to all kinds of ecosystem knock-on effects. (If you want to know
more about that, just ask the SUSE kiwi team... it's the gift that
keeps on giving...). Both /var/lib/rpm and /var/lib/alternatives are
part of the OS, but they're in /var. It'd be great to stitch that in
from the read-only OS volume into the /var subvolume so that it's
actually part of the OS volume even though it looks like it's in the
data one. It's completely transparent to everything. Supporting atomic
updates (with something like a dnf plugin) becomes much easier because
we can trigger snapshot and subvolume mounts with preserving enough
structure to make things work. In this circumstance, we can flip the
properties so that the new location has a rw OS and ro data volume
mount for doing only software updates (or leave data volume rw during
this transaction and merge the changes back into the OS). We could
also do creative things with /etc if we so wish...

Another thing that APFS seems to support now is creating linked
snapshots (snapshots of multiple subvolumes that are paired together
as single snapshot) for full system replication. Obviously, with firm
links, it makes sense to be able to do such a thing so that full
system replication works properly. As far as I know, it shouldn't be a
difficult concept to implement in Btrfs, but I guess it wouldn't be
really necessary if we don't have firm links...

What do you guys think?

[1]: https://developer.apple.com/videos/play/wwdc2019/710/

--
真実はいつも一つ！/ Always, there's only one truth!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: APFS improvements (e.g. firm links, volume w/ subvols replication) as ideas for Btrfs?
  2019-06-11 18:31 APFS improvements (e.g. firm links, volume w/ subvols replication) as ideas for Btrfs? Neal Gompa
@ 2019-06-12  4:03 ` Chris Murphy
  2019-06-12  8:06   ` Neal Gompa
  2019-06-12  9:58   ` David Sterba
  0 siblings, 2 replies; 7+ messages in thread
From: Chris Murphy @ 2019-06-12  4:03 UTC (permalink / raw)
  To: Neal Gompa; +Cc: Btrfs BTRFS, Chris Murphy, Josef Bacik, David Sterba

On Tue, Jun 11, 2019 at 12:31 PM Neal Gompa <ngompa13@gmail.com> wrote:
>
> Hey,
>
> So Apple held its WWDC event last week, and among other things, they
> talked about improvements they've made to filesystems in macOS[1].
>
> Among other things, one of the things introduced was a concept of
> "firm links", which is something like NTFS' directory junctions,
> except they can cross (sub)volumes.

My understanding is it's a work around for the lack of APFS supporting
directory hardlinks. Btrfs does support directory hardlinks but a
hardlink points to a particular inode within a particular subvolume
(files tree) so it's not possible to have a hard link that crosses
subvolumes. A reflink can already do this, but it's really just an
efficient copy, the resulting directory is independent. A directory
symlink can mirror a directory across subvolumes, but like any symlink
it must have a fixed path available to always find the real deal.

I think a firm link like thing on Btrfs would require a format change,
but I'm not certain. My best guess of what it'd be, is a dir/file
object that gets its own inode but contains a hard reference (not
independent object) to a subvolid+inode.

>This concept makes it easier to
> handle uglier layouts. While bind mounts work kind of okay for this
> with simpler configurations, it requires operating system awareness,
> rather than being setup automatically as the volume is mounted. This
> is less brittle and works better for recovery environments, and help
> make easier to do read-only system volumes while supported read-write
> sections in a more flexible way.

There are a couple of things going on. One is something between VFS
and Btrfs does this goofy assumption that bind mounts are subvolumes,
which is definitely not true. I bring this up here:
https://lore.kernel.org/linux-btrfs/CAJCQCtT=-YoFJgEo=BFqfiPdtMoJCYR3dJPSekf+HQ22GYGztw@mail.gmail.com/

Near as I can tell, Btrfs kernel code just needs to be smarter about
distinguishing between bind mounts of directories versus the behind
the scene bind mount used for subvolumes mounted using -o subvol= or
-o subvolid= ; I don't think that's difficult. It's just someone needs
to work through the logic and set aside the resources to do it.

Second, the FHS is a PITA anyway, but it really shows its unhelpful
ways when it comes to read-only, recoverable/resettable systems. Just
see the massively complicated subvolume carveouts opensuse has to do
when installed on Btrfs, and the even more complicated gymnastics
libostree is doing on the various rpm-ostree variants including Fedora
Silverblue.

Apple, a long long time ago said, fuck that insanity, we're burying
the FHS so mortal users can't see that shit. And we're going to have a
plain language set of directories for, you know, actual people who
need to get work done.

So definitely consider me in the camp of the FHS making life harder, not easier.

>
> For example, this would be useful if a volume has two subvolumes: OS
> and data. OS would have /usr and data would have /var and /home. But,
> importantly, a couple of system data things need to be part of the OS
> that are on /var: /var/lib/rpm and /var/lib/alternatives. These two
> belong with the OS, and it's incredibly difficult to move it around
> due to all kinds of ecosystem knock-on effects. (If you want to know
> more about that, just ask the SUSE kiwi team... it's the gift that
> keeps on giving...). Both /var/lib/rpm and /var/lib/alternatives are
> part of the OS, but they're in /var. It'd be great to stitch that in
> from the read-only OS volume into the /var subvolume so that it's
> actually part of the OS volume even though it looks like it's in the
> data one. It's completely transparent to everything. Supporting atomic
> updates (with something like a dnf plugin) becomes much easier because
> we can trigger snapshot and subvolume mounts with preserving enough
> structure to make things work. In this circumstance, we can flip the
> properties so that the new location has a rw OS and ro data volume
> mount for doing only software updates (or leave data volume rw during
> this transaction and merge the changes back into the OS). We could
> also do creative things with /etc if we so wish...

Is it really best to do this in Btrfs proper, rather than in VFS?

> Another thing that APFS seems to support now is creating linked
> snapshots (snapshots of multiple subvolumes that are paired together
> as single snapshot) for full system replication. Obviously, with firm
> links, it makes sense to be able to do such a thing so that full
> system replication works properly. As far as I know, it shouldn't be a
> difficult concept to implement in Btrfs, but I guess it wouldn't be
> really necessary if we don't have firm links...

Right now a subvolume is really just a files tree. It's not as
separate as it might seem from the pool, compared to what a ZFS
dataset is, or I guess it's called a volume is in APFS. To do this on
Btrfs probably is another disk format change. My guess is something
based on seed-sprout feature, but without the mandatory 2nd block
device for the spout. i.e. freeze all the trees.

--
Chris Murphy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: APFS improvements (e.g. firm links, volume w/ subvols replication) as ideas for Btrfs?
  2019-06-12  4:03 ` Chris Murphy
@ 2019-06-12  8:06   ` Neal Gompa
  2019-06-12 20:02     ` Chris Murphy
  2019-06-12  9:58   ` David Sterba
  1 sibling, 1 reply; 7+ messages in thread
From: Neal Gompa @ 2019-06-12  8:06 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS, Josef Bacik, David Sterba

On Wed, Jun 12, 2019 at 12:04 AM Chris Murphy <lists@colorremedies.com> wrote:
>
> On Tue, Jun 11, 2019 at 12:31 PM Neal Gompa <ngompa13@gmail.com> wrote:
> >
> > Hey,
> >
> > So Apple held its WWDC event last week, and among other things, they
> > talked about improvements they've made to filesystems in macOS[1].
> >
> > Among other things, one of the things introduced was a concept of
> > "firm links", which is something like NTFS' directory junctions,
> > except they can cross (sub)volumes.
>
> My understanding is it's a work around for the lack of APFS supporting
> directory hardlinks. Btrfs does support directory hardlinks but a
> hardlink points to a particular inode within a particular subvolume
> (files tree) so it's not possible to have a hard link that crosses
> subvolumes. A reflink can already do this, but it's really just an
> efficient copy, the resulting directory is independent. A directory
> symlink can mirror a directory across subvolumes, but like any symlink
> it must have a fixed path available to always find the real deal.
>
> I think a firm link like thing on Btrfs would require a format change,
> but I'm not certain. My best guess of what it'd be, is a dir/file
> object that gets its own inode but contains a hard reference (not
> independent object) to a subvolid+inode.
>
>
> >This concept makes it easier to
> > handle uglier layouts. While bind mounts work kind of okay for this
> > with simpler configurations, it requires operating system awareness,
> > rather than being setup automatically as the volume is mounted. This
> > is less brittle and works better for recovery environments, and help
> > make easier to do read-only system volumes while supported read-write
> > sections in a more flexible way.
>
> There are a couple of things going on. One is something between VFS
> and Btrfs does this goofy assumption that bind mounts are subvolumes,
> which is definitely not true. I bring this up here:
> https://lore.kernel.org/linux-btrfs/CAJCQCtT=-YoFJgEo=BFqfiPdtMoJCYR3dJPSekf+HQ22GYGztw@mail.gmail.com/
>
> Near as I can tell, Btrfs kernel code just needs to be smarter about
> distinguishing between bind mounts of directories versus the behind
> the scene bind mount used for subvolumes mounted using -o subvol= or
> -o subvolid= ; I don't think that's difficult. It's just someone needs
> to work through the logic and set aside the resources to do it.
>
> Second, the FHS is a PITA anyway, but it really shows its unhelpful
> ways when it comes to read-only, recoverable/resettable systems. Just
> see the massively complicated subvolume carveouts opensuse has to do
> when installed on Btrfs, and the even more complicated gymnastics
> libostree is doing on the various rpm-ostree variants including Fedora
> Silverblue.
>
> Apple, a long long time ago said, fuck that insanity, we're burying
> the FHS so mortal users can't see that shit. And we're going to have a
> plain language set of directories for, you know, actual people who
> need to get work done.
>
> So definitely consider me in the camp of the FHS making life harder, not easier.
>

I mean, yes... FHS is definitely unhelpful, but Apple conforms to FHS
pretty well, even though it's not obvious that it does. Apple just has
the benefit of being able to shuffle things around without people
noticing, whereas no Linux distribution has that.

> >
> > For example, this would be useful if a volume has two subvolumes: OS
> > and data. OS would have /usr and data would have /var and /home. But,
> > importantly, a couple of system data things need to be part of the OS
> > that are on /var: /var/lib/rpm and /var/lib/alternatives. These two
> > belong with the OS, and it's incredibly difficult to move it around
> > due to all kinds of ecosystem knock-on effects. (If you want to know
> > more about that, just ask the SUSE kiwi team... it's the gift that
> > keeps on giving...). Both /var/lib/rpm and /var/lib/alternatives are
> > part of the OS, but they're in /var. It'd be great to stitch that in
> > from the read-only OS volume into the /var subvolume so that it's
> > actually part of the OS volume even though it looks like it's in the
> > data one. It's completely transparent to everything. Supporting atomic
> > updates (with something like a dnf plugin) becomes much easier because
> > we can trigger snapshot and subvolume mounts with preserving enough
> > structure to make things work. In this circumstance, we can flip the
> > properties so that the new location has a rw OS and ro data volume
> > mount for doing only software updates (or leave data volume rw during
> > this transaction and merge the changes back into the OS). We could
> > also do creative things with /etc if we so wish...
>
> Is it really best to do this in Btrfs proper, rather than in VFS?
>

If we can handle it in VFS where things like firm links drag linked
subvolumes to be automatically mounted together at their individually
set snapshot level, then yeah. But best as I understand it, the VFS
layer is not capable of this level of granularity.

This is probably one issue with Btrfs that ZFS gets to avoid, since
ZFS can't use VFS and thus implements everything at its level. I'm not
suggesting Btrfs do it for everything, but the filesystem needs some
intelligence about subvolume handling that it doesn't have now.

>
> > Another thing that APFS seems to support now is creating linked
> > snapshots (snapshots of multiple subvolumes that are paired together
> > as single snapshot) for full system replication. Obviously, with firm
> > links, it makes sense to be able to do such a thing so that full
> > system replication works properly. As far as I know, it shouldn't be a
> > difficult concept to implement in Btrfs, but I guess it wouldn't be
> > really necessary if we don't have firm links...
>
> Right now a subvolume is really just a files tree. It's not as
> separate as it might seem from the pool, compared to what a ZFS
> dataset is, or I guess it's called a volume is in APFS. To do this on
> Btrfs probably is another disk format change. My guess is something
> based on seed-sprout feature, but without the mandatory 2nd block
> device for the spout. i.e. freeze all the trees.
>

Hmm... That makes sense. I think it would be good to have it for the
cases I've mentioned...


-- 
真実はいつも一つ！/ Always, there's only one truth!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: APFS improvements (e.g. firm links, volume w/ subvols replication) as ideas for Btrfs?
  2019-06-12  4:03 ` Chris Murphy
  2019-06-12  8:06   ` Neal Gompa
@ 2019-06-12  9:58   ` David Sterba
  2019-08-05 20:59     ` Chris Murphy
  1 sibling, 1 reply; 7+ messages in thread
From: David Sterba @ 2019-06-12  9:58 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Neal Gompa, Btrfs BTRFS, Josef Bacik, David Sterba

On Tue, Jun 11, 2019 at 10:03:51PM -0600, Chris Murphy wrote:
> On Tue, Jun 11, 2019 at 12:31 PM Neal Gompa <ngompa13@gmail.com> wrote:
> >
> > Hey,
> >
> > So Apple held its WWDC event last week, and among other things, they
> > talked about improvements they've made to filesystems in macOS[1].
> >
> > Among other things, one of the things introduced was a concept of
> > "firm links", which is something like NTFS' directory junctions,
> > except they can cross (sub)volumes.
> 
> My understanding is it's a work around for the lack of APFS supporting
> directory hardlinks. Btrfs does support directory hardlinks but a

Directory hardlinks are not supported in general on linux and prohibited
on the VFS level. (check fs/namei.c vfs_link, explicitly returns -EPERM
for a directory).

> hardlink points to a particular inode within a particular subvolume
> (files tree) so it's not possible to have a hard link that crosses
> subvolumes. A reflink can already do this, but it's really just an
> efficient copy, the resulting directory is independent. A directory
> symlink can mirror a directory across subvolumes, but like any symlink
> it must have a fixed path available to always find the real deal.
> 
> I think a firm link like thing on Btrfs would require a format change,
> but I'm not certain. My best guess of what it'd be, is a dir/file
> object that gets its own inode but contains a hard reference (not
> independent object) to a subvolid+inode.
> 
> 
> >This concept makes it easier to
> > handle uglier layouts. While bind mounts work kind of okay for this
> > with simpler configurations, it requires operating system awareness,
> > rather than being setup automatically as the volume is mounted. This
> > is less brittle and works better for recovery environments, and help
> > make easier to do read-only system volumes while supported read-write
> > sections in a more flexible way.
> 
> There are a couple of things going on. One is something between VFS
> and Btrfs does this goofy assumption that bind mounts are subvolumes,
> which is definitely not true. I bring this up here:
> https://lore.kernel.org/linux-btrfs/CAJCQCtT=-YoFJgEo=BFqfiPdtMoJCYR3dJPSekf+HQ22GYGztw@mail.gmail.com/

The subvolumes build on top of the bind mount API internally but it is
or should be a different kind of object.

> Near as I can tell, Btrfs kernel code just needs to be smarter about
> distinguishing between bind mounts of directories versus the behind
> the scene bind mount used for subvolumes mounted using -o subvol= or
> -o subvolid= ; I don't think that's difficult. It's just someone needs
> to work through the logic and set aside the resources to do it.

I tried to fix that and got half way through, then hit the difficult
problems mainly with nested subvolumes. For leaf subvolumes, the
difference between

  subvolume/dir/dir/dir (bind mounted)

and

  subvolume (mounted with -o)

is to traverse back the path until the subvolume is hit, which in both
cases would be 'subvolume'. Howvever, with nested subvolumes it's not
easy to see where to stop

  subvol1/dir/dir/subvol2/dir/dir/subvol3/dir/dir

and take 3 cases:

  mount -o subvol=subvol1
  mount -o subvol=subvol2
  mount -o subvol=subvol3

the backward path traversal will always say it's subvol3 (that's wrong
from users POV). Keeping track of the exact subvolume that was mounted
is not trivial because it partially has to duplicate the internal VFS
information which makes it hard to keep consistent after moves.

There was a concept proposal called 'fs view' that would add proper
subvolume abstraction for subvolumes to VFS but I don't know how far
this got.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: APFS improvements (e.g. firm links, volume w/ subvols replication) as ideas for Btrfs?
  2019-06-12  8:06   ` Neal Gompa
@ 2019-06-12 20:02     ` Chris Murphy
  2019-06-13 11:37       ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 7+ messages in thread
From: Chris Murphy @ 2019-06-12 20:02 UTC (permalink / raw)
  To: Neal Gompa; +Cc: Chris Murphy, Btrfs BTRFS, Josef Bacik, David Sterba

On Wed, Jun 12, 2019 at 2:07 AM Neal Gompa <ngompa13@gmail.com> wrote:
> I mean, yes... FHS is definitely unhelpful, but Apple conforms to FHS
> pretty well, even though it's not obvious that it does. Apple just has
> the benefit of being able to shuffle things around without people
> noticing, whereas no Linux distribution has that.

The vast majority of macOS is located outside the FHS in Applications,
Library, System, and Users. And in fact I don't even see the FHS
structure in the GUI where 99.9% of macOS users interact. So the FHS
is really incidental to macOS and macOS users.

My Mac looks like this:

 14G    Applications
3.7G    Library
  0B    Network
9.1G    System
4.0K    TMVersion.ini
 11G    Users
226G    Volumes
2.6M    bin
  0B    cores
4.5K    dev
  0B    etc
1.0K    home
504M    macOS Install Data
1.0K    net
6.2G    private
1.2M    sbin
  0B    tmp
473M    usr
  0B    var

The vast majority of private is a 4G hibernation image file, and a 2G
dyld cache. I'm not really sure how Apple is going to leverage APFS
for moving things around behind the scenes, it doesn't strike me as
having been a problem in search of a solution (or even vice versa).

But I can see why they'd like to have a "clean" system snapshot to
reset to, without resetting /User (use data files). To date, they have
no reset/refresh like Windows has had for some time, and software
updates are not safe and do not rollback either automatically or
manually. Their primary troubleshooting step is the "clean install" or
really the "clean reinstall". It's expected that the user has a backup
(preferably Time Machine) because it's a reformat, point and shoot
installer with no separate partition for user data.

> > > For example, this would be useful if a volume has two subvolumes: OS
> > > and data. OS would have /usr and data would have /var and /home. But,
> > > importantly, a couple of system data things need to be part of the OS
> > > that are on /var: /var/lib/rpm and /var/lib/alternatives. These two
> > > belong with the OS, and it's incredibly difficult to move it around
> > > due to all kinds of ecosystem knock-on effects.

I'd say to achieve functionally what you're after with Btrfs as it
exists today, is possible. But it would require two "views" - the
dirty realty that the updating system manages. And the kind unreality
for the user.

On Btrfs you can have a /var/lib/rpm and /var/lib/alternatives
read-write subvolume, just snapshot it, as either ro or rw (your
choice) and at the time the snapshot is created, have it created in
the OS "area" (rather than calling it a subvolume). A snapshot is just
a pre-populated subvolume, otherwise they are the same thing.

Subvolumes can also be moved just like directories if they're rw. If
they're ro, they can be renamed but not moved within the hierarchy.
It's also possible to change the ro property on any subvolume.

Of course this means there is a time delay. Your updating system would
update the "data" version of /var/lib/rpm and /var/lib/alternatives,
and then once the out of band update is finished, it would snapshot
them into the "OS" hierarchy, replacing its now stale versions of
/var/lib/rpm and /var/lib/alternatives.

Also, subvolumes can act as a barrier to snapshots. That's a big part
of how (open)SUSE is using Btrfs subvolumes, to delimit the
snapshotting. i.e. they use a grub subvolume instead of a regular
directory so that when /boot is snapshot, the changing state of the
bootloader is NOT snapshot. Same for /etc/ and parts of /var

I understand what you don't have on Btrfs, that it sounds like you'd
like to have, is a kind of wormhole between two whatevers (be they
directories or subvolumes or wormvolumes) where they are always
containing identical things. But they can independently be made ro or
rw. They can both be rw. They can both be ro. Or one can be rw and the
other ro, which would mean one of them "sends" data into the otherone,
live.

You could fake that today. But you'd need two perspectives, where 99%
of user space is aware of the fabricated perspective; and the 1% of
user space that is your OS updates and switcheroo system would need to
be aware of both perspectives and be the domain owner of managing both
perspectives.

> (If you want to know
> > > more about that, just ask the SUSE kiwi team... it's the gift that
> > > keeps on giving...). Both /var/lib/rpm and /var/lib/alternatives are
> > > part of the OS, but they're in /var. It'd be great to stitch that in
> > > from the read-only OS volume into the /var subvolume so that it's
> > > actually part of the OS volume even though it looks like it's in the
> > > data one. It's completely transparent to everything.

I'm having a hard time visualizing what Apple is doing or going to do.
I expect like most things there will be some daemon that understands
all the new ioctls and does the actual work of changing the storage
hierarchy states. It's transparent to 99% of the OS and user space,
but this one daemon must understand the real deal in order to "fake"
it with the creation of firm links and whatever else they do.
Something in user space must know about firm links in order to
leverage firm links.

> Supporting atomic
> > > updates (with something like a dnf plugin) becomes much easier because
> > > we can trigger snapshot and subvolume mounts with preserving enough
> > > structure to make things work. In this circumstance, we can flip the
> > > properties so that the new location has a rw OS and ro data volume
> > > mount for doing only software updates (or leave data volume rw during
> > > this transaction and merge the changes back into the OS). We could
> > > also do creative things with /etc if we so wish...

I've done out of band software updates with dnf already myself, this
is a quick and dirty example:

# btrfs sub snap root.20190610 root.20190612
# mount -o subvol=root.20190612 /mnt/updates
# mount -B ## all the dev proc run sys stuff to /mnt/updates
# chroot /mnt/updates
# dnf update -y
# exit
# vi /mnt/updates/etc/fstab ##point the root to root.20190612
# umount --resursive ## tear it down
# grub2-editenv - set kernelopts= ##change rootflags to use the new subvol

Reboot. Now the last step is janky and does not take advantage of any
fallback we now have in GRUB at least in Fedora land. To do that I
need to so a bit more surgery manually to create new BLS snippets to
make an explicit "former" "current" boot menu entry that causes the
correct subvolume to become root.

But anyway, point is, I can do out of band updates, the currently
active OS doesn't get confused as the update process is yanking out
updates from underneath it all, and I don't have to mount /home in
that process at all. That chroot (or it can be done with bwrap or
nspawn or whatever) environment does not have my home in it. The
update process can't hurt it. And if the update fails, just delete the
bad subvolume, no harm done.

And libostree actually does a very close variation on that.

> >
> > Is it really best to do this in Btrfs proper, rather than in VFS?
> >
>
> If we can handle it in VFS where things like firm links drag linked
> subvolumes to be automatically mounted together at their individually
> set snapshot level, then yeah. But best as I understand it, the VFS
> layer is not capable of this level of granularity.

Yeah I kinda need pretty pictures and animations and things to
understand this better, I'm not really completely understanding the
problem. I'm very aware of Apple's severe limitations with updates and
rollbacks, and why they want a way out of that. But I'm not clear on
why Btrfs should mimick that particular behavior and solution for
their problem, seeing as only one distro defaults to Btrfs yet doesn't
require it, and Fedoraland is pretty much completely over Btrfs, while
Red Hat is building a whole new userland file system which for sure is
not going to have wormhold dirs between its fully independent volumes
- it'd really need something done in VFS or if it can't be done in VFS
then possibly leverage overlayfs for some of this.

>
> This is probably one issue with Btrfs that ZFS gets to avoid, since
> ZFS can't use VFS and thus implements everything at its level. I'm not
> suggesting Btrfs do it for everything, but the filesystem needs some
> intelligence about subvolume handling that it doesn't have now.

ZFS is way more limited in this regard than Btrfs. Snapshots are
always children of datasets, they are always read only. You can clone
a snapshot into a dataset. I don't think (?) there is such a thing as
nested datasets or snapshots. Datasets can't be deleted until all
children (snapshots) are deleted first. In many ways ZFS limitations
keep users from doing things like many nested subvolumes like we see
on Btrfs and then it turns into logical problems for users and
developers alike.

So a non-Btrfs solution is going to need higher level advancements
anyway, either VFS over overlayfs or some combination of the two.

--
Chris Murphy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: APFS improvements (e.g. firm links, volume w/ subvols replication) as ideas for Btrfs?
  2019-06-12 20:02     ` Chris Murphy
@ 2019-06-13 11:37       ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 7+ messages in thread
From: Austin S. Hemmelgarn @ 2019-06-13 11:37 UTC (permalink / raw)
  To: Chris Murphy, Neal Gompa; +Cc: Btrfs BTRFS, Josef Bacik, David Sterba

On 2019-06-12 16:02, Chris Murphy wrote:
> On Wed, Jun 12, 2019 at 2:07 AM Neal Gompa <ngompa13@gmail.com> wrote:
>> I mean, yes... FHS is definitely unhelpful, but Apple conforms to FHS
>> pretty well, even though it's not obvious that it does. Apple just has
>> the benefit of being able to shuffle things around without people
>> noticing, whereas no Linux distribution has that.
> 
> The vast majority of macOS is located outside the FHS in Applications,
> Library, System, and Users. And in fact I don't even see the FHS
> structure in the GUI where 99.9% of macOS users interact. So the FHS
> is really incidental to macOS and macOS users.
> 
> My Mac looks like this:
> 
>   14G    Applications
> 3.7G    Library
>    0B    Network
> 9.1G    System
> 4.0K    TMVersion.ini
>   11G    Users
> 226G    Volumes
> 2.6M    bin
>    0B    cores
> 4.5K    dev
>    0B    etc
> 1.0K    home
> 504M    macOS Install Data
> 1.0K    net
> 6.2G    private
> 1.2M    sbin
>    0B    tmp
> 473M    usr
>    0B    var
> 
> The vast majority of private is a 4G hibernation image file, and a 2G
> dyld cache. I'm not really sure how Apple is going to leverage APFS
> for moving things around behind the scenes, it doesn't strike me as
> having been a problem in search of a solution (or even vice versa).
> 
> But I can see why they'd like to have a "clean" system snapshot to
> reset to, without resetting /User (use data files). To date, they have
> no reset/refresh like Windows has had for some time, and software
> updates are not safe and do not rollback either automatically or
> manually. Their primary troubleshooting step is the "clean install" or
> really the "clean reinstall". It's expected that the user has a backup
> (preferably Time Machine) because it's a reformat, point and shoot
> installer with no separate partition for user data.
> 
> 
>>>> For example, this would be useful if a volume has two subvolumes: OS
>>>> and data. OS would have /usr and data would have /var and /home. But,
>>>> importantly, a couple of system data things need to be part of the OS
>>>> that are on /var: /var/lib/rpm and /var/lib/alternatives. These two
>>>> belong with the OS, and it's incredibly difficult to move it around
>>>> due to all kinds of ecosystem knock-on effects.
> 
> I'd say to achieve functionally what you're after with Btrfs as it
> exists today, is possible. But it would require two "views" - the
> dirty realty that the updating system manages. And the kind unreality
> for the user.
> 
> On Btrfs you can have a /var/lib/rpm and /var/lib/alternatives
> read-write subvolume, just snapshot it, as either ro or rw (your
> choice) and at the time the snapshot is created, have it created in
> the OS "area" (rather than calling it a subvolume). A snapshot is just
> a pre-populated subvolume, otherwise they are the same thing.
> 
> Subvolumes can also be moved just like directories if they're rw. If
> they're ro, they can be renamed but not moved within the hierarchy.
> It's also possible to change the ro property on any subvolume.
> 
> Of course this means there is a time delay. Your updating system would
> update the "data" version of /var/lib/rpm and /var/lib/alternatives,
> and then once the out of band update is finished, it would snapshot
> them into the "OS" hierarchy, replacing its now stale versions of
> /var/lib/rpm and /var/lib/alternatives.
> 
> Also, subvolumes can act as a barrier to snapshots. That's a big part
> of how (open)SUSE is using Btrfs subvolumes, to delimit the
> snapshotting. i.e. they use a grub subvolume instead of a regular
> directory so that when /boot is snapshot, the changing state of the
> bootloader is NOT snapshot. Same for /etc/ and parts of /var
> 
> I understand what you don't have on Btrfs, that it sounds like you'd
> like to have, is a kind of wormhole between two whatevers (be they
> directories or subvolumes or wormvolumes) where they are always
> containing identical things. But they can independently be made ro or
> rw. They can both be rw. They can both be ro. Or one can be rw and the
> other ro, which would mean one of them "sends" data into the otherone,
> live.
> 
> You could fake that today. But you'd need two perspectives, where 99%
> of user space is aware of the fabricated perspective; and the 1% of
> user space that is your OS updates and switcheroo system would need to
> be aware of both perspectives and be the domain owner of managing both
> perspectives.
> 
> 
>> (If you want to know
>>>> more about that, just ask the SUSE kiwi team... it's the gift that
>>>> keeps on giving...). Both /var/lib/rpm and /var/lib/alternatives are
>>>> part of the OS, but they're in /var. It'd be great to stitch that in
>>>> from the read-only OS volume into the /var subvolume so that it's
>>>> actually part of the OS volume even though it looks like it's in the
>>>> data one. It's completely transparent to everything.
> 
> I'm having a hard time visualizing what Apple is doing or going to do.
> I expect like most things there will be some daemon that understands
> all the new ioctls and does the actual work of changing the storage
> hierarchy states. It's transparent to 99% of the OS and user space,
> but this one daemon must understand the real deal in order to "fake"
> it with the creation of firm links and whatever else they do.
> Something in user space must know about firm links in order to
> leverage firm links.
> 
>> Supporting atomic
>>>> updates (with something like a dnf plugin) becomes much easier because
>>>> we can trigger snapshot and subvolume mounts with preserving enough
>>>> structure to make things work. In this circumstance, we can flip the
>>>> properties so that the new location has a rw OS and ro data volume
>>>> mount for doing only software updates (or leave data volume rw during
>>>> this transaction and merge the changes back into the OS). We could
>>>> also do creative things with /etc if we so wish...
> 
> I've done out of band software updates with dnf already myself, this
> is a quick and dirty example:
> 
> # btrfs sub snap root.20190610 root.20190612
> # mount -o subvol=root.20190612 /mnt/updates
> # mount -B ## all the dev proc run sys stuff to /mnt/updates
> # chroot /mnt/updates
> # dnf update -y
> # exit
> # vi /mnt/updates/etc/fstab ##point the root to root.20190612
> # umount --resursive ## tear it down
> # grub2-editenv - set kernelopts= ##change rootflags to use the new subvol
> 
> Reboot. Now the last step is janky and does not take advantage of any
> fallback we now have in GRUB at least in Fedora land. To do that I
> need to so a bit more surgery manually to create new BLS snippets to
> make an explicit "former" "current" boot menu entry that causes the
> correct subvolume to become root.
> 
> But anyway, point is, I can do out of band updates, the currently
> active OS doesn't get confused as the update process is yanking out
> updates from underneath it all, and I don't have to mount /home in
> that process at all. That chroot (or it can be done with bwrap or
> nspawn or whatever) environment does not have my home in it. The
> update process can't hurt it. And if the update fails, just delete the
> bad subvolume, no harm done.
> 
> And libostree actually does a very close variation on that.
And there are other people doing essentially the same thing too.  Where 
I work, we do something similar to implement fast rollback of failed 
updates, and I've been working intermittently on a wrapper for emerge 
(Gentoo's package manger) to do the same type of thing.
> 
>>>
>>> Is it really best to do this in Btrfs proper, rather than in VFS?
>>>
>>
>> If we can handle it in VFS where things like firm links drag linked
>> subvolumes to be automatically mounted together at their individually
>> set snapshot level, then yeah. But best as I understand it, the VFS
>> layer is not capable of this level of granularity.
> 
> Yeah I kinda need pretty pictures and animations and things to
> understand this better, I'm not really completely understanding the
> problem. I'm very aware of Apple's severe limitations with updates and
> rollbacks, and why they want a way out of that. But I'm not clear on
> why Btrfs should mimick that particular behavior and solution for
> their problem, seeing as only one distro defaults to Btrfs yet doesn't
> require it, and Fedoraland is pretty much completely over Btrfs, while
> Red Hat is building a whole new userland file system which for sure is
> not going to have wormhold dirs between its fully independent volumes
> - it'd really need something done in VFS or if it can't be done in VFS
> then possibly leverage overlayfs for some of this.
Put simply, firm links are essentially hard links for subvolumes that 
can have differing VFS-level metadata.

Also, you can achieve the same outcome WRT updates just using regular 
OverlayFS and snapshots.  Make your root filesystem an overlay mount 
with the user data subvolume as the upper writable layer, and the OS 
subvolume as the lower read-only layer.  When you go to update, you 
create a writable snapshot of the OS subvolume, run the updates on the 
snapshot, update the configuration to use the snapshot as the lower 
layer on reboot, then reboot.  The only hard part is handling the 
OverlayFS configuration.
> 
> 
>>
>> This is probably one issue with Btrfs that ZFS gets to avoid, since
>> ZFS can't use VFS and thus implements everything at its level. I'm not
>> suggesting Btrfs do it for everything, but the filesystem needs some
>> intelligence about subvolume handling that it doesn't have now.
> 
> ZFS is way more limited in this regard than Btrfs. Snapshots are
> always children of datasets, they are always read only. You can clone
> a snapshot into a dataset. I don't think (?) there is such a thing as
> nested datasets or snapshots. Datasets can't be deleted until all
> children (snapshots) are deleted first. In many ways ZFS limitations
> keep users from doing things like many nested subvolumes like we see
> on Btrfs and then it turns into logical problems for users and
> developers alike.
There is hierarchical nesting of datasets, but it's 100% independent of 
the mount hierarchy as far as ZFS itself is concerned, it only gets used 
for things like quota management and property inheritance.
> 
> So a non-Btrfs solution is going to need higher level advancements
> anyway, either VFS over overlayfs or some combination of the two.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: APFS improvements (e.g. firm links, volume w/ subvols replication) as ideas for Btrfs?
  2019-06-12  9:58   ` David Sterba
@ 2019-08-05 20:59     ` Chris Murphy
  0 siblings, 0 replies; 7+ messages in thread
From: Chris Murphy @ 2019-08-05 20:59 UTC (permalink / raw)
  To: David Sterba, Josef Bacik; +Cc: Btrfs BTRFS

On Wed, Jun 12, 2019 at 3:58 AM David Sterba <dsterba@suse.cz> wrote:
>
> On Tue, Jun 11, 2019 at 10:03:51PM -0600, Chris Murphy wrote:

> > There are a couple of things going on. One is something between VFS
> > and Btrfs does this goofy assumption that bind mounts are subvolumes,
> > which is definitely not true. I bring this up here:
> > https://lore.kernel.org/linux-btrfs/CAJCQCtT=-YoFJgEo=BFqfiPdtMoJCYR3dJPSekf+HQ22GYGztw@mail.gmail.com/
>
> The subvolumes build on top of the bind mount API internally but it is
> or should be a different kind of object.
>
> > Near as I can tell, Btrfs kernel code just needs to be smarter about
> > distinguishing between bind mounts of directories versus the behind
> > the scene bind mount used for subvolumes mounted using -o subvol= or
> > -o subvolid= ; I don't think that's difficult. It's just someone needs
> > to work through the logic and set aside the resources to do it.
>
> I tried to fix that and got half way through, then hit the difficult
> problems mainly with nested subvolumes. For leaf subvolumes, the
> difference between
>
>   subvolume/dir/dir/dir (bind mounted)
>
> and
>
>   subvolume (mounted with -o)
>
> is to traverse back the path until the subvolume is hit, which in both
> cases would be 'subvolume'. Howvever, with nested subvolumes it's not
> easy to see where to stop
>
>   subvol1/dir/dir/subvol2/dir/dir/subvol3/dir/dir
>
> and take 3 cases:
>
>   mount -o subvol=subvol1
>   mount -o subvol=subvol2
>   mount -o subvol=subvol3
>
> the backward path traversal will always say it's subvol3 (that's wrong
> from users POV). Keeping track of the exact subvolume that was mounted
> is not trivial because it partially has to duplicate the internal VFS
> information which makes it hard to keep consistent after moves.
>
> There was a concept proposal called 'fs view' that would add proper
> subvolume abstraction for subvolumes to VFS but I don't know how far
> this got.

I guess I'm curious why in these cases the subvolid number is correct
in mount and mountinfo, but the subvol name is wrong? And if it's not
just anomalous that the id is correct, why not just use that and do a
lookup of id to name instead of however the name is currently
determined?

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-08-05 21:00 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-11 18:31 APFS improvements (e.g. firm links, volume w/ subvols replication) as ideas for Btrfs? Neal Gompa
2019-06-12  4:03 ` Chris Murphy
2019-06-12  8:06   ` Neal Gompa
2019-06-12 20:02     ` Chris Murphy
2019-06-13 11:37       ` Austin S. Hemmelgarn
2019-06-12  9:58   ` David Sterba
2019-08-05 20:59     ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).