Re: [PATCH v2 3/4] btrfs: send: fix invalid commands for inodes with changed rdev but same gen - Roman Anasal

From: Roman Anasal | BDSU <roman.anasal@bdsu.de>
To: "fdmanana@gmail.com" <fdmanana@gmail.com>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH v2 3/4] btrfs: send: fix invalid commands for inodes with changed rdev but same gen
Date: Wed, 3 Feb 2021 16:20:48 +0000	[thread overview]
Message-ID: <311d6f4a38230bf3bf6d264665f8bfb6a277320b.camel@bdsu.de> (raw)
In-Reply-To: <CAL3q7H7DcSuM4ba6HU3vjsZ+h-i61Gb-21D1sqLScD0sdX2xZQ@mail.gmail.com>

On Tue, 2021-02-02 at 11:56 +0000, Filipe Manana wrote:
> On Sun, Jan 31, 2021 at 3:52 PM Roman Anasal | BDSU
> <roman.anasal@bdsu.de> wrote:
> > On Mon, Jan 25, 2021 at 20:51 +0000 Filipe Manana wrote:
> > > On Mon, Jan 25, 2021 at 7:51 PM Roman Anasal <
> > > roman.anasal@bdsu.de>
> > > wrote:
> > > > Second example:
> > > >   # case 2: same ino at different path
> > > >   btrfs subvolume create subvol1
> > > >   btrfs subvolume create subvol2
> > > >   mknod subvol1/a c 1 3
> > > >   mknod subvol2/b c 1 5
> > > >   btrfs property set subvol1 ro true
> > > >   btrfs property set subvol2 ro true
> > > >   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> > > 
> > > As I've told you before for the v1 patchset from a week or two
> > > ago,
> > > this is not a supported scenario for incremental sends.
> > > Incremental sends are meant to be used on RO snapshots of the
> > > same
> > > subvolume, and those snapshots must never be changed after they
> > > were
> > > created.
> > > 
> > > Incremental sends were simply not designed for these cases, and
> > > can
> > > never be guaranteed to work with such cases.
> > > 
> > > The bug is not having incremental sends fail right away, with an
> > > explicit error message, when the send and parent roots aren't RO
> > > snapshots of the same subvolume.
> > 
> > Since this should be fixed then I'd like to propose to add the
> > following check:
> > 
> > The inodes of the subvolumes' root directories (ino
> > BTRFS_FIRST_FREE_OBJECTID = 256) must have the same generation.
> > 
> > Since create_subvol() will always commit the transaction, i.e.
> > increment the generation, no two _independently_ created subvolumes
> > can
> > be created within the same generation (are there race conditions
> > possible here?).
> 
> That is currently true, but it has been discussed and proposed the
> ability to skip the transaction commit when creating a subvolume
> Boris sent a proposal patch for that a few months ago.

Ah, okay then, if this may change in the future then this idea isn't
safe and should be dismissed.

> I don't think that should be assumed. Avoiding the transaction
> commit,
> either by default or optionally, is something that makes sense.
> Plus for a case like snapshots, we can actually batch the creation of
> several ones in a single transaction.
> 
> > Taking a snapshot of a subvolume does not modify the generation of
> > the
> > root dir inode. Also it is not possible to change or delete/re-
> > create
> > the root directory of a subvolume since this would delete the
> > subvolume
> > itself.
> > 
> > 
> > So having two subvolumes with root directories created with
> > different
> > generations means they were created independently and can not share
> > a
> > common ancestor. Doing an incremental send with them is unsafe and
> > thus
> > must return an error.
> > With the root directories at the same generation though the
> > subvolumes
> > are based on a common ancestor which is a requirement for a safe
> > incremental send.
> > 
> > Are my assumptions and my understanding here correct? Then this
> > check
> > would catch most of the unsafe parents.
> > If so I could have a shot at a patch for this if you'd like me to?
> 
> That is too complex and makes too many assumptions.
> 
> To check if two roots are snapshots of the same subvolume (the send
> and parent roots), you can simply check if they have non-null uuids
> in
> the "parent_uuid" field of their root items and that they match.

I thought of this, too, but see it break in some scenarios I'd expect
it to work, mostly with "chains" of snapshots as they happen on a
receiving side.

Consider this scenario:

   btrfs subvolume create /subvol/
   # modify /subvol
   btrfs subvolume snapshot -r /subvol/ /snapshots/snap1
   # modify /subvol
   btrfs subvolume snapshot -r /subvol/ /snapshots/snap2
   # modify /subvol
btrfs subvolume snapshot -r /subvol/ /snapshots/snap3

I.e. have a single RW subvolume and taking incremental snapshots of it.

   cd /snapshots/
   btrfs send snap1 | btrfs receive /mnt/backups/
btrfs send -p snap1 snap2 | btrfs receive /mnt/backups/   btrfs send -p snap2 snap3 | btrfs receive /mnt/backups/

I.e. incrementally send the snapshots to another btrfs volume.

   cd /mnt/backups
   btrfs subvolume delete snap2
   btrfs send snap1 | btrfs receive /mnt/backups2/
   btrfs send -p snap1 snap3 | btrfs receive /mnt/backups2/

I.e. delete the intermediate snapshot snap2 and incrementally send
snap1 and snap3 from the receiving filesystem to yet another btrfs
filesystem.

The last command would fail since snap3 was based on snap2 which was
based on snap1; so neither is snap1 the (direct) parent of snap3 nor do
they share a common (direct) parent nor would it be possible to
reconstruct their relation by walking the chain since snap2 does no
longer exist.

While on the orignal filesystem all snapshots have the same parent on
the reciving volume it is a chain:

orignal volume:

        subvolume
        ^   ^   ^
       /    |    \
   snap1  snap2  snap3

receiving volume:

   snap1 <- snap2 <- snap3

So for this to work it would probably require another attribute
"original subvol UUID" for the root of the ancestry tree...

> While this is more straightforward to do in the kernel, I would
> prefer
> to have it in btrfs-progs, because:
> 
> 1) In btrfs-progs we can explicitly print an informative error
> message
> to the user, while in the kernel you can only return an errno value
> and log something dmesg/syslog, which is much less user friendly;

I was thinking about implementing it in the kernel as an (additional)
check to block unsafe sends regardless of the user space tool (are
there any besides btrfs-progs?); but proper handling and an explaining
error message must be imlpemented in btrfs-progs, totally.

> 2) The check would be on by default but could be skipped with some
> new
> flag - this is just being conservative to avoid breaking any existing
> workflows we might not be aware of.
>     In particular I'm thinking about people using "btrfs send" with
> -c
> and omitting -p, in which case btrfs-progs selects one of the -c
> roots
> to be used as the parent root,
>     but the selected root might not be a snapshot of the same
> subvolume as the send root.
>     Then maybe one day that option to skip the check would be
> removed,
> after we are more sure no one is using or really needs such
> workflows.

The way I read find_good_parent() it will only select a clone source as
parent if it is the parent subvolume of the send subvolume [1] or if
both have the same parent [2]?
Which makes sense since selecting an snapshot of an unrelated subvolume
would be unsafe.

[1] https://github.com/kdave/btrfs-
progs/blob/273380d98f4412ae8b0f35ad69debf682e48c6bd/cmds/send.c#L118
[2]
https://github.com/kdave/btrfs-
progs/blob/273380d98f4412ae8b0f35ad69debf682e48c6bd/cmds/send.c#L131

> > 
> > This check still does not solve the second edge case though, when
> > snapshots are modified afterwards and diverge independently form
> > one
> > another. For this I still see no good solution besides a new on-
> > disk
> > flag whether a snapshot was *ever* set to ro=false. But with that
> > I'm
> > not sure how to (not) inherit that flag in a safe way ...
> 
> I'm afraid there's nothing, codewise, to do about that case.
> 
> Setting some flag on the root to make it unusable for send in case it
> was ever RW would break send in at least one way:
> 
> During a receive we create the root as RW, apply the send stream and
> then change the root to RO.
> After such change, it would mean we could not send the received
> snapshot anymore. There's no way to make sure that only btrfs-receive
> can do that, since anyone can use the ioctl.

Another case where allowing to switch to RW before send would be
desirable: make snapshot RW, delete files you don't need anymore, make
RO again, send to backup disk.
Only deleting files/inodes should even be safe now.

> Perhaps all that needs to be done is to document this well in the man
> pages and wiki in case it's not already there.

Yes. Since these are all very unlikely edge cases and reliably detecting them without false positives is hard, just explicitly documenting them is probably the best solution.

> 
> Thanks.
>