On Fri, Jan 01, 2021 at 11:40:28PM +0300, Andrei Borzenkov wrote:
> 01.01.2021 14:42, Andrei Borzenkov пишет:
> > 01.01.2021 00:36, Zygo Blaxell пишет:
> > ...
> >>
> >> Yeah, I only checked that send completed without error and produced a
> >> smaller stream.
> >>
> >> I just dumped the send metadata stream from the incremental snapshot now,
> >> and it's more or less garbage at the start:
> >>
> >> 	# btrfs sub create A
> >> 	# btrfs sub create B
> >> 	# date > A/date
> >> 	# date > B/date
> >> 	# mkdir A/t B/u
> >> 	# btrfs sub snap -r A A_RO
> >> 	# btrfs sub snap -r B B_RO
> > ...
> >> 	# btrfs send A_RO | btrfs receive -v /tmp/test
> >> 	At subvol A_RO
> >> 	At subvol A_RO
> >> 	receiving subvol A_RO uuid=995adde4-00ac-5e49-8c6f-f01743def072, stransid=7329268
> >> 	write date - offset=0 length=29
> >> 	BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=995adde4-00ac-5e49-8c6f-f01743def072, stransid=7329268
> >> 	# btrfs send B_RO -p A_RO | btrfs receive -v /tmp/test
> >> 	At subvol B_RO
> >> 	At snapshot B_RO
> >> 	receiving snapshot B_RO uuid=4aa7db26-b219-694e-9b3c-f8f737a46bdb, ctransid=7329268 parent_uuid=995adde4-00ac-5e49-8c6f-f01743def072, parent_ctransid=7329268
> >> 	ERROR: link date -> date failed: File exists
> >>
> >> The btrfs_compare_trees function can handle arbitrary tree differences,
> > 
> > I am not sure. It apparently relies on the fact that inodes are ever
> > monotonically increasing. This is probably true for clones of the same
> > subvolume (I assume clone inherits highest_objectid) but two subvolumes
> > created independently have the same range of inode numbers.
> > 
> 
> In particular in your example both A/date and B/date have identical
> inode numbers and in general INODE_ITEMs are identical (including
> generation numbers) up to times so two inodes are compared as changed.
> At the same time INODE_REFs for them are considered different because
> INODE_ITEMs for root have different generations. This leads to code path
> that attempts to create additional alias to existing inode, as it is
> regular file it tries to link it. It does not really compares ref names
> at this point at all.
> 
> This would not really be possible if A and B were clones of the same
> subvolume (not necessary consecutive) as A/date and B/date would always
> have different inode numbers.

After v5.11-rc1 inode_cache can no longer be used, but any filesystem that
has inode_cache in its history might have cases like this hiding in
metadata even with a linear series of snapshots.

The send code is mostly used to transmit linear sequences of snapshots
(a series of snapshots which capture the state of a single subvol at
different times, ordered from oldest to newest) between machines that
are not using the inode_cache mount option.  Any other case isn't getting
very well tested in the field, even if it happens to work sometimes.

> If I force different generation numbers for A/date and B/date (by
> syncing in between) send stream contains correct sequence of removing
> old B/date (from A clone) and re-creating it again.
>
> Which shows that unfortunately generation numbers are not reliable to
> differentiate between different object generations (pun unintended). As
> I understand generation is tied to transaction and multiple changes can
> be packed into one transaction.

I'm pretty sure that the 6000+ lines of special-case code in send.c still
don't cover every possible case, or even all of the likely ones, even
with linear snapshot sequences.  We still get people on IRC reporting
strange receive issues, and usually the best solution we can find is
to start over with a new full send.  That's OK for small filesystems,
but when you have to unexpectedly do a full send of dozens of terabytes
over a medium-speed link, it's probably time to switch to rsync.

Subversion used to have problems like this (maybe it still does, I
switched to git years ago) where a complicated commit that combined
multiple operations on objects of the same name would break the tool.
I'm surprised btrfs is trying to do similar things in the kernel
(though with the current send implementation there's nowhere else we
could do them).  At least for fsync we get to say "nope, too hard,
do a full commit instead" when complications arise.

> > Also I am not sure if using later clone as base for difference to
> > earlier clone will work for the same reason.

That use case can come up e.g. if you have snapshots of / and you roll
back to an earlier snapshot after a bad upgrade, but your backups are
using incremental snapshots made from '/'.  Then the last-sent-snapshot
(from the bad upgrade) is newer than the origin subvol (from an earlier
good upgrade, with new modifications on top).

Cases like these really need to work, or at least reliably throw
errors when they have failed, as the application that rolls back to
earlier snapshots might have no knowledge of the application that does
incremental send backups on a user's system if they integrated tools
from different vendors.

> >> but something happens in one of the support functions and we get a
> >> bogus link command.  The rest of the stream is OK though:  we fill
> >> in the contents of B_RO/date, rename A_RO/t to B_RO/u, and update all
> >> the timestamps.
> >>
> >> Oh well, I didn't say send didn't have any bugs.  ;)
> >>
> > 
> 
>