From: Filipe Manana <firstname.lastname@example.org> To: Zygo Blaxell <email@example.com> Cc: firstname.lastname@example.org, linux-btrfs <email@example.com> Subject: Re: [PATCH 3/4] Btrfs: check if destination root is read-only for deduplication Date: Fri, 22 Feb 2019 11:13:48 +0000 Message-ID: <CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com> (raw) In-Reply-To: <20190220171708.GG9995@hungrycats.org> On Wed, Feb 20, 2019 at 5:17 PM Zygo Blaxell <firstname.lastname@example.org> wrote: > > On Wed, Feb 20, 2019 at 04:54:09PM +0000, Filipe Manana wrote: > > On Wed, Feb 20, 2019 at 4:42 PM Zygo Blaxell > > <email@example.com> wrote: > > > > > > On Thu, Jan 31, 2019 at 04:39:22PM +0000, Filipe Manana wrote: > > > > On Thu, Dec 13, 2018 at 4:08 PM David Sterba <firstname.lastname@example.org> wrote: > > > > > > > > > > On Wed, Dec 12, 2018 at 06:05:58PM +0000, email@example.com wrote: > > > > > > From: Filipe Manana <firstname.lastname@example.org> > > > > > > > > > > > > Checking if the destination root is read-only was being performed only for > > > > > > clone operations. Make deduplication check it as well, as it does not make > > > > > > sense to not do it, even if it is an operation that does not change the > > > > > > file contents (such as defrag for example, which checks first if the root > > > > > > is read-only). > > > > > > > > > > And this is also change in user-visible behaviour of dedupe, so this > > > > > needs to be verified if it's not breaking existing tools. > > > > > > > > Have you had the chance to do such verification? > > > > > > > > This actually conflicts with send. Send does not expect a root/tree to > > > > change, and with dedupe on read-only roots happening > > > > in parallel with send is going to cause all sorts of unexpected and > > > > undesired problems... > > > > > > This is a problem bees ran into. There is a workaround in bees (called > > > --workaround-btrfs-send) that avoids RO subvols as dedupe targets. > > > As the name of the option implies, it works around problems in btrfs send. > > > > > > This kernel change makes the workaround mandatory now, as the default > > > case (without workaround) will fail on every RO subvol even if that > > > behavior is desired by the user. That breaks an important use case on > > > the receiving side of sends--to dedupe the received subvols together > > > while also protecting them against modification on the target system > > > with the RO flag--and preserving that use case is why the send workaround > > > was optional (and not default) in bees. > > > > > > bees also won't handle the RO/RW/RO transition correctly, as it didn't > > > seem like a sane thing to support at the time. That is arguably something > > > to be fixed in bees. > > > > > > > This is a problem introduced by dedupe ioctl when it landed, since > > > > send existed for a longer time (when nothing else was > > > > allowed to change read-only roots, including defrag). > > > > > > Is there a reason why incremental send can't simply be fixed? > > > > This is a problem that affects both incremental and non-incremental (full) send. > > > > > As far > > > as I can tell, send is failing because of a runtime check that seems to > > > be too strict; however, I haven't tried removing that check to see if > > > it fixes the problem in send, or just hides the next problem. > > > > The problem is send was designed with the idea that read-only roots > > don't ever change. > > ...except when they're snapshotted, balanced, or deduped (to list places > where that assumption hasn't held so far). Yes, we had bugs related to that (I fixed some of them). However, the snapshotting ones [1, 2] happened to be easy to solve because only the root node is COWed. The one related to balance , which I only remember one and matches the piece of code you pointed to (which I fixed long ago), actually only solves one problem, which was hitting a BUG_ON() (if the extent item was changed by relocation we know no data changed and we can ignore the change). The other problem it doesn't solve is essentially the same problem that a concurrent dedupe can cause, which I only realized some weeks ago after noticing dedup was allowed on RO roots (subvolumes/snapshots). I'll explain below what problem it is. > > > The failures that can happen are many and unpredictable, from > > occasionally failing with some error, to invalid memory accesses, use > > after free problems, etc. > > Essentially all caused by races when the nodes/leafs from the > > read-only tree change while send is running. > > Maybe you can explain this further? As far as I can tell, all of those > are send bugs that should just be (and over the years have been) fixed. So send was always lockless and transactionless, because it's a potentially long running operation that could delay everything else if it were not (specially if it held transactions open) and because it needs to operate on a tree that never changes, so that it always sees a consistent state (often it needs to do multiple searches on a tree for many different reasons). Well this second reason is actually what matters and the first reason is more a consequence of the second reason. But well, the world is not perfect and it seems that during the design/implementation stage of send, balance and snapshotting of a snapshot being used by send were forgotten. Then after send came, dedupe came and because it allows to dedupe against files in a RO tree, brought another way of being able to modify a RO tree that can be in use by a concurrent send - this was missed during the review process obviously, otherwise it would have never allowed dedupe on RO trees or send would have been changed as well to somehow work with a concurrent dedupe. The big problem, for which I don't have a solution, is that dedupe and balance COW entire paths, and not just the root node of a tree like the snapshotting case. For example while send is doing work with a leaf from a RO snapshot tree, if that leaf is COWed, its extent is marked as free once the transaction that COWed it is committed. That means that immediately after that, the extent can be reused, allocated to some other tree leaf or node - while send is still using the respective in-memory representation of the leaf, the extent buffer. So some other task will be modifying the extent buffer while send is using it as well - this is what brings unexpected results that can result in crashes or, with luck, some random failures that make send return some error only. For a leaf we could just clone the extent buffer while preventing transaction commits, and then allow them to happen again after cloning. Now COWing a leaf also results in COWing its parent node, parent of the parent, etc, all the way up to the root node (pointers need to be changed in every parent). So if a node is COWed while send is using it, and before it finishes using it the same extent gets reused for another tree/node, send would follow an incorrect path since now the node has different pointers or it's in an undefined/inconsistent state due to the concurrent modification. Creating a copy/clone of the node while transaction commits are disabled is not an option, since when using the copy, nodes below it might have been COWed and no longer exist, so we are accessing stale pointers. Getting the extent of a node/leaf immediately reused after COWing it and after its transaction is committed is not very common and happens mostly when running close to ENOSPC, plus the time spent processing a node/leaf by send is very short. So this is effectively a hard to hit problem. The bug fixes for the snapshotting of a RO snapshot work and are simple because snapshotting only COWs the root of a tree, so we stop transactions commits temporarily, make a copy of the root for send to use and allow transaction commits to happen again, while send is happy doing what it has to using the cloned root nodes - works because snapshotting doesn't COW any other nodes/leafs of the tree, i.e., the pointers in the root node don't change. > > > I don't know what runtime check you are mentioning that is too strict. > > It's this one, in send.c: > > /* > * We may have found an extent item that has changed > * only its disk_bytenr field and the corresponding > * inode item was not updated. This case happens due to > * very specific timings during relocation when a leaf > * that contains file extent items is COWed while > * relocation is ongoing and its in the stage where it > * updates data pointers. So when this happens we can > * safely ignore it since we know it's the same extent, > * but just at different logical and physical locations > * (when an extent is fully replaced with a new one, we > * know the generation number must have changed too, > * since snapshot creation implies committing the current > * transaction, and the inode item must have been updated > * as well). > * This replacement of the disk_bytenr happens at > * relocation.c:replace_file_extents() through > * relocation.c:btrfs_reloc_cow_block(). > */ > if (btrfs_file_extent_generation(leaf_l, ei_l) == > btrfs_file_extent_generation(leaf_r, ei_r) && > btrfs_file_extent_ram_bytes(leaf_l, ei_l) == > btrfs_file_extent_ram_bytes(leaf_r, ei_r) && > btrfs_file_extent_compression(leaf_l, ei_l) == > btrfs_file_extent_compression(leaf_r, ei_r) && > btrfs_file_extent_encryption(leaf_l, ei_l) == > btrfs_file_extent_encryption(leaf_r, ei_r) && > btrfs_file_extent_other_encoding(leaf_l, ei_l) == > btrfs_file_extent_other_encoding(leaf_r, ei_r) && > btrfs_file_extent_type(leaf_l, ei_l) == > btrfs_file_extent_type(leaf_r, ei_r) && > btrfs_file_extent_disk_bytenr(leaf_l, ei_l) != > btrfs_file_extent_disk_bytenr(leaf_r, ei_r) && > btrfs_file_extent_disk_num_bytes(leaf_l, ei_l) == > btrfs_file_extent_disk_num_bytes(leaf_r, ei_r) && > btrfs_file_extent_offset(leaf_l, ei_l) == > btrfs_file_extent_offset(leaf_r, ei_r) && > btrfs_file_extent_num_bytes(leaf_l, ei_l) == > btrfs_file_extent_num_bytes(leaf_r, ei_r)) > return 0; > } > > inconsistent_snapshot_error(sctx, result, "extent"); > return -EIO; > > This is the point where bees users report send failures. That seems yo happen when dedupe is running in parallel with send. That can be fixed by ignoring file extent items that changed without the inode item having changed as well (assuming this can only be caused by balance and dedupe, and there's no other path for this). When I added tjat check to fix a BUG_ON hit due to balance changing file extent items (without updating inode items), I wasn't aware that an ongoing dedupe could cause the same problem (back then, when I added that check , I thought that dedupe was like clone and would refuse to work on RO subvolumes/snapshots). However I don't see immediately how it's possible, since dedupe updates file extent items (or inserts new ones) in the same transaction where it updates the inode item, so it's really weird how send would find the new/modified file extent items from the commit root without seeing an update inode item as well. Btw, why weren't those reports reported to the btrfs mailing list? However the bigger problem, mentioned above, I really don't have a solution for it now, and it's far from being a trivial problem (perhaps simply making send use read locks for searches on the commit root would work). It can be caused by either a concurrent dedupe or a concurrent balance (and hopefully there's nothing else that can modify RO trees), so a generic solution needs to be found. Preventing dedupe and send from running in parallel, as you suggest in another reply, is something I suggested too earlier in this thread. Preventing balance and send running in parallel, well... it starts to get messy. So not doing nothing for now, that is, not applying this patch to disable dedupe on RO roots and wait for a better solution (best case, for 5.2 merge window), is reasonable and not something I'm against. David, would you consider at least excluding it from 5.1 to allow for a different solution to pop up for another merge window?  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5e84fd8d0634d056248b67463b42f6c85896a19 > > I don't really understand what that code is trying to achieve, though. > If we are diffing two subvol trees then we should just send anything > that's different--even if we occasionally send a few redundant extents > because we happen to be sending during a balance, because that's better > than bombing out completely. > > Or is the problem more like we lose our ei pointer in one of the subvols > when the extent tree changes under it? That would be harder to solve, > it would have to keep releasing and reacquiring everything. > > > You can definitely do dedupe on a read-only root and after it finishes > > do a send (either full or incremental), and it will work. > > If there's a way to make a subvol RW temporarily _without breaking > incremental send from that subvol_ (i.e. without clearing the parent > UUIDs, maybe also allowing only dedupe) then I have no objection. > > > > More details at: > > > > > > https://github.com/Zygo/bees/issues/79#issuecomment-429039036 > > > > > > > I understand it can break some applications, but adding other solution > > > > such as preventing send and dedupe from running in parallel > > > > (erroring out or block and wait for each other, etc) is going to be > > > > really ugly. There's always the workaround for apps to set the > > > > subvolume > > > > to RW mode, do the dedupe, then switch it back to RO mode. > > > > > > > > Thanks.
next prev parent reply index Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-12-12 18:05 [PATCH 0/4] Btrfs: a few more cleanups and fixes for clone/deduplication fdmanana 2018-12-12 18:05 ` [PATCH 1/4] Btrfs: move duplicated nodatasum check into common reflink/dedupe helper fdmanana 2019-01-11 14:55 ` David Sterba 2018-12-12 18:05 ` [PATCH 2/4] Btrfs: use cross mount point check for cloning and deduplication fdmanana 2018-12-13 16:02 ` David Sterba 2019-01-11 14:38 ` David Sterba 2018-12-12 18:05 ` [PATCH 3/4] Btrfs: check if destination root is read-only for deduplication fdmanana 2018-12-13 16:07 ` David Sterba 2019-01-31 16:39 ` Filipe Manana 2019-01-31 16:44 ` Hugo Mills 2019-02-18 15:38 ` David Sterba 2019-02-18 16:55 ` Filipe Manana 2019-02-12 17:59 ` Filipe Manana 2019-02-20 16:41 ` Zygo Blaxell 2019-02-20 16:54 ` Filipe Manana 2019-02-20 17:17 ` Zygo Blaxell 2019-02-22 11:13 ` Filipe Manana [this message] 2019-02-22 17:25 ` David Sterba 2019-02-21 16:54 ` Zygo Blaxell 2019-02-18 16:01 ` David Sterba 2018-12-12 18:05 ` [PATCH 4/4] Btrfs: remove no longer needed range length checks " fdmanana 2018-12-13 12:20 ` Nikolay Borisov 2019-01-31 16:31 ` Filipe Manana 2019-02-12 17:58 ` Filipe Manana 2019-02-18 15:10 ` David Sterba 2018-12-13 12:19 ` [PATCH 0/4] Btrfs: a few more cleanups and fixes for clone/deduplication Nikolay Borisov
Reply instructions: You may reply publically to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-BTRFS Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \ email@example.com public-inbox-index linux-btrfs Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs AGPL code for this site: git clone https://public-inbox.org/public-inbox.git