From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [195.159.176.226] ([195.159.176.226]:45569 "EHLO blaine.gmane.org" rhost-flags-FAIL-FAIL-OK-OK) by vger.kernel.org with ESMTP id S932299AbcHIUbE (ORCPT ); Tue, 9 Aug 2016 16:31:04 -0400 Received: from list by blaine.gmane.org with local (Exim 4.84_2) (envelope-from ) id 1bXDfx-0000jD-NM for linux-btrfs@vger.kernel.org; Tue, 09 Aug 2016 22:31:01 +0200 To: linux-btrfs@vger.kernel.org From: Duncan <1i5t5.duncan@cox.net> Subject: Re: Strange behavior after "rm -rf //" Date: Tue, 9 Aug 2016 20:30:56 +0000 (UTC) Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: Chris Murphy posted on Tue, 09 Aug 2016 11:10:08 -0600 as excerpted: > On Mon, Aug 8, 2016 at 12:38 PM, Ivan Sizov wrote: >> 2016-08-08 20:13 GMT+03:00 Chris Murphy : >>> Just a wild guess, the deletions may be in the tree log and haven't >>> been applied to the other trees (fs tree, extent tree, etc). So yes >>> I'd expect they get deleted on a rw mount. >>> >>> This is what kernel? Because kernel 4.6 offers mount option >>> "nologreplay" which suggests even if you do mount -r that log replay >>> happens, so you shouldn't see these deleted files unless you mount ro >>> *and* use nologreplay mount option. >> >> Live USB has kernel 4.5.7. Maybe I should try to run "btrfs rescue >> zero-log" and then mount RW? Will the files safe in that case? > > Depends on what's in the log that you're zeroing out. It's entirely > possible other things are lost, not just the incomplete deletion. And > also I have no idea if the deletion is entirely contained in only the > tree log. It's worth noting a critical difference between btrfs replay logs and conventional filesystem replay logs, however, with the result being that there's a fair chance the log replay has absolutely nothing to do with this case at all, and that it's simply commit vs. crash timing. Btrfs is copy-on-write, with commits designed to be atomic -- changes work their way up the tree until a root commit finalizes them, and if a crash occurs, all changes since the last successful commit (with a commit every 30 seconds by default, and a mount option to change that) are normally lost. Because the filesystem is copy-on-write, that means the filesystem should be consistent at that commit, and changes made after that will be in different locations that haven't made it into the tree yet, since the next commit wasn't able to happen due to the crash. Thus, the stuff that conventional filesystems log simply doesn't apply to btrfs at all. By contrast, conventional filesystems rewrite a lot of data and metadata in-place, and logging lets them write out to a temporary area the changes they intend to make before they actually write them to the permanent location, so that in the event of a crash, any data partially written to the permanent location will be replayed from the log, while if the crash happened when writing the log so it's corrupt, that record won't be replayed, and the old content will remain in place. Tho of course writing all data twice tends to hit performance rather hard, so what most event logging filesystems do is only log the metadata, not the actual data. This lets them be much faster than if they were logging the data, and normally protects the filesystem structure, but there's some chance that files rewritten in-place will be corrupt if a crash happens at the wrong moment. But it limits the damage to only the file being written at the time, and does away with the requirement to fsck the entire filesystem after every crash. So what /does/ the btrfs log do, then? Good question! =:^) Rather simply, keeping in mind that commits only normally trigger every 30 seconds, the btrfs log tracks fsyncs (individual file syncs, as opposed to whole filesystem syncs), recording them in a replay log, so the filesystem can return success on the fsync, that the file was actually synced to permanent storage (often ssds these days, so not always "disk" as it used to be), without having to either wait upto 30 seconds for the next root tree commit, or forcing a full filesystem sync and commit, possibly including many other files, when only the one was requested. So with btrfs, it's *only* fsyncs that are logged to the replay log, and that only to be able to truthfully return that the file was written to permanent storage, not normal filesystem operations, which are already atomic due to the copy-on-write semantics, and thus don't need logged. So then, the question becomes one of whether rm -rf, or whatever other actual command was used to do the deletes, called fsync, or not. If the command didn't call fsync, then it would have been the normal btrfs commit mechanism, again, every 30 seconds by default, that would have been in play here, and the btrfs log replay wouldn't have anything to do with it. Which I actually strongly suspect to be the case. It's likely that the last commit wasn't completed, so the btrfs reverted to the last atomic commit. That would also explain why a read-only mount /without/ the nologreplay option still showed the files, since read-only does normally still replay that fsync log, so if the files were caught in it, they shouldn't show up at all. Meanwhile, back to the original scenario, just another demonstration of what every good sysadmin knows, often from hard experience, admin fat- fingering -- the human factor -- PEBKAC -- is as much of a danger to the data and the system, if not more, than device or software failure. If would-be backups can't protect from that, they're not backups. Which is why simple RAID fails as a backup method, even if it can protect against device failure. And of course, there's only two cases for the value of the data, it's either worth the hassle and resources to backup, or it's not, and if it's not backed up, by definition of not having that backup, you're defining it as the latter, no matter any claims to the contrary. In this case, as too many unfortunate people eventually find out, actions, or the lack of them, speak louder than words, and if the data is lost due to not having a backup, well, the only thing to do is to be happy that the thing your actions defined as worth more than that data, the time/hassle/resources necessary to do it, was saved. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman