From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from plane.gmane.org ([80.91.229.3]:53737 "EHLO plane.gmane.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751136AbaFDJV6 (ORCPT ); Wed, 4 Jun 2014 05:21:58 -0400 Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1Ws7OM-0001Bx-1l for linux-btrfs@vger.kernel.org; Wed, 04 Jun 2014 11:21:54 +0200 Received: from ip68-231-22-224.ph.ph.cox.net ([68.231.22.224]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 04 Jun 2014 11:21:54 +0200 Received: from 1i5t5.duncan by ip68-231-22-224.ph.ph.cox.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 04 Jun 2014 11:21:54 +0200 To: linux-btrfs@vger.kernel.org From: Duncan <1i5t5.duncan@cox.net> Subject: Re: All free space eaten during defragmenting (3.14) Date: Wed, 4 Jun 2014 09:21:41 +0000 (UTC) Message-ID: References: <1703083.hLnNuPsKpY@linux-suse.hu> <538B8F76.9090500@petezilla.co.uk> <538CE4A0.9020105@petezilla.co.uk> <538E4A83.30903@petezilla.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: Peter Chant posted on Tue, 03 Jun 2014 23:21:55 +0100 as excerpted: > On 06/03/2014 05:46 AM, Duncan wrote: > >> Of course if you were using something like find and executing defrag on >> each found entry, then yes it would recurse, as find would recurse >> across filesystems and keep going (unless you told it not to using >> find's -xdev option). > > I did not know the recursive option existed. However, I'd previously > cursed the tools not having a recursive option or being recursive by > default. If there is now a recursive option it would be really perverse > to use find to implement a recursive defrag. Defrag's -r/recursive option is reasonably new, but checking the btrfs- progs git tree (since I run the git version) says that was commit c2c5353b, which git describe says was v0.19-725, so it should be in btrfs- progs v3.12. So it's not /that/ new. Anyone still running something earlier than that really should update. =:^) But the wiki recommended using find from back before the builtin recursive option, and I can well imagine people with already working scripts not wanting to fix what isn't (for them) broken. =:^) So I imagine there will be find-and-defrag users for some time, tho they should even now be on their way to becoming a rather small percentage, at least for folks following the keep-current recommendations. Meanwhile, this question is bugging me so let me just ask it. The OP was from a different email address (szotsaki@gmail), and once I noticed that I've been assuming that you and the OP are different people, tho in my first reply to you I assumed you were the OP. So just to clear things up, different people and I can't assume that what he wrote about his case applies to you, correct? =:^) >> Meanwhile, you mention the autodefrag mount option. Assuming you have >> it on all the time, there should be that much to defrag, *EXCEPT* if >> the -c/ compress option is used as well. If you aren't also using the >> compress mount option by default, then you are effectively telling >> defrag to compress everything as it goes, so it will >> defrag-and-compressed all files. Which wouldn't be a problem with >> snapshot-aware-defrag as it'd compress for all snapshots at the same >> time too. But with snapshot-aware- >> defrag currently disabled, that would effectively force ALL files to be >> rewritten in ordered to compress them, thereby breaking the COW link >> with the other snapshots and duplicating ALL data. > > I've got compress=lzo, options from fstab: > device=/dev/sdb,device=/dev/sdc,autodefrag,defaults,inode_cache,noatime, > compress=lzo > > I'm running kernel 3.13.6. Not sure if snapshot-aware-defrag is enabled > or disabled in this version. A git search says (linus' mainline tree) commit 8101c8db, merge commit 878a876b, with git describe labeling the merge commit as v3.14-rc1-13, so it would be in v3.14-rc2. However, the commit in question was CCed to stable@, so it should have made it into a 3.13.x stable release as well. Whether it's in 3.13.6 specifically, I couldn't say without checking the stable tree or changelog, which should be easier for you to do since you're actually running it. (Hint, I simply searched on "defrag", here; it ended up being the third hit back from 3.14.0, I believe, so it shouldn't be horribly buried, at least.) > Unfortunately I really don't understand how COW works here. > I understand the basic idea but have no idea how it is implemented > in btrfs or any other fs. FWIW, I think only the kernel/filesystem or at least developer types /really/ understand COW, but I /think/ I have a reasonable sysadmin's- level understanding of the practical effects in terms of btrfs, simply from watching the list. Meanwhile, not that it has any bearing on this thread, but about your mount options, FWIW you may wish to remove that inode_cache option. I don't claim to have a full understanding, but from what I've picked up from various dev remarks, it's not necessary at all on 64-bit systems (well, unless you have really small files filling an exabyte size filesystem!) since the inode-space is large enough finding free inode numbers isn't an issue, and while it can be of help in specific situations on 32-bit systems, there's two problems with it that make it not suitable for the general case: (1) on large filesystems (I'm not sure how large but I'd guess it's TiB scale) there's danger of inode-number- collision due to 32-bit-overflow, and (2) it must be regenerated at every mount, which at least on TiB-scale spinning rust can trigger several minutes of intense drive activity while it does so. (The btrfs wiki now says it's not recommended, but has a somewhat different explanation. While I'm not a coder and thus in no position to say for sure based on the code, I believe the wiki's explanation isn't quite correct, but either way, it's still not recommended.) The use-cases where inode_cache might be worthwhile are thus all 32-bit, and include things like busy email servers with lots of files being constantly created/deleted. If in doubt, disable it. Oh, and while I'm at it, I might as well mention that the "defaults" mount option is normally not necessary, except as a field-holder in fstab if no non-default options are being used as well. That's the whole point of "defaults", that they're default, and thus don't need passed. Tho (unlike inode_cache) it does no harm. >> If the data that you're trying to defrag is snapshotted, the defrag >> will currently break the COW link and double usage. However, as long >> as you have the space to spare and are deleting the snapshots in a >> reasonable time (as it sounds like you are since it seems you're doing >> snapshots only to enable a stable backup), once you delete all the >> snapshots from before the defrag, you should get the space back, so >> it's not a permanent issue. > > Hmm. From your previous discussion I get the impression that it is not > a problem if it has always compressed, or always not compressed, but it > blows up if the compression setting is changed - e.g. the compressed > file and uncompressed file are effectively completely different. Yes. Except that I'm not /absolutely/ sure that a "null defrag", that is, one that doesn't have anything to do since everything is already defragged and compression is the same either way, actually does nothing, thereby leaving the COW linkages intact. I /believe/ that to be the case, and if it /is/ the case, a defrag with the same compression on already defragged (due to autodefrag) data /shouldn't/ blow up data usage against snapshots, since it wouldn't actually move anything, but I don't /know/ that for sure and not being a coder I can't so easily just go and look at the code to see, either. And since I basically don't use snapshots, preferring actual backups, it wouldn't be that easy for me to check by simply trying a few GiBs of defrag (since I too use compress=lzo,autodefrag) and comparing before and after usage, since there's no snapshots that it'd be duplicating against. But it should be a fairly easy thing to check. (This is where I wonder if you're the OP. Obviously if so, /something/ triggered that change in size.) Assuming you're not the OP and you're more concerned about future behavior, since you're already using autodefrag and snapshots, a before and after btrfs filesystem df check, with a defrag of a few GiB between, enough to tick over a couple digits of data usage on the df should it be doubling things, should help prove one way or the other. If that doesn't increase usage, the same test but with say -czlib, since you're currently using lzo, should force the COW-link breakage and double the usage for that few GiB, thereby proving both the leave-alone case of no change, and the COW-link breakage doubling. It'd cost that few GiB of extra space usage, but should clear up the question, one way or the other. > My 'main' backup is to rsync to a ext4 formatted drive. I have a second > backup (reminder to use it). That is btrfs and uses snapshots. However, > I rsync to it, I'm assuming that if my btrfs that I am backing up is > corrupted then there is a danger that send/receive could propagate > errors? Without knowing any better it seems like something worth > eliminating. >>From what I've seen onlist, if send/receive completes on both sides (send and receive) without error, you have a pretty reliable backup. If there's anything wrong, one side or the other errors out. Actually, due to various corner-case bugs they're still flushing out, send/receive can error out even if both sides are fine, too, just because something happened that nobody thought of yet. One recently fixed bug, for example, was a scenario where two subdirs were originally nested B inside of A, but switched positions so A was inside B. Until that fix, send/ receive couldn't make sense of that situation and would simply error out. So as long as send/receive works it should be reliable. The trouble isn't that it propagates filesystem corruption as it'll error out before it does that, but that at present there's still enough legitimate but strange corner-cases that it errors out on, that it isn't a reliable backup mechanism for /that/ reason, not because it'll propagate corruption. So you can use it with confidence as long as it's working. Just be prepared for it to quit working without notice, even after it has been working reliably for awhile, and have a fallback backup method ready to go should that happen. That said, since you're already using rsync, I'd suggest staying with that, for now. There will be plenty of time to switch to the more efficient btrfs send/receive after both btrfs and send/receive have matured rather longer. > I've amended my scripts so the toplevel subvol and snapshots are now > only mounted during snapshot creation and deletion. > >>> This is an interesting point. The changes are not too radical, all I >>> need to do is add code to my snapshot scripts to mount and unmount my >>> toplevel btrfs tree when performing a snapshot. Not sure if this >>> causes any sigificant time penulty as in slowing of the system with >>> any heavy IO. Since snapshots are run by cron then the time taken to >>> complete is not critical, rather whether the act of mounting and >>> unmounting causes any slowing due to heavy IO. >> >> [U]nless there's something strange going on, mounts shouldn't >> affect ongoing I/O much at all. Umounts are slightly different, in >> that on btrfs there can be some housekeeping that must be done before >> the filesystem is fully unmounted that could in theory disrupt ongoing >> I/O temporarily, but that's limited to writable mounts where some >> serious write-activity occurred, such that if you're just mounting to >> do a snapshot and umounting again, I don't believe that should be a >> problem, since in the normal case there will be only a bit of metadata >> to update from the process of doing the snapshot. >> >> > This is an interesting point. When I first modified my scripts to > mount/umount the top-level sub-volume I found things slowing > dramatically. Heavy disk IO and usage of btrfs-cleaner, btrfs-transact > and btrfs-submit for minutes on end. Brief pauses whilst the system > became usable. That /might/ be the inode_cache thing. Like I said, that's not recommended, with one of the down sides being high I/O at mount. I definitely wasn't considering it when I said mounts shouldn't affect ongoing I/O! So try without that, tho I can't say that's the /entire/ problem, but it certainly won't be helping things! > Something else odd seems to be happening right now. I'm cleaning out > some directories to free up disk space, /tmp-old out of / and also > associated snapshots. This is on SSD but I can hear my traditional HDDs > thrashing. Separate btrfs file systems. Presumably a coincidence. > > Hopefully things will settle down. Though the system is still doing a > lot of disk io it is a lot more usable than earlier. One other thing that might be part of it. Currently, btrfs does a lot of re-scanning (effectively btrfs device scan) as neither the userspace nor the kernel properly caches and reuses active btrfs filesystem and device information. So mounts and various btrfs userspace actions will rescan instead of caching, while OTOH, the kernel btrfs subsystem can sometimes be oblivious to device changes that other bits of the kernel already know about. There have actually been some real recent patches targeting that, but I think they'll hit kernel and userspace v3.16, as at least some of them were too late for kernel v3.15. But try without inode_cache as I suspect that may well be a good part of it right there, and I'd really like to know whether I'm right or wrong on that. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman