From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from azure.uno.uk.net ([95.172.254.11]:44456 "EHLO azure.uno.uk.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751712AbdC1Oni (ORCPT ); Tue, 28 Mar 2017 10:43:38 -0400 Received: from ty.sabi.co.uk ([95.172.230.208]:42818) by azure.uno.uk.net with esmtpsa (TLSv1.2:DHE-RSA-AES128-SHA:128) (Exim 4.88) (envelope-from ) id 1cssLO-002SIo-S3 for linux-btrfs@vger.kernel.org; Tue, 28 Mar 2017 15:43:35 +0100 Received: from from [127.0.0.1] (helo=tree.ty.sabi.co.uk) by ty.sabi.co.UK with esmtps(Cipher TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128)(Exim 4.82 3) id 1cssLE-0007pl-Rr for ; Tue, 28 Mar 2017 15:43:24 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Message-ID: <22746.30348.324000.636753@tree.ty.sabi.co.uk> Date: Tue, 28 Mar 2017 15:43:24 +0100 To: Linux fs Btrfs Subject: Re: Shrinking a device - performance? In-Reply-To: <4E13254F-FDE8-47F7-A495-53BFED814C81@flyingcircus.io> References: <1CCB3887-A88C-41C1-A8EA-514146828A42@flyingcircus.io> <20170327130730.GN11714@carfax.org.uk> <3558CE2F-0B8F-437B-966C-11C1392B81F2@flyingcircus.io> <20170327194847.5c0c5545@natsu> <4E13254F-FDE8-47F7-A495-53BFED814C81@flyingcircus.io> From: pg@btrfs.for.sabi.co.UK (Peter Grandi) Sender: linux-btrfs-owner@vger.kernel.org List-ID: This is going to be long because I am writing something detailed hoping pointlessly that someone in the future will find it by searching the list archives while doing research before setting up a new storage system, and they will be the kind of person that tolerates reading messages longer than Twitter. :-). > I’m currently shrinking a device and it seems that the > performance of shrink is abysmal. When I read this kind of statement I am reminded of all the cases where someone left me to decatastrophize a storage system built on "optimistic" assumptions. The usual "optimism" is what I call the "syntactic approach", that is the axiomatic belief that any syntactically valid combination of features not only will "work", but very fast too and reliably despite slow cheap hardware and "unattentive" configuration. Some people call that the expectation that system developers provide or should provide an "O_PONIES" option. In particular I get very saddened when people use "performance" to mean "speed", as the difference between the two is very great. As a general consideration, shrinking a large filetree online in-place is an amazingly risky, difficult, slow operation and should be a last desperate resort (as apparently in this case), regardless of the filesystem type, and expecting otherwise is "optimistic". My guess is that very complex risky slow operations like that are provided by "clever" filesystem developers for "marketing" purposes, to win box-ticking competitions. That applies to those system developers who do know better; I suspect that even some filesystem developers are "optimistic" as to what they can actually achieve. > I intended to shrink a ~22TiB filesystem down to 20TiB. This is > still using LVM underneath so that I can’t just remove a device > from the filesystem but have to use the resize command. That is actually a very good idea because Btrfs multi-device is not quite as reliable as DM/LVM2 multi-device. > Label: 'backy' uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4 > Total devices 1 FS bytes used 18.21TiB > devid 1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy Maybe 'balance' should have been used a bit more. > This has been running since last Thursday, so roughly 3.5days > now. The “used” number in devid1 has moved about 1TiB in this > time. The filesystem is seeing regular usage (read and write) > and when I’m suspending any application traffic I see about > 1GiB of movement every now and then. Maybe once every 30 > seconds or so. Does this sound fishy or normal to you? With consistent "optimism" this is a request to assess whether "performance" of some operations is adequate on a filetree without telling us either what the filetree contents look like, what the regular workload is, or what the storage layer looks like. Being one of the few system administrators crippled by lack of psychic powers :-), I rely on guesses and inferences here, and having read the whole thread containing some belated details. >>From the ~22TB total capacity my guess is that the storage layer involves rotating hard disks, and from later details the filesystem contents seems to be heavily reflinked files of several GB in size, and workload seems to be backups to those files from several source hosts. Considering the general level of "optimism" in the situation my wild guess is that the storage layer is based on large slow cheap rotating disks in teh 4GB-8GB range, with very low IOPS-per-TB. > Thanks for that info. The 1min per 1GiB is what I saw too - > the “it can take longer” wasn’t really explainable to me. A contemporary rotating disk device can do around 0.5MB/s transfer rate with small random accesses with barriers up to around 80-160MB/s in purely sequential access without barriers. 1GB/m of simultaneous read-write means around 16MB/s reads plus 16MB/s writes which is fairly good *performance* (even if slow *speed*) considering that moving extents around, even across disks, involves quite a bit of randomish same-disk updates of metadata; because it all depends usually on how much randomish metadata updates need to done, on any filesystem type, as those must be done with barriers. > As I’m not using snapshots: would large files (100+gb) Using 100GB sized VM virtual disks (never mind with COW) seems very unwise to me to start with, but of course a lot of other people know better :-). Just like a lot of other people know better that large single pool storage systems are awesome in every respect :-): cost, reliability, speed, flexibility, maintenance, etc. > with long chains of CoW history (specifically reflink copies) > also hurt? Oh yes... They are about one of the worst cases for using Btrfs. But also very "optimistic" to think that kind of stuff can work awesomely on *any* filesystem type. > Something I’d like to verify: does having traffic on the > volume have the potential to delay this infinitely? [ ... ] > it’s just slow and we’re looking forward to about 2 months > worth of time shrinking this volume. (And then again on the > next bigger server probably about 3-4 months). Those are pretty typical times for whole-filesystem operations like that on rotating disk media. There are some reports in the list and IRC channel archives to 'scrub' or 'balance' or 'check' times for filetrees of that size. > (Background info: we’re migrating large volumes from btrfs to > xfs and can only do this step by step: copying some data, > shrinking the btrfs volume, extending the xfs volume, rinse > repeat. That "extending the xfs volume" will have consequences too, but not too bad hopefully. > If someone should have any suggestions to speed this up and > not having to think in terms of _months_ then I’m all ears.) High IOPS-per-TB enterprise SSDs with capacitor backed caches :-). > One strategy that does come to mind: we’re converting our > backup from a system that uses reflinks to a non-reflink based > system. We can convert this in place so this would remove all > the reflink stuff in the existing filesystem Do you have enough space to do that? Either your reflinks are pointless or they are saving a lot of storage. But I guess that you can do it one 100GB file at a time... > and then we maybe can do the FS conversion faster when this > isn’t an issue any longer. I think I’ll I suspect the de-reflinking plus shrinking will take longer, but not totally sure. > Right. This is wan option we can do from a software perspective > (our own solution - https://bitbucket.org/flyingcircus/backy) Many thanks for sharing your system, I'll have a look. > but our systems in use can’t hold all the data twice. Even > though we’re migrating to a backend implementation that uses > less data than before I have to perform an “inplace” migration > in some way. This is VM block device backup. So basically we > migrate one VM with all its previous data and that works quite > fine with a little headroom. However, migrating all VMs to a > new “full” backup and then wait for the old to shrink would > only work if we had a completely empty backup server in place, > which we don’t. > Also: the idea of migrating on btrfs also has its downside - > the performance of “mkdir” and “fsync” is abysmal at the > moment. That *performance* is pretty good indeed, it is the *speed* that may be low, but that's obvious. Please consider looking at these entirely typical speeds: http://www.sabi.co.uk/blog/17-one.html?170302#170302 http://www.sabi.co.uk/blog/17-one.html?170228#170228 > I’m waiting for the current shrinking job to finish but this > is likely limited to the “find free space” algorithm. We’re > talking about a few megabytes converted per second. Sigh. Well, if the filetree is being actively used for COW backups while being shrunk that involves a lot of randomish IO with barriers. >> I would only suggest that you reconsider XFS. You can't >> shrink XFS, therefore you won't have the flexibility to >> migrate in the same way to anything better that comes along >> in the future (ZFS perhaps? or even Bcachefs?). XFS does not >> perform that much better over Ext4, and very importantly, >> Ext4 can be shrunk. ZFS is a complicated mess too with an intensely anisotropic performance envelope too and not necessarily that good for backup archival for various reasons. I would consider looking instead at using a collection of smaller "silo" JFS, F2FS, NILFS2 filetrees as well as XFS, and using MD RAID in RAID10 mode instead of DM/LVM2: http://www.sabi.co.uk/blog/16-two.html?161217#161217 http://www.sabi.co.uk/blog/17-one.html?170107#170107 http://www.sabi.co.uk/blog/12-fou.html?121223#121223 http://www.sabi.co.uk/blog/12-fou.html?121218b#121218b http://www.sabi.co.uk/blog/12-fou.html?121218#121218 and yes, Bcachefs looks promising, but I am sticking with Btrfs: https://lwn.net/Articles/717379 > That is true. However, we do have moved the expected feature > set of the filesystem (i.e. cow) That feature set is arguably not appropriate for VM images, but lots of people know better :-). > down to “store files safely and reliably” and we’ve seen too > much breakage with ext4 in the past. That is extremely unlikely unless your storage layer has unreliable barriers, and then you need a lot of "optimism". > Of course “persistence means you’ll have to say I’m sorry” and > thus with either choice we may be faced with some issue in the > future that we might have circumvented with another solution > and yes flexibility is worth a great deal. Enterprise SSDs with high small-random-write IOPS-per-TB can give both excellent speed and high flexibility :-). > We’ve run XFS and ext4 on different (large and small) > workloads in the last 2 years and I have to say I’m much more > happy about XFS even with the shrinking limitation. XFS and 'ext4' are essentially equivalent, except for the fixed-size inode table limitation of 'ext4' (and XFS reportedly has finer grained locking). Btrfs is nearly as good as either on most workloads is single-device mode without using the more complicated features (compression, qgroups, ...) and with appropriate use of the 'nowcow' options, and gives checksums on data too if needed. > To us ext4 is prohibitive with it’s fsck performance and we do > like the tight error checking in XFS. It is very pleasing to see someone care about the speed of whole-tree operations like 'fsck', a very often forgotten "little detail". But in my experience 'ext4' checking is quite competitive with XFS checking and repair, at least in recent years, as both have been hugely improved. XFS checking and repair still require a lot of RAM though. > Thanks for the reminder though - especially in the public > archive making this tradeoff with flexibility known is wise to > communicate. :-) "Flexibility" in filesystems, especially on rotating disk storage with extremely anisotropic performance envelopes, is very expensive, but of course lots of people know better :-).