From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [195.159.176.226] ([195.159.176.226]:41410 "EHLO blaine.gmane.org" rhost-flags-FAIL-FAIL-OK-OK) by vger.kernel.org with ESMTP id S932255AbeGHIH4 (ORCPT ); Sun, 8 Jul 2018 04:07:56 -0400 Received: from list by blaine.gmane.org with local (Exim 4.84_2) (envelope-from ) id 1fc4hP-0006n4-15 for linux-btrfs@vger.kernel.org; Sun, 08 Jul 2018 10:05:39 +0200 To: linux-btrfs@vger.kernel.org From: Duncan <1i5t5.duncan@cox.net> Subject: Re: how to best segment a big block device in resizeable btrfs filesystems? Date: Sun, 8 Jul 2018 08:05:27 +0000 (UTC) Message-ID: References: <20180629064354.kbaepro5ccmm6lkn@merlins.org> <20180701232202.vehg7amgyvz3hpxc@merlins.org> <5a603d3d-620b-6cb3-106c-9d38e3ca6d02@cn.fujitsu.com> <20180702032259.GD5567@merlins.org> <9fbd4b39-fa75-4c30-eea8-e789fd3e4dd5@cn.fujitsu.com> <20180702140527.wfbq5jenm67fvvjg@merlins.org> <3728d88c-29c1-332b-b698-31a0b3d36e2b@gmx.com> <20180702151853.mwlrinipbihq46zu@merlins.org> <20180702173438.7c2vhflvtncfb5gz@merlins.org> <8de54b29-c718-0230-09b2-f849e3ad01df@gmail.com> <9e79a4b4-af3c-20bf-ff8e-748b9ab46bf6@gmail.com> <293ab6d6-f609-0e9b-3d33-053336e43744@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: Andrei Borzenkov posted on Fri, 06 Jul 2018 07:28:48 +0300 as excerpted: > 03.07.2018 10:15, Duncan пишет: >> Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as >> excerpted: >> >>> 02.07.2018 21:35, Austin S. Hemmelgarn пишет: >>>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a >>>> bit dangerous to do it while writes are happening). >>> >>> Could you please elaborate? Do you mean btrfs can trim data before new >>> writes are actually committed to disk? >> >> No. >> >> But normally old roots aren't rewritten for some time simply due to >> odds (fuller filesystems will of course recycle them sooner), and the >> btrfs mount option usebackuproot (formerly recovery, until the >> norecovery mount option that parallels that of other filesystems was >> added and this option was renamed to avoid confusion) can be used to >> try an older root if the current root is too damaged to successfully >> mount. >> But other than simply by odds not using them again immediately, btrfs >> has >> no special protection for those old roots, and trim/discard will >> recover them to hardware-unused as it does any other unused space, tho >> whether it simply marks them for later processing or actually processes >> them immediately is up to the individual implementation -- some do it >> immediately, killing all chances at using the backup root because it's >> already zeroed out, some don't. >> >> > How is it relevant to "while writes are happening"? Will trimming old > tress immediately after writes have stopped be any different? Why? Define "while writes are happening" vs. "immediately after writes have stopped". How soon is "immediately", and does the writes stopped condition account for data that has reached the device-hardware write buffer (so is no longer being transmitted to the device across the bus) but not been actually written to media, or not? On a reasonably quiescent system, multiple empty write cycles are likely to have occurred since the last write barrier, and anything in-process is likely to have made it to media even if software is missing a write barrier it needs (software bug) or the hardware lies about honoring the write barrier (hardware bug, allegedly sometimes deliberate on hardware willing to gamble with your data that a crash won't happen in a critical moment, a somewhat rare occurrence, in ordered to improve normal operation performance metrics). On an IO-maxed system, data and write-barriers are coming down as fast as the system can handle them, and write-barriers become critical -- crash after something was supposed to get to media but didn't, either because of a missing write barrier or because the hardware/firmware lied about the barrier and said the data it was supposed to ensure was on-media was, when it wasn't, and the btrfs atomic-cow commit guarantees of consistent state at each commit go out the window. At this point it becomes useful to have a number of previous "guaranteed consistent state" roots to fall back on, with the /hope/ being that at least /one/ of them is usably consistent. If all but the last one are wiped due to trim... When the system isn't write-maxed the write will have almost certainly made it regardless of whether the barrier is there or not, because there's enough idle time to finish the current write before another one comes down the pipe, so the last-written root is almost certain to be fine regardless of barriers, and the history of past roots doesn't matter even if there's a crash. If "immediately after writes have stopped" is strictly defined as a condition when all writes including the btrfs commit updating the current root and the superblock pointers to the current root have completed, with no new writes coming down the pipe in the mean time that might have delayed a critical update if a barrier was missed, then trimming old roots in this state should be entirely safe, and the distinction between that state and the "while writes are happening" is clear. But if "immediately after writes have stopped" is less strictly defined, then the distinction between that state and "while writes are happening" remains blurry at best, and having old roots around to fall back on in case a write-barrier was missed (for whatever reason, hardware or software) becomes a very good thing. Of course the fact that trim/discard itself is an instruction written to the device in the combined command/data stream complexifies the picture substantially. If those write barriers get missed who knows what state the new root is in, and if the old ones got erased... But again, on a mostly idle system, it'll probably all "just work", because the writes will likely all make it to media, regardless, because there's not a bunch of other writes competing for limited write bandwidth and making ordering critical. >> In the context of the discard mount option, that can mean there's never >> any old roots available ever, as they've already been cleaned up by the >> hardware due to the discard option telling the hardware to do it. >> >> But even not using that mount option, and simply doing the trims >> periodically, as done weekly by for instance the systemd fstrim timer >> and service units, or done manually if you prefer, obviously >> potentially wipes the old roots at that point. If the system's >> effectively idle at the time, not much risk as the current commit is >> likely to represent a filesystem in full stasis, but if there's lots of >> writes going on at that moment *AND* the system happens to crash at >> just the wrong time, before additional commits have recreated at least >> a bit of root history, again, you'll potentially be left without any >> old roots for the usebackuproot mount option to try to fall back to, >> should it actually be necessary. >> >> > Sorry? You are just saying that "previous state can be discarded before > new state is committed", just more verbosely. No, it's more the new state gets committed before the old is trimmed, but should it turn out to be unusable (due to missing write barriers, etc, which is more of an issue on a write-bottlenecked system), having a history of old roots/states around to fall back to can be very useful. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman