From: General Zed <email@example.com> To: Zygo Blaxell <firstname.lastname@example.org> Cc: email@example.com Subject: Re: Feature requests: online backup - defrag - change RAID level Date: Wed, 18 Sep 2019 00:37:42 -0400 Message-ID: <20190918003742.Horde.uCadf9qXuYdCVqBfASzDeuN@server53.web-hosting.com> (raw) In-Reply-To: <20190917234044.GH24379@hungrycats.org> Quoting Zygo Blaxell <firstname.lastname@example.org>: > On Tue, Sep 17, 2019 at 06:07:24AM -0400, General Zed wrote: >> >> Quoting Zygo Blaxell <email@example.com>: >> > I doubt that on a 50TB filesystem you need to read the whole tree...are >> > you going to globally optimize 50TB at once? That will take a while. >> >> I need to read the whole free-space tree to find a few regions with most >> free space. Those will be used as destinations for defragmented data. > > Hmmm...I have some filesystems with 2-4 million FST entries, but > they can be read in 30 seconds or less, even on the busy machines. > >> If a mostly free region of sufficient size (a few GB) can be found faster, >> then there is no need to read the entire free-space tree. But, on a disk >> with less than 15% free space, it would be advisable to read the entire free >> space tree to find the less-crowded regions of the filesystem. > > A filesystem with 15% free space still has 6 TB of contiguous. > Not hard to find some room! You can just look in the chunk tree, all > the unallocated space there is multi-GB contiguous chunks. If there is a free chunk, defrag can take it. If there isn't, then it can't. > Every btrfs > is guaranteed to have a chunk tree, but some don't have free space trees > (though the ones that don't should probably be strongly encouraged to > enable that feature). So you probably don't even need to look for free > space if there is unallocated space. Yes, but only in that specific case. The free space tree scan can be skipped in that lucky situation. We are talking about pre-defrag situation. It has to be assumed that the free space is badly fragmented. > On the other hand, you do need to measure fragmentation of the existing > free space, in order to identify the highest-priority areas for defrag. > So maybe you read the whole FST anyway, sort, and spit out a short list. > It's not nearly as expensive as I thought. The primary purpose of the free-space tree scan is in the low-free space situation (<10% free space) to find an above-average empty area. > I did notice one thing while looking at filesystem metadata vs commit > latency the other day: btrfs's allocation performance seems to be > correlated to the amount of free space _in the block groups_, not _on the > filesystem_. So after deleting 2% of the files on a 50% full filesystem, > it runs as slowly a 98% full one. Then when you add 3% more data to fill > the free space and allocate some new block groups, it goes fast again. > Then you delete things and it gets slow again. Rinse and repeat. Obviously, more work has to be done to improve that allocator. >> > My dedupe runs continuously (well, polling with incremental scan). >> > It doesn't shut down. >> >> Ah... so I suggest that the defrag should temporarily shut down dedupe, at >> least in the initial versions of defrag. Once both defrag and dedupe are >> working standalone, the merging effort can begin. > > Pausing dedupe just postpones the inevitable. The first thing dedupe > will do when it resumes is a new-data scan that will find all the new > defrag extents, because dedupe has to discover what's in them and update > the now out-of-date physical location data in the hash table. It just postpones the inevitable, but you missed the point. The point of shutting down dedupe is to avoid nasty bugs caused by dedupe-defrag interaction. > When defrag > and dedupe are integrated, the hash table gets updated by defrag in > place--same table entries in a different location. > > I have that problem _right now_ when using balance to defragment free > space in block groups. Dedupe performance drops while the old relocated > data is rescanned--since it's old, we already found all the duplicates, > so the iops of the rescan are just repairing the damage to the hash > table that balance did. > > That said...I also have a plan to make dedupe's new-data scans about > two orders of magnitude faster under common conditions. So maybe in the > future dedupe won't care as much about rereading stuff, as rereading will > add at most 1% redundant read iops. That still requires running dedupe > first (or in the kernel so it's in the write path), or have some way for > defrag to avoid touching recently added data before dedupe gets to it, > due to the extent-splitting duplicate work problem. The share-preserving defrag shouldn't interfere with dedupe because defrag is run on-demand, and it should then shut down dedupe until it has completed. Therefore, the issues it causes to dedupe are only temporary (and minor, really). >> I think that this kind of close dedupe-defrag integration should mostly be >> left to dedupe developers. > > That's reasonable--dedupe creates a specific form of fragmentation > problems. Not fixing those is bad for dedupe performance (and performance > in general) so it's a logical extension of the dedupe function to take > care of them as we go. I was working on it already. > >> First, both defrag and dedupe should work >> perfectly on their own. > > You use the word "perfectly" in a strange way... What I meant by "perfectly" is that there are no serious bugs and issues in either of them. They can work sub-optimally, but they must work, not crash or hang. So, perhaps I should have said "both defrag and dedupe should work without issues on their own". Previously, I used the word "perfectly" in a different sense, but I thought that this time the modified meaning will be understood from the context. > There are lots of btrfs dedupers that are optimized for different cases: > some are very fast for ad-hoc full-file dedupe, others are traditional > block-oriented filesystem-tree scanners that run on multiple filesystems. > There's an in-kernel one under development that...runs in the kernel > (honestly, running in the kernel is not as much of an advantage as you > might think). I wrote a deduper that was designed to merely not die when > presented with a large filesystem and a 50%+ dupe hit rate (it ended > up being faster and more efficient than the others purely by accident, > but maybe that says more about the state of the btrfs dedupe art than > about the quality of my implementation). I wouldn't call any of these > "perfect"--there are always some subset of users for which any of them > are unusable or there is a more suitable tool that performs better for > some special case. Oh, I didn't mean "perfect" in the sense "best possible results". So, just a slight misunderstanding there. The point is that dedupe-defrag integration should be attempted only after it is determined that both defrag and dedupe are working *without issues* on their own. > There is similar specialization and variation among defrag algorithms > as well. At best, any of them is "a good result given some choice > of constraints." >> Then, an interface to defrag should be made >> available to dedupe developers. In particular, I think that the batch-update >> functionality (it takes lots of extents and an empty free space region, then >> writes defragmented extents to the given region) is of particular interest >> to dedupe. > > Yeah, I have a wishlist item for a kernel call that takes a list of > (virtual address, length) pairs and produces a single contiguous physical > extent containing the content of those pairs, updating all the reflinks > in the process. Same for dedupe, but that one replaces all the extents > with reflinks to the first entry in the list instead of making a copy. > > I guess the extent-merge call could be augmented with an address hint > for allocation, but experiments so far have indicated that the possible > gains are marginal at best given the current btrfs allocator behaviour, > so I haven't bothered pursuing that. The "batch-update" from defrag should certainly trumps any "extent-merge". The defrag will do it all for you, you just supply the defrag with a list of extents that need to be defragmented.
next prev parent reply index Thread overview: 111+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-09-09 2:55 zedlryqc 2019-09-09 3:51 ` Qu Wenruo 2019-09-09 11:25 ` zedlryqc 2019-09-09 12:18 ` Qu Wenruo 2019-09-09 12:28 ` Qu Wenruo 2019-09-09 17:11 ` webmaster 2019-09-10 17:39 ` Andrei Borzenkov 2019-09-10 22:41 ` webmaster 2019-09-09 15:29 ` Graham Cobb 2019-09-09 17:24 ` Remi Gauvin 2019-09-09 19:26 ` webmaster 2019-09-10 19:22 ` Austin S. Hemmelgarn 2019-09-10 23:32 ` webmaster 2019-09-11 12:02 ` Austin S. Hemmelgarn 2019-09-11 16:26 ` Zygo Blaxell 2019-09-11 17:20 ` webmaster 2019-09-11 18:19 ` Austin S. Hemmelgarn 2019-09-11 20:01 ` webmaster 2019-09-11 21:42 ` Zygo Blaxell 2019-09-13 1:33 ` General Zed 2019-09-11 21:37 ` webmaster 2019-09-12 11:31 ` Austin S. Hemmelgarn 2019-09-12 19:18 ` webmaster 2019-09-12 19:44 ` Chris Murphy 2019-09-12 21:34 ` General Zed 2019-09-12 22:28 ` Chris Murphy 2019-09-12 22:57 ` General Zed 2019-09-12 23:54 ` Zygo Blaxell 2019-09-13 0:26 ` General Zed 2019-09-13 3:12 ` Zygo Blaxell 2019-09-13 5:05 ` General Zed 2019-09-14 0:56 ` Zygo Blaxell 2019-09-14 1:50 ` General Zed 2019-09-14 4:42 ` Zygo Blaxell 2019-09-14 4:53 ` Zygo Blaxell 2019-09-15 17:54 ` General Zed 2019-09-16 22:51 ` Zygo Blaxell 2019-09-17 1:03 ` General Zed 2019-09-17 1:34 ` General Zed 2019-09-17 1:44 ` Chris Murphy 2019-09-17 4:55 ` Zygo Blaxell 2019-09-17 4:19 ` Zygo Blaxell 2019-09-17 3:10 ` General Zed 2019-09-17 4:05 ` General Zed 2019-09-14 1:56 ` General Zed 2019-09-13 5:22 ` General Zed 2019-09-13 6:16 ` General Zed 2019-09-13 6:58 ` General Zed 2019-09-13 9:25 ` General Zed 2019-09-13 17:02 ` General Zed 2019-09-14 0:59 ` Zygo Blaxell 2019-09-14 1:28 ` General Zed 2019-09-14 4:28 ` Zygo Blaxell 2019-09-15 18:05 ` General Zed 2019-09-16 23:05 ` Zygo Blaxell 2019-09-13 7:51 ` General Zed 2019-09-13 11:04 ` Austin S. Hemmelgarn 2019-09-13 20:43 ` Zygo Blaxell 2019-09-14 0:20 ` General Zed 2019-09-14 18:29 ` Chris Murphy 2019-09-14 23:39 ` Zygo Blaxell 2019-09-13 11:09 ` Austin S. Hemmelgarn 2019-09-13 17:20 ` General Zed 2019-09-13 18:20 ` General Zed 2019-09-12 19:54 ` Austin S. Hemmelgarn 2019-09-12 22:21 ` General Zed 2019-09-13 11:53 ` Austin S. Hemmelgarn 2019-09-13 16:54 ` General Zed 2019-09-13 18:29 ` Austin S. Hemmelgarn 2019-09-13 19:40 ` General Zed 2019-09-14 15:10 ` Jukka Larja 2019-09-12 22:47 ` General Zed 2019-09-11 21:37 ` Zygo Blaxell 2019-09-11 23:21 ` webmaster 2019-09-12 0:10 ` Remi Gauvin 2019-09-12 3:05 ` webmaster 2019-09-12 3:30 ` Remi Gauvin 2019-09-12 3:33 ` Remi Gauvin 2019-09-12 5:19 ` Zygo Blaxell 2019-09-12 21:23 ` General Zed 2019-09-14 4:12 ` Zygo Blaxell 2019-09-16 11:42 ` General Zed 2019-09-17 0:49 ` Zygo Blaxell 2019-09-17 2:30 ` General Zed 2019-09-17 5:30 ` Zygo Blaxell 2019-09-17 10:07 ` General Zed 2019-09-17 23:40 ` Zygo Blaxell 2019-09-18 4:37 ` General Zed [this message] 2019-09-18 18:00 ` Zygo Blaxell 2019-09-10 23:58 ` webmaster 2019-09-09 23:24 ` Qu Wenruo 2019-09-09 23:25 ` webmaster 2019-09-09 16:38 ` webmaster 2019-09-09 23:44 ` Qu Wenruo 2019-09-10 0:00 ` Chris Murphy 2019-09-10 0:51 ` Qu Wenruo 2019-09-10 0:06 ` webmaster 2019-09-10 0:48 ` Qu Wenruo 2019-09-10 1:24 ` webmaster 2019-09-10 1:48 ` Qu Wenruo 2019-09-10 3:32 ` webmaster 2019-09-10 14:14 ` Nikolay Borisov 2019-09-10 22:35 ` webmaster 2019-09-11 6:40 ` Nikolay Borisov 2019-09-10 22:48 ` webmaster 2019-09-10 23:14 ` webmaster 2019-09-11 0:26 ` webmaster 2019-09-11 0:36 ` webmaster 2019-09-11 1:00 ` webmaster 2019-09-10 11:12 ` Austin S. Hemmelgarn 2019-09-09 3:12 webmaster
Reply instructions: You may reply publically to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20190918003742.Horde.uCadf9qXuYdCVqBfASzDeuN@server53.web-hosting.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-BTRFS Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \ email@example.com firstname.lastname@example.org public-inbox-index linux-btrfs Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs AGPL code for this site: git clone https://public-inbox.org/ public-inbox