From: email@example.com To: Zygo Blaxell <firstname.lastname@example.org> Cc: "Austin S. Hemmelgarn" <email@example.com>, firstname.lastname@example.org Subject: Re: Feature requests: online backup - defrag - change RAID level Date: Wed, 11 Sep 2019 19:21:31 -0400 Message-ID: <20190911192131.Horde.2lTVSt-Ln94dqLGQKg_USXQ@server53.web-hosting.com> (raw) In-Reply-To: <20190911213704.GB22121@hungrycats.org> Quoting Zygo Blaxell <email@example.com>: > On Wed, Sep 11, 2019 at 01:20:53PM -0400, firstname.lastname@example.org wrote: >> >> Quoting "Austin S. Hemmelgarn" <email@example.com>: >> >> > On 2019-09-10 19:32, firstname.lastname@example.org wrote: >> > > >> > > Quoting "Austin S. Hemmelgarn" <email@example.com>: >> > > >> >> > > >> > > === I CHALLENGE you and anyone else on this mailing list: === >> > > >> > > - Show me an exaple where splitting an extent requires unsharing, >> > > and this split is needed to defrag. >> > > >> > > Make it clear, write it yourself, I don't want any machine-made outputs. >> > > >> > Start with the above comment about all writes unsharing the region being >> > written to. >> > >> > Now, extrapolating from there: >> > >> > Assume you have two files, A and B, each consisting of 64 filesystem >> > blocks in single shared extent. Now assume somebody writes a few bytes >> > to the middle of file B, right around the boundary between blocks 31 and >> > 32, and that you get similar writes to file A straddling blocks 14-15 >> > and 47-48. >> > >> > After all of that, file A will be 5 extents: >> > >> > * A reflink to blocks 0-13 of the original extent. >> > * A single isolated extent consisting of the new blocks 14-15 >> > * A reflink to blocks 16-46 of the original extent. >> > * A single isolated extent consisting of the new blocks 47-48 >> > * A reflink to blocks 49-63 of the original extent. >> > >> > And file B will be 3 extents: >> > >> > * A reflink to blocks 0-30 of the original extent. >> > * A single isolated extent consisting of the new blocks 31-32. >> > * A reflink to blocks 32-63 of the original extent. >> > >> > Note that there are a total of four contiguous sequences of blocks that >> > are common between both files: >> > >> > * 0-13 >> > * 16-30 >> > * 32-46 >> > * 49-63 >> > >> > There is no way to completely defragment either file without splitting >> > the original extent (which is still there, just not fully referenced by >> > either file) unless you rewrite the whole file to a new single extent >> > (which would, of course, completely unshare the whole file). In fact, >> > if you want to ensure that those shared regions stay reflinked, there's >> > no way to defragment either file without _increasing_ the number of >> > extents in that file (either file would need 7 extents to properly share >> > only those 4 regions), and even then only one of the files could be >> > fully defragmented. >> > >> > Such a situation generally won't happen if you're just dealing with >> > read-only snapshots, but is not unusual when dealing with regular files >> > that are reflinked (which is not an uncommon situation on some systems, >> > as a lot of people have `cp` aliased to reflink things whenever >> > possible). >> >> Well, thank you very much for writing this example. Your example is >> certainly not minimal, as it seems to me that one write to the file A and >> one write to file B would be sufficient to prove your point, so there we >> have one extra write in the example, but that's OK. >> >> Your example proves that I was wrong. I admit: it is impossible to perfectly >> defrag one subvolume (in the way I imagined it should be done). >> Why? Because, as in your example, there can be files within a SINGLE >> subvolume which share their extents with each other. I didn't consider such >> a case. >> >> On the other hand, I judge this issue to be mostly irrelevant. Why? Because >> most of the file sharing will be between subvolumes, not within a subvolume. >> When a user creates a reflink to a file in the same subvolume, he is >> willingly denying himself the assurance of a perfect defrag. Because, as >> your example proves, if there are a few writes to BOTH files, it gets >> impossible to defrag perfectly. So, if the user creates such reflinks, it's >> his own whish and his own fault. >> >> Such situations will occur only in some specific circumstances: >> a) when the user is reflinking manually >> b) when a file is copied from one subvolume into a different file in a >> different subvolume. >> >> The situation a) is unusual in normal use of the filesystem. Even when it >> occurs, it is the explicit command given by the user, so he should be >> willing to accept all the consequences, even the bad ones like imperfect >> defrag. >> >> The situation b) is possible, but as far as I know copies are currently not >> done that way in btrfs. There should probably be the option to reflink-copy >> files fron another subvolume, that would be good. > > Reflink copies across subvolumes have been working for years. They are > an important component that makes dedupe work when snapshots are present. I take that what you say is true, but what I said is that when a user (or application) makes a normal copy from one subvolume to another, then it won't be a reflink-copy. To make such a reflink-copy, you need btrfs-aware cp or btrfs-aware applications. So, the reflik-copy is a special case, usually explicitly requested by the user. >> But anyway, it doesn't matter. Because most of the sharing will be between >> subvolumes, not within subvolume. > > Heh. I'd like you to meet one of my medium-sized filesystems: > > Physical size: 8TB > Logical size: 16TB > Average references per extent: 2.03 (not counting snapshots) > Workload: CI build server, VM host > > That's a filesystem where over half of the logical data is reflinks to the > other physical data, and 94% of that data is in a single subvol. 7.5TB of > data is unique, the remaining 500GB is referenced an average of 17 times. > > We use ordinary applications to make ordinary copies of files, and do > tarball unpacks and source checkouts with reckless abandon, all day long. > Dedupe turns the copies into reflinks as we go, so every copy becomes > a reflink no matter how it was created. > > For the VM filesystem image files, it's not uncommon to see a high > reflink rate within a single file as well as reflinks to other files > (like the binary files in the build directories that the VM images are > constructed from). Those reference counts can go into the millions. OK, but that cannot be helped: either you retain the sharing structure with imperfect defrag, or you unshare and produce a perfect defrag which should have somewhat better performance (and pray that the disk doesn't fill up). >> So, if there is some in-subvolume sharing, >> the defrag wont be 100% perfect, that a minor point. Unimportant. > > It's not unimportant; however, the implementation does have to take this > into account, and make sure that defrag can efficiently skip extents that > are too expensive to relocate. If we plan to read an extent fewer than > 100 times, it makes no sense to update 20000 references to it--we spend > less total time just doing the 100 slower reads. Not necesarily. Because you can defrag in the time-of-day when there is a low pressure on the disk IO, so updating 20000 references is esentially free. You are just making those later 100 reads faster. OK, you are right, there is some limit, but this is such a rare case, that such a heavily-referenced extents are best left untouched. I suggest something along these lines: if there are more than XX (where XX defaults to 1000) reflinks to an extent, then one or more copies of the extent should be made such that each has less than XX reflinks to it. The number XX should be user-configurable. > If the numbers are > reversed then it's better to defrag the extent--100 reference updates > are easily outweighed by 20000 faster reads. The kernel doesn't have > enough information to make good decisions about this. So, just make the number XX user-provided. > Dedupe has a similar problem--it's rarely worth doing a GB of IO to > save 4K of space, so in practical implementations, a lot of duplicate > blocks have to remain duplicate. > > There are some ways to make the kernel dedupe and defrag API process > each reference a little more efficiently, but none will get around this > basic physical problem: some extents are just better off where they are. OK. If you don't touch those extents, they are still shared. That's what I wanted. > Userspace has access to some extra data from the user, e.g. "which > snapshots should have their references excluded from defrag because > the entire snapshot will be deleted in a few minutes." That will allow > better defrag cost-benefit decisions than any in-kernel implementation > can make by itself. Yes, but I think that we are going into too much details which are diverting the attention from the overall picture and from big problems. And the big problem here is: what do we want defrag to do in general, most common cases. Because we haven't still agreed on that one since many of the people here are ardent followers of the defrag-by-unsharing ideology. > 'btrfs fi defrag' is just one possible userspace implementation, which > implements the "throw entire files at the legacy kernel defrag API one > at a time" algorithm. Unfortunately, nobody seems to have implemented > any other algorithms yet, other than a few toy proof-of-concept demos. I really don't have a clue what's happening, but if I were to start working on it (which I won't), then the first things should be: - creating a way for btrfs to split large extents into smaller ones (for easier defrag, as first phase). - creating a way for btrfs to merge small adjanced extents shared by the same files into larger extents (as the last phase of defragmenting a file). - create a structure (associative array) for defrag that can track backlinks. Keep the structure updated with each filesystem change, by placing hooks in filesystem-update routines. You can't go wrong with this. Whatever details change about defrag operation, the given three things will be needed by defrag. >> Now, to retain the original sharing structure, the defrag has to change the >> reflink of extent E55 in file B to point to E70. You are telling me this is >> not possible? Bullshit! > > This is already possible today and userspace tools can do it--not as > efficiently as possible, but without requiring more than 128M of temporary > space. 'btrfs fi defrag' is not one of those tools. > >> Please explain to me how this 'defrag has to unshare' story of yours isn't >> an intentional attempt to mislead me. > > Austin is talking about the btrfs we have, not the btrfs we want. OK, but then, you agree with me that current defrag is a joke. I mean, something is better than nothing, and the current defrag isn't completely useless, but it is in most circumstances either unusable or not good enough. I mean, the snapshots are a prime feature of btrfs. If not, then why bother with b-trees? If you wanted subvolumes, checksums and RAID, then you should have made ext5. B-trees are in btrfs so that there can be snapshots. But, the current defrag works bad with snaphots. It doesn't defrag them well, it also unshares data. Bad bad bad. And if you wanted to be honest to your users, why don't you place this info in the wiki? Ok, the wiki says "defrag will unshare", but it doesn't say that it also doesn't defragment well. For example, lets examine the typical home user. If he is using btrfs, it means he probably wants snapshots of his data. And, after a few snapshots, his data is fragmented, and the current defrag can't help because it does a terrible job in this particualr case. So why don't you write on the wiki "the defrag is practically unusable in case you use snapshots". Because that is the truth. Be honest.
next prev parent reply index Thread overview: 111+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-09-09 2:55 zedlryqc 2019-09-09 3:51 ` Qu Wenruo 2019-09-09 11:25 ` zedlryqc 2019-09-09 12:18 ` Qu Wenruo 2019-09-09 12:28 ` Qu Wenruo 2019-09-09 17:11 ` webmaster 2019-09-10 17:39 ` Andrei Borzenkov 2019-09-10 22:41 ` webmaster 2019-09-09 15:29 ` Graham Cobb 2019-09-09 17:24 ` Remi Gauvin 2019-09-09 19:26 ` webmaster 2019-09-10 19:22 ` Austin S. Hemmelgarn 2019-09-10 23:32 ` webmaster 2019-09-11 12:02 ` Austin S. Hemmelgarn 2019-09-11 16:26 ` Zygo Blaxell 2019-09-11 17:20 ` webmaster 2019-09-11 18:19 ` Austin S. Hemmelgarn 2019-09-11 20:01 ` webmaster 2019-09-11 21:42 ` Zygo Blaxell 2019-09-13 1:33 ` General Zed 2019-09-11 21:37 ` webmaster 2019-09-12 11:31 ` Austin S. Hemmelgarn 2019-09-12 19:18 ` webmaster 2019-09-12 19:44 ` Chris Murphy 2019-09-12 21:34 ` General Zed 2019-09-12 22:28 ` Chris Murphy 2019-09-12 22:57 ` General Zed 2019-09-12 23:54 ` Zygo Blaxell 2019-09-13 0:26 ` General Zed 2019-09-13 3:12 ` Zygo Blaxell 2019-09-13 5:05 ` General Zed 2019-09-14 0:56 ` Zygo Blaxell 2019-09-14 1:50 ` General Zed 2019-09-14 4:42 ` Zygo Blaxell 2019-09-14 4:53 ` Zygo Blaxell 2019-09-15 17:54 ` General Zed 2019-09-16 22:51 ` Zygo Blaxell 2019-09-17 1:03 ` General Zed 2019-09-17 1:34 ` General Zed 2019-09-17 1:44 ` Chris Murphy 2019-09-17 4:55 ` Zygo Blaxell 2019-09-17 4:19 ` Zygo Blaxell 2019-09-17 3:10 ` General Zed 2019-09-17 4:05 ` General Zed 2019-09-14 1:56 ` General Zed 2019-09-13 5:22 ` General Zed 2019-09-13 6:16 ` General Zed 2019-09-13 6:58 ` General Zed 2019-09-13 9:25 ` General Zed 2019-09-13 17:02 ` General Zed 2019-09-14 0:59 ` Zygo Blaxell 2019-09-14 1:28 ` General Zed 2019-09-14 4:28 ` Zygo Blaxell 2019-09-15 18:05 ` General Zed 2019-09-16 23:05 ` Zygo Blaxell 2019-09-13 7:51 ` General Zed 2019-09-13 11:04 ` Austin S. Hemmelgarn 2019-09-13 20:43 ` Zygo Blaxell 2019-09-14 0:20 ` General Zed 2019-09-14 18:29 ` Chris Murphy 2019-09-14 23:39 ` Zygo Blaxell 2019-09-13 11:09 ` Austin S. Hemmelgarn 2019-09-13 17:20 ` General Zed 2019-09-13 18:20 ` General Zed 2019-09-12 19:54 ` Austin S. Hemmelgarn 2019-09-12 22:21 ` General Zed 2019-09-13 11:53 ` Austin S. Hemmelgarn 2019-09-13 16:54 ` General Zed 2019-09-13 18:29 ` Austin S. Hemmelgarn 2019-09-13 19:40 ` General Zed 2019-09-14 15:10 ` Jukka Larja 2019-09-12 22:47 ` General Zed 2019-09-11 21:37 ` Zygo Blaxell 2019-09-11 23:21 ` webmaster [this message] 2019-09-12 0:10 ` Remi Gauvin 2019-09-12 3:05 ` webmaster 2019-09-12 3:30 ` Remi Gauvin 2019-09-12 3:33 ` Remi Gauvin 2019-09-12 5:19 ` Zygo Blaxell 2019-09-12 21:23 ` General Zed 2019-09-14 4:12 ` Zygo Blaxell 2019-09-16 11:42 ` General Zed 2019-09-17 0:49 ` Zygo Blaxell 2019-09-17 2:30 ` General Zed 2019-09-17 5:30 ` Zygo Blaxell 2019-09-17 10:07 ` General Zed 2019-09-17 23:40 ` Zygo Blaxell 2019-09-18 4:37 ` General Zed 2019-09-18 18:00 ` Zygo Blaxell 2019-09-10 23:58 ` webmaster 2019-09-09 23:24 ` Qu Wenruo 2019-09-09 23:25 ` webmaster 2019-09-09 16:38 ` webmaster 2019-09-09 23:44 ` Qu Wenruo 2019-09-10 0:00 ` Chris Murphy 2019-09-10 0:51 ` Qu Wenruo 2019-09-10 0:06 ` webmaster 2019-09-10 0:48 ` Qu Wenruo 2019-09-10 1:24 ` webmaster 2019-09-10 1:48 ` Qu Wenruo 2019-09-10 3:32 ` webmaster 2019-09-10 14:14 ` Nikolay Borisov 2019-09-10 22:35 ` webmaster 2019-09-11 6:40 ` Nikolay Borisov 2019-09-10 22:48 ` webmaster 2019-09-10 23:14 ` webmaster 2019-09-11 0:26 ` webmaster 2019-09-11 0:36 ` webmaster 2019-09-11 1:00 ` webmaster 2019-09-10 11:12 ` Austin S. Hemmelgarn 2019-09-09 3:12 webmaster
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20190911192131.Horde.2lTVSt-Ln94dqLGQKg_USXQ@server53.web-hosting.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-BTRFS Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \ firstname.lastname@example.org public-inbox-index linux-btrfs Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs AGPL code for this site: git clone https://public-inbox.org/public-inbox.git