On Wed, Sep 11, 2019 at 08:02:40AM -0400, Austin S. Hemmelgarn wrote:
> On 2019-09-10 19:32, webmaster@zedlx.com wrote:
> > 
> > Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> > 
> > 
> > > > Defrag may break up extents. Defrag may fuse extents. But it
> > > > shouln't ever unshare extents.
> > 
> > > Actually, spitting or merging extents will unshare them in a large
> > > majority of cases.
> > 
> > Ok, this point seems to be repeated over and over without any proof, and
> > it is illogical to me.
> > 
> > About merging extents: a defrag should merge extents ONLY when both
> > extents are shared by the same files (and when those extents are
> > neighbours in both files). In other words, defrag should always merge
> > without unsharing. Let's call that operation "fusing extents", so that
> > there are no more misunderstandings.
> And I reiterate: defrag only operates on the file it's passed in.  It needs
> to for efficiency reasons (we had a reflink aware defrag for a while a few
> years back, it got removed because performance limitations meant it was
> unusable in the cases where you actually needed it). Defrag doesn't even
> know that there are reflinks to the extents it's operating on.
> 
> Now factor in that _any_ write will result in unsharing the region being
> written to, rounded to the nearest full filesystem block in both directions
> (this is mandatory, it's a side effect of the copy-on-write nature of BTRFS,
> and is why files that experience heavy internal rewrites get fragmented very
> heavily and very quickly on BTRFS).
> 
> Given this, defrag isn't willfully unsharing anything, it's just a
> side-effect of how it works (since it's rewriting the block layout of the
> file in-place).
> > 
> > === I CHALLENGE you and anyone else on this mailing list: ===
> > 
> >   - Show me an exaple where splitting an extent requires unsharing, and
> > this split is needed to defrag.
> >
> > Make it clear, write it yourself, I don't want any machine-made outputs.
> > 
> Start with the above comment about all writes unsharing the region being
> written to.
> 
> Now, extrapolating from there:
> 
> Assume you have two files, A and B, each consisting of 64 filesystem blocks
> in single shared extent.  Now assume somebody writes a few bytes to the
> middle of file B, right around the boundary between blocks 31 and 32, and
> that you get similar writes to file A straddling blocks 14-15 and 47-48.
> 
> After all of that, file A will be 5 extents:
> 
> * A reflink to blocks 0-13 of the original extent.
> * A single isolated extent consisting of the new blocks 14-15
> * A reflink to blocks 16-46 of the original extent.
> * A single isolated extent consisting of the new blocks 47-48
> * A reflink to blocks 49-63 of the original extent.
> 
> And file B will be 3 extents:
> 
> * A reflink to blocks 0-30 of the original extent.
> * A single isolated extent consisting of the new blocks 31-32.
> * A reflink to blocks 32-63 of the original extent.
> 
> Note that there are a total of four contiguous sequences of blocks that are
> common between both files:
> 
> * 0-13
> * 16-30
> * 32-46
> * 49-63
> 
> There is no way to completely defragment either file without splitting the
> original extent (which is still there, just not fully referenced by either
> file) unless you rewrite the whole file to a new single extent (which would,
> of course, completely unshare the whole file).  In fact, if you want to
> ensure that those shared regions stay reflinked, there's no way to
> defragment either file without _increasing_ the number of extents in that
> file (either file would need 7 extents to properly share only those 4
> regions), and even then only one of the files could be fully defragmented.

Arguably, the kernel's defrag ioctl should go ahead and do the extent
relocation and update all the reflinks at once, using the file given
in the argument as the "canonical" block order, i.e. the fd and offset
range you pass in is checked, and if it's not physically contiguous,
the extents in the range are copied to a single contiguous extent, then
all the other references to the old extent(s) within the offset range are
rewritten to point to the new extent, then the old extent is discarded.

It is possible to do this from userspace now using a mix of data copies
and dedupe, but it's much more efficient to use the facilities available
in the kernel:  in particular, the kernel can lock the extent in question
while all of this is going on, and the kernel can update shared snapshot
metadata pages directly instead of duplicating them and doing identical
updates on each copy.

This sort of extent reference manipulation, particularly of extents
referenced by readonly snapshots, used to break a lot of things (btrfs
send in particular) but the same issues came up again for dedupe,
and they now seem to be fixed as of 5.3 or so.  Maybe it's time to try
shared-extent-aware defrag again.

In practice, such an improved defrag ioctl would probably need some
more limit parameters, e.g.  "just skip over any extent with more than
1000 references" or "do a two-pass algorithm and relocate data only if
every reference to the data is logically contiguous" to avoid getting
bogged down on extents which require more iops to defrag in the present
than can possibly be saved by using the defrag result in the future.
That makes the defrag API even uglier, with even more magic baked-in
behavior to get in the way of users who know what they're doing, but
some stranger on a mailing list requested it, so why not...  :-P

> Such a situation generally won't happen if you're just dealing with
> read-only snapshots, but is not unusual when dealing with regular files that
> are reflinked (which is not an uncommon situation on some systems, as a lot
> of people have `cp` aliased to reflink things whenever possible).