On Wed, Jan 15, 2020 at 11:09:03PM +0000, David Howells wrote:
> Andreas Dilger <adilger@dilger.ca> wrote:
> 
> > > It would also have to say that blocks of zeros shouldn't be optimised away.
> > 
> > I don't necessarily see that as a requirement, so long as the filesystem
> > stores a "block" at that offset, but it could dedupe all zero-filled blocks
> > to the same "zero block".  That still allows saving storage space, while
> > keeping the semantics of "this block was written into the file" rather than
> > "there is a hole at this offset".
> 
> Yeah, that's more what I was thinking of.  Provided I can find out that
> something is present, it should be fine.

I'm curious how this proposal handles an application punching a hole
through the cache?  Does that get cached, or does that operation have
to be synchronous with the server?  Or is it a moot point because no
server supports hole punching, so it gets replaced with equivalent zero
block data writes?

Zero blocks are stupidly common on typical user data corpuses, and a
naive block-oriented deduper can create monster extents with millions
or even billions of references if it doesn't have some special handling
for zero blocks.  Even if they don't trigger filesystem performance bugs
or hit RAM or other implementation limits, it's still bigger and slower
to use zero-filled data blocks than just using holes for zero blocks.

In the bees deduper for btrfs, zero blocks get replaced with holes
unconditionally in uncompressed extents, and in compressed extents if the
extent consists entirely of zeros (a long run of zero bytes is compressed
to a few bits by all supported compression algorithms, and hole metdata
is much larger than a few bits, so no gain is possible if anything less
than the entire compressed extent is eliminated).  That behavior could
be adjusted to support this use case, as a non-default user option.

For defrag a similar optimization is possible:  read a long run of
consecutive zero data blocks, write a prealloc extent.  I don't know of
anyone doing that in real life, but it would play havoc with anything
trying to store information in FIEMAP data (or related ioctls like
GETFSMAP or TREE_SEARCH).

I think an explicit dirty-cache-data metadata structure is a good idea
despite implementation complexity.  It would eliminate dependencies on
non-portable filesystem behavior, and not abuse a facility that might
already be in active (ab)use by other existing things.  If you have
a writeback cache, you need to properly control write ordering with a
purpose-built metadata structure, or fsync() will be meaningless through
your caching layer, and after a crash you'll upload whatever confused,
delalloc-reordered, torn-written steaming crap is on the local disk to
the backing store.

> David
>