Re: Problems with determining data presence by examining extents?

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: David Howells <dhowells@redhat.com>
Cc: linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	hch@lst.de, tytso@mit.edu, adilger.kernel@dilger.ca,
	darrick.wong@oracle.com, clm@fb.com, josef@toxicpanda.com,
	dsterba@suse.com, linux-ext4@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-btrfs@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: Problems with determining data presence by examining extents?
Date: Wed, 15 Jan 2020 22:24:18 +0800	[thread overview]
Message-ID: <6330a53c-781b-83d7-8293-405787979736@gmx.com> (raw)
In-Reply-To: <23358.1579097103@warthog.procyon.org.uk>

[-- Attachment #1.1: Type: text/plain, Size: 4285 bytes --]

On 2020/1/15 下午10:05, David Howells wrote:
> Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> 
>> At least for btrfs, only unaligned extents get padding zeros.
> 
> What is "unaligned" defined as?  The revised cachefiles reads and writes 256k
> blocks, except for the last - which gets rounded up to the nearest page (which
> I'm assuming will be some multiple of the direct-I/O granularity).  The actual
> size of the data is noted in an xattr so I don't need to rely on the size of
> the cachefile.

"Unaligned" means "unaligned to fs sector size". In btrfs it's page
size, thus it shouldn't be a problem for your 256K block size.

> 
>> (c): A multi-device fs (btrfs) can have their own logical address mapping.
>> Meaning the bytenr returned makes no sense to end user, unless used for
>> that fs specific address space.
> 
> For the purpose of cachefiles, I don't care where it is, only whether or not
> it exists.  Further, if a DIO read will return a short read when it hits a
> hole, then I only really care about detecting whether the first byte exists in
> the block.
> 
> It might be cheaper, I suppose, to initiate the read and have it fail
> immediately if no data at all is present in the block than to do a separate
> ask of the filesystem.
> 
>> You won't like this case either.
>> (d): Compressed extents
>> One compressed extent can represents more data than its on-disk size.
> 
> Same answer as above.  Btw, since I'm using DIO reads and writes, would these
> get compressed?

Yes. DIO will also be compressed unless you set the inode to nocompression.

And you may not like this btrfs internal design:
Compressed extent can only be as large as 128K (uncompressed size).

So 256K block write will be split into 2 extents anyway.
And since compressed extent will cause non-continuous physical offset,
it will always be two extents to fiemap, even you're always writing in
256K block size.

Not sure if this matters though.

> 
>> And even more bad news:
>> (e): write time dedupe
>> Although no fs known has implemented it yet (btrfs used to try to
>> support that, and I guess XFS could do it in theory too), you won't
>> known when a fs could get such "awesome" feature.
> 
> I'm not sure this isn't the same answer as above either, except if this
> results in parts of the file being "filled in" with blocks of zeros that I
> haven't supplied.

The example would be, you have written 256K data, all filled with 0xaa.
And it committed to disk.
Then the next time you write another 256K data, all filled with 0xaa.
Then instead of writing this data onto disk, the fs chooses to reuse
your previous written data, doing a reflink to it.

So fiemap would report your latter 256K has the same bytenr of your
previous 256K write (since it's reflinked), and with SHARED flag.

>  Couldn't this be disabled on an inode-by-inode basis, say
> with an ioctl?

No fs has implemented yet, but for btrfs, it has a switch to disable it
in a per-inode base.

Thanks,
Qu

> 
>>> Without being able to trust the filesystem to tell me accurately what I've
>>> written into it, I have to use some other mechanism.  Currently, I've
>>> switched to storing a map in an xattr with 1 bit per 256k block, but that
>>> gets hard to use if the file grows particularly large and also has
>>> integrity consequences - though those are hopefully limited as I'm now
>>> using DIO to store data into the cache.
>>
>> Would you like to explain why you want to know such fs internal info?
> 
> As Andreas pointed out, fscache+cachefiles is used to cache data locally for
> network filesystems (9p, afs, ceph, cifs, nfs).  Cached files may be sparse,
> with unreferenced blocks not currently stored in the cache.
> 
> I'm attempting to move to a model where I don't use bmap and don't monitor
> bit-waitqueues to find out when page flags flip on backing files so that I can
> copy data out, but rather use DIO directly to/from the network filesystem
> inode pages.
> 
> Since the backing filesystem has to keep track of whether data is stored in a
> file, it would seem a shame to have to maintain a parallel copy on the same
> medium, with the coherency issues that entail.
> 
> David
> 
> 

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]