Re: Problems with determining data presence by examining extents?

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Andreas Dilger <adilger@dilger.ca>
Cc: David Howells <dhowells@redhat.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Al Viro <viro@zeniv.linux.org.uk>, Christoph Hellwig <hch@lst.de>,
	"Theodore Y. Ts'o" <tytso@mit.edu>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>,
	David Sterba <dsterba@suse.com>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	linux-btrfs <linux-btrfs@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: Problems with determining data presence by examining extents?
Date: Wed, 15 Jan 2020 21:10:44 +0800	[thread overview]
Message-ID: <afa71c13-4f99-747a-54ec-579f11f066a0@gmx.com> (raw)
In-Reply-To: <27181AE2-C63F-4932-A022-8B0563C72539@dilger.ca>

[-- Attachment #1.1: Type: text/plain, Size: 6606 bytes --]

On 2020/1/15 下午8:46, Andreas Dilger wrote:
> On Jan 14, 2020, at 8:54 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>> On 2020/1/15 上午12:48, David Howells wrote:
>>> Again with regard to my rewrite of fscache and cachefiles:
>>>
>>> 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-iter
>>>
>>> I've got rid of my use of bmap()!  Hooray!
>>>
>>> However, I'm informed that I can't trust the extent map of a backing file to
>>> tell me accurately whether content exists in a file because:
>>
>>>
>>> (b) Blocks of zeros that I write into the file may get punched out by
>>>     filesystem optimisation since a read back would be expected to read zeros
>>>     there anyway, provided it's below the EOF.  This would give me a false
>>>     negative.
>>
>> I know some qemu disk backend has such zero detection.
>> But not btrfs. So this is another per-fs based behavior.
>>
>> And problem (c):
>>
>> (c): A multi-device fs (btrfs) can have their own logical address mapping.
>> Meaning the bytenr returned makes no sense to end user, unless used for
>> that fs specific address space.
> 
> It would be useful to implement the multi-device extension for FIEMAP, adding
> the fe_device field to indicate which device the extent is resident on:
> 
> + #define FIEMAP_EXTENT_DEVICE		0x00008000 /* fe_device is valid, non-
> +						    * local with EXTENT_NET */
> + #define FIEMAP_EXTENT_NET		0x80000000 /* Data stored remotely. */
> 
>  struct fiemap_extent {
>  	__u64 fe_logical;  /* logical offset in bytes for the start of
>  			    * the extent from the beginning of the file */
>  	__u64 fe_physical; /* physical offset in bytes for the start
>  			    * of the extent from the beginning of the disk */
>  	__u64 fe_length;   /* length in bytes for this extent */
>  	__u64 fe_reserved64[2];
>  	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> -	__u32 fe_reserved[3];
> +	__u32 fe_device;   /* device number (fs-specific if FIEMAP_EXTENT_NET)*/
> +	__u32 fe_reserved[2];
>  };
> 
> That allows userspace to distinguish fe_physical addresses that may be
> on different devices.  This isn't in the kernel yet, since it is mostly
> useful only for Btrfs and nobody has implemented it there.  I can give
> you details if working on this for Btrfs is of interest to you.

IMHO it's not good enough.

The concern is, one extent can exist on multiple devices (mirrors for
RAID1/RAID10/RAID1C2/RAID1C3, or stripes for RAID5/6).
I didn't see how it can be easily implemented even with extra fields.

And even we implement it, it can be too complex or bug prune to fill
per-device info.

> 
>> This is even more trickier when considering single device btrfs.
>> It still utilize the same logical address space, just like all multiple
>> disks btrfs.
>>
>> And it completely possible for a single 1T btrfs has logical address
>> mapped beyond 10T or even more. (Any aligned bytenr in the range [0,
>> U64_MAX) is valid for btrfs logical address).
>>
>>
>> You won't like this case either.
>> (d): Compressed extents
>> One compressed extent can represents more data than its on-disk size.
>>
>> Furthermore, current fiemap interface hasn't considered this case, thus
>> there it only reports in-memory size (aka, file size), no way to
>> represent on-disk size.
> 
> There was a prototype patch to add compressed extent support to FIEMAP
> for btrfs, but it was never landed:
> 
> [PATCH 0/5 v4] fiemap: introduce EXTENT_DATA_COMPRESSED flag David Sterba
> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg35837.html
> 
> This adds a separate "fe_phys_length" field for each extent:

That would work much better.
Although we could has some corner cases.

E.g. a compressed extent which is 128K on-disk, and 1M uncompressed.
But only the last 4K of the uncompressed extent is referred.
Then current fields are still not enough.

But if the user only cares about hole and non-hole, then all these
hassles are not related.

> 
> +#define FIEMAP_EXTENT_DATA_COMPRESSED  0x00000040 /* Data is compressed by fs.
> +                                                   * Sets EXTENT_ENCODED and
> +                                                   * the compressed size is
> +                                                   * stored in fe_phys_length */
> 
>  struct fiemap_extent {
>  	__u64 fe_physical;    /* physical offset in bytes for the start
> 			       * of the extent from the beginning of the disk */
>  	__u64 fe_length;      /* length in bytes for this extent */
> -	__u64 fe_reserved64[2];
> +	__u64 fe_phys_length; /* physical length in bytes, may be different from
> +			       * fe_length and sets additional extent flags */
> +	__u64 fe_reserved64;
>  	__u32 fe_flags;	      /* FIEMAP_EXTENT_* flags for this extent */
>  	__u32 fe_reserved[3];
>  };
> 
> 
>> And even more bad news:
>> (e): write time dedupe
>> Although no fs known has implemented it yet (btrfs used to try to
>> support that, and I guess XFS could do it in theory too), you won't
>> known when a fs could get such "awesome" feature.
>>
>> Where your write may be checked and never reach disk if it matches with
>> existing extents.
>>
>> This is a little like the zero-detection-auto-punch.
>>
>>> Is there some setting I can use to prevent these scenarios on a file - or can
>>> one be added?
>>
>> I guess no.
>>
>>> Without being able to trust the filesystem to tell me accurately what I've
>>> written into it, I have to use some other mechanism.  Currently, I've switched
>>> to storing a map in an xattr with 1 bit per 256k block, but that gets hard to
>>> use if the file grows particularly large and also has integrity consequences -
>>> though those are hopefully limited as I'm now using DIO to store data into the
>>> cache.
>>
>> Would you like to explain why you want to know such fs internal info?
> 
> I believe David wants it to store sparse files as an cache and use FIEMAP to
> determine if the blocks are cached locally, or if they need to be fetched from
> the server.  If the filesystem doesn't store the written blocks accurately,
> there is no way for the local cache to know whether it is holding valid data
> or not.

That looks like a hack, by using fiemap result as out-of-band info.

Although looks very clever, not sure if this is the preferred way to do
it, as that's too fs internal mechanism specific.

Thanks,
Qu

> 
> 
> Cheers, Andreas
> 
> 
> 
> 
> 

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]