Linux-ext4 Archive on lore.kernel.org
 help / color / Atom feed
* Problems with determining data presence by examining extents?
@ 2020-01-14 16:48 David Howells
  2020-01-14 22:49 ` Theodore Y. Ts'o
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: David Howells @ 2020-01-14 16:48 UTC (permalink / raw)
  To: linux-fsdevel, viro, hch, tytso, adilger.kernel, darrick.wong,
	clm, josef, dsterba
  Cc: dhowells, linux-ext4, linux-xfs, linux-btrfs, linux-kernel

Again with regard to my rewrite of fscache and cachefiles:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-iter

I've got rid of my use of bmap()!  Hooray!

However, I'm informed that I can't trust the extent map of a backing file to
tell me accurately whether content exists in a file because:

 (a) Not-quite-contiguous extents may be joined by insertion of blocks of
     zeros by the filesystem optimising itself.  This would give me a false
     positive when trying to detect the presence of data.

 (b) Blocks of zeros that I write into the file may get punched out by
     filesystem optimisation since a read back would be expected to read zeros
     there anyway, provided it's below the EOF.  This would give me a false
     negative.

Is there some setting I can use to prevent these scenarios on a file - or can
one be added?

Without being able to trust the filesystem to tell me accurately what I've
written into it, I have to use some other mechanism.  Currently, I've switched
to storing a map in an xattr with 1 bit per 256k block, but that gets hard to
use if the file grows particularly large and also has integrity consequences -
though those are hopefully limited as I'm now using DIO to store data into the
cache.

If it helps, I'm downloading data in aligned 256k blocks and storing data in
those same aligned 256k blocks, so if that makes it easier...

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-14 16:48 Problems with determining data presence by examining extents? David Howells
@ 2020-01-14 22:49 ` Theodore Y. Ts'o
  2020-01-15  3:54 ` Qu Wenruo
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Theodore Y. Ts'o @ 2020-01-14 22:49 UTC (permalink / raw)
  To: David Howells
  Cc: linux-fsdevel, viro, hch, adilger.kernel, darrick.wong, clm,
	josef, dsterba, linux-ext4, linux-xfs, linux-btrfs, linux-kernel

On Tue, Jan 14, 2020 at 04:48:29PM +0000, David Howells wrote:
> Again with regard to my rewrite of fscache and cachefiles:
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-iter
> 
> I've got rid of my use of bmap()!  Hooray!
> 
> However, I'm informed that I can't trust the extent map of a backing file to
> tell me accurately whether content exists in a file because:
> 
>  (a) Not-quite-contiguous extents may be joined by insertion of blocks of
>      zeros by the filesystem optimising itself.  This would give me a false
>      positive when trying to detect the presence of data.
> 
>  (b) Blocks of zeros that I write into the file may get punched out by
>      filesystem optimisation since a read back would be expected to read zeros
>      there anyway, provided it's below the EOF.  This would give me a false
>      negative.
> 
> Is there some setting I can use to prevent these scenarios on a file - or can
> one be added?

I don't think there's any way to do this in a portable way, at least
today.  There is a hack we could be use that would work for ext4
today, at least with respect to (a), but I'm not sure we would want to
make any guarantees with respect to (b).

I suspect I understand why you want this; I've fielded some requests
for people wanting to do something very like this at $WORK, for what I
assume to be for the same reason you're seeking to do this; to create
do incremental caching of files and letting the file system track what
has and hasn't been cached yet.

If we were going to add such a facility, what we could perhaps do is
to define a new flag indicating that a particular file should have no
extent mapping optimization applied, such that FIEMAP would return a
mapping if and only if userspace had written to a particular block, or
had requested that a block be preallocated using fallocate().  The
flag could only be set on a zero-length file, and this might disable
certain advanced file system features, such as reflink, at the file
system's discretion; and there might be unspecified performance
impacts if this flag is set on a file.

File systems which do not support this feature would not allow this
flag to be set.

				- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-14 16:48 Problems with determining data presence by examining extents? David Howells
  2020-01-14 22:49 ` Theodore Y. Ts'o
@ 2020-01-15  3:54 ` Qu Wenruo
  2020-01-15 12:46   ` Andreas Dilger
  2020-01-15 14:20   ` David Howells
  2020-01-15  8:38 ` Christoph Hellwig
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 24+ messages in thread
From: Qu Wenruo @ 2020-01-15  3:54 UTC (permalink / raw)
  To: David Howells, linux-fsdevel, viro, hch, tytso, adilger.kernel,
	darrick.wong, clm, josef, dsterba
  Cc: linux-ext4, linux-xfs, linux-btrfs, linux-kernel

[-- Attachment #1.1: Type: text/plain, Size: 3181 bytes --]



On 2020/1/15 上午12:48, David Howells wrote:
> Again with regard to my rewrite of fscache and cachefiles:
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-iter
> 
> I've got rid of my use of bmap()!  Hooray!
> 
> However, I'm informed that I can't trust the extent map of a backing file to
> tell me accurately whether content exists in a file because:
> 
>  (a) Not-quite-contiguous extents may be joined by insertion of blocks of
>      zeros by the filesystem optimising itself.  This would give me a false
>      positive when trying to detect the presence of data.

At least for btrfs, only unaligned extents get padding zeros.

But I guess other fs could do whatever they want to optimize themselves.

> 
>  (b) Blocks of zeros that I write into the file may get punched out by
>      filesystem optimisation since a read back would be expected to read zeros
>      there anyway, provided it's below the EOF.  This would give me a false
>      negative.

I know some qemu disk backend has such zero detection.
But not btrfs. So this is another per-fs based behavior.

And problem (c):

(c): A multi-device fs (btrfs) can have their own logical address mapping.
Meaning the bytenr returned makes no sense to end user, unless used for
that fs specific address space.

This is even more trickier when considering single device btrfs.
It still utilize the same logical address space, just like all multiple
disks btrfs.

And it completely possible for a single 1T btrfs has logical address
mapped beyond 10T or even more. (Any aligned bytenr in the range [0,
U64_MAX) is valid for btrfs logical address).


You won't like this case either.
(d): Compressed extents
One compressed extent can represents more data than its on-disk size.

Furthermore, current fiemap interface hasn't considered this case, thus
there it only reports in-memory size (aka, file size), no way to
represent on-disk size.


And even more bad news:
(e): write time dedupe
Although no fs known has implemented it yet (btrfs used to try to
support that, and I guess XFS could do it in theory too), you won't
known when a fs could get such "awesome" feature.

Where your write may be checked and never reach disk if it matches with
existing extents.

This is a little like the zero-detection-auto-punch.

> 
> Is there some setting I can use to prevent these scenarios on a file - or can
> one be added?

I guess no.

> 
> Without being able to trust the filesystem to tell me accurately what I've
> written into it, I have to use some other mechanism.  Currently, I've switched
> to storing a map in an xattr with 1 bit per 256k block, but that gets hard to
> use if the file grows particularly large and also has integrity consequences -
> though those are hopefully limited as I'm now using DIO to store data into the
> cache.

Would you like to explain why you want to know such fs internal info?

Thanks,
Qu
> 
> If it helps, I'm downloading data in aligned 256k blocks and storing data in
> those same aligned 256k blocks, so if that makes it easier...
> 
> David
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-14 16:48 Problems with determining data presence by examining extents? David Howells
  2020-01-14 22:49 ` Theodore Y. Ts'o
  2020-01-15  3:54 ` Qu Wenruo
@ 2020-01-15  8:38 ` Christoph Hellwig
  2020-01-15 13:50 ` David Howells
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2020-01-15  8:38 UTC (permalink / raw)
  To: David Howells
  Cc: linux-fsdevel, viro, hch, tytso, adilger.kernel, darrick.wong,
	clm, josef, dsterba, linux-ext4, linux-xfs, linux-btrfs,
	linux-kernel

On Tue, Jan 14, 2020 at 04:48:29PM +0000, David Howells wrote:
> Again with regard to my rewrite of fscache and cachefiles:
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-iter
> 
> I've got rid of my use of bmap()!  Hooray!
> 
> However, I'm informed that I can't trust the extent map of a backing file to
> tell me accurately whether content exists in a file because:
> 
>  (a) Not-quite-contiguous extents may be joined by insertion of blocks of
>      zeros by the filesystem optimising itself.  This would give me a false
>      positive when trying to detect the presence of data.
> 
>  (b) Blocks of zeros that I write into the file may get punched out by
>      filesystem optimisation since a read back would be expected to read zeros
>      there anyway, provided it's below the EOF.  This would give me a false
>      negative.

The whole idea of an out of band interface is going to be racy and suffer
from implementation loss.  I think what you want is something similar to
the NFSv4.2 READ_PLUS operation - give me that if there is any and
otherwise tell me that there is a hole.  I think this could be a new
RWF_NOHOLE or similar flag, just how to return the hole size would be
a little awkward.  Maybe return a specific negative error code (ENODATA?)
and advance the iov anyway.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-15  3:54 ` Qu Wenruo
@ 2020-01-15 12:46   ` Andreas Dilger
  2020-01-15 13:10     ` Qu Wenruo
  2020-01-15 14:20   ` David Howells
  1 sibling, 1 reply; 24+ messages in thread
From: Andreas Dilger @ 2020-01-15 12:46 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: David Howells, linux-fsdevel, Al Viro, Christoph Hellwig,
	Theodore Y. Ts'o, Darrick J. Wong, Chris Mason, Josef Bacik,
	David Sterba, linux-ext4, linux-xfs, linux-btrfs,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 5485 bytes --]

On Jan 14, 2020, at 8:54 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> 
> On 2020/1/15 上午12:48, David Howells wrote:
>> Again with regard to my rewrite of fscache and cachefiles:
>> 
>> 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-iter
>> 
>> I've got rid of my use of bmap()!  Hooray!
>> 
>> However, I'm informed that I can't trust the extent map of a backing file to
>> tell me accurately whether content exists in a file because:
> 
>> 
>> (b) Blocks of zeros that I write into the file may get punched out by
>>     filesystem optimisation since a read back would be expected to read zeros
>>     there anyway, provided it's below the EOF.  This would give me a false
>>     negative.
> 
> I know some qemu disk backend has such zero detection.
> But not btrfs. So this is another per-fs based behavior.
> 
> And problem (c):
> 
> (c): A multi-device fs (btrfs) can have their own logical address mapping.
> Meaning the bytenr returned makes no sense to end user, unless used for
> that fs specific address space.

It would be useful to implement the multi-device extension for FIEMAP, adding
the fe_device field to indicate which device the extent is resident on:

+ #define FIEMAP_EXTENT_DEVICE		0x00008000 /* fe_device is valid, non-
+						    * local with EXTENT_NET */
+ #define FIEMAP_EXTENT_NET		0x80000000 /* Data stored remotely. */

 struct fiemap_extent {
 	__u64 fe_logical;  /* logical offset in bytes for the start of
 			    * the extent from the beginning of the file */
 	__u64 fe_physical; /* physical offset in bytes for the start
 			    * of the extent from the beginning of the disk */
 	__u64 fe_length;   /* length in bytes for this extent */
 	__u64 fe_reserved64[2];
 	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
-	__u32 fe_reserved[3];
+	__u32 fe_device;   /* device number (fs-specific if FIEMAP_EXTENT_NET)*/
+	__u32 fe_reserved[2];
 };

That allows userspace to distinguish fe_physical addresses that may be
on different devices.  This isn't in the kernel yet, since it is mostly
useful only for Btrfs and nobody has implemented it there.  I can give
you details if working on this for Btrfs is of interest to you.

> This is even more trickier when considering single device btrfs.
> It still utilize the same logical address space, just like all multiple
> disks btrfs.
> 
> And it completely possible for a single 1T btrfs has logical address
> mapped beyond 10T or even more. (Any aligned bytenr in the range [0,
> U64_MAX) is valid for btrfs logical address).
> 
> 
> You won't like this case either.
> (d): Compressed extents
> One compressed extent can represents more data than its on-disk size.
> 
> Furthermore, current fiemap interface hasn't considered this case, thus
> there it only reports in-memory size (aka, file size), no way to
> represent on-disk size.

There was a prototype patch to add compressed extent support to FIEMAP
for btrfs, but it was never landed:

[PATCH 0/5 v4] fiemap: introduce EXTENT_DATA_COMPRESSED flag David Sterba
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg35837.html

This adds a separate "fe_phys_length" field for each extent:

+#define FIEMAP_EXTENT_DATA_COMPRESSED  0x00000040 /* Data is compressed by fs.
+                                                   * Sets EXTENT_ENCODED and
+                                                   * the compressed size is
+                                                   * stored in fe_phys_length */

 struct fiemap_extent {
 	__u64 fe_physical;    /* physical offset in bytes for the start
			       * of the extent from the beginning of the disk */
 	__u64 fe_length;      /* length in bytes for this extent */
-	__u64 fe_reserved64[2];
+	__u64 fe_phys_length; /* physical length in bytes, may be different from
+			       * fe_length and sets additional extent flags */
+	__u64 fe_reserved64;
 	__u32 fe_flags;	      /* FIEMAP_EXTENT_* flags for this extent */
 	__u32 fe_reserved[3];
 };


> And even more bad news:
> (e): write time dedupe
> Although no fs known has implemented it yet (btrfs used to try to
> support that, and I guess XFS could do it in theory too), you won't
> known when a fs could get such "awesome" feature.
> 
> Where your write may be checked and never reach disk if it matches with
> existing extents.
> 
> This is a little like the zero-detection-auto-punch.
> 
>> Is there some setting I can use to prevent these scenarios on a file - or can
>> one be added?
> 
> I guess no.
> 
>> Without being able to trust the filesystem to tell me accurately what I've
>> written into it, I have to use some other mechanism.  Currently, I've switched
>> to storing a map in an xattr with 1 bit per 256k block, but that gets hard to
>> use if the file grows particularly large and also has integrity consequences -
>> though those are hopefully limited as I'm now using DIO to store data into the
>> cache.
> 
> Would you like to explain why you want to know such fs internal info?

I believe David wants it to store sparse files as an cache and use FIEMAP to
determine if the blocks are cached locally, or if they need to be fetched from
the server.  If the filesystem doesn't store the written blocks accurately,
there is no way for the local cache to know whether it is holding valid data
or not.


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-15 12:46   ` Andreas Dilger
@ 2020-01-15 13:10     ` Qu Wenruo
  2020-01-15 13:31       ` Christoph Hellwig
  2020-01-15 14:35       ` David Howells
  0 siblings, 2 replies; 24+ messages in thread
From: Qu Wenruo @ 2020-01-15 13:10 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: David Howells, linux-fsdevel, Al Viro, Christoph Hellwig,
	Theodore Y. Ts'o, Darrick J. Wong, Chris Mason, Josef Bacik,
	David Sterba, linux-ext4, linux-xfs, linux-btrfs,
	Linux Kernel Mailing List

[-- Attachment #1.1: Type: text/plain, Size: 6606 bytes --]



On 2020/1/15 下午8:46, Andreas Dilger wrote:
> On Jan 14, 2020, at 8:54 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>> On 2020/1/15 上午12:48, David Howells wrote:
>>> Again with regard to my rewrite of fscache and cachefiles:
>>>
>>> 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-iter
>>>
>>> I've got rid of my use of bmap()!  Hooray!
>>>
>>> However, I'm informed that I can't trust the extent map of a backing file to
>>> tell me accurately whether content exists in a file because:
>>
>>>
>>> (b) Blocks of zeros that I write into the file may get punched out by
>>>     filesystem optimisation since a read back would be expected to read zeros
>>>     there anyway, provided it's below the EOF.  This would give me a false
>>>     negative.
>>
>> I know some qemu disk backend has such zero detection.
>> But not btrfs. So this is another per-fs based behavior.
>>
>> And problem (c):
>>
>> (c): A multi-device fs (btrfs) can have their own logical address mapping.
>> Meaning the bytenr returned makes no sense to end user, unless used for
>> that fs specific address space.
> 
> It would be useful to implement the multi-device extension for FIEMAP, adding
> the fe_device field to indicate which device the extent is resident on:
> 
> + #define FIEMAP_EXTENT_DEVICE		0x00008000 /* fe_device is valid, non-
> +						    * local with EXTENT_NET */
> + #define FIEMAP_EXTENT_NET		0x80000000 /* Data stored remotely. */
> 
>  struct fiemap_extent {
>  	__u64 fe_logical;  /* logical offset in bytes for the start of
>  			    * the extent from the beginning of the file */
>  	__u64 fe_physical; /* physical offset in bytes for the start
>  			    * of the extent from the beginning of the disk */
>  	__u64 fe_length;   /* length in bytes for this extent */
>  	__u64 fe_reserved64[2];
>  	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> -	__u32 fe_reserved[3];
> +	__u32 fe_device;   /* device number (fs-specific if FIEMAP_EXTENT_NET)*/
> +	__u32 fe_reserved[2];
>  };
> 
> That allows userspace to distinguish fe_physical addresses that may be
> on different devices.  This isn't in the kernel yet, since it is mostly
> useful only for Btrfs and nobody has implemented it there.  I can give
> you details if working on this for Btrfs is of interest to you.

IMHO it's not good enough.

The concern is, one extent can exist on multiple devices (mirrors for
RAID1/RAID10/RAID1C2/RAID1C3, or stripes for RAID5/6).
I didn't see how it can be easily implemented even with extra fields.

And even we implement it, it can be too complex or bug prune to fill
per-device info.

> 
>> This is even more trickier when considering single device btrfs.
>> It still utilize the same logical address space, just like all multiple
>> disks btrfs.
>>
>> And it completely possible for a single 1T btrfs has logical address
>> mapped beyond 10T or even more. (Any aligned bytenr in the range [0,
>> U64_MAX) is valid for btrfs logical address).
>>
>>
>> You won't like this case either.
>> (d): Compressed extents
>> One compressed extent can represents more data than its on-disk size.
>>
>> Furthermore, current fiemap interface hasn't considered this case, thus
>> there it only reports in-memory size (aka, file size), no way to
>> represent on-disk size.
> 
> There was a prototype patch to add compressed extent support to FIEMAP
> for btrfs, but it was never landed:
> 
> [PATCH 0/5 v4] fiemap: introduce EXTENT_DATA_COMPRESSED flag David Sterba
> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg35837.html
> 
> This adds a separate "fe_phys_length" field for each extent:

That would work much better.
Although we could has some corner cases.

E.g. a compressed extent which is 128K on-disk, and 1M uncompressed.
But only the last 4K of the uncompressed extent is referred.
Then current fields are still not enough.

But if the user only cares about hole and non-hole, then all these
hassles are not related.

> 
> +#define FIEMAP_EXTENT_DATA_COMPRESSED  0x00000040 /* Data is compressed by fs.
> +                                                   * Sets EXTENT_ENCODED and
> +                                                   * the compressed size is
> +                                                   * stored in fe_phys_length */
> 
>  struct fiemap_extent {
>  	__u64 fe_physical;    /* physical offset in bytes for the start
> 			       * of the extent from the beginning of the disk */
>  	__u64 fe_length;      /* length in bytes for this extent */
> -	__u64 fe_reserved64[2];
> +	__u64 fe_phys_length; /* physical length in bytes, may be different from
> +			       * fe_length and sets additional extent flags */
> +	__u64 fe_reserved64;
>  	__u32 fe_flags;	      /* FIEMAP_EXTENT_* flags for this extent */
>  	__u32 fe_reserved[3];
>  };
> 
> 
>> And even more bad news:
>> (e): write time dedupe
>> Although no fs known has implemented it yet (btrfs used to try to
>> support that, and I guess XFS could do it in theory too), you won't
>> known when a fs could get such "awesome" feature.
>>
>> Where your write may be checked and never reach disk if it matches with
>> existing extents.
>>
>> This is a little like the zero-detection-auto-punch.
>>
>>> Is there some setting I can use to prevent these scenarios on a file - or can
>>> one be added?
>>
>> I guess no.
>>
>>> Without being able to trust the filesystem to tell me accurately what I've
>>> written into it, I have to use some other mechanism.  Currently, I've switched
>>> to storing a map in an xattr with 1 bit per 256k block, but that gets hard to
>>> use if the file grows particularly large and also has integrity consequences -
>>> though those are hopefully limited as I'm now using DIO to store data into the
>>> cache.
>>
>> Would you like to explain why you want to know such fs internal info?
> 
> I believe David wants it to store sparse files as an cache and use FIEMAP to
> determine if the blocks are cached locally, or if they need to be fetched from
> the server.  If the filesystem doesn't store the written blocks accurately,
> there is no way for the local cache to know whether it is holding valid data
> or not.

That looks like a hack, by using fiemap result as out-of-band info.

Although looks very clever, not sure if this is the preferred way to do
it, as that's too fs internal mechanism specific.

Thanks,
Qu

> 
> 
> Cheers, Andreas
> 
> 
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-15 13:10     ` Qu Wenruo
@ 2020-01-15 13:31       ` Christoph Hellwig
  2020-01-15 19:48         ` Andreas Dilger
  2020-01-15 20:55         ` David Howells
  2020-01-15 14:35       ` David Howells
  1 sibling, 2 replies; 24+ messages in thread
From: Christoph Hellwig @ 2020-01-15 13:31 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Andreas Dilger, David Howells, linux-fsdevel, Al Viro,
	Christoph Hellwig, Theodore Y. Ts'o, Darrick J. Wong,
	Chris Mason, Josef Bacik, David Sterba, linux-ext4, linux-xfs,
	linux-btrfs, Linux Kernel Mailing List

On Wed, Jan 15, 2020 at 09:10:44PM +0800, Qu Wenruo wrote:
> > That allows userspace to distinguish fe_physical addresses that may be
> > on different devices.  This isn't in the kernel yet, since it is mostly
> > useful only for Btrfs and nobody has implemented it there.  I can give
> > you details if working on this for Btrfs is of interest to you.
> 
> IMHO it's not good enough.
> 
> The concern is, one extent can exist on multiple devices (mirrors for
> RAID1/RAID10/RAID1C2/RAID1C3, or stripes for RAID5/6).
> I didn't see how it can be easily implemented even with extra fields.
> 
> And even we implement it, it can be too complex or bug prune to fill
> per-device info.

It's also completely bogus for the use cases to start with.  fiemap
is a debug tool reporting the file system layout.  Using it for anything
related to actual data storage and data integrity is a receipe for
disaster.  As said the right thing for the use case would be something
like the NFS READ_PLUS operation.  If we can't get that easily it can
be emulated using lseek SEEK_DATA / SEEK_HOLE assuming no other thread
could be writing to the file, or the raciness doesn't matter.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-14 16:48 Problems with determining data presence by examining extents? David Howells
                   ` (2 preceding siblings ...)
  2020-01-15  8:38 ` Christoph Hellwig
@ 2020-01-15 13:50 ` David Howells
  2020-01-15 14:05 ` David Howells
  2020-01-15 14:15 ` David Howells
  5 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2020-01-15 13:50 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: dhowells, linux-fsdevel, viro, hch, adilger.kernel, darrick.wong,
	clm, josef, dsterba, linux-ext4, linux-xfs, linux-btrfs,
	linux-kernel

Theodore Y. Ts'o <tytso@mit.edu> wrote:

> but I'm not sure we would want to make any guarantees with respect to (b).

Um.  That would potentially make disconnected operation problematic.  Now,
it's unlikely that I'll want to store a 256KiB block of zeros, but not
impossible.

> I suspect I understand why you want this; I've fielded some requests
> for people wanting to do something very like this at $WORK, for what I
> assume to be for the same reason you're seeking to do this; to create
> do incremental caching of files and letting the file system track what
> has and hasn't been cached yet.

Exactly so.  If I can't tap in to the filesystem's own map of what data is
present in a file, then I have to do it myself in parallel.  Keeping my own
list or map has a number of issues:

 (1) It's redundant.  I have to maintain a second copy of what the filesystem
     already maintains.  This uses extra space.

 (2) My map may get out of step with the filesystem after a crash.  The
     filesystem has tools to deal with this in its own structures.

 (3) If the file is very large and sparse, then keeping a bit-per-block map in
     a single xattr may not suffice or may become unmanageable.  There's a
     limit of 64k, which for bit-per-256k limits the maximum mappable size to
     1TiB (I could use multiple xattrs, but some filesystems may have total
     xattr limits) and whatever the size, I need a single buffer big enough to
     hold it.

     I could use a second file as a metadata cache - but that has worse
     coherency properties.  (As I understand it, setxattr is synchronous and
     journalled.)

> If we were going to add such a facility, what we could perhaps do is
> to define a new flag indicating that a particular file should have no
> extent mapping optimization applied, such that FIEMAP would return a
> mapping if and only if userspace had written to a particular block, or
> had requested that a block be preallocated using fallocate().  The
> flag could only be set on a zero-length file, and this might disable
> certain advanced file system features, such as reflink, at the file
> system's discretion; and there might be unspecified performance
> impacts if this flag is set on a file.

That would be fine for cachefiles.

Also, I don't need to know *where* the data is, only that the first byte of my
block exists - if a DIO read returns short when it reaches a hole.

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-14 16:48 Problems with determining data presence by examining extents? David Howells
                   ` (3 preceding siblings ...)
  2020-01-15 13:50 ` David Howells
@ 2020-01-15 14:05 ` David Howells
  2020-01-15 14:24   ` Qu Wenruo
  2020-01-15 14:50   ` David Howells
  2020-01-15 14:15 ` David Howells
  5 siblings, 2 replies; 24+ messages in thread
From: David Howells @ 2020-01-15 14:05 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: dhowells, linux-fsdevel, viro, hch, tytso, adilger.kernel,
	darrick.wong, clm, josef, dsterba, linux-ext4, linux-xfs,
	linux-btrfs, linux-kernel

Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:

> At least for btrfs, only unaligned extents get padding zeros.

What is "unaligned" defined as?  The revised cachefiles reads and writes 256k
blocks, except for the last - which gets rounded up to the nearest page (which
I'm assuming will be some multiple of the direct-I/O granularity).  The actual
size of the data is noted in an xattr so I don't need to rely on the size of
the cachefile.

> (c): A multi-device fs (btrfs) can have their own logical address mapping.
> Meaning the bytenr returned makes no sense to end user, unless used for
> that fs specific address space.

For the purpose of cachefiles, I don't care where it is, only whether or not
it exists.  Further, if a DIO read will return a short read when it hits a
hole, then I only really care about detecting whether the first byte exists in
the block.

It might be cheaper, I suppose, to initiate the read and have it fail
immediately if no data at all is present in the block than to do a separate
ask of the filesystem.

> You won't like this case either.
> (d): Compressed extents
> One compressed extent can represents more data than its on-disk size.

Same answer as above.  Btw, since I'm using DIO reads and writes, would these
get compressed?

> And even more bad news:
> (e): write time dedupe
> Although no fs known has implemented it yet (btrfs used to try to
> support that, and I guess XFS could do it in theory too), you won't
> known when a fs could get such "awesome" feature.

I'm not sure this isn't the same answer as above either, except if this
results in parts of the file being "filled in" with blocks of zeros that I
haven't supplied.  Couldn't this be disabled on an inode-by-inode basis, say
with an ioctl?

> > Without being able to trust the filesystem to tell me accurately what I've
> > written into it, I have to use some other mechanism.  Currently, I've
> > switched to storing a map in an xattr with 1 bit per 256k block, but that
> > gets hard to use if the file grows particularly large and also has
> > integrity consequences - though those are hopefully limited as I'm now
> > using DIO to store data into the cache.
> 
> Would you like to explain why you want to know such fs internal info?

As Andreas pointed out, fscache+cachefiles is used to cache data locally for
network filesystems (9p, afs, ceph, cifs, nfs).  Cached files may be sparse,
with unreferenced blocks not currently stored in the cache.

I'm attempting to move to a model where I don't use bmap and don't monitor
bit-waitqueues to find out when page flags flip on backing files so that I can
copy data out, but rather use DIO directly to/from the network filesystem
inode pages.

Since the backing filesystem has to keep track of whether data is stored in a
file, it would seem a shame to have to maintain a parallel copy on the same
medium, with the coherency issues that entail.

David



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-14 16:48 Problems with determining data presence by examining extents? David Howells
                   ` (4 preceding siblings ...)
  2020-01-15 14:05 ` David Howells
@ 2020-01-15 14:15 ` David Howells
  5 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2020-01-15 14:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, linux-fsdevel, viro, tytso, adilger.kernel,
	darrick.wong, clm, josef, dsterba, linux-ext4, linux-xfs,
	linux-btrfs, linux-kernel

Christoph Hellwig <hch@lst.de> wrote:

> The whole idea of an out of band interface is going to be racy and suffer
> from implementation loss.  I think what you want is something similar to
> the NFSv4.2 READ_PLUS operation - give me that if there is any and
> otherwise tell me that there is a hole.  I think this could be a new
> RWF_NOHOLE or similar flag, just how to return the hole size would be
> a little awkward.  Maybe return a specific negative error code (ENODATA?)
> and advance the iov anyway.

Just having call_iter_read() return a short read could be made to suffice...
provided the filesystem doesn't return data I haven't written in (which could
cause apparent corruption) and does return data I have written in (otherwise I
have to go back to the server).

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-15  3:54 ` Qu Wenruo
  2020-01-15 12:46   ` Andreas Dilger
@ 2020-01-15 14:20   ` David Howells
  1 sibling, 0 replies; 24+ messages in thread
From: David Howells @ 2020-01-15 14:20 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: dhowells, Qu Wenruo, linux-fsdevel, Al Viro, Christoph Hellwig,
	Theodore Y. Ts'o, Darrick J. Wong, Chris Mason, Josef Bacik,
	David Sterba, linux-ext4, linux-xfs, linux-btrfs,
	Linux Kernel Mailing List

Andreas Dilger <adilger@dilger.ca> wrote:

> > Would you like to explain why you want to know such fs internal info?
> 
> I believe David wants it to store sparse files as an cache and use FIEMAP to
> determine if the blocks are cached locally, or if they need to be fetched from
> the server.  If the filesystem doesn't store the written blocks accurately,
> there is no way for the local cache to know whether it is holding valid data
> or not.

More or less.  I have no particular attachment to bmap or FIEMAP as the
interface to use.  I'm just interested in finding out quickly if the data I
want is present.

If call_read_iter() will return a short read on hitting a hole, I can manage
if I can find out if just the first byte is present.

Finding out if the block is present allows me to avoid shaping read requests
from VM readahead into 256k blocks - which may require the allocation of extra
pages for bufferage.

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-15 14:05 ` David Howells
@ 2020-01-15 14:24   ` Qu Wenruo
  2020-01-15 14:50   ` David Howells
  1 sibling, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2020-01-15 14:24 UTC (permalink / raw)
  To: David Howells
  Cc: linux-fsdevel, viro, hch, tytso, adilger.kernel, darrick.wong,
	clm, josef, dsterba, linux-ext4, linux-xfs, linux-btrfs,
	linux-kernel

[-- Attachment #1.1: Type: text/plain, Size: 4285 bytes --]



On 2020/1/15 下午10:05, David Howells wrote:
> Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> 
>> At least for btrfs, only unaligned extents get padding zeros.
> 
> What is "unaligned" defined as?  The revised cachefiles reads and writes 256k
> blocks, except for the last - which gets rounded up to the nearest page (which
> I'm assuming will be some multiple of the direct-I/O granularity).  The actual
> size of the data is noted in an xattr so I don't need to rely on the size of
> the cachefile.

"Unaligned" means "unaligned to fs sector size". In btrfs it's page
size, thus it shouldn't be a problem for your 256K block size.

> 
>> (c): A multi-device fs (btrfs) can have their own logical address mapping.
>> Meaning the bytenr returned makes no sense to end user, unless used for
>> that fs specific address space.
> 
> For the purpose of cachefiles, I don't care where it is, only whether or not
> it exists.  Further, if a DIO read will return a short read when it hits a
> hole, then I only really care about detecting whether the first byte exists in
> the block.
> 
> It might be cheaper, I suppose, to initiate the read and have it fail
> immediately if no data at all is present in the block than to do a separate
> ask of the filesystem.
> 
>> You won't like this case either.
>> (d): Compressed extents
>> One compressed extent can represents more data than its on-disk size.
> 
> Same answer as above.  Btw, since I'm using DIO reads and writes, would these
> get compressed?

Yes. DIO will also be compressed unless you set the inode to nocompression.

And you may not like this btrfs internal design:
Compressed extent can only be as large as 128K (uncompressed size).

So 256K block write will be split into 2 extents anyway.
And since compressed extent will cause non-continuous physical offset,
it will always be two extents to fiemap, even you're always writing in
256K block size.

Not sure if this matters though.

> 
>> And even more bad news:
>> (e): write time dedupe
>> Although no fs known has implemented it yet (btrfs used to try to
>> support that, and I guess XFS could do it in theory too), you won't
>> known when a fs could get such "awesome" feature.
> 
> I'm not sure this isn't the same answer as above either, except if this
> results in parts of the file being "filled in" with blocks of zeros that I
> haven't supplied.

The example would be, you have written 256K data, all filled with 0xaa.
And it committed to disk.
Then the next time you write another 256K data, all filled with 0xaa.
Then instead of writing this data onto disk, the fs chooses to reuse
your previous written data, doing a reflink to it.

So fiemap would report your latter 256K has the same bytenr of your
previous 256K write (since it's reflinked), and with SHARED flag.

>  Couldn't this be disabled on an inode-by-inode basis, say
> with an ioctl?

No fs has implemented yet, but for btrfs, it has a switch to disable it
in a per-inode base.

Thanks,
Qu

> 
>>> Without being able to trust the filesystem to tell me accurately what I've
>>> written into it, I have to use some other mechanism.  Currently, I've
>>> switched to storing a map in an xattr with 1 bit per 256k block, but that
>>> gets hard to use if the file grows particularly large and also has
>>> integrity consequences - though those are hopefully limited as I'm now
>>> using DIO to store data into the cache.
>>
>> Would you like to explain why you want to know such fs internal info?
> 
> As Andreas pointed out, fscache+cachefiles is used to cache data locally for
> network filesystems (9p, afs, ceph, cifs, nfs).  Cached files may be sparse,
> with unreferenced blocks not currently stored in the cache.
> 
> I'm attempting to move to a model where I don't use bmap and don't monitor
> bit-waitqueues to find out when page flags flip on backing files so that I can
> copy data out, but rather use DIO directly to/from the network filesystem
> inode pages.
> 
> Since the backing filesystem has to keep track of whether data is stored in a
> file, it would seem a shame to have to maintain a parallel copy on the same
> medium, with the coherency issues that entail.
> 
> David
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-15 13:10     ` Qu Wenruo
  2020-01-15 13:31       ` Christoph Hellwig
@ 2020-01-15 14:35       ` David Howells
  2020-01-15 14:48         ` Christoph Hellwig
  2020-01-15 14:59         ` David Howells
  1 sibling, 2 replies; 24+ messages in thread
From: David Howells @ 2020-01-15 14:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Qu Wenruo, Andreas Dilger, linux-fsdevel, Al Viro,
	Theodore Y. Ts'o, Darrick J. Wong, Chris Mason, Josef Bacik,
	David Sterba, linux-ext4, linux-xfs, linux-btrfs,
	Linux Kernel Mailing List

Christoph Hellwig <hch@lst.de> wrote:

> If we can't get that easily it can be emulated using lseek SEEK_DATA /
> SEEK_HOLE assuming no other thread could be writing to the file, or the
> raciness doesn't matter.

Another thread could be writing to the file, and the raciness matters if I
want to cache the result of calling SEEK_HOLE - though it might be possible
just to mask it off.

One problem I have with SEEK_HOLE is that there's no upper bound on it.  Say
I have a 1GiB cachefile that's completely populated and I want to find out if
the first byte is present or not.  I call:

	end = vfs_llseek(file, SEEK_HOLE, 0);

It will have to scan the metadata of the entire 1GiB file and will then
presumably return the EOF position.  Now this might only be a mild irritation
as I can cache this information for later use, but it does put potentially put
a performance hiccough in the case of someone only reading the first page or
so of the file (say the file program).  On the other hand, probably most of
the files in the cache are likely to be complete - in which case, it's
probably quite cheap.

However, SEEK_HOLE doesn't help with the issue of the filesystem 'altering'
the content of the file by adding or removing blocks of zeros.

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-15 14:35       ` David Howells
@ 2020-01-15 14:48         ` Christoph Hellwig
  2020-01-15 14:59         ` David Howells
  1 sibling, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2020-01-15 14:48 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Qu Wenruo, Andreas Dilger, linux-fsdevel,
	Al Viro, Theodore Y. Ts'o, Darrick J. Wong, Chris Mason,
	Josef Bacik, David Sterba, linux-ext4, linux-xfs, linux-btrfs,
	Linux Kernel Mailing List

On Wed, Jan 15, 2020 at 02:35:22PM +0000, David Howells wrote:
> > If we can't get that easily it can be emulated using lseek SEEK_DATA /
> > SEEK_HOLE assuming no other thread could be writing to the file, or the
> > raciness doesn't matter.
> 
> Another thread could be writing to the file, and the raciness matters if I
> want to cache the result of calling SEEK_HOLE - though it might be possible
> just to mask it off.

Well, if you have other threads changing the file (writing, punching holes,
truncating, etc) you have lost with any interface that isn't an atomic
give me that data or tell me its a hole.  And even if that if you allow
threads that aren't part of your fscache implementation to do the
modifications you have lost.  If on the other hand they are part of
fscache you should be able to synchronize your threads somehow.

> One problem I have with SEEK_HOLE is that there's no upper bound on it.  Say
> I have a 1GiB cachefile that's completely populated and I want to find out if
> the first byte is present or not.  I call:
> 
> 	end = vfs_llseek(file, SEEK_HOLE, 0);
> 
> It will have to scan the metadata of the entire 1GiB file and will then
> presumably return the EOF position.  Now this might only be a mild irritation
> as I can cache this information for later use, but it does put potentially put
> a performance hiccough in the case of someone only reading the first page or
> so of the file (say the file program).  On the other hand, probably most of
> the files in the cache are likely to be complete - in which case, it's
> probably quite cheap.

At least for XFS all the metadata is read from disk at once anyway,
so you only spend a few more cycles walking through a pretty efficient
in-memory data structure.

> However, SEEK_HOLE doesn't help with the issue of the filesystem 'altering'
> the content of the file by adding or removing blocks of zeros.

As does any other method.  If you need that fine grained control you
need to track the information yourself.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-15 14:05 ` David Howells
  2020-01-15 14:24   ` Qu Wenruo
@ 2020-01-15 14:50   ` David Howells
  1 sibling, 0 replies; 24+ messages in thread
From: David Howells @ 2020-01-15 14:50 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: dhowells, linux-fsdevel, viro, hch, tytso, adilger.kernel,
	darrick.wong, clm, josef, dsterba, linux-ext4, linux-xfs,
	linux-btrfs, linux-kernel

Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:

> "Unaligned" means "unaligned to fs sector size". In btrfs it's page
> size, thus it shouldn't be a problem for your 256K block size.

Cool.

> > Same answer as above.  Btw, since I'm using DIO reads and writes, would these
> > get compressed?
> 
> Yes. DIO will also be compressed unless you set the inode to nocompression.
> 
> And you may not like this btrfs internal design:
> Compressed extent can only be as large as 128K (uncompressed size).
> 
> So 256K block write will be split into 2 extents anyway.
> And since compressed extent will cause non-continuous physical offset,
> it will always be two extents to fiemap, even you're always writing in
> 256K block size.
> 
> Not sure if this matters though.

Not a problem, provided I can read them with a single DIO read.  I just need
to know whether the data is present.  I don't need to know where it is or what
hoops the filesystem goes through to get it.

> > I'm not sure this isn't the same answer as above either, except if this
> > results in parts of the file being "filled in" with blocks of zeros that I
> > haven't supplied.
> 
> The example would be, you have written 256K data, all filled with 0xaa.
> And it committed to disk.
> Then the next time you write another 256K data, all filled with 0xaa.
> Then instead of writing this data onto disk, the fs chooses to reuse
> your previous written data, doing a reflink to it.

That's fine as long as the filesystem says it's there when I ask for it.
Having it shared isn't a problem.

But that brings me back to the original issue and that's the potential problem
of the filesystem optimising storage by adding or removing blocks of zero
bytes.  If either of those can happen, I cannot rely on the filesystem
metadata.

> So fiemap would report your latter 256K has the same bytenr of your
> previous 256K write (since it's reflinked), and with SHARED flag.

It might be better for me to use SEEK_HOLE than fiemap - barring the slight
issues that SEEK_HOLE has no upper bound and that writes may be taking place
at the same time.

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-15 14:35       ` David Howells
  2020-01-15 14:48         ` Christoph Hellwig
@ 2020-01-15 14:59         ` David Howells
  2020-01-16 10:13           ` Christoph Hellwig
  2020-01-17 16:43           ` David Howells
  1 sibling, 2 replies; 24+ messages in thread
From: David Howells @ 2020-01-15 14:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Qu Wenruo, Andreas Dilger, linux-fsdevel, Al Viro,
	Theodore Y. Ts'o, Darrick J. Wong, Chris Mason, Josef Bacik,
	David Sterba, linux-ext4, linux-xfs, linux-btrfs,
	Linux Kernel Mailing List

Christoph Hellwig <hch@lst.de> wrote:

> > Another thread could be writing to the file, and the raciness matters if I
> > want to cache the result of calling SEEK_HOLE - though it might be possible
> > just to mask it off.
> 
> Well, if you have other threads changing the file (writing, punching holes,
> truncating, etc) you have lost with any interface that isn't an atomic
> give me that data or tell me its a hole.  And even if that if you allow
> threads that aren't part of your fscache implementation to do the
> modifications you have lost.  If on the other hand they are part of
> fscache you should be able to synchronize your threads somehow.

Another thread could be writing to the file at the same time, but not in the
same block.  That's managed by netfs, most likely based on the pages and page
flags attached to the netfs inode being cached in this particular file[*].

What I was more thinking of is that SEEK_HOLE might run past the block of
interest and into a block that's currently being written and see a partially
written block.

[*] For AFS, this is only true of regular files; dirs and symlinks are cached
    as monoliths and are there entirely or not at all.

> > However, SEEK_HOLE doesn't help with the issue of the filesystem 'altering'
> > the content of the file by adding or removing blocks of zeros.
> 
> As does any other method.  If you need that fine grained control you
> need to track the information yourself.

So, basically, I can't.  Okay.  I was hoping it might be possible to add an
ioctl or something to tell filesystems not to do that with particular files.

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-15 13:31       ` Christoph Hellwig
@ 2020-01-15 19:48         ` Andreas Dilger
  2020-01-16 10:16           ` Christoph Hellwig
  2020-01-15 20:55         ` David Howells
  1 sibling, 1 reply; 24+ messages in thread
From: Andreas Dilger @ 2020-01-15 19:48 UTC (permalink / raw)
  To: David Howells, Christoph Hellwig
  Cc: Qu Wenruo, linux-fsdevel, Al Viro, Theodore Y. Ts'o,
	Darrick J. Wong, Chris Mason, Josef Bacik, David Sterba,
	linux-ext4, linux-xfs, linux-btrfs, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 2714 bytes --]

On Jan 15, 2020, at 6:31 AM, Christoph Hellwig <hch@lst.de> wrote:
> 
> On Wed, Jan 15, 2020 at 09:10:44PM +0800, Qu Wenruo wrote:
>>> That allows userspace to distinguish fe_physical addresses that may be
>>> on different devices.  This isn't in the kernel yet, since it is mostly
>>> useful only for Btrfs and nobody has implemented it there.  I can give
>>> you details if working on this for Btrfs is of interest to you.
>> 
>> IMHO it's not good enough.
>> 
>> The concern is, one extent can exist on multiple devices (mirrors for
>> RAID1/RAID10/RAID1C2/RAID1C3, or stripes for RAID5/6).
>> I didn't see how it can be easily implemented even with extra fields.
>> 
>> And even we implement it, it can be too complex or bug prune to fill
>> per-device info.
> 
> It's also completely bogus for the use cases to start with.  fiemap
> is a debug tool reporting the file system layout.  Using it for anything
> related to actual data storage and data integrity is a receipe for
> disaster.  As said the right thing for the use case would be something
> like the NFS READ_PLUS operation.  If we can't get that easily it can
> be emulated using lseek SEEK_DATA / SEEK_HOLE assuming no other thread
> could be writing to the file, or the raciness doesn't matter.

I don't think either of those will be any better than FIEMAP, if the reason
is that the underlying filesystem is filling in holes with actual data
blocks to optimize the IO pattern.  SEEK_HOLE would not find a hole in
the block allocation, and would happily return the block of zeroes to
the caller.  Also, it isn't clear if SEEK_HOLE considers an allocated but
unwritten extent to be a hole or a block?

I think what is needed here is an fadvise/ioctl that tells the filesystem
"don't allocate blocks unless actually written" for that file.  Storing
anything in a separate data structure is a recipe for disaster, since it
will become inconsistent after a crash, or filesystem corruption+e2fsck,
and will unnecessarily bloat the on-disk metadata for every file to hold
redundant information.

I don't see COW/reflink/compression as being a problem in this case, since
what cachefiles cares about is whether there is _any_ data for a given
logical offset, not where/how the data is stored.  IF FIEMAP was used for
a btrfs backing filesystem, it would need the "EXTENT_DATA_COMPRESSED"
feature to be implemented as well, so that it can distinguish the logical
vs. physical allocations.  I don't think that would be needed for SEEK_HOLE
and SEEK_DATA, so long as they handle unwritten extents properly (and are
correctly implemented in the first place, some filesystems fall back to
always returning the next block for SEEK_DATA).

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-15 13:31       ` Christoph Hellwig
  2020-01-15 19:48         ` Andreas Dilger
@ 2020-01-15 20:55         ` David Howells
  2020-01-15 22:11           ` Andreas Dilger
  2020-01-15 23:09           ` David Howells
  1 sibling, 2 replies; 24+ messages in thread
From: David Howells @ 2020-01-15 20:55 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: dhowells, Christoph Hellwig, Qu Wenruo, linux-fsdevel, Al Viro,
	Theodore Y. Ts'o, Darrick J. Wong, Chris Mason, Josef Bacik,
	David Sterba, linux-ext4, linux-xfs, linux-btrfs,
	Linux Kernel Mailing List

Andreas Dilger <adilger@dilger.ca> wrote:

> I think what is needed here is an fadvise/ioctl that tells the filesystem
> "don't allocate blocks unless actually written" for that file.

Yeah - and it would probably need to find its way onto disk so that its effect
is persistent and visible to out-of-kernel tools.

It would also have to say that blocks of zeros shouldn't be optimised away.

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-15 20:55         ` David Howells
@ 2020-01-15 22:11           ` Andreas Dilger
  2020-01-15 23:09           ` David Howells
  1 sibling, 0 replies; 24+ messages in thread
From: Andreas Dilger @ 2020-01-15 22:11 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Qu Wenruo, linux-fsdevel, Al Viro,
	Theodore Y. Ts'o, Darrick J. Wong, Chris Mason, Josef Bacik,
	David Sterba, linux-ext4, linux-xfs, linux-btrfs,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 861 bytes --]

On Jan 15, 2020, at 1:55 PM, David Howells <dhowells@redhat.com> wrote:
> 
> Andreas Dilger <adilger@dilger.ca> wrote:
> 
>> I think what is needed here is an fadvise/ioctl that tells the filesystem
>> "don't allocate blocks unless actually written" for that file.
> 
> Yeah - and it would probably need to find its way onto disk so that its effect
> is persistent and visible to out-of-kernel tools.
> 
> It would also have to say that blocks of zeros shouldn't be optimised away.

I don't necessarily see that as a requirement, so long as the filesystem
stores a "block" at that offset, but it could dedupe all zero-filled blocks
to the same "zero block".  That still allows saving storage space, while
keeping the semantics of "this block was written into the file" rather than
"there is a hole at this offset".

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-15 20:55         ` David Howells
  2020-01-15 22:11           ` Andreas Dilger
@ 2020-01-15 23:09           ` David Howells
  2020-01-26 18:19             ` Zygo Blaxell
  1 sibling, 1 reply; 24+ messages in thread
From: David Howells @ 2020-01-15 23:09 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: dhowells, Christoph Hellwig, Qu Wenruo, linux-fsdevel, Al Viro,
	Theodore Y. Ts'o, Darrick J. Wong, Chris Mason, Josef Bacik,
	David Sterba, linux-ext4, linux-xfs, linux-btrfs,
	Linux Kernel Mailing List

Andreas Dilger <adilger@dilger.ca> wrote:

> > It would also have to say that blocks of zeros shouldn't be optimised away.
> 
> I don't necessarily see that as a requirement, so long as the filesystem
> stores a "block" at that offset, but it could dedupe all zero-filled blocks
> to the same "zero block".  That still allows saving storage space, while
> keeping the semantics of "this block was written into the file" rather than
> "there is a hole at this offset".

Yeah, that's more what I was thinking of.  Provided I can find out that
something is present, it should be fine.

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-15 14:59         ` David Howells
@ 2020-01-16 10:13           ` Christoph Hellwig
  2020-01-17 16:43           ` David Howells
  1 sibling, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2020-01-16 10:13 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Qu Wenruo, Andreas Dilger, linux-fsdevel,
	Al Viro, Theodore Y. Ts'o, Darrick J. Wong, Chris Mason,
	Josef Bacik, David Sterba, linux-ext4, linux-xfs, linux-btrfs,
	Linux Kernel Mailing List

On Wed, Jan 15, 2020 at 02:59:38PM +0000, David Howells wrote:
> Another thread could be writing to the file at the same time, but not in the
> same block.  That's managed by netfs, most likely based on the pages and page
> flags attached to the netfs inode being cached in this particular file[*].
> 
> What I was more thinking of is that SEEK_HOLE might run past the block of
> interest and into a block that's currently being written and see a partially
> written block.

But that's not a problem given that you know where to search.

> 
> [*] For AFS, this is only true of regular files; dirs and symlinks are cached
>     as monoliths and are there entirely or not at all.
> 
> > > However, SEEK_HOLE doesn't help with the issue of the filesystem 'altering'
> > > the content of the file by adding or removing blocks of zeros.
> > 
> > As does any other method.  If you need that fine grained control you
> > need to track the information yourself.
> 
> So, basically, I can't.  Okay.  I was hoping it might be possible to add an
> ioctl or something to tell filesystems not to do that with particular files.

File systems usually pad zeroes where they have to, typically for
sub-blocksize writes.   Disabling this would break data integrity.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-15 19:48         ` Andreas Dilger
@ 2020-01-16 10:16           ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2020-01-16 10:16 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: David Howells, Christoph Hellwig, Qu Wenruo, linux-fsdevel,
	Al Viro, Theodore Y. Ts'o, Darrick J. Wong, Chris Mason,
	Josef Bacik, David Sterba, linux-ext4, linux-xfs, linux-btrfs,
	Linux Kernel Mailing List

On Wed, Jan 15, 2020 at 12:48:44PM -0700, Andreas Dilger wrote:
> I don't think either of those will be any better than FIEMAP, if the reason
> is that the underlying filesystem is filling in holes with actual data
> blocks to optimize the IO pattern.  SEEK_HOLE would not find a hole in
> the block allocation, and would happily return the block of zeroes to
> the caller.  Also, it isn't clear if SEEK_HOLE considers an allocated but
> unwritten extent to be a hole or a block?

It is supposed to treat unwritten extents that are not dirty as holes.
Note that fiemap can't even track the dirty state, so it will always give
you the wrong answer in some cases.  And that is by design given that it
is a debug tool to give you the file system extent layout and can't be
used for data integrity purposes.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-15 14:59         ` David Howells
  2020-01-16 10:13           ` Christoph Hellwig
@ 2020-01-17 16:43           ` David Howells
  1 sibling, 0 replies; 24+ messages in thread
From: David Howells @ 2020-01-17 16:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Qu Wenruo, Andreas Dilger, linux-fsdevel, Al Viro,
	Theodore Y. Ts'o, Darrick J. Wong, Chris Mason, Josef Bacik,
	David Sterba, linux-ext4, linux-xfs, linux-btrfs,
	Linux Kernel Mailing List

Christoph Hellwig <hch@lst.de> wrote:

> File systems usually pad zeroes where they have to, typically for
> sub-blocksize writes.   Disabling this would break data integrity.

I understand that.  I can, however, round up the netfs I/O granule size and
alignment to a multiple of the cachefile I/O block size.  Also, I'm doing DIO,
so I have to use block size multiples.

But if the filesystem can avoid bridging large, appropriately sized and
aligned blocks, then I can use it.

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Problems with determining data presence by examining extents?
  2020-01-15 23:09           ` David Howells
@ 2020-01-26 18:19             ` Zygo Blaxell
  0 siblings, 0 replies; 24+ messages in thread
From: Zygo Blaxell @ 2020-01-26 18:19 UTC (permalink / raw)
  To: David Howells
  Cc: Andreas Dilger, Christoph Hellwig, Qu Wenruo, linux-fsdevel,
	Al Viro, Theodore Y. Ts'o, Darrick J. Wong, Chris Mason,
	Josef Bacik, David Sterba, linux-ext4, linux-xfs, linux-btrfs,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 2848 bytes --]

On Wed, Jan 15, 2020 at 11:09:03PM +0000, David Howells wrote:
> Andreas Dilger <adilger@dilger.ca> wrote:
> 
> > > It would also have to say that blocks of zeros shouldn't be optimised away.
> > 
> > I don't necessarily see that as a requirement, so long as the filesystem
> > stores a "block" at that offset, but it could dedupe all zero-filled blocks
> > to the same "zero block".  That still allows saving storage space, while
> > keeping the semantics of "this block was written into the file" rather than
> > "there is a hole at this offset".
> 
> Yeah, that's more what I was thinking of.  Provided I can find out that
> something is present, it should be fine.

I'm curious how this proposal handles an application punching a hole
through the cache?  Does that get cached, or does that operation have
to be synchronous with the server?  Or is it a moot point because no
server supports hole punching, so it gets replaced with equivalent zero
block data writes?

Zero blocks are stupidly common on typical user data corpuses, and a
naive block-oriented deduper can create monster extents with millions
or even billions of references if it doesn't have some special handling
for zero blocks.  Even if they don't trigger filesystem performance bugs
or hit RAM or other implementation limits, it's still bigger and slower
to use zero-filled data blocks than just using holes for zero blocks.

In the bees deduper for btrfs, zero blocks get replaced with holes
unconditionally in uncompressed extents, and in compressed extents if the
extent consists entirely of zeros (a long run of zero bytes is compressed
to a few bits by all supported compression algorithms, and hole metdata
is much larger than a few bits, so no gain is possible if anything less
than the entire compressed extent is eliminated).  That behavior could
be adjusted to support this use case, as a non-default user option.

For defrag a similar optimization is possible:  read a long run of
consecutive zero data blocks, write a prealloc extent.  I don't know of
anyone doing that in real life, but it would play havoc with anything
trying to store information in FIEMAP data (or related ioctls like
GETFSMAP or TREE_SEARCH).

I think an explicit dirty-cache-data metadata structure is a good idea
despite implementation complexity.  It would eliminate dependencies on
non-portable filesystem behavior, and not abuse a facility that might
already be in active (ab)use by other existing things.  If you have
a writeback cache, you need to properly control write ordering with a
purpose-built metadata structure, or fsync() will be meaningless through
your caching layer, and after a crash you'll upload whatever confused,
delalloc-reordered, torn-written steaming crap is on the local disk to
the backing store.

> David
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, back to index

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-14 16:48 Problems with determining data presence by examining extents? David Howells
2020-01-14 22:49 ` Theodore Y. Ts'o
2020-01-15  3:54 ` Qu Wenruo
2020-01-15 12:46   ` Andreas Dilger
2020-01-15 13:10     ` Qu Wenruo
2020-01-15 13:31       ` Christoph Hellwig
2020-01-15 19:48         ` Andreas Dilger
2020-01-16 10:16           ` Christoph Hellwig
2020-01-15 20:55         ` David Howells
2020-01-15 22:11           ` Andreas Dilger
2020-01-15 23:09           ` David Howells
2020-01-26 18:19             ` Zygo Blaxell
2020-01-15 14:35       ` David Howells
2020-01-15 14:48         ` Christoph Hellwig
2020-01-15 14:59         ` David Howells
2020-01-16 10:13           ` Christoph Hellwig
2020-01-17 16:43           ` David Howells
2020-01-15 14:20   ` David Howells
2020-01-15  8:38 ` Christoph Hellwig
2020-01-15 13:50 ` David Howells
2020-01-15 14:05 ` David Howells
2020-01-15 14:24   ` Qu Wenruo
2020-01-15 14:50   ` David Howells
2020-01-15 14:15 ` David Howells

Linux-ext4 Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-ext4/0 linux-ext4/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-ext4 linux-ext4/ https://lore.kernel.org/linux-ext4 \
		linux-ext4@vger.kernel.org
	public-inbox-index linux-ext4

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-ext4


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git