* stat() on btrfs reports the st_blocks with delay (data loss in archivers)
@ 2016-07-02 7:18 Pavel Raiskup
2016-07-04 19:35 ` [Bug-tar] " Andreas Dilger
2016-07-11 14:41 ` David Sterba
0 siblings, 2 replies; 21+ messages in thread
From: Pavel Raiskup @ 2016-07-02 7:18 UTC (permalink / raw)
To: linux-btrfs; +Cc: bug-tar
There are optimizations in archivers (tar, rsync, ...) that rely on up2date
st_blocks info. For example, in GNU tar there is optimization check [1]
whether the 'st_size' reports more data than the 'st_blocks' can hold --> then
tar considers that file is sparse (and does additional steps).
It looks like btrfs doesn't show correct value in 'st_blocks' until the data
are synced. ATM, there happens that:
a) some "tool" creates sparse file
b) that tool does not sync explicitly and exits ..
c) tar is called immediately after that to archive the sparse file
d) tar considers [2] the file is completely sparse (because st_blocks is
zero) and archives no data. Here comes data loss.
Because we fixed 'btrfs' to report non-zero 'st_blocks' when the file data is
small and is in-lined (no real data blocks) -- I consider this is too bug in
btrfs worth fixing.
[1] http://git.savannah.gnu.org/cgit/paxutils.git/tree/lib/system.h?id=ec72abd9dd63bbff4534ec77e97b1a6cadfc3cf8#n392
[2] http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c?id=ac065c57fdc1788a2769fb119ed0c8146e1b9dd6#n273
Tested on kernel:
kernel-4.5.7-300.fc24.x86_64
Originally reported here, reproducer available there:
https://bugzilla.redhat.com/show_bug.cgi?id=1352061
Pavel
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-02 7:18 stat() on btrfs reports the st_blocks with delay (data loss in archivers) Pavel Raiskup
@ 2016-07-04 19:35 ` Andreas Dilger
2016-07-05 9:28 ` Joerg Schilling
2016-07-07 8:08 ` Pavel Raiskup
2016-07-11 14:41 ` David Sterba
1 sibling, 2 replies; 21+ messages in thread
From: Andreas Dilger @ 2016-07-04 19:35 UTC (permalink / raw)
To: Pavel Raiskup; +Cc: linux-btrfs, bug-tar
[-- Attachment #1: Type: text/plain, Size: 2427 bytes --]
On Jul 2, 2016, at 1:18 AM, Pavel Raiskup <praiskup@redhat.com> wrote:
>
> There are optimizations in archivers (tar, rsync, ...) that rely on up2date
> st_blocks info. For example, in GNU tar there is optimization check [1]
> whether the 'st_size' reports more data than the 'st_blocks' can hold --> then
> tar considers that file is sparse (and does additional steps).
>
> It looks like btrfs doesn't show correct value in 'st_blocks' until the data
> are synced. ATM, there happens that:
>
> a) some "tool" creates sparse file
> b) that tool does not sync explicitly and exits ..
> c) tar is called immediately after that to archive the sparse file
> d) tar considers [2] the file is completely sparse (because st_blocks is
> zero) and archives no data. Here comes data loss.
>
> Because we fixed 'btrfs' to report non-zero 'st_blocks' when the file data is
> small and is in-lined (no real data blocks) -- I consider this is too bug in
> btrfs worth fixing.
We had a similar problem with both ext4 and Lustre - the client was reporting
zero blocks due to delayed allocation until data was written to disk. While
those problems were fixed in the filesystem to report an estimate of the block
count before any blocks were actually written to disk, it seems like this may
be a problem that will come up again with other filesystems in the future.
I think in addition to fixing btrfs (because it needs to work with existing
tar/rsync/etc. tools) it makes sense to *also* fix the heuristics of tar
to handle this situation more robustly. One option is if st_blocks == 0 then
tar should also check if st_mtime is less than 60s in the past, and if yes
then it should call fsync() on the file to flush any unwritten data to disk,
or assume the file is not sparse and read the whole file, so that it doesn't
incorrectly assume that the file is sparse and skip archiving the file data.
Cheers, Andreas
>
> [1] http://git.savannah.gnu.org/cgit/paxutils.git/tree/lib/system.h?id=ec72abd9dd63bbff4534ec77e97b1a6cadfc3cf8#n392
> [2] http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c?id=ac065c57fdc1788a2769fb119ed0c8146e1b9dd6#n273
>
> Tested on kernel:
> kernel-4.5.7-300.fc24.x86_64
>
> Originally reported here, reproducer available there:
> https://bugzilla.redhat.com/show_bug.cgi?id=1352061
>
> Pavel
>
>
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-04 19:35 ` [Bug-tar] " Andreas Dilger
@ 2016-07-05 9:28 ` Joerg Schilling
2016-07-06 11:37 ` Austin S. Hemmelgarn
2016-07-07 8:08 ` Pavel Raiskup
1 sibling, 1 reply; 21+ messages in thread
From: Joerg Schilling @ 2016-07-05 9:28 UTC (permalink / raw)
To: praiskup, adilger; +Cc: linux-btrfs, bug-tar
Andreas Dilger <adilger@dilger.ca> wrote:
> I think in addition to fixing btrfs (because it needs to work with existing
> tar/rsync/etc. tools) it makes sense to *also* fix the heuristics of tar
> to handle this situation more robustly. One option is if st_blocks == 0 then
> tar should also check if st_mtime is less than 60s in the past, and if yes
> then it should call fsync() on the file to flush any unwritten data to disk,
> or assume the file is not sparse and read the whole file, so that it doesn't
> incorrectly assume that the file is sparse and skip archiving the file data.
A broken filesystem is a broken filesystem.
If you try to change gtar to work around a specific problem, it may fail in
other situations.
Jörg
--
EMail:joerg@schily.net (home) Jörg Schilling D-13353 Berlin
joerg.schilling@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/'
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-05 9:28 ` Joerg Schilling
@ 2016-07-06 11:37 ` Austin S. Hemmelgarn
2016-07-06 11:49 ` Joerg Schilling
0 siblings, 1 reply; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-07-06 11:37 UTC (permalink / raw)
To: Joerg Schilling, praiskup, adilger; +Cc: linux-btrfs, bug-tar
On 2016-07-05 05:28, Joerg Schilling wrote:
> Andreas Dilger <adilger@dilger.ca> wrote:
>
>> I think in addition to fixing btrfs (because it needs to work with existing
>> tar/rsync/etc. tools) it makes sense to *also* fix the heuristics of tar
>> to handle this situation more robustly. One option is if st_blocks == 0 then
>> tar should also check if st_mtime is less than 60s in the past, and if yes
>> then it should call fsync() on the file to flush any unwritten data to disk,
>> or assume the file is not sparse and read the whole file, so that it doesn't
>> incorrectly assume that the file is sparse and skip archiving the file data.
>
> A broken filesystem is a broken filesystem.
>
> If you try to change gtar to work around a specific problem, it may fail in
> other situations.
The problem with this is that tar is assuming things that are not
guaranteed to be true. There is absolutely nothing that says that
st_blocks has to be non-zero if there's data in the file. In fact, the
behavior that BTRFS used to have of reporting st_blocks to be 0 for
files entirely inlined in the metadata is absolutely correct given the
description of the field by POSIX, because there _are_ no blocks
allocated to the file (because the metadata block is technically
equivalent to the inode, which isn't counted by st_blocks). This is yet
another example of an old interface (in this case, sparse file
detection) being short-sighted (read in this case as non-existent).
The proper fix for this is that tar (and anything else that handles
sparse files differently) should be parsing the file regardless. It has
to anyway for a normal sparse file to figure out where the sparse
regions are, and optimizing for a file that's completely sparse (and
therefore probably pre-allocated with fallocate) is not all that
reasonable considering that this is going to be a very rare case in
normal usage.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-06 11:37 ` Austin S. Hemmelgarn
@ 2016-07-06 11:49 ` Joerg Schilling
2016-07-06 14:43 ` Antonio Diaz Diaz
0 siblings, 1 reply; 21+ messages in thread
From: Joerg Schilling @ 2016-07-06 11:49 UTC (permalink / raw)
To: praiskup, ahferroin7, adilger; +Cc: linux-btrfs, bug-tar
"Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:
> > A broken filesystem is a broken filesystem.
> >
> > If you try to change gtar to work around a specific problem, it may fail in
> > other situations.
> The problem with this is that tar is assuming things that are not
> guaranteed to be true. There is absolutely nothing that says that
> st_blocks has to be non-zero if there's data in the file. In fact, the
This is not true: POSIX requires st_blocks to be != 0 in case that the file
contains data.
> behavior that BTRFS used to have of reporting st_blocks to be 0 for
> files entirely inlined in the metadata is absolutely correct given the
> description of the field by POSIX, because there _are_ no blocks
> allocated to the file (because the metadata block is technically
> equivalent to the inode, which isn't counted by st_blocks). This is yet
> another example of an old interface (in this case, sparse file
> detection) being short-sighted (read in this case as non-existent).
The internal state of a file system is irrelevant. The only thing that counts
is the user space view and if a file contains data (read succeeds in user
space), it needs to report st_blocks != 0.
> The proper fix for this is that tar (and anything else that handles
> sparse files differently) should be parsing the file regardless. It has
> to anyway for a normal sparse file to figure out where the sparse
> regions are, and optimizing for a file that's completely sparse (and
> therefore probably pre-allocated with fallocate) is not all that
> reasonable considering that this is going to be a very rare case in
> normal usage.
This does not help.
Even on a decent OS (e.g. Solaris since Summer 2005) and a decent tar
implementation (star) that supports SEEK_HOLE since Summer 2005, this method
will not work for all filesystems as there may be old filesystem
implementations and as there may be NFS...
For this reason, star still checks st_blocks in case that SEEK_HOLE did not
work.
Jörg
--
EMail:joerg@schily.net (home) Jörg Schilling D-13353 Berlin
joerg.schilling@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/'
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-06 11:49 ` Joerg Schilling
@ 2016-07-06 14:43 ` Antonio Diaz Diaz
2016-07-06 14:53 ` Joerg Schilling
0 siblings, 1 reply; 21+ messages in thread
From: Antonio Diaz Diaz @ 2016-07-06 14:43 UTC (permalink / raw)
To: Joerg Schilling; +Cc: ahferroin7, adilger, linux-btrfs, bug-tar
Joerg Schilling wrote:
> POSIX requires st_blocks to be != 0 in case that the file contains data.
Please, could you provide a reference? I can't find such requirement at
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html
Thanks.
Antonio.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-06 14:43 ` Antonio Diaz Diaz
@ 2016-07-06 14:53 ` Joerg Schilling
2016-07-06 15:01 ` Paul Eggert
2016-07-06 15:12 ` Austin S. Hemmelgarn
0 siblings, 2 replies; 21+ messages in thread
From: Joerg Schilling @ 2016-07-06 14:53 UTC (permalink / raw)
To: antonio; +Cc: linux-btrfs, bug-tar, ahferroin7, adilger
Antonio Diaz Diaz <antonio@gnu.org> wrote:
> Joerg Schilling wrote:
> > POSIX requires st_blocks to be != 0 in case that the file contains data.
>
> Please, could you provide a reference? I can't find such requirement at
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html
blkcnt_t st_blocks Number of blocks allocated for this object.
It should be obvious that a file that offers content also has allocated blocks.
Blocks are "allocated" when the OS decides whether the new data will fit on the
medium. The fact that some filesystems may have data in a cache but not yet on
the medium does not matter here. This is how UNIX worked since st_block has
been introduced nearly 40 years ago.
A new filesystem cannot introduce new rules just because people believe it would
save time.
Jörg
--
EMail:joerg@schily.net (home) Jörg Schilling D-13353 Berlin
joerg.schilling@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/'
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-06 14:53 ` Joerg Schilling
@ 2016-07-06 15:01 ` Paul Eggert
2016-07-06 15:09 ` Joerg Schilling
2016-07-06 15:12 ` Austin S. Hemmelgarn
1 sibling, 1 reply; 21+ messages in thread
From: Paul Eggert @ 2016-07-06 15:01 UTC (permalink / raw)
To: Joerg Schilling, antonio; +Cc: adilger, ahferroin7, linux-btrfs, bug-tar
On 07/06/2016 04:53 PM, Joerg Schilling wrote:
> Antonio Diaz Diaz<antonio@gnu.org> wrote:
>
>> >Joerg Schilling wrote:
>>> > >POSIX requires st_blocks to be != 0 in case that the file contains data.
>> >
>> >Please, could you provide a reference? I can't find such requirement at
>> >http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html
> blkcnt_t st_blocks Number of blocks allocated for this object.
This doesn't require that st_blocks must be nonzero if the file contains
nonzero data, any more that it requires that st_blocks must be nonzero
if the file contains zero data. In either case, metadata outside the
scope of st_blocks might contain enough information for the file system
to represent all the file's data.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-06 15:01 ` Paul Eggert
@ 2016-07-06 15:09 ` Joerg Schilling
2016-07-06 15:11 ` Paul Eggert
0 siblings, 1 reply; 21+ messages in thread
From: Joerg Schilling @ 2016-07-06 15:09 UTC (permalink / raw)
To: eggert, antonio; +Cc: linux-btrfs, bug-tar, ahferroin7, adilger
Paul Eggert <eggert@cs.ucla.edu> wrote:
> On 07/06/2016 04:53 PM, Joerg Schilling wrote:
> > Antonio Diaz Diaz<antonio@gnu.org> wrote:
> >
> >> >Joerg Schilling wrote:
> >>> > >POSIX requires st_blocks to be != 0 in case that the file contains data.
> >> >
> >> >Please, could you provide a reference? I can't find such requirement at
> >> >http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html
> > blkcnt_t st_blocks Number of blocks allocated for this object.
>
> This doesn't require that st_blocks must be nonzero if the file contains
> nonzero data, any more that it requires that st_blocks must be nonzero
> if the file contains zero data. In either case, metadata outside the
> scope of st_blocks might contain enough information for the file system
> to represent all the file's data.
In other words, you concur that a delayed assignment of the "correct" value for
st_blocks while the contend of the file does not change is not permitted.
Jörg
--
EMail:joerg@schily.net (home) Jörg Schilling D-13353 Berlin
joerg.schilling@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/'
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-06 15:09 ` Joerg Schilling
@ 2016-07-06 15:11 ` Paul Eggert
0 siblings, 0 replies; 21+ messages in thread
From: Paul Eggert @ 2016-07-06 15:11 UTC (permalink / raw)
To: Joerg Schilling, antonio; +Cc: linux-btrfs, bug-tar, ahferroin7, adilger
On 07/06/2016 05:09 PM, Joerg Schilling wrote:
> you concur that a delayed assignment of the "correct" value for
> st_blocks while the contend of the file does not change is not permitted.
I'm not sure I agree even with that. A file system may undergo garbage
collection and compaction, for instance, in which a file's data do not
change but its internal representation does.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-06 14:53 ` Joerg Schilling
2016-07-06 15:01 ` Paul Eggert
@ 2016-07-06 15:12 ` Austin S. Hemmelgarn
2016-07-06 15:22 ` Joerg Schilling
1 sibling, 1 reply; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-07-06 15:12 UTC (permalink / raw)
To: Joerg Schilling, antonio; +Cc: linux-btrfs, bug-tar, adilger
On 2016-07-06 10:53, Joerg Schilling wrote:
> Antonio Diaz Diaz <antonio@gnu.org> wrote:
>
>> Joerg Schilling wrote:
>>> POSIX requires st_blocks to be != 0 in case that the file contains data.
>>
>> Please, could you provide a reference? I can't find such requirement at
>> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html
>
> blkcnt_t st_blocks Number of blocks allocated for this object.
>
> It should be obvious that a file that offers content also has allocated blocks.
What you mean then is that POSIX _implies_ that this is the case, but
does not say whether or not it is required. There are all kinds of
counterexamples to this too, procfs is a POSIX compliant filesystem
(every POSIX certified system has it), yet does not display the behavior
that you expect, every single file in /proc for example reports 0 for
both st_blocks and st_size, and yet all of them very obviously have content.
>
> Blocks are "allocated" when the OS decides whether the new data will fit on the
> medium. The fact that some filesystems may have data in a cache but not yet on
> the medium does not matter here. This is how UNIX worked since st_block has
> been introduced nearly 40 years ago.
Tradition is the corpse of wisdom. Backwards comparability is a problem
just as much as a good thing.
In all seriousness though, this started out because stuff wasn't cached
to anywhere near the degree it is today, and there was no such thing as
delayed allocation. When you said to write, the filesystem allocated
the blocks, regardless of when it actually wrote the data. IOW, the
behavior that GNU tar is relying on is an implementation detail, not an
API. Just like df, this breaks under modern designs, not because they
chose to break it, but because it wasn't designed for use with such
implementations.
In the case of tar and similar things though, I'd argue that it's not
sensible to special case files that are 'sparse', it should store any
long enough run of zeroes as a sparse region, then provide an option to
say to not make those files sparse when restored.
>
> A new filesystem cannot introduce new rules just because people believe it would
> save time.
Saying the file has no blocks when there are no blocks allocated for it
is not to 'save time', it's absolutely accurate. Suppose SVR4 UFS had a
way to pack file data into the inode if it was small enough. In that
case, it woulod be perfectly reasonable to return 0 for st_blocks
because the inode table in UFS is a fixed pre-allocated structure, and
therefore nothing is allocated to the file itself except the inode. The
same applies in the case of a file packed into it's own metadata block
on BTRFS, nothing is allocated to that file beyond the metadata block it
has to have to store the inode. In the case of delayed allocation where
the file hasn't been flushed, there is nothing allocated, so st_blocks
based on a strict interpretation of it's description in POSIX _should_
be 0, because nothing is allocated yet.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-06 15:12 ` Austin S. Hemmelgarn
@ 2016-07-06 15:22 ` Joerg Schilling
2016-07-06 16:05 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 21+ messages in thread
From: Joerg Schilling @ 2016-07-06 15:22 UTC (permalink / raw)
To: antonio, ahferroin7; +Cc: linux-btrfs, bug-tar, adilger
"Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:
> > It should be obvious that a file that offers content also has allocated blocks.
> What you mean then is that POSIX _implies_ that this is the case, but
> does not say whether or not it is required. There are all kinds of
> counterexamples to this too, procfs is a POSIX compliant filesystem
> (every POSIX certified system has it), yet does not display the behavior
> that you expect, every single file in /proc for example reports 0 for
> both st_blocks and st_size, and yet all of them very obviously have content.
You are mistaken.
stat /proc/$$/as
File: `/proc/6518/as'
Size: 2793472 Blocks: 5456 IO Block: 512 regular file
Device: 5440000h/88342528d Inode: 7557 Links: 1
Access: (0600/-rw-------) Uid: ( xx/ joerg) Gid: ( xx/ bs)
Access: 2016-07-06 16:33:15.660224934 +0200
Modify: 2016-07-06 16:33:15.660224934 +0200
Change: 2016-07-06 16:33:15.660224934 +0200
stat /proc/$$/auxv
File: `/proc/6518/auxv'
Size: 168 Blocks: 1 IO Block: 512 regular file
Device: 5440000h/88342528d Inode: 7568 Links: 1
Access: (0400/-r--------) Uid: ( xx/ joerg) Gid: ( xx/ bs)
Access: 2016-07-06 16:33:15.660224934 +0200
Modify: 2016-07-06 16:33:15.660224934 +0200
Change: 2016-07-06 16:33:15.660224934 +0200
Any correct implementation of /proc returns the expected numbers in st_size as
well as in st_blocks.
> In all seriousness though, this started out because stuff wasn't cached
> to anywhere near the degree it is today, and there was no such thing as
> delayed allocation. When you said to write, the filesystem allocated
> the blocks, regardless of when it actually wrote the data. IOW, the
> behavior that GNU tar is relying on is an implementation detail, not an
> API. Just like df, this breaks under modern designs, not because they
> chose to break it, but because it wasn't designed for use with such
> implementations.
This seems to be a strange interpretation if what a standard is.
> > A new filesystem cannot introduce new rules just because people believe it would
> > save time.
> Saying the file has no blocks when there are no blocks allocated for it
> is not to 'save time', it's absolutely accurate. Suppose SVR4 UFS had a
> way to pack file data into the inode if it was small enough. In that
> case, it woulod be perfectly reasonable to return 0 for st_blocks
> because the inode table in UFS is a fixed pre-allocated structure, and
Given that inode size is 128, such a change would not break things as the
heuristics would not imply a sparse file here.
> therefore nothing is allocated to the file itself except the inode. The
> same applies in the case of a file packed into it's own metadata block
> on BTRFS, nothing is allocated to that file beyond the metadata block it
> has to have to store the inode. In the case of delayed allocation where
> the file hasn't been flushed, there is nothing allocated, so st_blocks
> based on a strict interpretation of it's description in POSIX _should_
> be 0, because nothing is allocated yet.
Now you know why BTRFS is still an incomplete filesystem. In a few years when
it turns 10, this may change. People who implement filesystems of course need
to learn that they need to hide implementation details from the official user
space interfaces.
Jörg
--
EMail:joerg@schily.net (home) Jörg Schilling D-13353 Berlin
joerg.schilling@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/'
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-06 15:22 ` Joerg Schilling
@ 2016-07-06 16:05 ` Austin S. Hemmelgarn
2016-07-06 16:11 ` Austin S. Hemmelgarn
2016-07-06 16:33 ` Joerg Schilling
0 siblings, 2 replies; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-07-06 16:05 UTC (permalink / raw)
To: Joerg Schilling, antonio; +Cc: linux-btrfs, bug-tar, adilger
On 2016-07-06 11:22, Joerg Schilling wrote:
> "Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:
>
>>> It should be obvious that a file that offers content also has allocated blocks.
>> What you mean then is that POSIX _implies_ that this is the case, but
>> does not say whether or not it is required. There are all kinds of
>> counterexamples to this too, procfs is a POSIX compliant filesystem
>> (every POSIX certified system has it), yet does not display the behavior
>> that you expect, every single file in /proc for example reports 0 for
>> both st_blocks and st_size, and yet all of them very obviously have content.
>
> You are mistaken.
>
> stat /proc/$$/as
> File: `/proc/6518/as'
> Size: 2793472 Blocks: 5456 IO Block: 512 regular file
> Device: 5440000h/88342528d Inode: 7557 Links: 1
> Access: (0600/-rw-------) Uid: ( xx/ joerg) Gid: ( xx/ bs)
> Access: 2016-07-06 16:33:15.660224934 +0200
> Modify: 2016-07-06 16:33:15.660224934 +0200
> Change: 2016-07-06 16:33:15.660224934 +0200
>
> stat /proc/$$/auxv
> File: `/proc/6518/auxv'
> Size: 168 Blocks: 1 IO Block: 512 regular file
> Device: 5440000h/88342528d Inode: 7568 Links: 1
> Access: (0400/-r--------) Uid: ( xx/ joerg) Gid: ( xx/ bs)
> Access: 2016-07-06 16:33:15.660224934 +0200
> Modify: 2016-07-06 16:33:15.660224934 +0200
> Change: 2016-07-06 16:33:15.660224934 +0200
>
> Any correct implementation of /proc returns the expected numbers in st_size as
> well as in st_blocks.
Odd, because I get 0 for both values on all the files in /proc/self and
all the top level files on all kernels I tested prior to sending that
e-mail, for reference, they include:
* A direct clone of HEAD on torvalds/linux
* 4.6.3 mainline
* 4.1.27 mainline
* 4.6.3 mainline with a small number of local patches on top
* 4.1.19+ from the Raspberry Pi foundation
* 4.4.6-gentoo (mainline with Gentoo patches on top)
* 4.5.5-linode69 (not certain about the patches on top)
It's probably notable that I don't see /proc/$PID/as on any of these
systems, which implies you're running some significantly different
kernel version to begin with, and therefore it's not unreasonable to
assume that what you see is because of some misguided patch that got
added to allow tar to archive /proc.
>
>> In all seriousness though, this started out because stuff wasn't cached
>> to anywhere near the degree it is today, and there was no such thing as
>> delayed allocation. When you said to write, the filesystem allocated
>> the blocks, regardless of when it actually wrote the data. IOW, the
>> behavior that GNU tar is relying on is an implementation detail, not an
>> API. Just like df, this breaks under modern designs, not because they
>> chose to break it, but because it wasn't designed for use with such
>> implementations.
>
> This seems to be a strange interpretation if what a standard is.
Except what I'm talking about is the _interpretation_ of the standard,
not the standard itself. I said nothing about the standard, all it
requires is that st_blocks be the number of 512 byte blocks allocated by
the filesystem for the file. There is nothing in there about it having
to reflect the expected size of the allocated content on disk. In fact,
there's technically nothing in there about how to handle sparse files
either.
To further explain what I'm trying to say, here's a rough description of
what happens in SVR4 UFS (and other non-delayed allocation filesystems)
when you issue a write:
1. The number of new blocks needed to fulfill the write request is
calculated.
2. If this number is greater than 0, that many new blocks are allocated,
and st_blocks for that file is functionally updated (I don't recall if
it was dynamically calculated per call or not)
3. At some indeterminate point in the future, the decision is made to
flush the cache.
4. The data is written to the appropriate place in the file.
By comparison, in a delayed allocation scenario, 3 happens before 1 and
2. 1 and 2 obviously have to be strictly ordered WRT each other and 4,
but based on the POSIX standard, 3 does not have to be strictly ordered
with regards to any of them (although it is illogical to have it between
1 and 2 or after 4). Because it is not required by the standard to have
3 be strictly ordered and the ordering isn't part of the API itself,
where it happens in the sequence is an implementation detail.
>
>>> A new filesystem cannot introduce new rules just because people believe it would
>>> save time.
>> Saying the file has no blocks when there are no blocks allocated for it
>> is not to 'save time', it's absolutely accurate. Suppose SVR4 UFS had a
>> way to pack file data into the inode if it was small enough. In that
>> case, it woulod be perfectly reasonable to return 0 for st_blocks
>> because the inode table in UFS is a fixed pre-allocated structure, and
>
> Given that inode size is 128, such a change would not break things as the
> heuristics would not imply a sparse file here.
OK, so change the heuristic then so that there's a reasonable limit on
small files, or better yet, just get rid of it, as it was introduced to
save time itself. The case it optimizes for (large sparse files) is
mostly irrelevant, because you have to parse the file to figure out
what's sparse anyway, and anyone who's dealing with completely sparse
files has other potential issues to deal with (I'm actually curious if
you know of some legitimate reason to copy a completely empty file in a
backup anyway, they're almost always either pre-allocated but unused
files which will just get reallocated by whatever allocated them in the
first place, lock-files, or something similar in some other way).
Regardless, the correct way on current Linux systems to determine file
sparseness is SEEK_DATA and SEEK_HOLE. You have to read any data that's
there anyway, so alternating SEEK_DATA and SEEK_HOLE is necessary to
find where the data is that you need to read. If the first SEEK_HOLE
returns the end of the file, then you know it's not sparse.
>
>> therefore nothing is allocated to the file itself except the inode. The
>> same applies in the case of a file packed into it's own metadata block
>> on BTRFS, nothing is allocated to that file beyond the metadata block it
>> has to have to store the inode. In the case of delayed allocation where
>> the file hasn't been flushed, there is nothing allocated, so st_blocks
>> based on a strict interpretation of it's description in POSIX _should_
>> be 0, because nothing is allocated yet.
>
> Now you know why BTRFS is still an incomplete filesystem. In a few years when
> it turns 10, this may change. People who implement filesystems of course need
> to learn that they need to hide implementation details from the official user
> space interfaces.
So in other words you think we should be lying about how much is
actually allocated on disk and thus violating the standard directly (and
yes, ext4 and everyone else who does this with delayed allocation _is_
strictly speaking violating the standard, because _nothing_ is allocated
yet)?
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-06 16:05 ` Austin S. Hemmelgarn
@ 2016-07-06 16:11 ` Austin S. Hemmelgarn
2016-07-06 16:33 ` Joerg Schilling
1 sibling, 0 replies; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-07-06 16:11 UTC (permalink / raw)
To: Joerg Schilling, antonio; +Cc: linux-btrfs, bug-tar, adilger
On 2016-07-06 12:05, Austin S. Hemmelgarn wrote:
> On 2016-07-06 11:22, Joerg Schilling wrote:
>> "Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:
>>
>>>> It should be obvious that a file that offers content also has
>>>> allocated blocks.
>>> What you mean then is that POSIX _implies_ that this is the case, but
>>> does not say whether or not it is required. There are all kinds of
>>> counterexamples to this too, procfs is a POSIX compliant filesystem
>>> (every POSIX certified system has it), yet does not display the behavior
>>> that you expect, every single file in /proc for example reports 0 for
>>> both st_blocks and st_size, and yet all of them very obviously have
>>> content.
>>
>> You are mistaken.
>>
>> stat /proc/$$/as
>> File: `/proc/6518/as'
>> Size: 2793472 Blocks: 5456 IO Block: 512 regular file
>> Device: 5440000h/88342528d Inode: 7557 Links: 1
>> Access: (0600/-rw-------) Uid: ( xx/ joerg) Gid: ( xx/ bs)
>> Access: 2016-07-06 16:33:15.660224934 +0200
>> Modify: 2016-07-06 16:33:15.660224934 +0200
>> Change: 2016-07-06 16:33:15.660224934 +0200
>>
>> stat /proc/$$/auxv
>> File: `/proc/6518/auxv'
>> Size: 168 Blocks: 1 IO Block: 512 regular file
>> Device: 5440000h/88342528d Inode: 7568 Links: 1
>> Access: (0400/-r--------) Uid: ( xx/ joerg) Gid: ( xx/ bs)
>> Access: 2016-07-06 16:33:15.660224934 +0200
>> Modify: 2016-07-06 16:33:15.660224934 +0200
>> Change: 2016-07-06 16:33:15.660224934 +0200
>>
>> Any correct implementation of /proc returns the expected numbers in
>> st_size as
>> well as in st_blocks.
> Odd, because I get 0 for both values on all the files in /proc/self and
> all the top level files on all kernels I tested prior to sending that
> e-mail, for reference, they include:
> * A direct clone of HEAD on torvalds/linux
> * 4.6.3 mainline
> * 4.1.27 mainline
> * 4.6.3 mainline with a small number of local patches on top
> * 4.1.19+ from the Raspberry Pi foundation
> * 4.4.6-gentoo (mainline with Gentoo patches on top)
> * 4.5.5-linode69 (not certain about the patches on top)
Further ones I've now tested that behave like the others listed above:
* 2.4.20-8 from RedHat 9
* 2.6.18-1.2798.fc6 from Fedora Core 6
* 3.11.10-301.fc20 from Fedora 20
IOW, it looks like whatever you're running is an exception here.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-06 16:05 ` Austin S. Hemmelgarn
2016-07-06 16:11 ` Austin S. Hemmelgarn
@ 2016-07-06 16:33 ` Joerg Schilling
2016-07-06 17:35 ` Andreas Dilger
1 sibling, 1 reply; 21+ messages in thread
From: Joerg Schilling @ 2016-07-06 16:33 UTC (permalink / raw)
To: antonio, ahferroin7; +Cc: linux-btrfs, bug-tar, adilger
"Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:
> On 2016-07-06 11:22, Joerg Schilling wrote:
> > "Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:
> >
> >>> It should be obvious that a file that offers content also has allocated blocks.
> >> What you mean then is that POSIX _implies_ that this is the case, but
> >> does not say whether or not it is required. There are all kinds of
> >> counterexamples to this too, procfs is a POSIX compliant filesystem
> >> (every POSIX certified system has it), yet does not display the behavior
> >> that you expect, every single file in /proc for example reports 0 for
> >> both st_blocks and st_size, and yet all of them very obviously have content.
> >
> > You are mistaken.
> >
> > stat /proc/$$/as
> > File: `/proc/6518/as'
> > Size: 2793472 Blocks: 5456 IO Block: 512 regular file
> > Device: 5440000h/88342528d Inode: 7557 Links: 1
> > Access: (0600/-rw-------) Uid: ( xx/ joerg) Gid: ( xx/ bs)
> > Access: 2016-07-06 16:33:15.660224934 +0200
> > Modify: 2016-07-06 16:33:15.660224934 +0200
> > Change: 2016-07-06 16:33:15.660224934 +0200
> >
> > stat /proc/$$/auxv
> > File: `/proc/6518/auxv'
> > Size: 168 Blocks: 1 IO Block: 512 regular file
> > Device: 5440000h/88342528d Inode: 7568 Links: 1
> > Access: (0400/-r--------) Uid: ( xx/ joerg) Gid: ( xx/ bs)
> > Access: 2016-07-06 16:33:15.660224934 +0200
> > Modify: 2016-07-06 16:33:15.660224934 +0200
> > Change: 2016-07-06 16:33:15.660224934 +0200
> >
> > Any correct implementation of /proc returns the expected numbers in st_size as
> > well as in st_blocks.
> Odd, because I get 0 for both values on all the files in /proc/self and
> all the top level files on all kernels I tested prior to sending that
I tested this with an official PROCFS-2 implementation that was written by
the inventor of the PROC filesystem (Roger Faulkner) who as a sad news pased
away last weekend.
You may have done your tests on an inofficial procfs implementation....
> > Now you know why BTRFS is still an incomplete filesystem. In a few years when
> > it turns 10, this may change. People who implement filesystems of course need
> > to learn that they need to hide implementation details from the official user
> > space interfaces.
> So in other words you think we should be lying about how much is
> actually allocated on disk and thus violating the standard directly (and
> yes, ext4 and everyone else who does this with delayed allocation _is_
> strictly speaking violating the standard, because _nothing_ is allocated
> yet)?
If it returns 0, it would be lying or it would be wrong anyway as it did not
check fpe available space.
Also note that I mentioned already that the priciple availability of SEEK_HOLE
does not help as there is e.g. NFS...
Jörg
--
EMail:joerg@schily.net (home) Jörg Schilling D-13353 Berlin
joerg.schilling@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/'
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-06 16:33 ` Joerg Schilling
@ 2016-07-06 17:35 ` Andreas Dilger
0 siblings, 0 replies; 21+ messages in thread
From: Andreas Dilger @ 2016-07-06 17:35 UTC (permalink / raw)
To: Joerg Schilling; +Cc: antonio, ahferroin7, linux-btrfs, bug-tar
[-- Attachment #1: Type: text/plain, Size: 3137 bytes --]
> On Jul 6, 2016, at 10:33 AM, Joerg Schilling <Joerg.Schilling@fokus.fraunhofer.de> wrote:
>
> "Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:
>
>> On 2016-07-06 11:22, Joerg Schilling wrote:
>>>
>>>
>>> You are mistaken.
>>>
>>> stat /proc/$$/as
>>> File: `/proc/6518/as'
>>> Size: 2793472 Blocks: 5456 IO Block: 512 regular file
>>> Device: 5440000h/88342528d Inode: 7557 Links: 1
>>> Access: (0600/-rw-------) Uid: ( xx/ joerg) Gid: ( xx/ bs)
>>> Access: 2016-07-06 16:33:15.660224934 +0200
>>> Modify: 2016-07-06 16:33:15.660224934 +0200
>>> Change: 2016-07-06 16:33:15.660224934 +0200
>>>
>>> stat /proc/$$/auxv
>>> File: `/proc/6518/auxv'
>>> Size: 168 Blocks: 1 IO Block: 512 regular file
>>> Device: 5440000h/88342528d Inode: 7568 Links: 1
>>> Access: (0400/-r--------) Uid: ( xx/ joerg) Gid: ( xx/ bs)
>>> Access: 2016-07-06 16:33:15.660224934 +0200
>>> Modify: 2016-07-06 16:33:15.660224934 +0200
>>> Change: 2016-07-06 16:33:15.660224934 +0200
>>>
>>> Any correct implementation of /proc returns the expected numbers in
>>> st_size as well as in st_blocks.
>>
>> Odd, because I get 0 for both values on all the files in /proc/self and
>> all the top level files on all kernels I tested prior to sending that
>
> I tested this with an official PROCFS-2 implementation that was written by
> the inventor of the PROC filesystem (Roger Faulkner) who as a sad news pased
> away last weekend.
>
> You may have done your tests on an inofficial procfs implementation....
So, what you are saying is that you don't care about star working properly
on Linux, because it has an "inofficial" procfs implementation, while Solaris
has an "official" implementation?
>>> Now you know why BTRFS is still an incomplete filesystem. In a few years
>>> when it turns 10, this may change. People who implement filesystems of
>>> course need to learn that they need to hide implementation details from
>>> the official user space interfaces.
>>
>> So in other words you think we should be lying about how much is
>> actually allocated on disk and thus violating the standard directly (and
>> yes, ext4 and everyone else who does this with delayed allocation _is_
>> strictly speaking violating the standard, because _nothing_ is allocated
>> yet)?
>
> If it returns 0, it would be lying or it would be wrong anyway as it did not
> check fpe available space.
>
> Also note that I mentioned already that the priciple availability of SEEK_HOLE
> does not help as there is e.g. NFS...
So, it's OK that NFS is not POSIX compliant in various ways, and star will
deal with it, but you aren't willing to fix a heuristic used by star for a
behaviour that is unspecified by POSIX but has caused users to lose data
when archiving from several modern filesystems?
That's fine, so long as GNU tar is fixed to use the safe fallback in such
cases (i.e. trying to archive data from files that are newly created, even
if they report st_blocks == 0).
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-04 19:35 ` [Bug-tar] " Andreas Dilger
2016-07-05 9:28 ` Joerg Schilling
@ 2016-07-07 8:08 ` Pavel Raiskup
1 sibling, 0 replies; 21+ messages in thread
From: Pavel Raiskup @ 2016-07-07 8:08 UTC (permalink / raw)
To: bug-tar; +Cc: Andreas Dilger, linux-btrfs
On Monday, July 4, 2016 1:35:25 PM CEST Andreas Dilger wrote:
> I think in addition to fixing btrfs (because it needs to work with existing
> tar/rsync/etc. tools) it makes sense to *also* fix the heuristics of tar to
> handle this situation more robustly.
What I was rather thinking about is to remove the [2] heuristic. As there is
now SEEK_HOLE implemented, the need for that check "completely sparse files"
might be considered less useful. With [1], I'm not sure -- is it that bad to
face some false positive there? (it is documented that tar shouldn't be run
concurrently with other processes writing to archived files .., and waiting for
flush here is probably a very similar race condition).
> One option is if st_blocks == 0 then tar should also check if st_mtime is
> less than 60s in the past, and if yes then it should call fsync() on the
> file to flush any unwritten data to disk , or assume the file is not sparse
> and read the whole file, so that it doesn't incorrectly assume that the file
> is sparse and skip archiving the file data.
The reported fact 'st_blocks != 0' doesn't mean that the fsync() call is not
needed, so I'm not 100% we should special-case the 'st_blocks == 0' files.
--
As this effectively breaks tar's testsuite on btrfs, could we also explicitly
sync in 'genfile'?
Pavel
>
> Cheers, Andreas
>
> > [1] http://git.savannah.gnu.org/cgit/paxutils.git/tree/lib/system.h?id=ec72abd9dd63bbff4534ec77e97b1a6cadfc3cf8#n392
> > [2] http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c?id=ac065c57fdc1788a2769fb119ed0c8146e1b9dd6#n273
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-02 7:18 stat() on btrfs reports the st_blocks with delay (data loss in archivers) Pavel Raiskup
2016-07-04 19:35 ` [Bug-tar] " Andreas Dilger
@ 2016-07-11 14:41 ` David Sterba
2016-07-11 15:00 ` Chris Mason
1 sibling, 1 reply; 21+ messages in thread
From: David Sterba @ 2016-07-11 14:41 UTC (permalink / raw)
To: Pavel Raiskup; +Cc: linux-btrfs, bug-tar
On Sat, Jul 02, 2016 at 09:18:07AM +0200, Pavel Raiskup wrote:
> There are optimizations in archivers (tar, rsync, ...) that rely on up2date
> st_blocks info. For example, in GNU tar there is optimization check [1]
> whether the 'st_size' reports more data than the 'st_blocks' can hold --> then
> tar considers that file is sparse (and does additional steps).
>
> It looks like btrfs doesn't show correct value in 'st_blocks' until the data
> are synced. ATM, there happens that:
>
> a) some "tool" creates sparse file
> b) that tool does not sync explicitly and exits ..
> c) tar is called immediately after that to archive the sparse file
> d) tar considers [2] the file is completely sparse (because st_blocks is
> zero) and archives no data. Here comes data loss.
>
> Because we fixed 'btrfs' to report non-zero 'st_blocks' when the file data is
> small and is in-lined (no real data blocks) -- I consider this is too bug in
> btrfs worth fixing.
>
> [1] http://git.savannah.gnu.org/cgit/paxutils.git/tree/lib/system.h?id=ec72abd9dd63bbff4534ec77e97b1a6cadfc3cf8#n392
> [2] http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c?id=ac065c57fdc1788a2769fb119ed0c8146e1b9dd6#n273
>
> Tested on kernel:
> kernel-4.5.7-300.fc24.x86_64
>
> Originally reported here, reproducer available there:
> https://bugzilla.redhat.com/show_bug.cgi?id=1352061
The reproducer works for me here. So far I found:
* the btrfs implementation of stat.st_blocks (btrfs_getattr) includes
the 'delayed allocated' bytes, so there is not a problem in principle
(http://lxr.free-electrons.com/source/fs/btrfs/inode.c#L9372)
* calling fsync on the sparsefile will produce the expected result
* a short delay between ./binary and 'stat' will also produce correct
result, 0.5 seconds worked for me -- so it IMO proves it's a race
between writing and reporting the data
* I'm not yet sure where the delay between write and synced
'inode->delalloc_bytes' comes from
* I think that st_blocks accounting can be wrong anyway, if the file is
mmap-ed and not msync-ed, I'm writing a reproducer for this case
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-11 14:41 ` David Sterba
@ 2016-07-11 15:00 ` Chris Mason
2016-07-11 15:16 ` David Sterba
0 siblings, 1 reply; 21+ messages in thread
From: Chris Mason @ 2016-07-11 15:00 UTC (permalink / raw)
To: dsterba, Pavel Raiskup, linux-btrfs, bug-tar
On 07/11/2016 10:41 AM, David Sterba wrote:
> On Sat, Jul 02, 2016 at 09:18:07AM +0200, Pavel Raiskup wrote:
>> There are optimizations in archivers (tar, rsync, ...) that rely on up2date
>> st_blocks info. For example, in GNU tar there is optimization check [1]
>> whether the 'st_size' reports more data than the 'st_blocks' can hold --> then
>> tar considers that file is sparse (and does additional steps).
>>
>> It looks like btrfs doesn't show correct value in 'st_blocks' until the data
>> are synced. ATM, there happens that:
>>
>> a) some "tool" creates sparse file
>> b) that tool does not sync explicitly and exits ..
>> c) tar is called immediately after that to archive the sparse file
>> d) tar considers [2] the file is completely sparse (because st_blocks is
>> zero) and archives no data. Here comes data loss.
>>
>> Because we fixed 'btrfs' to report non-zero 'st_blocks' when the file data is
>> small and is in-lined (no real data blocks) -- I consider this is too bug in
>> btrfs worth fixing.
>>
>> [1] http://git.savannah.gnu.org/cgit/paxutils.git/tree/lib/system.h?id=ec72abd9dd63bbff4534ec77e97b1a6cadfc3cf8#n392
>> [2] http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c?id=ac065c57fdc1788a2769fb119ed0c8146e1b9dd6#n273
>>
>> Tested on kernel:
>> kernel-4.5.7-300.fc24.x86_64
>>
>> Originally reported here, reproducer available there:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1352061
>
> The reproducer works for me here. So far I found:
>
> * the btrfs implementation of stat.st_blocks (btrfs_getattr) includes
> the 'delayed allocated' bytes, so there is not a problem in principle
> (https://urldefense.proofpoint.com/v2/url?u=http-3A__lxr.free-2Delectrons.com_source_fs_btrfs_inode.c-23L9372&d=CwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=9QPtTAxcitoznaWRKKHoEQ&m=Y07_PApD0zOC-eaM4Hq-oxAwNlktnY8bDo7LD2GbXRo&s=Is_ZqFiL7a4sVWAB1k1ZuAgbNMK-sZ1gcU7oLtyfKoY&e= )
>
> * calling fsync on the sparsefile will produce the expected result
>
> * a short delay between ./binary and 'stat' will also produce correct
> result, 0.5 seconds worked for me -- so it IMO proves it's a race
> between writing and reporting the data
>
> * I'm not yet sure where the delay between write and synced
> 'inode->delalloc_bytes' comes from
>
> * I think that st_blocks accounting can be wrong anyway, if the file is
> mmap-ed and not msync-ed, I'm writing a reproducer for this case
On my test box running current linux git, things work fine if I run the
reproducer once. But if I leave it running in a loop long enough for
writeback to kick in, I trigger it.
The reproducer has a loop in there where it is adding delalloc writes
and truncating them away. What should be happening is that we're
leaving some delalloc bits set past EOF, which makes us skip bumping
inode->delalloc_bytes during the new write.
I can kind of confirm this by changing the reproducer to stat directly
after the write call. Normally st_blocks is never zero. But if I leave
it running in a loop for 30 seconds or so, I eventually get st_block
zero directly after the write().
If I change the C program to unlink the file on exit, running the binary
over and over again works every time.
So, the real bug is that we're letting some delalloc stat hang around
after the truncate, probably related to IO in progress. We do already
account for delalloc in what we return to stat, but there's a corner
case involving truncate where we screw it up.
-chris
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-11 15:00 ` Chris Mason
@ 2016-07-11 15:16 ` David Sterba
2016-07-11 17:30 ` Chris Mason
0 siblings, 1 reply; 21+ messages in thread
From: David Sterba @ 2016-07-11 15:16 UTC (permalink / raw)
To: Chris Mason; +Cc: dsterba, Pavel Raiskup, linux-btrfs, bug-tar
On Mon, Jul 11, 2016 at 11:00:55AM -0400, Chris Mason wrote:
> So, the real bug is that we're letting some delalloc stat hang around
> after the truncate, probably related to IO in progress. We do already
> account for delalloc in what we return to stat, but there's a corner
> case involving truncate where we screw it up.
So the original testcase:
> >> a) some "tool" creates sparse file
> >> b) that tool does not sync explicitly and exits ..
> >> c) tar is called immediately after that to archive the sparse file
> >> d) tar considers [2] the file is completely sparse (because st_blocks is
> >> zero) and archives no data. Here comes data loss.
will not happen. The application would basically have to mimick the
provided reproducer script and do the truncate/write loop and be lucky
enough to let tar hit the short race window.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: stat() on btrfs reports the st_blocks with delay (data loss in archivers)
2016-07-11 15:16 ` David Sterba
@ 2016-07-11 17:30 ` Chris Mason
0 siblings, 0 replies; 21+ messages in thread
From: Chris Mason @ 2016-07-11 17:30 UTC (permalink / raw)
To: dsterba, Pavel Raiskup, linux-btrfs, bug-tar
On 07/11/2016 11:16 AM, David Sterba wrote:
> On Mon, Jul 11, 2016 at 11:00:55AM -0400, Chris Mason wrote:
>> So, the real bug is that we're letting some delalloc stat hang around
>> after the truncate, probably related to IO in progress. We do already
>> account for delalloc in what we return to stat, but there's a corner
>> case involving truncate where we screw it up.
>
> So the original testcase:
>
>>>> a) some "tool" creates sparse file
>>>> b) that tool does not sync explicitly and exits ..
>>>> c) tar is called immediately after that to archive the sparse file
>>>> d) tar considers [2] the file is completely sparse (because st_blocks is
>>>> zero) and archives no data. Here comes data loss.
>
> will not happen. The application would basically have to mimick the
> provided reproducer script and do the truncate/write loop and be lucky
> enough to let tar hit the short race window.
>
Looking harder there is a race window that can trigger this without the
truncate loop:
1) application calls write(), we make the pages delalloc (in-ram
st_blocks goes up)
2) VM calls write_cache_pages, we go find a contiguous delalloc range
3) We call cow_file_range on the locked range of pages
4) cow_file_range clears the delalloc bits (in-ram st_blocks goes down)
< ----- race begins here ----->
5) The io is started
6) The IO completes and extents are inserted into the metadata
7) the on disk/in-ram st_blocks goes up
< ---- race ends here ---->
This makes a ton more sense than leaking delalloc bits.
-chris
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2016-07-11 17:31 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-02 7:18 stat() on btrfs reports the st_blocks with delay (data loss in archivers) Pavel Raiskup
2016-07-04 19:35 ` [Bug-tar] " Andreas Dilger
2016-07-05 9:28 ` Joerg Schilling
2016-07-06 11:37 ` Austin S. Hemmelgarn
2016-07-06 11:49 ` Joerg Schilling
2016-07-06 14:43 ` Antonio Diaz Diaz
2016-07-06 14:53 ` Joerg Schilling
2016-07-06 15:01 ` Paul Eggert
2016-07-06 15:09 ` Joerg Schilling
2016-07-06 15:11 ` Paul Eggert
2016-07-06 15:12 ` Austin S. Hemmelgarn
2016-07-06 15:22 ` Joerg Schilling
2016-07-06 16:05 ` Austin S. Hemmelgarn
2016-07-06 16:11 ` Austin S. Hemmelgarn
2016-07-06 16:33 ` Joerg Schilling
2016-07-06 17:35 ` Andreas Dilger
2016-07-07 8:08 ` Pavel Raiskup
2016-07-11 14:41 ` David Sterba
2016-07-11 15:00 ` Chris Mason
2016-07-11 15:16 ` David Sterba
2016-07-11 17:30 ` Chris Mason
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.