linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Reproducer for "compressed data + hole data corruption bug, 2018 editiion"
@ 2018-08-23  3:11 Zygo Blaxell
  2018-08-23  5:10 ` Qu Wenruo
  2019-02-12  3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell
  0 siblings, 2 replies; 38+ messages in thread
From: Zygo Blaxell @ 2018-08-23  3:11 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 6482 bytes --]

This is a repro script for a btrfs bug that causes corrupted data reads
when reading a mix of compressed extents and holes.  The bug is
reproducible on at least kernels v4.1..v4.18.

Some more observations and background follow, but first here is the
script and some sample output:

	root@rescue:/test# cat repro-hole-corruption-test
	#!/bin/bash

	# Write a 4096 byte block of something
	block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }

	# Here is some test data with holes in it:
	for y in $(seq 0 100); do
		for x in 0 1; do
			block 0;
			block 21;
			block 0;
			block 22;
			block 0;
			block 0;
			block 43;
			block 44;
			block 0;
			block 0;
			block 61;
			block 62;
			block 63;
			block 64;
			block 65;
			block 66;
		done
	done > am
	sync

	# Now replace those 101 distinct extents with 101 references to the first extent
	btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail

	# Punch holes into the extent refs
	fallocate -v -d am

	# Do some other stuff on the machine while this runs, and watch the sha1sums change!
	while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done

	root@rescue:/test# ./repro-hole-corruption-test
	i: 91, status: 0, bytes_deduped: 131072
	i: 92, status: 0, bytes_deduped: 131072
	i: 93, status: 0, bytes_deduped: 131072
	i: 94, status: 0, bytes_deduped: 131072
	i: 95, status: 0, bytes_deduped: 131072
	i: 96, status: 0, bytes_deduped: 131072
	i: 97, status: 0, bytes_deduped: 131072
	i: 98, status: 0, bytes_deduped: 131072
	i: 99, status: 0, bytes_deduped: 131072
	13107200 total bytes deduped in this operation
	am: 4.8 MiB (4964352 bytes) converted to sparse holes.
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	072a152355788c767b97e4e4c0e4567720988b84 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	bf00d862c6ad436a1be2be606a8ab88d22166b89 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	0d44cdf030fb149e103cfdc164da3da2b7474c17 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	60831f0e7ffe4b49722612c18685c09f4583b1df am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	a19662b294a3ccdf35dbb18fdd72c62018526d7d am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	^C

Corruption occurs most often when there is a sequence like this in a file:

	ref 1: hole
	ref 2: extent A, offset 0
	ref 3: hole
	ref 4: extent A, offset 8192

This scenario typically arises due to hole-punching or deduplication.
Hole-punching replaces one extent ref with two references to the same
extent with a hole between them, so:

	ref 1:  extent A, offset 0, length 16384

becomes:

	ref 1:  extent A, offset 0, length 4096
	ref 2:  hole, length 8192
	ref 3:  extent A, offset 12288, length 4096

Deduplication replaces two distinct extent refs surrounding a hole with
two references to one of the duplicate extents, turning this:

	ref 1:  extent A, offset 0, length 4096
	ref 2:  hole, length 8192
	ref 3:  extent B, offset 0, length 4096

into this:

	ref 1:  extent A, offset 0, length 4096
	ref 2:  hole, length 8192
	ref 3:  extent A, offset 0, length 4096

Compression is required (zlib, zstd, or lzo) for corruption to occur.
I am not able to reproduce the issue with an uncompressed extent nor
have I observed any such corruption in the wild.

The presence or absence of the no-holes filesystem feature has no effect.

Ordinary writes can lead to pairs of extent references to the same extent
separated by a reference to a different extent; however, in this case
there is data to be read from a real extent, instead of pages that have
to be zero filled from a hole.  If ordinary non-hole writes could trigger
this bug, every page-oriented database engine would be crashing all the
time on btrfs with compression enabled, and it's unlikely that would not
have been noticed between 2015 and now.  An ordinary write that splits
an extent ref would look like this:

	ref 1:  extent A, offset 0, length 4096
	ref 2:  extent C, offset 0, length 8192
	ref 3:  extent A, offset 12288, length 4096

Sparse writes can lead to pairs of extent references surrounding a hole;
however, in this case the extent references will point to different
extents, avoiding the bug.  If a sparse write could trigger the bug,
the rsync -S option and qemu/kvm 'raw' disk image files (among many
other tools that produce sparse files) would be unusable, and it's
unlikely that would not have been noticed between 2015 and now either.
Sparse writes look like this:

	ref 1:  extent A, offset 0, length 4096
	ref 2:  hole, length 8192
	ref 3:  extent B, offset 0, length 4096

The pattern or timing of read() calls seems to be relevant.  It is very
hard to see the corruption when reading files with 'hd', but 'cat | hd'
will see the corruption just fine.  Similar problems exist with 'cmp'
but not 'sha1sum'.  Two processes reading the same file at the same time
seem to trigger the corruption very frequently.

Some patterns of holes and data produce corruption faster than others.
The pattern generated by the script above is based on instances of
corruption I've found in the wild, and has a much better repro rate than
random holes.

The corruption occurs during reads, after csum verification and before
decompression, so btrfs detects no csum failures.  The data on disk
seems to be OK and could be read correctly once the kernel bug is fixed.
Repeated reads do eventually return correct data, but there is no way
for userspace to distinguish between corrupt and correct data reliably.

The corrupted data is usually data replaced by a hole or a copy of other
blocks in the same extent.

The behavior is similar to some earlier bugs related to holes and
Compressed data in btrfs, but it's new and not fixed yet--hence,
"2018 edition."

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 editiion"
  2018-08-23  3:11 Reproducer for "compressed data + hole data corruption bug, 2018 editiion" Zygo Blaxell
@ 2018-08-23  5:10 ` Qu Wenruo
  2018-08-23 16:44   ` Zygo Blaxell
  2019-02-12  3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell
  1 sibling, 1 reply; 38+ messages in thread
From: Qu Wenruo @ 2018-08-23  5:10 UTC (permalink / raw)
  To: Zygo Blaxell, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 7798 bytes --]



On 2018/8/23 上午11:11, Zygo Blaxell wrote:
> This is a repro script for a btrfs bug that causes corrupted data reads
> when reading a mix of compressed extents and holes.  The bug is
> reproducible on at least kernels v4.1..v4.18.

This bug already sounds more serious than previous nodatasum +
compression bug.

> 
> Some more observations and background follow, but first here is the
> script and some sample output:
> 
> 	root@rescue:/test# cat repro-hole-corruption-test
> 	#!/bin/bash
> 
> 	# Write a 4096 byte block of something
> 	block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> 
> 	# Here is some test data with holes in it:
> 	for y in $(seq 0 100); do
> 		for x in 0 1; do
> 			block 0;
> 			block 21;
> 			block 0;
> 			block 22;
> 			block 0;
> 			block 0;
> 			block 43;
> 			block 44;
> 			block 0;
> 			block 0;
> 			block 61;
> 			block 62;
> 			block 63;
> 			block 64;
> 			block 65;
> 			block 66;> 		done

Does the content has any difference on this bug?
It's just 16 * 4K * 2 * 101 data write *without* any hole so far.

This should indeed cause 101 128K compressed data extent.
But I'm wondering the description about 'holes'.

> 	done > am
> 	sync
> 
> 	# Now replace those 101 distinct extents with 101 references to the first extent
> 	btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail

Will this bug still happen by creating one extent and then reflink it
101 times?

> 
> 	# Punch holes into the extent refs
> 	fallocate -v -d am

Hole-punch in fact happens here.

BTW, will add a "sync" here change the result?

> 
> 	# Do some other stuff on the machine while this runs, and watch the sha1sums change!
> 	while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> 
> 	root@rescue:/test# ./repro-hole-corruption-test
> 	i: 91, status: 0, bytes_deduped: 131072
> 	i: 92, status: 0, bytes_deduped: 131072
> 	i: 93, status: 0, bytes_deduped: 131072
> 	i: 94, status: 0, bytes_deduped: 131072
> 	i: 95, status: 0, bytes_deduped: 131072
> 	i: 96, status: 0, bytes_deduped: 131072
> 	i: 97, status: 0, bytes_deduped: 131072
> 	i: 98, status: 0, bytes_deduped: 131072
> 	i: 99, status: 0, bytes_deduped: 131072
> 	13107200 total bytes deduped in this operation
> 	am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	072a152355788c767b97e4e4c0e4567720988b84 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	60831f0e7ffe4b49722612c18685c09f4583b1df am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	^C

It looks like we have something wrong interpreting file extent, maybe
related to extent map merging.

BTW, if without dropping page cache and no read corruption happens, it
would limit the range of problem we're looking for.

Thanks,
Qu

> 
> Corruption occurs most often when there is a sequence like this in a file:
> 
> 	ref 1: hole
> 	ref 2: extent A, offset 0
> 	ref 3: hole
> 	ref 4: extent A, offset 8192
> 
> This scenario typically arises due to hole-punching or deduplication.
> Hole-punching replaces one extent ref with two references to the same
> extent with a hole between them, so:
> 
> 	ref 1:  extent A, offset 0, length 16384
> 
> becomes:
> 
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  hole, length 8192
> 	ref 3:  extent A, offset 12288, length 4096
> 
> Deduplication replaces two distinct extent refs surrounding a hole with
> two references to one of the duplicate extents, turning this:
> 
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  hole, length 8192
> 	ref 3:  extent B, offset 0, length 4096
> 
> into this:
> 
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  hole, length 8192
> 	ref 3:  extent A, offset 0, length 4096
> 
> Compression is required (zlib, zstd, or lzo) for corruption to occur.
> I am not able to reproduce the issue with an uncompressed extent nor
> have I observed any such corruption in the wild.
> 
> The presence or absence of the no-holes filesystem feature has no effect.
> 
> Ordinary writes can lead to pairs of extent references to the same extent
> separated by a reference to a different extent; however, in this case
> there is data to be read from a real extent, instead of pages that have
> to be zero filled from a hole.  If ordinary non-hole writes could trigger
> this bug, every page-oriented database engine would be crashing all the
> time on btrfs with compression enabled, and it's unlikely that would not
> have been noticed between 2015 and now.  An ordinary write that splits
> an extent ref would look like this:
> 
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  extent C, offset 0, length 8192
> 	ref 3:  extent A, offset 12288, length 4096
> 
> Sparse writes can lead to pairs of extent references surrounding a hole;
> however, in this case the extent references will point to different
> extents, avoiding the bug.  If a sparse write could trigger the bug,
> the rsync -S option and qemu/kvm 'raw' disk image files (among many
> other tools that produce sparse files) would be unusable, and it's
> unlikely that would not have been noticed between 2015 and now either.
> Sparse writes look like this:
> 
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  hole, length 8192
> 	ref 3:  extent B, offset 0, length 4096
> 
> The pattern or timing of read() calls seems to be relevant.  It is very
> hard to see the corruption when reading files with 'hd', but 'cat | hd'
> will see the corruption just fine.  Similar problems exist with 'cmp'
> but not 'sha1sum'.  Two processes reading the same file at the same time
> seem to trigger the corruption very frequently.
> 
> Some patterns of holes and data produce corruption faster than others.
> The pattern generated by the script above is based on instances of
> corruption I've found in the wild, and has a much better repro rate than
> random holes.
> 
> The corruption occurs during reads, after csum verification and before
> decompression, so btrfs detects no csum failures.  The data on disk
> seems to be OK and could be read correctly once the kernel bug is fixed.
> Repeated reads do eventually return correct data, but there is no way
> for userspace to distinguish between corrupt and correct data reliably.
> 
> The corrupted data is usually data replaced by a hole or a copy of other
> blocks in the same extent.
> 
> The behavior is similar to some earlier bugs related to holes and
> Compressed data in btrfs, but it's new and not fixed yet--hence,
> "2018 edition."
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 editiion"
  2018-08-23  5:10 ` Qu Wenruo
@ 2018-08-23 16:44   ` Zygo Blaxell
  2018-08-23 23:50     ` Qu Wenruo
  0 siblings, 1 reply; 38+ messages in thread
From: Zygo Blaxell @ 2018-08-23 16:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 11603 bytes --]

On Thu, Aug 23, 2018 at 01:10:48PM +0800, Qu Wenruo wrote:
> On 2018/8/23 上午11:11, Zygo Blaxell wrote:
> > This is a repro script for a btrfs bug that causes corrupted data reads
> > when reading a mix of compressed extents and holes.  The bug is
> > reproducible on at least kernels v4.1..v4.18.
> 
> This bug already sounds more serious than previous nodatasum +
> compression bug.

Maybe.  "compression + holes corruption bug 2017" could be avoided with
the max-inline=0 mount option without disabling compression.  This time,
the workaround is more intrusive:  avoid all applications that use dedup
or hole-punching.

> > Some more observations and background follow, but first here is the
> > script and some sample output:
> > 
> > 	root@rescue:/test# cat repro-hole-corruption-test
> > 	#!/bin/bash
> > 
> > 	# Write a 4096 byte block of something
> > 	block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > 
> > 	# Here is some test data with holes in it:
> > 	for y in $(seq 0 100); do
> > 		for x in 0 1; do
> > 			block 0;
> > 			block 21;
> > 			block 0;
> > 			block 22;
> > 			block 0;
> > 			block 0;
> > 			block 43;
> > 			block 44;
> > 			block 0;
> > 			block 0;
> > 			block 61;
> > 			block 62;
> > 			block 63;
> > 			block 64;
> > 			block 65;
> > 			block 66;> 		done
> 
> Does the content has any difference on this bug?
> It's just 16 * 4K * 2 * 101 data write *without* any hole so far.

The content of the extents doesn't seem to matter, other than it needs to
be compressible so that the extents on disk are compressed.  The bug is
also triggered by writing non-zero data to all blocks, and then punching
the holes later with "fallocate -p -l 4096 -o $(( insert math here ))".

The layout of the extents matters a lot.  I have to loop hundreds or
thousands of times to hit the bug if the first block in the pattern is
not a hole, or if the non-hole extents are different sizes or positions
than above.

I tried random patterns of holes and extent refs, and most of them have
an order of magnitude lower hit rates than the above.  This might be due
to some relationship between the alignment of read() request boundaries
with extent boundaries, but I haven't done any tests designed to detect
such a relationship.

In the wild, corruption happens on some files much more often than others.
This seems to be correlated with the extent layout as well.

I discovered the bug by examining files that were intermittently but
repeatedly failing routine data integrity checks, and found that in every
case they had similar hole + extent patterns near the point where data
was corrupted.

I did a search on some big filesystems for the
hole-refExtentA-hole-refExtentA pattern and found several files with
this pattern that had passed previous data integrity checks, but would
fail randomly in the sha1sum/drop-caches loop.

> This should indeed cause 101 128K compressed data extent.
> But I'm wondering the description about 'holes'.

The holes are coming, wait for it... ;)

> > 	done > am
> > 	sync
> > 
> > 	# Now replace those 101 distinct extents with 101 references to the first extent
> > 	btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
> 
> Will this bug still happen by creating one extent and then reflink it
> 101 times?

Yes.  I used btrfs-extent-same because a binary is included in the
Debian duperemove package, but I use it only for convenience.

It's not necessary to have hundreds of references to the same extent--even
two refs to a single extent plus a hole can trigger the bug sometimes.
100 references in a single file will trigger the bug so often that it
can be detected within the first 20 sha1sum loops.

When the corruption occurs, it affects around 90 of the original 101
extents.  The different sha1sum results are due to different extents
giving bad data on different runs.

> > 	# Punch holes into the extent refs
> > 	fallocate -v -d am
> 
> Hole-punch in fact happens here.
> 
> BTW, will add a "sync" here change the result?

No.  You can reboot the machine here if you like, it does not change
anything that happens during reads later.

Looking at the extent tree in btrfs-debug-tree, the data on disk
looks correct, and btrfs does read it correctly most of the time (the
correct sha1sum below is 6926a34e0ab3e0a023e8ea85a650f5b4217acab4).
The corruption therefore comes from btrfs read() producing incorrect
data in some instances.

> > 	# Do some other stuff on the machine while this runs, and watch the sha1sums change!
> > 	while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> > 
> > 	root@rescue:/test# ./repro-hole-corruption-test
> > 	i: 91, status: 0, bytes_deduped: 131072
> > 	i: 92, status: 0, bytes_deduped: 131072
> > 	i: 93, status: 0, bytes_deduped: 131072
> > 	i: 94, status: 0, bytes_deduped: 131072
> > 	i: 95, status: 0, bytes_deduped: 131072
> > 	i: 96, status: 0, bytes_deduped: 131072
> > 	i: 97, status: 0, bytes_deduped: 131072
> > 	i: 98, status: 0, bytes_deduped: 131072
> > 	i: 99, status: 0, bytes_deduped: 131072
> > 	13107200 total bytes deduped in this operation
> > 	am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	072a152355788c767b97e4e4c0e4567720988b84 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	60831f0e7ffe4b49722612c18685c09f4583b1df am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	^C
> 
> It looks like we have something wrong interpreting file extent, maybe
> related to extent map merging.
> 
> BTW, if without dropping page cache and no read corruption happens, it
> would limit the range of problem we're looking for.

The page cache drop makes reproduction easier/faster.  If you don't drop
caches, you have to wait for the data to be evicted from page cache or
the data from read() will not change.

In the wild, if I do a sha1sum loop on a few hundred GB of data known
to have the hole-extent-hole pattern (so the pages are evicted between
sha1sum runs), I see similar results without explicitly dropping caches.

If you read the file with a cold cache from two processes at once
(e.g. you run 'hd am' while the sha1sum/drop-cache loop is running)
the data changes faster (different on 90% of reads instead of just 20%).

> Thanks,
> Qu
> 
> > 
> > Corruption occurs most often when there is a sequence like this in a file:
> > 
> > 	ref 1: hole
> > 	ref 2: extent A, offset 0
> > 	ref 3: hole
> > 	ref 4: extent A, offset 8192
> > 
> > This scenario typically arises due to hole-punching or deduplication.
> > Hole-punching replaces one extent ref with two references to the same
> > extent with a hole between them, so:
> > 
> > 	ref 1:  extent A, offset 0, length 16384
> > 
> > becomes:
> > 
> > 	ref 1:  extent A, offset 0, length 4096
> > 	ref 2:  hole, length 8192
> > 	ref 3:  extent A, offset 12288, length 4096
> > 
> > Deduplication replaces two distinct extent refs surrounding a hole with
> > two references to one of the duplicate extents, turning this:
> > 
> > 	ref 1:  extent A, offset 0, length 4096
> > 	ref 2:  hole, length 8192
> > 	ref 3:  extent B, offset 0, length 4096
> > 
> > into this:
> > 
> > 	ref 1:  extent A, offset 0, length 4096
> > 	ref 2:  hole, length 8192
> > 	ref 3:  extent A, offset 0, length 4096
> > 
> > Compression is required (zlib, zstd, or lzo) for corruption to occur.
> > I am not able to reproduce the issue with an uncompressed extent nor
> > have I observed any such corruption in the wild.
> > 
> > The presence or absence of the no-holes filesystem feature has no effect.
> > 
> > Ordinary writes can lead to pairs of extent references to the same extent
> > separated by a reference to a different extent; however, in this case
> > there is data to be read from a real extent, instead of pages that have
> > to be zero filled from a hole.  If ordinary non-hole writes could trigger
> > this bug, every page-oriented database engine would be crashing all the
> > time on btrfs with compression enabled, and it's unlikely that would not
> > have been noticed between 2015 and now.  An ordinary write that splits
> > an extent ref would look like this:
> > 
> > 	ref 1:  extent A, offset 0, length 4096
> > 	ref 2:  extent C, offset 0, length 8192
> > 	ref 3:  extent A, offset 12288, length 4096
> > 
> > Sparse writes can lead to pairs of extent references surrounding a hole;
> > however, in this case the extent references will point to different
> > extents, avoiding the bug.  If a sparse write could trigger the bug,
> > the rsync -S option and qemu/kvm 'raw' disk image files (among many
> > other tools that produce sparse files) would be unusable, and it's
> > unlikely that would not have been noticed between 2015 and now either.
> > Sparse writes look like this:
> > 
> > 	ref 1:  extent A, offset 0, length 4096
> > 	ref 2:  hole, length 8192
> > 	ref 3:  extent B, offset 0, length 4096
> > 
> > The pattern or timing of read() calls seems to be relevant.  It is very
> > hard to see the corruption when reading files with 'hd', but 'cat | hd'
> > will see the corruption just fine.  Similar problems exist with 'cmp'
> > but not 'sha1sum'.  Two processes reading the same file at the same time
> > seem to trigger the corruption very frequently.
> > 
> > Some patterns of holes and data produce corruption faster than others.
> > The pattern generated by the script above is based on instances of
> > corruption I've found in the wild, and has a much better repro rate than
> > random holes.
> > 
> > The corruption occurs during reads, after csum verification and before
> > decompression, so btrfs detects no csum failures.  The data on disk
> > seems to be OK and could be read correctly once the kernel bug is fixed.
> > Repeated reads do eventually return correct data, but there is no way
> > for userspace to distinguish between corrupt and correct data reliably.
> > 
> > The corrupted data is usually data replaced by a hole or a copy of other
> > blocks in the same extent.
> > 
> > The behavior is similar to some earlier bugs related to holes and
> > Compressed data in btrfs, but it's new and not fixed yet--hence,
> > "2018 edition."
> > 
> 




[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 editiion"
  2018-08-23 16:44   ` Zygo Blaxell
@ 2018-08-23 23:50     ` Qu Wenruo
  0 siblings, 0 replies; 38+ messages in thread
From: Qu Wenruo @ 2018-08-23 23:50 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 12116 bytes --]



On 2018/8/24 上午12:44, Zygo Blaxell wrote:
> On Thu, Aug 23, 2018 at 01:10:48PM +0800, Qu Wenruo wrote:
>> On 2018/8/23 上午11:11, Zygo Blaxell wrote:
>>> This is a repro script for a btrfs bug that causes corrupted data reads
>>> when reading a mix of compressed extents and holes.  The bug is
>>> reproducible on at least kernels v4.1..v4.18.
>>
>> This bug already sounds more serious than previous nodatasum +
>> compression bug.
> 
> Maybe.  "compression + holes corruption bug 2017" could be avoided with
> the max-inline=0 mount option without disabling compression.  This time,
> the workaround is more intrusive:  avoid all applications that use dedup
> or hole-punching.
> 
>>> Some more observations and background follow, but first here is the
>>> script and some sample output:
>>>
>>> 	root@rescue:/test# cat repro-hole-corruption-test
>>> 	#!/bin/bash
>>>
>>> 	# Write a 4096 byte block of something
>>> 	block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
>>>
>>> 	# Here is some test data with holes in it:
>>> 	for y in $(seq 0 100); do
>>> 		for x in 0 1; do
>>> 			block 0;
>>> 			block 21;
>>> 			block 0;
>>> 			block 22;
>>> 			block 0;
>>> 			block 0;
>>> 			block 43;
>>> 			block 44;
>>> 			block 0;
>>> 			block 0;
>>> 			block 61;
>>> 			block 62;
>>> 			block 63;
>>> 			block 64;
>>> 			block 65;
>>> 			block 66;> 		done
>>
>> Does the content has any difference on this bug?
>> It's just 16 * 4K * 2 * 101 data write *without* any hole so far.
> 
> The content of the extents doesn't seem to matter, other than it needs to
> be compressible so that the extents on disk are compressed.  The bug is
> also triggered by writing non-zero data to all blocks, and then punching
> the holes later with "fallocate -p -l 4096 -o $(( insert math here ))".
> 
> The layout of the extents matters a lot.  I have to loop hundreds or
> thousands of times to hit the bug if the first block in the pattern is
> not a hole, or if the non-hole extents are different sizes or positions
> than above.
> 
> I tried random patterns of holes and extent refs, and most of them have
> an order of magnitude lower hit rates than the above.  This might be due
> to some relationship between the alignment of read() request boundaries
> with extent boundaries, but I haven't done any tests designed to detect
> such a relationship.
> 
> In the wild, corruption happens on some files much more often than others.
> This seems to be correlated with the extent layout as well.
> 
> I discovered the bug by examining files that were intermittently but
> repeatedly failing routine data integrity checks, and found that in every
> case they had similar hole + extent patterns near the point where data
> was corrupted.
> 
> I did a search on some big filesystems for the
> hole-refExtentA-hole-refExtentA pattern and found several files with
> this pattern that had passed previous data integrity checks, but would
> fail randomly in the sha1sum/drop-caches loop.
> 
>> This should indeed cause 101 128K compressed data extent.
>> But I'm wondering the description about 'holes'.
> 
> The holes are coming, wait for it... ;)
> 
>>> 	done > am
>>> 	sync
>>>
>>> 	# Now replace those 101 distinct extents with 101 references to the first extent
>>> 	btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
>>
>> Will this bug still happen by creating one extent and then reflink it
>> 101 times?
> 
> Yes.  I used btrfs-extent-same because a binary is included in the
> Debian duperemove package, but I use it only for convenience.
> 
> It's not necessary to have hundreds of references to the same extent--even
> two refs to a single extent plus a hole can trigger the bug sometimes.
> 100 references in a single file will trigger the bug so often that it
> can be detected within the first 20 sha1sum loops.
> 
> When the corruption occurs, it affects around 90 of the original 101
> extents.  The different sha1sum results are due to different extents
> giving bad data on different runs.
> 
>>> 	# Punch holes into the extent refs
>>> 	fallocate -v -d am
>>
>> Hole-punch in fact happens here.
>>
>> BTW, will add a "sync" here change the result?
> 
> No.  You can reboot the machine here if you like, it does not change
> anything that happens during reads later.

So it looks like my assumption of bad file extent interpreter is getting
more and more valid.

It has nothing to do with the race against hole punching/write, but only
the file layout and extent map cache.

> 
> Looking at the extent tree in btrfs-debug-tree, the data on disk
> looks correct, and btrfs does read it correctly most of the time (the
> correct sha1sum below is 6926a34e0ab3e0a023e8ea85a650f5b4217acab4).
> The corruption therefore comes from btrfs read() producing incorrect
> data in some instances.
> 
>>> 	# Do some other stuff on the machine while this runs, and watch the sha1sums change!
>>> 	while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
>>>
>>> 	root@rescue:/test# ./repro-hole-corruption-test
>>> 	i: 91, status: 0, bytes_deduped: 131072
>>> 	i: 92, status: 0, bytes_deduped: 131072
>>> 	i: 93, status: 0, bytes_deduped: 131072
>>> 	i: 94, status: 0, bytes_deduped: 131072
>>> 	i: 95, status: 0, bytes_deduped: 131072
>>> 	i: 96, status: 0, bytes_deduped: 131072
>>> 	i: 97, status: 0, bytes_deduped: 131072
>>> 	i: 98, status: 0, bytes_deduped: 131072
>>> 	i: 99, status: 0, bytes_deduped: 131072
>>> 	13107200 total bytes deduped in this operation
>>> 	am: 4.8 MiB (4964352 bytes) converted to sparse holes.
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	072a152355788c767b97e4e4c0e4567720988b84 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	bf00d862c6ad436a1be2be606a8ab88d22166b89 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	0d44cdf030fb149e103cfdc164da3da2b7474c17 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	60831f0e7ffe4b49722612c18685c09f4583b1df am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	a19662b294a3ccdf35dbb18fdd72c62018526d7d am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	^C
>>
>> It looks like we have something wrong interpreting file extent, maybe
>> related to extent map merging.
>>
>> BTW, if without dropping page cache and no read corruption happens, it
>> would limit the range of problem we're looking for.
> 
> The page cache drop makes reproduction easier/faster.  If you don't drop
> caches, you have to wait for the data to be evicted from page cache or
> the data from read() will not change.

So it's highly possible that file extent interpreter is causing the problem.

Thanks,
Qu

> 
> In the wild, if I do a sha1sum loop on a few hundred GB of data known
> to have the hole-extent-hole pattern (so the pages are evicted between
> sha1sum runs), I see similar results without explicitly dropping caches.
> 
> If you read the file with a cold cache from two processes at once
> (e.g. you run 'hd am' while the sha1sum/drop-cache loop is running)
> the data changes faster (different on 90% of reads instead of just 20%).
> 
>> Thanks,
>> Qu
>>
>>>
>>> Corruption occurs most often when there is a sequence like this in a file:
>>>
>>> 	ref 1: hole
>>> 	ref 2: extent A, offset 0
>>> 	ref 3: hole
>>> 	ref 4: extent A, offset 8192
>>>
>>> This scenario typically arises due to hole-punching or deduplication.
>>> Hole-punching replaces one extent ref with two references to the same
>>> extent with a hole between them, so:
>>>
>>> 	ref 1:  extent A, offset 0, length 16384
>>>
>>> becomes:
>>>
>>> 	ref 1:  extent A, offset 0, length 4096
>>> 	ref 2:  hole, length 8192
>>> 	ref 3:  extent A, offset 12288, length 4096
>>>
>>> Deduplication replaces two distinct extent refs surrounding a hole with
>>> two references to one of the duplicate extents, turning this:
>>>
>>> 	ref 1:  extent A, offset 0, length 4096
>>> 	ref 2:  hole, length 8192
>>> 	ref 3:  extent B, offset 0, length 4096
>>>
>>> into this:
>>>
>>> 	ref 1:  extent A, offset 0, length 4096
>>> 	ref 2:  hole, length 8192
>>> 	ref 3:  extent A, offset 0, length 4096
>>>
>>> Compression is required (zlib, zstd, or lzo) for corruption to occur.
>>> I am not able to reproduce the issue with an uncompressed extent nor
>>> have I observed any such corruption in the wild.
>>>
>>> The presence or absence of the no-holes filesystem feature has no effect.
>>>
>>> Ordinary writes can lead to pairs of extent references to the same extent
>>> separated by a reference to a different extent; however, in this case
>>> there is data to be read from a real extent, instead of pages that have
>>> to be zero filled from a hole.  If ordinary non-hole writes could trigger
>>> this bug, every page-oriented database engine would be crashing all the
>>> time on btrfs with compression enabled, and it's unlikely that would not
>>> have been noticed between 2015 and now.  An ordinary write that splits
>>> an extent ref would look like this:
>>>
>>> 	ref 1:  extent A, offset 0, length 4096
>>> 	ref 2:  extent C, offset 0, length 8192
>>> 	ref 3:  extent A, offset 12288, length 4096
>>>
>>> Sparse writes can lead to pairs of extent references surrounding a hole;
>>> however, in this case the extent references will point to different
>>> extents, avoiding the bug.  If a sparse write could trigger the bug,
>>> the rsync -S option and qemu/kvm 'raw' disk image files (among many
>>> other tools that produce sparse files) would be unusable, and it's
>>> unlikely that would not have been noticed between 2015 and now either.
>>> Sparse writes look like this:
>>>
>>> 	ref 1:  extent A, offset 0, length 4096
>>> 	ref 2:  hole, length 8192
>>> 	ref 3:  extent B, offset 0, length 4096
>>>
>>> The pattern or timing of read() calls seems to be relevant.  It is very
>>> hard to see the corruption when reading files with 'hd', but 'cat | hd'
>>> will see the corruption just fine.  Similar problems exist with 'cmp'
>>> but not 'sha1sum'.  Two processes reading the same file at the same time
>>> seem to trigger the corruption very frequently.
>>>
>>> Some patterns of holes and data produce corruption faster than others.
>>> The pattern generated by the script above is based on instances of
>>> corruption I've found in the wild, and has a much better repro rate than
>>> random holes.
>>>
>>> The corruption occurs during reads, after csum verification and before
>>> decompression, so btrfs detects no csum failures.  The data on disk
>>> seems to be OK and could be read correctly once the kernel bug is fixed.
>>> Repeated reads do eventually return correct data, but there is no way
>>> for userspace to distinguish between corrupt and correct data reliably.
>>>
>>> The corrupted data is usually data replaced by a hole or a copy of other
>>> blocks in the same extent.
>>>
>>> The behavior is similar to some earlier bugs related to holes and
>>> Compressed data in btrfs, but it's new and not fixed yet--hence,
>>> "2018 edition."
>>>
>>
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2018-08-23  3:11 Reproducer for "compressed data + hole data corruption bug, 2018 editiion" Zygo Blaxell
  2018-08-23  5:10 ` Qu Wenruo
@ 2019-02-12  3:09 ` Zygo Blaxell
  2019-02-12 15:33   ` Christoph Anton Mitterer
                     ` (2 more replies)
  1 sibling, 3 replies; 38+ messages in thread
From: Zygo Blaxell @ 2019-02-12  3:09 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 8454 bytes --]

Still reproducible on 4.20.7.

The behavior is slightly different on current kernels (4.20.7, 4.14.96)
which makes the problem a bit more difficult to detect.

	# repro-hole-corruption-test
	i: 91, status: 0, bytes_deduped: 131072
	i: 92, status: 0, bytes_deduped: 131072
	i: 93, status: 0, bytes_deduped: 131072
	i: 94, status: 0, bytes_deduped: 131072
	i: 95, status: 0, bytes_deduped: 131072
	i: 96, status: 0, bytes_deduped: 131072
	i: 97, status: 0, bytes_deduped: 131072
	i: 98, status: 0, bytes_deduped: 131072
	i: 99, status: 0, bytes_deduped: 131072
	13107200 total bytes deduped in this operation
	am: 4.8 MiB (4964352 bytes) converted to sparse holes.
	94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am

The sha1sum seems stable after the first drop_caches--until a second
process tries to read the test file:

	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	# cat am > /dev/null              (in another shell)
	19294e695272c42edb89ceee24bb08c13473140a am                                                            
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am

On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> This is a repro script for a btrfs bug that causes corrupted data reads
> when reading a mix of compressed extents and holes.  The bug is
> reproducible on at least kernels v4.1..v4.18.
>
> Some more observations and background follow, but first here is the
> script and some sample output:
>
> 	root@rescue:/test# cat repro-hole-corruption-test
> 	#!/bin/bash
>
> 	# Write a 4096 byte block of something
> 	block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
>
> 	# Here is some test data with holes in it:
> 	for y in $(seq 0 100); do
> 		for x in 0 1; do
> 			block 0;
> 			block 21;
> 			block 0;
> 			block 22;
> 			block 0;
> 			block 0;
> 			block 43;
> 			block 44;
> 			block 0;
> 			block 0;
> 			block 61;
> 			block 62;
> 			block 63;
> 			block 64;
> 			block 65;
> 			block 66;
> 		done
> 	done > am
> 	sync
>
> 	# Now replace those 101 distinct extents with 101 references to the first extent
> 	btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
>
> 	# Punch holes into the extent refs
> 	fallocate -v -d am
>
> 	# Do some other stuff on the machine while this runs, and watch the sha1sums change!
> 	while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
>
> 	root@rescue:/test# ./repro-hole-corruption-test
> 	i: 91, status: 0, bytes_deduped: 131072
> 	i: 92, status: 0, bytes_deduped: 131072
> 	i: 93, status: 0, bytes_deduped: 131072
> 	i: 94, status: 0, bytes_deduped: 131072
> 	i: 95, status: 0, bytes_deduped: 131072
> 	i: 96, status: 0, bytes_deduped: 131072
> 	i: 97, status: 0, bytes_deduped: 131072
> 	i: 98, status: 0, bytes_deduped: 131072
> 	i: 99, status: 0, bytes_deduped: 131072
> 	13107200 total bytes deduped in this operation
> 	am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	072a152355788c767b97e4e4c0e4567720988b84 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	60831f0e7ffe4b49722612c18685c09f4583b1df am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	^C
>
> Corruption occurs most often when there is a sequence like this in a file:
>
> 	ref 1: hole
> 	ref 2: extent A, offset 0
> 	ref 3: hole
> 	ref 4: extent A, offset 8192
>
> This scenario typically arises due to hole-punching or deduplication.
> Hole-punching replaces one extent ref with two references to the same
> extent with a hole between them, so:
>
> 	ref 1:  extent A, offset 0, length 16384
>
> becomes:
>
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  hole, length 8192
> 	ref 3:  extent A, offset 12288, length 4096
>
> Deduplication replaces two distinct extent refs surrounding a hole with
> two references to one of the duplicate extents, turning this:
>
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  hole, length 8192
> 	ref 3:  extent B, offset 0, length 4096
>
> into this:
>
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  hole, length 8192
> 	ref 3:  extent A, offset 0, length 4096
>
> Compression is required (zlib, zstd, or lzo) for corruption to occur.
> I am not able to reproduce the issue with an uncompressed extent nor
> have I observed any such corruption in the wild.
>
> The presence or absence of the no-holes filesystem feature has no effect.
>
> Ordinary writes can lead to pairs of extent references to the same extent
> separated by a reference to a different extent; however, in this case
> there is data to be read from a real extent, instead of pages that have
> to be zero filled from a hole.  If ordinary non-hole writes could trigger
> this bug, every page-oriented database engine would be crashing all the
> time on btrfs with compression enabled, and it's unlikely that would not
> have been noticed between 2015 and now.  An ordinary write that splits
> an extent ref would look like this:
>
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  extent C, offset 0, length 8192
> 	ref 3:  extent A, offset 12288, length 4096
>
> Sparse writes can lead to pairs of extent references surrounding a hole;
> however, in this case the extent references will point to different
> extents, avoiding the bug.  If a sparse write could trigger the bug,
> the rsync -S option and qemu/kvm 'raw' disk image files (among many
> other tools that produce sparse files) would be unusable, and it's
> unlikely that would not have been noticed between 2015 and now either.
> Sparse writes look like this:
>
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  hole, length 8192
> 	ref 3:  extent B, offset 0, length 4096
>
> The pattern or timing of read() calls seems to be relevant.  It is very
> hard to see the corruption when reading files with 'hd', but 'cat | hd'
> will see the corruption just fine.  Similar problems exist with 'cmp'
> but not 'sha1sum'.  Two processes reading the same file at the same time
> seem to trigger the corruption very frequently.
>
> Some patterns of holes and data produce corruption faster than others.
> The pattern generated by the script above is based on instances of
> corruption I've found in the wild, and has a much better repro rate than
> random holes.
>
> The corruption occurs during reads, after csum verification and before
> decompression, so btrfs detects no csum failures.  The data on disk
> seems to be OK and could be read correctly once the kernel bug is fixed.
> Repeated reads do eventually return correct data, but there is no way
> for userspace to distinguish between corrupt and correct data reliably.
>
> The corrupted data is usually data replaced by a hole or a copy of other
> blocks in the same extent.
>
> The behavior is similar to some earlier bugs related to holes and
> Compressed data in btrfs, but it's new and not fixed yet--hence,
> "2018 edition."



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12  3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell
@ 2019-02-12 15:33   ` Christoph Anton Mitterer
  2019-02-12 15:35   ` Filipe Manana
  2019-02-13  7:47   ` Roman Mamedov
  2 siblings, 0 replies; 38+ messages in thread
From: Christoph Anton Mitterer @ 2019-02-12 15:33 UTC (permalink / raw)
  To: linux-btrfs

Hey.

Sounds like a highly severe (and long standing) bug?

Is anyone doing anything about it?


Cheers,
Chris.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12  3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell
  2019-02-12 15:33   ` Christoph Anton Mitterer
@ 2019-02-12 15:35   ` Filipe Manana
  2019-02-12 17:01     ` Zygo Blaxell
  2019-02-13  7:47   ` Roman Mamedov
  2 siblings, 1 reply; 38+ messages in thread
From: Filipe Manana @ 2019-02-12 15:35 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> Still reproducible on 4.20.7.

I tried your reproducer when you first reported it, on different
machines with different kernel versions.
Never managed to reproduce it, nor see anything obviously wrong in
relevant code paths.

>
> The behavior is slightly different on current kernels (4.20.7, 4.14.96)
> which makes the problem a bit more difficult to detect.
>
>         # repro-hole-corruption-test
>         i: 91, status: 0, bytes_deduped: 131072
>         i: 92, status: 0, bytes_deduped: 131072
>         i: 93, status: 0, bytes_deduped: 131072
>         i: 94, status: 0, bytes_deduped: 131072
>         i: 95, status: 0, bytes_deduped: 131072
>         i: 96, status: 0, bytes_deduped: 131072
>         i: 97, status: 0, bytes_deduped: 131072
>         i: 98, status: 0, bytes_deduped: 131072
>         i: 99, status: 0, bytes_deduped: 131072
>         13107200 total bytes deduped in this operation
>         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
>         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>
> The sha1sum seems stable after the first drop_caches--until a second
> process tries to read the test file:
>
>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>         # cat am > /dev/null              (in another shell)
>         19294e695272c42edb89ceee24bb08c13473140a am
>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>
> On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> > This is a repro script for a btrfs bug that causes corrupted data reads
> > when reading a mix of compressed extents and holes.  The bug is
> > reproducible on at least kernels v4.1..v4.18.
> >
> > Some more observations and background follow, but first here is the
> > script and some sample output:
> >
> >       root@rescue:/test# cat repro-hole-corruption-test
> >       #!/bin/bash
> >
> >       # Write a 4096 byte block of something
> >       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> >
> >       # Here is some test data with holes in it:
> >       for y in $(seq 0 100); do
> >               for x in 0 1; do
> >                       block 0;
> >                       block 21;
> >                       block 0;
> >                       block 22;
> >                       block 0;
> >                       block 0;
> >                       block 43;
> >                       block 44;
> >                       block 0;
> >                       block 0;
> >                       block 61;
> >                       block 62;
> >                       block 63;
> >                       block 64;
> >                       block 65;
> >                       block 66;
> >               done
> >       done > am
> >       sync
> >
> >       # Now replace those 101 distinct extents with 101 references to the first extent
> >       btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
> >
> >       # Punch holes into the extent refs
> >       fallocate -v -d am
> >
> >       # Do some other stuff on the machine while this runs, and watch the sha1sums change!
> >       while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> >
> >       root@rescue:/test# ./repro-hole-corruption-test
> >       i: 91, status: 0, bytes_deduped: 131072
> >       i: 92, status: 0, bytes_deduped: 131072
> >       i: 93, status: 0, bytes_deduped: 131072
> >       i: 94, status: 0, bytes_deduped: 131072
> >       i: 95, status: 0, bytes_deduped: 131072
> >       i: 96, status: 0, bytes_deduped: 131072
> >       i: 97, status: 0, bytes_deduped: 131072
> >       i: 98, status: 0, bytes_deduped: 131072
> >       i: 99, status: 0, bytes_deduped: 131072
> >       13107200 total bytes deduped in this operation
> >       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       072a152355788c767b97e4e4c0e4567720988b84 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       60831f0e7ffe4b49722612c18685c09f4583b1df am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       ^C
> >
> > Corruption occurs most often when there is a sequence like this in a file:
> >
> >       ref 1: hole
> >       ref 2: extent A, offset 0
> >       ref 3: hole
> >       ref 4: extent A, offset 8192
> >
> > This scenario typically arises due to hole-punching or deduplication.
> > Hole-punching replaces one extent ref with two references to the same
> > extent with a hole between them, so:
> >
> >       ref 1:  extent A, offset 0, length 16384
> >
> > becomes:
> >
> >       ref 1:  extent A, offset 0, length 4096
> >       ref 2:  hole, length 8192
> >       ref 3:  extent A, offset 12288, length 4096
> >
> > Deduplication replaces two distinct extent refs surrounding a hole with
> > two references to one of the duplicate extents, turning this:
> >
> >       ref 1:  extent A, offset 0, length 4096
> >       ref 2:  hole, length 8192
> >       ref 3:  extent B, offset 0, length 4096
> >
> > into this:
> >
> >       ref 1:  extent A, offset 0, length 4096
> >       ref 2:  hole, length 8192
> >       ref 3:  extent A, offset 0, length 4096
> >
> > Compression is required (zlib, zstd, or lzo) for corruption to occur.
> > I am not able to reproduce the issue with an uncompressed extent nor
> > have I observed any such corruption in the wild.
> >
> > The presence or absence of the no-holes filesystem feature has no effect.
> >
> > Ordinary writes can lead to pairs of extent references to the same extent
> > separated by a reference to a different extent; however, in this case
> > there is data to be read from a real extent, instead of pages that have
> > to be zero filled from a hole.  If ordinary non-hole writes could trigger
> > this bug, every page-oriented database engine would be crashing all the
> > time on btrfs with compression enabled, and it's unlikely that would not
> > have been noticed between 2015 and now.  An ordinary write that splits
> > an extent ref would look like this:
> >
> >       ref 1:  extent A, offset 0, length 4096
> >       ref 2:  extent C, offset 0, length 8192
> >       ref 3:  extent A, offset 12288, length 4096
> >
> > Sparse writes can lead to pairs of extent references surrounding a hole;
> > however, in this case the extent references will point to different
> > extents, avoiding the bug.  If a sparse write could trigger the bug,
> > the rsync -S option and qemu/kvm 'raw' disk image files (among many
> > other tools that produce sparse files) would be unusable, and it's
> > unlikely that would not have been noticed between 2015 and now either.
> > Sparse writes look like this:
> >
> >       ref 1:  extent A, offset 0, length 4096
> >       ref 2:  hole, length 8192
> >       ref 3:  extent B, offset 0, length 4096
> >
> > The pattern or timing of read() calls seems to be relevant.  It is very
> > hard to see the corruption when reading files with 'hd', but 'cat | hd'
> > will see the corruption just fine.  Similar problems exist with 'cmp'
> > but not 'sha1sum'.  Two processes reading the same file at the same time
> > seem to trigger the corruption very frequently.
> >
> > Some patterns of holes and data produce corruption faster than others.
> > The pattern generated by the script above is based on instances of
> > corruption I've found in the wild, and has a much better repro rate than
> > random holes.
> >
> > The corruption occurs during reads, after csum verification and before
> > decompression, so btrfs detects no csum failures.  The data on disk
> > seems to be OK and could be read correctly once the kernel bug is fixed.
> > Repeated reads do eventually return correct data, but there is no way
> > for userspace to distinguish between corrupt and correct data reliably.
> >
> > The corrupted data is usually data replaced by a hole or a copy of other
> > blocks in the same extent.
> >
> > The behavior is similar to some earlier bugs related to holes and
> > Compressed data in btrfs, but it's new and not fixed yet--hence,
> > "2018 edition."
>
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 15:35   ` Filipe Manana
@ 2019-02-12 17:01     ` Zygo Blaxell
  2019-02-12 17:56       ` Filipe Manana
  2019-02-12 18:58       ` Andrei Borzenkov
  0 siblings, 2 replies; 38+ messages in thread
From: Zygo Blaxell @ 2019-02-12 17:01 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 11371 bytes --]

On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote:
> On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > Still reproducible on 4.20.7.
> 
> I tried your reproducer when you first reported it, on different
> machines with different kernel versions.

That would have been useful to know last August...  :-/

> Never managed to reproduce it, nor see anything obviously wrong in
> relevant code paths.

I built a fresh VM running Debian stretch and
reproduced the issue immediately.  Mount options are
"rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/".  Kernel is
Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version
probably doesn't matter.

I don't have any configuration that can't reproduce this issue, so I don't
know how to help you.  I've tested AMD and Intel CPUs, VM, baremetal,
hardware ranging in age from 0 to 9 years.  Locally built kernels from
4.1 to 4.20 and the stock Debian kernel (4.9).  SSDs and spinning rust.
All of these reproduce the issue immediately--wrong sha1sum appears in
the first 10 loops.

What is your test environment?  I can try that here.

> >
> > The behavior is slightly different on current kernels (4.20.7, 4.14.96)
> > which makes the problem a bit more difficult to detect.
> >
> >         # repro-hole-corruption-test
> >         i: 91, status: 0, bytes_deduped: 131072
> >         i: 92, status: 0, bytes_deduped: 131072
> >         i: 93, status: 0, bytes_deduped: 131072
> >         i: 94, status: 0, bytes_deduped: 131072
> >         i: 95, status: 0, bytes_deduped: 131072
> >         i: 96, status: 0, bytes_deduped: 131072
> >         i: 97, status: 0, bytes_deduped: 131072
> >         i: 98, status: 0, bytes_deduped: 131072
> >         i: 99, status: 0, bytes_deduped: 131072
> >         13107200 total bytes deduped in this operation
> >         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> >         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >
> > The sha1sum seems stable after the first drop_caches--until a second
> > process tries to read the test file:
> >
> >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >         # cat am > /dev/null              (in another shell)
> >         19294e695272c42edb89ceee24bb08c13473140a am
> >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >
> > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> > > This is a repro script for a btrfs bug that causes corrupted data reads
> > > when reading a mix of compressed extents and holes.  The bug is
> > > reproducible on at least kernels v4.1..v4.18.
> > >
> > > Some more observations and background follow, but first here is the
> > > script and some sample output:
> > >
> > >       root@rescue:/test# cat repro-hole-corruption-test
> > >       #!/bin/bash
> > >
> > >       # Write a 4096 byte block of something
> > >       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > >
> > >       # Here is some test data with holes in it:
> > >       for y in $(seq 0 100); do
> > >               for x in 0 1; do
> > >                       block 0;
> > >                       block 21;
> > >                       block 0;
> > >                       block 22;
> > >                       block 0;
> > >                       block 0;
> > >                       block 43;
> > >                       block 44;
> > >                       block 0;
> > >                       block 0;
> > >                       block 61;
> > >                       block 62;
> > >                       block 63;
> > >                       block 64;
> > >                       block 65;
> > >                       block 66;
> > >               done
> > >       done > am
> > >       sync
> > >
> > >       # Now replace those 101 distinct extents with 101 references to the first extent
> > >       btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
> > >
> > >       # Punch holes into the extent refs
> > >       fallocate -v -d am
> > >
> > >       # Do some other stuff on the machine while this runs, and watch the sha1sums change!
> > >       while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> > >
> > >       root@rescue:/test# ./repro-hole-corruption-test
> > >       i: 91, status: 0, bytes_deduped: 131072
> > >       i: 92, status: 0, bytes_deduped: 131072
> > >       i: 93, status: 0, bytes_deduped: 131072
> > >       i: 94, status: 0, bytes_deduped: 131072
> > >       i: 95, status: 0, bytes_deduped: 131072
> > >       i: 96, status: 0, bytes_deduped: 131072
> > >       i: 97, status: 0, bytes_deduped: 131072
> > >       i: 98, status: 0, bytes_deduped: 131072
> > >       i: 99, status: 0, bytes_deduped: 131072
> > >       13107200 total bytes deduped in this operation
> > >       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       072a152355788c767b97e4e4c0e4567720988b84 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       60831f0e7ffe4b49722612c18685c09f4583b1df am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       ^C
> > >
> > > Corruption occurs most often when there is a sequence like this in a file:
> > >
> > >       ref 1: hole
> > >       ref 2: extent A, offset 0
> > >       ref 3: hole
> > >       ref 4: extent A, offset 8192
> > >
> > > This scenario typically arises due to hole-punching or deduplication.
> > > Hole-punching replaces one extent ref with two references to the same
> > > extent with a hole between them, so:
> > >
> > >       ref 1:  extent A, offset 0, length 16384
> > >
> > > becomes:
> > >
> > >       ref 1:  extent A, offset 0, length 4096
> > >       ref 2:  hole, length 8192
> > >       ref 3:  extent A, offset 12288, length 4096
> > >
> > > Deduplication replaces two distinct extent refs surrounding a hole with
> > > two references to one of the duplicate extents, turning this:
> > >
> > >       ref 1:  extent A, offset 0, length 4096
> > >       ref 2:  hole, length 8192
> > >       ref 3:  extent B, offset 0, length 4096
> > >
> > > into this:
> > >
> > >       ref 1:  extent A, offset 0, length 4096
> > >       ref 2:  hole, length 8192
> > >       ref 3:  extent A, offset 0, length 4096
> > >
> > > Compression is required (zlib, zstd, or lzo) for corruption to occur.
> > > I am not able to reproduce the issue with an uncompressed extent nor
> > > have I observed any such corruption in the wild.
> > >
> > > The presence or absence of the no-holes filesystem feature has no effect.
> > >
> > > Ordinary writes can lead to pairs of extent references to the same extent
> > > separated by a reference to a different extent; however, in this case
> > > there is data to be read from a real extent, instead of pages that have
> > > to be zero filled from a hole.  If ordinary non-hole writes could trigger
> > > this bug, every page-oriented database engine would be crashing all the
> > > time on btrfs with compression enabled, and it's unlikely that would not
> > > have been noticed between 2015 and now.  An ordinary write that splits
> > > an extent ref would look like this:
> > >
> > >       ref 1:  extent A, offset 0, length 4096
> > >       ref 2:  extent C, offset 0, length 8192
> > >       ref 3:  extent A, offset 12288, length 4096
> > >
> > > Sparse writes can lead to pairs of extent references surrounding a hole;
> > > however, in this case the extent references will point to different
> > > extents, avoiding the bug.  If a sparse write could trigger the bug,
> > > the rsync -S option and qemu/kvm 'raw' disk image files (among many
> > > other tools that produce sparse files) would be unusable, and it's
> > > unlikely that would not have been noticed between 2015 and now either.
> > > Sparse writes look like this:
> > >
> > >       ref 1:  extent A, offset 0, length 4096
> > >       ref 2:  hole, length 8192
> > >       ref 3:  extent B, offset 0, length 4096
> > >
> > > The pattern or timing of read() calls seems to be relevant.  It is very
> > > hard to see the corruption when reading files with 'hd', but 'cat | hd'
> > > will see the corruption just fine.  Similar problems exist with 'cmp'
> > > but not 'sha1sum'.  Two processes reading the same file at the same time
> > > seem to trigger the corruption very frequently.
> > >
> > > Some patterns of holes and data produce corruption faster than others.
> > > The pattern generated by the script above is based on instances of
> > > corruption I've found in the wild, and has a much better repro rate than
> > > random holes.
> > >
> > > The corruption occurs during reads, after csum verification and before
> > > decompression, so btrfs detects no csum failures.  The data on disk
> > > seems to be OK and could be read correctly once the kernel bug is fixed.
> > > Repeated reads do eventually return correct data, but there is no way
> > > for userspace to distinguish between corrupt and correct data reliably.
> > >
> > > The corrupted data is usually data replaced by a hole or a copy of other
> > > blocks in the same extent.
> > >
> > > The behavior is similar to some earlier bugs related to holes and
> > > Compressed data in btrfs, but it's new and not fixed yet--hence,
> > > "2018 edition."
> >
> >
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 17:01     ` Zygo Blaxell
@ 2019-02-12 17:56       ` Filipe Manana
  2019-02-12 18:13         ` Zygo Blaxell
  2019-02-12 18:58       ` Andrei Borzenkov
  1 sibling, 1 reply; 38+ messages in thread
From: Filipe Manana @ 2019-02-12 17:56 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Tue, Feb 12, 2019 at 5:01 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote:
> > On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
> > <ce3g8jdj@umail.furryterror.org> wrote:
> > >
> > > Still reproducible on 4.20.7.
> >
> > I tried your reproducer when you first reported it, on different
> > machines with different kernel versions.
>
> That would have been useful to know last August...  :-/
>
> > Never managed to reproduce it, nor see anything obviously wrong in
> > relevant code paths.
>
> I built a fresh VM running Debian stretch and
> reproduced the issue immediately.  Mount options are
> "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/".  Kernel is
> Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version
> probably doesn't matter.
>
> I don't have any configuration that can't reproduce this issue, so I don't
> know how to help you.  I've tested AMD and Intel CPUs, VM, baremetal,
> hardware ranging in age from 0 to 9 years.  Locally built kernels from
> 4.1 to 4.20 and the stock Debian kernel (4.9).  SSDs and spinning rust.
> All of these reproduce the issue immediately--wrong sha1sum appears in
> the first 10 loops.
>
> What is your test environment?  I can try that here.

Debian unstable, all qemu vms, 4 cpus 4G to 8G ram iirc. Always built
from source kernels.
I have tested this when you reported it for 1 to 2 weeks in 2 or 3 vms
that kept running the test in an infinite loop during those weeks.
Don't recall what were the kernel versions (whatever was the latest at
the time), but that shouldn't matter according to what you say.

>
> > >
> > > The behavior is slightly different on current kernels (4.20.7, 4.14.96)
> > > which makes the problem a bit more difficult to detect.
> > >
> > >         # repro-hole-corruption-test
> > >         i: 91, status: 0, bytes_deduped: 131072
> > >         i: 92, status: 0, bytes_deduped: 131072
> > >         i: 93, status: 0, bytes_deduped: 131072
> > >         i: 94, status: 0, bytes_deduped: 131072
> > >         i: 95, status: 0, bytes_deduped: 131072
> > >         i: 96, status: 0, bytes_deduped: 131072
> > >         i: 97, status: 0, bytes_deduped: 131072
> > >         i: 98, status: 0, bytes_deduped: 131072
> > >         i: 99, status: 0, bytes_deduped: 131072
> > >         13107200 total bytes deduped in this operation
> > >         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > >         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >
> > > The sha1sum seems stable after the first drop_caches--until a second
> > > process tries to read the test file:
> > >
> > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >         # cat am > /dev/null              (in another shell)
> > >         19294e695272c42edb89ceee24bb08c13473140a am
> > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >
> > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> > > > This is a repro script for a btrfs bug that causes corrupted data reads
> > > > when reading a mix of compressed extents and holes.  The bug is
> > > > reproducible on at least kernels v4.1..v4.18.
> > > >
> > > > Some more observations and background follow, but first here is the
> > > > script and some sample output:
> > > >
> > > >       root@rescue:/test# cat repro-hole-corruption-test
> > > >       #!/bin/bash
> > > >
> > > >       # Write a 4096 byte block of something
> > > >       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > > >
> > > >       # Here is some test data with holes in it:
> > > >       for y in $(seq 0 100); do
> > > >               for x in 0 1; do
> > > >                       block 0;
> > > >                       block 21;
> > > >                       block 0;
> > > >                       block 22;
> > > >                       block 0;
> > > >                       block 0;
> > > >                       block 43;
> > > >                       block 44;
> > > >                       block 0;
> > > >                       block 0;
> > > >                       block 61;
> > > >                       block 62;
> > > >                       block 63;
> > > >                       block 64;
> > > >                       block 65;
> > > >                       block 66;
> > > >               done
> > > >       done > am
> > > >       sync
> > > >
> > > >       # Now replace those 101 distinct extents with 101 references to the first extent
> > > >       btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
> > > >
> > > >       # Punch holes into the extent refs
> > > >       fallocate -v -d am
> > > >
> > > >       # Do some other stuff on the machine while this runs, and watch the sha1sums change!
> > > >       while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> > > >
> > > >       root@rescue:/test# ./repro-hole-corruption-test
> > > >       i: 91, status: 0, bytes_deduped: 131072
> > > >       i: 92, status: 0, bytes_deduped: 131072
> > > >       i: 93, status: 0, bytes_deduped: 131072
> > > >       i: 94, status: 0, bytes_deduped: 131072
> > > >       i: 95, status: 0, bytes_deduped: 131072
> > > >       i: 96, status: 0, bytes_deduped: 131072
> > > >       i: 97, status: 0, bytes_deduped: 131072
> > > >       i: 98, status: 0, bytes_deduped: 131072
> > > >       i: 99, status: 0, bytes_deduped: 131072
> > > >       13107200 total bytes deduped in this operation
> > > >       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       072a152355788c767b97e4e4c0e4567720988b84 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       60831f0e7ffe4b49722612c18685c09f4583b1df am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       ^C
> > > >
> > > > Corruption occurs most often when there is a sequence like this in a file:
> > > >
> > > >       ref 1: hole
> > > >       ref 2: extent A, offset 0
> > > >       ref 3: hole
> > > >       ref 4: extent A, offset 8192
> > > >
> > > > This scenario typically arises due to hole-punching or deduplication.
> > > > Hole-punching replaces one extent ref with two references to the same
> > > > extent with a hole between them, so:
> > > >
> > > >       ref 1:  extent A, offset 0, length 16384
> > > >
> > > > becomes:
> > > >
> > > >       ref 1:  extent A, offset 0, length 4096
> > > >       ref 2:  hole, length 8192
> > > >       ref 3:  extent A, offset 12288, length 4096
> > > >
> > > > Deduplication replaces two distinct extent refs surrounding a hole with
> > > > two references to one of the duplicate extents, turning this:
> > > >
> > > >       ref 1:  extent A, offset 0, length 4096
> > > >       ref 2:  hole, length 8192
> > > >       ref 3:  extent B, offset 0, length 4096
> > > >
> > > > into this:
> > > >
> > > >       ref 1:  extent A, offset 0, length 4096
> > > >       ref 2:  hole, length 8192
> > > >       ref 3:  extent A, offset 0, length 4096
> > > >
> > > > Compression is required (zlib, zstd, or lzo) for corruption to occur.
> > > > I am not able to reproduce the issue with an uncompressed extent nor
> > > > have I observed any such corruption in the wild.
> > > >
> > > > The presence or absence of the no-holes filesystem feature has no effect.
> > > >
> > > > Ordinary writes can lead to pairs of extent references to the same extent
> > > > separated by a reference to a different extent; however, in this case
> > > > there is data to be read from a real extent, instead of pages that have
> > > > to be zero filled from a hole.  If ordinary non-hole writes could trigger
> > > > this bug, every page-oriented database engine would be crashing all the
> > > > time on btrfs with compression enabled, and it's unlikely that would not
> > > > have been noticed between 2015 and now.  An ordinary write that splits
> > > > an extent ref would look like this:
> > > >
> > > >       ref 1:  extent A, offset 0, length 4096
> > > >       ref 2:  extent C, offset 0, length 8192
> > > >       ref 3:  extent A, offset 12288, length 4096
> > > >
> > > > Sparse writes can lead to pairs of extent references surrounding a hole;
> > > > however, in this case the extent references will point to different
> > > > extents, avoiding the bug.  If a sparse write could trigger the bug,
> > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many
> > > > other tools that produce sparse files) would be unusable, and it's
> > > > unlikely that would not have been noticed between 2015 and now either.
> > > > Sparse writes look like this:
> > > >
> > > >       ref 1:  extent A, offset 0, length 4096
> > > >       ref 2:  hole, length 8192
> > > >       ref 3:  extent B, offset 0, length 4096
> > > >
> > > > The pattern or timing of read() calls seems to be relevant.  It is very
> > > > hard to see the corruption when reading files with 'hd', but 'cat | hd'
> > > > will see the corruption just fine.  Similar problems exist with 'cmp'
> > > > but not 'sha1sum'.  Two processes reading the same file at the same time
> > > > seem to trigger the corruption very frequently.
> > > >
> > > > Some patterns of holes and data produce corruption faster than others.
> > > > The pattern generated by the script above is based on instances of
> > > > corruption I've found in the wild, and has a much better repro rate than
> > > > random holes.
> > > >
> > > > The corruption occurs during reads, after csum verification and before
> > > > decompression, so btrfs detects no csum failures.  The data on disk
> > > > seems to be OK and could be read correctly once the kernel bug is fixed.
> > > > Repeated reads do eventually return correct data, but there is no way
> > > > for userspace to distinguish between corrupt and correct data reliably.
> > > >
> > > > The corrupted data is usually data replaced by a hole or a copy of other
> > > > blocks in the same extent.
> > > >
> > > > The behavior is similar to some earlier bugs related to holes and
> > > > Compressed data in btrfs, but it's new and not fixed yet--hence,
> > > > "2018 edition."
> > >
> > >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”
> >



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 17:56       ` Filipe Manana
@ 2019-02-12 18:13         ` Zygo Blaxell
  2019-02-13  7:24           ` Qu Wenruo
  2019-02-13 17:36           ` Filipe Manana
  0 siblings, 2 replies; 38+ messages in thread
From: Zygo Blaxell @ 2019-02-12 18:13 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 13720 bytes --]

On Tue, Feb 12, 2019 at 05:56:24PM +0000, Filipe Manana wrote:
> On Tue, Feb 12, 2019 at 5:01 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote:
> > > On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
> > > <ce3g8jdj@umail.furryterror.org> wrote:
> > > >
> > > > Still reproducible on 4.20.7.
> > >
> > > I tried your reproducer when you first reported it, on different
> > > machines with different kernel versions.
> >
> > That would have been useful to know last August...  :-/
> >
> > > Never managed to reproduce it, nor see anything obviously wrong in
> > > relevant code paths.
> >
> > I built a fresh VM running Debian stretch and
> > reproduced the issue immediately.  Mount options are
> > "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/".  Kernel is
> > Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version
> > probably doesn't matter.
> >
> > I don't have any configuration that can't reproduce this issue, so I don't
> > know how to help you.  I've tested AMD and Intel CPUs, VM, baremetal,
> > hardware ranging in age from 0 to 9 years.  Locally built kernels from
> > 4.1 to 4.20 and the stock Debian kernel (4.9).  SSDs and spinning rust.
> > All of these reproduce the issue immediately--wrong sha1sum appears in
> > the first 10 loops.
> >
> > What is your test environment?  I can try that here.
> 
> Debian unstable, all qemu vms, 4 cpus 4G to 8G ram iirc. 

I have several environments like that...

> Always built from source kernels.

...that could be a relevant difference.  Have you tried a stock
Debian kernel?

> I have tested this when you reported it for 1 to 2 weeks in 2 or 3 vms
> that kept running the test in an infinite loop during those weeks.
> Don't recall what were the kernel versions (whatever was the latest at
> the time), but that shouldn't matter according to what you say.

That's an extremely long time compared to the rate of occurrence
of this bug.  It should appear in only a few seconds of testing.
Some data-hole-data patterns reproduce much slower (change the position
of "block 0" lines in the setup script), but "slower" is minutes,
not machine-months.

Is your filesystem compressed?  Does compsize show the test
file 'am' is compressed during the test?  Is the sha1sum you get
6926a34e0ab3e0a023e8ea85a650f5b4217acab4?  Does the sha1sum change
when a second process reads the file while the sha1sum/drop_caches loop
is running?

> > > >
> > > > The behavior is slightly different on current kernels (4.20.7, 4.14.96)
> > > > which makes the problem a bit more difficult to detect.
> > > >
> > > >         # repro-hole-corruption-test
> > > >         i: 91, status: 0, bytes_deduped: 131072
> > > >         i: 92, status: 0, bytes_deduped: 131072
> > > >         i: 93, status: 0, bytes_deduped: 131072
> > > >         i: 94, status: 0, bytes_deduped: 131072
> > > >         i: 95, status: 0, bytes_deduped: 131072
> > > >         i: 96, status: 0, bytes_deduped: 131072
> > > >         i: 97, status: 0, bytes_deduped: 131072
> > > >         i: 98, status: 0, bytes_deduped: 131072
> > > >         i: 99, status: 0, bytes_deduped: 131072
> > > >         13107200 total bytes deduped in this operation
> > > >         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > >         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >
> > > > The sha1sum seems stable after the first drop_caches--until a second
> > > > process tries to read the test file:
> > > >
> > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >         # cat am > /dev/null              (in another shell)
> > > >         19294e695272c42edb89ceee24bb08c13473140a am
> > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >
> > > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> > > > > This is a repro script for a btrfs bug that causes corrupted data reads
> > > > > when reading a mix of compressed extents and holes.  The bug is
> > > > > reproducible on at least kernels v4.1..v4.18.
> > > > >
> > > > > Some more observations and background follow, but first here is the
> > > > > script and some sample output:
> > > > >
> > > > >       root@rescue:/test# cat repro-hole-corruption-test
> > > > >       #!/bin/bash
> > > > >
> > > > >       # Write a 4096 byte block of something
> > > > >       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > > > >
> > > > >       # Here is some test data with holes in it:
> > > > >       for y in $(seq 0 100); do
> > > > >               for x in 0 1; do
> > > > >                       block 0;
> > > > >                       block 21;
> > > > >                       block 0;
> > > > >                       block 22;
> > > > >                       block 0;
> > > > >                       block 0;
> > > > >                       block 43;
> > > > >                       block 44;
> > > > >                       block 0;
> > > > >                       block 0;
> > > > >                       block 61;
> > > > >                       block 62;
> > > > >                       block 63;
> > > > >                       block 64;
> > > > >                       block 65;
> > > > >                       block 66;
> > > > >               done
> > > > >       done > am
> > > > >       sync
> > > > >
> > > > >       # Now replace those 101 distinct extents with 101 references to the first extent
> > > > >       btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
> > > > >
> > > > >       # Punch holes into the extent refs
> > > > >       fallocate -v -d am
> > > > >
> > > > >       # Do some other stuff on the machine while this runs, and watch the sha1sums change!
> > > > >       while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> > > > >
> > > > >       root@rescue:/test# ./repro-hole-corruption-test
> > > > >       i: 91, status: 0, bytes_deduped: 131072
> > > > >       i: 92, status: 0, bytes_deduped: 131072
> > > > >       i: 93, status: 0, bytes_deduped: 131072
> > > > >       i: 94, status: 0, bytes_deduped: 131072
> > > > >       i: 95, status: 0, bytes_deduped: 131072
> > > > >       i: 96, status: 0, bytes_deduped: 131072
> > > > >       i: 97, status: 0, bytes_deduped: 131072
> > > > >       i: 98, status: 0, bytes_deduped: 131072
> > > > >       i: 99, status: 0, bytes_deduped: 131072
> > > > >       13107200 total bytes deduped in this operation
> > > > >       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       072a152355788c767b97e4e4c0e4567720988b84 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       60831f0e7ffe4b49722612c18685c09f4583b1df am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       ^C
> > > > >
> > > > > Corruption occurs most often when there is a sequence like this in a file:
> > > > >
> > > > >       ref 1: hole
> > > > >       ref 2: extent A, offset 0
> > > > >       ref 3: hole
> > > > >       ref 4: extent A, offset 8192
> > > > >
> > > > > This scenario typically arises due to hole-punching or deduplication.
> > > > > Hole-punching replaces one extent ref with two references to the same
> > > > > extent with a hole between them, so:
> > > > >
> > > > >       ref 1:  extent A, offset 0, length 16384
> > > > >
> > > > > becomes:
> > > > >
> > > > >       ref 1:  extent A, offset 0, length 4096
> > > > >       ref 2:  hole, length 8192
> > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > >
> > > > > Deduplication replaces two distinct extent refs surrounding a hole with
> > > > > two references to one of the duplicate extents, turning this:
> > > > >
> > > > >       ref 1:  extent A, offset 0, length 4096
> > > > >       ref 2:  hole, length 8192
> > > > >       ref 3:  extent B, offset 0, length 4096
> > > > >
> > > > > into this:
> > > > >
> > > > >       ref 1:  extent A, offset 0, length 4096
> > > > >       ref 2:  hole, length 8192
> > > > >       ref 3:  extent A, offset 0, length 4096
> > > > >
> > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur.
> > > > > I am not able to reproduce the issue with an uncompressed extent nor
> > > > > have I observed any such corruption in the wild.
> > > > >
> > > > > The presence or absence of the no-holes filesystem feature has no effect.
> > > > >
> > > > > Ordinary writes can lead to pairs of extent references to the same extent
> > > > > separated by a reference to a different extent; however, in this case
> > > > > there is data to be read from a real extent, instead of pages that have
> > > > > to be zero filled from a hole.  If ordinary non-hole writes could trigger
> > > > > this bug, every page-oriented database engine would be crashing all the
> > > > > time on btrfs with compression enabled, and it's unlikely that would not
> > > > > have been noticed between 2015 and now.  An ordinary write that splits
> > > > > an extent ref would look like this:
> > > > >
> > > > >       ref 1:  extent A, offset 0, length 4096
> > > > >       ref 2:  extent C, offset 0, length 8192
> > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > >
> > > > > Sparse writes can lead to pairs of extent references surrounding a hole;
> > > > > however, in this case the extent references will point to different
> > > > > extents, avoiding the bug.  If a sparse write could trigger the bug,
> > > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many
> > > > > other tools that produce sparse files) would be unusable, and it's
> > > > > unlikely that would not have been noticed between 2015 and now either.
> > > > > Sparse writes look like this:
> > > > >
> > > > >       ref 1:  extent A, offset 0, length 4096
> > > > >       ref 2:  hole, length 8192
> > > > >       ref 3:  extent B, offset 0, length 4096
> > > > >
> > > > > The pattern or timing of read() calls seems to be relevant.  It is very
> > > > > hard to see the corruption when reading files with 'hd', but 'cat | hd'
> > > > > will see the corruption just fine.  Similar problems exist with 'cmp'
> > > > > but not 'sha1sum'.  Two processes reading the same file at the same time
> > > > > seem to trigger the corruption very frequently.
> > > > >
> > > > > Some patterns of holes and data produce corruption faster than others.
> > > > > The pattern generated by the script above is based on instances of
> > > > > corruption I've found in the wild, and has a much better repro rate than
> > > > > random holes.
> > > > >
> > > > > The corruption occurs during reads, after csum verification and before
> > > > > decompression, so btrfs detects no csum failures.  The data on disk
> > > > > seems to be OK and could be read correctly once the kernel bug is fixed.
> > > > > Repeated reads do eventually return correct data, but there is no way
> > > > > for userspace to distinguish between corrupt and correct data reliably.
> > > > >
> > > > > The corrupted data is usually data replaced by a hole or a copy of other
> > > > > blocks in the same extent.
> > > > >
> > > > > The behavior is similar to some earlier bugs related to holes and
> > > > > Compressed data in btrfs, but it's new and not fixed yet--hence,
> > > > > "2018 edition."
> > > >
> > > >
> > >
> > >
> > > --
> > > Filipe David Manana,
> > >
> > > “Whether you think you can, or you think you can't — you're right.”
> > >
> 
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 17:01     ` Zygo Blaxell
  2019-02-12 17:56       ` Filipe Manana
@ 2019-02-12 18:58       ` Andrei Borzenkov
  2019-02-12 21:48         ` Chris Murphy
  1 sibling, 1 reply; 38+ messages in thread
From: Andrei Borzenkov @ 2019-02-12 18:58 UTC (permalink / raw)
  To: Zygo Blaxell, Filipe Manana; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 13325 bytes --]

12.02.2019 20:01, Zygo Blaxell пишет:
> On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote:
>> On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
>> <ce3g8jdj@umail.furryterror.org> wrote:
>>>
>>> Still reproducible on 4.20.7.
>>
>> I tried your reproducer when you first reported it, on different
>> machines with different kernel versions.
> 
> That would have been useful to know last August...  :-/
> 
>> Never managed to reproduce it, nor see anything obviously wrong in
>> relevant code paths.
> 
> I built a fresh VM running Debian stretch and
> reproduced the issue immediately.  Mount options are
> "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/".  Kernel is
> Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version
> probably doesn't matter.
> 
> I don't have any configuration that can't reproduce this issue, so I don't
> know how to help you.  I've tested AMD and Intel CPUs, VM, baremetal,
> hardware ranging in age from 0 to 9 years.  Locally built kernels from
> 4.1 to 4.20 and the stock Debian kernel (4.9).  SSDs and spinning rust.
> All of these reproduce the issue immediately--wrong sha1sum appears in
> the first 10 loops.
> 
> What is your test environment?  I can try that here.
> 
>>>
>>> The behavior is slightly different on current kernels (4.20.7, 4.14.96)
>>> which makes the problem a bit more difficult to detect.
>>>
>>>         # repro-hole-corruption-test
>>>         i: 91, status: 0, bytes_deduped: 131072
>>>         i: 92, status: 0, bytes_deduped: 131072
>>>         i: 93, status: 0, bytes_deduped: 131072
>>>         i: 94, status: 0, bytes_deduped: 131072
>>>         i: 95, status: 0, bytes_deduped: 131072
>>>         i: 96, status: 0, bytes_deduped: 131072
>>>         i: 97, status: 0, bytes_deduped: 131072
>>>         i: 98, status: 0, bytes_deduped: 131072
>>>         i: 99, status: 0, bytes_deduped: 131072
>>>         13107200 total bytes deduped in this operation
>>>         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
>>>         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
>>>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>


I get the same result on Ubunut 18.04 using distro packages and 4.18 hwe
kernel.

root@bor-Latitude-E5450:/var/tmp# dd if=/dev/zero of=loop bs=1M count=200
200+0 записей получено
200+0 записей отправлено
209715200 bytes (210 MB, 200 MiB) copied, 0,125205 s, 1,7 GB/s
root@bor-Latitude-E5450:/var/tmp# mkfs.btrfs loop
btrfs-progs v4.15.1
See http://btrfs.wiki.kernel.org for more information.

Label:              (null)
UUID:               b1f1111e-2d65-484a-9ab3-e00feaac2048
Node size:          16384
Sector size:        4096
Filesystem size:    200.00MiB
Block group profiles:
  Data:             single            8.00MiB
  Metadata:         DUP              32.00MiB
  System:           DUP               8.00MiB
SSD detected:       no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   ID        SIZE  PATH
    1   200.00MiB  loop

root@bor-Latitude-E5450:/var/tmp# mount -t btrfs -o
loop,rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/ ./loop
./loopmnt
root@bor-Latitude-E5450:/var/tmp# cd -
/var/tmp/loopmnt
root@bor-Latitude-E5450:/var/tmp/loopmnt# ../repro-hole-corruption-test
i: 91, status: 0, bytes_deduped: 131072
i: 92, status: 0, bytes_deduped: 131072
i: 93, status: 0, bytes_deduped: 131072
i: 94, status: 0, bytes_deduped: 131072
i: 95, status: 0, bytes_deduped: 131072
i: 96, status: 0, bytes_deduped: 131072
i: 97, status: 0, bytes_deduped: 131072
i: 98, status: 0, bytes_deduped: 131072
i: 99, status: 0, bytes_deduped: 131072
13107200 total bytes deduped in this operation
am: 4,8 MiB (4964352 bytes) converted to sparse holes.
94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
^Croot@bor-Latitude-E5450:/var/tmp/loopmnt#


>>> The sha1sum seems stable after the first drop_caches--until a second
>>> process tries to read the test file:
>>>
>>>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>         # cat am > /dev/null              (in another shell)
>>>         19294e695272c42edb89ceee24bb08c13473140a am
>>>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>
>>> On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
>>>> This is a repro script for a btrfs bug that causes corrupted data reads
>>>> when reading a mix of compressed extents and holes.  The bug is
>>>> reproducible on at least kernels v4.1..v4.18.
>>>>
>>>> Some more observations and background follow, but first here is the
>>>> script and some sample output:
>>>>
>>>>       root@rescue:/test# cat repro-hole-corruption-test
>>>>       #!/bin/bash
>>>>
>>>>       # Write a 4096 byte block of something
>>>>       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
>>>>
>>>>       # Here is some test data with holes in it:
>>>>       for y in $(seq 0 100); do
>>>>               for x in 0 1; do
>>>>                       block 0;
>>>>                       block 21;
>>>>                       block 0;
>>>>                       block 22;
>>>>                       block 0;
>>>>                       block 0;
>>>>                       block 43;
>>>>                       block 44;
>>>>                       block 0;
>>>>                       block 0;
>>>>                       block 61;
>>>>                       block 62;
>>>>                       block 63;
>>>>                       block 64;
>>>>                       block 65;
>>>>                       block 66;
>>>>               done
>>>>       done > am
>>>>       sync
>>>>
>>>>       # Now replace those 101 distinct extents with 101 references to the first extent
>>>>       btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
>>>>
>>>>       # Punch holes into the extent refs
>>>>       fallocate -v -d am
>>>>
>>>>       # Do some other stuff on the machine while this runs, and watch the sha1sums change!
>>>>       while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
>>>>
>>>>       root@rescue:/test# ./repro-hole-corruption-test
>>>>       i: 91, status: 0, bytes_deduped: 131072
>>>>       i: 92, status: 0, bytes_deduped: 131072
>>>>       i: 93, status: 0, bytes_deduped: 131072
>>>>       i: 94, status: 0, bytes_deduped: 131072
>>>>       i: 95, status: 0, bytes_deduped: 131072
>>>>       i: 96, status: 0, bytes_deduped: 131072
>>>>       i: 97, status: 0, bytes_deduped: 131072
>>>>       i: 98, status: 0, bytes_deduped: 131072
>>>>       i: 99, status: 0, bytes_deduped: 131072
>>>>       13107200 total bytes deduped in this operation
>>>>       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       072a152355788c767b97e4e4c0e4567720988b84 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       60831f0e7ffe4b49722612c18685c09f4583b1df am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       ^C
>>>>
>>>> Corruption occurs most often when there is a sequence like this in a file:
>>>>
>>>>       ref 1: hole
>>>>       ref 2: extent A, offset 0
>>>>       ref 3: hole
>>>>       ref 4: extent A, offset 8192
>>>>
>>>> This scenario typically arises due to hole-punching or deduplication.
>>>> Hole-punching replaces one extent ref with two references to the same
>>>> extent with a hole between them, so:
>>>>
>>>>       ref 1:  extent A, offset 0, length 16384
>>>>
>>>> becomes:
>>>>
>>>>       ref 1:  extent A, offset 0, length 4096
>>>>       ref 2:  hole, length 8192
>>>>       ref 3:  extent A, offset 12288, length 4096
>>>>
>>>> Deduplication replaces two distinct extent refs surrounding a hole with
>>>> two references to one of the duplicate extents, turning this:
>>>>
>>>>       ref 1:  extent A, offset 0, length 4096
>>>>       ref 2:  hole, length 8192
>>>>       ref 3:  extent B, offset 0, length 4096
>>>>
>>>> into this:
>>>>
>>>>       ref 1:  extent A, offset 0, length 4096
>>>>       ref 2:  hole, length 8192
>>>>       ref 3:  extent A, offset 0, length 4096
>>>>
>>>> Compression is required (zlib, zstd, or lzo) for corruption to occur.
>>>> I am not able to reproduce the issue with an uncompressed extent nor
>>>> have I observed any such corruption in the wild.
>>>>
>>>> The presence or absence of the no-holes filesystem feature has no effect.
>>>>
>>>> Ordinary writes can lead to pairs of extent references to the same extent
>>>> separated by a reference to a different extent; however, in this case
>>>> there is data to be read from a real extent, instead of pages that have
>>>> to be zero filled from a hole.  If ordinary non-hole writes could trigger
>>>> this bug, every page-oriented database engine would be crashing all the
>>>> time on btrfs with compression enabled, and it's unlikely that would not
>>>> have been noticed between 2015 and now.  An ordinary write that splits
>>>> an extent ref would look like this:
>>>>
>>>>       ref 1:  extent A, offset 0, length 4096
>>>>       ref 2:  extent C, offset 0, length 8192
>>>>       ref 3:  extent A, offset 12288, length 4096
>>>>
>>>> Sparse writes can lead to pairs of extent references surrounding a hole;
>>>> however, in this case the extent references will point to different
>>>> extents, avoiding the bug.  If a sparse write could trigger the bug,
>>>> the rsync -S option and qemu/kvm 'raw' disk image files (among many
>>>> other tools that produce sparse files) would be unusable, and it's
>>>> unlikely that would not have been noticed between 2015 and now either.
>>>> Sparse writes look like this:
>>>>
>>>>       ref 1:  extent A, offset 0, length 4096
>>>>       ref 2:  hole, length 8192
>>>>       ref 3:  extent B, offset 0, length 4096
>>>>
>>>> The pattern or timing of read() calls seems to be relevant.  It is very
>>>> hard to see the corruption when reading files with 'hd', but 'cat | hd'
>>>> will see the corruption just fine.  Similar problems exist with 'cmp'
>>>> but not 'sha1sum'.  Two processes reading the same file at the same time
>>>> seem to trigger the corruption very frequently.
>>>>
>>>> Some patterns of holes and data produce corruption faster than others.
>>>> The pattern generated by the script above is based on instances of
>>>> corruption I've found in the wild, and has a much better repro rate than
>>>> random holes.
>>>>
>>>> The corruption occurs during reads, after csum verification and before
>>>> decompression, so btrfs detects no csum failures.  The data on disk
>>>> seems to be OK and could be read correctly once the kernel bug is fixed.
>>>> Repeated reads do eventually return correct data, but there is no way
>>>> for userspace to distinguish between corrupt and correct data reliably.
>>>>
>>>> The corrupted data is usually data replaced by a hole or a copy of other
>>>> blocks in the same extent.
>>>>
>>>> The behavior is similar to some earlier bugs related to holes and
>>>> Compressed data in btrfs, but it's new and not fixed yet--hence,
>>>> "2018 edition."
>>>
>>>
>>
>>
>> -- 
>> Filipe David Manana,
>>
>> “Whether you think you can, or you think you can't — you're right.”
>>



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 18:58       ` Andrei Borzenkov
@ 2019-02-12 21:48         ` Chris Murphy
  2019-02-12 22:11           ` Zygo Blaxell
  0 siblings, 1 reply; 38+ messages in thread
From: Chris Murphy @ 2019-02-12 21:48 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: Zygo Blaxell, Filipe Manana, linux-btrfs

Is it possibly related to the zlib library being used on
Debian/Ubuntu? That you've got even one reproducer with the exact same
hash for the transient error case means it's not hardware or random
error; let alone two independent reproducers.

And then what happens if you do the exact same test but change to zstd
or lzo? No error? Strictly zlib?

--
Chris Murphy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 21:48         ` Chris Murphy
@ 2019-02-12 22:11           ` Zygo Blaxell
  2019-02-12 22:53             ` Chris Murphy
  0 siblings, 1 reply; 38+ messages in thread
From: Zygo Blaxell @ 2019-02-12 22:11 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Andrei Borzenkov, Filipe Manana, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4033 bytes --]

On Tue, Feb 12, 2019 at 02:48:38PM -0700, Chris Murphy wrote:
> Is it possibly related to the zlib library being used on
> Debian/Ubuntu? That you've got even one reproducer with the exact same
> hash for the transient error case means it's not hardware or random
> error; let alone two independent reproducers.

The errors are not consistent between runs.  The above pattern is quite
common, but it is not the only possible output.  Add in other processes
reading the 'am' file at the same time and it gets very random.

The bad data tends to have entire extents missing, replaced with zeros.
That leads to a small number of possible outputs (the choices seem to be
only to have the data or have the zeros).  It does seem to be a lot more
consistent in recent (post 4.14.80) kernels, which may be interesting.

Here is an example of a diff between two copies of the 'am' file copied
while the repro script was running, filtered through hd:

	# diff -u /tmp/f1 /tmp/f2
	--- /tmp/f1     2019-02-12 17:05:14.861844871 -0500
	+++ /tmp/f2     2019-02-12 17:05:16.883868402 -0500
	@@ -56,10 +56,6 @@
	 *
	 00020000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	 *
	-00021000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
	-*
	-00022000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	-*
	 00023000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
	 *
	 00024000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	@@ -268,10 +264,6 @@
	 *
	 000a0000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	 *
	-000a1000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
	-*
	-000a2000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	-*
	 000a3000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
	 *
	 000a4000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	@@ -688,10 +680,6 @@
	 *
	 001a0000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	 *
	-001a1000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
	-*
	-001a2000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	-*
	 001a3000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
	 *
	 001a4000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	@@ -1524,10 +1512,6 @@
	 *
	 003a0000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	 *
	-003a1000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
	-*
	-003a2000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	-*
	 003a3000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
	 *
	 003a4000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	@@ -3192,10 +3176,6 @@
	 *
	 007a0000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	 *
	-007a1000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
	-*
	-007a2000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	-*
	 007a3000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
	 *
	 007a4000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	@@ -5016,10 +4996,6 @@
	 *
	 00c00000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	 *
	-00c01000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
	-*
	-00c02000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	-*
	[etc...you get the idea]

I'm not sure how the zlib library is involved--sha1sum doesn't use one.

> And then what happens if you do the exact same test but change to zstd
> or lzo? No error? Strictly zlib?

Same errors on all three btrfs compression algorithms (as mentioned in
the original post from August 2018).

> --
> Chris Murphy
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 22:11           ` Zygo Blaxell
@ 2019-02-12 22:53             ` Chris Murphy
  2019-02-13  2:46               ` Zygo Blaxell
  0 siblings, 1 reply; 38+ messages in thread
From: Chris Murphy @ 2019-02-12 22:53 UTC (permalink / raw)
  To: Zygo Blaxell, Btrfs BTRFS

On Tue, Feb 12, 2019 at 3:11 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Tue, Feb 12, 2019 at 02:48:38PM -0700, Chris Murphy wrote:
> > Is it possibly related to the zlib library being used on
> > Debian/Ubuntu? That you've got even one reproducer with the exact same
> > hash for the transient error case means it's not hardware or random
> > error; let alone two independent reproducers.
>
> The errors are not consistent between runs.  The above pattern is quite
> common, but it is not the only possible output.  Add in other processes
> reading the 'am' file at the same time and it gets very random.
>
> The bad data tends to have entire extents missing, replaced with zeros.
> That leads to a small number of possible outputs (the choices seem to be
> only to have the data or have the zeros).  It does seem to be a lot more
> consistent in recent (post 4.14.80) kernels, which may be interesting.
>
> Here is an example of a diff between two copies of the 'am' file copied
> while the repro script was running, filtered through hd:
>
>         # diff -u /tmp/f1 /tmp/f2
>         --- /tmp/f1     2019-02-12 17:05:14.861844871 -0500
>         +++ /tmp/f2     2019-02-12 17:05:16.883868402 -0500
>         @@ -56,10 +56,6 @@
>          *
>          00020000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>          *
>         -00021000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
>         -*
>         -00022000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         -*
>          00023000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
>          *
>          00024000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         @@ -268,10 +264,6 @@
>          *
>          000a0000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>          *
>         -000a1000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
>         -*
>         -000a2000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         -*
>          000a3000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
>          *
>          000a4000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         @@ -688,10 +680,6 @@
>          *
>          001a0000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>          *
>         -001a1000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
>         -*
>         -001a2000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         -*
>          001a3000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
>          *
>          001a4000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         @@ -1524,10 +1512,6 @@
>          *
>          003a0000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>          *
>         -003a1000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
>         -*
>         -003a2000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         -*
>          003a3000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
>          *
>          003a4000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         @@ -3192,10 +3176,6 @@
>          *
>          007a0000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>          *
>         -007a1000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
>         -*
>         -007a2000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         -*
>          007a3000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
>          *
>          007a4000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         @@ -5016,10 +4996,6 @@
>          *
>          00c00000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>          *
>         -00c01000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
>         -*
>         -00c02000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         -*
>         [etc...you get the idea]

And yet the file is delivered to user space, despite the changes, as
if it's immune to checksum computation or matching. The data is
clearly difference so how is it bypassing checksumming? Data csums are
based on original uncompressed data, correct? So any holes are zeros,
there are still csums for those holes?

>
> I'm not sure how the zlib library is involved--sha1sum doesn't use one.
>
> > And then what happens if you do the exact same test but change to zstd
> > or lzo? No error? Strictly zlib?
>
> Same errors on all three btrfs compression algorithms (as mentioned in
> the original post from August 2018).

Obviously there is a pattern. It's not random. I just don't know what
it looks like. I use compression, for years now, mostly zstd lately
and a mix of lzo and zlib before that, but never any errors or
corruptions. But I also never use holes, no punched holes, and rarely
use fallocated files which I guess isn't quite the same thing as hole
punching.

So the bug you're reproducing is for sure 100% not on the media
itself, it's somehow transiently being interpreted differently roughly
1 in 10 reads, but with a pattern. What about scrub? Do you get errors
every 1 in 10 scrubs? Or how does it manifest? No scrub errors?

I know very little about what parts of the kernel a file system
depends on outside of its own code (e.g. page cache) but I wonder if
there's something outside of Btrfs that's the source but it never gets
triggered because no other file systems use compression. Huh - what
file system uses compression *and* hole punching? squashfs? Is sparse
file support different than hole punching?


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 22:53             ` Chris Murphy
@ 2019-02-13  2:46               ` Zygo Blaxell
  0 siblings, 0 replies; 38+ messages in thread
From: Zygo Blaxell @ 2019-02-13  2:46 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 5093 bytes --]

On Tue, Feb 12, 2019 at 03:53:53PM -0700, Chris Murphy wrote:
> And yet the file is delivered to user space, despite the changes, as
> if it's immune to checksum computation or matching. The data is
> clearly difference so how is it bypassing checksumming? Data csums are
> based on original uncompressed data, correct? So any holes are zeros,
> there are still csums for those holes?

csums in btrfs protect data blocks.  Holes are the absence of data blocks,
so there are no csums for holes.

There are no csums for extent references either--only csums on the extent
data that is referenced.  Since this bug affects processing of extent
refs, it must occur long after all the csums are verified.

> > I'm not sure how the zlib library is involved--sha1sum doesn't use one.
> >
> > > And then what happens if you do the exact same test but change to zstd
> > > or lzo? No error? Strictly zlib?
> >
> > Same errors on all three btrfs compression algorithms (as mentioned in
> > the original post from August 2018).
> 
> Obviously there is a pattern. It's not random. I just don't know what
> it looks like. 

Without knowing the root cause I can only speculate, but it does seem to
be random, just very heavily biased to some outcomes.  It will produce
more distinct sha1sum values the longer you run it, especially if there
is other activity on the system to perturb the kernel a bit.  If you make
the test file bigger you can have more combinations of outputs.

I also note that since the big batch of btrfs bug fixes that landed
near 4.14.80, the variation between runs seems to be a lot less than
with earlier kernels; however, the full range of random output values
(i.e. which extents of the file disappear) still seems to be possible, it
just takes longer to get distinct values.  I'm not sure that information
helps to form a theory of how the bug operates.

> I use compression, for years now, mostly zstd lately
> and a mix of lzo and zlib before that, but never any errors or
> corruptions. But I also never use holes, no punched holes, and rarely
> use fallocated files which I guess isn't quite the same thing as hole
> punching.

I covered this in August.  The original thread was:

	https://www.spinics.net/lists/linux-btrfs/msg81293.html

TL;DR you won't see this problem unless you have a single compressed
extent that is split by a hole--an artifact that can only be produced by
punching holes, cloning, or dedupe.  The cases users are most likely to
encounter are dedupe and hole-punching--I don't know of any applications
in real-world use that do cloning the right way to trigger this problem.

Also, you haven't mentioned whether you've successfully reproduced this
yourself yet (or not).

> So the bug you're reproducing is for sure 100% not on the media
> itself, it's somehow transiently being interpreted differently roughly
> 1 in 10 reads, but with a pattern. What about scrub? Do you get errors
> every 1 in 10 scrubs? Or how does it manifest? No scrub errors?

No errors in scrub--nor should there be.  The data is correct on disk,
and it can be read reliably if you don't use the kernel btrfs code to
read it through extent refs (scrub reads the data items directly, so
scrub never looks at data through extent refs).

btrfs just drops some of the data when reading it to userspace.

> I know very little about what parts of the kernel a file system
> depends on outside of its own code (e.g. page cache) but I wonder if
> there's something outside of Btrfs that's the source but it never gets
> triggered because no other file systems use compression. Huh - what
> file system uses compression *and* hole punching? squashfs? Is sparse
> file support different than hole punching?

Traditional sparse file support leaves blocks in a file unallocated until
they are written to, i.e. you do something like:

	write(64K)
	seek(80K)
	write(48K)

and you get a 16K hole between two extents (or contiguous block ranges
if your filesystem doesn't have a formal extent concept per se):

	data(64k)
	hole(16k)
	data(48k)

Traditional POSIX sparse files don't have any way to release any extents
in the middle of a file without changing the length of the file.  You can
fill in the holes with data later, but you can't delete existing data and
replace it with holes.  If you want to punch holes in a file, you used to
do it by making a copy of the file, omitting any of the data blocks that
contained all zero, then renaming the copy over the original file.

The hole punch operation adds the capability to delete existing data
in place, e.g. you can say "punch a hole at 24K, length 8K" and the
filesystem will look like:

	data(24k) (originally part of first 64K extent)
	hole(8k)
	data(32k) (originally part of first 64K extent)
	hole(16k)
	data(48k)

On btrfs, the first 32k and 24k chunks of the file are both references
to pieces of the original 64k extent, which is not modified on disk,
but 8K of it is no longer accessible.

> -- 
> Chris Murphy
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 18:13         ` Zygo Blaxell
@ 2019-02-13  7:24           ` Qu Wenruo
  2019-02-13 17:36           ` Filipe Manana
  1 sibling, 0 replies; 38+ messages in thread
From: Qu Wenruo @ 2019-02-13  7:24 UTC (permalink / raw)
  To: Zygo Blaxell, Filipe Manana; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3282 bytes --]



On 2019/2/13 上午2:13, Zygo Blaxell wrote:
> On Tue, Feb 12, 2019 at 05:56:24PM +0000, Filipe Manana wrote:
>> On Tue, Feb 12, 2019 at 5:01 PM Zygo Blaxell
>> <ce3g8jdj@umail.furryterror.org> wrote:
>>>
>>> On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote:
>>>> On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
>>>> <ce3g8jdj@umail.furryterror.org> wrote:
>>>>>
>>>>> Still reproducible on 4.20.7.
>>>>
>>>> I tried your reproducer when you first reported it, on different
>>>> machines with different kernel versions.
>>>
>>> That would have been useful to know last August...  :-/
>>>
>>>> Never managed to reproduce it, nor see anything obviously wrong in
>>>> relevant code paths.
>>>
>>> I built a fresh VM running Debian stretch and
>>> reproduced the issue immediately.  Mount options are
>>> "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/".  Kernel is
>>> Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version
>>> probably doesn't matter.
>>>
>>> I don't have any configuration that can't reproduce this issue, so I don't
>>> know how to help you.  I've tested AMD and Intel CPUs, VM, baremetal,
>>> hardware ranging in age from 0 to 9 years.  Locally built kernels from
>>> 4.1 to 4.20 and the stock Debian kernel (4.9).  SSDs and spinning rust.
>>> All of these reproduce the issue immediately--wrong sha1sum appears in
>>> the first 10 loops.
>>>
>>> What is your test environment?  I can try that here.
>>
>> Debian unstable, all qemu vms, 4 cpus 4G to 8G ram iirc. 
> 
> I have several environments like that...
> 
>> Always built from source kernels.
> 
> ...that could be a relevant difference.  Have you tried a stock
> Debian kernel?

I'm afraid you may need to use upstream vanilla kernel other than kernel
from distro, especially for distros who may have heavy backports.

I also tried my test runs, using Arch stock kernel (pretty vanilla) and
upstream kernel.
Both my host and VM tested.
No reproduce either.

Upstream community is mostly focused on upstream vanilla kernel.
Bugs from distro kernel can sometimes be a good clue of existing
upstream bugs, but when dig deeper, vanilla kernel is always necessary.

Would you mind to reproduce it in a as vanilla as possible environment?
E.g. vanilla kernel and vanilla user space progs?

Thanks,
Qu

> 
>> I have tested this when you reported it for 1 to 2 weeks in 2 or 3 vms
>> that kept running the test in an infinite loop during those weeks.
>> Don't recall what were the kernel versions (whatever was the latest at
>> the time), but that shouldn't matter according to what you say.
> 
> That's an extremely long time compared to the rate of occurrence
> of this bug.  It should appear in only a few seconds of testing.
> Some data-hole-data patterns reproduce much slower (change the position
> of "block 0" lines in the setup script), but "slower" is minutes,
> not machine-months.
> 
> Is your filesystem compressed?  Does compsize show the test
> file 'am' is compressed during the test?  Is the sha1sum you get
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4?  Does the sha1sum change
> when a second process reads the file while the sha1sum/drop_caches loop
> is running?
> 
[snip]


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12  3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell
  2019-02-12 15:33   ` Christoph Anton Mitterer
  2019-02-12 15:35   ` Filipe Manana
@ 2019-02-13  7:47   ` Roman Mamedov
  2019-02-13  8:04     ` Qu Wenruo
  2 siblings, 1 reply; 38+ messages in thread
From: Roman Mamedov @ 2019-02-13  7:47 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Mon, 11 Feb 2019 22:09:02 -0500
Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:

> Still reproducible on 4.20.7.
> 
> The behavior is slightly different on current kernels (4.20.7, 4.14.96)
> which makes the problem a bit more difficult to detect.
> 
> 	# repro-hole-corruption-test
> 	i: 91, status: 0, bytes_deduped: 131072
> 	i: 92, status: 0, bytes_deduped: 131072
> 	i: 93, status: 0, bytes_deduped: 131072
> 	i: 94, status: 0, bytes_deduped: 131072
> 	i: 95, status: 0, bytes_deduped: 131072
> 	i: 96, status: 0, bytes_deduped: 131072
> 	i: 97, status: 0, bytes_deduped: 131072
> 	i: 98, status: 0, bytes_deduped: 131072
> 	i: 99, status: 0, bytes_deduped: 131072
> 	13107200 total bytes deduped in this operation
> 	am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> 	94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am

Seems like I can reproduce it as well. Vanilla 4.14.97 with .config loosely
based on Debian's.

$ sudo ./repro-hole-corruption-test 
i: 91, status: 0, bytes_deduped: 131072
i: 92, status: 0, bytes_deduped: 131072
i: 93, status: 0, bytes_deduped: 131072
i: 94, status: 0, bytes_deduped: 131072
i: 95, status: 0, bytes_deduped: 131072
i: 96, status: 0, bytes_deduped: 131072
i: 97, status: 0, bytes_deduped: 131072
i: 98, status: 0, bytes_deduped: 131072
i: 99, status: 0, bytes_deduped: 131072
13107200 total bytes deduped in this operation
am: 4.8 MiB (4964352 bytes) converted to sparse holes.
c5f25fc2b88eaab504a403465658c67f4669261e am
1d9aacd4ee38ab7db46c44e0d74cee163222e105 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am

The above is on a 3TB spinning disk. But on a 512GB NVMe SSD I even got the
same checksums as you did.

$ sudo ./repro-hole-corruption-test 
i: 91, status: 0, bytes_deduped: 131072
i: 92, status: 0, bytes_deduped: 131072
i: 93, status: 0, bytes_deduped: 131072
i: 94, status: 0, bytes_deduped: 131072
i: 95, status: 0, bytes_deduped: 131072
i: 96, status: 0, bytes_deduped: 131072
i: 97, status: 0, bytes_deduped: 131072
i: 98, status: 0, bytes_deduped: 131072
i: 99, status: 0, bytes_deduped: 131072
13107200 total bytes deduped in this operation
am: 4.8 MiB (4964352 bytes) converted to sparse holes.
94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am

In my case both filesystems are not mounted with compression, just chattr +c of
the directory with the script is enough to see the issue.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-13  7:47   ` Roman Mamedov
@ 2019-02-13  8:04     ` Qu Wenruo
  0 siblings, 0 replies; 38+ messages in thread
From: Qu Wenruo @ 2019-02-13  8:04 UTC (permalink / raw)
  To: Roman Mamedov, Zygo Blaxell; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3627 bytes --]



On 2019/2/13 下午3:47, Roman Mamedov wrote:
> On Mon, 11 Feb 2019 22:09:02 -0500
> Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:
> 
>> Still reproducible on 4.20.7.
>>
>> The behavior is slightly different on current kernels (4.20.7, 4.14.96)
>> which makes the problem a bit more difficult to detect.
>>
>> 	# repro-hole-corruption-test
>> 	i: 91, status: 0, bytes_deduped: 131072
>> 	i: 92, status: 0, bytes_deduped: 131072
>> 	i: 93, status: 0, bytes_deduped: 131072
>> 	i: 94, status: 0, bytes_deduped: 131072
>> 	i: 95, status: 0, bytes_deduped: 131072
>> 	i: 96, status: 0, bytes_deduped: 131072
>> 	i: 97, status: 0, bytes_deduped: 131072
>> 	i: 98, status: 0, bytes_deduped: 131072
>> 	i: 99, status: 0, bytes_deduped: 131072
>> 	13107200 total bytes deduped in this operation
>> 	am: 4.8 MiB (4964352 bytes) converted to sparse holes.
>> 	94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 
> Seems like I can reproduce it as well. Vanilla 4.14.97 with .config loosely
> based on Debian's.
> 
> $ sudo ./repro-hole-corruption-test 
> i: 91, status: 0, bytes_deduped: 131072
> i: 92, status: 0, bytes_deduped: 131072
> i: 93, status: 0, bytes_deduped: 131072
> i: 94, status: 0, bytes_deduped: 131072
> i: 95, status: 0, bytes_deduped: 131072
> i: 96, status: 0, bytes_deduped: 131072
> i: 97, status: 0, bytes_deduped: 131072
> i: 98, status: 0, bytes_deduped: 131072
> i: 99, status: 0, bytes_deduped: 131072
> 13107200 total bytes deduped in this operation
> am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> c5f25fc2b88eaab504a403465658c67f4669261e am
> 1d9aacd4ee38ab7db46c44e0d74cee163222e105 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 
> The above is on a 3TB spinning disk. But on a 512GB NVMe SSD I even got the
> same checksums as you did.
> 
> $ sudo ./repro-hole-corruption-test 
> i: 91, status: 0, bytes_deduped: 131072
> i: 92, status: 0, bytes_deduped: 131072
> i: 93, status: 0, bytes_deduped: 131072
> i: 94, status: 0, bytes_deduped: 131072
> i: 95, status: 0, bytes_deduped: 131072
> i: 96, status: 0, bytes_deduped: 131072
> i: 97, status: 0, bytes_deduped: 131072
> i: 98, status: 0, bytes_deduped: 131072
> i: 99, status: 0, bytes_deduped: 131072
> 13107200 total bytes deduped in this operation
> am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 
> In my case both filesystems are not mounted with compression,

OK, I forgot the compression mount option.

Now I can reproduce it too, both host and VM now.
I'll try to make the test case minimal enough to avoid too many noise
during test.

Thanks,
Qu

> just chattr +c of
> the directory with the script is enough to see the issue.
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 18:13         ` Zygo Blaxell
  2019-02-13  7:24           ` Qu Wenruo
@ 2019-02-13 17:36           ` Filipe Manana
  2019-02-13 18:14             ` Filipe Manana
  1 sibling, 1 reply; 38+ messages in thread
From: Filipe Manana @ 2019-02-13 17:36 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Tue, Feb 12, 2019 at 6:14 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Tue, Feb 12, 2019 at 05:56:24PM +0000, Filipe Manana wrote:
> > On Tue, Feb 12, 2019 at 5:01 PM Zygo Blaxell
> > <ce3g8jdj@umail.furryterror.org> wrote:
> > >
> > > On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote:
> > > > On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
> > > > <ce3g8jdj@umail.furryterror.org> wrote:
> > > > >
> > > > > Still reproducible on 4.20.7.
> > > >
> > > > I tried your reproducer when you first reported it, on different
> > > > machines with different kernel versions.
> > >
> > > That would have been useful to know last August...  :-/
> > >
> > > > Never managed to reproduce it, nor see anything obviously wrong in
> > > > relevant code paths.
> > >
> > > I built a fresh VM running Debian stretch and
> > > reproduced the issue immediately.  Mount options are
> > > "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/".  Kernel is
> > > Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version
> > > probably doesn't matter.
> > >
> > > I don't have any configuration that can't reproduce this issue, so I don't
> > > know how to help you.  I've tested AMD and Intel CPUs, VM, baremetal,
> > > hardware ranging in age from 0 to 9 years.  Locally built kernels from
> > > 4.1 to 4.20 and the stock Debian kernel (4.9).  SSDs and spinning rust.
> > > All of these reproduce the issue immediately--wrong sha1sum appears in
> > > the first 10 loops.
> > >
> > > What is your test environment?  I can try that here.
> >
> > Debian unstable, all qemu vms, 4 cpus 4G to 8G ram iirc.
>
> I have several environments like that...
>
> > Always built from source kernels.
>
> ...that could be a relevant difference.  Have you tried a stock
> Debian kernel?
>
> > I have tested this when you reported it for 1 to 2 weeks in 2 or 3 vms
> > that kept running the test in an infinite loop during those weeks.
> > Don't recall what were the kernel versions (whatever was the latest at
> > the time), but that shouldn't matter according to what you say.
>
> That's an extremely long time compared to the rate of occurrence
> of this bug.  It should appear in only a few seconds of testing.
> Some data-hole-data patterns reproduce much slower (change the position
> of "block 0" lines in the setup script), but "slower" is minutes,
> not machine-months.
>
> Is your filesystem compressed?  Does compsize show the test
> file 'am' is compressed during the test?  Is the sha1sum you get
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4?  Does the sha1sum change
> when a second process reads the file while the sha1sum/drop_caches loop
> is running?

Tried it today and I got it reproduced (different vm, but still debian
and kernel built from source).
Not sure what was different last time. Yes, I had compression enabled.

I'll look into it.

>
> > > > >
> > > > > The behavior is slightly different on current kernels (4.20.7, 4.14.96)
> > > > > which makes the problem a bit more difficult to detect.
> > > > >
> > > > >         # repro-hole-corruption-test
> > > > >         i: 91, status: 0, bytes_deduped: 131072
> > > > >         i: 92, status: 0, bytes_deduped: 131072
> > > > >         i: 93, status: 0, bytes_deduped: 131072
> > > > >         i: 94, status: 0, bytes_deduped: 131072
> > > > >         i: 95, status: 0, bytes_deduped: 131072
> > > > >         i: 96, status: 0, bytes_deduped: 131072
> > > > >         i: 97, status: 0, bytes_deduped: 131072
> > > > >         i: 98, status: 0, bytes_deduped: 131072
> > > > >         i: 99, status: 0, bytes_deduped: 131072
> > > > >         13107200 total bytes deduped in this operation
> > > > >         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > >         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >
> > > > > The sha1sum seems stable after the first drop_caches--until a second
> > > > > process tries to read the test file:
> > > > >
> > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >         # cat am > /dev/null              (in another shell)
> > > > >         19294e695272c42edb89ceee24bb08c13473140a am
> > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >
> > > > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> > > > > > This is a repro script for a btrfs bug that causes corrupted data reads
> > > > > > when reading a mix of compressed extents and holes.  The bug is
> > > > > > reproducible on at least kernels v4.1..v4.18.
> > > > > >
> > > > > > Some more observations and background follow, but first here is the
> > > > > > script and some sample output:
> > > > > >
> > > > > >       root@rescue:/test# cat repro-hole-corruption-test
> > > > > >       #!/bin/bash
> > > > > >
> > > > > >       # Write a 4096 byte block of something
> > > > > >       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > > > > >
> > > > > >       # Here is some test data with holes in it:
> > > > > >       for y in $(seq 0 100); do
> > > > > >               for x in 0 1; do
> > > > > >                       block 0;
> > > > > >                       block 21;
> > > > > >                       block 0;
> > > > > >                       block 22;
> > > > > >                       block 0;
> > > > > >                       block 0;
> > > > > >                       block 43;
> > > > > >                       block 44;
> > > > > >                       block 0;
> > > > > >                       block 0;
> > > > > >                       block 61;
> > > > > >                       block 62;
> > > > > >                       block 63;
> > > > > >                       block 64;
> > > > > >                       block 65;
> > > > > >                       block 66;
> > > > > >               done
> > > > > >       done > am
> > > > > >       sync
> > > > > >
> > > > > >       # Now replace those 101 distinct extents with 101 references to the first extent
> > > > > >       btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
> > > > > >
> > > > > >       # Punch holes into the extent refs
> > > > > >       fallocate -v -d am
> > > > > >
> > > > > >       # Do some other stuff on the machine while this runs, and watch the sha1sums change!
> > > > > >       while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> > > > > >
> > > > > >       root@rescue:/test# ./repro-hole-corruption-test
> > > > > >       i: 91, status: 0, bytes_deduped: 131072
> > > > > >       i: 92, status: 0, bytes_deduped: 131072
> > > > > >       i: 93, status: 0, bytes_deduped: 131072
> > > > > >       i: 94, status: 0, bytes_deduped: 131072
> > > > > >       i: 95, status: 0, bytes_deduped: 131072
> > > > > >       i: 96, status: 0, bytes_deduped: 131072
> > > > > >       i: 97, status: 0, bytes_deduped: 131072
> > > > > >       i: 98, status: 0, bytes_deduped: 131072
> > > > > >       i: 99, status: 0, bytes_deduped: 131072
> > > > > >       13107200 total bytes deduped in this operation
> > > > > >       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       072a152355788c767b97e4e4c0e4567720988b84 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       60831f0e7ffe4b49722612c18685c09f4583b1df am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       ^C
> > > > > >
> > > > > > Corruption occurs most often when there is a sequence like this in a file:
> > > > > >
> > > > > >       ref 1: hole
> > > > > >       ref 2: extent A, offset 0
> > > > > >       ref 3: hole
> > > > > >       ref 4: extent A, offset 8192
> > > > > >
> > > > > > This scenario typically arises due to hole-punching or deduplication.
> > > > > > Hole-punching replaces one extent ref with two references to the same
> > > > > > extent with a hole between them, so:
> > > > > >
> > > > > >       ref 1:  extent A, offset 0, length 16384
> > > > > >
> > > > > > becomes:
> > > > > >
> > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > >       ref 2:  hole, length 8192
> > > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > > >
> > > > > > Deduplication replaces two distinct extent refs surrounding a hole with
> > > > > > two references to one of the duplicate extents, turning this:
> > > > > >
> > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > >       ref 2:  hole, length 8192
> > > > > >       ref 3:  extent B, offset 0, length 4096
> > > > > >
> > > > > > into this:
> > > > > >
> > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > >       ref 2:  hole, length 8192
> > > > > >       ref 3:  extent A, offset 0, length 4096
> > > > > >
> > > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur.
> > > > > > I am not able to reproduce the issue with an uncompressed extent nor
> > > > > > have I observed any such corruption in the wild.
> > > > > >
> > > > > > The presence or absence of the no-holes filesystem feature has no effect.
> > > > > >
> > > > > > Ordinary writes can lead to pairs of extent references to the same extent
> > > > > > separated by a reference to a different extent; however, in this case
> > > > > > there is data to be read from a real extent, instead of pages that have
> > > > > > to be zero filled from a hole.  If ordinary non-hole writes could trigger
> > > > > > this bug, every page-oriented database engine would be crashing all the
> > > > > > time on btrfs with compression enabled, and it's unlikely that would not
> > > > > > have been noticed between 2015 and now.  An ordinary write that splits
> > > > > > an extent ref would look like this:
> > > > > >
> > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > >       ref 2:  extent C, offset 0, length 8192
> > > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > > >
> > > > > > Sparse writes can lead to pairs of extent references surrounding a hole;
> > > > > > however, in this case the extent references will point to different
> > > > > > extents, avoiding the bug.  If a sparse write could trigger the bug,
> > > > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many
> > > > > > other tools that produce sparse files) would be unusable, and it's
> > > > > > unlikely that would not have been noticed between 2015 and now either.
> > > > > > Sparse writes look like this:
> > > > > >
> > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > >       ref 2:  hole, length 8192
> > > > > >       ref 3:  extent B, offset 0, length 4096
> > > > > >
> > > > > > The pattern or timing of read() calls seems to be relevant.  It is very
> > > > > > hard to see the corruption when reading files with 'hd', but 'cat | hd'
> > > > > > will see the corruption just fine.  Similar problems exist with 'cmp'
> > > > > > but not 'sha1sum'.  Two processes reading the same file at the same time
> > > > > > seem to trigger the corruption very frequently.
> > > > > >
> > > > > > Some patterns of holes and data produce corruption faster than others.
> > > > > > The pattern generated by the script above is based on instances of
> > > > > > corruption I've found in the wild, and has a much better repro rate than
> > > > > > random holes.
> > > > > >
> > > > > > The corruption occurs during reads, after csum verification and before
> > > > > > decompression, so btrfs detects no csum failures.  The data on disk
> > > > > > seems to be OK and could be read correctly once the kernel bug is fixed.
> > > > > > Repeated reads do eventually return correct data, but there is no way
> > > > > > for userspace to distinguish between corrupt and correct data reliably.
> > > > > >
> > > > > > The corrupted data is usually data replaced by a hole or a copy of other
> > > > > > blocks in the same extent.
> > > > > >
> > > > > > The behavior is similar to some earlier bugs related to holes and
> > > > > > Compressed data in btrfs, but it's new and not fixed yet--hence,
> > > > > > "2018 edition."
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Filipe David Manana,
> > > >
> > > > “Whether you think you can, or you think you can't — you're right.”
> > > >
> >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”
> >



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-13 17:36           ` Filipe Manana
@ 2019-02-13 18:14             ` Filipe Manana
  2019-02-14  1:22               ` Filipe Manana
  0 siblings, 1 reply; 38+ messages in thread
From: Filipe Manana @ 2019-02-13 18:14 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Wed, Feb 13, 2019 at 5:36 PM Filipe Manana <fdmanana@gmail.com> wrote:
>
> On Tue, Feb 12, 2019 at 6:14 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > On Tue, Feb 12, 2019 at 05:56:24PM +0000, Filipe Manana wrote:
> > > On Tue, Feb 12, 2019 at 5:01 PM Zygo Blaxell
> > > <ce3g8jdj@umail.furryterror.org> wrote:
> > > >
> > > > On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote:
> > > > > On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
> > > > > <ce3g8jdj@umail.furryterror.org> wrote:
> > > > > >
> > > > > > Still reproducible on 4.20.7.
> > > > >
> > > > > I tried your reproducer when you first reported it, on different
> > > > > machines with different kernel versions.
> > > >
> > > > That would have been useful to know last August...  :-/
> > > >
> > > > > Never managed to reproduce it, nor see anything obviously wrong in
> > > > > relevant code paths.
> > > >
> > > > I built a fresh VM running Debian stretch and
> > > > reproduced the issue immediately.  Mount options are
> > > > "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/".  Kernel is
> > > > Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version
> > > > probably doesn't matter.
> > > >
> > > > I don't have any configuration that can't reproduce this issue, so I don't
> > > > know how to help you.  I've tested AMD and Intel CPUs, VM, baremetal,
> > > > hardware ranging in age from 0 to 9 years.  Locally built kernels from
> > > > 4.1 to 4.20 and the stock Debian kernel (4.9).  SSDs and spinning rust.
> > > > All of these reproduce the issue immediately--wrong sha1sum appears in
> > > > the first 10 loops.
> > > >
> > > > What is your test environment?  I can try that here.
> > >
> > > Debian unstable, all qemu vms, 4 cpus 4G to 8G ram iirc.
> >
> > I have several environments like that...
> >
> > > Always built from source kernels.
> >
> > ...that could be a relevant difference.  Have you tried a stock
> > Debian kernel?
> >
> > > I have tested this when you reported it for 1 to 2 weeks in 2 or 3 vms
> > > that kept running the test in an infinite loop during those weeks.
> > > Don't recall what were the kernel versions (whatever was the latest at
> > > the time), but that shouldn't matter according to what you say.
> >
> > That's an extremely long time compared to the rate of occurrence
> > of this bug.  It should appear in only a few seconds of testing.
> > Some data-hole-data patterns reproduce much slower (change the position
> > of "block 0" lines in the setup script), but "slower" is minutes,
> > not machine-months.
> >
> > Is your filesystem compressed?  Does compsize show the test
> > file 'am' is compressed during the test?  Is the sha1sum you get
> > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4?  Does the sha1sum change
> > when a second process reads the file while the sha1sum/drop_caches loop
> > is running?
>
> Tried it today and I got it reproduced (different vm, but still debian
> and kernel built from source).
> Not sure what was different last time. Yes, I had compression enabled.
>
> I'll look into it.

So the problem is caused by hole punching. The script can be reduced
to the following:

https://friendpaste.com/22t4OdktHQTl0aMGxckc86

file size: 384K am
digests after file creation:   7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
digests after file creation 2: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
262144 total bytes deduped in this operation
digests after dedupe:          7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
digests after dedupe 2:        7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
am: 24 KiB (24576 bytes) converted to sparse holes.
digests after hole punching:   7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
digests after hole punching 2: 5a357b64f4004ea38dbc7058c64a5678668420da  am

So hole punching is screwing things, and only after dropping the page
cache we can see the bug.
I'll send a fix likely tomorrow.

>
> >
> > > > > >
> > > > > > The behavior is slightly different on current kernels (4.20.7, 4.14.96)
> > > > > > which makes the problem a bit more difficult to detect.
> > > > > >
> > > > > >         # repro-hole-corruption-test
> > > > > >         i: 91, status: 0, bytes_deduped: 131072
> > > > > >         i: 92, status: 0, bytes_deduped: 131072
> > > > > >         i: 93, status: 0, bytes_deduped: 131072
> > > > > >         i: 94, status: 0, bytes_deduped: 131072
> > > > > >         i: 95, status: 0, bytes_deduped: 131072
> > > > > >         i: 96, status: 0, bytes_deduped: 131072
> > > > > >         i: 97, status: 0, bytes_deduped: 131072
> > > > > >         i: 98, status: 0, bytes_deduped: 131072
> > > > > >         i: 99, status: 0, bytes_deduped: 131072
> > > > > >         13107200 total bytes deduped in this operation
> > > > > >         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > > >         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >
> > > > > > The sha1sum seems stable after the first drop_caches--until a second
> > > > > > process tries to read the test file:
> > > > > >
> > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >         # cat am > /dev/null              (in another shell)
> > > > > >         19294e695272c42edb89ceee24bb08c13473140a am
> > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >
> > > > > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> > > > > > > This is a repro script for a btrfs bug that causes corrupted data reads
> > > > > > > when reading a mix of compressed extents and holes.  The bug is
> > > > > > > reproducible on at least kernels v4.1..v4.18.
> > > > > > >
> > > > > > > Some more observations and background follow, but first here is the
> > > > > > > script and some sample output:
> > > > > > >
> > > > > > >       root@rescue:/test# cat repro-hole-corruption-test
> > > > > > >       #!/bin/bash
> > > > > > >
> > > > > > >       # Write a 4096 byte block of something
> > > > > > >       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > > > > > >
> > > > > > >       # Here is some test data with holes in it:
> > > > > > >       for y in $(seq 0 100); do
> > > > > > >               for x in 0 1; do
> > > > > > >                       block 0;
> > > > > > >                       block 21;
> > > > > > >                       block 0;
> > > > > > >                       block 22;
> > > > > > >                       block 0;
> > > > > > >                       block 0;
> > > > > > >                       block 43;
> > > > > > >                       block 44;
> > > > > > >                       block 0;
> > > > > > >                       block 0;
> > > > > > >                       block 61;
> > > > > > >                       block 62;
> > > > > > >                       block 63;
> > > > > > >                       block 64;
> > > > > > >                       block 65;
> > > > > > >                       block 66;
> > > > > > >               done
> > > > > > >       done > am
> > > > > > >       sync
> > > > > > >
> > > > > > >       # Now replace those 101 distinct extents with 101 references to the first extent
> > > > > > >       btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
> > > > > > >
> > > > > > >       # Punch holes into the extent refs
> > > > > > >       fallocate -v -d am
> > > > > > >
> > > > > > >       # Do some other stuff on the machine while this runs, and watch the sha1sums change!
> > > > > > >       while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> > > > > > >
> > > > > > >       root@rescue:/test# ./repro-hole-corruption-test
> > > > > > >       i: 91, status: 0, bytes_deduped: 131072
> > > > > > >       i: 92, status: 0, bytes_deduped: 131072
> > > > > > >       i: 93, status: 0, bytes_deduped: 131072
> > > > > > >       i: 94, status: 0, bytes_deduped: 131072
> > > > > > >       i: 95, status: 0, bytes_deduped: 131072
> > > > > > >       i: 96, status: 0, bytes_deduped: 131072
> > > > > > >       i: 97, status: 0, bytes_deduped: 131072
> > > > > > >       i: 98, status: 0, bytes_deduped: 131072
> > > > > > >       i: 99, status: 0, bytes_deduped: 131072
> > > > > > >       13107200 total bytes deduped in this operation
> > > > > > >       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       072a152355788c767b97e4e4c0e4567720988b84 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       60831f0e7ffe4b49722612c18685c09f4583b1df am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       ^C
> > > > > > >
> > > > > > > Corruption occurs most often when there is a sequence like this in a file:
> > > > > > >
> > > > > > >       ref 1: hole
> > > > > > >       ref 2: extent A, offset 0
> > > > > > >       ref 3: hole
> > > > > > >       ref 4: extent A, offset 8192
> > > > > > >
> > > > > > > This scenario typically arises due to hole-punching or deduplication.
> > > > > > > Hole-punching replaces one extent ref with two references to the same
> > > > > > > extent with a hole between them, so:
> > > > > > >
> > > > > > >       ref 1:  extent A, offset 0, length 16384
> > > > > > >
> > > > > > > becomes:
> > > > > > >
> > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > >       ref 2:  hole, length 8192
> > > > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > > > >
> > > > > > > Deduplication replaces two distinct extent refs surrounding a hole with
> > > > > > > two references to one of the duplicate extents, turning this:
> > > > > > >
> > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > >       ref 2:  hole, length 8192
> > > > > > >       ref 3:  extent B, offset 0, length 4096
> > > > > > >
> > > > > > > into this:
> > > > > > >
> > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > >       ref 2:  hole, length 8192
> > > > > > >       ref 3:  extent A, offset 0, length 4096
> > > > > > >
> > > > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur.
> > > > > > > I am not able to reproduce the issue with an uncompressed extent nor
> > > > > > > have I observed any such corruption in the wild.
> > > > > > >
> > > > > > > The presence or absence of the no-holes filesystem feature has no effect.
> > > > > > >
> > > > > > > Ordinary writes can lead to pairs of extent references to the same extent
> > > > > > > separated by a reference to a different extent; however, in this case
> > > > > > > there is data to be read from a real extent, instead of pages that have
> > > > > > > to be zero filled from a hole.  If ordinary non-hole writes could trigger
> > > > > > > this bug, every page-oriented database engine would be crashing all the
> > > > > > > time on btrfs with compression enabled, and it's unlikely that would not
> > > > > > > have been noticed between 2015 and now.  An ordinary write that splits
> > > > > > > an extent ref would look like this:
> > > > > > >
> > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > >       ref 2:  extent C, offset 0, length 8192
> > > > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > > > >
> > > > > > > Sparse writes can lead to pairs of extent references surrounding a hole;
> > > > > > > however, in this case the extent references will point to different
> > > > > > > extents, avoiding the bug.  If a sparse write could trigger the bug,
> > > > > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many
> > > > > > > other tools that produce sparse files) would be unusable, and it's
> > > > > > > unlikely that would not have been noticed between 2015 and now either.
> > > > > > > Sparse writes look like this:
> > > > > > >
> > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > >       ref 2:  hole, length 8192
> > > > > > >       ref 3:  extent B, offset 0, length 4096
> > > > > > >
> > > > > > > The pattern or timing of read() calls seems to be relevant.  It is very
> > > > > > > hard to see the corruption when reading files with 'hd', but 'cat | hd'
> > > > > > > will see the corruption just fine.  Similar problems exist with 'cmp'
> > > > > > > but not 'sha1sum'.  Two processes reading the same file at the same time
> > > > > > > seem to trigger the corruption very frequently.
> > > > > > >
> > > > > > > Some patterns of holes and data produce corruption faster than others.
> > > > > > > The pattern generated by the script above is based on instances of
> > > > > > > corruption I've found in the wild, and has a much better repro rate than
> > > > > > > random holes.
> > > > > > >
> > > > > > > The corruption occurs during reads, after csum verification and before
> > > > > > > decompression, so btrfs detects no csum failures.  The data on disk
> > > > > > > seems to be OK and could be read correctly once the kernel bug is fixed.
> > > > > > > Repeated reads do eventually return correct data, but there is no way
> > > > > > > for userspace to distinguish between corrupt and correct data reliably.
> > > > > > >
> > > > > > > The corrupted data is usually data replaced by a hole or a copy of other
> > > > > > > blocks in the same extent.
> > > > > > >
> > > > > > > The behavior is similar to some earlier bugs related to holes and
> > > > > > > Compressed data in btrfs, but it's new and not fixed yet--hence,
> > > > > > > "2018 edition."
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Filipe David Manana,
> > > > >
> > > > > “Whether you think you can, or you think you can't — you're right.”
> > > > >
> > >
> > >
> > >
> > > --
> > > Filipe David Manana,
> > >
> > > “Whether you think you can, or you think you can't — you're right.”
> > >
>
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-13 18:14             ` Filipe Manana
@ 2019-02-14  1:22               ` Filipe Manana
  2019-02-14  5:00                 ` Zygo Blaxell
  2019-02-14 12:21                 ` Christoph Anton Mitterer
  0 siblings, 2 replies; 38+ messages in thread
From: Filipe Manana @ 2019-02-14  1:22 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Wed, Feb 13, 2019 at 6:14 PM Filipe Manana <fdmanana@gmail.com> wrote:
>
> On Wed, Feb 13, 2019 at 5:36 PM Filipe Manana <fdmanana@gmail.com> wrote:
> >
> > On Tue, Feb 12, 2019 at 6:14 PM Zygo Blaxell
> > <ce3g8jdj@umail.furryterror.org> wrote:
> > >
> > > On Tue, Feb 12, 2019 at 05:56:24PM +0000, Filipe Manana wrote:
> > > > On Tue, Feb 12, 2019 at 5:01 PM Zygo Blaxell
> > > > <ce3g8jdj@umail.furryterror.org> wrote:
> > > > >
> > > > > On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote:
> > > > > > On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
> > > > > > <ce3g8jdj@umail.furryterror.org> wrote:
> > > > > > >
> > > > > > > Still reproducible on 4.20.7.
> > > > > >
> > > > > > I tried your reproducer when you first reported it, on different
> > > > > > machines with different kernel versions.
> > > > >
> > > > > That would have been useful to know last August...  :-/
> > > > >
> > > > > > Never managed to reproduce it, nor see anything obviously wrong in
> > > > > > relevant code paths.
> > > > >
> > > > > I built a fresh VM running Debian stretch and
> > > > > reproduced the issue immediately.  Mount options are
> > > > > "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/".  Kernel is
> > > > > Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version
> > > > > probably doesn't matter.
> > > > >
> > > > > I don't have any configuration that can't reproduce this issue, so I don't
> > > > > know how to help you.  I've tested AMD and Intel CPUs, VM, baremetal,
> > > > > hardware ranging in age from 0 to 9 years.  Locally built kernels from
> > > > > 4.1 to 4.20 and the stock Debian kernel (4.9).  SSDs and spinning rust.
> > > > > All of these reproduce the issue immediately--wrong sha1sum appears in
> > > > > the first 10 loops.
> > > > >
> > > > > What is your test environment?  I can try that here.
> > > >
> > > > Debian unstable, all qemu vms, 4 cpus 4G to 8G ram iirc.
> > >
> > > I have several environments like that...
> > >
> > > > Always built from source kernels.
> > >
> > > ...that could be a relevant difference.  Have you tried a stock
> > > Debian kernel?
> > >
> > > > I have tested this when you reported it for 1 to 2 weeks in 2 or 3 vms
> > > > that kept running the test in an infinite loop during those weeks.
> > > > Don't recall what were the kernel versions (whatever was the latest at
> > > > the time), but that shouldn't matter according to what you say.
> > >
> > > That's an extremely long time compared to the rate of occurrence
> > > of this bug.  It should appear in only a few seconds of testing.
> > > Some data-hole-data patterns reproduce much slower (change the position
> > > of "block 0" lines in the setup script), but "slower" is minutes,
> > > not machine-months.
> > >
> > > Is your filesystem compressed?  Does compsize show the test
> > > file 'am' is compressed during the test?  Is the sha1sum you get
> > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4?  Does the sha1sum change
> > > when a second process reads the file while the sha1sum/drop_caches loop
> > > is running?
> >
> > Tried it today and I got it reproduced (different vm, but still debian
> > and kernel built from source).
> > Not sure what was different last time. Yes, I had compression enabled.
> >
> > I'll look into it.
>
> So the problem is caused by hole punching. The script can be reduced
> to the following:
>
> https://friendpaste.com/22t4OdktHQTl0aMGxckc86
>
> file size: 384K am
> digests after file creation:   7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> digests after file creation 2: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> 262144 total bytes deduped in this operation
> digests after dedupe:          7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> digests after dedupe 2:        7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> am: 24 KiB (24576 bytes) converted to sparse holes.
> digests after hole punching:   7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> digests after hole punching 2: 5a357b64f4004ea38dbc7058c64a5678668420da  am
>
> So hole punching is screwing things, and only after dropping the page
> cache we can see the bug.
> I'll send a fix likely tomorrow.

So it turns out it's a problem in the read of compressed extents part,
a variant of a bug I found back in 2015:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=005efedf2c7d0a270ffbe28d8997b03844f3e3e7

The following one liner fixes it:
https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3

While you test it there (if you want/can), I'll write a change log and
a proper test case for fstests and submit them later.

Thanks!
>
> >
> > >
> > > > > > >
> > > > > > > The behavior is slightly different on current kernels (4.20.7, 4.14.96)
> > > > > > > which makes the problem a bit more difficult to detect.
> > > > > > >
> > > > > > >         # repro-hole-corruption-test
> > > > > > >         i: 91, status: 0, bytes_deduped: 131072
> > > > > > >         i: 92, status: 0, bytes_deduped: 131072
> > > > > > >         i: 93, status: 0, bytes_deduped: 131072
> > > > > > >         i: 94, status: 0, bytes_deduped: 131072
> > > > > > >         i: 95, status: 0, bytes_deduped: 131072
> > > > > > >         i: 96, status: 0, bytes_deduped: 131072
> > > > > > >         i: 97, status: 0, bytes_deduped: 131072
> > > > > > >         i: 98, status: 0, bytes_deduped: 131072
> > > > > > >         i: 99, status: 0, bytes_deduped: 131072
> > > > > > >         13107200 total bytes deduped in this operation
> > > > > > >         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > > > >         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >
> > > > > > > The sha1sum seems stable after the first drop_caches--until a second
> > > > > > > process tries to read the test file:
> > > > > > >
> > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >         # cat am > /dev/null              (in another shell)
> > > > > > >         19294e695272c42edb89ceee24bb08c13473140a am
> > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >
> > > > > > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> > > > > > > > This is a repro script for a btrfs bug that causes corrupted data reads
> > > > > > > > when reading a mix of compressed extents and holes.  The bug is
> > > > > > > > reproducible on at least kernels v4.1..v4.18.
> > > > > > > >
> > > > > > > > Some more observations and background follow, but first here is the
> > > > > > > > script and some sample output:
> > > > > > > >
> > > > > > > >       root@rescue:/test# cat repro-hole-corruption-test
> > > > > > > >       #!/bin/bash
> > > > > > > >
> > > > > > > >       # Write a 4096 byte block of something
> > > > > > > >       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > > > > > > >
> > > > > > > >       # Here is some test data with holes in it:
> > > > > > > >       for y in $(seq 0 100); do
> > > > > > > >               for x in 0 1; do
> > > > > > > >                       block 0;
> > > > > > > >                       block 21;
> > > > > > > >                       block 0;
> > > > > > > >                       block 22;
> > > > > > > >                       block 0;
> > > > > > > >                       block 0;
> > > > > > > >                       block 43;
> > > > > > > >                       block 44;
> > > > > > > >                       block 0;
> > > > > > > >                       block 0;
> > > > > > > >                       block 61;
> > > > > > > >                       block 62;
> > > > > > > >                       block 63;
> > > > > > > >                       block 64;
> > > > > > > >                       block 65;
> > > > > > > >                       block 66;
> > > > > > > >               done
> > > > > > > >       done > am
> > > > > > > >       sync
> > > > > > > >
> > > > > > > >       # Now replace those 101 distinct extents with 101 references to the first extent
> > > > > > > >       btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
> > > > > > > >
> > > > > > > >       # Punch holes into the extent refs
> > > > > > > >       fallocate -v -d am
> > > > > > > >
> > > > > > > >       # Do some other stuff on the machine while this runs, and watch the sha1sums change!
> > > > > > > >       while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> > > > > > > >
> > > > > > > >       root@rescue:/test# ./repro-hole-corruption-test
> > > > > > > >       i: 91, status: 0, bytes_deduped: 131072
> > > > > > > >       i: 92, status: 0, bytes_deduped: 131072
> > > > > > > >       i: 93, status: 0, bytes_deduped: 131072
> > > > > > > >       i: 94, status: 0, bytes_deduped: 131072
> > > > > > > >       i: 95, status: 0, bytes_deduped: 131072
> > > > > > > >       i: 96, status: 0, bytes_deduped: 131072
> > > > > > > >       i: 97, status: 0, bytes_deduped: 131072
> > > > > > > >       i: 98, status: 0, bytes_deduped: 131072
> > > > > > > >       i: 99, status: 0, bytes_deduped: 131072
> > > > > > > >       13107200 total bytes deduped in this operation
> > > > > > > >       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       072a152355788c767b97e4e4c0e4567720988b84 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       60831f0e7ffe4b49722612c18685c09f4583b1df am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       ^C
> > > > > > > >
> > > > > > > > Corruption occurs most often when there is a sequence like this in a file:
> > > > > > > >
> > > > > > > >       ref 1: hole
> > > > > > > >       ref 2: extent A, offset 0
> > > > > > > >       ref 3: hole
> > > > > > > >       ref 4: extent A, offset 8192
> > > > > > > >
> > > > > > > > This scenario typically arises due to hole-punching or deduplication.
> > > > > > > > Hole-punching replaces one extent ref with two references to the same
> > > > > > > > extent with a hole between them, so:
> > > > > > > >
> > > > > > > >       ref 1:  extent A, offset 0, length 16384
> > > > > > > >
> > > > > > > > becomes:
> > > > > > > >
> > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > >       ref 2:  hole, length 8192
> > > > > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > > > > >
> > > > > > > > Deduplication replaces two distinct extent refs surrounding a hole with
> > > > > > > > two references to one of the duplicate extents, turning this:
> > > > > > > >
> > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > >       ref 2:  hole, length 8192
> > > > > > > >       ref 3:  extent B, offset 0, length 4096
> > > > > > > >
> > > > > > > > into this:
> > > > > > > >
> > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > >       ref 2:  hole, length 8192
> > > > > > > >       ref 3:  extent A, offset 0, length 4096
> > > > > > > >
> > > > > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur.
> > > > > > > > I am not able to reproduce the issue with an uncompressed extent nor
> > > > > > > > have I observed any such corruption in the wild.
> > > > > > > >
> > > > > > > > The presence or absence of the no-holes filesystem feature has no effect.
> > > > > > > >
> > > > > > > > Ordinary writes can lead to pairs of extent references to the same extent
> > > > > > > > separated by a reference to a different extent; however, in this case
> > > > > > > > there is data to be read from a real extent, instead of pages that have
> > > > > > > > to be zero filled from a hole.  If ordinary non-hole writes could trigger
> > > > > > > > this bug, every page-oriented database engine would be crashing all the
> > > > > > > > time on btrfs with compression enabled, and it's unlikely that would not
> > > > > > > > have been noticed between 2015 and now.  An ordinary write that splits
> > > > > > > > an extent ref would look like this:
> > > > > > > >
> > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > >       ref 2:  extent C, offset 0, length 8192
> > > > > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > > > > >
> > > > > > > > Sparse writes can lead to pairs of extent references surrounding a hole;
> > > > > > > > however, in this case the extent references will point to different
> > > > > > > > extents, avoiding the bug.  If a sparse write could trigger the bug,
> > > > > > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many
> > > > > > > > other tools that produce sparse files) would be unusable, and it's
> > > > > > > > unlikely that would not have been noticed between 2015 and now either.
> > > > > > > > Sparse writes look like this:
> > > > > > > >
> > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > >       ref 2:  hole, length 8192
> > > > > > > >       ref 3:  extent B, offset 0, length 4096
> > > > > > > >
> > > > > > > > The pattern or timing of read() calls seems to be relevant.  It is very
> > > > > > > > hard to see the corruption when reading files with 'hd', but 'cat | hd'
> > > > > > > > will see the corruption just fine.  Similar problems exist with 'cmp'
> > > > > > > > but not 'sha1sum'.  Two processes reading the same file at the same time
> > > > > > > > seem to trigger the corruption very frequently.
> > > > > > > >
> > > > > > > > Some patterns of holes and data produce corruption faster than others.
> > > > > > > > The pattern generated by the script above is based on instances of
> > > > > > > > corruption I've found in the wild, and has a much better repro rate than
> > > > > > > > random holes.
> > > > > > > >
> > > > > > > > The corruption occurs during reads, after csum verification and before
> > > > > > > > decompression, so btrfs detects no csum failures.  The data on disk
> > > > > > > > seems to be OK and could be read correctly once the kernel bug is fixed.
> > > > > > > > Repeated reads do eventually return correct data, but there is no way
> > > > > > > > for userspace to distinguish between corrupt and correct data reliably.
> > > > > > > >
> > > > > > > > The corrupted data is usually data replaced by a hole or a copy of other
> > > > > > > > blocks in the same extent.
> > > > > > > >
> > > > > > > > The behavior is similar to some earlier bugs related to holes and
> > > > > > > > Compressed data in btrfs, but it's new and not fixed yet--hence,
> > > > > > > > "2018 edition."
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Filipe David Manana,
> > > > > >
> > > > > > “Whether you think you can, or you think you can't — you're right.”
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Filipe David Manana,
> > > >
> > > > “Whether you think you can, or you think you can't — you're right.”
> > > >
> >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”
>
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-14  1:22               ` Filipe Manana
@ 2019-02-14  5:00                 ` Zygo Blaxell
  2019-02-14 12:21                 ` Christoph Anton Mitterer
  1 sibling, 0 replies; 38+ messages in thread
From: Zygo Blaxell @ 2019-02-14  5:00 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 15389 bytes --]

On Thu, Feb 14, 2019 at 01:22:49AM +0000, Filipe Manana wrote:
> On Wed, Feb 13, 2019 at 6:14 PM Filipe Manana <fdmanana@gmail.com> wrote:
> > On Wed, Feb 13, 2019 at 5:36 PM Filipe Manana <fdmanana@gmail.com> wrote:
[...]
> > > Tried it today and I got it reproduced (different vm, but still debian
> > > and kernel built from source).
> > > Not sure what was different last time. Yes, I had compression enabled.
> > >
> > > I'll look into it.
> >
> > So the problem is caused by hole punching. The script can be reduced
> > to the following:
> >
> > https://friendpaste.com/22t4OdktHQTl0aMGxckc86
> >
> > file size: 384K am
> > digests after file creation:   7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> > digests after file creation 2: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> > 262144 total bytes deduped in this operation
> > digests after dedupe:          7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> > digests after dedupe 2:        7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> > am: 24 KiB (24576 bytes) converted to sparse holes.
> > digests after hole punching:   7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> > digests after hole punching 2: 5a357b64f4004ea38dbc7058c64a5678668420da  am
> >
> > So hole punching is screwing things, and only after dropping the page
> > cache we can see the bug.
> > I'll send a fix likely tomorrow.
> 
> So it turns out it's a problem in the read of compressed extents part,
> a variant of a bug I found back in 2015:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=005efedf2c7d0a270ffbe28d8997b03844f3e3e7
> 
> The following one liner fixes it:
> https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3
> 
> While you test it there (if you want/can), I'll write a change log and
> a proper test case for fstests and submit them later.

Works here (and produces the correct sha1sum, which turns out to be
dae78e303edfb8b8ad64ecae01dc1bf233770cfd).

Nice work!

> Thanks!
> >
> > >
> > > >
> > > > > > > >
> > > > > > > > The behavior is slightly different on current kernels (4.20.7, 4.14.96)
> > > > > > > > which makes the problem a bit more difficult to detect.
> > > > > > > >
> > > > > > > >         # repro-hole-corruption-test
> > > > > > > >         i: 91, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 92, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 93, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 94, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 95, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 96, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 97, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 98, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 99, status: 0, bytes_deduped: 131072
> > > > > > > >         13107200 total bytes deduped in this operation
> > > > > > > >         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > > > > >         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >
> > > > > > > > The sha1sum seems stable after the first drop_caches--until a second
> > > > > > > > process tries to read the test file:
> > > > > > > >
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         # cat am > /dev/null              (in another shell)
> > > > > > > >         19294e695272c42edb89ceee24bb08c13473140a am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >
> > > > > > > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> > > > > > > > > This is a repro script for a btrfs bug that causes corrupted data reads
> > > > > > > > > when reading a mix of compressed extents and holes.  The bug is
> > > > > > > > > reproducible on at least kernels v4.1..v4.18.
> > > > > > > > >
> > > > > > > > > Some more observations and background follow, but first here is the
> > > > > > > > > script and some sample output:
> > > > > > > > >
> > > > > > > > >       root@rescue:/test# cat repro-hole-corruption-test
> > > > > > > > >       #!/bin/bash
> > > > > > > > >
> > > > > > > > >       # Write a 4096 byte block of something
> > > > > > > > >       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > > > > > > > >
> > > > > > > > >       # Here is some test data with holes in it:
> > > > > > > > >       for y in $(seq 0 100); do
> > > > > > > > >               for x in 0 1; do
> > > > > > > > >                       block 0;
> > > > > > > > >                       block 21;
> > > > > > > > >                       block 0;
> > > > > > > > >                       block 22;
> > > > > > > > >                       block 0;
> > > > > > > > >                       block 0;
> > > > > > > > >                       block 43;
> > > > > > > > >                       block 44;
> > > > > > > > >                       block 0;
> > > > > > > > >                       block 0;
> > > > > > > > >                       block 61;
> > > > > > > > >                       block 62;
> > > > > > > > >                       block 63;
> > > > > > > > >                       block 64;
> > > > > > > > >                       block 65;
> > > > > > > > >                       block 66;
> > > > > > > > >               done
> > > > > > > > >       done > am
> > > > > > > > >       sync
> > > > > > > > >
> > > > > > > > >       # Now replace those 101 distinct extents with 101 references to the first extent
> > > > > > > > >       btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
> > > > > > > > >
> > > > > > > > >       # Punch holes into the extent refs
> > > > > > > > >       fallocate -v -d am
> > > > > > > > >
> > > > > > > > >       # Do some other stuff on the machine while this runs, and watch the sha1sums change!
> > > > > > > > >       while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> > > > > > > > >
> > > > > > > > >       root@rescue:/test# ./repro-hole-corruption-test
> > > > > > > > >       i: 91, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 92, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 93, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 94, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 95, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 96, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 97, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 98, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 99, status: 0, bytes_deduped: 131072
> > > > > > > > >       13107200 total bytes deduped in this operation
> > > > > > > > >       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       072a152355788c767b97e4e4c0e4567720988b84 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       60831f0e7ffe4b49722612c18685c09f4583b1df am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       ^C
> > > > > > > > >
> > > > > > > > > Corruption occurs most often when there is a sequence like this in a file:
> > > > > > > > >
> > > > > > > > >       ref 1: hole
> > > > > > > > >       ref 2: extent A, offset 0
> > > > > > > > >       ref 3: hole
> > > > > > > > >       ref 4: extent A, offset 8192
> > > > > > > > >
> > > > > > > > > This scenario typically arises due to hole-punching or deduplication.
> > > > > > > > > Hole-punching replaces one extent ref with two references to the same
> > > > > > > > > extent with a hole between them, so:
> > > > > > > > >
> > > > > > > > >       ref 1:  extent A, offset 0, length 16384
> > > > > > > > >
> > > > > > > > > becomes:
> > > > > > > > >
> > > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > > >       ref 2:  hole, length 8192
> > > > > > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > > > > > >
> > > > > > > > > Deduplication replaces two distinct extent refs surrounding a hole with
> > > > > > > > > two references to one of the duplicate extents, turning this:
> > > > > > > > >
> > > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > > >       ref 2:  hole, length 8192
> > > > > > > > >       ref 3:  extent B, offset 0, length 4096
> > > > > > > > >
> > > > > > > > > into this:
> > > > > > > > >
> > > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > > >       ref 2:  hole, length 8192
> > > > > > > > >       ref 3:  extent A, offset 0, length 4096
> > > > > > > > >
> > > > > > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur.
> > > > > > > > > I am not able to reproduce the issue with an uncompressed extent nor
> > > > > > > > > have I observed any such corruption in the wild.
> > > > > > > > >
> > > > > > > > > The presence or absence of the no-holes filesystem feature has no effect.
> > > > > > > > >
> > > > > > > > > Ordinary writes can lead to pairs of extent references to the same extent
> > > > > > > > > separated by a reference to a different extent; however, in this case
> > > > > > > > > there is data to be read from a real extent, instead of pages that have
> > > > > > > > > to be zero filled from a hole.  If ordinary non-hole writes could trigger
> > > > > > > > > this bug, every page-oriented database engine would be crashing all the
> > > > > > > > > time on btrfs with compression enabled, and it's unlikely that would not
> > > > > > > > > have been noticed between 2015 and now.  An ordinary write that splits
> > > > > > > > > an extent ref would look like this:
> > > > > > > > >
> > > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > > >       ref 2:  extent C, offset 0, length 8192
> > > > > > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > > > > > >
> > > > > > > > > Sparse writes can lead to pairs of extent references surrounding a hole;
> > > > > > > > > however, in this case the extent references will point to different
> > > > > > > > > extents, avoiding the bug.  If a sparse write could trigger the bug,
> > > > > > > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many
> > > > > > > > > other tools that produce sparse files) would be unusable, and it's
> > > > > > > > > unlikely that would not have been noticed between 2015 and now either.
> > > > > > > > > Sparse writes look like this:
> > > > > > > > >
> > > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > > >       ref 2:  hole, length 8192
> > > > > > > > >       ref 3:  extent B, offset 0, length 4096
> > > > > > > > >
> > > > > > > > > The pattern or timing of read() calls seems to be relevant.  It is very
> > > > > > > > > hard to see the corruption when reading files with 'hd', but 'cat | hd'
> > > > > > > > > will see the corruption just fine.  Similar problems exist with 'cmp'
> > > > > > > > > but not 'sha1sum'.  Two processes reading the same file at the same time
> > > > > > > > > seem to trigger the corruption very frequently.
> > > > > > > > >
> > > > > > > > > Some patterns of holes and data produce corruption faster than others.
> > > > > > > > > The pattern generated by the script above is based on instances of
> > > > > > > > > corruption I've found in the wild, and has a much better repro rate than
> > > > > > > > > random holes.
> > > > > > > > >
> > > > > > > > > The corruption occurs during reads, after csum verification and before
> > > > > > > > > decompression, so btrfs detects no csum failures.  The data on disk
> > > > > > > > > seems to be OK and could be read correctly once the kernel bug is fixed.
> > > > > > > > > Repeated reads do eventually return correct data, but there is no way
> > > > > > > > > for userspace to distinguish between corrupt and correct data reliably.
> > > > > > > > >
> > > > > > > > > The corrupted data is usually data replaced by a hole or a copy of other
> > > > > > > > > blocks in the same extent.
> > > > > > > > >
> > > > > > > > > The behavior is similar to some earlier bugs related to holes and
> > > > > > > > > Compressed data in btrfs, but it's new and not fixed yet--hence,
> > > > > > > > > "2018 edition."
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Filipe David Manana,
> > > > > > >
> > > > > > > “Whether you think you can, or you think you can't — you're right.”
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Filipe David Manana,
> > > > >
> > > > > “Whether you think you can, or you think you can't — you're right.”
> > > > >
> > >
> > >
> > >
> > > --
> > > Filipe David Manana,
> > >
> > > “Whether you think you can, or you think you can't — you're right.”
> >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”
> 
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-14  1:22               ` Filipe Manana
  2019-02-14  5:00                 ` Zygo Blaxell
@ 2019-02-14 12:21                 ` Christoph Anton Mitterer
  2019-02-15  5:40                   ` Zygo Blaxell
  2019-02-15 12:02                   ` Filipe Manana
  1 sibling, 2 replies; 38+ messages in thread
From: Christoph Anton Mitterer @ 2019-02-14 12:21 UTC (permalink / raw)
  To: linux-btrfs

On Thu, 2019-02-14 at 01:22 +0000, Filipe Manana wrote:
> The following one liner fixes it:
> https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3

Great to see that fixed... is there any advise that can be given for
users/admins?


Like whether and how any occurred corruptions can be detected (right
now, people may still have backups)?


Or under which exact circumstances did the corruption happen? And under
which was one safe?
E.g. only on specific compression algos (I've been using -o compress
(which should be zlib) for quite a while but never found any
compression),... or only when specific file operations were done (I did
e.g. cp with refcopy, but I think none of the standard tools does hole-
punching)?


Cheers,
Chris.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-14 12:21                 ` Christoph Anton Mitterer
@ 2019-02-15  5:40                   ` Zygo Blaxell
  2019-03-04 15:34                     ` Christoph Anton Mitterer
  2019-02-15 12:02                   ` Filipe Manana
  1 sibling, 1 reply; 38+ messages in thread
From: Zygo Blaxell @ 2019-02-15  5:40 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4815 bytes --]

On Thu, Feb 14, 2019 at 01:21:29PM +0100, Christoph Anton Mitterer wrote:
> On Thu, 2019-02-14 at 01:22 +0000, Filipe Manana wrote:
> > The following one liner fixes it:
> > https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3
> 
> Great to see that fixed... is there any advise that can be given for
> users/admins?
> 
> 
> Like whether and how any occurred corruptions can be detected (right
> now, people may still have backups)?

The problem occurs only on reads.  Data that is written to disk will
be OK, and can be read correctly by a fixed kernel.

A kernel without the fix will give corrupt data on reads with no
indication of corruption other than the changes to the data itself.

Applications that copy data may read corrupted data and write it back
to the filesystem.  This will make the corruption permanent in the
copied data.

Given the age of the bug, backups that can be corrupted by this bug
probably already are.  Verify files against internal CRC/hashes where
possible.  The original files are likely to be OK, since the bug does
not affect writes.  If your situation has the risk factors listed below,
it may be worthwhile to create a fresh set of non-incremental backups
after applying the kernel fix.

> Or under which exact circumstances did the corruption happen? And under
> which was one safe?

Compression is required to trigger the bug, so you are safe if you (or
the applications you run) never enabled filesystem compression.  Even if
compression is enabled, the file data must be compressed for the bug to
corrupt it.  Incompressible data extents will never be affected by
this bug.

If you do use compression, you are still safe if:

	- you never punch holes in files

	- you never dedupe or clone files

If you do use compression and do the other things, the probability of
corruption by this particular bug is non-zero.  Whether you get corruption
and how often depends on the technical details of what you're doing.

To get corruption you have to have one data extent that is split in
two parts by punching a hole, or an extent that is cloned/deduped in
two parts to adjacent logical offsets in the same file.  Both of these
methods create the pattern on disk which triggers the bug.

Files that consist entirely of unique data will not be affected by dedupe
so will not trigger the bug that way.  Files that consist partially of
unique data may or may not be affected depending on the dedupe tool,
data alignment, etc.

> E.g. only on specific compression algos (I've been using -o compress
> (which should be zlib) for quite a while but never found any

All decompress algorithms are affected.  The bug is in the generic btrfs
decompression handling, so it is not limited to any single algorithm.

Compression (i.e. writing) is not affected--whatever data is written to
disk should be readable correctly with a fixed kernel.

> compression),... or only when specific file operations were done (I did
> e.g. cp with refcopy, but I think none of the standard tools does hole-
> punching)?

That depends on whether you consider fallocate or qemu to be standard
tools.  The hole-punching function has been a feature of several Linux
filesystems for some years now, so we can expect it to be more widely
adopted over time.  You'd have to do an audit to be sure none of the
tools you use are punching holes.

"Ordinary" sparse files (made by seeking forward while writing, as done
by older Unix utilities including cp, tar, rsync, cpio, binutils) do not
trigger this bug.  An ordinary sparse file has two distinct data extents
from two different writes separated by a hole which has never contained
file data.  A punched hole splits an existing single data extent into two
pieces with a newly created hole between them that replaces previously
existing file data.  These actions create different extent reference
patterns and only the hole-punching one is affected by the bug.

Files that contain no blocks full of zeros will not be affected by
fallocate-d-style hole punching (it searches for existing zeros and
punches holes over them--no zeros, no holes).  If the the hole punching
intentionally introduces zeros where zeros did not exist before (e.g. qemu
discard operations on raw image files) then it may trigger the bug.

btrfs send and receive may be affected, but I don't use them so I don't
have any experience of the bug related to these tools.  It seems from
reading the btrfs receive code that it lacks any code capable of punching
a hole, but I'm only doing a quick search for words like "punch", not
a detailed code analysis.

bees continues to be an awesome tool for discovering btrfs kernel bugs.
It compresses, dedupes, *and* punches holes.

> 
> Cheers,
> Chris.
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-14 12:21                 ` Christoph Anton Mitterer
  2019-02-15  5:40                   ` Zygo Blaxell
@ 2019-02-15 12:02                   ` Filipe Manana
  2019-03-04 15:46                     ` Christoph Anton Mitterer
  1 sibling, 1 reply; 38+ messages in thread
From: Filipe Manana @ 2019-02-15 12:02 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs

On Thu, Feb 14, 2019 at 11:10 PM Christoph Anton Mitterer
<calestyo@scientia.net> wrote:
>
> On Thu, 2019-02-14 at 01:22 +0000, Filipe Manana wrote:
> > The following one liner fixes it:
> > https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3
>
> Great to see that fixed... is there any advise that can be given for
> users/admins?

Upgrade to a kernel with the patch (none yet) or build it from source?
Not sure what kind of advice you are looking for.

>
>
> Like whether and how any occurred corruptions can be detected (right
> now, people may still have backups)?
>
>
> Or under which exact circumstances did the corruption happen? And under
> which was one safe?
> E.g. only on specific compression algos (I've been using -o compress
> (which should be zlib) for quite a while but never found any
> compression),... or only when specific file operations were done (I did
> e.g. cp with refcopy, but I think none of the standard tools does hole-
> punching)?

As I said in the previous reply, and in the patch's changelog [1], the
corruption happens at read time.
That means nothing stored on disk is corrupted. It's not the end of the world.

[1] https://lore.kernel.org/linux-btrfs/20190214151720.23563-1-fdmanana@kernel.org/

>
>
> Cheers,
> Chris.
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-15  5:40                   ` Zygo Blaxell
@ 2019-03-04 15:34                     ` Christoph Anton Mitterer
  2019-03-07 20:07                       ` Zygo Blaxell
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Anton Mitterer @ 2019-03-04 15:34 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

Hey.


Thanks for your elaborate explanations :-)


On Fri, 2019-02-15 at 00:40 -0500, Zygo Blaxell wrote:
> The problem occurs only on reads.  Data that is written to disk will
> be OK, and can be read correctly by a fixed kernel.
> 
> A kernel without the fix will give corrupt data on reads with no
> indication of corruption other than the changes to the data itself.
> 
> Applications that copy data may read corrupted data and write it back
> to the filesystem.  This will make the corruption permanent in the
> copied data.

So that basically means even a cp (without refcopy) or a btrfs
send/receive could already cause permanent silent data corruption.
Of course, only if the conditions you've described below are met.


> Given the age of the bug

Since when was it in the kernel?


> Even
> if
> compression is enabled, the file data must be compressed for the bug
> to
> corrupt it.

Is there a simple way to find files (i.e. pathnames) that were actually
compressed?


> 	- you never punch holes in files

Is there any "standard application" (like cp, tar, etc.) that would do
this?


> 	- you never dedupe or clone files

What do you mean by clone? refcopy? Would btrfs snapshots or btrfs
send/receive be affected?


Or is there anything in btrfs itself which does any of the two per
default or on a typical system (i.e. I didn't use dedupe).


Also, did the bug only affect data, or could metadata also be
affected... basically should such filesystems be re-created since they
may also hold corruptions in the meta-data like trees and so on?


> > compression),... or only when specific file operations were done (I
> > did
> > e.g. cp with refcopy, but I think none of the standard tools does
> > hole-
> > punching)?
> That depends on whether you consider fallocate or qemu to be standard
> tools.

I assume you mean the fallocate(1) program,... cause I wouldn't know
whether any of cp/mv/etc. does the system call fallocate(2) per
default.


My scenario looks about the following, and given your explanations, I'd
assume I should probably be safe:

- my normal laptop doesn't use compress, so it's safe anyway

- my cp has an alias to always have --reflink=auto

- two 8TB data archive disks, each with two backup disks to which the
  data of the two master disks is btrfs sent/received,... which were
  all mounted with compress


- typically I either cp or mv data from the laptop to these disks,
  => should then be safe as the laptop fs didn't use compress,...

- or I directly create the files on the data disks (which use compress)
  by means of wget, scp or similar from other sources
  => should be safe, too, as they probably don't do dedupe/hole
     punching by default

- or I cp/mv from them camera SD cards, which use some *FAT
  => so again I'd expect that to be fine

- on vacation I had the case that I put large amount of picture/videos
  from SD cards to some btrfs-with-compress mobile HDDs, and back home
  from these HDDs to my actual data HDDs.
  => here I do have the read / re-write pattern, so data could have
     been corrupted if it was compressed + deduped/hole-punched
     I'd guess that's anyway not the case (JPEGs/MPEGs don't compress
     well)... and AFAIU there would be no deduping/hole-punching 
     involved here


- on my main data disks, I do snapshots... and these snapshots I 
  send/receive to the other (also compress-mounted) btrfs disks.
  => could these operations involve deduping/hole-punching and thus the
     corruption?


Another thing:
I always store SHA512 hashsums of files as an XATTR of them (like
"directly after" creating such files).
I assume there would be no deduping/hole-punching involved till then,
so the sums should be from correct data, right?

But when I e.g. copy data from SD, to mobile btrfs-HDD and then to the
final archive HDD... corruption could in principle occur when copying
from mobile HDD to archive HDD.
In that case, would a diff between the two show me the corruption? I
guess not because the diff would likely get the same corruption on
read?


> "Ordinary" sparse files (made by seeking forward while writing, as
> done
> by older Unix utilities including cp, tar, rsync, cpio, binutils) do
> not
> trigger this bug.  An ordinary sparse file has two distinct data
> extents
> from two different writes separated by a hole which has never
> contained
> file data.  A punched hole splits an existing single data extent into
> two
> pieces with a newly created hole between them that replaces
> previously
> existing file data.  These actions create different extent reference
> patterns and only the hole-punching one is affected by the bug.
> Files that contain no blocks full of zeros will not be affected by
> fallocate-d-style hole punching (it searches for existing zeros and
> punches holes over them--no zeros, no holes).  If the the hole
> punching
> intentionally introduces zeros where zeros did not exist before (e.g.
> qemu
> discard operations on raw image files) then it may trigger the bug.

So long story short, "normal" file operations (cp/mv, etc.) should not
trigger the bug.


qemu with discard would be a prominent example of triggering the bug,
but luckily for me, I only use this on an fs with compress disabled :-D
Any other such prominent examples?

I assume normal mv of refcopy (i.e. cp --reflink=auto) would not punch
holes and thus be not affected?

Further, I'd assume XATTRs couldn't be affected?


So what remains unanswered is send/receive:

> btrfs send and receive may be affected, but I don't use them so I
> don't
> have any experience of the bug related to these tools.  It seems from
> reading the btrfs receive code that it lacks any code capable of
> punching
> a hole, but I'm only doing a quick search for words like "punch", not
> a detailed code analysis.

Is there some other developer who possibly knows whether send/receive
would have been vulnerable to the issue?


But since I use send/receive anyway in just one direction from the
master to the backup disks... only the later could be affected.


Thanks,
Chris.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-15 12:02                   ` Filipe Manana
@ 2019-03-04 15:46                     ` Christoph Anton Mitterer
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Anton Mitterer @ 2019-03-04 15:46 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Fri, 2019-02-15 at 12:02 +0000, Filipe Manana wrote:
> Upgrade to a kernel with the patch (none yet) or build it from
> source?
> Not sure what kind of advice you are looking for.

Well more something of the kind that Zygo wrote in his mail, i.e some
explanation of the whole issue in order to find out whether one might
be affected or not.


> As I said in the previous reply, and in the patch's changelog [1],
> the
> corruption happens at read time.
> That means nothing stored on disk is corrupted. It's not the end of
> the world.

Well but there are many cases where data is read and then written
again... and while Zygo's mail already answers a lot, at least the
question of whether it could happen on btrfs send/receive is still
open.


My understanding was that btrfs is considered "stable" for the normal
use cases (so e.g. perhaps without special features like raid56).

Data corruption is always quite serious, even if it's just on reads and
people may have workloads where data is read (possibly with corruption)
and (permanently) written again... so the whole thing *could* be quite
serious and IMO justifies a more thorough explanation for end-users and
not just a small commit message for developers.


Also, while it was really great to see how fast this got fixed then in
the end... it's also a bit worrying that Zygo apparently reported it
already some time ago and it got somehow lost.



Cheers,
Chris.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-03-04 15:34                     ` Christoph Anton Mitterer
@ 2019-03-07 20:07                       ` Zygo Blaxell
  2019-03-08 10:37                         ` Filipe Manana
                                           ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Zygo Blaxell @ 2019-03-07 20:07 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 11976 bytes --]

On Mon, Mar 04, 2019 at 04:34:39PM +0100, Christoph Anton Mitterer wrote:
> Hey.
> 
> 
> Thanks for your elaborate explanations :-)
> 
> 
> On Fri, 2019-02-15 at 00:40 -0500, Zygo Blaxell wrote:
> > The problem occurs only on reads.  Data that is written to disk will
> > be OK, and can be read correctly by a fixed kernel.
> > 
> > A kernel without the fix will give corrupt data on reads with no
> > indication of corruption other than the changes to the data itself.
> > 
> > Applications that copy data may read corrupted data and write it back
> > to the filesystem.  This will make the corruption permanent in the
> > copied data.
> 
> So that basically means even a cp (without refcopy) or a btrfs
> send/receive could already cause permanent silent data corruption.
> Of course, only if the conditions you've described below are met.
> 
> 
> > Given the age of the bug
> 
> Since when was it in the kernel?

Since at least 2015.  Note that if you are looking for an end date for
"clean" data, you may be disappointed.

In 2016 there were two kernel bugs that silently corrupted reads of
compressed data.  In 2015 there were...4?  5?  Before 2015 the problems
are worse, also damaging on-disk compressed data and crashing the kernel.
The bugs that were present in 2014 were present since compression was
introduced in 2008.

With this last fix, as far as I know, we have a kernel that can read
compressed data without corruption for the first time--at least for a
subset of use cases that doesn't include direct IO.  Of course I thought
the same thing in 2017, too, but I have since proven myself wrong.

When btrfs gets to the point where it doesn't fail backup verification for
some contiguous years, then I'll be satisfied btrfs (or any filesystem)
is properly debugged.  I'll still run backup verification then, of
course--hardware breaks all the time, and broken hardware can corrupt
any data it touches.  Verification failures point to broken hardware
much more often than btrfs data corruption bugs.

> > Even
> > if
> > compression is enabled, the file data must be compressed for the bug
> > to
> > corrupt it.
> 
> Is there a simple way to find files (i.e. pathnames) that were actually
> compressed?

Run compsize (sometimes the package is named btrfs-compsize) and see if
there are any lines referring to zlib, zstd, or lzo in the output.
If it's all "total" and "none" then there's no compression in that file.

filefrag -v reports non-inline compressed data extents with the "encoded"
flag, so

	if filefrag -v "$file" | grep -qw encoded; then
		echo "$file" is compressed, do something here
	fi

might also be a solution (assuming your filename doesn't include the
string 'encoded').

> > 	- you never punch holes in files
> 
> Is there any "standard application" (like cp, tar, etc.) that would do
> this?

Legacy POSIX doesn't have the hole-punching concept, so legacy
tools won't do it; however, people add features to GNU tools all the
time, so it's hard to be 100% sure without downloading the code and
reading/auditing/scanning it.  I'm 99% sure cp and tar are OK.

> What do you mean by clone? refcopy? Would btrfs snapshots or btrfs
> send/receive be affected?

clone is part of some file operation syscalls (e.g. clone_file_range,
dedupe_range) which make two different files, or two different offsets in
the same file, refer to the same physical extent.  This is the basis of
deduplication (replacing separate copies with references to a single
copy) and also of punching holes (a single reference is split into
two references to the original extent with a hole object inserted in
the middle).

"reflink copy" is a synonym for "cp --reflink", which is clone_file_range
using 0 as the start of range and EOF as the end.  The term 'reflink'
is sometimes used to refer to any extent shared between files that is
not the result of a snapshot.  reflink is to extents what a hardlink is
to inodes, if you ignore some details.

To trigger the bug you need to clone the same compressed source range
to two nearly adjacent locations in the destination file (i.e. two or
more ranges in the source overlap).  cp --reflink never overlaps ranges,
so it can't create the extent pattern that triggers this bug *by itself*.

If the source file already has extent references arranged in a way
that triggers the bug, then the copy made with cp --reflink will copy
the arrangement to the new file (i.e. if you upgrade the kernel, you
can correctly read both copies, and if you don't upgrade the kernel,
both copies will appear to be corrupted, probably the same way).

I would expect btrfs receive may be affected, but I did not find any
code in receive that would be affected.  There are a number of different
ways to make a file with a hole in it, and btrfs receive could use a
different one not affected by this bug.  I don't use send/receive myself,
so I don't have historical corruption data to guess from.

> Or is there anything in btrfs itself which does any of the two per
> default or on a typical system (i.e. I didn't use dedupe).

'btrfs' (the command-line utility) doesn't do these operations as far
as I can tell.  The kernel only does these when requested by applications.

> Also, did the bug only affect data, or could metadata also be
> affected... basically should such filesystems be re-created since they
> may also hold corruptions in the meta-data like trees and so on?

Metadata is not affected by this bug.  The bug only corrupts btrfs data
(specificially, the contents of files) in memory, not disk.

> My scenario looks about the following, and given your explanations, I'd
> assume I should probably be safe:
> 
> - my normal laptop doesn't use compress, so it's safe anyway
> 
> - my cp has an alias to always have --reflink=auto
> 
> - two 8TB data archive disks, each with two backup disks to which the
>   data of the two master disks is btrfs sent/received,... which were
>   all mounted with compress
> 
> 
> - typically I either cp or mv data from the laptop to these disks,
>   => should then be safe as the laptop fs didn't use compress,...
> 
> - or I directly create the files on the data disks (which use compress)
>   by means of wget, scp or similar from other sources
>   => should be safe, too, as they probably don't do dedupe/hole
>      punching by default
> 
> - or I cp/mv from them camera SD cards, which use some *FAT
>   => so again I'd expect that to be fine
> 
> - on vacation I had the case that I put large amount of picture/videos
>   from SD cards to some btrfs-with-compress mobile HDDs, and back home
>   from these HDDs to my actual data HDDs.
>   => here I do have the read / re-write pattern, so data could have
>      been corrupted if it was compressed + deduped/hole-punched
>      I'd guess that's anyway not the case (JPEGs/MPEGs don't compress
>      well)... and AFAIU there would be no deduping/hole-punching 
>      involved here

dedupe doesn't happen by itself on btrfs.  You have to run dedupe
userspace software (e.g. duperemove, bees, dduper, rmlint, jdupes, bedup,
etc...) or build a kernel with dedupe patches.

> - on my main data disks, I do snapshots... and these snapshots I 
>   send/receive to the other (also compress-mounted) btrfs disks.
>   => could these operations involve deduping/hole-punching and thus the
>      corruption?

Snapshots won't interact with the bug--they are not affected by it
and will not trigger it.  Send could transmit incorrect data (if it
uses the kernel's readpages path internally, I don't know if it does).
Receive seems not to be affected (though it will not detect incorrect
data from send).

> Another thing:
> I always store SHA512 hashsums of files as an XATTR of them (like
> "directly after" creating such files).
> I assume there would be no deduping/hole-punching involved till then,
> so the sums should be from correct data, right?

There's no assurance of that with this method.  It's highly likely that
the hashes match the input data, because the file will usually be cached
in host RAM from when it was written, so the bug has no opportunity to
appear.  It's not impossible for other system activity to evict those
cached pages between the copy and hash, so the hash function might reread
the data from disk again and thus be exposed to the bug.

Contrast with a copy tool which integrates the SHA512 function, so
the SHA hash and the copy consume their data from the same RAM buffers.
This reduces the risk of undetected error but still does not eliminate it.
A DRAM access failure could corrupt either the data or SHA hash but not
both, so the hash will fail verification later, but you won't know if
the hash is incorrect or the data.

If the source filesystem is not btrfs (and therefore cannot have this
btrfs bug), you can calculate the SHA512 from the source filesystem and
copy that to the xattr on the btrfs filesystem.  That reduces the risk
pool for data errors to the host RAM and CPU, the source filesystem,
and the storage stack below the source filesystem (i.e.  the generic
set of problems that can occur on any system at any time and corrupt
data during copy and hash operations).

> But when I e.g. copy data from SD, to mobile btrfs-HDD and then to the
> final archive HDD... corruption could in principle occur when copying
> from mobile HDD to archive HDD.
> In that case, would a diff between the two show me the corruption? I
> guess not because the diff would likely get the same corruption on
> read?

Upgrade your kernel before doing any verification activity; otherwise
you'll just get false results.

If you try to replace the data before upgrading the kernel, you're more
likely to introduce new corruption where corruption did not exist before,
or convert transient corruption events into permanent data corruption.
You might even miss corrupted data because the bug tends to corrupt data
in a consistent way.

Once you have a kernel with the fix applied, diff will show any corruption
in file copies, though 'cmp -l' might be much faster than diff on large
binary files.  Use just 'cmp' if you only want to know if any difference
exists but don't need detailed information, or 'cmp -s' in a shell script.

>[...]
> I assume normal mv of refcopy (i.e. cp --reflink=auto) would not punch
> holes and thus be not affected?
> 
> Further, I'd assume XATTRs couldn't be affected?

XATTRs aren't compressed file data, so they aren't affected by this bug
which only affects compressed file data.

> So what remains unanswered is send/receive:
> 
> > btrfs send and receive may be affected, but I don't use them so I
> > don't
> > have any experience of the bug related to these tools.  It seems from
> > reading the btrfs receive code that it lacks any code capable of
> > punching
> > a hole, but I'm only doing a quick search for words like "punch", not
> > a detailed code analysis.
> 
> Is there some other developer who possibly knows whether send/receive
> would have been vulnerable to the issue?
> 
> 
> But since I use send/receive anyway in just one direction from the
> master to the backup disks... only the later could be affected.

I presume from this line of questioning that you are not in the habit
of verifying the SHA512 hashes on your data every few weeks or months.
If you had that step in your scheduled backup routine, then you would
already be aware of data corruption bugs that affect you--or you'd
already be reasonably confident that this bug has no impact on your setup.

If you had asked questions like "is this bug the reason why I've been
seeing random SHA hash verification failures for several years?" then
you should worry about this bug; otherwise, it probably didn't affect you.

> Thanks,
> Chris.
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-03-07 20:07                       ` Zygo Blaxell
@ 2019-03-08 10:37                         ` Filipe Manana
  2019-03-14 18:58                           ` Christoph Anton Mitterer
  2019-03-14 20:22                           ` Christoph Anton Mitterer
  2019-03-08 12:20                         ` Austin S. Hemmelgarn
  2019-03-14 18:58                         ` Christoph Anton Mitterer
  2 siblings, 2 replies; 38+ messages in thread
From: Filipe Manana @ 2019-03-08 10:37 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Christoph Anton Mitterer, linux-btrfs

On Thu, Mar 7, 2019 at 8:14 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Mon, Mar 04, 2019 at 04:34:39PM +0100, Christoph Anton Mitterer wrote:
> > Hey.
> >
> >
> > Thanks for your elaborate explanations :-)
> >
> >
> > On Fri, 2019-02-15 at 00:40 -0500, Zygo Blaxell wrote:
> > > The problem occurs only on reads.  Data that is written to disk will
> > > be OK, and can be read correctly by a fixed kernel.
> > >
> > > A kernel without the fix will give corrupt data on reads with no
> > > indication of corruption other than the changes to the data itself.
> > >
> > > Applications that copy data may read corrupted data and write it back
> > > to the filesystem.  This will make the corruption permanent in the
> > > copied data.
> >
> > So that basically means even a cp (without refcopy) or a btrfs
> > send/receive could already cause permanent silent data corruption.
> > Of course, only if the conditions you've described below are met.
> >
> >
> > > Given the age of the bug
> >
> > Since when was it in the kernel?
>
> Since at least 2015.  Note that if you are looking for an end date for
> "clean" data, you may be disappointed.

It's been around since compression was introduced (October 2008).
The read ahead path was buggy for the case where the same compressed extent
is shared consecutively. I fixed 2 bugs there back in 2015 but missed the case
where there's a hole that makes the compressed extent be shared with a non-zero
start offset, which is the case that was fixed recently.

>
> In 2016 there were two kernel bugs that silently corrupted reads of
> compressed data.  In 2015 there were...4?  5?  Before 2015 the problems
> are worse, also damaging on-disk compressed data and crashing the kernel.
> The bugs that were present in 2014 were present since compression was
> introduced in 2008.
>
> With this last fix, as far as I know, we have a kernel that can read
> compressed data without corruption for the first time--at least for a
> subset of use cases that doesn't include direct IO.  Of course I thought
> the same thing in 2017, too, but I have since proven myself wrong.
>
> When btrfs gets to the point where it doesn't fail backup verification for
> some contiguous years, then I'll be satisfied btrfs (or any filesystem)
> is properly debugged.  I'll still run backup verification then, of
> course--hardware breaks all the time, and broken hardware can corrupt
> any data it touches.  Verification failures point to broken hardware
> much more often than btrfs data corruption bugs.
>
> > > Even
> > > if
> > > compression is enabled, the file data must be compressed for the bug
> > > to
> > > corrupt it.
> >
> > Is there a simple way to find files (i.e. pathnames) that were actually
> > compressed?
>
> Run compsize (sometimes the package is named btrfs-compsize) and see if
> there are any lines referring to zlib, zstd, or lzo in the output.
> If it's all "total" and "none" then there's no compression in that file.
>
> filefrag -v reports non-inline compressed data extents with the "encoded"
> flag, so
>
>         if filefrag -v "$file" | grep -qw encoded; then
>                 echo "$file" is compressed, do something here
>         fi
>
> might also be a solution (assuming your filename doesn't include the
> string 'encoded').
>
> > >     - you never punch holes in files
> >
> > Is there any "standard application" (like cp, tar, etc.) that would do
> > this?
>
> Legacy POSIX doesn't have the hole-punching concept, so legacy
> tools won't do it; however, people add features to GNU tools all the
> time, so it's hard to be 100% sure without downloading the code and
> reading/auditing/scanning it.  I'm 99% sure cp and tar are OK.
>
> > What do you mean by clone? refcopy? Would btrfs snapshots or btrfs
> > send/receive be affected?
>
> clone is part of some file operation syscalls (e.g. clone_file_range,
> dedupe_range) which make two different files, or two different offsets in
> the same file, refer to the same physical extent.  This is the basis of
> deduplication (replacing separate copies with references to a single
> copy) and also of punching holes (a single reference is split into
> two references to the original extent with a hole object inserted in
> the middle).
>
> "reflink copy" is a synonym for "cp --reflink", which is clone_file_range
> using 0 as the start of range and EOF as the end.  The term 'reflink'
> is sometimes used to refer to any extent shared between files that is
> not the result of a snapshot.  reflink is to extents what a hardlink is
> to inodes, if you ignore some details.
>
> To trigger the bug you need to clone the same compressed source range
> to two nearly adjacent locations in the destination file (i.e. two or
> more ranges in the source overlap).  cp --reflink never overlaps ranges,
> so it can't create the extent pattern that triggers this bug *by itself*.
>
> If the source file already has extent references arranged in a way
> that triggers the bug, then the copy made with cp --reflink will copy
> the arrangement to the new file (i.e. if you upgrade the kernel, you
> can correctly read both copies, and if you don't upgrade the kernel,
> both copies will appear to be corrupted, probably the same way).
>
> I would expect btrfs receive may be affected, but I did not find any
> code in receive that would be affected.  There are a number of different
> ways to make a file with a hole in it, and btrfs receive could use a
> different one not affected by this bug.  I don't use send/receive myself,
> so I don't have historical corruption data to guess from.
>
> > Or is there anything in btrfs itself which does any of the two per
> > default or on a typical system (i.e. I didn't use dedupe).
>
> 'btrfs' (the command-line utility) doesn't do these operations as far
> as I can tell.  The kernel only does these when requested by applications.
>
> > Also, did the bug only affect data, or could metadata also be
> > affected... basically should such filesystems be re-created since they
> > may also hold corruptions in the meta-data like trees and so on?
>
> Metadata is not affected by this bug.  The bug only corrupts btrfs data
> (specificially, the contents of files) in memory, not disk.
>
> > My scenario looks about the following, and given your explanations, I'd
> > assume I should probably be safe:
> >
> > - my normal laptop doesn't use compress, so it's safe anyway
> >
> > - my cp has an alias to always have --reflink=auto
> >
> > - two 8TB data archive disks, each with two backup disks to which the
> >   data of the two master disks is btrfs sent/received,... which were
> >   all mounted with compress
> >
> >
> > - typically I either cp or mv data from the laptop to these disks,
> >   => should then be safe as the laptop fs didn't use compress,...
> >
> > - or I directly create the files on the data disks (which use compress)
> >   by means of wget, scp or similar from other sources
> >   => should be safe, too, as they probably don't do dedupe/hole
> >      punching by default
> >
> > - or I cp/mv from them camera SD cards, which use some *FAT
> >   => so again I'd expect that to be fine
> >
> > - on vacation I had the case that I put large amount of picture/videos
> >   from SD cards to some btrfs-with-compress mobile HDDs, and back home
> >   from these HDDs to my actual data HDDs.
> >   => here I do have the read / re-write pattern, so data could have
> >      been corrupted if it was compressed + deduped/hole-punched
> >      I'd guess that's anyway not the case (JPEGs/MPEGs don't compress
> >      well)... and AFAIU there would be no deduping/hole-punching
> >      involved here
>
> dedupe doesn't happen by itself on btrfs.  You have to run dedupe
> userspace software (e.g. duperemove, bees, dduper, rmlint, jdupes, bedup,
> etc...) or build a kernel with dedupe patches.
>
> > - on my main data disks, I do snapshots... and these snapshots I
> >   send/receive to the other (also compress-mounted) btrfs disks.
> >   => could these operations involve deduping/hole-punching and thus the
> >      corruption?
>
> Snapshots won't interact with the bug--they are not affected by it
> and will not trigger it.  Send could transmit incorrect data (if it
> uses the kernel's readpages path internally, I don't know if it does).
> Receive seems not to be affected (though it will not detect incorrect
> data from send).
>
> > Another thing:
> > I always store SHA512 hashsums of files as an XATTR of them (like
> > "directly after" creating such files).
> > I assume there would be no deduping/hole-punching involved till then,
> > so the sums should be from correct data, right?
>
> There's no assurance of that with this method.  It's highly likely that
> the hashes match the input data, because the file will usually be cached
> in host RAM from when it was written, so the bug has no opportunity to
> appear.  It's not impossible for other system activity to evict those
> cached pages between the copy and hash, so the hash function might reread
> the data from disk again and thus be exposed to the bug.
>
> Contrast with a copy tool which integrates the SHA512 function, so
> the SHA hash and the copy consume their data from the same RAM buffers.
> This reduces the risk of undetected error but still does not eliminate it.
> A DRAM access failure could corrupt either the data or SHA hash but not
> both, so the hash will fail verification later, but you won't know if
> the hash is incorrect or the data.
>
> If the source filesystem is not btrfs (and therefore cannot have this
> btrfs bug), you can calculate the SHA512 from the source filesystem and
> copy that to the xattr on the btrfs filesystem.  That reduces the risk
> pool for data errors to the host RAM and CPU, the source filesystem,
> and the storage stack below the source filesystem (i.e.  the generic
> set of problems that can occur on any system at any time and corrupt
> data during copy and hash operations).
>
> > But when I e.g. copy data from SD, to mobile btrfs-HDD and then to the
> > final archive HDD... corruption could in principle occur when copying
> > from mobile HDD to archive HDD.
> > In that case, would a diff between the two show me the corruption? I
> > guess not because the diff would likely get the same corruption on
> > read?
>
> Upgrade your kernel before doing any verification activity; otherwise
> you'll just get false results.
>
> If you try to replace the data before upgrading the kernel, you're more
> likely to introduce new corruption where corruption did not exist before,
> or convert transient corruption events into permanent data corruption.
> You might even miss corrupted data because the bug tends to corrupt data
> in a consistent way.
>
> Once you have a kernel with the fix applied, diff will show any corruption
> in file copies, though 'cmp -l' might be much faster than diff on large
> binary files.  Use just 'cmp' if you only want to know if any difference
> exists but don't need detailed information, or 'cmp -s' in a shell script.
>
> >[...]
> > I assume normal mv of refcopy (i.e. cp --reflink=auto) would not punch
> > holes and thus be not affected?
> >
> > Further, I'd assume XATTRs couldn't be affected?
>
> XATTRs aren't compressed file data, so they aren't affected by this bug
> which only affects compressed file data.
>
> > So what remains unanswered is send/receive:
> >
> > > btrfs send and receive may be affected, but I don't use them so I
> > > don't
> > > have any experience of the bug related to these tools.  It seems from
> > > reading the btrfs receive code that it lacks any code capable of
> > > punching
> > > a hole, but I'm only doing a quick search for words like "punch", not
> > > a detailed code analysis.
> >
> > Is there some other developer who possibly knows whether send/receive
> > would have been vulnerable to the issue?
> >
> >
> > But since I use send/receive anyway in just one direction from the
> > master to the backup disks... only the later could be affected.
>
> I presume from this line of questioning that you are not in the habit
> of verifying the SHA512 hashes on your data every few weeks or months.
> If you had that step in your scheduled backup routine, then you would
> already be aware of data corruption bugs that affect you--or you'd
> already be reasonably confident that this bug has no impact on your setup.
>
> If you had asked questions like "is this bug the reason why I've been
> seeing random SHA hash verification failures for several years?" then
> you should worry about this bug; otherwise, it probably didn't affect you.
>
> > Thanks,
> > Chris.
> >
> >



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-03-07 20:07                       ` Zygo Blaxell
  2019-03-08 10:37                         ` Filipe Manana
@ 2019-03-08 12:20                         ` Austin S. Hemmelgarn
  2019-03-14 18:58                           ` Christoph Anton Mitterer
  2019-03-14 18:58                         ` Christoph Anton Mitterer
  2 siblings, 1 reply; 38+ messages in thread
From: Austin S. Hemmelgarn @ 2019-03-08 12:20 UTC (permalink / raw)
  To: Zygo Blaxell, Christoph Anton Mitterer; +Cc: linux-btrfs

On 2019-03-07 15:07, Zygo Blaxell wrote:
> On Mon, Mar 04, 2019 at 04:34:39PM +0100, Christoph Anton Mitterer wrote:
>> Hey.
>>
>>
>> Thanks for your elaborate explanations :-)
>>
>>
>> On Fri, 2019-02-15 at 00:40 -0500, Zygo Blaxell wrote:
>>> The problem occurs only on reads.  Data that is written to disk will
>>> be OK, and can be read correctly by a fixed kernel.
>>>
>>> A kernel without the fix will give corrupt data on reads with no
>>> indication of corruption other than the changes to the data itself.
>>>
>>> Applications that copy data may read corrupted data and write it back
>>> to the filesystem.  This will make the corruption permanent in the
>>> copied data.
>>
>> So that basically means even a cp (without refcopy) or a btrfs
>> send/receive could already cause permanent silent data corruption.
>> Of course, only if the conditions you've described below are met.
>>
>>
>>> Given the age of the bug
>>
>> Since when was it in the kernel?
> 
> Since at least 2015.  Note that if you are looking for an end date for
> "clean" data, you may be disappointed.
> 
> In 2016 there were two kernel bugs that silently corrupted reads of
> compressed data.  In 2015 there were...4?  5?  Before 2015 the problems
> are worse, also damaging on-disk compressed data and crashing the kernel.
> The bugs that were present in 2014 were present since compression was
> introduced in 2008.
> 
> With this last fix, as far as I know, we have a kernel that can read
> compressed data without corruption for the first time--at least for a
> subset of use cases that doesn't include direct IO.  Of course I thought
> the same thing in 2017, too, but I have since proven myself wrong.
> 
> When btrfs gets to the point where it doesn't fail backup verification for
> some contiguous years, then I'll be satisfied btrfs (or any filesystem)
> is properly debugged.  I'll still run backup verification then, of
> course--hardware breaks all the time, and broken hardware can corrupt
> any data it touches.  Verification failures point to broken hardware
> much more often than btrfs data corruption bugs.
> 
>>> Even
>>> if
>>> compression is enabled, the file data must be compressed for the bug
>>> to
>>> corrupt it.
>>
>> Is there a simple way to find files (i.e. pathnames) that were actually
>> compressed?
> 
> Run compsize (sometimes the package is named btrfs-compsize) and see if
> there are any lines referring to zlib, zstd, or lzo in the output.
> If it's all "total" and "none" then there's no compression in that file.
> 
> filefrag -v reports non-inline compressed data extents with the "encoded"
> flag, so
> 
> 	if filefrag -v "$file" | grep -qw encoded; then
> 		echo "$file" is compressed, do something here
> 	fi
> 
> might also be a solution (assuming your filename doesn't include the
> string 'encoded').
> 
>>> 	- you never punch holes in files
>>
>> Is there any "standard application" (like cp, tar, etc.) that would do
>> this?
> 
> Legacy POSIX doesn't have the hole-punching concept, so legacy
> tools won't do it; however, people add features to GNU tools all the
> time, so it's hard to be 100% sure without downloading the code and
> reading/auditing/scanning it.  I'm 99% sure cp and tar are OK.
They are, the only things they do with sparse files are creating new 
ones from scratch using the standard seek then write method.  The same 
is true of a vast majority of applications as well.  The stuff most 
people would have to worry about largely comes down to:

* VM software.  Some hypervisors such as QEMU can be configured to 
translate discard commands issued against the emulated block devices to 
fallocate calls to punch holes in the VM disk image file (and QEMU can 
be configured to translate block writes of null bytes to this too), 
though I know of none that do this by default.
* Database software.  This is what stuff like punching holes originated 
for, so it's obviously a potential source of this issue.
* FUSE filesystem drivers.  Most of them that support the required 
fallocate flag to punch holes pass it down directly.  Some make use of 
it themselves too.
* Userspace distributed storage systems.  Stuff like Ceph or Gluster. 
Same arguments as above for FUSE filesystem drivers.
> 
>> What do you mean by clone? refcopy? Would btrfs snapshots or btrfs
>> send/receive be affected?
> 
> clone is part of some file operation syscalls (e.g. clone_file_range,
> dedupe_range) which make two different files, or two different offsets in
> the same file, refer to the same physical extent.  This is the basis of
> deduplication (replacing separate copies with references to a single
> copy) and also of punching holes (a single reference is split into
> two references to the original extent with a hole object inserted in
> the middle).
> 
> "reflink copy" is a synonym for "cp --reflink", which is clone_file_range
> using 0 as the start of range and EOF as the end.  The term 'reflink'
> is sometimes used to refer to any extent shared between files that is
> not the result of a snapshot.  reflink is to extents what a hardlink is
> to inodes, if you ignore some details.
> 
> To trigger the bug you need to clone the same compressed source range
> to two nearly adjacent locations in the destination file (i.e. two or
> more ranges in the source overlap).  cp --reflink never overlaps ranges,
> so it can't create the extent pattern that triggers this bug *by itself*.
> 
> If the source file already has extent references arranged in a way
> that triggers the bug, then the copy made with cp --reflink will copy
> the arrangement to the new file (i.e. if you upgrade the kernel, you
> can correctly read both copies, and if you don't upgrade the kernel,
> both copies will appear to be corrupted, probably the same way).
> 
> I would expect btrfs receive may be affected, but I did not find any
> code in receive that would be affected.  There are a number of different
> ways to make a file with a hole in it, and btrfs receive could use a
> different one not affected by this bug.  I don't use send/receive myself,
> so I don't have historical corruption data to guess from.
> 
>> Or is there anything in btrfs itself which does any of the two per
>> default or on a typical system (i.e. I didn't use dedupe).
> 
> 'btrfs' (the command-line utility) doesn't do these operations as far
> as I can tell.  The kernel only does these when requested by applications.
The receive command will issue clone operations if the sent subvolume 
requires it to get the correct block layout, so there is a 'regular' 
BTRFS operation that can in theory set things up such that the required 
patterns are more likely to happen.
> 
>> Also, did the bug only affect data, or could metadata also be
>> affected... basically should such filesystems be re-created since they
>> may also hold corruptions in the meta-data like trees and so on?
> 
> Metadata is not affected by this bug.  The bug only corrupts btrfs data
> (specificially, the contents of files) in memory, not disk.
> 
>> My scenario looks about the following, and given your explanations, I'd
>> assume I should probably be safe:
>>
>> - my normal laptop doesn't use compress, so it's safe anyway
>>
>> - my cp has an alias to always have --reflink=auto
>>
>> - two 8TB data archive disks, each with two backup disks to which the
>>    data of the two master disks is btrfs sent/received,... which were
>>    all mounted with compress
>>
>>
>> - typically I either cp or mv data from the laptop to these disks,
>>    => should then be safe as the laptop fs didn't use compress,...
>>
>> - or I directly create the files on the data disks (which use compress)
>>    by means of wget, scp or similar from other sources
>>    => should be safe, too, as they probably don't do dedupe/hole
>>       punching by default
>>
>> - or I cp/mv from them camera SD cards, which use some *FAT
>>    => so again I'd expect that to be fine
>>
>> - on vacation I had the case that I put large amount of picture/videos
>>    from SD cards to some btrfs-with-compress mobile HDDs, and back home
>>    from these HDDs to my actual data HDDs.
>>    => here I do have the read / re-write pattern, so data could have
>>       been corrupted if it was compressed + deduped/hole-punched
>>       I'd guess that's anyway not the case (JPEGs/MPEGs don't compress
>>       well)... and AFAIU there would be no deduping/hole-punching
>>       involved here
> 
> dedupe doesn't happen by itself on btrfs.  You have to run dedupe
> userspace software (e.g. duperemove, bees, dduper, rmlint, jdupes, bedup,
> etc...) or build a kernel with dedupe patches.
> 
>> - on my main data disks, I do snapshots... and these snapshots I
>>    send/receive to the other (also compress-mounted) btrfs disks.
>>    => could these operations involve deduping/hole-punching and thus the
>>       corruption?
> 
> Snapshots won't interact with the bug--they are not affected by it
> and will not trigger it.  Send could transmit incorrect data (if it
> uses the kernel's readpages path internally, I don't know if it does).
> Receive seems not to be affected (though it will not detect incorrect
> data from send).
> 
>> Another thing:
>> I always store SHA512 hashsums of files as an XATTR of them (like
>> "directly after" creating such files).
>> I assume there would be no deduping/hole-punching involved till then,
>> so the sums should be from correct data, right?
> 
> There's no assurance of that with this method.  It's highly likely that
> the hashes match the input data, because the file will usually be cached
> in host RAM from when it was written, so the bug has no opportunity to
> appear.  It's not impossible for other system activity to evict those
> cached pages between the copy and hash, so the hash function might reread
> the data from disk again and thus be exposed to the bug.
> 
> Contrast with a copy tool which integrates the SHA512 function, so
> the SHA hash and the copy consume their data from the same RAM buffers.
> This reduces the risk of undetected error but still does not eliminate it.
> A DRAM access failure could corrupt either the data or SHA hash but not
> both, so the hash will fail verification later, but you won't know if
> the hash is incorrect or the data.
> 
> If the source filesystem is not btrfs (and therefore cannot have this
> btrfs bug), you can calculate the SHA512 from the source filesystem and
> copy that to the xattr on the btrfs filesystem.  That reduces the risk
> pool for data errors to the host RAM and CPU, the source filesystem,
> and the storage stack below the source filesystem (i.e.  the generic
> set of problems that can occur on any system at any time and corrupt
> data during copy and hash operations).
> 
>> But when I e.g. copy data from SD, to mobile btrfs-HDD and then to the
>> final archive HDD... corruption could in principle occur when copying
>> from mobile HDD to archive HDD.
>> In that case, would a diff between the two show me the corruption? I
>> guess not because the diff would likely get the same corruption on
>> read?
> 
> Upgrade your kernel before doing any verification activity; otherwise
> you'll just get false results.
> 
> If you try to replace the data before upgrading the kernel, you're more
> likely to introduce new corruption where corruption did not exist before,
> or convert transient corruption events into permanent data corruption.
> You might even miss corrupted data because the bug tends to corrupt data
> in a consistent way.
> 
> Once you have a kernel with the fix applied, diff will show any corruption
> in file copies, though 'cmp -l' might be much faster than diff on large
> binary files.  Use just 'cmp' if you only want to know if any difference
> exists but don't need detailed information, or 'cmp -s' in a shell script.
> 
>> [...]
>> I assume normal mv of refcopy (i.e. cp --reflink=auto) would not punch
>> holes and thus be not affected?
>>
>> Further, I'd assume XATTRs couldn't be affected?
> 
> XATTRs aren't compressed file data, so they aren't affected by this bug
> which only affects compressed file data.
> 
>> So what remains unanswered is send/receive:
>>
>>> btrfs send and receive may be affected, but I don't use them so I
>>> don't
>>> have any experience of the bug related to these tools.  It seems from
>>> reading the btrfs receive code that it lacks any code capable of
>>> punching
>>> a hole, but I'm only doing a quick search for words like "punch", not
>>> a detailed code analysis.
>>
>> Is there some other developer who possibly knows whether send/receive
>> would have been vulnerable to the issue?
>>
>>
>> But since I use send/receive anyway in just one direction from the
>> master to the backup disks... only the later could be affected.
> 
> I presume from this line of questioning that you are not in the habit
> of verifying the SHA512 hashes on your data every few weeks or months.
> If you had that step in your scheduled backup routine, then you would
> already be aware of data corruption bugs that affect you--or you'd
> already be reasonably confident that this bug has no impact on your setup.
> 
> If you had asked questions like "is this bug the reason why I've been
> seeing random SHA hash verification failures for several years?" then
> you should worry about this bug; otherwise, it probably didn't affect you.
> 
>> Thanks,
>> Chris.
>>
>>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-03-07 20:07                       ` Zygo Blaxell
  2019-03-08 10:37                         ` Filipe Manana
  2019-03-08 12:20                         ` Austin S. Hemmelgarn
@ 2019-03-14 18:58                         ` Christoph Anton Mitterer
  2019-03-15  5:28                           ` Zygo Blaxell
  2 siblings, 1 reply; 38+ messages in thread
From: Christoph Anton Mitterer @ 2019-03-14 18:58 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

Hey again.

And again thanks for your time and further elaborate explanations :-)


On Thu, 2019-03-07 at 15:07 -0500, Zygo Blaxell wrote:
> In 2016 there were two kernel bugs that silently corrupted reads of
> compressed data.  In 2015 there were...4?  5?  Before 2015 the
> problems
> are worse, also damaging on-disk compressed data and crashing the
> kernel.
> The bugs that were present in 2014 were present since compression was
> introduced in 2008.

Phew... too much [silent] corruption bugs in btrfs... :-(

Actually I didn't even notice the others (which unfortunately doesn't
mean I'm definitely not affected), so I probably cannot much do/check
about them now... but only about the "recent" one that was fixed now.

But maybe there should be something like a btrfs-announce list, i.e. a
low volume mailing list, in which (interested) users are informed about
more grave issues.
Such things can happen and there's no one to blame about that... but if
they happen it would be good for users to get notified so that they can
check their systems and possibly recover data from (still existing)
other sources.


> Run compsize (sometimes the package is named btrfs-compsize) and see
> if
> there are any lines referring to zlib, zstd, or lzo in the output.
> If it's all "total" and "none" then there's no compression in that
> file.
> 
> filefrag -v reports non-inline compressed data extents with the
> "encoded"
> flag, so
> 
> 	if filefrag -v "$file" | grep -qw encoded; then
> 		echo "$file" is compressed, do something here
> 	fi
> 
> might also be a solution (assuming your filename doesn't include the
> string 'encoded').

Will have a look at this.


As for all the following:

> > > 	- you never punch holes in files
> > 
> > Is there any "standard application" (like cp, tar, etc.) that would
> > do
> > this?
> 
> Legacy POSIX doesn't have the hole-punching concept, so legacy
> tools won't do it; however, people add features to GNU tools all the
> time, so it's hard to be 100% sure without downloading the code and
> reading/auditing/scanning it.  I'm 99% sure cp and tar are OK.
> 
> > What do you mean by clone? refcopy? Would btrfs snapshots or btrfs
> > send/receive be affected?
> 
> clone is part of some file operation syscalls (e.g. clone_file_range,
> dedupe_range) which make two different files, or two different
> offsets in
> the same file, refer to the same physical extent.  This is the basis
> of
> deduplication (replacing separate copies with references to a single
> copy) and also of punching holes (a single reference is split into
> two references to the original extent with a hole object inserted in
> the middle).
> 
> "reflink copy" is a synonym for "cp --reflink", which is
> clone_file_range
> using 0 as the start of range and EOF as the end.  The term 'reflink'
> is sometimes used to refer to any extent shared between files that is
> not the result of a snapshot.  reflink is to extents what a hardlink
> is
> to inodes, if you ignore some details.
> 
> To trigger the bug you need to clone the same compressed source range
> to two nearly adjacent locations in the destination file (i.e. two or
> more ranges in the source overlap).  cp --reflink never overlaps
> ranges,
> so it can't create the extent pattern that triggers this bug *by
> itself*.
> 
> If the source file already has extent references arranged in a way
> that triggers the bug, then the copy made with cp --reflink will copy
> the arrangement to the new file (i.e. if you upgrade the kernel, you
> can correctly read both copies, and if you don't upgrade the kernel,
> both copies will appear to be corrupted, probably the same way).
> 
> I would expect btrfs receive may be affected, but I did not find any
> code in receive that would be affected.  There are a number of
> different
> ways to make a file with a hole in it, and btrfs receive could use a
> different one not affected by this bug.  I don't use send/receive
> myself,
> so I don't have historical corruption data to guess from.
> 
> > Or is there anything in btrfs itself which does any of the two per
> > default or on a typical system (i.e. I didn't use dedupe).
> 
> 'btrfs' (the command-line utility) doesn't do these operations as far
> as I can tell.  The kernel only does these when requested by
> applications.
> 
> > Also, did the bug only affect data, or could metadata also be
> > affected... basically should such filesystems be re-created since
> > they
> > may also hold corruptions in the meta-data like trees and so on?
> 
> Metadata is not affected by this bug.  The bug only corrupts btrfs
> data
> (specificially, the contents of files) in memory, not disk.

So all the above, AFAIU, basically boils down to the following:


Unless such hole-punched files were brought into the filesystem by one
of the rather special things like:

- dedupe
- an application that by itself does the hole-punching of which most
  users will probably only have qemu which can do it 

...a normal user should probably not have encountered the issue, as
it's not triggered by typical end-user operations (cp, mv, tar, btrfs
send/receive, cp --reflink=always/auto).

With the exception that cp --reflink=always/auto, will duplicate (but
by itself not corrupt) a file that *ALREADY* has a reflink/hole
pattern, that is prone to the issue.
So, AFAIU, such a file would be correctly copied, but on read it would
also suffer from the curruption, just like the original.
But again, if nothing like qemu was used in the first place, such file
shouldn't be in the filesystem.

Further, I'd expect that if users followed the advise and used
nodatacow on their qemu images,... compression would be disabled for
these as well, and they'd be safe again, right?


=> Summarising... the issue is (with the exception of qemu and dedupe
users) likely not that much of an issue for normal end-users.




What about the direct IO issues that may be still present and which
you've mentioned above... is this used somewhere per default / under
normal circumstances?



> > - or I directly create the files on the data disks (which use
> > compress)
> >   by means of wget, scp or similar from other sources
> >   => should be safe, too, as they probably don't do dedupe/hole
> >      punching by default
> > 
> > - or I cp/mv from them camera SD cards, which use some *FAT
> >   => so again I'd expect that to be fine
> > 
> > - on vacation I had the case that I put large amount of
> > picture/videos
> >   from SD cards to some btrfs-with-compress mobile HDDs, and back
> > home
> >   from these HDDs to my actual data HDDs.
> >   => here I do have the read / re-write pattern, so data could have
> >      been corrupted if it was compressed + deduped/hole-punched
> >      I'd guess that's anyway not the case (JPEGs/MPEGs don't
> > compress
> >      well)... and AFAIU there would be no deduping/hole-punching 
> >      involved here
> 
> dedupe doesn't happen by itself on btrfs.  You have to run dedupe
> userspace software (e.g. duperemove, bees, dduper, rmlint, jdupes,
> bedup,
> etc...) or build a kernel with dedupe patches.

Which I both have not, so should be fine.


> It's highly likely
> that
> the hashes match the input data, because the file will usually be
> cached
> in host RAM from when it was written, so the bug has no opportunity
> to
> appear.

That's what I had in mind.


> It's not impossible for other system activity to evict those
> cached pages between the copy and hash, so the hash function might
> reread
> the data from disk again and thus be exposed to the bug.

Sure... which is especially very likely to be the case for any bigger
amounts of data that I've copied.
But anything bigger is typically pictures/videos, which I would
guess/assume not to be compressed at all.
But even then I should be still safe, as cp --reflink=auto/always
doesn't introduce the bug by itself, as you've said above.
Right?


> Contrast with a copy tool which integrates the SHA512 function, so
> the SHA hash and the copy consume their data from the same RAM
> buffers.
> This reduces the risk of undetected error but still does not
> eliminate it.

Hehe, I'd like to see that in GNU coreutils ;-)


> A DRAM access failure could corrupt either the data or SHA hash but
> not
> both

Unless, against all odds in the universe... you get that one special
hash collision where corrupted file and/or hash match again :D


>  so the hash will fail verification later, but you won't know if
> the hash is incorrect or the data.

Sure, but at least I would notice could try to recover from some backup
then.




> > But when I e.g. copy data from SD, to mobile btrfs-HDD and then to
> > the
> > final archive HDD... corruption could in principle occur when
> > copying
> > from mobile HDD to archive HDD.
> > In that case, would a diff between the two show me the corruption?
> > I
> > guess not because the diff would likely get the same corruption on
> > read?
> 
> Upgrade your kernel before doing any verification activity; otherwise
> you'll just get false results.

Well that's clear if I do the verification *now* ... I rather meant
here: would a diff have noticed it the past (where I still had the
originals)... for which the answer seems to be: possibly not


> > But since I use send/receive anyway in just one direction from the
> > master to the backup disks... only the later could be affected.
> 
> I presume from this line of questioning that you are not in the habit
> of verifying the SHA512 hashes on your data every few weeks or
> months.

Actually I do about every half year... my main point in the
"investigation" of my typical usage scenarios above was, whether any of
them could have introduced corruption in which my hashes wouldn't have
noticed it.

I guess all of my patterns of moving/copying data to these main data
HDDs that used btrfs+compressions should be safe (since you said cp/mv
is even with --reflink=always)...


The only questionable one is, where I copied data from some SD card to
an intermediate btrfs (that also used compression) and from there to
the final location on the main data HDDs.

Over time, I've used different ways to calc the XATTRs there:
In earlier times I did it on the intermediate btrfs (which would make
it in principle suspicious to not noticing corruption - if(!) I had not
used cp only, which should be safe as you say)... followed (after
clearing the kernel cache) by a recursive diff between SD and
intermediate btrfs (assuming that btrfs' checksuming would show me any
corruption error when re-reading from disk).

Later I did it similarly to what you suggested above:
Creating hash lists from the data on the SD... also creating the hashes
for the XATTR on the intermediate btrfs (which would have again been in
principle prone to the bug)... but then diffing the two, which should
have shown me any corruption.


> If you had that step in your scheduled backup routine, then you would
> already be aware of data corruption bugs that affect you--or you'd
> already be reasonably confident that this bug has no impact on your
> setup.

I think by now I'm pretty confident that I, personally, am safe.

The main points for this were:
- XATTRs not being affected
- cp (with any value for --reflink=) never creating the corruption
(as you've said both above)

and with
- send/receive likely being safe
- snapshots being not affected
means that my backup disks are likely unaffected as well.
But obviously I'll check this (by verifying all hashes on the master
disks... and by diffing the masters with the copies) on a fixed kernel,
which I think has just landed in Debian unstable.


Some time ago I had to split the previously one 8TiB master disk into
two (both using compress) as the one ran out of space.
But this should be also safe, as I've used just cp --reflink=auto which
shouldn't introduce the bug by itself AFAIU, followed by extensive
diff-ing... so especially the XATTRs should be still safe, too.

Also, I always create a list of all hash+pathname from the XATTRs
(basically in sha512sum(1) format and if I do another snapshot, I
compare previous lists with the fresh one... so I'd have noticed any
corruption there.
So for me the main point was really, whether data could have been
already corrupted when "introduced" to the filesystem via (especially)
cp or a series of cp.


> If you had asked questions like "is this bug the reason why I've been
> seeing random SHA hash verification failures for several years?" then
> you should worry about this bug; otherwise, it probably didn't affect
> you.

I think you're right... but my data with many thousands of pictures,
etc. from all life is really precious to me, so I better wanted to
understand the issue in "depth"... and I think these questions and your
answers may still benefit others who may also want to find out whether
they could have been silently affected :-)


Cheers and thanks,
Chris.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-03-08 10:37                         ` Filipe Manana
@ 2019-03-14 18:58                           ` Christoph Anton Mitterer
  2019-03-14 20:22                           ` Christoph Anton Mitterer
  1 sibling, 0 replies; 38+ messages in thread
From: Christoph Anton Mitterer @ 2019-03-14 18:58 UTC (permalink / raw)
  To: fdmanana, Zygo Blaxell; +Cc: linux-btrfs

Hey again.

Just wondered about the inclusion status of this patch?

The first merge I could find from Linus was 2 days ago for the upcoming
5.1.
It doesn't seem to be in any of the stable kernels yet, neither in
5.0.x?

Is this still coming to the stable kernels for distros or could it have
gotten missed there?

Debian has it in unstable since 4.19.28-1 (see 
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=922306)


Cheers,
Chris.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-03-08 12:20                         ` Austin S. Hemmelgarn
@ 2019-03-14 18:58                           ` Christoph Anton Mitterer
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Anton Mitterer @ 2019-03-14 18:58 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Zygo Blaxell; +Cc: linux-btrfs

On Fri, 2019-03-08 at 07:20 -0500, Austin S. Hemmelgarn wrote:
> On 2019-03-07 15:07, Zygo Blaxell wrote:
> > Legacy POSIX doesn't have the hole-punching concept, so legacy
> > tools won't do it; however, people add features to GNU tools all
> > the
> > time, so it's hard to be 100% sure without downloading the code and
> > reading/auditing/scanning it.  I'm 99% sure cp and tar are OK.
> > 
> They are, the only things they do with sparse files are creating new 
> ones from scratch using the standard seek then write method.  The
> same 
> is true of a vast majority of applications as well.

Thanks for your confirmation.


>   The stuff most 
> people would have to worry about largely comes down to:
> 
> * VM software.  Some hypervisors such as QEMU can be configured to 
> translate discard commands issued against the emulated block devices
> to 
> fallocate calls to punch holes in the VM disk image file (and QEMU
> can 
> be configured to translate block writes of null bytes to this too), 
> though I know of none that do this by default.
> * Database software.  This is what stuff like punching holes
> originated 
> for, so it's obviously a potential source of this issue.
> * FUSE filesystem drivers.  Most of them that support the required 
> fallocate flag to punch holes pass it down directly.  Some make use
> of 
> it themselves too.
> * Userspace distributed storage systems.  Stuff like Ceph or
> Gluster. 
> Same arguments as above for FUSE filesystem drivers.

These do at least not affect me personally, though only because I
didn't use compress, where I use qemu (which I have configured to pass
on the TRIMs).


> > 'btrfs' (the command-line utility) doesn't do these operations as
> > far
> > as I can tell.  The kernel only does these when requested by
> > applications.
> The receive command will issue clone operations if the sent
> subvolume 
> requires it to get the correct block layout, so there is a 'regular' 
> BTRFS operation that can in theory set things up such that the
> required 
> patterns are more likely to happen.

As long as snapshoting itself doesn't create the issue, I should be
still safe at least on my master disks (which were always only the
source or send/receive), which I'll now compare to the backup disks


Thanks,
Chris.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-03-08 10:37                         ` Filipe Manana
  2019-03-14 18:58                           ` Christoph Anton Mitterer
@ 2019-03-14 20:22                           ` Christoph Anton Mitterer
  2019-03-14 22:39                             ` Filipe Manana
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Anton Mitterer @ 2019-03-14 20:22 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

Oh and just for double checking:

In the original patch you've posted and which Zygo tested, AFAIU, you
had one line replaced.
( https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3 )

In the one submitted there were two occasions of replacing 
em->orig_start with em->start.
( https://lore.kernel.org/linux-btrfs/20190214151720.23563-1-fdmanana@kernel.org/ )

I assume that's on purpose?

Cheers,
Chris.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-03-14 20:22                           ` Christoph Anton Mitterer
@ 2019-03-14 22:39                             ` Filipe Manana
  0 siblings, 0 replies; 38+ messages in thread
From: Filipe Manana @ 2019-03-14 22:39 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs

On Thu, Mar 14, 2019 at 8:22 PM Christoph Anton Mitterer
<calestyo@scientia.net> wrote:
>
> Oh and just for double checking:
>
> In the original patch you've posted and which Zygo tested, AFAIU, you
> had one line replaced.
> ( https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3 )
>
> In the one submitted there were two occasions of replacing
> em->orig_start with em->start.
> ( https://lore.kernel.org/linux-btrfs/20190214151720.23563-1-fdmanana@kernel.org/ )
>
> I assume that's on purpose?

Yes.

>
> Cheers,
> Chris.
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-03-14 18:58                         ` Christoph Anton Mitterer
@ 2019-03-15  5:28                           ` Zygo Blaxell
  2019-03-16 22:11                             ` Christoph Anton Mitterer
  0 siblings, 1 reply; 38+ messages in thread
From: Zygo Blaxell @ 2019-03-15  5:28 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3384 bytes --]

On Thu, Mar 14, 2019 at 07:58:45PM +0100, Christoph Anton Mitterer wrote:
> Phew... too much [silent] corruption bugs in btrfs... :-(
> 
> Actually I didn't even notice the others (which unfortunately doesn't
> mean I'm definitely not affected), so I probably cannot much do/check
> about them now... but only about the "recent" one that was fixed now.
> 
> But maybe there should be something like a btrfs-announce list, i.e. a
> low volume mailing list, in which (interested) users are informed about
> more grave issues.
> Such things can happen and there's no one to blame about that... but if
> they happen it would be good for users to get notified so that they can
> check their systems and possibly recover data from (still existing)
> other sources.

I don't know if it would be a low-volume list...every kernel release
includes fixes for _some_ exotic corner case.

> What about the direct IO issues that may be still present and which
> you've mentioned above... is this used somewhere per default / under
> normal circumstances?

Direct IO is an odd case because it's not all that well understood
what the correct behavior is.  You can't prevent the kernel from making
copies of data and also expect full data integrity and also lock-free
performance, all at the same time.  Pick any two, and pay for it with
losses in the third.

The bug fixes here are more along the lines of "OK so you're using direct
IO which means you've basically admitted you don't care about *your* data,
let's try not to corrupt *other* data on the filesystem at the same time."

> I think by now I'm pretty confident that I, personally, am safe.

It took me two years to find this bug, and I had to write a tool to
encounter it often enough to notice.  A lot of people are safe.

> > If you had asked questions like "is this bug the reason why I've been
> > seeing random SHA hash verification failures for several years?" then
> > you should worry about this bug; otherwise, it probably didn't affect
> > you.
> 
> I think you're right... but my data with many thousands of pictures,
> etc. from all life is really precious to me, so I better wanted to
> understand the issue in "depth"... and I think these questions and your
> answers may still benefit others who may also want to find out whether
> they could have been silently affected :-)

I found the 2017 compression bug in a lot of digital photographs.
It turns out that several popular cameras (including some of the ones I
own) put a big chunk of zeros near the beginnings of JPG files, and when
rsync copies those it will insert a hole instead of copying the zeros.
The 2017 bug affected "ordinary" holes so standard tools like cp and
rsync could trigger it.  Most photo tools ignore this data completely,
so when garbage appears there, nobody notices.

A similar thing happens to .o files:  ld aligns things to 4K block
boundaries, triggering the 2017 compressed read bug.  Nobody reads that
data either--it's just alignment padding.

I don't think I found an application that cared about the 2017 bug at all.
Only backup verifications.

The 2018 bug is a different story--when it hits, it's obvious, and
ordinary application things break--but it won't happen to typical photo
image files, even with aggressive dedupe.

> 
> Cheers and thanks,
> Chris.
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-03-15  5:28                           ` Zygo Blaxell
@ 2019-03-16 22:11                             ` Christoph Anton Mitterer
  2019-03-17  2:54                               ` Zygo Blaxell
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Anton Mitterer @ 2019-03-16 22:11 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Fri, 2019-03-15 at 01:28 -0400, Zygo Blaxell wrote:
> But maybe there should be something like a btrfs-announce list,
> > i.e. a
> > low volume mailing list, in which (interested) users are informed
> > about
> > more grave issues.
> > …
> I don't know if it would be a low-volume list...every kernel release
> includes fixes for _some_ exotic corner case.

Well this one *may* be exotic for many users, but we have at least the
use case of qemu which seems to be not that exotic at all.

And the ones you outline below seem even more common?


Also the other means for end-users to know whether something is stable
or not like https://btrfs.wiki.kernel.org/index.php/Status don't seem
to really work out.

There is a known silent data corruption bug which seems so far only
fixed in 5.1rc* ... and the page still says stable since 4.14.
Even know with the fix, one should probably need to wait a year or so
until one could mark it stable again if nothing had been found by then.



> > What about the direct IO issues that may be still present and which
> > you've mentioned above... is this used somewhere per default /
> > under
> > normal circumstances?
> 
> Direct IO is an odd case because it's not all that well understood
> what the correct behavior is.  You can't prevent the kernel from
> making
> copies of data and also expect full data integrity and also lock-free
> performance, all at the same time.  Pick any two, and pay for it with
> losses in the third.
> 
> The bug fixes here are more along the lines of "OK so you're using
> direct
> IO which means you've basically admitted you don't care about *your*
> data,
> let's try not to corrupt *other* data on the filesystem at the same
> time."

So... if btrfs allows for direct IO... and if this isn't stable in some
situations,... what can one do about it? I mean there doesn't seem to
be an option to disallow it... and any program can do O_DIRECT (without
even knowing btrfs is below).




Guess I have to go deeper down the rabbit hole now for the other
compressions bugs...


> I found the 2017 compression bug in a lot of digital photographs.

Is there any way (apart from having correct checksums) to find out
whether a file was affected by the 2017-bug?
Like, I don't know,.. looking for large chunks of zeros?


And is there any more detailed information available on the 2017-bug,
in the sense under which occasions it occurred?

Like also only on reads (which would mean again that I'd be mostly
safe, because my checksums should mostly catch this)?

Or just on dedupe or hole punching? Or did it only affect sparse files
(and there only the holes (blocks of zeros) as in your camera JPG
example)?


> It turns out that several popular cameras (including some of the ones
> I
> own) put a big chunk of zeros near the beginnings of JPG files, and
> when
> rsync copies those it will insert a hole instead of copying the
> zeros.

Many other types of files may have such bigger chunks of zeros to...
basically everything that leaves place for meta-data.


> The 2017 bug affected "ordinary" holes so standard tools like cp and
> rsync could trigger it.

AFAIU, both cp and rsync (--sparse) don't create spares files actively
per default,... cp (per default) only creates sparse files when it
detects the source file to be already sparse.
Same seems to be the case for tar, which only stores a file sparse
(inside the archive) when --sparse is used.

So would one be safe from the 2017 bug if one haven't had sparse files
and not activated the sparse in any of these tools?


>   Most photo tools ignore this data completely,
> so when garbage appears there, nobody notices.

So the 2017-bug meant that areas that should be zero were filled with
garbage but everything als was preserved correclty



> I don't think I found an application that cared about the 2017 bug at
> all.

Well for me it would be still helpful to know how to find out whether I
might have been affected or not... I do have some really old backups so
recovery would be possible in many cases.



> The 2018 bug is a different story--when it hits, it's obvious, and
> ordinary application things break

Which one to you mean now? The one recently fixed on
reads+holepunching/dedupe/clone? Cause I thought that one was not not
that obvious as it was silent...


Anything still known about the even older compression related
corruption bugs that Filipe mentioned, in the sense when they occurred
and how to find out whether one was affected?


Thanks,
Chris.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-03-16 22:11                             ` Christoph Anton Mitterer
@ 2019-03-17  2:54                               ` Zygo Blaxell
  0 siblings, 0 replies; 38+ messages in thread
From: Zygo Blaxell @ 2019-03-17  2:54 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 8858 bytes --]

On Sat, Mar 16, 2019 at 11:11:10PM +0100, Christoph Anton Mitterer wrote:
> On Fri, 2019-03-15 at 01:28 -0400, Zygo Blaxell wrote:
> > But maybe there should be something like a btrfs-announce list,
> > > i.e. a
> > > low volume mailing list, in which (interested) users are informed
> > > about
> > > more grave issues.
> > > …
> > I don't know if it would be a low-volume list...every kernel release
> > includes fixes for _some_ exotic corner case.
> 
> Well this one *may* be exotic for many users, but we have at least the
> use case of qemu which seems to be not that exotic at all.
> 
> And the ones you outline below seem even more common?
> 
> Also the other means for end-users to know whether something is stable
> or not like https://btrfs.wiki.kernel.org/index.php/Status don't seem
> to really work out.

It's hard to separate the signal from the noise.  I first detected
the 2018 bug in 2016, but didn't know it was a distinct bug until
after eliminating all the other corruption causes that occurred during
that time.  I am still tracking issue(s) in btrfs that bring servers
down multiple times a week, so I'm not in a hurry to declare any part
of btrfs stable yet.

When could we ever confidently say btrfs is stable?  Some filesystems
are 30 years old and still fixing bugs.  See you in 2037?

Now, that specific wiki page should probably be updated, since at least
one outstanding bug is now known.

> There is a known silent data corruption bug which seems so far only
> fixed in 5.1rc* ... and the page still says stable since 4.14.
> Even know with the fix, one should probably need to wait a year or so
> until one could mark it stable again if nothing had been found by then.

I sometimes use "it has been $N days since the last bug fix in $Y" as a
crude metric of how trustworthy code is.  adfs is 2913 days and counting!
ext2 is only 106 days.  btrfs and xfs seem to be competing for the lowest
value of N, never rising above a few dozen except around holidays and
conferences, with ext4 not far behind.

> So... if btrfs allows for direct IO... and if this isn't stable in some
> situations,... what can one do about it? I mean there doesn't seem to
> be an option to disallow it... 

Sure, but O_DIRECT is a performance/risk tradeoff.  If you ask someone who
uses csums or snapshots, they'll tell you btrfs should always put correct
data and checksums on disk, even if the application does something weird
and undefined like O_DIRECT.  If you ask someone who wants the O_DIRECT
performance, they'll tell you O_DIRECT should not waste time computing,
verifying, reading, or writing csums, nor should users expect correct
behavior from applications that don't follow the filesystem-specific
rules correctly (for some implied definition of how correct applications
should behave, because O_DIRECT is not a concrete specification), and that
includes permitting undetected data corruption to be persisted on disk.

> and any program can do O_DIRECT (without even knowing btrfs is below).

Most filesystems permit silent data corruption all of the time, so btrfs
is weird for disallowing silent data corruption some of the time.

> Guess I have to go deeper down the rabbit hole now for the other
> compressions bugs...
> 
> 
> > I found the 2017 compression bug in a lot of digital photographs.
> 
> Is there any way (apart from having correct checksums) to find out
> whether a file was affected by the 2017-bug?
> Like, I don't know,.. looking for large chunks of zeros?

You need to have an inline extent in the first 4096 bytes of the file
and data starting at 4096 bytes.  Normally that never happens, but it
is possible to construct files that way with the right sequences of
write(), seek(), and fsync().  They occur naturally in about one out
of every 100,000 'rsync -S' files which triggers a similar sequence of
operations internally in the kernel.

The symptom is that the corrupted file has uninitialized kernel memory in
the last bytes of the first 4096 byte block, when the correct file has
0 bytes there.  It turns out that uninitialized kernel memory is often
full of zeros anyway, so even "corrupted" files come out unchanged most
of the time.

If you don't know what is supposed to be in those bytes (either from
the file format, an uncorrupted copy of the file, or unexpected behavior
when the file is used) then there's no way to know they're wrong.

> And is there any more detailed information available on the 2017-bug,
> in the sense under which occasions it occurred?

The kernel commit message for the fix is quite detailed.

> Like also only on reads (which would mean again that I'd be mostly
> safe, because my checksums should mostly catch this)?

Only reads, and only files with a specific structure, and only at a
single specific location in the file.

> Or just on dedupe or hole punching? Or did it only affect sparse files
> (and there only the holes (blocks of zeros) as in your camera JPG
> example)?

You can't get the 2017 bug with dedupe--inline extents are not dedupable.
You do need a sparse file.

I didn't find the 2017 bug because of bees--I found it because of
rsync -S.

> > It turns out that several popular cameras (including some of the ones
> > I
> > own) put a big chunk of zeros near the beginnings of JPG files, and
> > when
> > rsync copies those it will insert a hole instead of copying the
> > zeros.
> 
> Many other types of files may have such bigger chunks of zeros to...
> basically everything that leaves place for meta-data.

Only contiguous chunks of 0 that end at byte 4096 can be affected.
0 anywhere else in the file is the domain of the 2018 bug.  Also 2017
replaces 0 with invalid data, while 2018 replaces valid data with 0.

> AFAIU, both cp and rsync (--sparse) don't create spares files actively
> per default,... cp (per default) only creates sparse files when it
> detects the source file to be already sparse.
> Same seems to be the case for tar, which only stores a file sparse
> (inside the archive) when --sparse is used.
> 
> So would one be safe from the 2017 bug if one haven't had sparse files
> and not activated the sparse in any of these tools?

Probably.  Even "unsafe" is less than a 1 in 100,000 event, so you're
often safe even when using triggering tools (especially if the system
is lightly loaded).  Lots of tools make sparse files.

> >   Most photo tools ignore this data completely,
> > so when garbage appears there, nobody notices.
> 
> So the 2017-bug meant that areas that should be zero were filled with
> garbage but everything als was preserved correclty

Yep.

> > I don't think I found an application that cared about the 2017 bug at
> > all.
> 
> Well for me it would be still helpful to know how to find out whether I
> might have been affected or not... I do have some really old backups so
> recovery would be possible in many cases.

You could compare those backups to current copies before discarding them.
Or build a SHA table and keep a copy of it on online media for verification.

> > The 2018 bug is a different story--when it hits, it's obvious, and
> > ordinary application things break
> 
> Which one to you mean now? The one recently fixed on
> reads+holepunching/dedupe/clone? Cause I thought that one was not not
> that obvious as it was silent...

Many applications will squawk if you delete 32K of data randomly from
the middle of their data files.  There are crashes, garbage output,
error messages, corrupted VM filesystem images (i.e. the guest's fsck
complains).  A lot of issues magically disappear after applying the
"2018" fix.

> Anything still known about the even older compression related
> corruption bugs that Filipe mentioned, in the sense when they occurred
> and how to find out whether one was affected?

Kernels from 2015 and earlier had assorted problems with compressed data.
It's difficult to distinguish between them, or isolate specific syndomes
to specific bug fixes.  Not all of them were silent--there was a bug
in 2014 that returned EIO instead of data when reading files affected
by the 2017 bug (that change in behavior was a good clue about where
to look for the 2017 fix).  One of the bugs eventually manifests itself
as a broken filesystem or a kernel panic when you write to an affected
area of a file.

It's more practical to just assume anything stored on btrfs with
compression on a kernel prior to 2015 is suspect until proven otherwise.
In 2014 and earlier, you have to start suspecting uncompressed data too.
Kernels between 2012 and 2014 crashed so often it was difficult to run
data integrity verification tests with a significant corpus size.

> 
> Thanks,
> Chris.
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2019-03-17  2:54 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-23  3:11 Reproducer for "compressed data + hole data corruption bug, 2018 editiion" Zygo Blaxell
2018-08-23  5:10 ` Qu Wenruo
2018-08-23 16:44   ` Zygo Blaxell
2018-08-23 23:50     ` Qu Wenruo
2019-02-12  3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell
2019-02-12 15:33   ` Christoph Anton Mitterer
2019-02-12 15:35   ` Filipe Manana
2019-02-12 17:01     ` Zygo Blaxell
2019-02-12 17:56       ` Filipe Manana
2019-02-12 18:13         ` Zygo Blaxell
2019-02-13  7:24           ` Qu Wenruo
2019-02-13 17:36           ` Filipe Manana
2019-02-13 18:14             ` Filipe Manana
2019-02-14  1:22               ` Filipe Manana
2019-02-14  5:00                 ` Zygo Blaxell
2019-02-14 12:21                 ` Christoph Anton Mitterer
2019-02-15  5:40                   ` Zygo Blaxell
2019-03-04 15:34                     ` Christoph Anton Mitterer
2019-03-07 20:07                       ` Zygo Blaxell
2019-03-08 10:37                         ` Filipe Manana
2019-03-14 18:58                           ` Christoph Anton Mitterer
2019-03-14 20:22                           ` Christoph Anton Mitterer
2019-03-14 22:39                             ` Filipe Manana
2019-03-08 12:20                         ` Austin S. Hemmelgarn
2019-03-14 18:58                           ` Christoph Anton Mitterer
2019-03-14 18:58                         ` Christoph Anton Mitterer
2019-03-15  5:28                           ` Zygo Blaxell
2019-03-16 22:11                             ` Christoph Anton Mitterer
2019-03-17  2:54                               ` Zygo Blaxell
2019-02-15 12:02                   ` Filipe Manana
2019-03-04 15:46                     ` Christoph Anton Mitterer
2019-02-12 18:58       ` Andrei Borzenkov
2019-02-12 21:48         ` Chris Murphy
2019-02-12 22:11           ` Zygo Blaxell
2019-02-12 22:53             ` Chris Murphy
2019-02-13  2:46               ` Zygo Blaxell
2019-02-13  7:47   ` Roman Mamedov
2019-02-13  8:04     ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).