Linux-BTRFS Archive on lore.kernel.org
 help / Atom feed
* Reproducer for "compressed data + hole data corruption bug, 2018 editiion"
@ 2018-08-23  3:11 Zygo Blaxell
  2018-08-23  5:10 ` Qu Wenruo
  2019-02-12  3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell
  0 siblings, 2 replies; 25+ messages in thread
From: Zygo Blaxell @ 2018-08-23  3:11 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 6482 bytes --]

This is a repro script for a btrfs bug that causes corrupted data reads
when reading a mix of compressed extents and holes.  The bug is
reproducible on at least kernels v4.1..v4.18.

Some more observations and background follow, but first here is the
script and some sample output:

	root@rescue:/test# cat repro-hole-corruption-test
	#!/bin/bash

	# Write a 4096 byte block of something
	block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }

	# Here is some test data with holes in it:
	for y in $(seq 0 100); do
		for x in 0 1; do
			block 0;
			block 21;
			block 0;
			block 22;
			block 0;
			block 0;
			block 43;
			block 44;
			block 0;
			block 0;
			block 61;
			block 62;
			block 63;
			block 64;
			block 65;
			block 66;
		done
	done > am
	sync

	# Now replace those 101 distinct extents with 101 references to the first extent
	btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail

	# Punch holes into the extent refs
	fallocate -v -d am

	# Do some other stuff on the machine while this runs, and watch the sha1sums change!
	while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done

	root@rescue:/test# ./repro-hole-corruption-test
	i: 91, status: 0, bytes_deduped: 131072
	i: 92, status: 0, bytes_deduped: 131072
	i: 93, status: 0, bytes_deduped: 131072
	i: 94, status: 0, bytes_deduped: 131072
	i: 95, status: 0, bytes_deduped: 131072
	i: 96, status: 0, bytes_deduped: 131072
	i: 97, status: 0, bytes_deduped: 131072
	i: 98, status: 0, bytes_deduped: 131072
	i: 99, status: 0, bytes_deduped: 131072
	13107200 total bytes deduped in this operation
	am: 4.8 MiB (4964352 bytes) converted to sparse holes.
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	072a152355788c767b97e4e4c0e4567720988b84 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	bf00d862c6ad436a1be2be606a8ab88d22166b89 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	0d44cdf030fb149e103cfdc164da3da2b7474c17 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	60831f0e7ffe4b49722612c18685c09f4583b1df am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	a19662b294a3ccdf35dbb18fdd72c62018526d7d am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	^C

Corruption occurs most often when there is a sequence like this in a file:

	ref 1: hole
	ref 2: extent A, offset 0
	ref 3: hole
	ref 4: extent A, offset 8192

This scenario typically arises due to hole-punching or deduplication.
Hole-punching replaces one extent ref with two references to the same
extent with a hole between them, so:

	ref 1:  extent A, offset 0, length 16384

becomes:

	ref 1:  extent A, offset 0, length 4096
	ref 2:  hole, length 8192
	ref 3:  extent A, offset 12288, length 4096

Deduplication replaces two distinct extent refs surrounding a hole with
two references to one of the duplicate extents, turning this:

	ref 1:  extent A, offset 0, length 4096
	ref 2:  hole, length 8192
	ref 3:  extent B, offset 0, length 4096

into this:

	ref 1:  extent A, offset 0, length 4096
	ref 2:  hole, length 8192
	ref 3:  extent A, offset 0, length 4096

Compression is required (zlib, zstd, or lzo) for corruption to occur.
I am not able to reproduce the issue with an uncompressed extent nor
have I observed any such corruption in the wild.

The presence or absence of the no-holes filesystem feature has no effect.

Ordinary writes can lead to pairs of extent references to the same extent
separated by a reference to a different extent; however, in this case
there is data to be read from a real extent, instead of pages that have
to be zero filled from a hole.  If ordinary non-hole writes could trigger
this bug, every page-oriented database engine would be crashing all the
time on btrfs with compression enabled, and it's unlikely that would not
have been noticed between 2015 and now.  An ordinary write that splits
an extent ref would look like this:

	ref 1:  extent A, offset 0, length 4096
	ref 2:  extent C, offset 0, length 8192
	ref 3:  extent A, offset 12288, length 4096

Sparse writes can lead to pairs of extent references surrounding a hole;
however, in this case the extent references will point to different
extents, avoiding the bug.  If a sparse write could trigger the bug,
the rsync -S option and qemu/kvm 'raw' disk image files (among many
other tools that produce sparse files) would be unusable, and it's
unlikely that would not have been noticed between 2015 and now either.
Sparse writes look like this:

	ref 1:  extent A, offset 0, length 4096
	ref 2:  hole, length 8192
	ref 3:  extent B, offset 0, length 4096

The pattern or timing of read() calls seems to be relevant.  It is very
hard to see the corruption when reading files with 'hd', but 'cat | hd'
will see the corruption just fine.  Similar problems exist with 'cmp'
but not 'sha1sum'.  Two processes reading the same file at the same time
seem to trigger the corruption very frequently.

Some patterns of holes and data produce corruption faster than others.
The pattern generated by the script above is based on instances of
corruption I've found in the wild, and has a much better repro rate than
random holes.

The corruption occurs during reads, after csum verification and before
decompression, so btrfs detects no csum failures.  The data on disk
seems to be OK and could be read correctly once the kernel bug is fixed.
Repeated reads do eventually return correct data, but there is no way
for userspace to distinguish between corrupt and correct data reliably.

The corrupted data is usually data replaced by a hole or a copy of other
blocks in the same extent.

The behavior is similar to some earlier bugs related to holes and
Compressed data in btrfs, but it's new and not fixed yet--hence,
"2018 edition."

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 editiion"
  2018-08-23  3:11 Reproducer for "compressed data + hole data corruption bug, 2018 editiion" Zygo Blaxell
@ 2018-08-23  5:10 ` Qu Wenruo
  2018-08-23 16:44   ` Zygo Blaxell
  2019-02-12  3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell
  1 sibling, 1 reply; 25+ messages in thread
From: Qu Wenruo @ 2018-08-23  5:10 UTC (permalink / raw)
  To: Zygo Blaxell, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 7798 bytes --]



On 2018/8/23 上午11:11, Zygo Blaxell wrote:
> This is a repro script for a btrfs bug that causes corrupted data reads
> when reading a mix of compressed extents and holes.  The bug is
> reproducible on at least kernels v4.1..v4.18.

This bug already sounds more serious than previous nodatasum +
compression bug.

> 
> Some more observations and background follow, but first here is the
> script and some sample output:
> 
> 	root@rescue:/test# cat repro-hole-corruption-test
> 	#!/bin/bash
> 
> 	# Write a 4096 byte block of something
> 	block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> 
> 	# Here is some test data with holes in it:
> 	for y in $(seq 0 100); do
> 		for x in 0 1; do
> 			block 0;
> 			block 21;
> 			block 0;
> 			block 22;
> 			block 0;
> 			block 0;
> 			block 43;
> 			block 44;
> 			block 0;
> 			block 0;
> 			block 61;
> 			block 62;
> 			block 63;
> 			block 64;
> 			block 65;
> 			block 66;> 		done

Does the content has any difference on this bug?
It's just 16 * 4K * 2 * 101 data write *without* any hole so far.

This should indeed cause 101 128K compressed data extent.
But I'm wondering the description about 'holes'.

> 	done > am
> 	sync
> 
> 	# Now replace those 101 distinct extents with 101 references to the first extent
> 	btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail

Will this bug still happen by creating one extent and then reflink it
101 times?

> 
> 	# Punch holes into the extent refs
> 	fallocate -v -d am

Hole-punch in fact happens here.

BTW, will add a "sync" here change the result?

> 
> 	# Do some other stuff on the machine while this runs, and watch the sha1sums change!
> 	while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> 
> 	root@rescue:/test# ./repro-hole-corruption-test
> 	i: 91, status: 0, bytes_deduped: 131072
> 	i: 92, status: 0, bytes_deduped: 131072
> 	i: 93, status: 0, bytes_deduped: 131072
> 	i: 94, status: 0, bytes_deduped: 131072
> 	i: 95, status: 0, bytes_deduped: 131072
> 	i: 96, status: 0, bytes_deduped: 131072
> 	i: 97, status: 0, bytes_deduped: 131072
> 	i: 98, status: 0, bytes_deduped: 131072
> 	i: 99, status: 0, bytes_deduped: 131072
> 	13107200 total bytes deduped in this operation
> 	am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	072a152355788c767b97e4e4c0e4567720988b84 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	60831f0e7ffe4b49722612c18685c09f4583b1df am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	^C

It looks like we have something wrong interpreting file extent, maybe
related to extent map merging.

BTW, if without dropping page cache and no read corruption happens, it
would limit the range of problem we're looking for.

Thanks,
Qu

> 
> Corruption occurs most often when there is a sequence like this in a file:
> 
> 	ref 1: hole
> 	ref 2: extent A, offset 0
> 	ref 3: hole
> 	ref 4: extent A, offset 8192
> 
> This scenario typically arises due to hole-punching or deduplication.
> Hole-punching replaces one extent ref with two references to the same
> extent with a hole between them, so:
> 
> 	ref 1:  extent A, offset 0, length 16384
> 
> becomes:
> 
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  hole, length 8192
> 	ref 3:  extent A, offset 12288, length 4096
> 
> Deduplication replaces two distinct extent refs surrounding a hole with
> two references to one of the duplicate extents, turning this:
> 
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  hole, length 8192
> 	ref 3:  extent B, offset 0, length 4096
> 
> into this:
> 
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  hole, length 8192
> 	ref 3:  extent A, offset 0, length 4096
> 
> Compression is required (zlib, zstd, or lzo) for corruption to occur.
> I am not able to reproduce the issue with an uncompressed extent nor
> have I observed any such corruption in the wild.
> 
> The presence or absence of the no-holes filesystem feature has no effect.
> 
> Ordinary writes can lead to pairs of extent references to the same extent
> separated by a reference to a different extent; however, in this case
> there is data to be read from a real extent, instead of pages that have
> to be zero filled from a hole.  If ordinary non-hole writes could trigger
> this bug, every page-oriented database engine would be crashing all the
> time on btrfs with compression enabled, and it's unlikely that would not
> have been noticed between 2015 and now.  An ordinary write that splits
> an extent ref would look like this:
> 
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  extent C, offset 0, length 8192
> 	ref 3:  extent A, offset 12288, length 4096
> 
> Sparse writes can lead to pairs of extent references surrounding a hole;
> however, in this case the extent references will point to different
> extents, avoiding the bug.  If a sparse write could trigger the bug,
> the rsync -S option and qemu/kvm 'raw' disk image files (among many
> other tools that produce sparse files) would be unusable, and it's
> unlikely that would not have been noticed between 2015 and now either.
> Sparse writes look like this:
> 
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  hole, length 8192
> 	ref 3:  extent B, offset 0, length 4096
> 
> The pattern or timing of read() calls seems to be relevant.  It is very
> hard to see the corruption when reading files with 'hd', but 'cat | hd'
> will see the corruption just fine.  Similar problems exist with 'cmp'
> but not 'sha1sum'.  Two processes reading the same file at the same time
> seem to trigger the corruption very frequently.
> 
> Some patterns of holes and data produce corruption faster than others.
> The pattern generated by the script above is based on instances of
> corruption I've found in the wild, and has a much better repro rate than
> random holes.
> 
> The corruption occurs during reads, after csum verification and before
> decompression, so btrfs detects no csum failures.  The data on disk
> seems to be OK and could be read correctly once the kernel bug is fixed.
> Repeated reads do eventually return correct data, but there is no way
> for userspace to distinguish between corrupt and correct data reliably.
> 
> The corrupted data is usually data replaced by a hole or a copy of other
> blocks in the same extent.
> 
> The behavior is similar to some earlier bugs related to holes and
> Compressed data in btrfs, but it's new and not fixed yet--hence,
> "2018 edition."
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 editiion"
  2018-08-23  5:10 ` Qu Wenruo
@ 2018-08-23 16:44   ` Zygo Blaxell
  2018-08-23 23:50     ` Qu Wenruo
  0 siblings, 1 reply; 25+ messages in thread
From: Zygo Blaxell @ 2018-08-23 16:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 11603 bytes --]

On Thu, Aug 23, 2018 at 01:10:48PM +0800, Qu Wenruo wrote:
> On 2018/8/23 上午11:11, Zygo Blaxell wrote:
> > This is a repro script for a btrfs bug that causes corrupted data reads
> > when reading a mix of compressed extents and holes.  The bug is
> > reproducible on at least kernels v4.1..v4.18.
> 
> This bug already sounds more serious than previous nodatasum +
> compression bug.

Maybe.  "compression + holes corruption bug 2017" could be avoided with
the max-inline=0 mount option without disabling compression.  This time,
the workaround is more intrusive:  avoid all applications that use dedup
or hole-punching.

> > Some more observations and background follow, but first here is the
> > script and some sample output:
> > 
> > 	root@rescue:/test# cat repro-hole-corruption-test
> > 	#!/bin/bash
> > 
> > 	# Write a 4096 byte block of something
> > 	block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > 
> > 	# Here is some test data with holes in it:
> > 	for y in $(seq 0 100); do
> > 		for x in 0 1; do
> > 			block 0;
> > 			block 21;
> > 			block 0;
> > 			block 22;
> > 			block 0;
> > 			block 0;
> > 			block 43;
> > 			block 44;
> > 			block 0;
> > 			block 0;
> > 			block 61;
> > 			block 62;
> > 			block 63;
> > 			block 64;
> > 			block 65;
> > 			block 66;> 		done
> 
> Does the content has any difference on this bug?
> It's just 16 * 4K * 2 * 101 data write *without* any hole so far.

The content of the extents doesn't seem to matter, other than it needs to
be compressible so that the extents on disk are compressed.  The bug is
also triggered by writing non-zero data to all blocks, and then punching
the holes later with "fallocate -p -l 4096 -o $(( insert math here ))".

The layout of the extents matters a lot.  I have to loop hundreds or
thousands of times to hit the bug if the first block in the pattern is
not a hole, or if the non-hole extents are different sizes or positions
than above.

I tried random patterns of holes and extent refs, and most of them have
an order of magnitude lower hit rates than the above.  This might be due
to some relationship between the alignment of read() request boundaries
with extent boundaries, but I haven't done any tests designed to detect
such a relationship.

In the wild, corruption happens on some files much more often than others.
This seems to be correlated with the extent layout as well.

I discovered the bug by examining files that were intermittently but
repeatedly failing routine data integrity checks, and found that in every
case they had similar hole + extent patterns near the point where data
was corrupted.

I did a search on some big filesystems for the
hole-refExtentA-hole-refExtentA pattern and found several files with
this pattern that had passed previous data integrity checks, but would
fail randomly in the sha1sum/drop-caches loop.

> This should indeed cause 101 128K compressed data extent.
> But I'm wondering the description about 'holes'.

The holes are coming, wait for it... ;)

> > 	done > am
> > 	sync
> > 
> > 	# Now replace those 101 distinct extents with 101 references to the first extent
> > 	btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
> 
> Will this bug still happen by creating one extent and then reflink it
> 101 times?

Yes.  I used btrfs-extent-same because a binary is included in the
Debian duperemove package, but I use it only for convenience.

It's not necessary to have hundreds of references to the same extent--even
two refs to a single extent plus a hole can trigger the bug sometimes.
100 references in a single file will trigger the bug so often that it
can be detected within the first 20 sha1sum loops.

When the corruption occurs, it affects around 90 of the original 101
extents.  The different sha1sum results are due to different extents
giving bad data on different runs.

> > 	# Punch holes into the extent refs
> > 	fallocate -v -d am
> 
> Hole-punch in fact happens here.
> 
> BTW, will add a "sync" here change the result?

No.  You can reboot the machine here if you like, it does not change
anything that happens during reads later.

Looking at the extent tree in btrfs-debug-tree, the data on disk
looks correct, and btrfs does read it correctly most of the time (the
correct sha1sum below is 6926a34e0ab3e0a023e8ea85a650f5b4217acab4).
The corruption therefore comes from btrfs read() producing incorrect
data in some instances.

> > 	# Do some other stuff on the machine while this runs, and watch the sha1sums change!
> > 	while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> > 
> > 	root@rescue:/test# ./repro-hole-corruption-test
> > 	i: 91, status: 0, bytes_deduped: 131072
> > 	i: 92, status: 0, bytes_deduped: 131072
> > 	i: 93, status: 0, bytes_deduped: 131072
> > 	i: 94, status: 0, bytes_deduped: 131072
> > 	i: 95, status: 0, bytes_deduped: 131072
> > 	i: 96, status: 0, bytes_deduped: 131072
> > 	i: 97, status: 0, bytes_deduped: 131072
> > 	i: 98, status: 0, bytes_deduped: 131072
> > 	i: 99, status: 0, bytes_deduped: 131072
> > 	13107200 total bytes deduped in this operation
> > 	am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	072a152355788c767b97e4e4c0e4567720988b84 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	60831f0e7ffe4b49722612c18685c09f4583b1df am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > 	^C
> 
> It looks like we have something wrong interpreting file extent, maybe
> related to extent map merging.
> 
> BTW, if without dropping page cache and no read corruption happens, it
> would limit the range of problem we're looking for.

The page cache drop makes reproduction easier/faster.  If you don't drop
caches, you have to wait for the data to be evicted from page cache or
the data from read() will not change.

In the wild, if I do a sha1sum loop on a few hundred GB of data known
to have the hole-extent-hole pattern (so the pages are evicted between
sha1sum runs), I see similar results without explicitly dropping caches.

If you read the file with a cold cache from two processes at once
(e.g. you run 'hd am' while the sha1sum/drop-cache loop is running)
the data changes faster (different on 90% of reads instead of just 20%).

> Thanks,
> Qu
> 
> > 
> > Corruption occurs most often when there is a sequence like this in a file:
> > 
> > 	ref 1: hole
> > 	ref 2: extent A, offset 0
> > 	ref 3: hole
> > 	ref 4: extent A, offset 8192
> > 
> > This scenario typically arises due to hole-punching or deduplication.
> > Hole-punching replaces one extent ref with two references to the same
> > extent with a hole between them, so:
> > 
> > 	ref 1:  extent A, offset 0, length 16384
> > 
> > becomes:
> > 
> > 	ref 1:  extent A, offset 0, length 4096
> > 	ref 2:  hole, length 8192
> > 	ref 3:  extent A, offset 12288, length 4096
> > 
> > Deduplication replaces two distinct extent refs surrounding a hole with
> > two references to one of the duplicate extents, turning this:
> > 
> > 	ref 1:  extent A, offset 0, length 4096
> > 	ref 2:  hole, length 8192
> > 	ref 3:  extent B, offset 0, length 4096
> > 
> > into this:
> > 
> > 	ref 1:  extent A, offset 0, length 4096
> > 	ref 2:  hole, length 8192
> > 	ref 3:  extent A, offset 0, length 4096
> > 
> > Compression is required (zlib, zstd, or lzo) for corruption to occur.
> > I am not able to reproduce the issue with an uncompressed extent nor
> > have I observed any such corruption in the wild.
> > 
> > The presence or absence of the no-holes filesystem feature has no effect.
> > 
> > Ordinary writes can lead to pairs of extent references to the same extent
> > separated by a reference to a different extent; however, in this case
> > there is data to be read from a real extent, instead of pages that have
> > to be zero filled from a hole.  If ordinary non-hole writes could trigger
> > this bug, every page-oriented database engine would be crashing all the
> > time on btrfs with compression enabled, and it's unlikely that would not
> > have been noticed between 2015 and now.  An ordinary write that splits
> > an extent ref would look like this:
> > 
> > 	ref 1:  extent A, offset 0, length 4096
> > 	ref 2:  extent C, offset 0, length 8192
> > 	ref 3:  extent A, offset 12288, length 4096
> > 
> > Sparse writes can lead to pairs of extent references surrounding a hole;
> > however, in this case the extent references will point to different
> > extents, avoiding the bug.  If a sparse write could trigger the bug,
> > the rsync -S option and qemu/kvm 'raw' disk image files (among many
> > other tools that produce sparse files) would be unusable, and it's
> > unlikely that would not have been noticed between 2015 and now either.
> > Sparse writes look like this:
> > 
> > 	ref 1:  extent A, offset 0, length 4096
> > 	ref 2:  hole, length 8192
> > 	ref 3:  extent B, offset 0, length 4096
> > 
> > The pattern or timing of read() calls seems to be relevant.  It is very
> > hard to see the corruption when reading files with 'hd', but 'cat | hd'
> > will see the corruption just fine.  Similar problems exist with 'cmp'
> > but not 'sha1sum'.  Two processes reading the same file at the same time
> > seem to trigger the corruption very frequently.
> > 
> > Some patterns of holes and data produce corruption faster than others.
> > The pattern generated by the script above is based on instances of
> > corruption I've found in the wild, and has a much better repro rate than
> > random holes.
> > 
> > The corruption occurs during reads, after csum verification and before
> > decompression, so btrfs detects no csum failures.  The data on disk
> > seems to be OK and could be read correctly once the kernel bug is fixed.
> > Repeated reads do eventually return correct data, but there is no way
> > for userspace to distinguish between corrupt and correct data reliably.
> > 
> > The corrupted data is usually data replaced by a hole or a copy of other
> > blocks in the same extent.
> > 
> > The behavior is similar to some earlier bugs related to holes and
> > Compressed data in btrfs, but it's new and not fixed yet--hence,
> > "2018 edition."
> > 
> 




[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 editiion"
  2018-08-23 16:44   ` Zygo Blaxell
@ 2018-08-23 23:50     ` Qu Wenruo
  0 siblings, 0 replies; 25+ messages in thread
From: Qu Wenruo @ 2018-08-23 23:50 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 12116 bytes --]



On 2018/8/24 上午12:44, Zygo Blaxell wrote:
> On Thu, Aug 23, 2018 at 01:10:48PM +0800, Qu Wenruo wrote:
>> On 2018/8/23 上午11:11, Zygo Blaxell wrote:
>>> This is a repro script for a btrfs bug that causes corrupted data reads
>>> when reading a mix of compressed extents and holes.  The bug is
>>> reproducible on at least kernels v4.1..v4.18.
>>
>> This bug already sounds more serious than previous nodatasum +
>> compression bug.
> 
> Maybe.  "compression + holes corruption bug 2017" could be avoided with
> the max-inline=0 mount option without disabling compression.  This time,
> the workaround is more intrusive:  avoid all applications that use dedup
> or hole-punching.
> 
>>> Some more observations and background follow, but first here is the
>>> script and some sample output:
>>>
>>> 	root@rescue:/test# cat repro-hole-corruption-test
>>> 	#!/bin/bash
>>>
>>> 	# Write a 4096 byte block of something
>>> 	block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
>>>
>>> 	# Here is some test data with holes in it:
>>> 	for y in $(seq 0 100); do
>>> 		for x in 0 1; do
>>> 			block 0;
>>> 			block 21;
>>> 			block 0;
>>> 			block 22;
>>> 			block 0;
>>> 			block 0;
>>> 			block 43;
>>> 			block 44;
>>> 			block 0;
>>> 			block 0;
>>> 			block 61;
>>> 			block 62;
>>> 			block 63;
>>> 			block 64;
>>> 			block 65;
>>> 			block 66;> 		done
>>
>> Does the content has any difference on this bug?
>> It's just 16 * 4K * 2 * 101 data write *without* any hole so far.
> 
> The content of the extents doesn't seem to matter, other than it needs to
> be compressible so that the extents on disk are compressed.  The bug is
> also triggered by writing non-zero data to all blocks, and then punching
> the holes later with "fallocate -p -l 4096 -o $(( insert math here ))".
> 
> The layout of the extents matters a lot.  I have to loop hundreds or
> thousands of times to hit the bug if the first block in the pattern is
> not a hole, or if the non-hole extents are different sizes or positions
> than above.
> 
> I tried random patterns of holes and extent refs, and most of them have
> an order of magnitude lower hit rates than the above.  This might be due
> to some relationship between the alignment of read() request boundaries
> with extent boundaries, but I haven't done any tests designed to detect
> such a relationship.
> 
> In the wild, corruption happens on some files much more often than others.
> This seems to be correlated with the extent layout as well.
> 
> I discovered the bug by examining files that were intermittently but
> repeatedly failing routine data integrity checks, and found that in every
> case they had similar hole + extent patterns near the point where data
> was corrupted.
> 
> I did a search on some big filesystems for the
> hole-refExtentA-hole-refExtentA pattern and found several files with
> this pattern that had passed previous data integrity checks, but would
> fail randomly in the sha1sum/drop-caches loop.
> 
>> This should indeed cause 101 128K compressed data extent.
>> But I'm wondering the description about 'holes'.
> 
> The holes are coming, wait for it... ;)
> 
>>> 	done > am
>>> 	sync
>>>
>>> 	# Now replace those 101 distinct extents with 101 references to the first extent
>>> 	btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
>>
>> Will this bug still happen by creating one extent and then reflink it
>> 101 times?
> 
> Yes.  I used btrfs-extent-same because a binary is included in the
> Debian duperemove package, but I use it only for convenience.
> 
> It's not necessary to have hundreds of references to the same extent--even
> two refs to a single extent plus a hole can trigger the bug sometimes.
> 100 references in a single file will trigger the bug so often that it
> can be detected within the first 20 sha1sum loops.
> 
> When the corruption occurs, it affects around 90 of the original 101
> extents.  The different sha1sum results are due to different extents
> giving bad data on different runs.
> 
>>> 	# Punch holes into the extent refs
>>> 	fallocate -v -d am
>>
>> Hole-punch in fact happens here.
>>
>> BTW, will add a "sync" here change the result?
> 
> No.  You can reboot the machine here if you like, it does not change
> anything that happens during reads later.

So it looks like my assumption of bad file extent interpreter is getting
more and more valid.

It has nothing to do with the race against hole punching/write, but only
the file layout and extent map cache.

> 
> Looking at the extent tree in btrfs-debug-tree, the data on disk
> looks correct, and btrfs does read it correctly most of the time (the
> correct sha1sum below is 6926a34e0ab3e0a023e8ea85a650f5b4217acab4).
> The corruption therefore comes from btrfs read() producing incorrect
> data in some instances.
> 
>>> 	# Do some other stuff on the machine while this runs, and watch the sha1sums change!
>>> 	while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
>>>
>>> 	root@rescue:/test# ./repro-hole-corruption-test
>>> 	i: 91, status: 0, bytes_deduped: 131072
>>> 	i: 92, status: 0, bytes_deduped: 131072
>>> 	i: 93, status: 0, bytes_deduped: 131072
>>> 	i: 94, status: 0, bytes_deduped: 131072
>>> 	i: 95, status: 0, bytes_deduped: 131072
>>> 	i: 96, status: 0, bytes_deduped: 131072
>>> 	i: 97, status: 0, bytes_deduped: 131072
>>> 	i: 98, status: 0, bytes_deduped: 131072
>>> 	i: 99, status: 0, bytes_deduped: 131072
>>> 	13107200 total bytes deduped in this operation
>>> 	am: 4.8 MiB (4964352 bytes) converted to sparse holes.
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	072a152355788c767b97e4e4c0e4567720988b84 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	bf00d862c6ad436a1be2be606a8ab88d22166b89 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	0d44cdf030fb149e103cfdc164da3da2b7474c17 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	60831f0e7ffe4b49722612c18685c09f4583b1df am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	a19662b294a3ccdf35dbb18fdd72c62018526d7d am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 	^C
>>
>> It looks like we have something wrong interpreting file extent, maybe
>> related to extent map merging.
>>
>> BTW, if without dropping page cache and no read corruption happens, it
>> would limit the range of problem we're looking for.
> 
> The page cache drop makes reproduction easier/faster.  If you don't drop
> caches, you have to wait for the data to be evicted from page cache or
> the data from read() will not change.

So it's highly possible that file extent interpreter is causing the problem.

Thanks,
Qu

> 
> In the wild, if I do a sha1sum loop on a few hundred GB of data known
> to have the hole-extent-hole pattern (so the pages are evicted between
> sha1sum runs), I see similar results without explicitly dropping caches.
> 
> If you read the file with a cold cache from two processes at once
> (e.g. you run 'hd am' while the sha1sum/drop-cache loop is running)
> the data changes faster (different on 90% of reads instead of just 20%).
> 
>> Thanks,
>> Qu
>>
>>>
>>> Corruption occurs most often when there is a sequence like this in a file:
>>>
>>> 	ref 1: hole
>>> 	ref 2: extent A, offset 0
>>> 	ref 3: hole
>>> 	ref 4: extent A, offset 8192
>>>
>>> This scenario typically arises due to hole-punching or deduplication.
>>> Hole-punching replaces one extent ref with two references to the same
>>> extent with a hole between them, so:
>>>
>>> 	ref 1:  extent A, offset 0, length 16384
>>>
>>> becomes:
>>>
>>> 	ref 1:  extent A, offset 0, length 4096
>>> 	ref 2:  hole, length 8192
>>> 	ref 3:  extent A, offset 12288, length 4096
>>>
>>> Deduplication replaces two distinct extent refs surrounding a hole with
>>> two references to one of the duplicate extents, turning this:
>>>
>>> 	ref 1:  extent A, offset 0, length 4096
>>> 	ref 2:  hole, length 8192
>>> 	ref 3:  extent B, offset 0, length 4096
>>>
>>> into this:
>>>
>>> 	ref 1:  extent A, offset 0, length 4096
>>> 	ref 2:  hole, length 8192
>>> 	ref 3:  extent A, offset 0, length 4096
>>>
>>> Compression is required (zlib, zstd, or lzo) for corruption to occur.
>>> I am not able to reproduce the issue with an uncompressed extent nor
>>> have I observed any such corruption in the wild.
>>>
>>> The presence or absence of the no-holes filesystem feature has no effect.
>>>
>>> Ordinary writes can lead to pairs of extent references to the same extent
>>> separated by a reference to a different extent; however, in this case
>>> there is data to be read from a real extent, instead of pages that have
>>> to be zero filled from a hole.  If ordinary non-hole writes could trigger
>>> this bug, every page-oriented database engine would be crashing all the
>>> time on btrfs with compression enabled, and it's unlikely that would not
>>> have been noticed between 2015 and now.  An ordinary write that splits
>>> an extent ref would look like this:
>>>
>>> 	ref 1:  extent A, offset 0, length 4096
>>> 	ref 2:  extent C, offset 0, length 8192
>>> 	ref 3:  extent A, offset 12288, length 4096
>>>
>>> Sparse writes can lead to pairs of extent references surrounding a hole;
>>> however, in this case the extent references will point to different
>>> extents, avoiding the bug.  If a sparse write could trigger the bug,
>>> the rsync -S option and qemu/kvm 'raw' disk image files (among many
>>> other tools that produce sparse files) would be unusable, and it's
>>> unlikely that would not have been noticed between 2015 and now either.
>>> Sparse writes look like this:
>>>
>>> 	ref 1:  extent A, offset 0, length 4096
>>> 	ref 2:  hole, length 8192
>>> 	ref 3:  extent B, offset 0, length 4096
>>>
>>> The pattern or timing of read() calls seems to be relevant.  It is very
>>> hard to see the corruption when reading files with 'hd', but 'cat | hd'
>>> will see the corruption just fine.  Similar problems exist with 'cmp'
>>> but not 'sha1sum'.  Two processes reading the same file at the same time
>>> seem to trigger the corruption very frequently.
>>>
>>> Some patterns of holes and data produce corruption faster than others.
>>> The pattern generated by the script above is based on instances of
>>> corruption I've found in the wild, and has a much better repro rate than
>>> random holes.
>>>
>>> The corruption occurs during reads, after csum verification and before
>>> decompression, so btrfs detects no csum failures.  The data on disk
>>> seems to be OK and could be read correctly once the kernel bug is fixed.
>>> Repeated reads do eventually return correct data, but there is no way
>>> for userspace to distinguish between corrupt and correct data reliably.
>>>
>>> The corrupted data is usually data replaced by a hole or a copy of other
>>> blocks in the same extent.
>>>
>>> The behavior is similar to some earlier bugs related to holes and
>>> Compressed data in btrfs, but it's new and not fixed yet--hence,
>>> "2018 edition."
>>>
>>
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2018-08-23  3:11 Reproducer for "compressed data + hole data corruption bug, 2018 editiion" Zygo Blaxell
  2018-08-23  5:10 ` Qu Wenruo
@ 2019-02-12  3:09 ` Zygo Blaxell
  2019-02-12 15:33   ` Christoph Anton Mitterer
                     ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: Zygo Blaxell @ 2019-02-12  3:09 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 8454 bytes --]

Still reproducible on 4.20.7.

The behavior is slightly different on current kernels (4.20.7, 4.14.96)
which makes the problem a bit more difficult to detect.

	# repro-hole-corruption-test
	i: 91, status: 0, bytes_deduped: 131072
	i: 92, status: 0, bytes_deduped: 131072
	i: 93, status: 0, bytes_deduped: 131072
	i: 94, status: 0, bytes_deduped: 131072
	i: 95, status: 0, bytes_deduped: 131072
	i: 96, status: 0, bytes_deduped: 131072
	i: 97, status: 0, bytes_deduped: 131072
	i: 98, status: 0, bytes_deduped: 131072
	i: 99, status: 0, bytes_deduped: 131072
	13107200 total bytes deduped in this operation
	am: 4.8 MiB (4964352 bytes) converted to sparse holes.
	94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am

The sha1sum seems stable after the first drop_caches--until a second
process tries to read the test file:

	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
	# cat am > /dev/null              (in another shell)
	19294e695272c42edb89ceee24bb08c13473140a am                                                            
	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am

On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> This is a repro script for a btrfs bug that causes corrupted data reads
> when reading a mix of compressed extents and holes.  The bug is
> reproducible on at least kernels v4.1..v4.18.
>
> Some more observations and background follow, but first here is the
> script and some sample output:
>
> 	root@rescue:/test# cat repro-hole-corruption-test
> 	#!/bin/bash
>
> 	# Write a 4096 byte block of something
> 	block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
>
> 	# Here is some test data with holes in it:
> 	for y in $(seq 0 100); do
> 		for x in 0 1; do
> 			block 0;
> 			block 21;
> 			block 0;
> 			block 22;
> 			block 0;
> 			block 0;
> 			block 43;
> 			block 44;
> 			block 0;
> 			block 0;
> 			block 61;
> 			block 62;
> 			block 63;
> 			block 64;
> 			block 65;
> 			block 66;
> 		done
> 	done > am
> 	sync
>
> 	# Now replace those 101 distinct extents with 101 references to the first extent
> 	btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
>
> 	# Punch holes into the extent refs
> 	fallocate -v -d am
>
> 	# Do some other stuff on the machine while this runs, and watch the sha1sums change!
> 	while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
>
> 	root@rescue:/test# ./repro-hole-corruption-test
> 	i: 91, status: 0, bytes_deduped: 131072
> 	i: 92, status: 0, bytes_deduped: 131072
> 	i: 93, status: 0, bytes_deduped: 131072
> 	i: 94, status: 0, bytes_deduped: 131072
> 	i: 95, status: 0, bytes_deduped: 131072
> 	i: 96, status: 0, bytes_deduped: 131072
> 	i: 97, status: 0, bytes_deduped: 131072
> 	i: 98, status: 0, bytes_deduped: 131072
> 	i: 99, status: 0, bytes_deduped: 131072
> 	13107200 total bytes deduped in this operation
> 	am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	072a152355788c767b97e4e4c0e4567720988b84 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	60831f0e7ffe4b49722612c18685c09f4583b1df am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	^C
>
> Corruption occurs most often when there is a sequence like this in a file:
>
> 	ref 1: hole
> 	ref 2: extent A, offset 0
> 	ref 3: hole
> 	ref 4: extent A, offset 8192
>
> This scenario typically arises due to hole-punching or deduplication.
> Hole-punching replaces one extent ref with two references to the same
> extent with a hole between them, so:
>
> 	ref 1:  extent A, offset 0, length 16384
>
> becomes:
>
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  hole, length 8192
> 	ref 3:  extent A, offset 12288, length 4096
>
> Deduplication replaces two distinct extent refs surrounding a hole with
> two references to one of the duplicate extents, turning this:
>
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  hole, length 8192
> 	ref 3:  extent B, offset 0, length 4096
>
> into this:
>
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  hole, length 8192
> 	ref 3:  extent A, offset 0, length 4096
>
> Compression is required (zlib, zstd, or lzo) for corruption to occur.
> I am not able to reproduce the issue with an uncompressed extent nor
> have I observed any such corruption in the wild.
>
> The presence or absence of the no-holes filesystem feature has no effect.
>
> Ordinary writes can lead to pairs of extent references to the same extent
> separated by a reference to a different extent; however, in this case
> there is data to be read from a real extent, instead of pages that have
> to be zero filled from a hole.  If ordinary non-hole writes could trigger
> this bug, every page-oriented database engine would be crashing all the
> time on btrfs with compression enabled, and it's unlikely that would not
> have been noticed between 2015 and now.  An ordinary write that splits
> an extent ref would look like this:
>
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  extent C, offset 0, length 8192
> 	ref 3:  extent A, offset 12288, length 4096
>
> Sparse writes can lead to pairs of extent references surrounding a hole;
> however, in this case the extent references will point to different
> extents, avoiding the bug.  If a sparse write could trigger the bug,
> the rsync -S option and qemu/kvm 'raw' disk image files (among many
> other tools that produce sparse files) would be unusable, and it's
> unlikely that would not have been noticed between 2015 and now either.
> Sparse writes look like this:
>
> 	ref 1:  extent A, offset 0, length 4096
> 	ref 2:  hole, length 8192
> 	ref 3:  extent B, offset 0, length 4096
>
> The pattern or timing of read() calls seems to be relevant.  It is very
> hard to see the corruption when reading files with 'hd', but 'cat | hd'
> will see the corruption just fine.  Similar problems exist with 'cmp'
> but not 'sha1sum'.  Two processes reading the same file at the same time
> seem to trigger the corruption very frequently.
>
> Some patterns of holes and data produce corruption faster than others.
> The pattern generated by the script above is based on instances of
> corruption I've found in the wild, and has a much better repro rate than
> random holes.
>
> The corruption occurs during reads, after csum verification and before
> decompression, so btrfs detects no csum failures.  The data on disk
> seems to be OK and could be read correctly once the kernel bug is fixed.
> Repeated reads do eventually return correct data, but there is no way
> for userspace to distinguish between corrupt and correct data reliably.
>
> The corrupted data is usually data replaced by a hole or a copy of other
> blocks in the same extent.
>
> The behavior is similar to some earlier bugs related to holes and
> Compressed data in btrfs, but it's new and not fixed yet--hence,
> "2018 edition."



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12  3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell
@ 2019-02-12 15:33   ` Christoph Anton Mitterer
  2019-02-12 15:35   ` Filipe Manana
  2019-02-13  7:47   ` Roman Mamedov
  2 siblings, 0 replies; 25+ messages in thread
From: Christoph Anton Mitterer @ 2019-02-12 15:33 UTC (permalink / raw)
  To: linux-btrfs

Hey.

Sounds like a highly severe (and long standing) bug?

Is anyone doing anything about it?


Cheers,
Chris.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12  3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell
  2019-02-12 15:33   ` Christoph Anton Mitterer
@ 2019-02-12 15:35   ` Filipe Manana
  2019-02-12 17:01     ` Zygo Blaxell
  2019-02-13  7:47   ` Roman Mamedov
  2 siblings, 1 reply; 25+ messages in thread
From: Filipe Manana @ 2019-02-12 15:35 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> Still reproducible on 4.20.7.

I tried your reproducer when you first reported it, on different
machines with different kernel versions.
Never managed to reproduce it, nor see anything obviously wrong in
relevant code paths.

>
> The behavior is slightly different on current kernels (4.20.7, 4.14.96)
> which makes the problem a bit more difficult to detect.
>
>         # repro-hole-corruption-test
>         i: 91, status: 0, bytes_deduped: 131072
>         i: 92, status: 0, bytes_deduped: 131072
>         i: 93, status: 0, bytes_deduped: 131072
>         i: 94, status: 0, bytes_deduped: 131072
>         i: 95, status: 0, bytes_deduped: 131072
>         i: 96, status: 0, bytes_deduped: 131072
>         i: 97, status: 0, bytes_deduped: 131072
>         i: 98, status: 0, bytes_deduped: 131072
>         i: 99, status: 0, bytes_deduped: 131072
>         13107200 total bytes deduped in this operation
>         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
>         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>
> The sha1sum seems stable after the first drop_caches--until a second
> process tries to read the test file:
>
>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>         # cat am > /dev/null              (in another shell)
>         19294e695272c42edb89ceee24bb08c13473140a am
>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>
> On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> > This is a repro script for a btrfs bug that causes corrupted data reads
> > when reading a mix of compressed extents and holes.  The bug is
> > reproducible on at least kernels v4.1..v4.18.
> >
> > Some more observations and background follow, but first here is the
> > script and some sample output:
> >
> >       root@rescue:/test# cat repro-hole-corruption-test
> >       #!/bin/bash
> >
> >       # Write a 4096 byte block of something
> >       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> >
> >       # Here is some test data with holes in it:
> >       for y in $(seq 0 100); do
> >               for x in 0 1; do
> >                       block 0;
> >                       block 21;
> >                       block 0;
> >                       block 22;
> >                       block 0;
> >                       block 0;
> >                       block 43;
> >                       block 44;
> >                       block 0;
> >                       block 0;
> >                       block 61;
> >                       block 62;
> >                       block 63;
> >                       block 64;
> >                       block 65;
> >                       block 66;
> >               done
> >       done > am
> >       sync
> >
> >       # Now replace those 101 distinct extents with 101 references to the first extent
> >       btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
> >
> >       # Punch holes into the extent refs
> >       fallocate -v -d am
> >
> >       # Do some other stuff on the machine while this runs, and watch the sha1sums change!
> >       while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> >
> >       root@rescue:/test# ./repro-hole-corruption-test
> >       i: 91, status: 0, bytes_deduped: 131072
> >       i: 92, status: 0, bytes_deduped: 131072
> >       i: 93, status: 0, bytes_deduped: 131072
> >       i: 94, status: 0, bytes_deduped: 131072
> >       i: 95, status: 0, bytes_deduped: 131072
> >       i: 96, status: 0, bytes_deduped: 131072
> >       i: 97, status: 0, bytes_deduped: 131072
> >       i: 98, status: 0, bytes_deduped: 131072
> >       i: 99, status: 0, bytes_deduped: 131072
> >       13107200 total bytes deduped in this operation
> >       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       072a152355788c767b97e4e4c0e4567720988b84 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       60831f0e7ffe4b49722612c18685c09f4583b1df am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >       ^C
> >
> > Corruption occurs most often when there is a sequence like this in a file:
> >
> >       ref 1: hole
> >       ref 2: extent A, offset 0
> >       ref 3: hole
> >       ref 4: extent A, offset 8192
> >
> > This scenario typically arises due to hole-punching or deduplication.
> > Hole-punching replaces one extent ref with two references to the same
> > extent with a hole between them, so:
> >
> >       ref 1:  extent A, offset 0, length 16384
> >
> > becomes:
> >
> >       ref 1:  extent A, offset 0, length 4096
> >       ref 2:  hole, length 8192
> >       ref 3:  extent A, offset 12288, length 4096
> >
> > Deduplication replaces two distinct extent refs surrounding a hole with
> > two references to one of the duplicate extents, turning this:
> >
> >       ref 1:  extent A, offset 0, length 4096
> >       ref 2:  hole, length 8192
> >       ref 3:  extent B, offset 0, length 4096
> >
> > into this:
> >
> >       ref 1:  extent A, offset 0, length 4096
> >       ref 2:  hole, length 8192
> >       ref 3:  extent A, offset 0, length 4096
> >
> > Compression is required (zlib, zstd, or lzo) for corruption to occur.
> > I am not able to reproduce the issue with an uncompressed extent nor
> > have I observed any such corruption in the wild.
> >
> > The presence or absence of the no-holes filesystem feature has no effect.
> >
> > Ordinary writes can lead to pairs of extent references to the same extent
> > separated by a reference to a different extent; however, in this case
> > there is data to be read from a real extent, instead of pages that have
> > to be zero filled from a hole.  If ordinary non-hole writes could trigger
> > this bug, every page-oriented database engine would be crashing all the
> > time on btrfs with compression enabled, and it's unlikely that would not
> > have been noticed between 2015 and now.  An ordinary write that splits
> > an extent ref would look like this:
> >
> >       ref 1:  extent A, offset 0, length 4096
> >       ref 2:  extent C, offset 0, length 8192
> >       ref 3:  extent A, offset 12288, length 4096
> >
> > Sparse writes can lead to pairs of extent references surrounding a hole;
> > however, in this case the extent references will point to different
> > extents, avoiding the bug.  If a sparse write could trigger the bug,
> > the rsync -S option and qemu/kvm 'raw' disk image files (among many
> > other tools that produce sparse files) would be unusable, and it's
> > unlikely that would not have been noticed between 2015 and now either.
> > Sparse writes look like this:
> >
> >       ref 1:  extent A, offset 0, length 4096
> >       ref 2:  hole, length 8192
> >       ref 3:  extent B, offset 0, length 4096
> >
> > The pattern or timing of read() calls seems to be relevant.  It is very
> > hard to see the corruption when reading files with 'hd', but 'cat | hd'
> > will see the corruption just fine.  Similar problems exist with 'cmp'
> > but not 'sha1sum'.  Two processes reading the same file at the same time
> > seem to trigger the corruption very frequently.
> >
> > Some patterns of holes and data produce corruption faster than others.
> > The pattern generated by the script above is based on instances of
> > corruption I've found in the wild, and has a much better repro rate than
> > random holes.
> >
> > The corruption occurs during reads, after csum verification and before
> > decompression, so btrfs detects no csum failures.  The data on disk
> > seems to be OK and could be read correctly once the kernel bug is fixed.
> > Repeated reads do eventually return correct data, but there is no way
> > for userspace to distinguish between corrupt and correct data reliably.
> >
> > The corrupted data is usually data replaced by a hole or a copy of other
> > blocks in the same extent.
> >
> > The behavior is similar to some earlier bugs related to holes and
> > Compressed data in btrfs, but it's new and not fixed yet--hence,
> > "2018 edition."
>
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 15:35   ` Filipe Manana
@ 2019-02-12 17:01     ` Zygo Blaxell
  2019-02-12 17:56       ` Filipe Manana
  2019-02-12 18:58       ` Andrei Borzenkov
  0 siblings, 2 replies; 25+ messages in thread
From: Zygo Blaxell @ 2019-02-12 17:01 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 11371 bytes --]

On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote:
> On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > Still reproducible on 4.20.7.
> 
> I tried your reproducer when you first reported it, on different
> machines with different kernel versions.

That would have been useful to know last August...  :-/

> Never managed to reproduce it, nor see anything obviously wrong in
> relevant code paths.

I built a fresh VM running Debian stretch and
reproduced the issue immediately.  Mount options are
"rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/".  Kernel is
Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version
probably doesn't matter.

I don't have any configuration that can't reproduce this issue, so I don't
know how to help you.  I've tested AMD and Intel CPUs, VM, baremetal,
hardware ranging in age from 0 to 9 years.  Locally built kernels from
4.1 to 4.20 and the stock Debian kernel (4.9).  SSDs and spinning rust.
All of these reproduce the issue immediately--wrong sha1sum appears in
the first 10 loops.

What is your test environment?  I can try that here.

> >
> > The behavior is slightly different on current kernels (4.20.7, 4.14.96)
> > which makes the problem a bit more difficult to detect.
> >
> >         # repro-hole-corruption-test
> >         i: 91, status: 0, bytes_deduped: 131072
> >         i: 92, status: 0, bytes_deduped: 131072
> >         i: 93, status: 0, bytes_deduped: 131072
> >         i: 94, status: 0, bytes_deduped: 131072
> >         i: 95, status: 0, bytes_deduped: 131072
> >         i: 96, status: 0, bytes_deduped: 131072
> >         i: 97, status: 0, bytes_deduped: 131072
> >         i: 98, status: 0, bytes_deduped: 131072
> >         i: 99, status: 0, bytes_deduped: 131072
> >         13107200 total bytes deduped in this operation
> >         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> >         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >
> > The sha1sum seems stable after the first drop_caches--until a second
> > process tries to read the test file:
> >
> >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >         # cat am > /dev/null              (in another shell)
> >         19294e695272c42edb89ceee24bb08c13473140a am
> >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> >
> > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> > > This is a repro script for a btrfs bug that causes corrupted data reads
> > > when reading a mix of compressed extents and holes.  The bug is
> > > reproducible on at least kernels v4.1..v4.18.
> > >
> > > Some more observations and background follow, but first here is the
> > > script and some sample output:
> > >
> > >       root@rescue:/test# cat repro-hole-corruption-test
> > >       #!/bin/bash
> > >
> > >       # Write a 4096 byte block of something
> > >       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > >
> > >       # Here is some test data with holes in it:
> > >       for y in $(seq 0 100); do
> > >               for x in 0 1; do
> > >                       block 0;
> > >                       block 21;
> > >                       block 0;
> > >                       block 22;
> > >                       block 0;
> > >                       block 0;
> > >                       block 43;
> > >                       block 44;
> > >                       block 0;
> > >                       block 0;
> > >                       block 61;
> > >                       block 62;
> > >                       block 63;
> > >                       block 64;
> > >                       block 65;
> > >                       block 66;
> > >               done
> > >       done > am
> > >       sync
> > >
> > >       # Now replace those 101 distinct extents with 101 references to the first extent
> > >       btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
> > >
> > >       # Punch holes into the extent refs
> > >       fallocate -v -d am
> > >
> > >       # Do some other stuff on the machine while this runs, and watch the sha1sums change!
> > >       while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> > >
> > >       root@rescue:/test# ./repro-hole-corruption-test
> > >       i: 91, status: 0, bytes_deduped: 131072
> > >       i: 92, status: 0, bytes_deduped: 131072
> > >       i: 93, status: 0, bytes_deduped: 131072
> > >       i: 94, status: 0, bytes_deduped: 131072
> > >       i: 95, status: 0, bytes_deduped: 131072
> > >       i: 96, status: 0, bytes_deduped: 131072
> > >       i: 97, status: 0, bytes_deduped: 131072
> > >       i: 98, status: 0, bytes_deduped: 131072
> > >       i: 99, status: 0, bytes_deduped: 131072
> > >       13107200 total bytes deduped in this operation
> > >       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       072a152355788c767b97e4e4c0e4567720988b84 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       60831f0e7ffe4b49722612c18685c09f4583b1df am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >       ^C
> > >
> > > Corruption occurs most often when there is a sequence like this in a file:
> > >
> > >       ref 1: hole
> > >       ref 2: extent A, offset 0
> > >       ref 3: hole
> > >       ref 4: extent A, offset 8192
> > >
> > > This scenario typically arises due to hole-punching or deduplication.
> > > Hole-punching replaces one extent ref with two references to the same
> > > extent with a hole between them, so:
> > >
> > >       ref 1:  extent A, offset 0, length 16384
> > >
> > > becomes:
> > >
> > >       ref 1:  extent A, offset 0, length 4096
> > >       ref 2:  hole, length 8192
> > >       ref 3:  extent A, offset 12288, length 4096
> > >
> > > Deduplication replaces two distinct extent refs surrounding a hole with
> > > two references to one of the duplicate extents, turning this:
> > >
> > >       ref 1:  extent A, offset 0, length 4096
> > >       ref 2:  hole, length 8192
> > >       ref 3:  extent B, offset 0, length 4096
> > >
> > > into this:
> > >
> > >       ref 1:  extent A, offset 0, length 4096
> > >       ref 2:  hole, length 8192
> > >       ref 3:  extent A, offset 0, length 4096
> > >
> > > Compression is required (zlib, zstd, or lzo) for corruption to occur.
> > > I am not able to reproduce the issue with an uncompressed extent nor
> > > have I observed any such corruption in the wild.
> > >
> > > The presence or absence of the no-holes filesystem feature has no effect.
> > >
> > > Ordinary writes can lead to pairs of extent references to the same extent
> > > separated by a reference to a different extent; however, in this case
> > > there is data to be read from a real extent, instead of pages that have
> > > to be zero filled from a hole.  If ordinary non-hole writes could trigger
> > > this bug, every page-oriented database engine would be crashing all the
> > > time on btrfs with compression enabled, and it's unlikely that would not
> > > have been noticed between 2015 and now.  An ordinary write that splits
> > > an extent ref would look like this:
> > >
> > >       ref 1:  extent A, offset 0, length 4096
> > >       ref 2:  extent C, offset 0, length 8192
> > >       ref 3:  extent A, offset 12288, length 4096
> > >
> > > Sparse writes can lead to pairs of extent references surrounding a hole;
> > > however, in this case the extent references will point to different
> > > extents, avoiding the bug.  If a sparse write could trigger the bug,
> > > the rsync -S option and qemu/kvm 'raw' disk image files (among many
> > > other tools that produce sparse files) would be unusable, and it's
> > > unlikely that would not have been noticed between 2015 and now either.
> > > Sparse writes look like this:
> > >
> > >       ref 1:  extent A, offset 0, length 4096
> > >       ref 2:  hole, length 8192
> > >       ref 3:  extent B, offset 0, length 4096
> > >
> > > The pattern or timing of read() calls seems to be relevant.  It is very
> > > hard to see the corruption when reading files with 'hd', but 'cat | hd'
> > > will see the corruption just fine.  Similar problems exist with 'cmp'
> > > but not 'sha1sum'.  Two processes reading the same file at the same time
> > > seem to trigger the corruption very frequently.
> > >
> > > Some patterns of holes and data produce corruption faster than others.
> > > The pattern generated by the script above is based on instances of
> > > corruption I've found in the wild, and has a much better repro rate than
> > > random holes.
> > >
> > > The corruption occurs during reads, after csum verification and before
> > > decompression, so btrfs detects no csum failures.  The data on disk
> > > seems to be OK and could be read correctly once the kernel bug is fixed.
> > > Repeated reads do eventually return correct data, but there is no way
> > > for userspace to distinguish between corrupt and correct data reliably.
> > >
> > > The corrupted data is usually data replaced by a hole or a copy of other
> > > blocks in the same extent.
> > >
> > > The behavior is similar to some earlier bugs related to holes and
> > > Compressed data in btrfs, but it's new and not fixed yet--hence,
> > > "2018 edition."
> >
> >
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 17:01     ` Zygo Blaxell
@ 2019-02-12 17:56       ` Filipe Manana
  2019-02-12 18:13         ` Zygo Blaxell
  2019-02-12 18:58       ` Andrei Borzenkov
  1 sibling, 1 reply; 25+ messages in thread
From: Filipe Manana @ 2019-02-12 17:56 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Tue, Feb 12, 2019 at 5:01 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote:
> > On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
> > <ce3g8jdj@umail.furryterror.org> wrote:
> > >
> > > Still reproducible on 4.20.7.
> >
> > I tried your reproducer when you first reported it, on different
> > machines with different kernel versions.
>
> That would have been useful to know last August...  :-/
>
> > Never managed to reproduce it, nor see anything obviously wrong in
> > relevant code paths.
>
> I built a fresh VM running Debian stretch and
> reproduced the issue immediately.  Mount options are
> "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/".  Kernel is
> Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version
> probably doesn't matter.
>
> I don't have any configuration that can't reproduce this issue, so I don't
> know how to help you.  I've tested AMD and Intel CPUs, VM, baremetal,
> hardware ranging in age from 0 to 9 years.  Locally built kernels from
> 4.1 to 4.20 and the stock Debian kernel (4.9).  SSDs and spinning rust.
> All of these reproduce the issue immediately--wrong sha1sum appears in
> the first 10 loops.
>
> What is your test environment?  I can try that here.

Debian unstable, all qemu vms, 4 cpus 4G to 8G ram iirc. Always built
from source kernels.
I have tested this when you reported it for 1 to 2 weeks in 2 or 3 vms
that kept running the test in an infinite loop during those weeks.
Don't recall what were the kernel versions (whatever was the latest at
the time), but that shouldn't matter according to what you say.

>
> > >
> > > The behavior is slightly different on current kernels (4.20.7, 4.14.96)
> > > which makes the problem a bit more difficult to detect.
> > >
> > >         # repro-hole-corruption-test
> > >         i: 91, status: 0, bytes_deduped: 131072
> > >         i: 92, status: 0, bytes_deduped: 131072
> > >         i: 93, status: 0, bytes_deduped: 131072
> > >         i: 94, status: 0, bytes_deduped: 131072
> > >         i: 95, status: 0, bytes_deduped: 131072
> > >         i: 96, status: 0, bytes_deduped: 131072
> > >         i: 97, status: 0, bytes_deduped: 131072
> > >         i: 98, status: 0, bytes_deduped: 131072
> > >         i: 99, status: 0, bytes_deduped: 131072
> > >         13107200 total bytes deduped in this operation
> > >         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > >         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >
> > > The sha1sum seems stable after the first drop_caches--until a second
> > > process tries to read the test file:
> > >
> > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >         # cat am > /dev/null              (in another shell)
> > >         19294e695272c42edb89ceee24bb08c13473140a am
> > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > >
> > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> > > > This is a repro script for a btrfs bug that causes corrupted data reads
> > > > when reading a mix of compressed extents and holes.  The bug is
> > > > reproducible on at least kernels v4.1..v4.18.
> > > >
> > > > Some more observations and background follow, but first here is the
> > > > script and some sample output:
> > > >
> > > >       root@rescue:/test# cat repro-hole-corruption-test
> > > >       #!/bin/bash
> > > >
> > > >       # Write a 4096 byte block of something
> > > >       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > > >
> > > >       # Here is some test data with holes in it:
> > > >       for y in $(seq 0 100); do
> > > >               for x in 0 1; do
> > > >                       block 0;
> > > >                       block 21;
> > > >                       block 0;
> > > >                       block 22;
> > > >                       block 0;
> > > >                       block 0;
> > > >                       block 43;
> > > >                       block 44;
> > > >                       block 0;
> > > >                       block 0;
> > > >                       block 61;
> > > >                       block 62;
> > > >                       block 63;
> > > >                       block 64;
> > > >                       block 65;
> > > >                       block 66;
> > > >               done
> > > >       done > am
> > > >       sync
> > > >
> > > >       # Now replace those 101 distinct extents with 101 references to the first extent
> > > >       btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
> > > >
> > > >       # Punch holes into the extent refs
> > > >       fallocate -v -d am
> > > >
> > > >       # Do some other stuff on the machine while this runs, and watch the sha1sums change!
> > > >       while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> > > >
> > > >       root@rescue:/test# ./repro-hole-corruption-test
> > > >       i: 91, status: 0, bytes_deduped: 131072
> > > >       i: 92, status: 0, bytes_deduped: 131072
> > > >       i: 93, status: 0, bytes_deduped: 131072
> > > >       i: 94, status: 0, bytes_deduped: 131072
> > > >       i: 95, status: 0, bytes_deduped: 131072
> > > >       i: 96, status: 0, bytes_deduped: 131072
> > > >       i: 97, status: 0, bytes_deduped: 131072
> > > >       i: 98, status: 0, bytes_deduped: 131072
> > > >       i: 99, status: 0, bytes_deduped: 131072
> > > >       13107200 total bytes deduped in this operation
> > > >       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       072a152355788c767b97e4e4c0e4567720988b84 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       60831f0e7ffe4b49722612c18685c09f4583b1df am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >       ^C
> > > >
> > > > Corruption occurs most often when there is a sequence like this in a file:
> > > >
> > > >       ref 1: hole
> > > >       ref 2: extent A, offset 0
> > > >       ref 3: hole
> > > >       ref 4: extent A, offset 8192
> > > >
> > > > This scenario typically arises due to hole-punching or deduplication.
> > > > Hole-punching replaces one extent ref with two references to the same
> > > > extent with a hole between them, so:
> > > >
> > > >       ref 1:  extent A, offset 0, length 16384
> > > >
> > > > becomes:
> > > >
> > > >       ref 1:  extent A, offset 0, length 4096
> > > >       ref 2:  hole, length 8192
> > > >       ref 3:  extent A, offset 12288, length 4096
> > > >
> > > > Deduplication replaces two distinct extent refs surrounding a hole with
> > > > two references to one of the duplicate extents, turning this:
> > > >
> > > >       ref 1:  extent A, offset 0, length 4096
> > > >       ref 2:  hole, length 8192
> > > >       ref 3:  extent B, offset 0, length 4096
> > > >
> > > > into this:
> > > >
> > > >       ref 1:  extent A, offset 0, length 4096
> > > >       ref 2:  hole, length 8192
> > > >       ref 3:  extent A, offset 0, length 4096
> > > >
> > > > Compression is required (zlib, zstd, or lzo) for corruption to occur.
> > > > I am not able to reproduce the issue with an uncompressed extent nor
> > > > have I observed any such corruption in the wild.
> > > >
> > > > The presence or absence of the no-holes filesystem feature has no effect.
> > > >
> > > > Ordinary writes can lead to pairs of extent references to the same extent
> > > > separated by a reference to a different extent; however, in this case
> > > > there is data to be read from a real extent, instead of pages that have
> > > > to be zero filled from a hole.  If ordinary non-hole writes could trigger
> > > > this bug, every page-oriented database engine would be crashing all the
> > > > time on btrfs with compression enabled, and it's unlikely that would not
> > > > have been noticed between 2015 and now.  An ordinary write that splits
> > > > an extent ref would look like this:
> > > >
> > > >       ref 1:  extent A, offset 0, length 4096
> > > >       ref 2:  extent C, offset 0, length 8192
> > > >       ref 3:  extent A, offset 12288, length 4096
> > > >
> > > > Sparse writes can lead to pairs of extent references surrounding a hole;
> > > > however, in this case the extent references will point to different
> > > > extents, avoiding the bug.  If a sparse write could trigger the bug,
> > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many
> > > > other tools that produce sparse files) would be unusable, and it's
> > > > unlikely that would not have been noticed between 2015 and now either.
> > > > Sparse writes look like this:
> > > >
> > > >       ref 1:  extent A, offset 0, length 4096
> > > >       ref 2:  hole, length 8192
> > > >       ref 3:  extent B, offset 0, length 4096
> > > >
> > > > The pattern or timing of read() calls seems to be relevant.  It is very
> > > > hard to see the corruption when reading files with 'hd', but 'cat | hd'
> > > > will see the corruption just fine.  Similar problems exist with 'cmp'
> > > > but not 'sha1sum'.  Two processes reading the same file at the same time
> > > > seem to trigger the corruption very frequently.
> > > >
> > > > Some patterns of holes and data produce corruption faster than others.
> > > > The pattern generated by the script above is based on instances of
> > > > corruption I've found in the wild, and has a much better repro rate than
> > > > random holes.
> > > >
> > > > The corruption occurs during reads, after csum verification and before
> > > > decompression, so btrfs detects no csum failures.  The data on disk
> > > > seems to be OK and could be read correctly once the kernel bug is fixed.
> > > > Repeated reads do eventually return correct data, but there is no way
> > > > for userspace to distinguish between corrupt and correct data reliably.
> > > >
> > > > The corrupted data is usually data replaced by a hole or a copy of other
> > > > blocks in the same extent.
> > > >
> > > > The behavior is similar to some earlier bugs related to holes and
> > > > Compressed data in btrfs, but it's new and not fixed yet--hence,
> > > > "2018 edition."
> > >
> > >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”
> >



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 17:56       ` Filipe Manana
@ 2019-02-12 18:13         ` Zygo Blaxell
  2019-02-13  7:24           ` Qu Wenruo
  2019-02-13 17:36           ` Filipe Manana
  0 siblings, 2 replies; 25+ messages in thread
From: Zygo Blaxell @ 2019-02-12 18:13 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 13720 bytes --]

On Tue, Feb 12, 2019 at 05:56:24PM +0000, Filipe Manana wrote:
> On Tue, Feb 12, 2019 at 5:01 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote:
> > > On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
> > > <ce3g8jdj@umail.furryterror.org> wrote:
> > > >
> > > > Still reproducible on 4.20.7.
> > >
> > > I tried your reproducer when you first reported it, on different
> > > machines with different kernel versions.
> >
> > That would have been useful to know last August...  :-/
> >
> > > Never managed to reproduce it, nor see anything obviously wrong in
> > > relevant code paths.
> >
> > I built a fresh VM running Debian stretch and
> > reproduced the issue immediately.  Mount options are
> > "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/".  Kernel is
> > Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version
> > probably doesn't matter.
> >
> > I don't have any configuration that can't reproduce this issue, so I don't
> > know how to help you.  I've tested AMD and Intel CPUs, VM, baremetal,
> > hardware ranging in age from 0 to 9 years.  Locally built kernels from
> > 4.1 to 4.20 and the stock Debian kernel (4.9).  SSDs and spinning rust.
> > All of these reproduce the issue immediately--wrong sha1sum appears in
> > the first 10 loops.
> >
> > What is your test environment?  I can try that here.
> 
> Debian unstable, all qemu vms, 4 cpus 4G to 8G ram iirc. 

I have several environments like that...

> Always built from source kernels.

...that could be a relevant difference.  Have you tried a stock
Debian kernel?

> I have tested this when you reported it for 1 to 2 weeks in 2 or 3 vms
> that kept running the test in an infinite loop during those weeks.
> Don't recall what were the kernel versions (whatever was the latest at
> the time), but that shouldn't matter according to what you say.

That's an extremely long time compared to the rate of occurrence
of this bug.  It should appear in only a few seconds of testing.
Some data-hole-data patterns reproduce much slower (change the position
of "block 0" lines in the setup script), but "slower" is minutes,
not machine-months.

Is your filesystem compressed?  Does compsize show the test
file 'am' is compressed during the test?  Is the sha1sum you get
6926a34e0ab3e0a023e8ea85a650f5b4217acab4?  Does the sha1sum change
when a second process reads the file while the sha1sum/drop_caches loop
is running?

> > > >
> > > > The behavior is slightly different on current kernels (4.20.7, 4.14.96)
> > > > which makes the problem a bit more difficult to detect.
> > > >
> > > >         # repro-hole-corruption-test
> > > >         i: 91, status: 0, bytes_deduped: 131072
> > > >         i: 92, status: 0, bytes_deduped: 131072
> > > >         i: 93, status: 0, bytes_deduped: 131072
> > > >         i: 94, status: 0, bytes_deduped: 131072
> > > >         i: 95, status: 0, bytes_deduped: 131072
> > > >         i: 96, status: 0, bytes_deduped: 131072
> > > >         i: 97, status: 0, bytes_deduped: 131072
> > > >         i: 98, status: 0, bytes_deduped: 131072
> > > >         i: 99, status: 0, bytes_deduped: 131072
> > > >         13107200 total bytes deduped in this operation
> > > >         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > >         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >
> > > > The sha1sum seems stable after the first drop_caches--until a second
> > > > process tries to read the test file:
> > > >
> > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >         # cat am > /dev/null              (in another shell)
> > > >         19294e695272c42edb89ceee24bb08c13473140a am
> > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > >
> > > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> > > > > This is a repro script for a btrfs bug that causes corrupted data reads
> > > > > when reading a mix of compressed extents and holes.  The bug is
> > > > > reproducible on at least kernels v4.1..v4.18.
> > > > >
> > > > > Some more observations and background follow, but first here is the
> > > > > script and some sample output:
> > > > >
> > > > >       root@rescue:/test# cat repro-hole-corruption-test
> > > > >       #!/bin/bash
> > > > >
> > > > >       # Write a 4096 byte block of something
> > > > >       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > > > >
> > > > >       # Here is some test data with holes in it:
> > > > >       for y in $(seq 0 100); do
> > > > >               for x in 0 1; do
> > > > >                       block 0;
> > > > >                       block 21;
> > > > >                       block 0;
> > > > >                       block 22;
> > > > >                       block 0;
> > > > >                       block 0;
> > > > >                       block 43;
> > > > >                       block 44;
> > > > >                       block 0;
> > > > >                       block 0;
> > > > >                       block 61;
> > > > >                       block 62;
> > > > >                       block 63;
> > > > >                       block 64;
> > > > >                       block 65;
> > > > >                       block 66;
> > > > >               done
> > > > >       done > am
> > > > >       sync
> > > > >
> > > > >       # Now replace those 101 distinct extents with 101 references to the first extent
> > > > >       btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
> > > > >
> > > > >       # Punch holes into the extent refs
> > > > >       fallocate -v -d am
> > > > >
> > > > >       # Do some other stuff on the machine while this runs, and watch the sha1sums change!
> > > > >       while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> > > > >
> > > > >       root@rescue:/test# ./repro-hole-corruption-test
> > > > >       i: 91, status: 0, bytes_deduped: 131072
> > > > >       i: 92, status: 0, bytes_deduped: 131072
> > > > >       i: 93, status: 0, bytes_deduped: 131072
> > > > >       i: 94, status: 0, bytes_deduped: 131072
> > > > >       i: 95, status: 0, bytes_deduped: 131072
> > > > >       i: 96, status: 0, bytes_deduped: 131072
> > > > >       i: 97, status: 0, bytes_deduped: 131072
> > > > >       i: 98, status: 0, bytes_deduped: 131072
> > > > >       i: 99, status: 0, bytes_deduped: 131072
> > > > >       13107200 total bytes deduped in this operation
> > > > >       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       072a152355788c767b97e4e4c0e4567720988b84 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       60831f0e7ffe4b49722612c18685c09f4583b1df am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >       ^C
> > > > >
> > > > > Corruption occurs most often when there is a sequence like this in a file:
> > > > >
> > > > >       ref 1: hole
> > > > >       ref 2: extent A, offset 0
> > > > >       ref 3: hole
> > > > >       ref 4: extent A, offset 8192
> > > > >
> > > > > This scenario typically arises due to hole-punching or deduplication.
> > > > > Hole-punching replaces one extent ref with two references to the same
> > > > > extent with a hole between them, so:
> > > > >
> > > > >       ref 1:  extent A, offset 0, length 16384
> > > > >
> > > > > becomes:
> > > > >
> > > > >       ref 1:  extent A, offset 0, length 4096
> > > > >       ref 2:  hole, length 8192
> > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > >
> > > > > Deduplication replaces two distinct extent refs surrounding a hole with
> > > > > two references to one of the duplicate extents, turning this:
> > > > >
> > > > >       ref 1:  extent A, offset 0, length 4096
> > > > >       ref 2:  hole, length 8192
> > > > >       ref 3:  extent B, offset 0, length 4096
> > > > >
> > > > > into this:
> > > > >
> > > > >       ref 1:  extent A, offset 0, length 4096
> > > > >       ref 2:  hole, length 8192
> > > > >       ref 3:  extent A, offset 0, length 4096
> > > > >
> > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur.
> > > > > I am not able to reproduce the issue with an uncompressed extent nor
> > > > > have I observed any such corruption in the wild.
> > > > >
> > > > > The presence or absence of the no-holes filesystem feature has no effect.
> > > > >
> > > > > Ordinary writes can lead to pairs of extent references to the same extent
> > > > > separated by a reference to a different extent; however, in this case
> > > > > there is data to be read from a real extent, instead of pages that have
> > > > > to be zero filled from a hole.  If ordinary non-hole writes could trigger
> > > > > this bug, every page-oriented database engine would be crashing all the
> > > > > time on btrfs with compression enabled, and it's unlikely that would not
> > > > > have been noticed between 2015 and now.  An ordinary write that splits
> > > > > an extent ref would look like this:
> > > > >
> > > > >       ref 1:  extent A, offset 0, length 4096
> > > > >       ref 2:  extent C, offset 0, length 8192
> > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > >
> > > > > Sparse writes can lead to pairs of extent references surrounding a hole;
> > > > > however, in this case the extent references will point to different
> > > > > extents, avoiding the bug.  If a sparse write could trigger the bug,
> > > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many
> > > > > other tools that produce sparse files) would be unusable, and it's
> > > > > unlikely that would not have been noticed between 2015 and now either.
> > > > > Sparse writes look like this:
> > > > >
> > > > >       ref 1:  extent A, offset 0, length 4096
> > > > >       ref 2:  hole, length 8192
> > > > >       ref 3:  extent B, offset 0, length 4096
> > > > >
> > > > > The pattern or timing of read() calls seems to be relevant.  It is very
> > > > > hard to see the corruption when reading files with 'hd', but 'cat | hd'
> > > > > will see the corruption just fine.  Similar problems exist with 'cmp'
> > > > > but not 'sha1sum'.  Two processes reading the same file at the same time
> > > > > seem to trigger the corruption very frequently.
> > > > >
> > > > > Some patterns of holes and data produce corruption faster than others.
> > > > > The pattern generated by the script above is based on instances of
> > > > > corruption I've found in the wild, and has a much better repro rate than
> > > > > random holes.
> > > > >
> > > > > The corruption occurs during reads, after csum verification and before
> > > > > decompression, so btrfs detects no csum failures.  The data on disk
> > > > > seems to be OK and could be read correctly once the kernel bug is fixed.
> > > > > Repeated reads do eventually return correct data, but there is no way
> > > > > for userspace to distinguish between corrupt and correct data reliably.
> > > > >
> > > > > The corrupted data is usually data replaced by a hole or a copy of other
> > > > > blocks in the same extent.
> > > > >
> > > > > The behavior is similar to some earlier bugs related to holes and
> > > > > Compressed data in btrfs, but it's new and not fixed yet--hence,
> > > > > "2018 edition."
> > > >
> > > >
> > >
> > >
> > > --
> > > Filipe David Manana,
> > >
> > > “Whether you think you can, or you think you can't — you're right.”
> > >
> 
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 17:01     ` Zygo Blaxell
  2019-02-12 17:56       ` Filipe Manana
@ 2019-02-12 18:58       ` Andrei Borzenkov
  2019-02-12 21:48         ` Chris Murphy
  1 sibling, 1 reply; 25+ messages in thread
From: Andrei Borzenkov @ 2019-02-12 18:58 UTC (permalink / raw)
  To: Zygo Blaxell, Filipe Manana; +Cc: linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 13325 bytes --]

12.02.2019 20:01, Zygo Blaxell пишет:
> On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote:
>> On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
>> <ce3g8jdj@umail.furryterror.org> wrote:
>>>
>>> Still reproducible on 4.20.7.
>>
>> I tried your reproducer when you first reported it, on different
>> machines with different kernel versions.
> 
> That would have been useful to know last August...  :-/
> 
>> Never managed to reproduce it, nor see anything obviously wrong in
>> relevant code paths.
> 
> I built a fresh VM running Debian stretch and
> reproduced the issue immediately.  Mount options are
> "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/".  Kernel is
> Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version
> probably doesn't matter.
> 
> I don't have any configuration that can't reproduce this issue, so I don't
> know how to help you.  I've tested AMD and Intel CPUs, VM, baremetal,
> hardware ranging in age from 0 to 9 years.  Locally built kernels from
> 4.1 to 4.20 and the stock Debian kernel (4.9).  SSDs and spinning rust.
> All of these reproduce the issue immediately--wrong sha1sum appears in
> the first 10 loops.
> 
> What is your test environment?  I can try that here.
> 
>>>
>>> The behavior is slightly different on current kernels (4.20.7, 4.14.96)
>>> which makes the problem a bit more difficult to detect.
>>>
>>>         # repro-hole-corruption-test
>>>         i: 91, status: 0, bytes_deduped: 131072
>>>         i: 92, status: 0, bytes_deduped: 131072
>>>         i: 93, status: 0, bytes_deduped: 131072
>>>         i: 94, status: 0, bytes_deduped: 131072
>>>         i: 95, status: 0, bytes_deduped: 131072
>>>         i: 96, status: 0, bytes_deduped: 131072
>>>         i: 97, status: 0, bytes_deduped: 131072
>>>         i: 98, status: 0, bytes_deduped: 131072
>>>         i: 99, status: 0, bytes_deduped: 131072
>>>         13107200 total bytes deduped in this operation
>>>         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
>>>         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
>>>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>


I get the same result on Ubunut 18.04 using distro packages and 4.18 hwe
kernel.

root@bor-Latitude-E5450:/var/tmp# dd if=/dev/zero of=loop bs=1M count=200
200+0 записей получено
200+0 записей отправлено
209715200 bytes (210 MB, 200 MiB) copied, 0,125205 s, 1,7 GB/s
root@bor-Latitude-E5450:/var/tmp# mkfs.btrfs loop
btrfs-progs v4.15.1
See http://btrfs.wiki.kernel.org for more information.

Label:              (null)
UUID:               b1f1111e-2d65-484a-9ab3-e00feaac2048
Node size:          16384
Sector size:        4096
Filesystem size:    200.00MiB
Block group profiles:
  Data:             single            8.00MiB
  Metadata:         DUP              32.00MiB
  System:           DUP               8.00MiB
SSD detected:       no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   ID        SIZE  PATH
    1   200.00MiB  loop

root@bor-Latitude-E5450:/var/tmp# mount -t btrfs -o
loop,rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/ ./loop
./loopmnt
root@bor-Latitude-E5450:/var/tmp# cd -
/var/tmp/loopmnt
root@bor-Latitude-E5450:/var/tmp/loopmnt# ../repro-hole-corruption-test
i: 91, status: 0, bytes_deduped: 131072
i: 92, status: 0, bytes_deduped: 131072
i: 93, status: 0, bytes_deduped: 131072
i: 94, status: 0, bytes_deduped: 131072
i: 95, status: 0, bytes_deduped: 131072
i: 96, status: 0, bytes_deduped: 131072
i: 97, status: 0, bytes_deduped: 131072
i: 98, status: 0, bytes_deduped: 131072
i: 99, status: 0, bytes_deduped: 131072
13107200 total bytes deduped in this operation
am: 4,8 MiB (4964352 bytes) converted to sparse holes.
94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
^Croot@bor-Latitude-E5450:/var/tmp/loopmnt#


>>> The sha1sum seems stable after the first drop_caches--until a second
>>> process tries to read the test file:
>>>
>>>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>         # cat am > /dev/null              (in another shell)
>>>         19294e695272c42edb89ceee24bb08c13473140a am
>>>         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>
>>> On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
>>>> This is a repro script for a btrfs bug that causes corrupted data reads
>>>> when reading a mix of compressed extents and holes.  The bug is
>>>> reproducible on at least kernels v4.1..v4.18.
>>>>
>>>> Some more observations and background follow, but first here is the
>>>> script and some sample output:
>>>>
>>>>       root@rescue:/test# cat repro-hole-corruption-test
>>>>       #!/bin/bash
>>>>
>>>>       # Write a 4096 byte block of something
>>>>       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
>>>>
>>>>       # Here is some test data with holes in it:
>>>>       for y in $(seq 0 100); do
>>>>               for x in 0 1; do
>>>>                       block 0;
>>>>                       block 21;
>>>>                       block 0;
>>>>                       block 22;
>>>>                       block 0;
>>>>                       block 0;
>>>>                       block 43;
>>>>                       block 44;
>>>>                       block 0;
>>>>                       block 0;
>>>>                       block 61;
>>>>                       block 62;
>>>>                       block 63;
>>>>                       block 64;
>>>>                       block 65;
>>>>                       block 66;
>>>>               done
>>>>       done > am
>>>>       sync
>>>>
>>>>       # Now replace those 101 distinct extents with 101 references to the first extent
>>>>       btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
>>>>
>>>>       # Punch holes into the extent refs
>>>>       fallocate -v -d am
>>>>
>>>>       # Do some other stuff on the machine while this runs, and watch the sha1sums change!
>>>>       while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
>>>>
>>>>       root@rescue:/test# ./repro-hole-corruption-test
>>>>       i: 91, status: 0, bytes_deduped: 131072
>>>>       i: 92, status: 0, bytes_deduped: 131072
>>>>       i: 93, status: 0, bytes_deduped: 131072
>>>>       i: 94, status: 0, bytes_deduped: 131072
>>>>       i: 95, status: 0, bytes_deduped: 131072
>>>>       i: 96, status: 0, bytes_deduped: 131072
>>>>       i: 97, status: 0, bytes_deduped: 131072
>>>>       i: 98, status: 0, bytes_deduped: 131072
>>>>       i: 99, status: 0, bytes_deduped: 131072
>>>>       13107200 total bytes deduped in this operation
>>>>       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       072a152355788c767b97e4e4c0e4567720988b84 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       60831f0e7ffe4b49722612c18685c09f4583b1df am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>>       ^C
>>>>
>>>> Corruption occurs most often when there is a sequence like this in a file:
>>>>
>>>>       ref 1: hole
>>>>       ref 2: extent A, offset 0
>>>>       ref 3: hole
>>>>       ref 4: extent A, offset 8192
>>>>
>>>> This scenario typically arises due to hole-punching or deduplication.
>>>> Hole-punching replaces one extent ref with two references to the same
>>>> extent with a hole between them, so:
>>>>
>>>>       ref 1:  extent A, offset 0, length 16384
>>>>
>>>> becomes:
>>>>
>>>>       ref 1:  extent A, offset 0, length 4096
>>>>       ref 2:  hole, length 8192
>>>>       ref 3:  extent A, offset 12288, length 4096
>>>>
>>>> Deduplication replaces two distinct extent refs surrounding a hole with
>>>> two references to one of the duplicate extents, turning this:
>>>>
>>>>       ref 1:  extent A, offset 0, length 4096
>>>>       ref 2:  hole, length 8192
>>>>       ref 3:  extent B, offset 0, length 4096
>>>>
>>>> into this:
>>>>
>>>>       ref 1:  extent A, offset 0, length 4096
>>>>       ref 2:  hole, length 8192
>>>>       ref 3:  extent A, offset 0, length 4096
>>>>
>>>> Compression is required (zlib, zstd, or lzo) for corruption to occur.
>>>> I am not able to reproduce the issue with an uncompressed extent nor
>>>> have I observed any such corruption in the wild.
>>>>
>>>> The presence or absence of the no-holes filesystem feature has no effect.
>>>>
>>>> Ordinary writes can lead to pairs of extent references to the same extent
>>>> separated by a reference to a different extent; however, in this case
>>>> there is data to be read from a real extent, instead of pages that have
>>>> to be zero filled from a hole.  If ordinary non-hole writes could trigger
>>>> this bug, every page-oriented database engine would be crashing all the
>>>> time on btrfs with compression enabled, and it's unlikely that would not
>>>> have been noticed between 2015 and now.  An ordinary write that splits
>>>> an extent ref would look like this:
>>>>
>>>>       ref 1:  extent A, offset 0, length 4096
>>>>       ref 2:  extent C, offset 0, length 8192
>>>>       ref 3:  extent A, offset 12288, length 4096
>>>>
>>>> Sparse writes can lead to pairs of extent references surrounding a hole;
>>>> however, in this case the extent references will point to different
>>>> extents, avoiding the bug.  If a sparse write could trigger the bug,
>>>> the rsync -S option and qemu/kvm 'raw' disk image files (among many
>>>> other tools that produce sparse files) would be unusable, and it's
>>>> unlikely that would not have been noticed between 2015 and now either.
>>>> Sparse writes look like this:
>>>>
>>>>       ref 1:  extent A, offset 0, length 4096
>>>>       ref 2:  hole, length 8192
>>>>       ref 3:  extent B, offset 0, length 4096
>>>>
>>>> The pattern or timing of read() calls seems to be relevant.  It is very
>>>> hard to see the corruption when reading files with 'hd', but 'cat | hd'
>>>> will see the corruption just fine.  Similar problems exist with 'cmp'
>>>> but not 'sha1sum'.  Two processes reading the same file at the same time
>>>> seem to trigger the corruption very frequently.
>>>>
>>>> Some patterns of holes and data produce corruption faster than others.
>>>> The pattern generated by the script above is based on instances of
>>>> corruption I've found in the wild, and has a much better repro rate than
>>>> random holes.
>>>>
>>>> The corruption occurs during reads, after csum verification and before
>>>> decompression, so btrfs detects no csum failures.  The data on disk
>>>> seems to be OK and could be read correctly once the kernel bug is fixed.
>>>> Repeated reads do eventually return correct data, but there is no way
>>>> for userspace to distinguish between corrupt and correct data reliably.
>>>>
>>>> The corrupted data is usually data replaced by a hole or a copy of other
>>>> blocks in the same extent.
>>>>
>>>> The behavior is similar to some earlier bugs related to holes and
>>>> Compressed data in btrfs, but it's new and not fixed yet--hence,
>>>> "2018 edition."
>>>
>>>
>>
>>
>> -- 
>> Filipe David Manana,
>>
>> “Whether you think you can, or you think you can't — you're right.”
>>



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 18:58       ` Andrei Borzenkov
@ 2019-02-12 21:48         ` Chris Murphy
  2019-02-12 22:11           ` Zygo Blaxell
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Murphy @ 2019-02-12 21:48 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: Zygo Blaxell, Filipe Manana, linux-btrfs

Is it possibly related to the zlib library being used on
Debian/Ubuntu? That you've got even one reproducer with the exact same
hash for the transient error case means it's not hardware or random
error; let alone two independent reproducers.

And then what happens if you do the exact same test but change to zstd
or lzo? No error? Strictly zlib?

--
Chris Murphy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 21:48         ` Chris Murphy
@ 2019-02-12 22:11           ` Zygo Blaxell
  2019-02-12 22:53             ` Chris Murphy
  0 siblings, 1 reply; 25+ messages in thread
From: Zygo Blaxell @ 2019-02-12 22:11 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Andrei Borzenkov, Filipe Manana, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4033 bytes --]

On Tue, Feb 12, 2019 at 02:48:38PM -0700, Chris Murphy wrote:
> Is it possibly related to the zlib library being used on
> Debian/Ubuntu? That you've got even one reproducer with the exact same
> hash for the transient error case means it's not hardware or random
> error; let alone two independent reproducers.

The errors are not consistent between runs.  The above pattern is quite
common, but it is not the only possible output.  Add in other processes
reading the 'am' file at the same time and it gets very random.

The bad data tends to have entire extents missing, replaced with zeros.
That leads to a small number of possible outputs (the choices seem to be
only to have the data or have the zeros).  It does seem to be a lot more
consistent in recent (post 4.14.80) kernels, which may be interesting.

Here is an example of a diff between two copies of the 'am' file copied
while the repro script was running, filtered through hd:

	# diff -u /tmp/f1 /tmp/f2
	--- /tmp/f1     2019-02-12 17:05:14.861844871 -0500
	+++ /tmp/f2     2019-02-12 17:05:16.883868402 -0500
	@@ -56,10 +56,6 @@
	 *
	 00020000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	 *
	-00021000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
	-*
	-00022000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	-*
	 00023000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
	 *
	 00024000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	@@ -268,10 +264,6 @@
	 *
	 000a0000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	 *
	-000a1000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
	-*
	-000a2000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	-*
	 000a3000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
	 *
	 000a4000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	@@ -688,10 +680,6 @@
	 *
	 001a0000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	 *
	-001a1000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
	-*
	-001a2000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	-*
	 001a3000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
	 *
	 001a4000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	@@ -1524,10 +1512,6 @@
	 *
	 003a0000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	 *
	-003a1000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
	-*
	-003a2000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	-*
	 003a3000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
	 *
	 003a4000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	@@ -3192,10 +3176,6 @@
	 *
	 007a0000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	 *
	-007a1000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
	-*
	-007a2000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	-*
	 007a3000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
	 *
	 007a4000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	@@ -5016,10 +4996,6 @@
	 *
	 00c00000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	 *
	-00c01000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
	-*
	-00c02000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
	-*
	[etc...you get the idea]

I'm not sure how the zlib library is involved--sha1sum doesn't use one.

> And then what happens if you do the exact same test but change to zstd
> or lzo? No error? Strictly zlib?

Same errors on all three btrfs compression algorithms (as mentioned in
the original post from August 2018).

> --
> Chris Murphy
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 22:11           ` Zygo Blaxell
@ 2019-02-12 22:53             ` Chris Murphy
  2019-02-13  2:46               ` Zygo Blaxell
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Murphy @ 2019-02-12 22:53 UTC (permalink / raw)
  To: Zygo Blaxell, Btrfs BTRFS

On Tue, Feb 12, 2019 at 3:11 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Tue, Feb 12, 2019 at 02:48:38PM -0700, Chris Murphy wrote:
> > Is it possibly related to the zlib library being used on
> > Debian/Ubuntu? That you've got even one reproducer with the exact same
> > hash for the transient error case means it's not hardware or random
> > error; let alone two independent reproducers.
>
> The errors are not consistent between runs.  The above pattern is quite
> common, but it is not the only possible output.  Add in other processes
> reading the 'am' file at the same time and it gets very random.
>
> The bad data tends to have entire extents missing, replaced with zeros.
> That leads to a small number of possible outputs (the choices seem to be
> only to have the data or have the zeros).  It does seem to be a lot more
> consistent in recent (post 4.14.80) kernels, which may be interesting.
>
> Here is an example of a diff between two copies of the 'am' file copied
> while the repro script was running, filtered through hd:
>
>         # diff -u /tmp/f1 /tmp/f2
>         --- /tmp/f1     2019-02-12 17:05:14.861844871 -0500
>         +++ /tmp/f2     2019-02-12 17:05:16.883868402 -0500
>         @@ -56,10 +56,6 @@
>          *
>          00020000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>          *
>         -00021000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
>         -*
>         -00022000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         -*
>          00023000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
>          *
>          00024000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         @@ -268,10 +264,6 @@
>          *
>          000a0000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>          *
>         -000a1000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
>         -*
>         -000a2000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         -*
>          000a3000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
>          *
>          000a4000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         @@ -688,10 +680,6 @@
>          *
>          001a0000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>          *
>         -001a1000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
>         -*
>         -001a2000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         -*
>          001a3000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
>          *
>          001a4000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         @@ -1524,10 +1512,6 @@
>          *
>          003a0000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>          *
>         -003a1000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
>         -*
>         -003a2000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         -*
>          003a3000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
>          *
>          003a4000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         @@ -3192,10 +3176,6 @@
>          *
>          007a0000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>          *
>         -007a1000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
>         -*
>         -007a2000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         -*
>          007a3000  12 12 12 12 12 12 12 12  12 12 12 12 12 12 12 12  |................|
>          *
>          007a4000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         @@ -5016,10 +4996,6 @@
>          *
>          00c00000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>          *
>         -00c01000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
>         -*
>         -00c02000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>         -*
>         [etc...you get the idea]

And yet the file is delivered to user space, despite the changes, as
if it's immune to checksum computation or matching. The data is
clearly difference so how is it bypassing checksumming? Data csums are
based on original uncompressed data, correct? So any holes are zeros,
there are still csums for those holes?

>
> I'm not sure how the zlib library is involved--sha1sum doesn't use one.
>
> > And then what happens if you do the exact same test but change to zstd
> > or lzo? No error? Strictly zlib?
>
> Same errors on all three btrfs compression algorithms (as mentioned in
> the original post from August 2018).

Obviously there is a pattern. It's not random. I just don't know what
it looks like. I use compression, for years now, mostly zstd lately
and a mix of lzo and zlib before that, but never any errors or
corruptions. But I also never use holes, no punched holes, and rarely
use fallocated files which I guess isn't quite the same thing as hole
punching.

So the bug you're reproducing is for sure 100% not on the media
itself, it's somehow transiently being interpreted differently roughly
1 in 10 reads, but with a pattern. What about scrub? Do you get errors
every 1 in 10 scrubs? Or how does it manifest? No scrub errors?

I know very little about what parts of the kernel a file system
depends on outside of its own code (e.g. page cache) but I wonder if
there's something outside of Btrfs that's the source but it never gets
triggered because no other file systems use compression. Huh - what
file system uses compression *and* hole punching? squashfs? Is sparse
file support different than hole punching?


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 22:53             ` Chris Murphy
@ 2019-02-13  2:46               ` Zygo Blaxell
  0 siblings, 0 replies; 25+ messages in thread
From: Zygo Blaxell @ 2019-02-13  2:46 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 5093 bytes --]

On Tue, Feb 12, 2019 at 03:53:53PM -0700, Chris Murphy wrote:
> And yet the file is delivered to user space, despite the changes, as
> if it's immune to checksum computation or matching. The data is
> clearly difference so how is it bypassing checksumming? Data csums are
> based on original uncompressed data, correct? So any holes are zeros,
> there are still csums for those holes?

csums in btrfs protect data blocks.  Holes are the absence of data blocks,
so there are no csums for holes.

There are no csums for extent references either--only csums on the extent
data that is referenced.  Since this bug affects processing of extent
refs, it must occur long after all the csums are verified.

> > I'm not sure how the zlib library is involved--sha1sum doesn't use one.
> >
> > > And then what happens if you do the exact same test but change to zstd
> > > or lzo? No error? Strictly zlib?
> >
> > Same errors on all three btrfs compression algorithms (as mentioned in
> > the original post from August 2018).
> 
> Obviously there is a pattern. It's not random. I just don't know what
> it looks like. 

Without knowing the root cause I can only speculate, but it does seem to
be random, just very heavily biased to some outcomes.  It will produce
more distinct sha1sum values the longer you run it, especially if there
is other activity on the system to perturb the kernel a bit.  If you make
the test file bigger you can have more combinations of outputs.

I also note that since the big batch of btrfs bug fixes that landed
near 4.14.80, the variation between runs seems to be a lot less than
with earlier kernels; however, the full range of random output values
(i.e. which extents of the file disappear) still seems to be possible, it
just takes longer to get distinct values.  I'm not sure that information
helps to form a theory of how the bug operates.

> I use compression, for years now, mostly zstd lately
> and a mix of lzo and zlib before that, but never any errors or
> corruptions. But I also never use holes, no punched holes, and rarely
> use fallocated files which I guess isn't quite the same thing as hole
> punching.

I covered this in August.  The original thread was:

	https://www.spinics.net/lists/linux-btrfs/msg81293.html

TL;DR you won't see this problem unless you have a single compressed
extent that is split by a hole--an artifact that can only be produced by
punching holes, cloning, or dedupe.  The cases users are most likely to
encounter are dedupe and hole-punching--I don't know of any applications
in real-world use that do cloning the right way to trigger this problem.

Also, you haven't mentioned whether you've successfully reproduced this
yourself yet (or not).

> So the bug you're reproducing is for sure 100% not on the media
> itself, it's somehow transiently being interpreted differently roughly
> 1 in 10 reads, but with a pattern. What about scrub? Do you get errors
> every 1 in 10 scrubs? Or how does it manifest? No scrub errors?

No errors in scrub--nor should there be.  The data is correct on disk,
and it can be read reliably if you don't use the kernel btrfs code to
read it through extent refs (scrub reads the data items directly, so
scrub never looks at data through extent refs).

btrfs just drops some of the data when reading it to userspace.

> I know very little about what parts of the kernel a file system
> depends on outside of its own code (e.g. page cache) but I wonder if
> there's something outside of Btrfs that's the source but it never gets
> triggered because no other file systems use compression. Huh - what
> file system uses compression *and* hole punching? squashfs? Is sparse
> file support different than hole punching?

Traditional sparse file support leaves blocks in a file unallocated until
they are written to, i.e. you do something like:

	write(64K)
	seek(80K)
	write(48K)

and you get a 16K hole between two extents (or contiguous block ranges
if your filesystem doesn't have a formal extent concept per se):

	data(64k)
	hole(16k)
	data(48k)

Traditional POSIX sparse files don't have any way to release any extents
in the middle of a file without changing the length of the file.  You can
fill in the holes with data later, but you can't delete existing data and
replace it with holes.  If you want to punch holes in a file, you used to
do it by making a copy of the file, omitting any of the data blocks that
contained all zero, then renaming the copy over the original file.

The hole punch operation adds the capability to delete existing data
in place, e.g. you can say "punch a hole at 24K, length 8K" and the
filesystem will look like:

	data(24k) (originally part of first 64K extent)
	hole(8k)
	data(32k) (originally part of first 64K extent)
	hole(16k)
	data(48k)

On btrfs, the first 32k and 24k chunks of the file are both references
to pieces of the original 64k extent, which is not modified on disk,
but 8K of it is no longer accessible.

> -- 
> Chris Murphy
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 18:13         ` Zygo Blaxell
@ 2019-02-13  7:24           ` Qu Wenruo
  2019-02-13 17:36           ` Filipe Manana
  1 sibling, 0 replies; 25+ messages in thread
From: Qu Wenruo @ 2019-02-13  7:24 UTC (permalink / raw)
  To: Zygo Blaxell, Filipe Manana; +Cc: linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 3282 bytes --]



On 2019/2/13 上午2:13, Zygo Blaxell wrote:
> On Tue, Feb 12, 2019 at 05:56:24PM +0000, Filipe Manana wrote:
>> On Tue, Feb 12, 2019 at 5:01 PM Zygo Blaxell
>> <ce3g8jdj@umail.furryterror.org> wrote:
>>>
>>> On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote:
>>>> On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
>>>> <ce3g8jdj@umail.furryterror.org> wrote:
>>>>>
>>>>> Still reproducible on 4.20.7.
>>>>
>>>> I tried your reproducer when you first reported it, on different
>>>> machines with different kernel versions.
>>>
>>> That would have been useful to know last August...  :-/
>>>
>>>> Never managed to reproduce it, nor see anything obviously wrong in
>>>> relevant code paths.
>>>
>>> I built a fresh VM running Debian stretch and
>>> reproduced the issue immediately.  Mount options are
>>> "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/".  Kernel is
>>> Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version
>>> probably doesn't matter.
>>>
>>> I don't have any configuration that can't reproduce this issue, so I don't
>>> know how to help you.  I've tested AMD and Intel CPUs, VM, baremetal,
>>> hardware ranging in age from 0 to 9 years.  Locally built kernels from
>>> 4.1 to 4.20 and the stock Debian kernel (4.9).  SSDs and spinning rust.
>>> All of these reproduce the issue immediately--wrong sha1sum appears in
>>> the first 10 loops.
>>>
>>> What is your test environment?  I can try that here.
>>
>> Debian unstable, all qemu vms, 4 cpus 4G to 8G ram iirc. 
> 
> I have several environments like that...
> 
>> Always built from source kernels.
> 
> ...that could be a relevant difference.  Have you tried a stock
> Debian kernel?

I'm afraid you may need to use upstream vanilla kernel other than kernel
from distro, especially for distros who may have heavy backports.

I also tried my test runs, using Arch stock kernel (pretty vanilla) and
upstream kernel.
Both my host and VM tested.
No reproduce either.

Upstream community is mostly focused on upstream vanilla kernel.
Bugs from distro kernel can sometimes be a good clue of existing
upstream bugs, but when dig deeper, vanilla kernel is always necessary.

Would you mind to reproduce it in a as vanilla as possible environment?
E.g. vanilla kernel and vanilla user space progs?

Thanks,
Qu

> 
>> I have tested this when you reported it for 1 to 2 weeks in 2 or 3 vms
>> that kept running the test in an infinite loop during those weeks.
>> Don't recall what were the kernel versions (whatever was the latest at
>> the time), but that shouldn't matter according to what you say.
> 
> That's an extremely long time compared to the rate of occurrence
> of this bug.  It should appear in only a few seconds of testing.
> Some data-hole-data patterns reproduce much slower (change the position
> of "block 0" lines in the setup script), but "slower" is minutes,
> not machine-months.
> 
> Is your filesystem compressed?  Does compsize show the test
> file 'am' is compressed during the test?  Is the sha1sum you get
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4?  Does the sha1sum change
> when a second process reads the file while the sha1sum/drop_caches loop
> is running?
> 
[snip]


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12  3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell
  2019-02-12 15:33   ` Christoph Anton Mitterer
  2019-02-12 15:35   ` Filipe Manana
@ 2019-02-13  7:47   ` Roman Mamedov
  2019-02-13  8:04     ` Qu Wenruo
  2 siblings, 1 reply; 25+ messages in thread
From: Roman Mamedov @ 2019-02-13  7:47 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Mon, 11 Feb 2019 22:09:02 -0500
Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:

> Still reproducible on 4.20.7.
> 
> The behavior is slightly different on current kernels (4.20.7, 4.14.96)
> which makes the problem a bit more difficult to detect.
> 
> 	# repro-hole-corruption-test
> 	i: 91, status: 0, bytes_deduped: 131072
> 	i: 92, status: 0, bytes_deduped: 131072
> 	i: 93, status: 0, bytes_deduped: 131072
> 	i: 94, status: 0, bytes_deduped: 131072
> 	i: 95, status: 0, bytes_deduped: 131072
> 	i: 96, status: 0, bytes_deduped: 131072
> 	i: 97, status: 0, bytes_deduped: 131072
> 	i: 98, status: 0, bytes_deduped: 131072
> 	i: 99, status: 0, bytes_deduped: 131072
> 	13107200 total bytes deduped in this operation
> 	am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> 	94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am

Seems like I can reproduce it as well. Vanilla 4.14.97 with .config loosely
based on Debian's.

$ sudo ./repro-hole-corruption-test 
i: 91, status: 0, bytes_deduped: 131072
i: 92, status: 0, bytes_deduped: 131072
i: 93, status: 0, bytes_deduped: 131072
i: 94, status: 0, bytes_deduped: 131072
i: 95, status: 0, bytes_deduped: 131072
i: 96, status: 0, bytes_deduped: 131072
i: 97, status: 0, bytes_deduped: 131072
i: 98, status: 0, bytes_deduped: 131072
i: 99, status: 0, bytes_deduped: 131072
13107200 total bytes deduped in this operation
am: 4.8 MiB (4964352 bytes) converted to sparse holes.
c5f25fc2b88eaab504a403465658c67f4669261e am
1d9aacd4ee38ab7db46c44e0d74cee163222e105 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am

The above is on a 3TB spinning disk. But on a 512GB NVMe SSD I even got the
same checksums as you did.

$ sudo ./repro-hole-corruption-test 
i: 91, status: 0, bytes_deduped: 131072
i: 92, status: 0, bytes_deduped: 131072
i: 93, status: 0, bytes_deduped: 131072
i: 94, status: 0, bytes_deduped: 131072
i: 95, status: 0, bytes_deduped: 131072
i: 96, status: 0, bytes_deduped: 131072
i: 97, status: 0, bytes_deduped: 131072
i: 98, status: 0, bytes_deduped: 131072
i: 99, status: 0, bytes_deduped: 131072
13107200 total bytes deduped in this operation
am: 4.8 MiB (4964352 bytes) converted to sparse holes.
94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am

In my case both filesystems are not mounted with compression, just chattr +c of
the directory with the script is enough to see the issue.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-13  7:47   ` Roman Mamedov
@ 2019-02-13  8:04     ` Qu Wenruo
  0 siblings, 0 replies; 25+ messages in thread
From: Qu Wenruo @ 2019-02-13  8:04 UTC (permalink / raw)
  To: Roman Mamedov, Zygo Blaxell; +Cc: linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 3627 bytes --]



On 2019/2/13 下午3:47, Roman Mamedov wrote:
> On Mon, 11 Feb 2019 22:09:02 -0500
> Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:
> 
>> Still reproducible on 4.20.7.
>>
>> The behavior is slightly different on current kernels (4.20.7, 4.14.96)
>> which makes the problem a bit more difficult to detect.
>>
>> 	# repro-hole-corruption-test
>> 	i: 91, status: 0, bytes_deduped: 131072
>> 	i: 92, status: 0, bytes_deduped: 131072
>> 	i: 93, status: 0, bytes_deduped: 131072
>> 	i: 94, status: 0, bytes_deduped: 131072
>> 	i: 95, status: 0, bytes_deduped: 131072
>> 	i: 96, status: 0, bytes_deduped: 131072
>> 	i: 97, status: 0, bytes_deduped: 131072
>> 	i: 98, status: 0, bytes_deduped: 131072
>> 	i: 99, status: 0, bytes_deduped: 131072
>> 	13107200 total bytes deduped in this operation
>> 	am: 4.8 MiB (4964352 bytes) converted to sparse holes.
>> 	94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>> 	6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 
> Seems like I can reproduce it as well. Vanilla 4.14.97 with .config loosely
> based on Debian's.
> 
> $ sudo ./repro-hole-corruption-test 
> i: 91, status: 0, bytes_deduped: 131072
> i: 92, status: 0, bytes_deduped: 131072
> i: 93, status: 0, bytes_deduped: 131072
> i: 94, status: 0, bytes_deduped: 131072
> i: 95, status: 0, bytes_deduped: 131072
> i: 96, status: 0, bytes_deduped: 131072
> i: 97, status: 0, bytes_deduped: 131072
> i: 98, status: 0, bytes_deduped: 131072
> i: 99, status: 0, bytes_deduped: 131072
> 13107200 total bytes deduped in this operation
> am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> c5f25fc2b88eaab504a403465658c67f4669261e am
> 1d9aacd4ee38ab7db46c44e0d74cee163222e105 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 
> The above is on a 3TB spinning disk. But on a 512GB NVMe SSD I even got the
> same checksums as you did.
> 
> $ sudo ./repro-hole-corruption-test 
> i: 91, status: 0, bytes_deduped: 131072
> i: 92, status: 0, bytes_deduped: 131072
> i: 93, status: 0, bytes_deduped: 131072
> i: 94, status: 0, bytes_deduped: 131072
> i: 95, status: 0, bytes_deduped: 131072
> i: 96, status: 0, bytes_deduped: 131072
> i: 97, status: 0, bytes_deduped: 131072
> i: 98, status: 0, bytes_deduped: 131072
> i: 99, status: 0, bytes_deduped: 131072
> 13107200 total bytes deduped in this operation
> am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> 
> In my case both filesystems are not mounted with compression,

OK, I forgot the compression mount option.

Now I can reproduce it too, both host and VM now.
I'll try to make the test case minimal enough to avoid too many noise
during test.

Thanks,
Qu

> just chattr +c of
> the directory with the script is enough to see the issue.
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-12 18:13         ` Zygo Blaxell
  2019-02-13  7:24           ` Qu Wenruo
@ 2019-02-13 17:36           ` Filipe Manana
  2019-02-13 18:14             ` Filipe Manana
  1 sibling, 1 reply; 25+ messages in thread
From: Filipe Manana @ 2019-02-13 17:36 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Tue, Feb 12, 2019 at 6:14 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Tue, Feb 12, 2019 at 05:56:24PM +0000, Filipe Manana wrote:
> > On Tue, Feb 12, 2019 at 5:01 PM Zygo Blaxell
> > <ce3g8jdj@umail.furryterror.org> wrote:
> > >
> > > On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote:
> > > > On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
> > > > <ce3g8jdj@umail.furryterror.org> wrote:
> > > > >
> > > > > Still reproducible on 4.20.7.
> > > >
> > > > I tried your reproducer when you first reported it, on different
> > > > machines with different kernel versions.
> > >
> > > That would have been useful to know last August...  :-/
> > >
> > > > Never managed to reproduce it, nor see anything obviously wrong in
> > > > relevant code paths.
> > >
> > > I built a fresh VM running Debian stretch and
> > > reproduced the issue immediately.  Mount options are
> > > "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/".  Kernel is
> > > Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version
> > > probably doesn't matter.
> > >
> > > I don't have any configuration that can't reproduce this issue, so I don't
> > > know how to help you.  I've tested AMD and Intel CPUs, VM, baremetal,
> > > hardware ranging in age from 0 to 9 years.  Locally built kernels from
> > > 4.1 to 4.20 and the stock Debian kernel (4.9).  SSDs and spinning rust.
> > > All of these reproduce the issue immediately--wrong sha1sum appears in
> > > the first 10 loops.
> > >
> > > What is your test environment?  I can try that here.
> >
> > Debian unstable, all qemu vms, 4 cpus 4G to 8G ram iirc.
>
> I have several environments like that...
>
> > Always built from source kernels.
>
> ...that could be a relevant difference.  Have you tried a stock
> Debian kernel?
>
> > I have tested this when you reported it for 1 to 2 weeks in 2 or 3 vms
> > that kept running the test in an infinite loop during those weeks.
> > Don't recall what were the kernel versions (whatever was the latest at
> > the time), but that shouldn't matter according to what you say.
>
> That's an extremely long time compared to the rate of occurrence
> of this bug.  It should appear in only a few seconds of testing.
> Some data-hole-data patterns reproduce much slower (change the position
> of "block 0" lines in the setup script), but "slower" is minutes,
> not machine-months.
>
> Is your filesystem compressed?  Does compsize show the test
> file 'am' is compressed during the test?  Is the sha1sum you get
> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4?  Does the sha1sum change
> when a second process reads the file while the sha1sum/drop_caches loop
> is running?

Tried it today and I got it reproduced (different vm, but still debian
and kernel built from source).
Not sure what was different last time. Yes, I had compression enabled.

I'll look into it.

>
> > > > >
> > > > > The behavior is slightly different on current kernels (4.20.7, 4.14.96)
> > > > > which makes the problem a bit more difficult to detect.
> > > > >
> > > > >         # repro-hole-corruption-test
> > > > >         i: 91, status: 0, bytes_deduped: 131072
> > > > >         i: 92, status: 0, bytes_deduped: 131072
> > > > >         i: 93, status: 0, bytes_deduped: 131072
> > > > >         i: 94, status: 0, bytes_deduped: 131072
> > > > >         i: 95, status: 0, bytes_deduped: 131072
> > > > >         i: 96, status: 0, bytes_deduped: 131072
> > > > >         i: 97, status: 0, bytes_deduped: 131072
> > > > >         i: 98, status: 0, bytes_deduped: 131072
> > > > >         i: 99, status: 0, bytes_deduped: 131072
> > > > >         13107200 total bytes deduped in this operation
> > > > >         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > >         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >
> > > > > The sha1sum seems stable after the first drop_caches--until a second
> > > > > process tries to read the test file:
> > > > >
> > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >         # cat am > /dev/null              (in another shell)
> > > > >         19294e695272c42edb89ceee24bb08c13473140a am
> > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > >
> > > > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> > > > > > This is a repro script for a btrfs bug that causes corrupted data reads
> > > > > > when reading a mix of compressed extents and holes.  The bug is
> > > > > > reproducible on at least kernels v4.1..v4.18.
> > > > > >
> > > > > > Some more observations and background follow, but first here is the
> > > > > > script and some sample output:
> > > > > >
> > > > > >       root@rescue:/test# cat repro-hole-corruption-test
> > > > > >       #!/bin/bash
> > > > > >
> > > > > >       # Write a 4096 byte block of something
> > > > > >       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > > > > >
> > > > > >       # Here is some test data with holes in it:
> > > > > >       for y in $(seq 0 100); do
> > > > > >               for x in 0 1; do
> > > > > >                       block 0;
> > > > > >                       block 21;
> > > > > >                       block 0;
> > > > > >                       block 22;
> > > > > >                       block 0;
> > > > > >                       block 0;
> > > > > >                       block 43;
> > > > > >                       block 44;
> > > > > >                       block 0;
> > > > > >                       block 0;
> > > > > >                       block 61;
> > > > > >                       block 62;
> > > > > >                       block 63;
> > > > > >                       block 64;
> > > > > >                       block 65;
> > > > > >                       block 66;
> > > > > >               done
> > > > > >       done > am
> > > > > >       sync
> > > > > >
> > > > > >       # Now replace those 101 distinct extents with 101 references to the first extent
> > > > > >       btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
> > > > > >
> > > > > >       # Punch holes into the extent refs
> > > > > >       fallocate -v -d am
> > > > > >
> > > > > >       # Do some other stuff on the machine while this runs, and watch the sha1sums change!
> > > > > >       while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> > > > > >
> > > > > >       root@rescue:/test# ./repro-hole-corruption-test
> > > > > >       i: 91, status: 0, bytes_deduped: 131072
> > > > > >       i: 92, status: 0, bytes_deduped: 131072
> > > > > >       i: 93, status: 0, bytes_deduped: 131072
> > > > > >       i: 94, status: 0, bytes_deduped: 131072
> > > > > >       i: 95, status: 0, bytes_deduped: 131072
> > > > > >       i: 96, status: 0, bytes_deduped: 131072
> > > > > >       i: 97, status: 0, bytes_deduped: 131072
> > > > > >       i: 98, status: 0, bytes_deduped: 131072
> > > > > >       i: 99, status: 0, bytes_deduped: 131072
> > > > > >       13107200 total bytes deduped in this operation
> > > > > >       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       072a152355788c767b97e4e4c0e4567720988b84 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       60831f0e7ffe4b49722612c18685c09f4583b1df am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >       ^C
> > > > > >
> > > > > > Corruption occurs most often when there is a sequence like this in a file:
> > > > > >
> > > > > >       ref 1: hole
> > > > > >       ref 2: extent A, offset 0
> > > > > >       ref 3: hole
> > > > > >       ref 4: extent A, offset 8192
> > > > > >
> > > > > > This scenario typically arises due to hole-punching or deduplication.
> > > > > > Hole-punching replaces one extent ref with two references to the same
> > > > > > extent with a hole between them, so:
> > > > > >
> > > > > >       ref 1:  extent A, offset 0, length 16384
> > > > > >
> > > > > > becomes:
> > > > > >
> > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > >       ref 2:  hole, length 8192
> > > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > > >
> > > > > > Deduplication replaces two distinct extent refs surrounding a hole with
> > > > > > two references to one of the duplicate extents, turning this:
> > > > > >
> > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > >       ref 2:  hole, length 8192
> > > > > >       ref 3:  extent B, offset 0, length 4096
> > > > > >
> > > > > > into this:
> > > > > >
> > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > >       ref 2:  hole, length 8192
> > > > > >       ref 3:  extent A, offset 0, length 4096
> > > > > >
> > > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur.
> > > > > > I am not able to reproduce the issue with an uncompressed extent nor
> > > > > > have I observed any such corruption in the wild.
> > > > > >
> > > > > > The presence or absence of the no-holes filesystem feature has no effect.
> > > > > >
> > > > > > Ordinary writes can lead to pairs of extent references to the same extent
> > > > > > separated by a reference to a different extent; however, in this case
> > > > > > there is data to be read from a real extent, instead of pages that have
> > > > > > to be zero filled from a hole.  If ordinary non-hole writes could trigger
> > > > > > this bug, every page-oriented database engine would be crashing all the
> > > > > > time on btrfs with compression enabled, and it's unlikely that would not
> > > > > > have been noticed between 2015 and now.  An ordinary write that splits
> > > > > > an extent ref would look like this:
> > > > > >
> > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > >       ref 2:  extent C, offset 0, length 8192
> > > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > > >
> > > > > > Sparse writes can lead to pairs of extent references surrounding a hole;
> > > > > > however, in this case the extent references will point to different
> > > > > > extents, avoiding the bug.  If a sparse write could trigger the bug,
> > > > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many
> > > > > > other tools that produce sparse files) would be unusable, and it's
> > > > > > unlikely that would not have been noticed between 2015 and now either.
> > > > > > Sparse writes look like this:
> > > > > >
> > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > >       ref 2:  hole, length 8192
> > > > > >       ref 3:  extent B, offset 0, length 4096
> > > > > >
> > > > > > The pattern or timing of read() calls seems to be relevant.  It is very
> > > > > > hard to see the corruption when reading files with 'hd', but 'cat | hd'
> > > > > > will see the corruption just fine.  Similar problems exist with 'cmp'
> > > > > > but not 'sha1sum'.  Two processes reading the same file at the same time
> > > > > > seem to trigger the corruption very frequently.
> > > > > >
> > > > > > Some patterns of holes and data produce corruption faster than others.
> > > > > > The pattern generated by the script above is based on instances of
> > > > > > corruption I've found in the wild, and has a much better repro rate than
> > > > > > random holes.
> > > > > >
> > > > > > The corruption occurs during reads, after csum verification and before
> > > > > > decompression, so btrfs detects no csum failures.  The data on disk
> > > > > > seems to be OK and could be read correctly once the kernel bug is fixed.
> > > > > > Repeated reads do eventually return correct data, but there is no way
> > > > > > for userspace to distinguish between corrupt and correct data reliably.
> > > > > >
> > > > > > The corrupted data is usually data replaced by a hole or a copy of other
> > > > > > blocks in the same extent.
> > > > > >
> > > > > > The behavior is similar to some earlier bugs related to holes and
> > > > > > Compressed data in btrfs, but it's new and not fixed yet--hence,
> > > > > > "2018 edition."
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Filipe David Manana,
> > > >
> > > > “Whether you think you can, or you think you can't — you're right.”
> > > >
> >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”
> >



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-13 17:36           ` Filipe Manana
@ 2019-02-13 18:14             ` Filipe Manana
  2019-02-14  1:22               ` Filipe Manana
  0 siblings, 1 reply; 25+ messages in thread
From: Filipe Manana @ 2019-02-13 18:14 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Wed, Feb 13, 2019 at 5:36 PM Filipe Manana <fdmanana@gmail.com> wrote:
>
> On Tue, Feb 12, 2019 at 6:14 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > On Tue, Feb 12, 2019 at 05:56:24PM +0000, Filipe Manana wrote:
> > > On Tue, Feb 12, 2019 at 5:01 PM Zygo Blaxell
> > > <ce3g8jdj@umail.furryterror.org> wrote:
> > > >
> > > > On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote:
> > > > > On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
> > > > > <ce3g8jdj@umail.furryterror.org> wrote:
> > > > > >
> > > > > > Still reproducible on 4.20.7.
> > > > >
> > > > > I tried your reproducer when you first reported it, on different
> > > > > machines with different kernel versions.
> > > >
> > > > That would have been useful to know last August...  :-/
> > > >
> > > > > Never managed to reproduce it, nor see anything obviously wrong in
> > > > > relevant code paths.
> > > >
> > > > I built a fresh VM running Debian stretch and
> > > > reproduced the issue immediately.  Mount options are
> > > > "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/".  Kernel is
> > > > Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version
> > > > probably doesn't matter.
> > > >
> > > > I don't have any configuration that can't reproduce this issue, so I don't
> > > > know how to help you.  I've tested AMD and Intel CPUs, VM, baremetal,
> > > > hardware ranging in age from 0 to 9 years.  Locally built kernels from
> > > > 4.1 to 4.20 and the stock Debian kernel (4.9).  SSDs and spinning rust.
> > > > All of these reproduce the issue immediately--wrong sha1sum appears in
> > > > the first 10 loops.
> > > >
> > > > What is your test environment?  I can try that here.
> > >
> > > Debian unstable, all qemu vms, 4 cpus 4G to 8G ram iirc.
> >
> > I have several environments like that...
> >
> > > Always built from source kernels.
> >
> > ...that could be a relevant difference.  Have you tried a stock
> > Debian kernel?
> >
> > > I have tested this when you reported it for 1 to 2 weeks in 2 or 3 vms
> > > that kept running the test in an infinite loop during those weeks.
> > > Don't recall what were the kernel versions (whatever was the latest at
> > > the time), but that shouldn't matter according to what you say.
> >
> > That's an extremely long time compared to the rate of occurrence
> > of this bug.  It should appear in only a few seconds of testing.
> > Some data-hole-data patterns reproduce much slower (change the position
> > of "block 0" lines in the setup script), but "slower" is minutes,
> > not machine-months.
> >
> > Is your filesystem compressed?  Does compsize show the test
> > file 'am' is compressed during the test?  Is the sha1sum you get
> > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4?  Does the sha1sum change
> > when a second process reads the file while the sha1sum/drop_caches loop
> > is running?
>
> Tried it today and I got it reproduced (different vm, but still debian
> and kernel built from source).
> Not sure what was different last time. Yes, I had compression enabled.
>
> I'll look into it.

So the problem is caused by hole punching. The script can be reduced
to the following:

https://friendpaste.com/22t4OdktHQTl0aMGxckc86

file size: 384K am
digests after file creation:   7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
digests after file creation 2: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
262144 total bytes deduped in this operation
digests after dedupe:          7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
digests after dedupe 2:        7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
am: 24 KiB (24576 bytes) converted to sparse holes.
digests after hole punching:   7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
digests after hole punching 2: 5a357b64f4004ea38dbc7058c64a5678668420da  am

So hole punching is screwing things, and only after dropping the page
cache we can see the bug.
I'll send a fix likely tomorrow.

>
> >
> > > > > >
> > > > > > The behavior is slightly different on current kernels (4.20.7, 4.14.96)
> > > > > > which makes the problem a bit more difficult to detect.
> > > > > >
> > > > > >         # repro-hole-corruption-test
> > > > > >         i: 91, status: 0, bytes_deduped: 131072
> > > > > >         i: 92, status: 0, bytes_deduped: 131072
> > > > > >         i: 93, status: 0, bytes_deduped: 131072
> > > > > >         i: 94, status: 0, bytes_deduped: 131072
> > > > > >         i: 95, status: 0, bytes_deduped: 131072
> > > > > >         i: 96, status: 0, bytes_deduped: 131072
> > > > > >         i: 97, status: 0, bytes_deduped: 131072
> > > > > >         i: 98, status: 0, bytes_deduped: 131072
> > > > > >         i: 99, status: 0, bytes_deduped: 131072
> > > > > >         13107200 total bytes deduped in this operation
> > > > > >         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > > >         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >
> > > > > > The sha1sum seems stable after the first drop_caches--until a second
> > > > > > process tries to read the test file:
> > > > > >
> > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >         # cat am > /dev/null              (in another shell)
> > > > > >         19294e695272c42edb89ceee24bb08c13473140a am
> > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > >
> > > > > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> > > > > > > This is a repro script for a btrfs bug that causes corrupted data reads
> > > > > > > when reading a mix of compressed extents and holes.  The bug is
> > > > > > > reproducible on at least kernels v4.1..v4.18.
> > > > > > >
> > > > > > > Some more observations and background follow, but first here is the
> > > > > > > script and some sample output:
> > > > > > >
> > > > > > >       root@rescue:/test# cat repro-hole-corruption-test
> > > > > > >       #!/bin/bash
> > > > > > >
> > > > > > >       # Write a 4096 byte block of something
> > > > > > >       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > > > > > >
> > > > > > >       # Here is some test data with holes in it:
> > > > > > >       for y in $(seq 0 100); do
> > > > > > >               for x in 0 1; do
> > > > > > >                       block 0;
> > > > > > >                       block 21;
> > > > > > >                       block 0;
> > > > > > >                       block 22;
> > > > > > >                       block 0;
> > > > > > >                       block 0;
> > > > > > >                       block 43;
> > > > > > >                       block 44;
> > > > > > >                       block 0;
> > > > > > >                       block 0;
> > > > > > >                       block 61;
> > > > > > >                       block 62;
> > > > > > >                       block 63;
> > > > > > >                       block 64;
> > > > > > >                       block 65;
> > > > > > >                       block 66;
> > > > > > >               done
> > > > > > >       done > am
> > > > > > >       sync
> > > > > > >
> > > > > > >       # Now replace those 101 distinct extents with 101 references to the first extent
> > > > > > >       btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
> > > > > > >
> > > > > > >       # Punch holes into the extent refs
> > > > > > >       fallocate -v -d am
> > > > > > >
> > > > > > >       # Do some other stuff on the machine while this runs, and watch the sha1sums change!
> > > > > > >       while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> > > > > > >
> > > > > > >       root@rescue:/test# ./repro-hole-corruption-test
> > > > > > >       i: 91, status: 0, bytes_deduped: 131072
> > > > > > >       i: 92, status: 0, bytes_deduped: 131072
> > > > > > >       i: 93, status: 0, bytes_deduped: 131072
> > > > > > >       i: 94, status: 0, bytes_deduped: 131072
> > > > > > >       i: 95, status: 0, bytes_deduped: 131072
> > > > > > >       i: 96, status: 0, bytes_deduped: 131072
> > > > > > >       i: 97, status: 0, bytes_deduped: 131072
> > > > > > >       i: 98, status: 0, bytes_deduped: 131072
> > > > > > >       i: 99, status: 0, bytes_deduped: 131072
> > > > > > >       13107200 total bytes deduped in this operation
> > > > > > >       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       072a152355788c767b97e4e4c0e4567720988b84 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       60831f0e7ffe4b49722612c18685c09f4583b1df am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >       ^C
> > > > > > >
> > > > > > > Corruption occurs most often when there is a sequence like this in a file:
> > > > > > >
> > > > > > >       ref 1: hole
> > > > > > >       ref 2: extent A, offset 0
> > > > > > >       ref 3: hole
> > > > > > >       ref 4: extent A, offset 8192
> > > > > > >
> > > > > > > This scenario typically arises due to hole-punching or deduplication.
> > > > > > > Hole-punching replaces one extent ref with two references to the same
> > > > > > > extent with a hole between them, so:
> > > > > > >
> > > > > > >       ref 1:  extent A, offset 0, length 16384
> > > > > > >
> > > > > > > becomes:
> > > > > > >
> > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > >       ref 2:  hole, length 8192
> > > > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > > > >
> > > > > > > Deduplication replaces two distinct extent refs surrounding a hole with
> > > > > > > two references to one of the duplicate extents, turning this:
> > > > > > >
> > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > >       ref 2:  hole, length 8192
> > > > > > >       ref 3:  extent B, offset 0, length 4096
> > > > > > >
> > > > > > > into this:
> > > > > > >
> > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > >       ref 2:  hole, length 8192
> > > > > > >       ref 3:  extent A, offset 0, length 4096
> > > > > > >
> > > > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur.
> > > > > > > I am not able to reproduce the issue with an uncompressed extent nor
> > > > > > > have I observed any such corruption in the wild.
> > > > > > >
> > > > > > > The presence or absence of the no-holes filesystem feature has no effect.
> > > > > > >
> > > > > > > Ordinary writes can lead to pairs of extent references to the same extent
> > > > > > > separated by a reference to a different extent; however, in this case
> > > > > > > there is data to be read from a real extent, instead of pages that have
> > > > > > > to be zero filled from a hole.  If ordinary non-hole writes could trigger
> > > > > > > this bug, every page-oriented database engine would be crashing all the
> > > > > > > time on btrfs with compression enabled, and it's unlikely that would not
> > > > > > > have been noticed between 2015 and now.  An ordinary write that splits
> > > > > > > an extent ref would look like this:
> > > > > > >
> > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > >       ref 2:  extent C, offset 0, length 8192
> > > > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > > > >
> > > > > > > Sparse writes can lead to pairs of extent references surrounding a hole;
> > > > > > > however, in this case the extent references will point to different
> > > > > > > extents, avoiding the bug.  If a sparse write could trigger the bug,
> > > > > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many
> > > > > > > other tools that produce sparse files) would be unusable, and it's
> > > > > > > unlikely that would not have been noticed between 2015 and now either.
> > > > > > > Sparse writes look like this:
> > > > > > >
> > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > >       ref 2:  hole, length 8192
> > > > > > >       ref 3:  extent B, offset 0, length 4096
> > > > > > >
> > > > > > > The pattern or timing of read() calls seems to be relevant.  It is very
> > > > > > > hard to see the corruption when reading files with 'hd', but 'cat | hd'
> > > > > > > will see the corruption just fine.  Similar problems exist with 'cmp'
> > > > > > > but not 'sha1sum'.  Two processes reading the same file at the same time
> > > > > > > seem to trigger the corruption very frequently.
> > > > > > >
> > > > > > > Some patterns of holes and data produce corruption faster than others.
> > > > > > > The pattern generated by the script above is based on instances of
> > > > > > > corruption I've found in the wild, and has a much better repro rate than
> > > > > > > random holes.
> > > > > > >
> > > > > > > The corruption occurs during reads, after csum verification and before
> > > > > > > decompression, so btrfs detects no csum failures.  The data on disk
> > > > > > > seems to be OK and could be read correctly once the kernel bug is fixed.
> > > > > > > Repeated reads do eventually return correct data, but there is no way
> > > > > > > for userspace to distinguish between corrupt and correct data reliably.
> > > > > > >
> > > > > > > The corrupted data is usually data replaced by a hole or a copy of other
> > > > > > > blocks in the same extent.
> > > > > > >
> > > > > > > The behavior is similar to some earlier bugs related to holes and
> > > > > > > Compressed data in btrfs, but it's new and not fixed yet--hence,
> > > > > > > "2018 edition."
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Filipe David Manana,
> > > > >
> > > > > “Whether you think you can, or you think you can't — you're right.”
> > > > >
> > >
> > >
> > >
> > > --
> > > Filipe David Manana,
> > >
> > > “Whether you think you can, or you think you can't — you're right.”
> > >
>
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-13 18:14             ` Filipe Manana
@ 2019-02-14  1:22               ` Filipe Manana
  2019-02-14  5:00                 ` Zygo Blaxell
  2019-02-14 12:21                 ` Christoph Anton Mitterer
  0 siblings, 2 replies; 25+ messages in thread
From: Filipe Manana @ 2019-02-14  1:22 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Wed, Feb 13, 2019 at 6:14 PM Filipe Manana <fdmanana@gmail.com> wrote:
>
> On Wed, Feb 13, 2019 at 5:36 PM Filipe Manana <fdmanana@gmail.com> wrote:
> >
> > On Tue, Feb 12, 2019 at 6:14 PM Zygo Blaxell
> > <ce3g8jdj@umail.furryterror.org> wrote:
> > >
> > > On Tue, Feb 12, 2019 at 05:56:24PM +0000, Filipe Manana wrote:
> > > > On Tue, Feb 12, 2019 at 5:01 PM Zygo Blaxell
> > > > <ce3g8jdj@umail.furryterror.org> wrote:
> > > > >
> > > > > On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote:
> > > > > > On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
> > > > > > <ce3g8jdj@umail.furryterror.org> wrote:
> > > > > > >
> > > > > > > Still reproducible on 4.20.7.
> > > > > >
> > > > > > I tried your reproducer when you first reported it, on different
> > > > > > machines with different kernel versions.
> > > > >
> > > > > That would have been useful to know last August...  :-/
> > > > >
> > > > > > Never managed to reproduce it, nor see anything obviously wrong in
> > > > > > relevant code paths.
> > > > >
> > > > > I built a fresh VM running Debian stretch and
> > > > > reproduced the issue immediately.  Mount options are
> > > > > "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/".  Kernel is
> > > > > Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version
> > > > > probably doesn't matter.
> > > > >
> > > > > I don't have any configuration that can't reproduce this issue, so I don't
> > > > > know how to help you.  I've tested AMD and Intel CPUs, VM, baremetal,
> > > > > hardware ranging in age from 0 to 9 years.  Locally built kernels from
> > > > > 4.1 to 4.20 and the stock Debian kernel (4.9).  SSDs and spinning rust.
> > > > > All of these reproduce the issue immediately--wrong sha1sum appears in
> > > > > the first 10 loops.
> > > > >
> > > > > What is your test environment?  I can try that here.
> > > >
> > > > Debian unstable, all qemu vms, 4 cpus 4G to 8G ram iirc.
> > >
> > > I have several environments like that...
> > >
> > > > Always built from source kernels.
> > >
> > > ...that could be a relevant difference.  Have you tried a stock
> > > Debian kernel?
> > >
> > > > I have tested this when you reported it for 1 to 2 weeks in 2 or 3 vms
> > > > that kept running the test in an infinite loop during those weeks.
> > > > Don't recall what were the kernel versions (whatever was the latest at
> > > > the time), but that shouldn't matter according to what you say.
> > >
> > > That's an extremely long time compared to the rate of occurrence
> > > of this bug.  It should appear in only a few seconds of testing.
> > > Some data-hole-data patterns reproduce much slower (change the position
> > > of "block 0" lines in the setup script), but "slower" is minutes,
> > > not machine-months.
> > >
> > > Is your filesystem compressed?  Does compsize show the test
> > > file 'am' is compressed during the test?  Is the sha1sum you get
> > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4?  Does the sha1sum change
> > > when a second process reads the file while the sha1sum/drop_caches loop
> > > is running?
> >
> > Tried it today and I got it reproduced (different vm, but still debian
> > and kernel built from source).
> > Not sure what was different last time. Yes, I had compression enabled.
> >
> > I'll look into it.
>
> So the problem is caused by hole punching. The script can be reduced
> to the following:
>
> https://friendpaste.com/22t4OdktHQTl0aMGxckc86
>
> file size: 384K am
> digests after file creation:   7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> digests after file creation 2: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> 262144 total bytes deduped in this operation
> digests after dedupe:          7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> digests after dedupe 2:        7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> am: 24 KiB (24576 bytes) converted to sparse holes.
> digests after hole punching:   7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> digests after hole punching 2: 5a357b64f4004ea38dbc7058c64a5678668420da  am
>
> So hole punching is screwing things, and only after dropping the page
> cache we can see the bug.
> I'll send a fix likely tomorrow.

So it turns out it's a problem in the read of compressed extents part,
a variant of a bug I found back in 2015:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=005efedf2c7d0a270ffbe28d8997b03844f3e3e7

The following one liner fixes it:
https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3

While you test it there (if you want/can), I'll write a change log and
a proper test case for fstests and submit them later.

Thanks!
>
> >
> > >
> > > > > > >
> > > > > > > The behavior is slightly different on current kernels (4.20.7, 4.14.96)
> > > > > > > which makes the problem a bit more difficult to detect.
> > > > > > >
> > > > > > >         # repro-hole-corruption-test
> > > > > > >         i: 91, status: 0, bytes_deduped: 131072
> > > > > > >         i: 92, status: 0, bytes_deduped: 131072
> > > > > > >         i: 93, status: 0, bytes_deduped: 131072
> > > > > > >         i: 94, status: 0, bytes_deduped: 131072
> > > > > > >         i: 95, status: 0, bytes_deduped: 131072
> > > > > > >         i: 96, status: 0, bytes_deduped: 131072
> > > > > > >         i: 97, status: 0, bytes_deduped: 131072
> > > > > > >         i: 98, status: 0, bytes_deduped: 131072
> > > > > > >         i: 99, status: 0, bytes_deduped: 131072
> > > > > > >         13107200 total bytes deduped in this operation
> > > > > > >         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > > > >         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >
> > > > > > > The sha1sum seems stable after the first drop_caches--until a second
> > > > > > > process tries to read the test file:
> > > > > > >
> > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >         # cat am > /dev/null              (in another shell)
> > > > > > >         19294e695272c42edb89ceee24bb08c13473140a am
> > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > >
> > > > > > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> > > > > > > > This is a repro script for a btrfs bug that causes corrupted data reads
> > > > > > > > when reading a mix of compressed extents and holes.  The bug is
> > > > > > > > reproducible on at least kernels v4.1..v4.18.
> > > > > > > >
> > > > > > > > Some more observations and background follow, but first here is the
> > > > > > > > script and some sample output:
> > > > > > > >
> > > > > > > >       root@rescue:/test# cat repro-hole-corruption-test
> > > > > > > >       #!/bin/bash
> > > > > > > >
> > > > > > > >       # Write a 4096 byte block of something
> > > > > > > >       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > > > > > > >
> > > > > > > >       # Here is some test data with holes in it:
> > > > > > > >       for y in $(seq 0 100); do
> > > > > > > >               for x in 0 1; do
> > > > > > > >                       block 0;
> > > > > > > >                       block 21;
> > > > > > > >                       block 0;
> > > > > > > >                       block 22;
> > > > > > > >                       block 0;
> > > > > > > >                       block 0;
> > > > > > > >                       block 43;
> > > > > > > >                       block 44;
> > > > > > > >                       block 0;
> > > > > > > >                       block 0;
> > > > > > > >                       block 61;
> > > > > > > >                       block 62;
> > > > > > > >                       block 63;
> > > > > > > >                       block 64;
> > > > > > > >                       block 65;
> > > > > > > >                       block 66;
> > > > > > > >               done
> > > > > > > >       done > am
> > > > > > > >       sync
> > > > > > > >
> > > > > > > >       # Now replace those 101 distinct extents with 101 references to the first extent
> > > > > > > >       btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
> > > > > > > >
> > > > > > > >       # Punch holes into the extent refs
> > > > > > > >       fallocate -v -d am
> > > > > > > >
> > > > > > > >       # Do some other stuff on the machine while this runs, and watch the sha1sums change!
> > > > > > > >       while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> > > > > > > >
> > > > > > > >       root@rescue:/test# ./repro-hole-corruption-test
> > > > > > > >       i: 91, status: 0, bytes_deduped: 131072
> > > > > > > >       i: 92, status: 0, bytes_deduped: 131072
> > > > > > > >       i: 93, status: 0, bytes_deduped: 131072
> > > > > > > >       i: 94, status: 0, bytes_deduped: 131072
> > > > > > > >       i: 95, status: 0, bytes_deduped: 131072
> > > > > > > >       i: 96, status: 0, bytes_deduped: 131072
> > > > > > > >       i: 97, status: 0, bytes_deduped: 131072
> > > > > > > >       i: 98, status: 0, bytes_deduped: 131072
> > > > > > > >       i: 99, status: 0, bytes_deduped: 131072
> > > > > > > >       13107200 total bytes deduped in this operation
> > > > > > > >       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       072a152355788c767b97e4e4c0e4567720988b84 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       60831f0e7ffe4b49722612c18685c09f4583b1df am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >       ^C
> > > > > > > >
> > > > > > > > Corruption occurs most often when there is a sequence like this in a file:
> > > > > > > >
> > > > > > > >       ref 1: hole
> > > > > > > >       ref 2: extent A, offset 0
> > > > > > > >       ref 3: hole
> > > > > > > >       ref 4: extent A, offset 8192
> > > > > > > >
> > > > > > > > This scenario typically arises due to hole-punching or deduplication.
> > > > > > > > Hole-punching replaces one extent ref with two references to the same
> > > > > > > > extent with a hole between them, so:
> > > > > > > >
> > > > > > > >       ref 1:  extent A, offset 0, length 16384
> > > > > > > >
> > > > > > > > becomes:
> > > > > > > >
> > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > >       ref 2:  hole, length 8192
> > > > > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > > > > >
> > > > > > > > Deduplication replaces two distinct extent refs surrounding a hole with
> > > > > > > > two references to one of the duplicate extents, turning this:
> > > > > > > >
> > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > >       ref 2:  hole, length 8192
> > > > > > > >       ref 3:  extent B, offset 0, length 4096
> > > > > > > >
> > > > > > > > into this:
> > > > > > > >
> > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > >       ref 2:  hole, length 8192
> > > > > > > >       ref 3:  extent A, offset 0, length 4096
> > > > > > > >
> > > > > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur.
> > > > > > > > I am not able to reproduce the issue with an uncompressed extent nor
> > > > > > > > have I observed any such corruption in the wild.
> > > > > > > >
> > > > > > > > The presence or absence of the no-holes filesystem feature has no effect.
> > > > > > > >
> > > > > > > > Ordinary writes can lead to pairs of extent references to the same extent
> > > > > > > > separated by a reference to a different extent; however, in this case
> > > > > > > > there is data to be read from a real extent, instead of pages that have
> > > > > > > > to be zero filled from a hole.  If ordinary non-hole writes could trigger
> > > > > > > > this bug, every page-oriented database engine would be crashing all the
> > > > > > > > time on btrfs with compression enabled, and it's unlikely that would not
> > > > > > > > have been noticed between 2015 and now.  An ordinary write that splits
> > > > > > > > an extent ref would look like this:
> > > > > > > >
> > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > >       ref 2:  extent C, offset 0, length 8192
> > > > > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > > > > >
> > > > > > > > Sparse writes can lead to pairs of extent references surrounding a hole;
> > > > > > > > however, in this case the extent references will point to different
> > > > > > > > extents, avoiding the bug.  If a sparse write could trigger the bug,
> > > > > > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many
> > > > > > > > other tools that produce sparse files) would be unusable, and it's
> > > > > > > > unlikely that would not have been noticed between 2015 and now either.
> > > > > > > > Sparse writes look like this:
> > > > > > > >
> > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > >       ref 2:  hole, length 8192
> > > > > > > >       ref 3:  extent B, offset 0, length 4096
> > > > > > > >
> > > > > > > > The pattern or timing of read() calls seems to be relevant.  It is very
> > > > > > > > hard to see the corruption when reading files with 'hd', but 'cat | hd'
> > > > > > > > will see the corruption just fine.  Similar problems exist with 'cmp'
> > > > > > > > but not 'sha1sum'.  Two processes reading the same file at the same time
> > > > > > > > seem to trigger the corruption very frequently.
> > > > > > > >
> > > > > > > > Some patterns of holes and data produce corruption faster than others.
> > > > > > > > The pattern generated by the script above is based on instances of
> > > > > > > > corruption I've found in the wild, and has a much better repro rate than
> > > > > > > > random holes.
> > > > > > > >
> > > > > > > > The corruption occurs during reads, after csum verification and before
> > > > > > > > decompression, so btrfs detects no csum failures.  The data on disk
> > > > > > > > seems to be OK and could be read correctly once the kernel bug is fixed.
> > > > > > > > Repeated reads do eventually return correct data, but there is no way
> > > > > > > > for userspace to distinguish between corrupt and correct data reliably.
> > > > > > > >
> > > > > > > > The corrupted data is usually data replaced by a hole or a copy of other
> > > > > > > > blocks in the same extent.
> > > > > > > >
> > > > > > > > The behavior is similar to some earlier bugs related to holes and
> > > > > > > > Compressed data in btrfs, but it's new and not fixed yet--hence,
> > > > > > > > "2018 edition."
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Filipe David Manana,
> > > > > >
> > > > > > “Whether you think you can, or you think you can't — you're right.”
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Filipe David Manana,
> > > >
> > > > “Whether you think you can, or you think you can't — you're right.”
> > > >
> >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”
>
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-14  1:22               ` Filipe Manana
@ 2019-02-14  5:00                 ` Zygo Blaxell
  2019-02-14 12:21                 ` Christoph Anton Mitterer
  1 sibling, 0 replies; 25+ messages in thread
From: Zygo Blaxell @ 2019-02-14  5:00 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 15389 bytes --]

On Thu, Feb 14, 2019 at 01:22:49AM +0000, Filipe Manana wrote:
> On Wed, Feb 13, 2019 at 6:14 PM Filipe Manana <fdmanana@gmail.com> wrote:
> > On Wed, Feb 13, 2019 at 5:36 PM Filipe Manana <fdmanana@gmail.com> wrote:
[...]
> > > Tried it today and I got it reproduced (different vm, but still debian
> > > and kernel built from source).
> > > Not sure what was different last time. Yes, I had compression enabled.
> > >
> > > I'll look into it.
> >
> > So the problem is caused by hole punching. The script can be reduced
> > to the following:
> >
> > https://friendpaste.com/22t4OdktHQTl0aMGxckc86
> >
> > file size: 384K am
> > digests after file creation:   7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> > digests after file creation 2: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> > 262144 total bytes deduped in this operation
> > digests after dedupe:          7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> > digests after dedupe 2:        7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> > am: 24 KiB (24576 bytes) converted to sparse holes.
> > digests after hole punching:   7c8349cc657fbe61af53fbc5cfacae6e9a402e83  am
> > digests after hole punching 2: 5a357b64f4004ea38dbc7058c64a5678668420da  am
> >
> > So hole punching is screwing things, and only after dropping the page
> > cache we can see the bug.
> > I'll send a fix likely tomorrow.
> 
> So it turns out it's a problem in the read of compressed extents part,
> a variant of a bug I found back in 2015:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=005efedf2c7d0a270ffbe28d8997b03844f3e3e7
> 
> The following one liner fixes it:
> https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3
> 
> While you test it there (if you want/can), I'll write a change log and
> a proper test case for fstests and submit them later.

Works here (and produces the correct sha1sum, which turns out to be
dae78e303edfb8b8ad64ecae01dc1bf233770cfd).

Nice work!

> Thanks!
> >
> > >
> > > >
> > > > > > > >
> > > > > > > > The behavior is slightly different on current kernels (4.20.7, 4.14.96)
> > > > > > > > which makes the problem a bit more difficult to detect.
> > > > > > > >
> > > > > > > >         # repro-hole-corruption-test
> > > > > > > >         i: 91, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 92, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 93, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 94, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 95, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 96, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 97, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 98, status: 0, bytes_deduped: 131072
> > > > > > > >         i: 99, status: 0, bytes_deduped: 131072
> > > > > > > >         13107200 total bytes deduped in this operation
> > > > > > > >         am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > > > > >         94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >
> > > > > > > > The sha1sum seems stable after the first drop_caches--until a second
> > > > > > > > process tries to read the test file:
> > > > > > > >
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >         # cat am > /dev/null              (in another shell)
> > > > > > > >         19294e695272c42edb89ceee24bb08c13473140a am
> > > > > > > >         6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > >
> > > > > > > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
> > > > > > > > > This is a repro script for a btrfs bug that causes corrupted data reads
> > > > > > > > > when reading a mix of compressed extents and holes.  The bug is
> > > > > > > > > reproducible on at least kernels v4.1..v4.18.
> > > > > > > > >
> > > > > > > > > Some more observations and background follow, but first here is the
> > > > > > > > > script and some sample output:
> > > > > > > > >
> > > > > > > > >       root@rescue:/test# cat repro-hole-corruption-test
> > > > > > > > >       #!/bin/bash
> > > > > > > > >
> > > > > > > > >       # Write a 4096 byte block of something
> > > > > > > > >       block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > > > > > > > >
> > > > > > > > >       # Here is some test data with holes in it:
> > > > > > > > >       for y in $(seq 0 100); do
> > > > > > > > >               for x in 0 1; do
> > > > > > > > >                       block 0;
> > > > > > > > >                       block 21;
> > > > > > > > >                       block 0;
> > > > > > > > >                       block 22;
> > > > > > > > >                       block 0;
> > > > > > > > >                       block 0;
> > > > > > > > >                       block 43;
> > > > > > > > >                       block 44;
> > > > > > > > >                       block 0;
> > > > > > > > >                       block 0;
> > > > > > > > >                       block 61;
> > > > > > > > >                       block 62;
> > > > > > > > >                       block 63;
> > > > > > > > >                       block 64;
> > > > > > > > >                       block 65;
> > > > > > > > >                       block 66;
> > > > > > > > >               done
> > > > > > > > >       done > am
> > > > > > > > >       sync
> > > > > > > > >
> > > > > > > > >       # Now replace those 101 distinct extents with 101 references to the first extent
> > > > > > > > >       btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
> > > > > > > > >
> > > > > > > > >       # Punch holes into the extent refs
> > > > > > > > >       fallocate -v -d am
> > > > > > > > >
> > > > > > > > >       # Do some other stuff on the machine while this runs, and watch the sha1sums change!
> > > > > > > > >       while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
> > > > > > > > >
> > > > > > > > >       root@rescue:/test# ./repro-hole-corruption-test
> > > > > > > > >       i: 91, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 92, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 93, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 94, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 95, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 96, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 97, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 98, status: 0, bytes_deduped: 131072
> > > > > > > > >       i: 99, status: 0, bytes_deduped: 131072
> > > > > > > > >       13107200 total bytes deduped in this operation
> > > > > > > > >       am: 4.8 MiB (4964352 bytes) converted to sparse holes.
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       072a152355788c767b97e4e4c0e4567720988b84 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       bf00d862c6ad436a1be2be606a8ab88d22166b89 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       0d44cdf030fb149e103cfdc164da3da2b7474c17 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       60831f0e7ffe4b49722612c18685c09f4583b1df am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       a19662b294a3ccdf35dbb18fdd72c62018526d7d am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
> > > > > > > > >       ^C
> > > > > > > > >
> > > > > > > > > Corruption occurs most often when there is a sequence like this in a file:
> > > > > > > > >
> > > > > > > > >       ref 1: hole
> > > > > > > > >       ref 2: extent A, offset 0
> > > > > > > > >       ref 3: hole
> > > > > > > > >       ref 4: extent A, offset 8192
> > > > > > > > >
> > > > > > > > > This scenario typically arises due to hole-punching or deduplication.
> > > > > > > > > Hole-punching replaces one extent ref with two references to the same
> > > > > > > > > extent with a hole between them, so:
> > > > > > > > >
> > > > > > > > >       ref 1:  extent A, offset 0, length 16384
> > > > > > > > >
> > > > > > > > > becomes:
> > > > > > > > >
> > > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > > >       ref 2:  hole, length 8192
> > > > > > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > > > > > >
> > > > > > > > > Deduplication replaces two distinct extent refs surrounding a hole with
> > > > > > > > > two references to one of the duplicate extents, turning this:
> > > > > > > > >
> > > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > > >       ref 2:  hole, length 8192
> > > > > > > > >       ref 3:  extent B, offset 0, length 4096
> > > > > > > > >
> > > > > > > > > into this:
> > > > > > > > >
> > > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > > >       ref 2:  hole, length 8192
> > > > > > > > >       ref 3:  extent A, offset 0, length 4096
> > > > > > > > >
> > > > > > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur.
> > > > > > > > > I am not able to reproduce the issue with an uncompressed extent nor
> > > > > > > > > have I observed any such corruption in the wild.
> > > > > > > > >
> > > > > > > > > The presence or absence of the no-holes filesystem feature has no effect.
> > > > > > > > >
> > > > > > > > > Ordinary writes can lead to pairs of extent references to the same extent
> > > > > > > > > separated by a reference to a different extent; however, in this case
> > > > > > > > > there is data to be read from a real extent, instead of pages that have
> > > > > > > > > to be zero filled from a hole.  If ordinary non-hole writes could trigger
> > > > > > > > > this bug, every page-oriented database engine would be crashing all the
> > > > > > > > > time on btrfs with compression enabled, and it's unlikely that would not
> > > > > > > > > have been noticed between 2015 and now.  An ordinary write that splits
> > > > > > > > > an extent ref would look like this:
> > > > > > > > >
> > > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > > >       ref 2:  extent C, offset 0, length 8192
> > > > > > > > >       ref 3:  extent A, offset 12288, length 4096
> > > > > > > > >
> > > > > > > > > Sparse writes can lead to pairs of extent references surrounding a hole;
> > > > > > > > > however, in this case the extent references will point to different
> > > > > > > > > extents, avoiding the bug.  If a sparse write could trigger the bug,
> > > > > > > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many
> > > > > > > > > other tools that produce sparse files) would be unusable, and it's
> > > > > > > > > unlikely that would not have been noticed between 2015 and now either.
> > > > > > > > > Sparse writes look like this:
> > > > > > > > >
> > > > > > > > >       ref 1:  extent A, offset 0, length 4096
> > > > > > > > >       ref 2:  hole, length 8192
> > > > > > > > >       ref 3:  extent B, offset 0, length 4096
> > > > > > > > >
> > > > > > > > > The pattern or timing of read() calls seems to be relevant.  It is very
> > > > > > > > > hard to see the corruption when reading files with 'hd', but 'cat | hd'
> > > > > > > > > will see the corruption just fine.  Similar problems exist with 'cmp'
> > > > > > > > > but not 'sha1sum'.  Two processes reading the same file at the same time
> > > > > > > > > seem to trigger the corruption very frequently.
> > > > > > > > >
> > > > > > > > > Some patterns of holes and data produce corruption faster than others.
> > > > > > > > > The pattern generated by the script above is based on instances of
> > > > > > > > > corruption I've found in the wild, and has a much better repro rate than
> > > > > > > > > random holes.
> > > > > > > > >
> > > > > > > > > The corruption occurs during reads, after csum verification and before
> > > > > > > > > decompression, so btrfs detects no csum failures.  The data on disk
> > > > > > > > > seems to be OK and could be read correctly once the kernel bug is fixed.
> > > > > > > > > Repeated reads do eventually return correct data, but there is no way
> > > > > > > > > for userspace to distinguish between corrupt and correct data reliably.
> > > > > > > > >
> > > > > > > > > The corrupted data is usually data replaced by a hole or a copy of other
> > > > > > > > > blocks in the same extent.
> > > > > > > > >
> > > > > > > > > The behavior is similar to some earlier bugs related to holes and
> > > > > > > > > Compressed data in btrfs, but it's new and not fixed yet--hence,
> > > > > > > > > "2018 edition."
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Filipe David Manana,
> > > > > > >
> > > > > > > “Whether you think you can, or you think you can't — you're right.”
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Filipe David Manana,
> > > > >
> > > > > “Whether you think you can, or you think you can't — you're right.”
> > > > >
> > >
> > >
> > >
> > > --
> > > Filipe David Manana,
> > >
> > > “Whether you think you can, or you think you can't — you're right.”
> >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”
> 
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-14  1:22               ` Filipe Manana
  2019-02-14  5:00                 ` Zygo Blaxell
@ 2019-02-14 12:21                 ` Christoph Anton Mitterer
  2019-02-15  5:40                   ` Zygo Blaxell
  2019-02-15 12:02                   ` Filipe Manana
  1 sibling, 2 replies; 25+ messages in thread
From: Christoph Anton Mitterer @ 2019-02-14 12:21 UTC (permalink / raw)
  To: linux-btrfs

On Thu, 2019-02-14 at 01:22 +0000, Filipe Manana wrote:
> The following one liner fixes it:
> https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3

Great to see that fixed... is there any advise that can be given for
users/admins?


Like whether and how any occurred corruptions can be detected (right
now, people may still have backups)?


Or under which exact circumstances did the corruption happen? And under
which was one safe?
E.g. only on specific compression algos (I've been using -o compress
(which should be zlib) for quite a while but never found any
compression),... or only when specific file operations were done (I did
e.g. cp with refcopy, but I think none of the standard tools does hole-
punching)?


Cheers,
Chris.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-14 12:21                 ` Christoph Anton Mitterer
@ 2019-02-15  5:40                   ` Zygo Blaxell
  2019-02-15 12:02                   ` Filipe Manana
  1 sibling, 0 replies; 25+ messages in thread
From: Zygo Blaxell @ 2019-02-15  5:40 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4815 bytes --]

On Thu, Feb 14, 2019 at 01:21:29PM +0100, Christoph Anton Mitterer wrote:
> On Thu, 2019-02-14 at 01:22 +0000, Filipe Manana wrote:
> > The following one liner fixes it:
> > https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3
> 
> Great to see that fixed... is there any advise that can be given for
> users/admins?
> 
> 
> Like whether and how any occurred corruptions can be detected (right
> now, people may still have backups)?

The problem occurs only on reads.  Data that is written to disk will
be OK, and can be read correctly by a fixed kernel.

A kernel without the fix will give corrupt data on reads with no
indication of corruption other than the changes to the data itself.

Applications that copy data may read corrupted data and write it back
to the filesystem.  This will make the corruption permanent in the
copied data.

Given the age of the bug, backups that can be corrupted by this bug
probably already are.  Verify files against internal CRC/hashes where
possible.  The original files are likely to be OK, since the bug does
not affect writes.  If your situation has the risk factors listed below,
it may be worthwhile to create a fresh set of non-incremental backups
after applying the kernel fix.

> Or under which exact circumstances did the corruption happen? And under
> which was one safe?

Compression is required to trigger the bug, so you are safe if you (or
the applications you run) never enabled filesystem compression.  Even if
compression is enabled, the file data must be compressed for the bug to
corrupt it.  Incompressible data extents will never be affected by
this bug.

If you do use compression, you are still safe if:

	- you never punch holes in files

	- you never dedupe or clone files

If you do use compression and do the other things, the probability of
corruption by this particular bug is non-zero.  Whether you get corruption
and how often depends on the technical details of what you're doing.

To get corruption you have to have one data extent that is split in
two parts by punching a hole, or an extent that is cloned/deduped in
two parts to adjacent logical offsets in the same file.  Both of these
methods create the pattern on disk which triggers the bug.

Files that consist entirely of unique data will not be affected by dedupe
so will not trigger the bug that way.  Files that consist partially of
unique data may or may not be affected depending on the dedupe tool,
data alignment, etc.

> E.g. only on specific compression algos (I've been using -o compress
> (which should be zlib) for quite a while but never found any

All decompress algorithms are affected.  The bug is in the generic btrfs
decompression handling, so it is not limited to any single algorithm.

Compression (i.e. writing) is not affected--whatever data is written to
disk should be readable correctly with a fixed kernel.

> compression),... or only when specific file operations were done (I did
> e.g. cp with refcopy, but I think none of the standard tools does hole-
> punching)?

That depends on whether you consider fallocate or qemu to be standard
tools.  The hole-punching function has been a feature of several Linux
filesystems for some years now, so we can expect it to be more widely
adopted over time.  You'd have to do an audit to be sure none of the
tools you use are punching holes.

"Ordinary" sparse files (made by seeking forward while writing, as done
by older Unix utilities including cp, tar, rsync, cpio, binutils) do not
trigger this bug.  An ordinary sparse file has two distinct data extents
from two different writes separated by a hole which has never contained
file data.  A punched hole splits an existing single data extent into two
pieces with a newly created hole between them that replaces previously
existing file data.  These actions create different extent reference
patterns and only the hole-punching one is affected by the bug.

Files that contain no blocks full of zeros will not be affected by
fallocate-d-style hole punching (it searches for existing zeros and
punches holes over them--no zeros, no holes).  If the the hole punching
intentionally introduces zeros where zeros did not exist before (e.g. qemu
discard operations on raw image files) then it may trigger the bug.

btrfs send and receive may be affected, but I don't use them so I don't
have any experience of the bug related to these tools.  It seems from
reading the btrfs receive code that it lacks any code capable of punching
a hole, but I'm only doing a quick search for words like "punch", not
a detailed code analysis.

bees continues to be an awesome tool for discovering btrfs kernel bugs.
It compresses, dedupes, *and* punches holes.

> 
> Cheers,
> Chris.
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
  2019-02-14 12:21                 ` Christoph Anton Mitterer
  2019-02-15  5:40                   ` Zygo Blaxell
@ 2019-02-15 12:02                   ` Filipe Manana
  1 sibling, 0 replies; 25+ messages in thread
From: Filipe Manana @ 2019-02-15 12:02 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs

On Thu, Feb 14, 2019 at 11:10 PM Christoph Anton Mitterer
<calestyo@scientia.net> wrote:
>
> On Thu, 2019-02-14 at 01:22 +0000, Filipe Manana wrote:
> > The following one liner fixes it:
> > https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3
>
> Great to see that fixed... is there any advise that can be given for
> users/admins?

Upgrade to a kernel with the patch (none yet) or build it from source?
Not sure what kind of advice you are looking for.

>
>
> Like whether and how any occurred corruptions can be detected (right
> now, people may still have backups)?
>
>
> Or under which exact circumstances did the corruption happen? And under
> which was one safe?
> E.g. only on specific compression algos (I've been using -o compress
> (which should be zlib) for quite a while but never found any
> compression),... or only when specific file operations were done (I did
> e.g. cp with refcopy, but I think none of the standard tools does hole-
> punching)?

As I said in the previous reply, and in the patch's changelog [1], the
corruption happens at read time.
That means nothing stored on disk is corrupted. It's not the end of the world.

[1] https://lore.kernel.org/linux-btrfs/20190214151720.23563-1-fdmanana@kernel.org/

>
>
> Cheers,
> Chris.
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, back to index

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-23  3:11 Reproducer for "compressed data + hole data corruption bug, 2018 editiion" Zygo Blaxell
2018-08-23  5:10 ` Qu Wenruo
2018-08-23 16:44   ` Zygo Blaxell
2018-08-23 23:50     ` Qu Wenruo
2019-02-12  3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell
2019-02-12 15:33   ` Christoph Anton Mitterer
2019-02-12 15:35   ` Filipe Manana
2019-02-12 17:01     ` Zygo Blaxell
2019-02-12 17:56       ` Filipe Manana
2019-02-12 18:13         ` Zygo Blaxell
2019-02-13  7:24           ` Qu Wenruo
2019-02-13 17:36           ` Filipe Manana
2019-02-13 18:14             ` Filipe Manana
2019-02-14  1:22               ` Filipe Manana
2019-02-14  5:00                 ` Zygo Blaxell
2019-02-14 12:21                 ` Christoph Anton Mitterer
2019-02-15  5:40                   ` Zygo Blaxell
2019-02-15 12:02                   ` Filipe Manana
2019-02-12 18:58       ` Andrei Borzenkov
2019-02-12 21:48         ` Chris Murphy
2019-02-12 22:11           ` Zygo Blaxell
2019-02-12 22:53             ` Chris Murphy
2019-02-13  2:46               ` Zygo Blaxell
2019-02-13  7:47   ` Roman Mamedov
2019-02-13  8:04     ` Qu Wenruo

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox