* Reproducer for "compressed data + hole data corruption bug, 2018 editiion" @ 2018-08-23 3:11 Zygo Blaxell 2018-08-23 5:10 ` Qu Wenruo 2019-02-12 3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell 0 siblings, 2 replies; 38+ messages in thread From: Zygo Blaxell @ 2018-08-23 3:11 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 6482 bytes --] This is a repro script for a btrfs bug that causes corrupted data reads when reading a mix of compressed extents and holes. The bug is reproducible on at least kernels v4.1..v4.18. Some more observations and background follow, but first here is the script and some sample output: root@rescue:/test# cat repro-hole-corruption-test #!/bin/bash # Write a 4096 byte block of something block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } # Here is some test data with holes in it: for y in $(seq 0 100); do for x in 0 1; do block 0; block 21; block 0; block 22; block 0; block 0; block 43; block 44; block 0; block 0; block 61; block 62; block 63; block 64; block 65; block 66; done done > am sync # Now replace those 101 distinct extents with 101 references to the first extent btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail # Punch holes into the extent refs fallocate -v -d am # Do some other stuff on the machine while this runs, and watch the sha1sums change! while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done root@rescue:/test# ./repro-hole-corruption-test i: 91, status: 0, bytes_deduped: 131072 i: 92, status: 0, bytes_deduped: 131072 i: 93, status: 0, bytes_deduped: 131072 i: 94, status: 0, bytes_deduped: 131072 i: 95, status: 0, bytes_deduped: 131072 i: 96, status: 0, bytes_deduped: 131072 i: 97, status: 0, bytes_deduped: 131072 i: 98, status: 0, bytes_deduped: 131072 i: 99, status: 0, bytes_deduped: 131072 13107200 total bytes deduped in this operation am: 4.8 MiB (4964352 bytes) converted to sparse holes. 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 072a152355788c767b97e4e4c0e4567720988b84 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am bf00d862c6ad436a1be2be606a8ab88d22166b89 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 0d44cdf030fb149e103cfdc164da3da2b7474c17 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 60831f0e7ffe4b49722612c18685c09f4583b1df am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am a19662b294a3ccdf35dbb18fdd72c62018526d7d am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am ^C Corruption occurs most often when there is a sequence like this in a file: ref 1: hole ref 2: extent A, offset 0 ref 3: hole ref 4: extent A, offset 8192 This scenario typically arises due to hole-punching or deduplication. Hole-punching replaces one extent ref with two references to the same extent with a hole between them, so: ref 1: extent A, offset 0, length 16384 becomes: ref 1: extent A, offset 0, length 4096 ref 2: hole, length 8192 ref 3: extent A, offset 12288, length 4096 Deduplication replaces two distinct extent refs surrounding a hole with two references to one of the duplicate extents, turning this: ref 1: extent A, offset 0, length 4096 ref 2: hole, length 8192 ref 3: extent B, offset 0, length 4096 into this: ref 1: extent A, offset 0, length 4096 ref 2: hole, length 8192 ref 3: extent A, offset 0, length 4096 Compression is required (zlib, zstd, or lzo) for corruption to occur. I am not able to reproduce the issue with an uncompressed extent nor have I observed any such corruption in the wild. The presence or absence of the no-holes filesystem feature has no effect. Ordinary writes can lead to pairs of extent references to the same extent separated by a reference to a different extent; however, in this case there is data to be read from a real extent, instead of pages that have to be zero filled from a hole. If ordinary non-hole writes could trigger this bug, every page-oriented database engine would be crashing all the time on btrfs with compression enabled, and it's unlikely that would not have been noticed between 2015 and now. An ordinary write that splits an extent ref would look like this: ref 1: extent A, offset 0, length 4096 ref 2: extent C, offset 0, length 8192 ref 3: extent A, offset 12288, length 4096 Sparse writes can lead to pairs of extent references surrounding a hole; however, in this case the extent references will point to different extents, avoiding the bug. If a sparse write could trigger the bug, the rsync -S option and qemu/kvm 'raw' disk image files (among many other tools that produce sparse files) would be unusable, and it's unlikely that would not have been noticed between 2015 and now either. Sparse writes look like this: ref 1: extent A, offset 0, length 4096 ref 2: hole, length 8192 ref 3: extent B, offset 0, length 4096 The pattern or timing of read() calls seems to be relevant. It is very hard to see the corruption when reading files with 'hd', but 'cat | hd' will see the corruption just fine. Similar problems exist with 'cmp' but not 'sha1sum'. Two processes reading the same file at the same time seem to trigger the corruption very frequently. Some patterns of holes and data produce corruption faster than others. The pattern generated by the script above is based on instances of corruption I've found in the wild, and has a much better repro rate than random holes. The corruption occurs during reads, after csum verification and before decompression, so btrfs detects no csum failures. The data on disk seems to be OK and could be read correctly once the kernel bug is fixed. Repeated reads do eventually return correct data, but there is no way for userspace to distinguish between corrupt and correct data reliably. The corrupted data is usually data replaced by a hole or a copy of other blocks in the same extent. The behavior is similar to some earlier bugs related to holes and Compressed data in btrfs, but it's new and not fixed yet--hence, "2018 edition." [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 editiion" 2018-08-23 3:11 Reproducer for "compressed data + hole data corruption bug, 2018 editiion" Zygo Blaxell @ 2018-08-23 5:10 ` Qu Wenruo 2018-08-23 16:44 ` Zygo Blaxell 2019-02-12 3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell 1 sibling, 1 reply; 38+ messages in thread From: Qu Wenruo @ 2018-08-23 5:10 UTC (permalink / raw) To: Zygo Blaxell, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 7798 bytes --] On 2018/8/23 上午11:11, Zygo Blaxell wrote: > This is a repro script for a btrfs bug that causes corrupted data reads > when reading a mix of compressed extents and holes. The bug is > reproducible on at least kernels v4.1..v4.18. This bug already sounds more serious than previous nodatasum + compression bug. > > Some more observations and background follow, but first here is the > script and some sample output: > > root@rescue:/test# cat repro-hole-corruption-test > #!/bin/bash > > # Write a 4096 byte block of something > block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } > > # Here is some test data with holes in it: > for y in $(seq 0 100); do > for x in 0 1; do > block 0; > block 21; > block 0; > block 22; > block 0; > block 0; > block 43; > block 44; > block 0; > block 0; > block 61; > block 62; > block 63; > block 64; > block 65; > block 66;> done Does the content has any difference on this bug? It's just 16 * 4K * 2 * 101 data write *without* any hole so far. This should indeed cause 101 128K compressed data extent. But I'm wondering the description about 'holes'. > done > am > sync > > # Now replace those 101 distinct extents with 101 references to the first extent > btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail Will this bug still happen by creating one extent and then reflink it 101 times? > > # Punch holes into the extent refs > fallocate -v -d am Hole-punch in fact happens here. BTW, will add a "sync" here change the result? > > # Do some other stuff on the machine while this runs, and watch the sha1sums change! > while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done > > root@rescue:/test# ./repro-hole-corruption-test > i: 91, status: 0, bytes_deduped: 131072 > i: 92, status: 0, bytes_deduped: 131072 > i: 93, status: 0, bytes_deduped: 131072 > i: 94, status: 0, bytes_deduped: 131072 > i: 95, status: 0, bytes_deduped: 131072 > i: 96, status: 0, bytes_deduped: 131072 > i: 97, status: 0, bytes_deduped: 131072 > i: 98, status: 0, bytes_deduped: 131072 > i: 99, status: 0, bytes_deduped: 131072 > 13107200 total bytes deduped in this operation > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 072a152355788c767b97e4e4c0e4567720988b84 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > bf00d862c6ad436a1be2be606a8ab88d22166b89 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 0d44cdf030fb149e103cfdc164da3da2b7474c17 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 60831f0e7ffe4b49722612c18685c09f4583b1df am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > a19662b294a3ccdf35dbb18fdd72c62018526d7d am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > ^C It looks like we have something wrong interpreting file extent, maybe related to extent map merging. BTW, if without dropping page cache and no read corruption happens, it would limit the range of problem we're looking for. Thanks, Qu > > Corruption occurs most often when there is a sequence like this in a file: > > ref 1: hole > ref 2: extent A, offset 0 > ref 3: hole > ref 4: extent A, offset 8192 > > This scenario typically arises due to hole-punching or deduplication. > Hole-punching replaces one extent ref with two references to the same > extent with a hole between them, so: > > ref 1: extent A, offset 0, length 16384 > > becomes: > > ref 1: extent A, offset 0, length 4096 > ref 2: hole, length 8192 > ref 3: extent A, offset 12288, length 4096 > > Deduplication replaces two distinct extent refs surrounding a hole with > two references to one of the duplicate extents, turning this: > > ref 1: extent A, offset 0, length 4096 > ref 2: hole, length 8192 > ref 3: extent B, offset 0, length 4096 > > into this: > > ref 1: extent A, offset 0, length 4096 > ref 2: hole, length 8192 > ref 3: extent A, offset 0, length 4096 > > Compression is required (zlib, zstd, or lzo) for corruption to occur. > I am not able to reproduce the issue with an uncompressed extent nor > have I observed any such corruption in the wild. > > The presence or absence of the no-holes filesystem feature has no effect. > > Ordinary writes can lead to pairs of extent references to the same extent > separated by a reference to a different extent; however, in this case > there is data to be read from a real extent, instead of pages that have > to be zero filled from a hole. If ordinary non-hole writes could trigger > this bug, every page-oriented database engine would be crashing all the > time on btrfs with compression enabled, and it's unlikely that would not > have been noticed between 2015 and now. An ordinary write that splits > an extent ref would look like this: > > ref 1: extent A, offset 0, length 4096 > ref 2: extent C, offset 0, length 8192 > ref 3: extent A, offset 12288, length 4096 > > Sparse writes can lead to pairs of extent references surrounding a hole; > however, in this case the extent references will point to different > extents, avoiding the bug. If a sparse write could trigger the bug, > the rsync -S option and qemu/kvm 'raw' disk image files (among many > other tools that produce sparse files) would be unusable, and it's > unlikely that would not have been noticed between 2015 and now either. > Sparse writes look like this: > > ref 1: extent A, offset 0, length 4096 > ref 2: hole, length 8192 > ref 3: extent B, offset 0, length 4096 > > The pattern or timing of read() calls seems to be relevant. It is very > hard to see the corruption when reading files with 'hd', but 'cat | hd' > will see the corruption just fine. Similar problems exist with 'cmp' > but not 'sha1sum'. Two processes reading the same file at the same time > seem to trigger the corruption very frequently. > > Some patterns of holes and data produce corruption faster than others. > The pattern generated by the script above is based on instances of > corruption I've found in the wild, and has a much better repro rate than > random holes. > > The corruption occurs during reads, after csum verification and before > decompression, so btrfs detects no csum failures. The data on disk > seems to be OK and could be read correctly once the kernel bug is fixed. > Repeated reads do eventually return correct data, but there is no way > for userspace to distinguish between corrupt and correct data reliably. > > The corrupted data is usually data replaced by a hole or a copy of other > blocks in the same extent. > > The behavior is similar to some earlier bugs related to holes and > Compressed data in btrfs, but it's new and not fixed yet--hence, > "2018 edition." > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 editiion" 2018-08-23 5:10 ` Qu Wenruo @ 2018-08-23 16:44 ` Zygo Blaxell 2018-08-23 23:50 ` Qu Wenruo 0 siblings, 1 reply; 38+ messages in thread From: Zygo Blaxell @ 2018-08-23 16:44 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 11603 bytes --] On Thu, Aug 23, 2018 at 01:10:48PM +0800, Qu Wenruo wrote: > On 2018/8/23 上午11:11, Zygo Blaxell wrote: > > This is a repro script for a btrfs bug that causes corrupted data reads > > when reading a mix of compressed extents and holes. The bug is > > reproducible on at least kernels v4.1..v4.18. > > This bug already sounds more serious than previous nodatasum + > compression bug. Maybe. "compression + holes corruption bug 2017" could be avoided with the max-inline=0 mount option without disabling compression. This time, the workaround is more intrusive: avoid all applications that use dedup or hole-punching. > > Some more observations and background follow, but first here is the > > script and some sample output: > > > > root@rescue:/test# cat repro-hole-corruption-test > > #!/bin/bash > > > > # Write a 4096 byte block of something > > block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } > > > > # Here is some test data with holes in it: > > for y in $(seq 0 100); do > > for x in 0 1; do > > block 0; > > block 21; > > block 0; > > block 22; > > block 0; > > block 0; > > block 43; > > block 44; > > block 0; > > block 0; > > block 61; > > block 62; > > block 63; > > block 64; > > block 65; > > block 66;> done > > Does the content has any difference on this bug? > It's just 16 * 4K * 2 * 101 data write *without* any hole so far. The content of the extents doesn't seem to matter, other than it needs to be compressible so that the extents on disk are compressed. The bug is also triggered by writing non-zero data to all blocks, and then punching the holes later with "fallocate -p -l 4096 -o $(( insert math here ))". The layout of the extents matters a lot. I have to loop hundreds or thousands of times to hit the bug if the first block in the pattern is not a hole, or if the non-hole extents are different sizes or positions than above. I tried random patterns of holes and extent refs, and most of them have an order of magnitude lower hit rates than the above. This might be due to some relationship between the alignment of read() request boundaries with extent boundaries, but I haven't done any tests designed to detect such a relationship. In the wild, corruption happens on some files much more often than others. This seems to be correlated with the extent layout as well. I discovered the bug by examining files that were intermittently but repeatedly failing routine data integrity checks, and found that in every case they had similar hole + extent patterns near the point where data was corrupted. I did a search on some big filesystems for the hole-refExtentA-hole-refExtentA pattern and found several files with this pattern that had passed previous data integrity checks, but would fail randomly in the sha1sum/drop-caches loop. > This should indeed cause 101 128K compressed data extent. > But I'm wondering the description about 'holes'. The holes are coming, wait for it... ;) > > done > am > > sync > > > > # Now replace those 101 distinct extents with 101 references to the first extent > > btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail > > Will this bug still happen by creating one extent and then reflink it > 101 times? Yes. I used btrfs-extent-same because a binary is included in the Debian duperemove package, but I use it only for convenience. It's not necessary to have hundreds of references to the same extent--even two refs to a single extent plus a hole can trigger the bug sometimes. 100 references in a single file will trigger the bug so often that it can be detected within the first 20 sha1sum loops. When the corruption occurs, it affects around 90 of the original 101 extents. The different sha1sum results are due to different extents giving bad data on different runs. > > # Punch holes into the extent refs > > fallocate -v -d am > > Hole-punch in fact happens here. > > BTW, will add a "sync" here change the result? No. You can reboot the machine here if you like, it does not change anything that happens during reads later. Looking at the extent tree in btrfs-debug-tree, the data on disk looks correct, and btrfs does read it correctly most of the time (the correct sha1sum below is 6926a34e0ab3e0a023e8ea85a650f5b4217acab4). The corruption therefore comes from btrfs read() producing incorrect data in some instances. > > # Do some other stuff on the machine while this runs, and watch the sha1sums change! > > while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done > > > > root@rescue:/test# ./repro-hole-corruption-test > > i: 91, status: 0, bytes_deduped: 131072 > > i: 92, status: 0, bytes_deduped: 131072 > > i: 93, status: 0, bytes_deduped: 131072 > > i: 94, status: 0, bytes_deduped: 131072 > > i: 95, status: 0, bytes_deduped: 131072 > > i: 96, status: 0, bytes_deduped: 131072 > > i: 97, status: 0, bytes_deduped: 131072 > > i: 98, status: 0, bytes_deduped: 131072 > > i: 99, status: 0, bytes_deduped: 131072 > > 13107200 total bytes deduped in this operation > > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 072a152355788c767b97e4e4c0e4567720988b84 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > bf00d862c6ad436a1be2be606a8ab88d22166b89 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 0d44cdf030fb149e103cfdc164da3da2b7474c17 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 60831f0e7ffe4b49722612c18685c09f4583b1df am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > a19662b294a3ccdf35dbb18fdd72c62018526d7d am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > ^C > > It looks like we have something wrong interpreting file extent, maybe > related to extent map merging. > > BTW, if without dropping page cache and no read corruption happens, it > would limit the range of problem we're looking for. The page cache drop makes reproduction easier/faster. If you don't drop caches, you have to wait for the data to be evicted from page cache or the data from read() will not change. In the wild, if I do a sha1sum loop on a few hundred GB of data known to have the hole-extent-hole pattern (so the pages are evicted between sha1sum runs), I see similar results without explicitly dropping caches. If you read the file with a cold cache from two processes at once (e.g. you run 'hd am' while the sha1sum/drop-cache loop is running) the data changes faster (different on 90% of reads instead of just 20%). > Thanks, > Qu > > > > > Corruption occurs most often when there is a sequence like this in a file: > > > > ref 1: hole > > ref 2: extent A, offset 0 > > ref 3: hole > > ref 4: extent A, offset 8192 > > > > This scenario typically arises due to hole-punching or deduplication. > > Hole-punching replaces one extent ref with two references to the same > > extent with a hole between them, so: > > > > ref 1: extent A, offset 0, length 16384 > > > > becomes: > > > > ref 1: extent A, offset 0, length 4096 > > ref 2: hole, length 8192 > > ref 3: extent A, offset 12288, length 4096 > > > > Deduplication replaces two distinct extent refs surrounding a hole with > > two references to one of the duplicate extents, turning this: > > > > ref 1: extent A, offset 0, length 4096 > > ref 2: hole, length 8192 > > ref 3: extent B, offset 0, length 4096 > > > > into this: > > > > ref 1: extent A, offset 0, length 4096 > > ref 2: hole, length 8192 > > ref 3: extent A, offset 0, length 4096 > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur. > > I am not able to reproduce the issue with an uncompressed extent nor > > have I observed any such corruption in the wild. > > > > The presence or absence of the no-holes filesystem feature has no effect. > > > > Ordinary writes can lead to pairs of extent references to the same extent > > separated by a reference to a different extent; however, in this case > > there is data to be read from a real extent, instead of pages that have > > to be zero filled from a hole. If ordinary non-hole writes could trigger > > this bug, every page-oriented database engine would be crashing all the > > time on btrfs with compression enabled, and it's unlikely that would not > > have been noticed between 2015 and now. An ordinary write that splits > > an extent ref would look like this: > > > > ref 1: extent A, offset 0, length 4096 > > ref 2: extent C, offset 0, length 8192 > > ref 3: extent A, offset 12288, length 4096 > > > > Sparse writes can lead to pairs of extent references surrounding a hole; > > however, in this case the extent references will point to different > > extents, avoiding the bug. If a sparse write could trigger the bug, > > the rsync -S option and qemu/kvm 'raw' disk image files (among many > > other tools that produce sparse files) would be unusable, and it's > > unlikely that would not have been noticed between 2015 and now either. > > Sparse writes look like this: > > > > ref 1: extent A, offset 0, length 4096 > > ref 2: hole, length 8192 > > ref 3: extent B, offset 0, length 4096 > > > > The pattern or timing of read() calls seems to be relevant. It is very > > hard to see the corruption when reading files with 'hd', but 'cat | hd' > > will see the corruption just fine. Similar problems exist with 'cmp' > > but not 'sha1sum'. Two processes reading the same file at the same time > > seem to trigger the corruption very frequently. > > > > Some patterns of holes and data produce corruption faster than others. > > The pattern generated by the script above is based on instances of > > corruption I've found in the wild, and has a much better repro rate than > > random holes. > > > > The corruption occurs during reads, after csum verification and before > > decompression, so btrfs detects no csum failures. The data on disk > > seems to be OK and could be read correctly once the kernel bug is fixed. > > Repeated reads do eventually return correct data, but there is no way > > for userspace to distinguish between corrupt and correct data reliably. > > > > The corrupted data is usually data replaced by a hole or a copy of other > > blocks in the same extent. > > > > The behavior is similar to some earlier bugs related to holes and > > Compressed data in btrfs, but it's new and not fixed yet--hence, > > "2018 edition." > > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 editiion" 2018-08-23 16:44 ` Zygo Blaxell @ 2018-08-23 23:50 ` Qu Wenruo 0 siblings, 0 replies; 38+ messages in thread From: Qu Wenruo @ 2018-08-23 23:50 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 12116 bytes --] On 2018/8/24 上午12:44, Zygo Blaxell wrote: > On Thu, Aug 23, 2018 at 01:10:48PM +0800, Qu Wenruo wrote: >> On 2018/8/23 上午11:11, Zygo Blaxell wrote: >>> This is a repro script for a btrfs bug that causes corrupted data reads >>> when reading a mix of compressed extents and holes. The bug is >>> reproducible on at least kernels v4.1..v4.18. >> >> This bug already sounds more serious than previous nodatasum + >> compression bug. > > Maybe. "compression + holes corruption bug 2017" could be avoided with > the max-inline=0 mount option without disabling compression. This time, > the workaround is more intrusive: avoid all applications that use dedup > or hole-punching. > >>> Some more observations and background follow, but first here is the >>> script and some sample output: >>> >>> root@rescue:/test# cat repro-hole-corruption-test >>> #!/bin/bash >>> >>> # Write a 4096 byte block of something >>> block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } >>> >>> # Here is some test data with holes in it: >>> for y in $(seq 0 100); do >>> for x in 0 1; do >>> block 0; >>> block 21; >>> block 0; >>> block 22; >>> block 0; >>> block 0; >>> block 43; >>> block 44; >>> block 0; >>> block 0; >>> block 61; >>> block 62; >>> block 63; >>> block 64; >>> block 65; >>> block 66;> done >> >> Does the content has any difference on this bug? >> It's just 16 * 4K * 2 * 101 data write *without* any hole so far. > > The content of the extents doesn't seem to matter, other than it needs to > be compressible so that the extents on disk are compressed. The bug is > also triggered by writing non-zero data to all blocks, and then punching > the holes later with "fallocate -p -l 4096 -o $(( insert math here ))". > > The layout of the extents matters a lot. I have to loop hundreds or > thousands of times to hit the bug if the first block in the pattern is > not a hole, or if the non-hole extents are different sizes or positions > than above. > > I tried random patterns of holes and extent refs, and most of them have > an order of magnitude lower hit rates than the above. This might be due > to some relationship between the alignment of read() request boundaries > with extent boundaries, but I haven't done any tests designed to detect > such a relationship. > > In the wild, corruption happens on some files much more often than others. > This seems to be correlated with the extent layout as well. > > I discovered the bug by examining files that were intermittently but > repeatedly failing routine data integrity checks, and found that in every > case they had similar hole + extent patterns near the point where data > was corrupted. > > I did a search on some big filesystems for the > hole-refExtentA-hole-refExtentA pattern and found several files with > this pattern that had passed previous data integrity checks, but would > fail randomly in the sha1sum/drop-caches loop. > >> This should indeed cause 101 128K compressed data extent. >> But I'm wondering the description about 'holes'. > > The holes are coming, wait for it... ;) > >>> done > am >>> sync >>> >>> # Now replace those 101 distinct extents with 101 references to the first extent >>> btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail >> >> Will this bug still happen by creating one extent and then reflink it >> 101 times? > > Yes. I used btrfs-extent-same because a binary is included in the > Debian duperemove package, but I use it only for convenience. > > It's not necessary to have hundreds of references to the same extent--even > two refs to a single extent plus a hole can trigger the bug sometimes. > 100 references in a single file will trigger the bug so often that it > can be detected within the first 20 sha1sum loops. > > When the corruption occurs, it affects around 90 of the original 101 > extents. The different sha1sum results are due to different extents > giving bad data on different runs. > >>> # Punch holes into the extent refs >>> fallocate -v -d am >> >> Hole-punch in fact happens here. >> >> BTW, will add a "sync" here change the result? > > No. You can reboot the machine here if you like, it does not change > anything that happens during reads later. So it looks like my assumption of bad file extent interpreter is getting more and more valid. It has nothing to do with the race against hole punching/write, but only the file layout and extent map cache. > > Looking at the extent tree in btrfs-debug-tree, the data on disk > looks correct, and btrfs does read it correctly most of the time (the > correct sha1sum below is 6926a34e0ab3e0a023e8ea85a650f5b4217acab4). > The corruption therefore comes from btrfs read() producing incorrect > data in some instances. > >>> # Do some other stuff on the machine while this runs, and watch the sha1sums change! >>> while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done >>> >>> root@rescue:/test# ./repro-hole-corruption-test >>> i: 91, status: 0, bytes_deduped: 131072 >>> i: 92, status: 0, bytes_deduped: 131072 >>> i: 93, status: 0, bytes_deduped: 131072 >>> i: 94, status: 0, bytes_deduped: 131072 >>> i: 95, status: 0, bytes_deduped: 131072 >>> i: 96, status: 0, bytes_deduped: 131072 >>> i: 97, status: 0, bytes_deduped: 131072 >>> i: 98, status: 0, bytes_deduped: 131072 >>> i: 99, status: 0, bytes_deduped: 131072 >>> 13107200 total bytes deduped in this operation >>> am: 4.8 MiB (4964352 bytes) converted to sparse holes. >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 072a152355788c767b97e4e4c0e4567720988b84 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> bf00d862c6ad436a1be2be606a8ab88d22166b89 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 0d44cdf030fb149e103cfdc164da3da2b7474c17 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 60831f0e7ffe4b49722612c18685c09f4583b1df am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> a19662b294a3ccdf35dbb18fdd72c62018526d7d am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> ^C >> >> It looks like we have something wrong interpreting file extent, maybe >> related to extent map merging. >> >> BTW, if without dropping page cache and no read corruption happens, it >> would limit the range of problem we're looking for. > > The page cache drop makes reproduction easier/faster. If you don't drop > caches, you have to wait for the data to be evicted from page cache or > the data from read() will not change. So it's highly possible that file extent interpreter is causing the problem. Thanks, Qu > > In the wild, if I do a sha1sum loop on a few hundred GB of data known > to have the hole-extent-hole pattern (so the pages are evicted between > sha1sum runs), I see similar results without explicitly dropping caches. > > If you read the file with a cold cache from two processes at once > (e.g. you run 'hd am' while the sha1sum/drop-cache loop is running) > the data changes faster (different on 90% of reads instead of just 20%). > >> Thanks, >> Qu >> >>> >>> Corruption occurs most often when there is a sequence like this in a file: >>> >>> ref 1: hole >>> ref 2: extent A, offset 0 >>> ref 3: hole >>> ref 4: extent A, offset 8192 >>> >>> This scenario typically arises due to hole-punching or deduplication. >>> Hole-punching replaces one extent ref with two references to the same >>> extent with a hole between them, so: >>> >>> ref 1: extent A, offset 0, length 16384 >>> >>> becomes: >>> >>> ref 1: extent A, offset 0, length 4096 >>> ref 2: hole, length 8192 >>> ref 3: extent A, offset 12288, length 4096 >>> >>> Deduplication replaces two distinct extent refs surrounding a hole with >>> two references to one of the duplicate extents, turning this: >>> >>> ref 1: extent A, offset 0, length 4096 >>> ref 2: hole, length 8192 >>> ref 3: extent B, offset 0, length 4096 >>> >>> into this: >>> >>> ref 1: extent A, offset 0, length 4096 >>> ref 2: hole, length 8192 >>> ref 3: extent A, offset 0, length 4096 >>> >>> Compression is required (zlib, zstd, or lzo) for corruption to occur. >>> I am not able to reproduce the issue with an uncompressed extent nor >>> have I observed any such corruption in the wild. >>> >>> The presence or absence of the no-holes filesystem feature has no effect. >>> >>> Ordinary writes can lead to pairs of extent references to the same extent >>> separated by a reference to a different extent; however, in this case >>> there is data to be read from a real extent, instead of pages that have >>> to be zero filled from a hole. If ordinary non-hole writes could trigger >>> this bug, every page-oriented database engine would be crashing all the >>> time on btrfs with compression enabled, and it's unlikely that would not >>> have been noticed between 2015 and now. An ordinary write that splits >>> an extent ref would look like this: >>> >>> ref 1: extent A, offset 0, length 4096 >>> ref 2: extent C, offset 0, length 8192 >>> ref 3: extent A, offset 12288, length 4096 >>> >>> Sparse writes can lead to pairs of extent references surrounding a hole; >>> however, in this case the extent references will point to different >>> extents, avoiding the bug. If a sparse write could trigger the bug, >>> the rsync -S option and qemu/kvm 'raw' disk image files (among many >>> other tools that produce sparse files) would be unusable, and it's >>> unlikely that would not have been noticed between 2015 and now either. >>> Sparse writes look like this: >>> >>> ref 1: extent A, offset 0, length 4096 >>> ref 2: hole, length 8192 >>> ref 3: extent B, offset 0, length 4096 >>> >>> The pattern or timing of read() calls seems to be relevant. It is very >>> hard to see the corruption when reading files with 'hd', but 'cat | hd' >>> will see the corruption just fine. Similar problems exist with 'cmp' >>> but not 'sha1sum'. Two processes reading the same file at the same time >>> seem to trigger the corruption very frequently. >>> >>> Some patterns of holes and data produce corruption faster than others. >>> The pattern generated by the script above is based on instances of >>> corruption I've found in the wild, and has a much better repro rate than >>> random holes. >>> >>> The corruption occurs during reads, after csum verification and before >>> decompression, so btrfs detects no csum failures. The data on disk >>> seems to be OK and could be read correctly once the kernel bug is fixed. >>> Repeated reads do eventually return correct data, but there is no way >>> for userspace to distinguish between corrupt and correct data reliably. >>> >>> The corrupted data is usually data replaced by a hole or a copy of other >>> blocks in the same extent. >>> >>> The behavior is similar to some earlier bugs related to holes and >>> Compressed data in btrfs, but it's new and not fixed yet--hence, >>> "2018 edition." >>> >> > > > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2018-08-23 3:11 Reproducer for "compressed data + hole data corruption bug, 2018 editiion" Zygo Blaxell 2018-08-23 5:10 ` Qu Wenruo @ 2019-02-12 3:09 ` Zygo Blaxell 2019-02-12 15:33 ` Christoph Anton Mitterer ` (2 more replies) 1 sibling, 3 replies; 38+ messages in thread From: Zygo Blaxell @ 2019-02-12 3:09 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 8454 bytes --] Still reproducible on 4.20.7. The behavior is slightly different on current kernels (4.20.7, 4.14.96) which makes the problem a bit more difficult to detect. # repro-hole-corruption-test i: 91, status: 0, bytes_deduped: 131072 i: 92, status: 0, bytes_deduped: 131072 i: 93, status: 0, bytes_deduped: 131072 i: 94, status: 0, bytes_deduped: 131072 i: 95, status: 0, bytes_deduped: 131072 i: 96, status: 0, bytes_deduped: 131072 i: 97, status: 0, bytes_deduped: 131072 i: 98, status: 0, bytes_deduped: 131072 i: 99, status: 0, bytes_deduped: 131072 13107200 total bytes deduped in this operation am: 4.8 MiB (4964352 bytes) converted to sparse holes. 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am The sha1sum seems stable after the first drop_caches--until a second process tries to read the test file: 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am # cat am > /dev/null (in another shell) 19294e695272c42edb89ceee24bb08c13473140a am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote: > This is a repro script for a btrfs bug that causes corrupted data reads > when reading a mix of compressed extents and holes. The bug is > reproducible on at least kernels v4.1..v4.18. > > Some more observations and background follow, but first here is the > script and some sample output: > > root@rescue:/test# cat repro-hole-corruption-test > #!/bin/bash > > # Write a 4096 byte block of something > block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } > > # Here is some test data with holes in it: > for y in $(seq 0 100); do > for x in 0 1; do > block 0; > block 21; > block 0; > block 22; > block 0; > block 0; > block 43; > block 44; > block 0; > block 0; > block 61; > block 62; > block 63; > block 64; > block 65; > block 66; > done > done > am > sync > > # Now replace those 101 distinct extents with 101 references to the first extent > btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail > > # Punch holes into the extent refs > fallocate -v -d am > > # Do some other stuff on the machine while this runs, and watch the sha1sums change! > while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done > > root@rescue:/test# ./repro-hole-corruption-test > i: 91, status: 0, bytes_deduped: 131072 > i: 92, status: 0, bytes_deduped: 131072 > i: 93, status: 0, bytes_deduped: 131072 > i: 94, status: 0, bytes_deduped: 131072 > i: 95, status: 0, bytes_deduped: 131072 > i: 96, status: 0, bytes_deduped: 131072 > i: 97, status: 0, bytes_deduped: 131072 > i: 98, status: 0, bytes_deduped: 131072 > i: 99, status: 0, bytes_deduped: 131072 > 13107200 total bytes deduped in this operation > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 072a152355788c767b97e4e4c0e4567720988b84 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > bf00d862c6ad436a1be2be606a8ab88d22166b89 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 0d44cdf030fb149e103cfdc164da3da2b7474c17 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 60831f0e7ffe4b49722612c18685c09f4583b1df am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > a19662b294a3ccdf35dbb18fdd72c62018526d7d am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > ^C > > Corruption occurs most often when there is a sequence like this in a file: > > ref 1: hole > ref 2: extent A, offset 0 > ref 3: hole > ref 4: extent A, offset 8192 > > This scenario typically arises due to hole-punching or deduplication. > Hole-punching replaces one extent ref with two references to the same > extent with a hole between them, so: > > ref 1: extent A, offset 0, length 16384 > > becomes: > > ref 1: extent A, offset 0, length 4096 > ref 2: hole, length 8192 > ref 3: extent A, offset 12288, length 4096 > > Deduplication replaces two distinct extent refs surrounding a hole with > two references to one of the duplicate extents, turning this: > > ref 1: extent A, offset 0, length 4096 > ref 2: hole, length 8192 > ref 3: extent B, offset 0, length 4096 > > into this: > > ref 1: extent A, offset 0, length 4096 > ref 2: hole, length 8192 > ref 3: extent A, offset 0, length 4096 > > Compression is required (zlib, zstd, or lzo) for corruption to occur. > I am not able to reproduce the issue with an uncompressed extent nor > have I observed any such corruption in the wild. > > The presence or absence of the no-holes filesystem feature has no effect. > > Ordinary writes can lead to pairs of extent references to the same extent > separated by a reference to a different extent; however, in this case > there is data to be read from a real extent, instead of pages that have > to be zero filled from a hole. If ordinary non-hole writes could trigger > this bug, every page-oriented database engine would be crashing all the > time on btrfs with compression enabled, and it's unlikely that would not > have been noticed between 2015 and now. An ordinary write that splits > an extent ref would look like this: > > ref 1: extent A, offset 0, length 4096 > ref 2: extent C, offset 0, length 8192 > ref 3: extent A, offset 12288, length 4096 > > Sparse writes can lead to pairs of extent references surrounding a hole; > however, in this case the extent references will point to different > extents, avoiding the bug. If a sparse write could trigger the bug, > the rsync -S option and qemu/kvm 'raw' disk image files (among many > other tools that produce sparse files) would be unusable, and it's > unlikely that would not have been noticed between 2015 and now either. > Sparse writes look like this: > > ref 1: extent A, offset 0, length 4096 > ref 2: hole, length 8192 > ref 3: extent B, offset 0, length 4096 > > The pattern or timing of read() calls seems to be relevant. It is very > hard to see the corruption when reading files with 'hd', but 'cat | hd' > will see the corruption just fine. Similar problems exist with 'cmp' > but not 'sha1sum'. Two processes reading the same file at the same time > seem to trigger the corruption very frequently. > > Some patterns of holes and data produce corruption faster than others. > The pattern generated by the script above is based on instances of > corruption I've found in the wild, and has a much better repro rate than > random holes. > > The corruption occurs during reads, after csum verification and before > decompression, so btrfs detects no csum failures. The data on disk > seems to be OK and could be read correctly once the kernel bug is fixed. > Repeated reads do eventually return correct data, but there is no way > for userspace to distinguish between corrupt and correct data reliably. > > The corrupted data is usually data replaced by a hole or a copy of other > blocks in the same extent. > > The behavior is similar to some earlier bugs related to holes and > Compressed data in btrfs, but it's new and not fixed yet--hence, > "2018 edition." [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-12 3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell @ 2019-02-12 15:33 ` Christoph Anton Mitterer 2019-02-12 15:35 ` Filipe Manana 2019-02-13 7:47 ` Roman Mamedov 2 siblings, 0 replies; 38+ messages in thread From: Christoph Anton Mitterer @ 2019-02-12 15:33 UTC (permalink / raw) To: linux-btrfs Hey. Sounds like a highly severe (and long standing) bug? Is anyone doing anything about it? Cheers, Chris. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-12 3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell 2019-02-12 15:33 ` Christoph Anton Mitterer @ 2019-02-12 15:35 ` Filipe Manana 2019-02-12 17:01 ` Zygo Blaxell 2019-02-13 7:47 ` Roman Mamedov 2 siblings, 1 reply; 38+ messages in thread From: Filipe Manana @ 2019-02-12 15:35 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > > Still reproducible on 4.20.7. I tried your reproducer when you first reported it, on different machines with different kernel versions. Never managed to reproduce it, nor see anything obviously wrong in relevant code paths. > > The behavior is slightly different on current kernels (4.20.7, 4.14.96) > which makes the problem a bit more difficult to detect. > > # repro-hole-corruption-test > i: 91, status: 0, bytes_deduped: 131072 > i: 92, status: 0, bytes_deduped: 131072 > i: 93, status: 0, bytes_deduped: 131072 > i: 94, status: 0, bytes_deduped: 131072 > i: 95, status: 0, bytes_deduped: 131072 > i: 96, status: 0, bytes_deduped: 131072 > i: 97, status: 0, bytes_deduped: 131072 > i: 98, status: 0, bytes_deduped: 131072 > i: 99, status: 0, bytes_deduped: 131072 > 13107200 total bytes deduped in this operation > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > The sha1sum seems stable after the first drop_caches--until a second > process tries to read the test file: > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > # cat am > /dev/null (in another shell) > 19294e695272c42edb89ceee24bb08c13473140a am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote: > > This is a repro script for a btrfs bug that causes corrupted data reads > > when reading a mix of compressed extents and holes. The bug is > > reproducible on at least kernels v4.1..v4.18. > > > > Some more observations and background follow, but first here is the > > script and some sample output: > > > > root@rescue:/test# cat repro-hole-corruption-test > > #!/bin/bash > > > > # Write a 4096 byte block of something > > block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } > > > > # Here is some test data with holes in it: > > for y in $(seq 0 100); do > > for x in 0 1; do > > block 0; > > block 21; > > block 0; > > block 22; > > block 0; > > block 0; > > block 43; > > block 44; > > block 0; > > block 0; > > block 61; > > block 62; > > block 63; > > block 64; > > block 65; > > block 66; > > done > > done > am > > sync > > > > # Now replace those 101 distinct extents with 101 references to the first extent > > btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail > > > > # Punch holes into the extent refs > > fallocate -v -d am > > > > # Do some other stuff on the machine while this runs, and watch the sha1sums change! > > while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done > > > > root@rescue:/test# ./repro-hole-corruption-test > > i: 91, status: 0, bytes_deduped: 131072 > > i: 92, status: 0, bytes_deduped: 131072 > > i: 93, status: 0, bytes_deduped: 131072 > > i: 94, status: 0, bytes_deduped: 131072 > > i: 95, status: 0, bytes_deduped: 131072 > > i: 96, status: 0, bytes_deduped: 131072 > > i: 97, status: 0, bytes_deduped: 131072 > > i: 98, status: 0, bytes_deduped: 131072 > > i: 99, status: 0, bytes_deduped: 131072 > > 13107200 total bytes deduped in this operation > > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 072a152355788c767b97e4e4c0e4567720988b84 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > bf00d862c6ad436a1be2be606a8ab88d22166b89 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 0d44cdf030fb149e103cfdc164da3da2b7474c17 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 60831f0e7ffe4b49722612c18685c09f4583b1df am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > a19662b294a3ccdf35dbb18fdd72c62018526d7d am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > ^C > > > > Corruption occurs most often when there is a sequence like this in a file: > > > > ref 1: hole > > ref 2: extent A, offset 0 > > ref 3: hole > > ref 4: extent A, offset 8192 > > > > This scenario typically arises due to hole-punching or deduplication. > > Hole-punching replaces one extent ref with two references to the same > > extent with a hole between them, so: > > > > ref 1: extent A, offset 0, length 16384 > > > > becomes: > > > > ref 1: extent A, offset 0, length 4096 > > ref 2: hole, length 8192 > > ref 3: extent A, offset 12288, length 4096 > > > > Deduplication replaces two distinct extent refs surrounding a hole with > > two references to one of the duplicate extents, turning this: > > > > ref 1: extent A, offset 0, length 4096 > > ref 2: hole, length 8192 > > ref 3: extent B, offset 0, length 4096 > > > > into this: > > > > ref 1: extent A, offset 0, length 4096 > > ref 2: hole, length 8192 > > ref 3: extent A, offset 0, length 4096 > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur. > > I am not able to reproduce the issue with an uncompressed extent nor > > have I observed any such corruption in the wild. > > > > The presence or absence of the no-holes filesystem feature has no effect. > > > > Ordinary writes can lead to pairs of extent references to the same extent > > separated by a reference to a different extent; however, in this case > > there is data to be read from a real extent, instead of pages that have > > to be zero filled from a hole. If ordinary non-hole writes could trigger > > this bug, every page-oriented database engine would be crashing all the > > time on btrfs with compression enabled, and it's unlikely that would not > > have been noticed between 2015 and now. An ordinary write that splits > > an extent ref would look like this: > > > > ref 1: extent A, offset 0, length 4096 > > ref 2: extent C, offset 0, length 8192 > > ref 3: extent A, offset 12288, length 4096 > > > > Sparse writes can lead to pairs of extent references surrounding a hole; > > however, in this case the extent references will point to different > > extents, avoiding the bug. If a sparse write could trigger the bug, > > the rsync -S option and qemu/kvm 'raw' disk image files (among many > > other tools that produce sparse files) would be unusable, and it's > > unlikely that would not have been noticed between 2015 and now either. > > Sparse writes look like this: > > > > ref 1: extent A, offset 0, length 4096 > > ref 2: hole, length 8192 > > ref 3: extent B, offset 0, length 4096 > > > > The pattern or timing of read() calls seems to be relevant. It is very > > hard to see the corruption when reading files with 'hd', but 'cat | hd' > > will see the corruption just fine. Similar problems exist with 'cmp' > > but not 'sha1sum'. Two processes reading the same file at the same time > > seem to trigger the corruption very frequently. > > > > Some patterns of holes and data produce corruption faster than others. > > The pattern generated by the script above is based on instances of > > corruption I've found in the wild, and has a much better repro rate than > > random holes. > > > > The corruption occurs during reads, after csum verification and before > > decompression, so btrfs detects no csum failures. The data on disk > > seems to be OK and could be read correctly once the kernel bug is fixed. > > Repeated reads do eventually return correct data, but there is no way > > for userspace to distinguish between corrupt and correct data reliably. > > > > The corrupted data is usually data replaced by a hole or a copy of other > > blocks in the same extent. > > > > The behavior is similar to some earlier bugs related to holes and > > Compressed data in btrfs, but it's new and not fixed yet--hence, > > "2018 edition." > > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.” ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-12 15:35 ` Filipe Manana @ 2019-02-12 17:01 ` Zygo Blaxell 2019-02-12 17:56 ` Filipe Manana 2019-02-12 18:58 ` Andrei Borzenkov 0 siblings, 2 replies; 38+ messages in thread From: Zygo Blaxell @ 2019-02-12 17:01 UTC (permalink / raw) To: Filipe Manana; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 11371 bytes --] On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote: > On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell > <ce3g8jdj@umail.furryterror.org> wrote: > > > > Still reproducible on 4.20.7. > > I tried your reproducer when you first reported it, on different > machines with different kernel versions. That would have been useful to know last August... :-/ > Never managed to reproduce it, nor see anything obviously wrong in > relevant code paths. I built a fresh VM running Debian stretch and reproduced the issue immediately. Mount options are "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/". Kernel is Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version probably doesn't matter. I don't have any configuration that can't reproduce this issue, so I don't know how to help you. I've tested AMD and Intel CPUs, VM, baremetal, hardware ranging in age from 0 to 9 years. Locally built kernels from 4.1 to 4.20 and the stock Debian kernel (4.9). SSDs and spinning rust. All of these reproduce the issue immediately--wrong sha1sum appears in the first 10 loops. What is your test environment? I can try that here. > > > > The behavior is slightly different on current kernels (4.20.7, 4.14.96) > > which makes the problem a bit more difficult to detect. > > > > # repro-hole-corruption-test > > i: 91, status: 0, bytes_deduped: 131072 > > i: 92, status: 0, bytes_deduped: 131072 > > i: 93, status: 0, bytes_deduped: 131072 > > i: 94, status: 0, bytes_deduped: 131072 > > i: 95, status: 0, bytes_deduped: 131072 > > i: 96, status: 0, bytes_deduped: 131072 > > i: 97, status: 0, bytes_deduped: 131072 > > i: 98, status: 0, bytes_deduped: 131072 > > i: 99, status: 0, bytes_deduped: 131072 > > 13107200 total bytes deduped in this operation > > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > > 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > The sha1sum seems stable after the first drop_caches--until a second > > process tries to read the test file: > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > # cat am > /dev/null (in another shell) > > 19294e695272c42edb89ceee24bb08c13473140a am > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote: > > > This is a repro script for a btrfs bug that causes corrupted data reads > > > when reading a mix of compressed extents and holes. The bug is > > > reproducible on at least kernels v4.1..v4.18. > > > > > > Some more observations and background follow, but first here is the > > > script and some sample output: > > > > > > root@rescue:/test# cat repro-hole-corruption-test > > > #!/bin/bash > > > > > > # Write a 4096 byte block of something > > > block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } > > > > > > # Here is some test data with holes in it: > > > for y in $(seq 0 100); do > > > for x in 0 1; do > > > block 0; > > > block 21; > > > block 0; > > > block 22; > > > block 0; > > > block 0; > > > block 43; > > > block 44; > > > block 0; > > > block 0; > > > block 61; > > > block 62; > > > block 63; > > > block 64; > > > block 65; > > > block 66; > > > done > > > done > am > > > sync > > > > > > # Now replace those 101 distinct extents with 101 references to the first extent > > > btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail > > > > > > # Punch holes into the extent refs > > > fallocate -v -d am > > > > > > # Do some other stuff on the machine while this runs, and watch the sha1sums change! > > > while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done > > > > > > root@rescue:/test# ./repro-hole-corruption-test > > > i: 91, status: 0, bytes_deduped: 131072 > > > i: 92, status: 0, bytes_deduped: 131072 > > > i: 93, status: 0, bytes_deduped: 131072 > > > i: 94, status: 0, bytes_deduped: 131072 > > > i: 95, status: 0, bytes_deduped: 131072 > > > i: 96, status: 0, bytes_deduped: 131072 > > > i: 97, status: 0, bytes_deduped: 131072 > > > i: 98, status: 0, bytes_deduped: 131072 > > > i: 99, status: 0, bytes_deduped: 131072 > > > 13107200 total bytes deduped in this operation > > > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 072a152355788c767b97e4e4c0e4567720988b84 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > bf00d862c6ad436a1be2be606a8ab88d22166b89 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 0d44cdf030fb149e103cfdc164da3da2b7474c17 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 60831f0e7ffe4b49722612c18685c09f4583b1df am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > a19662b294a3ccdf35dbb18fdd72c62018526d7d am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > ^C > > > > > > Corruption occurs most often when there is a sequence like this in a file: > > > > > > ref 1: hole > > > ref 2: extent A, offset 0 > > > ref 3: hole > > > ref 4: extent A, offset 8192 > > > > > > This scenario typically arises due to hole-punching or deduplication. > > > Hole-punching replaces one extent ref with two references to the same > > > extent with a hole between them, so: > > > > > > ref 1: extent A, offset 0, length 16384 > > > > > > becomes: > > > > > > ref 1: extent A, offset 0, length 4096 > > > ref 2: hole, length 8192 > > > ref 3: extent A, offset 12288, length 4096 > > > > > > Deduplication replaces two distinct extent refs surrounding a hole with > > > two references to one of the duplicate extents, turning this: > > > > > > ref 1: extent A, offset 0, length 4096 > > > ref 2: hole, length 8192 > > > ref 3: extent B, offset 0, length 4096 > > > > > > into this: > > > > > > ref 1: extent A, offset 0, length 4096 > > > ref 2: hole, length 8192 > > > ref 3: extent A, offset 0, length 4096 > > > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur. > > > I am not able to reproduce the issue with an uncompressed extent nor > > > have I observed any such corruption in the wild. > > > > > > The presence or absence of the no-holes filesystem feature has no effect. > > > > > > Ordinary writes can lead to pairs of extent references to the same extent > > > separated by a reference to a different extent; however, in this case > > > there is data to be read from a real extent, instead of pages that have > > > to be zero filled from a hole. If ordinary non-hole writes could trigger > > > this bug, every page-oriented database engine would be crashing all the > > > time on btrfs with compression enabled, and it's unlikely that would not > > > have been noticed between 2015 and now. An ordinary write that splits > > > an extent ref would look like this: > > > > > > ref 1: extent A, offset 0, length 4096 > > > ref 2: extent C, offset 0, length 8192 > > > ref 3: extent A, offset 12288, length 4096 > > > > > > Sparse writes can lead to pairs of extent references surrounding a hole; > > > however, in this case the extent references will point to different > > > extents, avoiding the bug. If a sparse write could trigger the bug, > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many > > > other tools that produce sparse files) would be unusable, and it's > > > unlikely that would not have been noticed between 2015 and now either. > > > Sparse writes look like this: > > > > > > ref 1: extent A, offset 0, length 4096 > > > ref 2: hole, length 8192 > > > ref 3: extent B, offset 0, length 4096 > > > > > > The pattern or timing of read() calls seems to be relevant. It is very > > > hard to see the corruption when reading files with 'hd', but 'cat | hd' > > > will see the corruption just fine. Similar problems exist with 'cmp' > > > but not 'sha1sum'. Two processes reading the same file at the same time > > > seem to trigger the corruption very frequently. > > > > > > Some patterns of holes and data produce corruption faster than others. > > > The pattern generated by the script above is based on instances of > > > corruption I've found in the wild, and has a much better repro rate than > > > random holes. > > > > > > The corruption occurs during reads, after csum verification and before > > > decompression, so btrfs detects no csum failures. The data on disk > > > seems to be OK and could be read correctly once the kernel bug is fixed. > > > Repeated reads do eventually return correct data, but there is no way > > > for userspace to distinguish between corrupt and correct data reliably. > > > > > > The corrupted data is usually data replaced by a hole or a copy of other > > > blocks in the same extent. > > > > > > The behavior is similar to some earlier bugs related to holes and > > > Compressed data in btrfs, but it's new and not fixed yet--hence, > > > "2018 edition." > > > > > > > -- > Filipe David Manana, > > “Whether you think you can, or you think you can't — you're right.” > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-12 17:01 ` Zygo Blaxell @ 2019-02-12 17:56 ` Filipe Manana 2019-02-12 18:13 ` Zygo Blaxell 2019-02-12 18:58 ` Andrei Borzenkov 1 sibling, 1 reply; 38+ messages in thread From: Filipe Manana @ 2019-02-12 17:56 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On Tue, Feb 12, 2019 at 5:01 PM Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > > On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote: > > On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell > > <ce3g8jdj@umail.furryterror.org> wrote: > > > > > > Still reproducible on 4.20.7. > > > > I tried your reproducer when you first reported it, on different > > machines with different kernel versions. > > That would have been useful to know last August... :-/ > > > Never managed to reproduce it, nor see anything obviously wrong in > > relevant code paths. > > I built a fresh VM running Debian stretch and > reproduced the issue immediately. Mount options are > "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/". Kernel is > Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version > probably doesn't matter. > > I don't have any configuration that can't reproduce this issue, so I don't > know how to help you. I've tested AMD and Intel CPUs, VM, baremetal, > hardware ranging in age from 0 to 9 years. Locally built kernels from > 4.1 to 4.20 and the stock Debian kernel (4.9). SSDs and spinning rust. > All of these reproduce the issue immediately--wrong sha1sum appears in > the first 10 loops. > > What is your test environment? I can try that here. Debian unstable, all qemu vms, 4 cpus 4G to 8G ram iirc. Always built from source kernels. I have tested this when you reported it for 1 to 2 weeks in 2 or 3 vms that kept running the test in an infinite loop during those weeks. Don't recall what were the kernel versions (whatever was the latest at the time), but that shouldn't matter according to what you say. > > > > > > > The behavior is slightly different on current kernels (4.20.7, 4.14.96) > > > which makes the problem a bit more difficult to detect. > > > > > > # repro-hole-corruption-test > > > i: 91, status: 0, bytes_deduped: 131072 > > > i: 92, status: 0, bytes_deduped: 131072 > > > i: 93, status: 0, bytes_deduped: 131072 > > > i: 94, status: 0, bytes_deduped: 131072 > > > i: 95, status: 0, bytes_deduped: 131072 > > > i: 96, status: 0, bytes_deduped: 131072 > > > i: 97, status: 0, bytes_deduped: 131072 > > > i: 98, status: 0, bytes_deduped: 131072 > > > i: 99, status: 0, bytes_deduped: 131072 > > > 13107200 total bytes deduped in this operation > > > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > > > 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > The sha1sum seems stable after the first drop_caches--until a second > > > process tries to read the test file: > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > # cat am > /dev/null (in another shell) > > > 19294e695272c42edb89ceee24bb08c13473140a am > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote: > > > > This is a repro script for a btrfs bug that causes corrupted data reads > > > > when reading a mix of compressed extents and holes. The bug is > > > > reproducible on at least kernels v4.1..v4.18. > > > > > > > > Some more observations and background follow, but first here is the > > > > script and some sample output: > > > > > > > > root@rescue:/test# cat repro-hole-corruption-test > > > > #!/bin/bash > > > > > > > > # Write a 4096 byte block of something > > > > block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } > > > > > > > > # Here is some test data with holes in it: > > > > for y in $(seq 0 100); do > > > > for x in 0 1; do > > > > block 0; > > > > block 21; > > > > block 0; > > > > block 22; > > > > block 0; > > > > block 0; > > > > block 43; > > > > block 44; > > > > block 0; > > > > block 0; > > > > block 61; > > > > block 62; > > > > block 63; > > > > block 64; > > > > block 65; > > > > block 66; > > > > done > > > > done > am > > > > sync > > > > > > > > # Now replace those 101 distinct extents with 101 references to the first extent > > > > btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail > > > > > > > > # Punch holes into the extent refs > > > > fallocate -v -d am > > > > > > > > # Do some other stuff on the machine while this runs, and watch the sha1sums change! > > > > while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done > > > > > > > > root@rescue:/test# ./repro-hole-corruption-test > > > > i: 91, status: 0, bytes_deduped: 131072 > > > > i: 92, status: 0, bytes_deduped: 131072 > > > > i: 93, status: 0, bytes_deduped: 131072 > > > > i: 94, status: 0, bytes_deduped: 131072 > > > > i: 95, status: 0, bytes_deduped: 131072 > > > > i: 96, status: 0, bytes_deduped: 131072 > > > > i: 97, status: 0, bytes_deduped: 131072 > > > > i: 98, status: 0, bytes_deduped: 131072 > > > > i: 99, status: 0, bytes_deduped: 131072 > > > > 13107200 total bytes deduped in this operation > > > > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 072a152355788c767b97e4e4c0e4567720988b84 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > bf00d862c6ad436a1be2be606a8ab88d22166b89 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 0d44cdf030fb149e103cfdc164da3da2b7474c17 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 60831f0e7ffe4b49722612c18685c09f4583b1df am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > a19662b294a3ccdf35dbb18fdd72c62018526d7d am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > ^C > > > > > > > > Corruption occurs most often when there is a sequence like this in a file: > > > > > > > > ref 1: hole > > > > ref 2: extent A, offset 0 > > > > ref 3: hole > > > > ref 4: extent A, offset 8192 > > > > > > > > This scenario typically arises due to hole-punching or deduplication. > > > > Hole-punching replaces one extent ref with two references to the same > > > > extent with a hole between them, so: > > > > > > > > ref 1: extent A, offset 0, length 16384 > > > > > > > > becomes: > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > ref 2: hole, length 8192 > > > > ref 3: extent A, offset 12288, length 4096 > > > > > > > > Deduplication replaces two distinct extent refs surrounding a hole with > > > > two references to one of the duplicate extents, turning this: > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > ref 2: hole, length 8192 > > > > ref 3: extent B, offset 0, length 4096 > > > > > > > > into this: > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > ref 2: hole, length 8192 > > > > ref 3: extent A, offset 0, length 4096 > > > > > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur. > > > > I am not able to reproduce the issue with an uncompressed extent nor > > > > have I observed any such corruption in the wild. > > > > > > > > The presence or absence of the no-holes filesystem feature has no effect. > > > > > > > > Ordinary writes can lead to pairs of extent references to the same extent > > > > separated by a reference to a different extent; however, in this case > > > > there is data to be read from a real extent, instead of pages that have > > > > to be zero filled from a hole. If ordinary non-hole writes could trigger > > > > this bug, every page-oriented database engine would be crashing all the > > > > time on btrfs with compression enabled, and it's unlikely that would not > > > > have been noticed between 2015 and now. An ordinary write that splits > > > > an extent ref would look like this: > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > ref 2: extent C, offset 0, length 8192 > > > > ref 3: extent A, offset 12288, length 4096 > > > > > > > > Sparse writes can lead to pairs of extent references surrounding a hole; > > > > however, in this case the extent references will point to different > > > > extents, avoiding the bug. If a sparse write could trigger the bug, > > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many > > > > other tools that produce sparse files) would be unusable, and it's > > > > unlikely that would not have been noticed between 2015 and now either. > > > > Sparse writes look like this: > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > ref 2: hole, length 8192 > > > > ref 3: extent B, offset 0, length 4096 > > > > > > > > The pattern or timing of read() calls seems to be relevant. It is very > > > > hard to see the corruption when reading files with 'hd', but 'cat | hd' > > > > will see the corruption just fine. Similar problems exist with 'cmp' > > > > but not 'sha1sum'. Two processes reading the same file at the same time > > > > seem to trigger the corruption very frequently. > > > > > > > > Some patterns of holes and data produce corruption faster than others. > > > > The pattern generated by the script above is based on instances of > > > > corruption I've found in the wild, and has a much better repro rate than > > > > random holes. > > > > > > > > The corruption occurs during reads, after csum verification and before > > > > decompression, so btrfs detects no csum failures. The data on disk > > > > seems to be OK and could be read correctly once the kernel bug is fixed. > > > > Repeated reads do eventually return correct data, but there is no way > > > > for userspace to distinguish between corrupt and correct data reliably. > > > > > > > > The corrupted data is usually data replaced by a hole or a copy of other > > > > blocks in the same extent. > > > > > > > > The behavior is similar to some earlier bugs related to holes and > > > > Compressed data in btrfs, but it's new and not fixed yet--hence, > > > > "2018 edition." > > > > > > > > > > > > -- > > Filipe David Manana, > > > > “Whether you think you can, or you think you can't — you're right.” > > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.” ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-12 17:56 ` Filipe Manana @ 2019-02-12 18:13 ` Zygo Blaxell 2019-02-13 7:24 ` Qu Wenruo 2019-02-13 17:36 ` Filipe Manana 0 siblings, 2 replies; 38+ messages in thread From: Zygo Blaxell @ 2019-02-12 18:13 UTC (permalink / raw) To: Filipe Manana; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 13720 bytes --] On Tue, Feb 12, 2019 at 05:56:24PM +0000, Filipe Manana wrote: > On Tue, Feb 12, 2019 at 5:01 PM Zygo Blaxell > <ce3g8jdj@umail.furryterror.org> wrote: > > > > On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote: > > > On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell > > > <ce3g8jdj@umail.furryterror.org> wrote: > > > > > > > > Still reproducible on 4.20.7. > > > > > > I tried your reproducer when you first reported it, on different > > > machines with different kernel versions. > > > > That would have been useful to know last August... :-/ > > > > > Never managed to reproduce it, nor see anything obviously wrong in > > > relevant code paths. > > > > I built a fresh VM running Debian stretch and > > reproduced the issue immediately. Mount options are > > "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/". Kernel is > > Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version > > probably doesn't matter. > > > > I don't have any configuration that can't reproduce this issue, so I don't > > know how to help you. I've tested AMD and Intel CPUs, VM, baremetal, > > hardware ranging in age from 0 to 9 years. Locally built kernels from > > 4.1 to 4.20 and the stock Debian kernel (4.9). SSDs and spinning rust. > > All of these reproduce the issue immediately--wrong sha1sum appears in > > the first 10 loops. > > > > What is your test environment? I can try that here. > > Debian unstable, all qemu vms, 4 cpus 4G to 8G ram iirc. I have several environments like that... > Always built from source kernels. ...that could be a relevant difference. Have you tried a stock Debian kernel? > I have tested this when you reported it for 1 to 2 weeks in 2 or 3 vms > that kept running the test in an infinite loop during those weeks. > Don't recall what were the kernel versions (whatever was the latest at > the time), but that shouldn't matter according to what you say. That's an extremely long time compared to the rate of occurrence of this bug. It should appear in only a few seconds of testing. Some data-hole-data patterns reproduce much slower (change the position of "block 0" lines in the setup script), but "slower" is minutes, not machine-months. Is your filesystem compressed? Does compsize show the test file 'am' is compressed during the test? Is the sha1sum you get 6926a34e0ab3e0a023e8ea85a650f5b4217acab4? Does the sha1sum change when a second process reads the file while the sha1sum/drop_caches loop is running? > > > > > > > > The behavior is slightly different on current kernels (4.20.7, 4.14.96) > > > > which makes the problem a bit more difficult to detect. > > > > > > > > # repro-hole-corruption-test > > > > i: 91, status: 0, bytes_deduped: 131072 > > > > i: 92, status: 0, bytes_deduped: 131072 > > > > i: 93, status: 0, bytes_deduped: 131072 > > > > i: 94, status: 0, bytes_deduped: 131072 > > > > i: 95, status: 0, bytes_deduped: 131072 > > > > i: 96, status: 0, bytes_deduped: 131072 > > > > i: 97, status: 0, bytes_deduped: 131072 > > > > i: 98, status: 0, bytes_deduped: 131072 > > > > i: 99, status: 0, bytes_deduped: 131072 > > > > 13107200 total bytes deduped in this operation > > > > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > > > > 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > The sha1sum seems stable after the first drop_caches--until a second > > > > process tries to read the test file: > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > # cat am > /dev/null (in another shell) > > > > 19294e695272c42edb89ceee24bb08c13473140a am > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote: > > > > > This is a repro script for a btrfs bug that causes corrupted data reads > > > > > when reading a mix of compressed extents and holes. The bug is > > > > > reproducible on at least kernels v4.1..v4.18. > > > > > > > > > > Some more observations and background follow, but first here is the > > > > > script and some sample output: > > > > > > > > > > root@rescue:/test# cat repro-hole-corruption-test > > > > > #!/bin/bash > > > > > > > > > > # Write a 4096 byte block of something > > > > > block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } > > > > > > > > > > # Here is some test data with holes in it: > > > > > for y in $(seq 0 100); do > > > > > for x in 0 1; do > > > > > block 0; > > > > > block 21; > > > > > block 0; > > > > > block 22; > > > > > block 0; > > > > > block 0; > > > > > block 43; > > > > > block 44; > > > > > block 0; > > > > > block 0; > > > > > block 61; > > > > > block 62; > > > > > block 63; > > > > > block 64; > > > > > block 65; > > > > > block 66; > > > > > done > > > > > done > am > > > > > sync > > > > > > > > > > # Now replace those 101 distinct extents with 101 references to the first extent > > > > > btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail > > > > > > > > > > # Punch holes into the extent refs > > > > > fallocate -v -d am > > > > > > > > > > # Do some other stuff on the machine while this runs, and watch the sha1sums change! > > > > > while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done > > > > > > > > > > root@rescue:/test# ./repro-hole-corruption-test > > > > > i: 91, status: 0, bytes_deduped: 131072 > > > > > i: 92, status: 0, bytes_deduped: 131072 > > > > > i: 93, status: 0, bytes_deduped: 131072 > > > > > i: 94, status: 0, bytes_deduped: 131072 > > > > > i: 95, status: 0, bytes_deduped: 131072 > > > > > i: 96, status: 0, bytes_deduped: 131072 > > > > > i: 97, status: 0, bytes_deduped: 131072 > > > > > i: 98, status: 0, bytes_deduped: 131072 > > > > > i: 99, status: 0, bytes_deduped: 131072 > > > > > 13107200 total bytes deduped in this operation > > > > > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 072a152355788c767b97e4e4c0e4567720988b84 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > bf00d862c6ad436a1be2be606a8ab88d22166b89 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 0d44cdf030fb149e103cfdc164da3da2b7474c17 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 60831f0e7ffe4b49722612c18685c09f4583b1df am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > a19662b294a3ccdf35dbb18fdd72c62018526d7d am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > ^C > > > > > > > > > > Corruption occurs most often when there is a sequence like this in a file: > > > > > > > > > > ref 1: hole > > > > > ref 2: extent A, offset 0 > > > > > ref 3: hole > > > > > ref 4: extent A, offset 8192 > > > > > > > > > > This scenario typically arises due to hole-punching or deduplication. > > > > > Hole-punching replaces one extent ref with two references to the same > > > > > extent with a hole between them, so: > > > > > > > > > > ref 1: extent A, offset 0, length 16384 > > > > > > > > > > becomes: > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > ref 2: hole, length 8192 > > > > > ref 3: extent A, offset 12288, length 4096 > > > > > > > > > > Deduplication replaces two distinct extent refs surrounding a hole with > > > > > two references to one of the duplicate extents, turning this: > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > ref 2: hole, length 8192 > > > > > ref 3: extent B, offset 0, length 4096 > > > > > > > > > > into this: > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > ref 2: hole, length 8192 > > > > > ref 3: extent A, offset 0, length 4096 > > > > > > > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur. > > > > > I am not able to reproduce the issue with an uncompressed extent nor > > > > > have I observed any such corruption in the wild. > > > > > > > > > > The presence or absence of the no-holes filesystem feature has no effect. > > > > > > > > > > Ordinary writes can lead to pairs of extent references to the same extent > > > > > separated by a reference to a different extent; however, in this case > > > > > there is data to be read from a real extent, instead of pages that have > > > > > to be zero filled from a hole. If ordinary non-hole writes could trigger > > > > > this bug, every page-oriented database engine would be crashing all the > > > > > time on btrfs with compression enabled, and it's unlikely that would not > > > > > have been noticed between 2015 and now. An ordinary write that splits > > > > > an extent ref would look like this: > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > ref 2: extent C, offset 0, length 8192 > > > > > ref 3: extent A, offset 12288, length 4096 > > > > > > > > > > Sparse writes can lead to pairs of extent references surrounding a hole; > > > > > however, in this case the extent references will point to different > > > > > extents, avoiding the bug. If a sparse write could trigger the bug, > > > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many > > > > > other tools that produce sparse files) would be unusable, and it's > > > > > unlikely that would not have been noticed between 2015 and now either. > > > > > Sparse writes look like this: > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > ref 2: hole, length 8192 > > > > > ref 3: extent B, offset 0, length 4096 > > > > > > > > > > The pattern or timing of read() calls seems to be relevant. It is very > > > > > hard to see the corruption when reading files with 'hd', but 'cat | hd' > > > > > will see the corruption just fine. Similar problems exist with 'cmp' > > > > > but not 'sha1sum'. Two processes reading the same file at the same time > > > > > seem to trigger the corruption very frequently. > > > > > > > > > > Some patterns of holes and data produce corruption faster than others. > > > > > The pattern generated by the script above is based on instances of > > > > > corruption I've found in the wild, and has a much better repro rate than > > > > > random holes. > > > > > > > > > > The corruption occurs during reads, after csum verification and before > > > > > decompression, so btrfs detects no csum failures. The data on disk > > > > > seems to be OK and could be read correctly once the kernel bug is fixed. > > > > > Repeated reads do eventually return correct data, but there is no way > > > > > for userspace to distinguish between corrupt and correct data reliably. > > > > > > > > > > The corrupted data is usually data replaced by a hole or a copy of other > > > > > blocks in the same extent. > > > > > > > > > > The behavior is similar to some earlier bugs related to holes and > > > > > Compressed data in btrfs, but it's new and not fixed yet--hence, > > > > > "2018 edition." > > > > > > > > > > > > > > > > > -- > > > Filipe David Manana, > > > > > > “Whether you think you can, or you think you can't — you're right.” > > > > > > > -- > Filipe David Manana, > > “Whether you think you can, or you think you can't — you're right.” > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-12 18:13 ` Zygo Blaxell @ 2019-02-13 7:24 ` Qu Wenruo 2019-02-13 17:36 ` Filipe Manana 1 sibling, 0 replies; 38+ messages in thread From: Qu Wenruo @ 2019-02-13 7:24 UTC (permalink / raw) To: Zygo Blaxell, Filipe Manana; +Cc: linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 3282 bytes --] On 2019/2/13 上午2:13, Zygo Blaxell wrote: > On Tue, Feb 12, 2019 at 05:56:24PM +0000, Filipe Manana wrote: >> On Tue, Feb 12, 2019 at 5:01 PM Zygo Blaxell >> <ce3g8jdj@umail.furryterror.org> wrote: >>> >>> On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote: >>>> On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell >>>> <ce3g8jdj@umail.furryterror.org> wrote: >>>>> >>>>> Still reproducible on 4.20.7. >>>> >>>> I tried your reproducer when you first reported it, on different >>>> machines with different kernel versions. >>> >>> That would have been useful to know last August... :-/ >>> >>>> Never managed to reproduce it, nor see anything obviously wrong in >>>> relevant code paths. >>> >>> I built a fresh VM running Debian stretch and >>> reproduced the issue immediately. Mount options are >>> "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/". Kernel is >>> Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version >>> probably doesn't matter. >>> >>> I don't have any configuration that can't reproduce this issue, so I don't >>> know how to help you. I've tested AMD and Intel CPUs, VM, baremetal, >>> hardware ranging in age from 0 to 9 years. Locally built kernels from >>> 4.1 to 4.20 and the stock Debian kernel (4.9). SSDs and spinning rust. >>> All of these reproduce the issue immediately--wrong sha1sum appears in >>> the first 10 loops. >>> >>> What is your test environment? I can try that here. >> >> Debian unstable, all qemu vms, 4 cpus 4G to 8G ram iirc. > > I have several environments like that... > >> Always built from source kernels. > > ...that could be a relevant difference. Have you tried a stock > Debian kernel? I'm afraid you may need to use upstream vanilla kernel other than kernel from distro, especially for distros who may have heavy backports. I also tried my test runs, using Arch stock kernel (pretty vanilla) and upstream kernel. Both my host and VM tested. No reproduce either. Upstream community is mostly focused on upstream vanilla kernel. Bugs from distro kernel can sometimes be a good clue of existing upstream bugs, but when dig deeper, vanilla kernel is always necessary. Would you mind to reproduce it in a as vanilla as possible environment? E.g. vanilla kernel and vanilla user space progs? Thanks, Qu > >> I have tested this when you reported it for 1 to 2 weeks in 2 or 3 vms >> that kept running the test in an infinite loop during those weeks. >> Don't recall what were the kernel versions (whatever was the latest at >> the time), but that shouldn't matter according to what you say. > > That's an extremely long time compared to the rate of occurrence > of this bug. It should appear in only a few seconds of testing. > Some data-hole-data patterns reproduce much slower (change the position > of "block 0" lines in the setup script), but "slower" is minutes, > not machine-months. > > Is your filesystem compressed? Does compsize show the test > file 'am' is compressed during the test? Is the sha1sum you get > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4? Does the sha1sum change > when a second process reads the file while the sha1sum/drop_caches loop > is running? > [snip] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-12 18:13 ` Zygo Blaxell 2019-02-13 7:24 ` Qu Wenruo @ 2019-02-13 17:36 ` Filipe Manana 2019-02-13 18:14 ` Filipe Manana 1 sibling, 1 reply; 38+ messages in thread From: Filipe Manana @ 2019-02-13 17:36 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On Tue, Feb 12, 2019 at 6:14 PM Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > > On Tue, Feb 12, 2019 at 05:56:24PM +0000, Filipe Manana wrote: > > On Tue, Feb 12, 2019 at 5:01 PM Zygo Blaxell > > <ce3g8jdj@umail.furryterror.org> wrote: > > > > > > On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote: > > > > On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell > > > > <ce3g8jdj@umail.furryterror.org> wrote: > > > > > > > > > > Still reproducible on 4.20.7. > > > > > > > > I tried your reproducer when you first reported it, on different > > > > machines with different kernel versions. > > > > > > That would have been useful to know last August... :-/ > > > > > > > Never managed to reproduce it, nor see anything obviously wrong in > > > > relevant code paths. > > > > > > I built a fresh VM running Debian stretch and > > > reproduced the issue immediately. Mount options are > > > "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/". Kernel is > > > Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version > > > probably doesn't matter. > > > > > > I don't have any configuration that can't reproduce this issue, so I don't > > > know how to help you. I've tested AMD and Intel CPUs, VM, baremetal, > > > hardware ranging in age from 0 to 9 years. Locally built kernels from > > > 4.1 to 4.20 and the stock Debian kernel (4.9). SSDs and spinning rust. > > > All of these reproduce the issue immediately--wrong sha1sum appears in > > > the first 10 loops. > > > > > > What is your test environment? I can try that here. > > > > Debian unstable, all qemu vms, 4 cpus 4G to 8G ram iirc. > > I have several environments like that... > > > Always built from source kernels. > > ...that could be a relevant difference. Have you tried a stock > Debian kernel? > > > I have tested this when you reported it for 1 to 2 weeks in 2 or 3 vms > > that kept running the test in an infinite loop during those weeks. > > Don't recall what were the kernel versions (whatever was the latest at > > the time), but that shouldn't matter according to what you say. > > That's an extremely long time compared to the rate of occurrence > of this bug. It should appear in only a few seconds of testing. > Some data-hole-data patterns reproduce much slower (change the position > of "block 0" lines in the setup script), but "slower" is minutes, > not machine-months. > > Is your filesystem compressed? Does compsize show the test > file 'am' is compressed during the test? Is the sha1sum you get > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4? Does the sha1sum change > when a second process reads the file while the sha1sum/drop_caches loop > is running? Tried it today and I got it reproduced (different vm, but still debian and kernel built from source). Not sure what was different last time. Yes, I had compression enabled. I'll look into it. > > > > > > > > > > > The behavior is slightly different on current kernels (4.20.7, 4.14.96) > > > > > which makes the problem a bit more difficult to detect. > > > > > > > > > > # repro-hole-corruption-test > > > > > i: 91, status: 0, bytes_deduped: 131072 > > > > > i: 92, status: 0, bytes_deduped: 131072 > > > > > i: 93, status: 0, bytes_deduped: 131072 > > > > > i: 94, status: 0, bytes_deduped: 131072 > > > > > i: 95, status: 0, bytes_deduped: 131072 > > > > > i: 96, status: 0, bytes_deduped: 131072 > > > > > i: 97, status: 0, bytes_deduped: 131072 > > > > > i: 98, status: 0, bytes_deduped: 131072 > > > > > i: 99, status: 0, bytes_deduped: 131072 > > > > > 13107200 total bytes deduped in this operation > > > > > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > > > > > 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > > The sha1sum seems stable after the first drop_caches--until a second > > > > > process tries to read the test file: > > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > # cat am > /dev/null (in another shell) > > > > > 19294e695272c42edb89ceee24bb08c13473140a am > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote: > > > > > > This is a repro script for a btrfs bug that causes corrupted data reads > > > > > > when reading a mix of compressed extents and holes. The bug is > > > > > > reproducible on at least kernels v4.1..v4.18. > > > > > > > > > > > > Some more observations and background follow, but first here is the > > > > > > script and some sample output: > > > > > > > > > > > > root@rescue:/test# cat repro-hole-corruption-test > > > > > > #!/bin/bash > > > > > > > > > > > > # Write a 4096 byte block of something > > > > > > block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } > > > > > > > > > > > > # Here is some test data with holes in it: > > > > > > for y in $(seq 0 100); do > > > > > > for x in 0 1; do > > > > > > block 0; > > > > > > block 21; > > > > > > block 0; > > > > > > block 22; > > > > > > block 0; > > > > > > block 0; > > > > > > block 43; > > > > > > block 44; > > > > > > block 0; > > > > > > block 0; > > > > > > block 61; > > > > > > block 62; > > > > > > block 63; > > > > > > block 64; > > > > > > block 65; > > > > > > block 66; > > > > > > done > > > > > > done > am > > > > > > sync > > > > > > > > > > > > # Now replace those 101 distinct extents with 101 references to the first extent > > > > > > btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail > > > > > > > > > > > > # Punch holes into the extent refs > > > > > > fallocate -v -d am > > > > > > > > > > > > # Do some other stuff on the machine while this runs, and watch the sha1sums change! > > > > > > while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done > > > > > > > > > > > > root@rescue:/test# ./repro-hole-corruption-test > > > > > > i: 91, status: 0, bytes_deduped: 131072 > > > > > > i: 92, status: 0, bytes_deduped: 131072 > > > > > > i: 93, status: 0, bytes_deduped: 131072 > > > > > > i: 94, status: 0, bytes_deduped: 131072 > > > > > > i: 95, status: 0, bytes_deduped: 131072 > > > > > > i: 96, status: 0, bytes_deduped: 131072 > > > > > > i: 97, status: 0, bytes_deduped: 131072 > > > > > > i: 98, status: 0, bytes_deduped: 131072 > > > > > > i: 99, status: 0, bytes_deduped: 131072 > > > > > > 13107200 total bytes deduped in this operation > > > > > > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 072a152355788c767b97e4e4c0e4567720988b84 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > bf00d862c6ad436a1be2be606a8ab88d22166b89 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 0d44cdf030fb149e103cfdc164da3da2b7474c17 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 60831f0e7ffe4b49722612c18685c09f4583b1df am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > a19662b294a3ccdf35dbb18fdd72c62018526d7d am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > ^C > > > > > > > > > > > > Corruption occurs most often when there is a sequence like this in a file: > > > > > > > > > > > > ref 1: hole > > > > > > ref 2: extent A, offset 0 > > > > > > ref 3: hole > > > > > > ref 4: extent A, offset 8192 > > > > > > > > > > > > This scenario typically arises due to hole-punching or deduplication. > > > > > > Hole-punching replaces one extent ref with two references to the same > > > > > > extent with a hole between them, so: > > > > > > > > > > > > ref 1: extent A, offset 0, length 16384 > > > > > > > > > > > > becomes: > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > ref 2: hole, length 8192 > > > > > > ref 3: extent A, offset 12288, length 4096 > > > > > > > > > > > > Deduplication replaces two distinct extent refs surrounding a hole with > > > > > > two references to one of the duplicate extents, turning this: > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > ref 2: hole, length 8192 > > > > > > ref 3: extent B, offset 0, length 4096 > > > > > > > > > > > > into this: > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > ref 2: hole, length 8192 > > > > > > ref 3: extent A, offset 0, length 4096 > > > > > > > > > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur. > > > > > > I am not able to reproduce the issue with an uncompressed extent nor > > > > > > have I observed any such corruption in the wild. > > > > > > > > > > > > The presence or absence of the no-holes filesystem feature has no effect. > > > > > > > > > > > > Ordinary writes can lead to pairs of extent references to the same extent > > > > > > separated by a reference to a different extent; however, in this case > > > > > > there is data to be read from a real extent, instead of pages that have > > > > > > to be zero filled from a hole. If ordinary non-hole writes could trigger > > > > > > this bug, every page-oriented database engine would be crashing all the > > > > > > time on btrfs with compression enabled, and it's unlikely that would not > > > > > > have been noticed between 2015 and now. An ordinary write that splits > > > > > > an extent ref would look like this: > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > ref 2: extent C, offset 0, length 8192 > > > > > > ref 3: extent A, offset 12288, length 4096 > > > > > > > > > > > > Sparse writes can lead to pairs of extent references surrounding a hole; > > > > > > however, in this case the extent references will point to different > > > > > > extents, avoiding the bug. If a sparse write could trigger the bug, > > > > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many > > > > > > other tools that produce sparse files) would be unusable, and it's > > > > > > unlikely that would not have been noticed between 2015 and now either. > > > > > > Sparse writes look like this: > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > ref 2: hole, length 8192 > > > > > > ref 3: extent B, offset 0, length 4096 > > > > > > > > > > > > The pattern or timing of read() calls seems to be relevant. It is very > > > > > > hard to see the corruption when reading files with 'hd', but 'cat | hd' > > > > > > will see the corruption just fine. Similar problems exist with 'cmp' > > > > > > but not 'sha1sum'. Two processes reading the same file at the same time > > > > > > seem to trigger the corruption very frequently. > > > > > > > > > > > > Some patterns of holes and data produce corruption faster than others. > > > > > > The pattern generated by the script above is based on instances of > > > > > > corruption I've found in the wild, and has a much better repro rate than > > > > > > random holes. > > > > > > > > > > > > The corruption occurs during reads, after csum verification and before > > > > > > decompression, so btrfs detects no csum failures. The data on disk > > > > > > seems to be OK and could be read correctly once the kernel bug is fixed. > > > > > > Repeated reads do eventually return correct data, but there is no way > > > > > > for userspace to distinguish between corrupt and correct data reliably. > > > > > > > > > > > > The corrupted data is usually data replaced by a hole or a copy of other > > > > > > blocks in the same extent. > > > > > > > > > > > > The behavior is similar to some earlier bugs related to holes and > > > > > > Compressed data in btrfs, but it's new and not fixed yet--hence, > > > > > > "2018 edition." > > > > > > > > > > > > > > > > > > > > > > -- > > > > Filipe David Manana, > > > > > > > > “Whether you think you can, or you think you can't — you're right.” > > > > > > > > > > > > -- > > Filipe David Manana, > > > > “Whether you think you can, or you think you can't — you're right.” > > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.” ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-13 17:36 ` Filipe Manana @ 2019-02-13 18:14 ` Filipe Manana 2019-02-14 1:22 ` Filipe Manana 0 siblings, 1 reply; 38+ messages in thread From: Filipe Manana @ 2019-02-13 18:14 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On Wed, Feb 13, 2019 at 5:36 PM Filipe Manana <fdmanana@gmail.com> wrote: > > On Tue, Feb 12, 2019 at 6:14 PM Zygo Blaxell > <ce3g8jdj@umail.furryterror.org> wrote: > > > > On Tue, Feb 12, 2019 at 05:56:24PM +0000, Filipe Manana wrote: > > > On Tue, Feb 12, 2019 at 5:01 PM Zygo Blaxell > > > <ce3g8jdj@umail.furryterror.org> wrote: > > > > > > > > On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote: > > > > > On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell > > > > > <ce3g8jdj@umail.furryterror.org> wrote: > > > > > > > > > > > > Still reproducible on 4.20.7. > > > > > > > > > > I tried your reproducer when you first reported it, on different > > > > > machines with different kernel versions. > > > > > > > > That would have been useful to know last August... :-/ > > > > > > > > > Never managed to reproduce it, nor see anything obviously wrong in > > > > > relevant code paths. > > > > > > > > I built a fresh VM running Debian stretch and > > > > reproduced the issue immediately. Mount options are > > > > "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/". Kernel is > > > > Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version > > > > probably doesn't matter. > > > > > > > > I don't have any configuration that can't reproduce this issue, so I don't > > > > know how to help you. I've tested AMD and Intel CPUs, VM, baremetal, > > > > hardware ranging in age from 0 to 9 years. Locally built kernels from > > > > 4.1 to 4.20 and the stock Debian kernel (4.9). SSDs and spinning rust. > > > > All of these reproduce the issue immediately--wrong sha1sum appears in > > > > the first 10 loops. > > > > > > > > What is your test environment? I can try that here. > > > > > > Debian unstable, all qemu vms, 4 cpus 4G to 8G ram iirc. > > > > I have several environments like that... > > > > > Always built from source kernels. > > > > ...that could be a relevant difference. Have you tried a stock > > Debian kernel? > > > > > I have tested this when you reported it for 1 to 2 weeks in 2 or 3 vms > > > that kept running the test in an infinite loop during those weeks. > > > Don't recall what were the kernel versions (whatever was the latest at > > > the time), but that shouldn't matter according to what you say. > > > > That's an extremely long time compared to the rate of occurrence > > of this bug. It should appear in only a few seconds of testing. > > Some data-hole-data patterns reproduce much slower (change the position > > of "block 0" lines in the setup script), but "slower" is minutes, > > not machine-months. > > > > Is your filesystem compressed? Does compsize show the test > > file 'am' is compressed during the test? Is the sha1sum you get > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4? Does the sha1sum change > > when a second process reads the file while the sha1sum/drop_caches loop > > is running? > > Tried it today and I got it reproduced (different vm, but still debian > and kernel built from source). > Not sure what was different last time. Yes, I had compression enabled. > > I'll look into it. So the problem is caused by hole punching. The script can be reduced to the following: https://friendpaste.com/22t4OdktHQTl0aMGxckc86 file size: 384K am digests after file creation: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83 am digests after file creation 2: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83 am 262144 total bytes deduped in this operation digests after dedupe: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83 am digests after dedupe 2: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83 am am: 24 KiB (24576 bytes) converted to sparse holes. digests after hole punching: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83 am digests after hole punching 2: 5a357b64f4004ea38dbc7058c64a5678668420da am So hole punching is screwing things, and only after dropping the page cache we can see the bug. I'll send a fix likely tomorrow. > > > > > > > > > > > > > > > The behavior is slightly different on current kernels (4.20.7, 4.14.96) > > > > > > which makes the problem a bit more difficult to detect. > > > > > > > > > > > > # repro-hole-corruption-test > > > > > > i: 91, status: 0, bytes_deduped: 131072 > > > > > > i: 92, status: 0, bytes_deduped: 131072 > > > > > > i: 93, status: 0, bytes_deduped: 131072 > > > > > > i: 94, status: 0, bytes_deduped: 131072 > > > > > > i: 95, status: 0, bytes_deduped: 131072 > > > > > > i: 96, status: 0, bytes_deduped: 131072 > > > > > > i: 97, status: 0, bytes_deduped: 131072 > > > > > > i: 98, status: 0, bytes_deduped: 131072 > > > > > > i: 99, status: 0, bytes_deduped: 131072 > > > > > > 13107200 total bytes deduped in this operation > > > > > > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > > > > > > 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > > > > The sha1sum seems stable after the first drop_caches--until a second > > > > > > process tries to read the test file: > > > > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > # cat am > /dev/null (in another shell) > > > > > > 19294e695272c42edb89ceee24bb08c13473140a am > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > > > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote: > > > > > > > This is a repro script for a btrfs bug that causes corrupted data reads > > > > > > > when reading a mix of compressed extents and holes. The bug is > > > > > > > reproducible on at least kernels v4.1..v4.18. > > > > > > > > > > > > > > Some more observations and background follow, but first here is the > > > > > > > script and some sample output: > > > > > > > > > > > > > > root@rescue:/test# cat repro-hole-corruption-test > > > > > > > #!/bin/bash > > > > > > > > > > > > > > # Write a 4096 byte block of something > > > > > > > block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } > > > > > > > > > > > > > > # Here is some test data with holes in it: > > > > > > > for y in $(seq 0 100); do > > > > > > > for x in 0 1; do > > > > > > > block 0; > > > > > > > block 21; > > > > > > > block 0; > > > > > > > block 22; > > > > > > > block 0; > > > > > > > block 0; > > > > > > > block 43; > > > > > > > block 44; > > > > > > > block 0; > > > > > > > block 0; > > > > > > > block 61; > > > > > > > block 62; > > > > > > > block 63; > > > > > > > block 64; > > > > > > > block 65; > > > > > > > block 66; > > > > > > > done > > > > > > > done > am > > > > > > > sync > > > > > > > > > > > > > > # Now replace those 101 distinct extents with 101 references to the first extent > > > > > > > btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail > > > > > > > > > > > > > > # Punch holes into the extent refs > > > > > > > fallocate -v -d am > > > > > > > > > > > > > > # Do some other stuff on the machine while this runs, and watch the sha1sums change! > > > > > > > while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done > > > > > > > > > > > > > > root@rescue:/test# ./repro-hole-corruption-test > > > > > > > i: 91, status: 0, bytes_deduped: 131072 > > > > > > > i: 92, status: 0, bytes_deduped: 131072 > > > > > > > i: 93, status: 0, bytes_deduped: 131072 > > > > > > > i: 94, status: 0, bytes_deduped: 131072 > > > > > > > i: 95, status: 0, bytes_deduped: 131072 > > > > > > > i: 96, status: 0, bytes_deduped: 131072 > > > > > > > i: 97, status: 0, bytes_deduped: 131072 > > > > > > > i: 98, status: 0, bytes_deduped: 131072 > > > > > > > i: 99, status: 0, bytes_deduped: 131072 > > > > > > > 13107200 total bytes deduped in this operation > > > > > > > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 072a152355788c767b97e4e4c0e4567720988b84 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > bf00d862c6ad436a1be2be606a8ab88d22166b89 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 0d44cdf030fb149e103cfdc164da3da2b7474c17 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 60831f0e7ffe4b49722612c18685c09f4583b1df am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > a19662b294a3ccdf35dbb18fdd72c62018526d7d am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > ^C > > > > > > > > > > > > > > Corruption occurs most often when there is a sequence like this in a file: > > > > > > > > > > > > > > ref 1: hole > > > > > > > ref 2: extent A, offset 0 > > > > > > > ref 3: hole > > > > > > > ref 4: extent A, offset 8192 > > > > > > > > > > > > > > This scenario typically arises due to hole-punching or deduplication. > > > > > > > Hole-punching replaces one extent ref with two references to the same > > > > > > > extent with a hole between them, so: > > > > > > > > > > > > > > ref 1: extent A, offset 0, length 16384 > > > > > > > > > > > > > > becomes: > > > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > > ref 2: hole, length 8192 > > > > > > > ref 3: extent A, offset 12288, length 4096 > > > > > > > > > > > > > > Deduplication replaces two distinct extent refs surrounding a hole with > > > > > > > two references to one of the duplicate extents, turning this: > > > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > > ref 2: hole, length 8192 > > > > > > > ref 3: extent B, offset 0, length 4096 > > > > > > > > > > > > > > into this: > > > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > > ref 2: hole, length 8192 > > > > > > > ref 3: extent A, offset 0, length 4096 > > > > > > > > > > > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur. > > > > > > > I am not able to reproduce the issue with an uncompressed extent nor > > > > > > > have I observed any such corruption in the wild. > > > > > > > > > > > > > > The presence or absence of the no-holes filesystem feature has no effect. > > > > > > > > > > > > > > Ordinary writes can lead to pairs of extent references to the same extent > > > > > > > separated by a reference to a different extent; however, in this case > > > > > > > there is data to be read from a real extent, instead of pages that have > > > > > > > to be zero filled from a hole. If ordinary non-hole writes could trigger > > > > > > > this bug, every page-oriented database engine would be crashing all the > > > > > > > time on btrfs with compression enabled, and it's unlikely that would not > > > > > > > have been noticed between 2015 and now. An ordinary write that splits > > > > > > > an extent ref would look like this: > > > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > > ref 2: extent C, offset 0, length 8192 > > > > > > > ref 3: extent A, offset 12288, length 4096 > > > > > > > > > > > > > > Sparse writes can lead to pairs of extent references surrounding a hole; > > > > > > > however, in this case the extent references will point to different > > > > > > > extents, avoiding the bug. If a sparse write could trigger the bug, > > > > > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many > > > > > > > other tools that produce sparse files) would be unusable, and it's > > > > > > > unlikely that would not have been noticed between 2015 and now either. > > > > > > > Sparse writes look like this: > > > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > > ref 2: hole, length 8192 > > > > > > > ref 3: extent B, offset 0, length 4096 > > > > > > > > > > > > > > The pattern or timing of read() calls seems to be relevant. It is very > > > > > > > hard to see the corruption when reading files with 'hd', but 'cat | hd' > > > > > > > will see the corruption just fine. Similar problems exist with 'cmp' > > > > > > > but not 'sha1sum'. Two processes reading the same file at the same time > > > > > > > seem to trigger the corruption very frequently. > > > > > > > > > > > > > > Some patterns of holes and data produce corruption faster than others. > > > > > > > The pattern generated by the script above is based on instances of > > > > > > > corruption I've found in the wild, and has a much better repro rate than > > > > > > > random holes. > > > > > > > > > > > > > > The corruption occurs during reads, after csum verification and before > > > > > > > decompression, so btrfs detects no csum failures. The data on disk > > > > > > > seems to be OK and could be read correctly once the kernel bug is fixed. > > > > > > > Repeated reads do eventually return correct data, but there is no way > > > > > > > for userspace to distinguish between corrupt and correct data reliably. > > > > > > > > > > > > > > The corrupted data is usually data replaced by a hole or a copy of other > > > > > > > blocks in the same extent. > > > > > > > > > > > > > > The behavior is similar to some earlier bugs related to holes and > > > > > > > Compressed data in btrfs, but it's new and not fixed yet--hence, > > > > > > > "2018 edition." > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Filipe David Manana, > > > > > > > > > > “Whether you think you can, or you think you can't — you're right.” > > > > > > > > > > > > > > > > > -- > > > Filipe David Manana, > > > > > > “Whether you think you can, or you think you can't — you're right.” > > > > > > > -- > Filipe David Manana, > > “Whether you think you can, or you think you can't — you're right.” -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.” ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-13 18:14 ` Filipe Manana @ 2019-02-14 1:22 ` Filipe Manana 2019-02-14 5:00 ` Zygo Blaxell 2019-02-14 12:21 ` Christoph Anton Mitterer 0 siblings, 2 replies; 38+ messages in thread From: Filipe Manana @ 2019-02-14 1:22 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On Wed, Feb 13, 2019 at 6:14 PM Filipe Manana <fdmanana@gmail.com> wrote: > > On Wed, Feb 13, 2019 at 5:36 PM Filipe Manana <fdmanana@gmail.com> wrote: > > > > On Tue, Feb 12, 2019 at 6:14 PM Zygo Blaxell > > <ce3g8jdj@umail.furryterror.org> wrote: > > > > > > On Tue, Feb 12, 2019 at 05:56:24PM +0000, Filipe Manana wrote: > > > > On Tue, Feb 12, 2019 at 5:01 PM Zygo Blaxell > > > > <ce3g8jdj@umail.furryterror.org> wrote: > > > > > > > > > > On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote: > > > > > > On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell > > > > > > <ce3g8jdj@umail.furryterror.org> wrote: > > > > > > > > > > > > > > Still reproducible on 4.20.7. > > > > > > > > > > > > I tried your reproducer when you first reported it, on different > > > > > > machines with different kernel versions. > > > > > > > > > > That would have been useful to know last August... :-/ > > > > > > > > > > > Never managed to reproduce it, nor see anything obviously wrong in > > > > > > relevant code paths. > > > > > > > > > > I built a fresh VM running Debian stretch and > > > > > reproduced the issue immediately. Mount options are > > > > > "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/". Kernel is > > > > > Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version > > > > > probably doesn't matter. > > > > > > > > > > I don't have any configuration that can't reproduce this issue, so I don't > > > > > know how to help you. I've tested AMD and Intel CPUs, VM, baremetal, > > > > > hardware ranging in age from 0 to 9 years. Locally built kernels from > > > > > 4.1 to 4.20 and the stock Debian kernel (4.9). SSDs and spinning rust. > > > > > All of these reproduce the issue immediately--wrong sha1sum appears in > > > > > the first 10 loops. > > > > > > > > > > What is your test environment? I can try that here. > > > > > > > > Debian unstable, all qemu vms, 4 cpus 4G to 8G ram iirc. > > > > > > I have several environments like that... > > > > > > > Always built from source kernels. > > > > > > ...that could be a relevant difference. Have you tried a stock > > > Debian kernel? > > > > > > > I have tested this when you reported it for 1 to 2 weeks in 2 or 3 vms > > > > that kept running the test in an infinite loop during those weeks. > > > > Don't recall what were the kernel versions (whatever was the latest at > > > > the time), but that shouldn't matter according to what you say. > > > > > > That's an extremely long time compared to the rate of occurrence > > > of this bug. It should appear in only a few seconds of testing. > > > Some data-hole-data patterns reproduce much slower (change the position > > > of "block 0" lines in the setup script), but "slower" is minutes, > > > not machine-months. > > > > > > Is your filesystem compressed? Does compsize show the test > > > file 'am' is compressed during the test? Is the sha1sum you get > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4? Does the sha1sum change > > > when a second process reads the file while the sha1sum/drop_caches loop > > > is running? > > > > Tried it today and I got it reproduced (different vm, but still debian > > and kernel built from source). > > Not sure what was different last time. Yes, I had compression enabled. > > > > I'll look into it. > > So the problem is caused by hole punching. The script can be reduced > to the following: > > https://friendpaste.com/22t4OdktHQTl0aMGxckc86 > > file size: 384K am > digests after file creation: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83 am > digests after file creation 2: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83 am > 262144 total bytes deduped in this operation > digests after dedupe: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83 am > digests after dedupe 2: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83 am > am: 24 KiB (24576 bytes) converted to sparse holes. > digests after hole punching: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83 am > digests after hole punching 2: 5a357b64f4004ea38dbc7058c64a5678668420da am > > So hole punching is screwing things, and only after dropping the page > cache we can see the bug. > I'll send a fix likely tomorrow. So it turns out it's a problem in the read of compressed extents part, a variant of a bug I found back in 2015: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=005efedf2c7d0a270ffbe28d8997b03844f3e3e7 The following one liner fixes it: https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3 While you test it there (if you want/can), I'll write a change log and a proper test case for fstests and submit them later. Thanks! > > > > > > > > > > > > > > > > > > > > The behavior is slightly different on current kernels (4.20.7, 4.14.96) > > > > > > > which makes the problem a bit more difficult to detect. > > > > > > > > > > > > > > # repro-hole-corruption-test > > > > > > > i: 91, status: 0, bytes_deduped: 131072 > > > > > > > i: 92, status: 0, bytes_deduped: 131072 > > > > > > > i: 93, status: 0, bytes_deduped: 131072 > > > > > > > i: 94, status: 0, bytes_deduped: 131072 > > > > > > > i: 95, status: 0, bytes_deduped: 131072 > > > > > > > i: 96, status: 0, bytes_deduped: 131072 > > > > > > > i: 97, status: 0, bytes_deduped: 131072 > > > > > > > i: 98, status: 0, bytes_deduped: 131072 > > > > > > > i: 99, status: 0, bytes_deduped: 131072 > > > > > > > 13107200 total bytes deduped in this operation > > > > > > > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > > > > > > > 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > > > > > > The sha1sum seems stable after the first drop_caches--until a second > > > > > > > process tries to read the test file: > > > > > > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > # cat am > /dev/null (in another shell) > > > > > > > 19294e695272c42edb89ceee24bb08c13473140a am > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > > > > > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote: > > > > > > > > This is a repro script for a btrfs bug that causes corrupted data reads > > > > > > > > when reading a mix of compressed extents and holes. The bug is > > > > > > > > reproducible on at least kernels v4.1..v4.18. > > > > > > > > > > > > > > > > Some more observations and background follow, but first here is the > > > > > > > > script and some sample output: > > > > > > > > > > > > > > > > root@rescue:/test# cat repro-hole-corruption-test > > > > > > > > #!/bin/bash > > > > > > > > > > > > > > > > # Write a 4096 byte block of something > > > > > > > > block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } > > > > > > > > > > > > > > > > # Here is some test data with holes in it: > > > > > > > > for y in $(seq 0 100); do > > > > > > > > for x in 0 1; do > > > > > > > > block 0; > > > > > > > > block 21; > > > > > > > > block 0; > > > > > > > > block 22; > > > > > > > > block 0; > > > > > > > > block 0; > > > > > > > > block 43; > > > > > > > > block 44; > > > > > > > > block 0; > > > > > > > > block 0; > > > > > > > > block 61; > > > > > > > > block 62; > > > > > > > > block 63; > > > > > > > > block 64; > > > > > > > > block 65; > > > > > > > > block 66; > > > > > > > > done > > > > > > > > done > am > > > > > > > > sync > > > > > > > > > > > > > > > > # Now replace those 101 distinct extents with 101 references to the first extent > > > > > > > > btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail > > > > > > > > > > > > > > > > # Punch holes into the extent refs > > > > > > > > fallocate -v -d am > > > > > > > > > > > > > > > > # Do some other stuff on the machine while this runs, and watch the sha1sums change! > > > > > > > > while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done > > > > > > > > > > > > > > > > root@rescue:/test# ./repro-hole-corruption-test > > > > > > > > i: 91, status: 0, bytes_deduped: 131072 > > > > > > > > i: 92, status: 0, bytes_deduped: 131072 > > > > > > > > i: 93, status: 0, bytes_deduped: 131072 > > > > > > > > i: 94, status: 0, bytes_deduped: 131072 > > > > > > > > i: 95, status: 0, bytes_deduped: 131072 > > > > > > > > i: 96, status: 0, bytes_deduped: 131072 > > > > > > > > i: 97, status: 0, bytes_deduped: 131072 > > > > > > > > i: 98, status: 0, bytes_deduped: 131072 > > > > > > > > i: 99, status: 0, bytes_deduped: 131072 > > > > > > > > 13107200 total bytes deduped in this operation > > > > > > > > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 072a152355788c767b97e4e4c0e4567720988b84 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > bf00d862c6ad436a1be2be606a8ab88d22166b89 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 0d44cdf030fb149e103cfdc164da3da2b7474c17 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 60831f0e7ffe4b49722612c18685c09f4583b1df am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > a19662b294a3ccdf35dbb18fdd72c62018526d7d am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > ^C > > > > > > > > > > > > > > > > Corruption occurs most often when there is a sequence like this in a file: > > > > > > > > > > > > > > > > ref 1: hole > > > > > > > > ref 2: extent A, offset 0 > > > > > > > > ref 3: hole > > > > > > > > ref 4: extent A, offset 8192 > > > > > > > > > > > > > > > > This scenario typically arises due to hole-punching or deduplication. > > > > > > > > Hole-punching replaces one extent ref with two references to the same > > > > > > > > extent with a hole between them, so: > > > > > > > > > > > > > > > > ref 1: extent A, offset 0, length 16384 > > > > > > > > > > > > > > > > becomes: > > > > > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > > > ref 2: hole, length 8192 > > > > > > > > ref 3: extent A, offset 12288, length 4096 > > > > > > > > > > > > > > > > Deduplication replaces two distinct extent refs surrounding a hole with > > > > > > > > two references to one of the duplicate extents, turning this: > > > > > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > > > ref 2: hole, length 8192 > > > > > > > > ref 3: extent B, offset 0, length 4096 > > > > > > > > > > > > > > > > into this: > > > > > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > > > ref 2: hole, length 8192 > > > > > > > > ref 3: extent A, offset 0, length 4096 > > > > > > > > > > > > > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur. > > > > > > > > I am not able to reproduce the issue with an uncompressed extent nor > > > > > > > > have I observed any such corruption in the wild. > > > > > > > > > > > > > > > > The presence or absence of the no-holes filesystem feature has no effect. > > > > > > > > > > > > > > > > Ordinary writes can lead to pairs of extent references to the same extent > > > > > > > > separated by a reference to a different extent; however, in this case > > > > > > > > there is data to be read from a real extent, instead of pages that have > > > > > > > > to be zero filled from a hole. If ordinary non-hole writes could trigger > > > > > > > > this bug, every page-oriented database engine would be crashing all the > > > > > > > > time on btrfs with compression enabled, and it's unlikely that would not > > > > > > > > have been noticed between 2015 and now. An ordinary write that splits > > > > > > > > an extent ref would look like this: > > > > > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > > > ref 2: extent C, offset 0, length 8192 > > > > > > > > ref 3: extent A, offset 12288, length 4096 > > > > > > > > > > > > > > > > Sparse writes can lead to pairs of extent references surrounding a hole; > > > > > > > > however, in this case the extent references will point to different > > > > > > > > extents, avoiding the bug. If a sparse write could trigger the bug, > > > > > > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many > > > > > > > > other tools that produce sparse files) would be unusable, and it's > > > > > > > > unlikely that would not have been noticed between 2015 and now either. > > > > > > > > Sparse writes look like this: > > > > > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > > > ref 2: hole, length 8192 > > > > > > > > ref 3: extent B, offset 0, length 4096 > > > > > > > > > > > > > > > > The pattern or timing of read() calls seems to be relevant. It is very > > > > > > > > hard to see the corruption when reading files with 'hd', but 'cat | hd' > > > > > > > > will see the corruption just fine. Similar problems exist with 'cmp' > > > > > > > > but not 'sha1sum'. Two processes reading the same file at the same time > > > > > > > > seem to trigger the corruption very frequently. > > > > > > > > > > > > > > > > Some patterns of holes and data produce corruption faster than others. > > > > > > > > The pattern generated by the script above is based on instances of > > > > > > > > corruption I've found in the wild, and has a much better repro rate than > > > > > > > > random holes. > > > > > > > > > > > > > > > > The corruption occurs during reads, after csum verification and before > > > > > > > > decompression, so btrfs detects no csum failures. The data on disk > > > > > > > > seems to be OK and could be read correctly once the kernel bug is fixed. > > > > > > > > Repeated reads do eventually return correct data, but there is no way > > > > > > > > for userspace to distinguish between corrupt and correct data reliably. > > > > > > > > > > > > > > > > The corrupted data is usually data replaced by a hole or a copy of other > > > > > > > > blocks in the same extent. > > > > > > > > > > > > > > > > The behavior is similar to some earlier bugs related to holes and > > > > > > > > Compressed data in btrfs, but it's new and not fixed yet--hence, > > > > > > > > "2018 edition." > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Filipe David Manana, > > > > > > > > > > > > “Whether you think you can, or you think you can't — you're right.” > > > > > > > > > > > > > > > > > > > > > > -- > > > > Filipe David Manana, > > > > > > > > “Whether you think you can, or you think you can't — you're right.” > > > > > > > > > > > > -- > > Filipe David Manana, > > > > “Whether you think you can, or you think you can't — you're right.” > > > > -- > Filipe David Manana, > > “Whether you think you can, or you think you can't — you're right.” -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.” ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-14 1:22 ` Filipe Manana @ 2019-02-14 5:00 ` Zygo Blaxell 2019-02-14 12:21 ` Christoph Anton Mitterer 1 sibling, 0 replies; 38+ messages in thread From: Zygo Blaxell @ 2019-02-14 5:00 UTC (permalink / raw) To: Filipe Manana; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 15389 bytes --] On Thu, Feb 14, 2019 at 01:22:49AM +0000, Filipe Manana wrote: > On Wed, Feb 13, 2019 at 6:14 PM Filipe Manana <fdmanana@gmail.com> wrote: > > On Wed, Feb 13, 2019 at 5:36 PM Filipe Manana <fdmanana@gmail.com> wrote: [...] > > > Tried it today and I got it reproduced (different vm, but still debian > > > and kernel built from source). > > > Not sure what was different last time. Yes, I had compression enabled. > > > > > > I'll look into it. > > > > So the problem is caused by hole punching. The script can be reduced > > to the following: > > > > https://friendpaste.com/22t4OdktHQTl0aMGxckc86 > > > > file size: 384K am > > digests after file creation: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83 am > > digests after file creation 2: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83 am > > 262144 total bytes deduped in this operation > > digests after dedupe: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83 am > > digests after dedupe 2: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83 am > > am: 24 KiB (24576 bytes) converted to sparse holes. > > digests after hole punching: 7c8349cc657fbe61af53fbc5cfacae6e9a402e83 am > > digests after hole punching 2: 5a357b64f4004ea38dbc7058c64a5678668420da am > > > > So hole punching is screwing things, and only after dropping the page > > cache we can see the bug. > > I'll send a fix likely tomorrow. > > So it turns out it's a problem in the read of compressed extents part, > a variant of a bug I found back in 2015: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=005efedf2c7d0a270ffbe28d8997b03844f3e3e7 > > The following one liner fixes it: > https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3 > > While you test it there (if you want/can), I'll write a change log and > a proper test case for fstests and submit them later. Works here (and produces the correct sha1sum, which turns out to be dae78e303edfb8b8ad64ecae01dc1bf233770cfd). Nice work! > Thanks! > > > > > > > > > > > > > > > > > > > > > > > > > The behavior is slightly different on current kernels (4.20.7, 4.14.96) > > > > > > > > which makes the problem a bit more difficult to detect. > > > > > > > > > > > > > > > > # repro-hole-corruption-test > > > > > > > > i: 91, status: 0, bytes_deduped: 131072 > > > > > > > > i: 92, status: 0, bytes_deduped: 131072 > > > > > > > > i: 93, status: 0, bytes_deduped: 131072 > > > > > > > > i: 94, status: 0, bytes_deduped: 131072 > > > > > > > > i: 95, status: 0, bytes_deduped: 131072 > > > > > > > > i: 96, status: 0, bytes_deduped: 131072 > > > > > > > > i: 97, status: 0, bytes_deduped: 131072 > > > > > > > > i: 98, status: 0, bytes_deduped: 131072 > > > > > > > > i: 99, status: 0, bytes_deduped: 131072 > > > > > > > > 13107200 total bytes deduped in this operation > > > > > > > > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > > > > > > > > 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > > > > > > > > The sha1sum seems stable after the first drop_caches--until a second > > > > > > > > process tries to read the test file: > > > > > > > > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > # cat am > /dev/null (in another shell) > > > > > > > > 19294e695272c42edb89ceee24bb08c13473140a am > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > > > > > > > > On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote: > > > > > > > > > This is a repro script for a btrfs bug that causes corrupted data reads > > > > > > > > > when reading a mix of compressed extents and holes. The bug is > > > > > > > > > reproducible on at least kernels v4.1..v4.18. > > > > > > > > > > > > > > > > > > Some more observations and background follow, but first here is the > > > > > > > > > script and some sample output: > > > > > > > > > > > > > > > > > > root@rescue:/test# cat repro-hole-corruption-test > > > > > > > > > #!/bin/bash > > > > > > > > > > > > > > > > > > # Write a 4096 byte block of something > > > > > > > > > block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } > > > > > > > > > > > > > > > > > > # Here is some test data with holes in it: > > > > > > > > > for y in $(seq 0 100); do > > > > > > > > > for x in 0 1; do > > > > > > > > > block 0; > > > > > > > > > block 21; > > > > > > > > > block 0; > > > > > > > > > block 22; > > > > > > > > > block 0; > > > > > > > > > block 0; > > > > > > > > > block 43; > > > > > > > > > block 44; > > > > > > > > > block 0; > > > > > > > > > block 0; > > > > > > > > > block 61; > > > > > > > > > block 62; > > > > > > > > > block 63; > > > > > > > > > block 64; > > > > > > > > > block 65; > > > > > > > > > block 66; > > > > > > > > > done > > > > > > > > > done > am > > > > > > > > > sync > > > > > > > > > > > > > > > > > > # Now replace those 101 distinct extents with 101 references to the first extent > > > > > > > > > btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail > > > > > > > > > > > > > > > > > > # Punch holes into the extent refs > > > > > > > > > fallocate -v -d am > > > > > > > > > > > > > > > > > > # Do some other stuff on the machine while this runs, and watch the sha1sums change! > > > > > > > > > while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done > > > > > > > > > > > > > > > > > > root@rescue:/test# ./repro-hole-corruption-test > > > > > > > > > i: 91, status: 0, bytes_deduped: 131072 > > > > > > > > > i: 92, status: 0, bytes_deduped: 131072 > > > > > > > > > i: 93, status: 0, bytes_deduped: 131072 > > > > > > > > > i: 94, status: 0, bytes_deduped: 131072 > > > > > > > > > i: 95, status: 0, bytes_deduped: 131072 > > > > > > > > > i: 96, status: 0, bytes_deduped: 131072 > > > > > > > > > i: 97, status: 0, bytes_deduped: 131072 > > > > > > > > > i: 98, status: 0, bytes_deduped: 131072 > > > > > > > > > i: 99, status: 0, bytes_deduped: 131072 > > > > > > > > > 13107200 total bytes deduped in this operation > > > > > > > > > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 072a152355788c767b97e4e4c0e4567720988b84 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > bf00d862c6ad436a1be2be606a8ab88d22166b89 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 0d44cdf030fb149e103cfdc164da3da2b7474c17 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 60831f0e7ffe4b49722612c18685c09f4583b1df am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > a19662b294a3ccdf35dbb18fdd72c62018526d7d am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > > > > > > > > ^C > > > > > > > > > > > > > > > > > > Corruption occurs most often when there is a sequence like this in a file: > > > > > > > > > > > > > > > > > > ref 1: hole > > > > > > > > > ref 2: extent A, offset 0 > > > > > > > > > ref 3: hole > > > > > > > > > ref 4: extent A, offset 8192 > > > > > > > > > > > > > > > > > > This scenario typically arises due to hole-punching or deduplication. > > > > > > > > > Hole-punching replaces one extent ref with two references to the same > > > > > > > > > extent with a hole between them, so: > > > > > > > > > > > > > > > > > > ref 1: extent A, offset 0, length 16384 > > > > > > > > > > > > > > > > > > becomes: > > > > > > > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > > > > ref 2: hole, length 8192 > > > > > > > > > ref 3: extent A, offset 12288, length 4096 > > > > > > > > > > > > > > > > > > Deduplication replaces two distinct extent refs surrounding a hole with > > > > > > > > > two references to one of the duplicate extents, turning this: > > > > > > > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > > > > ref 2: hole, length 8192 > > > > > > > > > ref 3: extent B, offset 0, length 4096 > > > > > > > > > > > > > > > > > > into this: > > > > > > > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > > > > ref 2: hole, length 8192 > > > > > > > > > ref 3: extent A, offset 0, length 4096 > > > > > > > > > > > > > > > > > > Compression is required (zlib, zstd, or lzo) for corruption to occur. > > > > > > > > > I am not able to reproduce the issue with an uncompressed extent nor > > > > > > > > > have I observed any such corruption in the wild. > > > > > > > > > > > > > > > > > > The presence or absence of the no-holes filesystem feature has no effect. > > > > > > > > > > > > > > > > > > Ordinary writes can lead to pairs of extent references to the same extent > > > > > > > > > separated by a reference to a different extent; however, in this case > > > > > > > > > there is data to be read from a real extent, instead of pages that have > > > > > > > > > to be zero filled from a hole. If ordinary non-hole writes could trigger > > > > > > > > > this bug, every page-oriented database engine would be crashing all the > > > > > > > > > time on btrfs with compression enabled, and it's unlikely that would not > > > > > > > > > have been noticed between 2015 and now. An ordinary write that splits > > > > > > > > > an extent ref would look like this: > > > > > > > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > > > > ref 2: extent C, offset 0, length 8192 > > > > > > > > > ref 3: extent A, offset 12288, length 4096 > > > > > > > > > > > > > > > > > > Sparse writes can lead to pairs of extent references surrounding a hole; > > > > > > > > > however, in this case the extent references will point to different > > > > > > > > > extents, avoiding the bug. If a sparse write could trigger the bug, > > > > > > > > > the rsync -S option and qemu/kvm 'raw' disk image files (among many > > > > > > > > > other tools that produce sparse files) would be unusable, and it's > > > > > > > > > unlikely that would not have been noticed between 2015 and now either. > > > > > > > > > Sparse writes look like this: > > > > > > > > > > > > > > > > > > ref 1: extent A, offset 0, length 4096 > > > > > > > > > ref 2: hole, length 8192 > > > > > > > > > ref 3: extent B, offset 0, length 4096 > > > > > > > > > > > > > > > > > > The pattern or timing of read() calls seems to be relevant. It is very > > > > > > > > > hard to see the corruption when reading files with 'hd', but 'cat | hd' > > > > > > > > > will see the corruption just fine. Similar problems exist with 'cmp' > > > > > > > > > but not 'sha1sum'. Two processes reading the same file at the same time > > > > > > > > > seem to trigger the corruption very frequently. > > > > > > > > > > > > > > > > > > Some patterns of holes and data produce corruption faster than others. > > > > > > > > > The pattern generated by the script above is based on instances of > > > > > > > > > corruption I've found in the wild, and has a much better repro rate than > > > > > > > > > random holes. > > > > > > > > > > > > > > > > > > The corruption occurs during reads, after csum verification and before > > > > > > > > > decompression, so btrfs detects no csum failures. The data on disk > > > > > > > > > seems to be OK and could be read correctly once the kernel bug is fixed. > > > > > > > > > Repeated reads do eventually return correct data, but there is no way > > > > > > > > > for userspace to distinguish between corrupt and correct data reliably. > > > > > > > > > > > > > > > > > > The corrupted data is usually data replaced by a hole or a copy of other > > > > > > > > > blocks in the same extent. > > > > > > > > > > > > > > > > > > The behavior is similar to some earlier bugs related to holes and > > > > > > > > > Compressed data in btrfs, but it's new and not fixed yet--hence, > > > > > > > > > "2018 edition." > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Filipe David Manana, > > > > > > > > > > > > > > “Whether you think you can, or you think you can't — you're right.” > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Filipe David Manana, > > > > > > > > > > “Whether you think you can, or you think you can't — you're right.” > > > > > > > > > > > > > > > > > -- > > > Filipe David Manana, > > > > > > “Whether you think you can, or you think you can't — you're right.” > > > > > > > > -- > > Filipe David Manana, > > > > “Whether you think you can, or you think you can't — you're right.” > > > > -- > Filipe David Manana, > > “Whether you think you can, or you think you can't — you're right.” > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-14 1:22 ` Filipe Manana 2019-02-14 5:00 ` Zygo Blaxell @ 2019-02-14 12:21 ` Christoph Anton Mitterer 2019-02-15 5:40 ` Zygo Blaxell 2019-02-15 12:02 ` Filipe Manana 1 sibling, 2 replies; 38+ messages in thread From: Christoph Anton Mitterer @ 2019-02-14 12:21 UTC (permalink / raw) To: linux-btrfs On Thu, 2019-02-14 at 01:22 +0000, Filipe Manana wrote: > The following one liner fixes it: > https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3 Great to see that fixed... is there any advise that can be given for users/admins? Like whether and how any occurred corruptions can be detected (right now, people may still have backups)? Or under which exact circumstances did the corruption happen? And under which was one safe? E.g. only on specific compression algos (I've been using -o compress (which should be zlib) for quite a while but never found any compression),... or only when specific file operations were done (I did e.g. cp with refcopy, but I think none of the standard tools does hole- punching)? Cheers, Chris. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-14 12:21 ` Christoph Anton Mitterer @ 2019-02-15 5:40 ` Zygo Blaxell 2019-03-04 15:34 ` Christoph Anton Mitterer 2019-02-15 12:02 ` Filipe Manana 1 sibling, 1 reply; 38+ messages in thread From: Zygo Blaxell @ 2019-02-15 5:40 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 4815 bytes --] On Thu, Feb 14, 2019 at 01:21:29PM +0100, Christoph Anton Mitterer wrote: > On Thu, 2019-02-14 at 01:22 +0000, Filipe Manana wrote: > > The following one liner fixes it: > > https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3 > > Great to see that fixed... is there any advise that can be given for > users/admins? > > > Like whether and how any occurred corruptions can be detected (right > now, people may still have backups)? The problem occurs only on reads. Data that is written to disk will be OK, and can be read correctly by a fixed kernel. A kernel without the fix will give corrupt data on reads with no indication of corruption other than the changes to the data itself. Applications that copy data may read corrupted data and write it back to the filesystem. This will make the corruption permanent in the copied data. Given the age of the bug, backups that can be corrupted by this bug probably already are. Verify files against internal CRC/hashes where possible. The original files are likely to be OK, since the bug does not affect writes. If your situation has the risk factors listed below, it may be worthwhile to create a fresh set of non-incremental backups after applying the kernel fix. > Or under which exact circumstances did the corruption happen? And under > which was one safe? Compression is required to trigger the bug, so you are safe if you (or the applications you run) never enabled filesystem compression. Even if compression is enabled, the file data must be compressed for the bug to corrupt it. Incompressible data extents will never be affected by this bug. If you do use compression, you are still safe if: - you never punch holes in files - you never dedupe or clone files If you do use compression and do the other things, the probability of corruption by this particular bug is non-zero. Whether you get corruption and how often depends on the technical details of what you're doing. To get corruption you have to have one data extent that is split in two parts by punching a hole, or an extent that is cloned/deduped in two parts to adjacent logical offsets in the same file. Both of these methods create the pattern on disk which triggers the bug. Files that consist entirely of unique data will not be affected by dedupe so will not trigger the bug that way. Files that consist partially of unique data may or may not be affected depending on the dedupe tool, data alignment, etc. > E.g. only on specific compression algos (I've been using -o compress > (which should be zlib) for quite a while but never found any All decompress algorithms are affected. The bug is in the generic btrfs decompression handling, so it is not limited to any single algorithm. Compression (i.e. writing) is not affected--whatever data is written to disk should be readable correctly with a fixed kernel. > compression),... or only when specific file operations were done (I did > e.g. cp with refcopy, but I think none of the standard tools does hole- > punching)? That depends on whether you consider fallocate or qemu to be standard tools. The hole-punching function has been a feature of several Linux filesystems for some years now, so we can expect it to be more widely adopted over time. You'd have to do an audit to be sure none of the tools you use are punching holes. "Ordinary" sparse files (made by seeking forward while writing, as done by older Unix utilities including cp, tar, rsync, cpio, binutils) do not trigger this bug. An ordinary sparse file has two distinct data extents from two different writes separated by a hole which has never contained file data. A punched hole splits an existing single data extent into two pieces with a newly created hole between them that replaces previously existing file data. These actions create different extent reference patterns and only the hole-punching one is affected by the bug. Files that contain no blocks full of zeros will not be affected by fallocate-d-style hole punching (it searches for existing zeros and punches holes over them--no zeros, no holes). If the the hole punching intentionally introduces zeros where zeros did not exist before (e.g. qemu discard operations on raw image files) then it may trigger the bug. btrfs send and receive may be affected, but I don't use them so I don't have any experience of the bug related to these tools. It seems from reading the btrfs receive code that it lacks any code capable of punching a hole, but I'm only doing a quick search for words like "punch", not a detailed code analysis. bees continues to be an awesome tool for discovering btrfs kernel bugs. It compresses, dedupes, *and* punches holes. > > Cheers, > Chris. > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-15 5:40 ` Zygo Blaxell @ 2019-03-04 15:34 ` Christoph Anton Mitterer 2019-03-07 20:07 ` Zygo Blaxell 0 siblings, 1 reply; 38+ messages in thread From: Christoph Anton Mitterer @ 2019-03-04 15:34 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs Hey. Thanks for your elaborate explanations :-) On Fri, 2019-02-15 at 00:40 -0500, Zygo Blaxell wrote: > The problem occurs only on reads. Data that is written to disk will > be OK, and can be read correctly by a fixed kernel. > > A kernel without the fix will give corrupt data on reads with no > indication of corruption other than the changes to the data itself. > > Applications that copy data may read corrupted data and write it back > to the filesystem. This will make the corruption permanent in the > copied data. So that basically means even a cp (without refcopy) or a btrfs send/receive could already cause permanent silent data corruption. Of course, only if the conditions you've described below are met. > Given the age of the bug Since when was it in the kernel? > Even > if > compression is enabled, the file data must be compressed for the bug > to > corrupt it. Is there a simple way to find files (i.e. pathnames) that were actually compressed? > - you never punch holes in files Is there any "standard application" (like cp, tar, etc.) that would do this? > - you never dedupe or clone files What do you mean by clone? refcopy? Would btrfs snapshots or btrfs send/receive be affected? Or is there anything in btrfs itself which does any of the two per default or on a typical system (i.e. I didn't use dedupe). Also, did the bug only affect data, or could metadata also be affected... basically should such filesystems be re-created since they may also hold corruptions in the meta-data like trees and so on? > > compression),... or only when specific file operations were done (I > > did > > e.g. cp with refcopy, but I think none of the standard tools does > > hole- > > punching)? > That depends on whether you consider fallocate or qemu to be standard > tools. I assume you mean the fallocate(1) program,... cause I wouldn't know whether any of cp/mv/etc. does the system call fallocate(2) per default. My scenario looks about the following, and given your explanations, I'd assume I should probably be safe: - my normal laptop doesn't use compress, so it's safe anyway - my cp has an alias to always have --reflink=auto - two 8TB data archive disks, each with two backup disks to which the data of the two master disks is btrfs sent/received,... which were all mounted with compress - typically I either cp or mv data from the laptop to these disks, => should then be safe as the laptop fs didn't use compress,... - or I directly create the files on the data disks (which use compress) by means of wget, scp or similar from other sources => should be safe, too, as they probably don't do dedupe/hole punching by default - or I cp/mv from them camera SD cards, which use some *FAT => so again I'd expect that to be fine - on vacation I had the case that I put large amount of picture/videos from SD cards to some btrfs-with-compress mobile HDDs, and back home from these HDDs to my actual data HDDs. => here I do have the read / re-write pattern, so data could have been corrupted if it was compressed + deduped/hole-punched I'd guess that's anyway not the case (JPEGs/MPEGs don't compress well)... and AFAIU there would be no deduping/hole-punching involved here - on my main data disks, I do snapshots... and these snapshots I send/receive to the other (also compress-mounted) btrfs disks. => could these operations involve deduping/hole-punching and thus the corruption? Another thing: I always store SHA512 hashsums of files as an XATTR of them (like "directly after" creating such files). I assume there would be no deduping/hole-punching involved till then, so the sums should be from correct data, right? But when I e.g. copy data from SD, to mobile btrfs-HDD and then to the final archive HDD... corruption could in principle occur when copying from mobile HDD to archive HDD. In that case, would a diff between the two show me the corruption? I guess not because the diff would likely get the same corruption on read? > "Ordinary" sparse files (made by seeking forward while writing, as > done > by older Unix utilities including cp, tar, rsync, cpio, binutils) do > not > trigger this bug. An ordinary sparse file has two distinct data > extents > from two different writes separated by a hole which has never > contained > file data. A punched hole splits an existing single data extent into > two > pieces with a newly created hole between them that replaces > previously > existing file data. These actions create different extent reference > patterns and only the hole-punching one is affected by the bug. > Files that contain no blocks full of zeros will not be affected by > fallocate-d-style hole punching (it searches for existing zeros and > punches holes over them--no zeros, no holes). If the the hole > punching > intentionally introduces zeros where zeros did not exist before (e.g. > qemu > discard operations on raw image files) then it may trigger the bug. So long story short, "normal" file operations (cp/mv, etc.) should not trigger the bug. qemu with discard would be a prominent example of triggering the bug, but luckily for me, I only use this on an fs with compress disabled :-D Any other such prominent examples? I assume normal mv of refcopy (i.e. cp --reflink=auto) would not punch holes and thus be not affected? Further, I'd assume XATTRs couldn't be affected? So what remains unanswered is send/receive: > btrfs send and receive may be affected, but I don't use them so I > don't > have any experience of the bug related to these tools. It seems from > reading the btrfs receive code that it lacks any code capable of > punching > a hole, but I'm only doing a quick search for words like "punch", not > a detailed code analysis. Is there some other developer who possibly knows whether send/receive would have been vulnerable to the issue? But since I use send/receive anyway in just one direction from the master to the backup disks... only the later could be affected. Thanks, Chris. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-03-04 15:34 ` Christoph Anton Mitterer @ 2019-03-07 20:07 ` Zygo Blaxell 2019-03-08 10:37 ` Filipe Manana ` (2 more replies) 0 siblings, 3 replies; 38+ messages in thread From: Zygo Blaxell @ 2019-03-07 20:07 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 11976 bytes --] On Mon, Mar 04, 2019 at 04:34:39PM +0100, Christoph Anton Mitterer wrote: > Hey. > > > Thanks for your elaborate explanations :-) > > > On Fri, 2019-02-15 at 00:40 -0500, Zygo Blaxell wrote: > > The problem occurs only on reads. Data that is written to disk will > > be OK, and can be read correctly by a fixed kernel. > > > > A kernel without the fix will give corrupt data on reads with no > > indication of corruption other than the changes to the data itself. > > > > Applications that copy data may read corrupted data and write it back > > to the filesystem. This will make the corruption permanent in the > > copied data. > > So that basically means even a cp (without refcopy) or a btrfs > send/receive could already cause permanent silent data corruption. > Of course, only if the conditions you've described below are met. > > > > Given the age of the bug > > Since when was it in the kernel? Since at least 2015. Note that if you are looking for an end date for "clean" data, you may be disappointed. In 2016 there were two kernel bugs that silently corrupted reads of compressed data. In 2015 there were...4? 5? Before 2015 the problems are worse, also damaging on-disk compressed data and crashing the kernel. The bugs that were present in 2014 were present since compression was introduced in 2008. With this last fix, as far as I know, we have a kernel that can read compressed data without corruption for the first time--at least for a subset of use cases that doesn't include direct IO. Of course I thought the same thing in 2017, too, but I have since proven myself wrong. When btrfs gets to the point where it doesn't fail backup verification for some contiguous years, then I'll be satisfied btrfs (or any filesystem) is properly debugged. I'll still run backup verification then, of course--hardware breaks all the time, and broken hardware can corrupt any data it touches. Verification failures point to broken hardware much more often than btrfs data corruption bugs. > > Even > > if > > compression is enabled, the file data must be compressed for the bug > > to > > corrupt it. > > Is there a simple way to find files (i.e. pathnames) that were actually > compressed? Run compsize (sometimes the package is named btrfs-compsize) and see if there are any lines referring to zlib, zstd, or lzo in the output. If it's all "total" and "none" then there's no compression in that file. filefrag -v reports non-inline compressed data extents with the "encoded" flag, so if filefrag -v "$file" | grep -qw encoded; then echo "$file" is compressed, do something here fi might also be a solution (assuming your filename doesn't include the string 'encoded'). > > - you never punch holes in files > > Is there any "standard application" (like cp, tar, etc.) that would do > this? Legacy POSIX doesn't have the hole-punching concept, so legacy tools won't do it; however, people add features to GNU tools all the time, so it's hard to be 100% sure without downloading the code and reading/auditing/scanning it. I'm 99% sure cp and tar are OK. > What do you mean by clone? refcopy? Would btrfs snapshots or btrfs > send/receive be affected? clone is part of some file operation syscalls (e.g. clone_file_range, dedupe_range) which make two different files, or two different offsets in the same file, refer to the same physical extent. This is the basis of deduplication (replacing separate copies with references to a single copy) and also of punching holes (a single reference is split into two references to the original extent with a hole object inserted in the middle). "reflink copy" is a synonym for "cp --reflink", which is clone_file_range using 0 as the start of range and EOF as the end. The term 'reflink' is sometimes used to refer to any extent shared between files that is not the result of a snapshot. reflink is to extents what a hardlink is to inodes, if you ignore some details. To trigger the bug you need to clone the same compressed source range to two nearly adjacent locations in the destination file (i.e. two or more ranges in the source overlap). cp --reflink never overlaps ranges, so it can't create the extent pattern that triggers this bug *by itself*. If the source file already has extent references arranged in a way that triggers the bug, then the copy made with cp --reflink will copy the arrangement to the new file (i.e. if you upgrade the kernel, you can correctly read both copies, and if you don't upgrade the kernel, both copies will appear to be corrupted, probably the same way). I would expect btrfs receive may be affected, but I did not find any code in receive that would be affected. There are a number of different ways to make a file with a hole in it, and btrfs receive could use a different one not affected by this bug. I don't use send/receive myself, so I don't have historical corruption data to guess from. > Or is there anything in btrfs itself which does any of the two per > default or on a typical system (i.e. I didn't use dedupe). 'btrfs' (the command-line utility) doesn't do these operations as far as I can tell. The kernel only does these when requested by applications. > Also, did the bug only affect data, or could metadata also be > affected... basically should such filesystems be re-created since they > may also hold corruptions in the meta-data like trees and so on? Metadata is not affected by this bug. The bug only corrupts btrfs data (specificially, the contents of files) in memory, not disk. > My scenario looks about the following, and given your explanations, I'd > assume I should probably be safe: > > - my normal laptop doesn't use compress, so it's safe anyway > > - my cp has an alias to always have --reflink=auto > > - two 8TB data archive disks, each with two backup disks to which the > data of the two master disks is btrfs sent/received,... which were > all mounted with compress > > > - typically I either cp or mv data from the laptop to these disks, > => should then be safe as the laptop fs didn't use compress,... > > - or I directly create the files on the data disks (which use compress) > by means of wget, scp or similar from other sources > => should be safe, too, as they probably don't do dedupe/hole > punching by default > > - or I cp/mv from them camera SD cards, which use some *FAT > => so again I'd expect that to be fine > > - on vacation I had the case that I put large amount of picture/videos > from SD cards to some btrfs-with-compress mobile HDDs, and back home > from these HDDs to my actual data HDDs. > => here I do have the read / re-write pattern, so data could have > been corrupted if it was compressed + deduped/hole-punched > I'd guess that's anyway not the case (JPEGs/MPEGs don't compress > well)... and AFAIU there would be no deduping/hole-punching > involved here dedupe doesn't happen by itself on btrfs. You have to run dedupe userspace software (e.g. duperemove, bees, dduper, rmlint, jdupes, bedup, etc...) or build a kernel with dedupe patches. > - on my main data disks, I do snapshots... and these snapshots I > send/receive to the other (also compress-mounted) btrfs disks. > => could these operations involve deduping/hole-punching and thus the > corruption? Snapshots won't interact with the bug--they are not affected by it and will not trigger it. Send could transmit incorrect data (if it uses the kernel's readpages path internally, I don't know if it does). Receive seems not to be affected (though it will not detect incorrect data from send). > Another thing: > I always store SHA512 hashsums of files as an XATTR of them (like > "directly after" creating such files). > I assume there would be no deduping/hole-punching involved till then, > so the sums should be from correct data, right? There's no assurance of that with this method. It's highly likely that the hashes match the input data, because the file will usually be cached in host RAM from when it was written, so the bug has no opportunity to appear. It's not impossible for other system activity to evict those cached pages between the copy and hash, so the hash function might reread the data from disk again and thus be exposed to the bug. Contrast with a copy tool which integrates the SHA512 function, so the SHA hash and the copy consume their data from the same RAM buffers. This reduces the risk of undetected error but still does not eliminate it. A DRAM access failure could corrupt either the data or SHA hash but not both, so the hash will fail verification later, but you won't know if the hash is incorrect or the data. If the source filesystem is not btrfs (and therefore cannot have this btrfs bug), you can calculate the SHA512 from the source filesystem and copy that to the xattr on the btrfs filesystem. That reduces the risk pool for data errors to the host RAM and CPU, the source filesystem, and the storage stack below the source filesystem (i.e. the generic set of problems that can occur on any system at any time and corrupt data during copy and hash operations). > But when I e.g. copy data from SD, to mobile btrfs-HDD and then to the > final archive HDD... corruption could in principle occur when copying > from mobile HDD to archive HDD. > In that case, would a diff between the two show me the corruption? I > guess not because the diff would likely get the same corruption on > read? Upgrade your kernel before doing any verification activity; otherwise you'll just get false results. If you try to replace the data before upgrading the kernel, you're more likely to introduce new corruption where corruption did not exist before, or convert transient corruption events into permanent data corruption. You might even miss corrupted data because the bug tends to corrupt data in a consistent way. Once you have a kernel with the fix applied, diff will show any corruption in file copies, though 'cmp -l' might be much faster than diff on large binary files. Use just 'cmp' if you only want to know if any difference exists but don't need detailed information, or 'cmp -s' in a shell script. >[...] > I assume normal mv of refcopy (i.e. cp --reflink=auto) would not punch > holes and thus be not affected? > > Further, I'd assume XATTRs couldn't be affected? XATTRs aren't compressed file data, so they aren't affected by this bug which only affects compressed file data. > So what remains unanswered is send/receive: > > > btrfs send and receive may be affected, but I don't use them so I > > don't > > have any experience of the bug related to these tools. It seems from > > reading the btrfs receive code that it lacks any code capable of > > punching > > a hole, but I'm only doing a quick search for words like "punch", not > > a detailed code analysis. > > Is there some other developer who possibly knows whether send/receive > would have been vulnerable to the issue? > > > But since I use send/receive anyway in just one direction from the > master to the backup disks... only the later could be affected. I presume from this line of questioning that you are not in the habit of verifying the SHA512 hashes on your data every few weeks or months. If you had that step in your scheduled backup routine, then you would already be aware of data corruption bugs that affect you--or you'd already be reasonably confident that this bug has no impact on your setup. If you had asked questions like "is this bug the reason why I've been seeing random SHA hash verification failures for several years?" then you should worry about this bug; otherwise, it probably didn't affect you. > Thanks, > Chris. > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-03-07 20:07 ` Zygo Blaxell @ 2019-03-08 10:37 ` Filipe Manana 2019-03-14 18:58 ` Christoph Anton Mitterer 2019-03-14 20:22 ` Christoph Anton Mitterer 2019-03-08 12:20 ` Austin S. Hemmelgarn 2019-03-14 18:58 ` Christoph Anton Mitterer 2 siblings, 2 replies; 38+ messages in thread From: Filipe Manana @ 2019-03-08 10:37 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Christoph Anton Mitterer, linux-btrfs On Thu, Mar 7, 2019 at 8:14 PM Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > > On Mon, Mar 04, 2019 at 04:34:39PM +0100, Christoph Anton Mitterer wrote: > > Hey. > > > > > > Thanks for your elaborate explanations :-) > > > > > > On Fri, 2019-02-15 at 00:40 -0500, Zygo Blaxell wrote: > > > The problem occurs only on reads. Data that is written to disk will > > > be OK, and can be read correctly by a fixed kernel. > > > > > > A kernel without the fix will give corrupt data on reads with no > > > indication of corruption other than the changes to the data itself. > > > > > > Applications that copy data may read corrupted data and write it back > > > to the filesystem. This will make the corruption permanent in the > > > copied data. > > > > So that basically means even a cp (without refcopy) or a btrfs > > send/receive could already cause permanent silent data corruption. > > Of course, only if the conditions you've described below are met. > > > > > > > Given the age of the bug > > > > Since when was it in the kernel? > > Since at least 2015. Note that if you are looking for an end date for > "clean" data, you may be disappointed. It's been around since compression was introduced (October 2008). The read ahead path was buggy for the case where the same compressed extent is shared consecutively. I fixed 2 bugs there back in 2015 but missed the case where there's a hole that makes the compressed extent be shared with a non-zero start offset, which is the case that was fixed recently. > > In 2016 there were two kernel bugs that silently corrupted reads of > compressed data. In 2015 there were...4? 5? Before 2015 the problems > are worse, also damaging on-disk compressed data and crashing the kernel. > The bugs that were present in 2014 were present since compression was > introduced in 2008. > > With this last fix, as far as I know, we have a kernel that can read > compressed data without corruption for the first time--at least for a > subset of use cases that doesn't include direct IO. Of course I thought > the same thing in 2017, too, but I have since proven myself wrong. > > When btrfs gets to the point where it doesn't fail backup verification for > some contiguous years, then I'll be satisfied btrfs (or any filesystem) > is properly debugged. I'll still run backup verification then, of > course--hardware breaks all the time, and broken hardware can corrupt > any data it touches. Verification failures point to broken hardware > much more often than btrfs data corruption bugs. > > > > Even > > > if > > > compression is enabled, the file data must be compressed for the bug > > > to > > > corrupt it. > > > > Is there a simple way to find files (i.e. pathnames) that were actually > > compressed? > > Run compsize (sometimes the package is named btrfs-compsize) and see if > there are any lines referring to zlib, zstd, or lzo in the output. > If it's all "total" and "none" then there's no compression in that file. > > filefrag -v reports non-inline compressed data extents with the "encoded" > flag, so > > if filefrag -v "$file" | grep -qw encoded; then > echo "$file" is compressed, do something here > fi > > might also be a solution (assuming your filename doesn't include the > string 'encoded'). > > > > - you never punch holes in files > > > > Is there any "standard application" (like cp, tar, etc.) that would do > > this? > > Legacy POSIX doesn't have the hole-punching concept, so legacy > tools won't do it; however, people add features to GNU tools all the > time, so it's hard to be 100% sure without downloading the code and > reading/auditing/scanning it. I'm 99% sure cp and tar are OK. > > > What do you mean by clone? refcopy? Would btrfs snapshots or btrfs > > send/receive be affected? > > clone is part of some file operation syscalls (e.g. clone_file_range, > dedupe_range) which make two different files, or two different offsets in > the same file, refer to the same physical extent. This is the basis of > deduplication (replacing separate copies with references to a single > copy) and also of punching holes (a single reference is split into > two references to the original extent with a hole object inserted in > the middle). > > "reflink copy" is a synonym for "cp --reflink", which is clone_file_range > using 0 as the start of range and EOF as the end. The term 'reflink' > is sometimes used to refer to any extent shared between files that is > not the result of a snapshot. reflink is to extents what a hardlink is > to inodes, if you ignore some details. > > To trigger the bug you need to clone the same compressed source range > to two nearly adjacent locations in the destination file (i.e. two or > more ranges in the source overlap). cp --reflink never overlaps ranges, > so it can't create the extent pattern that triggers this bug *by itself*. > > If the source file already has extent references arranged in a way > that triggers the bug, then the copy made with cp --reflink will copy > the arrangement to the new file (i.e. if you upgrade the kernel, you > can correctly read both copies, and if you don't upgrade the kernel, > both copies will appear to be corrupted, probably the same way). > > I would expect btrfs receive may be affected, but I did not find any > code in receive that would be affected. There are a number of different > ways to make a file with a hole in it, and btrfs receive could use a > different one not affected by this bug. I don't use send/receive myself, > so I don't have historical corruption data to guess from. > > > Or is there anything in btrfs itself which does any of the two per > > default or on a typical system (i.e. I didn't use dedupe). > > 'btrfs' (the command-line utility) doesn't do these operations as far > as I can tell. The kernel only does these when requested by applications. > > > Also, did the bug only affect data, or could metadata also be > > affected... basically should such filesystems be re-created since they > > may also hold corruptions in the meta-data like trees and so on? > > Metadata is not affected by this bug. The bug only corrupts btrfs data > (specificially, the contents of files) in memory, not disk. > > > My scenario looks about the following, and given your explanations, I'd > > assume I should probably be safe: > > > > - my normal laptop doesn't use compress, so it's safe anyway > > > > - my cp has an alias to always have --reflink=auto > > > > - two 8TB data archive disks, each with two backup disks to which the > > data of the two master disks is btrfs sent/received,... which were > > all mounted with compress > > > > > > - typically I either cp or mv data from the laptop to these disks, > > => should then be safe as the laptop fs didn't use compress,... > > > > - or I directly create the files on the data disks (which use compress) > > by means of wget, scp or similar from other sources > > => should be safe, too, as they probably don't do dedupe/hole > > punching by default > > > > - or I cp/mv from them camera SD cards, which use some *FAT > > => so again I'd expect that to be fine > > > > - on vacation I had the case that I put large amount of picture/videos > > from SD cards to some btrfs-with-compress mobile HDDs, and back home > > from these HDDs to my actual data HDDs. > > => here I do have the read / re-write pattern, so data could have > > been corrupted if it was compressed + deduped/hole-punched > > I'd guess that's anyway not the case (JPEGs/MPEGs don't compress > > well)... and AFAIU there would be no deduping/hole-punching > > involved here > > dedupe doesn't happen by itself on btrfs. You have to run dedupe > userspace software (e.g. duperemove, bees, dduper, rmlint, jdupes, bedup, > etc...) or build a kernel with dedupe patches. > > > - on my main data disks, I do snapshots... and these snapshots I > > send/receive to the other (also compress-mounted) btrfs disks. > > => could these operations involve deduping/hole-punching and thus the > > corruption? > > Snapshots won't interact with the bug--they are not affected by it > and will not trigger it. Send could transmit incorrect data (if it > uses the kernel's readpages path internally, I don't know if it does). > Receive seems not to be affected (though it will not detect incorrect > data from send). > > > Another thing: > > I always store SHA512 hashsums of files as an XATTR of them (like > > "directly after" creating such files). > > I assume there would be no deduping/hole-punching involved till then, > > so the sums should be from correct data, right? > > There's no assurance of that with this method. It's highly likely that > the hashes match the input data, because the file will usually be cached > in host RAM from when it was written, so the bug has no opportunity to > appear. It's not impossible for other system activity to evict those > cached pages between the copy and hash, so the hash function might reread > the data from disk again and thus be exposed to the bug. > > Contrast with a copy tool which integrates the SHA512 function, so > the SHA hash and the copy consume their data from the same RAM buffers. > This reduces the risk of undetected error but still does not eliminate it. > A DRAM access failure could corrupt either the data or SHA hash but not > both, so the hash will fail verification later, but you won't know if > the hash is incorrect or the data. > > If the source filesystem is not btrfs (and therefore cannot have this > btrfs bug), you can calculate the SHA512 from the source filesystem and > copy that to the xattr on the btrfs filesystem. That reduces the risk > pool for data errors to the host RAM and CPU, the source filesystem, > and the storage stack below the source filesystem (i.e. the generic > set of problems that can occur on any system at any time and corrupt > data during copy and hash operations). > > > But when I e.g. copy data from SD, to mobile btrfs-HDD and then to the > > final archive HDD... corruption could in principle occur when copying > > from mobile HDD to archive HDD. > > In that case, would a diff between the two show me the corruption? I > > guess not because the diff would likely get the same corruption on > > read? > > Upgrade your kernel before doing any verification activity; otherwise > you'll just get false results. > > If you try to replace the data before upgrading the kernel, you're more > likely to introduce new corruption where corruption did not exist before, > or convert transient corruption events into permanent data corruption. > You might even miss corrupted data because the bug tends to corrupt data > in a consistent way. > > Once you have a kernel with the fix applied, diff will show any corruption > in file copies, though 'cmp -l' might be much faster than diff on large > binary files. Use just 'cmp' if you only want to know if any difference > exists but don't need detailed information, or 'cmp -s' in a shell script. > > >[...] > > I assume normal mv of refcopy (i.e. cp --reflink=auto) would not punch > > holes and thus be not affected? > > > > Further, I'd assume XATTRs couldn't be affected? > > XATTRs aren't compressed file data, so they aren't affected by this bug > which only affects compressed file data. > > > So what remains unanswered is send/receive: > > > > > btrfs send and receive may be affected, but I don't use them so I > > > don't > > > have any experience of the bug related to these tools. It seems from > > > reading the btrfs receive code that it lacks any code capable of > > > punching > > > a hole, but I'm only doing a quick search for words like "punch", not > > > a detailed code analysis. > > > > Is there some other developer who possibly knows whether send/receive > > would have been vulnerable to the issue? > > > > > > But since I use send/receive anyway in just one direction from the > > master to the backup disks... only the later could be affected. > > I presume from this line of questioning that you are not in the habit > of verifying the SHA512 hashes on your data every few weeks or months. > If you had that step in your scheduled backup routine, then you would > already be aware of data corruption bugs that affect you--or you'd > already be reasonably confident that this bug has no impact on your setup. > > If you had asked questions like "is this bug the reason why I've been > seeing random SHA hash verification failures for several years?" then > you should worry about this bug; otherwise, it probably didn't affect you. > > > Thanks, > > Chris. > > > > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.” ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-03-08 10:37 ` Filipe Manana @ 2019-03-14 18:58 ` Christoph Anton Mitterer 2019-03-14 20:22 ` Christoph Anton Mitterer 1 sibling, 0 replies; 38+ messages in thread From: Christoph Anton Mitterer @ 2019-03-14 18:58 UTC (permalink / raw) To: fdmanana, Zygo Blaxell; +Cc: linux-btrfs Hey again. Just wondered about the inclusion status of this patch? The first merge I could find from Linus was 2 days ago for the upcoming 5.1. It doesn't seem to be in any of the stable kernels yet, neither in 5.0.x? Is this still coming to the stable kernels for distros or could it have gotten missed there? Debian has it in unstable since 4.19.28-1 (see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=922306) Cheers, Chris. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-03-08 10:37 ` Filipe Manana 2019-03-14 18:58 ` Christoph Anton Mitterer @ 2019-03-14 20:22 ` Christoph Anton Mitterer 2019-03-14 22:39 ` Filipe Manana 1 sibling, 1 reply; 38+ messages in thread From: Christoph Anton Mitterer @ 2019-03-14 20:22 UTC (permalink / raw) To: fdmanana; +Cc: linux-btrfs Oh and just for double checking: In the original patch you've posted and which Zygo tested, AFAIU, you had one line replaced. ( https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3 ) In the one submitted there were two occasions of replacing em->orig_start with em->start. ( https://lore.kernel.org/linux-btrfs/20190214151720.23563-1-fdmanana@kernel.org/ ) I assume that's on purpose? Cheers, Chris. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-03-14 20:22 ` Christoph Anton Mitterer @ 2019-03-14 22:39 ` Filipe Manana 0 siblings, 0 replies; 38+ messages in thread From: Filipe Manana @ 2019-03-14 22:39 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: linux-btrfs On Thu, Mar 14, 2019 at 8:22 PM Christoph Anton Mitterer <calestyo@scientia.net> wrote: > > Oh and just for double checking: > > In the original patch you've posted and which Zygo tested, AFAIU, you > had one line replaced. > ( https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3 ) > > In the one submitted there were two occasions of replacing > em->orig_start with em->start. > ( https://lore.kernel.org/linux-btrfs/20190214151720.23563-1-fdmanana@kernel.org/ ) > > I assume that's on purpose? Yes. > > Cheers, > Chris. > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.” ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-03-07 20:07 ` Zygo Blaxell 2019-03-08 10:37 ` Filipe Manana @ 2019-03-08 12:20 ` Austin S. Hemmelgarn 2019-03-14 18:58 ` Christoph Anton Mitterer 2019-03-14 18:58 ` Christoph Anton Mitterer 2 siblings, 1 reply; 38+ messages in thread From: Austin S. Hemmelgarn @ 2019-03-08 12:20 UTC (permalink / raw) To: Zygo Blaxell, Christoph Anton Mitterer; +Cc: linux-btrfs On 2019-03-07 15:07, Zygo Blaxell wrote: > On Mon, Mar 04, 2019 at 04:34:39PM +0100, Christoph Anton Mitterer wrote: >> Hey. >> >> >> Thanks for your elaborate explanations :-) >> >> >> On Fri, 2019-02-15 at 00:40 -0500, Zygo Blaxell wrote: >>> The problem occurs only on reads. Data that is written to disk will >>> be OK, and can be read correctly by a fixed kernel. >>> >>> A kernel without the fix will give corrupt data on reads with no >>> indication of corruption other than the changes to the data itself. >>> >>> Applications that copy data may read corrupted data and write it back >>> to the filesystem. This will make the corruption permanent in the >>> copied data. >> >> So that basically means even a cp (without refcopy) or a btrfs >> send/receive could already cause permanent silent data corruption. >> Of course, only if the conditions you've described below are met. >> >> >>> Given the age of the bug >> >> Since when was it in the kernel? > > Since at least 2015. Note that if you are looking for an end date for > "clean" data, you may be disappointed. > > In 2016 there were two kernel bugs that silently corrupted reads of > compressed data. In 2015 there were...4? 5? Before 2015 the problems > are worse, also damaging on-disk compressed data and crashing the kernel. > The bugs that were present in 2014 were present since compression was > introduced in 2008. > > With this last fix, as far as I know, we have a kernel that can read > compressed data without corruption for the first time--at least for a > subset of use cases that doesn't include direct IO. Of course I thought > the same thing in 2017, too, but I have since proven myself wrong. > > When btrfs gets to the point where it doesn't fail backup verification for > some contiguous years, then I'll be satisfied btrfs (or any filesystem) > is properly debugged. I'll still run backup verification then, of > course--hardware breaks all the time, and broken hardware can corrupt > any data it touches. Verification failures point to broken hardware > much more often than btrfs data corruption bugs. > >>> Even >>> if >>> compression is enabled, the file data must be compressed for the bug >>> to >>> corrupt it. >> >> Is there a simple way to find files (i.e. pathnames) that were actually >> compressed? > > Run compsize (sometimes the package is named btrfs-compsize) and see if > there are any lines referring to zlib, zstd, or lzo in the output. > If it's all "total" and "none" then there's no compression in that file. > > filefrag -v reports non-inline compressed data extents with the "encoded" > flag, so > > if filefrag -v "$file" | grep -qw encoded; then > echo "$file" is compressed, do something here > fi > > might also be a solution (assuming your filename doesn't include the > string 'encoded'). > >>> - you never punch holes in files >> >> Is there any "standard application" (like cp, tar, etc.) that would do >> this? > > Legacy POSIX doesn't have the hole-punching concept, so legacy > tools won't do it; however, people add features to GNU tools all the > time, so it's hard to be 100% sure without downloading the code and > reading/auditing/scanning it. I'm 99% sure cp and tar are OK. They are, the only things they do with sparse files are creating new ones from scratch using the standard seek then write method. The same is true of a vast majority of applications as well. The stuff most people would have to worry about largely comes down to: * VM software. Some hypervisors such as QEMU can be configured to translate discard commands issued against the emulated block devices to fallocate calls to punch holes in the VM disk image file (and QEMU can be configured to translate block writes of null bytes to this too), though I know of none that do this by default. * Database software. This is what stuff like punching holes originated for, so it's obviously a potential source of this issue. * FUSE filesystem drivers. Most of them that support the required fallocate flag to punch holes pass it down directly. Some make use of it themselves too. * Userspace distributed storage systems. Stuff like Ceph or Gluster. Same arguments as above for FUSE filesystem drivers. > >> What do you mean by clone? refcopy? Would btrfs snapshots or btrfs >> send/receive be affected? > > clone is part of some file operation syscalls (e.g. clone_file_range, > dedupe_range) which make two different files, or two different offsets in > the same file, refer to the same physical extent. This is the basis of > deduplication (replacing separate copies with references to a single > copy) and also of punching holes (a single reference is split into > two references to the original extent with a hole object inserted in > the middle). > > "reflink copy" is a synonym for "cp --reflink", which is clone_file_range > using 0 as the start of range and EOF as the end. The term 'reflink' > is sometimes used to refer to any extent shared between files that is > not the result of a snapshot. reflink is to extents what a hardlink is > to inodes, if you ignore some details. > > To trigger the bug you need to clone the same compressed source range > to two nearly adjacent locations in the destination file (i.e. two or > more ranges in the source overlap). cp --reflink never overlaps ranges, > so it can't create the extent pattern that triggers this bug *by itself*. > > If the source file already has extent references arranged in a way > that triggers the bug, then the copy made with cp --reflink will copy > the arrangement to the new file (i.e. if you upgrade the kernel, you > can correctly read both copies, and if you don't upgrade the kernel, > both copies will appear to be corrupted, probably the same way). > > I would expect btrfs receive may be affected, but I did not find any > code in receive that would be affected. There are a number of different > ways to make a file with a hole in it, and btrfs receive could use a > different one not affected by this bug. I don't use send/receive myself, > so I don't have historical corruption data to guess from. > >> Or is there anything in btrfs itself which does any of the two per >> default or on a typical system (i.e. I didn't use dedupe). > > 'btrfs' (the command-line utility) doesn't do these operations as far > as I can tell. The kernel only does these when requested by applications. The receive command will issue clone operations if the sent subvolume requires it to get the correct block layout, so there is a 'regular' BTRFS operation that can in theory set things up such that the required patterns are more likely to happen. > >> Also, did the bug only affect data, or could metadata also be >> affected... basically should such filesystems be re-created since they >> may also hold corruptions in the meta-data like trees and so on? > > Metadata is not affected by this bug. The bug only corrupts btrfs data > (specificially, the contents of files) in memory, not disk. > >> My scenario looks about the following, and given your explanations, I'd >> assume I should probably be safe: >> >> - my normal laptop doesn't use compress, so it's safe anyway >> >> - my cp has an alias to always have --reflink=auto >> >> - two 8TB data archive disks, each with two backup disks to which the >> data of the two master disks is btrfs sent/received,... which were >> all mounted with compress >> >> >> - typically I either cp or mv data from the laptop to these disks, >> => should then be safe as the laptop fs didn't use compress,... >> >> - or I directly create the files on the data disks (which use compress) >> by means of wget, scp or similar from other sources >> => should be safe, too, as they probably don't do dedupe/hole >> punching by default >> >> - or I cp/mv from them camera SD cards, which use some *FAT >> => so again I'd expect that to be fine >> >> - on vacation I had the case that I put large amount of picture/videos >> from SD cards to some btrfs-with-compress mobile HDDs, and back home >> from these HDDs to my actual data HDDs. >> => here I do have the read / re-write pattern, so data could have >> been corrupted if it was compressed + deduped/hole-punched >> I'd guess that's anyway not the case (JPEGs/MPEGs don't compress >> well)... and AFAIU there would be no deduping/hole-punching >> involved here > > dedupe doesn't happen by itself on btrfs. You have to run dedupe > userspace software (e.g. duperemove, bees, dduper, rmlint, jdupes, bedup, > etc...) or build a kernel with dedupe patches. > >> - on my main data disks, I do snapshots... and these snapshots I >> send/receive to the other (also compress-mounted) btrfs disks. >> => could these operations involve deduping/hole-punching and thus the >> corruption? > > Snapshots won't interact with the bug--they are not affected by it > and will not trigger it. Send could transmit incorrect data (if it > uses the kernel's readpages path internally, I don't know if it does). > Receive seems not to be affected (though it will not detect incorrect > data from send). > >> Another thing: >> I always store SHA512 hashsums of files as an XATTR of them (like >> "directly after" creating such files). >> I assume there would be no deduping/hole-punching involved till then, >> so the sums should be from correct data, right? > > There's no assurance of that with this method. It's highly likely that > the hashes match the input data, because the file will usually be cached > in host RAM from when it was written, so the bug has no opportunity to > appear. It's not impossible for other system activity to evict those > cached pages between the copy and hash, so the hash function might reread > the data from disk again and thus be exposed to the bug. > > Contrast with a copy tool which integrates the SHA512 function, so > the SHA hash and the copy consume their data from the same RAM buffers. > This reduces the risk of undetected error but still does not eliminate it. > A DRAM access failure could corrupt either the data or SHA hash but not > both, so the hash will fail verification later, but you won't know if > the hash is incorrect or the data. > > If the source filesystem is not btrfs (and therefore cannot have this > btrfs bug), you can calculate the SHA512 from the source filesystem and > copy that to the xattr on the btrfs filesystem. That reduces the risk > pool for data errors to the host RAM and CPU, the source filesystem, > and the storage stack below the source filesystem (i.e. the generic > set of problems that can occur on any system at any time and corrupt > data during copy and hash operations). > >> But when I e.g. copy data from SD, to mobile btrfs-HDD and then to the >> final archive HDD... corruption could in principle occur when copying >> from mobile HDD to archive HDD. >> In that case, would a diff between the two show me the corruption? I >> guess not because the diff would likely get the same corruption on >> read? > > Upgrade your kernel before doing any verification activity; otherwise > you'll just get false results. > > If you try to replace the data before upgrading the kernel, you're more > likely to introduce new corruption where corruption did not exist before, > or convert transient corruption events into permanent data corruption. > You might even miss corrupted data because the bug tends to corrupt data > in a consistent way. > > Once you have a kernel with the fix applied, diff will show any corruption > in file copies, though 'cmp -l' might be much faster than diff on large > binary files. Use just 'cmp' if you only want to know if any difference > exists but don't need detailed information, or 'cmp -s' in a shell script. > >> [...] >> I assume normal mv of refcopy (i.e. cp --reflink=auto) would not punch >> holes and thus be not affected? >> >> Further, I'd assume XATTRs couldn't be affected? > > XATTRs aren't compressed file data, so they aren't affected by this bug > which only affects compressed file data. > >> So what remains unanswered is send/receive: >> >>> btrfs send and receive may be affected, but I don't use them so I >>> don't >>> have any experience of the bug related to these tools. It seems from >>> reading the btrfs receive code that it lacks any code capable of >>> punching >>> a hole, but I'm only doing a quick search for words like "punch", not >>> a detailed code analysis. >> >> Is there some other developer who possibly knows whether send/receive >> would have been vulnerable to the issue? >> >> >> But since I use send/receive anyway in just one direction from the >> master to the backup disks... only the later could be affected. > > I presume from this line of questioning that you are not in the habit > of verifying the SHA512 hashes on your data every few weeks or months. > If you had that step in your scheduled backup routine, then you would > already be aware of data corruption bugs that affect you--or you'd > already be reasonably confident that this bug has no impact on your setup. > > If you had asked questions like "is this bug the reason why I've been > seeing random SHA hash verification failures for several years?" then > you should worry about this bug; otherwise, it probably didn't affect you. > >> Thanks, >> Chris. >> >> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-03-08 12:20 ` Austin S. Hemmelgarn @ 2019-03-14 18:58 ` Christoph Anton Mitterer 0 siblings, 0 replies; 38+ messages in thread From: Christoph Anton Mitterer @ 2019-03-14 18:58 UTC (permalink / raw) To: Austin S. Hemmelgarn, Zygo Blaxell; +Cc: linux-btrfs On Fri, 2019-03-08 at 07:20 -0500, Austin S. Hemmelgarn wrote: > On 2019-03-07 15:07, Zygo Blaxell wrote: > > Legacy POSIX doesn't have the hole-punching concept, so legacy > > tools won't do it; however, people add features to GNU tools all > > the > > time, so it's hard to be 100% sure without downloading the code and > > reading/auditing/scanning it. I'm 99% sure cp and tar are OK. > > > They are, the only things they do with sparse files are creating new > ones from scratch using the standard seek then write method. The > same > is true of a vast majority of applications as well. Thanks for your confirmation. > The stuff most > people would have to worry about largely comes down to: > > * VM software. Some hypervisors such as QEMU can be configured to > translate discard commands issued against the emulated block devices > to > fallocate calls to punch holes in the VM disk image file (and QEMU > can > be configured to translate block writes of null bytes to this too), > though I know of none that do this by default. > * Database software. This is what stuff like punching holes > originated > for, so it's obviously a potential source of this issue. > * FUSE filesystem drivers. Most of them that support the required > fallocate flag to punch holes pass it down directly. Some make use > of > it themselves too. > * Userspace distributed storage systems. Stuff like Ceph or > Gluster. > Same arguments as above for FUSE filesystem drivers. These do at least not affect me personally, though only because I didn't use compress, where I use qemu (which I have configured to pass on the TRIMs). > > 'btrfs' (the command-line utility) doesn't do these operations as > > far > > as I can tell. The kernel only does these when requested by > > applications. > The receive command will issue clone operations if the sent > subvolume > requires it to get the correct block layout, so there is a 'regular' > BTRFS operation that can in theory set things up such that the > required > patterns are more likely to happen. As long as snapshoting itself doesn't create the issue, I should be still safe at least on my master disks (which were always only the source or send/receive), which I'll now compare to the backup disks Thanks, Chris. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-03-07 20:07 ` Zygo Blaxell 2019-03-08 10:37 ` Filipe Manana 2019-03-08 12:20 ` Austin S. Hemmelgarn @ 2019-03-14 18:58 ` Christoph Anton Mitterer 2019-03-15 5:28 ` Zygo Blaxell 2 siblings, 1 reply; 38+ messages in thread From: Christoph Anton Mitterer @ 2019-03-14 18:58 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs Hey again. And again thanks for your time and further elaborate explanations :-) On Thu, 2019-03-07 at 15:07 -0500, Zygo Blaxell wrote: > In 2016 there were two kernel bugs that silently corrupted reads of > compressed data. In 2015 there were...4? 5? Before 2015 the > problems > are worse, also damaging on-disk compressed data and crashing the > kernel. > The bugs that were present in 2014 were present since compression was > introduced in 2008. Phew... too much [silent] corruption bugs in btrfs... :-( Actually I didn't even notice the others (which unfortunately doesn't mean I'm definitely not affected), so I probably cannot much do/check about them now... but only about the "recent" one that was fixed now. But maybe there should be something like a btrfs-announce list, i.e. a low volume mailing list, in which (interested) users are informed about more grave issues. Such things can happen and there's no one to blame about that... but if they happen it would be good for users to get notified so that they can check their systems and possibly recover data from (still existing) other sources. > Run compsize (sometimes the package is named btrfs-compsize) and see > if > there are any lines referring to zlib, zstd, or lzo in the output. > If it's all "total" and "none" then there's no compression in that > file. > > filefrag -v reports non-inline compressed data extents with the > "encoded" > flag, so > > if filefrag -v "$file" | grep -qw encoded; then > echo "$file" is compressed, do something here > fi > > might also be a solution (assuming your filename doesn't include the > string 'encoded'). Will have a look at this. As for all the following: > > > - you never punch holes in files > > > > Is there any "standard application" (like cp, tar, etc.) that would > > do > > this? > > Legacy POSIX doesn't have the hole-punching concept, so legacy > tools won't do it; however, people add features to GNU tools all the > time, so it's hard to be 100% sure without downloading the code and > reading/auditing/scanning it. I'm 99% sure cp and tar are OK. > > > What do you mean by clone? refcopy? Would btrfs snapshots or btrfs > > send/receive be affected? > > clone is part of some file operation syscalls (e.g. clone_file_range, > dedupe_range) which make two different files, or two different > offsets in > the same file, refer to the same physical extent. This is the basis > of > deduplication (replacing separate copies with references to a single > copy) and also of punching holes (a single reference is split into > two references to the original extent with a hole object inserted in > the middle). > > "reflink copy" is a synonym for "cp --reflink", which is > clone_file_range > using 0 as the start of range and EOF as the end. The term 'reflink' > is sometimes used to refer to any extent shared between files that is > not the result of a snapshot. reflink is to extents what a hardlink > is > to inodes, if you ignore some details. > > To trigger the bug you need to clone the same compressed source range > to two nearly adjacent locations in the destination file (i.e. two or > more ranges in the source overlap). cp --reflink never overlaps > ranges, > so it can't create the extent pattern that triggers this bug *by > itself*. > > If the source file already has extent references arranged in a way > that triggers the bug, then the copy made with cp --reflink will copy > the arrangement to the new file (i.e. if you upgrade the kernel, you > can correctly read both copies, and if you don't upgrade the kernel, > both copies will appear to be corrupted, probably the same way). > > I would expect btrfs receive may be affected, but I did not find any > code in receive that would be affected. There are a number of > different > ways to make a file with a hole in it, and btrfs receive could use a > different one not affected by this bug. I don't use send/receive > myself, > so I don't have historical corruption data to guess from. > > > Or is there anything in btrfs itself which does any of the two per > > default or on a typical system (i.e. I didn't use dedupe). > > 'btrfs' (the command-line utility) doesn't do these operations as far > as I can tell. The kernel only does these when requested by > applications. > > > Also, did the bug only affect data, or could metadata also be > > affected... basically should such filesystems be re-created since > > they > > may also hold corruptions in the meta-data like trees and so on? > > Metadata is not affected by this bug. The bug only corrupts btrfs > data > (specificially, the contents of files) in memory, not disk. So all the above, AFAIU, basically boils down to the following: Unless such hole-punched files were brought into the filesystem by one of the rather special things like: - dedupe - an application that by itself does the hole-punching of which most users will probably only have qemu which can do it ...a normal user should probably not have encountered the issue, as it's not triggered by typical end-user operations (cp, mv, tar, btrfs send/receive, cp --reflink=always/auto). With the exception that cp --reflink=always/auto, will duplicate (but by itself not corrupt) a file that *ALREADY* has a reflink/hole pattern, that is prone to the issue. So, AFAIU, such a file would be correctly copied, but on read it would also suffer from the curruption, just like the original. But again, if nothing like qemu was used in the first place, such file shouldn't be in the filesystem. Further, I'd expect that if users followed the advise and used nodatacow on their qemu images,... compression would be disabled for these as well, and they'd be safe again, right? => Summarising... the issue is (with the exception of qemu and dedupe users) likely not that much of an issue for normal end-users. What about the direct IO issues that may be still present and which you've mentioned above... is this used somewhere per default / under normal circumstances? > > - or I directly create the files on the data disks (which use > > compress) > > by means of wget, scp or similar from other sources > > => should be safe, too, as they probably don't do dedupe/hole > > punching by default > > > > - or I cp/mv from them camera SD cards, which use some *FAT > > => so again I'd expect that to be fine > > > > - on vacation I had the case that I put large amount of > > picture/videos > > from SD cards to some btrfs-with-compress mobile HDDs, and back > > home > > from these HDDs to my actual data HDDs. > > => here I do have the read / re-write pattern, so data could have > > been corrupted if it was compressed + deduped/hole-punched > > I'd guess that's anyway not the case (JPEGs/MPEGs don't > > compress > > well)... and AFAIU there would be no deduping/hole-punching > > involved here > > dedupe doesn't happen by itself on btrfs. You have to run dedupe > userspace software (e.g. duperemove, bees, dduper, rmlint, jdupes, > bedup, > etc...) or build a kernel with dedupe patches. Which I both have not, so should be fine. > It's highly likely > that > the hashes match the input data, because the file will usually be > cached > in host RAM from when it was written, so the bug has no opportunity > to > appear. That's what I had in mind. > It's not impossible for other system activity to evict those > cached pages between the copy and hash, so the hash function might > reread > the data from disk again and thus be exposed to the bug. Sure... which is especially very likely to be the case for any bigger amounts of data that I've copied. But anything bigger is typically pictures/videos, which I would guess/assume not to be compressed at all. But even then I should be still safe, as cp --reflink=auto/always doesn't introduce the bug by itself, as you've said above. Right? > Contrast with a copy tool which integrates the SHA512 function, so > the SHA hash and the copy consume their data from the same RAM > buffers. > This reduces the risk of undetected error but still does not > eliminate it. Hehe, I'd like to see that in GNU coreutils ;-) > A DRAM access failure could corrupt either the data or SHA hash but > not > both Unless, against all odds in the universe... you get that one special hash collision where corrupted file and/or hash match again :D > so the hash will fail verification later, but you won't know if > the hash is incorrect or the data. Sure, but at least I would notice could try to recover from some backup then. > > But when I e.g. copy data from SD, to mobile btrfs-HDD and then to > > the > > final archive HDD... corruption could in principle occur when > > copying > > from mobile HDD to archive HDD. > > In that case, would a diff between the two show me the corruption? > > I > > guess not because the diff would likely get the same corruption on > > read? > > Upgrade your kernel before doing any verification activity; otherwise > you'll just get false results. Well that's clear if I do the verification *now* ... I rather meant here: would a diff have noticed it the past (where I still had the originals)... for which the answer seems to be: possibly not > > But since I use send/receive anyway in just one direction from the > > master to the backup disks... only the later could be affected. > > I presume from this line of questioning that you are not in the habit > of verifying the SHA512 hashes on your data every few weeks or > months. Actually I do about every half year... my main point in the "investigation" of my typical usage scenarios above was, whether any of them could have introduced corruption in which my hashes wouldn't have noticed it. I guess all of my patterns of moving/copying data to these main data HDDs that used btrfs+compressions should be safe (since you said cp/mv is even with --reflink=always)... The only questionable one is, where I copied data from some SD card to an intermediate btrfs (that also used compression) and from there to the final location on the main data HDDs. Over time, I've used different ways to calc the XATTRs there: In earlier times I did it on the intermediate btrfs (which would make it in principle suspicious to not noticing corruption - if(!) I had not used cp only, which should be safe as you say)... followed (after clearing the kernel cache) by a recursive diff between SD and intermediate btrfs (assuming that btrfs' checksuming would show me any corruption error when re-reading from disk). Later I did it similarly to what you suggested above: Creating hash lists from the data on the SD... also creating the hashes for the XATTR on the intermediate btrfs (which would have again been in principle prone to the bug)... but then diffing the two, which should have shown me any corruption. > If you had that step in your scheduled backup routine, then you would > already be aware of data corruption bugs that affect you--or you'd > already be reasonably confident that this bug has no impact on your > setup. I think by now I'm pretty confident that I, personally, am safe. The main points for this were: - XATTRs not being affected - cp (with any value for --reflink=) never creating the corruption (as you've said both above) and with - send/receive likely being safe - snapshots being not affected means that my backup disks are likely unaffected as well. But obviously I'll check this (by verifying all hashes on the master disks... and by diffing the masters with the copies) on a fixed kernel, which I think has just landed in Debian unstable. Some time ago I had to split the previously one 8TiB master disk into two (both using compress) as the one ran out of space. But this should be also safe, as I've used just cp --reflink=auto which shouldn't introduce the bug by itself AFAIU, followed by extensive diff-ing... so especially the XATTRs should be still safe, too. Also, I always create a list of all hash+pathname from the XATTRs (basically in sha512sum(1) format and if I do another snapshot, I compare previous lists with the fresh one... so I'd have noticed any corruption there. So for me the main point was really, whether data could have been already corrupted when "introduced" to the filesystem via (especially) cp or a series of cp. > If you had asked questions like "is this bug the reason why I've been > seeing random SHA hash verification failures for several years?" then > you should worry about this bug; otherwise, it probably didn't affect > you. I think you're right... but my data with many thousands of pictures, etc. from all life is really precious to me, so I better wanted to understand the issue in "depth"... and I think these questions and your answers may still benefit others who may also want to find out whether they could have been silently affected :-) Cheers and thanks, Chris. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-03-14 18:58 ` Christoph Anton Mitterer @ 2019-03-15 5:28 ` Zygo Blaxell 2019-03-16 22:11 ` Christoph Anton Mitterer 0 siblings, 1 reply; 38+ messages in thread From: Zygo Blaxell @ 2019-03-15 5:28 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3384 bytes --] On Thu, Mar 14, 2019 at 07:58:45PM +0100, Christoph Anton Mitterer wrote: > Phew... too much [silent] corruption bugs in btrfs... :-( > > Actually I didn't even notice the others (which unfortunately doesn't > mean I'm definitely not affected), so I probably cannot much do/check > about them now... but only about the "recent" one that was fixed now. > > But maybe there should be something like a btrfs-announce list, i.e. a > low volume mailing list, in which (interested) users are informed about > more grave issues. > Such things can happen and there's no one to blame about that... but if > they happen it would be good for users to get notified so that they can > check their systems and possibly recover data from (still existing) > other sources. I don't know if it would be a low-volume list...every kernel release includes fixes for _some_ exotic corner case. > What about the direct IO issues that may be still present and which > you've mentioned above... is this used somewhere per default / under > normal circumstances? Direct IO is an odd case because it's not all that well understood what the correct behavior is. You can't prevent the kernel from making copies of data and also expect full data integrity and also lock-free performance, all at the same time. Pick any two, and pay for it with losses in the third. The bug fixes here are more along the lines of "OK so you're using direct IO which means you've basically admitted you don't care about *your* data, let's try not to corrupt *other* data on the filesystem at the same time." > I think by now I'm pretty confident that I, personally, am safe. It took me two years to find this bug, and I had to write a tool to encounter it often enough to notice. A lot of people are safe. > > If you had asked questions like "is this bug the reason why I've been > > seeing random SHA hash verification failures for several years?" then > > you should worry about this bug; otherwise, it probably didn't affect > > you. > > I think you're right... but my data with many thousands of pictures, > etc. from all life is really precious to me, so I better wanted to > understand the issue in "depth"... and I think these questions and your > answers may still benefit others who may also want to find out whether > they could have been silently affected :-) I found the 2017 compression bug in a lot of digital photographs. It turns out that several popular cameras (including some of the ones I own) put a big chunk of zeros near the beginnings of JPG files, and when rsync copies those it will insert a hole instead of copying the zeros. The 2017 bug affected "ordinary" holes so standard tools like cp and rsync could trigger it. Most photo tools ignore this data completely, so when garbage appears there, nobody notices. A similar thing happens to .o files: ld aligns things to 4K block boundaries, triggering the 2017 compressed read bug. Nobody reads that data either--it's just alignment padding. I don't think I found an application that cared about the 2017 bug at all. Only backup verifications. The 2018 bug is a different story--when it hits, it's obvious, and ordinary application things break--but it won't happen to typical photo image files, even with aggressive dedupe. > > Cheers and thanks, > Chris. > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-03-15 5:28 ` Zygo Blaxell @ 2019-03-16 22:11 ` Christoph Anton Mitterer 2019-03-17 2:54 ` Zygo Blaxell 0 siblings, 1 reply; 38+ messages in thread From: Christoph Anton Mitterer @ 2019-03-16 22:11 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On Fri, 2019-03-15 at 01:28 -0400, Zygo Blaxell wrote: > But maybe there should be something like a btrfs-announce list, > > i.e. a > > low volume mailing list, in which (interested) users are informed > > about > > more grave issues. > > … > I don't know if it would be a low-volume list...every kernel release > includes fixes for _some_ exotic corner case. Well this one *may* be exotic for many users, but we have at least the use case of qemu which seems to be not that exotic at all. And the ones you outline below seem even more common? Also the other means for end-users to know whether something is stable or not like https://btrfs.wiki.kernel.org/index.php/Status don't seem to really work out. There is a known silent data corruption bug which seems so far only fixed in 5.1rc* ... and the page still says stable since 4.14. Even know with the fix, one should probably need to wait a year or so until one could mark it stable again if nothing had been found by then. > > What about the direct IO issues that may be still present and which > > you've mentioned above... is this used somewhere per default / > > under > > normal circumstances? > > Direct IO is an odd case because it's not all that well understood > what the correct behavior is. You can't prevent the kernel from > making > copies of data and also expect full data integrity and also lock-free > performance, all at the same time. Pick any two, and pay for it with > losses in the third. > > The bug fixes here are more along the lines of "OK so you're using > direct > IO which means you've basically admitted you don't care about *your* > data, > let's try not to corrupt *other* data on the filesystem at the same > time." So... if btrfs allows for direct IO... and if this isn't stable in some situations,... what can one do about it? I mean there doesn't seem to be an option to disallow it... and any program can do O_DIRECT (without even knowing btrfs is below). Guess I have to go deeper down the rabbit hole now for the other compressions bugs... > I found the 2017 compression bug in a lot of digital photographs. Is there any way (apart from having correct checksums) to find out whether a file was affected by the 2017-bug? Like, I don't know,.. looking for large chunks of zeros? And is there any more detailed information available on the 2017-bug, in the sense under which occasions it occurred? Like also only on reads (which would mean again that I'd be mostly safe, because my checksums should mostly catch this)? Or just on dedupe or hole punching? Or did it only affect sparse files (and there only the holes (blocks of zeros) as in your camera JPG example)? > It turns out that several popular cameras (including some of the ones > I > own) put a big chunk of zeros near the beginnings of JPG files, and > when > rsync copies those it will insert a hole instead of copying the > zeros. Many other types of files may have such bigger chunks of zeros to... basically everything that leaves place for meta-data. > The 2017 bug affected "ordinary" holes so standard tools like cp and > rsync could trigger it. AFAIU, both cp and rsync (--sparse) don't create spares files actively per default,... cp (per default) only creates sparse files when it detects the source file to be already sparse. Same seems to be the case for tar, which only stores a file sparse (inside the archive) when --sparse is used. So would one be safe from the 2017 bug if one haven't had sparse files and not activated the sparse in any of these tools? > Most photo tools ignore this data completely, > so when garbage appears there, nobody notices. So the 2017-bug meant that areas that should be zero were filled with garbage but everything als was preserved correclty > I don't think I found an application that cared about the 2017 bug at > all. Well for me it would be still helpful to know how to find out whether I might have been affected or not... I do have some really old backups so recovery would be possible in many cases. > The 2018 bug is a different story--when it hits, it's obvious, and > ordinary application things break Which one to you mean now? The one recently fixed on reads+holepunching/dedupe/clone? Cause I thought that one was not not that obvious as it was silent... Anything still known about the even older compression related corruption bugs that Filipe mentioned, in the sense when they occurred and how to find out whether one was affected? Thanks, Chris. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-03-16 22:11 ` Christoph Anton Mitterer @ 2019-03-17 2:54 ` Zygo Blaxell 0 siblings, 0 replies; 38+ messages in thread From: Zygo Blaxell @ 2019-03-17 2:54 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 8858 bytes --] On Sat, Mar 16, 2019 at 11:11:10PM +0100, Christoph Anton Mitterer wrote: > On Fri, 2019-03-15 at 01:28 -0400, Zygo Blaxell wrote: > > But maybe there should be something like a btrfs-announce list, > > > i.e. a > > > low volume mailing list, in which (interested) users are informed > > > about > > > more grave issues. > > > … > > I don't know if it would be a low-volume list...every kernel release > > includes fixes for _some_ exotic corner case. > > Well this one *may* be exotic for many users, but we have at least the > use case of qemu which seems to be not that exotic at all. > > And the ones you outline below seem even more common? > > Also the other means for end-users to know whether something is stable > or not like https://btrfs.wiki.kernel.org/index.php/Status don't seem > to really work out. It's hard to separate the signal from the noise. I first detected the 2018 bug in 2016, but didn't know it was a distinct bug until after eliminating all the other corruption causes that occurred during that time. I am still tracking issue(s) in btrfs that bring servers down multiple times a week, so I'm not in a hurry to declare any part of btrfs stable yet. When could we ever confidently say btrfs is stable? Some filesystems are 30 years old and still fixing bugs. See you in 2037? Now, that specific wiki page should probably be updated, since at least one outstanding bug is now known. > There is a known silent data corruption bug which seems so far only > fixed in 5.1rc* ... and the page still says stable since 4.14. > Even know with the fix, one should probably need to wait a year or so > until one could mark it stable again if nothing had been found by then. I sometimes use "it has been $N days since the last bug fix in $Y" as a crude metric of how trustworthy code is. adfs is 2913 days and counting! ext2 is only 106 days. btrfs and xfs seem to be competing for the lowest value of N, never rising above a few dozen except around holidays and conferences, with ext4 not far behind. > So... if btrfs allows for direct IO... and if this isn't stable in some > situations,... what can one do about it? I mean there doesn't seem to > be an option to disallow it... Sure, but O_DIRECT is a performance/risk tradeoff. If you ask someone who uses csums or snapshots, they'll tell you btrfs should always put correct data and checksums on disk, even if the application does something weird and undefined like O_DIRECT. If you ask someone who wants the O_DIRECT performance, they'll tell you O_DIRECT should not waste time computing, verifying, reading, or writing csums, nor should users expect correct behavior from applications that don't follow the filesystem-specific rules correctly (for some implied definition of how correct applications should behave, because O_DIRECT is not a concrete specification), and that includes permitting undetected data corruption to be persisted on disk. > and any program can do O_DIRECT (without even knowing btrfs is below). Most filesystems permit silent data corruption all of the time, so btrfs is weird for disallowing silent data corruption some of the time. > Guess I have to go deeper down the rabbit hole now for the other > compressions bugs... > > > > I found the 2017 compression bug in a lot of digital photographs. > > Is there any way (apart from having correct checksums) to find out > whether a file was affected by the 2017-bug? > Like, I don't know,.. looking for large chunks of zeros? You need to have an inline extent in the first 4096 bytes of the file and data starting at 4096 bytes. Normally that never happens, but it is possible to construct files that way with the right sequences of write(), seek(), and fsync(). They occur naturally in about one out of every 100,000 'rsync -S' files which triggers a similar sequence of operations internally in the kernel. The symptom is that the corrupted file has uninitialized kernel memory in the last bytes of the first 4096 byte block, when the correct file has 0 bytes there. It turns out that uninitialized kernel memory is often full of zeros anyway, so even "corrupted" files come out unchanged most of the time. If you don't know what is supposed to be in those bytes (either from the file format, an uncorrupted copy of the file, or unexpected behavior when the file is used) then there's no way to know they're wrong. > And is there any more detailed information available on the 2017-bug, > in the sense under which occasions it occurred? The kernel commit message for the fix is quite detailed. > Like also only on reads (which would mean again that I'd be mostly > safe, because my checksums should mostly catch this)? Only reads, and only files with a specific structure, and only at a single specific location in the file. > Or just on dedupe or hole punching? Or did it only affect sparse files > (and there only the holes (blocks of zeros) as in your camera JPG > example)? You can't get the 2017 bug with dedupe--inline extents are not dedupable. You do need a sparse file. I didn't find the 2017 bug because of bees--I found it because of rsync -S. > > It turns out that several popular cameras (including some of the ones > > I > > own) put a big chunk of zeros near the beginnings of JPG files, and > > when > > rsync copies those it will insert a hole instead of copying the > > zeros. > > Many other types of files may have such bigger chunks of zeros to... > basically everything that leaves place for meta-data. Only contiguous chunks of 0 that end at byte 4096 can be affected. 0 anywhere else in the file is the domain of the 2018 bug. Also 2017 replaces 0 with invalid data, while 2018 replaces valid data with 0. > AFAIU, both cp and rsync (--sparse) don't create spares files actively > per default,... cp (per default) only creates sparse files when it > detects the source file to be already sparse. > Same seems to be the case for tar, which only stores a file sparse > (inside the archive) when --sparse is used. > > So would one be safe from the 2017 bug if one haven't had sparse files > and not activated the sparse in any of these tools? Probably. Even "unsafe" is less than a 1 in 100,000 event, so you're often safe even when using triggering tools (especially if the system is lightly loaded). Lots of tools make sparse files. > > Most photo tools ignore this data completely, > > so when garbage appears there, nobody notices. > > So the 2017-bug meant that areas that should be zero were filled with > garbage but everything als was preserved correclty Yep. > > I don't think I found an application that cared about the 2017 bug at > > all. > > Well for me it would be still helpful to know how to find out whether I > might have been affected or not... I do have some really old backups so > recovery would be possible in many cases. You could compare those backups to current copies before discarding them. Or build a SHA table and keep a copy of it on online media for verification. > > The 2018 bug is a different story--when it hits, it's obvious, and > > ordinary application things break > > Which one to you mean now? The one recently fixed on > reads+holepunching/dedupe/clone? Cause I thought that one was not not > that obvious as it was silent... Many applications will squawk if you delete 32K of data randomly from the middle of their data files. There are crashes, garbage output, error messages, corrupted VM filesystem images (i.e. the guest's fsck complains). A lot of issues magically disappear after applying the "2018" fix. > Anything still known about the even older compression related > corruption bugs that Filipe mentioned, in the sense when they occurred > and how to find out whether one was affected? Kernels from 2015 and earlier had assorted problems with compressed data. It's difficult to distinguish between them, or isolate specific syndomes to specific bug fixes. Not all of them were silent--there was a bug in 2014 that returned EIO instead of data when reading files affected by the 2017 bug (that change in behavior was a good clue about where to look for the 2017 fix). One of the bugs eventually manifests itself as a broken filesystem or a kernel panic when you write to an affected area of a file. It's more practical to just assume anything stored on btrfs with compression on a kernel prior to 2015 is suspect until proven otherwise. In 2014 and earlier, you have to start suspecting uncompressed data too. Kernels between 2012 and 2014 crashed so often it was difficult to run data integrity verification tests with a significant corpus size. > > Thanks, > Chris. > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-14 12:21 ` Christoph Anton Mitterer 2019-02-15 5:40 ` Zygo Blaxell @ 2019-02-15 12:02 ` Filipe Manana 2019-03-04 15:46 ` Christoph Anton Mitterer 1 sibling, 1 reply; 38+ messages in thread From: Filipe Manana @ 2019-02-15 12:02 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: linux-btrfs On Thu, Feb 14, 2019 at 11:10 PM Christoph Anton Mitterer <calestyo@scientia.net> wrote: > > On Thu, 2019-02-14 at 01:22 +0000, Filipe Manana wrote: > > The following one liner fixes it: > > https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3 > > Great to see that fixed... is there any advise that can be given for > users/admins? Upgrade to a kernel with the patch (none yet) or build it from source? Not sure what kind of advice you are looking for. > > > Like whether and how any occurred corruptions can be detected (right > now, people may still have backups)? > > > Or under which exact circumstances did the corruption happen? And under > which was one safe? > E.g. only on specific compression algos (I've been using -o compress > (which should be zlib) for quite a while but never found any > compression),... or only when specific file operations were done (I did > e.g. cp with refcopy, but I think none of the standard tools does hole- > punching)? As I said in the previous reply, and in the patch's changelog [1], the corruption happens at read time. That means nothing stored on disk is corrupted. It's not the end of the world. [1] https://lore.kernel.org/linux-btrfs/20190214151720.23563-1-fdmanana@kernel.org/ > > > Cheers, > Chris. > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.” ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-15 12:02 ` Filipe Manana @ 2019-03-04 15:46 ` Christoph Anton Mitterer 0 siblings, 0 replies; 38+ messages in thread From: Christoph Anton Mitterer @ 2019-03-04 15:46 UTC (permalink / raw) To: fdmanana; +Cc: linux-btrfs On Fri, 2019-02-15 at 12:02 +0000, Filipe Manana wrote: > Upgrade to a kernel with the patch (none yet) or build it from > source? > Not sure what kind of advice you are looking for. Well more something of the kind that Zygo wrote in his mail, i.e some explanation of the whole issue in order to find out whether one might be affected or not. > As I said in the previous reply, and in the patch's changelog [1], > the > corruption happens at read time. > That means nothing stored on disk is corrupted. It's not the end of > the world. Well but there are many cases where data is read and then written again... and while Zygo's mail already answers a lot, at least the question of whether it could happen on btrfs send/receive is still open. My understanding was that btrfs is considered "stable" for the normal use cases (so e.g. perhaps without special features like raid56). Data corruption is always quite serious, even if it's just on reads and people may have workloads where data is read (possibly with corruption) and (permanently) written again... so the whole thing *could* be quite serious and IMO justifies a more thorough explanation for end-users and not just a small commit message for developers. Also, while it was really great to see how fast this got fixed then in the end... it's also a bit worrying that Zygo apparently reported it already some time ago and it got somehow lost. Cheers, Chris. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-12 17:01 ` Zygo Blaxell 2019-02-12 17:56 ` Filipe Manana @ 2019-02-12 18:58 ` Andrei Borzenkov 2019-02-12 21:48 ` Chris Murphy 1 sibling, 1 reply; 38+ messages in thread From: Andrei Borzenkov @ 2019-02-12 18:58 UTC (permalink / raw) To: Zygo Blaxell, Filipe Manana; +Cc: linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 13325 bytes --] 12.02.2019 20:01, Zygo Blaxell пишет: > On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote: >> On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell >> <ce3g8jdj@umail.furryterror.org> wrote: >>> >>> Still reproducible on 4.20.7. >> >> I tried your reproducer when you first reported it, on different >> machines with different kernel versions. > > That would have been useful to know last August... :-/ > >> Never managed to reproduce it, nor see anything obviously wrong in >> relevant code paths. > > I built a fresh VM running Debian stretch and > reproduced the issue immediately. Mount options are > "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/". Kernel is > Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version > probably doesn't matter. > > I don't have any configuration that can't reproduce this issue, so I don't > know how to help you. I've tested AMD and Intel CPUs, VM, baremetal, > hardware ranging in age from 0 to 9 years. Locally built kernels from > 4.1 to 4.20 and the stock Debian kernel (4.9). SSDs and spinning rust. > All of these reproduce the issue immediately--wrong sha1sum appears in > the first 10 loops. > > What is your test environment? I can try that here. > >>> >>> The behavior is slightly different on current kernels (4.20.7, 4.14.96) >>> which makes the problem a bit more difficult to detect. >>> >>> # repro-hole-corruption-test >>> i: 91, status: 0, bytes_deduped: 131072 >>> i: 92, status: 0, bytes_deduped: 131072 >>> i: 93, status: 0, bytes_deduped: 131072 >>> i: 94, status: 0, bytes_deduped: 131072 >>> i: 95, status: 0, bytes_deduped: 131072 >>> i: 96, status: 0, bytes_deduped: 131072 >>> i: 97, status: 0, bytes_deduped: 131072 >>> i: 98, status: 0, bytes_deduped: 131072 >>> i: 99, status: 0, bytes_deduped: 131072 >>> 13107200 total bytes deduped in this operation >>> am: 4.8 MiB (4964352 bytes) converted to sparse holes. >>> 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> I get the same result on Ubunut 18.04 using distro packages and 4.18 hwe kernel. root@bor-Latitude-E5450:/var/tmp# dd if=/dev/zero of=loop bs=1M count=200 200+0 записей получено 200+0 записей отправлено 209715200 bytes (210 MB, 200 MiB) copied, 0,125205 s, 1,7 GB/s root@bor-Latitude-E5450:/var/tmp# mkfs.btrfs loop btrfs-progs v4.15.1 See http://btrfs.wiki.kernel.org for more information. Label: (null) UUID: b1f1111e-2d65-484a-9ab3-e00feaac2048 Node size: 16384 Sector size: 4096 Filesystem size: 200.00MiB Block group profiles: Data: single 8.00MiB Metadata: DUP 32.00MiB System: DUP 8.00MiB SSD detected: no Incompat features: extref, skinny-metadata Number of devices: 1 Devices: ID SIZE PATH 1 200.00MiB loop root@bor-Latitude-E5450:/var/tmp# mount -t btrfs -o loop,rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/ ./loop ./loopmnt root@bor-Latitude-E5450:/var/tmp# cd - /var/tmp/loopmnt root@bor-Latitude-E5450:/var/tmp/loopmnt# ../repro-hole-corruption-test i: 91, status: 0, bytes_deduped: 131072 i: 92, status: 0, bytes_deduped: 131072 i: 93, status: 0, bytes_deduped: 131072 i: 94, status: 0, bytes_deduped: 131072 i: 95, status: 0, bytes_deduped: 131072 i: 96, status: 0, bytes_deduped: 131072 i: 97, status: 0, bytes_deduped: 131072 i: 98, status: 0, bytes_deduped: 131072 i: 99, status: 0, bytes_deduped: 131072 13107200 total bytes deduped in this operation am: 4,8 MiB (4964352 bytes) converted to sparse holes. 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am ^Croot@bor-Latitude-E5450:/var/tmp/loopmnt# >>> The sha1sum seems stable after the first drop_caches--until a second >>> process tries to read the test file: >>> >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> # cat am > /dev/null (in another shell) >>> 19294e695272c42edb89ceee24bb08c13473140a am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> >>> On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote: >>>> This is a repro script for a btrfs bug that causes corrupted data reads >>>> when reading a mix of compressed extents and holes. The bug is >>>> reproducible on at least kernels v4.1..v4.18. >>>> >>>> Some more observations and background follow, but first here is the >>>> script and some sample output: >>>> >>>> root@rescue:/test# cat repro-hole-corruption-test >>>> #!/bin/bash >>>> >>>> # Write a 4096 byte block of something >>>> block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } >>>> >>>> # Here is some test data with holes in it: >>>> for y in $(seq 0 100); do >>>> for x in 0 1; do >>>> block 0; >>>> block 21; >>>> block 0; >>>> block 22; >>>> block 0; >>>> block 0; >>>> block 43; >>>> block 44; >>>> block 0; >>>> block 0; >>>> block 61; >>>> block 62; >>>> block 63; >>>> block 64; >>>> block 65; >>>> block 66; >>>> done >>>> done > am >>>> sync >>>> >>>> # Now replace those 101 distinct extents with 101 references to the first extent >>>> btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail >>>> >>>> # Punch holes into the extent refs >>>> fallocate -v -d am >>>> >>>> # Do some other stuff on the machine while this runs, and watch the sha1sums change! >>>> while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done >>>> >>>> root@rescue:/test# ./repro-hole-corruption-test >>>> i: 91, status: 0, bytes_deduped: 131072 >>>> i: 92, status: 0, bytes_deduped: 131072 >>>> i: 93, status: 0, bytes_deduped: 131072 >>>> i: 94, status: 0, bytes_deduped: 131072 >>>> i: 95, status: 0, bytes_deduped: 131072 >>>> i: 96, status: 0, bytes_deduped: 131072 >>>> i: 97, status: 0, bytes_deduped: 131072 >>>> i: 98, status: 0, bytes_deduped: 131072 >>>> i: 99, status: 0, bytes_deduped: 131072 >>>> 13107200 total bytes deduped in this operation >>>> am: 4.8 MiB (4964352 bytes) converted to sparse holes. >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 072a152355788c767b97e4e4c0e4567720988b84 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> bf00d862c6ad436a1be2be606a8ab88d22166b89 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 0d44cdf030fb149e103cfdc164da3da2b7474c17 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 60831f0e7ffe4b49722612c18685c09f4583b1df am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> a19662b294a3ccdf35dbb18fdd72c62018526d7d am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> ^C >>>> >>>> Corruption occurs most often when there is a sequence like this in a file: >>>> >>>> ref 1: hole >>>> ref 2: extent A, offset 0 >>>> ref 3: hole >>>> ref 4: extent A, offset 8192 >>>> >>>> This scenario typically arises due to hole-punching or deduplication. >>>> Hole-punching replaces one extent ref with two references to the same >>>> extent with a hole between them, so: >>>> >>>> ref 1: extent A, offset 0, length 16384 >>>> >>>> becomes: >>>> >>>> ref 1: extent A, offset 0, length 4096 >>>> ref 2: hole, length 8192 >>>> ref 3: extent A, offset 12288, length 4096 >>>> >>>> Deduplication replaces two distinct extent refs surrounding a hole with >>>> two references to one of the duplicate extents, turning this: >>>> >>>> ref 1: extent A, offset 0, length 4096 >>>> ref 2: hole, length 8192 >>>> ref 3: extent B, offset 0, length 4096 >>>> >>>> into this: >>>> >>>> ref 1: extent A, offset 0, length 4096 >>>> ref 2: hole, length 8192 >>>> ref 3: extent A, offset 0, length 4096 >>>> >>>> Compression is required (zlib, zstd, or lzo) for corruption to occur. >>>> I am not able to reproduce the issue with an uncompressed extent nor >>>> have I observed any such corruption in the wild. >>>> >>>> The presence or absence of the no-holes filesystem feature has no effect. >>>> >>>> Ordinary writes can lead to pairs of extent references to the same extent >>>> separated by a reference to a different extent; however, in this case >>>> there is data to be read from a real extent, instead of pages that have >>>> to be zero filled from a hole. If ordinary non-hole writes could trigger >>>> this bug, every page-oriented database engine would be crashing all the >>>> time on btrfs with compression enabled, and it's unlikely that would not >>>> have been noticed between 2015 and now. An ordinary write that splits >>>> an extent ref would look like this: >>>> >>>> ref 1: extent A, offset 0, length 4096 >>>> ref 2: extent C, offset 0, length 8192 >>>> ref 3: extent A, offset 12288, length 4096 >>>> >>>> Sparse writes can lead to pairs of extent references surrounding a hole; >>>> however, in this case the extent references will point to different >>>> extents, avoiding the bug. If a sparse write could trigger the bug, >>>> the rsync -S option and qemu/kvm 'raw' disk image files (among many >>>> other tools that produce sparse files) would be unusable, and it's >>>> unlikely that would not have been noticed between 2015 and now either. >>>> Sparse writes look like this: >>>> >>>> ref 1: extent A, offset 0, length 4096 >>>> ref 2: hole, length 8192 >>>> ref 3: extent B, offset 0, length 4096 >>>> >>>> The pattern or timing of read() calls seems to be relevant. It is very >>>> hard to see the corruption when reading files with 'hd', but 'cat | hd' >>>> will see the corruption just fine. Similar problems exist with 'cmp' >>>> but not 'sha1sum'. Two processes reading the same file at the same time >>>> seem to trigger the corruption very frequently. >>>> >>>> Some patterns of holes and data produce corruption faster than others. >>>> The pattern generated by the script above is based on instances of >>>> corruption I've found in the wild, and has a much better repro rate than >>>> random holes. >>>> >>>> The corruption occurs during reads, after csum verification and before >>>> decompression, so btrfs detects no csum failures. The data on disk >>>> seems to be OK and could be read correctly once the kernel bug is fixed. >>>> Repeated reads do eventually return correct data, but there is no way >>>> for userspace to distinguish between corrupt and correct data reliably. >>>> >>>> The corrupted data is usually data replaced by a hole or a copy of other >>>> blocks in the same extent. >>>> >>>> The behavior is similar to some earlier bugs related to holes and >>>> Compressed data in btrfs, but it's new and not fixed yet--hence, >>>> "2018 edition." >>> >>> >> >> >> -- >> Filipe David Manana, >> >> “Whether you think you can, or you think you can't — you're right.” >> [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-12 18:58 ` Andrei Borzenkov @ 2019-02-12 21:48 ` Chris Murphy 2019-02-12 22:11 ` Zygo Blaxell 0 siblings, 1 reply; 38+ messages in thread From: Chris Murphy @ 2019-02-12 21:48 UTC (permalink / raw) To: Andrei Borzenkov; +Cc: Zygo Blaxell, Filipe Manana, linux-btrfs Is it possibly related to the zlib library being used on Debian/Ubuntu? That you've got even one reproducer with the exact same hash for the transient error case means it's not hardware or random error; let alone two independent reproducers. And then what happens if you do the exact same test but change to zstd or lzo? No error? Strictly zlib? -- Chris Murphy ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-12 21:48 ` Chris Murphy @ 2019-02-12 22:11 ` Zygo Blaxell 2019-02-12 22:53 ` Chris Murphy 0 siblings, 1 reply; 38+ messages in thread From: Zygo Blaxell @ 2019-02-12 22:11 UTC (permalink / raw) To: Chris Murphy; +Cc: Andrei Borzenkov, Filipe Manana, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 4033 bytes --] On Tue, Feb 12, 2019 at 02:48:38PM -0700, Chris Murphy wrote: > Is it possibly related to the zlib library being used on > Debian/Ubuntu? That you've got even one reproducer with the exact same > hash for the transient error case means it's not hardware or random > error; let alone two independent reproducers. The errors are not consistent between runs. The above pattern is quite common, but it is not the only possible output. Add in other processes reading the 'am' file at the same time and it gets very random. The bad data tends to have entire extents missing, replaced with zeros. That leads to a small number of possible outputs (the choices seem to be only to have the data or have the zeros). It does seem to be a lot more consistent in recent (post 4.14.80) kernels, which may be interesting. Here is an example of a diff between two copies of the 'am' file copied while the repro script was running, filtered through hd: # diff -u /tmp/f1 /tmp/f2 --- /tmp/f1 2019-02-12 17:05:14.861844871 -0500 +++ /tmp/f2 2019-02-12 17:05:16.883868402 -0500 @@ -56,10 +56,6 @@ * 00020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * -00021000 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 |................| -* -00022000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| -* 00023000 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 |................| * 00024000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| @@ -268,10 +264,6 @@ * 000a0000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * -000a1000 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 |................| -* -000a2000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| -* 000a3000 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 |................| * 000a4000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| @@ -688,10 +680,6 @@ * 001a0000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * -001a1000 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 |................| -* -001a2000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| -* 001a3000 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 |................| * 001a4000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| @@ -1524,10 +1512,6 @@ * 003a0000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * -003a1000 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 |................| -* -003a2000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| -* 003a3000 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 |................| * 003a4000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| @@ -3192,10 +3176,6 @@ * 007a0000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * -007a1000 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 |................| -* -007a2000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| -* 007a3000 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 |................| * 007a4000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| @@ -5016,10 +4996,6 @@ * 00c00000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * -00c01000 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 |................| -* -00c02000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| -* [etc...you get the idea] I'm not sure how the zlib library is involved--sha1sum doesn't use one. > And then what happens if you do the exact same test but change to zstd > or lzo? No error? Strictly zlib? Same errors on all three btrfs compression algorithms (as mentioned in the original post from August 2018). > -- > Chris Murphy > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-12 22:11 ` Zygo Blaxell @ 2019-02-12 22:53 ` Chris Murphy 2019-02-13 2:46 ` Zygo Blaxell 0 siblings, 1 reply; 38+ messages in thread From: Chris Murphy @ 2019-02-12 22:53 UTC (permalink / raw) To: Zygo Blaxell, Btrfs BTRFS On Tue, Feb 12, 2019 at 3:11 PM Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > > On Tue, Feb 12, 2019 at 02:48:38PM -0700, Chris Murphy wrote: > > Is it possibly related to the zlib library being used on > > Debian/Ubuntu? That you've got even one reproducer with the exact same > > hash for the transient error case means it's not hardware or random > > error; let alone two independent reproducers. > > The errors are not consistent between runs. The above pattern is quite > common, but it is not the only possible output. Add in other processes > reading the 'am' file at the same time and it gets very random. > > The bad data tends to have entire extents missing, replaced with zeros. > That leads to a small number of possible outputs (the choices seem to be > only to have the data or have the zeros). It does seem to be a lot more > consistent in recent (post 4.14.80) kernels, which may be interesting. > > Here is an example of a diff between two copies of the 'am' file copied > while the repro script was running, filtered through hd: > > # diff -u /tmp/f1 /tmp/f2 > --- /tmp/f1 2019-02-12 17:05:14.861844871 -0500 > +++ /tmp/f2 2019-02-12 17:05:16.883868402 -0500 > @@ -56,10 +56,6 @@ > * > 00020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > -00021000 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 |................| > -* > -00022000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > -* > 00023000 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 |................| > * > 00024000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > @@ -268,10 +264,6 @@ > * > 000a0000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > -000a1000 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 |................| > -* > -000a2000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > -* > 000a3000 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 |................| > * > 000a4000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > @@ -688,10 +680,6 @@ > * > 001a0000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > -001a1000 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 |................| > -* > -001a2000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > -* > 001a3000 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 |................| > * > 001a4000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > @@ -1524,10 +1512,6 @@ > * > 003a0000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > -003a1000 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 |................| > -* > -003a2000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > -* > 003a3000 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 |................| > * > 003a4000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > @@ -3192,10 +3176,6 @@ > * > 007a0000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > -007a1000 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 |................| > -* > -007a2000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > -* > 007a3000 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 |................| > * > 007a4000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > @@ -5016,10 +4996,6 @@ > * > 00c00000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > -00c01000 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 |................| > -* > -00c02000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > -* > [etc...you get the idea] And yet the file is delivered to user space, despite the changes, as if it's immune to checksum computation or matching. The data is clearly difference so how is it bypassing checksumming? Data csums are based on original uncompressed data, correct? So any holes are zeros, there are still csums for those holes? > > I'm not sure how the zlib library is involved--sha1sum doesn't use one. > > > And then what happens if you do the exact same test but change to zstd > > or lzo? No error? Strictly zlib? > > Same errors on all three btrfs compression algorithms (as mentioned in > the original post from August 2018). Obviously there is a pattern. It's not random. I just don't know what it looks like. I use compression, for years now, mostly zstd lately and a mix of lzo and zlib before that, but never any errors or corruptions. But I also never use holes, no punched holes, and rarely use fallocated files which I guess isn't quite the same thing as hole punching. So the bug you're reproducing is for sure 100% not on the media itself, it's somehow transiently being interpreted differently roughly 1 in 10 reads, but with a pattern. What about scrub? Do you get errors every 1 in 10 scrubs? Or how does it manifest? No scrub errors? I know very little about what parts of the kernel a file system depends on outside of its own code (e.g. page cache) but I wonder if there's something outside of Btrfs that's the source but it never gets triggered because no other file systems use compression. Huh - what file system uses compression *and* hole punching? squashfs? Is sparse file support different than hole punching? -- Chris Murphy ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-12 22:53 ` Chris Murphy @ 2019-02-13 2:46 ` Zygo Blaxell 0 siblings, 0 replies; 38+ messages in thread From: Zygo Blaxell @ 2019-02-13 2:46 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 5093 bytes --] On Tue, Feb 12, 2019 at 03:53:53PM -0700, Chris Murphy wrote: > And yet the file is delivered to user space, despite the changes, as > if it's immune to checksum computation or matching. The data is > clearly difference so how is it bypassing checksumming? Data csums are > based on original uncompressed data, correct? So any holes are zeros, > there are still csums for those holes? csums in btrfs protect data blocks. Holes are the absence of data blocks, so there are no csums for holes. There are no csums for extent references either--only csums on the extent data that is referenced. Since this bug affects processing of extent refs, it must occur long after all the csums are verified. > > I'm not sure how the zlib library is involved--sha1sum doesn't use one. > > > > > And then what happens if you do the exact same test but change to zstd > > > or lzo? No error? Strictly zlib? > > > > Same errors on all three btrfs compression algorithms (as mentioned in > > the original post from August 2018). > > Obviously there is a pattern. It's not random. I just don't know what > it looks like. Without knowing the root cause I can only speculate, but it does seem to be random, just very heavily biased to some outcomes. It will produce more distinct sha1sum values the longer you run it, especially if there is other activity on the system to perturb the kernel a bit. If you make the test file bigger you can have more combinations of outputs. I also note that since the big batch of btrfs bug fixes that landed near 4.14.80, the variation between runs seems to be a lot less than with earlier kernels; however, the full range of random output values (i.e. which extents of the file disappear) still seems to be possible, it just takes longer to get distinct values. I'm not sure that information helps to form a theory of how the bug operates. > I use compression, for years now, mostly zstd lately > and a mix of lzo and zlib before that, but never any errors or > corruptions. But I also never use holes, no punched holes, and rarely > use fallocated files which I guess isn't quite the same thing as hole > punching. I covered this in August. The original thread was: https://www.spinics.net/lists/linux-btrfs/msg81293.html TL;DR you won't see this problem unless you have a single compressed extent that is split by a hole--an artifact that can only be produced by punching holes, cloning, or dedupe. The cases users are most likely to encounter are dedupe and hole-punching--I don't know of any applications in real-world use that do cloning the right way to trigger this problem. Also, you haven't mentioned whether you've successfully reproduced this yourself yet (or not). > So the bug you're reproducing is for sure 100% not on the media > itself, it's somehow transiently being interpreted differently roughly > 1 in 10 reads, but with a pattern. What about scrub? Do you get errors > every 1 in 10 scrubs? Or how does it manifest? No scrub errors? No errors in scrub--nor should there be. The data is correct on disk, and it can be read reliably if you don't use the kernel btrfs code to read it through extent refs (scrub reads the data items directly, so scrub never looks at data through extent refs). btrfs just drops some of the data when reading it to userspace. > I know very little about what parts of the kernel a file system > depends on outside of its own code (e.g. page cache) but I wonder if > there's something outside of Btrfs that's the source but it never gets > triggered because no other file systems use compression. Huh - what > file system uses compression *and* hole punching? squashfs? Is sparse > file support different than hole punching? Traditional sparse file support leaves blocks in a file unallocated until they are written to, i.e. you do something like: write(64K) seek(80K) write(48K) and you get a 16K hole between two extents (or contiguous block ranges if your filesystem doesn't have a formal extent concept per se): data(64k) hole(16k) data(48k) Traditional POSIX sparse files don't have any way to release any extents in the middle of a file without changing the length of the file. You can fill in the holes with data later, but you can't delete existing data and replace it with holes. If you want to punch holes in a file, you used to do it by making a copy of the file, omitting any of the data blocks that contained all zero, then renaming the copy over the original file. The hole punch operation adds the capability to delete existing data in place, e.g. you can say "punch a hole at 24K, length 8K" and the filesystem will look like: data(24k) (originally part of first 64K extent) hole(8k) data(32k) (originally part of first 64K extent) hole(16k) data(48k) On btrfs, the first 32k and 24k chunks of the file are both references to pieces of the original 64k extent, which is not modified on disk, but 8K of it is no longer accessible. > -- > Chris Murphy > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-12 3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell 2019-02-12 15:33 ` Christoph Anton Mitterer 2019-02-12 15:35 ` Filipe Manana @ 2019-02-13 7:47 ` Roman Mamedov 2019-02-13 8:04 ` Qu Wenruo 2 siblings, 1 reply; 38+ messages in thread From: Roman Mamedov @ 2019-02-13 7:47 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On Mon, 11 Feb 2019 22:09:02 -0500 Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > Still reproducible on 4.20.7. > > The behavior is slightly different on current kernels (4.20.7, 4.14.96) > which makes the problem a bit more difficult to detect. > > # repro-hole-corruption-test > i: 91, status: 0, bytes_deduped: 131072 > i: 92, status: 0, bytes_deduped: 131072 > i: 93, status: 0, bytes_deduped: 131072 > i: 94, status: 0, bytes_deduped: 131072 > i: 95, status: 0, bytes_deduped: 131072 > i: 96, status: 0, bytes_deduped: 131072 > i: 97, status: 0, bytes_deduped: 131072 > i: 98, status: 0, bytes_deduped: 131072 > i: 99, status: 0, bytes_deduped: 131072 > 13107200 total bytes deduped in this operation > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am Seems like I can reproduce it as well. Vanilla 4.14.97 with .config loosely based on Debian's. $ sudo ./repro-hole-corruption-test i: 91, status: 0, bytes_deduped: 131072 i: 92, status: 0, bytes_deduped: 131072 i: 93, status: 0, bytes_deduped: 131072 i: 94, status: 0, bytes_deduped: 131072 i: 95, status: 0, bytes_deduped: 131072 i: 96, status: 0, bytes_deduped: 131072 i: 97, status: 0, bytes_deduped: 131072 i: 98, status: 0, bytes_deduped: 131072 i: 99, status: 0, bytes_deduped: 131072 13107200 total bytes deduped in this operation am: 4.8 MiB (4964352 bytes) converted to sparse holes. c5f25fc2b88eaab504a403465658c67f4669261e am 1d9aacd4ee38ab7db46c44e0d74cee163222e105 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am The above is on a 3TB spinning disk. But on a 512GB NVMe SSD I even got the same checksums as you did. $ sudo ./repro-hole-corruption-test i: 91, status: 0, bytes_deduped: 131072 i: 92, status: 0, bytes_deduped: 131072 i: 93, status: 0, bytes_deduped: 131072 i: 94, status: 0, bytes_deduped: 131072 i: 95, status: 0, bytes_deduped: 131072 i: 96, status: 0, bytes_deduped: 131072 i: 97, status: 0, bytes_deduped: 131072 i: 98, status: 0, bytes_deduped: 131072 i: 99, status: 0, bytes_deduped: 131072 13107200 total bytes deduped in this operation am: 4.8 MiB (4964352 bytes) converted to sparse holes. 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am In my case both filesystems are not mounted with compression, just chattr +c of the directory with the script is enough to see the issue. -- With respect, Roman ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 2019-02-13 7:47 ` Roman Mamedov @ 2019-02-13 8:04 ` Qu Wenruo 0 siblings, 0 replies; 38+ messages in thread From: Qu Wenruo @ 2019-02-13 8:04 UTC (permalink / raw) To: Roman Mamedov, Zygo Blaxell; +Cc: linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 3627 bytes --] On 2019/2/13 下午3:47, Roman Mamedov wrote: > On Mon, 11 Feb 2019 22:09:02 -0500 > Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > >> Still reproducible on 4.20.7. >> >> The behavior is slightly different on current kernels (4.20.7, 4.14.96) >> which makes the problem a bit more difficult to detect. >> >> # repro-hole-corruption-test >> i: 91, status: 0, bytes_deduped: 131072 >> i: 92, status: 0, bytes_deduped: 131072 >> i: 93, status: 0, bytes_deduped: 131072 >> i: 94, status: 0, bytes_deduped: 131072 >> i: 95, status: 0, bytes_deduped: 131072 >> i: 96, status: 0, bytes_deduped: 131072 >> i: 97, status: 0, bytes_deduped: 131072 >> i: 98, status: 0, bytes_deduped: 131072 >> i: 99, status: 0, bytes_deduped: 131072 >> 13107200 total bytes deduped in this operation >> am: 4.8 MiB (4964352 bytes) converted to sparse holes. >> 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am >> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > Seems like I can reproduce it as well. Vanilla 4.14.97 with .config loosely > based on Debian's. > > $ sudo ./repro-hole-corruption-test > i: 91, status: 0, bytes_deduped: 131072 > i: 92, status: 0, bytes_deduped: 131072 > i: 93, status: 0, bytes_deduped: 131072 > i: 94, status: 0, bytes_deduped: 131072 > i: 95, status: 0, bytes_deduped: 131072 > i: 96, status: 0, bytes_deduped: 131072 > i: 97, status: 0, bytes_deduped: 131072 > i: 98, status: 0, bytes_deduped: 131072 > i: 99, status: 0, bytes_deduped: 131072 > 13107200 total bytes deduped in this operation > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > c5f25fc2b88eaab504a403465658c67f4669261e am > 1d9aacd4ee38ab7db46c44e0d74cee163222e105 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > The above is on a 3TB spinning disk. But on a 512GB NVMe SSD I even got the > same checksums as you did. > > $ sudo ./repro-hole-corruption-test > i: 91, status: 0, bytes_deduped: 131072 > i: 92, status: 0, bytes_deduped: 131072 > i: 93, status: 0, bytes_deduped: 131072 > i: 94, status: 0, bytes_deduped: 131072 > i: 95, status: 0, bytes_deduped: 131072 > i: 96, status: 0, bytes_deduped: 131072 > i: 97, status: 0, bytes_deduped: 131072 > i: 98, status: 0, bytes_deduped: 131072 > i: 99, status: 0, bytes_deduped: 131072 > 13107200 total bytes deduped in this operation > am: 4.8 MiB (4964352 bytes) converted to sparse holes. > 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am > > In my case both filesystems are not mounted with compression, OK, I forgot the compression mount option. Now I can reproduce it too, both host and VM now. I'll try to make the test case minimal enough to avoid too many noise during test. Thanks, Qu > just chattr +c of > the directory with the script is enough to see the issue. > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2019-03-17 2:54 UTC | newest] Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-08-23 3:11 Reproducer for "compressed data + hole data corruption bug, 2018 editiion" Zygo Blaxell 2018-08-23 5:10 ` Qu Wenruo 2018-08-23 16:44 ` Zygo Blaxell 2018-08-23 23:50 ` Qu Wenruo 2019-02-12 3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell 2019-02-12 15:33 ` Christoph Anton Mitterer 2019-02-12 15:35 ` Filipe Manana 2019-02-12 17:01 ` Zygo Blaxell 2019-02-12 17:56 ` Filipe Manana 2019-02-12 18:13 ` Zygo Blaxell 2019-02-13 7:24 ` Qu Wenruo 2019-02-13 17:36 ` Filipe Manana 2019-02-13 18:14 ` Filipe Manana 2019-02-14 1:22 ` Filipe Manana 2019-02-14 5:00 ` Zygo Blaxell 2019-02-14 12:21 ` Christoph Anton Mitterer 2019-02-15 5:40 ` Zygo Blaxell 2019-03-04 15:34 ` Christoph Anton Mitterer 2019-03-07 20:07 ` Zygo Blaxell 2019-03-08 10:37 ` Filipe Manana 2019-03-14 18:58 ` Christoph Anton Mitterer 2019-03-14 20:22 ` Christoph Anton Mitterer 2019-03-14 22:39 ` Filipe Manana 2019-03-08 12:20 ` Austin S. Hemmelgarn 2019-03-14 18:58 ` Christoph Anton Mitterer 2019-03-14 18:58 ` Christoph Anton Mitterer 2019-03-15 5:28 ` Zygo Blaxell 2019-03-16 22:11 ` Christoph Anton Mitterer 2019-03-17 2:54 ` Zygo Blaxell 2019-02-15 12:02 ` Filipe Manana 2019-03-04 15:46 ` Christoph Anton Mitterer 2019-02-12 18:58 ` Andrei Borzenkov 2019-02-12 21:48 ` Chris Murphy 2019-02-12 22:11 ` Zygo Blaxell 2019-02-12 22:53 ` Chris Murphy 2019-02-13 2:46 ` Zygo Blaxell 2019-02-13 7:47 ` Roman Mamedov 2019-02-13 8:04 ` Qu Wenruo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).