12.02.2019 20:01, Zygo Blaxell пишет: > On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote: >> On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell >> wrote: >>> >>> Still reproducible on 4.20.7. >> >> I tried your reproducer when you first reported it, on different >> machines with different kernel versions. > > That would have been useful to know last August... :-/ > >> Never managed to reproduce it, nor see anything obviously wrong in >> relevant code paths. > > I built a fresh VM running Debian stretch and > reproduced the issue immediately. Mount options are > "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/". Kernel is > Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version > probably doesn't matter. > > I don't have any configuration that can't reproduce this issue, so I don't > know how to help you. I've tested AMD and Intel CPUs, VM, baremetal, > hardware ranging in age from 0 to 9 years. Locally built kernels from > 4.1 to 4.20 and the stock Debian kernel (4.9). SSDs and spinning rust. > All of these reproduce the issue immediately--wrong sha1sum appears in > the first 10 loops. > > What is your test environment? I can try that here. > >>> >>> The behavior is slightly different on current kernels (4.20.7, 4.14.96) >>> which makes the problem a bit more difficult to detect. >>> >>> # repro-hole-corruption-test >>> i: 91, status: 0, bytes_deduped: 131072 >>> i: 92, status: 0, bytes_deduped: 131072 >>> i: 93, status: 0, bytes_deduped: 131072 >>> i: 94, status: 0, bytes_deduped: 131072 >>> i: 95, status: 0, bytes_deduped: 131072 >>> i: 96, status: 0, bytes_deduped: 131072 >>> i: 97, status: 0, bytes_deduped: 131072 >>> i: 98, status: 0, bytes_deduped: 131072 >>> i: 99, status: 0, bytes_deduped: 131072 >>> 13107200 total bytes deduped in this operation >>> am: 4.8 MiB (4964352 bytes) converted to sparse holes. >>> 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> I get the same result on Ubunut 18.04 using distro packages and 4.18 hwe kernel. root@bor-Latitude-E5450:/var/tmp# dd if=/dev/zero of=loop bs=1M count=200 200+0 записей получено 200+0 записей отправлено 209715200 bytes (210 MB, 200 MiB) copied, 0,125205 s, 1,7 GB/s root@bor-Latitude-E5450:/var/tmp# mkfs.btrfs loop btrfs-progs v4.15.1 See http://btrfs.wiki.kernel.org for more information. Label: (null) UUID: b1f1111e-2d65-484a-9ab3-e00feaac2048 Node size: 16384 Sector size: 4096 Filesystem size: 200.00MiB Block group profiles: Data: single 8.00MiB Metadata: DUP 32.00MiB System: DUP 8.00MiB SSD detected: no Incompat features: extref, skinny-metadata Number of devices: 1 Devices: ID SIZE PATH 1 200.00MiB loop root@bor-Latitude-E5450:/var/tmp# mount -t btrfs -o loop,rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/ ./loop ./loopmnt root@bor-Latitude-E5450:/var/tmp# cd - /var/tmp/loopmnt root@bor-Latitude-E5450:/var/tmp/loopmnt# ../repro-hole-corruption-test i: 91, status: 0, bytes_deduped: 131072 i: 92, status: 0, bytes_deduped: 131072 i: 93, status: 0, bytes_deduped: 131072 i: 94, status: 0, bytes_deduped: 131072 i: 95, status: 0, bytes_deduped: 131072 i: 96, status: 0, bytes_deduped: 131072 i: 97, status: 0, bytes_deduped: 131072 i: 98, status: 0, bytes_deduped: 131072 i: 99, status: 0, bytes_deduped: 131072 13107200 total bytes deduped in this operation am: 4,8 MiB (4964352 bytes) converted to sparse holes. 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am ^Croot@bor-Latitude-E5450:/var/tmp/loopmnt# >>> The sha1sum seems stable after the first drop_caches--until a second >>> process tries to read the test file: >>> >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> # cat am > /dev/null (in another shell) >>> 19294e695272c42edb89ceee24bb08c13473140a am >>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>> >>> On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote: >>>> This is a repro script for a btrfs bug that causes corrupted data reads >>>> when reading a mix of compressed extents and holes. The bug is >>>> reproducible on at least kernels v4.1..v4.18. >>>> >>>> Some more observations and background follow, but first here is the >>>> script and some sample output: >>>> >>>> root@rescue:/test# cat repro-hole-corruption-test >>>> #!/bin/bash >>>> >>>> # Write a 4096 byte block of something >>>> block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } >>>> >>>> # Here is some test data with holes in it: >>>> for y in $(seq 0 100); do >>>> for x in 0 1; do >>>> block 0; >>>> block 21; >>>> block 0; >>>> block 22; >>>> block 0; >>>> block 0; >>>> block 43; >>>> block 44; >>>> block 0; >>>> block 0; >>>> block 61; >>>> block 62; >>>> block 63; >>>> block 64; >>>> block 65; >>>> block 66; >>>> done >>>> done > am >>>> sync >>>> >>>> # Now replace those 101 distinct extents with 101 references to the first extent >>>> btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail >>>> >>>> # Punch holes into the extent refs >>>> fallocate -v -d am >>>> >>>> # Do some other stuff on the machine while this runs, and watch the sha1sums change! >>>> while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done >>>> >>>> root@rescue:/test# ./repro-hole-corruption-test >>>> i: 91, status: 0, bytes_deduped: 131072 >>>> i: 92, status: 0, bytes_deduped: 131072 >>>> i: 93, status: 0, bytes_deduped: 131072 >>>> i: 94, status: 0, bytes_deduped: 131072 >>>> i: 95, status: 0, bytes_deduped: 131072 >>>> i: 96, status: 0, bytes_deduped: 131072 >>>> i: 97, status: 0, bytes_deduped: 131072 >>>> i: 98, status: 0, bytes_deduped: 131072 >>>> i: 99, status: 0, bytes_deduped: 131072 >>>> 13107200 total bytes deduped in this operation >>>> am: 4.8 MiB (4964352 bytes) converted to sparse holes. >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 072a152355788c767b97e4e4c0e4567720988b84 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> bf00d862c6ad436a1be2be606a8ab88d22166b89 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 0d44cdf030fb149e103cfdc164da3da2b7474c17 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 60831f0e7ffe4b49722612c18685c09f4583b1df am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> a19662b294a3ccdf35dbb18fdd72c62018526d7d am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am >>>> ^C >>>> >>>> Corruption occurs most often when there is a sequence like this in a file: >>>> >>>> ref 1: hole >>>> ref 2: extent A, offset 0 >>>> ref 3: hole >>>> ref 4: extent A, offset 8192 >>>> >>>> This scenario typically arises due to hole-punching or deduplication. >>>> Hole-punching replaces one extent ref with two references to the same >>>> extent with a hole between them, so: >>>> >>>> ref 1: extent A, offset 0, length 16384 >>>> >>>> becomes: >>>> >>>> ref 1: extent A, offset 0, length 4096 >>>> ref 2: hole, length 8192 >>>> ref 3: extent A, offset 12288, length 4096 >>>> >>>> Deduplication replaces two distinct extent refs surrounding a hole with >>>> two references to one of the duplicate extents, turning this: >>>> >>>> ref 1: extent A, offset 0, length 4096 >>>> ref 2: hole, length 8192 >>>> ref 3: extent B, offset 0, length 4096 >>>> >>>> into this: >>>> >>>> ref 1: extent A, offset 0, length 4096 >>>> ref 2: hole, length 8192 >>>> ref 3: extent A, offset 0, length 4096 >>>> >>>> Compression is required (zlib, zstd, or lzo) for corruption to occur. >>>> I am not able to reproduce the issue with an uncompressed extent nor >>>> have I observed any such corruption in the wild. >>>> >>>> The presence or absence of the no-holes filesystem feature has no effect. >>>> >>>> Ordinary writes can lead to pairs of extent references to the same extent >>>> separated by a reference to a different extent; however, in this case >>>> there is data to be read from a real extent, instead of pages that have >>>> to be zero filled from a hole. If ordinary non-hole writes could trigger >>>> this bug, every page-oriented database engine would be crashing all the >>>> time on btrfs with compression enabled, and it's unlikely that would not >>>> have been noticed between 2015 and now. An ordinary write that splits >>>> an extent ref would look like this: >>>> >>>> ref 1: extent A, offset 0, length 4096 >>>> ref 2: extent C, offset 0, length 8192 >>>> ref 3: extent A, offset 12288, length 4096 >>>> >>>> Sparse writes can lead to pairs of extent references surrounding a hole; >>>> however, in this case the extent references will point to different >>>> extents, avoiding the bug. If a sparse write could trigger the bug, >>>> the rsync -S option and qemu/kvm 'raw' disk image files (among many >>>> other tools that produce sparse files) would be unusable, and it's >>>> unlikely that would not have been noticed between 2015 and now either. >>>> Sparse writes look like this: >>>> >>>> ref 1: extent A, offset 0, length 4096 >>>> ref 2: hole, length 8192 >>>> ref 3: extent B, offset 0, length 4096 >>>> >>>> The pattern or timing of read() calls seems to be relevant. It is very >>>> hard to see the corruption when reading files with 'hd', but 'cat | hd' >>>> will see the corruption just fine. Similar problems exist with 'cmp' >>>> but not 'sha1sum'. Two processes reading the same file at the same time >>>> seem to trigger the corruption very frequently. >>>> >>>> Some patterns of holes and data produce corruption faster than others. >>>> The pattern generated by the script above is based on instances of >>>> corruption I've found in the wild, and has a much better repro rate than >>>> random holes. >>>> >>>> The corruption occurs during reads, after csum verification and before >>>> decompression, so btrfs detects no csum failures. The data on disk >>>> seems to be OK and could be read correctly once the kernel bug is fixed. >>>> Repeated reads do eventually return correct data, but there is no way >>>> for userspace to distinguish between corrupt and correct data reliably. >>>> >>>> The corrupted data is usually data replaced by a hole or a copy of other >>>> blocks in the same extent. >>>> >>>> The behavior is similar to some earlier bugs related to holes and >>>> Compressed data in btrfs, but it's new and not fixed yet--hence, >>>> "2018 edition." >>> >>> >> >> >> -- >> Filipe David Manana, >> >> “Whether you think you can, or you think you can't — you're right.” >>