From: Andrei Borzenkov <arvidjaar@gmail.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
Filipe Manana <fdmanana@gmail.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
Date: Tue, 12 Feb 2019 21:58:22 +0300 [thread overview]
Message-ID: <b7d76392-6414-804f-c383-a43dc2d81c1c@gmail.com> (raw)
In-Reply-To: <20190212165916.GA23918@hungrycats.org>
[-- Attachment #1.1: Type: text/plain, Size: 13325 bytes --]
12.02.2019 20:01, Zygo Blaxell пишет:
> On Tue, Feb 12, 2019 at 03:35:37PM +0000, Filipe Manana wrote:
>> On Tue, Feb 12, 2019 at 3:11 AM Zygo Blaxell
>> <ce3g8jdj@umail.furryterror.org> wrote:
>>>
>>> Still reproducible on 4.20.7.
>>
>> I tried your reproducer when you first reported it, on different
>> machines with different kernel versions.
>
> That would have been useful to know last August... :-/
>
>> Never managed to reproduce it, nor see anything obviously wrong in
>> relevant code paths.
>
> I built a fresh VM running Debian stretch and
> reproduced the issue immediately. Mount options are
> "rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/". Kernel is
> Debian's "4.9.0-8-amd64" but the bug is old enough that kernel version
> probably doesn't matter.
>
> I don't have any configuration that can't reproduce this issue, so I don't
> know how to help you. I've tested AMD and Intel CPUs, VM, baremetal,
> hardware ranging in age from 0 to 9 years. Locally built kernels from
> 4.1 to 4.20 and the stock Debian kernel (4.9). SSDs and spinning rust.
> All of these reproduce the issue immediately--wrong sha1sum appears in
> the first 10 loops.
>
> What is your test environment? I can try that here.
>
>>>
>>> The behavior is slightly different on current kernels (4.20.7, 4.14.96)
>>> which makes the problem a bit more difficult to detect.
>>>
>>> # repro-hole-corruption-test
>>> i: 91, status: 0, bytes_deduped: 131072
>>> i: 92, status: 0, bytes_deduped: 131072
>>> i: 93, status: 0, bytes_deduped: 131072
>>> i: 94, status: 0, bytes_deduped: 131072
>>> i: 95, status: 0, bytes_deduped: 131072
>>> i: 96, status: 0, bytes_deduped: 131072
>>> i: 97, status: 0, bytes_deduped: 131072
>>> i: 98, status: 0, bytes_deduped: 131072
>>> i: 99, status: 0, bytes_deduped: 131072
>>> 13107200 total bytes deduped in this operation
>>> am: 4.8 MiB (4964352 bytes) converted to sparse holes.
>>> 94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>
I get the same result on Ubunut 18.04 using distro packages and 4.18 hwe
kernel.
root@bor-Latitude-E5450:/var/tmp# dd if=/dev/zero of=loop bs=1M count=200
200+0 записей получено
200+0 записей отправлено
209715200 bytes (210 MB, 200 MiB) copied, 0,125205 s, 1,7 GB/s
root@bor-Latitude-E5450:/var/tmp# mkfs.btrfs loop
btrfs-progs v4.15.1
See http://btrfs.wiki.kernel.org for more information.
Label: (null)
UUID: b1f1111e-2d65-484a-9ab3-e00feaac2048
Node size: 16384
Sector size: 4096
Filesystem size: 200.00MiB
Block group profiles:
Data: single 8.00MiB
Metadata: DUP 32.00MiB
System: DUP 8.00MiB
SSD detected: no
Incompat features: extref, skinny-metadata
Number of devices: 1
Devices:
ID SIZE PATH
1 200.00MiB loop
root@bor-Latitude-E5450:/var/tmp# mount -t btrfs -o
loop,rw,noatime,compress=zlib,space_cache,subvolid=5,subvol=/ ./loop
./loopmnt
root@bor-Latitude-E5450:/var/tmp# cd -
/var/tmp/loopmnt
root@bor-Latitude-E5450:/var/tmp/loopmnt# ../repro-hole-corruption-test
i: 91, status: 0, bytes_deduped: 131072
i: 92, status: 0, bytes_deduped: 131072
i: 93, status: 0, bytes_deduped: 131072
i: 94, status: 0, bytes_deduped: 131072
i: 95, status: 0, bytes_deduped: 131072
i: 96, status: 0, bytes_deduped: 131072
i: 97, status: 0, bytes_deduped: 131072
i: 98, status: 0, bytes_deduped: 131072
i: 99, status: 0, bytes_deduped: 131072
13107200 total bytes deduped in this operation
am: 4,8 MiB (4964352 bytes) converted to sparse holes.
94a8acd3e1f6e14272f3262a8aa73ab6b25c9ce8 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
^Croot@bor-Latitude-E5450:/var/tmp/loopmnt#
>>> The sha1sum seems stable after the first drop_caches--until a second
>>> process tries to read the test file:
>>>
>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>> # cat am > /dev/null (in another shell)
>>> 19294e695272c42edb89ceee24bb08c13473140a am
>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>
>>> On Wed, Aug 22, 2018 at 11:11:25PM -0400, Zygo Blaxell wrote:
>>>> This is a repro script for a btrfs bug that causes corrupted data reads
>>>> when reading a mix of compressed extents and holes. The bug is
>>>> reproducible on at least kernels v4.1..v4.18.
>>>>
>>>> Some more observations and background follow, but first here is the
>>>> script and some sample output:
>>>>
>>>> root@rescue:/test# cat repro-hole-corruption-test
>>>> #!/bin/bash
>>>>
>>>> # Write a 4096 byte block of something
>>>> block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
>>>>
>>>> # Here is some test data with holes in it:
>>>> for y in $(seq 0 100); do
>>>> for x in 0 1; do
>>>> block 0;
>>>> block 21;
>>>> block 0;
>>>> block 22;
>>>> block 0;
>>>> block 0;
>>>> block 43;
>>>> block 44;
>>>> block 0;
>>>> block 0;
>>>> block 61;
>>>> block 62;
>>>> block 63;
>>>> block 64;
>>>> block 65;
>>>> block 66;
>>>> done
>>>> done > am
>>>> sync
>>>>
>>>> # Now replace those 101 distinct extents with 101 references to the first extent
>>>> btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail
>>>>
>>>> # Punch holes into the extent refs
>>>> fallocate -v -d am
>>>>
>>>> # Do some other stuff on the machine while this runs, and watch the sha1sums change!
>>>> while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done
>>>>
>>>> root@rescue:/test# ./repro-hole-corruption-test
>>>> i: 91, status: 0, bytes_deduped: 131072
>>>> i: 92, status: 0, bytes_deduped: 131072
>>>> i: 93, status: 0, bytes_deduped: 131072
>>>> i: 94, status: 0, bytes_deduped: 131072
>>>> i: 95, status: 0, bytes_deduped: 131072
>>>> i: 96, status: 0, bytes_deduped: 131072
>>>> i: 97, status: 0, bytes_deduped: 131072
>>>> i: 98, status: 0, bytes_deduped: 131072
>>>> i: 99, status: 0, bytes_deduped: 131072
>>>> 13107200 total bytes deduped in this operation
>>>> am: 4.8 MiB (4964352 bytes) converted to sparse holes.
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 072a152355788c767b97e4e4c0e4567720988b84 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> bf00d862c6ad436a1be2be606a8ab88d22166b89 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 0d44cdf030fb149e103cfdc164da3da2b7474c17 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 60831f0e7ffe4b49722612c18685c09f4583b1df am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> a19662b294a3ccdf35dbb18fdd72c62018526d7d am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
>>>> ^C
>>>>
>>>> Corruption occurs most often when there is a sequence like this in a file:
>>>>
>>>> ref 1: hole
>>>> ref 2: extent A, offset 0
>>>> ref 3: hole
>>>> ref 4: extent A, offset 8192
>>>>
>>>> This scenario typically arises due to hole-punching or deduplication.
>>>> Hole-punching replaces one extent ref with two references to the same
>>>> extent with a hole between them, so:
>>>>
>>>> ref 1: extent A, offset 0, length 16384
>>>>
>>>> becomes:
>>>>
>>>> ref 1: extent A, offset 0, length 4096
>>>> ref 2: hole, length 8192
>>>> ref 3: extent A, offset 12288, length 4096
>>>>
>>>> Deduplication replaces two distinct extent refs surrounding a hole with
>>>> two references to one of the duplicate extents, turning this:
>>>>
>>>> ref 1: extent A, offset 0, length 4096
>>>> ref 2: hole, length 8192
>>>> ref 3: extent B, offset 0, length 4096
>>>>
>>>> into this:
>>>>
>>>> ref 1: extent A, offset 0, length 4096
>>>> ref 2: hole, length 8192
>>>> ref 3: extent A, offset 0, length 4096
>>>>
>>>> Compression is required (zlib, zstd, or lzo) for corruption to occur.
>>>> I am not able to reproduce the issue with an uncompressed extent nor
>>>> have I observed any such corruption in the wild.
>>>>
>>>> The presence or absence of the no-holes filesystem feature has no effect.
>>>>
>>>> Ordinary writes can lead to pairs of extent references to the same extent
>>>> separated by a reference to a different extent; however, in this case
>>>> there is data to be read from a real extent, instead of pages that have
>>>> to be zero filled from a hole. If ordinary non-hole writes could trigger
>>>> this bug, every page-oriented database engine would be crashing all the
>>>> time on btrfs with compression enabled, and it's unlikely that would not
>>>> have been noticed between 2015 and now. An ordinary write that splits
>>>> an extent ref would look like this:
>>>>
>>>> ref 1: extent A, offset 0, length 4096
>>>> ref 2: extent C, offset 0, length 8192
>>>> ref 3: extent A, offset 12288, length 4096
>>>>
>>>> Sparse writes can lead to pairs of extent references surrounding a hole;
>>>> however, in this case the extent references will point to different
>>>> extents, avoiding the bug. If a sparse write could trigger the bug,
>>>> the rsync -S option and qemu/kvm 'raw' disk image files (among many
>>>> other tools that produce sparse files) would be unusable, and it's
>>>> unlikely that would not have been noticed between 2015 and now either.
>>>> Sparse writes look like this:
>>>>
>>>> ref 1: extent A, offset 0, length 4096
>>>> ref 2: hole, length 8192
>>>> ref 3: extent B, offset 0, length 4096
>>>>
>>>> The pattern or timing of read() calls seems to be relevant. It is very
>>>> hard to see the corruption when reading files with 'hd', but 'cat | hd'
>>>> will see the corruption just fine. Similar problems exist with 'cmp'
>>>> but not 'sha1sum'. Two processes reading the same file at the same time
>>>> seem to trigger the corruption very frequently.
>>>>
>>>> Some patterns of holes and data produce corruption faster than others.
>>>> The pattern generated by the script above is based on instances of
>>>> corruption I've found in the wild, and has a much better repro rate than
>>>> random holes.
>>>>
>>>> The corruption occurs during reads, after csum verification and before
>>>> decompression, so btrfs detects no csum failures. The data on disk
>>>> seems to be OK and could be read correctly once the kernel bug is fixed.
>>>> Repeated reads do eventually return correct data, but there is no way
>>>> for userspace to distinguish between corrupt and correct data reliably.
>>>>
>>>> The corrupted data is usually data replaced by a hole or a copy of other
>>>> blocks in the same extent.
>>>>
>>>> The behavior is similar to some earlier bugs related to holes and
>>>> Compressed data in btrfs, but it's new and not fixed yet--hence,
>>>> "2018 edition."
>>>
>>>
>>
>>
>> --
>> Filipe David Manana,
>>
>> “Whether you think you can, or you think you can't — you're right.”
>>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
next prev parent reply other threads:[~2019-02-12 18:58 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-08-23 3:11 Reproducer for "compressed data + hole data corruption bug, 2018 editiion" Zygo Blaxell
2018-08-23 5:10 ` Qu Wenruo
2018-08-23 16:44 ` Zygo Blaxell
2018-08-23 23:50 ` Qu Wenruo
2019-02-12 3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell
2019-02-12 15:33 ` Christoph Anton Mitterer
2019-02-12 15:35 ` Filipe Manana
2019-02-12 17:01 ` Zygo Blaxell
2019-02-12 17:56 ` Filipe Manana
2019-02-12 18:13 ` Zygo Blaxell
2019-02-13 7:24 ` Qu Wenruo
2019-02-13 17:36 ` Filipe Manana
2019-02-13 18:14 ` Filipe Manana
2019-02-14 1:22 ` Filipe Manana
2019-02-14 5:00 ` Zygo Blaxell
2019-02-14 12:21 ` Christoph Anton Mitterer
2019-02-15 5:40 ` Zygo Blaxell
2019-03-04 15:34 ` Christoph Anton Mitterer
2019-03-07 20:07 ` Zygo Blaxell
2019-03-08 10:37 ` Filipe Manana
2019-03-14 18:58 ` Christoph Anton Mitterer
2019-03-14 20:22 ` Christoph Anton Mitterer
2019-03-14 22:39 ` Filipe Manana
2019-03-08 12:20 ` Austin S. Hemmelgarn
2019-03-14 18:58 ` Christoph Anton Mitterer
2019-03-14 18:58 ` Christoph Anton Mitterer
2019-03-15 5:28 ` Zygo Blaxell
2019-03-16 22:11 ` Christoph Anton Mitterer
2019-03-17 2:54 ` Zygo Blaxell
2019-02-15 12:02 ` Filipe Manana
2019-03-04 15:46 ` Christoph Anton Mitterer
2019-02-12 18:58 ` Andrei Borzenkov [this message]
2019-02-12 21:48 ` Chris Murphy
2019-02-12 22:11 ` Zygo Blaxell
2019-02-12 22:53 ` Chris Murphy
2019-02-13 2:46 ` Zygo Blaxell
2019-02-13 7:47 ` Roman Mamedov
2019-02-13 8:04 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b7d76392-6414-804f-c383-a43dc2d81c1c@gmail.com \
--to=arvidjaar@gmail.com \
--cc=ce3g8jdj@umail.furryterror.org \
--cc=fdmanana@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).