* [general question] rare silent data corruption when writing data
@ 2020-05-07 17:30 Michal Soltys
2020-05-07 18:24 ` Roger Heflin
2020-05-13 6:31 ` Chris Dunlop
0 siblings, 2 replies; 20+ messages in thread
From: Michal Soltys @ 2020-05-07 17:30 UTC (permalink / raw)
To: linux-raid
Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
The hardware is (can provide more detailed info of course):
- Supermicro X9DR7-LN4F
- onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
- 96 gb ram (ecc)
- 24 disk backplane
- 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
- 1 array on the backplane (4 disks, mdraid5, journaled)
- journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
- 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
With a doze of testing we managed to roughly rule out the following elements as being the cause:
- qemu/kvm (issue occured directly on host)
- backplane (issue occured on disks directly connected via LSI's 2nd connector)
- cable (as a above, two different cables)
- memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
- mdadm journaling (issue occured on plain mdraid configuration as well)
- disks themselves (issue occured on two separate mdadm arrays)
- filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
We did not manage to rule out (though somewhat _highly_ unlikely):
- lvm thin (issue always - so far - occured on lvm thin pools)
- mdraid (issue always - so far - on mdraid managed arrays)
- kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
And finally - so far - the issue never occured:
- directly on a disk
- directly on mdraid
- on linear lvm volume on top of mdraid
As far as the issue goes it's:
- always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
- we also found (or rather btrfs scrub did) a few small damaged files as well
- the chunks look like a correct piece of different or previous data
The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data
2020-05-07 17:30 [general question] rare silent data corruption when writing data Michal Soltys
@ 2020-05-07 18:24 ` Roger Heflin
2020-05-07 21:01 ` John Stoffel
2020-05-07 22:13 ` Michal Soltys
2020-05-13 6:31 ` Chris Dunlop
1 sibling, 2 replies; 20+ messages in thread
From: Roger Heflin @ 2020-05-07 18:24 UTC (permalink / raw)
To: Michal Soltys; +Cc: Linux RAID
Have you tried the same file 2x and verified the corruption is in the
same places and looks the same?
I have not as of yet seen write corruption (except when a vendors disk
was resetting and it was lying about having written the data prior to
the crash, these were ssds, if your disk write cache is on and you
have a disk reset this can also happen), but have not seen "lost
writes" otherwise, but would expect the 2 read corruption I have seen
to also be able to cause write issues. So for that look for scsi
notifications for disk resets that should not happen.
I have had a "bad" controller cause read corruptions, those
corruptions would move around, replacing the controller resolved it,
so there may be lack of error checking "inside" some paths in the
card. Lucky I had a number of these controllers and had cold spares
for them. The give away here was 2 separate buses with almost
identical load with 6 separate disks each and all12 disks on 2 buses
had between 47-52 scsi errors, which points to the only component
shared (the controller).
The backplane and cables are unlikely in general cause this, there is
too much error checking between the controller and the disk from what
I know.
I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133
cause random read corruptions, lowering speed to 100 fixed it), this
one was duplicated on multiple identical pieces of hw with all
different parts on the duplication machine.
I have also seen lost writes (from software) because someone did a
seek without doing a flush which in some versions of the libs loses
the unfilled block when the seek happens (this is noted in the man
page, and I saw it 20years ago, it is still noted in the man page, so
no idea if it was ever fixed). So has more than one application been
noted to see the corruption?
So one question, have you seen the corruption in a path that would
rely on one controller, or all corruptions you have seen involving
more than one controller? Isolate and test each controller if you
can, or if you can afford to replace it and see if it continues.
On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>
> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
>
> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
>
> The hardware is (can provide more detailed info of course):
>
> - Supermicro X9DR7-LN4F
> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
> - 96 gb ram (ecc)
> - 24 disk backplane
>
> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
> - 1 array on the backplane (4 disks, mdraid5, journaled)
> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
>
> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
>
> With a doze of testing we managed to roughly rule out the following elements as being the cause:
>
> - qemu/kvm (issue occured directly on host)
> - backplane (issue occured on disks directly connected via LSI's 2nd connector)
> - cable (as a above, two different cables)
> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
> - mdadm journaling (issue occured on plain mdraid configuration as well)
> - disks themselves (issue occured on two separate mdadm arrays)
> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
>
> We did not manage to rule out (though somewhat _highly_ unlikely):
>
> - lvm thin (issue always - so far - occured on lvm thin pools)
> - mdraid (issue always - so far - on mdraid managed arrays)
> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
>
> And finally - so far - the issue never occured:
>
> - directly on a disk
> - directly on mdraid
> - on linear lvm volume on top of mdraid
>
> As far as the issue goes it's:
>
> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
> - we also found (or rather btrfs scrub did) a few small damaged files as well
> - the chunks look like a correct piece of different or previous data
>
> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
>
> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data
2020-05-07 18:24 ` Roger Heflin
@ 2020-05-07 21:01 ` John Stoffel
2020-05-07 22:33 ` Michal Soltys
2020-05-07 22:13 ` Michal Soltys
1 sibling, 1 reply; 20+ messages in thread
From: John Stoffel @ 2020-05-07 21:01 UTC (permalink / raw)
To: Roger Heflin; +Cc: Michal Soltys, Linux RAID
>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes:
Roger> Have you tried the same file 2x and verified the corruption is in the
Roger> same places and looks the same?
Are these 1tb files VMDK or COW images of VMs? How are these files
made. And does it ever happen with *smaller* files? What about if
you just use a sparse 2tb file and write blocks out past 1tb to see if
there's a problem?
Are the LVs split across RAID5 PVs by any chance?
It's not clear if you can replicate the problem without using
lvm-thin, but that's what I suspect you might be having problems with.
Can you give us the versions of the your tools, and exactly how you
setup your test cases? How long does it take to find the problem?
Can you compile the newst kernel and newest thin tools and try them
out?
How long does it take to replicate the corruption?
Sorry for all the questions, but until there's a test case which is
repeatable, it's going to be hard to chase this down.
I wonder if running 'fio' tests would be something to try?
And also changing your RAID5 setup to use the default stride and
stripe widths, instead of the large values you're using.
Good luck!
Roger> I have not as of yet seen write corruption (except when a vendors disk
Roger> was resetting and it was lying about having written the data prior to
Roger> the crash, these were ssds, if your disk write cache is on and you
Roger> have a disk reset this can also happen), but have not seen "lost
Roger> writes" otherwise, but would expect the 2 read corruption I have seen
Roger> to also be able to cause write issues. So for that look for scsi
Roger> notifications for disk resets that should not happen.
Roger> I have had a "bad" controller cause read corruptions, those
Roger> corruptions would move around, replacing the controller resolved it,
Roger> so there may be lack of error checking "inside" some paths in the
Roger> card. Lucky I had a number of these controllers and had cold spares
Roger> for them. The give away here was 2 separate buses with almost
Roger> identical load with 6 separate disks each and all12 disks on 2 buses
Roger> had between 47-52 scsi errors, which points to the only component
Roger> shared (the controller).
Roger> The backplane and cables are unlikely in general cause this, there is
Roger> too much error checking between the controller and the disk from what
Roger> I know.
Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133
Roger> cause random read corruptions, lowering speed to 100 fixed it), this
Roger> one was duplicated on multiple identical pieces of hw with all
Roger> different parts on the duplication machine.
Roger> I have also seen lost writes (from software) because someone did a
Roger> seek without doing a flush which in some versions of the libs loses
Roger> the unfilled block when the seek happens (this is noted in the man
Roger> page, and I saw it 20years ago, it is still noted in the man page, so
Roger> no idea if it was ever fixed). So has more than one application been
Roger> noted to see the corruption?
Roger> So one question, have you seen the corruption in a path that would
Roger> rely on one controller, or all corruptions you have seen involving
Roger> more than one controller? Isolate and test each controller if you
Roger> can, or if you can afford to replace it and see if it continues.
Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>
>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
>>
>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
>>
>> The hardware is (can provide more detailed info of course):
>>
>> - Supermicro X9DR7-LN4F
>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
>> - 96 gb ram (ecc)
>> - 24 disk backplane
>>
>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
>> - 1 array on the backplane (4 disks, mdraid5, journaled)
>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
>>
>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
>>
>> With a doze of testing we managed to roughly rule out the following elements as being the cause:
>>
>> - qemu/kvm (issue occured directly on host)
>> - backplane (issue occured on disks directly connected via LSI's 2nd connector)
>> - cable (as a above, two different cables)
>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
>> - mdadm journaling (issue occured on plain mdraid configuration as well)
>> - disks themselves (issue occured on two separate mdadm arrays)
>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
>>
>> We did not manage to rule out (though somewhat _highly_ unlikely):
>>
>> - lvm thin (issue always - so far - occured on lvm thin pools)
>> - mdraid (issue always - so far - on mdraid managed arrays)
>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
>>
>> And finally - so far - the issue never occured:
>>
>> - directly on a disk
>> - directly on mdraid
>> - on linear lvm volume on top of mdraid
>>
>> As far as the issue goes it's:
>>
>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
>> - we also found (or rather btrfs scrub did) a few small damaged files as well
>> - the chunks look like a correct piece of different or previous data
>>
>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
>>
>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data
2020-05-07 18:24 ` Roger Heflin
2020-05-07 21:01 ` John Stoffel
@ 2020-05-07 22:13 ` Michal Soltys
1 sibling, 0 replies; 20+ messages in thread
From: Michal Soltys @ 2020-05-07 22:13 UTC (permalink / raw)
To: Roger Heflin; +Cc: linux-raid
On 20/05/07 20:24, Roger Heflin wrote:
> Have you tried the same file 2x and verified the corruption is in the
> same places and looks the same?
Yes, both with direct tests on hosts and with btrfs scrub failing on the
same files in exactly same places. Always full 4KiB chunks.
>
> I have not as of yet seen write corruption (except when a vendors disk
> was resetting and it was lying about having written the data prior to
> the crash, these were ssds, if your disk write cache is on and you
> have a disk reset this can also happen), but have not seen "lost
> writes" otherwise, but would expect the 2 read corruption I have seen
> to also be able to cause write issues. So for that look for scsi
> notifications for disk resets that should not happen.
>
When I was doing a simple test that basically was:
while .....; do
rng=$((hexdump ..... /dev/urandom))
dcfldd hash=md5 textpattern=$((rng_value)) of=/dst/test bs=262144
count=$((16*4096))
sync
echo 1>/proc/sys/vm/drop_caches
dcfldd hash=md5 if=/dst/test of=/dev/null .....
compare_hashes_and_stop_if_different
done
There was no worrysome resets/etc. entries in dmesg.
> I have had a "bad" controller cause read corruptions, those
> corruptions would move around, replacing the controller resolved it,
> so there may be lack of error checking "inside" some paths in the
> card. Lucky I had a number of these controllers and had cold spares
> for them. The give away here was 2 separate buses with almost
> identical load with 6 separate disks each and all12 disks on 2 buses
> had between 47-52 scsi errors, which points to the only component
> shared (the controller).
Doesn't seem to be the case here, the reads are always the same - both
in content and position.
>
> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133
> cause random read corruptions, lowering speed to 100 fixed it), this
> one was duplicated on multiple identical pieces of hw with all
> different parts on the duplication machine.
>
> I have also seen lost writes (from software) because someone did a
> seek without doing a flush which in some versions of the libs loses
> the unfilled block when the seek happens (this is noted in the man
> page, and I saw it 20years ago, it is still noted in the man page, so
> no idea if it was ever fixed). So has more than one application been
> noted to see the corruption?
>
> So one question, have you seen the corruption in a path that would
> rely on one controller, or all corruptions you have seen involving
> more than one controller? Isolate and test each controller if you
> can, or if you can afford to replace it and see if it continues.
>
So far only on one (LSI 2308) controller - although the thin volumes'
metadata is on the ssds connected to chipset's sata controller. Still if
hypothetically that was the case (metadata disks), wouldn't I rather see
some kind of corruptions that would be a multiple of thin-volume's chunk
size (so multiplies of 1.5 MiB in this case).
As for controler, I have ordered another one that we plan to test in
near future.
>
> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>
>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
>>
>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
>>
>> The hardware is (can provide more detailed info of course):
>>
>> - Supermicro X9DR7-LN4F
>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
>> - 96 gb ram (ecc)
>> - 24 disk backplane
>>
>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
>> - 1 array on the backplane (4 disks, mdraid5, journaled)
>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
>>
>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
>>
>> With a doze of testing we managed to roughly rule out the following elements as being the cause:
>>
>> - qemu/kvm (issue occured directly on host)
>> - backplane (issue occured on disks directly connected via LSI's 2nd connector)
>> - cable (as a above, two different cables)
>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
>> - mdadm journaling (issue occured on plain mdraid configuration as well)
>> - disks themselves (issue occured on two separate mdadm arrays)
>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
>>
>> We did not manage to rule out (though somewhat _highly_ unlikely):
>>
>> - lvm thin (issue always - so far - occured on lvm thin pools)
>> - mdraid (issue always - so far - on mdraid managed arrays)
>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
>>
>> And finally - so far - the issue never occured:
>>
>> - directly on a disk
>> - directly on mdraid
>> - on linear lvm volume on top of mdraid
>>
>> As far as the issue goes it's:
>>
>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
>> - we also found (or rather btrfs scrub did) a few small damaged files as well
>> - the chunks look like a correct piece of different or previous data
>>
>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
>>
>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data
2020-05-07 21:01 ` John Stoffel
@ 2020-05-07 22:33 ` Michal Soltys
2020-05-08 0:54 ` John Stoffel
2020-05-08 3:44 ` Chris Murphy
0 siblings, 2 replies; 20+ messages in thread
From: Michal Soltys @ 2020-05-07 22:33 UTC (permalink / raw)
To: John Stoffel, Roger Heflin; +Cc: Linux RAID
On 20/05/07 23:01, John Stoffel wrote:
>>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes:
>
> Roger> Have you tried the same file 2x and verified the corruption is in the
> Roger> same places and looks the same?
>
> Are these 1tb files VMDK or COW images of VMs? How are these files
> made. And does it ever happen with *smaller* files? What about if
> you just use a sparse 2tb file and write blocks out past 1tb to see if
> there's a problem?
The VMs are always directly on lvm volumes. (e.g.
/dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the
errors after we ran scrub on the filesystem.
Yes, the errors were also found on small files.
Since then we recreated the issue directly on the host, just by making
ext4 filesystem on some LV, then doing write with checksum, sync,
drop_caches, read and check checksum. The errors are, as I mentioned -
always a full 4KiB chunks (always same content, always same position).
>
> Are the LVs split across RAID5 PVs by any chance?
raid5s are used as PVs, but a single logical volume always uses one only
one physical volume underneath (if that's what you meant by split across).
>
> It's not clear if you can replicate the problem without using
> lvm-thin, but that's what I suspect you might be having problems with.
>
I'll be trying to do that, though the heavier tests will have to wait
until I move all VMs to other hosts (as that is/was our production machnie).
> Can you give us the versions of the your tools, and exactly how you
> setup your test cases? How long does it take to find the problem?
Will get all the details tommorow (the host is on up to date debian
buster, the VMs are mix of archlinuxes and debians (and the issue
happened on both)).
As for how long, it's a hit and miss. Sometimes writing and reading back
~16gb file fails (the cheksum read back differs from what was written)
after 2-3 tries. That's on the host.
On the guest, it's been (so far) a guaranteed thing when we were
creating very large tar file (900gb+). As for past two weeks we were
unable to create that file without errors even once.
>
> Can you compile the newst kernel and newest thin tools and try them
> out?
I can, but a bit later (once we move VMs out of the host).
>
> How long does it take to replicate the corruption?
>
When it happens, it's usually few tries tries of writing a 16gb file
with random patterns and reading it back (directly on host). The
irritating thing is that it can be somewhat hard to reproduce (e.g.
after machine's reboot).
> Sorry for all the questions, but until there's a test case which is
> repeatable, it's going to be hard to chase this down.
>
> I wonder if running 'fio' tests would be something to try?
>
> And also changing your RAID5 setup to use the default stride and
> stripe widths, instead of the large values you're using.
The raid5 is using mdadm's defaults (which is 512 KiB these days for a
chunk). LVM on top is using much longer extents (as we don't really need
4mb granularity) and the lvm-thin chunks were set to match (and align)
to raid's stripe.
>
> Good luck!
>
> Roger> I have not as of yet seen write corruption (except when a vendors disk
> Roger> was resetting and it was lying about having written the data prior to
> Roger> the crash, these were ssds, if your disk write cache is on and you
> Roger> have a disk reset this can also happen), but have not seen "lost
> Roger> writes" otherwise, but would expect the 2 read corruption I have seen
> Roger> to also be able to cause write issues. So for that look for scsi
> Roger> notifications for disk resets that should not happen.
>
> Roger> I have had a "bad" controller cause read corruptions, those
> Roger> corruptions would move around, replacing the controller resolved it,
> Roger> so there may be lack of error checking "inside" some paths in the
> Roger> card. Lucky I had a number of these controllers and had cold spares
> Roger> for them. The give away here was 2 separate buses with almost
> Roger> identical load with 6 separate disks each and all12 disks on 2 buses
> Roger> had between 47-52 scsi errors, which points to the only component
> Roger> shared (the controller).
>
> Roger> The backplane and cables are unlikely in general cause this, there is
> Roger> too much error checking between the controller and the disk from what
> Roger> I know.
>
> Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133
> Roger> cause random read corruptions, lowering speed to 100 fixed it), this
> Roger> one was duplicated on multiple identical pieces of hw with all
> Roger> different parts on the duplication machine.
>
> Roger> I have also seen lost writes (from software) because someone did a
> Roger> seek without doing a flush which in some versions of the libs loses
> Roger> the unfilled block when the seek happens (this is noted in the man
> Roger> page, and I saw it 20years ago, it is still noted in the man page, so
> Roger> no idea if it was ever fixed). So has more than one application been
> Roger> noted to see the corruption?
>
> Roger> So one question, have you seen the corruption in a path that would
> Roger> rely on one controller, or all corruptions you have seen involving
> Roger> more than one controller? Isolate and test each controller if you
> Roger> can, or if you can afford to replace it and see if it continues.
>
>
> Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>>
>>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
>>>
>>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
>>>
>>> The hardware is (can provide more detailed info of course):
>>>
>>> - Supermicro X9DR7-LN4F
>>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
>>> - 96 gb ram (ecc)
>>> - 24 disk backplane
>>>
>>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
>>> - 1 array on the backplane (4 disks, mdraid5, journaled)
>>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
>>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
>>>
>>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
>>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
>>>
>>> With a doze of testing we managed to roughly rule out the following elements as being the cause:
>>>
>>> - qemu/kvm (issue occured directly on host)
>>> - backplane (issue occured on disks directly connected via LSI's 2nd connector)
>>> - cable (as a above, two different cables)
>>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
>>> - mdadm journaling (issue occured on plain mdraid configuration as well)
>>> - disks themselves (issue occured on two separate mdadm arrays)
>>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
>>>
>>> We did not manage to rule out (though somewhat _highly_ unlikely):
>>>
>>> - lvm thin (issue always - so far - occured on lvm thin pools)
>>> - mdraid (issue always - so far - on mdraid managed arrays)
>>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
>>>
>>> And finally - so far - the issue never occured:
>>>
>>> - directly on a disk
>>> - directly on mdraid
>>> - on linear lvm volume on top of mdraid
>>>
>>> As far as the issue goes it's:
>>>
>>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
>>> - we also found (or rather btrfs scrub did) a few small damaged files as well
>>> - the chunks look like a correct piece of different or previous data
>>>
>>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
>>>
>>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data
2020-05-07 22:33 ` Michal Soltys
@ 2020-05-08 0:54 ` John Stoffel
2020-05-08 11:10 ` [linux-lvm] " Michal Soltys
2020-05-08 3:44 ` Chris Murphy
1 sibling, 1 reply; 20+ messages in thread
From: John Stoffel @ 2020-05-08 0:54 UTC (permalink / raw)
To: Michal Soltys; +Cc: John Stoffel, Roger Heflin, Linux RAID
>>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes:
Michal> On 20/05/07 23:01, John Stoffel wrote:
>>>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes:
>>
Roger> Have you tried the same file 2x and verified the corruption is in the
Roger> same places and looks the same?
>>
>> Are these 1tb files VMDK or COW images of VMs? How are these files
>> made. And does it ever happen with *smaller* files? What about if
>> you just use a sparse 2tb file and write blocks out past 1tb to see if
>> there's a problem?
Michal> The VMs are always directly on lvm volumes. (e.g.
Michal> /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the
Michal> errors after we ran scrub on the filesystem.
Michal> Yes, the errors were also found on small files.
Those errors are in small files inside the VM, which is running btrfs
ontop of block storage provided by your thin-lv, right?
disks -> md raid5 -> pv -> vg -> lv-thin -> guest QCOW/LUN ->
filesystem -> corruption
Michal> Since then we recreated the issue directly on the host, just
Michal> by making ext4 filesystem on some LV, then doing write with
Michal> checksum, sync, drop_caches, read and check checksum. The
Michal> errors are, as I mentioned - always a full 4KiB chunks (always
Michal> same content, always same position).
What position? Is it a 4k, 1.5m or some other consistent offset? And
how far into the file? And this LV is a plain LV or a thin-lv? I'm
running a debian box at home with RAID1 and I haven't seen this, but
I'm not nearly as careful as you. Can you provide the output of:
/sbin/lvs --version
too?
Can you post your:
/sbin/dmsetup status
output too? There's a better command to use here, but I'm not an
export. You might really want to copy this over to the
linux-lvm@redhat.com mailing list as well.
>> Are the LVs split across RAID5 PVs by any chance?
Michal> raid5s are used as PVs, but a single logical volume always uses one only
Michal> one physical volume underneath (if that's what you meant by split across).
Ok, that's what I was asking about. It shouldn't matter... but just
trying to chase down the details.
>> It's not clear if you can replicate the problem without using
>> lvm-thin, but that's what I suspect you might be having problems with.
Michal> I'll be trying to do that, though the heavier tests will have to wait
Michal> until I move all VMs to other hosts (as that is/was our production machnie).
Sure, makes sense.
>> Can you give us the versions of the your tools, and exactly how you
>> setup your test cases? How long does it take to find the problem?
Michal> Will get all the details tommorow (the host is on up to date debian
Michal> buster, the VMs are mix of archlinuxes and debians (and the issue
Michal> happened on both)).
Michal> As for how long, it's a hit and miss. Sometimes writing and reading back
Michal> ~16gb file fails (the cheksum read back differs from what was written)
Michal> after 2-3 tries. That's on the host.
Michal> On the guest, it's been (so far) a guaranteed thing when we were
Michal> creating very large tar file (900gb+). As for past two weeks we were
Michal> unable to create that file without errors even once.
Ouch! That's not good. Just to confirm, these corruptions are all in
a thin-lv based filesystem, right? I'd be interested to know if you
can create another plain LV and cause the same error. Trying to
simplify the potential problems.
>> Can you compile the newst kernel and newest thin tools and try them
>> out?
Michal> I can, but a bit later (once we move VMs out of the host).
>>
>> How long does it take to replicate the corruption?
>>
Michal> When it happens, it's usually few tries tries of writing a 16gb file
Michal> with random patterns and reading it back (directly on host). The
Michal> irritating thing is that it can be somewhat hard to reproduce (e.g.
Michal> after machine's reboot).
>> Sorry for all the questions, but until there's a test case which is
>> repeatable, it's going to be hard to chase this down.
>>
>> I wonder if running 'fio' tests would be something to try?
>>
>> And also changing your RAID5 setup to use the default stride and
>> stripe widths, instead of the large values you're using.
Michal> The raid5 is using mdadm's defaults (which is 512 KiB these days for a
Michal> chunk). LVM on top is using much longer extents (as we don't really need
Michal> 4mb granularity) and the lvm-thin chunks were set to match (and align)
Michal> to raid's stripe.
>>
>> Good luck!
>>
Roger> I have not as of yet seen write corruption (except when a vendors disk
Roger> was resetting and it was lying about having written the data prior to
Roger> the crash, these were ssds, if your disk write cache is on and you
Roger> have a disk reset this can also happen), but have not seen "lost
Roger> writes" otherwise, but would expect the 2 read corruption I have seen
Roger> to also be able to cause write issues. So for that look for scsi
Roger> notifications for disk resets that should not happen.
>>
Roger> I have had a "bad" controller cause read corruptions, those
Roger> corruptions would move around, replacing the controller resolved it,
Roger> so there may be lack of error checking "inside" some paths in the
Roger> card. Lucky I had a number of these controllers and had cold spares
Roger> for them. The give away here was 2 separate buses with almost
Roger> identical load with 6 separate disks each and all12 disks on 2 buses
Roger> had between 47-52 scsi errors, which points to the only component
Roger> shared (the controller).
>>
Roger> The backplane and cables are unlikely in general cause this, there is
Roger> too much error checking between the controller and the disk from what
Roger> I know.
>>
Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133
Roger> cause random read corruptions, lowering speed to 100 fixed it), this
Roger> one was duplicated on multiple identical pieces of hw with all
Roger> different parts on the duplication machine.
>>
Roger> I have also seen lost writes (from software) because someone did a
Roger> seek without doing a flush which in some versions of the libs loses
Roger> the unfilled block when the seek happens (this is noted in the man
Roger> page, and I saw it 20years ago, it is still noted in the man page, so
Roger> no idea if it was ever fixed). So has more than one application been
Roger> noted to see the corruption?
>>
Roger> So one question, have you seen the corruption in a path that would
Roger> rely on one controller, or all corruptions you have seen involving
Roger> more than one controller? Isolate and test each controller if you
Roger> can, or if you can afford to replace it and see if it continues.
>>
>>
Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>>>
>>>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
>>>>
>>>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
>>>>
>>>> The hardware is (can provide more detailed info of course):
>>>>
>>>> - Supermicro X9DR7-LN4F
>>>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
>>>> - 96 gb ram (ecc)
>>>> - 24 disk backplane
>>>>
>>>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
>>>> - 1 array on the backplane (4 disks, mdraid5, journaled)
>>>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
>>>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
>>>>
>>>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
>>>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
>>>>
>>>> With a doze of testing we managed to roughly rule out the following elements as being the cause:
>>>>
>>>> - qemu/kvm (issue occured directly on host)
>>>> - backplane (issue occured on disks directly connected via LSI's 2nd connector)
>>>> - cable (as a above, two different cables)
>>>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
>>>> - mdadm journaling (issue occured on plain mdraid configuration as well)
>>>> - disks themselves (issue occured on two separate mdadm arrays)
>>>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
>>>>
>>>> We did not manage to rule out (though somewhat _highly_ unlikely):
>>>>
>>>> - lvm thin (issue always - so far - occured on lvm thin pools)
>>>> - mdraid (issue always - so far - on mdraid managed arrays)
>>>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
>>>>
>>>> And finally - so far - the issue never occured:
>>>>
>>>> - directly on a disk
>>>> - directly on mdraid
>>>> - on linear lvm volume on top of mdraid
>>>>
>>>> As far as the issue goes it's:
>>>>
>>>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
>>>> - we also found (or rather btrfs scrub did) a few small damaged files as well
>>>> - the chunks look like a correct piece of different or previous data
>>>>
>>>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
>>>>
>>>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.
>>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data
2020-05-07 22:33 ` Michal Soltys
2020-05-08 0:54 ` John Stoffel
@ 2020-05-08 3:44 ` Chris Murphy
2020-05-10 19:05 ` Sarah Newman
2020-05-20 21:40 ` Michal Soltys
1 sibling, 2 replies; 20+ messages in thread
From: Chris Murphy @ 2020-05-08 3:44 UTC (permalink / raw)
To: Michal Soltys; +Cc: John Stoffel, Roger Heflin, Linux RAID
On Thu, May 7, 2020 at 4:34 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
> Since then we recreated the issue directly on the host, just by making
> ext4 filesystem on some LV, then doing write with checksum, sync,
> drop_caches, read and check checksum. The errors are, as I mentioned -
> always a full 4KiB chunks (always same content, always same position).
The 4KiB chunk. What are the contents? Is it definitely guest VM data?
Or is it sometimes file system metadata? How many corruptions have
happened? The file system metadata is quite small compared to data.
But if there have been many errors, we'd expect if it's caused on the
host, that eventually file system metadata is corrupted. If it's
definitely only data, that's curious and maybe implicates something
going on in the guest.
Btrfs, whether normal reads or scrubs, will report the path to the
affected file, for data corruption. Metadata corruption errors
sometimes have inode references, but not a path to a file.
> >
> > Are the LVs split across RAID5 PVs by any chance?
>
> raid5s are used as PVs, but a single logical volume always uses one only
> one physical volume underneath (if that's what you meant by split across).
It might be a bit suboptimal. A single 4KiB block write in the guest,
turns into a 4KiB block write in the host's LV. That in turn trickles
down to md, which has a 512KiB x 4 drive stripe. So a single 4KiB
write translates into a 2M stripe write. There is an optimization for
raid5 in the RMW case, where it should be true only 4KiB data plus
4KiB parity is written (partial strip/chunk write); I'm not sure about
reads.
> > It's not clear if you can replicate the problem without using
> > lvm-thin, but that's what I suspect you might be having problems with.
> >
>
> I'll be trying to do that, though the heavier tests will have to wait
> until I move all VMs to other hosts (as that is/was our production machnie).
Btrfs default Btrfs uses 16KiB block size for leaves and nodes. It's
still a tiny foot print compared to data writes, but if LVM thin is a
suspect, it really should just be a matter of time before file system
corruption happens. If it doesn't, that's useful information. It
probably means it's not LVM thin. But then what?
> As for how long, it's a hit and miss. Sometimes writing and reading back
> ~16gb file fails (the cheksum read back differs from what was written)
> after 2-3 tries. That's on the host.
>
> On the guest, it's been (so far) a guaranteed thing when we were
> creating very large tar file (900gb+). As for past two weeks we were
> unable to create that file without errors even once.
It's very useful to have a consistent reproducer. You can do metadata
only writes on Btrfs by doing multiple back to back metadata only
balance. If the problem really is in the write path somewhere, this
would eventually corrupt the metadata - it would be detected during
any subsequent balance or scrub. 'btrfs balance start -musage=100
/mountpoint' will do it.
This reproducer. It only reproduces in the guest VM? If you do it in
the host, otherwise exactly the same way with all the exact same
versions of everything, and it does not reproduce?
>
> >
> > Can you compile the newst kernel and newest thin tools and try them
> > out?
>
> I can, but a bit later (once we move VMs out of the host).
>
> >
> > How long does it take to replicate the corruption?
> >
>
> When it happens, it's usually few tries tries of writing a 16gb file
> with random patterns and reading it back (directly on host). The
> irritating thing is that it can be somewhat hard to reproduce (e.g.
> after machine's reboot).
Reading it back on the host. So you've shut down the VM, and you're
mounting what was the guests VM's backing disk, on the host to do the
verification. There's never a case of concurrent usage between guest
and host?
>
> > Sorry for all the questions, but until there's a test case which is
> > repeatable, it's going to be hard to chase this down.
> >
> > I wonder if running 'fio' tests would be something to try?
> >
> > And also changing your RAID5 setup to use the default stride and
> > stripe widths, instead of the large values you're using.
>
> The raid5 is using mdadm's defaults (which is 512 KiB these days for a
> chunk). LVM on top is using much longer extents (as we don't really need
> 4mb granularity) and the lvm-thin chunks were set to match (and align)
> to raid's stripe.
I would change very little until you track this down, if the goal is
to track it down and get it fixed.
I'm not sure if LVM thinp is supported with LVM raid still, which if
it's not supported yet then I can understand using mdadm raid5 instead
of LVM raid5.
--
Chris Murphy
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data
2020-05-08 0:54 ` John Stoffel
@ 2020-05-08 11:10 ` Michal Soltys
0 siblings, 0 replies; 20+ messages in thread
From: Michal Soltys @ 2020-05-08 11:10 UTC (permalink / raw)
To: John Stoffel; +Cc: Roger Heflin, Linux RAID, linux-lvm
note: as suggested, I'm also CCing this to linux-lvm; the full context with replies starts at:
https://www.spinics.net/lists/raid/msg64364.html
There is also the initial post at the bottom as well.
On 5/8/20 2:54 AM, John Stoffel wrote:
>>>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes:
>
> Michal> On 20/05/07 23:01, John Stoffel wrote:
>>>>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes:
>>>
> Roger> Have you tried the same file 2x and verified the corruption is in the
> Roger> same places and looks the same?
>>>
>>> Are these 1tb files VMDK or COW images of VMs? How are these files
>>> made. And does it ever happen with *smaller* files? What about if
>>> you just use a sparse 2tb file and write blocks out past 1tb to see if
>>> there's a problem?
>
> Michal> The VMs are always directly on lvm volumes. (e.g.
> Michal> /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the
> Michal> errors after we ran scrub on the filesystem.
>
> Michal> Yes, the errors were also found on small files.
>
> Those errors are in small files inside the VM, which is running btrfs
> ontop of block storage provided by your thin-lv, right?
>
Yea, the small files were in this case on that thin-lv.
We also discovered (yesterday) file corruptions with VM hosting gitlab registry - this one was using the same thin-lv underneath, but the guest itself was using ext4 (in this case, docker simply reported incorrect sha checksum on (so far) 2 layers.
>
>
> disks -> md raid5 -> pv -> vg -> lv-thin -> guest QCOW/LUN ->
> filesystem -> corruption
Those particular guests, yea. The host case it's just w/o "guest" step.
But (so far) all corruption ended going via one of the lv-thin layers (and via one of md raids).
>
>
> Michal> Since then we recreated the issue directly on the host, just
> Michal> by making ext4 filesystem on some LV, then doing write with
> Michal> checksum, sync, drop_caches, read and check checksum. The
> Michal> errors are, as I mentioned - always a full 4KiB chunks (always
> Michal> same content, always same position).
>
> What position? Is it a 4k, 1.5m or some other consistent offset? And
> how far into the file? And this LV is a plain LV or a thin-lv? I'm
> running a debian box at home with RAID1 and I haven't seen this, but
> I'm not nearly as careful as you. Can you provide the output of:
>
What I meant that it doesn't "move" when verifying the same file (aka different reads from same test file). Between the tests, the errors are of course in different places - but it's always some 4KiB piece(s) - that look like correct pieces belonging somewhere else.
> /sbin/lvs --version
LVM version: 2.03.02(2) (2018-12-18)
Library version: 1.02.155 (2018-12-18)
Driver version: 4.41.0
Configuration: ./configure --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --exec-prefix= --bindir=/bin --libdir=/lib/x86_64-linux-gnu --sbindir=/sbin --with-usrlibdir=/usr/lib/x86_64-linux-gnu --with-optimisation=-O2 --with-cache=internal --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --with-default-pid-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-thin=internal --with-thin-check=/usr/sbin/thin_check --with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair --enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd --enable-dbus-service --enable-lvmlockd-dlm --enable-lvmlockd-sanlock --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-readline --enable-udev_rules --enable-udev_sync
>
> too?
>
> Can you post your:
>
> /sbin/dmsetup status
>
> output too? There's a better command to use here, but I'm not an
> export. You might really want to copy this over to the
> linux-lvm@redhat.com mailing list as well.
x22v0-tp_ssd-tpool: 0 2577285120 thin-pool 19 8886/552960 629535/838960 - rw no_discard_passdown queue_if_no_space - 1024
x22v0-tp_ssd_tdata: 0 2147696640 linear
x22v0-tp_ssd_tdata: 2147696640 429588480 linear
x22v0-tp_ssd_tmeta_rimage_1: 0 4423680 linear
x22v0-tp_ssd_tmeta: 0 4423680 raid raid1 2 AA 4423680/4423680 idle 0 0 -
x22v0-gerrit--new: 0 268615680 thin 255510528 268459007
x22v0-btrfsnopool: 0 134430720 linear
x22v0-gitlab_root: 0 629145600 thin 628291584 629145599
x22v0-tp_ssd_tmeta_rimage_0: 0 4423680 linear
x22v0-nexus_old_storage: 0 10737500160 thin 5130817536 10737500159
x22v0-gitlab_reg: 0 2147696640 thin 1070963712 2147696639
x22v0-nexus_old_root: 0 268615680 thin 257657856 268615679
x22v0-tp_big_tmeta_rimage_1: 0 8601600 linear
x22v0-tp_ssd_tmeta_rmeta_1: 0 245760 linear
x22v0-micron_vol: 0 268615680 linear
x22v0-tp_big_tmeta_rimage_0: 0 8601600 linear
x22v0-tp_ssd_tmeta_rmeta_0: 0 245760 linear
x22v0-gerrit--root: 0 268615680 thin 103388160 268443647
x22v0-btrfs_ssd_linear: 0 268615680 linear
x22v0-btrfstest: 0 268615680 thin 40734720 268615679
x22v0-tp_ssd: 0 2577285120 linear
x22v0-tp_big: 0 22164602880 linear
x22v0-nexus3_root: 0 167854080 thin 21860352 167854079
x22v0-nusknacker--staging: 0 268615680 thin 268182528 268615679
x22v0-tmob2: 0 1048657920 linear
x22v0-tp_big-tpool: 0 22164602880 thin-pool 35 35152/1075200 3870070/7215040 - rw no_discard_passdown queue_if_no_space - 1024
x22v0-tp_big_tdata: 0 4295147520 linear
x22v0-tp_big_tdata: 4295147520 17869455360 linear
x22v0-btrfs_ssd_test: 0 201523200 thin 191880192 201335807
x22v0-nussknacker2: 0 268615680 thin 58573824 268615679
x22v0-tmob1: 0 1048657920 linear
x22v0-tp_big_tmeta: 0 8601600 raid raid1 2 AA 8601600/8601600 idle 0 0 -
x22v0-nussknacker1: 0 268615680 thin 74376192 268615679
x22v0-touk--elk4: 0 839024640 linear
x22v0-gerrit--backup: 0 268615680 thin 228989952 268443647
x22v0-tp_big_tmeta_rmeta_1: 0 245760 linear
x22v0-openvpn--new: 0 134430720 thin 24152064 66272255
x22v0-k8sdkr: 0 268615680 linear
x22v0-nexus3_storage: 0 10737500160 thin 4976683008 10737500159
x22v0-rocket: 0 167854080 thin 163602432 167854079
x22v0-tp_big_tmeta_rmeta_0: 0 245760 linear
x22v0-roger2: 0 134430720 thin 33014784 134430719
x22v0-gerrit--new--backup: 0 268615680 thin 6552576 268443647
Also lvs -a with segment ranges:
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert LE Ranges
btrfs_ssd_linear x22v0 -wi-a----- <128.09g /dev/md125:19021-20113
btrfs_ssd_test x22v0 Vwi-a-t--- 96.09g tp_ssd 95.21
btrfsnopool x22v0 -wi-a----- 64.10g /dev/sdt2:35-581
btrfstest x22v0 Vwi-a-t--- <128.09g tp_big 15.16
gerrit-backup x22v0 Vwi-aot--- <128.09g tp_big 85.25
gerrit-new x22v0 Vwi-a-t--- <128.09g tp_ssd 95.12
gerrit-new-backup x22v0 Vwi-a-t--- <128.09g tp_big 2.44
gerrit-root x22v0 Vwi-aot--- <128.09g tp_ssd 38.49
gitlab_reg x22v0 Vwi-a-t--- 1.00t tp_big 49.87
gitlab_reg_snapshot x22v0 Vwi---t--k 1.00t tp_big gitlab_reg
gitlab_root x22v0 Vwi-a-t--- 300.00g tp_ssd 99.86
gitlab_root_snapshot x22v0 Vwi---t--k 300.00g tp_ssd gitlab_root
k8sdkr x22v0 -wi-a----- <128.09g /dev/md126:20891-21983
[lvol0_pmspare] x22v0 ewi------- 4.10g /dev/sdt2:0-34
micron_vol x22v0 -wi-a----- <128.09g /dev/sdt2:582-1674
nexus3_root x22v0 Vwi-aot--- <80.04g tp_ssd 13.03
nexus3_storage x22v0 Vwi-aot--- 5.00t tp_big 46.35
nexus_old_root x22v0 Vwi-a-t--- <128.09g tp_ssd 95.92
nexus_old_storage x22v0 Vwi-a-t--- 5.00t tp_big 47.78
nusknacker-staging x22v0 Vwi-aot--- <128.09g tp_big 99.84
nussknacker1 x22v0 Vwi-aot--- <128.09g tp_big 27.69
nussknacker2 x22v0 Vwi-aot--- <128.09g tp_big 21.81
openvpn-new x22v0 Vwi-aot--- 64.10g tp_big 17.97
rocket x22v0 Vwi-aot--- <80.04g tp_ssd 97.47
roger2 x22v0 Vwi-a-t--- 64.10g tp_ssd 24.56
tmob1 x22v0 -wi-a----- <500.04g /dev/md125:8739-13005
tmob2 x22v0 -wi-a----- <500.04g /dev/md125:13006-17272
touk-elk4 x22v0 -wi-ao---- <400.08g /dev/md126:17477-20890
tp_big x22v0 twi-aot--- 10.32t 53.64 3.27 [tp_big_tdata]:0-90187
[tp_big_tdata] x22v0 Twi-ao---- 10.32t /dev/md126:0-17476
[tp_big_tdata] x22v0 Twi-ao---- 10.32t /dev/md126:21984-94694
[tp_big_tmeta] x22v0 ewi-aor--- 4.10g 100.00 [tp_big_tmeta_rimage_0]:0-34,[tp_big_tmeta_rimage_1]:0-34
[tp_big_tmeta_rimage_0] x22v0 iwi-aor--- 4.10g /dev/sda3:30-64
[tp_big_tmeta_rimage_1] x22v0 iwi-aor--- 4.10g /dev/sdb3:30-64
[tp_big_tmeta_rmeta_0] x22v0 ewi-aor--- 120.00m /dev/sda3:29-29
[tp_big_tmeta_rmeta_1] x22v0 ewi-aor--- 120.00m /dev/sdb3:29-29
tp_ssd x22v0 twi-aot--- 1.20t 75.04 1.61 [tp_ssd_tdata]:0-10486
[tp_ssd_tdata] x22v0 Twi-ao---- 1.20t /dev/md125:0-8738
[tp_ssd_tdata] x22v0 Twi-ao---- 1.20t /dev/md125:17273-19020
[tp_ssd_tmeta] x22v0 ewi-aor--- <2.11g 100.00 [tp_ssd_tmeta_rimage_0]:0-17,[tp_ssd_tmeta_rimage_1]:0-17
[tp_ssd_tmeta_rimage_0] x22v0 iwi-aor--- <2.11g /dev/sda3:11-28
[tp_ssd_tmeta_rimage_1] x22v0 iwi-aor--- <2.11g /dev/sdb3:11-28
[tp_ssd_tmeta_rmeta_0] x22v0 ewi-aor--- 120.00m /dev/sda3:10-10
[tp_ssd_tmeta_rmeta_1] x22v0 ewi-aor--- 120.00m /dev/sdb3:10-10
>
>>> Are the LVs split across RAID5 PVs by any chance?
>
> Michal> raid5s are used as PVs, but a single logical volume always uses one only
> Michal> one physical volume underneath (if that's what you meant by split across).
>
> Ok, that's what I was asking about. It shouldn't matter... but just
> trying to chase down the details.
>
>
>>> It's not clear if you can replicate the problem without using
>>> lvm-thin, but that's what I suspect you might be having problems with.
>
> Michal> I'll be trying to do that, though the heavier tests will have to wait
> Michal> until I move all VMs to other hosts (as that is/was our production machnie).
>
> Sure, makes sense.
>
>>> Can you give us the versions of the your tools, and exactly how you
>>> setup your test cases? How long does it take to find the problem?
Regarding this, currently:
kernel: 5.4.0-0.bpo.4-amd64 #1 SMP Debian 5.4.19-1~bpo10+1 (2020-03-09) x86_64 GNU/Linux (was also happening with 5.2.0-0.bpo.3-amd64)
LVM version: 2.03.02(2) (2018-12-18)
Library version: 1.02.155 (2018-12-18)
Driver version: 4.41.0
mdadm - v4.1 - 2018-10-01
>
> Michal> Will get all the details tommorow (the host is on up to date debian
> Michal> buster, the VMs are mix of archlinuxes and debians (and the issue
> Michal> happened on both)).
>
> Michal> As for how long, it's a hit and miss. Sometimes writing and reading back
> Michal> ~16gb file fails (the cheksum read back differs from what was written)
> Michal> after 2-3 tries. That's on the host.
>
> Michal> On the guest, it's been (so far) a guaranteed thing when we were
> Michal> creating very large tar file (900gb+). As for past two weeks we were
> Michal> unable to create that file without errors even once.
>
> Ouch! That's not good. Just to confirm, these corruptions are all in
> a thin-lv based filesystem, right? I'd be interested to know if you
> can create another plain LV and cause the same error. Trying to
> simplify the potential problems.
I have been trying to - but so far didn't manage to replicate this with:
- a physical partition
- filesystem directly on a physical partition
- filesystem directly on mdraid
- filesystem directly on a linear volume
Note that this _doesn't_ imply that I _always_ get errors if lvm-thin is in use - as I also had lengthy period of attempts to cause corruption on some thin volume w/o any successes either. But the ones that failed had those in common (so far): md & lvm-thin - with 4 KiB piece(s) being incorrect
>
>
>>> Can you compile the newst kernel and newest thin tools and try them
>>> out?
>
> Michal> I can, but a bit later (once we move VMs out of the host).
>
>>>
>>> How long does it take to replicate the corruption?
>>>
>
> Michal> When it happens, it's usually few tries tries of writing a 16gb file
> Michal> with random patterns and reading it back (directly on host). The
> Michal> irritating thing is that it can be somewhat hard to reproduce (e.g.
> Michal> after machine's reboot).
>
>>> Sorry for all the questions, but until there's a test case which is
>>> repeatable, it's going to be hard to chase this down.
>>>
>>> I wonder if running 'fio' tests would be something to try?
>>>
>>> And also changing your RAID5 setup to use the default stride and
>>> stripe widths, instead of the large values you're using.
>
> Michal> The raid5 is using mdadm's defaults (which is 512 KiB these days for a
> Michal> chunk). LVM on top is using much longer extents (as we don't really need
> Michal> 4mb granularity) and the lvm-thin chunks were set to match (and align)
> Michal> to raid's stripe.
>
>>>
>>> Good luck!
>>>
> Roger> I have not as of yet seen write corruption (except when a vendors disk
> Roger> was resetting and it was lying about having written the data prior to
> Roger> the crash, these were ssds, if your disk write cache is on and you
> Roger> have a disk reset this can also happen), but have not seen "lost
> Roger> writes" otherwise, but would expect the 2 read corruption I have seen
> Roger> to also be able to cause write issues. So for that look for scsi
> Roger> notifications for disk resets that should not happen.
>>>
> Roger> I have had a "bad" controller cause read corruptions, those
> Roger> corruptions would move around, replacing the controller resolved it,
> Roger> so there may be lack of error checking "inside" some paths in the
> Roger> card. Lucky I had a number of these controllers and had cold spares
> Roger> for them. The give away here was 2 separate buses with almost
> Roger> identical load with 6 separate disks each and all12 disks on 2 buses
> Roger> had between 47-52 scsi errors, which points to the only component
> Roger> shared (the controller).
>>>
> Roger> The backplane and cables are unlikely in general cause this, there is
> Roger> too much error checking between the controller and the disk from what
> Roger> I know.
>>>
> Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133
> Roger> cause random read corruptions, lowering speed to 100 fixed it), this
> Roger> one was duplicated on multiple identical pieces of hw with all
> Roger> different parts on the duplication machine.
>>>
> Roger> I have also seen lost writes (from software) because someone did a
> Roger> seek without doing a flush which in some versions of the libs loses
> Roger> the unfilled block when the seek happens (this is noted in the man
> Roger> page, and I saw it 20years ago, it is still noted in the man page, so
> Roger> no idea if it was ever fixed). So has more than one application been
> Roger> noted to see the corruption?
>>>
> Roger> So one question, have you seen the corruption in a path that would
> Roger> rely on one controller, or all corruptions you have seen involving
> Roger> more than one controller? Isolate and test each controller if you
> Roger> can, or if you can afford to replace it and see if it continues.
>>>
>>>
> Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>>>>
>>>>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
>>>>>
>>>>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
>>>>>
>>>>> The hardware is (can provide more detailed info of course):
>>>>>
>>>>> - Supermicro X9DR7-LN4F
>>>>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
>>>>> - 96 gb ram (ecc)
>>>>> - 24 disk backplane
>>>>>
>>>>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
>>>>> - 1 array on the backplane (4 disks, mdraid5, journaled)
>>>>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
>>>>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
>>>>>
>>>>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
>>>>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
>>>>>
>>>>> With a doze of testing we managed to roughly rule out the following elements as being the cause:
>>>>>
>>>>> - qemu/kvm (issue occured directly on host)
>>>>> - backplane (issue occured on disks directly connected via LSI's 2nd connector)
>>>>> - cable (as a above, two different cables)
>>>>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
>>>>> - mdadm journaling (issue occured on plain mdraid configuration as well)
>>>>> - disks themselves (issue occured on two separate mdadm arrays)
>>>>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
>>>>>
>>>>> We did not manage to rule out (though somewhat _highly_ unlikely):
>>>>>
>>>>> - lvm thin (issue always - so far - occured on lvm thin pools)
>>>>> - mdraid (issue always - so far - on mdraid managed arrays)
>>>>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
>>>>>
>>>>> And finally - so far - the issue never occured:
>>>>>
>>>>> - directly on a disk
>>>>> - directly on mdraid
>>>>> - on linear lvm volume on top of mdraid
>>>>>
>>>>> As far as the issue goes it's:
>>>>>
>>>>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
>>>>> - we also found (or rather btrfs scrub did) a few small damaged files as well
>>>>> - the chunks look like a correct piece of different or previous data
>>>>>
>>>>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
>>>>>
>>>>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.
>>>
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [linux-lvm] [general question] rare silent data corruption when writing data
@ 2020-05-08 11:10 ` Michal Soltys
0 siblings, 0 replies; 20+ messages in thread
From: Michal Soltys @ 2020-05-08 11:10 UTC (permalink / raw)
To: John Stoffel; +Cc: Linux RAID, Roger Heflin, linux-lvm
note: as suggested, I'm also CCing this to linux-lvm; the full context with replies starts at:
https://www.spinics.net/lists/raid/msg64364.html
There is also the initial post at the bottom as well.
On 5/8/20 2:54 AM, John Stoffel wrote:
>>>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes:
>
> Michal> On 20/05/07 23:01, John Stoffel wrote:
>>>>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes:
>>>
> Roger> Have you tried the same file 2x and verified the corruption is in the
> Roger> same places and looks the same?
>>>
>>> Are these 1tb files VMDK or COW images of VMs? How are these files
>>> made. And does it ever happen with *smaller* files? What about if
>>> you just use a sparse 2tb file and write blocks out past 1tb to see if
>>> there's a problem?
>
> Michal> The VMs are always directly on lvm volumes. (e.g.
> Michal> /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the
> Michal> errors after we ran scrub on the filesystem.
>
> Michal> Yes, the errors were also found on small files.
>
> Those errors are in small files inside the VM, which is running btrfs
> ontop of block storage provided by your thin-lv, right?
>
Yea, the small files were in this case on that thin-lv.
We also discovered (yesterday) file corruptions with VM hosting gitlab registry - this one was using the same thin-lv underneath, but the guest itself was using ext4 (in this case, docker simply reported incorrect sha checksum on (so far) 2 layers.
>
>
> disks -> md raid5 -> pv -> vg -> lv-thin -> guest QCOW/LUN ->
> filesystem -> corruption
Those particular guests, yea. The host case it's just w/o "guest" step.
But (so far) all corruption ended going via one of the lv-thin layers (and via one of md raids).
>
>
> Michal> Since then we recreated the issue directly on the host, just
> Michal> by making ext4 filesystem on some LV, then doing write with
> Michal> checksum, sync, drop_caches, read and check checksum. The
> Michal> errors are, as I mentioned - always a full 4KiB chunks (always
> Michal> same content, always same position).
>
> What position? Is it a 4k, 1.5m or some other consistent offset? And
> how far into the file? And this LV is a plain LV or a thin-lv? I'm
> running a debian box at home with RAID1 and I haven't seen this, but
> I'm not nearly as careful as you. Can you provide the output of:
>
What I meant that it doesn't "move" when verifying the same file (aka different reads from same test file). Between the tests, the errors are of course in different places - but it's always some 4KiB piece(s) - that look like correct pieces belonging somewhere else.
> /sbin/lvs --version
LVM version: 2.03.02(2) (2018-12-18)
Library version: 1.02.155 (2018-12-18)
Driver version: 4.41.0
Configuration: ./configure --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --exec-prefix= --bindir=/bin --libdir=/lib/x86_64-linux-gnu --sbindir=/sbin --with-usrlibdir=/usr/lib/x86_64-linux-gnu --with-optimisation=-O2 --with-cache=internal --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --with-default-pid-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-thin=internal --with-thin-check=/usr/sbin/thin_check --with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair --enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd --enable-dbus-service --enable-lvmlockd-dlm --enable-lvmlockd-sanlock --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-readline --enable-udev_rules --enable-udev_sync
>
> too?
>
> Can you post your:
>
> /sbin/dmsetup status
>
> output too? There's a better command to use here, but I'm not an
> export. You might really want to copy this over to the
> linux-lvm@redhat.com mailing list as well.
x22v0-tp_ssd-tpool: 0 2577285120 thin-pool 19 8886/552960 629535/838960 - rw no_discard_passdown queue_if_no_space - 1024
x22v0-tp_ssd_tdata: 0 2147696640 linear
x22v0-tp_ssd_tdata: 2147696640 429588480 linear
x22v0-tp_ssd_tmeta_rimage_1: 0 4423680 linear
x22v0-tp_ssd_tmeta: 0 4423680 raid raid1 2 AA 4423680/4423680 idle 0 0 -
x22v0-gerrit--new: 0 268615680 thin 255510528 268459007
x22v0-btrfsnopool: 0 134430720 linear
x22v0-gitlab_root: 0 629145600 thin 628291584 629145599
x22v0-tp_ssd_tmeta_rimage_0: 0 4423680 linear
x22v0-nexus_old_storage: 0 10737500160 thin 5130817536 10737500159
x22v0-gitlab_reg: 0 2147696640 thin 1070963712 2147696639
x22v0-nexus_old_root: 0 268615680 thin 257657856 268615679
x22v0-tp_big_tmeta_rimage_1: 0 8601600 linear
x22v0-tp_ssd_tmeta_rmeta_1: 0 245760 linear
x22v0-micron_vol: 0 268615680 linear
x22v0-tp_big_tmeta_rimage_0: 0 8601600 linear
x22v0-tp_ssd_tmeta_rmeta_0: 0 245760 linear
x22v0-gerrit--root: 0 268615680 thin 103388160 268443647
x22v0-btrfs_ssd_linear: 0 268615680 linear
x22v0-btrfstest: 0 268615680 thin 40734720 268615679
x22v0-tp_ssd: 0 2577285120 linear
x22v0-tp_big: 0 22164602880 linear
x22v0-nexus3_root: 0 167854080 thin 21860352 167854079
x22v0-nusknacker--staging: 0 268615680 thin 268182528 268615679
x22v0-tmob2: 0 1048657920 linear
x22v0-tp_big-tpool: 0 22164602880 thin-pool 35 35152/1075200 3870070/7215040 - rw no_discard_passdown queue_if_no_space - 1024
x22v0-tp_big_tdata: 0 4295147520 linear
x22v0-tp_big_tdata: 4295147520 17869455360 linear
x22v0-btrfs_ssd_test: 0 201523200 thin 191880192 201335807
x22v0-nussknacker2: 0 268615680 thin 58573824 268615679
x22v0-tmob1: 0 1048657920 linear
x22v0-tp_big_tmeta: 0 8601600 raid raid1 2 AA 8601600/8601600 idle 0 0 -
x22v0-nussknacker1: 0 268615680 thin 74376192 268615679
x22v0-touk--elk4: 0 839024640 linear
x22v0-gerrit--backup: 0 268615680 thin 228989952 268443647
x22v0-tp_big_tmeta_rmeta_1: 0 245760 linear
x22v0-openvpn--new: 0 134430720 thin 24152064 66272255
x22v0-k8sdkr: 0 268615680 linear
x22v0-nexus3_storage: 0 10737500160 thin 4976683008 10737500159
x22v0-rocket: 0 167854080 thin 163602432 167854079
x22v0-tp_big_tmeta_rmeta_0: 0 245760 linear
x22v0-roger2: 0 134430720 thin 33014784 134430719
x22v0-gerrit--new--backup: 0 268615680 thin 6552576 268443647
Also lvs -a with segment ranges:
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert LE Ranges
btrfs_ssd_linear x22v0 -wi-a----- <128.09g /dev/md125:19021-20113
btrfs_ssd_test x22v0 Vwi-a-t--- 96.09g tp_ssd 95.21
btrfsnopool x22v0 -wi-a----- 64.10g /dev/sdt2:35-581
btrfstest x22v0 Vwi-a-t--- <128.09g tp_big 15.16
gerrit-backup x22v0 Vwi-aot--- <128.09g tp_big 85.25
gerrit-new x22v0 Vwi-a-t--- <128.09g tp_ssd 95.12
gerrit-new-backup x22v0 Vwi-a-t--- <128.09g tp_big 2.44
gerrit-root x22v0 Vwi-aot--- <128.09g tp_ssd 38.49
gitlab_reg x22v0 Vwi-a-t--- 1.00t tp_big 49.87
gitlab_reg_snapshot x22v0 Vwi---t--k 1.00t tp_big gitlab_reg
gitlab_root x22v0 Vwi-a-t--- 300.00g tp_ssd 99.86
gitlab_root_snapshot x22v0 Vwi---t--k 300.00g tp_ssd gitlab_root
k8sdkr x22v0 -wi-a----- <128.09g /dev/md126:20891-21983
[lvol0_pmspare] x22v0 ewi------- 4.10g /dev/sdt2:0-34
micron_vol x22v0 -wi-a----- <128.09g /dev/sdt2:582-1674
nexus3_root x22v0 Vwi-aot--- <80.04g tp_ssd 13.03
nexus3_storage x22v0 Vwi-aot--- 5.00t tp_big 46.35
nexus_old_root x22v0 Vwi-a-t--- <128.09g tp_ssd 95.92
nexus_old_storage x22v0 Vwi-a-t--- 5.00t tp_big 47.78
nusknacker-staging x22v0 Vwi-aot--- <128.09g tp_big 99.84
nussknacker1 x22v0 Vwi-aot--- <128.09g tp_big 27.69
nussknacker2 x22v0 Vwi-aot--- <128.09g tp_big 21.81
openvpn-new x22v0 Vwi-aot--- 64.10g tp_big 17.97
rocket x22v0 Vwi-aot--- <80.04g tp_ssd 97.47
roger2 x22v0 Vwi-a-t--- 64.10g tp_ssd 24.56
tmob1 x22v0 -wi-a----- <500.04g /dev/md125:8739-13005
tmob2 x22v0 -wi-a----- <500.04g /dev/md125:13006-17272
touk-elk4 x22v0 -wi-ao---- <400.08g /dev/md126:17477-20890
tp_big x22v0 twi-aot--- 10.32t 53.64 3.27 [tp_big_tdata]:0-90187
[tp_big_tdata] x22v0 Twi-ao---- 10.32t /dev/md126:0-17476
[tp_big_tdata] x22v0 Twi-ao---- 10.32t /dev/md126:21984-94694
[tp_big_tmeta] x22v0 ewi-aor--- 4.10g 100.00 [tp_big_tmeta_rimage_0]:0-34,[tp_big_tmeta_rimage_1]:0-34
[tp_big_tmeta_rimage_0] x22v0 iwi-aor--- 4.10g /dev/sda3:30-64
[tp_big_tmeta_rimage_1] x22v0 iwi-aor--- 4.10g /dev/sdb3:30-64
[tp_big_tmeta_rmeta_0] x22v0 ewi-aor--- 120.00m /dev/sda3:29-29
[tp_big_tmeta_rmeta_1] x22v0 ewi-aor--- 120.00m /dev/sdb3:29-29
tp_ssd x22v0 twi-aot--- 1.20t 75.04 1.61 [tp_ssd_tdata]:0-10486
[tp_ssd_tdata] x22v0 Twi-ao---- 1.20t /dev/md125:0-8738
[tp_ssd_tdata] x22v0 Twi-ao---- 1.20t /dev/md125:17273-19020
[tp_ssd_tmeta] x22v0 ewi-aor--- <2.11g 100.00 [tp_ssd_tmeta_rimage_0]:0-17,[tp_ssd_tmeta_rimage_1]:0-17
[tp_ssd_tmeta_rimage_0] x22v0 iwi-aor--- <2.11g /dev/sda3:11-28
[tp_ssd_tmeta_rimage_1] x22v0 iwi-aor--- <2.11g /dev/sdb3:11-28
[tp_ssd_tmeta_rmeta_0] x22v0 ewi-aor--- 120.00m /dev/sda3:10-10
[tp_ssd_tmeta_rmeta_1] x22v0 ewi-aor--- 120.00m /dev/sdb3:10-10
>
>>> Are the LVs split across RAID5 PVs by any chance?
>
> Michal> raid5s are used as PVs, but a single logical volume always uses one only
> Michal> one physical volume underneath (if that's what you meant by split across).
>
> Ok, that's what I was asking about. It shouldn't matter... but just
> trying to chase down the details.
>
>
>>> It's not clear if you can replicate the problem without using
>>> lvm-thin, but that's what I suspect you might be having problems with.
>
> Michal> I'll be trying to do that, though the heavier tests will have to wait
> Michal> until I move all VMs to other hosts (as that is/was our production machnie).
>
> Sure, makes sense.
>
>>> Can you give us the versions of the your tools, and exactly how you
>>> setup your test cases? How long does it take to find the problem?
Regarding this, currently:
kernel: 5.4.0-0.bpo.4-amd64 #1 SMP Debian 5.4.19-1~bpo10+1 (2020-03-09) x86_64 GNU/Linux (was also happening with 5.2.0-0.bpo.3-amd64)
LVM version: 2.03.02(2) (2018-12-18)
Library version: 1.02.155 (2018-12-18)
Driver version: 4.41.0
mdadm - v4.1 - 2018-10-01
>
> Michal> Will get all the details tommorow (the host is on up to date debian
> Michal> buster, the VMs are mix of archlinuxes and debians (and the issue
> Michal> happened on both)).
>
> Michal> As for how long, it's a hit and miss. Sometimes writing and reading back
> Michal> ~16gb file fails (the cheksum read back differs from what was written)
> Michal> after 2-3 tries. That's on the host.
>
> Michal> On the guest, it's been (so far) a guaranteed thing when we were
> Michal> creating very large tar file (900gb+). As for past two weeks we were
> Michal> unable to create that file without errors even once.
>
> Ouch! That's not good. Just to confirm, these corruptions are all in
> a thin-lv based filesystem, right? I'd be interested to know if you
> can create another plain LV and cause the same error. Trying to
> simplify the potential problems.
I have been trying to - but so far didn't manage to replicate this with:
- a physical partition
- filesystem directly on a physical partition
- filesystem directly on mdraid
- filesystem directly on a linear volume
Note that this _doesn't_ imply that I _always_ get errors if lvm-thin is in use - as I also had lengthy period of attempts to cause corruption on some thin volume w/o any successes either. But the ones that failed had those in common (so far): md & lvm-thin - with 4 KiB piece(s) being incorrect
>
>
>>> Can you compile the newst kernel and newest thin tools and try them
>>> out?
>
> Michal> I can, but a bit later (once we move VMs out of the host).
>
>>>
>>> How long does it take to replicate the corruption?
>>>
>
> Michal> When it happens, it's usually few tries tries of writing a 16gb file
> Michal> with random patterns and reading it back (directly on host). The
> Michal> irritating thing is that it can be somewhat hard to reproduce (e.g.
> Michal> after machine's reboot).
>
>>> Sorry for all the questions, but until there's a test case which is
>>> repeatable, it's going to be hard to chase this down.
>>>
>>> I wonder if running 'fio' tests would be something to try?
>>>
>>> And also changing your RAID5 setup to use the default stride and
>>> stripe widths, instead of the large values you're using.
>
> Michal> The raid5 is using mdadm's defaults (which is 512 KiB these days for a
> Michal> chunk). LVM on top is using much longer extents (as we don't really need
> Michal> 4mb granularity) and the lvm-thin chunks were set to match (and align)
> Michal> to raid's stripe.
>
>>>
>>> Good luck!
>>>
> Roger> I have not as of yet seen write corruption (except when a vendors disk
> Roger> was resetting and it was lying about having written the data prior to
> Roger> the crash, these were ssds, if your disk write cache is on and you
> Roger> have a disk reset this can also happen), but have not seen "lost
> Roger> writes" otherwise, but would expect the 2 read corruption I have seen
> Roger> to also be able to cause write issues. So for that look for scsi
> Roger> notifications for disk resets that should not happen.
>>>
> Roger> I have had a "bad" controller cause read corruptions, those
> Roger> corruptions would move around, replacing the controller resolved it,
> Roger> so there may be lack of error checking "inside" some paths in the
> Roger> card. Lucky I had a number of these controllers and had cold spares
> Roger> for them. The give away here was 2 separate buses with almost
> Roger> identical load with 6 separate disks each and all12 disks on 2 buses
> Roger> had between 47-52 scsi errors, which points to the only component
> Roger> shared (the controller).
>>>
> Roger> The backplane and cables are unlikely in general cause this, there is
> Roger> too much error checking between the controller and the disk from what
> Roger> I know.
>>>
> Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133
> Roger> cause random read corruptions, lowering speed to 100 fixed it), this
> Roger> one was duplicated on multiple identical pieces of hw with all
> Roger> different parts on the duplication machine.
>>>
> Roger> I have also seen lost writes (from software) because someone did a
> Roger> seek without doing a flush which in some versions of the libs loses
> Roger> the unfilled block when the seek happens (this is noted in the man
> Roger> page, and I saw it 20years ago, it is still noted in the man page, so
> Roger> no idea if it was ever fixed). So has more than one application been
> Roger> noted to see the corruption?
>>>
> Roger> So one question, have you seen the corruption in a path that would
> Roger> rely on one controller, or all corruptions you have seen involving
> Roger> more than one controller? Isolate and test each controller if you
> Roger> can, or if you can afford to replace it and see if it continues.
>>>
>>>
> Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>>>>
>>>>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
>>>>>
>>>>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
>>>>>
>>>>> The hardware is (can provide more detailed info of course):
>>>>>
>>>>> - Supermicro X9DR7-LN4F
>>>>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
>>>>> - 96 gb ram (ecc)
>>>>> - 24 disk backplane
>>>>>
>>>>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
>>>>> - 1 array on the backplane (4 disks, mdraid5, journaled)
>>>>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
>>>>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
>>>>>
>>>>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
>>>>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
>>>>>
>>>>> With a doze of testing we managed to roughly rule out the following elements as being the cause:
>>>>>
>>>>> - qemu/kvm (issue occured directly on host)
>>>>> - backplane (issue occured on disks directly connected via LSI's 2nd connector)
>>>>> - cable (as a above, two different cables)
>>>>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
>>>>> - mdadm journaling (issue occured on plain mdraid configuration as well)
>>>>> - disks themselves (issue occured on two separate mdadm arrays)
>>>>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
>>>>>
>>>>> We did not manage to rule out (though somewhat _highly_ unlikely):
>>>>>
>>>>> - lvm thin (issue always - so far - occured on lvm thin pools)
>>>>> - mdraid (issue always - so far - on mdraid managed arrays)
>>>>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
>>>>>
>>>>> And finally - so far - the issue never occured:
>>>>>
>>>>> - directly on a disk
>>>>> - directly on mdraid
>>>>> - on linear lvm volume on top of mdraid
>>>>>
>>>>> As far as the issue goes it's:
>>>>>
>>>>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
>>>>> - we also found (or rather btrfs scrub did) a few small damaged files as well
>>>>> - the chunks look like a correct piece of different or previous data
>>>>>
>>>>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
>>>>>
>>>>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.
>>>
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data
2020-05-08 11:10 ` [linux-lvm] " Michal Soltys
@ 2020-05-08 16:10 ` John Stoffel
-1 siblings, 0 replies; 20+ messages in thread
From: John Stoffel @ 2020-05-08 16:10 UTC (permalink / raw)
To: Michal Soltys; +Cc: John Stoffel, Roger Heflin, Linux RAID, linux-lvm, dm-devel
>>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes:
And of course it should also go to dm-devel@redhat.com, my fault for
not including that as well. I strongly suspect it's an thin-lv
problem somewhere, but I don't know enough to help chase down the
problem in detail.
John
Michal> note: as suggested, I'm also CCing this to linux-lvm; the full
Michal> context with replies starts at:
Michal> https://www.spinics.net/lists/raid/msg64364.html There is also
Michal> the initial post at the bottom as well.
Michal> On 5/8/20 2:54 AM, John Stoffel wrote:
>>>>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes:
>>
Michal> On 20/05/07 23:01, John Stoffel wrote:
>>>>>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes:
>>>>
Roger> Have you tried the same file 2x and verified the corruption is in the
Roger> same places and looks the same?
>>>>
>>>> Are these 1tb files VMDK or COW images of VMs? How are these files
>>>> made. And does it ever happen with *smaller* files? What about if
>>>> you just use a sparse 2tb file and write blocks out past 1tb to see if
>>>> there's a problem?
>>
Michal> The VMs are always directly on lvm volumes. (e.g.
Michal> /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the
Michal> errors after we ran scrub on the filesystem.
>>
Michal> Yes, the errors were also found on small files.
>>
>> Those errors are in small files inside the VM, which is running btrfs
>> ontop of block storage provided by your thin-lv, right?
>>
Michal> Yea, the small files were in this case on that thin-lv.
Michal> We also discovered (yesterday) file corruptions with VM hosting gitlab registry - this one was using the same thin-lv underneath, but the guest itself was using ext4 (in this case, docker simply reported incorrect sha checksum on (so far) 2 layers.
>>
>>
>> disks -> md raid5 -> pv -> vg -> lv-thin -> guest QCOW/LUN ->
>> filesystem -> corruption
Michal> Those particular guests, yea. The host case it's just w/o "guest" step.
Michal> But (so far) all corruption ended going via one of the lv-thin layers (and via one of md raids).
>>
>>
Michal> Since then we recreated the issue directly on the host, just
Michal> by making ext4 filesystem on some LV, then doing write with
Michal> checksum, sync, drop_caches, read and check checksum. The
Michal> errors are, as I mentioned - always a full 4KiB chunks (always
Michal> same content, always same position).
>>
>> What position? Is it a 4k, 1.5m or some other consistent offset? And
>> how far into the file? And this LV is a plain LV or a thin-lv? I'm
>> running a debian box at home with RAID1 and I haven't seen this, but
>> I'm not nearly as careful as you. Can you provide the output of:
>>
Michal> What I meant that it doesn't "move" when verifying the same file (aka different reads from same test file). Between the tests, the errors are of course in different places - but it's always some 4KiB piece(s) - that look like correct pieces belonging somewhere else.
>> /sbin/lvs --version
Michal> LVM version: 2.03.02(2) (2018-12-18)
Michal> Library version: 1.02.155 (2018-12-18)
Michal> Driver version: 4.41.0
Michal> Configuration: ./configure --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --exec-prefix= --bindir=/bin --libdir=/lib/x86_64-linux-gnu --sbindir=/sbin --with-usrlibdir=/usr/lib/x86_64-linux-gnu --with-optimisation=-O2 --with-cache=internal --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --with-default-pid-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-thin=internal --with-thin-check=/usr/sbin/thin_check --with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair --enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd --enable-dbus-service --enable-lvmlockd-dlm --enable-lvmlockd-sanlock --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-readline --enable-udev_rules --enable-udev_sync
>>
>> too?
>>
>> Can you post your:
>>
>> /sbin/dmsetup status
>>
>> output too? There's a better command to use here, but I'm not an
>> export. You might really want to copy this over to the
>> linux-lvm@redhat.com mailing list as well.
Michal> x22v0-tp_ssd-tpool: 0 2577285120 thin-pool 19 8886/552960 629535/838960 - rw no_discard_passdown queue_if_no_space - 1024
Michal> x22v0-tp_ssd_tdata: 0 2147696640 linear
Michal> x22v0-tp_ssd_tdata: 2147696640 429588480 linear
Michal> x22v0-tp_ssd_tmeta_rimage_1: 0 4423680 linear
Michal> x22v0-tp_ssd_tmeta: 0 4423680 raid raid1 2 AA 4423680/4423680 idle 0 0 -
Michal> x22v0-gerrit--new: 0 268615680 thin 255510528 268459007
Michal> x22v0-btrfsnopool: 0 134430720 linear
Michal> x22v0-gitlab_root: 0 629145600 thin 628291584 629145599
Michal> x22v0-tp_ssd_tmeta_rimage_0: 0 4423680 linear
Michal> x22v0-nexus_old_storage: 0 10737500160 thin 5130817536 10737500159
Michal> x22v0-gitlab_reg: 0 2147696640 thin 1070963712 2147696639
Michal> x22v0-nexus_old_root: 0 268615680 thin 257657856 268615679
Michal> x22v0-tp_big_tmeta_rimage_1: 0 8601600 linear
Michal> x22v0-tp_ssd_tmeta_rmeta_1: 0 245760 linear
Michal> x22v0-micron_vol: 0 268615680 linear
Michal> x22v0-tp_big_tmeta_rimage_0: 0 8601600 linear
Michal> x22v0-tp_ssd_tmeta_rmeta_0: 0 245760 linear
Michal> x22v0-gerrit--root: 0 268615680 thin 103388160 268443647
Michal> x22v0-btrfs_ssd_linear: 0 268615680 linear
Michal> x22v0-btrfstest: 0 268615680 thin 40734720 268615679
Michal> x22v0-tp_ssd: 0 2577285120 linear
Michal> x22v0-tp_big: 0 22164602880 linear
Michal> x22v0-nexus3_root: 0 167854080 thin 21860352 167854079
Michal> x22v0-nusknacker--staging: 0 268615680 thin 268182528 268615679
Michal> x22v0-tmob2: 0 1048657920 linear
Michal> x22v0-tp_big-tpool: 0 22164602880 thin-pool 35 35152/1075200 3870070/7215040 - rw no_discard_passdown queue_if_no_space - 1024
Michal> x22v0-tp_big_tdata: 0 4295147520 linear
Michal> x22v0-tp_big_tdata: 4295147520 17869455360 linear
Michal> x22v0-btrfs_ssd_test: 0 201523200 thin 191880192 201335807
Michal> x22v0-nussknacker2: 0 268615680 thin 58573824 268615679
Michal> x22v0-tmob1: 0 1048657920 linear
Michal> x22v0-tp_big_tmeta: 0 8601600 raid raid1 2 AA 8601600/8601600 idle 0 0 -
Michal> x22v0-nussknacker1: 0 268615680 thin 74376192 268615679
Michal> x22v0-touk--elk4: 0 839024640 linear
Michal> x22v0-gerrit--backup: 0 268615680 thin 228989952 268443647
Michal> x22v0-tp_big_tmeta_rmeta_1: 0 245760 linear
Michal> x22v0-openvpn--new: 0 134430720 thin 24152064 66272255
Michal> x22v0-k8sdkr: 0 268615680 linear
Michal> x22v0-nexus3_storage: 0 10737500160 thin 4976683008 10737500159
Michal> x22v0-rocket: 0 167854080 thin 163602432 167854079
Michal> x22v0-tp_big_tmeta_rmeta_0: 0 245760 linear
Michal> x22v0-roger2: 0 134430720 thin 33014784 134430719
Michal> x22v0-gerrit--new--backup: 0 268615680 thin 6552576 268443647
Michal> Also lvs -a with segment ranges:
Michal> LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert LE Ranges
Michal> btrfs_ssd_linear x22v0 -wi-a----- <128.09g /dev/md125:19021-20113
Michal> btrfs_ssd_test x22v0 Vwi-a-t--- 96.09g tp_ssd 95.21
Michal> btrfsnopool x22v0 -wi-a----- 64.10g /dev/sdt2:35-581
Michal> btrfstest x22v0 Vwi-a-t--- <128.09g tp_big 15.16
Michal> gerrit-backup x22v0 Vwi-aot--- <128.09g tp_big 85.25
Michal> gerrit-new x22v0 Vwi-a-t--- <128.09g tp_ssd 95.12
Michal> gerrit-new-backup x22v0 Vwi-a-t--- <128.09g tp_big 2.44
Michal> gerrit-root x22v0 Vwi-aot--- <128.09g tp_ssd 38.49
Michal> gitlab_reg x22v0 Vwi-a-t--- 1.00t tp_big 49.87
Michal> gitlab_reg_snapshot x22v0 Vwi---t--k 1.00t tp_big gitlab_reg
Michal> gitlab_root x22v0 Vwi-a-t--- 300.00g tp_ssd 99.86
Michal> gitlab_root_snapshot x22v0 Vwi---t--k 300.00g tp_ssd gitlab_root
Michal> k8sdkr x22v0 -wi-a----- <128.09g /dev/md126:20891-21983
Michal> [lvol0_pmspare] x22v0 ewi------- 4.10g /dev/sdt2:0-34
Michal> micron_vol x22v0 -wi-a----- <128.09g /dev/sdt2:582-1674
Michal> nexus3_root x22v0 Vwi-aot--- <80.04g tp_ssd 13.03
Michal> nexus3_storage x22v0 Vwi-aot--- 5.00t tp_big 46.35
Michal> nexus_old_root x22v0 Vwi-a-t--- <128.09g tp_ssd 95.92
Michal> nexus_old_storage x22v0 Vwi-a-t--- 5.00t tp_big 47.78
Michal> nusknacker-staging x22v0 Vwi-aot--- <128.09g tp_big 99.84
Michal> nussknacker1 x22v0 Vwi-aot--- <128.09g tp_big 27.69
Michal> nussknacker2 x22v0 Vwi-aot--- <128.09g tp_big 21.81
Michal> openvpn-new x22v0 Vwi-aot--- 64.10g tp_big 17.97
Michal> rocket x22v0 Vwi-aot--- <80.04g tp_ssd 97.47
Michal> roger2 x22v0 Vwi-a-t--- 64.10g tp_ssd 24.56
Michal> tmob1 x22v0 -wi-a----- <500.04g /dev/md125:8739-13005
Michal> tmob2 x22v0 -wi-a----- <500.04g /dev/md125:13006-17272
Michal> touk-elk4 x22v0 -wi-ao---- <400.08g /dev/md126:17477-20890
Michal> tp_big x22v0 twi-aot--- 10.32t 53.64 3.27 [tp_big_tdata]:0-90187
Michal> [tp_big_tdata] x22v0 Twi-ao---- 10.32t /dev/md126:0-17476
Michal> [tp_big_tdata] x22v0 Twi-ao---- 10.32t /dev/md126:21984-94694
Michal> [tp_big_tmeta] x22v0 ewi-aor--- 4.10g 100.00 [tp_big_tmeta_rimage_0]:0-34,[tp_big_tmeta_rimage_1]:0-34
Michal> [tp_big_tmeta_rimage_0] x22v0 iwi-aor--- 4.10g /dev/sda3:30-64
Michal> [tp_big_tmeta_rimage_1] x22v0 iwi-aor--- 4.10g /dev/sdb3:30-64
Michal> [tp_big_tmeta_rmeta_0] x22v0 ewi-aor--- 120.00m /dev/sda3:29-29
Michal> [tp_big_tmeta_rmeta_1] x22v0 ewi-aor--- 120.00m /dev/sdb3:29-29
Michal> tp_ssd x22v0 twi-aot--- 1.20t 75.04 1.61 [tp_ssd_tdata]:0-10486
Michal> [tp_ssd_tdata] x22v0 Twi-ao---- 1.20t /dev/md125:0-8738
Michal> [tp_ssd_tdata] x22v0 Twi-ao---- 1.20t /dev/md125:17273-19020
Michal> [tp_ssd_tmeta] x22v0 ewi-aor--- <2.11g 100.00 [tp_ssd_tmeta_rimage_0]:0-17,[tp_ssd_tmeta_rimage_1]:0-17
Michal> [tp_ssd_tmeta_rimage_0] x22v0 iwi-aor--- <2.11g /dev/sda3:11-28
Michal> [tp_ssd_tmeta_rimage_1] x22v0 iwi-aor--- <2.11g /dev/sdb3:11-28
Michal> [tp_ssd_tmeta_rmeta_0] x22v0 ewi-aor--- 120.00m /dev/sda3:10-10
Michal> [tp_ssd_tmeta_rmeta_1] x22v0 ewi-aor--- 120.00m /dev/sdb3:10-10
>>
>>>> Are the LVs split across RAID5 PVs by any chance?
>>
Michal> raid5s are used as PVs, but a single logical volume always uses one only
Michal> one physical volume underneath (if that's what you meant by split across).
>>
>> Ok, that's what I was asking about. It shouldn't matter... but just
>> trying to chase down the details.
>>
>>
>>>> It's not clear if you can replicate the problem without using
>>>> lvm-thin, but that's what I suspect you might be having problems with.
>>
Michal> I'll be trying to do that, though the heavier tests will have to wait
Michal> until I move all VMs to other hosts (as that is/was our production machnie).
>>
>> Sure, makes sense.
>>
>>>> Can you give us the versions of the your tools, and exactly how you
>>>> setup your test cases? How long does it take to find the problem?
Michal> Regarding this, currently:
Michal> kernel: 5.4.0-0.bpo.4-amd64 #1 SMP Debian 5.4.19-1~bpo10+1 (2020-03-09) x86_64 GNU/Linux (was also happening with 5.2.0-0.bpo.3-amd64)
Michal> LVM version: 2.03.02(2) (2018-12-18)
Michal> Library version: 1.02.155 (2018-12-18)
Michal> Driver version: 4.41.0
Michal> mdadm - v4.1 - 2018-10-01
>>
Michal> Will get all the details tommorow (the host is on up to date debian
Michal> buster, the VMs are mix of archlinuxes and debians (and the issue
Michal> happened on both)).
>>
Michal> As for how long, it's a hit and miss. Sometimes writing and reading back
Michal> ~16gb file fails (the cheksum read back differs from what was written)
Michal> after 2-3 tries. That's on the host.
>>
Michal> On the guest, it's been (so far) a guaranteed thing when we were
Michal> creating very large tar file (900gb+). As for past two weeks we were
Michal> unable to create that file without errors even once.
>>
>> Ouch! That's not good. Just to confirm, these corruptions are all in
>> a thin-lv based filesystem, right? I'd be interested to know if you
>> can create another plain LV and cause the same error. Trying to
>> simplify the potential problems.
Michal> I have been trying to - but so far didn't manage to replicate this with:
Michal> - a physical partition
Michal> - filesystem directly on a physical partition
Michal> - filesystem directly on mdraid
Michal> - filesystem directly on a linear volume
Michal> Note that this _doesn't_ imply that I _always_ get errors if lvm-thin is in use - as I also had lengthy period of attempts to cause corruption on some thin volume w/o any successes either. But the ones that failed had those in common (so far): md & lvm-thin - with 4 KiB piece(s) being incorrect
>>
>>
>>>> Can you compile the newst kernel and newest thin tools and try them
>>>> out?
>>
Michal> I can, but a bit later (once we move VMs out of the host).
>>
>>>>
>>>> How long does it take to replicate the corruption?
>>>>
>>
Michal> When it happens, it's usually few tries tries of writing a 16gb file
Michal> with random patterns and reading it back (directly on host). The
Michal> irritating thing is that it can be somewhat hard to reproduce (e.g.
Michal> after machine's reboot).
>>
>>>> Sorry for all the questions, but until there's a test case which is
>>>> repeatable, it's going to be hard to chase this down.
>>>>
>>>> I wonder if running 'fio' tests would be something to try?
>>>>
>>>> And also changing your RAID5 setup to use the default stride and
>>>> stripe widths, instead of the large values you're using.
>>
Michal> The raid5 is using mdadm's defaults (which is 512 KiB these days for a
Michal> chunk). LVM on top is using much longer extents (as we don't really need
Michal> 4mb granularity) and the lvm-thin chunks were set to match (and align)
Michal> to raid's stripe.
>>
>>>>
>>>> Good luck!
>>>>
Roger> I have not as of yet seen write corruption (except when a vendors disk
Roger> was resetting and it was lying about having written the data prior to
Roger> the crash, these were ssds, if your disk write cache is on and you
Roger> have a disk reset this can also happen), but have not seen "lost
Roger> writes" otherwise, but would expect the 2 read corruption I have seen
Roger> to also be able to cause write issues. So for that look for scsi
Roger> notifications for disk resets that should not happen.
>>>>
Roger> I have had a "bad" controller cause read corruptions, those
Roger> corruptions would move around, replacing the controller resolved it,
Roger> so there may be lack of error checking "inside" some paths in the
Roger> card. Lucky I had a number of these controllers and had cold spares
Roger> for them. The give away here was 2 separate buses with almost
Roger> identical load with 6 separate disks each and all12 disks on 2 buses
Roger> had between 47-52 scsi errors, which points to the only component
Roger> shared (the controller).
>>>>
Roger> The backplane and cables are unlikely in general cause this, there is
Roger> too much error checking between the controller and the disk from what
Roger> I know.
>>>>
Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133
Roger> cause random read corruptions, lowering speed to 100 fixed it), this
Roger> one was duplicated on multiple identical pieces of hw with all
Roger> different parts on the duplication machine.
>>>>
Roger> I have also seen lost writes (from software) because someone did a
Roger> seek without doing a flush which in some versions of the libs loses
Roger> the unfilled block when the seek happens (this is noted in the man
Roger> page, and I saw it 20years ago, it is still noted in the man page, so
Roger> no idea if it was ever fixed). So has more than one application been
Roger> noted to see the corruption?
>>>>
Roger> So one question, have you seen the corruption in a path that would
Roger> rely on one controller, or all corruptions you have seen involving
Roger> more than one controller? Isolate and test each controller if you
Roger> can, or if you can afford to replace it and see if it continues.
>>>>
>>>>
Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>>>>>
>>>>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
>>>>>>
>>>>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
>>>>>>
>>>>> The hardware is (can provide more detailed info of course):
>>>>>>
>>>>> - Supermicro X9DR7-LN4F
>>>>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
>>>>> - 96 gb ram (ecc)
>>>>> - 24 disk backplane
>>>>>>
>>>>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
>>>>> - 1 array on the backplane (4 disks, mdraid5, journaled)
>>>>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
>>>>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
>>>>>>
>>>>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
>>>>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
>>>>>>
>>>>> With a doze of testing we managed to roughly rule out the following elements as being the cause:
>>>>>>
>>>>> - qemu/kvm (issue occured directly on host)
>>>>> - backplane (issue occured on disks directly connected via LSI's 2nd connector)
>>>>> - cable (as a above, two different cables)
>>>>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
>>>>> - mdadm journaling (issue occured on plain mdraid configuration as well)
>>>>> - disks themselves (issue occured on two separate mdadm arrays)
>>>>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
>>>>>>
>>>>> We did not manage to rule out (though somewhat _highly_ unlikely):
>>>>>>
>>>>> - lvm thin (issue always - so far - occured on lvm thin pools)
>>>>> - mdraid (issue always - so far - on mdraid managed arrays)
>>>>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
>>>>>>
>>>>> And finally - so far - the issue never occured:
>>>>>>
>>>>> - directly on a disk
>>>>> - directly on mdraid
>>>>> - on linear lvm volume on top of mdraid
>>>>>>
>>>>> As far as the issue goes it's:
>>>>>>
>>>>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
>>>>> - we also found (or rather btrfs scrub did) a few small damaged files as well
>>>>> - the chunks look like a correct piece of different or previous data
>>>>>>
>>>>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
>>>>>>
>>>>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.
>>>>
>>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [linux-lvm] [general question] rare silent data corruption when writing data
@ 2020-05-08 16:10 ` John Stoffel
0 siblings, 0 replies; 20+ messages in thread
From: John Stoffel @ 2020-05-08 16:10 UTC (permalink / raw)
To: Michal Soltys; +Cc: Linux RAID, Roger Heflin, dm-devel, linux-lvm
>>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes:
And of course it should also go to dm-devel@redhat.com, my fault for
not including that as well. I strongly suspect it's an thin-lv
problem somewhere, but I don't know enough to help chase down the
problem in detail.
John
Michal> note: as suggested, I'm also CCing this to linux-lvm; the full
Michal> context with replies starts at:
Michal> https://www.spinics.net/lists/raid/msg64364.html There is also
Michal> the initial post at the bottom as well.
Michal> On 5/8/20 2:54 AM, John Stoffel wrote:
>>>>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes:
>>
Michal> On 20/05/07 23:01, John Stoffel wrote:
>>>>>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes:
>>>>
Roger> Have you tried the same file 2x and verified the corruption is in the
Roger> same places and looks the same?
>>>>
>>>> Are these 1tb files VMDK or COW images of VMs? How are these files
>>>> made. And does it ever happen with *smaller* files? What about if
>>>> you just use a sparse 2tb file and write blocks out past 1tb to see if
>>>> there's a problem?
>>
Michal> The VMs are always directly on lvm volumes. (e.g.
Michal> /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the
Michal> errors after we ran scrub on the filesystem.
>>
Michal> Yes, the errors were also found on small files.
>>
>> Those errors are in small files inside the VM, which is running btrfs
>> ontop of block storage provided by your thin-lv, right?
>>
Michal> Yea, the small files were in this case on that thin-lv.
Michal> We also discovered (yesterday) file corruptions with VM hosting gitlab registry - this one was using the same thin-lv underneath, but the guest itself was using ext4 (in this case, docker simply reported incorrect sha checksum on (so far) 2 layers.
>>
>>
>> disks -> md raid5 -> pv -> vg -> lv-thin -> guest QCOW/LUN ->
>> filesystem -> corruption
Michal> Those particular guests, yea. The host case it's just w/o "guest" step.
Michal> But (so far) all corruption ended going via one of the lv-thin layers (and via one of md raids).
>>
>>
Michal> Since then we recreated the issue directly on the host, just
Michal> by making ext4 filesystem on some LV, then doing write with
Michal> checksum, sync, drop_caches, read and check checksum. The
Michal> errors are, as I mentioned - always a full 4KiB chunks (always
Michal> same content, always same position).
>>
>> What position? Is it a 4k, 1.5m or some other consistent offset? And
>> how far into the file? And this LV is a plain LV or a thin-lv? I'm
>> running a debian box at home with RAID1 and I haven't seen this, but
>> I'm not nearly as careful as you. Can you provide the output of:
>>
Michal> What I meant that it doesn't "move" when verifying the same file (aka different reads from same test file). Between the tests, the errors are of course in different places - but it's always some 4KiB piece(s) - that look like correct pieces belonging somewhere else.
>> /sbin/lvs --version
Michal> LVM version: 2.03.02(2) (2018-12-18)
Michal> Library version: 1.02.155 (2018-12-18)
Michal> Driver version: 4.41.0
Michal> Configuration: ./configure --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --exec-prefix= --bindir=/bin --libdir=/lib/x86_64-linux-gnu --sbindir=/sbin --with-usrlibdir=/usr/lib/x86_64-linux-gnu --with-optimisation=-O2 --with-cache=internal --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --with-default-pid-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-thin=internal --with-thin-check=/usr/sbin/thin_check --with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair --enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd --enable-dbus-service --enable-lvmlockd-dlm --enable-lvmlockd-sanlock --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-readline --enable-udev_rules --enable-udev_sync
>>
>> too?
>>
>> Can you post your:
>>
>> /sbin/dmsetup status
>>
>> output too? There's a better command to use here, but I'm not an
>> export. You might really want to copy this over to the
>> linux-lvm@redhat.com mailing list as well.
Michal> x22v0-tp_ssd-tpool: 0 2577285120 thin-pool 19 8886/552960 629535/838960 - rw no_discard_passdown queue_if_no_space - 1024
Michal> x22v0-tp_ssd_tdata: 0 2147696640 linear
Michal> x22v0-tp_ssd_tdata: 2147696640 429588480 linear
Michal> x22v0-tp_ssd_tmeta_rimage_1: 0 4423680 linear
Michal> x22v0-tp_ssd_tmeta: 0 4423680 raid raid1 2 AA 4423680/4423680 idle 0 0 -
Michal> x22v0-gerrit--new: 0 268615680 thin 255510528 268459007
Michal> x22v0-btrfsnopool: 0 134430720 linear
Michal> x22v0-gitlab_root: 0 629145600 thin 628291584 629145599
Michal> x22v0-tp_ssd_tmeta_rimage_0: 0 4423680 linear
Michal> x22v0-nexus_old_storage: 0 10737500160 thin 5130817536 10737500159
Michal> x22v0-gitlab_reg: 0 2147696640 thin 1070963712 2147696639
Michal> x22v0-nexus_old_root: 0 268615680 thin 257657856 268615679
Michal> x22v0-tp_big_tmeta_rimage_1: 0 8601600 linear
Michal> x22v0-tp_ssd_tmeta_rmeta_1: 0 245760 linear
Michal> x22v0-micron_vol: 0 268615680 linear
Michal> x22v0-tp_big_tmeta_rimage_0: 0 8601600 linear
Michal> x22v0-tp_ssd_tmeta_rmeta_0: 0 245760 linear
Michal> x22v0-gerrit--root: 0 268615680 thin 103388160 268443647
Michal> x22v0-btrfs_ssd_linear: 0 268615680 linear
Michal> x22v0-btrfstest: 0 268615680 thin 40734720 268615679
Michal> x22v0-tp_ssd: 0 2577285120 linear
Michal> x22v0-tp_big: 0 22164602880 linear
Michal> x22v0-nexus3_root: 0 167854080 thin 21860352 167854079
Michal> x22v0-nusknacker--staging: 0 268615680 thin 268182528 268615679
Michal> x22v0-tmob2: 0 1048657920 linear
Michal> x22v0-tp_big-tpool: 0 22164602880 thin-pool 35 35152/1075200 3870070/7215040 - rw no_discard_passdown queue_if_no_space - 1024
Michal> x22v0-tp_big_tdata: 0 4295147520 linear
Michal> x22v0-tp_big_tdata: 4295147520 17869455360 linear
Michal> x22v0-btrfs_ssd_test: 0 201523200 thin 191880192 201335807
Michal> x22v0-nussknacker2: 0 268615680 thin 58573824 268615679
Michal> x22v0-tmob1: 0 1048657920 linear
Michal> x22v0-tp_big_tmeta: 0 8601600 raid raid1 2 AA 8601600/8601600 idle 0 0 -
Michal> x22v0-nussknacker1: 0 268615680 thin 74376192 268615679
Michal> x22v0-touk--elk4: 0 839024640 linear
Michal> x22v0-gerrit--backup: 0 268615680 thin 228989952 268443647
Michal> x22v0-tp_big_tmeta_rmeta_1: 0 245760 linear
Michal> x22v0-openvpn--new: 0 134430720 thin 24152064 66272255
Michal> x22v0-k8sdkr: 0 268615680 linear
Michal> x22v0-nexus3_storage: 0 10737500160 thin 4976683008 10737500159
Michal> x22v0-rocket: 0 167854080 thin 163602432 167854079
Michal> x22v0-tp_big_tmeta_rmeta_0: 0 245760 linear
Michal> x22v0-roger2: 0 134430720 thin 33014784 134430719
Michal> x22v0-gerrit--new--backup: 0 268615680 thin 6552576 268443647
Michal> Also lvs -a with segment ranges:
Michal> LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert LE Ranges
Michal> btrfs_ssd_linear x22v0 -wi-a----- <128.09g /dev/md125:19021-20113
Michal> btrfs_ssd_test x22v0 Vwi-a-t--- 96.09g tp_ssd 95.21
Michal> btrfsnopool x22v0 -wi-a----- 64.10g /dev/sdt2:35-581
Michal> btrfstest x22v0 Vwi-a-t--- <128.09g tp_big 15.16
Michal> gerrit-backup x22v0 Vwi-aot--- <128.09g tp_big 85.25
Michal> gerrit-new x22v0 Vwi-a-t--- <128.09g tp_ssd 95.12
Michal> gerrit-new-backup x22v0 Vwi-a-t--- <128.09g tp_big 2.44
Michal> gerrit-root x22v0 Vwi-aot--- <128.09g tp_ssd 38.49
Michal> gitlab_reg x22v0 Vwi-a-t--- 1.00t tp_big 49.87
Michal> gitlab_reg_snapshot x22v0 Vwi---t--k 1.00t tp_big gitlab_reg
Michal> gitlab_root x22v0 Vwi-a-t--- 300.00g tp_ssd 99.86
Michal> gitlab_root_snapshot x22v0 Vwi---t--k 300.00g tp_ssd gitlab_root
Michal> k8sdkr x22v0 -wi-a----- <128.09g /dev/md126:20891-21983
Michal> [lvol0_pmspare] x22v0 ewi------- 4.10g /dev/sdt2:0-34
Michal> micron_vol x22v0 -wi-a----- <128.09g /dev/sdt2:582-1674
Michal> nexus3_root x22v0 Vwi-aot--- <80.04g tp_ssd 13.03
Michal> nexus3_storage x22v0 Vwi-aot--- 5.00t tp_big 46.35
Michal> nexus_old_root x22v0 Vwi-a-t--- <128.09g tp_ssd 95.92
Michal> nexus_old_storage x22v0 Vwi-a-t--- 5.00t tp_big 47.78
Michal> nusknacker-staging x22v0 Vwi-aot--- <128.09g tp_big 99.84
Michal> nussknacker1 x22v0 Vwi-aot--- <128.09g tp_big 27.69
Michal> nussknacker2 x22v0 Vwi-aot--- <128.09g tp_big 21.81
Michal> openvpn-new x22v0 Vwi-aot--- 64.10g tp_big 17.97
Michal> rocket x22v0 Vwi-aot--- <80.04g tp_ssd 97.47
Michal> roger2 x22v0 Vwi-a-t--- 64.10g tp_ssd 24.56
Michal> tmob1 x22v0 -wi-a----- <500.04g /dev/md125:8739-13005
Michal> tmob2 x22v0 -wi-a----- <500.04g /dev/md125:13006-17272
Michal> touk-elk4 x22v0 -wi-ao---- <400.08g /dev/md126:17477-20890
Michal> tp_big x22v0 twi-aot--- 10.32t 53.64 3.27 [tp_big_tdata]:0-90187
Michal> [tp_big_tdata] x22v0 Twi-ao---- 10.32t /dev/md126:0-17476
Michal> [tp_big_tdata] x22v0 Twi-ao---- 10.32t /dev/md126:21984-94694
Michal> [tp_big_tmeta] x22v0 ewi-aor--- 4.10g 100.00 [tp_big_tmeta_rimage_0]:0-34,[tp_big_tmeta_rimage_1]:0-34
Michal> [tp_big_tmeta_rimage_0] x22v0 iwi-aor--- 4.10g /dev/sda3:30-64
Michal> [tp_big_tmeta_rimage_1] x22v0 iwi-aor--- 4.10g /dev/sdb3:30-64
Michal> [tp_big_tmeta_rmeta_0] x22v0 ewi-aor--- 120.00m /dev/sda3:29-29
Michal> [tp_big_tmeta_rmeta_1] x22v0 ewi-aor--- 120.00m /dev/sdb3:29-29
Michal> tp_ssd x22v0 twi-aot--- 1.20t 75.04 1.61 [tp_ssd_tdata]:0-10486
Michal> [tp_ssd_tdata] x22v0 Twi-ao---- 1.20t /dev/md125:0-8738
Michal> [tp_ssd_tdata] x22v0 Twi-ao---- 1.20t /dev/md125:17273-19020
Michal> [tp_ssd_tmeta] x22v0 ewi-aor--- <2.11g 100.00 [tp_ssd_tmeta_rimage_0]:0-17,[tp_ssd_tmeta_rimage_1]:0-17
Michal> [tp_ssd_tmeta_rimage_0] x22v0 iwi-aor--- <2.11g /dev/sda3:11-28
Michal> [tp_ssd_tmeta_rimage_1] x22v0 iwi-aor--- <2.11g /dev/sdb3:11-28
Michal> [tp_ssd_tmeta_rmeta_0] x22v0 ewi-aor--- 120.00m /dev/sda3:10-10
Michal> [tp_ssd_tmeta_rmeta_1] x22v0 ewi-aor--- 120.00m /dev/sdb3:10-10
>>
>>>> Are the LVs split across RAID5 PVs by any chance?
>>
Michal> raid5s are used as PVs, but a single logical volume always uses one only
Michal> one physical volume underneath (if that's what you meant by split across).
>>
>> Ok, that's what I was asking about. It shouldn't matter... but just
>> trying to chase down the details.
>>
>>
>>>> It's not clear if you can replicate the problem without using
>>>> lvm-thin, but that's what I suspect you might be having problems with.
>>
Michal> I'll be trying to do that, though the heavier tests will have to wait
Michal> until I move all VMs to other hosts (as that is/was our production machnie).
>>
>> Sure, makes sense.
>>
>>>> Can you give us the versions of the your tools, and exactly how you
>>>> setup your test cases? How long does it take to find the problem?
Michal> Regarding this, currently:
Michal> kernel: 5.4.0-0.bpo.4-amd64 #1 SMP Debian 5.4.19-1~bpo10+1 (2020-03-09) x86_64 GNU/Linux (was also happening with 5.2.0-0.bpo.3-amd64)
Michal> LVM version: 2.03.02(2) (2018-12-18)
Michal> Library version: 1.02.155 (2018-12-18)
Michal> Driver version: 4.41.0
Michal> mdadm - v4.1 - 2018-10-01
>>
Michal> Will get all the details tommorow (the host is on up to date debian
Michal> buster, the VMs are mix of archlinuxes and debians (and the issue
Michal> happened on both)).
>>
Michal> As for how long, it's a hit and miss. Sometimes writing and reading back
Michal> ~16gb file fails (the cheksum read back differs from what was written)
Michal> after 2-3 tries. That's on the host.
>>
Michal> On the guest, it's been (so far) a guaranteed thing when we were
Michal> creating very large tar file (900gb+). As for past two weeks we were
Michal> unable to create that file without errors even once.
>>
>> Ouch! That's not good. Just to confirm, these corruptions are all in
>> a thin-lv based filesystem, right? I'd be interested to know if you
>> can create another plain LV and cause the same error. Trying to
>> simplify the potential problems.
Michal> I have been trying to - but so far didn't manage to replicate this with:
Michal> - a physical partition
Michal> - filesystem directly on a physical partition
Michal> - filesystem directly on mdraid
Michal> - filesystem directly on a linear volume
Michal> Note that this _doesn't_ imply that I _always_ get errors if lvm-thin is in use - as I also had lengthy period of attempts to cause corruption on some thin volume w/o any successes either. But the ones that failed had those in common (so far): md & lvm-thin - with 4 KiB piece(s) being incorrect
>>
>>
>>>> Can you compile the newst kernel and newest thin tools and try them
>>>> out?
>>
Michal> I can, but a bit later (once we move VMs out of the host).
>>
>>>>
>>>> How long does it take to replicate the corruption?
>>>>
>>
Michal> When it happens, it's usually few tries tries of writing a 16gb file
Michal> with random patterns and reading it back (directly on host). The
Michal> irritating thing is that it can be somewhat hard to reproduce (e.g.
Michal> after machine's reboot).
>>
>>>> Sorry for all the questions, but until there's a test case which is
>>>> repeatable, it's going to be hard to chase this down.
>>>>
>>>> I wonder if running 'fio' tests would be something to try?
>>>>
>>>> And also changing your RAID5 setup to use the default stride and
>>>> stripe widths, instead of the large values you're using.
>>
Michal> The raid5 is using mdadm's defaults (which is 512 KiB these days for a
Michal> chunk). LVM on top is using much longer extents (as we don't really need
Michal> 4mb granularity) and the lvm-thin chunks were set to match (and align)
Michal> to raid's stripe.
>>
>>>>
>>>> Good luck!
>>>>
Roger> I have not as of yet seen write corruption (except when a vendors disk
Roger> was resetting and it was lying about having written the data prior to
Roger> the crash, these were ssds, if your disk write cache is on and you
Roger> have a disk reset this can also happen), but have not seen "lost
Roger> writes" otherwise, but would expect the 2 read corruption I have seen
Roger> to also be able to cause write issues. So for that look for scsi
Roger> notifications for disk resets that should not happen.
>>>>
Roger> I have had a "bad" controller cause read corruptions, those
Roger> corruptions would move around, replacing the controller resolved it,
Roger> so there may be lack of error checking "inside" some paths in the
Roger> card. Lucky I had a number of these controllers and had cold spares
Roger> for them. The give away here was 2 separate buses with almost
Roger> identical load with 6 separate disks each and all12 disks on 2 buses
Roger> had between 47-52 scsi errors, which points to the only component
Roger> shared (the controller).
>>>>
Roger> The backplane and cables are unlikely in general cause this, there is
Roger> too much error checking between the controller and the disk from what
Roger> I know.
>>>>
Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133
Roger> cause random read corruptions, lowering speed to 100 fixed it), this
Roger> one was duplicated on multiple identical pieces of hw with all
Roger> different parts on the duplication machine.
>>>>
Roger> I have also seen lost writes (from software) because someone did a
Roger> seek without doing a flush which in some versions of the libs loses
Roger> the unfilled block when the seek happens (this is noted in the man
Roger> page, and I saw it 20years ago, it is still noted in the man page, so
Roger> no idea if it was ever fixed). So has more than one application been
Roger> noted to see the corruption?
>>>>
Roger> So one question, have you seen the corruption in a path that would
Roger> rely on one controller, or all corruptions you have seen involving
Roger> more than one controller? Isolate and test each controller if you
Roger> can, or if you can afford to replace it and see if it continues.
>>>>
>>>>
Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>>>>>
>>>>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
>>>>>>
>>>>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
>>>>>>
>>>>> The hardware is (can provide more detailed info of course):
>>>>>>
>>>>> - Supermicro X9DR7-LN4F
>>>>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
>>>>> - 96 gb ram (ecc)
>>>>> - 24 disk backplane
>>>>>>
>>>>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
>>>>> - 1 array on the backplane (4 disks, mdraid5, journaled)
>>>>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
>>>>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
>>>>>>
>>>>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
>>>>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
>>>>>>
>>>>> With a doze of testing we managed to roughly rule out the following elements as being the cause:
>>>>>>
>>>>> - qemu/kvm (issue occured directly on host)
>>>>> - backplane (issue occured on disks directly connected via LSI's 2nd connector)
>>>>> - cable (as a above, two different cables)
>>>>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
>>>>> - mdadm journaling (issue occured on plain mdraid configuration as well)
>>>>> - disks themselves (issue occured on two separate mdadm arrays)
>>>>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
>>>>>>
>>>>> We did not manage to rule out (though somewhat _highly_ unlikely):
>>>>>>
>>>>> - lvm thin (issue always - so far - occured on lvm thin pools)
>>>>> - mdraid (issue always - so far - on mdraid managed arrays)
>>>>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
>>>>>>
>>>>> And finally - so far - the issue never occured:
>>>>>>
>>>>> - directly on a disk
>>>>> - directly on mdraid
>>>>> - on linear lvm volume on top of mdraid
>>>>>>
>>>>> As far as the issue goes it's:
>>>>>>
>>>>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
>>>>> - we also found (or rather btrfs scrub did) a few small damaged files as well
>>>>> - the chunks look like a correct piece of different or previous data
>>>>>>
>>>>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
>>>>>>
>>>>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.
>>>>
>>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data
2020-05-08 3:44 ` Chris Murphy
@ 2020-05-10 19:05 ` Sarah Newman
2020-05-10 19:12 ` Sarah Newman
2020-05-20 21:40 ` Michal Soltys
1 sibling, 1 reply; 20+ messages in thread
From: Sarah Newman @ 2020-05-10 19:05 UTC (permalink / raw)
To: Chris Murphy, Michal Soltys; +Cc: John Stoffel, Roger Heflin, Linux RAID
On 5/7/20 8:44 PM, Chris Murphy wrote:
>
> I would change very little until you track this down, if the goal is
> to track it down and get it fixed.
>
> I'm not sure if LVM thinp is supported with LVM raid still, which if
> it's not supported yet then I can understand using mdadm raid5 instead
> of LVM raid5.
My apologies if this ideas was considered and discarded already, but the bug being hard to reproduce right after reboot and the error being exactly
the size of a page sounds like a memory use after free bug or similar.
A debug kernel build with one or more of these options may find the problem:
CONFIG_DEBUG_PAGEALLOC
CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
CONFIG_PAGE_POISONING + page_poison=1
CONFIG_KASAN
--Sarah
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data
2020-05-10 19:05 ` Sarah Newman
@ 2020-05-10 19:12 ` Sarah Newman
2020-05-11 9:41 ` Michal Soltys
0 siblings, 1 reply; 20+ messages in thread
From: Sarah Newman @ 2020-05-10 19:12 UTC (permalink / raw)
To: Chris Murphy, Michal Soltys; +Cc: John Stoffel, Roger Heflin, Linux RAID
On 5/10/20 12:05 PM, Sarah Newman wrote:
> On 5/7/20 8:44 PM, Chris Murphy wrote:
>>
>> I would change very little until you track this down, if the goal is
>> to track it down and get it fixed.
>>
>> I'm not sure if LVM thinp is supported with LVM raid still, which if
>> it's not supported yet then I can understand using mdadm raid5 instead
>> of LVM raid5.
>
>
> My apologies if this ideas was considered and discarded already, but the bug being hard to reproduce right after reboot and the error being exactly
> the size of a page sounds like a memory use after free bug or similar.
>
> A debug kernel build with one or more of these options may find the problem:
>
> CONFIG_DEBUG_PAGEALLOC
> CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
> CONFIG_PAGE_POISONING + page_poison=1
> CONFIG_KASAN
>
> --Sarah
And on further reflection you may as well add these:
CONFIG_DEBUG_OBJECTS
CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT
CONFIG_CRASH_DUMP (kdump)
+ anything else available. Basically turn debugging on all the way.
If you can reproduce reliably with these, then you can try the latest kernel with the same options and have some confidence the problem was
legitimately fixed.
--Sarah
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data
2020-05-10 19:12 ` Sarah Newman
@ 2020-05-11 9:41 ` Michal Soltys
2020-05-11 19:42 ` Sarah Newman
0 siblings, 1 reply; 20+ messages in thread
From: Michal Soltys @ 2020-05-11 9:41 UTC (permalink / raw)
To: Sarah Newman, Chris Murphy; +Cc: John Stoffel, Roger Heflin, Linux RAID
On 5/10/20 9:12 PM, Sarah Newman wrote:
> On 5/10/20 12:05 PM, Sarah Newman wrote:
>> On 5/7/20 8:44 PM, Chris Murphy wrote:
>>>
>>> I would change very little until you track this down, if the goal is
>>> to track it down and get it fixed.
>>>
>>> I'm not sure if LVM thinp is supported with LVM raid still, which if
>>> it's not supported yet then I can understand using mdadm raid5 instead
>>> of LVM raid5.
>>
>>
>> My apologies if this ideas was considered and discarded already, but
>> the bug being hard to reproduce right after reboot and the error being
>> exactly the size of a page sounds like a memory use after free bug or
>> similar.
>>
>> A debug kernel build with one or more of these options may find the
>> problem:
>>
>> CONFIG_DEBUG_PAGEALLOC
>> CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
>> CONFIG_PAGE_POISONING + page_poison=1
>> CONFIG_KASAN
>>
>> --Sarah
>
> And on further reflection you may as well add these:
>
> CONFIG_DEBUG_OBJECTS
> CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT
> CONFIG_CRASH_DUMP (kdump)
>
> + anything else available. Basically turn debugging on all the way.
>
> If you can reproduce reliably with these, then you can try the latest
> kernel with the same options and have some confidence the problem was
> legitimately fixed.
>
After compiling the kernel with above options enabled - and if this is
the underlying issue as you suspect - will it just pop in dmesg if I hit
this bug, or do I need some extra tools/preparation/etc. ?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data
2020-05-11 9:41 ` Michal Soltys
@ 2020-05-11 19:42 ` Sarah Newman
0 siblings, 0 replies; 20+ messages in thread
From: Sarah Newman @ 2020-05-11 19:42 UTC (permalink / raw)
To: Michal Soltys, Chris Murphy; +Cc: John Stoffel, Roger Heflin, Linux RAID
On 5/11/20 2:41 AM, Michal Soltys wrote:
> On 5/10/20 9:12 PM, Sarah Newman wrote:
>> On 5/10/20 12:05 PM, Sarah Newman wrote:
>>> On 5/7/20 8:44 PM, Chris Murphy wrote:
>>>>
>>>> I would change very little until you track this down, if the goal is
>>>> to track it down and get it fixed.
>>>>
>>>> I'm not sure if LVM thinp is supported with LVM raid still, which if
>>>> it's not supported yet then I can understand using mdadm raid5 instead
>>>> of LVM raid5.
>>>
>>>
>>> My apologies if this ideas was considered and discarded already, but the bug being hard to reproduce right after reboot and the error being exactly
>>> the size of a page sounds like a memory use after free bug or similar.
>>>
>>> A debug kernel build with one or more of these options may find the problem:
>>>
>>> CONFIG_DEBUG_PAGEALLOC
>>> CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
>>> CONFIG_PAGE_POISONING + page_poison=1
>>> CONFIG_KASAN
>>>
>>> --Sarah
>>
>> And on further reflection you may as well add these:
>>
>> CONFIG_DEBUG_OBJECTS
>> CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT
>> CONFIG_CRASH_DUMP (kdump)
>>
>> + anything else available. Basically turn debugging on all the way.
>>
>> If you can reproduce reliably with these, then you can try the latest kernel with the same options and have some confidence the problem was
>> legitimately fixed.
>>
>
> After compiling the kernel with above options enabled - and if this is the underlying issue as you suspect - will it just pop in dmesg if I hit this
> bug, or do I need some extra tools/preparation/etc. ?
>
I'm pretty sure that you can get everything you need from either dmesg or sysfs/debugfs. Be prepared for an oops or panic.
--Sarah
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data
2020-05-07 17:30 [general question] rare silent data corruption when writing data Michal Soltys
2020-05-07 18:24 ` Roger Heflin
@ 2020-05-13 6:31 ` Chris Dunlop
2020-05-13 17:49 ` John Stoffel
2020-05-20 20:29 ` Michal Soltys
1 sibling, 2 replies; 20+ messages in thread
From: Chris Dunlop @ 2020-05-13 6:31 UTC (permalink / raw)
To: Michal Soltys; +Cc: linux-raid
Hi,
On Thu, May 07, 2020 at 07:30:19PM +0200, Michal Soltys wrote:
> Note: this is just general question - if anyone experienced something
> similar or could suggest how to pinpoint / verify the actual cause.
>
> Thanks to btrfs's checksumming we discovered somewhat (even if quite
> rare) nasty silent corruption going on on one of our hosts. Or perhaps
> "corruption" is not the correct word - the files simply have precise 4kb
> (1 page) of incorrect data. The incorrect pieces of data look on their
> own fine - as something that was previously in the place, or written
> from wrong source.
"Me too!"
We are seeing 256-byte corruptions which are always the last 256b of a 4K
block. The 256b is very often a copy of a "last 256b of 4k block" from
earlier on the file. We sometimes see multiple corruptions in the same
file, with each of the corruptions being a copy of a different 256b from
earlier on the file. The original 256b and the copied 256b aren't
identifiably at a regular offset from each other. Where the 256b isn't a
copy from earlier in the file
I'd be really interested to hear if your problem is just in the last 256b
of the 4k block also!
We haven't been able to track down any the origin of any of the copies
where it's not a 256b block earlier in the file. I tried some extensive
analysis of some of these occurrences, including looking at files being
written around the same time, but wasn't able to identify where the data
came from. It could be the "last 256b of 4k block" from some other file
being written at the same time, or a non-256b aligned chunk, or indeed not
a copy of other file data at all.
See Also: https://lore.kernel.org/linux-xfs/20180322150226.GA31029@onthe.net.au/
We've been able to detect these corruptions via an md5sum calculated as
the files are generated, where a later md5sum doesn't match the original.
We regularly see the md5sum match soon after the file is written (seconds
to minutes), and then go "bad" after doing a "vmtouch -e" to evict the
file from memory. I.e. it looks like the problem is occurring somewhere on
the write path to disk. We can move the corrupt file out of the way and
regenerate the file, then use 'cmp -l' to see where the corruption[s] are,
and calculate md5 sums for each 256b block in the file to identify where
the 256b was copied from.
The corruptions are far more likely to occur during a scrub, although we
have seen a few of them when not scrubbing. We're currently working around
the issue by scrubbing infrequently, and trying to schedule scrubs during
periods of low write load.
> The hardware is (can provide more detailed info of course):
>
> - Supermicro X9DR7-LN4F
> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to
> backplane)
> - 96 gb ram (ecc)
> - 24 disk backplane
>
> - 1 array connected directly to lsi controller (4 disks, mdraid5,
> internal bitmap, 512kb chunk)
> - 1 array on the backplane (4 disks, mdraid5, journaled)
> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro
> disks)
> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still
> fine intel ssds from DC 3500 series)
Ours is on similar hardware:
- Supermicro X8DTH-IF
- LSI SAS 9211-8i (LSI SAS2008, PCI-e 2.0, multiple firmware versions)
- 192GB ECC RAM
- A mix of 12 and 24-bay expanders (some daisy chained: lsi-expander-expander)
We swapped the LSI HBA for another of the same model, the problem
persisted. We have a SAS9300 card on the way for testing.
> Raid 5 arrays are in lvm volume group, and the logical volumes are used
> by VMs. Some of the volumes are linear, some are using thin-pools (with
> metadata on the aforementioned intel ssds, in mirrored config). LVM uses
> large extent sizes (120m) and the chunk-size of thin-pools is set to
> 1.5m to match underlying raid stripe. Everything is cleanly aligned as
> well.
We're not using VMs nor lvm thin on this storage.
Our main filesystem is xfs + lvm + raid6 and this is where we've seen all
but one of these corruptions (70-100 since Mar 2018).
The problem has occurred on all md arrays under the lvm, on disks from
multiple vendors and models, and on disks attached to all expanders.
We've seen one of these corruptions with xfs directly on a hdd partition.
I.e. no mdraid or lvm involved. This fs an order of magnitude or more less
utilised than the main fs in terms of data being written.
> We did not manage to rule out (though somewhat _highly_ unlikely):
>
> - lvm thin (issue always - so far - occured on lvm thin pools)
> - mdraid (issue always - so far - on mdraid managed arrays)
> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels,
> happened with both - so it would imply rather already longstanding bug
> somewhere)
- we're not using lvm thin
- problem has occurred once on non-mdraid (xfs directly on a hdd partition)
- problem NOT seen on kernel 3.18.25
- problem seen on, so far, kernels 4.4.153 - 5.4.2
> And finally - so far - the issue never occured:
>
> - directly on a disk
> - directly on mdraid
> - on linear lvm volume on top of mdraid
- seen once directly on disk (partition)
- we don't use mdraid directly
- our problem arises on linear lvm on top of mdraid (raid6)
> As far as the issue goes it's:
>
> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from
> a few to few dozens of such chunks
> - we also found (or rather btrfs scrub did) a few small damaged files as
> well
> - the chunks look like a correct piece of different or previous data
>
> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes
> anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin
> pools; mdraid - default 512kb chunks). It does nicely fit a page though
> ...
>
> Anyway, if anyone has any ideas or suggestions what could be happening
> (perhaps with this particular motherboard or vendor) or how to pinpoint
> the cause - I'll be grateful for any.
Likewise!
Cheers,
Chris
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data
2020-05-13 6:31 ` Chris Dunlop
@ 2020-05-13 17:49 ` John Stoffel
2020-05-14 0:39 ` Chris Dunlop
2020-05-20 20:29 ` Michal Soltys
1 sibling, 1 reply; 20+ messages in thread
From: John Stoffel @ 2020-05-13 17:49 UTC (permalink / raw)
To: Chris Dunlop; +Cc: Michal Soltys, linux-raid
I wonder if this problem can be replicated on loop devices? Once
there's a way to cause it reliably, we can then start doing a
bisection of the kernel to try and find out where this is happening.
So far, it looks like it happens sometimes on bare RAID6 systems
without lv-thin in place, which is both good and bad. And without
using VMs on top of the storage either. So this helps narrow down the
cause.
Is there any info on the work load on these systems? Lots of small
fils which are added/removed? Large files which are just written to
and not touched again?
I assume finding a bad file with corruption and then doing a cp of the
file keeps the same corruption?
>>>>> "Chris" == Chris Dunlop <chris@onthe.net.au> writes:
Chris> Hi,
Chris> On Thu, May 07, 2020 at 07:30:19PM +0200, Michal Soltys wrote:
>> Note: this is just general question - if anyone experienced something
>> similar or could suggest how to pinpoint / verify the actual cause.
>>
>> Thanks to btrfs's checksumming we discovered somewhat (even if quite
>> rare) nasty silent corruption going on on one of our hosts. Or perhaps
>> "corruption" is not the correct word - the files simply have precise 4kb
>> (1 page) of incorrect data. The incorrect pieces of data look on their
>> own fine - as something that was previously in the place, or written
>> from wrong source.
Chris> "Me too!"
Chris> We are seeing 256-byte corruptions which are always the last 256b of a 4K
Chris> block. The 256b is very often a copy of a "last 256b of 4k block" from
Chris> earlier on the file. We sometimes see multiple corruptions in the same
Chris> file, with each of the corruptions being a copy of a different 256b from
Chris> earlier on the file. The original 256b and the copied 256b aren't
Chris> identifiably at a regular offset from each other. Where the 256b isn't a
Chris> copy from earlier in the file
Chris> I'd be really interested to hear if your problem is just in the last 256b
Chris> of the 4k block also!
Chris> We haven't been able to track down any the origin of any of the copies
Chris> where it's not a 256b block earlier in the file. I tried some extensive
Chris> analysis of some of these occurrences, including looking at files being
Chris> written around the same time, but wasn't able to identify where the data
Chris> came from. It could be the "last 256b of 4k block" from some other file
Chris> being written at the same time, or a non-256b aligned chunk, or indeed not
Chris> a copy of other file data at all.
Chris> See Also: https://lore.kernel.org/linux-xfs/20180322150226.GA31029@onthe.net.au/
Chris> We've been able to detect these corruptions via an md5sum calculated as
Chris> the files are generated, where a later md5sum doesn't match the original.
Chris> We regularly see the md5sum match soon after the file is written (seconds
Chris> to minutes), and then go "bad" after doing a "vmtouch -e" to evict the
Chris> file from memory. I.e. it looks like the problem is occurring somewhere on
Chris> the write path to disk. We can move the corrupt file out of the way and
Chris> regenerate the file, then use 'cmp -l' to see where the corruption[s] are,
Chris> and calculate md5 sums for each 256b block in the file to identify where
Chris> the 256b was copied from.
Chris> The corruptions are far more likely to occur during a scrub, although we
Chris> have seen a few of them when not scrubbing. We're currently working around
Chris> the issue by scrubbing infrequently, and trying to schedule scrubs during
Chris> periods of low write load.
>> The hardware is (can provide more detailed info of course):
>>
>> - Supermicro X9DR7-LN4F
>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to
>> backplane)
>> - 96 gb ram (ecc)
>> - 24 disk backplane
>>
>> - 1 array connected directly to lsi controller (4 disks, mdraid5,
>> internal bitmap, 512kb chunk)
>> - 1 array on the backplane (4 disks, mdraid5, journaled)
>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro
>> disks)
>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still
>> fine intel ssds from DC 3500 series)
Chris> Ours is on similar hardware:
Chris> - Supermicro X8DTH-IF
Chris> - LSI SAS 9211-8i (LSI SAS2008, PCI-e 2.0, multiple firmware versions)
Chris> - 192GB ECC RAM
Chris> - A mix of 12 and 24-bay expanders (some daisy chained: lsi-expander-expander)
Chris> We swapped the LSI HBA for another of the same model, the problem
Chris> persisted. We have a SAS9300 card on the way for testing.
>> Raid 5 arrays are in lvm volume group, and the logical volumes are used
>> by VMs. Some of the volumes are linear, some are using thin-pools (with
>> metadata on the aforementioned intel ssds, in mirrored config). LVM uses
>> large extent sizes (120m) and the chunk-size of thin-pools is set to
>> 1.5m to match underlying raid stripe. Everything is cleanly aligned as
>> well.
Chris> We're not using VMs nor lvm thin on this storage.
Chris> Our main filesystem is xfs + lvm + raid6 and this is where we've seen all
Chris> but one of these corruptions (70-100 since Mar 2018).
Chris> The problem has occurred on all md arrays under the lvm, on disks from
Chris> multiple vendors and models, and on disks attached to all expanders.
Chris> We've seen one of these corruptions with xfs directly on a hdd partition.
Chris> I.e. no mdraid or lvm involved. This fs an order of magnitude or more less
Chris> utilised than the main fs in terms of data being written.
>> We did not manage to rule out (though somewhat _highly_ unlikely):
>>
>> - lvm thin (issue always - so far - occured on lvm thin pools)
>> - mdraid (issue always - so far - on mdraid managed arrays)
>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels,
>> happened with both - so it would imply rather already longstanding bug
>> somewhere)
Chris> - we're not using lvm thin
Chris> - problem has occurred once on non-mdraid (xfs directly on a hdd partition)
Chris> - problem NOT seen on kernel 3.18.25
Chris> - problem seen on, so far, kernels 4.4.153 - 5.4.2
>> And finally - so far - the issue never occured:
>>
>> - directly on a disk
>> - directly on mdraid
>> - on linear lvm volume on top of mdraid
Chris> - seen once directly on disk (partition)
Chris> - we don't use mdraid directly
Chris> - our problem arises on linear lvm on top of mdraid (raid6)
>> As far as the issue goes it's:
>>
>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from
>> a few to few dozens of such chunks
>> - we also found (or rather btrfs scrub did) a few small damaged files as
>> well
>> - the chunks look like a correct piece of different or previous data
>>
>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes
>> anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin
>> pools; mdraid - default 512kb chunks). It does nicely fit a page though
>> ...
>>
>> Anyway, if anyone has any ideas or suggestions what could be happening
>> (perhaps with this particular motherboard or vendor) or how to pinpoint
>> the cause - I'll be grateful for any.
Chris> Likewise!
Chris> Cheers,
Chris> Chris
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data
2020-05-13 17:49 ` John Stoffel
@ 2020-05-14 0:39 ` Chris Dunlop
0 siblings, 0 replies; 20+ messages in thread
From: Chris Dunlop @ 2020-05-14 0:39 UTC (permalink / raw)
To: John Stoffel; +Cc: Michal Soltys, linux-raid
On Wed, May 13, 2020 at 01:49:10PM -0400, John Stoffel wrote:
> I wonder if this problem can be replicated on loop devices? Once
> there's a way to cause it reliably, we can then start doing a
> bisection of the kernel to try and find out where this is happening.
I ran a week or so of attempting to replicate the problem in a VM on loop
devices replicating the lvm/raid config, without success. Basically just
having a random bunch of 1-25 concurrent writers banging out middling to
largish files.
The fact it wasn't replicable in that environment could be pointing
towards the LSI driver or hardware - or I simply wasn't able to match
the conditions well enough.
> So far, it looks like it happens sometimes on bare RAID6 systems
> without lv-thin in place, which is both good and bad. And without
> using VMs on top of the storage either. So this helps narrow down the
> cause.
Note: We don't have any bare RAID6 so I haven't seen it there: our main fs
is xfs on sequential LVM on raid6 (6 x 11-disk sets), and we saw it once
on xfs directly on HDD partition.
> Is there any info on the work load on these systems? Lots of small
> fils which are added/removed? Large files which are just written to
> and not touched again?
Large files written and not touched again. Most of the time 2-5 concurrent
writers but regularly (daily) up to 20-25 concurrent.
> I assume finding a bad file with corruption and then doing a cp of the
> file keeps the same corruption?
Yep.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data
2020-05-13 6:31 ` Chris Dunlop
2020-05-13 17:49 ` John Stoffel
@ 2020-05-20 20:29 ` Michal Soltys
1 sibling, 0 replies; 20+ messages in thread
From: Michal Soltys @ 2020-05-20 20:29 UTC (permalink / raw)
To: Chris Dunlop; +Cc: linux-raid
On 20/05/13 08:31, Chris Dunlop wrote:
> Hi,
>
>
> "Me too!"
>
> We are seeing 256-byte corruptions which are always the last 256b of a
> 4K block. The 256b is very often a copy of a "last 256b of 4k block"
> from earlier on the file. We sometimes see multiple corruptions in the
> same file, with each of the corruptions being a copy of a different 256b
> from earlier on the file. The original 256b and the copied 256b aren't
> identifiably at a regular offset from each other. Where the 256b isn't a
> copy from earlier in the file
>
> I'd be really interested to hear if your problem is just in the last
> 256b of the 4k block also!
From what I have checked - in my case it has always been full 4k page.
I'll follow the suggestion by Sarah in the other part of this thread and
enable pagealloc debug options and then put the machine/disks under load
- so I'll keep an eye if something like you described happens.
This will have to wait a bit though, as I have another bug to hunt as
well - as journaled raid refuses to assemble, so with help of Song I'm
chasing that issue first.
If not for btrfs, we probably would have been using the machine happily
until now (blaming occasional detected issues on userspace stuff,
usually some fat java mess).
Thanks for detailed explanations of what happened in your case (and the
span of kernel versions in which it does happen is scary). The hardware
indeed looks strikingly similiar.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data
2020-05-08 3:44 ` Chris Murphy
2020-05-10 19:05 ` Sarah Newman
@ 2020-05-20 21:40 ` Michal Soltys
1 sibling, 0 replies; 20+ messages in thread
From: Michal Soltys @ 2020-05-20 21:40 UTC (permalink / raw)
To: Chris Murphy; +Cc: John Stoffel, Roger Heflin, Linux RAID
Sorry for delayed reply, have had rather busy weeks.
On 20/05/08 05:44, Chris Murphy wrote:
>
> The 4KiB chunk. What are the contents? Is it definitely guest VM data?
> Or is it sometimes file system metadata? How many corruptions have
> happened? The file system metadata is quite small compared to data.
I haven't looked that precisely (and it would be hard to tell in quite a
few cases) - but I'll keep that in mind when I resume chasing this bug.
> But if there have been many errors, we'd expect if it's caused on the
> host, that eventually file system metadata is corrupted. If it's
> definitely only data, that's curious and maybe implicates something
> going on in the guest.
As far as metadata goes, so far I haven't seen those - as far as e2fsck
on ext4 and btrfs-scrub on ext4 could tell. Though in ext4 case I
haven't ran it that many times - so good point, I'll include fsck after
each round.
>
> Btrfs, whether normal reads or scrubs, will report the path to the
> affected file, for data corruption. Metadata corruption errors
> sometimes have inode references, but not a path to a file.
>
Btrfs pointed to file contents only, so far.
>
>> >
>> > Are the LVs split across RAID5 PVs by any chance?
>>
>> raid5s are used as PVs, but a single logical volume always uses one only
>> one physical volume underneath (if that's what you meant by split across).
>
> It might be a bit suboptimal. A single 4KiB block write in the guest,
> turns into a 4KiB block write in the host's LV. That in turn trickles
> down to md, which has a 512KiB x 4 drive stripe. So a single 4KiB
> write translates into a 2M stripe write. There is an optimization for
> raid5 in the RMW case, where it should be true only 4KiB data plus
> 4KiB parity is written (partial strip/chunk write); I'm not sure about
> reads.
Well, I didn't play with current defaults too much - aside large
stripe_cache_size + the raid running under 2x ssd write-back journal
(which unfortunately became another issue - there is another thread
where I'm chasing that bug).
>
>> > It's not clear if you can replicate the problem without using
>> > lvm-thin, but that's what I suspect you might be having problems with.
>> >
>>
>> I'll be trying to do that, though the heavier tests will have to wait
>> until I move all VMs to other hosts (as that is/was our production machnie).
>
> Btrfs default Btrfs uses 16KiB block size for leaves and nodes. It's
> still a tiny foot print compared to data writes, but if LVM thin is a
> suspect, it really should just be a matter of time before file system
> corruption happens. If it doesn't, that's useful information. It
> probably means it's not LVM thin. But then what?
>
>> As for how long, it's a hit and miss. Sometimes writing and reading back
>> ~16gb file fails (the cheksum read back differs from what was written)
>> after 2-3 tries. That's on the host.
>>
>> On the guest, it's been (so far) a guaranteed thing when we were
>> creating very large tar file (900gb+). As for past two weeks we were
>> unable to create that file without errors even once.
>
> It's very useful to have a consistent reproducer. You can do metadata
> only writes on Btrfs by doing multiple back to back metadata only
> balance. If the problem really is in the write path somewhere, this
> would eventually corrupt the metadata - it would be detected during
> any subsequent balance or scrub. 'btrfs balance start -musage=100
> /mountpoint' will do it.
Will do that too.
>
> This reproducer. It only reproduces in the guest VM? If you do it in
> the host, otherwise exactly the same way with all the exact same
> versions of everything, and it does not reproduce?
>
I did reproduce the issue on the host (both in ext4 and btrfs). The host
has slightly different versions of kernel and tools, but otherwise same
stuff as one of the guests in which I was testing it.
>>
>> >
>> > Can you compile the newst kernel and newest thin tools and try them
>> > out?
>>
>> I can, but a bit later (once we move VMs out of the host).
>>
>> >
>> > How long does it take to replicate the corruption?
>> >
>>
>> When it happens, it's usually few tries tries of writing a 16gb file
>> with random patterns and reading it back (directly on host). The
>> irritating thing is that it can be somewhat hard to reproduce (e.g.
>> after machine's reboot).
>
> Reading it back on the host. So you've shut down the VM, and you're
> mounting what was the guests VM's backing disk, on the host to do the
> verification. There's never a case of concurrent usage between guest
> and host?
The hosts test where on a fresh filesystems on a fresh lvm volumes (and
I hit them on two different thin pools). The issue was also reproduced
on hosts when all guests were turned off.
>
>
>>
>> > Sorry for all the questions, but until there's a test case which is
>> > repeatable, it's going to be hard to chase this down.
>> >
>> > I wonder if running 'fio' tests would be something to try?
>> >
>> > And also changing your RAID5 setup to use the default stride and
>> > stripe widths, instead of the large values you're using.
>>
>> The raid5 is using mdadm's defaults (which is 512 KiB these days for a
>> chunk). LVM on top is using much longer extents (as we don't really need
>> 4mb granularity) and the lvm-thin chunks were set to match (and align)
>> to raid's stripe.
>
> I would change very little until you track this down, if the goal is
> to track it down and get it fixed.
>
Yea, I'm keeping the stuff as is (and will try Sarah's suggestions with
debug options as well).
> I'm not sure if LVM thinp is supported with LVM raid still, which if
> it's not supported yet then I can understand using mdadm raid5 instead
> of LVM raid5.
>
It probably is, but still while direct dmsetup exposes a few knobs (e.g.
allows to setup journal), the lvm doesn't allow much besides chunk size.
That was the primary reason that I sticked to native mdadm.
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2020-05-20 21:40 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-07 17:30 [general question] rare silent data corruption when writing data Michal Soltys
2020-05-07 18:24 ` Roger Heflin
2020-05-07 21:01 ` John Stoffel
2020-05-07 22:33 ` Michal Soltys
2020-05-08 0:54 ` John Stoffel
2020-05-08 11:10 ` Michal Soltys
2020-05-08 11:10 ` [linux-lvm] " Michal Soltys
2020-05-08 16:10 ` John Stoffel
2020-05-08 16:10 ` [linux-lvm] " John Stoffel
2020-05-08 3:44 ` Chris Murphy
2020-05-10 19:05 ` Sarah Newman
2020-05-10 19:12 ` Sarah Newman
2020-05-11 9:41 ` Michal Soltys
2020-05-11 19:42 ` Sarah Newman
2020-05-20 21:40 ` Michal Soltys
2020-05-07 22:13 ` Michal Soltys
2020-05-13 6:31 ` Chris Dunlop
2020-05-13 17:49 ` John Stoffel
2020-05-14 0:39 ` Chris Dunlop
2020-05-20 20:29 ` Michal Soltys
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.