[general question] rare silent data corruption when writing data

All of lore.kernel.org
 help / color / mirror / Atom feed

* [general question] rare silent data corruption when writing data
@ 2020-05-07 17:30 Michal Soltys
  2020-05-07 18:24 ` Roger Heflin
  2020-05-13  6:31 ` Chris Dunlop
  0 siblings, 2 replies; 20+ messages in thread
From: Michal Soltys @ 2020-05-07 17:30 UTC (permalink / raw)
  To: linux-raid

Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.

Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.

The hardware is (can provide more detailed info of course):

- Supermicro X9DR7-LN4F
- onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
- 96 gb ram (ecc)
- 24 disk backplane

- 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
- 1 array on the backplane (4 disks, mdraid5, journaled)
- journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
- 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)

Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.

With a doze of testing we managed to roughly rule out the following elements as being the cause:

- qemu/kvm (issue occured directly on host)
- backplane (issue occured on disks directly connected via LSI's 2nd connector)
- cable (as a above, two different cables)
- memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
- mdadm journaling (issue occured on plain mdraid configuration as well)
- disks themselves (issue occured on two separate mdadm arrays)
- filesystem (issue occured on both btrfs and ext4 (checksumed manually) )

We did not manage to rule out (though somewhat _highly_ unlikely):

- lvm thin (issue always - so far - occured on lvm thin pools)
- mdraid (issue always - so far - on mdraid managed arrays)
- kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)

And finally - so far - the issue never occured:

- directly on a disk
- directly on mdraid
- on linear lvm volume on top of mdraid

As far as the issue goes it's:

- always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
- we also found (or rather btrfs scrub did) a few small damaged files as well
- the chunks look like a correct piece of different or previous data

The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...

Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [general question] rare silent data corruption when writing data
  2020-05-07 17:30 [general question] rare silent data corruption when writing data Michal Soltys
@ 2020-05-07 18:24 ` Roger Heflin
  2020-05-07 21:01   ` John Stoffel
  2020-05-07 22:13   ` Michal Soltys
  2020-05-13  6:31 ` Chris Dunlop
  1 sibling, 2 replies; 20+ messages in thread
From: Roger Heflin @ 2020-05-07 18:24 UTC (permalink / raw)
  To: Michal Soltys; +Cc: Linux RAID

Have you tried the same file 2x and verified the corruption is in the
same places and looks the same?

I have not as of yet seen write corruption (except when a vendors disk
was resetting and it was lying about having written the data prior to
the crash, these were ssds, if your disk write cache is on and you
have a disk reset this can also happen), but have not seen "lost
writes" otherwise, but would expect the 2 read corruption I have seen
to also be able to cause write issues.  So for that look for scsi
notifications for disk resets that should not happen.

I have had a "bad" controller cause read corruptions, those
corruptions would move around, replacing the controller resolved it,
so there may be lack of error checking "inside" some paths in the
card.  Lucky I had a number of these controllers and had cold spares
for them.  The give away here was 2 separate buses with almost
identical load with 6 separate disks each and all12 disks on 2 buses
had between 47-52 scsi errors, which points to the only component
shared (the controller).

The backplane and cables are unlikely in general cause this, there is
too much error checking between the controller and the disk from what
I know.

I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133
cause random read corruptions, lowering speed to 100 fixed it), this
one was duplicated on multiple identical pieces of hw with all
different parts on the duplication machine.

I have also seen lost writes (from software) because someone did a
seek without doing a flush which in some versions of the libs loses
the unfilled block when the seek happens (this is noted in the man
page, and I saw it 20years ago, it is still noted in the man page, so
no idea if it was ever fixed).  So has more than one application been
noted to see the corruption?

So one question, have you seen the corruption in a path that would
rely on one controller, or all corruptions you have seen involving
more than one controller?  Isolate and test each controller if you
can, or if you can afford to replace it and see if it continues.

On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>
> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
>
> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
>
> The hardware is (can provide more detailed info of course):
>
> - Supermicro X9DR7-LN4F
> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
> - 96 gb ram (ecc)
> - 24 disk backplane
>
> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
> - 1 array on the backplane (4 disks, mdraid5, journaled)
> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
>
> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
>
> With a doze of testing we managed to roughly rule out the following elements as being the cause:
>
> - qemu/kvm (issue occured directly on host)
> - backplane (issue occured on disks directly connected via LSI's 2nd connector)
> - cable (as a above, two different cables)
> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
> - mdadm journaling (issue occured on plain mdraid configuration as well)
> - disks themselves (issue occured on two separate mdadm arrays)
> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
>
> We did not manage to rule out (though somewhat _highly_ unlikely):
>
> - lvm thin (issue always - so far - occured on lvm thin pools)
> - mdraid (issue always - so far - on mdraid managed arrays)
> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
>
> And finally - so far - the issue never occured:
>
> - directly on a disk
> - directly on mdraid
> - on linear lvm volume on top of mdraid
>
> As far as the issue goes it's:
>
> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
> - we also found (or rather btrfs scrub did) a few small damaged files as well
> - the chunks look like a correct piece of different or previous data
>
> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
>
> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [general question] rare silent data corruption when writing data
  2020-05-07 18:24 ` Roger Heflin
@ 2020-05-07 21:01   ` John Stoffel
  2020-05-07 22:33     ` Michal Soltys
  2020-05-07 22:13   ` Michal Soltys
  1 sibling, 1 reply; 20+ messages in thread
From: John Stoffel @ 2020-05-07 21:01 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Michal Soltys, Linux RAID

>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes:

Roger> Have you tried the same file 2x and verified the corruption is in the
Roger> same places and looks the same?

Are these 1tb files VMDK or COW images of VMs?  How are these files
made.  And does it ever happen with *smaller* files?  What about if
you just use a sparse 2tb file and write blocks out past 1tb to see if
there's a problem?

Are the LVs split across RAID5 PVs by any chance?  

It's not clear if you can replicate the problem without using
lvm-thin, but that's what I suspect you might be having problems with.

Can you give us the versions of the your tools, and exactly how you
setup your test cases?  How long does it take to find the problem?

Can you compile the newst kernel and newest thin tools and try them
out?

How long does it take to replicate the corruption?  

Sorry for all the questions, but until there's a test case which is
repeatable, it's going to be hard to chase this down.

I wonder if running 'fio' tests would be something to try?

And also changing your RAID5 setup to use the default stride and
stripe widths, instead of the large values you're using.

Good luck!

Roger> I have not as of yet seen write corruption (except when a vendors disk
Roger> was resetting and it was lying about having written the data prior to
Roger> the crash, these were ssds, if your disk write cache is on and you
Roger> have a disk reset this can also happen), but have not seen "lost
Roger> writes" otherwise, but would expect the 2 read corruption I have seen
Roger> to also be able to cause write issues.  So for that look for scsi
Roger> notifications for disk resets that should not happen.

Roger> I have had a "bad" controller cause read corruptions, those
Roger> corruptions would move around, replacing the controller resolved it,
Roger> so there may be lack of error checking "inside" some paths in the
Roger> card.  Lucky I had a number of these controllers and had cold spares
Roger> for them.  The give away here was 2 separate buses with almost
Roger> identical load with 6 separate disks each and all12 disks on 2 buses
Roger> had between 47-52 scsi errors, which points to the only component
Roger> shared (the controller).

Roger> The backplane and cables are unlikely in general cause this, there is
Roger> too much error checking between the controller and the disk from what
Roger> I know.

Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133
Roger> cause random read corruptions, lowering speed to 100 fixed it), this
Roger> one was duplicated on multiple identical pieces of hw with all
Roger> different parts on the duplication machine.

Roger> I have also seen lost writes (from software) because someone did a
Roger> seek without doing a flush which in some versions of the libs loses
Roger> the unfilled block when the seek happens (this is noted in the man
Roger> page, and I saw it 20years ago, it is still noted in the man page, so
Roger> no idea if it was ever fixed).  So has more than one application been
Roger> noted to see the corruption?

Roger> So one question, have you seen the corruption in a path that would
Roger> rely on one controller, or all corruptions you have seen involving
Roger> more than one controller?  Isolate and test each controller if you
Roger> can, or if you can afford to replace it and see if it continues.

Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>> 
>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
>> 
>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
>> 
>> The hardware is (can provide more detailed info of course):
>> 
>> - Supermicro X9DR7-LN4F
>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
>> - 96 gb ram (ecc)
>> - 24 disk backplane
>> 
>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
>> - 1 array on the backplane (4 disks, mdraid5, journaled)
>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
>> 
>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
>> 
>> With a doze of testing we managed to roughly rule out the following elements as being the cause:
>> 
>> - qemu/kvm (issue occured directly on host)
>> - backplane (issue occured on disks directly connected via LSI's 2nd connector)
>> - cable (as a above, two different cables)
>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
>> - mdadm journaling (issue occured on plain mdraid configuration as well)
>> - disks themselves (issue occured on two separate mdadm arrays)
>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
>> 
>> We did not manage to rule out (though somewhat _highly_ unlikely):
>> 
>> - lvm thin (issue always - so far - occured on lvm thin pools)
>> - mdraid (issue always - so far - on mdraid managed arrays)
>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
>> 
>> And finally - so far - the issue never occured:
>> 
>> - directly on a disk
>> - directly on mdraid
>> - on linear lvm volume on top of mdraid
>> 
>> As far as the issue goes it's:
>> 
>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
>> - we also found (or rather btrfs scrub did) a few small damaged files as well
>> - the chunks look like a correct piece of different or previous data
>> 
>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
>> 
>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [general question] rare silent data corruption when writing data
  2020-05-07 18:24 ` Roger Heflin
  2020-05-07 21:01   ` John Stoffel
@ 2020-05-07 22:13   ` Michal Soltys
  1 sibling, 0 replies; 20+ messages in thread
From: Michal Soltys @ 2020-05-07 22:13 UTC (permalink / raw)
  To: Roger Heflin; +Cc: linux-raid

On 20/05/07 20:24, Roger Heflin wrote:
> Have you tried the same file 2x and verified the corruption is in the
> same places and looks the same?

Yes, both with direct tests on hosts and with btrfs scrub failing on the 
same files in exactly same places. Always full 4KiB chunks.

> 
> I have not as of yet seen write corruption (except when a vendors disk
> was resetting and it was lying about having written the data prior to
> the crash, these were ssds, if your disk write cache is on and you
> have a disk reset this can also happen), but have not seen "lost
> writes" otherwise, but would expect the 2 read corruption I have seen
> to also be able to cause write issues.  So for that look for scsi
> notifications for disk resets that should not happen.
> 

When I was doing a simple test that basically was:

while .....; do
  rng=$((hexdump ..... /dev/urandom))
  dcfldd hash=md5 textpattern=$((rng_value)) of=/dst/test bs=262144 
count=$((16*4096))
  sync
  echo 1>/proc/sys/vm/drop_caches
  dcfldd hash=md5 if=/dst/test of=/dev/null .....
  compare_hashes_and_stop_if_different
done

There was no worrysome resets/etc. entries in dmesg.


> I have had a "bad" controller cause read corruptions, those
> corruptions would move around, replacing the controller resolved it,
> so there may be lack of error checking "inside" some paths in the
> card.  Lucky I had a number of these controllers and had cold spares
> for them.  The give away here was 2 separate buses with almost
> identical load with 6 separate disks each and all12 disks on 2 buses
> had between 47-52 scsi errors, which points to the only component
> shared (the controller).

Doesn't seem to be the case here, the reads are always the same - both 
in content and position.

> 
> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133
> cause random read corruptions, lowering speed to 100 fixed it), this
> one was duplicated on multiple identical pieces of hw with all
> different parts on the duplication machine.
> 
> I have also seen lost writes (from software) because someone did a
> seek without doing a flush which in some versions of the libs loses
> the unfilled block when the seek happens (this is noted in the man
> page, and I saw it 20years ago, it is still noted in the man page, so
> no idea if it was ever fixed).  So has more than one application been
> noted to see the corruption?
> 
> So one question, have you seen the corruption in a path that would
> rely on one controller, or all corruptions you have seen involving
> more than one controller?  Isolate and test each controller if you
> can, or if you can afford to replace it and see if it continues.
> 

So far only on one (LSI 2308) controller - although the thin volumes' 
metadata is on the ssds connected to chipset's sata controller. Still if 
hypothetically that was the case (metadata disks), wouldn't I rather see 
some kind of corruptions that would be a multiple of thin-volume's chunk 
size (so multiplies of 1.5 MiB in this case).

As for controler, I have ordered another one that we plan to test in 
near future.

> 
> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>
>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
>>
>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
>>
>> The hardware is (can provide more detailed info of course):
>>
>> - Supermicro X9DR7-LN4F
>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
>> - 96 gb ram (ecc)
>> - 24 disk backplane
>>
>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
>> - 1 array on the backplane (4 disks, mdraid5, journaled)
>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
>>
>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
>>
>> With a doze of testing we managed to roughly rule out the following elements as being the cause:
>>
>> - qemu/kvm (issue occured directly on host)
>> - backplane (issue occured on disks directly connected via LSI's 2nd connector)
>> - cable (as a above, two different cables)
>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
>> - mdadm journaling (issue occured on plain mdraid configuration as well)
>> - disks themselves (issue occured on two separate mdadm arrays)
>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
>>
>> We did not manage to rule out (though somewhat _highly_ unlikely):
>>
>> - lvm thin (issue always - so far - occured on lvm thin pools)
>> - mdraid (issue always - so far - on mdraid managed arrays)
>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
>>
>> And finally - so far - the issue never occured:
>>
>> - directly on a disk
>> - directly on mdraid
>> - on linear lvm volume on top of mdraid
>>
>> As far as the issue goes it's:
>>
>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
>> - we also found (or rather btrfs scrub did) a few small damaged files as well
>> - the chunks look like a correct piece of different or previous data
>>
>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
>>
>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [general question] rare silent data corruption when writing data
  2020-05-07 21:01   ` John Stoffel
@ 2020-05-07 22:33     ` Michal Soltys
  2020-05-08  0:54       ` John Stoffel
  2020-05-08  3:44       ` Chris Murphy
  0 siblings, 2 replies; 20+ messages in thread
From: Michal Soltys @ 2020-05-07 22:33 UTC (permalink / raw)
  To: John Stoffel, Roger Heflin; +Cc: Linux RAID

On 20/05/07 23:01, John Stoffel wrote:
>>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes:
> 
> Roger> Have you tried the same file 2x and verified the corruption is in the
> Roger> same places and looks the same?
> 
> Are these 1tb files VMDK or COW images of VMs?  How are these files
> made.  And does it ever happen with *smaller* files?  What about if
> you just use a sparse 2tb file and write blocks out past 1tb to see if
> there's a problem?

The VMs are always directly on lvm volumes. (e.g. 
/dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the 
errors after we ran scrub on the filesystem.

Yes, the errors were also found on small files.

Since then we recreated the issue directly on the host, just by making 
ext4 filesystem on some LV, then doing write with checksum, sync, 
drop_caches, read and check checksum. The errors are, as I mentioned - 
always a full 4KiB chunks (always same content, always same position).

> 
> Are the LVs split across RAID5 PVs by any chance?

raid5s are used as PVs, but a single logical volume always uses one only 
one physical volume underneath (if that's what you meant by split across).

> 
> It's not clear if you can replicate the problem without using
> lvm-thin, but that's what I suspect you might be having problems with.
> 

I'll be trying to do that, though the heavier tests will have to wait 
until I move all VMs to other hosts (as that is/was our production machnie).

> Can you give us the versions of the your tools, and exactly how you
> setup your test cases?  How long does it take to find the problem?

Will get all the details tommorow (the host is on up to date debian 
buster, the VMs are mix of archlinuxes and debians (and the issue 
happened on both)).

As for how long, it's a hit and miss. Sometimes writing and reading back 
~16gb file fails (the cheksum read back differs from what was written) 
after 2-3 tries. That's on the host.

On the guest, it's been (so far) a guaranteed thing when we were 
creating very large tar file (900gb+). As for past two weeks we were 
unable to create that file without errors even once.

> 
> Can you compile the newst kernel and newest thin tools and try them
> out?

I can, but a bit later (once we move VMs out of the host).

> 
> How long does it take to replicate the corruption?
> 

When it happens, it's usually few tries tries of writing a 16gb file 
with random patterns and reading it back (directly on host). The 
irritating thing is that it can be somewhat hard to reproduce (e.g. 
after machine's reboot).

> Sorry for all the questions, but until there's a test case which is
> repeatable, it's going to be hard to chase this down.
> 
> I wonder if running 'fio' tests would be something to try?
> 
> And also changing your RAID5 setup to use the default stride and
> stripe widths, instead of the large values you're using.

The raid5 is using mdadm's defaults (which is 512 KiB these days for a 
chunk). LVM on top is using much longer extents (as we don't really need 
4mb granularity) and the lvm-thin chunks were set to match (and align) 
to raid's stripe.

> 
> Good luck!
> 
> Roger> I have not as of yet seen write corruption (except when a vendors disk
> Roger> was resetting and it was lying about having written the data prior to
> Roger> the crash, these were ssds, if your disk write cache is on and you
> Roger> have a disk reset this can also happen), but have not seen "lost
> Roger> writes" otherwise, but would expect the 2 read corruption I have seen
> Roger> to also be able to cause write issues.  So for that look for scsi
> Roger> notifications for disk resets that should not happen.
> 
> Roger> I have had a "bad" controller cause read corruptions, those
> Roger> corruptions would move around, replacing the controller resolved it,
> Roger> so there may be lack of error checking "inside" some paths in the
> Roger> card.  Lucky I had a number of these controllers and had cold spares
> Roger> for them.  The give away here was 2 separate buses with almost
> Roger> identical load with 6 separate disks each and all12 disks on 2 buses
> Roger> had between 47-52 scsi errors, which points to the only component
> Roger> shared (the controller).
> 
> Roger> The backplane and cables are unlikely in general cause this, there is
> Roger> too much error checking between the controller and the disk from what
> Roger> I know.
> 
> Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133
> Roger> cause random read corruptions, lowering speed to 100 fixed it), this
> Roger> one was duplicated on multiple identical pieces of hw with all
> Roger> different parts on the duplication machine.
> 
> Roger> I have also seen lost writes (from software) because someone did a
> Roger> seek without doing a flush which in some versions of the libs loses
> Roger> the unfilled block when the seek happens (this is noted in the man
> Roger> page, and I saw it 20years ago, it is still noted in the man page, so
> Roger> no idea if it was ever fixed).  So has more than one application been
> Roger> noted to see the corruption?
> 
> Roger> So one question, have you seen the corruption in a path that would
> Roger> rely on one controller, or all corruptions you have seen involving
> Roger> more than one controller?  Isolate and test each controller if you
> Roger> can, or if you can afford to replace it and see if it continues.
> 
> 
> Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>> 
>>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
>>> 
>>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
>>> 
>>> The hardware is (can provide more detailed info of course):
>>> 
>>> - Supermicro X9DR7-LN4F
>>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
>>> - 96 gb ram (ecc)
>>> - 24 disk backplane
>>> 
>>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
>>> - 1 array on the backplane (4 disks, mdraid5, journaled)
>>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
>>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
>>> 
>>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
>>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
>>> 
>>> With a doze of testing we managed to roughly rule out the following elements as being the cause:
>>> 
>>> - qemu/kvm (issue occured directly on host)
>>> - backplane (issue occured on disks directly connected via LSI's 2nd connector)
>>> - cable (as a above, two different cables)
>>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
>>> - mdadm journaling (issue occured on plain mdraid configuration as well)
>>> - disks themselves (issue occured on two separate mdadm arrays)
>>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
>>> 
>>> We did not manage to rule out (though somewhat _highly_ unlikely):
>>> 
>>> - lvm thin (issue always - so far - occured on lvm thin pools)
>>> - mdraid (issue always - so far - on mdraid managed arrays)
>>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
>>> 
>>> And finally - so far - the issue never occured:
>>> 
>>> - directly on a disk
>>> - directly on mdraid
>>> - on linear lvm volume on top of mdraid
>>> 
>>> As far as the issue goes it's:
>>> 
>>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
>>> - we also found (or rather btrfs scrub did) a few small damaged files as well
>>> - the chunks look like a correct piece of different or previous data
>>> 
>>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
>>> 
>>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [general question] rare silent data corruption when writing data
  2020-05-07 22:33     ` Michal Soltys
@ 2020-05-08  0:54       ` John Stoffel
  2020-05-08 11:10           ` [linux-lvm] " Michal Soltys
  2020-05-08  3:44       ` Chris Murphy
  1 sibling, 1 reply; 20+ messages in thread
From: John Stoffel @ 2020-05-08  0:54 UTC (permalink / raw)
  To: Michal Soltys; +Cc: John Stoffel, Roger Heflin, Linux RAID

>>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes:

Michal> On 20/05/07 23:01, John Stoffel wrote:
>>>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes:
>> 
Roger> Have you tried the same file 2x and verified the corruption is in the
Roger> same places and looks the same?
>> 
>> Are these 1tb files VMDK or COW images of VMs?  How are these files
>> made.  And does it ever happen with *smaller* files?  What about if
>> you just use a sparse 2tb file and write blocks out past 1tb to see if
>> there's a problem?

Michal> The VMs are always directly on lvm volumes. (e.g. 
Michal> /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the 
Michal> errors after we ran scrub on the filesystem.

Michal> Yes, the errors were also found on small files.

Those errors are in small files inside the VM, which is running btrfs
ontop of block storage provided by your thin-lv, right?



disks -> md raid5 -> pv -> vg -> lv-thin -> guest QCOW/LUN ->
filesystem -> corruption


Michal> Since then we recreated the issue directly on the host, just
Michal> by making ext4 filesystem on some LV, then doing write with
Michal> checksum, sync, drop_caches, read and check checksum. The
Michal> errors are, as I mentioned - always a full 4KiB chunks (always
Michal> same content, always same position).

What position?  Is it a 4k, 1.5m or some other consistent offset?  And
how far into the file?  And this LV is a plain LV or a thin-lv?   I'm
running a debian box at home with RAID1 and I haven't seen this, but
I'm not nearly as careful as you.  Can you provide the output of:

   /sbin/lvs --version

too?  

Can you post your:

   /sbin/dmsetup status

output too?  There's a better command to use here, but I'm not an
export.  You might really want to copy this over to the
linux-lvm@redhat.com mailing list as well.

>> Are the LVs split across RAID5 PVs by any chance?

Michal> raid5s are used as PVs, but a single logical volume always uses one only 
Michal> one physical volume underneath (if that's what you meant by split across).

Ok, that's what I was asking about.  It shouldn't matter... but just
trying to chase down the details.


>> It's not clear if you can replicate the problem without using
>> lvm-thin, but that's what I suspect you might be having problems with.

Michal> I'll be trying to do that, though the heavier tests will have to wait 
Michal> until I move all VMs to other hosts (as that is/was our production machnie).

Sure, makes sense.

>> Can you give us the versions of the your tools, and exactly how you
>> setup your test cases?  How long does it take to find the problem?

Michal> Will get all the details tommorow (the host is on up to date debian 
Michal> buster, the VMs are mix of archlinuxes and debians (and the issue 
Michal> happened on both)).

Michal> As for how long, it's a hit and miss. Sometimes writing and reading back 
Michal> ~16gb file fails (the cheksum read back differs from what was written) 
Michal> after 2-3 tries. That's on the host.

Michal> On the guest, it's been (so far) a guaranteed thing when we were 
Michal> creating very large tar file (900gb+). As for past two weeks we were 
Michal> unable to create that file without errors even once.

Ouch!  That's not good.  Just to confirm, these corruptions are all in
a thin-lv based filesystem, right?   I'd be interested to know if you
can create another plain LV and cause the same error.  Trying to
simplify the potential problems.  


>> Can you compile the newst kernel and newest thin tools and try them
>> out?

Michal> I can, but a bit later (once we move VMs out of the host).

>> 
>> How long does it take to replicate the corruption?
>> 

Michal> When it happens, it's usually few tries tries of writing a 16gb file 
Michal> with random patterns and reading it back (directly on host). The 
Michal> irritating thing is that it can be somewhat hard to reproduce (e.g. 
Michal> after machine's reboot).

>> Sorry for all the questions, but until there's a test case which is
>> repeatable, it's going to be hard to chase this down.
>> 
>> I wonder if running 'fio' tests would be something to try?
>> 
>> And also changing your RAID5 setup to use the default stride and
>> stripe widths, instead of the large values you're using.

Michal> The raid5 is using mdadm's defaults (which is 512 KiB these days for a 
Michal> chunk). LVM on top is using much longer extents (as we don't really need 
Michal> 4mb granularity) and the lvm-thin chunks were set to match (and align) 
Michal> to raid's stripe.

>> 
>> Good luck!
>> 
Roger> I have not as of yet seen write corruption (except when a vendors disk
Roger> was resetting and it was lying about having written the data prior to
Roger> the crash, these were ssds, if your disk write cache is on and you
Roger> have a disk reset this can also happen), but have not seen "lost
Roger> writes" otherwise, but would expect the 2 read corruption I have seen
Roger> to also be able to cause write issues.  So for that look for scsi
Roger> notifications for disk resets that should not happen.
>> 
Roger> I have had a "bad" controller cause read corruptions, those
Roger> corruptions would move around, replacing the controller resolved it,
Roger> so there may be lack of error checking "inside" some paths in the
Roger> card.  Lucky I had a number of these controllers and had cold spares
Roger> for them.  The give away here was 2 separate buses with almost
Roger> identical load with 6 separate disks each and all12 disks on 2 buses
Roger> had between 47-52 scsi errors, which points to the only component
Roger> shared (the controller).
>> 
Roger> The backplane and cables are unlikely in general cause this, there is
Roger> too much error checking between the controller and the disk from what
Roger> I know.
>> 
Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133
Roger> cause random read corruptions, lowering speed to 100 fixed it), this
Roger> one was duplicated on multiple identical pieces of hw with all
Roger> different parts on the duplication machine.
>> 
Roger> I have also seen lost writes (from software) because someone did a
Roger> seek without doing a flush which in some versions of the libs loses
Roger> the unfilled block when the seek happens (this is noted in the man
Roger> page, and I saw it 20years ago, it is still noted in the man page, so
Roger> no idea if it was ever fixed).  So has more than one application been
Roger> noted to see the corruption?
>> 
Roger> So one question, have you seen the corruption in a path that would
Roger> rely on one controller, or all corruptions you have seen involving
Roger> more than one controller?  Isolate and test each controller if you
Roger> can, or if you can afford to replace it and see if it continues.
>> 
>> 
Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>>> 
>>>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
>>>> 
>>>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
>>>> 
>>>> The hardware is (can provide more detailed info of course):
>>>> 
>>>> - Supermicro X9DR7-LN4F
>>>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
>>>> - 96 gb ram (ecc)
>>>> - 24 disk backplane
>>>> 
>>>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
>>>> - 1 array on the backplane (4 disks, mdraid5, journaled)
>>>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
>>>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
>>>> 
>>>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
>>>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
>>>> 
>>>> With a doze of testing we managed to roughly rule out the following elements as being the cause:
>>>> 
>>>> - qemu/kvm (issue occured directly on host)
>>>> - backplane (issue occured on disks directly connected via LSI's 2nd connector)
>>>> - cable (as a above, two different cables)
>>>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
>>>> - mdadm journaling (issue occured on plain mdraid configuration as well)
>>>> - disks themselves (issue occured on two separate mdadm arrays)
>>>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
>>>> 
>>>> We did not manage to rule out (though somewhat _highly_ unlikely):
>>>> 
>>>> - lvm thin (issue always - so far - occured on lvm thin pools)
>>>> - mdraid (issue always - so far - on mdraid managed arrays)
>>>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
>>>> 
>>>> And finally - so far - the issue never occured:
>>>> 
>>>> - directly on a disk
>>>> - directly on mdraid
>>>> - on linear lvm volume on top of mdraid
>>>> 
>>>> As far as the issue goes it's:
>>>> 
>>>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
>>>> - we also found (or rather btrfs scrub did) a few small damaged files as well
>>>> - the chunks look like a correct piece of different or previous data
>>>> 
>>>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
>>>> 
>>>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.
>> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [general question] rare silent data corruption when writing data
  2020-05-07 22:33     ` Michal Soltys
  2020-05-08  0:54       ` John Stoffel
@ 2020-05-08  3:44       ` Chris Murphy
  2020-05-10 19:05         ` Sarah Newman
  2020-05-20 21:40         ` Michal Soltys
  1 sibling, 2 replies; 20+ messages in thread
From: Chris Murphy @ 2020-05-08  3:44 UTC (permalink / raw)
  To: Michal Soltys; +Cc: John Stoffel, Roger Heflin, Linux RAID

On Thu, May 7, 2020 at 4:34 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
> Since then we recreated the issue directly on the host, just by making
> ext4 filesystem on some LV, then doing write with checksum, sync,
> drop_caches, read and check checksum. The errors are, as I mentioned -
> always a full 4KiB chunks (always same content, always same position).

The 4KiB chunk. What are the contents? Is it definitely guest VM data?
Or is it sometimes file system metadata? How many corruptions have
happened? The file system metadata is quite small compared to data.
But if there have been many errors, we'd expect if it's caused on the
host, that eventually file system metadata is corrupted. If it's
definitely only data, that's curious and maybe implicates something
going on in the guest.

Btrfs, whether normal reads or scrubs, will report the path to the
affected file, for data corruption. Metadata corruption errors
sometimes have inode references, but not a path to a file.

> >
> > Are the LVs split across RAID5 PVs by any chance?
>
> raid5s are used as PVs, but a single logical volume always uses one only
> one physical volume underneath (if that's what you meant by split across).

It might be a bit suboptimal. A single 4KiB block write in the guest,
turns into a 4KiB block write in the host's LV. That in turn trickles
down to md, which has a 512KiB x 4 drive stripe. So a single 4KiB
write translates into a 2M stripe write. There is an optimization for
raid5 in the RMW case, where it should be true only 4KiB data plus
4KiB parity is written (partial strip/chunk write); I'm not sure about
reads.

> > It's not clear if you can replicate the problem without using
> > lvm-thin, but that's what I suspect you might be having problems with.
> >
>
> I'll be trying to do that, though the heavier tests will have to wait
> until I move all VMs to other hosts (as that is/was our production machnie).

Btrfs default Btrfs uses 16KiB block size for leaves and nodes. It's
still a tiny foot print compared to data writes, but if LVM thin is a
suspect, it really should just be a matter of time before file system
corruption happens. If it doesn't, that's useful information. It
probably means it's not LVM thin. But then what?

> As for how long, it's a hit and miss. Sometimes writing and reading back
> ~16gb file fails (the cheksum read back differs from what was written)
> after 2-3 tries. That's on the host.
>
> On the guest, it's been (so far) a guaranteed thing when we were
> creating very large tar file (900gb+). As for past two weeks we were
> unable to create that file without errors even once.

It's very useful to have a consistent reproducer. You can do metadata
only writes on Btrfs by doing multiple back to back metadata only
balance. If the problem really is in the write path somewhere, this
would eventually corrupt the metadata - it would be detected during
any subsequent balance or scrub. 'btrfs balance start -musage=100
/mountpoint' will do it.

This reproducer. It only reproduces in the guest VM? If you do it in
the host, otherwise exactly the same way with all the exact same
versions of everything, and it does not reproduce?

>
> >
> > Can you compile the newst kernel and newest thin tools and try them
> > out?
>
> I can, but a bit later (once we move VMs out of the host).
>
> >
> > How long does it take to replicate the corruption?
> >
>
> When it happens, it's usually few tries tries of writing a 16gb file
> with random patterns and reading it back (directly on host). The
> irritating thing is that it can be somewhat hard to reproduce (e.g.
> after machine's reboot).

Reading it back on the host. So you've shut down the VM, and you're
mounting what was the guests VM's backing disk, on the host to do the
verification. There's never a case of concurrent usage between guest
and host?

>
> > Sorry for all the questions, but until there's a test case which is
> > repeatable, it's going to be hard to chase this down.
> >
> > I wonder if running 'fio' tests would be something to try?
> >
> > And also changing your RAID5 setup to use the default stride and
> > stripe widths, instead of the large values you're using.
>
> The raid5 is using mdadm's defaults (which is 512 KiB these days for a
> chunk). LVM on top is using much longer extents (as we don't really need
> 4mb granularity) and the lvm-thin chunks were set to match (and align)
> to raid's stripe.

I would change very little until you track this down, if the goal is
to track it down and get it fixed.

I'm not sure if LVM thinp is supported with LVM raid still, which if
it's not supported yet then I can understand using mdadm raid5 instead
of LVM raid5.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [general question] rare silent data corruption when writing data
  2020-05-08  0:54       ` John Stoffel
@ 2020-05-08 11:10           ` Michal Soltys
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Soltys @ 2020-05-08 11:10 UTC (permalink / raw)
  To: John Stoffel; +Cc: Roger Heflin, Linux RAID, linux-lvm

note: as suggested, I'm also CCing this to linux-lvm; the full context with replies starts at:

https://www.spinics.net/lists/raid/msg64364.html
There is also the initial post at the bottom as well.

On 5/8/20 2:54 AM, John Stoffel wrote:
>>>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes:
> 
> Michal> On 20/05/07 23:01, John Stoffel wrote:
>>>>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes:
>>>
> Roger> Have you tried the same file 2x and verified the corruption is in the
> Roger> same places and looks the same?
>>>
>>> Are these 1tb files VMDK or COW images of VMs?  How are these files
>>> made.  And does it ever happen with *smaller* files?  What about if
>>> you just use a sparse 2tb file and write blocks out past 1tb to see if
>>> there's a problem?
> 
> Michal> The VMs are always directly on lvm volumes. (e.g.
> Michal> /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the
> Michal> errors after we ran scrub on the filesystem.
> 
> Michal> Yes, the errors were also found on small files.
> 
> Those errors are in small files inside the VM, which is running btrfs
> ontop of block storage provided by your thin-lv, right?
> 

Yea, the small files were in this case on that thin-lv.

We also discovered (yesterday) file corruptions with VM hosting gitlab registry - this one was using the same thin-lv underneath, but the guest itself was using ext4 (in this case, docker simply reported incorrect sha checksum on (so far) 2 layers.

> 
> 
> disks -> md raid5 -> pv -> vg -> lv-thin -> guest QCOW/LUN ->
> filesystem -> corruption

Those particular guests, yea. The host case it's just w/o "guest" step.

But (so far) all corruption ended going via one of the lv-thin layers (and via one of md raids).

> 
> 
> Michal> Since then we recreated the issue directly on the host, just
> Michal> by making ext4 filesystem on some LV, then doing write with
> Michal> checksum, sync, drop_caches, read and check checksum. The
> Michal> errors are, as I mentioned - always a full 4KiB chunks (always
> Michal> same content, always same position).
> 
> What position?  Is it a 4k, 1.5m or some other consistent offset?  And
> how far into the file?  And this LV is a plain LV or a thin-lv?   I'm
> running a debian box at home with RAID1 and I haven't seen this, but
> I'm not nearly as careful as you.  Can you provide the output of:
> 

What I meant that it doesn't "move" when verifying the same file (aka different reads from same test file). Between the tests, the errors are of course in different places - but it's always some 4KiB piece(s) - that look like correct pieces belonging somewhere else.

>     /sbin/lvs --version

  LVM version:     2.03.02(2) (2018-12-18)
  Library version: 1.02.155 (2018-12-18)
  Driver version:  4.41.0
  Configuration:   ./configure --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --exec-prefix= --bindir=/bin --libdir=/lib/x86_64-linux-gnu --sbindir=/sbin --with-usrlibdir=/usr/lib/x86_64-linux-gnu --with-optimisation=-O2 --with-cache=internal --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --with-default-pid-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-thin=internal --with-thin-check=/usr/sbin/thin_check --with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair --enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd --enable-dbus-service --enable-lvmlockd-dlm --enable-lvmlockd-sanlock --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-readline --enable-udev_rules --enable-udev_sync

> 
> too?
> 
> Can you post your:
> 
>     /sbin/dmsetup status
> 
> output too?  There's a better command to use here, but I'm not an
> export.  You might really want to copy this over to the
> linux-lvm@redhat.com mailing list as well.

x22v0-tp_ssd-tpool: 0 2577285120 thin-pool 19 8886/552960 629535/838960 - rw no_discard_passdown queue_if_no_space - 1024 
x22v0-tp_ssd_tdata: 0 2147696640 linear 
x22v0-tp_ssd_tdata: 2147696640 429588480 linear 
x22v0-tp_ssd_tmeta_rimage_1: 0 4423680 linear 
x22v0-tp_ssd_tmeta: 0 4423680 raid raid1 2 AA 4423680/4423680 idle 0 0 -
x22v0-gerrit--new: 0 268615680 thin 255510528 268459007
x22v0-btrfsnopool: 0 134430720 linear 
x22v0-gitlab_root: 0 629145600 thin 628291584 629145599
x22v0-tp_ssd_tmeta_rimage_0: 0 4423680 linear 
x22v0-nexus_old_storage: 0 10737500160 thin 5130817536 10737500159
x22v0-gitlab_reg: 0 2147696640 thin 1070963712 2147696639
x22v0-nexus_old_root: 0 268615680 thin 257657856 268615679
x22v0-tp_big_tmeta_rimage_1: 0 8601600 linear 
x22v0-tp_ssd_tmeta_rmeta_1: 0 245760 linear 
x22v0-micron_vol: 0 268615680 linear 
x22v0-tp_big_tmeta_rimage_0: 0 8601600 linear 
x22v0-tp_ssd_tmeta_rmeta_0: 0 245760 linear 
x22v0-gerrit--root: 0 268615680 thin 103388160 268443647
x22v0-btrfs_ssd_linear: 0 268615680 linear 
x22v0-btrfstest: 0 268615680 thin 40734720 268615679
x22v0-tp_ssd: 0 2577285120 linear 
x22v0-tp_big: 0 22164602880 linear 
x22v0-nexus3_root: 0 167854080 thin 21860352 167854079
x22v0-nusknacker--staging: 0 268615680 thin 268182528 268615679
x22v0-tmob2: 0 1048657920 linear 
x22v0-tp_big-tpool: 0 22164602880 thin-pool 35 35152/1075200 3870070/7215040 - rw no_discard_passdown queue_if_no_space - 1024 
x22v0-tp_big_tdata: 0 4295147520 linear 
x22v0-tp_big_tdata: 4295147520 17869455360 linear 
x22v0-btrfs_ssd_test: 0 201523200 thin 191880192 201335807
x22v0-nussknacker2: 0 268615680 thin 58573824 268615679
x22v0-tmob1: 0 1048657920 linear 
x22v0-tp_big_tmeta: 0 8601600 raid raid1 2 AA 8601600/8601600 idle 0 0 -
x22v0-nussknacker1: 0 268615680 thin 74376192 268615679
x22v0-touk--elk4: 0 839024640 linear 
x22v0-gerrit--backup: 0 268615680 thin 228989952 268443647
x22v0-tp_big_tmeta_rmeta_1: 0 245760 linear 
x22v0-openvpn--new: 0 134430720 thin 24152064 66272255
x22v0-k8sdkr: 0 268615680 linear 
x22v0-nexus3_storage: 0 10737500160 thin 4976683008 10737500159
x22v0-rocket: 0 167854080 thin 163602432 167854079
x22v0-tp_big_tmeta_rmeta_0: 0 245760 linear 
x22v0-roger2: 0 134430720 thin 33014784 134430719
x22v0-gerrit--new--backup: 0 268615680 thin 6552576 268443647

Also lvs -a with segment ranges:
  LV                      VG    Attr       LSize    Pool   Origin      Data%  Meta%  Move Log Cpy%Sync Convert LE Ranges                                                
  btrfs_ssd_linear        x22v0 -wi-a----- <128.09g                                                            /dev/md125:19021-20113                                   
  btrfs_ssd_test          x22v0 Vwi-a-t---   96.09g tp_ssd             95.21                                                                                            
  btrfsnopool             x22v0 -wi-a-----   64.10g                                                            /dev/sdt2:35-581                                         
  btrfstest               x22v0 Vwi-a-t--- <128.09g tp_big             15.16                                                                                            
  gerrit-backup           x22v0 Vwi-aot--- <128.09g tp_big             85.25                                                                                            
  gerrit-new              x22v0 Vwi-a-t--- <128.09g tp_ssd             95.12                                                                                            
  gerrit-new-backup       x22v0 Vwi-a-t--- <128.09g tp_big             2.44                                                                                             
  gerrit-root             x22v0 Vwi-aot--- <128.09g tp_ssd             38.49                                                                                            
  gitlab_reg              x22v0 Vwi-a-t---    1.00t tp_big             49.87                                                                                            
  gitlab_reg_snapshot     x22v0 Vwi---t--k    1.00t tp_big gitlab_reg                                                                                                   
  gitlab_root             x22v0 Vwi-a-t---  300.00g tp_ssd             99.86                                                                                            
  gitlab_root_snapshot    x22v0 Vwi---t--k  300.00g tp_ssd gitlab_root                                                                                                  
  k8sdkr                  x22v0 -wi-a----- <128.09g                                                            /dev/md126:20891-21983                                   
  [lvol0_pmspare]         x22v0 ewi-------    4.10g                                                            /dev/sdt2:0-34                                           
  micron_vol              x22v0 -wi-a----- <128.09g                                                            /dev/sdt2:582-1674                                       
  nexus3_root             x22v0 Vwi-aot---  <80.04g tp_ssd             13.03                                                                                            
  nexus3_storage          x22v0 Vwi-aot---    5.00t tp_big             46.35                                                                                            
  nexus_old_root          x22v0 Vwi-a-t--- <128.09g tp_ssd             95.92                                                                                            
  nexus_old_storage       x22v0 Vwi-a-t---    5.00t tp_big             47.78                                                                                            
  nusknacker-staging      x22v0 Vwi-aot--- <128.09g tp_big             99.84                                                                                            
  nussknacker1            x22v0 Vwi-aot--- <128.09g tp_big             27.69                                                                                            
  nussknacker2            x22v0 Vwi-aot--- <128.09g tp_big             21.81                                                                                            
  openvpn-new             x22v0 Vwi-aot---   64.10g tp_big             17.97                                                                                            
  rocket                  x22v0 Vwi-aot---  <80.04g tp_ssd             97.47                                                                                            
  roger2                  x22v0 Vwi-a-t---   64.10g tp_ssd             24.56                                                                                            
  tmob1                   x22v0 -wi-a----- <500.04g                                                            /dev/md125:8739-13005                                    
  tmob2                   x22v0 -wi-a----- <500.04g                                                            /dev/md125:13006-17272                                   
  touk-elk4               x22v0 -wi-ao---- <400.08g                                                            /dev/md126:17477-20890                                   
  tp_big                  x22v0 twi-aot---   10.32t                    53.64  3.27                             [tp_big_tdata]:0-90187                                   
  [tp_big_tdata]          x22v0 Twi-ao----   10.32t                                                            /dev/md126:0-17476                                       
  [tp_big_tdata]          x22v0 Twi-ao----   10.32t                                                            /dev/md126:21984-94694                                   
  [tp_big_tmeta]          x22v0 ewi-aor---    4.10g                                           100.00           [tp_big_tmeta_rimage_0]:0-34,[tp_big_tmeta_rimage_1]:0-34
  [tp_big_tmeta_rimage_0] x22v0 iwi-aor---    4.10g                                                            /dev/sda3:30-64                                          
  [tp_big_tmeta_rimage_1] x22v0 iwi-aor---    4.10g                                                            /dev/sdb3:30-64                                          
  [tp_big_tmeta_rmeta_0]  x22v0 ewi-aor---  120.00m                                                            /dev/sda3:29-29                                          
  [tp_big_tmeta_rmeta_1]  x22v0 ewi-aor---  120.00m                                                            /dev/sdb3:29-29                                          
  tp_ssd                  x22v0 twi-aot---    1.20t                    75.04  1.61                             [tp_ssd_tdata]:0-10486                                   
  [tp_ssd_tdata]          x22v0 Twi-ao----    1.20t                                                            /dev/md125:0-8738                                        
  [tp_ssd_tdata]          x22v0 Twi-ao----    1.20t                                                            /dev/md125:17273-19020                                   
  [tp_ssd_tmeta]          x22v0 ewi-aor---   <2.11g                                           100.00           [tp_ssd_tmeta_rimage_0]:0-17,[tp_ssd_tmeta_rimage_1]:0-17
  [tp_ssd_tmeta_rimage_0] x22v0 iwi-aor---   <2.11g                                                            /dev/sda3:11-28                                          
  [tp_ssd_tmeta_rimage_1] x22v0 iwi-aor---   <2.11g                                                            /dev/sdb3:11-28                                          
  [tp_ssd_tmeta_rmeta_0]  x22v0 ewi-aor---  120.00m                                                            /dev/sda3:10-10                                          
  [tp_ssd_tmeta_rmeta_1]  x22v0 ewi-aor---  120.00m                                                            /dev/sdb3:10-10                                          


> 
>>> Are the LVs split across RAID5 PVs by any chance?
> 
> Michal> raid5s are used as PVs, but a single logical volume always uses one only
> Michal> one physical volume underneath (if that's what you meant by split across).
> 
> Ok, that's what I was asking about.  It shouldn't matter... but just
> trying to chase down the details.
> 
> 
>>> It's not clear if you can replicate the problem without using
>>> lvm-thin, but that's what I suspect you might be having problems with.
> 
> Michal> I'll be trying to do that, though the heavier tests will have to wait
> Michal> until I move all VMs to other hosts (as that is/was our production machnie).
> 
> Sure, makes sense.
> 
>>> Can you give us the versions of the your tools, and exactly how you
>>> setup your test cases?  How long does it take to find the problem?

Regarding this, currently:

kernel:  5.4.0-0.bpo.4-amd64 #1 SMP Debian 5.4.19-1~bpo10+1 (2020-03-09) x86_64 GNU/Linux (was also happening with 5.2.0-0.bpo.3-amd64)
LVM version:     2.03.02(2) (2018-12-18)
Library version: 1.02.155 (2018-12-18)
Driver version:  4.41.0
mdadm - v4.1 - 2018-10-01

> 
> Michal> Will get all the details tommorow (the host is on up to date debian
> Michal> buster, the VMs are mix of archlinuxes and debians (and the issue
> Michal> happened on both)).
> 
> Michal> As for how long, it's a hit and miss. Sometimes writing and reading back
> Michal> ~16gb file fails (the cheksum read back differs from what was written)
> Michal> after 2-3 tries. That's on the host.
> 
> Michal> On the guest, it's been (so far) a guaranteed thing when we were
> Michal> creating very large tar file (900gb+). As for past two weeks we were
> Michal> unable to create that file without errors even once.
> 
> Ouch!  That's not good.  Just to confirm, these corruptions are all in
> a thin-lv based filesystem, right?   I'd be interested to know if you
> can create another plain LV and cause the same error.  Trying to
> simplify the potential problems.

I have been trying to - but so far didn't manage to replicate this with:

- a physical partition
- filesystem directly on a physical partition
- filesystem directly on mdraid
- filesystem directly on a linear volume

Note that this _doesn't_ imply that I _always_ get errors if lvm-thin is in use - as I also had lengthy period of attempts to cause corruption on some thin volume w/o any successes either. But the ones that failed had those in common (so far): md & lvm-thin - with 4 KiB piece(s) being incorrect

> 
> 
>>> Can you compile the newst kernel and newest thin tools and try them
>>> out?
> 
> Michal> I can, but a bit later (once we move VMs out of the host).
> 
>>>
>>> How long does it take to replicate the corruption?
>>>
> 
> Michal> When it happens, it's usually few tries tries of writing a 16gb file
> Michal> with random patterns and reading it back (directly on host). The
> Michal> irritating thing is that it can be somewhat hard to reproduce (e.g.
> Michal> after machine's reboot).
> 
>>> Sorry for all the questions, but until there's a test case which is
>>> repeatable, it's going to be hard to chase this down.
>>>
>>> I wonder if running 'fio' tests would be something to try?
>>>
>>> And also changing your RAID5 setup to use the default stride and
>>> stripe widths, instead of the large values you're using.
> 
> Michal> The raid5 is using mdadm's defaults (which is 512 KiB these days for a
> Michal> chunk). LVM on top is using much longer extents (as we don't really need
> Michal> 4mb granularity) and the lvm-thin chunks were set to match (and align)
> Michal> to raid's stripe.
> 
>>>
>>> Good luck!
>>>
> Roger> I have not as of yet seen write corruption (except when a vendors disk
> Roger> was resetting and it was lying about having written the data prior to
> Roger> the crash, these were ssds, if your disk write cache is on and you
> Roger> have a disk reset this can also happen), but have not seen "lost
> Roger> writes" otherwise, but would expect the 2 read corruption I have seen
> Roger> to also be able to cause write issues.  So for that look for scsi
> Roger> notifications for disk resets that should not happen.
>>>
> Roger> I have had a "bad" controller cause read corruptions, those
> Roger> corruptions would move around, replacing the controller resolved it,
> Roger> so there may be lack of error checking "inside" some paths in the
> Roger> card.  Lucky I had a number of these controllers and had cold spares
> Roger> for them.  The give away here was 2 separate buses with almost
> Roger> identical load with 6 separate disks each and all12 disks on 2 buses
> Roger> had between 47-52 scsi errors, which points to the only component
> Roger> shared (the controller).
>>>
> Roger> The backplane and cables are unlikely in general cause this, there is
> Roger> too much error checking between the controller and the disk from what
> Roger> I know.
>>>
> Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133
> Roger> cause random read corruptions, lowering speed to 100 fixed it), this
> Roger> one was duplicated on multiple identical pieces of hw with all
> Roger> different parts on the duplication machine.
>>>
> Roger> I have also seen lost writes (from software) because someone did a
> Roger> seek without doing a flush which in some versions of the libs loses
> Roger> the unfilled block when the seek happens (this is noted in the man
> Roger> page, and I saw it 20years ago, it is still noted in the man page, so
> Roger> no idea if it was ever fixed).  So has more than one application been
> Roger> noted to see the corruption?
>>>
> Roger> So one question, have you seen the corruption in a path that would
> Roger> rely on one controller, or all corruptions you have seen involving
> Roger> more than one controller?  Isolate and test each controller if you
> Roger> can, or if you can afford to replace it and see if it continues.
>>>
>>>
> Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>>>>
>>>>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
>>>>>
>>>>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
>>>>>
>>>>> The hardware is (can provide more detailed info of course):
>>>>>
>>>>> - Supermicro X9DR7-LN4F
>>>>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
>>>>> - 96 gb ram (ecc)
>>>>> - 24 disk backplane
>>>>>
>>>>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
>>>>> - 1 array on the backplane (4 disks, mdraid5, journaled)
>>>>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
>>>>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
>>>>>
>>>>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
>>>>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
>>>>>
>>>>> With a doze of testing we managed to roughly rule out the following elements as being the cause:
>>>>>
>>>>> - qemu/kvm (issue occured directly on host)
>>>>> - backplane (issue occured on disks directly connected via LSI's 2nd connector)
>>>>> - cable (as a above, two different cables)
>>>>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
>>>>> - mdadm journaling (issue occured on plain mdraid configuration as well)
>>>>> - disks themselves (issue occured on two separate mdadm arrays)
>>>>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
>>>>>
>>>>> We did not manage to rule out (though somewhat _highly_ unlikely):
>>>>>
>>>>> - lvm thin (issue always - so far - occured on lvm thin pools)
>>>>> - mdraid (issue always - so far - on mdraid managed arrays)
>>>>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
>>>>>
>>>>> And finally - so far - the issue never occured:
>>>>>
>>>>> - directly on a disk
>>>>> - directly on mdraid
>>>>> - on linear lvm volume on top of mdraid
>>>>>
>>>>> As far as the issue goes it's:
>>>>>
>>>>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
>>>>> - we also found (or rather btrfs scrub did) a few small damaged files as well
>>>>> - the chunks look like a correct piece of different or previous data
>>>>>
>>>>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
>>>>>
>>>>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.
>>>
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [linux-lvm] [general question] rare silent data corruption when writing data
@ 2020-05-08 11:10           ` Michal Soltys
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Soltys @ 2020-05-08 11:10 UTC (permalink / raw)
  To: John Stoffel; +Cc: Linux RAID, Roger Heflin, linux-lvm

note: as suggested, I'm also CCing this to linux-lvm; the full context with replies starts at:

https://www.spinics.net/lists/raid/msg64364.html
There is also the initial post at the bottom as well.

On 5/8/20 2:54 AM, John Stoffel wrote:
>>>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes:
> 
> Michal> On 20/05/07 23:01, John Stoffel wrote:
>>>>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes:
>>>
> Roger> Have you tried the same file 2x and verified the corruption is in the
> Roger> same places and looks the same?
>>>
>>> Are these 1tb files VMDK or COW images of VMs?  How are these files
>>> made.  And does it ever happen with *smaller* files?  What about if
>>> you just use a sparse 2tb file and write blocks out past 1tb to see if
>>> there's a problem?
> 
> Michal> The VMs are always directly on lvm volumes. (e.g.
> Michal> /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the
> Michal> errors after we ran scrub on the filesystem.
> 
> Michal> Yes, the errors were also found on small files.
> 
> Those errors are in small files inside the VM, which is running btrfs
> ontop of block storage provided by your thin-lv, right?
> 

Yea, the small files were in this case on that thin-lv.

We also discovered (yesterday) file corruptions with VM hosting gitlab registry - this one was using the same thin-lv underneath, but the guest itself was using ext4 (in this case, docker simply reported incorrect sha checksum on (so far) 2 layers.

> 
> 
> disks -> md raid5 -> pv -> vg -> lv-thin -> guest QCOW/LUN ->
> filesystem -> corruption

Those particular guests, yea. The host case it's just w/o "guest" step.

But (so far) all corruption ended going via one of the lv-thin layers (and via one of md raids).

> 
> 
> Michal> Since then we recreated the issue directly on the host, just
> Michal> by making ext4 filesystem on some LV, then doing write with
> Michal> checksum, sync, drop_caches, read and check checksum. The
> Michal> errors are, as I mentioned - always a full 4KiB chunks (always
> Michal> same content, always same position).
> 
> What position?  Is it a 4k, 1.5m or some other consistent offset?  And
> how far into the file?  And this LV is a plain LV or a thin-lv?   I'm
> running a debian box at home with RAID1 and I haven't seen this, but
> I'm not nearly as careful as you.  Can you provide the output of:
> 

What I meant that it doesn't "move" when verifying the same file (aka different reads from same test file). Between the tests, the errors are of course in different places - but it's always some 4KiB piece(s) - that look like correct pieces belonging somewhere else.

>     /sbin/lvs --version

  LVM version:     2.03.02(2) (2018-12-18)
  Library version: 1.02.155 (2018-12-18)
  Driver version:  4.41.0
  Configuration:   ./configure --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --exec-prefix= --bindir=/bin --libdir=/lib/x86_64-linux-gnu --sbindir=/sbin --with-usrlibdir=/usr/lib/x86_64-linux-gnu --with-optimisation=-O2 --with-cache=internal --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --with-default-pid-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-thin=internal --with-thin-check=/usr/sbin/thin_check --with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair --enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd --enable-dbus-service --enable-lvmlockd-dlm --enable-lvmlockd-sanlock --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-readline --enable-udev_rules --enable-udev_sync

> 
> too?
> 
> Can you post your:
> 
>     /sbin/dmsetup status
> 
> output too?  There's a better command to use here, but I'm not an
> export.  You might really want to copy this over to the
> linux-lvm@redhat.com mailing list as well.

x22v0-tp_ssd-tpool: 0 2577285120 thin-pool 19 8886/552960 629535/838960 - rw no_discard_passdown queue_if_no_space - 1024 
x22v0-tp_ssd_tdata: 0 2147696640 linear 
x22v0-tp_ssd_tdata: 2147696640 429588480 linear 
x22v0-tp_ssd_tmeta_rimage_1: 0 4423680 linear 
x22v0-tp_ssd_tmeta: 0 4423680 raid raid1 2 AA 4423680/4423680 idle 0 0 -
x22v0-gerrit--new: 0 268615680 thin 255510528 268459007
x22v0-btrfsnopool: 0 134430720 linear 
x22v0-gitlab_root: 0 629145600 thin 628291584 629145599
x22v0-tp_ssd_tmeta_rimage_0: 0 4423680 linear 
x22v0-nexus_old_storage: 0 10737500160 thin 5130817536 10737500159
x22v0-gitlab_reg: 0 2147696640 thin 1070963712 2147696639
x22v0-nexus_old_root: 0 268615680 thin 257657856 268615679
x22v0-tp_big_tmeta_rimage_1: 0 8601600 linear 
x22v0-tp_ssd_tmeta_rmeta_1: 0 245760 linear 
x22v0-micron_vol: 0 268615680 linear 
x22v0-tp_big_tmeta_rimage_0: 0 8601600 linear 
x22v0-tp_ssd_tmeta_rmeta_0: 0 245760 linear 
x22v0-gerrit--root: 0 268615680 thin 103388160 268443647
x22v0-btrfs_ssd_linear: 0 268615680 linear 
x22v0-btrfstest: 0 268615680 thin 40734720 268615679
x22v0-tp_ssd: 0 2577285120 linear 
x22v0-tp_big: 0 22164602880 linear 
x22v0-nexus3_root: 0 167854080 thin 21860352 167854079
x22v0-nusknacker--staging: 0 268615680 thin 268182528 268615679
x22v0-tmob2: 0 1048657920 linear 
x22v0-tp_big-tpool: 0 22164602880 thin-pool 35 35152/1075200 3870070/7215040 - rw no_discard_passdown queue_if_no_space - 1024 
x22v0-tp_big_tdata: 0 4295147520 linear 
x22v0-tp_big_tdata: 4295147520 17869455360 linear 
x22v0-btrfs_ssd_test: 0 201523200 thin 191880192 201335807
x22v0-nussknacker2: 0 268615680 thin 58573824 268615679
x22v0-tmob1: 0 1048657920 linear 
x22v0-tp_big_tmeta: 0 8601600 raid raid1 2 AA 8601600/8601600 idle 0 0 -
x22v0-nussknacker1: 0 268615680 thin 74376192 268615679
x22v0-touk--elk4: 0 839024640 linear 
x22v0-gerrit--backup: 0 268615680 thin 228989952 268443647
x22v0-tp_big_tmeta_rmeta_1: 0 245760 linear 
x22v0-openvpn--new: 0 134430720 thin 24152064 66272255
x22v0-k8sdkr: 0 268615680 linear 
x22v0-nexus3_storage: 0 10737500160 thin 4976683008 10737500159
x22v0-rocket: 0 167854080 thin 163602432 167854079
x22v0-tp_big_tmeta_rmeta_0: 0 245760 linear 
x22v0-roger2: 0 134430720 thin 33014784 134430719
x22v0-gerrit--new--backup: 0 268615680 thin 6552576 268443647

Also lvs -a with segment ranges:
  LV                      VG    Attr       LSize    Pool   Origin      Data%  Meta%  Move Log Cpy%Sync Convert LE Ranges                                                
  btrfs_ssd_linear        x22v0 -wi-a----- <128.09g                                                            /dev/md125:19021-20113                                   
  btrfs_ssd_test          x22v0 Vwi-a-t---   96.09g tp_ssd             95.21                                                                                            
  btrfsnopool             x22v0 -wi-a-----   64.10g                                                            /dev/sdt2:35-581                                         
  btrfstest               x22v0 Vwi-a-t--- <128.09g tp_big             15.16                                                                                            
  gerrit-backup           x22v0 Vwi-aot--- <128.09g tp_big             85.25                                                                                            
  gerrit-new              x22v0 Vwi-a-t--- <128.09g tp_ssd             95.12                                                                                            
  gerrit-new-backup       x22v0 Vwi-a-t--- <128.09g tp_big             2.44                                                                                             
  gerrit-root             x22v0 Vwi-aot--- <128.09g tp_ssd             38.49                                                                                            
  gitlab_reg              x22v0 Vwi-a-t---    1.00t tp_big             49.87                                                                                            
  gitlab_reg_snapshot     x22v0 Vwi---t--k    1.00t tp_big gitlab_reg                                                                                                   
  gitlab_root             x22v0 Vwi-a-t---  300.00g tp_ssd             99.86                                                                                            
  gitlab_root_snapshot    x22v0 Vwi---t--k  300.00g tp_ssd gitlab_root                                                                                                  
  k8sdkr                  x22v0 -wi-a----- <128.09g                                                            /dev/md126:20891-21983                                   
  [lvol0_pmspare]         x22v0 ewi-------    4.10g                                                            /dev/sdt2:0-34                                           
  micron_vol              x22v0 -wi-a----- <128.09g                                                            /dev/sdt2:582-1674                                       
  nexus3_root             x22v0 Vwi-aot---  <80.04g tp_ssd             13.03                                                                                            
  nexus3_storage          x22v0 Vwi-aot---    5.00t tp_big             46.35                                                                                            
  nexus_old_root          x22v0 Vwi-a-t--- <128.09g tp_ssd             95.92                                                                                            
  nexus_old_storage       x22v0 Vwi-a-t---    5.00t tp_big             47.78                                                                                            
  nusknacker-staging      x22v0 Vwi-aot--- <128.09g tp_big             99.84                                                                                            
  nussknacker1            x22v0 Vwi-aot--- <128.09g tp_big             27.69                                                                                            
  nussknacker2            x22v0 Vwi-aot--- <128.09g tp_big             21.81                                                                                            
  openvpn-new             x22v0 Vwi-aot---   64.10g tp_big             17.97                                                                                            
  rocket                  x22v0 Vwi-aot---  <80.04g tp_ssd             97.47                                                                                            
  roger2                  x22v0 Vwi-a-t---   64.10g tp_ssd             24.56                                                                                            
  tmob1                   x22v0 -wi-a----- <500.04g                                                            /dev/md125:8739-13005                                    
  tmob2                   x22v0 -wi-a----- <500.04g                                                            /dev/md125:13006-17272                                   
  touk-elk4               x22v0 -wi-ao---- <400.08g                                                            /dev/md126:17477-20890                                   
  tp_big                  x22v0 twi-aot---   10.32t                    53.64  3.27                             [tp_big_tdata]:0-90187                                   
  [tp_big_tdata]          x22v0 Twi-ao----   10.32t                                                            /dev/md126:0-17476                                       
  [tp_big_tdata]          x22v0 Twi-ao----   10.32t                                                            /dev/md126:21984-94694                                   
  [tp_big_tmeta]          x22v0 ewi-aor---    4.10g                                           100.00           [tp_big_tmeta_rimage_0]:0-34,[tp_big_tmeta_rimage_1]:0-34
  [tp_big_tmeta_rimage_0] x22v0 iwi-aor---    4.10g                                                            /dev/sda3:30-64                                          
  [tp_big_tmeta_rimage_1] x22v0 iwi-aor---    4.10g                                                            /dev/sdb3:30-64                                          
  [tp_big_tmeta_rmeta_0]  x22v0 ewi-aor---  120.00m                                                            /dev/sda3:29-29                                          
  [tp_big_tmeta_rmeta_1]  x22v0 ewi-aor---  120.00m                                                            /dev/sdb3:29-29                                          
  tp_ssd                  x22v0 twi-aot---    1.20t                    75.04  1.61                             [tp_ssd_tdata]:0-10486                                   
  [tp_ssd_tdata]          x22v0 Twi-ao----    1.20t                                                            /dev/md125:0-8738                                        
  [tp_ssd_tdata]          x22v0 Twi-ao----    1.20t                                                            /dev/md125:17273-19020                                   
  [tp_ssd_tmeta]          x22v0 ewi-aor---   <2.11g                                           100.00           [tp_ssd_tmeta_rimage_0]:0-17,[tp_ssd_tmeta_rimage_1]:0-17
  [tp_ssd_tmeta_rimage_0] x22v0 iwi-aor---   <2.11g                                                            /dev/sda3:11-28                                          
  [tp_ssd_tmeta_rimage_1] x22v0 iwi-aor---   <2.11g                                                            /dev/sdb3:11-28                                          
  [tp_ssd_tmeta_rmeta_0]  x22v0 ewi-aor---  120.00m                                                            /dev/sda3:10-10                                          
  [tp_ssd_tmeta_rmeta_1]  x22v0 ewi-aor---  120.00m                                                            /dev/sdb3:10-10                                          


> 
>>> Are the LVs split across RAID5 PVs by any chance?
> 
> Michal> raid5s are used as PVs, but a single logical volume always uses one only
> Michal> one physical volume underneath (if that's what you meant by split across).
> 
> Ok, that's what I was asking about.  It shouldn't matter... but just
> trying to chase down the details.
> 
> 
>>> It's not clear if you can replicate the problem without using
>>> lvm-thin, but that's what I suspect you might be having problems with.
> 
> Michal> I'll be trying to do that, though the heavier tests will have to wait
> Michal> until I move all VMs to other hosts (as that is/was our production machnie).
> 
> Sure, makes sense.
> 
>>> Can you give us the versions of the your tools, and exactly how you
>>> setup your test cases?  How long does it take to find the problem?

Regarding this, currently:

kernel:  5.4.0-0.bpo.4-amd64 #1 SMP Debian 5.4.19-1~bpo10+1 (2020-03-09) x86_64 GNU/Linux (was also happening with 5.2.0-0.bpo.3-amd64)
LVM version:     2.03.02(2) (2018-12-18)
Library version: 1.02.155 (2018-12-18)
Driver version:  4.41.0
mdadm - v4.1 - 2018-10-01

> 
> Michal> Will get all the details tommorow (the host is on up to date debian
> Michal> buster, the VMs are mix of archlinuxes and debians (and the issue
> Michal> happened on both)).
> 
> Michal> As for how long, it's a hit and miss. Sometimes writing and reading back
> Michal> ~16gb file fails (the cheksum read back differs from what was written)
> Michal> after 2-3 tries. That's on the host.
> 
> Michal> On the guest, it's been (so far) a guaranteed thing when we were
> Michal> creating very large tar file (900gb+). As for past two weeks we were
> Michal> unable to create that file without errors even once.
> 
> Ouch!  That's not good.  Just to confirm, these corruptions are all in
> a thin-lv based filesystem, right?   I'd be interested to know if you
> can create another plain LV and cause the same error.  Trying to
> simplify the potential problems.

I have been trying to - but so far didn't manage to replicate this with:

- a physical partition
- filesystem directly on a physical partition
- filesystem directly on mdraid
- filesystem directly on a linear volume

Note that this _doesn't_ imply that I _always_ get errors if lvm-thin is in use - as I also had lengthy period of attempts to cause corruption on some thin volume w/o any successes either. But the ones that failed had those in common (so far): md & lvm-thin - with 4 KiB piece(s) being incorrect

> 
> 
>>> Can you compile the newst kernel and newest thin tools and try them
>>> out?
> 
> Michal> I can, but a bit later (once we move VMs out of the host).
> 
>>>
>>> How long does it take to replicate the corruption?
>>>
> 
> Michal> When it happens, it's usually few tries tries of writing a 16gb file
> Michal> with random patterns and reading it back (directly on host). The
> Michal> irritating thing is that it can be somewhat hard to reproduce (e.g.
> Michal> after machine's reboot).
> 
>>> Sorry for all the questions, but until there's a test case which is
>>> repeatable, it's going to be hard to chase this down.
>>>
>>> I wonder if running 'fio' tests would be something to try?
>>>
>>> And also changing your RAID5 setup to use the default stride and
>>> stripe widths, instead of the large values you're using.
> 
> Michal> The raid5 is using mdadm's defaults (which is 512 KiB these days for a
> Michal> chunk). LVM on top is using much longer extents (as we don't really need
> Michal> 4mb granularity) and the lvm-thin chunks were set to match (and align)
> Michal> to raid's stripe.
> 
>>>
>>> Good luck!
>>>
> Roger> I have not as of yet seen write corruption (except when a vendors disk
> Roger> was resetting and it was lying about having written the data prior to
> Roger> the crash, these were ssds, if your disk write cache is on and you
> Roger> have a disk reset this can also happen), but have not seen "lost
> Roger> writes" otherwise, but would expect the 2 read corruption I have seen
> Roger> to also be able to cause write issues.  So for that look for scsi
> Roger> notifications for disk resets that should not happen.
>>>
> Roger> I have had a "bad" controller cause read corruptions, those
> Roger> corruptions would move around, replacing the controller resolved it,
> Roger> so there may be lack of error checking "inside" some paths in the
> Roger> card.  Lucky I had a number of these controllers and had cold spares
> Roger> for them.  The give away here was 2 separate buses with almost
> Roger> identical load with 6 separate disks each and all12 disks on 2 buses
> Roger> had between 47-52 scsi errors, which points to the only component
> Roger> shared (the controller).
>>>
> Roger> The backplane and cables are unlikely in general cause this, there is
> Roger> too much error checking between the controller and the disk from what
> Roger> I know.
>>>
> Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133
> Roger> cause random read corruptions, lowering speed to 100 fixed it), this
> Roger> one was duplicated on multiple identical pieces of hw with all
> Roger> different parts on the duplication machine.
>>>
> Roger> I have also seen lost writes (from software) because someone did a
> Roger> seek without doing a flush which in some versions of the libs loses
> Roger> the unfilled block when the seek happens (this is noted in the man
> Roger> page, and I saw it 20years ago, it is still noted in the man page, so
> Roger> no idea if it was ever fixed).  So has more than one application been
> Roger> noted to see the corruption?
>>>
> Roger> So one question, have you seen the corruption in a path that would
> Roger> rely on one controller, or all corruptions you have seen involving
> Roger> more than one controller?  Isolate and test each controller if you
> Roger> can, or if you can afford to replace it and see if it continues.
>>>
>>>
> Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>>>>
>>>>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
>>>>>
>>>>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
>>>>>
>>>>> The hardware is (can provide more detailed info of course):
>>>>>
>>>>> - Supermicro X9DR7-LN4F
>>>>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
>>>>> - 96 gb ram (ecc)
>>>>> - 24 disk backplane
>>>>>
>>>>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
>>>>> - 1 array on the backplane (4 disks, mdraid5, journaled)
>>>>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
>>>>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
>>>>>
>>>>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
>>>>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
>>>>>
>>>>> With a doze of testing we managed to roughly rule out the following elements as being the cause:
>>>>>
>>>>> - qemu/kvm (issue occured directly on host)
>>>>> - backplane (issue occured on disks directly connected via LSI's 2nd connector)
>>>>> - cable (as a above, two different cables)
>>>>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
>>>>> - mdadm journaling (issue occured on plain mdraid configuration as well)
>>>>> - disks themselves (issue occured on two separate mdadm arrays)
>>>>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
>>>>>
>>>>> We did not manage to rule out (though somewhat _highly_ unlikely):
>>>>>
>>>>> - lvm thin (issue always - so far - occured on lvm thin pools)
>>>>> - mdraid (issue always - so far - on mdraid managed arrays)
>>>>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
>>>>>
>>>>> And finally - so far - the issue never occured:
>>>>>
>>>>> - directly on a disk
>>>>> - directly on mdraid
>>>>> - on linear lvm volume on top of mdraid
>>>>>
>>>>> As far as the issue goes it's:
>>>>>
>>>>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
>>>>> - we also found (or rather btrfs scrub did) a few small damaged files as well
>>>>> - the chunks look like a correct piece of different or previous data
>>>>>
>>>>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
>>>>>
>>>>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.
>>>
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [general question] rare silent data corruption when writing data
  2020-05-08 11:10           ` [linux-lvm] " Michal Soltys
@ 2020-05-08 16:10             ` John Stoffel
  -1 siblings, 0 replies; 20+ messages in thread
From: John Stoffel @ 2020-05-08 16:10 UTC (permalink / raw)
  To: Michal Soltys; +Cc: John Stoffel, Roger Heflin, Linux RAID, linux-lvm, dm-devel

>>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes:

And of course it should also go to dm-devel@redhat.com, my fault for
not including that as well.  I strongly suspect it's an thin-lv
problem somewhere, but I don't know enough to help chase down the
problem in detail.

John


Michal> note: as suggested, I'm also CCing this to linux-lvm; the full
Michal> context with replies starts at:
Michal> https://www.spinics.net/lists/raid/msg64364.html There is also
Michal> the initial post at the bottom as well.

Michal> On 5/8/20 2:54 AM, John Stoffel wrote:
>>>>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes:
>> 
Michal> On 20/05/07 23:01, John Stoffel wrote:
>>>>>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes:
>>>> 
Roger> Have you tried the same file 2x and verified the corruption is in the
Roger> same places and looks the same?
>>>> 
>>>> Are these 1tb files VMDK or COW images of VMs?  How are these files
>>>> made.  And does it ever happen with *smaller* files?  What about if
>>>> you just use a sparse 2tb file and write blocks out past 1tb to see if
>>>> there's a problem?
>> 
Michal> The VMs are always directly on lvm volumes. (e.g.
Michal> /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the
Michal> errors after we ran scrub on the filesystem.
>> 
Michal> Yes, the errors were also found on small files.
>> 
>> Those errors are in small files inside the VM, which is running btrfs
>> ontop of block storage provided by your thin-lv, right?
>> 

Michal> Yea, the small files were in this case on that thin-lv.

Michal> We also discovered (yesterday) file corruptions with VM hosting gitlab registry - this one was using the same thin-lv underneath, but the guest itself was using ext4 (in this case, docker simply reported incorrect sha checksum on (so far) 2 layers.

>> 
>> 
>> disks -> md raid5 -> pv -> vg -> lv-thin -> guest QCOW/LUN ->
>> filesystem -> corruption

Michal> Those particular guests, yea. The host case it's just w/o "guest" step.

Michal> But (so far) all corruption ended going via one of the lv-thin layers (and via one of md raids).

>> 
>> 
Michal> Since then we recreated the issue directly on the host, just
Michal> by making ext4 filesystem on some LV, then doing write with
Michal> checksum, sync, drop_caches, read and check checksum. The
Michal> errors are, as I mentioned - always a full 4KiB chunks (always
Michal> same content, always same position).
>> 
>> What position?  Is it a 4k, 1.5m or some other consistent offset?  And
>> how far into the file?  And this LV is a plain LV or a thin-lv?   I'm
>> running a debian box at home with RAID1 and I haven't seen this, but
>> I'm not nearly as careful as you.  Can you provide the output of:
>> 

Michal> What I meant that it doesn't "move" when verifying the same file (aka different reads from same test file). Between the tests, the errors are of course in different places - but it's always some 4KiB piece(s) - that look like correct pieces belonging somewhere else.

>> /sbin/lvs --version

Michal>   LVM version:     2.03.02(2) (2018-12-18)
Michal>   Library version: 1.02.155 (2018-12-18)
Michal>   Driver version:  4.41.0
Michal>   Configuration:   ./configure --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --exec-prefix= --bindir=/bin --libdir=/lib/x86_64-linux-gnu --sbindir=/sbin --with-usrlibdir=/usr/lib/x86_64-linux-gnu --with-optimisation=-O2 --with-cache=internal --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --with-default-pid-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-thin=internal --with-thin-check=/usr/sbin/thin_check --with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair --enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd --enable-dbus-service --enable-lvmlockd-dlm --enable-lvmlockd-sanlock --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-readline --enable-udev_rules --enable-udev_sync

>> 
>> too?
>> 
>> Can you post your:
>> 
>> /sbin/dmsetup status
>> 
>> output too?  There's a better command to use here, but I'm not an
>> export.  You might really want to copy this over to the
>> linux-lvm@redhat.com mailing list as well.

Michal> x22v0-tp_ssd-tpool: 0 2577285120 thin-pool 19 8886/552960 629535/838960 - rw no_discard_passdown queue_if_no_space - 1024 
Michal> x22v0-tp_ssd_tdata: 0 2147696640 linear 
Michal> x22v0-tp_ssd_tdata: 2147696640 429588480 linear 
Michal> x22v0-tp_ssd_tmeta_rimage_1: 0 4423680 linear 
Michal> x22v0-tp_ssd_tmeta: 0 4423680 raid raid1 2 AA 4423680/4423680 idle 0 0 -
Michal> x22v0-gerrit--new: 0 268615680 thin 255510528 268459007
Michal> x22v0-btrfsnopool: 0 134430720 linear 
Michal> x22v0-gitlab_root: 0 629145600 thin 628291584 629145599
Michal> x22v0-tp_ssd_tmeta_rimage_0: 0 4423680 linear 
Michal> x22v0-nexus_old_storage: 0 10737500160 thin 5130817536 10737500159
Michal> x22v0-gitlab_reg: 0 2147696640 thin 1070963712 2147696639
Michal> x22v0-nexus_old_root: 0 268615680 thin 257657856 268615679
Michal> x22v0-tp_big_tmeta_rimage_1: 0 8601600 linear 
Michal> x22v0-tp_ssd_tmeta_rmeta_1: 0 245760 linear 
Michal> x22v0-micron_vol: 0 268615680 linear 
Michal> x22v0-tp_big_tmeta_rimage_0: 0 8601600 linear 
Michal> x22v0-tp_ssd_tmeta_rmeta_0: 0 245760 linear 
Michal> x22v0-gerrit--root: 0 268615680 thin 103388160 268443647
Michal> x22v0-btrfs_ssd_linear: 0 268615680 linear 
Michal> x22v0-btrfstest: 0 268615680 thin 40734720 268615679
Michal> x22v0-tp_ssd: 0 2577285120 linear 
Michal> x22v0-tp_big: 0 22164602880 linear 
Michal> x22v0-nexus3_root: 0 167854080 thin 21860352 167854079
Michal> x22v0-nusknacker--staging: 0 268615680 thin 268182528 268615679
Michal> x22v0-tmob2: 0 1048657920 linear 
Michal> x22v0-tp_big-tpool: 0 22164602880 thin-pool 35 35152/1075200 3870070/7215040 - rw no_discard_passdown queue_if_no_space - 1024 
Michal> x22v0-tp_big_tdata: 0 4295147520 linear 
Michal> x22v0-tp_big_tdata: 4295147520 17869455360 linear 
Michal> x22v0-btrfs_ssd_test: 0 201523200 thin 191880192 201335807
Michal> x22v0-nussknacker2: 0 268615680 thin 58573824 268615679
Michal> x22v0-tmob1: 0 1048657920 linear 
Michal> x22v0-tp_big_tmeta: 0 8601600 raid raid1 2 AA 8601600/8601600 idle 0 0 -
Michal> x22v0-nussknacker1: 0 268615680 thin 74376192 268615679
Michal> x22v0-touk--elk4: 0 839024640 linear 
Michal> x22v0-gerrit--backup: 0 268615680 thin 228989952 268443647
Michal> x22v0-tp_big_tmeta_rmeta_1: 0 245760 linear 
Michal> x22v0-openvpn--new: 0 134430720 thin 24152064 66272255
Michal> x22v0-k8sdkr: 0 268615680 linear 
Michal> x22v0-nexus3_storage: 0 10737500160 thin 4976683008 10737500159
Michal> x22v0-rocket: 0 167854080 thin 163602432 167854079
Michal> x22v0-tp_big_tmeta_rmeta_0: 0 245760 linear 
Michal> x22v0-roger2: 0 134430720 thin 33014784 134430719
Michal> x22v0-gerrit--new--backup: 0 268615680 thin 6552576 268443647

Michal> Also lvs -a with segment ranges:
Michal>   LV                      VG    Attr       LSize    Pool   Origin      Data%  Meta%  Move Log Cpy%Sync Convert LE Ranges                                                
Michal>   btrfs_ssd_linear        x22v0 -wi-a----- <128.09g                                                            /dev/md125:19021-20113                                   
Michal>   btrfs_ssd_test          x22v0 Vwi-a-t---   96.09g tp_ssd             95.21                                                                                            
Michal>   btrfsnopool             x22v0 -wi-a-----   64.10g                                                            /dev/sdt2:35-581                                         
Michal>   btrfstest               x22v0 Vwi-a-t--- <128.09g tp_big             15.16                                                                                            
Michal>   gerrit-backup           x22v0 Vwi-aot--- <128.09g tp_big             85.25                                                                                            
Michal>   gerrit-new              x22v0 Vwi-a-t--- <128.09g tp_ssd             95.12                                                                                            
Michal>   gerrit-new-backup       x22v0 Vwi-a-t--- <128.09g tp_big             2.44                                                                                             
Michal>   gerrit-root             x22v0 Vwi-aot--- <128.09g tp_ssd             38.49                                                                                            
Michal>   gitlab_reg              x22v0 Vwi-a-t---    1.00t tp_big             49.87                                                                                            
Michal>   gitlab_reg_snapshot     x22v0 Vwi---t--k    1.00t tp_big gitlab_reg                                                                                                   
Michal>   gitlab_root             x22v0 Vwi-a-t---  300.00g tp_ssd             99.86                                                                                            
Michal>   gitlab_root_snapshot    x22v0 Vwi---t--k  300.00g tp_ssd gitlab_root                                                                                                  
Michal>   k8sdkr                  x22v0 -wi-a----- <128.09g                                                            /dev/md126:20891-21983                                   
Michal>   [lvol0_pmspare]         x22v0 ewi-------    4.10g                                                            /dev/sdt2:0-34                                           
Michal>   micron_vol              x22v0 -wi-a----- <128.09g                                                            /dev/sdt2:582-1674                                       
Michal>   nexus3_root             x22v0 Vwi-aot---  <80.04g tp_ssd             13.03                                                                                            
Michal>   nexus3_storage          x22v0 Vwi-aot---    5.00t tp_big             46.35                                                                                            
Michal>   nexus_old_root          x22v0 Vwi-a-t--- <128.09g tp_ssd             95.92                                                                                            
Michal>   nexus_old_storage       x22v0 Vwi-a-t---    5.00t tp_big             47.78                                                                                            
Michal>   nusknacker-staging      x22v0 Vwi-aot--- <128.09g tp_big             99.84                                                                                            
Michal>   nussknacker1            x22v0 Vwi-aot--- <128.09g tp_big             27.69                                                                                            
Michal>   nussknacker2            x22v0 Vwi-aot--- <128.09g tp_big             21.81                                                                                            
Michal>   openvpn-new             x22v0 Vwi-aot---   64.10g tp_big             17.97                                                                                            
Michal>   rocket                  x22v0 Vwi-aot---  <80.04g tp_ssd             97.47                                                                                            
Michal>   roger2                  x22v0 Vwi-a-t---   64.10g tp_ssd             24.56                                                                                            
Michal>   tmob1                   x22v0 -wi-a----- <500.04g                                                            /dev/md125:8739-13005                                    
Michal>   tmob2                   x22v0 -wi-a----- <500.04g                                                            /dev/md125:13006-17272                                   
Michal>   touk-elk4               x22v0 -wi-ao---- <400.08g                                                            /dev/md126:17477-20890                                   
Michal>   tp_big                  x22v0 twi-aot---   10.32t                    53.64  3.27                             [tp_big_tdata]:0-90187                                   
Michal>   [tp_big_tdata]          x22v0 Twi-ao----   10.32t                                                            /dev/md126:0-17476                                       
Michal>   [tp_big_tdata]          x22v0 Twi-ao----   10.32t                                                            /dev/md126:21984-94694                                   
Michal>   [tp_big_tmeta]          x22v0 ewi-aor---    4.10g                                           100.00           [tp_big_tmeta_rimage_0]:0-34,[tp_big_tmeta_rimage_1]:0-34
Michal>   [tp_big_tmeta_rimage_0] x22v0 iwi-aor---    4.10g                                                            /dev/sda3:30-64                                          
Michal>   [tp_big_tmeta_rimage_1] x22v0 iwi-aor---    4.10g                                                            /dev/sdb3:30-64                                          
Michal>   [tp_big_tmeta_rmeta_0]  x22v0 ewi-aor---  120.00m                                                            /dev/sda3:29-29                                          
Michal>   [tp_big_tmeta_rmeta_1]  x22v0 ewi-aor---  120.00m                                                            /dev/sdb3:29-29                                          
Michal>   tp_ssd                  x22v0 twi-aot---    1.20t                    75.04  1.61                             [tp_ssd_tdata]:0-10486                                   
Michal>   [tp_ssd_tdata]          x22v0 Twi-ao----    1.20t                                                            /dev/md125:0-8738                                        
Michal>   [tp_ssd_tdata]          x22v0 Twi-ao----    1.20t                                                            /dev/md125:17273-19020                                   
Michal>   [tp_ssd_tmeta]          x22v0 ewi-aor---   <2.11g                                           100.00           [tp_ssd_tmeta_rimage_0]:0-17,[tp_ssd_tmeta_rimage_1]:0-17
Michal>   [tp_ssd_tmeta_rimage_0] x22v0 iwi-aor---   <2.11g                                                            /dev/sda3:11-28                                          
Michal>   [tp_ssd_tmeta_rimage_1] x22v0 iwi-aor---   <2.11g                                                            /dev/sdb3:11-28                                          
Michal>   [tp_ssd_tmeta_rmeta_0]  x22v0 ewi-aor---  120.00m                                                            /dev/sda3:10-10                                          
Michal>   [tp_ssd_tmeta_rmeta_1]  x22v0 ewi-aor---  120.00m                                                            /dev/sdb3:10-10                                          


>> 
>>>> Are the LVs split across RAID5 PVs by any chance?
>> 
Michal> raid5s are used as PVs, but a single logical volume always uses one only
Michal> one physical volume underneath (if that's what you meant by split across).
>> 
>> Ok, that's what I was asking about.  It shouldn't matter... but just
>> trying to chase down the details.
>> 
>> 
>>>> It's not clear if you can replicate the problem without using
>>>> lvm-thin, but that's what I suspect you might be having problems with.
>> 
Michal> I'll be trying to do that, though the heavier tests will have to wait
Michal> until I move all VMs to other hosts (as that is/was our production machnie).
>> 
>> Sure, makes sense.
>> 
>>>> Can you give us the versions of the your tools, and exactly how you
>>>> setup your test cases?  How long does it take to find the problem?

Michal> Regarding this, currently:

Michal> kernel:  5.4.0-0.bpo.4-amd64 #1 SMP Debian 5.4.19-1~bpo10+1 (2020-03-09) x86_64 GNU/Linux (was also happening with 5.2.0-0.bpo.3-amd64)
Michal> LVM version:     2.03.02(2) (2018-12-18)
Michal> Library version: 1.02.155 (2018-12-18)
Michal> Driver version:  4.41.0
Michal> mdadm - v4.1 - 2018-10-01

>> 
Michal> Will get all the details tommorow (the host is on up to date debian
Michal> buster, the VMs are mix of archlinuxes and debians (and the issue
Michal> happened on both)).
>> 
Michal> As for how long, it's a hit and miss. Sometimes writing and reading back
Michal> ~16gb file fails (the cheksum read back differs from what was written)
Michal> after 2-3 tries. That's on the host.
>> 
Michal> On the guest, it's been (so far) a guaranteed thing when we were
Michal> creating very large tar file (900gb+). As for past two weeks we were
Michal> unable to create that file without errors even once.
>> 
>> Ouch!  That's not good.  Just to confirm, these corruptions are all in
>> a thin-lv based filesystem, right?   I'd be interested to know if you
>> can create another plain LV and cause the same error.  Trying to
>> simplify the potential problems.

Michal> I have been trying to - but so far didn't manage to replicate this with:

Michal> - a physical partition
Michal> - filesystem directly on a physical partition
Michal> - filesystem directly on mdraid
Michal> - filesystem directly on a linear volume

Michal> Note that this _doesn't_ imply that I _always_ get errors if lvm-thin is in use - as I also had lengthy period of attempts to cause corruption on some thin volume w/o any successes either. But the ones that failed had those in common (so far): md & lvm-thin - with 4 KiB piece(s) being incorrect

>> 
>> 
>>>> Can you compile the newst kernel and newest thin tools and try them
>>>> out?
>> 
Michal> I can, but a bit later (once we move VMs out of the host).
>> 
>>>> 
>>>> How long does it take to replicate the corruption?
>>>> 
>> 
Michal> When it happens, it's usually few tries tries of writing a 16gb file
Michal> with random patterns and reading it back (directly on host). The
Michal> irritating thing is that it can be somewhat hard to reproduce (e.g.
Michal> after machine's reboot).
>> 
>>>> Sorry for all the questions, but until there's a test case which is
>>>> repeatable, it's going to be hard to chase this down.
>>>> 
>>>> I wonder if running 'fio' tests would be something to try?
>>>> 
>>>> And also changing your RAID5 setup to use the default stride and
>>>> stripe widths, instead of the large values you're using.
>> 
Michal> The raid5 is using mdadm's defaults (which is 512 KiB these days for a
Michal> chunk). LVM on top is using much longer extents (as we don't really need
Michal> 4mb granularity) and the lvm-thin chunks were set to match (and align)
Michal> to raid's stripe.
>> 
>>>> 
>>>> Good luck!
>>>> 
Roger> I have not as of yet seen write corruption (except when a vendors disk
Roger> was resetting and it was lying about having written the data prior to
Roger> the crash, these were ssds, if your disk write cache is on and you
Roger> have a disk reset this can also happen), but have not seen "lost
Roger> writes" otherwise, but would expect the 2 read corruption I have seen
Roger> to also be able to cause write issues.  So for that look for scsi
Roger> notifications for disk resets that should not happen.
>>>> 
Roger> I have had a "bad" controller cause read corruptions, those
Roger> corruptions would move around, replacing the controller resolved it,
Roger> so there may be lack of error checking "inside" some paths in the
Roger> card.  Lucky I had a number of these controllers and had cold spares
Roger> for them.  The give away here was 2 separate buses with almost
Roger> identical load with 6 separate disks each and all12 disks on 2 buses
Roger> had between 47-52 scsi errors, which points to the only component
Roger> shared (the controller).
>>>> 
Roger> The backplane and cables are unlikely in general cause this, there is
Roger> too much error checking between the controller and the disk from what
Roger> I know.
>>>> 
Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133
Roger> cause random read corruptions, lowering speed to 100 fixed it), this
Roger> one was duplicated on multiple identical pieces of hw with all
Roger> different parts on the duplication machine.
>>>> 
Roger> I have also seen lost writes (from software) because someone did a
Roger> seek without doing a flush which in some versions of the libs loses
Roger> the unfilled block when the seek happens (this is noted in the man
Roger> page, and I saw it 20years ago, it is still noted in the man page, so
Roger> no idea if it was ever fixed).  So has more than one application been
Roger> noted to see the corruption?
>>>> 
Roger> So one question, have you seen the corruption in a path that would
Roger> rely on one controller, or all corruptions you have seen involving
Roger> more than one controller?  Isolate and test each controller if you
Roger> can, or if you can afford to replace it and see if it continues.
>>>> 
>>>> 
Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>>>>> 
>>>>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
>>>>>> 
>>>>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
>>>>>> 
>>>>> The hardware is (can provide more detailed info of course):
>>>>>> 
>>>>> - Supermicro X9DR7-LN4F
>>>>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
>>>>> - 96 gb ram (ecc)
>>>>> - 24 disk backplane
>>>>>> 
>>>>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
>>>>> - 1 array on the backplane (4 disks, mdraid5, journaled)
>>>>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
>>>>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
>>>>>> 
>>>>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
>>>>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
>>>>>> 
>>>>> With a doze of testing we managed to roughly rule out the following elements as being the cause:
>>>>>> 
>>>>> - qemu/kvm (issue occured directly on host)
>>>>> - backplane (issue occured on disks directly connected via LSI's 2nd connector)
>>>>> - cable (as a above, two different cables)
>>>>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
>>>>> - mdadm journaling (issue occured on plain mdraid configuration as well)
>>>>> - disks themselves (issue occured on two separate mdadm arrays)
>>>>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
>>>>>> 
>>>>> We did not manage to rule out (though somewhat _highly_ unlikely):
>>>>>> 
>>>>> - lvm thin (issue always - so far - occured on lvm thin pools)
>>>>> - mdraid (issue always - so far - on mdraid managed arrays)
>>>>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
>>>>>> 
>>>>> And finally - so far - the issue never occured:
>>>>>> 
>>>>> - directly on a disk
>>>>> - directly on mdraid
>>>>> - on linear lvm volume on top of mdraid
>>>>>> 
>>>>> As far as the issue goes it's:
>>>>>> 
>>>>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
>>>>> - we also found (or rather btrfs scrub did) a few small damaged files as well
>>>>> - the chunks look like a correct piece of different or previous data
>>>>>> 
>>>>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
>>>>>> 
>>>>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.
>>>> 
>> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [linux-lvm] [general question] rare silent data corruption when writing data
@ 2020-05-08 16:10             ` John Stoffel
  0 siblings, 0 replies; 20+ messages in thread
From: John Stoffel @ 2020-05-08 16:10 UTC (permalink / raw)
  To: Michal Soltys; +Cc: Linux RAID, Roger Heflin, dm-devel, linux-lvm

>>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes:

And of course it should also go to dm-devel@redhat.com, my fault for
not including that as well.  I strongly suspect it's an thin-lv
problem somewhere, but I don't know enough to help chase down the
problem in detail.

John


Michal> note: as suggested, I'm also CCing this to linux-lvm; the full
Michal> context with replies starts at:
Michal> https://www.spinics.net/lists/raid/msg64364.html There is also
Michal> the initial post at the bottom as well.

Michal> On 5/8/20 2:54 AM, John Stoffel wrote:
>>>>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes:
>> 
Michal> On 20/05/07 23:01, John Stoffel wrote:
>>>>>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes:
>>>> 
Roger> Have you tried the same file 2x and verified the corruption is in the
Roger> same places and looks the same?
>>>> 
>>>> Are these 1tb files VMDK or COW images of VMs?  How are these files
>>>> made.  And does it ever happen with *smaller* files?  What about if
>>>> you just use a sparse 2tb file and write blocks out past 1tb to see if
>>>> there's a problem?
>> 
Michal> The VMs are always directly on lvm volumes. (e.g.
Michal> /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the
Michal> errors after we ran scrub on the filesystem.
>> 
Michal> Yes, the errors were also found on small files.
>> 
>> Those errors are in small files inside the VM, which is running btrfs
>> ontop of block storage provided by your thin-lv, right?
>> 

Michal> Yea, the small files were in this case on that thin-lv.

Michal> We also discovered (yesterday) file corruptions with VM hosting gitlab registry - this one was using the same thin-lv underneath, but the guest itself was using ext4 (in this case, docker simply reported incorrect sha checksum on (so far) 2 layers.

>> 
>> 
>> disks -> md raid5 -> pv -> vg -> lv-thin -> guest QCOW/LUN ->
>> filesystem -> corruption

Michal> Those particular guests, yea. The host case it's just w/o "guest" step.

Michal> But (so far) all corruption ended going via one of the lv-thin layers (and via one of md raids).

>> 
>> 
Michal> Since then we recreated the issue directly on the host, just
Michal> by making ext4 filesystem on some LV, then doing write with
Michal> checksum, sync, drop_caches, read and check checksum. The
Michal> errors are, as I mentioned - always a full 4KiB chunks (always
Michal> same content, always same position).
>> 
>> What position?  Is it a 4k, 1.5m or some other consistent offset?  And
>> how far into the file?  And this LV is a plain LV or a thin-lv?   I'm
>> running a debian box at home with RAID1 and I haven't seen this, but
>> I'm not nearly as careful as you.  Can you provide the output of:
>> 

Michal> What I meant that it doesn't "move" when verifying the same file (aka different reads from same test file). Between the tests, the errors are of course in different places - but it's always some 4KiB piece(s) - that look like correct pieces belonging somewhere else.

>> /sbin/lvs --version

Michal>   LVM version:     2.03.02(2) (2018-12-18)
Michal>   Library version: 1.02.155 (2018-12-18)
Michal>   Driver version:  4.41.0
Michal>   Configuration:   ./configure --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --exec-prefix= --bindir=/bin --libdir=/lib/x86_64-linux-gnu --sbindir=/sbin --with-usrlibdir=/usr/lib/x86_64-linux-gnu --with-optimisation=-O2 --with-cache=internal --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --with-default-pid-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-thin=internal --with-thin-check=/usr/sbin/thin_check --with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair --enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd --enable-dbus-service --enable-lvmlockd-dlm --enable-lvmlockd-sanlock --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-readline --enable-udev_rules --enable-udev_sync

>> 
>> too?
>> 
>> Can you post your:
>> 
>> /sbin/dmsetup status
>> 
>> output too?  There's a better command to use here, but I'm not an
>> export.  You might really want to copy this over to the
>> linux-lvm@redhat.com mailing list as well.

Michal> x22v0-tp_ssd-tpool: 0 2577285120 thin-pool 19 8886/552960 629535/838960 - rw no_discard_passdown queue_if_no_space - 1024 
Michal> x22v0-tp_ssd_tdata: 0 2147696640 linear 
Michal> x22v0-tp_ssd_tdata: 2147696640 429588480 linear 
Michal> x22v0-tp_ssd_tmeta_rimage_1: 0 4423680 linear 
Michal> x22v0-tp_ssd_tmeta: 0 4423680 raid raid1 2 AA 4423680/4423680 idle 0 0 -
Michal> x22v0-gerrit--new: 0 268615680 thin 255510528 268459007
Michal> x22v0-btrfsnopool: 0 134430720 linear 
Michal> x22v0-gitlab_root: 0 629145600 thin 628291584 629145599
Michal> x22v0-tp_ssd_tmeta_rimage_0: 0 4423680 linear 
Michal> x22v0-nexus_old_storage: 0 10737500160 thin 5130817536 10737500159
Michal> x22v0-gitlab_reg: 0 2147696640 thin 1070963712 2147696639
Michal> x22v0-nexus_old_root: 0 268615680 thin 257657856 268615679
Michal> x22v0-tp_big_tmeta_rimage_1: 0 8601600 linear 
Michal> x22v0-tp_ssd_tmeta_rmeta_1: 0 245760 linear 
Michal> x22v0-micron_vol: 0 268615680 linear 
Michal> x22v0-tp_big_tmeta_rimage_0: 0 8601600 linear 
Michal> x22v0-tp_ssd_tmeta_rmeta_0: 0 245760 linear 
Michal> x22v0-gerrit--root: 0 268615680 thin 103388160 268443647
Michal> x22v0-btrfs_ssd_linear: 0 268615680 linear 
Michal> x22v0-btrfstest: 0 268615680 thin 40734720 268615679
Michal> x22v0-tp_ssd: 0 2577285120 linear 
Michal> x22v0-tp_big: 0 22164602880 linear 
Michal> x22v0-nexus3_root: 0 167854080 thin 21860352 167854079
Michal> x22v0-nusknacker--staging: 0 268615680 thin 268182528 268615679
Michal> x22v0-tmob2: 0 1048657920 linear 
Michal> x22v0-tp_big-tpool: 0 22164602880 thin-pool 35 35152/1075200 3870070/7215040 - rw no_discard_passdown queue_if_no_space - 1024 
Michal> x22v0-tp_big_tdata: 0 4295147520 linear 
Michal> x22v0-tp_big_tdata: 4295147520 17869455360 linear 
Michal> x22v0-btrfs_ssd_test: 0 201523200 thin 191880192 201335807
Michal> x22v0-nussknacker2: 0 268615680 thin 58573824 268615679
Michal> x22v0-tmob1: 0 1048657920 linear 
Michal> x22v0-tp_big_tmeta: 0 8601600 raid raid1 2 AA 8601600/8601600 idle 0 0 -
Michal> x22v0-nussknacker1: 0 268615680 thin 74376192 268615679
Michal> x22v0-touk--elk4: 0 839024640 linear 
Michal> x22v0-gerrit--backup: 0 268615680 thin 228989952 268443647
Michal> x22v0-tp_big_tmeta_rmeta_1: 0 245760 linear 
Michal> x22v0-openvpn--new: 0 134430720 thin 24152064 66272255
Michal> x22v0-k8sdkr: 0 268615680 linear 
Michal> x22v0-nexus3_storage: 0 10737500160 thin 4976683008 10737500159
Michal> x22v0-rocket: 0 167854080 thin 163602432 167854079
Michal> x22v0-tp_big_tmeta_rmeta_0: 0 245760 linear 
Michal> x22v0-roger2: 0 134430720 thin 33014784 134430719
Michal> x22v0-gerrit--new--backup: 0 268615680 thin 6552576 268443647

Michal> Also lvs -a with segment ranges:
Michal>   LV                      VG    Attr       LSize    Pool   Origin      Data%  Meta%  Move Log Cpy%Sync Convert LE Ranges                                                
Michal>   btrfs_ssd_linear        x22v0 -wi-a----- <128.09g                                                            /dev/md125:19021-20113                                   
Michal>   btrfs_ssd_test          x22v0 Vwi-a-t---   96.09g tp_ssd             95.21                                                                                            
Michal>   btrfsnopool             x22v0 -wi-a-----   64.10g                                                            /dev/sdt2:35-581                                         
Michal>   btrfstest               x22v0 Vwi-a-t--- <128.09g tp_big             15.16                                                                                            
Michal>   gerrit-backup           x22v0 Vwi-aot--- <128.09g tp_big             85.25                                                                                            
Michal>   gerrit-new              x22v0 Vwi-a-t--- <128.09g tp_ssd             95.12                                                                                            
Michal>   gerrit-new-backup       x22v0 Vwi-a-t--- <128.09g tp_big             2.44                                                                                             
Michal>   gerrit-root             x22v0 Vwi-aot--- <128.09g tp_ssd             38.49                                                                                            
Michal>   gitlab_reg              x22v0 Vwi-a-t---    1.00t tp_big             49.87                                                                                            
Michal>   gitlab_reg_snapshot     x22v0 Vwi---t--k    1.00t tp_big gitlab_reg                                                                                                   
Michal>   gitlab_root             x22v0 Vwi-a-t---  300.00g tp_ssd             99.86                                                                                            
Michal>   gitlab_root_snapshot    x22v0 Vwi---t--k  300.00g tp_ssd gitlab_root                                                                                                  
Michal>   k8sdkr                  x22v0 -wi-a----- <128.09g                                                            /dev/md126:20891-21983                                   
Michal>   [lvol0_pmspare]         x22v0 ewi-------    4.10g                                                            /dev/sdt2:0-34                                           
Michal>   micron_vol              x22v0 -wi-a----- <128.09g                                                            /dev/sdt2:582-1674                                       
Michal>   nexus3_root             x22v0 Vwi-aot---  <80.04g tp_ssd             13.03                                                                                            
Michal>   nexus3_storage          x22v0 Vwi-aot---    5.00t tp_big             46.35                                                                                            
Michal>   nexus_old_root          x22v0 Vwi-a-t--- <128.09g tp_ssd             95.92                                                                                            
Michal>   nexus_old_storage       x22v0 Vwi-a-t---    5.00t tp_big             47.78                                                                                            
Michal>   nusknacker-staging      x22v0 Vwi-aot--- <128.09g tp_big             99.84                                                                                            
Michal>   nussknacker1            x22v0 Vwi-aot--- <128.09g tp_big             27.69                                                                                            
Michal>   nussknacker2            x22v0 Vwi-aot--- <128.09g tp_big             21.81                                                                                            
Michal>   openvpn-new             x22v0 Vwi-aot---   64.10g tp_big             17.97                                                                                            
Michal>   rocket                  x22v0 Vwi-aot---  <80.04g tp_ssd             97.47                                                                                            
Michal>   roger2                  x22v0 Vwi-a-t---   64.10g tp_ssd             24.56                                                                                            
Michal>   tmob1                   x22v0 -wi-a----- <500.04g                                                            /dev/md125:8739-13005                                    
Michal>   tmob2                   x22v0 -wi-a----- <500.04g                                                            /dev/md125:13006-17272                                   
Michal>   touk-elk4               x22v0 -wi-ao---- <400.08g                                                            /dev/md126:17477-20890                                   
Michal>   tp_big                  x22v0 twi-aot---   10.32t                    53.64  3.27                             [tp_big_tdata]:0-90187                                   
Michal>   [tp_big_tdata]          x22v0 Twi-ao----   10.32t                                                            /dev/md126:0-17476                                       
Michal>   [tp_big_tdata]          x22v0 Twi-ao----   10.32t                                                            /dev/md126:21984-94694                                   
Michal>   [tp_big_tmeta]          x22v0 ewi-aor---    4.10g                                           100.00           [tp_big_tmeta_rimage_0]:0-34,[tp_big_tmeta_rimage_1]:0-34
Michal>   [tp_big_tmeta_rimage_0] x22v0 iwi-aor---    4.10g                                                            /dev/sda3:30-64                                          
Michal>   [tp_big_tmeta_rimage_1] x22v0 iwi-aor---    4.10g                                                            /dev/sdb3:30-64                                          
Michal>   [tp_big_tmeta_rmeta_0]  x22v0 ewi-aor---  120.00m                                                            /dev/sda3:29-29                                          
Michal>   [tp_big_tmeta_rmeta_1]  x22v0 ewi-aor---  120.00m                                                            /dev/sdb3:29-29                                          
Michal>   tp_ssd                  x22v0 twi-aot---    1.20t                    75.04  1.61                             [tp_ssd_tdata]:0-10486                                   
Michal>   [tp_ssd_tdata]          x22v0 Twi-ao----    1.20t                                                            /dev/md125:0-8738                                        
Michal>   [tp_ssd_tdata]          x22v0 Twi-ao----    1.20t                                                            /dev/md125:17273-19020                                   
Michal>   [tp_ssd_tmeta]          x22v0 ewi-aor---   <2.11g                                           100.00           [tp_ssd_tmeta_rimage_0]:0-17,[tp_ssd_tmeta_rimage_1]:0-17
Michal>   [tp_ssd_tmeta_rimage_0] x22v0 iwi-aor---   <2.11g                                                            /dev/sda3:11-28                                          
Michal>   [tp_ssd_tmeta_rimage_1] x22v0 iwi-aor---   <2.11g                                                            /dev/sdb3:11-28                                          
Michal>   [tp_ssd_tmeta_rmeta_0]  x22v0 ewi-aor---  120.00m                                                            /dev/sda3:10-10                                          
Michal>   [tp_ssd_tmeta_rmeta_1]  x22v0 ewi-aor---  120.00m                                                            /dev/sdb3:10-10                                          


>> 
>>>> Are the LVs split across RAID5 PVs by any chance?
>> 
Michal> raid5s are used as PVs, but a single logical volume always uses one only
Michal> one physical volume underneath (if that's what you meant by split across).
>> 
>> Ok, that's what I was asking about.  It shouldn't matter... but just
>> trying to chase down the details.
>> 
>> 
>>>> It's not clear if you can replicate the problem without using
>>>> lvm-thin, but that's what I suspect you might be having problems with.
>> 
Michal> I'll be trying to do that, though the heavier tests will have to wait
Michal> until I move all VMs to other hosts (as that is/was our production machnie).
>> 
>> Sure, makes sense.
>> 
>>>> Can you give us the versions of the your tools, and exactly how you
>>>> setup your test cases?  How long does it take to find the problem?

Michal> Regarding this, currently:

Michal> kernel:  5.4.0-0.bpo.4-amd64 #1 SMP Debian 5.4.19-1~bpo10+1 (2020-03-09) x86_64 GNU/Linux (was also happening with 5.2.0-0.bpo.3-amd64)
Michal> LVM version:     2.03.02(2) (2018-12-18)
Michal> Library version: 1.02.155 (2018-12-18)
Michal> Driver version:  4.41.0
Michal> mdadm - v4.1 - 2018-10-01

>> 
Michal> Will get all the details tommorow (the host is on up to date debian
Michal> buster, the VMs are mix of archlinuxes and debians (and the issue
Michal> happened on both)).
>> 
Michal> As for how long, it's a hit and miss. Sometimes writing and reading back
Michal> ~16gb file fails (the cheksum read back differs from what was written)
Michal> after 2-3 tries. That's on the host.
>> 
Michal> On the guest, it's been (so far) a guaranteed thing when we were
Michal> creating very large tar file (900gb+). As for past two weeks we were
Michal> unable to create that file without errors even once.
>> 
>> Ouch!  That's not good.  Just to confirm, these corruptions are all in
>> a thin-lv based filesystem, right?   I'd be interested to know if you
>> can create another plain LV and cause the same error.  Trying to
>> simplify the potential problems.

Michal> I have been trying to - but so far didn't manage to replicate this with:

Michal> - a physical partition
Michal> - filesystem directly on a physical partition
Michal> - filesystem directly on mdraid
Michal> - filesystem directly on a linear volume

Michal> Note that this _doesn't_ imply that I _always_ get errors if lvm-thin is in use - as I also had lengthy period of attempts to cause corruption on some thin volume w/o any successes either. But the ones that failed had those in common (so far): md & lvm-thin - with 4 KiB piece(s) being incorrect

>> 
>> 
>>>> Can you compile the newst kernel and newest thin tools and try them
>>>> out?
>> 
Michal> I can, but a bit later (once we move VMs out of the host).
>> 
>>>> 
>>>> How long does it take to replicate the corruption?
>>>> 
>> 
Michal> When it happens, it's usually few tries tries of writing a 16gb file
Michal> with random patterns and reading it back (directly on host). The
Michal> irritating thing is that it can be somewhat hard to reproduce (e.g.
Michal> after machine's reboot).
>> 
>>>> Sorry for all the questions, but until there's a test case which is
>>>> repeatable, it's going to be hard to chase this down.
>>>> 
>>>> I wonder if running 'fio' tests would be something to try?
>>>> 
>>>> And also changing your RAID5 setup to use the default stride and
>>>> stripe widths, instead of the large values you're using.
>> 
Michal> The raid5 is using mdadm's defaults (which is 512 KiB these days for a
Michal> chunk). LVM on top is using much longer extents (as we don't really need
Michal> 4mb granularity) and the lvm-thin chunks were set to match (and align)
Michal> to raid's stripe.
>> 
>>>> 
>>>> Good luck!
>>>> 
Roger> I have not as of yet seen write corruption (except when a vendors disk
Roger> was resetting and it was lying about having written the data prior to
Roger> the crash, these were ssds, if your disk write cache is on and you
Roger> have a disk reset this can also happen), but have not seen "lost
Roger> writes" otherwise, but would expect the 2 read corruption I have seen
Roger> to also be able to cause write issues.  So for that look for scsi
Roger> notifications for disk resets that should not happen.
>>>> 
Roger> I have had a "bad" controller cause read corruptions, those
Roger> corruptions would move around, replacing the controller resolved it,
Roger> so there may be lack of error checking "inside" some paths in the
Roger> card.  Lucky I had a number of these controllers and had cold spares
Roger> for them.  The give away here was 2 separate buses with almost
Roger> identical load with 6 separate disks each and all12 disks on 2 buses
Roger> had between 47-52 scsi errors, which points to the only component
Roger> shared (the controller).
>>>> 
Roger> The backplane and cables are unlikely in general cause this, there is
Roger> too much error checking between the controller and the disk from what
Roger> I know.
>>>> 
Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133
Roger> cause random read corruptions, lowering speed to 100 fixed it), this
Roger> one was duplicated on multiple identical pieces of hw with all
Roger> different parts on the duplication machine.
>>>> 
Roger> I have also seen lost writes (from software) because someone did a
Roger> seek without doing a flush which in some versions of the libs loses
Roger> the unfilled block when the seek happens (this is noted in the man
Roger> page, and I saw it 20years ago, it is still noted in the man page, so
Roger> no idea if it was ever fixed).  So has more than one application been
Roger> noted to see the corruption?
>>>> 
Roger> So one question, have you seen the corruption in a path that would
Roger> rely on one controller, or all corruptions you have seen involving
Roger> more than one controller?  Isolate and test each controller if you
Roger> can, or if you can afford to replace it and see if it continues.
>>>> 
>>>> 
Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>>>>> 
>>>>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.
>>>>>> 
>>>>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.
>>>>>> 
>>>>> The hardware is (can provide more detailed info of course):
>>>>>> 
>>>>> - Supermicro X9DR7-LN4F
>>>>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane)
>>>>> - 96 gb ram (ecc)
>>>>> - 24 disk backplane
>>>>>> 
>>>>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk)
>>>>> - 1 array on the backplane (4 disks, mdraid5, journaled)
>>>>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks)
>>>>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)
>>>>>> 
>>>>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM
>>>>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.
>>>>>> 
>>>>> With a doze of testing we managed to roughly rule out the following elements as being the cause:
>>>>>> 
>>>>> - qemu/kvm (issue occured directly on host)
>>>>> - backplane (issue occured on disks directly connected via LSI's 2nd connector)
>>>>> - cable (as a above, two different cables)
>>>>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest)
>>>>> - mdadm journaling (issue occured on plain mdraid configuration as well)
>>>>> - disks themselves (issue occured on two separate mdadm arrays)
>>>>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )
>>>>>> 
>>>>> We did not manage to rule out (though somewhat _highly_ unlikely):
>>>>>> 
>>>>> - lvm thin (issue always - so far - occured on lvm thin pools)
>>>>> - mdraid (issue always - so far - on mdraid managed arrays)
>>>>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)
>>>>>> 
>>>>> And finally - so far - the issue never occured:
>>>>>> 
>>>>> - directly on a disk
>>>>> - directly on mdraid
>>>>> - on linear lvm volume on top of mdraid
>>>>>> 
>>>>> As far as the issue goes it's:
>>>>>> 
>>>>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks
>>>>> - we also found (or rather btrfs scrub did) a few small damaged files as well
>>>>> - the chunks look like a correct piece of different or previous data
>>>>>> 
>>>>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...
>>>>>> 
>>>>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.
>>>> 
>> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [general question] rare silent data corruption when writing data
  2020-05-08  3:44       ` Chris Murphy
@ 2020-05-10 19:05         ` Sarah Newman
  2020-05-10 19:12           ` Sarah Newman
  2020-05-20 21:40         ` Michal Soltys
  1 sibling, 1 reply; 20+ messages in thread
From: Sarah Newman @ 2020-05-10 19:05 UTC (permalink / raw)
  To: Chris Murphy, Michal Soltys; +Cc: John Stoffel, Roger Heflin, Linux RAID

On 5/7/20 8:44 PM, Chris Murphy wrote:
> 
> I would change very little until you track this down, if the goal is
> to track it down and get it fixed.
> 
> I'm not sure if LVM thinp is supported with LVM raid still, which if
> it's not supported yet then I can understand using mdadm raid5 instead
> of LVM raid5.


My apologies if this ideas was considered and discarded already, but the bug being hard to reproduce right after reboot and the error being exactly 
the size of a page sounds like a memory use after free bug or similar.

A debug kernel build with one or more of these options may find the problem:

CONFIG_DEBUG_PAGEALLOC
CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
CONFIG_PAGE_POISONING + page_poison=1
CONFIG_KASAN

--Sarah

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [general question] rare silent data corruption when writing data
  2020-05-10 19:05         ` Sarah Newman
@ 2020-05-10 19:12           ` Sarah Newman
  2020-05-11  9:41             ` Michal Soltys
  0 siblings, 1 reply; 20+ messages in thread
From: Sarah Newman @ 2020-05-10 19:12 UTC (permalink / raw)
  To: Chris Murphy, Michal Soltys; +Cc: John Stoffel, Roger Heflin, Linux RAID

On 5/10/20 12:05 PM, Sarah Newman wrote:
> On 5/7/20 8:44 PM, Chris Murphy wrote:
>>
>> I would change very little until you track this down, if the goal is
>> to track it down and get it fixed.
>>
>> I'm not sure if LVM thinp is supported with LVM raid still, which if
>> it's not supported yet then I can understand using mdadm raid5 instead
>> of LVM raid5.
> 
> 
> My apologies if this ideas was considered and discarded already, but the bug being hard to reproduce right after reboot and the error being exactly 
> the size of a page sounds like a memory use after free bug or similar.
> 
> A debug kernel build with one or more of these options may find the problem:
> 
> CONFIG_DEBUG_PAGEALLOC
> CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
> CONFIG_PAGE_POISONING + page_poison=1
> CONFIG_KASAN
> 
> --Sarah

And on further reflection you may as well add these:

CONFIG_DEBUG_OBJECTS
CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT
CONFIG_CRASH_DUMP (kdump)

+ anything else available. Basically turn debugging on all the way.

If you can reproduce reliably with these, then you can try the latest kernel with the same options and have some confidence the problem was 
legitimately fixed.

--Sarah

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [general question] rare silent data corruption when writing data
  2020-05-10 19:12           ` Sarah Newman
@ 2020-05-11  9:41             ` Michal Soltys
  2020-05-11 19:42               ` Sarah Newman
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Soltys @ 2020-05-11  9:41 UTC (permalink / raw)
  To: Sarah Newman, Chris Murphy; +Cc: John Stoffel, Roger Heflin, Linux RAID

On 5/10/20 9:12 PM, Sarah Newman wrote:
> On 5/10/20 12:05 PM, Sarah Newman wrote:
>> On 5/7/20 8:44 PM, Chris Murphy wrote:
>>>
>>> I would change very little until you track this down, if the goal is
>>> to track it down and get it fixed.
>>>
>>> I'm not sure if LVM thinp is supported with LVM raid still, which if
>>> it's not supported yet then I can understand using mdadm raid5 instead
>>> of LVM raid5.
>>
>>
>> My apologies if this ideas was considered and discarded already, but 
>> the bug being hard to reproduce right after reboot and the error being 
>> exactly the size of a page sounds like a memory use after free bug or 
>> similar.
>>
>> A debug kernel build with one or more of these options may find the 
>> problem:
>>
>> CONFIG_DEBUG_PAGEALLOC
>> CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
>> CONFIG_PAGE_POISONING + page_poison=1
>> CONFIG_KASAN
>>
>> --Sarah
> 
> And on further reflection you may as well add these:
> 
> CONFIG_DEBUG_OBJECTS
> CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT
> CONFIG_CRASH_DUMP (kdump)
> 
> + anything else available. Basically turn debugging on all the way.
> 
> If you can reproduce reliably with these, then you can try the latest 
> kernel with the same options and have some confidence the problem was 
> legitimately fixed.
> 

After compiling the kernel with above options enabled - and if this is 
the underlying issue as you suspect - will it just pop in dmesg if I hit 
this bug, or do I need some extra tools/preparation/etc. ?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [general question] rare silent data corruption when writing data
  2020-05-11  9:41             ` Michal Soltys
@ 2020-05-11 19:42               ` Sarah Newman
  0 siblings, 0 replies; 20+ messages in thread
From: Sarah Newman @ 2020-05-11 19:42 UTC (permalink / raw)
  To: Michal Soltys, Chris Murphy; +Cc: John Stoffel, Roger Heflin, Linux RAID

On 5/11/20 2:41 AM, Michal Soltys wrote:
> On 5/10/20 9:12 PM, Sarah Newman wrote:
>> On 5/10/20 12:05 PM, Sarah Newman wrote:
>>> On 5/7/20 8:44 PM, Chris Murphy wrote:
>>>>
>>>> I would change very little until you track this down, if the goal is
>>>> to track it down and get it fixed.
>>>>
>>>> I'm not sure if LVM thinp is supported with LVM raid still, which if
>>>> it's not supported yet then I can understand using mdadm raid5 instead
>>>> of LVM raid5.
>>>
>>>
>>> My apologies if this ideas was considered and discarded already, but the bug being hard to reproduce right after reboot and the error being exactly 
>>> the size of a page sounds like a memory use after free bug or similar.
>>>
>>> A debug kernel build with one or more of these options may find the problem:
>>>
>>> CONFIG_DEBUG_PAGEALLOC
>>> CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
>>> CONFIG_PAGE_POISONING + page_poison=1
>>> CONFIG_KASAN
>>>
>>> --Sarah
>>
>> And on further reflection you may as well add these:
>>
>> CONFIG_DEBUG_OBJECTS
>> CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT
>> CONFIG_CRASH_DUMP (kdump)
>>
>> + anything else available. Basically turn debugging on all the way.
>>
>> If you can reproduce reliably with these, then you can try the latest kernel with the same options and have some confidence the problem was 
>> legitimately fixed.
>>
> 
> After compiling the kernel with above options enabled - and if this is the underlying issue as you suspect - will it just pop in dmesg if I hit this 
> bug, or do I need some extra tools/preparation/etc. ?
> 

I'm pretty sure that you can get everything you need from either dmesg or sysfs/debugfs. Be prepared for an oops or panic.

--Sarah

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [general question] rare silent data corruption when writing data
  2020-05-07 17:30 [general question] rare silent data corruption when writing data Michal Soltys
  2020-05-07 18:24 ` Roger Heflin
@ 2020-05-13  6:31 ` Chris Dunlop
  2020-05-13 17:49   ` John Stoffel
  2020-05-20 20:29   ` Michal Soltys
  1 sibling, 2 replies; 20+ messages in thread
From: Chris Dunlop @ 2020-05-13  6:31 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

Hi,

On Thu, May 07, 2020 at 07:30:19PM +0200, Michal Soltys wrote:
> Note: this is just general question - if anyone experienced something 
> similar or could suggest how to pinpoint / verify the actual cause.
>
> Thanks to btrfs's checksumming we discovered somewhat (even if quite 
> rare) nasty silent corruption going on on one of our hosts. Or perhaps 
> "corruption" is not the correct word - the files simply have precise 4kb 
> (1 page) of incorrect data. The incorrect pieces of data look on their 
> own fine - as something that was previously in the place, or written 
> from wrong source.

"Me too!"

We are seeing 256-byte corruptions which are always the last 256b of a 4K 
block. The 256b is very often a copy of a "last 256b of 4k block" from 
earlier on the file. We sometimes see multiple corruptions in the same 
file, with each of the corruptions being a copy of a different 256b from 
earlier on the file. The original 256b and the copied 256b aren't 
identifiably at a regular offset from each other. Where the 256b isn't a 
copy from earlier in the file

I'd be really interested to hear if your problem is just in the last 256b 
of the 4k block also!

We haven't been able to track down any the origin of any of the copies 
where it's not a 256b block earlier in the file. I tried some extensive 
analysis of some of these occurrences, including looking at files being 
written around the same time, but wasn't able to identify where the data 
came from. It could be the "last 256b of 4k block" from some other file 
being written at the same time, or a non-256b aligned chunk, or indeed not 
a copy of other file data at all.

See Also: https://lore.kernel.org/linux-xfs/20180322150226.GA31029@onthe.net.au/

We've been able to detect these corruptions via an md5sum calculated as 
the files are generated, where a later md5sum doesn't match the original.  
We regularly see the md5sum match soon after the file is written (seconds 
to minutes), and then go "bad" after doing a "vmtouch -e" to evict the 
file from memory. I.e. it looks like the problem is occurring somewhere on 
the write path to disk. We can move the corrupt file out of the way and 
regenerate the file, then use 'cmp -l' to see where the corruption[s] are, 
and calculate md5 sums for each 256b block in the file to identify where 
the 256b was copied from.

The corruptions are far more likely to occur during a scrub, although we 
have seen a few of them when not scrubbing. We're currently working around 
the issue by scrubbing infrequently, and trying to schedule scrubs during 
periods of low write load.

> The hardware is (can provide more detailed info of course):
>
> - Supermicro X9DR7-LN4F
> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to 
>   backplane)
> - 96 gb ram (ecc)
> - 24 disk backplane
>
> - 1 array connected directly to lsi controller (4 disks, mdraid5, 
>   internal bitmap, 512kb chunk)
> - 1 array on the backplane (4 disks, mdraid5, journaled)
> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro 
>   disks)
> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still 
>   fine intel ssds from DC 3500 series)

Ours is on similar hardware:

- Supermicro X8DTH-IF
- LSI SAS 9211-8i  (LSI SAS2008, PCI-e 2.0, multiple firmware versions)
- 192GB ECC RAM
- A mix of 12 and 24-bay expanders (some daisy chained: lsi-expander-expander)

We swapped the LSI HBA for another of the same model, the problem 
persisted. We have a SAS9300 card on the way for testing.

> Raid 5 arrays are in lvm volume group, and the logical volumes are used 
> by VMs. Some of the volumes are linear, some are using thin-pools (with 
> metadata on the aforementioned intel ssds, in mirrored config). LVM uses 
> large extent sizes (120m) and the chunk-size of thin-pools is set to 
> 1.5m to match underlying raid stripe. Everything is cleanly aligned as 
> well.

We're not using VMs nor lvm thin on this storage.

Our main filesystem is xfs + lvm + raid6 and this is where we've seen all 
but one of these corruptions (70-100 since Mar 2018).

The problem has occurred on all md arrays under the lvm, on disks from 
multiple vendors and models, and on disks attached to all expanders.

We've seen one of these corruptions with xfs directly on a hdd partition.  
I.e. no mdraid or lvm involved. This fs an order of magnitude or more less 
utilised than the main fs in terms of data being written.

> We did not manage to rule out (though somewhat _highly_ unlikely):
>
> - lvm thin (issue always - so far - occured on lvm thin pools)
> - mdraid (issue always - so far - on mdraid managed arrays)
> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, 
>   happened with both - so it would imply rather already longstanding bug 
>   somewhere)

- we're not using lvm thin
- problem has occurred once on non-mdraid (xfs directly on a hdd partition)
- problem NOT seen on kernel 3.18.25
- problem seen on, so far, kernels 4.4.153 - 5.4.2

> And finally - so far - the issue never occured:
>
> - directly on a disk
> - directly on mdraid
> - on linear lvm volume on top of mdraid

- seen once directly on disk (partition)
- we don't use mdraid directly
- our problem arises on linear lvm on top of mdraid (raid6)

> As far as the issue goes it's:
>
> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from 
>   a few to few dozens of such chunks
> - we also found (or rather btrfs scrub did) a few small damaged files as 
>   well
> - the chunks look like a correct piece of different or previous data
>
> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes 
> anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin 
> pools; mdraid - default 512kb chunks). It does nicely fit a page though 
> ...
>
> Anyway, if anyone has any ideas or suggestions what could be happening 
> (perhaps with this particular motherboard or vendor) or how to pinpoint 
> the cause - I'll be grateful for any.

Likewise!

Cheers,

Chris

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [general question] rare silent data corruption when writing data
  2020-05-13  6:31 ` Chris Dunlop
@ 2020-05-13 17:49   ` John Stoffel
  2020-05-14  0:39     ` Chris Dunlop
  2020-05-20 20:29   ` Michal Soltys
  1 sibling, 1 reply; 20+ messages in thread
From: John Stoffel @ 2020-05-13 17:49 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Michal Soltys, linux-raid

I wonder if this problem can be replicated on loop devices?  Once
there's a way to cause it reliably, we can then start doing a
bisection of the kernel to try and find out where this is happening.

So far, it looks like it happens sometimes on bare RAID6 systems
without lv-thin in place, which is both good and bad.  And without
using VMs on top of the storage either.  So this helps narrow down the
cause.  

Is there any info on the work load on these systems?  Lots of small
fils which are added/removed?  Large files which are just written to
and not touched again?

I assume finding a bad file with corruption and then doing a cp of the
file keeps the same corruption?  

>>>>> "Chris" == Chris Dunlop <chris@onthe.net.au> writes:

Chris> Hi,

Chris> On Thu, May 07, 2020 at 07:30:19PM +0200, Michal Soltys wrote:
>> Note: this is just general question - if anyone experienced something 
>> similar or could suggest how to pinpoint / verify the actual cause.
>> 
>> Thanks to btrfs's checksumming we discovered somewhat (even if quite 
>> rare) nasty silent corruption going on on one of our hosts. Or perhaps 
>> "corruption" is not the correct word - the files simply have precise 4kb 
>> (1 page) of incorrect data. The incorrect pieces of data look on their 
>> own fine - as something that was previously in the place, or written 
>> from wrong source.

Chris> "Me too!"

Chris> We are seeing 256-byte corruptions which are always the last 256b of a 4K 
Chris> block. The 256b is very often a copy of a "last 256b of 4k block" from 
Chris> earlier on the file. We sometimes see multiple corruptions in the same 
Chris> file, with each of the corruptions being a copy of a different 256b from 
Chris> earlier on the file. The original 256b and the copied 256b aren't 
Chris> identifiably at a regular offset from each other. Where the 256b isn't a 
Chris> copy from earlier in the file

Chris> I'd be really interested to hear if your problem is just in the last 256b 
Chris> of the 4k block also!

Chris> We haven't been able to track down any the origin of any of the copies 
Chris> where it's not a 256b block earlier in the file. I tried some extensive 
Chris> analysis of some of these occurrences, including looking at files being 
Chris> written around the same time, but wasn't able to identify where the data 
Chris> came from. It could be the "last 256b of 4k block" from some other file 
Chris> being written at the same time, or a non-256b aligned chunk, or indeed not 
Chris> a copy of other file data at all.

Chris> See Also: https://lore.kernel.org/linux-xfs/20180322150226.GA31029@onthe.net.au/

Chris> We've been able to detect these corruptions via an md5sum calculated as 
Chris> the files are generated, where a later md5sum doesn't match the original.  
Chris> We regularly see the md5sum match soon after the file is written (seconds 
Chris> to minutes), and then go "bad" after doing a "vmtouch -e" to evict the 
Chris> file from memory. I.e. it looks like the problem is occurring somewhere on 
Chris> the write path to disk. We can move the corrupt file out of the way and 
Chris> regenerate the file, then use 'cmp -l' to see where the corruption[s] are, 
Chris> and calculate md5 sums for each 256b block in the file to identify where 
Chris> the 256b was copied from.

Chris> The corruptions are far more likely to occur during a scrub, although we 
Chris> have seen a few of them when not scrubbing. We're currently working around 
Chris> the issue by scrubbing infrequently, and trying to schedule scrubs during 
Chris> periods of low write load.

>> The hardware is (can provide more detailed info of course):
>> 
>> - Supermicro X9DR7-LN4F
>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to 
>> backplane)
>> - 96 gb ram (ecc)
>> - 24 disk backplane
>> 
>> - 1 array connected directly to lsi controller (4 disks, mdraid5, 
>> internal bitmap, 512kb chunk)
>> - 1 array on the backplane (4 disks, mdraid5, journaled)
>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro 
>> disks)
>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still 
>> fine intel ssds from DC 3500 series)

Chris> Ours is on similar hardware:

Chris> - Supermicro X8DTH-IF
Chris> - LSI SAS 9211-8i  (LSI SAS2008, PCI-e 2.0, multiple firmware versions)
Chris> - 192GB ECC RAM
Chris> - A mix of 12 and 24-bay expanders (some daisy chained: lsi-expander-expander)

Chris> We swapped the LSI HBA for another of the same model, the problem 
Chris> persisted. We have a SAS9300 card on the way for testing.

>> Raid 5 arrays are in lvm volume group, and the logical volumes are used 
>> by VMs. Some of the volumes are linear, some are using thin-pools (with 
>> metadata on the aforementioned intel ssds, in mirrored config). LVM uses 
>> large extent sizes (120m) and the chunk-size of thin-pools is set to 
>> 1.5m to match underlying raid stripe. Everything is cleanly aligned as 
>> well.

Chris> We're not using VMs nor lvm thin on this storage.

Chris> Our main filesystem is xfs + lvm + raid6 and this is where we've seen all 
Chris> but one of these corruptions (70-100 since Mar 2018).

Chris> The problem has occurred on all md arrays under the lvm, on disks from 
Chris> multiple vendors and models, and on disks attached to all expanders.

Chris> We've seen one of these corruptions with xfs directly on a hdd partition.  
Chris> I.e. no mdraid or lvm involved. This fs an order of magnitude or more less 
Chris> utilised than the main fs in terms of data being written.

>> We did not manage to rule out (though somewhat _highly_ unlikely):
>> 
>> - lvm thin (issue always - so far - occured on lvm thin pools)
>> - mdraid (issue always - so far - on mdraid managed arrays)
>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, 
>> happened with both - so it would imply rather already longstanding bug 
>> somewhere)

Chris> - we're not using lvm thin
Chris> - problem has occurred once on non-mdraid (xfs directly on a hdd partition)
Chris> - problem NOT seen on kernel 3.18.25
Chris> - problem seen on, so far, kernels 4.4.153 - 5.4.2

>> And finally - so far - the issue never occured:
>> 
>> - directly on a disk
>> - directly on mdraid
>> - on linear lvm volume on top of mdraid

Chris> - seen once directly on disk (partition)
Chris> - we don't use mdraid directly
Chris> - our problem arises on linear lvm on top of mdraid (raid6)

>> As far as the issue goes it's:
>> 
>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from 
>> a few to few dozens of such chunks
>> - we also found (or rather btrfs scrub did) a few small damaged files as 
>> well
>> - the chunks look like a correct piece of different or previous data
>> 
>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes 
>> anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin 
>> pools; mdraid - default 512kb chunks). It does nicely fit a page though 
>> ...
>> 
>> Anyway, if anyone has any ideas or suggestions what could be happening 
>> (perhaps with this particular motherboard or vendor) or how to pinpoint 
>> the cause - I'll be grateful for any.

Chris> Likewise!

Chris> Cheers,

Chris> Chris

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [general question] rare silent data corruption when writing data
  2020-05-13 17:49   ` John Stoffel
@ 2020-05-14  0:39     ` Chris Dunlop
  0 siblings, 0 replies; 20+ messages in thread
From: Chris Dunlop @ 2020-05-14  0:39 UTC (permalink / raw)
  To: John Stoffel; +Cc: Michal Soltys, linux-raid

On Wed, May 13, 2020 at 01:49:10PM -0400, John Stoffel wrote:
> I wonder if this problem can be replicated on loop devices?  Once
> there's a way to cause it reliably, we can then start doing a
> bisection of the kernel to try and find out where this is happening.

I ran a week or so of attempting to replicate the problem in a VM on loop 
devices replicating the lvm/raid config, without success. Basically just 
having a random bunch of 1-25 concurrent writers banging out middling to 
largish files.

The fact it wasn't replicable in that environment could be pointing 
towards the LSI driver or hardware - or I simply wasn't able to match  
the conditions well enough.

> So far, it looks like it happens sometimes on bare RAID6 systems
> without lv-thin in place, which is both good and bad.  And without
> using VMs on top of the storage either.  So this helps narrow down the
> cause.

Note: We don't have any bare RAID6 so I haven't seen it there: our main fs 
is xfs on sequential LVM on raid6 (6 x 11-disk sets), and we saw it once 
on xfs directly on HDD partition.

> Is there any info on the work load on these systems?  Lots of small
> fils which are added/removed?  Large files which are just written to
> and not touched again?

Large files written and not touched again. Most of the time 2-5 concurrent 
writers but regularly (daily) up to 20-25 concurrent.

> I assume finding a bad file with corruption and then doing a cp of the
> file keeps the same corruption?

Yep.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [general question] rare silent data corruption when writing data
  2020-05-13  6:31 ` Chris Dunlop
  2020-05-13 17:49   ` John Stoffel
@ 2020-05-20 20:29   ` Michal Soltys
  1 sibling, 0 replies; 20+ messages in thread
From: Michal Soltys @ 2020-05-20 20:29 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: linux-raid

On 20/05/13 08:31, Chris Dunlop wrote:
> Hi,
> 
> 
> "Me too!"
> 
> We are seeing 256-byte corruptions which are always the last 256b of a 
> 4K block. The 256b is very often a copy of a "last 256b of 4k block" 
> from earlier on the file. We sometimes see multiple corruptions in the 
> same file, with each of the corruptions being a copy of a different 256b 
> from earlier on the file. The original 256b and the copied 256b aren't 
> identifiably at a regular offset from each other. Where the 256b isn't a 
> copy from earlier in the file
> 
> I'd be really interested to hear if your problem is just in the last 
> 256b of the 4k block also!

 From what I have checked - in my case it has always been full 4k page.

I'll follow the suggestion by Sarah in the other part of this thread and 
enable pagealloc debug options and then put the machine/disks under load 
- so I'll keep an eye if something like you described happens.

This will have to wait a bit though, as I have another bug to hunt as 
well - as journaled raid refuses to assemble, so with help of Song I'm 
chasing that issue first.

If not for btrfs, we probably would have been using the machine happily 
until now (blaming occasional detected issues on userspace stuff, 
usually some fat java mess).

Thanks for detailed explanations of what happened in your case (and the 
span of kernel versions in which it does happen is scary). The hardware 
indeed looks strikingly similiar.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [general question] rare silent data corruption when writing data
  2020-05-08  3:44       ` Chris Murphy
  2020-05-10 19:05         ` Sarah Newman
@ 2020-05-20 21:40         ` Michal Soltys
  1 sibling, 0 replies; 20+ messages in thread
From: Michal Soltys @ 2020-05-20 21:40 UTC (permalink / raw)
  To: Chris Murphy; +Cc: John Stoffel, Roger Heflin, Linux RAID

Sorry for delayed reply, have had rather busy weeks.

On 20/05/08 05:44, Chris Murphy wrote:
> 
> The 4KiB chunk. What are the contents? Is it definitely guest VM data?
> Or is it sometimes file system metadata? How many corruptions have
> happened? The file system metadata is quite small compared to data.

I haven't looked that precisely (and it would be hard to tell in quite a 
few cases) - but I'll keep that in mind when I resume chasing this bug.

> But if there have been many errors, we'd expect if it's caused on the
> host, that eventually file system metadata is corrupted. If it's
> definitely only data, that's curious and maybe implicates something
> going on in the guest.

As far as metadata goes, so far I haven't seen those - as far as e2fsck 
on ext4 and btrfs-scrub on ext4 could tell. Though in ext4 case I 
haven't ran it that many times - so good point, I'll include fsck after 
each round.

> 
> Btrfs, whether normal reads or scrubs, will report the path to the
> affected file, for data corruption. Metadata corruption errors
> sometimes have inode references, but not a path to a file.
> 

Btrfs pointed to file contents only, so far.

> 
>> >
>> > Are the LVs split across RAID5 PVs by any chance?
>>
>> raid5s are used as PVs, but a single logical volume always uses one only
>> one physical volume underneath (if that's what you meant by split across).
> 
> It might be a bit suboptimal. A single 4KiB block write in the guest,
> turns into a 4KiB block write in the host's LV. That in turn trickles
> down to md, which has a 512KiB x 4 drive stripe. So a single 4KiB
> write translates into a 2M stripe write. There is an optimization for
> raid5 in the RMW case, where it should be true only 4KiB data plus
> 4KiB parity is written (partial strip/chunk write); I'm not sure about
> reads.

Well, I didn't play with current defaults too much - aside large 
stripe_cache_size + the raid running under 2x ssd write-back journal 
(which unfortunately became another issue - there is another thread 
where I'm chasing that bug).

> 
>> > It's not clear if you can replicate the problem without using
>> > lvm-thin, but that's what I suspect you might be having problems with.
>> >
>>
>> I'll be trying to do that, though the heavier tests will have to wait
>> until I move all VMs to other hosts (as that is/was our production machnie).
> 
> Btrfs default Btrfs uses 16KiB block size for leaves and nodes. It's
> still a tiny foot print compared to data writes, but if LVM thin is a
> suspect, it really should just be a matter of time before file system
> corruption happens. If it doesn't, that's useful information. It
> probably means it's not LVM thin. But then what?
> 
>> As for how long, it's a hit and miss. Sometimes writing and reading back
>> ~16gb file fails (the cheksum read back differs from what was written)
>> after 2-3 tries. That's on the host.
>>
>> On the guest, it's been (so far) a guaranteed thing when we were
>> creating very large tar file (900gb+). As for past two weeks we were
>> unable to create that file without errors even once.
> 
> It's very useful to have a consistent reproducer. You can do metadata
> only writes on Btrfs by doing multiple back to back metadata only
> balance. If the problem really is in the write path somewhere, this
> would eventually corrupt the metadata - it would be detected during
> any subsequent balance or scrub. 'btrfs balance start -musage=100
> /mountpoint' will do it.

Will do that too.

> 
> This reproducer. It only reproduces in the guest VM? If you do it in
> the host, otherwise exactly the same way with all the exact same
> versions of everything, and it does not reproduce?
> 

I did reproduce the issue on the host (both in ext4 and btrfs). The host 
has slightly different versions of kernel and tools, but otherwise same 
stuff as one of the guests in which I was testing it.

>>
>> >
>> > Can you compile the newst kernel and newest thin tools and try them
>> > out?
>>
>> I can, but a bit later (once we move VMs out of the host).
>>
>> >
>> > How long does it take to replicate the corruption?
>> >
>>
>> When it happens, it's usually few tries tries of writing a 16gb file
>> with random patterns and reading it back (directly on host). The
>> irritating thing is that it can be somewhat hard to reproduce (e.g.
>> after machine's reboot).
> 
> Reading it back on the host. So you've shut down the VM, and you're
> mounting what was the guests VM's backing disk, on the host to do the
> verification. There's never a case of concurrent usage between guest
> and host?

The hosts test where on a fresh filesystems on a fresh lvm volumes (and 
I hit them on two different thin pools). The issue was also reproduced 
on hosts when all guests were turned off.

> 
> 
>>
>> > Sorry for all the questions, but until there's a test case which is
>> > repeatable, it's going to be hard to chase this down.
>> >
>> > I wonder if running 'fio' tests would be something to try?
>> >
>> > And also changing your RAID5 setup to use the default stride and
>> > stripe widths, instead of the large values you're using.
>>
>> The raid5 is using mdadm's defaults (which is 512 KiB these days for a
>> chunk). LVM on top is using much longer extents (as we don't really need
>> 4mb granularity) and the lvm-thin chunks were set to match (and align)
>> to raid's stripe.
> 
> I would change very little until you track this down, if the goal is
> to track it down and get it fixed.
> 

Yea, I'm keeping the stuff as is (and will try Sarah's suggestions with 
debug options as well).

> I'm not sure if LVM thinp is supported with LVM raid still, which if
> it's not supported yet then I can understand using mdadm raid5 instead
> of LVM raid5.
> 

It probably is, but still while direct dmsetup exposes a few knobs (e.g. 
allows to setup journal), the lvm doesn't allow much besides chunk size. 
That was the primary reason that I sticked to native mdadm.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2020-05-20 21:40 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-07 17:30 [general question] rare silent data corruption when writing data Michal Soltys
2020-05-07 18:24 ` Roger Heflin
2020-05-07 21:01   ` John Stoffel
2020-05-07 22:33     ` Michal Soltys
2020-05-08  0:54       ` John Stoffel
2020-05-08 11:10         ` Michal Soltys
2020-05-08 11:10           ` [linux-lvm] " Michal Soltys
2020-05-08 16:10           ` John Stoffel
2020-05-08 16:10             ` [linux-lvm] " John Stoffel
2020-05-08  3:44       ` Chris Murphy
2020-05-10 19:05         ` Sarah Newman
2020-05-10 19:12           ` Sarah Newman
2020-05-11  9:41             ` Michal Soltys
2020-05-11 19:42               ` Sarah Newman
2020-05-20 21:40         ` Michal Soltys
2020-05-07 22:13   ` Michal Soltys
2020-05-13  6:31 ` Chris Dunlop
2020-05-13 17:49   ` John Stoffel
2020-05-14  0:39     ` Chris Dunlop
2020-05-20 20:29   ` Michal Soltys

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.