* [general question] rare silent data corruption when writing data @ 2020-05-07 17:30 Michal Soltys 2020-05-07 18:24 ` Roger Heflin 2020-05-13 6:31 ` Chris Dunlop 0 siblings, 2 replies; 20+ messages in thread From: Michal Soltys @ 2020-05-07 17:30 UTC (permalink / raw) To: linux-raid Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause. Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source. The hardware is (can provide more detailed info of course): - Supermicro X9DR7-LN4F - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane) - 96 gb ram (ecc) - 24 disk backplane - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk) - 1 array on the backplane (4 disks, mdraid5, journaled) - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks) - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series) Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well. With a doze of testing we managed to roughly rule out the following elements as being the cause: - qemu/kvm (issue occured directly on host) - backplane (issue occured on disks directly connected via LSI's 2nd connector) - cable (as a above, two different cables) - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest) - mdadm journaling (issue occured on plain mdraid configuration as well) - disks themselves (issue occured on two separate mdadm arrays) - filesystem (issue occured on both btrfs and ext4 (checksumed manually) ) We did not manage to rule out (though somewhat _highly_ unlikely): - lvm thin (issue always - so far - occured on lvm thin pools) - mdraid (issue always - so far - on mdraid managed arrays) - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere) And finally - so far - the issue never occured: - directly on a disk - directly on mdraid - on linear lvm volume on top of mdraid As far as the issue goes it's: - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks - we also found (or rather btrfs scrub did) a few small damaged files as well - the chunks look like a correct piece of different or previous data The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ... Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data 2020-05-07 17:30 [general question] rare silent data corruption when writing data Michal Soltys @ 2020-05-07 18:24 ` Roger Heflin 2020-05-07 21:01 ` John Stoffel 2020-05-07 22:13 ` Michal Soltys 2020-05-13 6:31 ` Chris Dunlop 1 sibling, 2 replies; 20+ messages in thread From: Roger Heflin @ 2020-05-07 18:24 UTC (permalink / raw) To: Michal Soltys; +Cc: Linux RAID Have you tried the same file 2x and verified the corruption is in the same places and looks the same? I have not as of yet seen write corruption (except when a vendors disk was resetting and it was lying about having written the data prior to the crash, these were ssds, if your disk write cache is on and you have a disk reset this can also happen), but have not seen "lost writes" otherwise, but would expect the 2 read corruption I have seen to also be able to cause write issues. So for that look for scsi notifications for disk resets that should not happen. I have had a "bad" controller cause read corruptions, those corruptions would move around, replacing the controller resolved it, so there may be lack of error checking "inside" some paths in the card. Lucky I had a number of these controllers and had cold spares for them. The give away here was 2 separate buses with almost identical load with 6 separate disks each and all12 disks on 2 buses had between 47-52 scsi errors, which points to the only component shared (the controller). The backplane and cables are unlikely in general cause this, there is too much error checking between the controller and the disk from what I know. I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133 cause random read corruptions, lowering speed to 100 fixed it), this one was duplicated on multiple identical pieces of hw with all different parts on the duplication machine. I have also seen lost writes (from software) because someone did a seek without doing a flush which in some versions of the libs loses the unfilled block when the seek happens (this is noted in the man page, and I saw it 20years ago, it is still noted in the man page, so no idea if it was ever fixed). So has more than one application been noted to see the corruption? So one question, have you seen the corruption in a path that would rely on one controller, or all corruptions you have seen involving more than one controller? Isolate and test each controller if you can, or if you can afford to replace it and see if it continues. On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote: > > Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause. > > Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source. > > The hardware is (can provide more detailed info of course): > > - Supermicro X9DR7-LN4F > - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane) > - 96 gb ram (ecc) > - 24 disk backplane > > - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk) > - 1 array on the backplane (4 disks, mdraid5, journaled) > - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks) > - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series) > > Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM > uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well. > > With a doze of testing we managed to roughly rule out the following elements as being the cause: > > - qemu/kvm (issue occured directly on host) > - backplane (issue occured on disks directly connected via LSI's 2nd connector) > - cable (as a above, two different cables) > - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest) > - mdadm journaling (issue occured on plain mdraid configuration as well) > - disks themselves (issue occured on two separate mdadm arrays) > - filesystem (issue occured on both btrfs and ext4 (checksumed manually) ) > > We did not manage to rule out (though somewhat _highly_ unlikely): > > - lvm thin (issue always - so far - occured on lvm thin pools) > - mdraid (issue always - so far - on mdraid managed arrays) > - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere) > > And finally - so far - the issue never occured: > > - directly on a disk > - directly on mdraid > - on linear lvm volume on top of mdraid > > As far as the issue goes it's: > > - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks > - we also found (or rather btrfs scrub did) a few small damaged files as well > - the chunks look like a correct piece of different or previous data > > The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ... > > Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data 2020-05-07 18:24 ` Roger Heflin @ 2020-05-07 21:01 ` John Stoffel 2020-05-07 22:33 ` Michal Soltys 2020-05-07 22:13 ` Michal Soltys 1 sibling, 1 reply; 20+ messages in thread From: John Stoffel @ 2020-05-07 21:01 UTC (permalink / raw) To: Roger Heflin; +Cc: Michal Soltys, Linux RAID >>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes: Roger> Have you tried the same file 2x and verified the corruption is in the Roger> same places and looks the same? Are these 1tb files VMDK or COW images of VMs? How are these files made. And does it ever happen with *smaller* files? What about if you just use a sparse 2tb file and write blocks out past 1tb to see if there's a problem? Are the LVs split across RAID5 PVs by any chance? It's not clear if you can replicate the problem without using lvm-thin, but that's what I suspect you might be having problems with. Can you give us the versions of the your tools, and exactly how you setup your test cases? How long does it take to find the problem? Can you compile the newst kernel and newest thin tools and try them out? How long does it take to replicate the corruption? Sorry for all the questions, but until there's a test case which is repeatable, it's going to be hard to chase this down. I wonder if running 'fio' tests would be something to try? And also changing your RAID5 setup to use the default stride and stripe widths, instead of the large values you're using. Good luck! Roger> I have not as of yet seen write corruption (except when a vendors disk Roger> was resetting and it was lying about having written the data prior to Roger> the crash, these were ssds, if your disk write cache is on and you Roger> have a disk reset this can also happen), but have not seen "lost Roger> writes" otherwise, but would expect the 2 read corruption I have seen Roger> to also be able to cause write issues. So for that look for scsi Roger> notifications for disk resets that should not happen. Roger> I have had a "bad" controller cause read corruptions, those Roger> corruptions would move around, replacing the controller resolved it, Roger> so there may be lack of error checking "inside" some paths in the Roger> card. Lucky I had a number of these controllers and had cold spares Roger> for them. The give away here was 2 separate buses with almost Roger> identical load with 6 separate disks each and all12 disks on 2 buses Roger> had between 47-52 scsi errors, which points to the only component Roger> shared (the controller). Roger> The backplane and cables are unlikely in general cause this, there is Roger> too much error checking between the controller and the disk from what Roger> I know. Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133 Roger> cause random read corruptions, lowering speed to 100 fixed it), this Roger> one was duplicated on multiple identical pieces of hw with all Roger> different parts on the duplication machine. Roger> I have also seen lost writes (from software) because someone did a Roger> seek without doing a flush which in some versions of the libs loses Roger> the unfilled block when the seek happens (this is noted in the man Roger> page, and I saw it 20years ago, it is still noted in the man page, so Roger> no idea if it was ever fixed). So has more than one application been Roger> noted to see the corruption? Roger> So one question, have you seen the corruption in a path that would Roger> rely on one controller, or all corruptions you have seen involving Roger> more than one controller? Isolate and test each controller if you Roger> can, or if you can afford to replace it and see if it continues. Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote: >> >> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause. >> >> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source. >> >> The hardware is (can provide more detailed info of course): >> >> - Supermicro X9DR7-LN4F >> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane) >> - 96 gb ram (ecc) >> - 24 disk backplane >> >> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk) >> - 1 array on the backplane (4 disks, mdraid5, journaled) >> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks) >> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series) >> >> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM >> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well. >> >> With a doze of testing we managed to roughly rule out the following elements as being the cause: >> >> - qemu/kvm (issue occured directly on host) >> - backplane (issue occured on disks directly connected via LSI's 2nd connector) >> - cable (as a above, two different cables) >> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest) >> - mdadm journaling (issue occured on plain mdraid configuration as well) >> - disks themselves (issue occured on two separate mdadm arrays) >> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) ) >> >> We did not manage to rule out (though somewhat _highly_ unlikely): >> >> - lvm thin (issue always - so far - occured on lvm thin pools) >> - mdraid (issue always - so far - on mdraid managed arrays) >> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere) >> >> And finally - so far - the issue never occured: >> >> - directly on a disk >> - directly on mdraid >> - on linear lvm volume on top of mdraid >> >> As far as the issue goes it's: >> >> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks >> - we also found (or rather btrfs scrub did) a few small damaged files as well >> - the chunks look like a correct piece of different or previous data >> >> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ... >> >> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data 2020-05-07 21:01 ` John Stoffel @ 2020-05-07 22:33 ` Michal Soltys 2020-05-08 0:54 ` John Stoffel 2020-05-08 3:44 ` Chris Murphy 0 siblings, 2 replies; 20+ messages in thread From: Michal Soltys @ 2020-05-07 22:33 UTC (permalink / raw) To: John Stoffel, Roger Heflin; +Cc: Linux RAID On 20/05/07 23:01, John Stoffel wrote: >>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes: > > Roger> Have you tried the same file 2x and verified the corruption is in the > Roger> same places and looks the same? > > Are these 1tb files VMDK or COW images of VMs? How are these files > made. And does it ever happen with *smaller* files? What about if > you just use a sparse 2tb file and write blocks out past 1tb to see if > there's a problem? The VMs are always directly on lvm volumes. (e.g. /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the errors after we ran scrub on the filesystem. Yes, the errors were also found on small files. Since then we recreated the issue directly on the host, just by making ext4 filesystem on some LV, then doing write with checksum, sync, drop_caches, read and check checksum. The errors are, as I mentioned - always a full 4KiB chunks (always same content, always same position). > > Are the LVs split across RAID5 PVs by any chance? raid5s are used as PVs, but a single logical volume always uses one only one physical volume underneath (if that's what you meant by split across). > > It's not clear if you can replicate the problem without using > lvm-thin, but that's what I suspect you might be having problems with. > I'll be trying to do that, though the heavier tests will have to wait until I move all VMs to other hosts (as that is/was our production machnie). > Can you give us the versions of the your tools, and exactly how you > setup your test cases? How long does it take to find the problem? Will get all the details tommorow (the host is on up to date debian buster, the VMs are mix of archlinuxes and debians (and the issue happened on both)). As for how long, it's a hit and miss. Sometimes writing and reading back ~16gb file fails (the cheksum read back differs from what was written) after 2-3 tries. That's on the host. On the guest, it's been (so far) a guaranteed thing when we were creating very large tar file (900gb+). As for past two weeks we were unable to create that file without errors even once. > > Can you compile the newst kernel and newest thin tools and try them > out? I can, but a bit later (once we move VMs out of the host). > > How long does it take to replicate the corruption? > When it happens, it's usually few tries tries of writing a 16gb file with random patterns and reading it back (directly on host). The irritating thing is that it can be somewhat hard to reproduce (e.g. after machine's reboot). > Sorry for all the questions, but until there's a test case which is > repeatable, it's going to be hard to chase this down. > > I wonder if running 'fio' tests would be something to try? > > And also changing your RAID5 setup to use the default stride and > stripe widths, instead of the large values you're using. The raid5 is using mdadm's defaults (which is 512 KiB these days for a chunk). LVM on top is using much longer extents (as we don't really need 4mb granularity) and the lvm-thin chunks were set to match (and align) to raid's stripe. > > Good luck! > > Roger> I have not as of yet seen write corruption (except when a vendors disk > Roger> was resetting and it was lying about having written the data prior to > Roger> the crash, these were ssds, if your disk write cache is on and you > Roger> have a disk reset this can also happen), but have not seen "lost > Roger> writes" otherwise, but would expect the 2 read corruption I have seen > Roger> to also be able to cause write issues. So for that look for scsi > Roger> notifications for disk resets that should not happen. > > Roger> I have had a "bad" controller cause read corruptions, those > Roger> corruptions would move around, replacing the controller resolved it, > Roger> so there may be lack of error checking "inside" some paths in the > Roger> card. Lucky I had a number of these controllers and had cold spares > Roger> for them. The give away here was 2 separate buses with almost > Roger> identical load with 6 separate disks each and all12 disks on 2 buses > Roger> had between 47-52 scsi errors, which points to the only component > Roger> shared (the controller). > > Roger> The backplane and cables are unlikely in general cause this, there is > Roger> too much error checking between the controller and the disk from what > Roger> I know. > > Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133 > Roger> cause random read corruptions, lowering speed to 100 fixed it), this > Roger> one was duplicated on multiple identical pieces of hw with all > Roger> different parts on the duplication machine. > > Roger> I have also seen lost writes (from software) because someone did a > Roger> seek without doing a flush which in some versions of the libs loses > Roger> the unfilled block when the seek happens (this is noted in the man > Roger> page, and I saw it 20years ago, it is still noted in the man page, so > Roger> no idea if it was ever fixed). So has more than one application been > Roger> noted to see the corruption? > > Roger> So one question, have you seen the corruption in a path that would > Roger> rely on one controller, or all corruptions you have seen involving > Roger> more than one controller? Isolate and test each controller if you > Roger> can, or if you can afford to replace it and see if it continues. > > > Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote: >>> >>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause. >>> >>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source. >>> >>> The hardware is (can provide more detailed info of course): >>> >>> - Supermicro X9DR7-LN4F >>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane) >>> - 96 gb ram (ecc) >>> - 24 disk backplane >>> >>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk) >>> - 1 array on the backplane (4 disks, mdraid5, journaled) >>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks) >>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series) >>> >>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM >>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well. >>> >>> With a doze of testing we managed to roughly rule out the following elements as being the cause: >>> >>> - qemu/kvm (issue occured directly on host) >>> - backplane (issue occured on disks directly connected via LSI's 2nd connector) >>> - cable (as a above, two different cables) >>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest) >>> - mdadm journaling (issue occured on plain mdraid configuration as well) >>> - disks themselves (issue occured on two separate mdadm arrays) >>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) ) >>> >>> We did not manage to rule out (though somewhat _highly_ unlikely): >>> >>> - lvm thin (issue always - so far - occured on lvm thin pools) >>> - mdraid (issue always - so far - on mdraid managed arrays) >>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere) >>> >>> And finally - so far - the issue never occured: >>> >>> - directly on a disk >>> - directly on mdraid >>> - on linear lvm volume on top of mdraid >>> >>> As far as the issue goes it's: >>> >>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks >>> - we also found (or rather btrfs scrub did) a few small damaged files as well >>> - the chunks look like a correct piece of different or previous data >>> >>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ... >>> >>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any. > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data 2020-05-07 22:33 ` Michal Soltys @ 2020-05-08 0:54 ` John Stoffel 2020-05-08 11:10 ` [linux-lvm] " Michal Soltys 2020-05-08 3:44 ` Chris Murphy 1 sibling, 1 reply; 20+ messages in thread From: John Stoffel @ 2020-05-08 0:54 UTC (permalink / raw) To: Michal Soltys; +Cc: John Stoffel, Roger Heflin, Linux RAID >>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes: Michal> On 20/05/07 23:01, John Stoffel wrote: >>>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes: >> Roger> Have you tried the same file 2x and verified the corruption is in the Roger> same places and looks the same? >> >> Are these 1tb files VMDK or COW images of VMs? How are these files >> made. And does it ever happen with *smaller* files? What about if >> you just use a sparse 2tb file and write blocks out past 1tb to see if >> there's a problem? Michal> The VMs are always directly on lvm volumes. (e.g. Michal> /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the Michal> errors after we ran scrub on the filesystem. Michal> Yes, the errors were also found on small files. Those errors are in small files inside the VM, which is running btrfs ontop of block storage provided by your thin-lv, right? disks -> md raid5 -> pv -> vg -> lv-thin -> guest QCOW/LUN -> filesystem -> corruption Michal> Since then we recreated the issue directly on the host, just Michal> by making ext4 filesystem on some LV, then doing write with Michal> checksum, sync, drop_caches, read and check checksum. The Michal> errors are, as I mentioned - always a full 4KiB chunks (always Michal> same content, always same position). What position? Is it a 4k, 1.5m or some other consistent offset? And how far into the file? And this LV is a plain LV or a thin-lv? I'm running a debian box at home with RAID1 and I haven't seen this, but I'm not nearly as careful as you. Can you provide the output of: /sbin/lvs --version too? Can you post your: /sbin/dmsetup status output too? There's a better command to use here, but I'm not an export. You might really want to copy this over to the linux-lvm@redhat.com mailing list as well. >> Are the LVs split across RAID5 PVs by any chance? Michal> raid5s are used as PVs, but a single logical volume always uses one only Michal> one physical volume underneath (if that's what you meant by split across). Ok, that's what I was asking about. It shouldn't matter... but just trying to chase down the details. >> It's not clear if you can replicate the problem without using >> lvm-thin, but that's what I suspect you might be having problems with. Michal> I'll be trying to do that, though the heavier tests will have to wait Michal> until I move all VMs to other hosts (as that is/was our production machnie). Sure, makes sense. >> Can you give us the versions of the your tools, and exactly how you >> setup your test cases? How long does it take to find the problem? Michal> Will get all the details tommorow (the host is on up to date debian Michal> buster, the VMs are mix of archlinuxes and debians (and the issue Michal> happened on both)). Michal> As for how long, it's a hit and miss. Sometimes writing and reading back Michal> ~16gb file fails (the cheksum read back differs from what was written) Michal> after 2-3 tries. That's on the host. Michal> On the guest, it's been (so far) a guaranteed thing when we were Michal> creating very large tar file (900gb+). As for past two weeks we were Michal> unable to create that file without errors even once. Ouch! That's not good. Just to confirm, these corruptions are all in a thin-lv based filesystem, right? I'd be interested to know if you can create another plain LV and cause the same error. Trying to simplify the potential problems. >> Can you compile the newst kernel and newest thin tools and try them >> out? Michal> I can, but a bit later (once we move VMs out of the host). >> >> How long does it take to replicate the corruption? >> Michal> When it happens, it's usually few tries tries of writing a 16gb file Michal> with random patterns and reading it back (directly on host). The Michal> irritating thing is that it can be somewhat hard to reproduce (e.g. Michal> after machine's reboot). >> Sorry for all the questions, but until there's a test case which is >> repeatable, it's going to be hard to chase this down. >> >> I wonder if running 'fio' tests would be something to try? >> >> And also changing your RAID5 setup to use the default stride and >> stripe widths, instead of the large values you're using. Michal> The raid5 is using mdadm's defaults (which is 512 KiB these days for a Michal> chunk). LVM on top is using much longer extents (as we don't really need Michal> 4mb granularity) and the lvm-thin chunks were set to match (and align) Michal> to raid's stripe. >> >> Good luck! >> Roger> I have not as of yet seen write corruption (except when a vendors disk Roger> was resetting and it was lying about having written the data prior to Roger> the crash, these were ssds, if your disk write cache is on and you Roger> have a disk reset this can also happen), but have not seen "lost Roger> writes" otherwise, but would expect the 2 read corruption I have seen Roger> to also be able to cause write issues. So for that look for scsi Roger> notifications for disk resets that should not happen. >> Roger> I have had a "bad" controller cause read corruptions, those Roger> corruptions would move around, replacing the controller resolved it, Roger> so there may be lack of error checking "inside" some paths in the Roger> card. Lucky I had a number of these controllers and had cold spares Roger> for them. The give away here was 2 separate buses with almost Roger> identical load with 6 separate disks each and all12 disks on 2 buses Roger> had between 47-52 scsi errors, which points to the only component Roger> shared (the controller). >> Roger> The backplane and cables are unlikely in general cause this, there is Roger> too much error checking between the controller and the disk from what Roger> I know. >> Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133 Roger> cause random read corruptions, lowering speed to 100 fixed it), this Roger> one was duplicated on multiple identical pieces of hw with all Roger> different parts on the duplication machine. >> Roger> I have also seen lost writes (from software) because someone did a Roger> seek without doing a flush which in some versions of the libs loses Roger> the unfilled block when the seek happens (this is noted in the man Roger> page, and I saw it 20years ago, it is still noted in the man page, so Roger> no idea if it was ever fixed). So has more than one application been Roger> noted to see the corruption? >> Roger> So one question, have you seen the corruption in a path that would Roger> rely on one controller, or all corruptions you have seen involving Roger> more than one controller? Isolate and test each controller if you Roger> can, or if you can afford to replace it and see if it continues. >> >> Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote: >>>> >>>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause. >>>> >>>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source. >>>> >>>> The hardware is (can provide more detailed info of course): >>>> >>>> - Supermicro X9DR7-LN4F >>>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane) >>>> - 96 gb ram (ecc) >>>> - 24 disk backplane >>>> >>>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk) >>>> - 1 array on the backplane (4 disks, mdraid5, journaled) >>>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks) >>>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series) >>>> >>>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM >>>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well. >>>> >>>> With a doze of testing we managed to roughly rule out the following elements as being the cause: >>>> >>>> - qemu/kvm (issue occured directly on host) >>>> - backplane (issue occured on disks directly connected via LSI's 2nd connector) >>>> - cable (as a above, two different cables) >>>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest) >>>> - mdadm journaling (issue occured on plain mdraid configuration as well) >>>> - disks themselves (issue occured on two separate mdadm arrays) >>>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) ) >>>> >>>> We did not manage to rule out (though somewhat _highly_ unlikely): >>>> >>>> - lvm thin (issue always - so far - occured on lvm thin pools) >>>> - mdraid (issue always - so far - on mdraid managed arrays) >>>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere) >>>> >>>> And finally - so far - the issue never occured: >>>> >>>> - directly on a disk >>>> - directly on mdraid >>>> - on linear lvm volume on top of mdraid >>>> >>>> As far as the issue goes it's: >>>> >>>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks >>>> - we also found (or rather btrfs scrub did) a few small damaged files as well >>>> - the chunks look like a correct piece of different or previous data >>>> >>>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ... >>>> >>>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any. >> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data 2020-05-08 0:54 ` John Stoffel @ 2020-05-08 11:10 ` Michal Soltys 0 siblings, 0 replies; 20+ messages in thread From: Michal Soltys @ 2020-05-08 11:10 UTC (permalink / raw) To: John Stoffel; +Cc: Roger Heflin, Linux RAID, linux-lvm note: as suggested, I'm also CCing this to linux-lvm; the full context with replies starts at: https://www.spinics.net/lists/raid/msg64364.html There is also the initial post at the bottom as well. On 5/8/20 2:54 AM, John Stoffel wrote: >>>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes: > > Michal> On 20/05/07 23:01, John Stoffel wrote: >>>>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes: >>> > Roger> Have you tried the same file 2x and verified the corruption is in the > Roger> same places and looks the same? >>> >>> Are these 1tb files VMDK or COW images of VMs? How are these files >>> made. And does it ever happen with *smaller* files? What about if >>> you just use a sparse 2tb file and write blocks out past 1tb to see if >>> there's a problem? > > Michal> The VMs are always directly on lvm volumes. (e.g. > Michal> /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the > Michal> errors after we ran scrub on the filesystem. > > Michal> Yes, the errors were also found on small files. > > Those errors are in small files inside the VM, which is running btrfs > ontop of block storage provided by your thin-lv, right? > Yea, the small files were in this case on that thin-lv. We also discovered (yesterday) file corruptions with VM hosting gitlab registry - this one was using the same thin-lv underneath, but the guest itself was using ext4 (in this case, docker simply reported incorrect sha checksum on (so far) 2 layers. > > > disks -> md raid5 -> pv -> vg -> lv-thin -> guest QCOW/LUN -> > filesystem -> corruption Those particular guests, yea. The host case it's just w/o "guest" step. But (so far) all corruption ended going via one of the lv-thin layers (and via one of md raids). > > > Michal> Since then we recreated the issue directly on the host, just > Michal> by making ext4 filesystem on some LV, then doing write with > Michal> checksum, sync, drop_caches, read and check checksum. The > Michal> errors are, as I mentioned - always a full 4KiB chunks (always > Michal> same content, always same position). > > What position? Is it a 4k, 1.5m or some other consistent offset? And > how far into the file? And this LV is a plain LV or a thin-lv? I'm > running a debian box at home with RAID1 and I haven't seen this, but > I'm not nearly as careful as you. Can you provide the output of: > What I meant that it doesn't "move" when verifying the same file (aka different reads from same test file). Between the tests, the errors are of course in different places - but it's always some 4KiB piece(s) - that look like correct pieces belonging somewhere else. > /sbin/lvs --version LVM version: 2.03.02(2) (2018-12-18) Library version: 1.02.155 (2018-12-18) Driver version: 4.41.0 Configuration: ./configure --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --exec-prefix= --bindir=/bin --libdir=/lib/x86_64-linux-gnu --sbindir=/sbin --with-usrlibdir=/usr/lib/x86_64-linux-gnu --with-optimisation=-O2 --with-cache=internal --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --with-default-pid-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-thin=internal --with-thin-check=/usr/sbin/thin_check --with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair --enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd --enable-dbus-service --enable-lvmlockd-dlm --enable-lvmlockd-sanlock --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-readline --enable-udev_rules --enable-udev_sync > > too? > > Can you post your: > > /sbin/dmsetup status > > output too? There's a better command to use here, but I'm not an > export. You might really want to copy this over to the > linux-lvm@redhat.com mailing list as well. x22v0-tp_ssd-tpool: 0 2577285120 thin-pool 19 8886/552960 629535/838960 - rw no_discard_passdown queue_if_no_space - 1024 x22v0-tp_ssd_tdata: 0 2147696640 linear x22v0-tp_ssd_tdata: 2147696640 429588480 linear x22v0-tp_ssd_tmeta_rimage_1: 0 4423680 linear x22v0-tp_ssd_tmeta: 0 4423680 raid raid1 2 AA 4423680/4423680 idle 0 0 - x22v0-gerrit--new: 0 268615680 thin 255510528 268459007 x22v0-btrfsnopool: 0 134430720 linear x22v0-gitlab_root: 0 629145600 thin 628291584 629145599 x22v0-tp_ssd_tmeta_rimage_0: 0 4423680 linear x22v0-nexus_old_storage: 0 10737500160 thin 5130817536 10737500159 x22v0-gitlab_reg: 0 2147696640 thin 1070963712 2147696639 x22v0-nexus_old_root: 0 268615680 thin 257657856 268615679 x22v0-tp_big_tmeta_rimage_1: 0 8601600 linear x22v0-tp_ssd_tmeta_rmeta_1: 0 245760 linear x22v0-micron_vol: 0 268615680 linear x22v0-tp_big_tmeta_rimage_0: 0 8601600 linear x22v0-tp_ssd_tmeta_rmeta_0: 0 245760 linear x22v0-gerrit--root: 0 268615680 thin 103388160 268443647 x22v0-btrfs_ssd_linear: 0 268615680 linear x22v0-btrfstest: 0 268615680 thin 40734720 268615679 x22v0-tp_ssd: 0 2577285120 linear x22v0-tp_big: 0 22164602880 linear x22v0-nexus3_root: 0 167854080 thin 21860352 167854079 x22v0-nusknacker--staging: 0 268615680 thin 268182528 268615679 x22v0-tmob2: 0 1048657920 linear x22v0-tp_big-tpool: 0 22164602880 thin-pool 35 35152/1075200 3870070/7215040 - rw no_discard_passdown queue_if_no_space - 1024 x22v0-tp_big_tdata: 0 4295147520 linear x22v0-tp_big_tdata: 4295147520 17869455360 linear x22v0-btrfs_ssd_test: 0 201523200 thin 191880192 201335807 x22v0-nussknacker2: 0 268615680 thin 58573824 268615679 x22v0-tmob1: 0 1048657920 linear x22v0-tp_big_tmeta: 0 8601600 raid raid1 2 AA 8601600/8601600 idle 0 0 - x22v0-nussknacker1: 0 268615680 thin 74376192 268615679 x22v0-touk--elk4: 0 839024640 linear x22v0-gerrit--backup: 0 268615680 thin 228989952 268443647 x22v0-tp_big_tmeta_rmeta_1: 0 245760 linear x22v0-openvpn--new: 0 134430720 thin 24152064 66272255 x22v0-k8sdkr: 0 268615680 linear x22v0-nexus3_storage: 0 10737500160 thin 4976683008 10737500159 x22v0-rocket: 0 167854080 thin 163602432 167854079 x22v0-tp_big_tmeta_rmeta_0: 0 245760 linear x22v0-roger2: 0 134430720 thin 33014784 134430719 x22v0-gerrit--new--backup: 0 268615680 thin 6552576 268443647 Also lvs -a with segment ranges: LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert LE Ranges btrfs_ssd_linear x22v0 -wi-a----- <128.09g /dev/md125:19021-20113 btrfs_ssd_test x22v0 Vwi-a-t--- 96.09g tp_ssd 95.21 btrfsnopool x22v0 -wi-a----- 64.10g /dev/sdt2:35-581 btrfstest x22v0 Vwi-a-t--- <128.09g tp_big 15.16 gerrit-backup x22v0 Vwi-aot--- <128.09g tp_big 85.25 gerrit-new x22v0 Vwi-a-t--- <128.09g tp_ssd 95.12 gerrit-new-backup x22v0 Vwi-a-t--- <128.09g tp_big 2.44 gerrit-root x22v0 Vwi-aot--- <128.09g tp_ssd 38.49 gitlab_reg x22v0 Vwi-a-t--- 1.00t tp_big 49.87 gitlab_reg_snapshot x22v0 Vwi---t--k 1.00t tp_big gitlab_reg gitlab_root x22v0 Vwi-a-t--- 300.00g tp_ssd 99.86 gitlab_root_snapshot x22v0 Vwi---t--k 300.00g tp_ssd gitlab_root k8sdkr x22v0 -wi-a----- <128.09g /dev/md126:20891-21983 [lvol0_pmspare] x22v0 ewi------- 4.10g /dev/sdt2:0-34 micron_vol x22v0 -wi-a----- <128.09g /dev/sdt2:582-1674 nexus3_root x22v0 Vwi-aot--- <80.04g tp_ssd 13.03 nexus3_storage x22v0 Vwi-aot--- 5.00t tp_big 46.35 nexus_old_root x22v0 Vwi-a-t--- <128.09g tp_ssd 95.92 nexus_old_storage x22v0 Vwi-a-t--- 5.00t tp_big 47.78 nusknacker-staging x22v0 Vwi-aot--- <128.09g tp_big 99.84 nussknacker1 x22v0 Vwi-aot--- <128.09g tp_big 27.69 nussknacker2 x22v0 Vwi-aot--- <128.09g tp_big 21.81 openvpn-new x22v0 Vwi-aot--- 64.10g tp_big 17.97 rocket x22v0 Vwi-aot--- <80.04g tp_ssd 97.47 roger2 x22v0 Vwi-a-t--- 64.10g tp_ssd 24.56 tmob1 x22v0 -wi-a----- <500.04g /dev/md125:8739-13005 tmob2 x22v0 -wi-a----- <500.04g /dev/md125:13006-17272 touk-elk4 x22v0 -wi-ao---- <400.08g /dev/md126:17477-20890 tp_big x22v0 twi-aot--- 10.32t 53.64 3.27 [tp_big_tdata]:0-90187 [tp_big_tdata] x22v0 Twi-ao---- 10.32t /dev/md126:0-17476 [tp_big_tdata] x22v0 Twi-ao---- 10.32t /dev/md126:21984-94694 [tp_big_tmeta] x22v0 ewi-aor--- 4.10g 100.00 [tp_big_tmeta_rimage_0]:0-34,[tp_big_tmeta_rimage_1]:0-34 [tp_big_tmeta_rimage_0] x22v0 iwi-aor--- 4.10g /dev/sda3:30-64 [tp_big_tmeta_rimage_1] x22v0 iwi-aor--- 4.10g /dev/sdb3:30-64 [tp_big_tmeta_rmeta_0] x22v0 ewi-aor--- 120.00m /dev/sda3:29-29 [tp_big_tmeta_rmeta_1] x22v0 ewi-aor--- 120.00m /dev/sdb3:29-29 tp_ssd x22v0 twi-aot--- 1.20t 75.04 1.61 [tp_ssd_tdata]:0-10486 [tp_ssd_tdata] x22v0 Twi-ao---- 1.20t /dev/md125:0-8738 [tp_ssd_tdata] x22v0 Twi-ao---- 1.20t /dev/md125:17273-19020 [tp_ssd_tmeta] x22v0 ewi-aor--- <2.11g 100.00 [tp_ssd_tmeta_rimage_0]:0-17,[tp_ssd_tmeta_rimage_1]:0-17 [tp_ssd_tmeta_rimage_0] x22v0 iwi-aor--- <2.11g /dev/sda3:11-28 [tp_ssd_tmeta_rimage_1] x22v0 iwi-aor--- <2.11g /dev/sdb3:11-28 [tp_ssd_tmeta_rmeta_0] x22v0 ewi-aor--- 120.00m /dev/sda3:10-10 [tp_ssd_tmeta_rmeta_1] x22v0 ewi-aor--- 120.00m /dev/sdb3:10-10 > >>> Are the LVs split across RAID5 PVs by any chance? > > Michal> raid5s are used as PVs, but a single logical volume always uses one only > Michal> one physical volume underneath (if that's what you meant by split across). > > Ok, that's what I was asking about. It shouldn't matter... but just > trying to chase down the details. > > >>> It's not clear if you can replicate the problem without using >>> lvm-thin, but that's what I suspect you might be having problems with. > > Michal> I'll be trying to do that, though the heavier tests will have to wait > Michal> until I move all VMs to other hosts (as that is/was our production machnie). > > Sure, makes sense. > >>> Can you give us the versions of the your tools, and exactly how you >>> setup your test cases? How long does it take to find the problem? Regarding this, currently: kernel: 5.4.0-0.bpo.4-amd64 #1 SMP Debian 5.4.19-1~bpo10+1 (2020-03-09) x86_64 GNU/Linux (was also happening with 5.2.0-0.bpo.3-amd64) LVM version: 2.03.02(2) (2018-12-18) Library version: 1.02.155 (2018-12-18) Driver version: 4.41.0 mdadm - v4.1 - 2018-10-01 > > Michal> Will get all the details tommorow (the host is on up to date debian > Michal> buster, the VMs are mix of archlinuxes and debians (and the issue > Michal> happened on both)). > > Michal> As for how long, it's a hit and miss. Sometimes writing and reading back > Michal> ~16gb file fails (the cheksum read back differs from what was written) > Michal> after 2-3 tries. That's on the host. > > Michal> On the guest, it's been (so far) a guaranteed thing when we were > Michal> creating very large tar file (900gb+). As for past two weeks we were > Michal> unable to create that file without errors even once. > > Ouch! That's not good. Just to confirm, these corruptions are all in > a thin-lv based filesystem, right? I'd be interested to know if you > can create another plain LV and cause the same error. Trying to > simplify the potential problems. I have been trying to - but so far didn't manage to replicate this with: - a physical partition - filesystem directly on a physical partition - filesystem directly on mdraid - filesystem directly on a linear volume Note that this _doesn't_ imply that I _always_ get errors if lvm-thin is in use - as I also had lengthy period of attempts to cause corruption on some thin volume w/o any successes either. But the ones that failed had those in common (so far): md & lvm-thin - with 4 KiB piece(s) being incorrect > > >>> Can you compile the newst kernel and newest thin tools and try them >>> out? > > Michal> I can, but a bit later (once we move VMs out of the host). > >>> >>> How long does it take to replicate the corruption? >>> > > Michal> When it happens, it's usually few tries tries of writing a 16gb file > Michal> with random patterns and reading it back (directly on host). The > Michal> irritating thing is that it can be somewhat hard to reproduce (e.g. > Michal> after machine's reboot). > >>> Sorry for all the questions, but until there's a test case which is >>> repeatable, it's going to be hard to chase this down. >>> >>> I wonder if running 'fio' tests would be something to try? >>> >>> And also changing your RAID5 setup to use the default stride and >>> stripe widths, instead of the large values you're using. > > Michal> The raid5 is using mdadm's defaults (which is 512 KiB these days for a > Michal> chunk). LVM on top is using much longer extents (as we don't really need > Michal> 4mb granularity) and the lvm-thin chunks were set to match (and align) > Michal> to raid's stripe. > >>> >>> Good luck! >>> > Roger> I have not as of yet seen write corruption (except when a vendors disk > Roger> was resetting and it was lying about having written the data prior to > Roger> the crash, these were ssds, if your disk write cache is on and you > Roger> have a disk reset this can also happen), but have not seen "lost > Roger> writes" otherwise, but would expect the 2 read corruption I have seen > Roger> to also be able to cause write issues. So for that look for scsi > Roger> notifications for disk resets that should not happen. >>> > Roger> I have had a "bad" controller cause read corruptions, those > Roger> corruptions would move around, replacing the controller resolved it, > Roger> so there may be lack of error checking "inside" some paths in the > Roger> card. Lucky I had a number of these controllers and had cold spares > Roger> for them. The give away here was 2 separate buses with almost > Roger> identical load with 6 separate disks each and all12 disks on 2 buses > Roger> had between 47-52 scsi errors, which points to the only component > Roger> shared (the controller). >>> > Roger> The backplane and cables are unlikely in general cause this, there is > Roger> too much error checking between the controller and the disk from what > Roger> I know. >>> > Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133 > Roger> cause random read corruptions, lowering speed to 100 fixed it), this > Roger> one was duplicated on multiple identical pieces of hw with all > Roger> different parts on the duplication machine. >>> > Roger> I have also seen lost writes (from software) because someone did a > Roger> seek without doing a flush which in some versions of the libs loses > Roger> the unfilled block when the seek happens (this is noted in the man > Roger> page, and I saw it 20years ago, it is still noted in the man page, so > Roger> no idea if it was ever fixed). So has more than one application been > Roger> noted to see the corruption? >>> > Roger> So one question, have you seen the corruption in a path that would > Roger> rely on one controller, or all corruptions you have seen involving > Roger> more than one controller? Isolate and test each controller if you > Roger> can, or if you can afford to replace it and see if it continues. >>> >>> > Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote: >>>>> >>>>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause. >>>>> >>>>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source. >>>>> >>>>> The hardware is (can provide more detailed info of course): >>>>> >>>>> - Supermicro X9DR7-LN4F >>>>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane) >>>>> - 96 gb ram (ecc) >>>>> - 24 disk backplane >>>>> >>>>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk) >>>>> - 1 array on the backplane (4 disks, mdraid5, journaled) >>>>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks) >>>>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series) >>>>> >>>>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM >>>>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well. >>>>> >>>>> With a doze of testing we managed to roughly rule out the following elements as being the cause: >>>>> >>>>> - qemu/kvm (issue occured directly on host) >>>>> - backplane (issue occured on disks directly connected via LSI's 2nd connector) >>>>> - cable (as a above, two different cables) >>>>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest) >>>>> - mdadm journaling (issue occured on plain mdraid configuration as well) >>>>> - disks themselves (issue occured on two separate mdadm arrays) >>>>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) ) >>>>> >>>>> We did not manage to rule out (though somewhat _highly_ unlikely): >>>>> >>>>> - lvm thin (issue always - so far - occured on lvm thin pools) >>>>> - mdraid (issue always - so far - on mdraid managed arrays) >>>>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere) >>>>> >>>>> And finally - so far - the issue never occured: >>>>> >>>>> - directly on a disk >>>>> - directly on mdraid >>>>> - on linear lvm volume on top of mdraid >>>>> >>>>> As far as the issue goes it's: >>>>> >>>>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks >>>>> - we also found (or rather btrfs scrub did) a few small damaged files as well >>>>> - the chunks look like a correct piece of different or previous data >>>>> >>>>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ... >>>>> >>>>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any. >>> > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [linux-lvm] [general question] rare silent data corruption when writing data @ 2020-05-08 11:10 ` Michal Soltys 0 siblings, 0 replies; 20+ messages in thread From: Michal Soltys @ 2020-05-08 11:10 UTC (permalink / raw) To: John Stoffel; +Cc: Linux RAID, Roger Heflin, linux-lvm note: as suggested, I'm also CCing this to linux-lvm; the full context with replies starts at: https://www.spinics.net/lists/raid/msg64364.html There is also the initial post at the bottom as well. On 5/8/20 2:54 AM, John Stoffel wrote: >>>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes: > > Michal> On 20/05/07 23:01, John Stoffel wrote: >>>>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes: >>> > Roger> Have you tried the same file 2x and verified the corruption is in the > Roger> same places and looks the same? >>> >>> Are these 1tb files VMDK or COW images of VMs? How are these files >>> made. And does it ever happen with *smaller* files? What about if >>> you just use a sparse 2tb file and write blocks out past 1tb to see if >>> there's a problem? > > Michal> The VMs are always directly on lvm volumes. (e.g. > Michal> /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the > Michal> errors after we ran scrub on the filesystem. > > Michal> Yes, the errors were also found on small files. > > Those errors are in small files inside the VM, which is running btrfs > ontop of block storage provided by your thin-lv, right? > Yea, the small files were in this case on that thin-lv. We also discovered (yesterday) file corruptions with VM hosting gitlab registry - this one was using the same thin-lv underneath, but the guest itself was using ext4 (in this case, docker simply reported incorrect sha checksum on (so far) 2 layers. > > > disks -> md raid5 -> pv -> vg -> lv-thin -> guest QCOW/LUN -> > filesystem -> corruption Those particular guests, yea. The host case it's just w/o "guest" step. But (so far) all corruption ended going via one of the lv-thin layers (and via one of md raids). > > > Michal> Since then we recreated the issue directly on the host, just > Michal> by making ext4 filesystem on some LV, then doing write with > Michal> checksum, sync, drop_caches, read and check checksum. The > Michal> errors are, as I mentioned - always a full 4KiB chunks (always > Michal> same content, always same position). > > What position? Is it a 4k, 1.5m or some other consistent offset? And > how far into the file? And this LV is a plain LV or a thin-lv? I'm > running a debian box at home with RAID1 and I haven't seen this, but > I'm not nearly as careful as you. Can you provide the output of: > What I meant that it doesn't "move" when verifying the same file (aka different reads from same test file). Between the tests, the errors are of course in different places - but it's always some 4KiB piece(s) - that look like correct pieces belonging somewhere else. > /sbin/lvs --version LVM version: 2.03.02(2) (2018-12-18) Library version: 1.02.155 (2018-12-18) Driver version: 4.41.0 Configuration: ./configure --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --exec-prefix= --bindir=/bin --libdir=/lib/x86_64-linux-gnu --sbindir=/sbin --with-usrlibdir=/usr/lib/x86_64-linux-gnu --with-optimisation=-O2 --with-cache=internal --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --with-default-pid-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-thin=internal --with-thin-check=/usr/sbin/thin_check --with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair --enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd --enable-dbus-service --enable-lvmlockd-dlm --enable-lvmlockd-sanlock --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-readline --enable-udev_rules --enable-udev_sync > > too? > > Can you post your: > > /sbin/dmsetup status > > output too? There's a better command to use here, but I'm not an > export. You might really want to copy this over to the > linux-lvm@redhat.com mailing list as well. x22v0-tp_ssd-tpool: 0 2577285120 thin-pool 19 8886/552960 629535/838960 - rw no_discard_passdown queue_if_no_space - 1024 x22v0-tp_ssd_tdata: 0 2147696640 linear x22v0-tp_ssd_tdata: 2147696640 429588480 linear x22v0-tp_ssd_tmeta_rimage_1: 0 4423680 linear x22v0-tp_ssd_tmeta: 0 4423680 raid raid1 2 AA 4423680/4423680 idle 0 0 - x22v0-gerrit--new: 0 268615680 thin 255510528 268459007 x22v0-btrfsnopool: 0 134430720 linear x22v0-gitlab_root: 0 629145600 thin 628291584 629145599 x22v0-tp_ssd_tmeta_rimage_0: 0 4423680 linear x22v0-nexus_old_storage: 0 10737500160 thin 5130817536 10737500159 x22v0-gitlab_reg: 0 2147696640 thin 1070963712 2147696639 x22v0-nexus_old_root: 0 268615680 thin 257657856 268615679 x22v0-tp_big_tmeta_rimage_1: 0 8601600 linear x22v0-tp_ssd_tmeta_rmeta_1: 0 245760 linear x22v0-micron_vol: 0 268615680 linear x22v0-tp_big_tmeta_rimage_0: 0 8601600 linear x22v0-tp_ssd_tmeta_rmeta_0: 0 245760 linear x22v0-gerrit--root: 0 268615680 thin 103388160 268443647 x22v0-btrfs_ssd_linear: 0 268615680 linear x22v0-btrfstest: 0 268615680 thin 40734720 268615679 x22v0-tp_ssd: 0 2577285120 linear x22v0-tp_big: 0 22164602880 linear x22v0-nexus3_root: 0 167854080 thin 21860352 167854079 x22v0-nusknacker--staging: 0 268615680 thin 268182528 268615679 x22v0-tmob2: 0 1048657920 linear x22v0-tp_big-tpool: 0 22164602880 thin-pool 35 35152/1075200 3870070/7215040 - rw no_discard_passdown queue_if_no_space - 1024 x22v0-tp_big_tdata: 0 4295147520 linear x22v0-tp_big_tdata: 4295147520 17869455360 linear x22v0-btrfs_ssd_test: 0 201523200 thin 191880192 201335807 x22v0-nussknacker2: 0 268615680 thin 58573824 268615679 x22v0-tmob1: 0 1048657920 linear x22v0-tp_big_tmeta: 0 8601600 raid raid1 2 AA 8601600/8601600 idle 0 0 - x22v0-nussknacker1: 0 268615680 thin 74376192 268615679 x22v0-touk--elk4: 0 839024640 linear x22v0-gerrit--backup: 0 268615680 thin 228989952 268443647 x22v0-tp_big_tmeta_rmeta_1: 0 245760 linear x22v0-openvpn--new: 0 134430720 thin 24152064 66272255 x22v0-k8sdkr: 0 268615680 linear x22v0-nexus3_storage: 0 10737500160 thin 4976683008 10737500159 x22v0-rocket: 0 167854080 thin 163602432 167854079 x22v0-tp_big_tmeta_rmeta_0: 0 245760 linear x22v0-roger2: 0 134430720 thin 33014784 134430719 x22v0-gerrit--new--backup: 0 268615680 thin 6552576 268443647 Also lvs -a with segment ranges: LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert LE Ranges btrfs_ssd_linear x22v0 -wi-a----- <128.09g /dev/md125:19021-20113 btrfs_ssd_test x22v0 Vwi-a-t--- 96.09g tp_ssd 95.21 btrfsnopool x22v0 -wi-a----- 64.10g /dev/sdt2:35-581 btrfstest x22v0 Vwi-a-t--- <128.09g tp_big 15.16 gerrit-backup x22v0 Vwi-aot--- <128.09g tp_big 85.25 gerrit-new x22v0 Vwi-a-t--- <128.09g tp_ssd 95.12 gerrit-new-backup x22v0 Vwi-a-t--- <128.09g tp_big 2.44 gerrit-root x22v0 Vwi-aot--- <128.09g tp_ssd 38.49 gitlab_reg x22v0 Vwi-a-t--- 1.00t tp_big 49.87 gitlab_reg_snapshot x22v0 Vwi---t--k 1.00t tp_big gitlab_reg gitlab_root x22v0 Vwi-a-t--- 300.00g tp_ssd 99.86 gitlab_root_snapshot x22v0 Vwi---t--k 300.00g tp_ssd gitlab_root k8sdkr x22v0 -wi-a----- <128.09g /dev/md126:20891-21983 [lvol0_pmspare] x22v0 ewi------- 4.10g /dev/sdt2:0-34 micron_vol x22v0 -wi-a----- <128.09g /dev/sdt2:582-1674 nexus3_root x22v0 Vwi-aot--- <80.04g tp_ssd 13.03 nexus3_storage x22v0 Vwi-aot--- 5.00t tp_big 46.35 nexus_old_root x22v0 Vwi-a-t--- <128.09g tp_ssd 95.92 nexus_old_storage x22v0 Vwi-a-t--- 5.00t tp_big 47.78 nusknacker-staging x22v0 Vwi-aot--- <128.09g tp_big 99.84 nussknacker1 x22v0 Vwi-aot--- <128.09g tp_big 27.69 nussknacker2 x22v0 Vwi-aot--- <128.09g tp_big 21.81 openvpn-new x22v0 Vwi-aot--- 64.10g tp_big 17.97 rocket x22v0 Vwi-aot--- <80.04g tp_ssd 97.47 roger2 x22v0 Vwi-a-t--- 64.10g tp_ssd 24.56 tmob1 x22v0 -wi-a----- <500.04g /dev/md125:8739-13005 tmob2 x22v0 -wi-a----- <500.04g /dev/md125:13006-17272 touk-elk4 x22v0 -wi-ao---- <400.08g /dev/md126:17477-20890 tp_big x22v0 twi-aot--- 10.32t 53.64 3.27 [tp_big_tdata]:0-90187 [tp_big_tdata] x22v0 Twi-ao---- 10.32t /dev/md126:0-17476 [tp_big_tdata] x22v0 Twi-ao---- 10.32t /dev/md126:21984-94694 [tp_big_tmeta] x22v0 ewi-aor--- 4.10g 100.00 [tp_big_tmeta_rimage_0]:0-34,[tp_big_tmeta_rimage_1]:0-34 [tp_big_tmeta_rimage_0] x22v0 iwi-aor--- 4.10g /dev/sda3:30-64 [tp_big_tmeta_rimage_1] x22v0 iwi-aor--- 4.10g /dev/sdb3:30-64 [tp_big_tmeta_rmeta_0] x22v0 ewi-aor--- 120.00m /dev/sda3:29-29 [tp_big_tmeta_rmeta_1] x22v0 ewi-aor--- 120.00m /dev/sdb3:29-29 tp_ssd x22v0 twi-aot--- 1.20t 75.04 1.61 [tp_ssd_tdata]:0-10486 [tp_ssd_tdata] x22v0 Twi-ao---- 1.20t /dev/md125:0-8738 [tp_ssd_tdata] x22v0 Twi-ao---- 1.20t /dev/md125:17273-19020 [tp_ssd_tmeta] x22v0 ewi-aor--- <2.11g 100.00 [tp_ssd_tmeta_rimage_0]:0-17,[tp_ssd_tmeta_rimage_1]:0-17 [tp_ssd_tmeta_rimage_0] x22v0 iwi-aor--- <2.11g /dev/sda3:11-28 [tp_ssd_tmeta_rimage_1] x22v0 iwi-aor--- <2.11g /dev/sdb3:11-28 [tp_ssd_tmeta_rmeta_0] x22v0 ewi-aor--- 120.00m /dev/sda3:10-10 [tp_ssd_tmeta_rmeta_1] x22v0 ewi-aor--- 120.00m /dev/sdb3:10-10 > >>> Are the LVs split across RAID5 PVs by any chance? > > Michal> raid5s are used as PVs, but a single logical volume always uses one only > Michal> one physical volume underneath (if that's what you meant by split across). > > Ok, that's what I was asking about. It shouldn't matter... but just > trying to chase down the details. > > >>> It's not clear if you can replicate the problem without using >>> lvm-thin, but that's what I suspect you might be having problems with. > > Michal> I'll be trying to do that, though the heavier tests will have to wait > Michal> until I move all VMs to other hosts (as that is/was our production machnie). > > Sure, makes sense. > >>> Can you give us the versions of the your tools, and exactly how you >>> setup your test cases? How long does it take to find the problem? Regarding this, currently: kernel: 5.4.0-0.bpo.4-amd64 #1 SMP Debian 5.4.19-1~bpo10+1 (2020-03-09) x86_64 GNU/Linux (was also happening with 5.2.0-0.bpo.3-amd64) LVM version: 2.03.02(2) (2018-12-18) Library version: 1.02.155 (2018-12-18) Driver version: 4.41.0 mdadm - v4.1 - 2018-10-01 > > Michal> Will get all the details tommorow (the host is on up to date debian > Michal> buster, the VMs are mix of archlinuxes and debians (and the issue > Michal> happened on both)). > > Michal> As for how long, it's a hit and miss. Sometimes writing and reading back > Michal> ~16gb file fails (the cheksum read back differs from what was written) > Michal> after 2-3 tries. That's on the host. > > Michal> On the guest, it's been (so far) a guaranteed thing when we were > Michal> creating very large tar file (900gb+). As for past two weeks we were > Michal> unable to create that file without errors even once. > > Ouch! That's not good. Just to confirm, these corruptions are all in > a thin-lv based filesystem, right? I'd be interested to know if you > can create another plain LV and cause the same error. Trying to > simplify the potential problems. I have been trying to - but so far didn't manage to replicate this with: - a physical partition - filesystem directly on a physical partition - filesystem directly on mdraid - filesystem directly on a linear volume Note that this _doesn't_ imply that I _always_ get errors if lvm-thin is in use - as I also had lengthy period of attempts to cause corruption on some thin volume w/o any successes either. But the ones that failed had those in common (so far): md & lvm-thin - with 4 KiB piece(s) being incorrect > > >>> Can you compile the newst kernel and newest thin tools and try them >>> out? > > Michal> I can, but a bit later (once we move VMs out of the host). > >>> >>> How long does it take to replicate the corruption? >>> > > Michal> When it happens, it's usually few tries tries of writing a 16gb file > Michal> with random patterns and reading it back (directly on host). The > Michal> irritating thing is that it can be somewhat hard to reproduce (e.g. > Michal> after machine's reboot). > >>> Sorry for all the questions, but until there's a test case which is >>> repeatable, it's going to be hard to chase this down. >>> >>> I wonder if running 'fio' tests would be something to try? >>> >>> And also changing your RAID5 setup to use the default stride and >>> stripe widths, instead of the large values you're using. > > Michal> The raid5 is using mdadm's defaults (which is 512 KiB these days for a > Michal> chunk). LVM on top is using much longer extents (as we don't really need > Michal> 4mb granularity) and the lvm-thin chunks were set to match (and align) > Michal> to raid's stripe. > >>> >>> Good luck! >>> > Roger> I have not as of yet seen write corruption (except when a vendors disk > Roger> was resetting and it was lying about having written the data prior to > Roger> the crash, these were ssds, if your disk write cache is on and you > Roger> have a disk reset this can also happen), but have not seen "lost > Roger> writes" otherwise, but would expect the 2 read corruption I have seen > Roger> to also be able to cause write issues. So for that look for scsi > Roger> notifications for disk resets that should not happen. >>> > Roger> I have had a "bad" controller cause read corruptions, those > Roger> corruptions would move around, replacing the controller resolved it, > Roger> so there may be lack of error checking "inside" some paths in the > Roger> card. Lucky I had a number of these controllers and had cold spares > Roger> for them. The give away here was 2 separate buses with almost > Roger> identical load with 6 separate disks each and all12 disks on 2 buses > Roger> had between 47-52 scsi errors, which points to the only component > Roger> shared (the controller). >>> > Roger> The backplane and cables are unlikely in general cause this, there is > Roger> too much error checking between the controller and the disk from what > Roger> I know. >>> > Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133 > Roger> cause random read corruptions, lowering speed to 100 fixed it), this > Roger> one was duplicated on multiple identical pieces of hw with all > Roger> different parts on the duplication machine. >>> > Roger> I have also seen lost writes (from software) because someone did a > Roger> seek without doing a flush which in some versions of the libs loses > Roger> the unfilled block when the seek happens (this is noted in the man > Roger> page, and I saw it 20years ago, it is still noted in the man page, so > Roger> no idea if it was ever fixed). So has more than one application been > Roger> noted to see the corruption? >>> > Roger> So one question, have you seen the corruption in a path that would > Roger> rely on one controller, or all corruptions you have seen involving > Roger> more than one controller? Isolate and test each controller if you > Roger> can, or if you can afford to replace it and see if it continues. >>> >>> > Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote: >>>>> >>>>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause. >>>>> >>>>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source. >>>>> >>>>> The hardware is (can provide more detailed info of course): >>>>> >>>>> - Supermicro X9DR7-LN4F >>>>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane) >>>>> - 96 gb ram (ecc) >>>>> - 24 disk backplane >>>>> >>>>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk) >>>>> - 1 array on the backplane (4 disks, mdraid5, journaled) >>>>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks) >>>>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series) >>>>> >>>>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM >>>>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well. >>>>> >>>>> With a doze of testing we managed to roughly rule out the following elements as being the cause: >>>>> >>>>> - qemu/kvm (issue occured directly on host) >>>>> - backplane (issue occured on disks directly connected via LSI's 2nd connector) >>>>> - cable (as a above, two different cables) >>>>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest) >>>>> - mdadm journaling (issue occured on plain mdraid configuration as well) >>>>> - disks themselves (issue occured on two separate mdadm arrays) >>>>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) ) >>>>> >>>>> We did not manage to rule out (though somewhat _highly_ unlikely): >>>>> >>>>> - lvm thin (issue always - so far - occured on lvm thin pools) >>>>> - mdraid (issue always - so far - on mdraid managed arrays) >>>>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere) >>>>> >>>>> And finally - so far - the issue never occured: >>>>> >>>>> - directly on a disk >>>>> - directly on mdraid >>>>> - on linear lvm volume on top of mdraid >>>>> >>>>> As far as the issue goes it's: >>>>> >>>>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks >>>>> - we also found (or rather btrfs scrub did) a few small damaged files as well >>>>> - the chunks look like a correct piece of different or previous data >>>>> >>>>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ... >>>>> >>>>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any. >>> > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data 2020-05-08 11:10 ` [linux-lvm] " Michal Soltys @ 2020-05-08 16:10 ` John Stoffel -1 siblings, 0 replies; 20+ messages in thread From: John Stoffel @ 2020-05-08 16:10 UTC (permalink / raw) To: Michal Soltys; +Cc: John Stoffel, Roger Heflin, Linux RAID, linux-lvm, dm-devel >>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes: And of course it should also go to dm-devel@redhat.com, my fault for not including that as well. I strongly suspect it's an thin-lv problem somewhere, but I don't know enough to help chase down the problem in detail. John Michal> note: as suggested, I'm also CCing this to linux-lvm; the full Michal> context with replies starts at: Michal> https://www.spinics.net/lists/raid/msg64364.html There is also Michal> the initial post at the bottom as well. Michal> On 5/8/20 2:54 AM, John Stoffel wrote: >>>>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes: >> Michal> On 20/05/07 23:01, John Stoffel wrote: >>>>>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes: >>>> Roger> Have you tried the same file 2x and verified the corruption is in the Roger> same places and looks the same? >>>> >>>> Are these 1tb files VMDK or COW images of VMs? How are these files >>>> made. And does it ever happen with *smaller* files? What about if >>>> you just use a sparse 2tb file and write blocks out past 1tb to see if >>>> there's a problem? >> Michal> The VMs are always directly on lvm volumes. (e.g. Michal> /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the Michal> errors after we ran scrub on the filesystem. >> Michal> Yes, the errors were also found on small files. >> >> Those errors are in small files inside the VM, which is running btrfs >> ontop of block storage provided by your thin-lv, right? >> Michal> Yea, the small files were in this case on that thin-lv. Michal> We also discovered (yesterday) file corruptions with VM hosting gitlab registry - this one was using the same thin-lv underneath, but the guest itself was using ext4 (in this case, docker simply reported incorrect sha checksum on (so far) 2 layers. >> >> >> disks -> md raid5 -> pv -> vg -> lv-thin -> guest QCOW/LUN -> >> filesystem -> corruption Michal> Those particular guests, yea. The host case it's just w/o "guest" step. Michal> But (so far) all corruption ended going via one of the lv-thin layers (and via one of md raids). >> >> Michal> Since then we recreated the issue directly on the host, just Michal> by making ext4 filesystem on some LV, then doing write with Michal> checksum, sync, drop_caches, read and check checksum. The Michal> errors are, as I mentioned - always a full 4KiB chunks (always Michal> same content, always same position). >> >> What position? Is it a 4k, 1.5m or some other consistent offset? And >> how far into the file? And this LV is a plain LV or a thin-lv? I'm >> running a debian box at home with RAID1 and I haven't seen this, but >> I'm not nearly as careful as you. Can you provide the output of: >> Michal> What I meant that it doesn't "move" when verifying the same file (aka different reads from same test file). Between the tests, the errors are of course in different places - but it's always some 4KiB piece(s) - that look like correct pieces belonging somewhere else. >> /sbin/lvs --version Michal> LVM version: 2.03.02(2) (2018-12-18) Michal> Library version: 1.02.155 (2018-12-18) Michal> Driver version: 4.41.0 Michal> Configuration: ./configure --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --exec-prefix= --bindir=/bin --libdir=/lib/x86_64-linux-gnu --sbindir=/sbin --with-usrlibdir=/usr/lib/x86_64-linux-gnu --with-optimisation=-O2 --with-cache=internal --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --with-default-pid-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-thin=internal --with-thin-check=/usr/sbin/thin_check --with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair --enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd --enable-dbus-service --enable-lvmlockd-dlm --enable-lvmlockd-sanlock --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-readline --enable-udev_rules --enable-udev_sync >> >> too? >> >> Can you post your: >> >> /sbin/dmsetup status >> >> output too? There's a better command to use here, but I'm not an >> export. You might really want to copy this over to the >> linux-lvm@redhat.com mailing list as well. Michal> x22v0-tp_ssd-tpool: 0 2577285120 thin-pool 19 8886/552960 629535/838960 - rw no_discard_passdown queue_if_no_space - 1024 Michal> x22v0-tp_ssd_tdata: 0 2147696640 linear Michal> x22v0-tp_ssd_tdata: 2147696640 429588480 linear Michal> x22v0-tp_ssd_tmeta_rimage_1: 0 4423680 linear Michal> x22v0-tp_ssd_tmeta: 0 4423680 raid raid1 2 AA 4423680/4423680 idle 0 0 - Michal> x22v0-gerrit--new: 0 268615680 thin 255510528 268459007 Michal> x22v0-btrfsnopool: 0 134430720 linear Michal> x22v0-gitlab_root: 0 629145600 thin 628291584 629145599 Michal> x22v0-tp_ssd_tmeta_rimage_0: 0 4423680 linear Michal> x22v0-nexus_old_storage: 0 10737500160 thin 5130817536 10737500159 Michal> x22v0-gitlab_reg: 0 2147696640 thin 1070963712 2147696639 Michal> x22v0-nexus_old_root: 0 268615680 thin 257657856 268615679 Michal> x22v0-tp_big_tmeta_rimage_1: 0 8601600 linear Michal> x22v0-tp_ssd_tmeta_rmeta_1: 0 245760 linear Michal> x22v0-micron_vol: 0 268615680 linear Michal> x22v0-tp_big_tmeta_rimage_0: 0 8601600 linear Michal> x22v0-tp_ssd_tmeta_rmeta_0: 0 245760 linear Michal> x22v0-gerrit--root: 0 268615680 thin 103388160 268443647 Michal> x22v0-btrfs_ssd_linear: 0 268615680 linear Michal> x22v0-btrfstest: 0 268615680 thin 40734720 268615679 Michal> x22v0-tp_ssd: 0 2577285120 linear Michal> x22v0-tp_big: 0 22164602880 linear Michal> x22v0-nexus3_root: 0 167854080 thin 21860352 167854079 Michal> x22v0-nusknacker--staging: 0 268615680 thin 268182528 268615679 Michal> x22v0-tmob2: 0 1048657920 linear Michal> x22v0-tp_big-tpool: 0 22164602880 thin-pool 35 35152/1075200 3870070/7215040 - rw no_discard_passdown queue_if_no_space - 1024 Michal> x22v0-tp_big_tdata: 0 4295147520 linear Michal> x22v0-tp_big_tdata: 4295147520 17869455360 linear Michal> x22v0-btrfs_ssd_test: 0 201523200 thin 191880192 201335807 Michal> x22v0-nussknacker2: 0 268615680 thin 58573824 268615679 Michal> x22v0-tmob1: 0 1048657920 linear Michal> x22v0-tp_big_tmeta: 0 8601600 raid raid1 2 AA 8601600/8601600 idle 0 0 - Michal> x22v0-nussknacker1: 0 268615680 thin 74376192 268615679 Michal> x22v0-touk--elk4: 0 839024640 linear Michal> x22v0-gerrit--backup: 0 268615680 thin 228989952 268443647 Michal> x22v0-tp_big_tmeta_rmeta_1: 0 245760 linear Michal> x22v0-openvpn--new: 0 134430720 thin 24152064 66272255 Michal> x22v0-k8sdkr: 0 268615680 linear Michal> x22v0-nexus3_storage: 0 10737500160 thin 4976683008 10737500159 Michal> x22v0-rocket: 0 167854080 thin 163602432 167854079 Michal> x22v0-tp_big_tmeta_rmeta_0: 0 245760 linear Michal> x22v0-roger2: 0 134430720 thin 33014784 134430719 Michal> x22v0-gerrit--new--backup: 0 268615680 thin 6552576 268443647 Michal> Also lvs -a with segment ranges: Michal> LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert LE Ranges Michal> btrfs_ssd_linear x22v0 -wi-a----- <128.09g /dev/md125:19021-20113 Michal> btrfs_ssd_test x22v0 Vwi-a-t--- 96.09g tp_ssd 95.21 Michal> btrfsnopool x22v0 -wi-a----- 64.10g /dev/sdt2:35-581 Michal> btrfstest x22v0 Vwi-a-t--- <128.09g tp_big 15.16 Michal> gerrit-backup x22v0 Vwi-aot--- <128.09g tp_big 85.25 Michal> gerrit-new x22v0 Vwi-a-t--- <128.09g tp_ssd 95.12 Michal> gerrit-new-backup x22v0 Vwi-a-t--- <128.09g tp_big 2.44 Michal> gerrit-root x22v0 Vwi-aot--- <128.09g tp_ssd 38.49 Michal> gitlab_reg x22v0 Vwi-a-t--- 1.00t tp_big 49.87 Michal> gitlab_reg_snapshot x22v0 Vwi---t--k 1.00t tp_big gitlab_reg Michal> gitlab_root x22v0 Vwi-a-t--- 300.00g tp_ssd 99.86 Michal> gitlab_root_snapshot x22v0 Vwi---t--k 300.00g tp_ssd gitlab_root Michal> k8sdkr x22v0 -wi-a----- <128.09g /dev/md126:20891-21983 Michal> [lvol0_pmspare] x22v0 ewi------- 4.10g /dev/sdt2:0-34 Michal> micron_vol x22v0 -wi-a----- <128.09g /dev/sdt2:582-1674 Michal> nexus3_root x22v0 Vwi-aot--- <80.04g tp_ssd 13.03 Michal> nexus3_storage x22v0 Vwi-aot--- 5.00t tp_big 46.35 Michal> nexus_old_root x22v0 Vwi-a-t--- <128.09g tp_ssd 95.92 Michal> nexus_old_storage x22v0 Vwi-a-t--- 5.00t tp_big 47.78 Michal> nusknacker-staging x22v0 Vwi-aot--- <128.09g tp_big 99.84 Michal> nussknacker1 x22v0 Vwi-aot--- <128.09g tp_big 27.69 Michal> nussknacker2 x22v0 Vwi-aot--- <128.09g tp_big 21.81 Michal> openvpn-new x22v0 Vwi-aot--- 64.10g tp_big 17.97 Michal> rocket x22v0 Vwi-aot--- <80.04g tp_ssd 97.47 Michal> roger2 x22v0 Vwi-a-t--- 64.10g tp_ssd 24.56 Michal> tmob1 x22v0 -wi-a----- <500.04g /dev/md125:8739-13005 Michal> tmob2 x22v0 -wi-a----- <500.04g /dev/md125:13006-17272 Michal> touk-elk4 x22v0 -wi-ao---- <400.08g /dev/md126:17477-20890 Michal> tp_big x22v0 twi-aot--- 10.32t 53.64 3.27 [tp_big_tdata]:0-90187 Michal> [tp_big_tdata] x22v0 Twi-ao---- 10.32t /dev/md126:0-17476 Michal> [tp_big_tdata] x22v0 Twi-ao---- 10.32t /dev/md126:21984-94694 Michal> [tp_big_tmeta] x22v0 ewi-aor--- 4.10g 100.00 [tp_big_tmeta_rimage_0]:0-34,[tp_big_tmeta_rimage_1]:0-34 Michal> [tp_big_tmeta_rimage_0] x22v0 iwi-aor--- 4.10g /dev/sda3:30-64 Michal> [tp_big_tmeta_rimage_1] x22v0 iwi-aor--- 4.10g /dev/sdb3:30-64 Michal> [tp_big_tmeta_rmeta_0] x22v0 ewi-aor--- 120.00m /dev/sda3:29-29 Michal> [tp_big_tmeta_rmeta_1] x22v0 ewi-aor--- 120.00m /dev/sdb3:29-29 Michal> tp_ssd x22v0 twi-aot--- 1.20t 75.04 1.61 [tp_ssd_tdata]:0-10486 Michal> [tp_ssd_tdata] x22v0 Twi-ao---- 1.20t /dev/md125:0-8738 Michal> [tp_ssd_tdata] x22v0 Twi-ao---- 1.20t /dev/md125:17273-19020 Michal> [tp_ssd_tmeta] x22v0 ewi-aor--- <2.11g 100.00 [tp_ssd_tmeta_rimage_0]:0-17,[tp_ssd_tmeta_rimage_1]:0-17 Michal> [tp_ssd_tmeta_rimage_0] x22v0 iwi-aor--- <2.11g /dev/sda3:11-28 Michal> [tp_ssd_tmeta_rimage_1] x22v0 iwi-aor--- <2.11g /dev/sdb3:11-28 Michal> [tp_ssd_tmeta_rmeta_0] x22v0 ewi-aor--- 120.00m /dev/sda3:10-10 Michal> [tp_ssd_tmeta_rmeta_1] x22v0 ewi-aor--- 120.00m /dev/sdb3:10-10 >> >>>> Are the LVs split across RAID5 PVs by any chance? >> Michal> raid5s are used as PVs, but a single logical volume always uses one only Michal> one physical volume underneath (if that's what you meant by split across). >> >> Ok, that's what I was asking about. It shouldn't matter... but just >> trying to chase down the details. >> >> >>>> It's not clear if you can replicate the problem without using >>>> lvm-thin, but that's what I suspect you might be having problems with. >> Michal> I'll be trying to do that, though the heavier tests will have to wait Michal> until I move all VMs to other hosts (as that is/was our production machnie). >> >> Sure, makes sense. >> >>>> Can you give us the versions of the your tools, and exactly how you >>>> setup your test cases? How long does it take to find the problem? Michal> Regarding this, currently: Michal> kernel: 5.4.0-0.bpo.4-amd64 #1 SMP Debian 5.4.19-1~bpo10+1 (2020-03-09) x86_64 GNU/Linux (was also happening with 5.2.0-0.bpo.3-amd64) Michal> LVM version: 2.03.02(2) (2018-12-18) Michal> Library version: 1.02.155 (2018-12-18) Michal> Driver version: 4.41.0 Michal> mdadm - v4.1 - 2018-10-01 >> Michal> Will get all the details tommorow (the host is on up to date debian Michal> buster, the VMs are mix of archlinuxes and debians (and the issue Michal> happened on both)). >> Michal> As for how long, it's a hit and miss. Sometimes writing and reading back Michal> ~16gb file fails (the cheksum read back differs from what was written) Michal> after 2-3 tries. That's on the host. >> Michal> On the guest, it's been (so far) a guaranteed thing when we were Michal> creating very large tar file (900gb+). As for past two weeks we were Michal> unable to create that file without errors even once. >> >> Ouch! That's not good. Just to confirm, these corruptions are all in >> a thin-lv based filesystem, right? I'd be interested to know if you >> can create another plain LV and cause the same error. Trying to >> simplify the potential problems. Michal> I have been trying to - but so far didn't manage to replicate this with: Michal> - a physical partition Michal> - filesystem directly on a physical partition Michal> - filesystem directly on mdraid Michal> - filesystem directly on a linear volume Michal> Note that this _doesn't_ imply that I _always_ get errors if lvm-thin is in use - as I also had lengthy period of attempts to cause corruption on some thin volume w/o any successes either. But the ones that failed had those in common (so far): md & lvm-thin - with 4 KiB piece(s) being incorrect >> >> >>>> Can you compile the newst kernel and newest thin tools and try them >>>> out? >> Michal> I can, but a bit later (once we move VMs out of the host). >> >>>> >>>> How long does it take to replicate the corruption? >>>> >> Michal> When it happens, it's usually few tries tries of writing a 16gb file Michal> with random patterns and reading it back (directly on host). The Michal> irritating thing is that it can be somewhat hard to reproduce (e.g. Michal> after machine's reboot). >> >>>> Sorry for all the questions, but until there's a test case which is >>>> repeatable, it's going to be hard to chase this down. >>>> >>>> I wonder if running 'fio' tests would be something to try? >>>> >>>> And also changing your RAID5 setup to use the default stride and >>>> stripe widths, instead of the large values you're using. >> Michal> The raid5 is using mdadm's defaults (which is 512 KiB these days for a Michal> chunk). LVM on top is using much longer extents (as we don't really need Michal> 4mb granularity) and the lvm-thin chunks were set to match (and align) Michal> to raid's stripe. >> >>>> >>>> Good luck! >>>> Roger> I have not as of yet seen write corruption (except when a vendors disk Roger> was resetting and it was lying about having written the data prior to Roger> the crash, these were ssds, if your disk write cache is on and you Roger> have a disk reset this can also happen), but have not seen "lost Roger> writes" otherwise, but would expect the 2 read corruption I have seen Roger> to also be able to cause write issues. So for that look for scsi Roger> notifications for disk resets that should not happen. >>>> Roger> I have had a "bad" controller cause read corruptions, those Roger> corruptions would move around, replacing the controller resolved it, Roger> so there may be lack of error checking "inside" some paths in the Roger> card. Lucky I had a number of these controllers and had cold spares Roger> for them. The give away here was 2 separate buses with almost Roger> identical load with 6 separate disks each and all12 disks on 2 buses Roger> had between 47-52 scsi errors, which points to the only component Roger> shared (the controller). >>>> Roger> The backplane and cables are unlikely in general cause this, there is Roger> too much error checking between the controller and the disk from what Roger> I know. >>>> Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133 Roger> cause random read corruptions, lowering speed to 100 fixed it), this Roger> one was duplicated on multiple identical pieces of hw with all Roger> different parts on the duplication machine. >>>> Roger> I have also seen lost writes (from software) because someone did a Roger> seek without doing a flush which in some versions of the libs loses Roger> the unfilled block when the seek happens (this is noted in the man Roger> page, and I saw it 20years ago, it is still noted in the man page, so Roger> no idea if it was ever fixed). So has more than one application been Roger> noted to see the corruption? >>>> Roger> So one question, have you seen the corruption in a path that would Roger> rely on one controller, or all corruptions you have seen involving Roger> more than one controller? Isolate and test each controller if you Roger> can, or if you can afford to replace it and see if it continues. >>>> >>>> Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote: >>>>>> >>>>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause. >>>>>> >>>>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source. >>>>>> >>>>> The hardware is (can provide more detailed info of course): >>>>>> >>>>> - Supermicro X9DR7-LN4F >>>>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane) >>>>> - 96 gb ram (ecc) >>>>> - 24 disk backplane >>>>>> >>>>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk) >>>>> - 1 array on the backplane (4 disks, mdraid5, journaled) >>>>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks) >>>>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series) >>>>>> >>>>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM >>>>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well. >>>>>> >>>>> With a doze of testing we managed to roughly rule out the following elements as being the cause: >>>>>> >>>>> - qemu/kvm (issue occured directly on host) >>>>> - backplane (issue occured on disks directly connected via LSI's 2nd connector) >>>>> - cable (as a above, two different cables) >>>>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest) >>>>> - mdadm journaling (issue occured on plain mdraid configuration as well) >>>>> - disks themselves (issue occured on two separate mdadm arrays) >>>>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) ) >>>>>> >>>>> We did not manage to rule out (though somewhat _highly_ unlikely): >>>>>> >>>>> - lvm thin (issue always - so far - occured on lvm thin pools) >>>>> - mdraid (issue always - so far - on mdraid managed arrays) >>>>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere) >>>>>> >>>>> And finally - so far - the issue never occured: >>>>>> >>>>> - directly on a disk >>>>> - directly on mdraid >>>>> - on linear lvm volume on top of mdraid >>>>>> >>>>> As far as the issue goes it's: >>>>>> >>>>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks >>>>> - we also found (or rather btrfs scrub did) a few small damaged files as well >>>>> - the chunks look like a correct piece of different or previous data >>>>>> >>>>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ... >>>>>> >>>>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any. >>>> >> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [linux-lvm] [general question] rare silent data corruption when writing data @ 2020-05-08 16:10 ` John Stoffel 0 siblings, 0 replies; 20+ messages in thread From: John Stoffel @ 2020-05-08 16:10 UTC (permalink / raw) To: Michal Soltys; +Cc: Linux RAID, Roger Heflin, dm-devel, linux-lvm >>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes: And of course it should also go to dm-devel@redhat.com, my fault for not including that as well. I strongly suspect it's an thin-lv problem somewhere, but I don't know enough to help chase down the problem in detail. John Michal> note: as suggested, I'm also CCing this to linux-lvm; the full Michal> context with replies starts at: Michal> https://www.spinics.net/lists/raid/msg64364.html There is also Michal> the initial post at the bottom as well. Michal> On 5/8/20 2:54 AM, John Stoffel wrote: >>>>>>> "Michal" == Michal Soltys <msoltyspl@yandex.pl> writes: >> Michal> On 20/05/07 23:01, John Stoffel wrote: >>>>>>>>> "Roger" == Roger Heflin <rogerheflin@gmail.com> writes: >>>> Roger> Have you tried the same file 2x and verified the corruption is in the Roger> same places and looks the same? >>>> >>>> Are these 1tb files VMDK or COW images of VMs? How are these files >>>> made. And does it ever happen with *smaller* files? What about if >>>> you just use a sparse 2tb file and write blocks out past 1tb to see if >>>> there's a problem? >> Michal> The VMs are always directly on lvm volumes. (e.g. Michal> /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the Michal> errors after we ran scrub on the filesystem. >> Michal> Yes, the errors were also found on small files. >> >> Those errors are in small files inside the VM, which is running btrfs >> ontop of block storage provided by your thin-lv, right? >> Michal> Yea, the small files were in this case on that thin-lv. Michal> We also discovered (yesterday) file corruptions with VM hosting gitlab registry - this one was using the same thin-lv underneath, but the guest itself was using ext4 (in this case, docker simply reported incorrect sha checksum on (so far) 2 layers. >> >> >> disks -> md raid5 -> pv -> vg -> lv-thin -> guest QCOW/LUN -> >> filesystem -> corruption Michal> Those particular guests, yea. The host case it's just w/o "guest" step. Michal> But (so far) all corruption ended going via one of the lv-thin layers (and via one of md raids). >> >> Michal> Since then we recreated the issue directly on the host, just Michal> by making ext4 filesystem on some LV, then doing write with Michal> checksum, sync, drop_caches, read and check checksum. The Michal> errors are, as I mentioned - always a full 4KiB chunks (always Michal> same content, always same position). >> >> What position? Is it a 4k, 1.5m or some other consistent offset? And >> how far into the file? And this LV is a plain LV or a thin-lv? I'm >> running a debian box at home with RAID1 and I haven't seen this, but >> I'm not nearly as careful as you. Can you provide the output of: >> Michal> What I meant that it doesn't "move" when verifying the same file (aka different reads from same test file). Between the tests, the errors are of course in different places - but it's always some 4KiB piece(s) - that look like correct pieces belonging somewhere else. >> /sbin/lvs --version Michal> LVM version: 2.03.02(2) (2018-12-18) Michal> Library version: 1.02.155 (2018-12-18) Michal> Driver version: 4.41.0 Michal> Configuration: ./configure --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --exec-prefix= --bindir=/bin --libdir=/lib/x86_64-linux-gnu --sbindir=/sbin --with-usrlibdir=/usr/lib/x86_64-linux-gnu --with-optimisation=-O2 --with-cache=internal --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --with-default-pid-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-thin=internal --with-thin-check=/usr/sbin/thin_check --with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair --enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd --enable-dbus-service --enable-lvmlockd-dlm --enable-lvmlockd-sanlock --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-readline --enable-udev_rules --enable-udev_sync >> >> too? >> >> Can you post your: >> >> /sbin/dmsetup status >> >> output too? There's a better command to use here, but I'm not an >> export. You might really want to copy this over to the >> linux-lvm@redhat.com mailing list as well. Michal> x22v0-tp_ssd-tpool: 0 2577285120 thin-pool 19 8886/552960 629535/838960 - rw no_discard_passdown queue_if_no_space - 1024 Michal> x22v0-tp_ssd_tdata: 0 2147696640 linear Michal> x22v0-tp_ssd_tdata: 2147696640 429588480 linear Michal> x22v0-tp_ssd_tmeta_rimage_1: 0 4423680 linear Michal> x22v0-tp_ssd_tmeta: 0 4423680 raid raid1 2 AA 4423680/4423680 idle 0 0 - Michal> x22v0-gerrit--new: 0 268615680 thin 255510528 268459007 Michal> x22v0-btrfsnopool: 0 134430720 linear Michal> x22v0-gitlab_root: 0 629145600 thin 628291584 629145599 Michal> x22v0-tp_ssd_tmeta_rimage_0: 0 4423680 linear Michal> x22v0-nexus_old_storage: 0 10737500160 thin 5130817536 10737500159 Michal> x22v0-gitlab_reg: 0 2147696640 thin 1070963712 2147696639 Michal> x22v0-nexus_old_root: 0 268615680 thin 257657856 268615679 Michal> x22v0-tp_big_tmeta_rimage_1: 0 8601600 linear Michal> x22v0-tp_ssd_tmeta_rmeta_1: 0 245760 linear Michal> x22v0-micron_vol: 0 268615680 linear Michal> x22v0-tp_big_tmeta_rimage_0: 0 8601600 linear Michal> x22v0-tp_ssd_tmeta_rmeta_0: 0 245760 linear Michal> x22v0-gerrit--root: 0 268615680 thin 103388160 268443647 Michal> x22v0-btrfs_ssd_linear: 0 268615680 linear Michal> x22v0-btrfstest: 0 268615680 thin 40734720 268615679 Michal> x22v0-tp_ssd: 0 2577285120 linear Michal> x22v0-tp_big: 0 22164602880 linear Michal> x22v0-nexus3_root: 0 167854080 thin 21860352 167854079 Michal> x22v0-nusknacker--staging: 0 268615680 thin 268182528 268615679 Michal> x22v0-tmob2: 0 1048657920 linear Michal> x22v0-tp_big-tpool: 0 22164602880 thin-pool 35 35152/1075200 3870070/7215040 - rw no_discard_passdown queue_if_no_space - 1024 Michal> x22v0-tp_big_tdata: 0 4295147520 linear Michal> x22v0-tp_big_tdata: 4295147520 17869455360 linear Michal> x22v0-btrfs_ssd_test: 0 201523200 thin 191880192 201335807 Michal> x22v0-nussknacker2: 0 268615680 thin 58573824 268615679 Michal> x22v0-tmob1: 0 1048657920 linear Michal> x22v0-tp_big_tmeta: 0 8601600 raid raid1 2 AA 8601600/8601600 idle 0 0 - Michal> x22v0-nussknacker1: 0 268615680 thin 74376192 268615679 Michal> x22v0-touk--elk4: 0 839024640 linear Michal> x22v0-gerrit--backup: 0 268615680 thin 228989952 268443647 Michal> x22v0-tp_big_tmeta_rmeta_1: 0 245760 linear Michal> x22v0-openvpn--new: 0 134430720 thin 24152064 66272255 Michal> x22v0-k8sdkr: 0 268615680 linear Michal> x22v0-nexus3_storage: 0 10737500160 thin 4976683008 10737500159 Michal> x22v0-rocket: 0 167854080 thin 163602432 167854079 Michal> x22v0-tp_big_tmeta_rmeta_0: 0 245760 linear Michal> x22v0-roger2: 0 134430720 thin 33014784 134430719 Michal> x22v0-gerrit--new--backup: 0 268615680 thin 6552576 268443647 Michal> Also lvs -a with segment ranges: Michal> LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert LE Ranges Michal> btrfs_ssd_linear x22v0 -wi-a----- <128.09g /dev/md125:19021-20113 Michal> btrfs_ssd_test x22v0 Vwi-a-t--- 96.09g tp_ssd 95.21 Michal> btrfsnopool x22v0 -wi-a----- 64.10g /dev/sdt2:35-581 Michal> btrfstest x22v0 Vwi-a-t--- <128.09g tp_big 15.16 Michal> gerrit-backup x22v0 Vwi-aot--- <128.09g tp_big 85.25 Michal> gerrit-new x22v0 Vwi-a-t--- <128.09g tp_ssd 95.12 Michal> gerrit-new-backup x22v0 Vwi-a-t--- <128.09g tp_big 2.44 Michal> gerrit-root x22v0 Vwi-aot--- <128.09g tp_ssd 38.49 Michal> gitlab_reg x22v0 Vwi-a-t--- 1.00t tp_big 49.87 Michal> gitlab_reg_snapshot x22v0 Vwi---t--k 1.00t tp_big gitlab_reg Michal> gitlab_root x22v0 Vwi-a-t--- 300.00g tp_ssd 99.86 Michal> gitlab_root_snapshot x22v0 Vwi---t--k 300.00g tp_ssd gitlab_root Michal> k8sdkr x22v0 -wi-a----- <128.09g /dev/md126:20891-21983 Michal> [lvol0_pmspare] x22v0 ewi------- 4.10g /dev/sdt2:0-34 Michal> micron_vol x22v0 -wi-a----- <128.09g /dev/sdt2:582-1674 Michal> nexus3_root x22v0 Vwi-aot--- <80.04g tp_ssd 13.03 Michal> nexus3_storage x22v0 Vwi-aot--- 5.00t tp_big 46.35 Michal> nexus_old_root x22v0 Vwi-a-t--- <128.09g tp_ssd 95.92 Michal> nexus_old_storage x22v0 Vwi-a-t--- 5.00t tp_big 47.78 Michal> nusknacker-staging x22v0 Vwi-aot--- <128.09g tp_big 99.84 Michal> nussknacker1 x22v0 Vwi-aot--- <128.09g tp_big 27.69 Michal> nussknacker2 x22v0 Vwi-aot--- <128.09g tp_big 21.81 Michal> openvpn-new x22v0 Vwi-aot--- 64.10g tp_big 17.97 Michal> rocket x22v0 Vwi-aot--- <80.04g tp_ssd 97.47 Michal> roger2 x22v0 Vwi-a-t--- 64.10g tp_ssd 24.56 Michal> tmob1 x22v0 -wi-a----- <500.04g /dev/md125:8739-13005 Michal> tmob2 x22v0 -wi-a----- <500.04g /dev/md125:13006-17272 Michal> touk-elk4 x22v0 -wi-ao---- <400.08g /dev/md126:17477-20890 Michal> tp_big x22v0 twi-aot--- 10.32t 53.64 3.27 [tp_big_tdata]:0-90187 Michal> [tp_big_tdata] x22v0 Twi-ao---- 10.32t /dev/md126:0-17476 Michal> [tp_big_tdata] x22v0 Twi-ao---- 10.32t /dev/md126:21984-94694 Michal> [tp_big_tmeta] x22v0 ewi-aor--- 4.10g 100.00 [tp_big_tmeta_rimage_0]:0-34,[tp_big_tmeta_rimage_1]:0-34 Michal> [tp_big_tmeta_rimage_0] x22v0 iwi-aor--- 4.10g /dev/sda3:30-64 Michal> [tp_big_tmeta_rimage_1] x22v0 iwi-aor--- 4.10g /dev/sdb3:30-64 Michal> [tp_big_tmeta_rmeta_0] x22v0 ewi-aor--- 120.00m /dev/sda3:29-29 Michal> [tp_big_tmeta_rmeta_1] x22v0 ewi-aor--- 120.00m /dev/sdb3:29-29 Michal> tp_ssd x22v0 twi-aot--- 1.20t 75.04 1.61 [tp_ssd_tdata]:0-10486 Michal> [tp_ssd_tdata] x22v0 Twi-ao---- 1.20t /dev/md125:0-8738 Michal> [tp_ssd_tdata] x22v0 Twi-ao---- 1.20t /dev/md125:17273-19020 Michal> [tp_ssd_tmeta] x22v0 ewi-aor--- <2.11g 100.00 [tp_ssd_tmeta_rimage_0]:0-17,[tp_ssd_tmeta_rimage_1]:0-17 Michal> [tp_ssd_tmeta_rimage_0] x22v0 iwi-aor--- <2.11g /dev/sda3:11-28 Michal> [tp_ssd_tmeta_rimage_1] x22v0 iwi-aor--- <2.11g /dev/sdb3:11-28 Michal> [tp_ssd_tmeta_rmeta_0] x22v0 ewi-aor--- 120.00m /dev/sda3:10-10 Michal> [tp_ssd_tmeta_rmeta_1] x22v0 ewi-aor--- 120.00m /dev/sdb3:10-10 >> >>>> Are the LVs split across RAID5 PVs by any chance? >> Michal> raid5s are used as PVs, but a single logical volume always uses one only Michal> one physical volume underneath (if that's what you meant by split across). >> >> Ok, that's what I was asking about. It shouldn't matter... but just >> trying to chase down the details. >> >> >>>> It's not clear if you can replicate the problem without using >>>> lvm-thin, but that's what I suspect you might be having problems with. >> Michal> I'll be trying to do that, though the heavier tests will have to wait Michal> until I move all VMs to other hosts (as that is/was our production machnie). >> >> Sure, makes sense. >> >>>> Can you give us the versions of the your tools, and exactly how you >>>> setup your test cases? How long does it take to find the problem? Michal> Regarding this, currently: Michal> kernel: 5.4.0-0.bpo.4-amd64 #1 SMP Debian 5.4.19-1~bpo10+1 (2020-03-09) x86_64 GNU/Linux (was also happening with 5.2.0-0.bpo.3-amd64) Michal> LVM version: 2.03.02(2) (2018-12-18) Michal> Library version: 1.02.155 (2018-12-18) Michal> Driver version: 4.41.0 Michal> mdadm - v4.1 - 2018-10-01 >> Michal> Will get all the details tommorow (the host is on up to date debian Michal> buster, the VMs are mix of archlinuxes and debians (and the issue Michal> happened on both)). >> Michal> As for how long, it's a hit and miss. Sometimes writing and reading back Michal> ~16gb file fails (the cheksum read back differs from what was written) Michal> after 2-3 tries. That's on the host. >> Michal> On the guest, it's been (so far) a guaranteed thing when we were Michal> creating very large tar file (900gb+). As for past two weeks we were Michal> unable to create that file without errors even once. >> >> Ouch! That's not good. Just to confirm, these corruptions are all in >> a thin-lv based filesystem, right? I'd be interested to know if you >> can create another plain LV and cause the same error. Trying to >> simplify the potential problems. Michal> I have been trying to - but so far didn't manage to replicate this with: Michal> - a physical partition Michal> - filesystem directly on a physical partition Michal> - filesystem directly on mdraid Michal> - filesystem directly on a linear volume Michal> Note that this _doesn't_ imply that I _always_ get errors if lvm-thin is in use - as I also had lengthy period of attempts to cause corruption on some thin volume w/o any successes either. But the ones that failed had those in common (so far): md & lvm-thin - with 4 KiB piece(s) being incorrect >> >> >>>> Can you compile the newst kernel and newest thin tools and try them >>>> out? >> Michal> I can, but a bit later (once we move VMs out of the host). >> >>>> >>>> How long does it take to replicate the corruption? >>>> >> Michal> When it happens, it's usually few tries tries of writing a 16gb file Michal> with random patterns and reading it back (directly on host). The Michal> irritating thing is that it can be somewhat hard to reproduce (e.g. Michal> after machine's reboot). >> >>>> Sorry for all the questions, but until there's a test case which is >>>> repeatable, it's going to be hard to chase this down. >>>> >>>> I wonder if running 'fio' tests would be something to try? >>>> >>>> And also changing your RAID5 setup to use the default stride and >>>> stripe widths, instead of the large values you're using. >> Michal> The raid5 is using mdadm's defaults (which is 512 KiB these days for a Michal> chunk). LVM on top is using much longer extents (as we don't really need Michal> 4mb granularity) and the lvm-thin chunks were set to match (and align) Michal> to raid's stripe. >> >>>> >>>> Good luck! >>>> Roger> I have not as of yet seen write corruption (except when a vendors disk Roger> was resetting and it was lying about having written the data prior to Roger> the crash, these were ssds, if your disk write cache is on and you Roger> have a disk reset this can also happen), but have not seen "lost Roger> writes" otherwise, but would expect the 2 read corruption I have seen Roger> to also be able to cause write issues. So for that look for scsi Roger> notifications for disk resets that should not happen. >>>> Roger> I have had a "bad" controller cause read corruptions, those Roger> corruptions would move around, replacing the controller resolved it, Roger> so there may be lack of error checking "inside" some paths in the Roger> card. Lucky I had a number of these controllers and had cold spares Roger> for them. The give away here was 2 separate buses with almost Roger> identical load with 6 separate disks each and all12 disks on 2 buses Roger> had between 47-52 scsi errors, which points to the only component Roger> shared (the controller). >>>> Roger> The backplane and cables are unlikely in general cause this, there is Roger> too much error checking between the controller and the disk from what Roger> I know. >>>> Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133 Roger> cause random read corruptions, lowering speed to 100 fixed it), this Roger> one was duplicated on multiple identical pieces of hw with all Roger> different parts on the duplication machine. >>>> Roger> I have also seen lost writes (from software) because someone did a Roger> seek without doing a flush which in some versions of the libs loses Roger> the unfilled block when the seek happens (this is noted in the man Roger> page, and I saw it 20years ago, it is still noted in the man page, so Roger> no idea if it was ever fixed). So has more than one application been Roger> noted to see the corruption? >>>> Roger> So one question, have you seen the corruption in a path that would Roger> rely on one controller, or all corruptions you have seen involving Roger> more than one controller? Isolate and test each controller if you Roger> can, or if you can afford to replace it and see if it continues. >>>> >>>> Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote: >>>>>> >>>>> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause. >>>>>> >>>>> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source. >>>>>> >>>>> The hardware is (can provide more detailed info of course): >>>>>> >>>>> - Supermicro X9DR7-LN4F >>>>> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane) >>>>> - 96 gb ram (ecc) >>>>> - 24 disk backplane >>>>>> >>>>> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk) >>>>> - 1 array on the backplane (4 disks, mdraid5, journaled) >>>>> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks) >>>>> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series) >>>>>> >>>>> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM >>>>> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well. >>>>>> >>>>> With a doze of testing we managed to roughly rule out the following elements as being the cause: >>>>>> >>>>> - qemu/kvm (issue occured directly on host) >>>>> - backplane (issue occured on disks directly connected via LSI's 2nd connector) >>>>> - cable (as a above, two different cables) >>>>> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest) >>>>> - mdadm journaling (issue occured on plain mdraid configuration as well) >>>>> - disks themselves (issue occured on two separate mdadm arrays) >>>>> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) ) >>>>>> >>>>> We did not manage to rule out (though somewhat _highly_ unlikely): >>>>>> >>>>> - lvm thin (issue always - so far - occured on lvm thin pools) >>>>> - mdraid (issue always - so far - on mdraid managed arrays) >>>>> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere) >>>>>> >>>>> And finally - so far - the issue never occured: >>>>>> >>>>> - directly on a disk >>>>> - directly on mdraid >>>>> - on linear lvm volume on top of mdraid >>>>>> >>>>> As far as the issue goes it's: >>>>>> >>>>> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks >>>>> - we also found (or rather btrfs scrub did) a few small damaged files as well >>>>> - the chunks look like a correct piece of different or previous data >>>>>> >>>>> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ... >>>>>> >>>>> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any. >>>> >> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data 2020-05-07 22:33 ` Michal Soltys 2020-05-08 0:54 ` John Stoffel @ 2020-05-08 3:44 ` Chris Murphy 2020-05-10 19:05 ` Sarah Newman 2020-05-20 21:40 ` Michal Soltys 1 sibling, 2 replies; 20+ messages in thread From: Chris Murphy @ 2020-05-08 3:44 UTC (permalink / raw) To: Michal Soltys; +Cc: John Stoffel, Roger Heflin, Linux RAID On Thu, May 7, 2020 at 4:34 PM Michal Soltys <msoltyspl@yandex.pl> wrote: > Since then we recreated the issue directly on the host, just by making > ext4 filesystem on some LV, then doing write with checksum, sync, > drop_caches, read and check checksum. The errors are, as I mentioned - > always a full 4KiB chunks (always same content, always same position). The 4KiB chunk. What are the contents? Is it definitely guest VM data? Or is it sometimes file system metadata? How many corruptions have happened? The file system metadata is quite small compared to data. But if there have been many errors, we'd expect if it's caused on the host, that eventually file system metadata is corrupted. If it's definitely only data, that's curious and maybe implicates something going on in the guest. Btrfs, whether normal reads or scrubs, will report the path to the affected file, for data corruption. Metadata corruption errors sometimes have inode references, but not a path to a file. > > > > Are the LVs split across RAID5 PVs by any chance? > > raid5s are used as PVs, but a single logical volume always uses one only > one physical volume underneath (if that's what you meant by split across). It might be a bit suboptimal. A single 4KiB block write in the guest, turns into a 4KiB block write in the host's LV. That in turn trickles down to md, which has a 512KiB x 4 drive stripe. So a single 4KiB write translates into a 2M stripe write. There is an optimization for raid5 in the RMW case, where it should be true only 4KiB data plus 4KiB parity is written (partial strip/chunk write); I'm not sure about reads. > > It's not clear if you can replicate the problem without using > > lvm-thin, but that's what I suspect you might be having problems with. > > > > I'll be trying to do that, though the heavier tests will have to wait > until I move all VMs to other hosts (as that is/was our production machnie). Btrfs default Btrfs uses 16KiB block size for leaves and nodes. It's still a tiny foot print compared to data writes, but if LVM thin is a suspect, it really should just be a matter of time before file system corruption happens. If it doesn't, that's useful information. It probably means it's not LVM thin. But then what? > As for how long, it's a hit and miss. Sometimes writing and reading back > ~16gb file fails (the cheksum read back differs from what was written) > after 2-3 tries. That's on the host. > > On the guest, it's been (so far) a guaranteed thing when we were > creating very large tar file (900gb+). As for past two weeks we were > unable to create that file without errors even once. It's very useful to have a consistent reproducer. You can do metadata only writes on Btrfs by doing multiple back to back metadata only balance. If the problem really is in the write path somewhere, this would eventually corrupt the metadata - it would be detected during any subsequent balance or scrub. 'btrfs balance start -musage=100 /mountpoint' will do it. This reproducer. It only reproduces in the guest VM? If you do it in the host, otherwise exactly the same way with all the exact same versions of everything, and it does not reproduce? > > > > > Can you compile the newst kernel and newest thin tools and try them > > out? > > I can, but a bit later (once we move VMs out of the host). > > > > > How long does it take to replicate the corruption? > > > > When it happens, it's usually few tries tries of writing a 16gb file > with random patterns and reading it back (directly on host). The > irritating thing is that it can be somewhat hard to reproduce (e.g. > after machine's reboot). Reading it back on the host. So you've shut down the VM, and you're mounting what was the guests VM's backing disk, on the host to do the verification. There's never a case of concurrent usage between guest and host? > > > Sorry for all the questions, but until there's a test case which is > > repeatable, it's going to be hard to chase this down. > > > > I wonder if running 'fio' tests would be something to try? > > > > And also changing your RAID5 setup to use the default stride and > > stripe widths, instead of the large values you're using. > > The raid5 is using mdadm's defaults (which is 512 KiB these days for a > chunk). LVM on top is using much longer extents (as we don't really need > 4mb granularity) and the lvm-thin chunks were set to match (and align) > to raid's stripe. I would change very little until you track this down, if the goal is to track it down and get it fixed. I'm not sure if LVM thinp is supported with LVM raid still, which if it's not supported yet then I can understand using mdadm raid5 instead of LVM raid5. -- Chris Murphy ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data 2020-05-08 3:44 ` Chris Murphy @ 2020-05-10 19:05 ` Sarah Newman 2020-05-10 19:12 ` Sarah Newman 2020-05-20 21:40 ` Michal Soltys 1 sibling, 1 reply; 20+ messages in thread From: Sarah Newman @ 2020-05-10 19:05 UTC (permalink / raw) To: Chris Murphy, Michal Soltys; +Cc: John Stoffel, Roger Heflin, Linux RAID On 5/7/20 8:44 PM, Chris Murphy wrote: > > I would change very little until you track this down, if the goal is > to track it down and get it fixed. > > I'm not sure if LVM thinp is supported with LVM raid still, which if > it's not supported yet then I can understand using mdadm raid5 instead > of LVM raid5. My apologies if this ideas was considered and discarded already, but the bug being hard to reproduce right after reboot and the error being exactly the size of a page sounds like a memory use after free bug or similar. A debug kernel build with one or more of these options may find the problem: CONFIG_DEBUG_PAGEALLOC CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT CONFIG_PAGE_POISONING + page_poison=1 CONFIG_KASAN --Sarah ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data 2020-05-10 19:05 ` Sarah Newman @ 2020-05-10 19:12 ` Sarah Newman 2020-05-11 9:41 ` Michal Soltys 0 siblings, 1 reply; 20+ messages in thread From: Sarah Newman @ 2020-05-10 19:12 UTC (permalink / raw) To: Chris Murphy, Michal Soltys; +Cc: John Stoffel, Roger Heflin, Linux RAID On 5/10/20 12:05 PM, Sarah Newman wrote: > On 5/7/20 8:44 PM, Chris Murphy wrote: >> >> I would change very little until you track this down, if the goal is >> to track it down and get it fixed. >> >> I'm not sure if LVM thinp is supported with LVM raid still, which if >> it's not supported yet then I can understand using mdadm raid5 instead >> of LVM raid5. > > > My apologies if this ideas was considered and discarded already, but the bug being hard to reproduce right after reboot and the error being exactly > the size of a page sounds like a memory use after free bug or similar. > > A debug kernel build with one or more of these options may find the problem: > > CONFIG_DEBUG_PAGEALLOC > CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT > CONFIG_PAGE_POISONING + page_poison=1 > CONFIG_KASAN > > --Sarah And on further reflection you may as well add these: CONFIG_DEBUG_OBJECTS CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT CONFIG_CRASH_DUMP (kdump) + anything else available. Basically turn debugging on all the way. If you can reproduce reliably with these, then you can try the latest kernel with the same options and have some confidence the problem was legitimately fixed. --Sarah ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data 2020-05-10 19:12 ` Sarah Newman @ 2020-05-11 9:41 ` Michal Soltys 2020-05-11 19:42 ` Sarah Newman 0 siblings, 1 reply; 20+ messages in thread From: Michal Soltys @ 2020-05-11 9:41 UTC (permalink / raw) To: Sarah Newman, Chris Murphy; +Cc: John Stoffel, Roger Heflin, Linux RAID On 5/10/20 9:12 PM, Sarah Newman wrote: > On 5/10/20 12:05 PM, Sarah Newman wrote: >> On 5/7/20 8:44 PM, Chris Murphy wrote: >>> >>> I would change very little until you track this down, if the goal is >>> to track it down and get it fixed. >>> >>> I'm not sure if LVM thinp is supported with LVM raid still, which if >>> it's not supported yet then I can understand using mdadm raid5 instead >>> of LVM raid5. >> >> >> My apologies if this ideas was considered and discarded already, but >> the bug being hard to reproduce right after reboot and the error being >> exactly the size of a page sounds like a memory use after free bug or >> similar. >> >> A debug kernel build with one or more of these options may find the >> problem: >> >> CONFIG_DEBUG_PAGEALLOC >> CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT >> CONFIG_PAGE_POISONING + page_poison=1 >> CONFIG_KASAN >> >> --Sarah > > And on further reflection you may as well add these: > > CONFIG_DEBUG_OBJECTS > CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT > CONFIG_CRASH_DUMP (kdump) > > + anything else available. Basically turn debugging on all the way. > > If you can reproduce reliably with these, then you can try the latest > kernel with the same options and have some confidence the problem was > legitimately fixed. > After compiling the kernel with above options enabled - and if this is the underlying issue as you suspect - will it just pop in dmesg if I hit this bug, or do I need some extra tools/preparation/etc. ? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data 2020-05-11 9:41 ` Michal Soltys @ 2020-05-11 19:42 ` Sarah Newman 0 siblings, 0 replies; 20+ messages in thread From: Sarah Newman @ 2020-05-11 19:42 UTC (permalink / raw) To: Michal Soltys, Chris Murphy; +Cc: John Stoffel, Roger Heflin, Linux RAID On 5/11/20 2:41 AM, Michal Soltys wrote: > On 5/10/20 9:12 PM, Sarah Newman wrote: >> On 5/10/20 12:05 PM, Sarah Newman wrote: >>> On 5/7/20 8:44 PM, Chris Murphy wrote: >>>> >>>> I would change very little until you track this down, if the goal is >>>> to track it down and get it fixed. >>>> >>>> I'm not sure if LVM thinp is supported with LVM raid still, which if >>>> it's not supported yet then I can understand using mdadm raid5 instead >>>> of LVM raid5. >>> >>> >>> My apologies if this ideas was considered and discarded already, but the bug being hard to reproduce right after reboot and the error being exactly >>> the size of a page sounds like a memory use after free bug or similar. >>> >>> A debug kernel build with one or more of these options may find the problem: >>> >>> CONFIG_DEBUG_PAGEALLOC >>> CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT >>> CONFIG_PAGE_POISONING + page_poison=1 >>> CONFIG_KASAN >>> >>> --Sarah >> >> And on further reflection you may as well add these: >> >> CONFIG_DEBUG_OBJECTS >> CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT >> CONFIG_CRASH_DUMP (kdump) >> >> + anything else available. Basically turn debugging on all the way. >> >> If you can reproduce reliably with these, then you can try the latest kernel with the same options and have some confidence the problem was >> legitimately fixed. >> > > After compiling the kernel with above options enabled - and if this is the underlying issue as you suspect - will it just pop in dmesg if I hit this > bug, or do I need some extra tools/preparation/etc. ? > I'm pretty sure that you can get everything you need from either dmesg or sysfs/debugfs. Be prepared for an oops or panic. --Sarah ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data 2020-05-08 3:44 ` Chris Murphy 2020-05-10 19:05 ` Sarah Newman @ 2020-05-20 21:40 ` Michal Soltys 1 sibling, 0 replies; 20+ messages in thread From: Michal Soltys @ 2020-05-20 21:40 UTC (permalink / raw) To: Chris Murphy; +Cc: John Stoffel, Roger Heflin, Linux RAID Sorry for delayed reply, have had rather busy weeks. On 20/05/08 05:44, Chris Murphy wrote: > > The 4KiB chunk. What are the contents? Is it definitely guest VM data? > Or is it sometimes file system metadata? How many corruptions have > happened? The file system metadata is quite small compared to data. I haven't looked that precisely (and it would be hard to tell in quite a few cases) - but I'll keep that in mind when I resume chasing this bug. > But if there have been many errors, we'd expect if it's caused on the > host, that eventually file system metadata is corrupted. If it's > definitely only data, that's curious and maybe implicates something > going on in the guest. As far as metadata goes, so far I haven't seen those - as far as e2fsck on ext4 and btrfs-scrub on ext4 could tell. Though in ext4 case I haven't ran it that many times - so good point, I'll include fsck after each round. > > Btrfs, whether normal reads or scrubs, will report the path to the > affected file, for data corruption. Metadata corruption errors > sometimes have inode references, but not a path to a file. > Btrfs pointed to file contents only, so far. > >> > >> > Are the LVs split across RAID5 PVs by any chance? >> >> raid5s are used as PVs, but a single logical volume always uses one only >> one physical volume underneath (if that's what you meant by split across). > > It might be a bit suboptimal. A single 4KiB block write in the guest, > turns into a 4KiB block write in the host's LV. That in turn trickles > down to md, which has a 512KiB x 4 drive stripe. So a single 4KiB > write translates into a 2M stripe write. There is an optimization for > raid5 in the RMW case, where it should be true only 4KiB data plus > 4KiB parity is written (partial strip/chunk write); I'm not sure about > reads. Well, I didn't play with current defaults too much - aside large stripe_cache_size + the raid running under 2x ssd write-back journal (which unfortunately became another issue - there is another thread where I'm chasing that bug). > >> > It's not clear if you can replicate the problem without using >> > lvm-thin, but that's what I suspect you might be having problems with. >> > >> >> I'll be trying to do that, though the heavier tests will have to wait >> until I move all VMs to other hosts (as that is/was our production machnie). > > Btrfs default Btrfs uses 16KiB block size for leaves and nodes. It's > still a tiny foot print compared to data writes, but if LVM thin is a > suspect, it really should just be a matter of time before file system > corruption happens. If it doesn't, that's useful information. It > probably means it's not LVM thin. But then what? > >> As for how long, it's a hit and miss. Sometimes writing and reading back >> ~16gb file fails (the cheksum read back differs from what was written) >> after 2-3 tries. That's on the host. >> >> On the guest, it's been (so far) a guaranteed thing when we were >> creating very large tar file (900gb+). As for past two weeks we were >> unable to create that file without errors even once. > > It's very useful to have a consistent reproducer. You can do metadata > only writes on Btrfs by doing multiple back to back metadata only > balance. If the problem really is in the write path somewhere, this > would eventually corrupt the metadata - it would be detected during > any subsequent balance or scrub. 'btrfs balance start -musage=100 > /mountpoint' will do it. Will do that too. > > This reproducer. It only reproduces in the guest VM? If you do it in > the host, otherwise exactly the same way with all the exact same > versions of everything, and it does not reproduce? > I did reproduce the issue on the host (both in ext4 and btrfs). The host has slightly different versions of kernel and tools, but otherwise same stuff as one of the guests in which I was testing it. >> >> > >> > Can you compile the newst kernel and newest thin tools and try them >> > out? >> >> I can, but a bit later (once we move VMs out of the host). >> >> > >> > How long does it take to replicate the corruption? >> > >> >> When it happens, it's usually few tries tries of writing a 16gb file >> with random patterns and reading it back (directly on host). The >> irritating thing is that it can be somewhat hard to reproduce (e.g. >> after machine's reboot). > > Reading it back on the host. So you've shut down the VM, and you're > mounting what was the guests VM's backing disk, on the host to do the > verification. There's never a case of concurrent usage between guest > and host? The hosts test where on a fresh filesystems on a fresh lvm volumes (and I hit them on two different thin pools). The issue was also reproduced on hosts when all guests were turned off. > > >> >> > Sorry for all the questions, but until there's a test case which is >> > repeatable, it's going to be hard to chase this down. >> > >> > I wonder if running 'fio' tests would be something to try? >> > >> > And also changing your RAID5 setup to use the default stride and >> > stripe widths, instead of the large values you're using. >> >> The raid5 is using mdadm's defaults (which is 512 KiB these days for a >> chunk). LVM on top is using much longer extents (as we don't really need >> 4mb granularity) and the lvm-thin chunks were set to match (and align) >> to raid's stripe. > > I would change very little until you track this down, if the goal is > to track it down and get it fixed. > Yea, I'm keeping the stuff as is (and will try Sarah's suggestions with debug options as well). > I'm not sure if LVM thinp is supported with LVM raid still, which if > it's not supported yet then I can understand using mdadm raid5 instead > of LVM raid5. > It probably is, but still while direct dmsetup exposes a few knobs (e.g. allows to setup journal), the lvm doesn't allow much besides chunk size. That was the primary reason that I sticked to native mdadm. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data 2020-05-07 18:24 ` Roger Heflin 2020-05-07 21:01 ` John Stoffel @ 2020-05-07 22:13 ` Michal Soltys 1 sibling, 0 replies; 20+ messages in thread From: Michal Soltys @ 2020-05-07 22:13 UTC (permalink / raw) To: Roger Heflin; +Cc: linux-raid On 20/05/07 20:24, Roger Heflin wrote: > Have you tried the same file 2x and verified the corruption is in the > same places and looks the same? Yes, both with direct tests on hosts and with btrfs scrub failing on the same files in exactly same places. Always full 4KiB chunks. > > I have not as of yet seen write corruption (except when a vendors disk > was resetting and it was lying about having written the data prior to > the crash, these were ssds, if your disk write cache is on and you > have a disk reset this can also happen), but have not seen "lost > writes" otherwise, but would expect the 2 read corruption I have seen > to also be able to cause write issues. So for that look for scsi > notifications for disk resets that should not happen. > When I was doing a simple test that basically was: while .....; do rng=$((hexdump ..... /dev/urandom)) dcfldd hash=md5 textpattern=$((rng_value)) of=/dst/test bs=262144 count=$((16*4096)) sync echo 1>/proc/sys/vm/drop_caches dcfldd hash=md5 if=/dst/test of=/dev/null ..... compare_hashes_and_stop_if_different done There was no worrysome resets/etc. entries in dmesg. > I have had a "bad" controller cause read corruptions, those > corruptions would move around, replacing the controller resolved it, > so there may be lack of error checking "inside" some paths in the > card. Lucky I had a number of these controllers and had cold spares > for them. The give away here was 2 separate buses with almost > identical load with 6 separate disks each and all12 disks on 2 buses > had between 47-52 scsi errors, which points to the only component > shared (the controller). Doesn't seem to be the case here, the reads are always the same - both in content and position. > > I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133 > cause random read corruptions, lowering speed to 100 fixed it), this > one was duplicated on multiple identical pieces of hw with all > different parts on the duplication machine. > > I have also seen lost writes (from software) because someone did a > seek without doing a flush which in some versions of the libs loses > the unfilled block when the seek happens (this is noted in the man > page, and I saw it 20years ago, it is still noted in the man page, so > no idea if it was ever fixed). So has more than one application been > noted to see the corruption? > > So one question, have you seen the corruption in a path that would > rely on one controller, or all corruptions you have seen involving > more than one controller? Isolate and test each controller if you > can, or if you can afford to replace it and see if it continues. > So far only on one (LSI 2308) controller - although the thin volumes' metadata is on the ssds connected to chipset's sata controller. Still if hypothetically that was the case (metadata disks), wouldn't I rather see some kind of corruptions that would be a multiple of thin-volume's chunk size (so multiplies of 1.5 MiB in this case). As for controler, I have ordered another one that we plan to test in near future. > > On Thu, May 7, 2020 at 12:33 PM Michal Soltys <msoltyspl@yandex.pl> wrote: >> >> Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause. >> >> Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source. >> >> The hardware is (can provide more detailed info of course): >> >> - Supermicro X9DR7-LN4F >> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane) >> - 96 gb ram (ecc) >> - 24 disk backplane >> >> - 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk) >> - 1 array on the backplane (4 disks, mdraid5, journaled) >> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks) >> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series) >> >> Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM >> uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well. >> >> With a doze of testing we managed to roughly rule out the following elements as being the cause: >> >> - qemu/kvm (issue occured directly on host) >> - backplane (issue occured on disks directly connected via LSI's 2nd connector) >> - cable (as a above, two different cables) >> - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest) >> - mdadm journaling (issue occured on plain mdraid configuration as well) >> - disks themselves (issue occured on two separate mdadm arrays) >> - filesystem (issue occured on both btrfs and ext4 (checksumed manually) ) >> >> We did not manage to rule out (though somewhat _highly_ unlikely): >> >> - lvm thin (issue always - so far - occured on lvm thin pools) >> - mdraid (issue always - so far - on mdraid managed arrays) >> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere) >> >> And finally - so far - the issue never occured: >> >> - directly on a disk >> - directly on mdraid >> - on linear lvm volume on top of mdraid >> >> As far as the issue goes it's: >> >> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks >> - we also found (or rather btrfs scrub did) a few small damaged files as well >> - the chunks look like a correct piece of different or previous data >> >> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ... >> >> Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any. > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data 2020-05-07 17:30 [general question] rare silent data corruption when writing data Michal Soltys 2020-05-07 18:24 ` Roger Heflin @ 2020-05-13 6:31 ` Chris Dunlop 2020-05-13 17:49 ` John Stoffel 2020-05-20 20:29 ` Michal Soltys 1 sibling, 2 replies; 20+ messages in thread From: Chris Dunlop @ 2020-05-13 6:31 UTC (permalink / raw) To: Michal Soltys; +Cc: linux-raid Hi, On Thu, May 07, 2020 at 07:30:19PM +0200, Michal Soltys wrote: > Note: this is just general question - if anyone experienced something > similar or could suggest how to pinpoint / verify the actual cause. > > Thanks to btrfs's checksumming we discovered somewhat (even if quite > rare) nasty silent corruption going on on one of our hosts. Or perhaps > "corruption" is not the correct word - the files simply have precise 4kb > (1 page) of incorrect data. The incorrect pieces of data look on their > own fine - as something that was previously in the place, or written > from wrong source. "Me too!" We are seeing 256-byte corruptions which are always the last 256b of a 4K block. The 256b is very often a copy of a "last 256b of 4k block" from earlier on the file. We sometimes see multiple corruptions in the same file, with each of the corruptions being a copy of a different 256b from earlier on the file. The original 256b and the copied 256b aren't identifiably at a regular offset from each other. Where the 256b isn't a copy from earlier in the file I'd be really interested to hear if your problem is just in the last 256b of the 4k block also! We haven't been able to track down any the origin of any of the copies where it's not a 256b block earlier in the file. I tried some extensive analysis of some of these occurrences, including looking at files being written around the same time, but wasn't able to identify where the data came from. It could be the "last 256b of 4k block" from some other file being written at the same time, or a non-256b aligned chunk, or indeed not a copy of other file data at all. See Also: https://lore.kernel.org/linux-xfs/20180322150226.GA31029@onthe.net.au/ We've been able to detect these corruptions via an md5sum calculated as the files are generated, where a later md5sum doesn't match the original. We regularly see the md5sum match soon after the file is written (seconds to minutes), and then go "bad" after doing a "vmtouch -e" to evict the file from memory. I.e. it looks like the problem is occurring somewhere on the write path to disk. We can move the corrupt file out of the way and regenerate the file, then use 'cmp -l' to see where the corruption[s] are, and calculate md5 sums for each 256b block in the file to identify where the 256b was copied from. The corruptions are far more likely to occur during a scrub, although we have seen a few of them when not scrubbing. We're currently working around the issue by scrubbing infrequently, and trying to schedule scrubs during periods of low write load. > The hardware is (can provide more detailed info of course): > > - Supermicro X9DR7-LN4F > - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to > backplane) > - 96 gb ram (ecc) > - 24 disk backplane > > - 1 array connected directly to lsi controller (4 disks, mdraid5, > internal bitmap, 512kb chunk) > - 1 array on the backplane (4 disks, mdraid5, journaled) > - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro > disks) > - 1 btrfs raid1 boot array on motherboard's sata ports (older but still > fine intel ssds from DC 3500 series) Ours is on similar hardware: - Supermicro X8DTH-IF - LSI SAS 9211-8i (LSI SAS2008, PCI-e 2.0, multiple firmware versions) - 192GB ECC RAM - A mix of 12 and 24-bay expanders (some daisy chained: lsi-expander-expander) We swapped the LSI HBA for another of the same model, the problem persisted. We have a SAS9300 card on the way for testing. > Raid 5 arrays are in lvm volume group, and the logical volumes are used > by VMs. Some of the volumes are linear, some are using thin-pools (with > metadata on the aforementioned intel ssds, in mirrored config). LVM uses > large extent sizes (120m) and the chunk-size of thin-pools is set to > 1.5m to match underlying raid stripe. Everything is cleanly aligned as > well. We're not using VMs nor lvm thin on this storage. Our main filesystem is xfs + lvm + raid6 and this is where we've seen all but one of these corruptions (70-100 since Mar 2018). The problem has occurred on all md arrays under the lvm, on disks from multiple vendors and models, and on disks attached to all expanders. We've seen one of these corruptions with xfs directly on a hdd partition. I.e. no mdraid or lvm involved. This fs an order of magnitude or more less utilised than the main fs in terms of data being written. > We did not manage to rule out (though somewhat _highly_ unlikely): > > - lvm thin (issue always - so far - occured on lvm thin pools) > - mdraid (issue always - so far - on mdraid managed arrays) > - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, > happened with both - so it would imply rather already longstanding bug > somewhere) - we're not using lvm thin - problem has occurred once on non-mdraid (xfs directly on a hdd partition) - problem NOT seen on kernel 3.18.25 - problem seen on, so far, kernels 4.4.153 - 5.4.2 > And finally - so far - the issue never occured: > > - directly on a disk > - directly on mdraid > - on linear lvm volume on top of mdraid - seen once directly on disk (partition) - we don't use mdraid directly - our problem arises on linear lvm on top of mdraid (raid6) > As far as the issue goes it's: > > - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from > a few to few dozens of such chunks > - we also found (or rather btrfs scrub did) a few small damaged files as > well > - the chunks look like a correct piece of different or previous data > > The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes > anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin > pools; mdraid - default 512kb chunks). It does nicely fit a page though > ... > > Anyway, if anyone has any ideas or suggestions what could be happening > (perhaps with this particular motherboard or vendor) or how to pinpoint > the cause - I'll be grateful for any. Likewise! Cheers, Chris ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data 2020-05-13 6:31 ` Chris Dunlop @ 2020-05-13 17:49 ` John Stoffel 2020-05-14 0:39 ` Chris Dunlop 2020-05-20 20:29 ` Michal Soltys 1 sibling, 1 reply; 20+ messages in thread From: John Stoffel @ 2020-05-13 17:49 UTC (permalink / raw) To: Chris Dunlop; +Cc: Michal Soltys, linux-raid I wonder if this problem can be replicated on loop devices? Once there's a way to cause it reliably, we can then start doing a bisection of the kernel to try and find out where this is happening. So far, it looks like it happens sometimes on bare RAID6 systems without lv-thin in place, which is both good and bad. And without using VMs on top of the storage either. So this helps narrow down the cause. Is there any info on the work load on these systems? Lots of small fils which are added/removed? Large files which are just written to and not touched again? I assume finding a bad file with corruption and then doing a cp of the file keeps the same corruption? >>>>> "Chris" == Chris Dunlop <chris@onthe.net.au> writes: Chris> Hi, Chris> On Thu, May 07, 2020 at 07:30:19PM +0200, Michal Soltys wrote: >> Note: this is just general question - if anyone experienced something >> similar or could suggest how to pinpoint / verify the actual cause. >> >> Thanks to btrfs's checksumming we discovered somewhat (even if quite >> rare) nasty silent corruption going on on one of our hosts. Or perhaps >> "corruption" is not the correct word - the files simply have precise 4kb >> (1 page) of incorrect data. The incorrect pieces of data look on their >> own fine - as something that was previously in the place, or written >> from wrong source. Chris> "Me too!" Chris> We are seeing 256-byte corruptions which are always the last 256b of a 4K Chris> block. The 256b is very often a copy of a "last 256b of 4k block" from Chris> earlier on the file. We sometimes see multiple corruptions in the same Chris> file, with each of the corruptions being a copy of a different 256b from Chris> earlier on the file. The original 256b and the copied 256b aren't Chris> identifiably at a regular offset from each other. Where the 256b isn't a Chris> copy from earlier in the file Chris> I'd be really interested to hear if your problem is just in the last 256b Chris> of the 4k block also! Chris> We haven't been able to track down any the origin of any of the copies Chris> where it's not a 256b block earlier in the file. I tried some extensive Chris> analysis of some of these occurrences, including looking at files being Chris> written around the same time, but wasn't able to identify where the data Chris> came from. It could be the "last 256b of 4k block" from some other file Chris> being written at the same time, or a non-256b aligned chunk, or indeed not Chris> a copy of other file data at all. Chris> See Also: https://lore.kernel.org/linux-xfs/20180322150226.GA31029@onthe.net.au/ Chris> We've been able to detect these corruptions via an md5sum calculated as Chris> the files are generated, where a later md5sum doesn't match the original. Chris> We regularly see the md5sum match soon after the file is written (seconds Chris> to minutes), and then go "bad" after doing a "vmtouch -e" to evict the Chris> file from memory. I.e. it looks like the problem is occurring somewhere on Chris> the write path to disk. We can move the corrupt file out of the way and Chris> regenerate the file, then use 'cmp -l' to see where the corruption[s] are, Chris> and calculate md5 sums for each 256b block in the file to identify where Chris> the 256b was copied from. Chris> The corruptions are far more likely to occur during a scrub, although we Chris> have seen a few of them when not scrubbing. We're currently working around Chris> the issue by scrubbing infrequently, and trying to schedule scrubs during Chris> periods of low write load. >> The hardware is (can provide more detailed info of course): >> >> - Supermicro X9DR7-LN4F >> - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to >> backplane) >> - 96 gb ram (ecc) >> - 24 disk backplane >> >> - 1 array connected directly to lsi controller (4 disks, mdraid5, >> internal bitmap, 512kb chunk) >> - 1 array on the backplane (4 disks, mdraid5, journaled) >> - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro >> disks) >> - 1 btrfs raid1 boot array on motherboard's sata ports (older but still >> fine intel ssds from DC 3500 series) Chris> Ours is on similar hardware: Chris> - Supermicro X8DTH-IF Chris> - LSI SAS 9211-8i (LSI SAS2008, PCI-e 2.0, multiple firmware versions) Chris> - 192GB ECC RAM Chris> - A mix of 12 and 24-bay expanders (some daisy chained: lsi-expander-expander) Chris> We swapped the LSI HBA for another of the same model, the problem Chris> persisted. We have a SAS9300 card on the way for testing. >> Raid 5 arrays are in lvm volume group, and the logical volumes are used >> by VMs. Some of the volumes are linear, some are using thin-pools (with >> metadata on the aforementioned intel ssds, in mirrored config). LVM uses >> large extent sizes (120m) and the chunk-size of thin-pools is set to >> 1.5m to match underlying raid stripe. Everything is cleanly aligned as >> well. Chris> We're not using VMs nor lvm thin on this storage. Chris> Our main filesystem is xfs + lvm + raid6 and this is where we've seen all Chris> but one of these corruptions (70-100 since Mar 2018). Chris> The problem has occurred on all md arrays under the lvm, on disks from Chris> multiple vendors and models, and on disks attached to all expanders. Chris> We've seen one of these corruptions with xfs directly on a hdd partition. Chris> I.e. no mdraid or lvm involved. This fs an order of magnitude or more less Chris> utilised than the main fs in terms of data being written. >> We did not manage to rule out (though somewhat _highly_ unlikely): >> >> - lvm thin (issue always - so far - occured on lvm thin pools) >> - mdraid (issue always - so far - on mdraid managed arrays) >> - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, >> happened with both - so it would imply rather already longstanding bug >> somewhere) Chris> - we're not using lvm thin Chris> - problem has occurred once on non-mdraid (xfs directly on a hdd partition) Chris> - problem NOT seen on kernel 3.18.25 Chris> - problem seen on, so far, kernels 4.4.153 - 5.4.2 >> And finally - so far - the issue never occured: >> >> - directly on a disk >> - directly on mdraid >> - on linear lvm volume on top of mdraid Chris> - seen once directly on disk (partition) Chris> - we don't use mdraid directly Chris> - our problem arises on linear lvm on top of mdraid (raid6) >> As far as the issue goes it's: >> >> - always a 4kb chunk that is incorrect - in a ~1 tb file it can be from >> a few to few dozens of such chunks >> - we also found (or rather btrfs scrub did) a few small damaged files as >> well >> - the chunks look like a correct piece of different or previous data >> >> The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes >> anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin >> pools; mdraid - default 512kb chunks). It does nicely fit a page though >> ... >> >> Anyway, if anyone has any ideas or suggestions what could be happening >> (perhaps with this particular motherboard or vendor) or how to pinpoint >> the cause - I'll be grateful for any. Chris> Likewise! Chris> Cheers, Chris> Chris ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data 2020-05-13 17:49 ` John Stoffel @ 2020-05-14 0:39 ` Chris Dunlop 0 siblings, 0 replies; 20+ messages in thread From: Chris Dunlop @ 2020-05-14 0:39 UTC (permalink / raw) To: John Stoffel; +Cc: Michal Soltys, linux-raid On Wed, May 13, 2020 at 01:49:10PM -0400, John Stoffel wrote: > I wonder if this problem can be replicated on loop devices? Once > there's a way to cause it reliably, we can then start doing a > bisection of the kernel to try and find out where this is happening. I ran a week or so of attempting to replicate the problem in a VM on loop devices replicating the lvm/raid config, without success. Basically just having a random bunch of 1-25 concurrent writers banging out middling to largish files. The fact it wasn't replicable in that environment could be pointing towards the LSI driver or hardware - or I simply wasn't able to match the conditions well enough. > So far, it looks like it happens sometimes on bare RAID6 systems > without lv-thin in place, which is both good and bad. And without > using VMs on top of the storage either. So this helps narrow down the > cause. Note: We don't have any bare RAID6 so I haven't seen it there: our main fs is xfs on sequential LVM on raid6 (6 x 11-disk sets), and we saw it once on xfs directly on HDD partition. > Is there any info on the work load on these systems? Lots of small > fils which are added/removed? Large files which are just written to > and not touched again? Large files written and not touched again. Most of the time 2-5 concurrent writers but regularly (daily) up to 20-25 concurrent. > I assume finding a bad file with corruption and then doing a cp of the > file keeps the same corruption? Yep. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [general question] rare silent data corruption when writing data 2020-05-13 6:31 ` Chris Dunlop 2020-05-13 17:49 ` John Stoffel @ 2020-05-20 20:29 ` Michal Soltys 1 sibling, 0 replies; 20+ messages in thread From: Michal Soltys @ 2020-05-20 20:29 UTC (permalink / raw) To: Chris Dunlop; +Cc: linux-raid On 20/05/13 08:31, Chris Dunlop wrote: > Hi, > > > "Me too!" > > We are seeing 256-byte corruptions which are always the last 256b of a > 4K block. The 256b is very often a copy of a "last 256b of 4k block" > from earlier on the file. We sometimes see multiple corruptions in the > same file, with each of the corruptions being a copy of a different 256b > from earlier on the file. The original 256b and the copied 256b aren't > identifiably at a regular offset from each other. Where the 256b isn't a > copy from earlier in the file > > I'd be really interested to hear if your problem is just in the last > 256b of the 4k block also! From what I have checked - in my case it has always been full 4k page. I'll follow the suggestion by Sarah in the other part of this thread and enable pagealloc debug options and then put the machine/disks under load - so I'll keep an eye if something like you described happens. This will have to wait a bit though, as I have another bug to hunt as well - as journaled raid refuses to assemble, so with help of Song I'm chasing that issue first. If not for btrfs, we probably would have been using the machine happily until now (blaming occasional detected issues on userspace stuff, usually some fat java mess). Thanks for detailed explanations of what happened in your case (and the span of kernel versions in which it does happen is scary). The hardware indeed looks strikingly similiar. ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2020-05-20 21:40 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-05-07 17:30 [general question] rare silent data corruption when writing data Michal Soltys 2020-05-07 18:24 ` Roger Heflin 2020-05-07 21:01 ` John Stoffel 2020-05-07 22:33 ` Michal Soltys 2020-05-08 0:54 ` John Stoffel 2020-05-08 11:10 ` Michal Soltys 2020-05-08 11:10 ` [linux-lvm] " Michal Soltys 2020-05-08 16:10 ` John Stoffel 2020-05-08 16:10 ` [linux-lvm] " John Stoffel 2020-05-08 3:44 ` Chris Murphy 2020-05-10 19:05 ` Sarah Newman 2020-05-10 19:12 ` Sarah Newman 2020-05-11 9:41 ` Michal Soltys 2020-05-11 19:42 ` Sarah Newman 2020-05-20 21:40 ` Michal Soltys 2020-05-07 22:13 ` Michal Soltys 2020-05-13 6:31 ` Chris Dunlop 2020-05-13 17:49 ` John Stoffel 2020-05-14 0:39 ` Chris Dunlop 2020-05-20 20:29 ` Michal Soltys
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.