* [RFC] Checksum of the parity @ 2017-08-13 14:16 Goffredo Baroncelli 2017-08-13 18:45 ` Chris Murphy 2017-08-14 13:23 ` Austin S. Hemmelgarn 0 siblings, 2 replies; 8+ messages in thread From: Goffredo Baroncelli @ 2017-08-13 14:16 UTC (permalink / raw) To: linux-btrfs Hi all, in the BTRFS wiki, in the status page, in the "line" RAID5/6 it is reported that the parity is not checksummed. This was reported several time in the ML and also on other site (e.g. phoronix) as a BTRFS defect. However I was unable to understand it, and I am supposing that this is a false mith. So my question is: the fact that in the BTRFS5/6 the parity is not checksummed could be considered a defect ? My goal is to verify if there is a rationale to require the parity checksummed, and if no I would like to remove this from the wiki. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Checksum of the parity 2017-08-13 14:16 [RFC] Checksum of the parity Goffredo Baroncelli @ 2017-08-13 18:45 ` Chris Murphy 2017-08-13 23:40 ` Janos Toth F. 2017-08-14 14:12 ` Goffredo Baroncelli 2017-08-14 13:23 ` Austin S. Hemmelgarn 1 sibling, 2 replies; 8+ messages in thread From: Chris Murphy @ 2017-08-13 18:45 UTC (permalink / raw) To: Goffredo Baroncelli; +Cc: linux-btrfs On Sun, Aug 13, 2017 at 8:16 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote: > Hi all, > > in the BTRFS wiki, in the status page, in the "line" RAID5/6 it is reported that the parity is not checksummed. This was reported several time in the ML and also on other site (e.g. phoronix) as a BTRFS defect. > > However I was unable to understand it, and I am supposing that this is a false mith. > > So my question is: the fact that in the BTRFS5/6 the parity is not checksummed could be considered a defect ? > > My goal is to verify if there is a rationale to require the parity checksummed, and if no I would like to remove this from the wiki. It is not a per se defect. If parity is corrupt, and parity is needed for reconstruction, reconstruction will be corrupt, but is then detected and we get EIO [1] Further, the error detection of corrupt reconstruction is why I say Btrfs is not subject *in practice* to the write hole problem. [2] [1] I haven't tested the raid6 normal read case where a stripe contains corrupt data strip and corrupt P strip, and Q strip is good. I expect instead of EIO, we get a reconstruction from Q, and then both data and P get fixed up, but I can't find it in comments or code. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?h=v4.12.7 line 1851, I'm not sure where we're at exactly at this line; seems like it must be a scrub because P & Q are not relevant if data is good. [2] Is Btrfs subject to the write hole problem manifesting on disk? I'm not sure, sadly I don't read the code well enough. But if all Btrfs raid56 writes are full stripe CoW writes, and if the prescribed order guarantees still happen: data CoW to disk > metadata CoW to disk > superblock update, then I don't see how the write hole happens. Write hole requires: RMW of a stripe, which is a partial stripe overwrite, and a crash during the modification of the stripe making that stripe inconsistent as well as still pointed to by metadata. -- Chris Murphy ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Checksum of the parity 2017-08-13 18:45 ` Chris Murphy @ 2017-08-13 23:40 ` Janos Toth F. 2017-08-14 14:12 ` Goffredo Baroncelli 1 sibling, 0 replies; 8+ messages in thread From: Janos Toth F. @ 2017-08-13 23:40 UTC (permalink / raw) To: Chris Murphy; +Cc: Goffredo Baroncelli, linux-btrfs On Sun, Aug 13, 2017 at 8:45 PM, Chris Murphy <lists@colorremedies.com> wrote: > Further, the error detection of corrupt reconstruction is why I say > Btrfs is not subject *in practice* to the write hole problem. [2] > > [1] > I haven't tested the raid6 normal read case where a stripe contains > corrupt data strip and corrupt P strip, and Q strip is good. I expect > instead of EIO, we get a reconstruction from Q, and then both data and > P get fixed up, but I can't find it in comments or code. Yes, that's what I would expect (which theoretically makes the odds of successful recovery better on RAID6, possible "good enough") but I have no clue how that actually gets handled right now (I guess the current code isn't that thorough). > [2] > Is Btrfs subject to the write hole problem manifesting on disk? I'm > not sure, sadly I don't read the code well enough. But if all Btrfs > raid56 writes are full stripe CoW writes, and if the prescribed order > guarantees still happen: data CoW to disk > metadata CoW to disk > > superblock update, then I don't see how the write hole happens. Write > hole requires: RMW of a stripe, which is a partial stripe overwrite, > and a crash during the modification of the stripe making that stripe > inconsistent as well as still pointed to by metadata. I guess the problem is that stripe size or stripe element size is (sort of) fixed (not sure which one, I guess it's the latter, in which case the actual stripe size depends on the number of devices) and relatively big (much bigger than the usual 4k sector size or even the leaf size which now defaults to 16k, if I recall [but I set this to 4k myself]), so a partial stripe update (RMW) is certainly possible during generic use. This is why I threw the idea around a few months ago to resurrect that old (but dead looking / stuck) project about making the stripe (element) size configurable by the user. That would allow for making the stripe size equal to the filesystem sector size on a limited amount of setups (for example, 5 or 6 HDD with 512-byte physical sectors in RAID-5 or RAID-6 respectively) which would (as I understand) practically eliminate the problem (at least on the filesystem side, I am not sure if the HDD's volatile write-cache or at least it's internal re-ordering feature should still be disabled for this to really avoid inconsistencies between stripe elements --- I can't recall ever seeing partially written sectors [we would know since these are checksummed in place and thus appear unreadable if partially written], I guess there might be usually enough electricity in some small capacitor to finish the current sector after the power gets cut ???). ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Checksum of the parity 2017-08-13 18:45 ` Chris Murphy 2017-08-13 23:40 ` Janos Toth F. @ 2017-08-14 14:12 ` Goffredo Baroncelli 2017-08-14 19:28 ` Chris Murphy 1 sibling, 1 reply; 8+ messages in thread From: Goffredo Baroncelli @ 2017-08-14 14:12 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs On 08/13/2017 08:45 PM, Chris Murphy wrote: > [2] > Is Btrfs subject to the write hole problem manifesting on disk? I'm > not sure, sadly I don't read the code well enough. But if all Btrfs > raid56 writes are full stripe CoW writes, and if the prescribed order > guarantees still happen: data CoW to disk > metadata CoW to disk > > superblock update, then I don't see how the write hole happens. Write > hole requires: RMW of a stripe, which is a partial stripe overwrite, > and a crash during the modification of the stripe making that stripe > inconsistent as well as still pointed to by metadata. RAID5 is *single* failure prof. And in order to have the write hole bug we need two failure: 1) a transaction is aborted (e.g. due to a power failure) and the results is that data and parity are mis-aligned 2) a disk disappears These two events may happen even in different moment. The key is that when a disk disappear, all remaining ones are used to rebuild the missing one. So if data and parity are mis-aligned the rebuild disk is wrong. Let me to show an example Disk 1 Disk 2 Disk 3 (parity) AAAAAA BBBBBB CCCCCC where CCCCCC = AAAAA ^ BBBBB Note1: AAAAA is a valid data Supposing to update B and due to a power failure you can't update parity, you have: Disk 1 Disk 2 Disk 3 (parity) AAAAAA DDDDDDD CCCCCC Of course CCCCCC != AAAAA ^ DDDDD (data and parity are misaligned). Pay attention that AAAAAA is still valid data. Now suppose to loose disk1. If you want to read from it, you have to perform a read of disk2 and disk3 to compute disk1. However Disk2 and disk3 are misaligned, so doing a DDDDD ^ CCCCC you don't got AAAAA anymore. Note that it is not important if DDDDDD or BBBBB are valid or invalid data. Moreover I have to point out that a simple scrub process between 1 and 2, is able to rebuild a correct parity. This would reduce the likelihood of the "write hole" bug. The only case which would still exists is when 1) and 2) happen at the same time (which is not impossible: i.e. if a disk die, it is not infrequent that the user shutdown the machine without waiting a clean shutdown). BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Checksum of the parity 2017-08-14 14:12 ` Goffredo Baroncelli @ 2017-08-14 19:28 ` Chris Murphy 2017-08-14 20:18 ` Goffredo Baroncelli 0 siblings, 1 reply; 8+ messages in thread From: Chris Murphy @ 2017-08-14 19:28 UTC (permalink / raw) To: Goffredo Baroncelli; +Cc: Chris Murphy, linux-btrfs On Mon, Aug 14, 2017 at 8:12 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote: > On 08/13/2017 08:45 PM, Chris Murphy wrote: >> [2] >> Is Btrfs subject to the write hole problem manifesting on disk? I'm >> not sure, sadly I don't read the code well enough. But if all Btrfs >> raid56 writes are full stripe CoW writes, and if the prescribed order >> guarantees still happen: data CoW to disk > metadata CoW to disk > >> superblock update, then I don't see how the write hole happens. Write >> hole requires: RMW of a stripe, which is a partial stripe overwrite, >> and a crash during the modification of the stripe making that stripe >> inconsistent as well as still pointed to by metadata. > > > RAID5 is *single* failure prof. And in order to have the write hole bug we need two failure: > 1) a transaction is aborted (e.g. due to a power failure) and the results is that data and parity are mis-aligned > 2) a disk disappears > > These two events may happen even in different moment. > > The key is that when a disk disappear, all remaining ones are used to rebuild the missing one. So if data and parity are mis-aligned the rebuild disk is wrong. > > Let me to show an example > > Disk 1 Disk 2 Disk 3 (parity) > AAAAAA BBBBBB CCCCCC > > where CCCCCC = AAAAA ^ BBBBB > > Note1: AAAAA is a valid data > > Supposing to update B and due to a power failure you can't update parity, you have: > > > Disk 1 Disk 2 Disk 3 (parity) > AAAAAA DDDDDDD CCCCCC > > Of course CCCCCC != AAAAA ^ DDDDD (data and parity are misaligned). > > > Pay attention that AAAAAA is still valid data. > > Now suppose to loose disk1. If you want to read from it, you have to perform a read of disk2 and disk3 to compute disk1. > > However Disk2 and disk3 are misaligned, so doing a DDDDD ^ CCCCC you don't got AAAAA anymore. > > > Note that it is not important if DDDDDD or BBBBB are valid or invalid data. Doesn't matter on Btrfs. Bad reconstruction due to wrong parity results in csum mismatch. This I've tested. I vaguely remember a while ago doing a dd conv=notrunc modification of a file that's raid5, and there was no RMW, what happened is the whole stripe was CoW'd and had the modification. So that would, hardware behaving correctly, mean that the raid5 data CoW succeeds, then there is a metadata CoW to point to it, then the super block is updated to point to the new tree. At any point, if there's an interruption, we have the old super pointing to the old tree which points to premodified data. Anyway, I do wish I read the code better, so I knew exactly where, if at all, the RMW code was happening on disk rather than just in memory. There very clearly is RMW in memory code as a performanc optimizer, before a stripe gets written out it's possible to RMW it to add in more changes or new files, that way raid56 isn't dog slow CoW'ing literally a handful of 16KiB leaves each time, that then translate into a minimum of 384K of writes. But yeah, Qu just said in another thread that Liu is working on a journal for the raid56 write hole problem. Thing is I don't see when it happens in the code or in practice (so far, it's really tedious to poke a file system with a stick). -- Chris Murphy ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Checksum of the parity 2017-08-14 19:28 ` Chris Murphy @ 2017-08-14 20:18 ` Goffredo Baroncelli 2017-08-14 21:10 ` Chris Murphy 0 siblings, 1 reply; 8+ messages in thread From: Goffredo Baroncelli @ 2017-08-14 20:18 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs On 08/14/2017 09:28 PM, Chris Murphy wrote: > On Mon, Aug 14, 2017 at 8:12 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote: >> On 08/13/2017 08:45 PM, Chris Murphy wrote: >>> [2] >>> Is Btrfs subject to the write hole problem manifesting on disk? I'm >>> not sure, sadly I don't read the code well enough. But if all Btrfs >>> raid56 writes are full stripe CoW writes, and if the prescribed order >>> guarantees still happen: data CoW to disk > metadata CoW to disk > >>> superblock update, then I don't see how the write hole happens. Write >>> hole requires: RMW of a stripe, which is a partial stripe overwrite, >>> and a crash during the modification of the stripe making that stripe >>> inconsistent as well as still pointed to by metadata. >> >> >> RAID5 is *single* failure prof. And in order to have the write hole bug we need two failure: >> 1) a transaction is aborted (e.g. due to a power failure) and the results is that data and parity are mis-aligned >> 2) a disk disappears >> >> These two events may happen even in different moment. >> >> The key is that when a disk disappear, all remaining ones are used to rebuild the missing one. So if data and parity are mis-aligned the rebuild disk is wrong. >> >> Let me to show an example >> >> Disk 1 Disk 2 Disk 3 (parity) >> AAAAAA BBBBBB CCCCCC >> >> where CCCCCC = AAAAA ^ BBBBB >> >> Note1: AAAAA is a valid data >> >> Supposing to update B and due to a power failure you can't update parity, you have: >> >> >> Disk 1 Disk 2 Disk 3 (parity) >> AAAAAA DDDDDDD CCCCCC >> >> Of course CCCCCC != AAAAA ^ DDDDD (data and parity are misaligned). >> >> >> Pay attention that AAAAAA is still valid data. >> >> Now suppose to loose disk1. If you want to read from it, you have to perform a read of disk2 and disk3 to compute disk1. >> >> However Disk2 and disk3 are misaligned, so doing a DDDDD ^ CCCCC you don't got AAAAA anymore. >> >> >> Note that it is not important if DDDDDD or BBBBB are valid or invalid data. > > > Doesn't matter on Btrfs. Bad reconstruction due to wrong parity > results in csum mismatch. This I've tested. I never argued about that. The write hole is related to *loss* of "valid data" due to a mis-alignement between data and parity. The fact that BTRFS is capable to detect the problem and return an -EIO, doesn't mitigate the loss of valid data. Pay attention that in my example AAAAA reached the disk before the "failure events" > > I vaguely remember a while ago doing a dd conv=notrunc modification of > a file that's raid5, and there was no RMW, what happened is the whole > stripe was CoW'd and had the modification. So that would, hardware > behaving correctly, mean that the raid5 data CoW succeeds, then there > is a metadata CoW to point to it, then the super block is updated to > point to the new tree. > > At any point, if there's an interruption, we have the old super > pointing to the old tree which points to premodified data. > > Anyway, I do wish I read the code better, so I knew exactly where, if > at all, the RMW code was happening on disk rather than just in memory. > There very clearly is RMW in memory code as a performanc optimizer, > before a stripe gets written out it's possible to RMW it to add in > more changes or new files, that way raid56 isn't dog slow CoW'ing > literally a handful of 16KiB leaves each time, that then translate > into a minimum of 384K of writes. In case of a fully stripe write, there is no RMW cycle, so no "write hole". Unfortunately not all writes are full stripe size. I never checked the code, but I hope that during a commit of the transaction all the writing are grouped in "full stripe write" as possible. Just of curiosity, what is "minimum of 384k" ? In a 3 disks raid5 case, the minimum data is 64k * 2 (+ 64kb of parity)..... > But yeah, Qu just said in another thread that Liu is working on a > journal for the raid56 write hole problem. Thing is I don't see when > it happens in the code or in practice (so far, it's really tedious to > poke a file system with a stick). > > > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Checksum of the parity 2017-08-14 20:18 ` Goffredo Baroncelli @ 2017-08-14 21:10 ` Chris Murphy 0 siblings, 0 replies; 8+ messages in thread From: Chris Murphy @ 2017-08-14 21:10 UTC (permalink / raw) To: Goffredo Baroncelli; +Cc: Chris Murphy, linux-btrfs On Mon, Aug 14, 2017 at 2:18 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote: >> Anyway, I do wish I read the code better, so I knew exactly where, if >> at all, the RMW code was happening on disk rather than just in memory. >> There very clearly is RMW in memory code as a performanc optimizer, >> before a stripe gets written out it's possible to RMW it to add in >> more changes or new files, that way raid56 isn't dog slow CoW'ing >> literally a handful of 16KiB leaves each time, that then translate >> into a minimum of 384K of writes. > > In case of a fully stripe write, there is no RMW cycle, so no "write hole". That is conflating disk writes and in-memory RMW. They have to be separated. For sure there's in-memory RMW + CoW of the entire stripe to disk, for a tiny (1 byte) change to a file because I've seen it. What I don't know, and can't tell from the code, is if there's ever such a thing as partial stripe RMW (and over write of just a data strip and a corresponding over write for parity). Any overwrite is where the write hole comes into play. But inferring from the work Liu is apparently working on for a journal, it must be true that there is such a thing as a overwrites with Btrfs raid56. > > Just of curiosity, what is "minimum of 384k" ? In a 3 disks raid5 case, the minimum data is 64k * 2 (+ 64kb of parity)..... Bad addition on my part. -- Chris Murphy ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Checksum of the parity 2017-08-13 14:16 [RFC] Checksum of the parity Goffredo Baroncelli 2017-08-13 18:45 ` Chris Murphy @ 2017-08-14 13:23 ` Austin S. Hemmelgarn 1 sibling, 0 replies; 8+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-14 13:23 UTC (permalink / raw) To: kreijack, linux-btrfs On 2017-08-13 10:16, Goffredo Baroncelli wrote: > Hi all, > > in the BTRFS wiki, in the status page, in the "line" RAID5/6 it is reported that the parity is not checksummed. This was reported several time in the ML and also on other site (e.g. phoronix) as a BTRFS defect. > > However I was unable to understand it, and I am supposing that this is a false mith. > > So my question is: the fact that in the BTRFS5/6 the parity is not checksummed could be considered a defect ? > > My goal is to verify if there is a rationale to require the parity checksummed, and if no I would like to remove this from the wiki. While there isn't for normal operation (as Chris did a good job of explaining), there is a benefit for scrubbing the filesystem. Without checksummed parity, you have to verify checksums on all the data, and then either recompute and compare the parity, or recompute and compare the data from parity to be able to verify that everything is correct. With checksummed parity you just verify checksums on everything. So, overall, I wouldn't consider it a defect, but having it could improve performance for a very specific use case. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2017-08-14 21:11 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-08-13 14:16 [RFC] Checksum of the parity Goffredo Baroncelli 2017-08-13 18:45 ` Chris Murphy 2017-08-13 23:40 ` Janos Toth F. 2017-08-14 14:12 ` Goffredo Baroncelli 2017-08-14 19:28 ` Chris Murphy 2017-08-14 20:18 ` Goffredo Baroncelli 2017-08-14 21:10 ` Chris Murphy 2017-08-14 13:23 ` Austin S. Hemmelgarn
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.