From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cn.fujitsu.com ([59.151.112.132]:55025 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1751301AbbFPCj5 (ORCPT ); Mon, 15 Jun 2015 22:39:57 -0400 Subject: Re: [PATCH RFC] btrfs: csum: Introduce partial csum for tree block. To: , Chris Mason , References: <1434078015-8868-1-git-send-email-quwenruo@cn.fujitsu.com> <557B076B.7050500@fb.com> <557E86A9.8040207@cn.fujitsu.com> <20150615131507.GL6761@twin.jikos.cz> <557F7A5F.5010206@cn.fujitsu.com> From: Qu Wenruo Message-ID: <557F8C78.7080304@cn.fujitsu.com> Date: Tue, 16 Jun 2015 10:39:52 +0800 MIME-Version: 1.0 In-Reply-To: <557F7A5F.5010206@cn.fujitsu.com> Content-Type: text/plain; charset="utf-8"; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: Qu Wenruo wrote on 2015/06/16 09:22 +0800: > > > David Sterba wrote on 2015/06/15 15:15 +0200: >> On Mon, Jun 15, 2015 at 04:02:49PM +0800, Qu Wenruo wrote: >>> In the following case of corruption, RAID1 or DUP will fail to recover >>> it(Use 16K as leafsize) >>> 0 4K 8K 12K 16K >>> Mirror 0: >>> |<-OK---------->|<----ERROR---->|<-----------------OK------------->| >>> >>> Mirror 1: >>> |<----------------------------OK--------------->|<------Error----->| >>> >>> Since the CRC32 stored in header is calculated for the whole leaf, >>> so both will fail the CRC32 check. >>> >>> But the corruption are in different position, in fact, if we know where >>> the corruption is (no need to be so accurate), we can recover the tree >>> block by using the current part. >>> >>> In above example, we can just use the correct 0~12K from mirror 1 >>> and then 12K~16K from mirror 0. >> >> If the mirror 0 copy is intact, you can use it entirely. Your >> improvement could help if each mirror is partially broken but we can >> find good copies of all 4k blocks among all mirrors. >> >> The natural question is how often this happens and if it's worth adding >> the code complexity and what's the estimated speed drop. >> >> I think the conditions are very rare and that we could add minimal code >> to attempt to build the metadata block from the available copies without >> the separate block checksums. This is an immediate idea so I could have >> missed something: >> >> * if a metadata-block checksum mismatches, do a direct comparison of the >> metadata-blocks in all available mirrors >> * if they match and checksums match, no help, it's damaged >> * if there's a good copy (ie the original checksum or data were >> corrupted), use it >> * otherwise attempt to rebuild the metadata block from what's >> available >> >> * by direct comparisons of the 4k blocks, find the first where the >> metadataA and mirror1 blocks mismatch, offset N >> * try to compute the checksum from metadataA[0..N-1] + mirror1 block N + >> rest of metadataA >> * if it's ok, use it >> * if not: the block N is corrupted in mirror1 (we've skipped it in >> metadataA) >> then repeat with metadataA[0..N] + mirror1[N+1..end] > The method sounds good, but the codes will be even more complex than > mine. > Also, due to the nature of CRC32 and RAID1/Dup case, things will be > quite complex like the following case using your method. > > 0 4K 8K 12K 16K > Mirror 0 > | | X | | | > Mirror 1 > | | | X | | > > If using your method: > [0~4K] > CRC32 of 0~4K are the same. No problem. > > [4~8K] > CRC32 of 0~8K differs. > Now we know there is something wrong with 4~8K, but we still don't know > which is the good copy. > > We can continue, keep 2 different CRC32 seed for later checksum. > One seed using the 4~8 from mirror 0 and one from mirror1. > > Note, the crc32 is for the whole tree block, so until we calculate all > the tree block, we can know which one is correct. Sorry here, "can" -> "can't" Thanks, Qu > > [8~12K] > Now crc32 mismatches again. > We still don't know which part is correct. > > We can still continue by recording different CRC32 seed for them. > But don't forget the previous 2 seeds we recorded. > So we records 4 crc32 seeds for 0~12K.(Mi0, Mi0) (Mi0, Mi1) (Mi1, Mi0) > (Mi1, Mi1). > > [12~16K] > Continue calculate the crc32 with above 4 seeds. > Finally, we found the crc32 matches with the tree block is using the > combination of (Mi1, Mi0). And we can restore the tree block. > > Yes, with this behavior, we can restore the tree block even 3 parts are > corrupted, but in that case, we need to try 8 times. > > So, I don't consider this is more easy to implement than my idea. > > [ROOT CAUSE] > The root cause of the complex is: > 1) Checksum algorithm depends on previous input(seed) > Almost all checksum algorithm depends on previous input. > And you won't know the data is correct or not until all data is calculated. > > 2) Only one extra duplication for RAID1/DUP > We don't have extra duplication to judge which block is correct as there > are only two mirror. > > So my partial checksum solves the 2 root cause at the same time. > With partial checksum, we don't depend on previous data to calculate > checksum. > And we have extra reference if mirror differs with each other, we use > checksum to judge which copy is correct. > >> >> That's a rough idea that I hope will cover most of the cases when it >> happens. With some more exhaustive attempts to rebuild the metadata >> block we can try to repair 2 damaged blocks. >> >> As this is completely independent, we can test it separately, and also >> add it as a rescue feature to the userspace tools. >> >>> Yes, this corruption case may be minor enough, since even corruption in >>> one mirror is rare enough. >>> So I didn't introduce a new CRC32 checksum, but use the extra 32-4 bytes >>> to store the partial CRC32 to keep the backward compatibility. >> >> The above would work with any checksums, without the need to store the >> per-block checksums which become impossible with strongher algorithms. >> > [FURTHER CSUM DESIGN] > As you mentioned in later part, yes, stronger checksum like SHA256 won't > be able to do such partial checksum as it uses all the space. > > But that's already a trade-off, right? > > If btrfs was using partial csum from the beginning, pros and cons of > strong checksum should be quite obvious: > > Stronger checksum: > (Pros) > 1. Less confliction > Which means more security. > > (Cons) > 1. More space usage for data. > 4Bytes/4K vs 32Bytes/4K. Obvious. > > 2. No/less partial checksum for metadata. > Harder to detect/repair partial error. > > 3. (generally) More CPU time > Normally, stronger checksums are cryptographic checksum, which uses > more CPU time. > > So with this patch, I also want to inspire other developers about the > trade off between different checksums. > Especially the fact, we can make better use of the spare metadata csum > space if the csum length is less than 32bytes. > > Thanks, > Qu > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html