From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cn.fujitsu.com ([59.151.112.132]:3330 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1751648AbbFPBWo (ORCPT ); Mon, 15 Jun 2015 21:22:44 -0400 Subject: Re: [PATCH RFC] btrfs: csum: Introduce partial csum for tree block. To: , Chris Mason , References: <1434078015-8868-1-git-send-email-quwenruo@cn.fujitsu.com> <557B076B.7050500@fb.com> <557E86A9.8040207@cn.fujitsu.com> <20150615131507.GL6761@twin.jikos.cz> From: Qu Wenruo Message-ID: <557F7A5F.5010206@cn.fujitsu.com> Date: Tue, 16 Jun 2015 09:22:39 +0800 MIME-Version: 1.0 In-Reply-To: <20150615131507.GL6761@twin.jikos.cz> Content-Type: text/plain; charset="utf-8"; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: David Sterba wrote on 2015/06/15 15:15 +0200: > On Mon, Jun 15, 2015 at 04:02:49PM +0800, Qu Wenruo wrote: >> In the following case of corruption, RAID1 or DUP will fail to recover >> it(Use 16K as leafsize) >> 0 4K 8K 12K 16K >> Mirror 0: >> |<-OK---------->|<----ERROR---->|<-----------------OK------------->| >> >> Mirror 1: >> |<----------------------------OK--------------->|<------Error----->| >> >> Since the CRC32 stored in header is calculated for the whole leaf, >> so both will fail the CRC32 check. >> >> But the corruption are in different position, in fact, if we know where >> the corruption is (no need to be so accurate), we can recover the tree >> block by using the current part. >> >> In above example, we can just use the correct 0~12K from mirror 1 >> and then 12K~16K from mirror 0. > > If the mirror 0 copy is intact, you can use it entirely. Your > improvement could help if each mirror is partially broken but we can > find good copies of all 4k blocks among all mirrors. > > The natural question is how often this happens and if it's worth adding > the code complexity and what's the estimated speed drop. > > I think the conditions are very rare and that we could add minimal code > to attempt to build the metadata block from the available copies without > the separate block checksums. This is an immediate idea so I could have > missed something: > > * if a metadata-block checksum mismatches, do a direct comparison of the > metadata-blocks in all available mirrors > * if they match and checksums match, no help, it's damaged > * if there's a good copy (ie the original checksum or data were > corrupted), use it > * otherwise attempt to rebuild the metadata block from what's available > > * by direct comparisons of the 4k blocks, find the first where the > metadataA and mirror1 blocks mismatch, offset N > * try to compute the checksum from metadataA[0..N-1] + mirror1 block N + > rest of metadataA > * if it's ok, use it > * if not: the block N is corrupted in mirror1 (we've skipped it in > metadataA) > then repeat with metadataA[0..N] + mirror1[N+1..end] The method sounds good, but the codes will be even more complex than mine. Also, due to the nature of CRC32 and RAID1/Dup case, things will be quite complex like the following case using your method. 0 4K 8K 12K 16K Mirror 0 | | X | | | Mirror 1 | | | X | | If using your method: [0~4K] CRC32 of 0~4K are the same. No problem. [4~8K] CRC32 of 0~8K differs. Now we know there is something wrong with 4~8K, but we still don't know which is the good copy. We can continue, keep 2 different CRC32 seed for later checksum. One seed using the 4~8 from mirror 0 and one from mirror1. Note, the crc32 is for the whole tree block, so until we calculate all the tree block, we can know which one is correct. [8~12K] Now crc32 mismatches again. We still don't know which part is correct. We can still continue by recording different CRC32 seed for them. But don't forget the previous 2 seeds we recorded. So we records 4 crc32 seeds for 0~12K.(Mi0, Mi0) (Mi0, Mi1) (Mi1, Mi0) (Mi1, Mi1). [12~16K] Continue calculate the crc32 with above 4 seeds. Finally, we found the crc32 matches with the tree block is using the combination of (Mi1, Mi0). And we can restore the tree block. Yes, with this behavior, we can restore the tree block even 3 parts are corrupted, but in that case, we need to try 8 times. So, I don't consider this is more easy to implement than my idea. [ROOT CAUSE] The root cause of the complex is: 1) Checksum algorithm depends on previous input(seed) Almost all checksum algorithm depends on previous input. And you won't know the data is correct or not until all data is calculated. 2) Only one extra duplication for RAID1/DUP We don't have extra duplication to judge which block is correct as there are only two mirror. So my partial checksum solves the 2 root cause at the same time. With partial checksum, we don't depend on previous data to calculate checksum. And we have extra reference if mirror differs with each other, we use checksum to judge which copy is correct. > > That's a rough idea that I hope will cover most of the cases when it > happens. With some more exhaustive attempts to rebuild the metadata > block we can try to repair 2 damaged blocks. > > As this is completely independent, we can test it separately, and also > add it as a rescue feature to the userspace tools. > >> Yes, this corruption case may be minor enough, since even corruption in >> one mirror is rare enough. >> So I didn't introduce a new CRC32 checksum, but use the extra 32-4 bytes >> to store the partial CRC32 to keep the backward compatibility. > > The above would work with any checksums, without the need to store the > per-block checksums which become impossible with strongher algorithms. > [FURTHER CSUM DESIGN] As you mentioned in later part, yes, stronger checksum like SHA256 won't be able to do such partial checksum as it uses all the space. But that's already a trade-off, right? If btrfs was using partial csum from the beginning, pros and cons of strong checksum should be quite obvious: Stronger checksum: (Pros) 1. Less confliction Which means more security. (Cons) 1. More space usage for data. 4Bytes/4K vs 32Bytes/4K. Obvious. 2. No/less partial checksum for metadata. Harder to detect/repair partial error. 3. (generally) More CPU time Normally, stronger checksums are cryptographic checksum, which uses more CPU time. So with this patch, I also want to inspire other developers about the trade off between different checksums. Especially the fact, we can make better use of the spare metadata csum space if the csum length is less than 32bytes. Thanks, Qu