From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-32-i6.italiaonline.it ([212.48.14.166]:44154 "EHLO libero.it" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752360AbdHNOMq (ORCPT ); Mon, 14 Aug 2017 10:12:46 -0400 Reply-To: kreijack@inwind.it Subject: Re: [RFC] Checksum of the parity To: Chris Murphy Cc: linux-btrfs References: <5d703b4c-e8ac-21dc-e327-ff1d8e232ee9@inwind.it> From: Goffredo Baroncelli Message-ID: Date: Mon, 14 Aug 2017 16:12:42 +0200 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 08/13/2017 08:45 PM, Chris Murphy wrote: > [2] > Is Btrfs subject to the write hole problem manifesting on disk? I'm > not sure, sadly I don't read the code well enough. But if all Btrfs > raid56 writes are full stripe CoW writes, and if the prescribed order > guarantees still happen: data CoW to disk > metadata CoW to disk > > superblock update, then I don't see how the write hole happens. Write > hole requires: RMW of a stripe, which is a partial stripe overwrite, > and a crash during the modification of the stripe making that stripe > inconsistent as well as still pointed to by metadata. RAID5 is *single* failure prof. And in order to have the write hole bug we need two failure: 1) a transaction is aborted (e.g. due to a power failure) and the results is that data and parity are mis-aligned 2) a disk disappears These two events may happen even in different moment. The key is that when a disk disappear, all remaining ones are used to rebuild the missing one. So if data and parity are mis-aligned the rebuild disk is wrong. Let me to show an example Disk 1 Disk 2 Disk 3 (parity) AAAAAA BBBBBB CCCCCC where CCCCCC = AAAAA ^ BBBBB Note1: AAAAA is a valid data Supposing to update B and due to a power failure you can't update parity, you have: Disk 1 Disk 2 Disk 3 (parity) AAAAAA DDDDDDD CCCCCC Of course CCCCCC != AAAAA ^ DDDDD (data and parity are misaligned). Pay attention that AAAAAA is still valid data. Now suppose to loose disk1. If you want to read from it, you have to perform a read of disk2 and disk3 to compute disk1. However Disk2 and disk3 are misaligned, so doing a DDDDD ^ CCCCC you don't got AAAAA anymore. Note that it is not important if DDDDDD or BBBBB are valid or invalid data. Moreover I have to point out that a simple scrub process between 1 and 2, is able to rebuild a correct parity. This would reduce the likelihood of the "write hole" bug. The only case which would still exists is when 1) and 2) happen at the same time (which is not impossible: i.e. if a disk die, it is not infrequent that the user shutdown the machine without waiting a clean shutdown). BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5