From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vk0-f48.google.com ([209.85.213.48]:33553 "EHLO mail-vk0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750904AbdHNT2l (ORCPT ); Mon, 14 Aug 2017 15:28:41 -0400 Received: by mail-vk0-f48.google.com with SMTP id j189so34737764vka.0 for ; Mon, 14 Aug 2017 12:28:41 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <5d703b4c-e8ac-21dc-e327-ff1d8e232ee9@inwind.it> From: Chris Murphy Date: Mon, 14 Aug 2017 13:28:40 -0600 Message-ID: Subject: Re: [RFC] Checksum of the parity To: Goffredo Baroncelli Cc: Chris Murphy , linux-btrfs Content-Type: text/plain; charset="UTF-8" Sender: linux-btrfs-owner@vger.kernel.org List-ID: On Mon, Aug 14, 2017 at 8:12 AM, Goffredo Baroncelli wrote: > On 08/13/2017 08:45 PM, Chris Murphy wrote: >> [2] >> Is Btrfs subject to the write hole problem manifesting on disk? I'm >> not sure, sadly I don't read the code well enough. But if all Btrfs >> raid56 writes are full stripe CoW writes, and if the prescribed order >> guarantees still happen: data CoW to disk > metadata CoW to disk > >> superblock update, then I don't see how the write hole happens. Write >> hole requires: RMW of a stripe, which is a partial stripe overwrite, >> and a crash during the modification of the stripe making that stripe >> inconsistent as well as still pointed to by metadata. > > > RAID5 is *single* failure prof. And in order to have the write hole bug we need two failure: > 1) a transaction is aborted (e.g. due to a power failure) and the results is that data and parity are mis-aligned > 2) a disk disappears > > These two events may happen even in different moment. > > The key is that when a disk disappear, all remaining ones are used to rebuild the missing one. So if data and parity are mis-aligned the rebuild disk is wrong. > > Let me to show an example > > Disk 1 Disk 2 Disk 3 (parity) > AAAAAA BBBBBB CCCCCC > > where CCCCCC = AAAAA ^ BBBBB > > Note1: AAAAA is a valid data > > Supposing to update B and due to a power failure you can't update parity, you have: > > > Disk 1 Disk 2 Disk 3 (parity) > AAAAAA DDDDDDD CCCCCC > > Of course CCCCCC != AAAAA ^ DDDDD (data and parity are misaligned). > > > Pay attention that AAAAAA is still valid data. > > Now suppose to loose disk1. If you want to read from it, you have to perform a read of disk2 and disk3 to compute disk1. > > However Disk2 and disk3 are misaligned, so doing a DDDDD ^ CCCCC you don't got AAAAA anymore. > > > Note that it is not important if DDDDDD or BBBBB are valid or invalid data. Doesn't matter on Btrfs. Bad reconstruction due to wrong parity results in csum mismatch. This I've tested. I vaguely remember a while ago doing a dd conv=notrunc modification of a file that's raid5, and there was no RMW, what happened is the whole stripe was CoW'd and had the modification. So that would, hardware behaving correctly, mean that the raid5 data CoW succeeds, then there is a metadata CoW to point to it, then the super block is updated to point to the new tree. At any point, if there's an interruption, we have the old super pointing to the old tree which points to premodified data. Anyway, I do wish I read the code better, so I knew exactly where, if at all, the RMW code was happening on disk rather than just in memory. There very clearly is RMW in memory code as a performanc optimizer, before a stripe gets written out it's possible to RMW it to add in more changes or new files, that way raid56 isn't dog slow CoW'ing literally a handful of 16KiB leaves each time, that then translate into a minimum of 384K of writes. But yeah, Qu just said in another thread that Liu is working on a journal for the raid56 write hole problem. Thing is I don't see when it happens in the code or in practice (so far, it's really tedious to poke a file system with a stick). -- Chris Murphy