All of lore.kernel.org
 help / color / mirror / Atom feed
From: Goffredo Baroncelli <kreijack@inwind.it>
To: Chris Murphy <lists@colorremedies.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [RFC] Checksum of the parity
Date: Mon, 14 Aug 2017 22:18:48 +0200	[thread overview]
Message-ID: <23b09099-7cb5-82fe-c941-4701136f952e@inwind.it> (raw)
In-Reply-To: <CAJCQCtTTgdyfMzZKGqMo+LyA2dW+Csq+xp6TqoGyqEmsQ-j8eQ@mail.gmail.com>

On 08/14/2017 09:28 PM, Chris Murphy wrote:
> On Mon, Aug 14, 2017 at 8:12 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
>> On 08/13/2017 08:45 PM, Chris Murphy wrote:
>>> [2]
>>> Is Btrfs subject to the write hole problem manifesting on disk? I'm
>>> not sure, sadly I don't read the code well enough. But if all Btrfs
>>> raid56 writes are full stripe CoW writes, and if the prescribed order
>>> guarantees still happen: data CoW to disk > metadata CoW to disk >
>>> superblock update, then I don't see how the write hole happens. Write
>>> hole requires: RMW of a stripe, which is a partial stripe overwrite,
>>> and a crash during the modification of the stripe making that stripe
>>> inconsistent as well as still pointed to by metadata.
>>
>>
>> RAID5 is *single* failure prof. And in order to have the write hole bug we need two failure:
>> 1) a transaction is aborted (e.g. due to a power failure) and the results is that data and parity are mis-aligned
>> 2) a disk disappears
>>
>> These two events may happen even in different moment.
>>
>> The key is that when a disk disappear, all remaining ones are used to rebuild the missing one. So if data and parity are mis-aligned the rebuild disk is wrong.
>>
>> Let me to show an example
>>
>> Disk 1            Disk 2         Disk 3  (parity)
>> AAAAAA            BBBBBB         CCCCCC
>>
>> where CCCCCC = AAAAA ^ BBBBB
>>
>> Note1: AAAAA is a valid data
>>
>> Supposing to update B and due to a power failure you can't update parity, you have:
>>
>>
>> Disk 1            Disk 2         Disk 3  (parity)
>> AAAAAA            DDDDDDD        CCCCCC
>>
>> Of course CCCCCC != AAAAA ^ DDDDD  (data and parity are misaligned).
>>
>>
>> Pay attention that AAAAAA is still valid data.
>>
>> Now suppose to loose disk1. If you want to read from it, you have to perform a read of disk2 and disk3 to compute disk1.
>>
>> However Disk2 and disk3 are misaligned, so doing a DDDDD ^ CCCCC you don't got AAAAA anymore.
>>
>>
>> Note that it is not important if DDDDDD or BBBBB are valid or invalid data.
> 
> 
> Doesn't matter on Btrfs. Bad reconstruction due to wrong parity
> results in csum mismatch. This I've tested.

I never argued about that. The write hole is related to *loss* of "valid data" due to a mis-alignement between data and parity.
The fact that  BTRFS is capable to detect the problem and return an -EIO, doesn't mitigate the loss of valid data. Pay attention that in my example AAAAA reached the disk before the "failure events"

> 
> I vaguely remember a while ago doing a dd conv=notrunc modification of
> a file that's raid5, and there was no RMW, what happened is the whole
> stripe was CoW'd and had the modification. So that would, hardware
> behaving correctly, mean that the raid5 data CoW succeeds, then there
> is a metadata CoW to point to it, then the super block is updated to
> point to the new tree.
> 
> At any point, if there's an interruption, we have the old super
> pointing to the old tree which points to premodified data.
> 
> Anyway, I do wish I read the code better, so I knew exactly where, if
> at all, the RMW code was happening on disk rather than just in memory.
> There very clearly is RMW in memory code as a performanc optimizer,
> before a stripe gets written out it's possible to RMW it to add in
> more changes or new files, that way raid56 isn't dog slow CoW'ing
> literally a handful of 16KiB leaves each time, that then translate
> into a minimum of 384K of writes.

In case of a fully stripe write, there is no RMW cycle, so no "write hole". Unfortunately not all writes are full stripe size. I never checked the code, but I hope that during a commit of the transaction all the writing are grouped in "full stripe write" as possible.

Just of curiosity, what is "minimum of 384k" ? In a 3 disks raid5 case, the minimum data is 64k * 2 (+ 64kb of parity).....

> But yeah, Qu just said in another thread that Liu is working on a
> journal for the raid56 write hole problem. Thing is I don't see when
> it happens in the code or in practice (so far, it's really tedious to
> poke a file system with a stick).
> 



> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

  reply	other threads:[~2017-08-14 20:19 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-13 14:16 [RFC] Checksum of the parity Goffredo Baroncelli
2017-08-13 18:45 ` Chris Murphy
2017-08-13 23:40   ` Janos Toth F.
2017-08-14 14:12   ` Goffredo Baroncelli
2017-08-14 19:28     ` Chris Murphy
2017-08-14 20:18       ` Goffredo Baroncelli [this message]
2017-08-14 21:10         ` Chris Murphy
2017-08-14 13:23 ` Austin S. Hemmelgarn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=23b09099-7cb5-82fe-c941-4701136f952e@inwind.it \
    --to=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.