On 2018/12/3 上午4:30, Andrei Borzenkov wrote:
> 02.12.2018 23:14, Patrick Dijkgraaf пишет:
>> I have some additional info.
>>
>> I found the reason the FS got corrupted. It was a single failing drive,
>> which caused the entire cabinet (containing 7 drives) to reset. So the
>> FS suddenly lost 7 drives.
>>
> 
> This remains mystery for me. btrfs is marketed to be always consistent
> on disk - you either have previous full transaction or current full
> transaction. If current transaction was interrupted the promise is you
> are left with previous valid consistent transaction.
> 
> Obviously this is not what happens in practice. Which nullifies the main
> selling point of btrfs.
> 
> Unless this is expected behavior, it sounds like some barriers are
> missing and summary data is updated before (and without waiting for)
> subordinate data. And if it is expected behavior ...

There are one (unfortunately) known problem for RAID5/6 and one special
problem for RAID6.

The common problem is write hole.
For a RAID5 stripe like:
        Disk 1      |        Disk 2        |       Disk 3
---------------------------------------------------------------
        DATA1       |        DATA2         |       PARITY

If we have written something into DATA1, but powerloss happened before
we update PARITY in disk 3.
In this case, we can't tolerant Disk 2 loss, since DATA1 doesn't match
PARAITY anymore.

Without the ability to know what exactly block we have written, for
write hole problem exists for any parity based solution, including BTRFS
RAID5/6.

From the guys in the mail list, other RAID5/6 implementations have their
own record of which block is updated on-disk, and for powerloss case
they will rebuild involved stripes.

Since btrfs doesn't has such ability, we need to scrub the whole fs to
regain the disk loss tolerance (and hope there will not be another power
loss during it)


The RAID6 special problem is the missing of rebuilt retry logic.
(Not any more after 4.16 kernel, but still missing btrfs-progs support)

For a RAID6 stripe like:
    Disk 1     |    Disk 2      |     Disk 3     |    Disk 4
----------------------------------------------------------------
    DATA1      |    DATA2       |       P        |      Q

If data read from DATA1 failed, we have 3 ways to rebuild the data:
1) Using DATA2 and P (just as RAID5)
2) Using P and Q
3) Using DATA2 and Q

However until 4.16 we won't retry all possible ways to build it.
(Thanks Liu for solving this problem).

Thanks,
Qu

> 
>> I have removed the failed drive, so the RAID is now degraded. I hope
>> the data is still recoverable... ☹
>>
>