Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

All of lore.kernel.org
 help / color / mirror / Atom feed

* Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
@ 2018-06-28  1:42 Remi Gauvin
  2018-06-28  1:58 ` Qu Wenruo
  2018-06-28 13:24 ` Anand Jain
  0 siblings, 2 replies; 28+ messages in thread
From: Remi Gauvin @ 2018-06-28  1:42 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 708 bytes --]

There seems to be a major design flaw with BTRFS that needs to be better
documented, to avoid massive data loss.

Tested with Raid 1 on Ubuntu Kernel 4.15

The use case being tested was a Virtualbox VDI file created with
NODATACOW attribute, (as is often suggested, due to the painful
performance penalty of COW on these files.)

However, if a device is temporarily dropped (this in case, tested by
disconnecting drives.) and re-connects automatically next boot, BTRFS
does not in any way synchronize the VDI file, or have any means to know
that one of copy is out of date and bad.

The result of trying to use said VDI file is.... interestingly insane.
Scrub did not do anything to rectify the situation.

[-- Attachment #2: remi.vcf --]
[-- Type: text/x-vcard, Size: 203 bytes --]

begin:vcard
fn:Remi Gauvin
n:Gauvin;Remi
org:Georgian Infotech
adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada
email;internet:remi@georgianit.com
tel;work:226-256-1545
version:2.1
end:vcard

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28  1:42 Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files Remi Gauvin
@ 2018-06-28  1:58 ` Qu Wenruo
  2018-06-28  2:10   ` Remi Gauvin
  2018-06-28 13:24 ` Anand Jain
  1 sibling, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2018-06-28  1:58 UTC (permalink / raw)
  To: Remi Gauvin, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1181 bytes --]



On 2018年06月28日 09:42, Remi Gauvin wrote:
> There seems to be a major design flaw with BTRFS that needs to be better
> documented, to avoid massive data loss.
> 
> Tested with Raid 1 on Ubuntu Kernel 4.15
> 
> The use case being tested was a Virtualbox VDI file created with
> NODATACOW attribute, (as is often suggested, due to the painful
> performance penalty of COW on these files.)

NODATACOW implies NODATASUM.

From btrfs(5):
---
Enable data copy-on-write for newly created files.  Nodatacow
implies nodatasum, and disables compression. All files created
under nodatacow are also set the NOCOW file attribute (see
chattr(1)).
---

Although it's talking about the mount option, it also applies to
per-inode options.

Thanks,
Qu

> 
> However, if a device is temporarily dropped (this in case, tested by
> disconnecting drives.) and re-connects automatically next boot, BTRFS
> does not in any way synchronize the VDI file, or have any means to know
> that one of copy is out of date and bad.
> 
> The result of trying to use said VDI file is.... interestingly insane.
> Scrub did not do anything to rectify the situation.
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28  1:58 ` Qu Wenruo
@ 2018-06-28  2:10   ` Remi Gauvin
  2018-06-28  2:55     ` Qu Wenruo
  0 siblings, 1 reply; 28+ messages in thread
From: Remi Gauvin @ 2018-06-28  2:10 UTC (permalink / raw)
  To: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 929 bytes --]

On 2018-06-27 09:58 PM, Qu Wenruo wrote:
> 
> 
> On 2018年06月28日 09:42, Remi Gauvin wrote:
>> There seems to be a major design flaw with BTRFS that needs to be better
>> documented, to avoid massive data loss.
>>
>> Tested with Raid 1 on Ubuntu Kernel 4.15
>>
>> The use case being tested was a Virtualbox VDI file created with
>> NODATACOW attribute, (as is often suggested, due to the painful
>> performance penalty of COW on these files.)
> 
> NODATACOW implies NODATASUM.
> 

yes yes,, none of which changes the simple fact that if you use this
option, which is often touted as outright necessary for some types of
files, BTRFS raid is worse than useless,, not only will it not protect
your data at all from bitrot, (as expected), it will actively go out of
it's way to corrupt it!

This is not expected behaviour from 'Raid', and I despair that seems to
be something that I have to explain!



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28  2:10   ` Remi Gauvin
@ 2018-06-28  2:55     ` Qu Wenruo
  2018-06-28  3:14       ` remi
  0 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2018-06-28  2:55 UTC (permalink / raw)
  To: Remi Gauvin, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1248 bytes --]



On 2018年06月28日 10:10, Remi Gauvin wrote:
> On 2018-06-27 09:58 PM, Qu Wenruo wrote:
>>
>>
>> On 2018年06月28日 09:42, Remi Gauvin wrote:
>>> There seems to be a major design flaw with BTRFS that needs to be better
>>> documented, to avoid massive data loss.
>>>
>>> Tested with Raid 1 on Ubuntu Kernel 4.15
>>>
>>> The use case being tested was a Virtualbox VDI file created with
>>> NODATACOW attribute, (as is often suggested, due to the painful
>>> performance penalty of COW on these files.)
>>
>> NODATACOW implies NODATASUM.
>>
> 
> yes yes,, none of which changes the simple fact that if you use this
> option, which is often touted as outright necessary for some types of
> files, BTRFS raid is worse than useless,, not only will it not protect
> your data at all from bitrot, (as expected), it will actively go out of
> it's way to corrupt it!
> 
> This is not expected behaviour from 'Raid', and I despair that seems to
> be something that I have to explain!

Nope, all normal raid1 is the same, if you corrupt one copy, you won't
know which one is correct.
Btrfs csum is already doing much better job than plain raid1.

Please get yourself clear of what other raid1 is doing.

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28  2:55     ` Qu Wenruo
@ 2018-06-28  3:14       ` remi
  2018-06-28  5:39         ` Qu Wenruo
  0 siblings, 1 reply; 28+ messages in thread
From: remi @ 2018-06-28  3:14 UTC (permalink / raw)
  To: linux-btrfs

On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:

> 
> Please get yourself clear of what other raid1 is doing.

A drive failure, where the drive is still there when the computer reboots, is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but raid 0) will recover from perfectly without raising a sweat. Some will rebuild the array automatically, others will automatically kick out the misbehaving drive.  *none* of them will take back the the drive with old data and start commingling that data with good copy.)\ This behaviour from BTRFS is completely abnormal.. and defeats even the most basic expectations of RAID.

I'm not the one who has to clear his expectations here.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28  3:14       ` remi
@ 2018-06-28  5:39         ` Qu Wenruo
  2018-06-28  8:16           ` Andrei Borzenkov
  0 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2018-06-28  5:39 UTC (permalink / raw)
  To: remi, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 1834 bytes --]

On 2018年06月28日 11:14, remi@georgianit.com wrote:
> 
> 
> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
> 
>>
>> Please get yourself clear of what other raid1 is doing.
> 
> A drive failure, where the drive is still there when the computer reboots, is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but raid 0) will recover from perfectly without raising a sweat. Some will rebuild the array automatically,

WOW, that's black magic, at least for RAID1.
The whole RAID1 has no idea of which copy is correct unlike btrfs who
has datasum.

Don't bother other things, just tell me how to determine which one is
correct?

The only possibility is that, the misbehaved device missed several super
block update so we have a chance to detect it's out-of-date.
But that's not always working.

If you're talking about missing generation check for btrfs, that's
valid, but it's far from a "major design flaw", as there are a lot of
cases where other RAID1 (mdraid or LVM mirrored) can also be affected
(the brain-split case).

> others will automatically kick out the misbehaving drive.  *none* of them will take back the the drive with old data and start commingling that data with good copy.)\ This behaviour from BTRFS is completely abnormal.. and defeats even the most basic expectations of RAID.

RAID1 can only tolerate 1 missing device, it has nothing to do with
error detection.
And it's impossible to detect such case without extra help.

Your expectation is completely wrong.

> 
> I'm not the one who has to clear his expectations here.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28  5:39         ` Qu Wenruo
@ 2018-06-28  8:16           ` Andrei Borzenkov
  2018-06-28  8:20             ` Andrei Borzenkov
  2018-06-28  9:15             ` Qu Wenruo
  0 siblings, 2 replies; 28+ messages in thread
From: Andrei Borzenkov @ 2018-06-28  8:16 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: remi, Btrfs BTRFS

On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
> On 2018年06月28日 11:14, remi@georgianit.com wrote:
>>
>>
>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>>
>>>
>>> Please get yourself clear of what other raid1 is doing.
>>
>> A drive failure, where the drive is still there when the computer reboots, is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but raid 0) will recover from perfectly without raising a sweat. Some will rebuild the array automatically,
>
> WOW, that's black magic, at least for RAID1.
> The whole RAID1 has no idea of which copy is correct unlike btrfs who
> has datasum.
>
> Don't bother other things, just tell me how to determine which one is
> correct?
>

When one drive fails, it is recorded in meta-data on remaining drives;
probably configuration generation number is increased. Next time drive
with older generation is not incorporated. Hardware controllers also
keep this information in NVRAM and so do not even depend on scanning
of other disks.

> The only possibility is that, the misbehaved device missed several super
> block update so we have a chance to detect it's out-of-date.
> But that's not always working.
>

Why it should not work as long as any write to array is suspended
until superblock on remaining devices is updated?

> If you're talking about missing generation check for btrfs, that's
> valid, but it's far from a "major design flaw", as there are a lot of
> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
> (the brain-split case).
>

That's different. Yes, with software-based raid there is usually no
way to detect outdated copy if no other copies are present. Having
older valid data is still very different from corrupting newer data.

>> others will automatically kick out the misbehaving drive.  *none* of them will take back the the drive with old data and start commingling that data with good copy.)\ This behaviour from BTRFS is completely abnormal.. and defeats even the most basic expectations of RAID.
>
> RAID1 can only tolerate 1 missing device, it has nothing to do with
> error detection.
> And it's impossible to detect such case without extra help.
>
> Your expectation is completely wrong.
>

Well ... somehow it is my experience as well ... :)

>>
>> I'm not the one who has to clear his expectations here.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28  8:16           ` Andrei Borzenkov
@ 2018-06-28  8:20             ` Andrei Borzenkov
  2018-06-28  9:15             ` Qu Wenruo
  1 sibling, 0 replies; 28+ messages in thread
From: Andrei Borzenkov @ 2018-06-28  8:20 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: remi, Btrfs BTRFS

On Thu, Jun 28, 2018 at 11:16 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>> On 2018年06月28日 11:14, remi@georgianit.com wrote:
>>>
>>>
>>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>>>
>>>>
>>>> Please get yourself clear of what other raid1 is doing.
>>>
>>> A drive failure, where the drive is still there when the computer reboots, is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but raid 0) will recover from perfectly without raising a sweat. Some will rebuild the array automatically,
>>
>> WOW, that's black magic, at least for RAID1.
>> The whole RAID1 has no idea of which copy is correct unlike btrfs who
>> has datasum.
>>
>> Don't bother other things, just tell me how to determine which one is
>> correct?
>>
>
> When one drive fails, it is recorded in meta-data on remaining drives;
> probably configuration generation number is increased. Next time drive
> with older generation is not incorporated. Hardware controllers also
> keep this information in NVRAM and so do not even depend on scanning
> of other disks.
>
>> The only possibility is that, the misbehaved device missed several super
>> block update so we have a chance to detect it's out-of-date.
>> But that's not always working.
>>
>
> Why it should not work as long as any write to array is suspended
> until superblock on remaining devices is updated?
>
>> If you're talking about missing generation check for btrfs, that's
>> valid, but it's far from a "major design flaw", as there are a lot of
>> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
>> (the brain-split case).
>>
>
> That's different. Yes, with software-based raid there is usually no
> way to detect outdated copy if no other copies are present. Having
> older valid data is still very different from corrupting newer data.
>
>>> others will automatically kick out the misbehaving drive.  *none* of them will take back the the drive with old data and start commingling that data with good copy.)\ This behaviour from BTRFS is completely abnormal.. and defeats even the most basic expectations of RAID.
>>
>> RAID1 can only tolerate 1 missing device, it has nothing to do with
>> error detection.
>> And it's impossible to detect such case without extra help.
>>
>> Your expectation is completely wrong.
>>
>
> Well ... somehow it is my experience as well ... :)

s/experience/expectation/

sorry.

>
>>>
>>> I'm not the one who has to clear his expectations here.
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28  8:16           ` Andrei Borzenkov
  2018-06-28  8:20             ` Andrei Borzenkov
@ 2018-06-28  9:15             ` Qu Wenruo
  2018-06-28 11:12               ` Austin S. Hemmelgarn
                                 ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Qu Wenruo @ 2018-06-28  9:15 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: remi, Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 4260 bytes --]



On 2018年06月28日 16:16, Andrei Borzenkov wrote:
> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>> On 2018年06月28日 11:14, remi@georgianit.com wrote:
>>>
>>>
>>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>>>
>>>>
>>>> Please get yourself clear of what other raid1 is doing.
>>>
>>> A drive failure, where the drive is still there when the computer reboots, is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but raid 0) will recover from perfectly without raising a sweat. Some will rebuild the array automatically,
>>
>> WOW, that's black magic, at least for RAID1.
>> The whole RAID1 has no idea of which copy is correct unlike btrfs who
>> has datasum.
>>
>> Don't bother other things, just tell me how to determine which one is
>> correct?
>>
> 
> When one drive fails, it is recorded in meta-data on remaining drives;
> probably configuration generation number is increased. Next time drive
> with older generation is not incorporated. Hardware controllers also
> keep this information in NVRAM and so do not even depend on scanning
> of other disks.

Yep, the only possible way to determine such case is from external info.

For device generation, it's possible to enhance btrfs, but at least we
could start from detect and refuse to RW mount to avoid possible further
corruption.
But anyway, if one really cares about such case, hardware RAID
controller seems to be the only solution as other software may have the
same problem.

And the hardware solution looks pretty interesting, is the write to
NVRAM 100% atomic? Even at power loss?

> 
>> The only possibility is that, the misbehaved device missed several super
>> block update so we have a chance to detect it's out-of-date.
>> But that's not always working.
>>
> 
> Why it should not work as long as any write to array is suspended
> until superblock on remaining devices is updated?

What happens if there is no generation gap in device superblock?

If one device got some of its (nodatacow) data written to disk, while
the other device doesn't get data written, and neither of them reached
super block update, there is no difference in device superblock, thus no
way to detect which is correct.

> 
>> If you're talking about missing generation check for btrfs, that's
>> valid, but it's far from a "major design flaw", as there are a lot of
>> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
>> (the brain-split case).
>>
> 
> That's different. Yes, with software-based raid there is usually no
> way to detect outdated copy if no other copies are present. Having
> older valid data is still very different from corrupting newer data.

While for VDI case (or any VM image file format other than raw), older
valid data normally means corruption.
Unless they have their own write-ahead log.

Some file format may detect such problem by themselves if they have
internal checksum, but anyway, older data normally means corruption,
especially when partial new and partial old.

On the other hand, with data COW and csum, btrfs can ensure the whole
filesystem update is atomic (at least for single device).
So the title, especially the "major design flaw" can't be wrong any more.

> 
>>> others will automatically kick out the misbehaving drive.  *none* of them will take back the the drive with old data and start commingling that data with good copy.)\ This behaviour from BTRFS is completely abnormal.. and defeats even the most basic expectations of RAID.
>>
>> RAID1 can only tolerate 1 missing device, it has nothing to do with
>> error detection.
>> And it's impossible to detect such case without extra help.
>>
>> Your expectation is completely wrong.
>>
> 
> Well ... somehow it is my experience as well ... :)

Acceptable, but not really apply to software based RAID1.

Thanks,
Qu

> 
>>>
>>> I'm not the one who has to clear his expectations here.
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28  9:15             ` Qu Wenruo
@ 2018-06-28 11:12               ` Austin S. Hemmelgarn
  2018-06-28 11:46                 ` Qu Wenruo
  2018-06-28 17:10               ` Andrei Borzenkov
  2018-06-28 22:00               ` Remi Gauvin
  2 siblings, 1 reply; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2018-06-28 11:12 UTC (permalink / raw)
  To: Qu Wenruo, Andrei Borzenkov; +Cc: remi, Btrfs BTRFS

On 2018-06-28 05:15, Qu Wenruo wrote:
> 
> 
> On 2018年06月28日 16:16, Andrei Borzenkov wrote:
>> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>
>>>
>>> On 2018年06月28日 11:14, remi@georgianit.com wrote:
>>>>
>>>>
>>>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>>>>
>>>>>
>>>>> Please get yourself clear of what other raid1 is doing.
>>>>
>>>> A drive failure, where the drive is still there when the computer reboots, is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but raid 0) will recover from perfectly without raising a sweat. Some will rebuild the array automatically,
>>>
>>> WOW, that's black magic, at least for RAID1.
>>> The whole RAID1 has no idea of which copy is correct unlike btrfs who
>>> has datasum.
>>>
>>> Don't bother other things, just tell me how to determine which one is
>>> correct?
>>>
>>
>> When one drive fails, it is recorded in meta-data on remaining drives;
>> probably configuration generation number is increased. Next time drive
>> with older generation is not incorporated. Hardware controllers also
>> keep this information in NVRAM and so do not even depend on scanning
>> of other disks.
> 
> Yep, the only possible way to determine such case is from external info.
> 
> For device generation, it's possible to enhance btrfs, but at least we
> could start from detect and refuse to RW mount to avoid possible further
> corruption.
> But anyway, if one really cares about such case, hardware RAID
> controller seems to be the only solution as other software may have the
> same problem.
LVM doesn't.  It detects that one of the devices was gone for some 
period of time and marks the volume as degraded (and _might_, depending 
on how you have things configured, automatically re-sync).  Not sure 
about MD, but I am willing to bet it properly detects this type of 
situation too.
> 
> And the hardware solution looks pretty interesting, is the write to
> NVRAM 100% atomic? Even at power loss?
On a proper RAID controller, it's battery backed, and that battery 
backing provides enough power to also make sure that the state change is 
properly recorded in the event of power loss.
> 
>>
>>> The only possibility is that, the misbehaved device missed several super
>>> block update so we have a chance to detect it's out-of-date.
>>> But that's not always working.
>>>
>>
>> Why it should not work as long as any write to array is suspended
>> until superblock on remaining devices is updated?
> 
> What happens if there is no generation gap in device superblock?
> 
> If one device got some of its (nodatacow) data written to disk, while
> the other device doesn't get data written, and neither of them reached
> super block update, there is no difference in device superblock, thus no
> way to detect which is correct.
Yes, but that should be a very small window (at least, once we finally 
quit serializing writes across devices), and it's a problem on existing 
RAID1 implementations too (and therefore isn't something we should be 
using as an excuse for not doing this).
> 
>>
>>> If you're talking about missing generation check for btrfs, that's
>>> valid, but it's far from a "major design flaw", as there are a lot of
>>> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
>>> (the brain-split case).
>>>
>>
>> That's different. Yes, with software-based raid there is usually no
>> way to detect outdated copy if no other copies are present. Having
>> older valid data is still very different from corrupting newer data.
> 
> While for VDI case (or any VM image file format other than raw), older
> valid data normally means corruption.
> Unless they have their own write-ahead log.
> 
> Some file format may detect such problem by themselves if they have
> internal checksum, but anyway, older data normally means corruption,
> especially when partial new and partial old.
> 
> On the other hand, with data COW and csum, btrfs can ensure the whole
> filesystem update is atomic (at least for single device).
> So the title, especially the "major design flaw" can't be wrong any more.
The title is excessive, but I'd agree it's a design flaw that BTRFS 
doesn't at least notice that the generation ID's are different and 
preferentially trust the device with the newer generation ID. The only 
special handling I can see that would be needed is around volumes 
mounted with the `nodatacow` option, which may not see generation 
changes for a very long time otherwise.
> 
>>
>>>> others will automatically kick out the misbehaving drive.  *none* of them will take back the the drive with old data and start commingling that data with good copy.)\ This behaviour from BTRFS is completely abnormal.. and defeats even the most basic expectations of RAID.
>>>
>>> RAID1 can only tolerate 1 missing device, it has nothing to do with
>>> error detection.
>>> And it's impossible to detect such case without extra help.
>>>
>>> Your expectation is completely wrong.
>>>
>>
>> Well ... somehow it is my experience as well ... :)
> 
> Acceptable, but not really apply to software based RAID1.
> 
> Thanks,
> Qu
> 
>>
>>>>
>>>> I'm not the one who has to clear his expectations here.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28 11:12               ` Austin S. Hemmelgarn
@ 2018-06-28 11:46                 ` Qu Wenruo
  2018-06-28 12:20                   ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2018-06-28 11:46 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Andrei Borzenkov; +Cc: remi, Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 6413 bytes --]



On 2018年06月28日 19:12, Austin S. Hemmelgarn wrote:
> On 2018-06-28 05:15, Qu Wenruo wrote:
>>
>>
>> On 2018年06月28日 16:16, Andrei Borzenkov wrote:
>>> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo <quwenruo.btrfs@gmx.com>
>>> wrote:
>>>>
>>>>
>>>> On 2018年06月28日 11:14, remi@georgianit.com wrote:
>>>>>
>>>>>
>>>>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>>>>>
>>>>>>
>>>>>> Please get yourself clear of what other raid1 is doing.
>>>>>
>>>>> A drive failure, where the drive is still there when the computer
>>>>> reboots, is a situation that *any* raid 1, (or for that matter,
>>>>> raid 5, raid 6, anything but raid 0) will recover from perfectly
>>>>> without raising a sweat. Some will rebuild the array automatically,
>>>>
>>>> WOW, that's black magic, at least for RAID1.
>>>> The whole RAID1 has no idea of which copy is correct unlike btrfs who
>>>> has datasum.
>>>>
>>>> Don't bother other things, just tell me how to determine which one is
>>>> correct?
>>>>
>>>
>>> When one drive fails, it is recorded in meta-data on remaining drives;
>>> probably configuration generation number is increased. Next time drive
>>> with older generation is not incorporated. Hardware controllers also
>>> keep this information in NVRAM and so do not even depend on scanning
>>> of other disks.
>>
>> Yep, the only possible way to determine such case is from external info.
>>
>> For device generation, it's possible to enhance btrfs, but at least we
>> could start from detect and refuse to RW mount to avoid possible further
>> corruption.
>> But anyway, if one really cares about such case, hardware RAID
>> controller seems to be the only solution as other software may have the
>> same problem.
> LVM doesn't.  It detects that one of the devices was gone for some
> period of time and marks the volume as degraded (and _might_, depending
> on how you have things configured, automatically re-sync).  Not sure
> about MD, but I am willing to bet it properly detects this type of
> situation too.
>>
>> And the hardware solution looks pretty interesting, is the write to
>> NVRAM 100% atomic? Even at power loss?
> On a proper RAID controller, it's battery backed, and that battery
> backing provides enough power to also make sure that the state change is
> properly recorded in the event of power loss.

Well, that explains a lot of thing.

So hardware RAID controller can be considered having a special battery
(always atomic) journal device.
If we can't provide UPS for the whole system, a battery powered journal
device indeed makes sense.

>>
>>>
>>>> The only possibility is that, the misbehaved device missed several
>>>> super
>>>> block update so we have a chance to detect it's out-of-date.
>>>> But that's not always working.
>>>>
>>>
>>> Why it should not work as long as any write to array is suspended
>>> until superblock on remaining devices is updated?
>>
>> What happens if there is no generation gap in device superblock?
>>
>> If one device got some of its (nodatacow) data written to disk, while
>> the other device doesn't get data written, and neither of them reached
>> super block update, there is no difference in device superblock, thus no
>> way to detect which is correct.
> Yes, but that should be a very small window (at least, once we finally
> quit serializing writes across devices), and it's a problem on existing
> RAID1 implementations too (and therefore isn't something we should be
> using as an excuse for not doing this).
>>
>>>
>>>> If you're talking about missing generation check for btrfs, that's
>>>> valid, but it's far from a "major design flaw", as there are a lot of
>>>> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
>>>> (the brain-split case).
>>>>
>>>
>>> That's different. Yes, with software-based raid there is usually no
>>> way to detect outdated copy if no other copies are present. Having
>>> older valid data is still very different from corrupting newer data.
>>
>> While for VDI case (or any VM image file format other than raw), older
>> valid data normally means corruption.
>> Unless they have their own write-ahead log.
>>
>> Some file format may detect such problem by themselves if they have
>> internal checksum, but anyway, older data normally means corruption,
>> especially when partial new and partial old.
>>
>> On the other hand, with data COW and csum, btrfs can ensure the whole
>> filesystem update is atomic (at least for single device).
>> So the title, especially the "major design flaw" can't be wrong any more.
> The title is excessive, but I'd agree it's a design flaw that BTRFS
> doesn't at least notice that the generation ID's are different and
> preferentially trust the device with the newer generation ID.

Well, a design flaw should be something that can't be easily fixed
without *huge* on-disk format or behavior change.
Flaw in btrfs' one-subvolume-per-tree metadata design or current extent
booking behavior could be called design flaw.

While for things like this, just as the submitted RFC patch, less than
100 lines could change the behavior.

> The only
> special handling I can see that would be needed is around volumes
> mounted with the `nodatacow` option, which may not see generation
> changes for a very long time otherwise.

Nodatacow shouldn't cause much difference.
We still have commit interval, and metadata CoW.
Any btrfs metadata change or filesystem metadata change (inode change
etc) will still leads to a new generation.

Thanks,
Qu

>>
>>>
>>>>> others will automatically kick out the misbehaving drive.  *none*
>>>>> of them will take back the the drive with old data and start
>>>>> commingling that data with good copy.)\ This behaviour from BTRFS
>>>>> is completely abnormal.. and defeats even the most basic
>>>>> expectations of RAID.
>>>>
>>>> RAID1 can only tolerate 1 missing device, it has nothing to do with
>>>> error detection.
>>>> And it's impossible to detect such case without extra help.
>>>>
>>>> Your expectation is completely wrong.
>>>>
>>>
>>> Well ... somehow it is my experience as well ... :)
>>
>> Acceptable, but not really apply to software based RAID1.
>>
>> Thanks,
>> Qu
>>
>>>
>>>>>
>>>>> I'm not the one who has to clear his expectations here.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28 11:46                 ` Qu Wenruo
@ 2018-06-28 12:20                   ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2018-06-28 12:20 UTC (permalink / raw)
  To: Qu Wenruo, Andrei Borzenkov; +Cc: remi, Btrfs BTRFS

On 2018-06-28 07:46, Qu Wenruo wrote:
> 
> 
> On 2018年06月28日 19:12, Austin S. Hemmelgarn wrote:
>> On 2018-06-28 05:15, Qu Wenruo wrote:
>>>
>>>
>>> On 2018年06月28日 16:16, Andrei Borzenkov wrote:
>>>> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo <quwenruo.btrfs@gmx.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> On 2018年06月28日 11:14, remi@georgianit.com wrote:
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>>>>>>
>>>>>>>
>>>>>>> Please get yourself clear of what other raid1 is doing.
>>>>>>
>>>>>> A drive failure, where the drive is still there when the computer
>>>>>> reboots, is a situation that *any* raid 1, (or for that matter,
>>>>>> raid 5, raid 6, anything but raid 0) will recover from perfectly
>>>>>> without raising a sweat. Some will rebuild the array automatically,
>>>>>
>>>>> WOW, that's black magic, at least for RAID1.
>>>>> The whole RAID1 has no idea of which copy is correct unlike btrfs who
>>>>> has datasum.
>>>>>
>>>>> Don't bother other things, just tell me how to determine which one is
>>>>> correct?
>>>>>
>>>>
>>>> When one drive fails, it is recorded in meta-data on remaining drives;
>>>> probably configuration generation number is increased. Next time drive
>>>> with older generation is not incorporated. Hardware controllers also
>>>> keep this information in NVRAM and so do not even depend on scanning
>>>> of other disks.
>>>
>>> Yep, the only possible way to determine such case is from external info.
>>>
>>> For device generation, it's possible to enhance btrfs, but at least we
>>> could start from detect and refuse to RW mount to avoid possible further
>>> corruption.
>>> But anyway, if one really cares about such case, hardware RAID
>>> controller seems to be the only solution as other software may have the
>>> same problem.
>> LVM doesn't.  It detects that one of the devices was gone for some
>> period of time and marks the volume as degraded (and _might_, depending
>> on how you have things configured, automatically re-sync).  Not sure
>> about MD, but I am willing to bet it properly detects this type of
>> situation too.
>>>
>>> And the hardware solution looks pretty interesting, is the write to
>>> NVRAM 100% atomic? Even at power loss?
>> On a proper RAID controller, it's battery backed, and that battery
>> backing provides enough power to also make sure that the state change is
>> properly recorded in the event of power loss.
> 
> Well, that explains a lot of thing.
> 
> So hardware RAID controller can be considered having a special battery
> (always atomic) journal device.
> If we can't provide UPS for the whole system, a battery powered journal
> device indeed makes sense.
> 
>>>
>>>>
>>>>> The only possibility is that, the misbehaved device missed several
>>>>> super
>>>>> block update so we have a chance to detect it's out-of-date.
>>>>> But that's not always working.
>>>>>
>>>>
>>>> Why it should not work as long as any write to array is suspended
>>>> until superblock on remaining devices is updated?
>>>
>>> What happens if there is no generation gap in device superblock?
>>>
>>> If one device got some of its (nodatacow) data written to disk, while
>>> the other device doesn't get data written, and neither of them reached
>>> super block update, there is no difference in device superblock, thus no
>>> way to detect which is correct.
>> Yes, but that should be a very small window (at least, once we finally
>> quit serializing writes across devices), and it's a problem on existing
>> RAID1 implementations too (and therefore isn't something we should be
>> using as an excuse for not doing this).
>>>
>>>>
>>>>> If you're talking about missing generation check for btrfs, that's
>>>>> valid, but it's far from a "major design flaw", as there are a lot of
>>>>> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
>>>>> (the brain-split case).
>>>>>
>>>>
>>>> That's different. Yes, with software-based raid there is usually no
>>>> way to detect outdated copy if no other copies are present. Having
>>>> older valid data is still very different from corrupting newer data.
>>>
>>> While for VDI case (or any VM image file format other than raw), older
>>> valid data normally means corruption.
>>> Unless they have their own write-ahead log.
>>>
>>> Some file format may detect such problem by themselves if they have
>>> internal checksum, but anyway, older data normally means corruption,
>>> especially when partial new and partial old.
>>>
>>> On the other hand, with data COW and csum, btrfs can ensure the whole
>>> filesystem update is atomic (at least for single device).
>>> So the title, especially the "major design flaw" can't be wrong any more.
>> The title is excessive, but I'd agree it's a design flaw that BTRFS
>> doesn't at least notice that the generation ID's are different and
>> preferentially trust the device with the newer generation ID.
> 
> Well, a design flaw should be something that can't be easily fixed
> without *huge* on-disk format or behavior change.
> Flaw in btrfs' one-subvolume-per-tree metadata design or current extent
> booking behavior could be called design flaw.
That would be a structural design flaw.  it's a result of how the 
software is structured.  There are other types of design flaws though.
> 
> While for things like this, just as the submitted RFC patch, less than
> 100 lines could change the behavior.
I would still consider this case a design flaw (a purely behavioral one 
not tied to how things are structured) because it defies user 
expectations in a pretty significant (and potentially dangerous) way.

Arguably though, the actual flaw here is that naming for multi-device 
profiles uses the term 'raid', which has very well defined semantics, 
but does not behave like pretty much any other 'raid' implementation, 
not that BTRFS is behaving in this manner (it's bad that it's doing 
this, but without that naming issue it's largely just a bug in that it's 
not safe and not documented).
> 
>> The only
>> special handling I can see that would be needed is around volumes
>> mounted with the `nodatacow` option, which may not see generation
>> changes for a very long time otherwise.
> 
> Nodatacow shouldn't cause much difference.
> We still have commit interval, and metadata CoW.
> Any btrfs metadata change or filesystem metadata change (inode change
> etc) will still leads to a new generation.
Ah, I forgot that an inode change will trigger it.  So pretty much 
provided that writes are still happening and thus mtime is being 
updated, the generation ID will update.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28  1:42 Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files Remi Gauvin
  2018-06-28  1:58 ` Qu Wenruo
@ 2018-06-28 13:24 ` Anand Jain
  2018-06-28 14:17   ` Chris Murphy
  1 sibling, 1 reply; 28+ messages in thread
From: Anand Jain @ 2018-06-28 13:24 UTC (permalink / raw)
  To: Remi Gauvin, linux-btrfs



On 06/28/2018 09:42 AM, Remi Gauvin wrote:
> There seems to be a major design flaw with BTRFS that needs to be better
> documented, to avoid massive data loss.
> 
> Tested with Raid 1 on Ubuntu Kernel 4.15
> 
> The use case being tested was a Virtualbox VDI file created with
> NODATACOW attribute, (as is often suggested, due to the painful
> performance penalty of COW on these files.)
> 
> However, if a device is temporarily dropped (this in case, tested by
> disconnecting drives.) and re-connects automatically next boot, BTRFS
> does not in any way synchronize the VDI file, or have any means to know
> that one of copy is out of date and bad.
> 
> The result of trying to use said VDI file is.... interestingly insane.


> Scrub did not do anything to rectify the situation.

  Please use Balance to rectify as its RAID1. Because when one of the
  device was missing we wrote Single Chunks.

Thanks, Anand

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28 13:24 ` Anand Jain
@ 2018-06-28 14:17   ` Chris Murphy
  2018-06-28 15:37     ` Remi Gauvin
  2018-06-28 17:37     ` Goffredo Baroncelli
  0 siblings, 2 replies; 28+ messages in thread
From: Chris Murphy @ 2018-06-28 14:17 UTC (permalink / raw)
  To: Anand Jain; +Cc: Remi Gauvin, Btrfs BTRFS

The problems are known with Btrfs raid1, but I think they bear
repeating because they are really not OK.

In the exact same described scenario: a simple clear cut drop off of a
member device, which then later clearly reappears (no transient
failure).

Both mdadm and LVM based raid1 would have re-added the missing device
and resynced it because internal bitmap is the default (on > 100G
arrays for mdadm and always with lvm). Only the new data would be
propagated to user space. Both mdadm and lvm have metadata to know
which drive has stale data in this common scenario.

Btrfs does two, maybe three, bad things:
1. No automatic resync. This is a net worse behavior than mdadm and
lvm, putting data at risk.
2. The new data goes in a single chunk; even if the user does a manual
balance (resync) their data isn't replicated. They must know to do a
-dconvert balance to replicate the new data. Again this is a net worse
behavior than mdadm out of the box, putting user data at risk.
3. Apparently if nodatacow, given a file with two copies of different
transid, Btrfs won't always pick the higher transid copy? If true
that's terrible, and again not at all what mdadm/lvm are doing.

Btrfs can do better because it has more information available to make
unambiguous decisions about data. But it needs to always do at least
as good a job as mdadm/lvm and as reported, that didn't happen. So
some tested is needed in particular case #3 above with nodatacow.
That's a huge bug, if it's true.

Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28 14:17   ` Chris Murphy
@ 2018-06-28 15:37     ` Remi Gauvin
  2018-06-28 22:04       ` Chris Murphy
  2018-06-28 17:37     ` Goffredo Baroncelli
  1 sibling, 1 reply; 28+ messages in thread
From: Remi Gauvin @ 2018-06-28 15:37 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 670 bytes --]

On 2018-06-28 10:17 AM, Chris Murphy wrote:

> 2. The new data goes in a single chunk; even if the user does a manual
> balance (resync) their data isn't replicated. They must know to do a
> -dconvert balance to replicate the new data. Again this is a net worse
> behavior than mdadm out of the box, putting user data at risk.

I'm not sure this is the case.  Even though writes failed to the
disconnected device, btrfs seemed to keep on going as though it *were*.

When the array was re-mounted with both devices, (never mounted as
degraded), and scrub was run, scrub took a *long* time fixing errors, at
a whopping 3MB/s, and reported having fixed millions of them.

[-- Attachment #2: remi.vcf --]
[-- Type: text/x-vcard, Size: 193 bytes --]

begin:vcard
fn:Remi Gauvin
n:Gauvin;Remi
org:Georgian Infotech
adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada
email;internet:remi@georgianit.com
tel;work:226-256-1545
version:2.1
end:vcard

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28  9:15             ` Qu Wenruo
  2018-06-28 11:12               ` Austin S. Hemmelgarn
@ 2018-06-28 17:10               ` Andrei Borzenkov
  2018-06-29  0:07                 ` Qu Wenruo
  2018-06-28 22:00               ` Remi Gauvin
  2 siblings, 1 reply; 28+ messages in thread
From: Andrei Borzenkov @ 2018-06-28 17:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: remi, Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 4958 bytes --]

28.06.2018 12:15, Qu Wenruo пишет:
> 
> 
> On 2018年06月28日 16:16, Andrei Borzenkov wrote:
>> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>
>>>
>>> On 2018年06月28日 11:14, remi@georgianit.com wrote:
>>>>
>>>>
>>>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>>>>
>>>>>
>>>>> Please get yourself clear of what other raid1 is doing.
>>>>
>>>> A drive failure, where the drive is still there when the computer reboots, is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but raid 0) will recover from perfectly without raising a sweat. Some will rebuild the array automatically,
>>>
>>> WOW, that's black magic, at least for RAID1.
>>> The whole RAID1 has no idea of which copy is correct unlike btrfs who
>>> has datasum.
>>>
>>> Don't bother other things, just tell me how to determine which one is
>>> correct?
>>>
>>
>> When one drive fails, it is recorded in meta-data on remaining drives;
>> probably configuration generation number is increased. Next time drive
>> with older generation is not incorporated. Hardware controllers also
>> keep this information in NVRAM and so do not even depend on scanning
>> of other disks.
> 
> Yep, the only possible way to determine such case is from external info.
> 
> For device generation, it's possible to enhance btrfs, but at least we
> could start from detect and refuse to RW mount to avoid possible further
> corruption.
> But anyway, if one really cares about such case, hardware RAID
> controller seems to be the only solution as other software may have the
> same problem.
> 
> And the hardware solution looks pretty interesting, is the write to
> NVRAM 100% atomic? Even at power loss?
> 
>>
>>> The only possibility is that, the misbehaved device missed several super
>>> block update so we have a chance to detect it's out-of-date.
>>> But that's not always working.
>>>
>>
>> Why it should not work as long as any write to array is suspended
>> until superblock on remaining devices is updated?
> 
> What happens if there is no generation gap in device superblock?
> 

Well, you use "generation" in strict btrfs sense, I use "generation"
generically. That is exactly what btrfs apparently lacks currently -
some monotonic counter that is used to record such event.

> If one device got some of its (nodatacow) data written to disk, while
> the other device doesn't get data written, and neither of them reached
> super block update, there is no difference in device superblock, thus no
> way to detect which is correct.
> 

Again, the very fact that device failed should have triggered update of
superblock to record this information which presumably should increase
some counter.

>>
>>> If you're talking about missing generation check for btrfs, that's
>>> valid, but it's far from a "major design flaw", as there are a lot of
>>> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
>>> (the brain-split case).
>>>
>>
>> That's different. Yes, with software-based raid there is usually no
>> way to detect outdated copy if no other copies are present. Having
>> older valid data is still very different from corrupting newer data.
> 
> While for VDI case (or any VM image file format other than raw), older
> valid data normally means corruption.
> Unless they have their own write-ahead log.
>> Some file format may detect such problem by themselves if they have
> internal checksum, but anyway, older data normally means corruption,
> especially when partial new and partial old.
>

Yes, that's true. But there is really nothing that can be done here,
even theoretically; it hardly a reason to not do what looks possible.

> On the other hand, with data COW and csum, btrfs can ensure the whole
> filesystem update is atomic (at least for single device).
> So the title, especially the "major design flaw" can't be wrong any more.
> 
>>
>>>> others will automatically kick out the misbehaving drive.  *none* of them will take back the the drive with old data and start commingling that data with good copy.)\ This behaviour from BTRFS is completely abnormal.. and defeats even the most basic expectations of RAID.
>>>
>>> RAID1 can only tolerate 1 missing device, it has nothing to do with
>>> error detection.
>>> And it's impossible to detect such case without extra help.
>>>
>>> Your expectation is completely wrong.
>>>
>>
>> Well ... somehow it is my experience as well ... :)
> 
> Acceptable, but not really apply to software based RAID1.
> 
> Thanks,
> Qu
> 
>>
>>>>
>>>> I'm not the one who has to clear his expectations here.
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28 14:17   ` Chris Murphy
  2018-06-28 15:37     ` Remi Gauvin
@ 2018-06-28 17:37     ` Goffredo Baroncelli
  2018-06-28 22:27       ` Chris Murphy
  1 sibling, 1 reply; 28+ messages in thread
From: Goffredo Baroncelli @ 2018-06-28 17:37 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Anand Jain, Remi Gauvin, Btrfs BTRFS

On 06/28/2018 04:17 PM, Chris Murphy wrote:
> Btrfs does two, maybe three, bad things:
> 1. No automatic resync. This is a net worse behavior than mdadm and
> lvm, putting data at risk.
> 2. The new data goes in a single chunk; even if the user does a manual
> balance (resync) their data isn't replicated. They must know to do a
> -dconvert balance to replicate the new data. Again this is a net worse
> behavior than mdadm out of the box, putting user data at risk.
> 3. Apparently if nodatacow, given a file with two copies of different
> transid, Btrfs won't always pick the higher transid copy? If true
> that's terrible, and again not at all what mdadm/lvm are doing.

All these could be avoided simply not allowing a multidevice filesystem to mount without ensuring that all the devices have the same generation.

In the past I proposed a mount.btrfs helper; I am still thinking that it would be the right place to
a) put all the check before mounting the filesystem
b) print the correct information in order to help the user on what he has to do to solve the issues

Regarding your point 3), it must be point out that in case of NOCOW files, even having the same transid it is not enough. It still be possible that a copy is update before a power failure preventing the super-block update.
I think that the only way to prevent it to happens is:
  1) using a data journal (which means that each data is copied two times)
OR
  2) using a cow filesystem (with cow enabled of course !)

I think that this is a good example of why a HW Raid controller battery backed could be better than a SW raid. Of course the likelihood of a lot of problems could be reduced using a power supply.

BR
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28  9:15             ` Qu Wenruo
  2018-06-28 11:12               ` Austin S. Hemmelgarn
  2018-06-28 17:10               ` Andrei Borzenkov
@ 2018-06-28 22:00               ` Remi Gauvin
  2 siblings, 0 replies; 28+ messages in thread
From: Remi Gauvin @ 2018-06-28 22:00 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1.1.1: Type: text/plain, Size: 1602 bytes --]

> Acceptable, but not really apply to software based RAID1.
> 

Which completely disregards the minor detail that all the software
Raid's I know of can handle exactly this kind of situation without
loosing or corrupting a single byte of data, (Errors on the remaining
hard drive notwithstanding.)

Exactly what methods they employ to do so I'm not an expert at,, but it
*does* work, contrary to your repeated assertions otherwise.

In any case, thank you the for the patch you wrote.  I will, however,
propose a different solution.

Given the reliance of BTRFS on csum, and the lack of any
resynchronization, (no matter how the drives got out of sync, doesn't
matter.).  I think NoDataCow should just be ignored in the case of RAID,
just like the data blocks would get copied if there was a snapshot.

In the current implementation of RAID on btrfs, RAID and nodatacow are
effectively mutually exclusive.  Consider the kinds of use cases
nodatacow is usually recommended for,  VM images and databases.   Even
though those files should have their own mechanisms for dealing with
incomplete writes, and data verification, BTRFS RAID creates a unique
situation where parts of the file can be inconsistent, with different
data being read depending on which device is doing the reading.

Regardless of which method, short term and long term, developers choose
to address this, this next part I have stress I consider very important.

The status page really needs to be updated to reflect this gotchya.  It
*will* bite people in ways they do not expect, and disastrously.

[-- Attachment #1.1.2: remi.vcf --]
[-- Type: text/x-vcard, Size: 203 bytes --]

begin:vcard
fn:Remi Gauvin
n:Gauvin;Remi
org:Georgian Infotech
adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada
email;internet:remi@georgianit.com
tel;work:226-256-1545
version:2.1
end:vcard

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28 15:37     ` Remi Gauvin
@ 2018-06-28 22:04       ` Chris Murphy
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2018-06-28 22:04 UTC (permalink / raw)
  To: Remi Gauvin; +Cc: Btrfs BTRFS

On Thu, Jun 28, 2018 at 9:37 AM, Remi Gauvin <remi@georgianit.com> wrote:
> On 2018-06-28 10:17 AM, Chris Murphy wrote:
>
>> 2. The new data goes in a single chunk; even if the user does a manual
>> balance (resync) their data isn't replicated. They must know to do a
>> -dconvert balance to replicate the new data. Again this is a net worse
>> behavior than mdadm out of the box, putting user data at risk.
>
> I'm not sure this is the case.  Even though writes failed to the
> disconnected device, btrfs seemed to keep on going as though it *were*.

Yeah in your case the failure happens during normal operation and in
that case there's no degraded state on Btrfs. So it keeps writing to
raid1 chunk on the working drive, with writes on the failed devices
going nowhere (with lots of write errors). When you stop using the
volume, fix the problem with the missing drive, then remount the
volume, it really should still use only the new copy on the never
missing drive, even though it won't necessarily notice the file is
missing on the formerly missing drive. You have to balance manually to
fix it.

> When the array was re-mounted with both devices, (never mounted as
> degraded), and scrub was run, scrub took a *long* time fixing errors, at
> a whopping 3MB/s, and reported having fixed millions of them.

That's slow but it's expected to fix a lot of problems. Even in a very
short amount of time there are thousands of missing data and metadata
extents that need to be replicated.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28 17:37     ` Goffredo Baroncelli
@ 2018-06-28 22:27       ` Chris Murphy
  2018-06-29 15:15         ` james harvey
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2018-06-28 22:27 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: Chris Murphy, Anand Jain, Remi Gauvin, Btrfs BTRFS

On Thu, Jun 28, 2018 at 11:37 AM, Goffredo Baroncelli
<kreijack@libero.it> wrote:

> Regarding your point 3), it must be point out that in case of NOCOW files, even having the same transid it is not enough. It still be possible that a copy is update before a power failure preventing the super-block update.
> I think that the only way to prevent it to happens is:
>   1) using a data journal (which means that each data is copied two times)
> OR
>   2) using a cow filesystem (with cow enabled of course !)

There is no power failure in this example. So it's really off the
table considering whether Btrfs or mdadm/lvm raid do better in the
same situation with a nodatacow file.

I think here is the problem in the Btrfs nodatacow case. Btrfs doesn't
have a way of untrusting nodatacow files on a previously missing drive
that hasn't been balanced. There is no such thing as nometadatacow, so
no matter what it figures out there's a problem, and uses the good
copy of metadata, but it never "marks" the previously missing device
as suspicious. When it comes time to read a nodatacow file, Btrfs just
blindly reads off one of the drives, it has no mechanism for
questioning the formerly missing drive without csum.

That is actually a really weird and unique kind of write hole for
Btrfs raid1 when the data is nodatacow.

I have to agree with Remi. This is a flaw in the design or bad bug,
however you want to consider it. Because mdadm/lvm do not behave this
way in the exact same situation.

And an open question I have about scrub is weather it only ever is
checking csums, meaning nodatacow files are never scrubbed, or if the
copies are at least compared to each other?

As for fixes:

- During mount time, Btrfs sees from supers that there is a transid
mismatch, to not read nodatacow files from the lower transid device
until an auto balance has completed. Right now Btrfs doesn't have an
abbreviated balance that "replays" the events between two transids.
Basically it would work like send/receive but for balance to catch up
a previously missing device. Right now we have to do a full balance
which is a brutal penalty for a briefly missing drive. Again, mdadm
and lvm do better here by default.

- Fix the performance issues of COW with disk images. ZFS doesn't even
have a nodatacow option and they're running VM images on ZFS and it
doesn't sound like they're running into ridiculous performance
penalties that makes it impractical to use.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28 17:10               ` Andrei Borzenkov
@ 2018-06-29  0:07                 ` Qu Wenruo
  0 siblings, 0 replies; 28+ messages in thread
From: Qu Wenruo @ 2018-06-29  0:07 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: remi, Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 5885 bytes --]



On 2018年06月29日 01:10, Andrei Borzenkov wrote:
> 28.06.2018 12:15, Qu Wenruo пишет:
>>
>>
>> On 2018年06月28日 16:16, Andrei Borzenkov wrote:
>>> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>
>>>>
>>>> On 2018年06月28日 11:14, remi@georgianit.com wrote:
>>>>>
>>>>>
>>>>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>>>>>
>>>>>>
>>>>>> Please get yourself clear of what other raid1 is doing.
>>>>>
>>>>> A drive failure, where the drive is still there when the computer reboots, is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but raid 0) will recover from perfectly without raising a sweat. Some will rebuild the array automatically,
>>>>
>>>> WOW, that's black magic, at least for RAID1.
>>>> The whole RAID1 has no idea of which copy is correct unlike btrfs who
>>>> has datasum.
>>>>
>>>> Don't bother other things, just tell me how to determine which one is
>>>> correct?
>>>>
>>>
>>> When one drive fails, it is recorded in meta-data on remaining drives;
>>> probably configuration generation number is increased. Next time drive
>>> with older generation is not incorporated. Hardware controllers also
>>> keep this information in NVRAM and so do not even depend on scanning
>>> of other disks.
>>
>> Yep, the only possible way to determine such case is from external info.
>>
>> For device generation, it's possible to enhance btrfs, but at least we
>> could start from detect and refuse to RW mount to avoid possible further
>> corruption.
>> But anyway, if one really cares about such case, hardware RAID
>> controller seems to be the only solution as other software may have the
>> same problem.
>>
>> And the hardware solution looks pretty interesting, is the write to
>> NVRAM 100% atomic? Even at power loss?
>>
>>>
>>>> The only possibility is that, the misbehaved device missed several super
>>>> block update so we have a chance to detect it's out-of-date.
>>>> But that's not always working.
>>>>
>>>
>>> Why it should not work as long as any write to array is suspended
>>> until superblock on remaining devices is updated?
>>
>> What happens if there is no generation gap in device superblock?
>>
> 
> Well, you use "generation" in strict btrfs sense, I use "generation"
> generically. That is exactly what btrfs apparently lacks currently -
> some monotonic counter that is used to record such event.

Indeed, btrfs doesn't have any way to record which device get degraded
at all.
The usage of btrfs device generation is already kind of workaround.

So to keep the same behavior of mdraid/lvm, each time btrfs detects a
device missing/fatal command (flush/fua) not executed correctly, btrfs
needs to record it, maybe into its device item, and commit it to disk.

In short, the btrfs csum makes us a little conceited about such device
missing case, normally csum will tell us which data is wrong so we could
avoid complex device status tracking.
But apparently, if nodatasum is involved, everything just goes out of
our expectation.

> 
>> If one device got some of its (nodatacow) data written to disk, while
>> the other device doesn't get data written, and neither of them reached
>> super block update, there is no difference in device superblock, thus no
>> way to detect which is correct.
>>
> 
> Again, the very fact that device failed should have triggered update of
> superblock to record this information which presumably should increase
> some counter.

Indeed.

> 
>>>
>>>> If you're talking about missing generation check for btrfs, that's
>>>> valid, but it's far from a "major design flaw", as there are a lot of
>>>> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
>>>> (the brain-split case).
>>>>
>>>
>>> That's different. Yes, with software-based raid there is usually no
>>> way to detect outdated copy if no other copies are present. Having
>>> older valid data is still very different from corrupting newer data.
>>
>> While for VDI case (or any VM image file format other than raw), older
>> valid data normally means corruption.
>> Unless they have their own write-ahead log.
>>> Some file format may detect such problem by themselves if they have
>> internal checksum, but anyway, older data normally means corruption,
>> especially when partial new and partial old.
>>
> 
> Yes, that's true. But there is really nothing that can be done here,
> even theoretically; it hardly a reason to not do what looks possible.

Well, theoretically, you can just use datasum and datacow :)

Thanks,
Qu

> 
>> On the other hand, with data COW and csum, btrfs can ensure the whole
>> filesystem update is atomic (at least for single device).
>> So the title, especially the "major design flaw" can't be wrong any more.
>>
>>>
>>>>> others will automatically kick out the misbehaving drive.  *none* of them will take back the the drive with old data and start commingling that data with good copy.)\ This behaviour from BTRFS is completely abnormal.. and defeats even the most basic expectations of RAID.
>>>>
>>>> RAID1 can only tolerate 1 missing device, it has nothing to do with
>>>> error detection.
>>>> And it's impossible to detect such case without extra help.
>>>>
>>>> Your expectation is completely wrong.
>>>>
>>>
>>> Well ... somehow it is my experience as well ... :)
>>
>> Acceptable, but not really apply to software based RAID1.
>>
>> Thanks,
>> Qu
>>
>>>
>>>>>
>>>>> I'm not the one who has to clear his expectations here.
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-28 22:27       ` Chris Murphy
@ 2018-06-29 15:15         ` james harvey
  2018-06-29 17:09           ` Austin S. Hemmelgarn
  2018-06-29 18:40           ` Chris Murphy
  0 siblings, 2 replies; 28+ messages in thread
From: james harvey @ 2018-06-29 15:15 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Goffredo Baroncelli, Anand Jain, Remi Gauvin, Btrfs BTRFS

On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy <lists@colorremedies.com> wrote:
> And an open question I have about scrub is weather it only ever is
> checking csums, meaning nodatacow files are never scrubbed, or if the
> copies are at least compared to each other?

Scrub never looks at nodatacow files.  It does not compare the copies
to each other.

Qu submitted a patch to make check compare the copies:
https://patchwork.kernel.org/patch/10434509/

This hasn't been added to btrfs-progs git yet.

IMO, I think the offline check should look at nodatacow copies like
this, but I still think this also needs to be added to scrub.  In the
patch thread, I discuss my reasons why.  In brief: online scanning;
this goes along with user's expectation of scrub ensuring mirrored
data integrity; and recommendations to setup scrub on periodic basis
to me means it's the place to put it.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-29 15:15         ` james harvey
@ 2018-06-29 17:09           ` Austin S. Hemmelgarn
  2018-06-29 17:58             ` james harvey
  2018-06-29 18:40           ` Chris Murphy
  1 sibling, 1 reply; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2018-06-29 17:09 UTC (permalink / raw)
  To: james harvey, Chris Murphy
  Cc: Goffredo Baroncelli, Anand Jain, Remi Gauvin, Btrfs BTRFS

On 2018-06-29 11:15, james harvey wrote:
> On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy <lists@colorremedies.com> wrote:
>> And an open question I have about scrub is weather it only ever is
>> checking csums, meaning nodatacow files are never scrubbed, or if the
>> copies are at least compared to each other?
> 
> Scrub never looks at nodatacow files.  It does not compare the copies
> to each other.
> 
> Qu submitted a patch to make check compare the copies:
> https://patchwork.kernel.org/patch/10434509/
> 
> This hasn't been added to btrfs-progs git yet.
> 
> IMO, I think the offline check should look at nodatacow copies like
> this, but I still think this also needs to be added to scrub.  In the
> patch thread, I discuss my reasons why.  In brief: online scanning;
> this goes along with user's expectation of scrub ensuring mirrored
> data integrity; and recommendations to setup scrub on periodic basis
> to me means it's the place to put it.
That said, it can't sanely fix things if there is a mismatch.  At least, 
not unless BTRFS gets proper generational tracking to handle temporarily 
missing devices.  As of right now, sanely fixing things requires 
significant manual intervention, as you have to bypass the device read 
selection algorithm to be able to look at the state of the individual 
copies so that you can pick one to use and forcibly rewrite the whole 
file by hand.

A while back, Anand Jain posted some patches that would let you select a 
particular device to direct all reads to via a mount option, but I don't 
think they ever got merged.  That would have made manual recovery in 
cases like this exponentially easier (mount read-only with one device 
selected, copy the file out somewhere, remount read-only with the other 
device, drop caches, copy the file out again, compare and reconcile the 
two copies, then remount the volume writable and write out the repaired 
file).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-29 17:09           ` Austin S. Hemmelgarn
@ 2018-06-29 17:58             ` james harvey
  2018-06-29 18:31               ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 28+ messages in thread
From: james harvey @ 2018-06-29 17:58 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Chris Murphy, Goffredo Baroncelli, Anand Jain, Remi Gauvin, Btrfs BTRFS

On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2018-06-29 11:15, james harvey wrote:
>>
>> On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy <lists@colorremedies.com>
>> wrote:
>>>
>>> And an open question I have about scrub is weather it only ever is
>>> checking csums, meaning nodatacow files are never scrubbed, or if the
>>> copies are at least compared to each other?
>>
>>
>> Scrub never looks at nodatacow files.  It does not compare the copies
>> to each other.
>>
>> Qu submitted a patch to make check compare the copies:
>> https://patchwork.kernel.org/patch/10434509/
>>
>> This hasn't been added to btrfs-progs git yet.
>>
>> IMO, I think the offline check should look at nodatacow copies like
>> this, but I still think this also needs to be added to scrub.  In the
>> patch thread, I discuss my reasons why.  In brief: online scanning;
>> this goes along with user's expectation of scrub ensuring mirrored
>> data integrity; and recommendations to setup scrub on periodic basis
>> to me means it's the place to put it.
>
> That said, it can't sanely fix things if there is a mismatch. At least, not
> unless BTRFS gets proper generational tracking to handle temporarily missing
> devices.  As of right now, sanely fixing things requires significant manual
> intervention, as you have to bypass the device read selection algorithm to
> be able to look at the state of the individual copies so that you can pick
> one to use and forcibly rewrite the whole file by hand.

Absolutely.  User would need to use manual intervention as you
describe, or restore the single file(s) from backup.  But, it's a good
opportunity to tell the user they had partial data corruption, even if
it can't be auto-fixed.  Otherwise they get intermittent data
corruption, depending on which copies are read.

> A while back, Anand Jain posted some patches that would let you select a
> particular device to direct all reads to via a mount option, but I don't
> think they ever got merged.  That would have made manual recovery in cases
> like this exponentially easier (mount read-only with one device selected,
> copy the file out somewhere, remount read-only with the other device, drop
> caches, copy the file out again, compare and reconcile the two copies, then
> remount the volume writable and write out the repaired file).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-29 17:58             ` james harvey
@ 2018-06-29 18:31               ` Austin S. Hemmelgarn
  2018-06-30  6:33                 ` Duncan
  0 siblings, 1 reply; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2018-06-29 18:31 UTC (permalink / raw)
  To: james harvey
  Cc: Chris Murphy, Goffredo Baroncelli, Anand Jain, Remi Gauvin, Btrfs BTRFS

On 2018-06-29 13:58, james harvey wrote:
> On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2018-06-29 11:15, james harvey wrote:
>>>
>>> On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy <lists@colorremedies.com>
>>> wrote:
>>>>
>>>> And an open question I have about scrub is weather it only ever is
>>>> checking csums, meaning nodatacow files are never scrubbed, or if the
>>>> copies are at least compared to each other?
>>>
>>>
>>> Scrub never looks at nodatacow files.  It does not compare the copies
>>> to each other.
>>>
>>> Qu submitted a patch to make check compare the copies:
>>> https://patchwork.kernel.org/patch/10434509/
>>>
>>> This hasn't been added to btrfs-progs git yet.
>>>
>>> IMO, I think the offline check should look at nodatacow copies like
>>> this, but I still think this also needs to be added to scrub.  In the
>>> patch thread, I discuss my reasons why.  In brief: online scanning;
>>> this goes along with user's expectation of scrub ensuring mirrored
>>> data integrity; and recommendations to setup scrub on periodic basis
>>> to me means it's the place to put it.
>>
>> That said, it can't sanely fix things if there is a mismatch. At least, not
>> unless BTRFS gets proper generational tracking to handle temporarily missing
>> devices.  As of right now, sanely fixing things requires significant manual
>> intervention, as you have to bypass the device read selection algorithm to
>> be able to look at the state of the individual copies so that you can pick
>> one to use and forcibly rewrite the whole file by hand.
> 
> Absolutely.  User would need to use manual intervention as you
> describe, or restore the single file(s) from backup.  But, it's a good
> opportunity to tell the user they had partial data corruption, even if
> it can't be auto-fixed.  Otherwise they get intermittent data
> corruption, depending on which copies are read.
The thing is though, as things stand right now, you need to manually 
edit the data on-disk directly or restore the file from a backup to fix 
the file.  While it's technically true that you can manually repair this 
type of thing, both of the cases for doing it without those patches I 
mentioned, it's functionally impossible for a regular user to do it 
without potentially losing some data.

Unless that changes, scrub telling you it's corrupt is not going to help 
much aside from making sure you don't make things worse by trying to use 
it.  Given this, it would make sense to have a (disabled by default) 
option to have scrub repair it by just using the newer or older copy of 
the data.  That would require classic RAID generational tracking though, 
which BTRFS doesn't have yet.

>> A while back, Anand Jain posted some patches that would let you select a
>> particular device to direct all reads to via a mount option, but I don't
>> think they ever got merged.  That would have made manual recovery in cases
>> like this exponentially easier (mount read-only with one device selected,
>> copy the file out somewhere, remount read-only with the other device, drop
>> caches, copy the file out again, compare and reconcile the two copies, then
>> remount the volume writable and write out the repaired file).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-29 15:15         ` james harvey
  2018-06-29 17:09           ` Austin S. Hemmelgarn
@ 2018-06-29 18:40           ` Chris Murphy
  1 sibling, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2018-06-29 18:40 UTC (permalink / raw)
  To: james harvey
  Cc: Chris Murphy, Goffredo Baroncelli, Anand Jain, Remi Gauvin, Btrfs BTRFS

On Fri, Jun 29, 2018 at 9:15 AM, james harvey <jamespharvey20@gmail.com> wrote:
> On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy <lists@colorremedies.com> wrote:
>> And an open question I have about scrub is weather it only ever is
>> checking csums, meaning nodatacow files are never scrubbed, or if the
>> copies are at least compared to each other?
>
> Scrub never looks at nodatacow files.  It does not compare the copies
> to each other.
>
> Qu submitted a patch to make check compare the copies:
> https://patchwork.kernel.org/patch/10434509/

Yeah online scrub needs to report any mismatches, even if it can't do
anything about it because it's ambiguous which copy is wrong.

> IMO, I think the offline check should look at nodatacow copies like
> this, but I still think this also needs to be added to scrub.  In the
> patch thread, I discuss my reasons why.  In brief: online scanning;
> this goes along with user's expectation of scrub ensuring mirrored
> data integrity; and recommendations to setup scrub on periodic basis
> to me means it's the place to put it.

I don't mind this being implemented in offline scrub first for testing
purposes. But the online scrub certainly should have this ability
eventually.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-29 18:31               ` Austin S. Hemmelgarn
@ 2018-06-30  6:33                 ` Duncan
  2018-07-02 12:03                   ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 28+ messages in thread
From: Duncan @ 2018-06-30  6:33 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Fri, 29 Jun 2018 14:31:04 -0400 as
excerpted:

> On 2018-06-29 13:58, james harvey wrote:
>> On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>> On 2018-06-29 11:15, james harvey wrote:
>>>>
>>>> On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy
>>>> <lists@colorremedies.com>
>>>> wrote:
>>>>>
>>>>> And an open question I have about scrub is weather it only ever is
>>>>> checking csums, meaning nodatacow files are never scrubbed, or if
>>>>> the copies are at least compared to each other?
>>>>
>>>>
>>>> Scrub never looks at nodatacow files.  It does not compare the copies
>>>> to each other.
>>>>
>>>> Qu submitted a patch to make check compare the copies:
>>>> https://patchwork.kernel.org/patch/10434509/
>>>>
>>>> This hasn't been added to btrfs-progs git yet.
>>>>
>>>> IMO, I think the offline check should look at nodatacow copies like
>>>> this, but I still think this also needs to be added to scrub.  In the
>>>> patch thread, I discuss my reasons why.  In brief: online scanning;
>>>> this goes along with user's expectation of scrub ensuring mirrored
>>>> data integrity; and recommendations to setup scrub on periodic basis
>>>> to me means it's the place to put it.
>>>
>>> That said, it can't sanely fix things if there is a mismatch. At
>>> least,
>>> not unless BTRFS gets proper generational tracking to handle
>>> temporarily missing devices.  As of right now, sanely fixing things
>>> requires significant manual intervention, as you have to bypass the
>>> device read selection algorithm to be able to look at the state of the
>>> individual copies so that you can pick one to use and forcibly rewrite
>>> the whole file by hand.
>> 
>> Absolutely.  User would need to use manual intervention as you
>> describe, or restore the single file(s) from backup.  But, it's a good
>> opportunity to tell the user they had partial data corruption, even if
>> it can't be auto-fixed.  Otherwise they get intermittent data
>> corruption, depending on which copies are read.

> The thing is though, as things stand right now, you need to manually
> edit the data on-disk directly or restore the file from a backup to fix
> the file.  While it's technically true that you can manually repair this
> type of thing, both of the cases for doing it without those patches I
> mentioned, it's functionally impossible for a regular user to do it
> without potentially losing some data.

[Usual backups rant, user vs. admin variant, nowcow/tmpfs edition.  
Regulars can skip as the rest is already predicted from past posts, for 
them. =;^]

"Regular user"?  

"Regular users" don't need to bother with this level of detail.  They 
simply get their "admin" to do it, even if that "admin" is their kid, or 
the kid from next door that's good with computers, or the geek squad (aka 
nsa-agent-squad) guy/gal, doing it... or telling them to install "a real 
OS", meaning whatever MS/Apple/Google something that they know how to 
deal with.

If the "user" is dealing with setting nocow, choosing btrfs in the first 
place, etc, then they're _not_ a "regular user" by definition, they're 
already an admin.

And as any admin learns rather quickly, the value of data is defined by 
the number of backups it's worth having of that data.

Which means it's not a problem.  Either the data had a backup and it's 
(reasonably) trivial to restore the data from that backup, or the data 
was defined by lack of having that backup as of only trivial value, so 
low as to not be worth the time/trouble/resources necessary to make that 
backup in the first place.

Which of course means what was defined as of most value, either the data 
of there was a backup, or the time/trouble/resources that would have gone 
into creating it if not, is *always* saved.

(And of course the same goes for "I had a backup, but it's old", except 
in this case it's the value of the data delta between the backup and 
current.  As soon as it's worth more than the time/trouble/hassle of 
updating the backup, it will by definition be updated.  Not having a 
newer backup available thus simply means the value of the data that 
changed between the last backup and current was simply not enough to 
justify updating the backup, and again, what was of most value is 
*always* saved, either the data, or the time that would have otherwise 
gone into making the newer backup.)

Because while a "regular user" may not know it because it's not his /job/ 
to know it, if there's anything an admin knows *well* it's that the 
working copy of data **WILL** be damaged.  It's not a matter of if, but 
of when, and of whether it'll be a fat-finger mistake, or a hardware or 
software failure, or wetware (theft, ransomware, etc), or wetware (flood, 
fire and the water that put it out damage, etc), tho none of that 
actually matters after all, because in the end, the only thing that 
matters was how the value of that data was defined by the number of 
backups made of it, and how quickly and conveniently at least one of 
those backups can be retrieved and restored.

Meanwhile, an admin worth the label will also know the relative risk 
associated with various options they might use, including nocow, and 
knowing that downgrades the stability rating of the storage approximately 
to the same degree that raid0 does, they'll already be aware that in such 
a case the working copy can only be defined as "throw-away" level in case 
of problems in the first place, and will thus not even consider their 
working copy to be a permanent copy at all, just a temporary garbage 
copy, only slightly more reliable than one stored on tmpfs, and will thus 
consider the first backup thereof the true working copy, with an 
additional level of backup beyond what they'd normally have thrown in to 
account for that fact.

So in case of problems people can simply restore nocow files from a near-
line stable working copy, much as they'd do after reboot or a umount/
remount cycle for a file stored in tmpfs.  And if they didn't have even a 
stable working copy let alone a backup... well, much like that file in 
tmpfs, what did they expect?  They *really* defined that data as of no 
more than trivial value, didn't they?

All that said, making the NOCOW warning labels a bit more bold print 
couldn't hurt; and making scrub in the nocow case at least compare copies 
and report differences, simply makes it easier for people to know they 
need to reach for that near-line stable working copy, or mkfs and start 
from scratch if they defined the data value as not worth the trouble of 
(in this case) even a stable working copy, let alone a backup, so that'd 
be a good thing too. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
  2018-06-30  6:33                 ` Duncan
@ 2018-07-02 12:03                   ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-02 12:03 UTC (permalink / raw)
  To: linux-btrfs

On 2018-06-30 02:33, Duncan wrote:
> Austin S. Hemmelgarn posted on Fri, 29 Jun 2018 14:31:04 -0400 as
> excerpted:
> 
>> On 2018-06-29 13:58, james harvey wrote:
>>> On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn
>>> <ahferroin7@gmail.com> wrote:
>>>> On 2018-06-29 11:15, james harvey wrote:
>>>>>
>>>>> On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy
>>>>> <lists@colorremedies.com>
>>>>> wrote:
>>>>>>
>>>>>> And an open question I have about scrub is weather it only ever is
>>>>>> checking csums, meaning nodatacow files are never scrubbed, or if
>>>>>> the copies are at least compared to each other?
>>>>>
>>>>>
>>>>> Scrub never looks at nodatacow files.  It does not compare the copies
>>>>> to each other.
>>>>>
>>>>> Qu submitted a patch to make check compare the copies:
>>>>> https://patchwork.kernel.org/patch/10434509/
>>>>>
>>>>> This hasn't been added to btrfs-progs git yet.
>>>>>
>>>>> IMO, I think the offline check should look at nodatacow copies like
>>>>> this, but I still think this also needs to be added to scrub.  In the
>>>>> patch thread, I discuss my reasons why.  In brief: online scanning;
>>>>> this goes along with user's expectation of scrub ensuring mirrored
>>>>> data integrity; and recommendations to setup scrub on periodic basis
>>>>> to me means it's the place to put it.
>>>>
>>>> That said, it can't sanely fix things if there is a mismatch. At
>>>> least,
>>>> not unless BTRFS gets proper generational tracking to handle
>>>> temporarily missing devices.  As of right now, sanely fixing things
>>>> requires significant manual intervention, as you have to bypass the
>>>> device read selection algorithm to be able to look at the state of the
>>>> individual copies so that you can pick one to use and forcibly rewrite
>>>> the whole file by hand.
>>>
>>> Absolutely.  User would need to use manual intervention as you
>>> describe, or restore the single file(s) from backup.  But, it's a good
>>> opportunity to tell the user they had partial data corruption, even if
>>> it can't be auto-fixed.  Otherwise they get intermittent data
>>> corruption, depending on which copies are read.
> 
>> The thing is though, as things stand right now, you need to manually
>> edit the data on-disk directly or restore the file from a backup to fix
>> the file.  While it's technically true that you can manually repair this
>> type of thing, both of the cases for doing it without those patches I
>> mentioned, it's functionally impossible for a regular user to do it
>> without potentially losing some data.
> 
> [Usual backups rant, user vs. admin variant, nowcow/tmpfs edition.
> Regulars can skip as the rest is already predicted from past posts, for
> them. =;^]
> 
> "Regular user"?
> 
> "Regular users" don't need to bother with this level of detail.  They
> simply get their "admin" to do it, even if that "admin" is their kid, or
> the kid from next door that's good with computers, or the geek squad (aka
> nsa-agent-squad) guy/gal, doing it... or telling them to install "a real
> OS", meaning whatever MS/Apple/Google something that they know how to
> deal with.
> 
> If the "user" is dealing with setting nocow, choosing btrfs in the first
> place, etc, then they're _not_ a "regular user" by definition, they're
> already an admin.I'd argue that that's not always true.  'Regular users' also bli9ndly 
follow advice they find online about how to make their system run 
better, and quite often don't keep backups.
> 
> And as any admin learns rather quickly, the value of data is defined by
> the number of backups it's worth having of that data.
> 
> Which means it's not a problem.  Either the data had a backup and it's
> (reasonably) trivial to restore the data from that backup, or the data
> was defined by lack of having that backup as of only trivial value, so
> low as to not be worth the time/trouble/resources necessary to make that
> backup in the first place.
> 
> Which of course means what was defined as of most value, either the data
> of there was a backup, or the time/trouble/resources that would have gone
> into creating it if not, is *always* saved.
> 
> (And of course the same goes for "I had a backup, but it's old", except
> in this case it's the value of the data delta between the backup and
> current.  As soon as it's worth more than the time/trouble/hassle of
> updating the backup, it will by definition be updated.  Not having a
> newer backup available thus simply means the value of the data that
> changed between the last backup and current was simply not enough to
> justify updating the backup, and again, what was of most value is
> *always* saved, either the data, or the time that would have otherwise
> gone into making the newer backup.)
> 
> Because while a "regular user" may not know it because it's not his /job/
> to know it, if there's anything an admin knows *well* it's that the
> working copy of data **WILL** be damaged.  It's not a matter of if, but
> of when, and of whether it'll be a fat-finger mistake, or a hardware or
> software failure, or wetware (theft, ransomware, etc), or wetware (flood,
> fire and the water that put it out damage, etc), tho none of that
> actually matters after all, because in the end, the only thing that
> matters was how the value of that data was defined by the number of
> backups made of it, and how quickly and conveniently at least one of
> those backups can be retrieved and restored.
> 
> 
> Meanwhile, an admin worth the label will also know the relative risk
> associated with various options they might use, including nocow, and
> knowing that downgrades the stability rating of the storage approximately
> to the same degree that raid0 does, they'll already be aware that in such
> a case the working copy can only be defined as "throw-away" level in case
> of problems in the first place, and will thus not even consider their
> working copy to be a permanent copy at all, just a temporary garbage
> copy, only slightly more reliable than one stored on tmpfs, and will thus
> consider the first backup thereof the true working copy, with an
> additional level of backup beyond what they'd normally have thrown in to
> account for that fact.
> 
> So in case of problems people can simply restore nocow files from a near-
> line stable working copy, much as they'd do after reboot or a umount/
> remount cycle for a file stored in tmpfs.  And if they didn't have even a
> stable working copy let alone a backup... well, much like that file in
> tmpfs, what did they expect?  They *really* defined that data as of no
> more than trivial value, didn't they?
> 
> 
> All that said, making the NOCOW warning labels a bit more bold print
> couldn't hurt; and making scrub in the nocow case at least compare copies
> and report differences, simply makes it easier for people to know they
> need to reach for that near-line stable working copy, or mkfs and start
> from scratch if they defined the data value as not worth the trouble of
> (in this case) even a stable working copy, let alone a backup, so that'd
> be a good thing too. =:^)
> 
There are two things this rant ignores though:

1. Restoring from a backup is usually slow.  Even if you have a good 
backup system.  As a really specific example, where I work, it takes me 
about 5 minutes to find a single file in our backups.  Beyond that, the 
backup software has to pull together the whole archive form the 
individual pieces, decompress it, and then extract the file.  On 
average, for a file the size of a VM image, this all takes at least half 
an hour.

2. Backups are usually daily.  In most cases, it's much preferred to not 
lose all the day's work on a given file.

Given both points, I'd much rather be able to take 90 seconds to fix a 
file and have it probably work, with the ability to restore from a 
backup if it doesn't.  Currently, despite the fact that I actually know 
(just barely) enough to fix this particular type of issue by hand, I end 
up just restoring files from backup all the time, because that 30 minute 
wait is still better than the hour plus amount of time it takes for me 
to repair it by hand.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2018-07-02 12:03 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-28  1:42 Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files Remi Gauvin
2018-06-28  1:58 ` Qu Wenruo
2018-06-28  2:10   ` Remi Gauvin
2018-06-28  2:55     ` Qu Wenruo
2018-06-28  3:14       ` remi
2018-06-28  5:39         ` Qu Wenruo
2018-06-28  8:16           ` Andrei Borzenkov
2018-06-28  8:20             ` Andrei Borzenkov
2018-06-28  9:15             ` Qu Wenruo
2018-06-28 11:12               ` Austin S. Hemmelgarn
2018-06-28 11:46                 ` Qu Wenruo
2018-06-28 12:20                   ` Austin S. Hemmelgarn
2018-06-28 17:10               ` Andrei Borzenkov
2018-06-29  0:07                 ` Qu Wenruo
2018-06-28 22:00               ` Remi Gauvin
2018-06-28 13:24 ` Anand Jain
2018-06-28 14:17   ` Chris Murphy
2018-06-28 15:37     ` Remi Gauvin
2018-06-28 22:04       ` Chris Murphy
2018-06-28 17:37     ` Goffredo Baroncelli
2018-06-28 22:27       ` Chris Murphy
2018-06-29 15:15         ` james harvey
2018-06-29 17:09           ` Austin S. Hemmelgarn
2018-06-29 17:58             ` james harvey
2018-06-29 18:31               ` Austin S. Hemmelgarn
2018-06-30  6:33                 ` Duncan
2018-07-02 12:03                   ` Austin S. Hemmelgarn
2018-06-29 18:40           ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.