Hi All, For clarity for the masses, what are the "multiple serious data-loss bugs" as mentioned in the btrfs wiki? The bullet points on this page: https://btrfs.wiki.kernel.org/index.php/RAID56 don't enumerate the bugs. Not even in a high level. If anything what can be closest to a bug or issue or "resilience use-case missing" would be the first point on that page. "Parity may be inconsistent after a crash (the "write hole"). The problem born when after "an unclean shutdown" a disk failure happens. But these are *two* distinct failures. These together break the BTRFS raid5 redundancy. If you run a scrub process after "an unclean shutdown" (with no disk failure in between) those data which match their checksum can still be read out while the mismatched data are lost forever." So in a nutshell; "What are the multiple serious data-loss bugs?" If there aren't any, perhaps updating the wiki should be considered for something less the "dramatic" .
DanglingPointer wrote: > > Hi All, > > For clarity for the masses, what are the "multiple serious data-loss > bugs" as mentioned in the btrfs wiki? > The bullet points on this page: > https://btrfs.wiki.kernel.org/index.php/RAID56 > don't enumerate the bugs. Not even in a high level. If anything what > can be closest to a bug or issue or "resilience use-case missing" would > be the first point on that page. > > "Parity may be inconsistent after a crash (the "write hole"). The > problem born when after "an unclean shutdown" a disk failure happens. > But these are *two* distinct failures. These together break the BTRFS > raid5 redundancy. If you run a scrub process after "an unclean shutdown" > (with no disk failure in between) those data which match their checksum > can still be read out while the mismatched data are lost forever." > > So in a nutshell; "What are the multiple serious data-loss bugs?" If > there aren't any, perhaps updating the wiki should be considered for > something less the "dramatic" . > I would just like to add that according to the status page the only missing piece from a implementation point of view is the write hole. https://btrfs.wiki.kernel.org/index.php/Status#RAID56 What effect exactly the write hole might have on *data* is not pointed out in detail, but I would imagine that for some it might be desirable to run a btrfs filesystem with metadata in "RAID" 1/10 mode and data in "RAID" 5/6. As far as I can understand this would leave you in a position where your filesystem structures are relatively safe as "RAID" 1/10 mode is considered stable. e.g. you should not loose or corrupt your filesystem in the event of a crash / brownout. On the other hand you might loose or corrupt a file being written which may or may not be acceptable for some. In any case a scrub should fix any inconsistencies. My point being that such a configuration might be (just?) as safe as for exampel mdraid 5/6 and in some cases perhaps even more thanks to checksumming and the self-heal features of btrfs. Unless I am totally off I think it would be wise to add the metadata "RAID" 1/10 and data "RAID" 5/6 method to the wiki as a possible "no worse than any other XYZ solution" if you need storage and don't have that much metadata in your filesystem.
[-- Attachment #1.1.1: Type: text/plain, Size: 524 bytes --] On 2019-01-26 7:07 a.m., waxhead wrote: > > What effect exactly the write hole might have on *data* is not pointed > out in detail, but I would imagine that for some it might be desirable > to run a btrfs filesystem with metadata in "RAID" 1/10 mode and data in > "RAID" 5/6. > One big problem here is that neither Raid 1 nor Raid 10 currently have configurable # of Mirrors, (N devices), so it's impossible to get 2 device failure redundancy. (No point in using Raid 6 then, that's just wasting space.) [-- Attachment #1.1.2: remi.vcf --] [-- Type: text/x-vcard, Size: 203 bytes --] begin:vcard fn:Remi Gauvin n:Gauvin;Remi org:Georgian Infotech adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada email;internet:remi@georgianit.com tel;work:226-256-1545 version:2.1 end:vcard [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 473 bytes --]
[-- Attachment #1.1: Type: text/plain, Size: 1363 bytes --] On 2019/1/26 下午7:45, DanglingPointer wrote: > > > Hi All, > > For clarity for the masses, what are the "multiple serious data-loss > bugs" as mentioned in the btrfs wiki? > The bullet points on this page: > https://btrfs.wiki.kernel.org/index.php/RAID56 > don't enumerate the bugs. Not even in a high level. If anything what > can be closest to a bug or issue or "resilience use-case missing" would > be the first point on that page. > > "Parity may be inconsistent after a crash (the "write hole"). The > problem born when after "an unclean shutdown" a disk failure happens. > But these are *two* distinct failures. These together break the BTRFS > raid5 redundancy. If you run a scrub process after "an unclean shutdown" > (with no disk failure in between) those data which match their checksum > can still be read out while the mismatched data are lost forever." > > So in a nutshell; "What are the multiple serious data-loss bugs?". There used to be two, like scrub racing (minor), and screwing up good copy when doing recovery (major). Although these two should already be fixed. So for current upstream kernel, there should be no major problem despite write hole. Thanks, Qu > If > there aren't any, perhaps updating the wiki should be considered for > something less the "dramatic" . > > > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --]
On Mon, 28 Jan 2019 at 01:18, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
> So for current upstream kernel, there should be no major problem despite
> write hole.
Can you please elaborate on the implications of the write-hole? Does
it mean that the transaction currently in-flight might be lost but the
filesystem is otherwise intact? How does it interact with data and
metadata being stored with a different profile (one with write hole
and one without)?
Thanks
On Mon, Jan 28, 2019 at 03:23:28PM +0000, Supercilious Dude wrote: > On Mon, 28 Jan 2019 at 01:18, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > So for current upstream kernel, there should be no major problem despite > > write hole. > > > Can you please elaborate on the implications of the write-hole? Does > it mean that the transaction currently in-flight might be lost but the > filesystem is otherwise intact? No, losing the in-flight transaction is normal operation of every modern filesystem -- in fact, you _want_ the transaction to be lost instead of partially torn. The write hole means corruption of a random _old_ piece of data. It can be fatal (ie, lead to data loss) if two errors happen together: * the stripe is degraded * there's unexpected crash/power loss Every RAID implementation (not just btrfs) suffers from the write hole unless some special, costly, precaution is being taken. Those include journaling, plug extents, varying-width stripes (ZFS: RAIDZ). The two former require effectively writing small writes twice, the latter degrades small writes to RAID1 as disk capacity goes. The write hole affects only writes that neighbour some old (ie, not from the current transaction) data in the same stripe -- as long as everything in a single stripe belongs to no more than one transaction, all is fine. > How does it interact with data and metadata being stored with a different > profile (one with write hole and one without)? If there's unrecoverable error due to write hole, you lose a single stripe worth. For data, this means a single piece of a file is beyond repair. For metadata, you lose a potentially large swatch of the filesystem -- and as tree nodes close to the root get rewritten the most, a total filesystem loss is pretty likely. To make things worse, while data writes are mostly linear (for small files, btrfs batches writes from the same transaction), metadata is strewn all around, mixing pieces of different importance and different age. RAID5 (all implementations) is also very slow for random writes (such as btrfs metadata), thus you really want RAID1 metadata both for safety and performance. Metadata being only around 1-2% of disk space, the only upside of RAID5 (better use of capacity) doesn't really matter. Ie: RAID1 is a clear winner for btrfs metadata; mixing profiles for data vs metadata is safe. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢠⠒⠀⣿⡁ Remember, the S in "IoT" stands for Security, while P stands ⢿⡄⠘⠷⠚⠋⠀ for Privacy. ⠈⠳⣄⠀⠀⠀⠀
Thanks Qu!
I thought as much from following the mailing list and your great work
over the years!
Would it be possible to get the wiki updated to reflect the current
"real" status?
From Qu's statement and perspective, there's no difference to other
non-BTRFS software RAID56's out there that are marked as stable (except
ZFS).
Also there are no "multiple serious data-loss bugs".
Please do consider my proposal as it will decrease the amount of
incorrect paranoia that exists in the community.
As long as the Wiki properly mentions the current state with the options
for mitigation; like backup power and perhaps RAID1 for metadata or
anything else you believe as appropriate.
Thanks,
DP
On 28/1/19 11:52 am, Qu Wenruo wrote:
>
> On 2019/1/26 下午7:45, DanglingPointer wrote:
>>
>> Hi All,
>>
>> For clarity for the masses, what are the "multiple serious data-loss
>> bugs" as mentioned in the btrfs wiki?
>> The bullet points on this page:
>> https://btrfs.wiki.kernel.org/index.php/RAID56
>> don't enumerate the bugs. Not even in a high level. If anything what
>> can be closest to a bug or issue or "resilience use-case missing" would
>> be the first point on that page.
>>
>> "Parity may be inconsistent after a crash (the "write hole"). The
>> problem born when after "an unclean shutdown" a disk failure happens.
>> But these are *two* distinct failures. These together break the BTRFS
>> raid5 redundancy. If you run a scrub process after "an unclean shutdown"
>> (with no disk failure in between) those data which match their checksum
>> can still be read out while the mismatched data are lost forever."
>>
>> So in a nutshell; "What are the multiple serious data-loss bugs?".
> There used to be two, like scrub racing (minor), and screwing up good
> copy when doing recovery (major).
>
> Although these two should already be fixed.
>
> So for current upstream kernel, there should be no major problem despite
> write hole.
>
> Thanks,
> Qu
>
>> If
>> there aren't any, perhaps updating the wiki should be considered for
>> something less the "dramatic" .
>>
>>
>>
[-- Attachment #1.1.1: Type: text/plain, Size: 874 bytes --] On 2019-01-28 5:07 p.m., DanglingPointer wrote: > From Qu's statement and perspective, there's no difference to other > non-BTRFS software RAID56's out there that are marked as stable (except > ZFS). > Also there are no "multiple serious data-loss bugs". > Please do consider my proposal as it will decrease the amount of > incorrect paranoia that exists in the community. > As long as the Wiki properly mentions the current state with the options > for mitigation; like backup power and perhaps RAID1 for metadata or > anything else you believe as appropriate. Should implement some way to automatically scrub on unclean shutdown. BTRFS is the only (to my knowlege) Raid implementation that will not automatically detect an unclean shutdown and fix the affected parity blocks, (either by some form of write journal/write intent map, or full resync.) [-- Attachment #1.1.2: remi.vcf --] [-- Type: text/x-vcard, Size: 203 bytes --] begin:vcard fn:Remi Gauvin n:Gauvin;Remi org:Georgian Infotech adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada email;internet:remi@georgianit.com tel;work:226-256-1545 version:2.1 end:vcard [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 473 bytes --]
[-- Attachment #1.1: Type: text/plain, Size: 2519 bytes --] On 2019/1/29 上午6:07, DanglingPointer wrote: > Thanks Qu! > I thought as much from following the mailing list and your great work > over the years! > > Would it be possible to get the wiki updated to reflect the current > "real" status? > > From Qu's statement and perspective, there's no difference to other > non-BTRFS software RAID56's out there that are marked as stable (except > ZFS). I'm afraid that my old statement is wrong. Quite a lot software RAID56 has a way to record which block get modified, just like some hardware RAID56 controller does, thus get rid of the write hole problem. Thanks, Qu > Also there are no "multiple serious data-loss bugs". > Please do consider my proposal as it will decrease the amount of > incorrect paranoia that exists in the community. > As long as the Wiki properly mentions the current state with the options > for mitigation; like backup power and perhaps RAID1 for metadata or > anything else you believe as appropriate. > > > Thanks, > > DP > > > On 28/1/19 11:52 am, Qu Wenruo wrote: >> >> On 2019/1/26 下午7:45, DanglingPointer wrote: >>> >>> Hi All, >>> >>> For clarity for the masses, what are the "multiple serious data-loss >>> bugs" as mentioned in the btrfs wiki? >>> The bullet points on this page: >>> https://btrfs.wiki.kernel.org/index.php/RAID56 >>> don't enumerate the bugs. Not even in a high level. If anything what >>> can be closest to a bug or issue or "resilience use-case missing" would >>> be the first point on that page. >>> >>> "Parity may be inconsistent after a crash (the "write hole"). The >>> problem born when after "an unclean shutdown" a disk failure happens. >>> But these are *two* distinct failures. These together break the BTRFS >>> raid5 redundancy. If you run a scrub process after "an unclean shutdown" >>> (with no disk failure in between) those data which match their checksum >>> can still be read out while the mismatched data are lost forever." >>> >>> So in a nutshell; "What are the multiple serious data-loss bugs?". >> There used to be two, like scrub racing (minor), and screwing up good >> copy when doing recovery (major). >> >> Although these two should already be fixed. >> >> So for current upstream kernel, there should be no major problem despite >> write hole. >> >> Thanks, >> Qu >> >>> If >>> there aren't any, perhaps updating the wiki should be considered for >>> something less the "dramatic" . >>> >>> >>> [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --]
On Mon, Jan 28, 2019 at 3:52 PM Remi Gauvin <remi@georgianit.com> wrote:
>
> On 2019-01-28 5:07 p.m., DanglingPointer wrote:
>
> > From Qu's statement and perspective, there's no difference to other
> > non-BTRFS software RAID56's out there that are marked as stable (except
> > ZFS).
> > Also there are no "multiple serious data-loss bugs".
> > Please do consider my proposal as it will decrease the amount of
> > incorrect paranoia that exists in the community.
> > As long as the Wiki properly mentions the current state with the options
> > for mitigation; like backup power and perhaps RAID1 for metadata or
> > anything else you believe as appropriate.
>
> Should implement some way to automatically scrub on unclean shutdown.
> BTRFS is the only (to my knowlege) Raid implementation that will not
> automatically detect an unclean shutdown and fix the affected parity
> blocks, (either by some form of write journal/write intent map, or full
> resync.)
There's no dirty bit set on mount, and thus no dirty bit to unset on
clean mount, from which to infer a dirty unmount if it's present at
the next mount.
If there were a way to implement an abridged scrub, it could be done
on every mount if metadata uses raid56 profile. But I think Qu is
working on something like a raid56 that would obviate the problem,
which is probably the best and most scalable solution.
An abridged scrub could be metadata only, and only if it's raid56 profile.
But still in 2019, we have this super crap default SCSI block layer
command timeout of 30 seconds. This encourages corruption in common
consumer devices by prematurely resetting it when it's merely in deep
recoveries that take longer than 30s. And this prevents automatic
repair from happening, since it prevents the device from reporting a
discrete read + sector value error, and therefore the problem gets
masked behind link resets.
--
Chris Murphy
On 29/01/2019 20.02, Chris Murphy wrote:
> On Mon, Jan 28, 2019 at 3:52 PM Remi Gauvin <remi@georgianit.com> wrote:
>>
>> On 2019-01-28 5:07 p.m., DanglingPointer wrote:
>>
>>> From Qu's statement and perspective, there's no difference to other
>>> non-BTRFS software RAID56's out there that are marked as stable (except
>>> ZFS).
>>> Also there are no "multiple serious data-loss bugs".
>>> Please do consider my proposal as it will decrease the amount of
>>> incorrect paranoia that exists in the community.
>>> As long as the Wiki properly mentions the current state with the options
>>> for mitigation; like backup power and perhaps RAID1 for metadata or
>>> anything else you believe as appropriate.
>>
>> Should implement some way to automatically scrub on unclean shutdown.
>> BTRFS is the only (to my knowlege) Raid implementation that will not
>> automatically detect an unclean shutdown and fix the affected parity
>> blocks, (either by some form of write journal/write intent map, or full
>> resync.)
>
> There's no dirty bit set on mount, and thus no dirty bit to unset on
> clean mount, from which to infer a dirty unmount if it's present at
> the next mount.
It would be sufficient to use the log, which BTRFS already has. During each transaction, when an area is touched by a rwm cycle, it has to tracked in the log.
In case of unclean shutdown, it is already implemented a way to replay the log. So it would be sufficient to track a scrub of these area as "log replay".
Of course I am talking as not a BTRFS developers, so the reality could be more complex: e.g. I don't know how it would be easy to raise a scrub process on per area basis.
BR
G.Baroncelli
--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
Going back to my original email, would the BTRFS wiki admins consider a
better more reflective update of the RAID56 status page?
It still states "multiple serious data-loss bugs" which as Qu Wenruo has
already clarified is not the case. The only "bug" left is the write
hole edge-case problem.
On 30/1/19 6:47 am, Goffredo Baroncelli wrote:
> On 29/01/2019 20.02, Chris Murphy wrote:
>> On Mon, Jan 28, 2019 at 3:52 PM Remi Gauvin <remi@georgianit.com> wrote:
>>> On 2019-01-28 5:07 p.m., DanglingPointer wrote:
>>>
>>>> From Qu's statement and perspective, there's no difference to other
>>>> non-BTRFS software RAID56's out there that are marked as stable (except
>>>> ZFS).
>>>> Also there are no "multiple serious data-loss bugs".
>>>> Please do consider my proposal as it will decrease the amount of
>>>> incorrect paranoia that exists in the community.
>>>> As long as the Wiki properly mentions the current state with the options
>>>> for mitigation; like backup power and perhaps RAID1 for metadata or
>>>> anything else you believe as appropriate.
>>> Should implement some way to automatically scrub on unclean shutdown.
>>> BTRFS is the only (to my knowlege) Raid implementation that will not
>>> automatically detect an unclean shutdown and fix the affected parity
>>> blocks, (either by some form of write journal/write intent map, or full
>>> resync.)
>> There's no dirty bit set on mount, and thus no dirty bit to unset on
>> clean mount, from which to infer a dirty unmount if it's present at
>> the next mount.
> It would be sufficient to use the log, which BTRFS already has. During each transaction, when an area is touched by a rwm cycle, it has to tracked in the log.
> In case of unclean shutdown, it is already implemented a way to replay the log. So it would be sufficient to track a scrub of these area as "log replay".
>
> Of course I am talking as not a BTRFS developers, so the reality could be more complex: e.g. I don't know how it would be easy to raise a scrub process on per area basis.
>
> BR
> G.Baroncelli
>
>
>
[-- Attachment #1.1.1: Type: text/plain, Size: 740 bytes --] On 2019-01-29 2:02 p.m., Chris Murphy wrote: > > There's no dirty bit set on mount, and thus no dirty bit to unset on > clean mount, from which to infer a dirty unmount if it's present at > the next mount. Some time back, i was toying with the idea of a Startup script that creates a /need_scrub file, paired with a daily nightly script that checks for that file and runs scrub if found. That way, if the system ever ended up re-started without admin intervention, a scrub would be run that very night. But I've decided against using BTRFS parity raid until I can get working 2 device failure with no write hole. Either with Metadata Raid 1 N3 (what's the correct way to say that?) or Raid 6 that's write hole proof. [-- Attachment #1.1.2: remi.vcf --] [-- Type: text/x-vcard, Size: 203 bytes --] begin:vcard fn:Remi Gauvin n:Gauvin;Remi org:Georgian Infotech adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada email;internet:remi@georgianit.com tel;work:226-256-1545 version:2.1 end:vcard [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 473 bytes --]