All of lore.kernel.org
 help / color / mirror / Atom feed
* Encountered kernel bug#72811. Advice on recovery?
@ 2017-04-13 18:49 Ank Ular
  2017-04-14  3:47 ` Duncan
  2017-04-14 16:46 ` Chris Murphy
  0 siblings, 2 replies; 10+ messages in thread
From: Ank Ular @ 2017-04-13 18:49 UTC (permalink / raw)
  To: linux-btrfs

I've encountered kernel bug#72811 "If a raid5 array gets into degraded
mode, gets modified, and the missing drive re-added, the filesystem
loses state".

In my case, I had rebooted my system and one of the drives on my main
array did not come up. I was able to mount in degraded mode. I needed
to re-boot the following day. This time, all the drives in the array
came up. Several hours later, the array went into read only mode.
That's when I discovered the odd device out had been re-added without
any kind of error message or notice.

SMART does not report any errors on the device itself. I did have a
failed fan inside the server case and I suspect a thermally sensitive
issue with the responsible drive controller. Since replacing the
failed fan plus another fan, all drives devices report a running
temperature in the range of 34~35 Celsius. This is normal. None of the
drives report recording any errors.

The array normally consists of 22 devices with data and meta in raid6.
Physically, the devices are split 16 devices in a NORCO DS-24 cage and
the remaining devices are in the server itself. All the devices are
SATA III.

I've added "noauto" to the options in my fstab file for this array.
I've also disabled the odd drive out so it's no longer seen as part of
the array.

Current fstab line:
LABEL="PublicB"                                 /PublicB        btrfs
 autodefrag,compress=lzo,space_cache,noatime,noauto      0 0

I manually mount the array:
mount -o recovery,ro,degraded

Current device list for the array:
Label: 'PublicB'  uuid: 76d87b95-5651-4707-b5bf-168210af7c3f
       Total devices 22 FS bytes used 83.63TiB
       devid    1 size 5.46TiB used 5.12TiB path /dev/sdt
       devid    2 size 5.46TiB used 5.12TiB path /dev/sdv
       devid    3 size 5.46TiB used 5.12TiB path /dev/sdaa
       devid    4 size 5.46TiB used 5.12TiB path /dev/sdx
       devid    5 size 5.46TiB used 5.12TiB path /dev/sdo
       devid    6 size 5.46TiB used 5.12TiB path /dev/sdq
       devid    7 size 5.46TiB used 5.12TiB path /dev/sds
       devid    8 size 5.46TiB used 5.12TiB path /dev/sdu
       devid    9 size 5.46TiB used 4.25TiB path /dev/sdr
       devid   10 size 5.46TiB used 4.25TiB path /dev/sdy
       devid   11 size 5.46TiB used 4.25TiB path /dev/sdab
       devid   12 size 3.64TiB used 3.64TiB path /dev/sdb
       devid   13 size 3.64TiB used 3.64TiB path /dev/sdc
       devid   14 size 4.55TiB used 4.25TiB path /dev/sdd
       devid   17 size 4.55TiB used 4.25TiB path /dev/sdg
       devid   18 size 4.55TiB used 4.25TiB path /dev/sdh
       devid   19 size 5.46TiB used 4.25TiB path /dev/sdm
       devid   20 size 5.46TiB used 2.33TiB path /dev/sdp
       devid   21 size 5.46TiB used 2.33TiB path /dev/sdn
       devid   22 size 5.46TiB used 2.33TiB path /dev/sdw
       devid   23 size 5.46TiB used 2.33TiB path /dev/sdz
       *** Some devices missing

The missing device is a {nominal} 5.0TB drive and would usually show
up in this list as:
       devid   15 size 4.55TiB used 4.25TiB path /dev/sde

Other than "mount -o recovery,ro" when all 22 were present {and before
I understood I had encountered #72811}, I have NOT run any of the more
advanced recovery/repair commands/techniques.

As best as I can tell using independent {non btrfs related} all data
{approximately 80TB} prior to the initial event is intact. Directories
and files written/updated after the automatic {and silent} device
re-add are suspect and occasionally exhibit either missing files or
missing chunks of files.

Regardless of the fact the data is intact, I get runs of csum and
other errors - sample:
[114427.223006] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223011] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223012] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223015] BTRFS info (device sdw): no csum found for inode
913818 start 1219862528
[114427.223019] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223021] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223022] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223024] BTRFS info (device sdw): no csum found for inode
913818 start 1219866624
[114427.223027] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223029] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223030] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223032] BTRFS info (device sdw): no csum found for inode
913818 start 1219870720
[114427.223035] BTRFS error (device sdw): parent transid verify failed
on 59281854676992 wanted 328408 found 328388
[114427.223037] BTRFS info (device sdw): no csum found for inode
913818 start 1219874816
[114427.223042] BTRFS info (device sdw): no csum found for inode
913818 start 1219878912
[114427.223047] BTRFS info (device sdw): no csum found for inode
913818 start 1219883008
[114427.223051] BTRFS info (device sdw): no csum found for inode
913818 start 1219887104
[114427.223071] BTRFS info (device sdw): no csum found for inode
913818 start 1219891200
[114427.223076] BTRFS info (device sdw): no csum found for inode
913818 start 1219895296
[114427.223080] BTRFS info (device sdw): no csum found for inode
913818 start 1219899392
[114427.230847] BTRFS warning (device sdw): csum failed ino 913818 off
1220612096 csum 3114921698 expected csum 0
[114427.230856] BTRFS warning (device sdw): csum failed ino 913818 off
1220616192 csum 1310722868 expected csum 0
[114427.230861] BTRFS warning (device sdw): csum failed ino 913818 off
1220620288 csum 2799646595 expected csum 0
[114427.230866] BTRFS warning (device sdw): csum failed ino 913818 off
1220624384 csum 4020833134 expected csum 0
[114427.230870] BTRFS warning (device sdw): csum failed ino 913818 off
1220628480 csum 2942842633 expected csum 0
[114427.230875] BTRFS warning (device sdw): csum failed ino 913818 off
1220632576 csum 2112871613 expected csum 0
[114427.230879] BTRFS warning (device sdw): csum failed ino 913818 off
1220636672 csum 3037436145 expected csum 0
[114427.230884] BTRFS warning (device sdw): csum failed ino 913818 off
1220640768 csum 2799458999 expected csum 0
[114427.230888] BTRFS warning (device sdw): csum failed ino 913818 off
1220644864 csum 1132935941 expected csum 0
[114427.230893] BTRFS warning (device sdw): csum failed ino 913818 off
1220648960 csum 2622911668 expected csum 0

At the time of the event, I was running gentoo-sources-4.9.11. I've
upped my kernel to gentoo-sources-4.10.8. I've upped the associated
btrfs tools equivalently.

The practical problem with bug#72811 is that all the csum and transid
information is treated as being just as valid on the automatically
re-added drive as the same information on all the other drives. The
second practical problem appears to be that since this is a raid56
configuration, none of the usual tools such as for fixing csums or
going to backup roots etc appear to work properly. i.e., this appears
to be one of the areas in btrfs where raid56 isn't ready.

In hind sight, I've run into bug#72811 before, but didn't recognize it
at the time. This was due to both inexperience and to having other
problems as well {combination of physical mis-configuration and
failing hard drives}.

I don't have issues with the above tools not being ready for for
raid56. Despite the mass quantities, none of the data involved it
irretrievable, irreplaceable or of earth shattering importance on any
level. This is a purely personal setup.

I'd also like to point out that I have tested the process of
physically pulling a drive and then going through both reducing the
number of drives {given sufficient remaining space} and adding a new
drive to replace the allegedly failed one. These functions seem to
work fine so long as none of the devices are too full.

As such, I'm not bothered by the 'not ready for prime time status' of
raid56. This bug however, is really really nasty bad. Once a drive is
out of sync, if should never be automatically re-added back. Such
drives should always be re-initialized as new drives. At some future
date, if someone codes a solution for properly re-syncing such a
raid56 configured device, then perhaps auto re-adding might make
sense. For now, it doesn't.

I mention all this because I KNOW someone is going to go off on how I
should have back ups of everything and how I should not run raid56 and
how I should run mirrored instead etc. Been there. Done that. I have
the same canned lecture for people running data centers for
businesses.

I am not a business. This is my personal hobby. The risk does not
bother me. I don't mind running this setup because I think real life
runtimes can contribute to the general betterment of btrfs for
everyone. I'm not in any particular hurry. My income is completely
independent from this.

I've run 8 months with this array without a single problem. Drive
controller problems are always nasty and generally much more difficult
to protect against which is why I've been migrating to an external
chassis.

Now that I've gotten that out of my system, what I would really like
is some input/help into putting together a recovery strategy. As it
happens, I had already scheduled and budgeted for the purchase of 8
additional 6TB hard drives. This was in line with approaching 80%
storage utilization. I've accelerated the purchase of these drives and
now have them in hand. I do not currently have the resources to
purchase a second drive chassis nor anymore additional drives. This
means I cannot simply copy the entire array either directly nor via
'btrfs restore'.

On a superficial level, what I'd like to do is set up the new drives
as a second array. Copy/move approximately 20TBs of pre-event data
from the degraded array. Delete/remove/free up those 20TBs from the
degraded array. Reduce the number of devices in the degraded array.
Initialized and add those devices to the new array. Wash. Rinse.
Repeat. Eventually, I'd like all the drives in the external drive
chassis to be the new, recovered array. I'd re-purpose the internal
drives in the server for other uses.

The potential problem is controlling what happens once I mount the
degraded array in read/write mode to delete copied data and perform
device reduction. I have no clue how to or even if this can be done
safely.

The alternative is to continue to run this array in read only degraded
mode until I can accumulate sufficient funds for a second chassis and
approximately 20 more drives.This probably won't be until Jan 2018.

Is such a recovery strategy even possible? While I would expect a
strategy involving 'btrfs restore' to be possible for raid0, raid1,
raid10 configure arrays, I don't know that such a strategy will work
for raid56.

As I see it, the key here to to be able to safely delete copied files
and to safely reduce the number of devices in the array.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Encountered kernel bug#72811. Advice on recovery?
  2017-04-13 18:49 Encountered kernel bug#72811. Advice on recovery? Ank Ular
@ 2017-04-14  3:47 ` Duncan
  2017-04-14 16:56   ` ronnie sahlberg
  2017-04-14 16:46 ` Chris Murphy
  1 sibling, 1 reply; 10+ messages in thread
From: Duncan @ 2017-04-14  3:47 UTC (permalink / raw)
  To: linux-btrfs

Ank Ular posted on Thu, 13 Apr 2017 14:49:41 -0400 as excerpted:

> I've encountered kernel bug#72811 "If a raid5 array gets into degraded
> mode, gets modified, and the missing drive re-added, the filesystem
> loses state".

> The array normally consists of 22 devices with data and meta in raid6.
> Physically, the devices are split 16 devices in a NORCO DS-24 cage and
> the remaining devices are in the server itself. All the devices are SATA
> III.

> I don't have issues with the above tools not being ready for for raid56.
> Despite the mass quantities, none of the data involved it irretrievable,
> irreplaceable or of earth shattering importance on any level. This is a
> purely personal setup.

> As such, I'm not bothered by the 'not ready for prime time status' of
> raid56. This bug however, is really really nasty bad. Once a drive is
> out of sync, if should never be automatically re-added back.

> I mention all this because I KNOW someone is going to go off on how I
> should have back ups of everything and how I should not run raid56 and
> how I should run mirrored instead etc. Been there. Done that. I have the
> same canned lecture for people running data centers for businesses.
> 
> I am not a business. This is my personal hobby. The risk does not bother
> me. I don't mind running this setup because I think real life runtimes
> can contribute to the general betterment of btrfs for everyone. I'm not
> in any particular hurry. My income is completely independent from this.

> The potential problem is controlling what happens once I mount the
> degraded array in read/write mode to delete copied data and perform
> device reduction. I have no clue how to or even if this can be done
> safely.
> 
> The alternative is to continue to run this array in read only degraded
> mode until I can accumulate sufficient funds for a second chassis and
> approximately 20 more drives.This probably won't be until Jan 2018.
> 
> Is such a recovery strategy even possible? While I would expect a
> strategy involving 'btrfs restore' to be possible for raid0, raid1,
> raid10 configure arrays, I don't know that such a strategy will work for
> raid56.
> 
> As I see it, the key here to to be able to safely delete copied files
> and to safely reduce the number of devices in the array.

OK, I'm one of the ones that's going to "go off" on you, but FWIW, I 
expect pretty much everyone else would pretty much agree.  At least you 
do have backups. =:^)

I don't think you appreciate just how bad raid56 is ATM.  There are just 
too many REALLY serious bugs like the one you mention with it, and it's 
actively NEGATIVELY recommended here as a result.  It's bad enough with 
even current kernels, and the problems are well known enough to the devs, 
that there's really not a whole lot to test ATM...

Well, unless you're REALLY into building kernels with a whole slew of pre-
merge patches and reporting back the results to the dev working on it, as 
there /are/ a significant number of raid56 patches floating around in a 
pre-merge state here on the list.  Some of them may be in btrfs-next 
already, but I don't believe all of them are.

The problem with that is, despite how willing you may be, you obviously 
aren't running them now.  So you obviously didn't know the current 
really /really/ bad state.  If you're /willing/ to run them and have the 
skills to do that sort of patching, etc, including possibly ones that 
won't fix problems, only help further trace them down, then either 
followup with the dev working on it (which I've not tracked specifically 
so I can't tell you who) if he posts a reply, or go looking on the list 
for raid56 patches and get ahold of the dev posting them.

You'll need to get the opinion of the dev as to whether with the patches 
it's worth running yet or not.  I'm not sure if he's thru patching the 
worst of the known issues, or if there's more to go.

One of the big problems is that in the current state, the repair tools, 
scrub, etc, can actively make the problem MUCH worse.  They're simply 
broken.  Normal raid56 runtime has been working for quite awhile, so it's 
no surprise that has worked for you.  And under specific circumstances, 
pulling a drive and replacing it can work too.  But the problem is, those 
circumstances are precisely the type that people test, but not the type 
that tends to actually happen in the real world.

So effectively, raid56 mode is little more dependable than raid0 mode.  
While you /may/ be able to recover, it's uncertain enough that it's 
better to just treat the array as a raid0, and consider that you may well 
lose everything on it with pretty much any problem at all.  As such, it's 
simply irresponsible to recommend that anyone use it /as/ raid56, which 
is why it's actively NEGATIVELY recommended ATM.  Meanwhile, people that 
want raid0s... tend to configure raid0s, not raid5s or raid6s.

FWIW, I /think/ at least /some/ of the patches have been reviewed and 
cleared for, hopefully, 4.12.  For sure they're not going to make 4.11.  
And I'm not sure all of them will make 4.12, and even if they do, while 
they're certainly a beginning, I'm honestly not sure if they fix the 
known problems well enough yet to slack off on the negativity a bit or 
not.


Meanwhile...  what to do with the current array?

If you are up to patching, get those patches applied and work with the 
dev to see what you can do.

If you want to risk it and aren't up for the level of pre-release 
patching I've described above, yeah, running it read-only until early 
next year should basically maintain state (assuming no hardware actually 
fails, that's the risk part), and with a bit of luck getting those 
patches merged, raid56 may actually be stabilizing a bit by then.  One 
can hope. =:^)

Meanwhile, the good thing about btrfs restore is that it's read-only for 
the filesystem you're trying to recover files from.  If there's enough 
damage that the read-only mount won't let you access all the files and 
you want to try restore, while I don't know its raid56 status, unlike 
scrub and some of the other tools, it's not going to hurt things further.

Of course that's going to require quite some space to do all files, space 
you apparently don't have ATM.  What you /may/ find useful, however, is 
using the array read-only for what it gives you, if it's most of them, 
and restoring a few damaged files using btrfs restore.  The regex filters 
should allow you to do that, tho it may take a bit of fiddling to get the 
format correct.  (AFAIK they're a bit more complex than ordinary regex.)

Of course you also have the just blow it all away and restore from 
backups option, because you /do/ have backups. =:^)  But based on your 
post I'm guessing that's too easy for you, something I can understand as 
I've been there myself, wanting to do it the hard way to learn.  =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Encountered kernel bug#72811. Advice on recovery?
  2017-04-13 18:49 Encountered kernel bug#72811. Advice on recovery? Ank Ular
  2017-04-14  3:47 ` Duncan
@ 2017-04-14 16:46 ` Chris Murphy
  2017-04-14 16:58   ` Chris Murphy
  1 sibling, 1 reply; 10+ messages in thread
From: Chris Murphy @ 2017-04-14 16:46 UTC (permalink / raw)
  To: Ank Ular; +Cc: Btrfs BTRFS

Summary: 22x device raid6 (data and metadata). One device vanished,
and the volume is rw,degraded mounted with writes happening; next time
it's mounted the formerly missing device is not missing so it's a
normal mount, and writes are happening. Then later, the filesystem
goes read only. Now there are problems, what are the escape routes?



OK the Autopsy Report:

> In my case, I had rebooted my system and one of the drives on my main
> array did not come up. I was able to mount in degraded mode. I needed
> to re-boot the following day. This time, all the drives in the array
> came up. Several hours later, the array went into read only mode.
> That's when I discovered the odd device out had been re-added without
> any kind of error message or notice.

The instant Btrfs complains about something, you cannot make
assumptions, and you have to fix it. You can't turn your back on it.
It's an angry goose with an egg nearby. And if you turn your back on
it, it'll beat your ass down. But because this is raid6, you thought
it's OK, it's a reliable predictable mule. And you made a lot of
assumptions that are totally reasonable because it's called raid6,
except that those assumptions are all wrong because Btrfs is not like
anything else, and it's raid doesn't work like anything else.



1. The first mount attempt fails. OK why? On Btrfs you must find out
why normal mount failed, because you don't want to use degraded mode
unless absolutely necessary. But you didn't troubleshoot it.

2. The second mount attempt with degraded works. This mode exists for
one reason, you are ready right now to add a new device and delete the
missing one. Other raid56's you can wait and just hope another drive
doesn't die. Not Btrfs. You might get one chance with rw,degraded to
do a device replacement and you have to make 'dev add' and 'dev del
missing' the top priority before writing anything else to the volume.
So if you're not ready to do this, the default first action is
ro,degraded. You can get data off the volume but not change it and
lose your chance to use degraded,rw which has a decent chance of being
a one time event. But you didn't do this, you assumed Btrfs raid56 is
OK to use rw,degraded like any other raid.

3. The third mount, you must have mounted with -o degraded right off
the bat, assuming the formerly missing device was still missing and
you'd  still need -o degraded. If you'd tried a normal mount, it would
have succeeded, which would have informed you the formerly missing
device had been found and was being used. Now you have normal chunks,
degraded chunks, and more normal chunks. This array is very confused.

4. Btrfs does not do active heals (auto generation limited scrub) when
a previously missing device becomes available again. It only does
passive healing as it encounters wrong or missing data.

5. Btrfs raid6 is obviously broken somehow, because you're not the
only person who has had a file system with all available information
and two copies, and it still breaks. Most of your data is raid6,
that's three copies (data plus two parity). Some of it is degraded
raid6 which is effectively raid5, so that's data plus one copy. And
yet at some point Btrfs gets confused in normal, non-degraded mount,
and splats to read-only.  This is definitely a bug. This requires a
complete call traces prior to and include the read-only splat, in a
bug report. Or it simply won't get better. It's unclear where the devs
are at priority wise with raid56, it's also unclear if they're going
to fix it, or rewrite it.


The point is, you made a lot of mistakes by making too many
assumptions, and not realizing that degraded state in Btrfs is
basically an emergency. Finally at the very end, it still could have
saved you from your own mistakes, but there's a missing feature
(active auto heal to catch up the missing device), and there's a bug
making the fs read-only. And now it's in a sufficiently
non-deterministic state that the repair tools probably can't repair
it.


>
> The practical problem with bug#72811 is that all the csum and transid
> information is treated as being just as valid on the automatically
> re-added drive as the same information on all the other drives.

My guess is that the first normal mount after degraded writes, the
readded drive has a new super block that has current valid
information, pointing to missing data, and only as it goes looking for
the data or metadata, does it start fixing things up. Passive. So it's
own passive healing is eventually hitting a brick wall the farther
backward in time it has to go to do these fix ups.

The passive repair works when it's a few bad sectors on the drive. But
when it's piles of missing data, this is the wrong mode. It needs a
limited scrub or balance to fix things. Right now you have to manually
do a full scrub or balance after you've mounted for even one second
using degraded,rw. That's why you want to avoid it at all costs.


>
> I don't have issues with the above tools not being ready for for
> raid56. Despite the mass quantities, none of the data involved it
> irretrievable, irreplaceable or of earth shattering importance on any
> level. This is a purely personal setup.

I think there's no justification for a 22 drive raid6 on Btrfs. It's
such an extreme usage case I expect something will go wrong, it will
totally betray the user, and there's so much other work that needs to
be done on Btrfs raid56 that it's not even interesting to do this
extreme case as an experiment to try and make Btrfs raid56 better.

Even aside from raid56, even if it were raid1 or 10 or single. It's a
problem. If you're doing snapshots, as Btrfs intends and makes easy
and cost free, it still comes with a cost with such a huge file
system. Balance will take a long time. If it gets into one of these
slow balance states, it can take weeks to do a scrub or balance.

Btrfs has scalability problems other than raid56. Once those are
mostly all fixed maybe the devs announce a plan for raid56 getting
fixed or replaced. Until then I think Btrfs raid56 is not interesting.


> I mention all this because I KNOW someone is going to go off on how I
> should have back ups of everything and how I should not run raid56 and
> how I should run mirrored instead etc. Been there. Done that. I have
> the same canned lecture for people running data centers for
> businesses.

As long as you've learned something, it's fine.



>
> Now that I've gotten that out of my system, what I would really like
> is some input/help into putting together a recovery strategy. As it
> happens, I had already scheduled and budgeted for the purchase of 8
> additional 6TB hard drives. This was in line with approaching 80%
> storage utilization. I've accelerated the purchase of these drives and
> now have them in hand. I do not currently have the resources to
> purchase a second drive chassis nor anymore additional drives. This
> means I cannot simply copy the entire array either directly nor via
> 'btrfs restore'.

You've got too much data for the available resources is what that
says. And that's a case for triage.



> On a superficial level, what I'd like to do is set up the new drives
> as a second array. Copy/move approximately 20TBs of pre-event data
> from the degraded array. Delete/remove/free up those 20TBs from the
> degraded array. Reduce the number of devices in the degraded array.
> Initialized and add those devices to the new array. Wash. Rinse.
> Repeat. Eventually, I'd like all the drives in the external drive
> chassis to be the new, recovered array. I'd re-purpose the internal
> drives in the server for other uses.

OK but you can't mount normally anymore. It pretty much immediately
goes read-only either at mount time, or shortly thereafter (?).

So it's stuck. You can't modify this volume without risking all the
data on it, in my opinion.




>
> The potential problem is controlling what happens once I mount the
> degraded array in read/write mode to delete copied data and perform
> device reduction. I have no clue how to or even if this can be done
> safely.

Non-deterministic. First of all it's unclear whether it will delete
files without splatting to read only. Even if that works, it's far
from clear, and almost certainly true, that it will splat when you're
doing a device delete (and the ensuing shrink).

If this were a single chunk setup it might be possible. But device
delete on raid56 is not easy, it has to do a reshape. All chunks have
to be read in and then written back out.

So maybe what you do is copy off the most important 20TB you can,
because chances are that's all that you're going to get off this array
given the limitations you have set. Once that 20TB is  copied off, I
think it's not worth it to delete it. Because deleting on Btrfs is
COW, and thus you're actually writing. And writing all these deletions
is more change to the file system and what you want is less change.

The next step, I'd say is convert it to single/raid1.

# btrfs balance start -dconvert=single -mconvert=raid1 /mnt

And then hope to f'n god nothing dies. This is COW so in theory it
should not get worse. But... there is a better chance it gets worse,
than it chops off all the crusty stale bad parts in raid56, and leaves
you with clean single chunks. But once it's single, it's much, much
easier to delete that 20TB, and then start deleting individual
devices. Moving single chunks around is very efficient on Btrfs
compared to distributed chunks were literally every 1GiB chunk is on
22 drives. Now a 1Gib chunk is on exactly one drive. So it will be
easy to do exactly what you want. If the convert doesn't totally eat
shit and die, which it probably will.

So backup your 20TB, expecting that it will be the only 20TB you get
off this volume. So choose wisely.

And then convert to single chunks.



> The alternative is to continue to run this array in read only degraded
> mode until I can accumulate sufficient funds for a second chassis and
> approximately 20 more drives.This probably won't be until Jan 2018.


Yeah that can work. Read-only degraded might even survive another
drive failure, so why not? It's only a year. That'll go by fast.


>
> As I see it, the key here to to be able to safely delete copied files
> and to safely reduce the number of devices in the array.


The only safe option you have is read-only degraded until you have the
resources to make an independent copy. The more you change this
volume, the more likely it is irrecoverable and there will be data
loss.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Encountered kernel bug#72811. Advice on recovery?
  2017-04-14  3:47 ` Duncan
@ 2017-04-14 16:56   ` ronnie sahlberg
  2017-04-15  1:41     ` Duncan
  0 siblings, 1 reply; 10+ messages in thread
From: ronnie sahlberg @ 2017-04-14 16:56 UTC (permalink / raw)
  To: Duncan; +Cc: Btrfs BTRFS

On Thu, Apr 13, 2017 at 8:47 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Ank Ular posted on Thu, 13 Apr 2017 14:49:41 -0400 as excerpted:
...
> OK, I'm one of the ones that's going to "go off" on you, but FWIW, I
> expect pretty much everyone else would pretty much agree.  At least you
> do have backups. =:^)
>
> I don't think you appreciate just how bad raid56 is ATM.  There are just
> too many REALLY serious bugs like the one you mention with it, and it's
> actively NEGATIVELY recommended here as a result.  It's bad enough with
> even current kernels, and the problems are well known enough to the devs,
> that there's really not a whole lot to test ATM...

Can we please hide the ability to even create any new raid56
filesystems behind a new flag :

--i-accept-total-data-loss

to make sure that folks are prepared for how risky it currently is.
That should be an easy patch to the userland utilities.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Encountered kernel bug#72811. Advice on recovery?
  2017-04-14 16:46 ` Chris Murphy
@ 2017-04-14 16:58   ` Chris Murphy
  0 siblings, 0 replies; 10+ messages in thread
From: Chris Murphy @ 2017-04-14 16:58 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Ank Ular, Btrfs BTRFS

On Fri, Apr 14, 2017 at 10:46 AM, Chris Murphy <lists@colorremedies.com> wrote:

>
> The passive repair works when it's a few bad sectors on the drive. But
> when it's piles of missing data, this is the wrong mode. It needs a
> limited scrub or balance to fix things. Right now you have to manually
> do a full scrub or balance after you've mounted for even one second
> using degraded,rw. That's why you want to avoid it at all costs.


Small clarification on "right now you have to manually do"

I don't mean YOU personally, with your array. I mean, anyone who
happens to have done even the tiniest amount of writes to a Btrfs
volume while mounted in rw,degraded. Once a new device is added and
the bad/missing device deleted, you still have to manually do a scrub
or balance of the entire array. That's the only way to fix up the
array back to normal. It's not automatic.

The way to avoid this is to *immediately* before any new writes, do a
device add and device delete missing*. That prevents any degraded
chunks from being written.



* ON non-raid56 volumes, you can use 'btrfs replace'.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Encountered kernel bug#72811. Advice on recovery?
  2017-04-14 16:56   ` ronnie sahlberg
@ 2017-04-15  1:41     ` Duncan
  2017-04-15 23:28       ` Duncan
  2017-04-16  8:01       ` Marat Khalili
  0 siblings, 2 replies; 10+ messages in thread
From: Duncan @ 2017-04-15  1:41 UTC (permalink / raw)
  To: linux-btrfs

ronnie sahlberg posted on Fri, 14 Apr 2017 09:56:30 -0700 as excerpted:

> On Thu, Apr 13, 2017 at 8:47 PM, Duncan <1i5t5.duncan@cox.net> wrote:
>> Ank Ular posted on Thu, 13 Apr 2017 14:49:41 -0400 as excerpted:
> ...
>> OK, I'm one of the ones that's going to "go off" on you, but FWIW, I
>> expect pretty much everyone else would pretty much agree.  At least you
>> do have backups. =:^)
>>
>> I don't think you appreciate just how bad raid56 is ATM.  There are
>> just too many REALLY serious bugs like the one you mention with it, and
>> it's actively NEGATIVELY recommended here as a result.  It's bad enough
>> with even current kernels, and the problems are well known enough to
>> the devs,
>> that there's really not a whole lot to test ATM...
> 
> Can we please hide the ability to even create any new raid56 filesystems
> behind a new flag :
> 
> --i-accept-total-data-loss
> 
> to make sure that folks are prepared for how risky it currently is. That
> should be an easy patch to the userland utilities.

The biggest problem with such a flag in general is that people often use 
a kernel and userland that are /vastly/ out of sync, version-wise.  Were 
such a flag to be introduced, people would still be seeing it five years 
or more after it no longer applied to the kernel they're using (because 
the kernel's what actually does the work in many cases, including scrub).

Even making such a warning conditional on kernel version is problematic, 
because many distros backport major blocks of code, including perhaps 
btrfs fixes, and the nominally 3.14 or whatever kernel may actually be 
running btrfs and other fixes from 4.14 or later, by the time they 
actually drop support for whatever LTS distro version and quit backporting 
fixes.

Besides which, if the patch was submitted now, the earliest it could 
really hit btrfs-progs would be 4.12, and by the time people actually get 
that in their distro they may well be on 4.13 or 4.15 or whatever, and 
the patches fixing raid56 mode to actually work may already be in place.

The only place such a warning really works is on the wiki at
https://btrfs.wiki.kernel.org , because that's really the only place that 
can be updated to current status in a realistic timeframe.  And there's 
already a feature maturity matrix there, with raid56 mode marked 
appropriately, last I checked.

Meanwhile, it can be argued that admins (and anyone making the choice of 
filesystem and device layout they're going to run is an admin of those 
systems, even if they're just running them at home for their own use) who 
don't care enough about the safety of their data to actually research the 
stability of the filesystem and filesystem features they plan to use... 
really don't value that data very highly in the first place.  And the 
status is out there both on this list and on the wiki, so even a trivial 
google should find it without issue.

Indeed:  https://www.google.com/search?q=btrfs+raid56+stability



-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Encountered kernel bug#72811. Advice on recovery?
  2017-04-15  1:41     ` Duncan
@ 2017-04-15 23:28       ` Duncan
  2017-04-15 23:32         ` Hugo Mills
  2017-04-16  8:01       ` Marat Khalili
  1 sibling, 1 reply; 10+ messages in thread
From: Duncan @ 2017-04-15 23:28 UTC (permalink / raw)
  To: linux-btrfs

Duncan posted on Sat, 15 Apr 2017 01:41:28 +0000 as excerpted:

> Besides which, if the patch was submitted now, the earliest it could
> really hit btrfs-progs would be 4.12,

Well, maybe 3.11.x...



-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Encountered kernel bug#72811. Advice on recovery?
  2017-04-15 23:28       ` Duncan
@ 2017-04-15 23:32         ` Hugo Mills
  0 siblings, 0 replies; 10+ messages in thread
From: Hugo Mills @ 2017-04-15 23:32 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 636 bytes --]

On Sat, Apr 15, 2017 at 11:28:41PM +0000, Duncan wrote:
> Duncan posted on Sat, 15 Apr 2017 01:41:28 +0000 as excerpted:
> 
> > Besides which, if the patch was submitted now, the earliest it could
> > really hit btrfs-progs would be 4.12,
> 
> Well, maybe 3.11.x...

   Can I borrow your time machine? Would last Wednesday be OK?

   Hugo.

-- 
Hugo Mills             | We teach people management skills by examining
hugo@... carfax.org.uk | characters in Shakespeare. You could look at
http://carfax.org.uk/  | Claudius's crisis management techniques, for
PGP: E2AB1DE4          | example.       Richard Smith-Jones, Slings and Arrows

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Encountered kernel bug#72811. Advice on recovery?
  2017-04-15  1:41     ` Duncan
  2017-04-15 23:28       ` Duncan
@ 2017-04-16  8:01       ` Marat Khalili
  2017-04-16 15:09         ` Duncan
  1 sibling, 1 reply; 10+ messages in thread
From: Marat Khalili @ 2017-04-16  8:01 UTC (permalink / raw)
  To: linux-btrfs

> Even making such a warning conditional on kernel version is 
> problematic, because many distros backport major blocks of code, 
> including perhaps btrfs fixes, and the nominally 3.14 or whatever 
> kernel may actually be running btrfs and other fixes from 4.14 or 
> later, by the time they actually drop support for whatever LTS distro 
> version and quit backporting fixes.

This information could be stored in kernel and made available for 
usermode tools via some proc file. This would be very useful 
_especially_ considering backporting. Raid56 could be fixed already (or 
not) by the time it is implemented, but no doubt there will still be 
other highly experimental capabilities judging by how things go. And 
this feature itself could easily be backported.

Some machine-readable readiness level (ok/warning/override flag 
needed/known but disabled in kernel) plus one-line text message 
displayed to users in cases 2-4 is all we need. If proc file is missing 
or doesn't contain information about specific capability, tools could 
default to current behavior (AFAIR there're already warnings in some 
cases). Message should tersely cover any known issues, including 
stability, performance, compatibility and general readiness, and may 
contain links (to btrfs wiki?) for more information. I expect whole file 
to easily fit in 512 bytes.

--

With Best Regards,
Marat Khalili

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Encountered kernel bug#72811. Advice on recovery?
  2017-04-16  8:01       ` Marat Khalili
@ 2017-04-16 15:09         ` Duncan
  0 siblings, 0 replies; 10+ messages in thread
From: Duncan @ 2017-04-16 15:09 UTC (permalink / raw)
  To: linux-btrfs

Marat Khalili posted on Sun, 16 Apr 2017 11:01:00 +0300 as excerpted:

>> Even making such a warning conditional on kernel version is
>> problematic, because many distros backport major blocks of code,
>> including perhaps btrfs fixes, and the nominally 3.14 or whatever
>> kernel may actually be running btrfs and other fixes from 4.14 or
>> later, by the time they actually drop support for whatever LTS distro
>> version and quit backporting fixes.
> 
> This information could be stored in kernel and made available for
> usermode tools via some proc file. This would be very useful
> _especially_ considering backporting. Raid56 could be fixed already (or
> not) by the time it is implemented, but no doubt there will still be
> other highly experimental capabilities judging by how things go. And
> this feature itself could easily be backported.

What they /could/ do would be something very similar to what they already 
did for the free-space-tree (as opposed to the free-space-cache, the 
original and still default implementation).

There was a critical bug in the early implementations of free-space-
tree.  But btrfs has incompatibility/feature flags for a reason, and they 
set it up in such a way that the flaw could be detected and fixed.

In theory they could grab another bit from it and make that raid56v2, or 
something similar, and if the raid56 flag is there but not raid56v2, 
warn, etc.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-04-16 15:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-13 18:49 Encountered kernel bug#72811. Advice on recovery? Ank Ular
2017-04-14  3:47 ` Duncan
2017-04-14 16:56   ` ronnie sahlberg
2017-04-15  1:41     ` Duncan
2017-04-15 23:28       ` Duncan
2017-04-15 23:32         ` Hugo Mills
2017-04-16  8:01       ` Marat Khalili
2017-04-16 15:09         ` Duncan
2017-04-14 16:46 ` Chris Murphy
2017-04-14 16:58   ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.