* Strange behavior when replacing device on BTRFS RAID 5 array.
@ 2016-06-27 3:57 Nick Austin
2016-06-27 4:02 ` Nick Austin
2016-06-27 21:12 ` Duncan
0 siblings, 2 replies; 7+ messages in thread
From: Nick Austin @ 2016-06-27 3:57 UTC (permalink / raw)
To: linux-btrfs
I have a 4 device BTRFS RAID 5 filesystem.
One of the device members of this file system (sdr) had badblocks, so I
decided to replace it.
(I don't have a copy of fi show from before the replace. :-/ )
I ran this command:
sudo btrfs replace start 4 /dev/sdw /mnt/newdata
I had to shrink /dev/sdr by ~250 megs since the replacement drive was a tiny bit
smaller.
Jun 25 17:26:52 frank kernel: BTRFS info (device sdr): resizing devid 4
Jun 25 17:26:52 frank kernel: BTRFS info (device sdr): new size for /dev/sdr is
6001175121920
Jun 25 17:27:45 frank kernel: BTRFS info (device sdr): dev_replace from /dev/sdr
(devid 4) to /dev/sdw started
The replace started, all seemed well.
3 hours into the replace, sdr dropped off the SATA bus and was redetected
as sdx. Bummer, but shouldn't be fatal.
This event really seemed to throw BTRFS for a loop.
Jun 25 20:32:35 frank kernel: sd 10:0:19:0: device_block, handle(0x0019)
Jun 25 20:33:05 frank kernel: sd 10:0:19:0: device_unblock and setting
to running, handle(0x0019)
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: rejecting I/O to offline device
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: [sdr] killing request
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: rejecting I/O to offline device
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: [sdr] killing request
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: rejecting I/O to offline device
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: [sdr] killing request
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: [sdr] FAILED Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jun 25 20:33:05 frank kernel: blk_update_request: I/O error, dev sdr,
sector 1785876480
Jun 25 20:33:05 frank kernel: mpt2sas_cm0: removing handle(0x0019),
sas_addr(0x500194000687e20e)
Jun 25 20:33:05 frank kernel: mpt2sas_cm0: removing : enclosure
logical id(0x500194000687e23f), slot(14)
Jun 25 20:33:16 frank kernel: scsi 10:0:21:0: Direct-Access ATA
WL6000GSA12872E 1C01 PQ: 0 ANSI: 5
Here you can see btrfs seems to figure out sdr has become sdx (based on the
"dev /dev/sdx" entry showing up on the BTRFS warning lines).
Unfortunately, all remaining IO for the device formerly known as sdr results in
btrf errors like the ones listed below. iostat confirms no IO on sdx.
Jun 25 20:33:17 frank kernel: sd 10:0:21:0: [sdx] Attached SCSI disk
...
Jun 25 20:33:20 frank kernel: scrub_handle_errored_block: 31983
callbacks suppressed
Jun 25 20:33:20 frank kernel: BTRFS warning (device sdr): i/o error at
logical 2742536544256 on dev /dev/sdx, sector 1786897488, root 5,
inode 222965, offset 296329216, length 4096, links 1 (path:
/path/to/file)
Jun 25 20:33:20 frank kernel: btrfs_dev_stat_print_on_error: 32107
callbacks suppressed
These messages continue for many hours as the replace continues running.
sudo btrfs replace status /mnt/newdata
Started on 25.Jun 17:27:45, finished on 26.Jun 12:48:22, 0 write errs,
0 uncorr. read errs
...
Jun 26 12:48:22 frank kernel: BTRFS warning (device sdr): lost page
write due to IO error on /dev/sdx (Many, many of these)
Jun 26 12:48:22 frank kernel: BTRFS info (device sdr): dev_replace
from /dev/sdx (devid 4) to /dev/sdw finished
Great! /dev/sdx replaced by /dev/sdw!
Now let's confirm:
sudo btrfs fi show /mnt/newdata
Label: '/var/data' uuid: e4a2eb77-956e-447a-875e-4f6595a5d3ec
Total devices 4 FS bytes used 8.07TiB
devid 1 size 5.46TiB used 2.70TiB path /dev/sdg
devid 2 size 5.46TiB used 2.70TiB path /dev/sdl
devid 3 size 5.46TiB used 2.70TiB path /dev/sdm
devid 4 size 5.46TiB used 2.70TiB path /dev/sdx
Bummer, this doesn't look right.
sdx is still in the array (failing drive).
Additionally, /dev/sdw isn't listed at all! Worse still, it looks like the
array has lost redundancy (it has 8TiB of data, and that's the amount shown as
used divided by number of devices). It looks like it tried to add the new
device, but did a balance across all of them instead?
% sudo btrfs fi df /mnt/newdata
Data, RAID5: total=8.07TiB, used=8.06TiB
System, RAID10: total=80.00MiB, used=576.00KiB
Metadata, RAID10: total=12.00GiB, used=10.26GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
Any advice would be appreciated.
% uname -a
Linux frank 4.5.5-201.fc23.x86_64 #1 SMP Sat May 21 15:29:49 UTC 2016 x86_64
x86_64 x86_64 GNU/Linux
% lsb_release
Description: Fedora release 24 (Twenty Three)
% btrfs --version
btrfs-progs v4.4.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Strange behavior when replacing device on BTRFS RAID 5 array.
2016-06-27 3:57 Strange behavior when replacing device on BTRFS RAID 5 array Nick Austin
@ 2016-06-27 4:02 ` Nick Austin
2016-06-27 17:29 ` Chris Murphy
2016-06-27 21:12 ` Duncan
1 sibling, 1 reply; 7+ messages in thread
From: Nick Austin @ 2016-06-27 4:02 UTC (permalink / raw)
To: linux-btrfs
On Sun, Jun 26, 2016 at 8:57 PM, Nick Austin <nick@smartaustin.com> wrote:
> sudo btrfs fi show /mnt/newdata
> Label: '/var/data' uuid: e4a2eb77-956e-447a-875e-4f6595a5d3ec
> Total devices 4 FS bytes used 8.07TiB
> devid 1 size 5.46TiB used 2.70TiB path /dev/sdg
> devid 2 size 5.46TiB used 2.70TiB path /dev/sdl
> devid 3 size 5.46TiB used 2.70TiB path /dev/sdm
> devid 4 size 5.46TiB used 2.70TiB path /dev/sdx
It looks like fi show has bad data:
When I start heavy IO on the filesystem (running rsync -c to verify the data),
I notice zero IO on the bad drive I told btrfs to replace, and lots of IO to the
expected replacement.
I guess some metadata is messed up somewhere?
avg-cpu: %user %nice %system %iowait %steal %idle
25.19 0.00 7.81 28.46 0.00 38.54
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sdg 437.00 75168.00 1792.00 75168 1792
sdl 443.00 76064.00 1792.00 76064 1792
sdm 438.00 75232.00 1472.00 75232 1472
sdw 443.00 75680.00 1856.00 75680 1856
sdx 0.00 0.00 0.00 0 0
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Strange behavior when replacing device on BTRFS RAID 5 array.
2016-06-27 4:02 ` Nick Austin
@ 2016-06-27 17:29 ` Chris Murphy
2016-06-27 17:37 ` Austin S. Hemmelgarn
2016-06-27 17:46 ` Chris Murphy
0 siblings, 2 replies; 7+ messages in thread
From: Chris Murphy @ 2016-06-27 17:29 UTC (permalink / raw)
To: Nick Austin; +Cc: Btrfs BTRFS
On Sun, Jun 26, 2016 at 10:02 PM, Nick Austin <nick@smartaustin.com> wrote:
> On Sun, Jun 26, 2016 at 8:57 PM, Nick Austin <nick@smartaustin.com> wrote:
>> sudo btrfs fi show /mnt/newdata
>> Label: '/var/data' uuid: e4a2eb77-956e-447a-875e-4f6595a5d3ec
>> Total devices 4 FS bytes used 8.07TiB
>> devid 1 size 5.46TiB used 2.70TiB path /dev/sdg
>> devid 2 size 5.46TiB used 2.70TiB path /dev/sdl
>> devid 3 size 5.46TiB used 2.70TiB path /dev/sdm
>> devid 4 size 5.46TiB used 2.70TiB path /dev/sdx
>
> It looks like fi show has bad data:
>
> When I start heavy IO on the filesystem (running rsync -c to verify the data),
> I notice zero IO on the bad drive I told btrfs to replace, and lots of IO to the
> expected replacement.
>
> I guess some metadata is messed up somewhere?
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 25.19 0.00 7.81 28.46 0.00 38.54
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> sdg 437.00 75168.00 1792.00 75168 1792
> sdl 443.00 76064.00 1792.00 76064 1792
> sdm 438.00 75232.00 1472.00 75232 1472
> sdw 443.00 75680.00 1856.00 75680 1856
> sdx 0.00 0.00 0.00 0 0
There's reported some bugs with 'btrfs replace' and raid56, but I
don't know the exact nature of those bugs, when or how they manifest.
It's recommended to fallback to use 'btrfs add' and then 'btrfs
delete' but you have other issues going on also.
Devices dropping off and being renamed is something btrfs, in my
experience, does not handle well at all. The very fact the hardware is
dropping off and coming back is bad, so you really need to get that
sorted out as a prerequisite no matter what RAID technology you're
using.
First advice, make a backup. Don't change the volume further until
you've done this. Each attempt to make the volume healthy again
carries risks of totally breaking it and losing the ability to mount
it. So as long as it's mounted, take advantage of that. Pretend the
very next repair attempt will break the volume, and make your backup
accordingly.
Next is to decide to what degree you want to salvage this volume and
keep using Btrfs raid56 despite the risks (it's still rather
experimental, and in particular some things have been realized on the
list in the last week especially that make it not recommended, except
by people willing to poke it with a stick and learn how many more
bodies can be found with the current implementation) or if you just
want to migrate it over to something like XFS on mdadm or LVM raid 5
as soon as possible?
There's also the obligatory notice that applies to all Linux software
raid implementations which is to discover if you have a very common
misconfiguration that enhances the chance of data loss if the volume
ever goes degraded and you need to rebuild with a new drive:
smartctl -l scterc <dev>
cat /sys/block/<dev>/device/timeout
The first value must be less than the second. Note the first value is
in deciseconds, the second is in seconds. And either 'unsupported' or
'unset' translates into a vague value that could be as high as 180
seconds.
--
Chris Murphy
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Strange behavior when replacing device on BTRFS RAID 5 array.
2016-06-27 17:29 ` Chris Murphy
@ 2016-06-27 17:37 ` Austin S. Hemmelgarn
2016-06-27 17:46 ` Chris Murphy
1 sibling, 0 replies; 7+ messages in thread
From: Austin S. Hemmelgarn @ 2016-06-27 17:37 UTC (permalink / raw)
To: Chris Murphy, Nick Austin; +Cc: Btrfs BTRFS
On 2016-06-27 13:29, Chris Murphy wrote:
> On Sun, Jun 26, 2016 at 10:02 PM, Nick Austin <nick@smartaustin.com> wrote:
>> On Sun, Jun 26, 2016 at 8:57 PM, Nick Austin <nick@smartaustin.com> wrote:
>>> sudo btrfs fi show /mnt/newdata
>>> Label: '/var/data' uuid: e4a2eb77-956e-447a-875e-4f6595a5d3ec
>>> Total devices 4 FS bytes used 8.07TiB
>>> devid 1 size 5.46TiB used 2.70TiB path /dev/sdg
>>> devid 2 size 5.46TiB used 2.70TiB path /dev/sdl
>>> devid 3 size 5.46TiB used 2.70TiB path /dev/sdm
>>> devid 4 size 5.46TiB used 2.70TiB path /dev/sdx
>>
>> It looks like fi show has bad data:
>>
>> When I start heavy IO on the filesystem (running rsync -c to verify the data),
>> I notice zero IO on the bad drive I told btrfs to replace, and lots of IO to the
>> expected replacement.
>>
>> I guess some metadata is messed up somewhere?
>>
>> avg-cpu: %user %nice %system %iowait %steal %idle
>> 25.19 0.00 7.81 28.46 0.00 38.54
>>
>> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
>> sdg 437.00 75168.00 1792.00 75168 1792
>> sdl 443.00 76064.00 1792.00 76064 1792
>> sdm 438.00 75232.00 1472.00 75232 1472
>> sdw 443.00 75680.00 1856.00 75680 1856
>> sdx 0.00 0.00 0.00 0 0
>
> There's reported some bugs with 'btrfs replace' and raid56, but I
> don't know the exact nature of those bugs, when or how they manifest.
> It's recommended to fallback to use 'btrfs add' and then 'btrfs
> delete' but you have other issues going on also.
One other thing to mention, if the device is failing, _always_ add '-r'
to the replace command line. This will tell it to avoid reading from
the device being replaced (in raid1 or raid10 mode, it will pull from
the other mirror, in raid5/6 mode, it will recompute the block from
parity and compare to the stored checksums (which in turn means that
this _will_ be slower on raid5/6 than regular repalce)). Link resets
and other issues that cause devices to disappear become more common the
more damaged a disk is, so avoiding reading from it becomes more
important too, because just reading from a disk puts stress on it.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Strange behavior when replacing device on BTRFS RAID 5 array.
2016-06-27 17:29 ` Chris Murphy
2016-06-27 17:37 ` Austin S. Hemmelgarn
@ 2016-06-27 17:46 ` Chris Murphy
2016-06-27 22:29 ` Steven Haigh
1 sibling, 1 reply; 7+ messages in thread
From: Chris Murphy @ 2016-06-27 17:46 UTC (permalink / raw)
To: Chris Murphy; +Cc: Nick Austin, Btrfs BTRFS
On Mon, Jun 27, 2016 at 11:29 AM, Chris Murphy <lists@colorremedies.com> wrote:
>
> Next is to decide to what degree you want to salvage this volume and
> keep using Btrfs raid56 despite the risks
Forgot to complete this thought. So if you get a backup, and decide
you want to fix it, I would see if you can cancel the replace using
"btrfs replace cancel <mp>" and confirm that it stops. And now is the
risky part, which is whether to try "btrfs add" and then "btrfs
remove" or remove the bad drive, reboot, and see if it'll mount with
-o degraded, and then use add and remove (in which case you'll use
'remove missing').
The first you risk Btrfs still using the flaky bad drive.
The second you risk whether a degraded mount will work, and whether
any other drive in the array has a problem while degraded (like an
unrecovery read error from a single sector).
--
Chris Murphy
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Strange behavior when replacing device on BTRFS RAID 5 array.
2016-06-27 3:57 Strange behavior when replacing device on BTRFS RAID 5 array Nick Austin
2016-06-27 4:02 ` Nick Austin
@ 2016-06-27 21:12 ` Duncan
1 sibling, 0 replies; 7+ messages in thread
From: Duncan @ 2016-06-27 21:12 UTC (permalink / raw)
To: linux-btrfs
Nick Austin posted on Sun, 26 Jun 2016 20:57:32 -0700 as excerpted:
> I have a 4 device BTRFS RAID 5 filesystem.
>
> One of the device members of this file system (sdr) had badblocks, so I
> decided to replace it.
While the others answered the direct question, there's something
potentially more urgent...
Btrfs raid56 mode has some fundamentally serious bugs as currently
implemented, that we are just now finding out how serious they
potentially are. For the details you can read the other active threads
from the last week or so, but the important thing is that...
For the time being, raid56 mode is not to be trusted repairable and as a
result is now highly negative-recommended. Unless you are using pure
testing data that you don't care about whether it lives or dies (either
because it literally /is/ that trivial, or because you have tested
backups, /making/ it that trivial), I'd urgently recommend either getting
your data off it ASAP, or rebalancing to redundant-raid, raid1 or raid10,
instead of parity-raid (5/6), before something worse happens and you no
longer can.
Raid1 mode is a reasonable alternative, as long as your data fits in the
available space. Keeping in mind that btrfs raid1 is always two copies,
with more than two devices upping the capacity, not the redundancy, 3
5.46 TB devices = 8.19 TB usable space. Given your 8+ TiB of data usage,
plus metadata and system, that's unlikely to fit unless you delete some
stuff (older snapshots probably, if you have them). So you'll need to
keep it to four devices of that size.
Btrfs raid10 is also considered as stable as btrfs in general, and would
be doable with 4+ devices, but for various reasons I'll skip for brevity
here (ask if you want them detailed), I'd recommend staying with btrfs
raid1.
Or switch to md- or dm-raid1. Or one other interesting alternative, a
pair of md- or dm-raid0s, on top of which you run btrfs raid1. That
gives you the data integrity of btrfs raid1, with somewhat better speed
than the reasonably stable but as yet unoptimized btrfs raid10.
And of course there's one other alternative, zfs, if you are good with
its hardware requirements and licensing situation.
But I'd recommend btrfs raid1 as the simple choice. It's what I'm using
here (tho on a pair of ssds, so far smaller but faster media, so
different use-case).
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Strange behavior when replacing device on BTRFS RAID 5 array.
2016-06-27 17:46 ` Chris Murphy
@ 2016-06-27 22:29 ` Steven Haigh
0 siblings, 0 replies; 7+ messages in thread
From: Steven Haigh @ 2016-06-27 22:29 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1.1: Type: text/plain, Size: 1678 bytes --]
On 28/06/16 03:46, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 11:29 AM, Chris Murphy <lists@colorremedies.com> wrote:
>
>>
>> Next is to decide to what degree you want to salvage this volume and
>> keep using Btrfs raid56 despite the risks
>
> Forgot to complete this thought. So if you get a backup, and decide
> you want to fix it, I would see if you can cancel the replace using
> "btrfs replace cancel <mp>" and confirm that it stops. And now is the
> risky part, which is whether to try "btrfs add" and then "btrfs
> remove" or remove the bad drive, reboot, and see if it'll mount with
> -o degraded, and then use add and remove (in which case you'll use
> 'remove missing').
>
> The first you risk Btrfs still using the flaky bad drive.
>
> The second you risk whether a degraded mount will work, and whether
> any other drive in the array has a problem while degraded (like an
> unrecovery read error from a single sector).
This is the exact set of circumstances that caused my corrupt array. I
was using RAID6 - yet it still corrupted large portions of things. In
theory, due to having double parity, it should still have survived even
if a disk did go bad - but there we are.
I first started a replace - noted how slow it was going - cancelled the
replace, then did an add / delete - the system crashed and it was all over.
Just as another data point, I've been flogging the guts out of the array
with mdadm RAID6 doing a reshape of that - and no read errors, system
crashes or other problems in over 48 hours.
--
Steven Haigh
Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-06-27 22:29 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-27 3:57 Strange behavior when replacing device on BTRFS RAID 5 array Nick Austin
2016-06-27 4:02 ` Nick Austin
2016-06-27 17:29 ` Chris Murphy
2016-06-27 17:37 ` Austin S. Hemmelgarn
2016-06-27 17:46 ` Chris Murphy
2016-06-27 22:29 ` Steven Haigh
2016-06-27 21:12 ` Duncan
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.