Corrupted data, failed drive(s)

* Corrupted data, failed drive(s)
@ 2021-06-03 16:50 Gaardiolor
  2021-06-03 22:37 ` Chris Murphy
  0 siblings, 1 reply; 6+ messages in thread
From: Gaardiolor @ 2021-06-03 16:50 UTC (permalink / raw)
  To: linux-btrfs

Hello,

I could use some help with some issues I'm having with my drives.

I've got 4 disks in raid1.
--
[17:59:07]root@kiwi:/storage/samba/storage# btrfs filesystem df /storage/
Data, RAID1: total=4.39TiB, used=4.38TiB
System, RAID1: total=32.00MiB, used=720.00KiB
Metadata, RAID1: total=6.00GiB, used=4.66GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

[17:59:10]root@kiwi:/storage/samba/storage# btrfs filesystem show
Label: none  uuid: 8ce9e167-57ea-4cf8-8678-3049ba028c12
         Total devices 4 FS bytes used 4.38TiB
         devid    1 size 3.64TiB used 3.10TiB path /dev/sdc
         devid    2 size 3.64TiB used 3.14TiB path /dev/sdb
         devid    3 size 1.82TiB used 1.32TiB path /dev/sda
         devid    4 size 1.82TiB used 1.21TiB path /dev/sdd
--

I'm having some issues with faulty disk(s). /dev/sdd is bad for sure, 
SMART is complaining.
--
# smartctl -aq errorsonly /dev/sdd
ATA Error Count: 108 (device log contains only the most recent five errors)
Error 108 occurred at disk power-on lifetime: 47563 hours (1981 days + 
19 hours)
Error 107 occurred at disk power-on lifetime: 47563 hours (1981 days + 
19 hours)
Error 106 occurred at disk power-on lifetime: 47563 hours (1981 days + 
19 hours)
Error 105 occurred at disk power-on lifetime: 47563 hours (1981 days + 
19 hours)
Error 104 occurred at disk power-on lifetime: 47563 hours (1981 days + 
19 hours)
--

Also in /var/log/messages:
--
Jun  3 17:47:21 kiwi smartd[1112]: Device: /dev/sdd [SAT], 3088 
Currently unreadable (pending) sectors
Jun  3 17:47:21 kiwi smartd[1112]: Device: /dev/sdd [SAT], 3088 Offline 
uncorrectable sectors
--

However, the other disks also generate errors.
--
[18:00:35]root@kiwi:/storage/samba/storage# btrfs device stats /dev/sda
[/dev/sda].write_io_errs    0
[/dev/sda].read_io_errs     0
[/dev/sda].flush_io_errs    0
[/dev/sda].corruption_errs  408
[/dev/sda].generation_errs  0
[18:00:39]root@kiwi:/storage/samba/storage# btrfs device stats /dev/sdb
[/dev/sdb].write_io_errs    0
[/dev/sdb].read_io_errs     0
[/dev/sdb].flush_io_errs    0
[/dev/sdb].corruption_errs  322
[/dev/sdb].generation_errs  0
[18:00:42]root@kiwi:/storage/samba/storage# btrfs device stats /dev/sdc
[/dev/sdc].write_io_errs    0
[/dev/sdc].read_io_errs     0
[/dev/sdc].flush_io_errs    0
[/dev/sdc].corruption_errs  1283
[/dev/sdc].generation_errs  0
[18:00:43]root@kiwi:/storage/samba/storage# btrfs device stats /dev/sdd
[/dev/sdd].write_io_errs    0
[/dev/sdd].read_io_errs     1582
[/dev/sdd].flush_io_errs    0
[/dev/sdd].corruption_errs  1310
[/dev/sdd].generation_errs  0
-

/dev/sdd is the only one with read_io_errs.

I've tried unpacking a .tar.gz from /storage to another filesystem, but 
the tar.gz was obviously corrupt. Very strange filenames which were, 
because of the name, pretty difficult to remove. I will not post the 
filenames here, it'd probably crash the internet. I'm also getting:
--
gzip: stdin: invalid compressed data--crc error
tar: Child returned status 1
tar: Error is not recoverable: exiting now
--

I can't btrfs remove /dev/sdd . The command below ran for a while (I 
could see the allocated space of /dev/sdd decrease with btrfs fi us 
/storage/), but then errored:
--
root@kiwi:~# btrfs device remove /dev/sdd /storage/
  ERROR: error removing device '/dev/sdd': Input/output error
--

I have a couple of questions:

1) Unpacking some .tar.gz files from /storage resulted in files with 
weird names, data was unusable. But, it's raid1. Why is my data corrupt, 
I've read that BTRFS checks the checksum on read ?
2) Are all my 4 drives faulty because of the corruption_errs ? If so, 4 
faulty drives is somewhat unusual. Any other possibilities ?
3) Given that
- I can't 'btrfs device remove' the device
- I do not have a free SATA port
- I'd prefer a method that doesn't unnecessarily take a very long time

What's the best way to migrate to a different device ? I'm guessing, 
after doing some reading:
- shutdown
- physically remove faulty disk
- boot
- verify /dev/sdd is missing, and that I've removed the correct disk
- shutdown
- connect new disk, it will also be /dev/sdd, because I have no other 
free SATA port
- boot
- check that the new disk is /dev/sdd
- mount -o degraded /dev/sda /storage
- btrfs replace start 4 /dev/sdd /storage
- btrfs balance /storage

Is this correct, should this also check / fix errors, if not, what's the 
best approach.. Thanks!

Gaardiolor

^ permalink raw reply	[flat|nested] 6+ messages in thread