raid1 has failing disks, but smart is clear

* raid1 has failing disks, but smart is clear
@ 2016-07-06 22:14 Corey Coughlin
  2016-07-06 22:59 ` Tomasz Kusmierz
  0 siblings, 1 reply; 13+ messages in thread
From: Corey Coughlin @ 2016-07-06 22:14 UTC (permalink / raw)
  To: Btrfs BTRFS

Hi all,
     Hoping you all can help, have a strange problem, think I know 
what's going on, but could use some verification.  I set up a raid1 type 
btrfs filesystem on an Ubuntu 16.04 system, here's what it looks like:

btrfs fi show
Label: none  uuid: 597ee185-36ac-4b68-8961-d4adc13f95d4
     Total devices 10 FS bytes used 3.42TiB
     devid    1 size 1.82TiB used 1.18TiB path /dev/sdd
     devid    2 size 698.64GiB used 47.00GiB path /dev/sdk
     devid    3 size 931.51GiB used 280.03GiB path /dev/sdm
     devid    4 size 931.51GiB used 280.00GiB path /dev/sdl
     devid    5 size 1.82TiB used 1.17TiB path /dev/sdi
     devid    6 size 1.82TiB used 823.03GiB path /dev/sdj
     devid    7 size 698.64GiB used 47.00GiB path /dev/sdg
     devid    8 size 1.82TiB used 1.18TiB path /dev/sda
     devid    9 size 1.82TiB used 1.18TiB path /dev/sdb
     devid   10 size 1.36TiB used 745.03GiB path /dev/sdh

I added a couple disks, and then ran a balance operation, and that took 
about 3 days to finish.  When it did finish, tried a scrub and got this 
message:

scrub status for 597ee185-36ac-4b68-8961-d4adc13f95d4
     scrub started at Sun Jun 26 18:19:28 2016 and was aborted after 
01:16:35
     total bytes scrubbed: 926.45GiB with 18849935 errors
     error details: read=18849935
     corrected errors: 5860, uncorrectable errors: 18844075, unverified 
errors: 0

So that seems bad.  Took a look at the devices and a few of them have 
errors:
...
[/dev/sdi].generation_errs 0
[/dev/sdj].write_io_errs   289436740
[/dev/sdj].read_io_errs    289492820
[/dev/sdj].flush_io_errs   12411
[/dev/sdj].corruption_errs 0
[/dev/sdj].generation_errs 0
[/dev/sdg].write_io_errs   0
...
[/dev/sda].generation_errs 0
[/dev/sdb].write_io_errs   3490143
[/dev/sdb].read_io_errs    111
[/dev/sdb].flush_io_errs   268
[/dev/sdb].corruption_errs 0
[/dev/sdb].generation_errs 0
[/dev/sdh].write_io_errs   5839
[/dev/sdh].read_io_errs    2188
[/dev/sdh].flush_io_errs   11
[/dev/sdh].corruption_errs 1
[/dev/sdh].generation_errs 16373

So I checked the smart data for those disks, they seem perfect, no 
reallocated sectors, no problems.  But one thing I did notice is that 
they are all WD Green drives.  So I'm guessing that if they power down 
and get reassigned to a new /dev/sd* letter, that could lead to data 
corruption.  I used idle3ctl to turn off the shut down mode on all the 
green drives in the system, but I'm having trouble getting the 
filesystem working without the errors.  I tried a 'check --repair' 
command on it, and it seems to find a lot of verification errors, but it 
doesn't look like things are getting fixed.  But I have all the data on 
it backed up on another system, so I can recreate this if I need to.  
But here's what I want to know:

1.  Am I correct about the issues with the WD Green drives, if they 
change mounts during disk operations, will that corrupt data?
2.  If that is the case:
     a.) Is there any way I can stop the /dev/sd* mount points from 
changing?  Or can I set up the filesystem using UUIDs or something more 
solid?  I googled about it, but found conflicting info
     b.) Or, is there something else changing my drive devices?  I have 
most of drives on an LSI SAS 9201-16i card, is there something I need to 
do to make them fixed?
     c.) Or, is there a script or something I can use to figure out if 
the disks will change mounts?
     d.) Or, if I wipe everything and rebuild, will the disks with the 
idle3ctl fix work now?

Regardless of whether or not it's a WD Green drive issue, should I just 
wipefs all the disks and rebuild it?  Is there any way to recover this?  
Thanks for any help!

     ------- Corey

^ permalink raw reply	[flat|nested] 13+ messages in thread