All of lore.kernel.org
 help / color / mirror / Atom feed
* Offline array, events count mismatch
@ 2015-11-09  2:49 Guillaume Paumier
  2015-11-09  3:35 ` Phil Turmel
  0 siblings, 1 reply; 4+ messages in thread
From: Guillaume Paumier @ 2015-11-09  2:49 UTC (permalink / raw)
  To: linux-raid

Hello folks,

I reached out to you a few months ago when a --grow went awry. In the end I 
managed to restore my array thanks to this mailing list and the invaluable 
help of IRC user frostschutz.

I'm now facing another issue and I'm hoping you can help me again.

Today I found out that my RAID6, 9-disk array was offline. When looking at the 
machine, two disks seemed to have disappeared; they didn't show in fdisk or 
anything. And a third one was marked as "faulty" in mdadm.

At first, I was puzzled because it seemed improbable that three disks had 
failed at the same time. I removed the array from fstab and rebooted. The two 
vanished disks re-appeared (in fdisk too), and when examining the partitions, 
I noticed the following events count:

/dev/sdb1:
         Events : 198477
/dev/sdc1:
         Events : 198477
/dev/sdd1:
         Events : 198477
/dev/sde1:
         Events : 54264
/dev/sdf1:
         Events : 54264
/dev/sdg1:
         Events : 198477
/dev/sdh1:
         Events : 198477
/dev/sdi1:
         Events : 198477
/dev/sdj1:
         Events : 198473

Looking at those event counts, my understanding is this:
* Two of the disks (sde, sdf) were dropped from the array for some reason.
* I didn't notice this immediately (an issue I'm addressing separately).
* A third disk (sdj) encountered a small issue today.
* The array went offline because it didn't have enough disks to function 
cleanly any more.

If I understand the documentation [1] correctly, since the event count for sdj 
is very close to the event count of sd[b,c,d,g,h,i], I should be able to re-
assemble the array with these 7 disks using --force, leaving sde and sdf 
aside. Once the array is assembled, I should be able to re-add sde and sdf, 
and they will be re-sync'd.

[1] 
https://raid.wiki.kernel.org/index.php/RAID_Recovery#Trying_to_assemble_using_--force

I prefer to be cautious and ask here before doing anything that could make 
things worse. It would be great if you could confirm that my understanding is 
correct, and tell me if this plan is sound.

I'm including some more detailed information below. Let me know if there's any 
other information that would be useful.

Many thanks,


===========================================================
Before the reboot: mdadm -D
-----------------------------------------------------------

# mdadm -D /dev/md0
/dev/md0:
        Version : 1.0
  Creation Time : Thu Aug  1 12:23:07 2013
     Raid Level : raid6
     Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
  Used Dev Size : 3907016448 (3726.02 GiB 4000.78 GB)
   Raid Devices : 9
  Total Devices : 8
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sun Nov  8 06:36:50 2015
          State : clean, FAILED
 Active Devices : 6
Working Devices : 6
 Failed Devices : 2
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 128K

           UUID : eea59047:120a0365:353da182:6787e030
         Events : 198477

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1
       2       8       97        2      active sync   /dev/sdg1
       3       8      113        3      active sync   /dev/sdh1
       4       8      129        4      active sync   /dev/sdi1
      10       0        0       10      removed
      12       0        0       12      removed
      14       0        0       14      removed
       8       8       17        8      active sync   /dev/sdb1

       5       8      145        -      faulty   /dev/sdj1
       6       8       65        -      faulty   /dev/sde1


===========================================================
Before the reboot: mdadm --examine
-----------------------------------------------------------

# mdadm --examine /dev/sd[b-j]1
/dev/sdb1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : eea59047:120a0365:353da182:6787e030
  Creation Time : Thu Aug  1 12:23:07 2013
     Raid Level : raid6
   Raid Devices : 9

 Avail Dev Size : 7814033128 (3726.02 GiB 4000.78 GB)
     Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
  Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
   Super Offset : 7814033392 sectors
   Unused Space : before=0 sectors, after=480 sectors
          State : clean
    Device UUID : 91b187fd:f416880a:f5e81e49:92615e07

Internal Bitmap : -16 sectors from superblock
    Update Time : Sun Nov  8 06:36:50 2015
  Bad Block Log : 512 entries available at offset -8 sectors
       Checksum : 30050dee - correct
         Events : 198477

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 8
   Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdc1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : eea59047:120a0365:353da182:6787e030
  Creation Time : Thu Aug  1 12:23:07 2013
     Raid Level : raid6
   Raid Devices : 9

 Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
     Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
  Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
   Super Offset : 7814033392 sectors
   Unused Space : before=0 sectors, after=480 sectors
          State : clean
    Device UUID : e1b689b5:b4a2c5a7:56057b69:a9101af0

Internal Bitmap : -16 sectors from superblock
    Update Time : Sun Nov  8 06:36:50 2015
       Checksum : 8e546a7e - correct
         Events : 198477

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 0
   Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : eea59047:120a0365:353da182:6787e030
  Creation Time : Thu Aug  1 12:23:07 2013
     Raid Level : raid6
   Raid Devices : 9

 Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
     Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
  Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
   Super Offset : 7814033392 sectors
   Unused Space : before=0 sectors, after=480 sectors
          State : clean
    Device UUID : 1d8e74d3:9abd37f8:f2cf0ab8:02fdcfd6

Internal Bitmap : -16 sectors from superblock
    Update Time : Sun Nov  8 06:36:50 2015
       Checksum : 31f71397 - correct
         Events : 198477

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 1
   Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
mdadm: No md superblock detected on /dev/sde1.
/dev/sdg1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : eea59047:120a0365:353da182:6787e030
  Creation Time : Thu Aug  1 12:23:07 2013
     Raid Level : raid6
   Raid Devices : 9

 Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
     Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
  Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
   Super Offset : 7814033392 sectors
   Unused Space : before=0 sectors, after=480 sectors
          State : clean
    Device UUID : b24758e6:042412c5:9b5a3c06:f167aedf

Internal Bitmap : -16 sectors from superblock
    Update Time : Sun Nov  8 06:36:50 2015
       Checksum : 68c5292e - correct
         Events : 198477

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 2
   Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdh1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : eea59047:120a0365:353da182:6787e030
  Creation Time : Thu Aug  1 12:23:07 2013
     Raid Level : raid6
   Raid Devices : 9

 Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
     Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
  Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
   Super Offset : 7814033392 sectors
   Unused Space : before=0 sectors, after=480 sectors
          State : clean
    Device UUID : 00e47d82:b49c3905:3ed961fe:40a5f259

Internal Bitmap : -16 sectors from superblock
    Update Time : Sun Nov  8 06:36:50 2015
       Checksum : b77bfa1e - correct
         Events : 198477

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 3
   Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdi1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : eea59047:120a0365:353da182:6787e030
  Creation Time : Thu Aug  1 12:23:07 2013
     Raid Level : raid6
   Raid Devices : 9

 Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
     Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
  Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
   Super Offset : 7814033392 sectors
   Unused Space : before=0 sectors, after=480 sectors
          State : clean
    Device UUID : a7e34040:fa12382f:c2ef3d85:9c95b1d0

Internal Bitmap : -16 sectors from superblock
    Update Time : Sun Nov  8 06:36:50 2015
       Checksum : 9cd876ec - correct
         Events : 198477

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 4
   Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdj1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : eea59047:120a0365:353da182:6787e030
  Creation Time : Thu Aug  1 12:23:07 2013
     Raid Level : raid6
   Raid Devices : 9

 Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
     Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
  Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
   Super Offset : 7814033392 sectors
   Unused Space : before=0 sectors, after=480 sectors
          State : clean
    Device UUID : 9d89c55d:9f4a2181:6b87922f:0681d580

Internal Bitmap : -16 sectors from superblock
    Update Time : Sun Nov  8 06:36:38 2015
       Checksum : 66c5dfd2 - correct
         Events : 198473

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 5
   Array State : AAAAAA..A ('A' == active, '.' == missing, 'R' == replacing)


===========================================================
After the reboot: mdadm --examine
-----------------------------------------------------------

# mdadm --examine /dev/sd[b-j]1
/dev/sdb1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : eea59047:120a0365:353da182:6787e030
  Creation Time : Thu Aug  1 12:23:07 2013
     Raid Level : raid6
   Raid Devices : 9

 Avail Dev Size : 7814033128 (3726.02 GiB 4000.78 GB)
     Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
  Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
   Super Offset : 7814033392 sectors
   Unused Space : before=0 sectors, after=480 sectors
          State : clean
    Device UUID : 91b187fd:f416880a:f5e81e49:92615e07

Internal Bitmap : -16 sectors from superblock
    Update Time : Sun Nov  8 06:36:50 2015
  Bad Block Log : 512 entries available at offset -8 sectors
       Checksum : 30050dee - correct
         Events : 198477

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 8
   Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdc1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : eea59047:120a0365:353da182:6787e030
  Creation Time : Thu Aug  1 12:23:07 2013
     Raid Level : raid6
   Raid Devices : 9

 Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
     Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
  Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
   Super Offset : 7814033392 sectors
   Unused Space : before=0 sectors, after=480 sectors
          State : clean
    Device UUID : e1b689b5:b4a2c5a7:56057b69:a9101af0

Internal Bitmap : -16 sectors from superblock
    Update Time : Sun Nov  8 06:36:50 2015
       Checksum : 8e546a7e - correct
         Events : 198477

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 0
   Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : eea59047:120a0365:353da182:6787e030
  Creation Time : Thu Aug  1 12:23:07 2013
     Raid Level : raid6
   Raid Devices : 9

 Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
     Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
  Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
   Super Offset : 7814033392 sectors
   Unused Space : before=0 sectors, after=480 sectors
          State : clean
    Device UUID : 1d8e74d3:9abd37f8:f2cf0ab8:02fdcfd6

Internal Bitmap : -16 sectors from superblock
    Update Time : Sun Nov  8 06:36:50 2015
       Checksum : 31f71397 - correct
         Events : 198477

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 1
   Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sde1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : eea59047:120a0365:353da182:6787e030
  Creation Time : Thu Aug  1 12:23:07 2013
     Raid Level : raid6
   Raid Devices : 9

 Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
     Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
  Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
   Super Offset : 7814033392 sectors
   Unused Space : before=0 sectors, after=480 sectors
          State : clean
    Device UUID : ddf17d3d:ea944bfb:6886cc91:3366f55f

Internal Bitmap : -16 sectors from superblock
    Update Time : Wed Oct  7 10:17:35 2015
       Checksum : 1dd30b1 - correct
         Events : 54264

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 7
   Array State : AAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdf1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : eea59047:120a0365:353da182:6787e030
  Creation Time : Thu Aug  1 12:23:07 2013
     Raid Level : raid6
   Raid Devices : 9

 Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
     Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
  Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
   Super Offset : 7814033392 sectors
   Unused Space : before=0 sectors, after=480 sectors
          State : clean
    Device UUID : 38675f59:ea412b1f:67d6ed9a:a33fc5dd

Internal Bitmap : -16 sectors from superblock
    Update Time : Wed Oct  7 10:17:35 2015
       Checksum : c88f7c7b - correct
         Events : 54264

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 6
   Array State : AAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdg1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : eea59047:120a0365:353da182:6787e030
  Creation Time : Thu Aug  1 12:23:07 2013
     Raid Level : raid6
   Raid Devices : 9

 Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
     Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
  Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
   Super Offset : 7814033392 sectors
   Unused Space : before=0 sectors, after=480 sectors
          State : clean
    Device UUID : b24758e6:042412c5:9b5a3c06:f167aedf

Internal Bitmap : -16 sectors from superblock
    Update Time : Sun Nov  8 06:36:50 2015
       Checksum : 68c5292e - correct
         Events : 198477

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 2
   Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdh1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : eea59047:120a0365:353da182:6787e030
  Creation Time : Thu Aug  1 12:23:07 2013
     Raid Level : raid6
   Raid Devices : 9

 Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
     Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
  Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
   Super Offset : 7814033392 sectors
   Unused Space : before=0 sectors, after=480 sectors
          State : clean
    Device UUID : 00e47d82:b49c3905:3ed961fe:40a5f259

Internal Bitmap : -16 sectors from superblock
    Update Time : Sun Nov  8 06:36:50 2015
       Checksum : b77bfa1e - correct
         Events : 198477

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 3
   Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdi1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : eea59047:120a0365:353da182:6787e030
  Creation Time : Thu Aug  1 12:23:07 2013
     Raid Level : raid6
   Raid Devices : 9

 Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
     Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
  Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
   Super Offset : 7814033392 sectors
   Unused Space : before=0 sectors, after=480 sectors
          State : clean
    Device UUID : a7e34040:fa12382f:c2ef3d85:9c95b1d0

Internal Bitmap : -16 sectors from superblock
    Update Time : Sun Nov  8 06:36:50 2015
       Checksum : 9cd876ec - correct
         Events : 198477

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 4
   Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdj1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : eea59047:120a0365:353da182:6787e030
  Creation Time : Thu Aug  1 12:23:07 2013
     Raid Level : raid6
   Raid Devices : 9

 Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
     Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
  Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
   Super Offset : 7814033392 sectors
   Unused Space : before=0 sectors, after=480 sectors
          State : clean
    Device UUID : 9d89c55d:9f4a2181:6b87922f:0681d580

Internal Bitmap : -16 sectors from superblock
    Update Time : Sun Nov  8 06:36:38 2015
       Checksum : 66c5dfd2 - correct
         Events : 198473

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 5
   Array State : AAAAAA..A ('A' == active, '.' == missing, 'R' == replacing)

===========================================================


-- 
Guillaume Paumier

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Offline array, events count mismatch
  2015-11-09  2:49 Offline array, events count mismatch Guillaume Paumier
@ 2015-11-09  3:35 ` Phil Turmel
  2015-11-10  3:05   ` Guillaume Paumier
  0 siblings, 1 reply; 4+ messages in thread
From: Phil Turmel @ 2015-11-09  3:35 UTC (permalink / raw)
  To: Guillaume Paumier, linux-raid

Hi Guillaume,

On 11/08/2015 09:49 PM, Guillaume Paumier wrote:

[trim /]

> Looking at those event counts, my understanding is this:
> * Two of the disks (sde, sdf) were dropped from the array for some reason.
> * I didn't notice this immediately (an issue I'm addressing separately).
> * A third disk (sdj) encountered a small issue today.
> * The array went offline because it didn't have enough disks to function 
> cleanly any more.
> 
> If I understand the documentation [1] correctly, since the event count for sdj 
> is very close to the event count of sd[b,c,d,g,h,i], I should be able to re-
> assemble the array with these 7 disks using --force, leaving sde and sdf 
> aside. Once the array is assembled, I should be able to re-add sde and sdf, 
> and they will be re-sync'd.

Yes, that is the correct response.

Your situation is common.  Please see the thread this weekend started by
Franscisco Parada.

https://marc.info/?t=144691643300001&r=1&w=2&n=12

You should provide "smartctl -i -A -l scterc /dev/sdX" reports for your
drives.  If you can find an old syslog for when your two worst drives
fell out, it might help.

Phil

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Offline array, events count mismatch
  2015-11-09  3:35 ` Phil Turmel
@ 2015-11-10  3:05   ` Guillaume Paumier
  2015-11-10 15:50     ` Phil Turmel
  0 siblings, 1 reply; 4+ messages in thread
From: Guillaume Paumier @ 2015-11-10  3:05 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

Hello Phil and the list,

Le dimanche 8 novembre 2015, 22:35:13 Phil Turmel a écrit :
> 
> On 11/08/2015 09:49 PM, Guillaume Paumier wrote:
> > 
> > If I understand the documentation [1] correctly, since the event count for
> > sdj is very close to the event count of sd[b,c,d,g,h,i], I should be able
> > to re- assemble the array with these 7 disks using --force, leaving sde
> > and sdf aside. Once the array is assembled, I should be able to re-add
> > sde and sdf, and they will be re-sync'd.
> 
> Yes, that is the correct response.
> 
> Your situation is common.  Please see the thread this weekend started by
> Franscisco Parada.

Thank you for confirming, Phil, and for the additional pointer.

I've re-assembled the array with --force, which cleaned sdj, and then I was 
able to re-add the two other disks. The array started rebuilding and recovery 
was past 10% when the array failed again.

It seems there was an "unrecoverable read error" on sdj, and now I'm back with 
an array where 2 of the disks are marked as spare (sde and sdf, because their 
rebuild didn't complete), and sdj is faulty with an event count mismatch of 4, 
like before:

/dev/sdb1:
         Events : 198704
/dev/sdc1:
         Events : 198704
/dev/sdd1:
         Events : 198704
/dev/sde1:
         Events : 198704
/dev/sdf1:
         Events : 198704
/dev/sdg1:
         Events : 198704
/dev/sdh1:
         Events : 198704
/dev/sdi1:
         Events : 198704
/dev/sdj1:
         Events : 198700

Below is the output of dmesg with more details on the read error.

Is there any way I can move past this? This error is preventing me from 
rebuilding the array, and I'm assuming it would also prevent me from copying 
the data off the array without rebuilding, so I'm not sure how to proceed. Any 
guidance would be much appreciated.


[88233.712961] md: recovery of RAID array md0
[88233.712965] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[88233.712967] md: using maximum available idle IO bandwidth (but not more 
than 200000 KB/sec) for recovery.
[88233.712978] md: using 128k window, over a total of 3907016448k.

[88953.752335] ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[88953.752345] ata9.01: BMDMA stat 0x64
[88953.752353] ata9.01: failed command: READ DMA EXT
[88953.752368] ata9.01: cmd 25/00:00:00:fc:e8/00:02:27:00:00/f0 tag 0 dma 
262144 in
         res 51/40:00:f8:fd:e8/40:00:27:00:00/10 Emask 0x9 (media error)                                                                                     
[88953.752375] ata9.01: status: { DRDY ERR }
[88953.752380] ata9.01: error: { UNC }
[88953.793877] ata9.00: configured for UDMA/33
[88953.799795] ata9.01: configured for UDMA/33
[88953.799855] sd 8:0:1:0: [sdj] Unhandled sense code
[88953.799858] sd 8:0:1:0: [sdj]  
[88953.799860] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[88953.799862] sd 8:0:1:0: [sdj]  
[88953.799864] Sense Key : Medium Error [current] [descriptor]
[88953.799867] Descriptor sense data with sense descriptors (in hex):
[88953.799868]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
[88953.799875]         27 e8 fd f8 
[88953.799879] sd 8:0:1:0: [sdj]  
[88953.799882] Add. Sense: Unrecovered read error - auto reallocate failed
[88953.799884] sd 8:0:1:0: [sdj] CDB: 
[88953.799885] Read(16): 88 00 00 00 00 00 27 e8 fc 00 00 00 02 00 00 00
[88953.799894] end_request: I/O error, dev sdj, sector 669580792
[88953.799898] md/raid:md0: read error not correctable (sector 669578744 on 
sdj1).
[88953.799924] ata9: EH complete

[89333.138473] ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[89333.138478] ata9.01: BMDMA stat 0x64
[89333.138482] ata9.01: failed command: READ DMA EXT
[89333.138488] ata9.01: cmd 25/00:00:58:6e:3b/00:02:35:00:00/f0 tag 0 dma 
262144 in
         res 51/40:00:c8:6f:3b/40:00:35:00:00/10 Emask 0x9 (media error)                                                                                     
[89333.138491] ata9.01: status: { DRDY ERR }
[89333.138493] ata9.01: error: { UNC }
[89333.147985] ata9.00: configured for UDMA/33
[89333.153966] ata9.01: configured for UDMA/33
[89333.154022] sd 8:0:1:0: [sdj] Unhandled sense code
[89333.154025] sd 8:0:1:0: [sdj]  
[89333.154027] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[89333.154029] sd 8:0:1:0: [sdj]  
[89333.154031] Sense Key : Medium Error [current] [descriptor]
[89333.154034] Descriptor sense data with sense descriptors (in hex):
[89333.154035]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
[89333.154042]         35 3b 6f c8 
[89333.154046] sd 8:0:1:0: [sdj]  
[89333.154048] Add. Sense: Unrecovered read error - auto reallocate failed
[89333.154050] sd 8:0:1:0: [sdj] CDB: 
[89333.154052] Read(16): 88 00 00 00 00 00 35 3b 6e 58 00 00 02 00 00 00
[89333.154061] end_request: I/O error, dev sdj, sector 893087688
[89333.154064] md/raid:md0: read error not correctable (sector 893085640 on 
sdj1).
[89333.154067] md/raid:md0: read error not correctable (sector 893085648 on 
sdj1).
[89333.154069] md/raid:md0: read error not correctable (sector 893085656 on 
sdj1).
[89333.154071] md/raid:md0: read error not correctable (sector 893085664 on 
sdj1).
[89333.154073] md/raid:md0: read error not correctable (sector 893085672 on 
sdj1).
[89333.154075] md/raid:md0: read error not correctable (sector 893085680 on 
sdj1).
[89333.154077] md/raid:md0: read error not correctable (sector 893085688 on 
sdj1).
[89333.154079] md/raid:md0: read error not correctable (sector 893085696 on 
sdj1).
[89333.154081] md/raid:md0: read error not correctable (sector 893085704 on 
sdj1).
[89333.154083] md/raid:md0: read error not correctable (sector 893085712 on 
sdj1).
[89333.154111] ata9: EH complete
[89338.097012] ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[89338.097016] ata9.01: BMDMA stat 0x64
[89338.097019] ata9.01: failed command: READ DMA EXT
[89338.097023] ata9.01: cmd 25/00:00:58:70:3b/00:02:35:00:00/f0 tag 0 dma 
262144 in
         res 51/40:00:60:70:3b/40:00:35:00:00/10 Emask 0x9 (media error)                                                                                     
[89338.097025] ata9.01: status: { DRDY ERR }
[89338.097026] ata9.01: error: { UNC }
[89338.125468] ata9.00: configured for UDMA/33
[89338.131458] ata9.01: configured for UDMA/33
[89338.131489] sd 8:0:1:0: [sdj] Unhandled sense code
[89338.131491] sd 8:0:1:0: [sdj]  
[89338.131492] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[89338.131493] sd 8:0:1:0: [sdj]  
[89338.131494] Sense Key : Medium Error [current] [descriptor]
[89338.131496] Descriptor sense data with sense descriptors (in hex):
[89338.131497]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
[89338.131502]         35 3b 70 60 
[89338.131504] sd 8:0:1:0: [sdj]  
[89338.131506] Add. Sense: Unrecovered read error - auto reallocate failed
[89338.131507] sd 8:0:1:0: [sdj] CDB: 
[89338.131508] Read(16): 88 00 00 00 00 00 35 3b 70 58 00 00 02 00 00 00
[89338.131513] end_request: I/O error, dev sdj, sector 893087840
[89338.131556] ata9: EH complete
[89342.103300] ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[89342.103310] ata9.01: BMDMA stat 0x64
[89342.103319] ata9.01: failed command: READ DMA EXT
[89342.103333] ata9.01: cmd 25/00:00:58:72:3b/00:02:35:00:00/f0 tag 0 dma 
262144 in
         res 51/40:00:58:72:3b/40:00:35:00:00/10 Emask 0x9 (media error)                                                                                     
[89342.103340] ata9.01: status: { DRDY ERR }
[89342.103344] ata9.01: error: { UNC }
[89342.224995] ata9.00: configured for UDMA/33
[89342.230983] ata9.01: configured for UDMA/33
[89342.231022] sd 8:0:1:0: [sdj] Unhandled sense code
[89342.231025] sd 8:0:1:0: [sdj]  
[89342.231027] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[89342.231029] sd 8:0:1:0: [sdj]  
[89342.231031] Sense Key : Medium Error [current] [descriptor]
[89342.231034] Descriptor sense data with sense descriptors (in hex):
[89342.231035]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
[89342.231042]         35 3b 72 58 
[89342.231046] sd 8:0:1:0: [sdj]  
[89342.231049] Add. Sense: Unrecovered read error - auto reallocate failed
[89342.231051] sd 8:0:1:0: [sdj] CDB: 
[89342.231052] Read(16): 88 00 00 00 00 00 35 3b 72 58 00 00 02 00 00 00
[89342.231061] end_request: I/O error, dev sdj, sector 893088344
[89342.231065] raid5_end_read_request: 71 callbacks suppressed
[89342.231067] md/raid:md0: read error not correctable (sector 893086296 on 
sdj1).
[89342.231070] md/raid:md0: read error not correctable (sector 893086304 on 
sdj1).
[89342.231072] md/raid:md0: read error not correctable (sector 893086312 on 
sdj1).
[89342.231074] md/raid:md0: read error not correctable (sector 893086320 on 
sdj1).
[89342.231076] md/raid:md0: read error not correctable (sector 893086328 on 
sdj1).
[89342.231078] md/raid:md0: read error not correctable (sector 893086336 on 
sdj1).
[89342.231080] md/raid:md0: read error not correctable (sector 893086344 on 
sdj1).
[89342.231081] md/raid:md0: read error not correctable (sector 893086352 on 
sdj1).
[89342.231083] md/raid:md0: read error not correctable (sector 893086360 on 
sdj1).
[89342.231085] md/raid:md0: read error not correctable (sector 893086368 on 
sdj1).
[89342.231149] ata9: EH complete
[89346.169717] ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[89346.169727] ata9.01: BMDMA stat 0x64
[89346.169736] ata9.01: failed command: READ DMA EXT
[89346.169750] ata9.01: cmd 25/00:00:58:74:3b/00:02:35:00:00/f0 tag 0 dma 
262144 in
         res 51/40:00:58:74:3b/40:00:35:00:00/10 Emask 0x9 (media error)                                                                                     
[89346.169758] ata9.01: status: { DRDY ERR }
[89346.169763] ata9.01: error: { UNC }
[89346.198239] ata9.00: configured for UDMA/33
[89346.204166] ata9.01: configured for UDMA/33
[89346.204232] sd 8:0:1:0: [sdj] Unhandled sense code
[89346.204239] sd 8:0:1:0: [sdj]  
[89346.204243] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[89346.204248] sd 8:0:1:0: [sdj]  
[89346.204251] Sense Key : Medium Error [current] [descriptor]
[89346.204258] Descriptor sense data with sense descriptors (in hex):
[89346.204261]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
[89346.204278]         35 3b 74 58 
[89346.204286] sd 8:0:1:0: [sdj]  
[89346.204292] Add. Sense: Unrecovered read error - auto reallocate failed
[89346.204296] sd 8:0:1:0: [sdj] CDB: 
[89346.204299] Read(16): 88 00 00 00 00 00 35 3b 74 58 00 00 02 00 00 00
[89346.204319] end_request: I/O error, dev sdj, sector 893088856
[89346.204419] ata9: EH complete
[89353.949976] ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[89353.949986] ata9.01: BMDMA stat 0x64
[89353.949994] ata9.01: failed command: READ DMA EXT
[89353.950008] ata9.01: cmd 25/00:90:c8:6f:3b/00:00:35:00:00/f0 tag 0 dma 
73728 in
         res 51/40:00:e0:6f:3b/40:00:35:00:00/10 Emask 0x9 (media error)                                                                                     
[89353.950016] ata9.01: status: { DRDY ERR }
[89353.950021] ata9.01: error: { UNC }
[89353.994545] ata9.00: configured for UDMA/33
[89354.000539] ata9.01: configured for UDMA/33
[89354.000597] sd 8:0:1:0: [sdj] Unhandled sense code
[89354.000603] sd 8:0:1:0: [sdj]  
[89354.000608] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[89354.000612] sd 8:0:1:0: [sdj]  
[89354.000616] Sense Key : Medium Error [current] [descriptor]
[89354.000623] Descriptor sense data with sense descriptors (in hex):
[89354.000626]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
[89354.000643]         35 3b 6f e0 
[89354.000651] sd 8:0:1:0: [sdj]  
[89354.000657] Add. Sense: Unrecovered read error - auto reallocate failed
[89354.000661] sd 8:0:1:0: [sdj] CDB: 
[89354.000664] Read(16): 88 00 00 00 00 00 35 3b 6f c8 00 00 00 90 00 00
[89354.000684] end_request: I/O error, dev sdj, sector 893087712
[89354.000692] raid5_end_read_request: 118 callbacks suppressed
[89354.000697] md/raid:md0: read error not correctable (sector 893085664 on 
sdj1).
[89354.000706] md/raid:md0: Disk failure on sdj1, disabling device.
md/raid:md0: Operation continuing on 6 devices.
[89354.000732] md/raid:md0: read error not correctable (sector 893085672 on 
sdj1).
[89354.000737] md/raid:md0: read error not correctable (sector 893085680 on 
sdj1).
[89354.000742] md/raid:md0: read error not correctable (sector 893085688 on 
sdj1).
[89354.000747] md/raid:md0: read error not correctable (sector 893085696 on 
sdj1).
[89354.000751] md/raid:md0: read error not correctable (sector 893085704 on 
sdj1).
[89354.000756] md/raid:md0: read error not correctable (sector 893085712 on 
sdj1).
[89354.000760] md/raid:md0: read error not correctable (sector 893085720 on 
sdj1).
[89354.000765] md/raid:md0: read error not correctable (sector 893085728 on 
sdj1).
[89354.000769] md/raid:md0: read error not correctable (sector 893085736 on 
sdj1).
[89354.000903] ata9: EH complete
[89354.109105] md: md0: recovery interrupted.
[89354.175670] RAID conf printout:
[89354.175675]  --- level:6 rd:9 wd:6
[89354.175677]  disk 0, o:1, dev:sdc1
[89354.175679]  disk 1, o:1, dev:sdd1
[89354.175680]  disk 2, o:1, dev:sdg1
[89354.175681]  disk 3, o:1, dev:sdh1
[89354.175682]  disk 4, o:1, dev:sdi1
[89354.175683]  disk 5, o:0, dev:sdj1
[89354.175684]  disk 6, o:1, dev:sdf1
[89354.175685]  disk 7, o:1, dev:sde1
[89354.175686]  disk 8, o:1, dev:sdb1
[89354.177220] RAID conf printout:
[89354.177221]  --- level:6 rd:9 wd:6
[89354.177222]  disk 0, o:1, dev:sdc1
[89354.177223]  disk 1, o:1, dev:sdd1
[89354.177224]  disk 2, o:1, dev:sdg1
[89354.177225]  disk 3, o:1, dev:sdh1
[89354.177226]  disk 4, o:1, dev:sdi1
[89354.177227]  disk 5, o:0, dev:sdj1
[89354.177227]  disk 7, o:1, dev:sde1
[89354.177228]  disk 8, o:1, dev:sdb1
[89354.177233] RAID conf printout:
[89354.177234]  --- level:6 rd:9 wd:6
[89354.177234]  disk 0, o:1, dev:sdc1
[89354.177235]  disk 1, o:1, dev:sdd1
[89354.177236]  disk 2, o:1, dev:sdg1
[89354.177237]  disk 3, o:1, dev:sdh1
[89354.177238]  disk 4, o:1, dev:sdi1
[89354.177239]  disk 5, o:0, dev:sdj1
[89354.177240]  disk 7, o:1, dev:sde1
[89354.177241]  disk 8, o:1, dev:sdb1
[89354.179575] RAID conf printout:
[89354.179576]  --- level:6 rd:9 wd:6
[89354.179577]  disk 0, o:1, dev:sdc1
[89354.179578]  disk 1, o:1, dev:sdd1
[89354.179579]  disk 2, o:1, dev:sdg1
[89354.179580]  disk 3, o:1, dev:sdh1
[89354.179581]  disk 4, o:1, dev:sdi1
[89354.179582]  disk 5, o:0, dev:sdj1
[89354.179583]  disk 8, o:1, dev:sdb1
[89354.179585] RAID conf printout:
[89354.179586]  --- level:6 rd:9 wd:6
[89354.179587]  disk 0, o:1, dev:sdc1
[89354.179588]  disk 1, o:1, dev:sdd1
[89354.179589]  disk 2, o:1, dev:sdg1
[89354.179589]  disk 3, o:1, dev:sdh1
[89354.179590]  disk 4, o:1, dev:sdi1
[89354.179591]  disk 5, o:0, dev:sdj1
[89354.179592]  disk 8, o:1, dev:sdb1
[89354.181443] RAID conf printout:
[89354.181444]  --- level:6 rd:9 wd:6
[89354.181445]  disk 0, o:1, dev:sdc1
[89354.181446]  disk 1, o:1, dev:sdd1
[89354.181447]  disk 2, o:1, dev:sdg1
[89354.181448]  disk 3, o:1, dev:sdh1
[89354.181449]  disk 4, o:1, dev:sdi1
[89354.181450]  disk 8, o:1, dev:sdb1

[90001.391680] md0: detected capacity change from 28005493899264 to 0
[90001.391697] md: md0 stopped.
[90001.391717] md: unbind<sdf1>
[90001.396688] md: export_rdev(sdf1)
[90001.396808] md: unbind<sde1>
[90001.403661] md: export_rdev(sde1)
[90001.403726] md: unbind<sdc1>
[90001.412707] md: export_rdev(sdc1)
[90001.412867] md: unbind<sdb1>
[90001.415711] md: export_rdev(sdb1)
[90001.415782] md: unbind<sdj1>
[90001.421708] md: export_rdev(sdj1)
[90001.421783] md: unbind<sdi1>
[90001.424752] md: export_rdev(sdi1)
[90001.424909] md: unbind<sdh1>
[90001.427741] md: export_rdev(sdh1)
[90001.427807] md: unbind<sdg1>
[90001.433745] md: export_rdev(sdg1)
[90001.433812] md: unbind<sdd1>
[90001.436732] md: export_rdev(sdd1)

> You should provide "smartctl -i -A -l scterc /dev/sdX" reports for your
> drives.  If you can find an old syslog for when your two worst drives
> fell out, it might help.

Here's the output for the disk with the read error for now, in case it's 
useful.

# smartctl -i -A -l scterc /dev/sdj
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-29-desktop] (SUSE RPM)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST4000VN000-1H4168
Serial Number:    Z300NEB5
LU WWN Device Id: 5 000c50 063ed9f94
Firmware Version: SC43
User Capacity:    4 000 787 030 016 bytes [4,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Nov  9 18:58:47 2015 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   094   094   006    Pre-fail  Always       
-       28320486
  3 Spin_Up_Time            0x0003   092   092   000    Pre-fail  Always       
-       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       
-       73
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       
-       160
  7 Seek_Error_Rate         0x000f   069   060   030    Pre-fail  Always       
-       17212021570
  9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       
-       19201
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       
-       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       
-       73
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       
-       0
187 Reported_Uncorrect      0x0032   055   055   000    Old_age   Always       
-       45
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       
-       0
189 High_Fly_Writes         0x003a   001   001   000    Old_age   Always       
-       169
190 Airflow_Temperature_Cel 0x0022   065   057   045    Old_age   Always       
-       35 (Min/Max 30/37)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       
-       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       
-       28
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       
-       73
194 Temperature_Celsius     0x0022   035   043   000    Old_age   Always       
-       35 (0 18 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       
-       48
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       
48
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       
-       0

SCT Error Recovery Control:
           Read:     70 (7,0 seconds)
          Write:     70 (7,0 seconds)


-- 
Guillaume Paumier
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Offline array, events count mismatch
  2015-11-10  3:05   ` Guillaume Paumier
@ 2015-11-10 15:50     ` Phil Turmel
  0 siblings, 0 replies; 4+ messages in thread
From: Phil Turmel @ 2015-11-10 15:50 UTC (permalink / raw)
  To: Guillaume Paumier; +Cc: linux-raid

On 11/09/2015 10:05 PM, Guillaume Paumier wrote:
> Hello Phil and the list,

> Thank you for confirming, Phil, and for the additional pointer.
> 
> I've re-assembled the array with --force, which cleaned sdj, and then I was 
> able to re-add the two other disks. The array started rebuilding and recovery 
> was past 10% when the array failed again.
> 
> It seems there was an "unrecoverable read error" on sdj, and now I'm back with 
> an array where 2 of the disks are marked as spare (sde and sdf, because their 
> rebuild didn't complete), and sdj is faulty with an event count mismatch of 4, 
> like before:

Yes, you're going to lose some data.

Your only path forward at this point is to --assemble --force without
the spares, and leave them out.  The array will be running degraded.

Apply the timeout mismatch work-arounds suited to your drives.

Start copying out your files to a new backup destination.  Keep track
which ones succeed.

/dev/sdj has 48 pending bad sectors.  You are likely to have files that
cannot be read thanks to those sectors.  Just skip them and keep going
(for now).  Note the sector addresses that fail.

You may have to do forced assembly multiple times to get through the
entire backup.

Write zeroes over the bad sectors to clear the UREs.  If the files are
worthless with those zeroes in them, just delete them.  Do this for all
drives that have UREs.  Then you can add the spares back in to rebuild.

Going forward, you need to apply work-arounds for non-raid drives at
every power cycle, buy raid-rated drives for future replacements, and
use cron to run regular scrubs to keep the UREs under control.

Show the smartctl reports for all of your drives if you'd like more
specfic advice.  And turn off word wrapping when you paste, please.

Phil


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-11-10 15:50 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-09  2:49 Offline array, events count mismatch Guillaume Paumier
2015-11-09  3:35 ` Phil Turmel
2015-11-10  3:05   ` Guillaume Paumier
2015-11-10 15:50     ` Phil Turmel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.