* Offline array, events count mismatch
@ 2015-11-09 2:49 Guillaume Paumier
2015-11-09 3:35 ` Phil Turmel
0 siblings, 1 reply; 4+ messages in thread
From: Guillaume Paumier @ 2015-11-09 2:49 UTC (permalink / raw)
To: linux-raid
Hello folks,
I reached out to you a few months ago when a --grow went awry. In the end I
managed to restore my array thanks to this mailing list and the invaluable
help of IRC user frostschutz.
I'm now facing another issue and I'm hoping you can help me again.
Today I found out that my RAID6, 9-disk array was offline. When looking at the
machine, two disks seemed to have disappeared; they didn't show in fdisk or
anything. And a third one was marked as "faulty" in mdadm.
At first, I was puzzled because it seemed improbable that three disks had
failed at the same time. I removed the array from fstab and rebooted. The two
vanished disks re-appeared (in fdisk too), and when examining the partitions,
I noticed the following events count:
/dev/sdb1:
Events : 198477
/dev/sdc1:
Events : 198477
/dev/sdd1:
Events : 198477
/dev/sde1:
Events : 54264
/dev/sdf1:
Events : 54264
/dev/sdg1:
Events : 198477
/dev/sdh1:
Events : 198477
/dev/sdi1:
Events : 198477
/dev/sdj1:
Events : 198473
Looking at those event counts, my understanding is this:
* Two of the disks (sde, sdf) were dropped from the array for some reason.
* I didn't notice this immediately (an issue I'm addressing separately).
* A third disk (sdj) encountered a small issue today.
* The array went offline because it didn't have enough disks to function
cleanly any more.
If I understand the documentation [1] correctly, since the event count for sdj
is very close to the event count of sd[b,c,d,g,h,i], I should be able to re-
assemble the array with these 7 disks using --force, leaving sde and sdf
aside. Once the array is assembled, I should be able to re-add sde and sdf,
and they will be re-sync'd.
[1]
https://raid.wiki.kernel.org/index.php/RAID_Recovery#Trying_to_assemble_using_--force
I prefer to be cautious and ask here before doing anything that could make
things worse. It would be great if you could confirm that my understanding is
correct, and tell me if this plan is sound.
I'm including some more detailed information below. Let me know if there's any
other information that would be useful.
Many thanks,
===========================================================
Before the reboot: mdadm -D
-----------------------------------------------------------
# mdadm -D /dev/md0
/dev/md0:
Version : 1.0
Creation Time : Thu Aug 1 12:23:07 2013
Raid Level : raid6
Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
Used Dev Size : 3907016448 (3726.02 GiB 4000.78 GB)
Raid Devices : 9
Total Devices : 8
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sun Nov 8 06:36:50 2015
State : clean, FAILED
Active Devices : 6
Working Devices : 6
Failed Devices : 2
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 128K
UUID : eea59047:120a0365:353da182:6787e030
Events : 198477
Number Major Minor RaidDevice State
0 8 33 0 active sync /dev/sdc1
1 8 49 1 active sync /dev/sdd1
2 8 97 2 active sync /dev/sdg1
3 8 113 3 active sync /dev/sdh1
4 8 129 4 active sync /dev/sdi1
10 0 0 10 removed
12 0 0 12 removed
14 0 0 14 removed
8 8 17 8 active sync /dev/sdb1
5 8 145 - faulty /dev/sdj1
6 8 65 - faulty /dev/sde1
===========================================================
Before the reboot: mdadm --examine
-----------------------------------------------------------
# mdadm --examine /dev/sd[b-j]1
/dev/sdb1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : eea59047:120a0365:353da182:6787e030
Creation Time : Thu Aug 1 12:23:07 2013
Raid Level : raid6
Raid Devices : 9
Avail Dev Size : 7814033128 (3726.02 GiB 4000.78 GB)
Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
Super Offset : 7814033392 sectors
Unused Space : before=0 sectors, after=480 sectors
State : clean
Device UUID : 91b187fd:f416880a:f5e81e49:92615e07
Internal Bitmap : -16 sectors from superblock
Update Time : Sun Nov 8 06:36:50 2015
Bad Block Log : 512 entries available at offset -8 sectors
Checksum : 30050dee - correct
Events : 198477
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 8
Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdc1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : eea59047:120a0365:353da182:6787e030
Creation Time : Thu Aug 1 12:23:07 2013
Raid Level : raid6
Raid Devices : 9
Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
Super Offset : 7814033392 sectors
Unused Space : before=0 sectors, after=480 sectors
State : clean
Device UUID : e1b689b5:b4a2c5a7:56057b69:a9101af0
Internal Bitmap : -16 sectors from superblock
Update Time : Sun Nov 8 06:36:50 2015
Checksum : 8e546a7e - correct
Events : 198477
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 0
Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : eea59047:120a0365:353da182:6787e030
Creation Time : Thu Aug 1 12:23:07 2013
Raid Level : raid6
Raid Devices : 9
Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
Super Offset : 7814033392 sectors
Unused Space : before=0 sectors, after=480 sectors
State : clean
Device UUID : 1d8e74d3:9abd37f8:f2cf0ab8:02fdcfd6
Internal Bitmap : -16 sectors from superblock
Update Time : Sun Nov 8 06:36:50 2015
Checksum : 31f71397 - correct
Events : 198477
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 1
Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
mdadm: No md superblock detected on /dev/sde1.
/dev/sdg1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : eea59047:120a0365:353da182:6787e030
Creation Time : Thu Aug 1 12:23:07 2013
Raid Level : raid6
Raid Devices : 9
Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
Super Offset : 7814033392 sectors
Unused Space : before=0 sectors, after=480 sectors
State : clean
Device UUID : b24758e6:042412c5:9b5a3c06:f167aedf
Internal Bitmap : -16 sectors from superblock
Update Time : Sun Nov 8 06:36:50 2015
Checksum : 68c5292e - correct
Events : 198477
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 2
Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdh1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : eea59047:120a0365:353da182:6787e030
Creation Time : Thu Aug 1 12:23:07 2013
Raid Level : raid6
Raid Devices : 9
Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
Super Offset : 7814033392 sectors
Unused Space : before=0 sectors, after=480 sectors
State : clean
Device UUID : 00e47d82:b49c3905:3ed961fe:40a5f259
Internal Bitmap : -16 sectors from superblock
Update Time : Sun Nov 8 06:36:50 2015
Checksum : b77bfa1e - correct
Events : 198477
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 3
Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdi1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : eea59047:120a0365:353da182:6787e030
Creation Time : Thu Aug 1 12:23:07 2013
Raid Level : raid6
Raid Devices : 9
Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
Super Offset : 7814033392 sectors
Unused Space : before=0 sectors, after=480 sectors
State : clean
Device UUID : a7e34040:fa12382f:c2ef3d85:9c95b1d0
Internal Bitmap : -16 sectors from superblock
Update Time : Sun Nov 8 06:36:50 2015
Checksum : 9cd876ec - correct
Events : 198477
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 4
Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdj1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : eea59047:120a0365:353da182:6787e030
Creation Time : Thu Aug 1 12:23:07 2013
Raid Level : raid6
Raid Devices : 9
Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
Super Offset : 7814033392 sectors
Unused Space : before=0 sectors, after=480 sectors
State : clean
Device UUID : 9d89c55d:9f4a2181:6b87922f:0681d580
Internal Bitmap : -16 sectors from superblock
Update Time : Sun Nov 8 06:36:38 2015
Checksum : 66c5dfd2 - correct
Events : 198473
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 5
Array State : AAAAAA..A ('A' == active, '.' == missing, 'R' == replacing)
===========================================================
After the reboot: mdadm --examine
-----------------------------------------------------------
# mdadm --examine /dev/sd[b-j]1
/dev/sdb1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : eea59047:120a0365:353da182:6787e030
Creation Time : Thu Aug 1 12:23:07 2013
Raid Level : raid6
Raid Devices : 9
Avail Dev Size : 7814033128 (3726.02 GiB 4000.78 GB)
Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
Super Offset : 7814033392 sectors
Unused Space : before=0 sectors, after=480 sectors
State : clean
Device UUID : 91b187fd:f416880a:f5e81e49:92615e07
Internal Bitmap : -16 sectors from superblock
Update Time : Sun Nov 8 06:36:50 2015
Bad Block Log : 512 entries available at offset -8 sectors
Checksum : 30050dee - correct
Events : 198477
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 8
Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdc1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : eea59047:120a0365:353da182:6787e030
Creation Time : Thu Aug 1 12:23:07 2013
Raid Level : raid6
Raid Devices : 9
Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
Super Offset : 7814033392 sectors
Unused Space : before=0 sectors, after=480 sectors
State : clean
Device UUID : e1b689b5:b4a2c5a7:56057b69:a9101af0
Internal Bitmap : -16 sectors from superblock
Update Time : Sun Nov 8 06:36:50 2015
Checksum : 8e546a7e - correct
Events : 198477
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 0
Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : eea59047:120a0365:353da182:6787e030
Creation Time : Thu Aug 1 12:23:07 2013
Raid Level : raid6
Raid Devices : 9
Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
Super Offset : 7814033392 sectors
Unused Space : before=0 sectors, after=480 sectors
State : clean
Device UUID : 1d8e74d3:9abd37f8:f2cf0ab8:02fdcfd6
Internal Bitmap : -16 sectors from superblock
Update Time : Sun Nov 8 06:36:50 2015
Checksum : 31f71397 - correct
Events : 198477
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 1
Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sde1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : eea59047:120a0365:353da182:6787e030
Creation Time : Thu Aug 1 12:23:07 2013
Raid Level : raid6
Raid Devices : 9
Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
Super Offset : 7814033392 sectors
Unused Space : before=0 sectors, after=480 sectors
State : clean
Device UUID : ddf17d3d:ea944bfb:6886cc91:3366f55f
Internal Bitmap : -16 sectors from superblock
Update Time : Wed Oct 7 10:17:35 2015
Checksum : 1dd30b1 - correct
Events : 54264
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 7
Array State : AAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdf1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : eea59047:120a0365:353da182:6787e030
Creation Time : Thu Aug 1 12:23:07 2013
Raid Level : raid6
Raid Devices : 9
Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
Super Offset : 7814033392 sectors
Unused Space : before=0 sectors, after=480 sectors
State : clean
Device UUID : 38675f59:ea412b1f:67d6ed9a:a33fc5dd
Internal Bitmap : -16 sectors from superblock
Update Time : Wed Oct 7 10:17:35 2015
Checksum : c88f7c7b - correct
Events : 54264
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 6
Array State : AAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdg1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : eea59047:120a0365:353da182:6787e030
Creation Time : Thu Aug 1 12:23:07 2013
Raid Level : raid6
Raid Devices : 9
Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
Super Offset : 7814033392 sectors
Unused Space : before=0 sectors, after=480 sectors
State : clean
Device UUID : b24758e6:042412c5:9b5a3c06:f167aedf
Internal Bitmap : -16 sectors from superblock
Update Time : Sun Nov 8 06:36:50 2015
Checksum : 68c5292e - correct
Events : 198477
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 2
Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdh1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : eea59047:120a0365:353da182:6787e030
Creation Time : Thu Aug 1 12:23:07 2013
Raid Level : raid6
Raid Devices : 9
Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
Super Offset : 7814033392 sectors
Unused Space : before=0 sectors, after=480 sectors
State : clean
Device UUID : 00e47d82:b49c3905:3ed961fe:40a5f259
Internal Bitmap : -16 sectors from superblock
Update Time : Sun Nov 8 06:36:50 2015
Checksum : b77bfa1e - correct
Events : 198477
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 3
Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdi1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : eea59047:120a0365:353da182:6787e030
Creation Time : Thu Aug 1 12:23:07 2013
Raid Level : raid6
Raid Devices : 9
Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
Super Offset : 7814033392 sectors
Unused Space : before=0 sectors, after=480 sectors
State : clean
Device UUID : a7e34040:fa12382f:c2ef3d85:9c95b1d0
Internal Bitmap : -16 sectors from superblock
Update Time : Sun Nov 8 06:36:50 2015
Checksum : 9cd876ec - correct
Events : 198477
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 4
Array State : AAAAA...A ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdj1:
Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
Array UUID : eea59047:120a0365:353da182:6787e030
Creation Time : Thu Aug 1 12:23:07 2013
Raid Level : raid6
Raid Devices : 9
Avail Dev Size : 7814033136 (3726.02 GiB 4000.78 GB)
Array Size : 27349115136 (26082.15 GiB 28005.49 GB)
Used Dev Size : 7814032896 (3726.02 GiB 4000.78 GB)
Super Offset : 7814033392 sectors
Unused Space : before=0 sectors, after=480 sectors
State : clean
Device UUID : 9d89c55d:9f4a2181:6b87922f:0681d580
Internal Bitmap : -16 sectors from superblock
Update Time : Sun Nov 8 06:36:38 2015
Checksum : 66c5dfd2 - correct
Events : 198473
Layout : left-symmetric
Chunk Size : 128K
Device Role : Active device 5
Array State : AAAAAA..A ('A' == active, '.' == missing, 'R' == replacing)
===========================================================
--
Guillaume Paumier
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Offline array, events count mismatch
2015-11-09 2:49 Offline array, events count mismatch Guillaume Paumier
@ 2015-11-09 3:35 ` Phil Turmel
2015-11-10 3:05 ` Guillaume Paumier
0 siblings, 1 reply; 4+ messages in thread
From: Phil Turmel @ 2015-11-09 3:35 UTC (permalink / raw)
To: Guillaume Paumier, linux-raid
Hi Guillaume,
On 11/08/2015 09:49 PM, Guillaume Paumier wrote:
[trim /]
> Looking at those event counts, my understanding is this:
> * Two of the disks (sde, sdf) were dropped from the array for some reason.
> * I didn't notice this immediately (an issue I'm addressing separately).
> * A third disk (sdj) encountered a small issue today.
> * The array went offline because it didn't have enough disks to function
> cleanly any more.
>
> If I understand the documentation [1] correctly, since the event count for sdj
> is very close to the event count of sd[b,c,d,g,h,i], I should be able to re-
> assemble the array with these 7 disks using --force, leaving sde and sdf
> aside. Once the array is assembled, I should be able to re-add sde and sdf,
> and they will be re-sync'd.
Yes, that is the correct response.
Your situation is common. Please see the thread this weekend started by
Franscisco Parada.
https://marc.info/?t=144691643300001&r=1&w=2&n=12
You should provide "smartctl -i -A -l scterc /dev/sdX" reports for your
drives. If you can find an old syslog for when your two worst drives
fell out, it might help.
Phil
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Offline array, events count mismatch
2015-11-09 3:35 ` Phil Turmel
@ 2015-11-10 3:05 ` Guillaume Paumier
2015-11-10 15:50 ` Phil Turmel
0 siblings, 1 reply; 4+ messages in thread
From: Guillaume Paumier @ 2015-11-10 3:05 UTC (permalink / raw)
To: Phil Turmel; +Cc: linux-raid
Hello Phil and the list,
Le dimanche 8 novembre 2015, 22:35:13 Phil Turmel a écrit :
>
> On 11/08/2015 09:49 PM, Guillaume Paumier wrote:
> >
> > If I understand the documentation [1] correctly, since the event count for
> > sdj is very close to the event count of sd[b,c,d,g,h,i], I should be able
> > to re- assemble the array with these 7 disks using --force, leaving sde
> > and sdf aside. Once the array is assembled, I should be able to re-add
> > sde and sdf, and they will be re-sync'd.
>
> Yes, that is the correct response.
>
> Your situation is common. Please see the thread this weekend started by
> Franscisco Parada.
Thank you for confirming, Phil, and for the additional pointer.
I've re-assembled the array with --force, which cleaned sdj, and then I was
able to re-add the two other disks. The array started rebuilding and recovery
was past 10% when the array failed again.
It seems there was an "unrecoverable read error" on sdj, and now I'm back with
an array where 2 of the disks are marked as spare (sde and sdf, because their
rebuild didn't complete), and sdj is faulty with an event count mismatch of 4,
like before:
/dev/sdb1:
Events : 198704
/dev/sdc1:
Events : 198704
/dev/sdd1:
Events : 198704
/dev/sde1:
Events : 198704
/dev/sdf1:
Events : 198704
/dev/sdg1:
Events : 198704
/dev/sdh1:
Events : 198704
/dev/sdi1:
Events : 198704
/dev/sdj1:
Events : 198700
Below is the output of dmesg with more details on the read error.
Is there any way I can move past this? This error is preventing me from
rebuilding the array, and I'm assuming it would also prevent me from copying
the data off the array without rebuilding, so I'm not sure how to proceed. Any
guidance would be much appreciated.
[88233.712961] md: recovery of RAID array md0
[88233.712965] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[88233.712967] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for recovery.
[88233.712978] md: using 128k window, over a total of 3907016448k.
[88953.752335] ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[88953.752345] ata9.01: BMDMA stat 0x64
[88953.752353] ata9.01: failed command: READ DMA EXT
[88953.752368] ata9.01: cmd 25/00:00:00:fc:e8/00:02:27:00:00/f0 tag 0 dma
262144 in
res 51/40:00:f8:fd:e8/40:00:27:00:00/10 Emask 0x9 (media error)
[88953.752375] ata9.01: status: { DRDY ERR }
[88953.752380] ata9.01: error: { UNC }
[88953.793877] ata9.00: configured for UDMA/33
[88953.799795] ata9.01: configured for UDMA/33
[88953.799855] sd 8:0:1:0: [sdj] Unhandled sense code
[88953.799858] sd 8:0:1:0: [sdj]
[88953.799860] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[88953.799862] sd 8:0:1:0: [sdj]
[88953.799864] Sense Key : Medium Error [current] [descriptor]
[88953.799867] Descriptor sense data with sense descriptors (in hex):
[88953.799868] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[88953.799875] 27 e8 fd f8
[88953.799879] sd 8:0:1:0: [sdj]
[88953.799882] Add. Sense: Unrecovered read error - auto reallocate failed
[88953.799884] sd 8:0:1:0: [sdj] CDB:
[88953.799885] Read(16): 88 00 00 00 00 00 27 e8 fc 00 00 00 02 00 00 00
[88953.799894] end_request: I/O error, dev sdj, sector 669580792
[88953.799898] md/raid:md0: read error not correctable (sector 669578744 on
sdj1).
[88953.799924] ata9: EH complete
[89333.138473] ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[89333.138478] ata9.01: BMDMA stat 0x64
[89333.138482] ata9.01: failed command: READ DMA EXT
[89333.138488] ata9.01: cmd 25/00:00:58:6e:3b/00:02:35:00:00/f0 tag 0 dma
262144 in
res 51/40:00:c8:6f:3b/40:00:35:00:00/10 Emask 0x9 (media error)
[89333.138491] ata9.01: status: { DRDY ERR }
[89333.138493] ata9.01: error: { UNC }
[89333.147985] ata9.00: configured for UDMA/33
[89333.153966] ata9.01: configured for UDMA/33
[89333.154022] sd 8:0:1:0: [sdj] Unhandled sense code
[89333.154025] sd 8:0:1:0: [sdj]
[89333.154027] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[89333.154029] sd 8:0:1:0: [sdj]
[89333.154031] Sense Key : Medium Error [current] [descriptor]
[89333.154034] Descriptor sense data with sense descriptors (in hex):
[89333.154035] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[89333.154042] 35 3b 6f c8
[89333.154046] sd 8:0:1:0: [sdj]
[89333.154048] Add. Sense: Unrecovered read error - auto reallocate failed
[89333.154050] sd 8:0:1:0: [sdj] CDB:
[89333.154052] Read(16): 88 00 00 00 00 00 35 3b 6e 58 00 00 02 00 00 00
[89333.154061] end_request: I/O error, dev sdj, sector 893087688
[89333.154064] md/raid:md0: read error not correctable (sector 893085640 on
sdj1).
[89333.154067] md/raid:md0: read error not correctable (sector 893085648 on
sdj1).
[89333.154069] md/raid:md0: read error not correctable (sector 893085656 on
sdj1).
[89333.154071] md/raid:md0: read error not correctable (sector 893085664 on
sdj1).
[89333.154073] md/raid:md0: read error not correctable (sector 893085672 on
sdj1).
[89333.154075] md/raid:md0: read error not correctable (sector 893085680 on
sdj1).
[89333.154077] md/raid:md0: read error not correctable (sector 893085688 on
sdj1).
[89333.154079] md/raid:md0: read error not correctable (sector 893085696 on
sdj1).
[89333.154081] md/raid:md0: read error not correctable (sector 893085704 on
sdj1).
[89333.154083] md/raid:md0: read error not correctable (sector 893085712 on
sdj1).
[89333.154111] ata9: EH complete
[89338.097012] ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[89338.097016] ata9.01: BMDMA stat 0x64
[89338.097019] ata9.01: failed command: READ DMA EXT
[89338.097023] ata9.01: cmd 25/00:00:58:70:3b/00:02:35:00:00/f0 tag 0 dma
262144 in
res 51/40:00:60:70:3b/40:00:35:00:00/10 Emask 0x9 (media error)
[89338.097025] ata9.01: status: { DRDY ERR }
[89338.097026] ata9.01: error: { UNC }
[89338.125468] ata9.00: configured for UDMA/33
[89338.131458] ata9.01: configured for UDMA/33
[89338.131489] sd 8:0:1:0: [sdj] Unhandled sense code
[89338.131491] sd 8:0:1:0: [sdj]
[89338.131492] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[89338.131493] sd 8:0:1:0: [sdj]
[89338.131494] Sense Key : Medium Error [current] [descriptor]
[89338.131496] Descriptor sense data with sense descriptors (in hex):
[89338.131497] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[89338.131502] 35 3b 70 60
[89338.131504] sd 8:0:1:0: [sdj]
[89338.131506] Add. Sense: Unrecovered read error - auto reallocate failed
[89338.131507] sd 8:0:1:0: [sdj] CDB:
[89338.131508] Read(16): 88 00 00 00 00 00 35 3b 70 58 00 00 02 00 00 00
[89338.131513] end_request: I/O error, dev sdj, sector 893087840
[89338.131556] ata9: EH complete
[89342.103300] ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[89342.103310] ata9.01: BMDMA stat 0x64
[89342.103319] ata9.01: failed command: READ DMA EXT
[89342.103333] ata9.01: cmd 25/00:00:58:72:3b/00:02:35:00:00/f0 tag 0 dma
262144 in
res 51/40:00:58:72:3b/40:00:35:00:00/10 Emask 0x9 (media error)
[89342.103340] ata9.01: status: { DRDY ERR }
[89342.103344] ata9.01: error: { UNC }
[89342.224995] ata9.00: configured for UDMA/33
[89342.230983] ata9.01: configured for UDMA/33
[89342.231022] sd 8:0:1:0: [sdj] Unhandled sense code
[89342.231025] sd 8:0:1:0: [sdj]
[89342.231027] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[89342.231029] sd 8:0:1:0: [sdj]
[89342.231031] Sense Key : Medium Error [current] [descriptor]
[89342.231034] Descriptor sense data with sense descriptors (in hex):
[89342.231035] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[89342.231042] 35 3b 72 58
[89342.231046] sd 8:0:1:0: [sdj]
[89342.231049] Add. Sense: Unrecovered read error - auto reallocate failed
[89342.231051] sd 8:0:1:0: [sdj] CDB:
[89342.231052] Read(16): 88 00 00 00 00 00 35 3b 72 58 00 00 02 00 00 00
[89342.231061] end_request: I/O error, dev sdj, sector 893088344
[89342.231065] raid5_end_read_request: 71 callbacks suppressed
[89342.231067] md/raid:md0: read error not correctable (sector 893086296 on
sdj1).
[89342.231070] md/raid:md0: read error not correctable (sector 893086304 on
sdj1).
[89342.231072] md/raid:md0: read error not correctable (sector 893086312 on
sdj1).
[89342.231074] md/raid:md0: read error not correctable (sector 893086320 on
sdj1).
[89342.231076] md/raid:md0: read error not correctable (sector 893086328 on
sdj1).
[89342.231078] md/raid:md0: read error not correctable (sector 893086336 on
sdj1).
[89342.231080] md/raid:md0: read error not correctable (sector 893086344 on
sdj1).
[89342.231081] md/raid:md0: read error not correctable (sector 893086352 on
sdj1).
[89342.231083] md/raid:md0: read error not correctable (sector 893086360 on
sdj1).
[89342.231085] md/raid:md0: read error not correctable (sector 893086368 on
sdj1).
[89342.231149] ata9: EH complete
[89346.169717] ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[89346.169727] ata9.01: BMDMA stat 0x64
[89346.169736] ata9.01: failed command: READ DMA EXT
[89346.169750] ata9.01: cmd 25/00:00:58:74:3b/00:02:35:00:00/f0 tag 0 dma
262144 in
res 51/40:00:58:74:3b/40:00:35:00:00/10 Emask 0x9 (media error)
[89346.169758] ata9.01: status: { DRDY ERR }
[89346.169763] ata9.01: error: { UNC }
[89346.198239] ata9.00: configured for UDMA/33
[89346.204166] ata9.01: configured for UDMA/33
[89346.204232] sd 8:0:1:0: [sdj] Unhandled sense code
[89346.204239] sd 8:0:1:0: [sdj]
[89346.204243] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[89346.204248] sd 8:0:1:0: [sdj]
[89346.204251] Sense Key : Medium Error [current] [descriptor]
[89346.204258] Descriptor sense data with sense descriptors (in hex):
[89346.204261] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[89346.204278] 35 3b 74 58
[89346.204286] sd 8:0:1:0: [sdj]
[89346.204292] Add. Sense: Unrecovered read error - auto reallocate failed
[89346.204296] sd 8:0:1:0: [sdj] CDB:
[89346.204299] Read(16): 88 00 00 00 00 00 35 3b 74 58 00 00 02 00 00 00
[89346.204319] end_request: I/O error, dev sdj, sector 893088856
[89346.204419] ata9: EH complete
[89353.949976] ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[89353.949986] ata9.01: BMDMA stat 0x64
[89353.949994] ata9.01: failed command: READ DMA EXT
[89353.950008] ata9.01: cmd 25/00:90:c8:6f:3b/00:00:35:00:00/f0 tag 0 dma
73728 in
res 51/40:00:e0:6f:3b/40:00:35:00:00/10 Emask 0x9 (media error)
[89353.950016] ata9.01: status: { DRDY ERR }
[89353.950021] ata9.01: error: { UNC }
[89353.994545] ata9.00: configured for UDMA/33
[89354.000539] ata9.01: configured for UDMA/33
[89354.000597] sd 8:0:1:0: [sdj] Unhandled sense code
[89354.000603] sd 8:0:1:0: [sdj]
[89354.000608] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[89354.000612] sd 8:0:1:0: [sdj]
[89354.000616] Sense Key : Medium Error [current] [descriptor]
[89354.000623] Descriptor sense data with sense descriptors (in hex):
[89354.000626] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[89354.000643] 35 3b 6f e0
[89354.000651] sd 8:0:1:0: [sdj]
[89354.000657] Add. Sense: Unrecovered read error - auto reallocate failed
[89354.000661] sd 8:0:1:0: [sdj] CDB:
[89354.000664] Read(16): 88 00 00 00 00 00 35 3b 6f c8 00 00 00 90 00 00
[89354.000684] end_request: I/O error, dev sdj, sector 893087712
[89354.000692] raid5_end_read_request: 118 callbacks suppressed
[89354.000697] md/raid:md0: read error not correctable (sector 893085664 on
sdj1).
[89354.000706] md/raid:md0: Disk failure on sdj1, disabling device.
md/raid:md0: Operation continuing on 6 devices.
[89354.000732] md/raid:md0: read error not correctable (sector 893085672 on
sdj1).
[89354.000737] md/raid:md0: read error not correctable (sector 893085680 on
sdj1).
[89354.000742] md/raid:md0: read error not correctable (sector 893085688 on
sdj1).
[89354.000747] md/raid:md0: read error not correctable (sector 893085696 on
sdj1).
[89354.000751] md/raid:md0: read error not correctable (sector 893085704 on
sdj1).
[89354.000756] md/raid:md0: read error not correctable (sector 893085712 on
sdj1).
[89354.000760] md/raid:md0: read error not correctable (sector 893085720 on
sdj1).
[89354.000765] md/raid:md0: read error not correctable (sector 893085728 on
sdj1).
[89354.000769] md/raid:md0: read error not correctable (sector 893085736 on
sdj1).
[89354.000903] ata9: EH complete
[89354.109105] md: md0: recovery interrupted.
[89354.175670] RAID conf printout:
[89354.175675] --- level:6 rd:9 wd:6
[89354.175677] disk 0, o:1, dev:sdc1
[89354.175679] disk 1, o:1, dev:sdd1
[89354.175680] disk 2, o:1, dev:sdg1
[89354.175681] disk 3, o:1, dev:sdh1
[89354.175682] disk 4, o:1, dev:sdi1
[89354.175683] disk 5, o:0, dev:sdj1
[89354.175684] disk 6, o:1, dev:sdf1
[89354.175685] disk 7, o:1, dev:sde1
[89354.175686] disk 8, o:1, dev:sdb1
[89354.177220] RAID conf printout:
[89354.177221] --- level:6 rd:9 wd:6
[89354.177222] disk 0, o:1, dev:sdc1
[89354.177223] disk 1, o:1, dev:sdd1
[89354.177224] disk 2, o:1, dev:sdg1
[89354.177225] disk 3, o:1, dev:sdh1
[89354.177226] disk 4, o:1, dev:sdi1
[89354.177227] disk 5, o:0, dev:sdj1
[89354.177227] disk 7, o:1, dev:sde1
[89354.177228] disk 8, o:1, dev:sdb1
[89354.177233] RAID conf printout:
[89354.177234] --- level:6 rd:9 wd:6
[89354.177234] disk 0, o:1, dev:sdc1
[89354.177235] disk 1, o:1, dev:sdd1
[89354.177236] disk 2, o:1, dev:sdg1
[89354.177237] disk 3, o:1, dev:sdh1
[89354.177238] disk 4, o:1, dev:sdi1
[89354.177239] disk 5, o:0, dev:sdj1
[89354.177240] disk 7, o:1, dev:sde1
[89354.177241] disk 8, o:1, dev:sdb1
[89354.179575] RAID conf printout:
[89354.179576] --- level:6 rd:9 wd:6
[89354.179577] disk 0, o:1, dev:sdc1
[89354.179578] disk 1, o:1, dev:sdd1
[89354.179579] disk 2, o:1, dev:sdg1
[89354.179580] disk 3, o:1, dev:sdh1
[89354.179581] disk 4, o:1, dev:sdi1
[89354.179582] disk 5, o:0, dev:sdj1
[89354.179583] disk 8, o:1, dev:sdb1
[89354.179585] RAID conf printout:
[89354.179586] --- level:6 rd:9 wd:6
[89354.179587] disk 0, o:1, dev:sdc1
[89354.179588] disk 1, o:1, dev:sdd1
[89354.179589] disk 2, o:1, dev:sdg1
[89354.179589] disk 3, o:1, dev:sdh1
[89354.179590] disk 4, o:1, dev:sdi1
[89354.179591] disk 5, o:0, dev:sdj1
[89354.179592] disk 8, o:1, dev:sdb1
[89354.181443] RAID conf printout:
[89354.181444] --- level:6 rd:9 wd:6
[89354.181445] disk 0, o:1, dev:sdc1
[89354.181446] disk 1, o:1, dev:sdd1
[89354.181447] disk 2, o:1, dev:sdg1
[89354.181448] disk 3, o:1, dev:sdh1
[89354.181449] disk 4, o:1, dev:sdi1
[89354.181450] disk 8, o:1, dev:sdb1
[90001.391680] md0: detected capacity change from 28005493899264 to 0
[90001.391697] md: md0 stopped.
[90001.391717] md: unbind<sdf1>
[90001.396688] md: export_rdev(sdf1)
[90001.396808] md: unbind<sde1>
[90001.403661] md: export_rdev(sde1)
[90001.403726] md: unbind<sdc1>
[90001.412707] md: export_rdev(sdc1)
[90001.412867] md: unbind<sdb1>
[90001.415711] md: export_rdev(sdb1)
[90001.415782] md: unbind<sdj1>
[90001.421708] md: export_rdev(sdj1)
[90001.421783] md: unbind<sdi1>
[90001.424752] md: export_rdev(sdi1)
[90001.424909] md: unbind<sdh1>
[90001.427741] md: export_rdev(sdh1)
[90001.427807] md: unbind<sdg1>
[90001.433745] md: export_rdev(sdg1)
[90001.433812] md: unbind<sdd1>
[90001.436732] md: export_rdev(sdd1)
> You should provide "smartctl -i -A -l scterc /dev/sdX" reports for your
> drives. If you can find an old syslog for when your two worst drives
> fell out, it might help.
Here's the output for the disk with the read error for now, in case it's
useful.
# smartctl -i -A -l scterc /dev/sdj
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-29-desktop] (SUSE RPM)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: ST4000VN000-1H4168
Serial Number: Z300NEB5
LU WWN Device Id: 5 000c50 063ed9f94
Firmware Version: SC43
User Capacity: 4 000 787 030 016 bytes [4,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5900 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Nov 9 18:58:47 2015 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 094 094 006 Pre-fail Always
- 28320486
3 Spin_Up_Time 0x0003 092 092 000 Pre-fail Always
- 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always
- 73
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always
- 160
7 Seek_Error_Rate 0x000f 069 060 030 Pre-fail Always
- 17212021570
9 Power_On_Hours 0x0032 079 079 000 Old_age Always
- 19201
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always
- 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always
- 73
184 End-to-End_Error 0x0032 100 100 099 Old_age Always
- 0
187 Reported_Uncorrect 0x0032 055 055 000 Old_age Always
- 45
188 Command_Timeout 0x0032 100 100 000 Old_age Always
- 0
189 High_Fly_Writes 0x003a 001 001 000 Old_age Always
- 169
190 Airflow_Temperature_Cel 0x0022 065 057 045 Old_age Always
- 35 (Min/Max 30/37)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always
- 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always
- 28
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always
- 73
194 Temperature_Celsius 0x0022 035 043 000 Old_age Always
- 35 (0 18 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 48
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline -
48
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always
- 0
SCT Error Recovery Control:
Read: 70 (7,0 seconds)
Write: 70 (7,0 seconds)
--
Guillaume Paumier
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Offline array, events count mismatch
2015-11-10 3:05 ` Guillaume Paumier
@ 2015-11-10 15:50 ` Phil Turmel
0 siblings, 0 replies; 4+ messages in thread
From: Phil Turmel @ 2015-11-10 15:50 UTC (permalink / raw)
To: Guillaume Paumier; +Cc: linux-raid
On 11/09/2015 10:05 PM, Guillaume Paumier wrote:
> Hello Phil and the list,
> Thank you for confirming, Phil, and for the additional pointer.
>
> I've re-assembled the array with --force, which cleaned sdj, and then I was
> able to re-add the two other disks. The array started rebuilding and recovery
> was past 10% when the array failed again.
>
> It seems there was an "unrecoverable read error" on sdj, and now I'm back with
> an array where 2 of the disks are marked as spare (sde and sdf, because their
> rebuild didn't complete), and sdj is faulty with an event count mismatch of 4,
> like before:
Yes, you're going to lose some data.
Your only path forward at this point is to --assemble --force without
the spares, and leave them out. The array will be running degraded.
Apply the timeout mismatch work-arounds suited to your drives.
Start copying out your files to a new backup destination. Keep track
which ones succeed.
/dev/sdj has 48 pending bad sectors. You are likely to have files that
cannot be read thanks to those sectors. Just skip them and keep going
(for now). Note the sector addresses that fail.
You may have to do forced assembly multiple times to get through the
entire backup.
Write zeroes over the bad sectors to clear the UREs. If the files are
worthless with those zeroes in them, just delete them. Do this for all
drives that have UREs. Then you can add the spares back in to rebuild.
Going forward, you need to apply work-arounds for non-raid drives at
every power cycle, buy raid-rated drives for future replacements, and
use cron to run regular scrubs to keep the UREs under control.
Show the smartctl reports for all of your drives if you'd like more
specfic advice. And turn off word wrapping when you paste, please.
Phil
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2015-11-10 15:50 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-09 2:49 Offline array, events count mismatch Guillaume Paumier
2015-11-09 3:35 ` Phil Turmel
2015-11-10 3:05 ` Guillaume Paumier
2015-11-10 15:50 ` Phil Turmel
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.