* Does a "check" of a RAID6 actually read all disks in a stripe?
@ 2020-04-28 6:47 Brad Campbell
2020-04-28 11:02 ` Brad Campbell
2020-04-28 15:40 ` Piergiorgio Sartor
0 siblings, 2 replies; 8+ messages in thread
From: Brad Campbell @ 2020-04-28 6:47 UTC (permalink / raw)
To: linux-raid
G'day all,
I have a test server with some old disks I use for beating up on. Bear in mind the disks are old and dicey which is *why* they live in a test server. I'm not after reliability, I'm more interested in finding corner cases.
One disk has a persistent read error (pending sector). This can be identified easily with dd on a specific or whole disk basis.
The array has 9 2TB drives in a RAID6 :
md3 : active raid6 sdh[12] sdm[8] sdc[10] sde[6] sdj[9] sdk[4] sdl[11] sdg[13]
13673684416 blocks super 1.2 level 6, 64k chunk, algorithm 2 [9/8] [UU_UUUUUU]
bitmap: 0/15 pages [0KB], 65536KB chunk
Ignore the missing disk, it's out right now being secure erased, but it was in for the tests.
The read error is on sdj, about 23G into the disk :
[Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
[Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 Sense Key : 0x3 [current]
[Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 ASC=0x11 ASCQ=0x0
[Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 CDB: opcode=0x28 28 00 03 39 d8 08 00 20 00 00
[Sun Apr 26 15:05:30 2020] blk_update_request: critical medium error, dev sdj, sector 54126096 op 0x0:(READ) flags 0x80700 phys_seg 37 prio class 0
Trigger a "check" :
[Mon Apr 27 18:51:15 2020] md: data-check of RAID array md3
[Tue Apr 28 03:42:21 2020] md: md3: data-check done.
Just to be sure it's still there :
[Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
[Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 Sense Key : 0x3 [current]
[Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 ASC=0x11 ASCQ=0x0
[Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 CDB: opcode=0x28 28 00 03 39 e6 10 00 00 08 00
[Tue Apr 28 14:13:33 2020] blk_update_request: critical medium error, dev sdj, sector 54126096 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
So I can read from the disk with dd and trigger a read error each and every time, but a RAID6 "check" appears to skip over it without triggering the read error.
For completeness, the complete log :
[Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
[Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 Sense Key : 0x3 [current]
[Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 ASC=0x11 ASCQ=0x0
[Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 CDB: opcode=0x28 28 00 03 39 d8 08 00 20 00 00
[Sun Apr 26 15:05:30 2020] blk_update_request: critical medium error, dev sdj, sector 54126096 op 0x0:(READ) flags 0x80700 phys_seg 37 prio class 0
[Sun Apr 26 21:15:47 2020] sdd: sdd1 sdd2
[Mon Apr 27 18:51:15 2020] md: data-check of RAID array md3
[Tue Apr 28 03:42:21 2020] md: md3: data-check done.
[Tue Apr 28 09:39:18 2020] md/raid:md3: Disk failure on sdi, disabling device.
md/raid:md3: Operation continuing on 8 devices.
[Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
[Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 Sense Key : 0x3 [current]
[Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 ASC=0x11 ASCQ=0x0
[Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 CDB: opcode=0x28 28 00 03 39 e6 10 00 00 08 00
[Tue Apr 28 14:13:33 2020] blk_update_request: critical medium error, dev sdj, sector 54126096 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[Tue Apr 28 14:13:35 2020] sd 4:0:4:0: [sdj] tag#112 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
[Tue Apr 28 14:13:35 2020] sd 4:0:4:0: [sdj] tag#112 Sense Key : 0x3 [current]
[Tue Apr 28 14:13:35 2020] sd 4:0:4:0: [sdj] tag#112 ASC=0x11 ASCQ=0x0
[Tue Apr 28 14:13:35 2020] sd 4:0:4:0: [sdj] tag#112 CDB: opcode=0x28 28 00 03 39 e6 10 00 00 08 00
[Tue Apr 28 14:13:35 2020] blk_update_request: critical medium error, dev sdj, sector 54126096 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Tue Apr 28 14:13:35 2020] Buffer I/O error on dev sdj, logical block 6765762, async page read
Examine on the suspect disk :
test:/home/brad# mdadm --examine /dev/sdj
/dev/sdj:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : dbbca7b5:327751b1:895f8f11:443f6ecb
Name : test:3 (local to host test)
Creation Time : Wed Nov 29 10:46:21 2017
Raid Level : raid6
Raid Devices : 9
Avail Dev Size : 3906767024 (1862.89 GiB 2000.26 GB)
Array Size : 13673684416 (13040.24 GiB 14001.85 GB)
Used Dev Size : 3906766976 (1862.89 GiB 2000.26 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262056 sectors, after=48 sectors
State : clean
Device UUID : f1a39d9b:fe217c62:26b065e3:0f859afd
Internal Bitmap : 8 sectors from superblock
Update Time : Tue Apr 28 09:39:23 2020
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : cb44256b - correct
Events : 177156
Layout : left-symmetric
Chunk Size : 64K
Device Role : Active device 5
Array State : AA.AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
test:/home/brad# mdadm --detail /dev/md3
/dev/md3:
Version : 1.2
Creation Time : Wed Nov 29 10:46:21 2017
Raid Level : raid6
Array Size : 13673684416 (13040.24 GiB 14001.85 GB)
Used Dev Size : 1953383488 (1862.89 GiB 2000.26 GB)
Raid Devices : 9
Total Devices : 8
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Tue Apr 28 09:39:23 2020
State : clean, degraded
Active Devices : 8
Working Devices : 8
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
Name : test:3 (local to host test)
UUID : dbbca7b5:327751b1:895f8f11:443f6ecb
Events : 177156
Number Major Minor RaidDevice State
12 8 112 0 active sync /dev/sdh
13 8 96 1 active sync /dev/sdg
4 0 0 4 removed
11 8 176 3 active sync /dev/sdl
4 8 160 4 active sync /dev/sdk
9 8 144 5 active sync /dev/sdj
6 8 64 6 active sync /dev/sde
10 8 32 7 active sync /dev/sdc
8 8 192 8 active sync /dev/sdm
test:/home/brad# uname -a
Linux test 5.4.11 #49 SMP Wed Jan 15 11:23:38 AWST 2020 x86_64 GNU/Linux
So the read error is well into the array member, yet a "check" doesn't hit it. Does that sound right?
These disks grow bad sectors not infrequently, and so a check quite often forces a repair on a block of 8 sectors, but it has persistently missed this one.
Regards,
Brad
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Does a "check" of a RAID6 actually read all disks in a stripe?
2020-04-28 6:47 Does a "check" of a RAID6 actually read all disks in a stripe? Brad Campbell
@ 2020-04-28 11:02 ` Brad Campbell
2020-04-28 13:47 ` Phil Turmel
2020-04-28 15:40 ` Piergiorgio Sartor
1 sibling, 1 reply; 8+ messages in thread
From: Brad Campbell @ 2020-04-28 11:02 UTC (permalink / raw)
To: linux-raid
On 28/4/20 2:47 pm, Brad Campbell wrote:
> G'day all,
>
> I have a test server with some old disks I use for beating up on. Bear in mind the disks are old and dicey which is *why* they live in a test server. I'm not after reliability, I'm more interested in finding corner cases.
>
> One disk has a persistent read error (pending sector). This can be identified easily with dd on a specific or whole disk basis.
>
> The array has 9 2TB drives in a RAID6 :
>
> md3 : active raid6 sdh[12] sdm[8] sdc[10] sde[6] sdj[9] sdk[4] sdl[11] sdg[13]
> 13673684416 blocks super 1.2 level 6, 64k chunk, algorithm 2 [9/8] [UU_UUUUUU]
> bitmap: 0/15 pages [0KB], 65536KB chunk
>
> Ignore the missing disk, it's out right now being secure erased, but it was in for the tests.
>
> The read error is on sdj, about 23G into the disk :
>
> [Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
> [Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 Sense Key : 0x3 [current]
> [Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 ASC=0x11 ASCQ=0x0
> [Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 CDB: opcode=0x28 28 00 03 39 d8 08 00 20 00 00
> [Sun Apr 26 15:05:30 2020] blk_update_request: critical medium error, dev sdj, sector 54126096 op 0x0:(READ) flags 0x80700 phys_seg 37 prio class 0
>
> Trigger a "check" :
> [Mon Apr 27 18:51:15 2020] md: data-check of RAID array md3
> [Tue Apr 28 03:42:21 2020] md: md3: data-check done.
>
> Just to be sure it's still there :
> [Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
> [Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 Sense Key : 0x3 [current]
> [Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 ASC=0x11 ASCQ=0x0
> [Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 CDB: opcode=0x28 28 00 03 39 e6 10 00 00 08 00
> [Tue Apr 28 14:13:33 2020] blk_update_request: critical medium error, dev sdj, sector 54126096 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
>
> So I can read from the disk with dd and trigger a read error each and every time, but a RAID6 "check" appears to skip over it without triggering the read error.
>
> For completeness, the complete log :
> [Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
> [Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 Sense Key : 0x3 [current]
> [Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 ASC=0x11 ASCQ=0x0
> [Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 CDB: opcode=0x28 28 00 03 39 d8 08 00 20 00 00
> [Sun Apr 26 15:05:30 2020] blk_update_request: critical medium error, dev sdj, sector 54126096 op 0x0:(READ) flags 0x80700 phys_seg 37 prio class 0
> [Sun Apr 26 21:15:47 2020] sdd: sdd1 sdd2
> [Mon Apr 27 18:51:15 2020] md: data-check of RAID array md3
> [Tue Apr 28 03:42:21 2020] md: md3: data-check done.
> [Tue Apr 28 09:39:18 2020] md/raid:md3: Disk failure on sdi, disabling device.
> md/raid:md3: Operation continuing on 8 devices.
> [Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
> [Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 Sense Key : 0x3 [current]
> [Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 ASC=0x11 ASCQ=0x0
> [Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 CDB: opcode=0x28 28 00 03 39 e6 10 00 00 08 00
> [Tue Apr 28 14:13:33 2020] blk_update_request: critical medium error, dev sdj, sector 54126096 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
> [Tue Apr 28 14:13:35 2020] sd 4:0:4:0: [sdj] tag#112 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
> [Tue Apr 28 14:13:35 2020] sd 4:0:4:0: [sdj] tag#112 Sense Key : 0x3 [current]
> [Tue Apr 28 14:13:35 2020] sd 4:0:4:0: [sdj] tag#112 ASC=0x11 ASCQ=0x0
> [Tue Apr 28 14:13:35 2020] sd 4:0:4:0: [sdj] tag#112 CDB: opcode=0x28 28 00 03 39 e6 10 00 00 08 00
> [Tue Apr 28 14:13:35 2020] blk_update_request: critical medium error, dev sdj, sector 54126096 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
> [Tue Apr 28 14:13:35 2020] Buffer I/O error on dev sdj, logical block 6765762, async page read
>
> Examine on the suspect disk :
>
> test:/home/brad# mdadm --examine /dev/sdj
> /dev/sdj:
> Magic : a92b4efc
> Version : 1.2
> Feature Map : 0x1
> Array UUID : dbbca7b5:327751b1:895f8f11:443f6ecb
> Name : test:3 (local to host test)
> Creation Time : Wed Nov 29 10:46:21 2017
> Raid Level : raid6
> Raid Devices : 9
>
> Avail Dev Size : 3906767024 (1862.89 GiB 2000.26 GB)
> Array Size : 13673684416 (13040.24 GiB 14001.85 GB)
> Used Dev Size : 3906766976 (1862.89 GiB 2000.26 GB)
> Data Offset : 262144 sectors
> Super Offset : 8 sectors
> Unused Space : before=262056 sectors, after=48 sectors
> State : clean
> Device UUID : f1a39d9b:fe217c62:26b065e3:0f859afd
>
> Internal Bitmap : 8 sectors from superblock
> Update Time : Tue Apr 28 09:39:23 2020
> Bad Block Log : 512 entries available at offset 72 sectors
> Checksum : cb44256b - correct
> Events : 177156
>
> Layout : left-symmetric
> Chunk Size : 64K
>
> Device Role : Active device 5
> Array State : AA.AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
>
> test:/home/brad# mdadm --detail /dev/md3
> /dev/md3:
> Version : 1.2
> Creation Time : Wed Nov 29 10:46:21 2017
> Raid Level : raid6
> Array Size : 13673684416 (13040.24 GiB 14001.85 GB)
> Used Dev Size : 1953383488 (1862.89 GiB 2000.26 GB)
> Raid Devices : 9
> Total Devices : 8
> Persistence : Superblock is persistent
>
> Intent Bitmap : Internal
>
> Update Time : Tue Apr 28 09:39:23 2020
> State : clean, degraded
> Active Devices : 8
> Working Devices : 8
> Failed Devices : 0
> Spare Devices : 0
>
> Layout : left-symmetric
> Chunk Size : 64K
>
> Name : test:3 (local to host test)
> UUID : dbbca7b5:327751b1:895f8f11:443f6ecb
> Events : 177156
>
> Number Major Minor RaidDevice State
> 12 8 112 0 active sync /dev/sdh
> 13 8 96 1 active sync /dev/sdg
> 4 0 0 4 removed
> 11 8 176 3 active sync /dev/sdl
> 4 8 160 4 active sync /dev/sdk
> 9 8 144 5 active sync /dev/sdj
> 6 8 64 6 active sync /dev/sde
> 10 8 32 7 active sync /dev/sdc
> 8 8 192 8 active sync /dev/sdm
>
> test:/home/brad# uname -a
> Linux test 5.4.11 #49 SMP Wed Jan 15 11:23:38 AWST 2020 x86_64 GNU/Linux
>
> So the read error is well into the array member, yet a "check" doesn't hit it. Does that sound right?
> These disks grow bad sectors not infrequently, and so a check quite often forces a repair on a block of 8 sectors, but it has persistently missed this one.
>
> Regards,
> Brad
So after reinserting /dev/sdi, the bad sectors were picked up during the rebuild :
[Tue Apr 28 17:23:16 2020] md: recovery of RAID array md3
[Tue Apr 28 17:29:09 2020] sd 4:0:4:0: [sdj] tag#131 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
[Tue Apr 28 17:29:09 2020] sd 4:0:4:0: [sdj] tag#131 Sense Key : 0x3 [current]
[Tue Apr 28 17:29:09 2020] sd 4:0:4:0: [sdj] tag#131 ASC=0x11 ASCQ=0x0
[Tue Apr 28 17:29:09 2020] sd 4:0:4:0: [sdj] tag#131 CDB: opcode=0x28 28 00 03 39 e5 a0 00 04 00 00
[Tue Apr 28 17:29:09 2020] blk_update_request: critical medium error, dev sdj, sector 54126096 op 0x0:(READ) flags 0x4000 phys_seg 114 prio class 0
[Tue Apr 28 17:29:20 2020] sd 4:0:4:0: [sdj] tag#175 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
[Tue Apr 28 17:29:20 2020] sd 4:0:4:0: [sdj] tag#175 Sense Key : 0x3 [current]
[Tue Apr 28 17:29:20 2020] sd 4:0:4:0: [sdj] tag#175 ASC=0x11 ASCQ=0x0
[Tue Apr 28 17:29:20 2020] sd 4:0:4:0: [sdj] tag#175 CDB: opcode=0x28 28 00 03 39 e6 10 00 00 08 00
[Tue Apr 28 17:29:20 2020] blk_update_request: critical medium error, dev sdj, sector 54126096 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
[Tue Apr 28 17:29:20 2020] md/raid:md3: read error corrected (8 sectors at 54126592 on sdj)
[Tue Apr 28 17:29:20 2020] md/raid:md3: read error corrected (8 sectors at 54126600 on sdj)
[Tue Apr 28 17:29:20 2020] md/raid:md3: read error corrected (8 sectors at 54126608 on sdj)
[Tue Apr 28 17:29:20 2020] md/raid:md3: read error corrected (8 sectors at 54126616 on sdj)
[Tue Apr 28 17:29:20 2020] md/raid:md3: read error corrected (8 sectors at 54126624 on sdj)
[Tue Apr 28 17:29:20 2020] md/raid:md3: read error corrected (8 sectors at 54126632 on sdj)
[Tue Apr 28 17:29:20 2020] md/raid:md3: read error corrected (8 sectors at 54126640 on sdj)
[Tue Apr 28 17:29:20 2020] md/raid:md3: read error corrected (8 sectors at 54126648 on sdj)
[Tue Apr 28 17:29:20 2020] md/raid:md3: read error corrected (8 sectors at 54126656 on sdj)
[Tue Apr 28 17:29:20 2020] md/raid:md3: read error corrected (8 sectors at 54126664 on sdj)
This would imply that a RAID "check" scrub does not actually read every block on every stripe of a RAID6, and thus has the potential to miss a dodgy sector under the wrong circumstances. When I get a minute, I'll try and put some test scenarios together with hdparm to create bad blocks and try to characterize the issue further.
Regards,
Brad
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Does a "check" of a RAID6 actually read all disks in a stripe?
2020-04-28 11:02 ` Brad Campbell
@ 2020-04-28 13:47 ` Phil Turmel
2020-04-28 14:00 ` Brad Campbell
0 siblings, 1 reply; 8+ messages in thread
From: Phil Turmel @ 2020-04-28 13:47 UTC (permalink / raw)
To: Brad Campbell, linux-raid
On 4/28/20 7:02 AM, Brad Campbell wrote:
>
>
> On 28/4/20 2:47 pm, Brad Campbell wrote:
>> G'day all,
>>
>> I have a test server with some old disks I use for beating up on. Bear
>> in mind the disks are old and dicey which is *why* they live in a test
>> server. I'm not after reliability, I'm more interested in finding
>> corner cases.
>>
>> One disk has a persistent read error (pending sector). This can be
>> identified easily with dd on a specific or whole disk basis.
[trim /]
>> Examine on the suspect disk :
>>
>> test:/home/brad# mdadm --examine /dev/sdj
>> /dev/sdj:
>> Magic : a92b4efc
>> Version : 1.2
>> Feature Map : 0x1
>> Array UUID : dbbca7b5:327751b1:895f8f11:443f6ecb
>> Name : test:3 (local to host test)
>> Creation Time : Wed Nov 29 10:46:21 2017
>> Raid Level : raid6
>> Raid Devices : 9
>>
>> Avail Dev Size : 3906767024 (1862.89 GiB 2000.26 GB)
>> Array Size : 13673684416 (13040.24 GiB 14001.85 GB)
>> Used Dev Size : 3906766976 (1862.89 GiB 2000.26 GB)
>> Data Offset : 262144 sectors
>> Super Offset : 8 sectors
>> Unused Space : before=262056 sectors, after=48 sectors
>> State : clean
>> Device UUID : f1a39d9b:fe217c62:26b065e3:0f859afd
>>
>> Internal Bitmap : 8 sectors from superblock
>> Update Time : Tue Apr 28 09:39:23 2020
>> Bad Block Log : 512 entries available at offset 72 sectors
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> Checksum : cb44256b - correct
>> Events : 177156
>>
>> Layout : left-symmetric
>> Chunk Size : 64K
>>
>> Device Role : Active device 5
>> Array State : AA.AAAAAA ('A' == active, '.' == missing, 'R' ==
>> replacing)
The bad block log misfeature is turned on. Any blocks recorded in it
will never be read again by MD, last I looked. This might explain what
you are seeing.
> This would imply that a RAID "check" scrub does not actually read every
> block on every stripe of a RAID6, and thus has the potential to miss a
> dodgy sector under the wrong circumstances. When I get a minute, I'll
> try and put some test scenarios together with hdparm to create bad
> blocks and try to characterize the issue further.
>
> Regards,
> Brad
Regards,
Phil
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Does a "check" of a RAID6 actually read all disks in a stripe?
2020-04-28 13:47 ` Phil Turmel
@ 2020-04-28 14:00 ` Brad Campbell
2020-04-28 14:41 ` Phil Turmel
0 siblings, 1 reply; 8+ messages in thread
From: Brad Campbell @ 2020-04-28 14:00 UTC (permalink / raw)
To: Phil Turmel, linux-raid
On 28/4/20 9:47 pm, Phil Turmel wrote:
> On 4/28/20 7:02 AM, Brad Campbell wrote:
>>
>>
>> On 28/4/20 2:47 pm, Brad Campbell wrote:
>>> G'day all,
>>>
>>> I have a test server with some old disks I use for beating up on. Bear in mind the disks are old and dicey which is *why* they live in a test server. I'm not after reliability, I'm more interested in finding corner cases.
>>>
>>> One disk has a persistent read error (pending sector). This can be identified easily with dd on a specific or whole disk basis.
>
> [trim /]
>
>>> Examine on the suspect disk :
>>>
>>> test:/home/brad# mdadm --examine /dev/sdj
>>> /dev/sdj:
>>> Magic : a92b4efc
>>> Version : 1.2
>>> Feature Map : 0x1
>>> Array UUID : dbbca7b5:327751b1:895f8f11:443f6ecb
>>> Name : test:3 (local to host test)
>>> Creation Time : Wed Nov 29 10:46:21 2017
>>> Raid Level : raid6
>>> Raid Devices : 9
>>>
>>> Avail Dev Size : 3906767024 (1862.89 GiB 2000.26 GB)
>>> Array Size : 13673684416 (13040.24 GiB 14001.85 GB)
>>> Used Dev Size : 3906766976 (1862.89 GiB 2000.26 GB)
>>> Data Offset : 262144 sectors
>>> Super Offset : 8 sectors
>>> Unused Space : before=262056 sectors, after=48 sectors
>>> State : clean
>>> Device UUID : f1a39d9b:fe217c62:26b065e3:0f859afd
>>>
>>> Internal Bitmap : 8 sectors from superblock
>>> Update Time : Tue Apr 28 09:39:23 2020
>>> Bad Block Log : 512 entries available at offset 72 sectors
>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
>>> Checksum : cb44256b - correct
>>> Events : 177156
>>>
>>> Layout : left-symmetric
>>> Chunk Size : 64K
>>>
>>> Device Role : Active device 5
>>> Array State : AA.AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
>
> The bad block log misfeature is turned on. Any blocks recorded in it will never be read again by MD, last I looked. This might explain what you are seeing.
>
>
While I see where you are going, the fact it corrected the bad sector by re-writing it during the re-build would intimate that isn't/wasn't the case.
Regards,
Brad
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Does a "check" of a RAID6 actually read all disks in a stripe?
2020-04-28 14:00 ` Brad Campbell
@ 2020-04-28 14:41 ` Phil Turmel
2020-04-29 12:56 ` Brad Campbell
0 siblings, 1 reply; 8+ messages in thread
From: Phil Turmel @ 2020-04-28 14:41 UTC (permalink / raw)
To: Brad Campbell, linux-raid
On 4/28/20 10:00 AM, Brad Campbell wrote:
>
> On 28/4/20 9:47 pm, Phil Turmel wrote:
>> The bad block log misfeature is turned on. Any blocks recorded in it
>> will never be read again by MD, last I looked. This might explain
>> what you are seeing.
>>
>>
> While I see where you are going, the fact it corrected the bad sector by
> re-writing it during the re-build would intimate that isn't/wasn't the
> case.
Ah, yes. Any chance you have set the sysfs sector limits on a check and
haven't reset them?
Phil
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Does a "check" of a RAID6 actually read all disks in a stripe?
2020-04-28 6:47 Does a "check" of a RAID6 actually read all disks in a stripe? Brad Campbell
2020-04-28 11:02 ` Brad Campbell
@ 2020-04-28 15:40 ` Piergiorgio Sartor
2020-05-03 3:28 ` Brad Campbell
1 sibling, 1 reply; 8+ messages in thread
From: Piergiorgio Sartor @ 2020-04-28 15:40 UTC (permalink / raw)
To: Brad Campbell; +Cc: linux-raid
On Tue, Apr 28, 2020 at 02:47:04PM +0800, Brad Campbell wrote:
> G'day all,
>
> I have a test server with some old disks I use for beating up on. Bear in mind the disks are old and dicey which is *why* they live in a test server. I'm not after reliability, I'm more interested in finding corner cases.
>
> One disk has a persistent read error (pending sector). This can be identified easily with dd on a specific or whole disk basis.
>
> The array has 9 2TB drives in a RAID6 :
>
> md3 : active raid6 sdh[12] sdm[8] sdc[10] sde[6] sdj[9] sdk[4] sdl[11] sdg[13]
> 13673684416 blocks super 1.2 level 6, 64k chunk, algorithm 2 [9/8] [UU_UUUUUU]
> bitmap: 0/15 pages [0KB], 65536KB chunk
>
> Ignore the missing disk, it's out right now being secure erased, but it was in for the tests.
>
> The read error is on sdj, about 23G into the disk :
>
> [Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
> [Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 Sense Key : 0x3 [current]
> [Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 ASC=0x11 ASCQ=0x0
> [Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 CDB: opcode=0x28 28 00 03 39 d8 08 00 20 00 00
> [Sun Apr 26 15:05:30 2020] blk_update_request: critical medium error, dev sdj, sector 54126096 op 0x0:(READ) flags 0x80700 phys_seg 37 prio class 0
>
> Trigger a "check" :
> [Mon Apr 27 18:51:15 2020] md: data-check of RAID array md3
> [Tue Apr 28 03:42:21 2020] md: md3: data-check done.
>
> Just to be sure it's still there :
> [Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
> [Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 Sense Key : 0x3 [current]
> [Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 ASC=0x11 ASCQ=0x0
> [Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 CDB: opcode=0x28 28 00 03 39 e6 10 00 00 08 00
> [Tue Apr 28 14:13:33 2020] blk_update_request: critical medium error, dev sdj, sector 54126096 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
>
> So I can read from the disk with dd and trigger a read error each and every time, but a RAID6 "check" appears to skip over it without triggering the read error.
>
> For completeness, the complete log :
> [Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
> [Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 Sense Key : 0x3 [current]
> [Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 ASC=0x11 ASCQ=0x0
> [Sun Apr 26 15:05:30 2020] sd 4:0:4:0: [sdj] tag#229 CDB: opcode=0x28 28 00 03 39 d8 08 00 20 00 00
> [Sun Apr 26 15:05:30 2020] blk_update_request: critical medium error, dev sdj, sector 54126096 op 0x0:(READ) flags 0x80700 phys_seg 37 prio class 0
> [Sun Apr 26 21:15:47 2020] sdd: sdd1 sdd2
> [Mon Apr 27 18:51:15 2020] md: data-check of RAID array md3
> [Tue Apr 28 03:42:21 2020] md: md3: data-check done.
> [Tue Apr 28 09:39:18 2020] md/raid:md3: Disk failure on sdi, disabling device.
> md/raid:md3: Operation continuing on 8 devices.
> [Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
> [Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 Sense Key : 0x3 [current]
> [Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 ASC=0x11 ASCQ=0x0
> [Tue Apr 28 14:13:33 2020] sd 4:0:4:0: [sdj] tag#100 CDB: opcode=0x28 28 00 03 39 e6 10 00 00 08 00
> [Tue Apr 28 14:13:33 2020] blk_update_request: critical medium error, dev sdj, sector 54126096 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
> [Tue Apr 28 14:13:35 2020] sd 4:0:4:0: [sdj] tag#112 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
> [Tue Apr 28 14:13:35 2020] sd 4:0:4:0: [sdj] tag#112 Sense Key : 0x3 [current]
> [Tue Apr 28 14:13:35 2020] sd 4:0:4:0: [sdj] tag#112 ASC=0x11 ASCQ=0x0
> [Tue Apr 28 14:13:35 2020] sd 4:0:4:0: [sdj] tag#112 CDB: opcode=0x28 28 00 03 39 e6 10 00 00 08 00
> [Tue Apr 28 14:13:35 2020] blk_update_request: critical medium error, dev sdj, sector 54126096 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
> [Tue Apr 28 14:13:35 2020] Buffer I/O error on dev sdj, logical block 6765762, async page read
>
> Examine on the suspect disk :
>
> test:/home/brad# mdadm --examine /dev/sdj
> /dev/sdj:
> Magic : a92b4efc
> Version : 1.2
> Feature Map : 0x1
> Array UUID : dbbca7b5:327751b1:895f8f11:443f6ecb
> Name : test:3 (local to host test)
> Creation Time : Wed Nov 29 10:46:21 2017
> Raid Level : raid6
> Raid Devices : 9
>
> Avail Dev Size : 3906767024 (1862.89 GiB 2000.26 GB)
> Array Size : 13673684416 (13040.24 GiB 14001.85 GB)
> Used Dev Size : 3906766976 (1862.89 GiB 2000.26 GB)
> Data Offset : 262144 sectors
> Super Offset : 8 sectors
> Unused Space : before=262056 sectors, after=48 sectors
> State : clean
> Device UUID : f1a39d9b:fe217c62:26b065e3:0f859afd
>
> Internal Bitmap : 8 sectors from superblock
> Update Time : Tue Apr 28 09:39:23 2020
> Bad Block Log : 512 entries available at offset 72 sectors
> Checksum : cb44256b - correct
> Events : 177156
>
> Layout : left-symmetric
> Chunk Size : 64K
>
> Device Role : Active device 5
> Array State : AA.AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
>
> test:/home/brad# mdadm --detail /dev/md3
> /dev/md3:
> Version : 1.2
> Creation Time : Wed Nov 29 10:46:21 2017
> Raid Level : raid6
> Array Size : 13673684416 (13040.24 GiB 14001.85 GB)
> Used Dev Size : 1953383488 (1862.89 GiB 2000.26 GB)
> Raid Devices : 9
> Total Devices : 8
> Persistence : Superblock is persistent
>
> Intent Bitmap : Internal
>
> Update Time : Tue Apr 28 09:39:23 2020
> State : clean, degraded
> Active Devices : 8
> Working Devices : 8
> Failed Devices : 0
> Spare Devices : 0
>
> Layout : left-symmetric
> Chunk Size : 64K
>
> Name : test:3 (local to host test)
> UUID : dbbca7b5:327751b1:895f8f11:443f6ecb
> Events : 177156
>
> Number Major Minor RaidDevice State
> 12 8 112 0 active sync /dev/sdh
> 13 8 96 1 active sync /dev/sdg
> 4 0 0 4 removed
> 11 8 176 3 active sync /dev/sdl
> 4 8 160 4 active sync /dev/sdk
> 9 8 144 5 active sync /dev/sdj
> 6 8 64 6 active sync /dev/sde
> 10 8 32 7 active sync /dev/sdc
> 8 8 192 8 active sync /dev/sdm
>
> test:/home/brad# uname -a
> Linux test 5.4.11 #49 SMP Wed Jan 15 11:23:38 AWST 2020 x86_64 GNU/Linux
>
> So the read error is well into the array member, yet a "check" doesn't hit it. Does that sound right?
> These disks grow bad sectors not infrequently, and so a check quite often forces a repair on a block of 8 sectors, but it has persistently missed this one.
I suspect, but Neil or some expert should confim
or deny, that a check on a RAID-6 uses only the
P parity to verify stripe consistency.
If there are errors in the Q parity chunk, these
will not be found.
On the other hand, "raid6check" :-) uses both
parities and, maybe, can trigger the rebuild.
Hoep this helps,
bye,
pg
>
> Regards,
> Brad
--
piergiorgio
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Does a "check" of a RAID6 actually read all disks in a stripe?
2020-04-28 14:41 ` Phil Turmel
@ 2020-04-29 12:56 ` Brad Campbell
0 siblings, 0 replies; 8+ messages in thread
From: Brad Campbell @ 2020-04-29 12:56 UTC (permalink / raw)
To: Phil Turmel, linux-raid
On 28/4/20 10:41 pm, Phil Turmel wrote:
> On 4/28/20 10:00 AM, Brad Campbell wrote:
>>
>> On 28/4/20 9:47 pm, Phil Turmel wrote:
>
>>> The bad block log misfeature is turned on. Any blocks recorded in it will never be read again by MD, last I looked. This might explain what you are seeing.
>>>
test:/sys/block/md3/md# mdadm --examine-badblocks /dev/sd[ceghijklm]
Bad-blocks list is empty in /dev/sdc
Bad-blocks list is empty in /dev/sde
Bad-blocks list is empty in /dev/sdg
Bad-blocks list is empty in /dev/sdh
Bad-blocks list is empty in /dev/sdi
Bad-blocks list is empty in /dev/sdj
Bad-blocks list is empty in /dev/sdk
Bad-blocks list is empty in /dev/sdl
Bad-blocks list is empty in /dev/sdm
>> While I see where you are going, the fact it corrected the bad sector by re-writing it during the re-build would intimate that isn't/wasn't the case.
>
> Ah, yes. Any chance you have set the sysfs sector limits on a check and haven't reset them?
>
> Phil
>
No. Each run was triggered by an :
echo check > /sys/block/md3/md/sync_action
No other interaction with sysfs.
I have a few drives in another machine I can abuse, so I have some time put aside on the weekend to set up a couple of small test arrays. When I last tried (admittedly years ago now) it was easy enough to "create" bad blocks with hdparm.
I think Piergiorgio was on the money with something not reading one of the parity blocks, but I've read the code several times and can't see obviously how that might be the case. I suppose examining a check with blktrace would highlight it if that were the case. I might try that first.
On the topic of the bbl. Is there any way to remove it or mark it for removal while an array is active? For example, on an array used as the root filesystem, where stopping and restarting is a "take the machine down and boot from a live-cd" sort of affair. All of my arrays have the bbl enabled. It has never caused me an issue, but looking at it I can see how that might be possible.
Regards,
Brad
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Does a "check" of a RAID6 actually read all disks in a stripe?
2020-04-28 15:40 ` Piergiorgio Sartor
@ 2020-05-03 3:28 ` Brad Campbell
0 siblings, 0 replies; 8+ messages in thread
From: Brad Campbell @ 2020-05-03 3:28 UTC (permalink / raw)
To: Piergiorgio Sartor; +Cc: linux-raid
On 28/4/20 11:40 pm, Piergiorgio Sartor wrote:
> I suspect, but Neil or some expert should confim
> or deny, that a check on a RAID-6 uses only the
> P parity to verify stripe consistency.
> If there are errors in the Q parity chunk, these
> will not be found.
>
I have been able to experimentally verify that a check does indeed read
both p & q and perform validation on them for mismatch.
I did find a couple of cases where a bad sector was not picked up during
a check, however these appeared to be inconsistent and not reproducible.
I suspect it is related to using a small part of the array and caching
at the block layer. Once I removed the sync_max restriction and let the
array run the check until I knew the data read was greater than the
machine RAM (ie, it was all out of cache) I could not reproduce the
issue. Incidentally, no use of drop_caches helped here, I had to let it
just run.
I performed these tests using hdparm to create a bad sector on one of
the array members and then forcing a check. In all sane cases, the check
hit the read error, performed a reconstruction and re-wrote.
Oddly enough, 9 times out of 10, on the P block it simply corrected the
read error (ie 1 bad "sector" on the disk causes 8 sectors to return
errors due to the 4K formatting) while the Q block re-wrote the entire
chunk every time (128 sectors).
[248861.618254] ata3: EH complete
[248861.627940] raid5_end_read_request: 7 callbacks suppressed
[248861.627964] md/raid:md3: read error corrected (8 sectors at 262144
on sdc)
[248877.770056] md: md3: data-check interrupted.
There were instances however where performing exactly the same test on
the P block would force an entire chunk write. Apparently inconsistent
behaviour but always resulting in the correct on-disk result.
[249390.287825] ata3: EH complete
[249390.570722] raid5_end_read_request: 6 callbacks suppressed
[249390.570752] md/raid:md3: read error corrected (8 sectors at 262272
on sdc)
[249390.570776] md/raid:md3: read error corrected (8 sectors at 262280
on sdc)
[249390.570800] md/raid:md3: read error corrected (8 sectors at 262288
on sdc)
[249390.570822] md/raid:md3: read error corrected (8 sectors at 262296
on sdc)
[249390.570845] md/raid:md3: read error corrected (8 sectors at 262304
on sdc)
[249390.570871] md/raid:md3: read error corrected (8 sectors at 262312
on sdc)
[249390.570893] md/raid:md3: read error corrected (8 sectors at 262320
on sdc)
[249390.570916] md/raid:md3: read error corrected (8 sectors at 262328
on sdc)
[249390.570940] md/raid:md3: read error corrected (8 sectors at 262336
on sdc)
[249390.570962] md/raid:md3: read error corrected (8 sectors at 262344
on sdc)
[249397.549844] md: md3: data-check interrupted.
This test did validate the theory I had that using dd-rescue to clone a
dying drive and then using hdparm to mark the unrecoverable sectors as
bad on the clone would prevent md from reading corrupt data from the
clone and allowing a rebuild on that stripe.
For completeness, the test was performed with many variants along these
lines :
test:~# mdadm --assemble /dev/md3
mdadm: /dev/md3 has been started with 9 drives.
test:~# hdparm --yes-i-know-what-i-am-doing --make-bad-sector f262144
/dev/sdm
/dev/sdm:
Corrupting sector 262144 (WRITE_UNC_EXT as flagged): succeeded
test:~# export MD=/sys/block/md3/md/ ; echo 256 > $MD/sync_max ; echo
check > $MD/sync_action
test:~# export MD=/sys/block/md3/md/ ; echo idle > $MD/sync_action ;
echo 0 > $MD/sync_min
test:~# mdadm --stop /dev/md3
[249036.634236] md: md3 stopped.
[249036.700205] md/raid:md3: device sde operational as raid disk 0
[249036.700227] md/raid:md3: device sdm operational as raid disk 8
[249036.700246] md/raid:md3: device sdc operational as raid disk 7
[249036.700264] md/raid:md3: device sdf operational as raid disk 6
[249036.700283] md/raid:md3: device sdk operational as raid disk 5
[249036.700301] md/raid:md3: device sdj operational as raid disk 4
[249036.700319] md/raid:md3: device sdl operational as raid disk 3
[249036.700338] md/raid:md3: device sdi operational as raid disk 2
[249036.700356] md/raid:md3: device sdg operational as raid disk 1
[249036.700832] md/raid:md3: raid level 6 active with 9 out of 9
devices, algorithm 2
[249036.720763] md3: detected capacity change from 0 to 14001852841984
[249056.033360] md: data-check of RAID array md3
[249056.154104] sd 2:0:7:0: [sdm] tag#246 UNKNOWN(0x2003) Result:
hostbyte=0x00 driverbyte=0x08
[249056.154136] sd 2:0:7:0: [sdm] tag#246 Sense Key : 0x3 [current]
[249056.154157] sd 2:0:7:0: [sdm] tag#246 ASC=0x11 ASCQ=0x0
[249056.154178] sd 2:0:7:0: [sdm] tag#246 CDB: opcode=0x28 28 00 00 04
00 00 00 00 80 00
[249056.154206] blk_update_request: critical medium error, dev sdm,
sector 262144 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 0
[249056.155056] md/raid:md3: read error corrected (8 sectors at 262144
on sdm)
[249056.155081] md/raid:md3: read error corrected (8 sectors at 262152
on sdm)
[249056.155107] md/raid:md3: read error corrected (8 sectors at 262160
on sdm)
[249056.155131] md/raid:md3: read error corrected (8 sectors at 262168
on sdm)
[249056.155157] md/raid:md3: read error corrected (8 sectors at 262176
on sdm)
[249056.155181] md/raid:md3: read error corrected (8 sectors at 262184
on sdm)
[249056.155207] md/raid:md3: read error corrected (8 sectors at 262192
on sdm)
[249056.155232] md/raid:md3: read error corrected (8 sectors at 262200
on sdm)
[249056.155258] md/raid:md3: read error corrected (8 sectors at 262208
on sdm)
[249056.155282] md/raid:md3: read error corrected (8 sectors at 262216
on sdm)
[249064.025753] md: md3: data-check interrupted.
[249064.190722] md3: detected capacity change from 14001852841984 to 0
[249064.190756] md: md3 stopped.
With these test results I'm at a loss as to explaining how repeated full
disk checks missed the bad sectors on the drive previously, but it
certainly happened.
Regards,
Brad
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2020-05-03 3:28 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-28 6:47 Does a "check" of a RAID6 actually read all disks in a stripe? Brad Campbell
2020-04-28 11:02 ` Brad Campbell
2020-04-28 13:47 ` Phil Turmel
2020-04-28 14:00 ` Brad Campbell
2020-04-28 14:41 ` Phil Turmel
2020-04-29 12:56 ` Brad Campbell
2020-04-28 15:40 ` Piergiorgio Sartor
2020-05-03 3:28 ` Brad Campbell
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.