All of lore.kernel.org
 help / color / mirror / Atom feed
* MD Raid10 recovery results in "attempt to access beyond end of device"
@ 2012-06-22  7:06 Christian Balzer
  2012-06-22  8:07 ` NeilBrown
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Balzer @ 2012-06-22  7:06 UTC (permalink / raw)
  To: linux-raid


Hello,

the basics first:
Debian Squeeze, custom 3.2.18 kernel.

The Raid(s) in question are:
---
Personalities : [raid1] [raid10] 
md4 : active raid10 sdd1[0] sdb4[5](S) sdl1[4] sdk1[3] sdj1[2] sdi1[1]
      3662836224 blocks super 1.2 512K chunks 2 near-copies [5/5] [UUUUU]
      
md3 : active raid10 sdh1[7] sdc1[0] sda4[5](S) sdg1[3] sdf1[2] sde1[6]
      3662836224 blocks super 1.2 512K chunks 2 near-copies [5/4] [UUUU_]
      [=====>...............]  recovery = 28.3% (415962368/1465134592) finish=326.2min speed=53590K/sec
---

Drives sda to sdd are on nVidia MCP55 and sde to sdl on SAS1068E, sdc to
sdl are identical 1.5TB Seagates (about 2 years old, recycled from the
previous incarnation of these machines) with a single partition spanning
the whole drive like this:
---
Disk /dev/sdc: 1500.3 GB, 1500301910016 bytes
255 heads, 63 sectors/track, 182401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1      182401  1465136001   fd  Linux raid autodetect
---

sda and sdb are new 2TB Hitachi drives, partitioned like this:
---
Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000d53b0

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1       31124   249999360   fd  Linux raid autodetect
/dev/sda2           31124       46686   124999680   fd  Linux raid autodetect
/dev/sda3           46686       50576    31246425   fd  Linux raid autodetect
/dev/sda4           50576      243201  1547265543+  fd  Linux raid autodetect
---

So the idea is to have 5 drives per each of the two Raid10s and one spare
on that (intentionally over-sized) fourth partition of the bigger OS
disks.

Some weeks ago a drive failed on the twin (identical everything, DRBD
replication of those 2 RAIDs) of the machine in question and everything
went according to the book, spare took over and things got rebuild, I
replaced the failed drive (sdi) later:
---
md4 : active raid10 sdi1[6](S) sdd1[0] sdb4[5] sdl1[4] sdk1[3] sdj1[2]
      3662836224 blocks super 1.2 512K chunks 2 near-copies [5/5] [UUUUU]
---

Two days ago drive sdh on the machine that's having issues failed:
---
Jun 20 18:22:39 borg03b kernel: [1383395.448043] sd 8:0:3:0: Device offlined - not ready after error recovery
Jun 20 18:22:39 borg03b kernel: [1383395.448135] sd 8:0:3:0: rejecting I/O to offline device
Jun 20 18:22:39 borg03b kernel: [1383395.452063] end_request: I/O error, dev sdh, sector 71
Jun 20 18:22:39 borg03b kernel: [1383395.452063] md: super_written gets error=-5, uptodate=0
Jun 20 18:22:39 borg03b kernel: [1383395.452063] md/raid10:md3: Disk failure on sdh1, disabling device.
Jun 20 18:22:39 borg03b kernel: [1383395.452063] md/raid10:md3: Operation continuing on 4 devices.
Jun 20 18:22:39 borg03b kernel: [1383395.527178] RAID10 conf printout:
Jun 20 18:22:39 borg03b kernel: [1383395.527181]  --- wd:4 rd:5
Jun 20 18:22:39 borg03b kernel: [1383395.527184]  disk 0, wo:0, o:1, dev:sdc1
Jun 20 18:22:39 borg03b kernel: [1383395.527186]  disk 1, wo:0, o:1, dev:sde1
Jun 20 18:22:39 borg03b kernel: [1383395.527189]  disk 2, wo:0, o:1, dev:sdf1
Jun 20 18:22:39 borg03b kernel: [1383395.527191]  disk 3, wo:0, o:1, dev:sdg1
Jun 20 18:22:39 borg03b kernel: [1383395.527193]  disk 4, wo:1, o:0, dev:sdh1
Jun 20 18:22:39 borg03b kernel: [1383395.568037] RAID10 conf printout:
Jun 20 18:22:39 borg03b kernel: [1383395.568040]  --- wd:4 rd:5
Jun 20 18:22:39 borg03b kernel: [1383395.568042]  disk 0, wo:0, o:1, dev:sdc1
Jun 20 18:22:39 borg03b kernel: [1383395.568045]  disk 1, wo:0, o:1, dev:sde1
Jun 20 18:22:39 borg03b kernel: [1383395.568047]  disk 2, wo:0, o:1, dev:sdf1
Jun 20 18:22:39 borg03b kernel: [1383395.568049]  disk 3, wo:0, o:1, dev:sdg1
Jun 20 18:22:39 borg03b kernel: [1383395.568060] RAID10 conf printout:
Jun 20 18:22:39 borg03b kernel: [1383395.568061]  --- wd:4 rd:5
Jun 20 18:22:39 borg03b kernel: [1383395.568063]  disk 0, wo:0, o:1, dev:sdc1
Jun 20 18:22:39 borg03b kernel: [1383395.568065]  disk 1, wo:0, o:1, dev:sde1
Jun 20 18:22:39 borg03b kernel: [1383395.568068]  disk 2, wo:0, o:1, dev:sdf1
Jun 20 18:22:39 borg03b kernel: [1383395.568070]  disk 3, wo:0, o:1, dev:sdg1
Jun 20 18:22:39 borg03b kernel: [1383395.568072]  disk 4, wo:1, o:1, dev:sda4
Jun 20 18:22:39 borg03b kernel: [1383395.568135] md: recovery of RAID array md3
Jun 20 18:22:39 borg03b kernel: [1383395.568139] md: minimum _guaranteed_  speed: 20000 KB/sec/disk.
Jun 20 18:22:39 borg03b kernel: [1383395.568142] md: using maximum available idle IO bandwidth (but not more than 500000 KB/sec) for recovery.
Jun 20 18:22:39 borg03b kernel: [1383395.568155] md: using 128k window, over a total of 1465134592k.
---

OK, spare kicked, recovery underway (from the neighbors sdg and sdc), but then:
---
Jun 21 02:29:29 borg03b kernel: [1412604.989978] attempt to access beyond end of device
Jun 21 02:29:29 borg03b kernel: [1412604.989983] sdc1: rw=0, want=2930272128, limit=2930272002
Jun 21 02:29:29 borg03b kernel: [1412604.990003] attempt to access beyond end of device
Jun 21 02:29:29 borg03b kernel: [1412604.990009] sdc1: rw=16, want=2930272008, limit=2930272002
Jun 21 02:29:29 borg03b kernel: [1412604.990013] md/raid10:md3: recovery aborted due to read error
Jun 21 02:29:29 borg03b kernel: [1412604.990025] attempt to access beyond end of device
Jun 21 02:29:29 borg03b kernel: [1412604.990028] sdc1: rw=0, want=2930272256, limit=2930272002
Jun 21 02:29:29 borg03b kernel: [1412604.990032] md: md3: recovery done.
Jun 21 02:29:29 borg03b kernel: [1412604.990035] attempt to access beyond end of device
Jun 21 02:29:29 borg03b kernel: [1412604.990038] sdc1: rw=16, want=2930272136, limit=2930272002
Jun 21 02:29:29 borg03b kernel: [1412604.990040] md/raid10:md3: recovery aborted due to read error
---

Why it would want to read data beyond the end of that device (and
partition) is a complete mystery to me, if anything was odd with this Raid
or its superblocks, surely the initial sync should have stumbled across
this as well?

After this failure the kernel goes into a log frenzy:
---
Jun 21 02:29:29 borg03b kernel: [1412605.744052] RAID10 conf printout:
Jun 21 02:29:29 borg03b kernel: [1412605.744055]  --- wd:4 rd:5
Jun 21 02:29:29 borg03b kernel: [1412605.744057]  disk 0, wo:0, o:1, dev:sdc1
Jun 21 02:29:29 borg03b kernel: [1412605.744060]  disk 1, wo:0, o:1, dev:sde1
Jun 21 02:29:29 borg03b kernel: [1412605.744062]  disk 2, wo:0, o:1, dev:sdf1
Jun 21 02:29:29 borg03b kernel: [1412605.744064]  disk 3, wo:0, o:1, dev:sdg1
---
repeating every second or so, until I "mdadm -r"ed the sda4 partition
(former spare).

On the next day I replaced the failed sdh drive with another 2TB Hitachi
(having only 1.5TB Seagates of dubious quality lying around), gave it the
same single partition size as the other drives and added it to md3.

The resync failed in the same manner:
---
Jun 21 20:59:06 borg03b kernel: [1479182.509914] attempt to access beyond end of device
Jun 21 20:59:06 borg03b kernel: [1479182.509920] sdc1: rw=0, want=2930272128, limit=2930272002
Jun 21 20:59:06 borg03b kernel: [1479182.509931] attempt to access beyond end of device
Jun 21 20:59:06 borg03b kernel: [1479182.509933] attempt to access beyond end of device
Jun 21 20:59:06 borg03b kernel: [1479182.509937] sdc1: rw=0, want=2930272256, limit=2930272002
Jun 21 20:59:06 borg03b kernel: [1479182.509942] md: md3: recovery done.
Jun 21 20:59:06 borg03b kernel: [1479182.509948] sdc1: rw=16, want=2930272008, limit=2930272002
Jun 21 20:59:06 borg03b kernel: [1479182.509952] md/raid10:md3: recovery aborted due to read error
Jun 21 20:59:06 borg03b kernel: [1479182.509963] attempt to access beyond end of device
Jun 21 20:59:06 borg03b kernel: [1479182.509965] sdc1: rw=16, want=2930272136, limit=2930272002
Jun 21 20:59:06 borg03b kernel: [1479182.509968] md/raid10:md3: recovery aborted due to read error
---

I've now scrounged up an identical 1.5TB drive and added it to the Raid
(the recovery visible in the topmost mdstat). 
If that fails as well, I'm completely lost as to what's going on, if it
succeeds though I guess we're looking at a subtle bug. 

I didn't find anything like this mentioned in the archives before, any and
all feedback would be most welcome.

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: MD Raid10 recovery results in "attempt to access beyond end of device"
  2012-06-22  7:06 MD Raid10 recovery results in "attempt to access beyond end of device" Christian Balzer
@ 2012-06-22  8:07 ` NeilBrown
  2012-06-22  8:42   ` Christian Balzer
  0 siblings, 1 reply; 8+ messages in thread
From: NeilBrown @ 2012-06-22  8:07 UTC (permalink / raw)
  To: Christian Balzer; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 9884 bytes --]

On Fri, 22 Jun 2012 16:06:32 +0900 Christian Balzer <chibi@gol.com> wrote:

> 
> Hello,
> 
> the basics first:
> Debian Squeeze, custom 3.2.18 kernel.
> 
> The Raid(s) in question are:
> ---
> Personalities : [raid1] [raid10] 
> md4 : active raid10 sdd1[0] sdb4[5](S) sdl1[4] sdk1[3] sdj1[2] sdi1[1]
>       3662836224 blocks super 1.2 512K chunks 2 near-copies [5/5] [UUUUU]

I'm stumped by this.  It shouldn't be possible.

The size of the array is impossible.

If there are N chunks per device, then there are 5*N chunks on the whole
array, and there are are two copies of each data chunk, so
5*N/2 distinct data chunks, so that should be the size of the array.

So if we take the size of the array, divide by chunk size, multiply by 2,
divide by 5, we get N = the number of chunks per device.
i.e.
  N = (array_size / chunk_size)*2 / 5

If we plug in 3662836224 for the array size and 512 for the chunk size,
we get 2861590.8, which is not an integer.
i.e. impossible.

What does "mdadm --examine" of the various devices show?

NeilBrown


>       
> md3 : active raid10 sdh1[7] sdc1[0] sda4[5](S) sdg1[3] sdf1[2] sde1[6]
>       3662836224 blocks super 1.2 512K chunks 2 near-copies [5/4] [UUUU_]
>       [=====>...............]  recovery = 28.3% (415962368/1465134592) finish=326.2min speed=53590K/sec
> ---
> 
> Drives sda to sdd are on nVidia MCP55 and sde to sdl on SAS1068E, sdc to
> sdl are identical 1.5TB Seagates (about 2 years old, recycled from the
> previous incarnation of these machines) with a single partition spanning
> the whole drive like this:
> ---
> Disk /dev/sdc: 1500.3 GB, 1500301910016 bytes
> 255 heads, 63 sectors/track, 182401 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disk identifier: 0x00000000
> 
>    Device Boot      Start         End      Blocks   Id  System
> /dev/sdc1               1      182401  1465136001   fd  Linux raid autodetect
> ---
> 
> sda and sdb are new 2TB Hitachi drives, partitioned like this:
> ---
> Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
> 255 heads, 63 sectors/track, 243201 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disk identifier: 0x000d53b0
> 
>    Device Boot      Start         End      Blocks   Id  System
> /dev/sda1   *           1       31124   249999360   fd  Linux raid autodetect
> /dev/sda2           31124       46686   124999680   fd  Linux raid autodetect
> /dev/sda3           46686       50576    31246425   fd  Linux raid autodetect
> /dev/sda4           50576      243201  1547265543+  fd  Linux raid autodetect
> ---
> 
> So the idea is to have 5 drives per each of the two Raid10s and one spare
> on that (intentionally over-sized) fourth partition of the bigger OS
> disks.
> 
> Some weeks ago a drive failed on the twin (identical everything, DRBD
> replication of those 2 RAIDs) of the machine in question and everything
> went according to the book, spare took over and things got rebuild, I
> replaced the failed drive (sdi) later:
> ---
> md4 : active raid10 sdi1[6](S) sdd1[0] sdb4[5] sdl1[4] sdk1[3] sdj1[2]
>       3662836224 blocks super 1.2 512K chunks 2 near-copies [5/5] [UUUUU]
> ---
> 
> Two days ago drive sdh on the machine that's having issues failed:
> ---
> Jun 20 18:22:39 borg03b kernel: [1383395.448043] sd 8:0:3:0: Device offlined - not ready after error recovery
> Jun 20 18:22:39 borg03b kernel: [1383395.448135] sd 8:0:3:0: rejecting I/O to offline device
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] end_request: I/O error, dev sdh, sector 71
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] md: super_written gets error=-5, uptodate=0
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] md/raid10:md3: Disk failure on sdh1, disabling device.
> Jun 20 18:22:39 borg03b kernel: [1383395.452063] md/raid10:md3: Operation continuing on 4 devices.
> Jun 20 18:22:39 borg03b kernel: [1383395.527178] RAID10 conf printout:
> Jun 20 18:22:39 borg03b kernel: [1383395.527181]  --- wd:4 rd:5
> Jun 20 18:22:39 borg03b kernel: [1383395.527184]  disk 0, wo:0, o:1, dev:sdc1
> Jun 20 18:22:39 borg03b kernel: [1383395.527186]  disk 1, wo:0, o:1, dev:sde1
> Jun 20 18:22:39 borg03b kernel: [1383395.527189]  disk 2, wo:0, o:1, dev:sdf1
> Jun 20 18:22:39 borg03b kernel: [1383395.527191]  disk 3, wo:0, o:1, dev:sdg1
> Jun 20 18:22:39 borg03b kernel: [1383395.527193]  disk 4, wo:1, o:0, dev:sdh1
> Jun 20 18:22:39 borg03b kernel: [1383395.568037] RAID10 conf printout:
> Jun 20 18:22:39 borg03b kernel: [1383395.568040]  --- wd:4 rd:5
> Jun 20 18:22:39 borg03b kernel: [1383395.568042]  disk 0, wo:0, o:1, dev:sdc1
> Jun 20 18:22:39 borg03b kernel: [1383395.568045]  disk 1, wo:0, o:1, dev:sde1
> Jun 20 18:22:39 borg03b kernel: [1383395.568047]  disk 2, wo:0, o:1, dev:sdf1
> Jun 20 18:22:39 borg03b kernel: [1383395.568049]  disk 3, wo:0, o:1, dev:sdg1
> Jun 20 18:22:39 borg03b kernel: [1383395.568060] RAID10 conf printout:
> Jun 20 18:22:39 borg03b kernel: [1383395.568061]  --- wd:4 rd:5
> Jun 20 18:22:39 borg03b kernel: [1383395.568063]  disk 0, wo:0, o:1, dev:sdc1
> Jun 20 18:22:39 borg03b kernel: [1383395.568065]  disk 1, wo:0, o:1, dev:sde1
> Jun 20 18:22:39 borg03b kernel: [1383395.568068]  disk 2, wo:0, o:1, dev:sdf1
> Jun 20 18:22:39 borg03b kernel: [1383395.568070]  disk 3, wo:0, o:1, dev:sdg1
> Jun 20 18:22:39 borg03b kernel: [1383395.568072]  disk 4, wo:1, o:1, dev:sda4
> Jun 20 18:22:39 borg03b kernel: [1383395.568135] md: recovery of RAID array md3
> Jun 20 18:22:39 borg03b kernel: [1383395.568139] md: minimum _guaranteed_  speed: 20000 KB/sec/disk.
> Jun 20 18:22:39 borg03b kernel: [1383395.568142] md: using maximum available idle IO bandwidth (but not more than 500000 KB/sec) for recovery.
> Jun 20 18:22:39 borg03b kernel: [1383395.568155] md: using 128k window, over a total of 1465134592k.
> ---
> 
> OK, spare kicked, recovery underway (from the neighbors sdg and sdc), but then:
> ---
> Jun 21 02:29:29 borg03b kernel: [1412604.989978] attempt to access beyond end of device
> Jun 21 02:29:29 borg03b kernel: [1412604.989983] sdc1: rw=0, want=2930272128, limit=2930272002
> Jun 21 02:29:29 borg03b kernel: [1412604.990003] attempt to access beyond end of device
> Jun 21 02:29:29 borg03b kernel: [1412604.990009] sdc1: rw=16, want=2930272008, limit=2930272002
> Jun 21 02:29:29 borg03b kernel: [1412604.990013] md/raid10:md3: recovery aborted due to read error
> Jun 21 02:29:29 borg03b kernel: [1412604.990025] attempt to access beyond end of device
> Jun 21 02:29:29 borg03b kernel: [1412604.990028] sdc1: rw=0, want=2930272256, limit=2930272002
> Jun 21 02:29:29 borg03b kernel: [1412604.990032] md: md3: recovery done.
> Jun 21 02:29:29 borg03b kernel: [1412604.990035] attempt to access beyond end of device
> Jun 21 02:29:29 borg03b kernel: [1412604.990038] sdc1: rw=16, want=2930272136, limit=2930272002
> Jun 21 02:29:29 borg03b kernel: [1412604.990040] md/raid10:md3: recovery aborted due to read error
> ---
> 
> Why it would want to read data beyond the end of that device (and
> partition) is a complete mystery to me, if anything was odd with this Raid
> or its superblocks, surely the initial sync should have stumbled across
> this as well?
> 
> After this failure the kernel goes into a log frenzy:
> ---
> Jun 21 02:29:29 borg03b kernel: [1412605.744052] RAID10 conf printout:
> Jun 21 02:29:29 borg03b kernel: [1412605.744055]  --- wd:4 rd:5
> Jun 21 02:29:29 borg03b kernel: [1412605.744057]  disk 0, wo:0, o:1, dev:sdc1
> Jun 21 02:29:29 borg03b kernel: [1412605.744060]  disk 1, wo:0, o:1, dev:sde1
> Jun 21 02:29:29 borg03b kernel: [1412605.744062]  disk 2, wo:0, o:1, dev:sdf1
> Jun 21 02:29:29 borg03b kernel: [1412605.744064]  disk 3, wo:0, o:1, dev:sdg1
> ---
> repeating every second or so, until I "mdadm -r"ed the sda4 partition
> (former spare).
> 
> On the next day I replaced the failed sdh drive with another 2TB Hitachi
> (having only 1.5TB Seagates of dubious quality lying around), gave it the
> same single partition size as the other drives and added it to md3.
> 
> The resync failed in the same manner:
> ---
> Jun 21 20:59:06 borg03b kernel: [1479182.509914] attempt to access beyond end of device
> Jun 21 20:59:06 borg03b kernel: [1479182.509920] sdc1: rw=0, want=2930272128, limit=2930272002
> Jun 21 20:59:06 borg03b kernel: [1479182.509931] attempt to access beyond end of device
> Jun 21 20:59:06 borg03b kernel: [1479182.509933] attempt to access beyond end of device
> Jun 21 20:59:06 borg03b kernel: [1479182.509937] sdc1: rw=0, want=2930272256, limit=2930272002
> Jun 21 20:59:06 borg03b kernel: [1479182.509942] md: md3: recovery done.
> Jun 21 20:59:06 borg03b kernel: [1479182.509948] sdc1: rw=16, want=2930272008, limit=2930272002
> Jun 21 20:59:06 borg03b kernel: [1479182.509952] md/raid10:md3: recovery aborted due to read error
> Jun 21 20:59:06 borg03b kernel: [1479182.509963] attempt to access beyond end of device
> Jun 21 20:59:06 borg03b kernel: [1479182.509965] sdc1: rw=16, want=2930272136, limit=2930272002
> Jun 21 20:59:06 borg03b kernel: [1479182.509968] md/raid10:md3: recovery aborted due to read error
> ---
> 
> I've now scrounged up an identical 1.5TB drive and added it to the Raid
> (the recovery visible in the topmost mdstat). 
> If that fails as well, I'm completely lost as to what's going on, if it
> succeeds though I guess we're looking at a subtle bug. 
> 
> I didn't find anything like this mentioned in the archives before, any and
> all feedback would be most welcome.
> 
> Regards,
> 
> Christian


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: MD Raid10 recovery results in "attempt to access beyond end of device"
  2012-06-22  8:07 ` NeilBrown
@ 2012-06-22  8:42   ` Christian Balzer
  2012-06-23  4:13     ` Christian Balzer
  2012-06-25  4:07     ` NeilBrown
  0 siblings, 2 replies; 8+ messages in thread
From: Christian Balzer @ 2012-06-22  8:42 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid


Hello,

On Fri, 22 Jun 2012 18:07:48 +1000 NeilBrown wrote:

> On Fri, 22 Jun 2012 16:06:32 +0900 Christian Balzer <chibi@gol.com>
> wrote:
> 
> > 
> > Hello,
> > 
> > the basics first:
> > Debian Squeeze, custom 3.2.18 kernel.
> > 
> > The Raid(s) in question are:
> > ---
> > Personalities : [raid1] [raid10] 
> > md4 : active raid10 sdd1[0] sdb4[5](S) sdl1[4] sdk1[3] sdj1[2] sdi1[1]
> >       3662836224 blocks super 1.2 512K chunks 2 near-copies [5/5]
> > [UUUUU]
> 
> I'm stumped by this.  It shouldn't be possible.
> 
> The size of the array is impossible.
> 
> If there are N chunks per device, then there are 5*N chunks on the whole
> array, and there are are two copies of each data chunk, so
> 5*N/2 distinct data chunks, so that should be the size of the array.
> 
> So if we take the size of the array, divide by chunk size, multiply by 2,
> divide by 5, we get N = the number of chunks per device.
> i.e.
>   N = (array_size / chunk_size)*2 / 5
> 
> If we plug in 3662836224 for the array size and 512 for the chunk size,
> we get 2861590.8, which is not an integer.
> i.e. impossible.
> 
Quite right, though I never bothered to check that number of course,
pretty much assuming after using Linux MD since the last millennium that
it would get things right. ^o^

> What does "mdadm --examine" of the various devices show?
> 
They looks all identical and sane to me:
---
/dev/sdc1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 2b46b20b:80c18c76:bcd534b5:4d1372e4
           Name : borg03b:3  (local to host borg03b)
  Creation Time : Sat May 19 01:07:34 2012
     Raid Level : raid10
   Raid Devices : 5

 Avail Dev Size : 2930269954 (1397.26 GiB 1500.30 GB)
     Array Size : 5860538368 (2794.52 GiB 3000.60 GB)
  Used Dev Size : 2930269184 (1397.26 GiB 1500.30 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : fe922c1c:35319892:cc1e32e9:948d932c

    Update Time : Fri Jun 22 17:12:05 2012
       Checksum : 27a61d9a - correct
         Events : 90893

         Layout : near=2
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAAAA ('A' == active, '.' == missing)

/dev/sdg1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 2b46b20b:80c18c76:bcd534b5:4d1372e4
           Name : borg03b:3  (local to host borg03b)
  Creation Time : Sat May 19 01:07:34 2012
     Raid Level : raid10
   Raid Devices : 5

 Avail Dev Size : 2930269954 (1397.26 GiB 1500.30 GB)
     Array Size : 5860538368 (2794.52 GiB 3000.60 GB)
  Used Dev Size : 2930269184 (1397.26 GiB 1500.30 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : e7f5da61:cba8e3f7:d5efbd3d:2f4d3013

    Update Time : Fri Jun 22 17:12:55 2012
       Checksum : dc88710 - correct
         Events : 90923

         Layout : near=2
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : AAAAA ('A' == active, '.' == missing)

/dev/sdf1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 2b46b20b:80c18c76:bcd534b5:4d1372e4
           Name : borg03b:3  (local to host borg03b)
  Creation Time : Sat May 19 01:07:34 2012
     Raid Level : raid10
   Raid Devices : 5

 Avail Dev Size : 2930269954 (1397.26 GiB 1500.30 GB)
     Array Size : 5860538368 (2794.52 GiB 3000.60 GB)
  Used Dev Size : 2930269184 (1397.26 GiB 1500.30 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : eea0d414:382d5ac4:851772a2:af72eceb

    Update Time : Fri Jun 22 17:13:10 2012
       Checksum : caa903cc - correct
         Events : 90933

         Layout : near=2
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : AAAAA ('A' == active, '.' == missing)

/dev/sde1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 2b46b20b:80c18c76:bcd534b5:4d1372e4
           Name : borg03b:3  (local to host borg03b)
  Creation Time : Sat May 19 01:07:34 2012
     Raid Level : raid10
   Raid Devices : 5

 Avail Dev Size : 2930269954 (1397.26 GiB 1500.30 GB)
     Array Size : 5860538368 (2794.52 GiB 3000.60 GB)
  Used Dev Size : 2930269184 (1397.26 GiB 1500.30 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : ffcfc875:77d830a0:14575bdc:c339a428

    Update Time : Fri Jun 22 17:13:34 2012
       Checksum : 7e14e4e9 - correct
         Events : 90947

         Layout : near=2
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAAAA ('A' == active, '.' == missing)

/dev/sdh1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x2
     Array UUID : 2b46b20b:80c18c76:bcd534b5:4d1372e4
           Name : borg03b:3  (local to host borg03b)
  Creation Time : Sat May 19 01:07:34 2012
     Raid Level : raid10
   Raid Devices : 5

 Avail Dev Size : 2930269954 (1397.26 GiB 1500.30 GB)
     Array Size : 5860538368 (2794.52 GiB 3000.60 GB)
  Used Dev Size : 2930269184 (1397.26 GiB 1500.30 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
Recovery Offset : 1465135104 sectors
          State : clean
    Device UUID : e86f53a3:940ce746:25423ae0:da3b179f

    Update Time : Fri Jun 22 17:13:49 2012
       Checksum : 23fbd830 - correct
         Events : 90953

         Layout : near=2
     Chunk Size : 512K

   Device Role : Active device 4
   Array State : AAAAA ('A' == active, '.' == missing)
---

I verified that these are identical to the ones on the other machine which
survived a resync event flawlessly. 

The version of mdadm in Squeeze is: mdadm - v3.1.4 - 31st August 2010

I created a pretty similar setup last year with 5 2TB drives each and
using a 3.0.7 kernel. That array size is nicely divisible...

I have a sinking feeling that the "fix" for this will be a rebuild of the
RAIDs on a production cluster. >.<

Christian

> NeilBrown
> 
> 
> >       
> > md3 : active raid10 sdh1[7] sdc1[0] sda4[5](S) sdg1[3] sdf1[2] sde1[6]
> >       3662836224 blocks super 1.2 512K chunks 2 near-copies [5/4]
> > [UUUU_] [=====>...............]  recovery = 28.3%
> > (415962368/1465134592) finish=326.2min speed=53590K/sec ---
> > 
> > Drives sda to sdd are on nVidia MCP55 and sde to sdl on SAS1068E, sdc
> > to sdl are identical 1.5TB Seagates (about 2 years old, recycled from
> > the previous incarnation of these machines) with a single partition
> > spanning the whole drive like this:
> > ---
> > Disk /dev/sdc: 1500.3 GB, 1500301910016 bytes
> > 255 heads, 63 sectors/track, 182401 cylinders
> > Units = cylinders of 16065 * 512 = 8225280 bytes
> > Sector size (logical/physical): 512 bytes / 512 bytes
> > I/O size (minimum/optimal): 512 bytes / 512 bytes
> > Disk identifier: 0x00000000
> > 
> >    Device Boot      Start         End      Blocks   Id  System
> > /dev/sdc1               1      182401  1465136001   fd  Linux raid
> > autodetect ---
> > 
> > sda and sdb are new 2TB Hitachi drives, partitioned like this:
> > ---
> > Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
> > 255 heads, 63 sectors/track, 243201 cylinders
> > Units = cylinders of 16065 * 512 = 8225280 bytes
> > Sector size (logical/physical): 512 bytes / 512 bytes
> > I/O size (minimum/optimal): 512 bytes / 512 bytes
> > Disk identifier: 0x000d53b0
> > 
> >    Device Boot      Start         End      Blocks   Id  System
> > /dev/sda1   *           1       31124   249999360   fd  Linux raid
> > autodetect /dev/sda2           31124       46686   124999680   fd
> > Linux raid autodetect /dev/sda3           46686       50576
> > 31246425   fd  Linux raid autodetect /dev/sda4           50576
> > 243201  1547265543+  fd  Linux raid autodetect ---
> > 
> > So the idea is to have 5 drives per each of the two Raid10s and one
> > spare on that (intentionally over-sized) fourth partition of the
> > bigger OS disks.
> > 
> > Some weeks ago a drive failed on the twin (identical everything, DRBD
> > replication of those 2 RAIDs) of the machine in question and everything
> > went according to the book, spare took over and things got rebuild, I
> > replaced the failed drive (sdi) later:
> > ---
> > md4 : active raid10 sdi1[6](S) sdd1[0] sdb4[5] sdl1[4] sdk1[3] sdj1[2]
> >       3662836224 blocks super 1.2 512K chunks 2 near-copies [5/5]
> > [UUUUU] ---
> > 
> > Two days ago drive sdh on the machine that's having issues failed:
> > ---
> > Jun 20 18:22:39 borg03b kernel: [1383395.448043] sd 8:0:3:0: Device
> > offlined - not ready after error recovery Jun 20 18:22:39 borg03b
> > kernel: [1383395.448135] sd 8:0:3:0: rejecting I/O to offline device
> > Jun 20 18:22:39 borg03b kernel: [1383395.452063] end_request: I/O
> > error, dev sdh, sector 71 Jun 20 18:22:39 borg03b kernel:
> > [1383395.452063] md: super_written gets error=-5, uptodate=0 Jun 20
> > 18:22:39 borg03b kernel: [1383395.452063] md/raid10:md3: Disk failure
> > on sdh1, disabling device. Jun 20 18:22:39 borg03b kernel:
> > [1383395.452063] md/raid10:md3: Operation continuing on 4 devices. Jun
> > 20 18:22:39 borg03b kernel: [1383395.527178] RAID10 conf printout: Jun
> > 20 18:22:39 borg03b kernel: [1383395.527181]  --- wd:4 rd:5 Jun 20
> > 18:22:39 borg03b kernel: [1383395.527184]  disk 0, wo:0, o:1, dev:sdc1
> > Jun 20 18:22:39 borg03b kernel: [1383395.527186]  disk 1, wo:0, o:1,
> > dev:sde1 Jun 20 18:22:39 borg03b kernel: [1383395.527189]  disk 2,
> > wo:0, o:1, dev:sdf1 Jun 20 18:22:39 borg03b kernel: [1383395.527191]
> > disk 3, wo:0, o:1, dev:sdg1 Jun 20 18:22:39 borg03b kernel:
> > [1383395.527193]  disk 4, wo:1, o:0, dev:sdh1 Jun 20 18:22:39 borg03b
> > kernel: [1383395.568037] RAID10 conf printout: Jun 20 18:22:39 borg03b
> > kernel: [1383395.568040]  --- wd:4 rd:5 Jun 20 18:22:39 borg03b
> > kernel: [1383395.568042]  disk 0, wo:0, o:1, dev:sdc1 Jun 20 18:22:39
> > borg03b kernel: [1383395.568045]  disk 1, wo:0, o:1, dev:sde1 Jun 20
> > 18:22:39 borg03b kernel: [1383395.568047]  disk 2, wo:0, o:1, dev:sdf1
> > Jun 20 18:22:39 borg03b kernel: [1383395.568049]  disk 3, wo:0, o:1,
> > dev:sdg1 Jun 20 18:22:39 borg03b kernel: [1383395.568060] RAID10 conf
> > printout: Jun 20 18:22:39 borg03b kernel: [1383395.568061]  --- wd:4
> > rd:5 Jun 20 18:22:39 borg03b kernel: [1383395.568063]  disk 0, wo:0,
> > o:1, dev:sdc1 Jun 20 18:22:39 borg03b kernel: [1383395.568065]  disk
> > 1, wo:0, o:1, dev:sde1 Jun 20 18:22:39 borg03b kernel:
> > [1383395.568068]  disk 2, wo:0, o:1, dev:sdf1 Jun 20 18:22:39 borg03b
> > kernel: [1383395.568070]  disk 3, wo:0, o:1, dev:sdg1 Jun 20 18:22:39
> > borg03b kernel: [1383395.568072]  disk 4, wo:1, o:1, dev:sda4 Jun 20
> > 18:22:39 borg03b kernel: [1383395.568135] md: recovery of RAID array
> > md3 Jun 20 18:22:39 borg03b kernel: [1383395.568139] md: minimum
> > _guaranteed_  speed: 20000 KB/sec/disk. Jun 20 18:22:39 borg03b
> > kernel: [1383395.568142] md: using maximum available idle IO bandwidth
> > (but not more than 500000 KB/sec) for recovery. Jun 20 18:22:39
> > borg03b kernel: [1383395.568155] md: using 128k window, over a total
> > of 1465134592k. ---
> > 
> > OK, spare kicked, recovery underway (from the neighbors sdg and sdc),
> > but then: ---
> > Jun 21 02:29:29 borg03b kernel: [1412604.989978] attempt to access
> > beyond end of device Jun 21 02:29:29 borg03b kernel: [1412604.989983]
> > sdc1: rw=0, want=2930272128, limit=2930272002 Jun 21 02:29:29 borg03b
> > kernel: [1412604.990003] attempt to access beyond end of device Jun 21
> > 02:29:29 borg03b kernel: [1412604.990009] sdc1: rw=16,
> > want=2930272008, limit=2930272002 Jun 21 02:29:29 borg03b kernel:
> > [1412604.990013] md/raid10:md3: recovery aborted due to read error Jun
> > 21 02:29:29 borg03b kernel: [1412604.990025] attempt to access beyond
> > end of device Jun 21 02:29:29 borg03b kernel: [1412604.990028] sdc1:
> > rw=0, want=2930272256, limit=2930272002 Jun 21 02:29:29 borg03b
> > kernel: [1412604.990032] md: md3: recovery done. Jun 21 02:29:29
> > borg03b kernel: [1412604.990035] attempt to access beyond end of
> > device Jun 21 02:29:29 borg03b kernel: [1412604.990038] sdc1: rw=16,
> > want=2930272136, limit=2930272002 Jun 21 02:29:29 borg03b kernel:
> > [1412604.990040] md/raid10:md3: recovery aborted due to read error ---
> > 
> > Why it would want to read data beyond the end of that device (and
> > partition) is a complete mystery to me, if anything was odd with this
> > Raid or its superblocks, surely the initial sync should have stumbled
> > across this as well?
> > 
> > After this failure the kernel goes into a log frenzy:
> > ---
> > Jun 21 02:29:29 borg03b kernel: [1412605.744052] RAID10 conf printout:
> > Jun 21 02:29:29 borg03b kernel: [1412605.744055]  --- wd:4 rd:5
> > Jun 21 02:29:29 borg03b kernel: [1412605.744057]  disk 0, wo:0, o:1,
> > dev:sdc1 Jun 21 02:29:29 borg03b kernel: [1412605.744060]  disk 1,
> > wo:0, o:1, dev:sde1 Jun 21 02:29:29 borg03b kernel: [1412605.744062]
> > disk 2, wo:0, o:1, dev:sdf1 Jun 21 02:29:29 borg03b kernel:
> > [1412605.744064]  disk 3, wo:0, o:1, dev:sdg1 ---
> > repeating every second or so, until I "mdadm -r"ed the sda4 partition
> > (former spare).
> > 
> > On the next day I replaced the failed sdh drive with another 2TB
> > Hitachi (having only 1.5TB Seagates of dubious quality lying around),
> > gave it the same single partition size as the other drives and added
> > it to md3.
> > 
> > The resync failed in the same manner:
> > ---
> > Jun 21 20:59:06 borg03b kernel: [1479182.509914] attempt to access
> > beyond end of device Jun 21 20:59:06 borg03b kernel: [1479182.509920]
> > sdc1: rw=0, want=2930272128, limit=2930272002 Jun 21 20:59:06 borg03b
> > kernel: [1479182.509931] attempt to access beyond end of device Jun 21
> > 20:59:06 borg03b kernel: [1479182.509933] attempt to access beyond end
> > of device Jun 21 20:59:06 borg03b kernel: [1479182.509937] sdc1: rw=0,
> > want=2930272256, limit=2930272002 Jun 21 20:59:06 borg03b kernel:
> > [1479182.509942] md: md3: recovery done. Jun 21 20:59:06 borg03b
> > kernel: [1479182.509948] sdc1: rw=16, want=2930272008,
> > limit=2930272002 Jun 21 20:59:06 borg03b kernel: [1479182.509952]
> > md/raid10:md3: recovery aborted due to read error Jun 21 20:59:06
> > borg03b kernel: [1479182.509963] attempt to access beyond end of
> > device Jun 21 20:59:06 borg03b kernel: [1479182.509965] sdc1: rw=16,
> > want=2930272136, limit=2930272002 Jun 21 20:59:06 borg03b kernel:
> > [1479182.509968] md/raid10:md3: recovery aborted due to read error ---
> > 
> > I've now scrounged up an identical 1.5TB drive and added it to the Raid
> > (the recovery visible in the topmost mdstat). 
> > If that fails as well, I'm completely lost as to what's going on, if it
> > succeeds though I guess we're looking at a subtle bug. 
> > 
> > I didn't find anything like this mentioned in the archives before, any
> > and all feedback would be most welcome.
> > 
> > Regards,
> > 
> > Christian
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi@gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: MD Raid10 recovery results in "attempt to access beyond end of device"
  2012-06-22  8:42   ` Christian Balzer
@ 2012-06-23  4:13     ` Christian Balzer
  2012-06-25  4:07     ` NeilBrown
  1 sibling, 0 replies; 8+ messages in thread
From: Christian Balzer @ 2012-06-23  4:13 UTC (permalink / raw)
  To: linux-raid

On Fri, 22 Jun 2012 17:42:57 +0900 Christian Balzer wrote:

> 
> Hello,
> 
> On Fri, 22 Jun 2012 18:07:48 +1000 NeilBrown wrote:
> 
> > On Fri, 22 Jun 2012 16:06:32 +0900 Christian Balzer <chibi@gol.com>
> > wrote:
> > 
> > > 
> > > Hello,
> > > 
> > > the basics first:
> > > Debian Squeeze, custom 3.2.18 kernel.
> > > 
> > > The Raid(s) in question are:
> > > ---
> > > Personalities : [raid1] [raid10] 
> > > md4 : active raid10 sdd1[0] sdb4[5](S) sdl1[4] sdk1[3] sdj1[2]
> > > sdi1[1] 3662836224 blocks super 1.2 512K chunks 2 near-copies [5/5]
> > > [UUUUU]
> > 
> > I'm stumped by this.  It shouldn't be possible.
> > 
> > The size of the array is impossible.
> > 
> > If there are N chunks per device, then there are 5*N chunks on the
> > whole array, and there are are two copies of each data chunk, so
> > 5*N/2 distinct data chunks, so that should be the size of the array.
> > 
> > So if we take the size of the array, divide by chunk size, multiply by
> > 2, divide by 5, we get N = the number of chunks per device.
> > i.e.
> >   N = (array_size / chunk_size)*2 / 5
> > 
> > If we plug in 3662836224 for the array size and 512 for the chunk size,
> > we get 2861590.8, which is not an integer.
> > i.e. impossible.
> > 
> Quite right, though I never bothered to check that number of course,
> pretty much assuming after using Linux MD since the last millennium that
> it would get things right. ^o^
> 

Well, the last rebuild attempt onto an identical disk as the rest also
failed, with the exact same error as below. <sigh>

Does anybody have anything wise and pertinent to say about this?
Especially given that an identically "broken" raid managed to recover and
finish a resync.

Anything that could be done in situ?

I guess not, though.

So I would be probably looking at:
1. Remove the broken raid as backing device for the DRBD on top of it,
thankfully this is the secondary node. Pray that primary node doesn't
fail while doing the next steps.

2. Stop the raid and re-create it. Now what if that gives me the exact same
broken result? It certainly did it happily four times before...

3. If I manage to somehow get a sane looking result, make sure that it can
recover from disk failures. So looking at at least 2 days of testing and
more praying that the primary doesn't fail.

4. Assuming that the resulting raid is smaller than the current one, shrink
the file system (impossible to do while mounted, or so the resize2fs
documentation suggests). Downtime, ouch. Potential do eat all babies in
the vicinity, double ouch.

5. Shrink the DRBD device accordingly, more gnashing of teeth and
sacrificing of goats to keep the data destroying demons at bay.
 
6. Re-attach the new and sane raid as backing device to DRBD. Wait another
day for things to sync up.

7. Do a failover (downtime...) and heal the now secondary backing raid on
the other node.

8. Repeat 4-7 for the other DRBD resource and raid pair.


Regards,

Christian
> > What does "mdadm --examine" of the various devices show?
> > 
> They looks all identical and sane to me:
> ---
> /dev/sdc1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 2b46b20b:80c18c76:bcd534b5:4d1372e4
>            Name : borg03b:3  (local to host borg03b)
>   Creation Time : Sat May 19 01:07:34 2012
>      Raid Level : raid10
>    Raid Devices : 5
> 
>  Avail Dev Size : 2930269954 (1397.26 GiB 1500.30 GB)
>      Array Size : 5860538368 (2794.52 GiB 3000.60 GB)
>   Used Dev Size : 2930269184 (1397.26 GiB 1500.30 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>           State : clean
>     Device UUID : fe922c1c:35319892:cc1e32e9:948d932c
> 
>     Update Time : Fri Jun 22 17:12:05 2012
>        Checksum : 27a61d9a - correct
>          Events : 90893
> 
>          Layout : near=2
>      Chunk Size : 512K
> 
>    Device Role : Active device 0
>    Array State : AAAAA ('A' == active, '.' == missing)
> 
> /dev/sdg1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 2b46b20b:80c18c76:bcd534b5:4d1372e4
>            Name : borg03b:3  (local to host borg03b)
>   Creation Time : Sat May 19 01:07:34 2012
>      Raid Level : raid10
>    Raid Devices : 5
> 
>  Avail Dev Size : 2930269954 (1397.26 GiB 1500.30 GB)
>      Array Size : 5860538368 (2794.52 GiB 3000.60 GB)
>   Used Dev Size : 2930269184 (1397.26 GiB 1500.30 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>           State : clean
>     Device UUID : e7f5da61:cba8e3f7:d5efbd3d:2f4d3013
> 
>     Update Time : Fri Jun 22 17:12:55 2012
>        Checksum : dc88710 - correct
>          Events : 90923
> 
>          Layout : near=2
>      Chunk Size : 512K
> 
>    Device Role : Active device 3
>    Array State : AAAAA ('A' == active, '.' == missing)
> 
> /dev/sdf1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 2b46b20b:80c18c76:bcd534b5:4d1372e4
>            Name : borg03b:3  (local to host borg03b)
>   Creation Time : Sat May 19 01:07:34 2012
>      Raid Level : raid10
>    Raid Devices : 5
> 
>  Avail Dev Size : 2930269954 (1397.26 GiB 1500.30 GB)
>      Array Size : 5860538368 (2794.52 GiB 3000.60 GB)
>   Used Dev Size : 2930269184 (1397.26 GiB 1500.30 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>           State : clean
>     Device UUID : eea0d414:382d5ac4:851772a2:af72eceb
> 
>     Update Time : Fri Jun 22 17:13:10 2012
>        Checksum : caa903cc - correct
>          Events : 90933
> 
>          Layout : near=2
>      Chunk Size : 512K
> 
>    Device Role : Active device 2
>    Array State : AAAAA ('A' == active, '.' == missing)
> 
> /dev/sde1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 2b46b20b:80c18c76:bcd534b5:4d1372e4
>            Name : borg03b:3  (local to host borg03b)
>   Creation Time : Sat May 19 01:07:34 2012
>      Raid Level : raid10
>    Raid Devices : 5
> 
>  Avail Dev Size : 2930269954 (1397.26 GiB 1500.30 GB)
>      Array Size : 5860538368 (2794.52 GiB 3000.60 GB)
>   Used Dev Size : 2930269184 (1397.26 GiB 1500.30 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>           State : clean
>     Device UUID : ffcfc875:77d830a0:14575bdc:c339a428
> 
>     Update Time : Fri Jun 22 17:13:34 2012
>        Checksum : 7e14e4e9 - correct
>          Events : 90947
> 
>          Layout : near=2
>      Chunk Size : 512K
> 
>    Device Role : Active device 1
>    Array State : AAAAA ('A' == active, '.' == missing)
> 
> /dev/sdh1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x2
>      Array UUID : 2b46b20b:80c18c76:bcd534b5:4d1372e4
>            Name : borg03b:3  (local to host borg03b)
>   Creation Time : Sat May 19 01:07:34 2012
>      Raid Level : raid10
>    Raid Devices : 5
> 
>  Avail Dev Size : 2930269954 (1397.26 GiB 1500.30 GB)
>      Array Size : 5860538368 (2794.52 GiB 3000.60 GB)
>   Used Dev Size : 2930269184 (1397.26 GiB 1500.30 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
> Recovery Offset : 1465135104 sectors
>           State : clean
>     Device UUID : e86f53a3:940ce746:25423ae0:da3b179f
> 
>     Update Time : Fri Jun 22 17:13:49 2012
>        Checksum : 23fbd830 - correct
>          Events : 90953
> 
>          Layout : near=2
>      Chunk Size : 512K
> 
>    Device Role : Active device 4
>    Array State : AAAAA ('A' == active, '.' == missing)
> ---
> 
> I verified that these are identical to the ones on the other machine
> which survived a resync event flawlessly. 
> 
> The version of mdadm in Squeeze is: mdadm - v3.1.4 - 31st August 2010
> 
> I created a pretty similar setup last year with 5 2TB drives each and
> using a 3.0.7 kernel. That array size is nicely divisible...
> 
> I have a sinking feeling that the "fix" for this will be a rebuild of the
> RAIDs on a production cluster. >.<
> 
> Christian
> 
> > NeilBrown
> > 
> > 
> > >       
> > > md3 : active raid10 sdh1[7] sdc1[0] sda4[5](S) sdg1[3] sdf1[2]
> > > sde1[6] 3662836224 blocks super 1.2 512K chunks 2 near-copies [5/4]
> > > [UUUU_] [=====>...............]  recovery = 28.3%
> > > (415962368/1465134592) finish=326.2min speed=53590K/sec ---
> > > 
> > > Drives sda to sdd are on nVidia MCP55 and sde to sdl on SAS1068E, sdc
> > > to sdl are identical 1.5TB Seagates (about 2 years old, recycled from
> > > the previous incarnation of these machines) with a single partition
> > > spanning the whole drive like this:
> > > ---
> > > Disk /dev/sdc: 1500.3 GB, 1500301910016 bytes
> > > 255 heads, 63 sectors/track, 182401 cylinders
> > > Units = cylinders of 16065 * 512 = 8225280 bytes
> > > Sector size (logical/physical): 512 bytes / 512 bytes
> > > I/O size (minimum/optimal): 512 bytes / 512 bytes
> > > Disk identifier: 0x00000000
> > > 
> > >    Device Boot      Start         End      Blocks   Id  System
> > > /dev/sdc1               1      182401  1465136001   fd  Linux raid
> > > autodetect ---
> > > 
> > > sda and sdb are new 2TB Hitachi drives, partitioned like this:
> > > ---
> > > Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
> > > 255 heads, 63 sectors/track, 243201 cylinders
> > > Units = cylinders of 16065 * 512 = 8225280 bytes
> > > Sector size (logical/physical): 512 bytes / 512 bytes
> > > I/O size (minimum/optimal): 512 bytes / 512 bytes
> > > Disk identifier: 0x000d53b0
> > > 
> > >    Device Boot      Start         End      Blocks   Id  System
> > > /dev/sda1   *           1       31124   249999360   fd  Linux raid
> > > autodetect /dev/sda2           31124       46686   124999680   fd
> > > Linux raid autodetect /dev/sda3           46686       50576
> > > 31246425   fd  Linux raid autodetect /dev/sda4           50576
> > > 243201  1547265543+  fd  Linux raid autodetect ---
> > > 
> > > So the idea is to have 5 drives per each of the two Raid10s and one
> > > spare on that (intentionally over-sized) fourth partition of the
> > > bigger OS disks.
> > > 
> > > Some weeks ago a drive failed on the twin (identical everything, DRBD
> > > replication of those 2 RAIDs) of the machine in question and
> > > everything went according to the book, spare took over and things
> > > got rebuild, I replaced the failed drive (sdi) later:
> > > ---
> > > md4 : active raid10 sdi1[6](S) sdd1[0] sdb4[5] sdl1[4] sdk1[3]
> > > sdj1[2] 3662836224 blocks super 1.2 512K chunks 2 near-copies [5/5]
> > > [UUUUU] ---
> > > 
> > > Two days ago drive sdh on the machine that's having issues failed:
> > > ---
> > > Jun 20 18:22:39 borg03b kernel: [1383395.448043] sd 8:0:3:0: Device
> > > offlined - not ready after error recovery Jun 20 18:22:39 borg03b
> > > kernel: [1383395.448135] sd 8:0:3:0: rejecting I/O to offline device
> > > Jun 20 18:22:39 borg03b kernel: [1383395.452063] end_request: I/O
> > > error, dev sdh, sector 71 Jun 20 18:22:39 borg03b kernel:
> > > [1383395.452063] md: super_written gets error=-5, uptodate=0 Jun 20
> > > 18:22:39 borg03b kernel: [1383395.452063] md/raid10:md3: Disk failure
> > > on sdh1, disabling device. Jun 20 18:22:39 borg03b kernel:
> > > [1383395.452063] md/raid10:md3: Operation continuing on 4 devices.
> > > Jun 20 18:22:39 borg03b kernel: [1383395.527178] RAID10 conf
> > > printout: Jun 20 18:22:39 borg03b kernel: [1383395.527181]  --- wd:4
> > > rd:5 Jun 20 18:22:39 borg03b kernel: [1383395.527184]  disk 0, wo:0,
> > > o:1, dev:sdc1 Jun 20 18:22:39 borg03b kernel: [1383395.527186]  disk
> > > 1, wo:0, o:1, dev:sde1 Jun 20 18:22:39 borg03b kernel:
> > > [1383395.527189]  disk 2, wo:0, o:1, dev:sdf1 Jun 20 18:22:39
> > > borg03b kernel: [1383395.527191] disk 3, wo:0, o:1, dev:sdg1 Jun 20
> > > 18:22:39 borg03b kernel: [1383395.527193]  disk 4, wo:1, o:0,
> > > dev:sdh1 Jun 20 18:22:39 borg03b kernel: [1383395.568037] RAID10
> > > conf printout: Jun 20 18:22:39 borg03b kernel: [1383395.568040]  ---
> > > wd:4 rd:5 Jun 20 18:22:39 borg03b kernel: [1383395.568042]  disk 0,
> > > wo:0, o:1, dev:sdc1 Jun 20 18:22:39 borg03b kernel:
> > > [1383395.568045]  disk 1, wo:0, o:1, dev:sde1 Jun 20 18:22:39
> > > borg03b kernel: [1383395.568047]  disk 2, wo:0, o:1, dev:sdf1 Jun 20
> > > 18:22:39 borg03b kernel: [1383395.568049]  disk 3, wo:0, o:1,
> > > dev:sdg1 Jun 20 18:22:39 borg03b kernel: [1383395.568060] RAID10
> > > conf printout: Jun 20 18:22:39 borg03b kernel: [1383395.568061]  ---
> > > wd:4 rd:5 Jun 20 18:22:39 borg03b kernel: [1383395.568063]  disk 0,
> > > wo:0, o:1, dev:sdc1 Jun 20 18:22:39 borg03b kernel:
> > > [1383395.568065]  disk 1, wo:0, o:1, dev:sde1 Jun 20 18:22:39
> > > borg03b kernel: [1383395.568068]  disk 2, wo:0, o:1, dev:sdf1 Jun 20
> > > 18:22:39 borg03b kernel: [1383395.568070]  disk 3, wo:0, o:1,
> > > dev:sdg1 Jun 20 18:22:39 borg03b kernel: [1383395.568072]  disk 4,
> > > wo:1, o:1, dev:sda4 Jun 20 18:22:39 borg03b kernel: [1383395.568135]
> > > md: recovery of RAID array md3 Jun 20 18:22:39 borg03b kernel:
> > > [1383395.568139] md: minimum _guaranteed_  speed: 20000 KB/sec/disk.
> > > Jun 20 18:22:39 borg03b kernel: [1383395.568142] md: using maximum
> > > available idle IO bandwidth (but not more than 500000 KB/sec) for
> > > recovery. Jun 20 18:22:39 borg03b kernel: [1383395.568155] md: using
> > > 128k window, over a total of 1465134592k. ---
> > > 
> > > OK, spare kicked, recovery underway (from the neighbors sdg and sdc),
> > > but then: ---
> > > Jun 21 02:29:29 borg03b kernel: [1412604.989978] attempt to access
> > > beyond end of device Jun 21 02:29:29 borg03b kernel: [1412604.989983]
> > > sdc1: rw=0, want=2930272128, limit=2930272002 Jun 21 02:29:29 borg03b
> > > kernel: [1412604.990003] attempt to access beyond end of device Jun
> > > 21 02:29:29 borg03b kernel: [1412604.990009] sdc1: rw=16,
> > > want=2930272008, limit=2930272002 Jun 21 02:29:29 borg03b kernel:
> > > [1412604.990013] md/raid10:md3: recovery aborted due to read error
> > > Jun 21 02:29:29 borg03b kernel: [1412604.990025] attempt to access
> > > beyond end of device Jun 21 02:29:29 borg03b kernel:
> > > [1412604.990028] sdc1: rw=0, want=2930272256, limit=2930272002 Jun
> > > 21 02:29:29 borg03b kernel: [1412604.990032] md: md3: recovery done.
> > > Jun 21 02:29:29 borg03b kernel: [1412604.990035] attempt to access
> > > beyond end of device Jun 21 02:29:29 borg03b kernel:
> > > [1412604.990038] sdc1: rw=16, want=2930272136, limit=2930272002 Jun
> > > 21 02:29:29 borg03b kernel: [1412604.990040] md/raid10:md3: recovery
> > > aborted due to read error ---
> > > 
> > > Why it would want to read data beyond the end of that device (and
> > > partition) is a complete mystery to me, if anything was odd with this
> > > Raid or its superblocks, surely the initial sync should have stumbled
> > > across this as well?
> > > 
> > > After this failure the kernel goes into a log frenzy:
> > > ---
> > > Jun 21 02:29:29 borg03b kernel: [1412605.744052] RAID10 conf
> > > printout: Jun 21 02:29:29 borg03b kernel: [1412605.744055]  --- wd:4
> > > rd:5 Jun 21 02:29:29 borg03b kernel: [1412605.744057]  disk 0, wo:0,
> > > o:1, dev:sdc1 Jun 21 02:29:29 borg03b kernel: [1412605.744060]  disk
> > > 1, wo:0, o:1, dev:sde1 Jun 21 02:29:29 borg03b kernel:
> > > [1412605.744062] disk 2, wo:0, o:1, dev:sdf1 Jun 21 02:29:29 borg03b
> > > kernel: [1412605.744064]  disk 3, wo:0, o:1, dev:sdg1 ---
> > > repeating every second or so, until I "mdadm -r"ed the sda4 partition
> > > (former spare).
> > > 
> > > On the next day I replaced the failed sdh drive with another 2TB
> > > Hitachi (having only 1.5TB Seagates of dubious quality lying around),
> > > gave it the same single partition size as the other drives and added
> > > it to md3.
> > > 
> > > The resync failed in the same manner:
> > > ---
> > > Jun 21 20:59:06 borg03b kernel: [1479182.509914] attempt to access
> > > beyond end of device Jun 21 20:59:06 borg03b kernel: [1479182.509920]
> > > sdc1: rw=0, want=2930272128, limit=2930272002 Jun 21 20:59:06 borg03b
> > > kernel: [1479182.509931] attempt to access beyond end of device Jun
> > > 21 20:59:06 borg03b kernel: [1479182.509933] attempt to access
> > > beyond end of device Jun 21 20:59:06 borg03b kernel:
> > > [1479182.509937] sdc1: rw=0, want=2930272256, limit=2930272002 Jun
> > > 21 20:59:06 borg03b kernel: [1479182.509942] md: md3: recovery done.
> > > Jun 21 20:59:06 borg03b kernel: [1479182.509948] sdc1: rw=16,
> > > want=2930272008, limit=2930272002 Jun 21 20:59:06 borg03b kernel:
> > > [1479182.509952] md/raid10:md3: recovery aborted due to read error
> > > Jun 21 20:59:06 borg03b kernel: [1479182.509963] attempt to access
> > > beyond end of device Jun 21 20:59:06 borg03b kernel:
> > > [1479182.509965] sdc1: rw=16, want=2930272136, limit=2930272002 Jun
> > > 21 20:59:06 borg03b kernel: [1479182.509968] md/raid10:md3: recovery
> > > aborted due to read error ---
> > > 
> > > I've now scrounged up an identical 1.5TB drive and added it to the
> > > Raid (the recovery visible in the topmost mdstat). 
> > > If that fails as well, I'm completely lost as to what's going on, if
> > > it succeeds though I guess we're looking at a subtle bug. 
> > > 
> > > I didn't find anything like this mentioned in the archives before,
> > > any and all feedback would be most welcome.
> > > 
> > > Regards,
> > > 
> > > Christian
> > 
> 
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi@gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: MD Raid10 recovery results in "attempt to access beyond end of device"
  2012-06-22  8:42   ` Christian Balzer
  2012-06-23  4:13     ` Christian Balzer
@ 2012-06-25  4:07     ` NeilBrown
  2012-06-25  6:06       ` Christian Balzer
  1 sibling, 1 reply; 8+ messages in thread
From: NeilBrown @ 2012-06-25  4:07 UTC (permalink / raw)
  To: Christian Balzer; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4602 bytes --]

On Fri, 22 Jun 2012 17:42:57 +0900 Christian Balzer <chibi@gol.com> wrote:

> 
> Hello,
> 
> On Fri, 22 Jun 2012 18:07:48 +1000 NeilBrown wrote:
> 
> > On Fri, 22 Jun 2012 16:06:32 +0900 Christian Balzer <chibi@gol.com>
> > wrote:
> > 
> > > 
> > > Hello,
> > > 
> > > the basics first:
> > > Debian Squeeze, custom 3.2.18 kernel.
> > > 
> > > The Raid(s) in question are:
> > > ---
> > > Personalities : [raid1] [raid10] 
> > > md4 : active raid10 sdd1[0] sdb4[5](S) sdl1[4] sdk1[3] sdj1[2] sdi1[1]
> > >       3662836224 blocks super 1.2 512K chunks 2 near-copies [5/5]
> > > [UUUUU]
> > 
> > I'm stumped by this.  It shouldn't be possible.
> > 
> > The size of the array is impossible.
> > 
> > If there are N chunks per device, then there are 5*N chunks on the whole
> > array, and there are are two copies of each data chunk, so
> > 5*N/2 distinct data chunks, so that should be the size of the array.
> > 
> > So if we take the size of the array, divide by chunk size, multiply by 2,
> > divide by 5, we get N = the number of chunks per device.
> > i.e.
> >   N = (array_size / chunk_size)*2 / 5
> > 
> > If we plug in 3662836224 for the array size and 512 for the chunk size,
> > we get 2861590.8, which is not an integer.
> > i.e. impossible.
> > 
> Quite right, though I never bothered to check that number of course,
> pretty much assuming after using Linux MD since the last millennium that
> it would get things right. ^o^
> 
> > What does "mdadm --examine" of the various devices show?
> > 
> They looks all identical and sane to me:
> ---
> /dev/sdc1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 2b46b20b:80c18c76:bcd534b5:4d1372e4
>            Name : borg03b:3  (local to host borg03b)
>   Creation Time : Sat May 19 01:07:34 2012
>      Raid Level : raid10
>    Raid Devices : 5
> 
>  Avail Dev Size : 2930269954 (1397.26 GiB 1500.30 GB)
>      Array Size : 5860538368 (2794.52 GiB 3000.60 GB)
>   Used Dev Size : 2930269184 (1397.26 GiB 1500.30 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>           State : clean
>     Device UUID : fe922c1c:35319892:cc1e32e9:948d932c
> 
>     Update Time : Fri Jun 22 17:12:05 2012
>        Checksum : 27a61d9a - correct
>          Events : 90893
> 
>          Layout : near=2
>      Chunk Size : 512K
> 
>    Device Role : Active device 0
>    Array State : AAAAA ('A' == active, '.' == missing)

Thanks.
With this extra info - and the clearer perspective that morning provides - I
see what is happening.

The following kernel patch should make it work for you.  It was made and
tested against 3.4. but should apply to your 3.2 kernel.

The problem only occurs when recovering the last device in certain RAID10
arrays.  If you had > 2 copies (e.g. --layout=n3) it could be more than just
the last device.

RAID10 with an odd number of devices (5 in this case) lays out chunks like
this:

 A A B B C
 C D D E E
 F F G G H
 H I I J J

If you have an even number of stripes, everything is happy.
If you have an odd number of stripes - as is the case with your problem array
- then the last stripe might look like:

 F F G G H

The 'H' chunk only exists once.  There is no mirror for it.
md does not store any data in this chunk - the size of the array is calculated
to finish after 'G'.
However the recovery code isn't quite so careful.  It tries to recover this
chunk and loads it from beyond the end of the first device - which is where
it would be if the devices were all a bit bigger.

So there is no risk of data corruption here - just that md tries to recover a
block that isn't in the array, fails, and aborts the recovery.

This patch gets it to complete the recovery earlier so that it doesn't try
(and fail) to do the impossible.

If you could test and confirm, I'd appreciate it.

Thanks,
NeilBrown

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 99ae606..bcf6ea8 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2890,6 +2890,12 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
 			/* want to reconstruct this device */
 			rb2 = r10_bio;
 			sect = raid10_find_virt(conf, sector_nr, i);
+			if (sect >= mddev->resync_max_sectors) {
+				/* last stripe is not complete - don't
+				 * try to recover this sector.
+				 */
+				continue;
+			}
 			/* Unless we are doing a full sync, or a replacement
 			 * we only need to recover the block if it is set in
 			 * the bitmap

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: MD Raid10 recovery results in "attempt to access beyond end of device"
  2012-06-25  4:07     ` NeilBrown
@ 2012-06-25  6:06       ` Christian Balzer
  2012-06-26 14:48         ` Christian Balzer
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Balzer @ 2012-06-25  6:06 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid


Hello Neil,

On Mon, 25 Jun 2012 14:07:54 +1000 NeilBrown wrote:

> On Fri, 22 Jun 2012 17:42:57 +0900 Christian Balzer <chibi@gol.com>
> wrote:
> 
> > 
> > Hello,
> > 
> > On Fri, 22 Jun 2012 18:07:48 +1000 NeilBrown wrote:
> > 
> > > On Fri, 22 Jun 2012 16:06:32 +0900 Christian Balzer <chibi@gol.com>
> > > wrote:
> > > 
> > > > 
> > > > Hello,
> > > > 
> > > > the basics first:
> > > > Debian Squeeze, custom 3.2.18 kernel.
> > > > 
> > > > The Raid(s) in question are:
> > > > ---
> > > > Personalities : [raid1] [raid10] 
> > > > md4 : active raid10 sdd1[0] sdb4[5](S) sdl1[4] sdk1[3] sdj1[2]
> > > > sdi1[1] 3662836224 blocks super 1.2 512K chunks 2 near-copies [5/5]
> > > > [UUUUU]
> > > 
> > > I'm stumped by this.  It shouldn't be possible.
> > > 
> > > The size of the array is impossible.
> > > 
> > > If there are N chunks per device, then there are 5*N chunks on the
> > > whole array, and there are are two copies of each data chunk, so
> > > 5*N/2 distinct data chunks, so that should be the size of the array.
> > > 
> > > So if we take the size of the array, divide by chunk size, multiply
> > > by 2, divide by 5, we get N = the number of chunks per device.
> > > i.e.
> > >   N = (array_size / chunk_size)*2 / 5
> > > 
> > > If we plug in 3662836224 for the array size and 512 for the chunk
> > > size, we get 2861590.8, which is not an integer.
> > > i.e. impossible.
> > > 
> > Quite right, though I never bothered to check that number of course,
> > pretty much assuming after using Linux MD since the last millennium
> > that it would get things right. ^o^
> > 
> > > What does "mdadm --examine" of the various devices show?
> > > 
> > They looks all identical and sane to me:
> > ---
> > /dev/sdc1:
> >           Magic : a92b4efc
> >         Version : 1.2
> >     Feature Map : 0x0
> >      Array UUID : 2b46b20b:80c18c76:bcd534b5:4d1372e4
> >            Name : borg03b:3  (local to host borg03b)
> >   Creation Time : Sat May 19 01:07:34 2012
> >      Raid Level : raid10
> >    Raid Devices : 5
> > 
> >  Avail Dev Size : 2930269954 (1397.26 GiB 1500.30 GB)
> >      Array Size : 5860538368 (2794.52 GiB 3000.60 GB)
> >   Used Dev Size : 2930269184 (1397.26 GiB 1500.30 GB)
> >     Data Offset : 2048 sectors
> >    Super Offset : 8 sectors
> >           State : clean
> >     Device UUID : fe922c1c:35319892:cc1e32e9:948d932c
> > 
> >     Update Time : Fri Jun 22 17:12:05 2012
> >        Checksum : 27a61d9a - correct
> >          Events : 90893
> > 
> >          Layout : near=2
> >      Chunk Size : 512K
> > 
> >    Device Role : Active device 0
> >    Array State : AAAAA ('A' == active, '.' == missing)
> 
> Thanks.
> With this extra info - and the clearer perspective that morning provides
> - I see what is happening.
>
Ah, thank goodness for that. ^.^
 
> The following kernel patch should make it work for you.  It was made and
> tested against 3.4. but should apply to your 3.2 kernel.
> 
> The problem only occurs when recovering the last device in certain RAID10
> arrays.  If you had > 2 copies (e.g. --layout=n3) it could be more than
> just the last device.
> 
> RAID10 with an odd number of devices (5 in this case) lays out chunks
> like this:
> 
>  A A B B C
>  C D D E E
>  F F G G H
>  H I I J J
> 
> If you have an even number of stripes, everything is happy.
> If you have an odd number of stripes - as is the case with your problem
> array
> - then the last stripe might look like:
> 
>  F F G G H
> 
> The 'H' chunk only exists once.  There is no mirror for it.
> md does not store any data in this chunk - the size of the array is
> calculated to finish after 'G'.
> However the recovery code isn't quite so careful.  It tries to recover
> this chunk and loads it from beyond the end of the first device - which
> is where it would be if the devices were all a bit bigger.
> 
That makes perfect sense, I'm just amazed to be the first one to encounter
this. Granted, most people will have even numbered stripes based on
typical controller and server backplanes (1U -> 4x 3.5 drives), but the
ability to use odd numbers (and gain the additional speed another spindle
adds) was always one of the nice points of the MD Raid10 implementation.

> So there is no risk of data corruption here - just that md tries to
> recover a block that isn't in the array, fails, and aborts the recovery.
>
That's a relief!
 
> This patch gets it to complete the recovery earlier so that it doesn't
> try (and fail) to do the impossible.
> 
> If you could test and confirm, I'd appreciate it.
> 
I've build a new kernel-package (taking the opportunity to go to 3.2.20)
and the assorted drbd module and scheduled downtime for tomorrow.

Should know if this fixes it by Wednesday.

Many thanks,

Christian

> Thanks,
> NeilBrown
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 99ae606..bcf6ea8 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -2890,6 +2890,12 @@ static sector_t sync_request(struct mddev *mddev,
> sector_t sector_nr, /* want to reconstruct this device */
>  			rb2 = r10_bio;
>  			sect = raid10_find_virt(conf, sector_nr, i);
> +			if (sect >= mddev->resync_max_sectors) {
> +				/* last stripe is not complete - don't
> +				 * try to recover this sector.
> +				 */
> +				continue;
> +			}
>  			/* Unless we are doing a full sync, or a
> replacement
>  			 * we only need to recover the block if it is
> set in
>  			 * the bitmap


-- 
Christian Balzer        Network/Systems Engineer                
chibi@gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: MD Raid10 recovery results in "attempt to access beyond end of device"
  2012-06-25  6:06       ` Christian Balzer
@ 2012-06-26 14:48         ` Christian Balzer
  2012-07-03  1:46           ` NeilBrown
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Balzer @ 2012-06-26 14:48 UTC (permalink / raw)
  To: linux-raid; +Cc: NeilBrown


Hello,

On Mon, 25 Jun 2012 15:06:51 +0900 Christian Balzer wrote:

> 
> Hello Neil,
> 
> On Mon, 25 Jun 2012 14:07:54 +1000 NeilBrown wrote:
> 
> > On Fri, 22 Jun 2012 17:42:57 +0900 Christian Balzer <chibi@gol.com>
> > wrote:
> > 
> > > 
> > > Hello,
> > > 
> > > On Fri, 22 Jun 2012 18:07:48 +1000 NeilBrown wrote:
> > > 
> > > > On Fri, 22 Jun 2012 16:06:32 +0900 Christian Balzer <chibi@gol.com>
> > > > wrote:
> > > > 
> > > > > 
> > > > > Hello,
> > > > > 
> > > > > the basics first:
> > > > > Debian Squeeze, custom 3.2.18 kernel.
> > > > > 
> > > > > The Raid(s) in question are:
> > > > > ---
> > > > > Personalities : [raid1] [raid10] 
> > > > > md4 : active raid10 sdd1[0] sdb4[5](S) sdl1[4] sdk1[3] sdj1[2]
> > > > > sdi1[1] 3662836224 blocks super 1.2 512K chunks 2 near-copies
> > > > > [5/5] [UUUUU]
> > > > 
> > > > I'm stumped by this.  It shouldn't be possible.
> > > > 
> > > > The size of the array is impossible.
> > > > 
> > > > If there are N chunks per device, then there are 5*N chunks on the
> > > > whole array, and there are are two copies of each data chunk, so
> > > > 5*N/2 distinct data chunks, so that should be the size of the
> > > > array.
> > > > 
> > > > So if we take the size of the array, divide by chunk size, multiply
> > > > by 2, divide by 5, we get N = the number of chunks per device.
> > > > i.e.
> > > >   N = (array_size / chunk_size)*2 / 5
> > > > 
> > > > If we plug in 3662836224 for the array size and 512 for the chunk
> > > > size, we get 2861590.8, which is not an integer.
> > > > i.e. impossible.
> > > > 
> > > Quite right, though I never bothered to check that number of course,
> > > pretty much assuming after using Linux MD since the last millennium
> > > that it would get things right. ^o^
> > > 
> > > > What does "mdadm --examine" of the various devices show?
> > > > 
> > > They looks all identical and sane to me:
> > > ---
> > > /dev/sdc1:
> > >           Magic : a92b4efc
> > >         Version : 1.2
> > >     Feature Map : 0x0
> > >      Array UUID : 2b46b20b:80c18c76:bcd534b5:4d1372e4
> > >            Name : borg03b:3  (local to host borg03b)
> > >   Creation Time : Sat May 19 01:07:34 2012
> > >      Raid Level : raid10
> > >    Raid Devices : 5
> > > 
> > >  Avail Dev Size : 2930269954 (1397.26 GiB 1500.30 GB)
> > >      Array Size : 5860538368 (2794.52 GiB 3000.60 GB)
> > >   Used Dev Size : 2930269184 (1397.26 GiB 1500.30 GB)
> > >     Data Offset : 2048 sectors
> > >    Super Offset : 8 sectors
> > >           State : clean
> > >     Device UUID : fe922c1c:35319892:cc1e32e9:948d932c
> > > 
> > >     Update Time : Fri Jun 22 17:12:05 2012
> > >        Checksum : 27a61d9a - correct
> > >          Events : 90893
> > > 
> > >          Layout : near=2
> > >      Chunk Size : 512K
> > > 
> > >    Device Role : Active device 0
> > >    Array State : AAAAA ('A' == active, '.' == missing)
> > 
> > Thanks.
> > With this extra info - and the clearer perspective that morning
> > provides
> > - I see what is happening.
> >
> Ah, thank goodness for that. ^.^
>

The patch worked fine:
---
[  105.872117] md: recovery of RAID array md3
[28981.157157] md: md3: recovery done.
---

Thanks a bunch and I'd suggest to include this patch in any and all
feasible backports and future kernels of course.

Regards,

Christian

> > The following kernel patch should make it work for you.  It was made
> > and tested against 3.4. but should apply to your 3.2 kernel.
> > 
> > The problem only occurs when recovering the last device in certain
> > RAID10 arrays.  If you had > 2 copies (e.g. --layout=n3) it could be
> > more than just the last device.
> > 
> > RAID10 with an odd number of devices (5 in this case) lays out chunks
> > like this:
> > 
> >  A A B B C
> >  C D D E E
> >  F F G G H
> >  H I I J J
> > 
> > If you have an even number of stripes, everything is happy.
> > If you have an odd number of stripes - as is the case with your problem
> > array
> > - then the last stripe might look like:
> > 
> >  F F G G H
> > 
> > The 'H' chunk only exists once.  There is no mirror for it.
> > md does not store any data in this chunk - the size of the array is
> > calculated to finish after 'G'.
> > However the recovery code isn't quite so careful.  It tries to recover
> > this chunk and loads it from beyond the end of the first device - which
> > is where it would be if the devices were all a bit bigger.
> > 
> That makes perfect sense, I'm just amazed to be the first one to
> encounter this. Granted, most people will have even numbered stripes
> based on typical controller and server backplanes (1U -> 4x 3.5 drives),
> but the ability to use odd numbers (and gain the additional speed
> another spindle adds) was always one of the nice points of the MD Raid10
> implementation.
> 
> > So there is no risk of data corruption here - just that md tries to
> > recover a block that isn't in the array, fails, and aborts the
> > recovery.
> >
> That's a relief!
>  
> > This patch gets it to complete the recovery earlier so that it doesn't
> > try (and fail) to do the impossible.
> > 
> > If you could test and confirm, I'd appreciate it.
> > 
> I've build a new kernel-package (taking the opportunity to go to 3.2.20)
> and the assorted drbd module and scheduled downtime for tomorrow.
> 
> Should know if this fixes it by Wednesday.
> 
> Many thanks,
> 
> Christian
> 
> > Thanks,
> > NeilBrown
> > 
> > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> > index 99ae606..bcf6ea8 100644
> > --- a/drivers/md/raid10.c
> > +++ b/drivers/md/raid10.c
> > @@ -2890,6 +2890,12 @@ static sector_t sync_request(struct mddev
> > *mddev, sector_t sector_nr, /* want to reconstruct this device */
> >  			rb2 = r10_bio;
> >  			sect = raid10_find_virt(conf, sector_nr, i);
> > +			if (sect >= mddev->resync_max_sectors) {
> > +				/* last stripe is not complete - don't
> > +				 * try to recover this sector.
> > +				 */
> > +				continue;
> > +			}
> >  			/* Unless we are doing a full sync, or a
> > replacement
> >  			 * we only need to recover the block if it is
> > set in
> >  			 * the bitmap
> 
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi@gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: MD Raid10 recovery results in "attempt to access beyond end of device"
  2012-06-26 14:48         ` Christian Balzer
@ 2012-07-03  1:46           ` NeilBrown
  0 siblings, 0 replies; 8+ messages in thread
From: NeilBrown @ 2012-07-03  1:46 UTC (permalink / raw)
  To: Christian Balzer; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 514 bytes --]

On Tue, 26 Jun 2012 23:48:45 +0900 Christian Balzer <chibi@gol.com> wrote:

> 

> The patch worked fine:
> ---
> [  105.872117] md: recovery of RAID array md3
> [28981.157157] md: md3: recovery done.
> ---
> 
> Thanks a bunch and I'd suggest to include this patch in any and all
> feasible backports and future kernels of course.
> 
> Regards,
> 
> Christian
> 

Thanks for the confirmation.  I've added your "tested-by" and will forward to
Linus and -stable later today.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-07-03  1:46 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-22  7:06 MD Raid10 recovery results in "attempt to access beyond end of device" Christian Balzer
2012-06-22  8:07 ` NeilBrown
2012-06-22  8:42   ` Christian Balzer
2012-06-23  4:13     ` Christian Balzer
2012-06-25  4:07     ` NeilBrown
2012-06-25  6:06       ` Christian Balzer
2012-06-26 14:48         ` Christian Balzer
2012-07-03  1:46           ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.