All of lore.kernel.org
 help / color / mirror / Atom feed
* moving spares into group and checking spares
@ 2016-09-14  5:18 scar
  2016-09-14  9:29 ` Andreas Klauer
  2016-09-29  1:21 ` NeilBrown
  0 siblings, 2 replies; 14+ messages in thread
From: scar @ 2016-09-14  5:18 UTC (permalink / raw)
  To: linux-raid-u79uwXL29TY76Z2rM5mHXA

i currently have four RAID-5 md arrays which i concatenated into one 
logical volume (lvm2), essentially creating a RAID-50.  each md array 
was created with one spare disk.

instead, i would like to move the four spare disks into one group that 
each of the four arrays can have access to when needed.  i was wondering 
how to safely accomplish this, preferably without unmounting/disrupting 
the filesystem.

secondly, i have the checkarray script scheduled via cron to 
periodically check each of the four arrays.  i noticed in the output of 
checkarray that it doesn't list the spare disk(s).  so i'm guessing they 
are not being checked?  i was wondering, then, how i could also check 
the spare disks to make sure they are healthy and ready to be used if 
needed?

below is output of /proc/mdstat and mdadm.conf

thanks
--

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md3 : active (auto-read-only) raid5 sdal1[0] sdaw1[11](S) sdav1[10] 
sdau1[9] sdat1[8] sdas1[7] sdar1[6] sdaq1[5] sdap1[4] sdao1[3] sdan1[2] 
sdam1[1]
       9766297600 blocks super 1.2 level 5, 512k chunk, algorithm 2 
[11/11] [UUUUUUUUUUU]
       bitmap: 0/8 pages [0KB], 65536KB chunk

md2 : active raid5 sdaa1[0] sdak1[11](S) sdz1[10] sdaj1[9] sdai1[8] 
sdah1[7] sdag1[6] sdaf1[5] sdae1[4] sdad1[3] sdac1[2] sdab1[1]
       9766297600 blocks super 1.2 level 5, 512k chunk, algorithm 2 
[11/11] [UUUUUUUUUUU]
       bitmap: 1/8 pages [4KB], 65536KB chunk

md1 : active raid5 sdn1[0] sdy1[11](S) sdx1[10] sdw1[9] sdv1[8] sdu1[7] 
sdt1[6] sds1[5] sdr1[4] sdq1[3] sdp1[2] sdo1[1]
       9766297600 blocks super 1.2 level 5, 512k chunk, algorithm 2 
[11/11] [UUUUUUUUUUU]
       bitmap: 1/8 pages [4KB], 65536KB chunk

md0 : active raid5 sdb1[0] sdm1[11](S) sdl1[10] sdk1[9] sdj1[8] sdi1[7] 
sdh1[6] sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1]
       9766297600 blocks super 1.2 level 5, 512k chunk, algorithm 2 
[11/11] [UUUUUUUUUUU]
       bitmap: 0/8 pages [0KB], 65536KB chunk

unused devices: <none>

# cat /etc/mdadm/mdadm.conf
CREATE owner=root group=disk mode=0660 auto=yes
HOMEHOST <system>
MAILADDR root
ARRAY /dev/md/0  metadata=1.2 UUID=6dd6eba5:50fd8c6d:33ad61ee:e84763a8 
name=hind:0
    spares=1
ARRAY /dev/md/1  metadata=1.2 UUID=9336c73a:8b8993bf:ea6cfc3d:bf9f7441 
name=hind:1
    spares=1
ARRAY /dev/md/2  metadata=1.2 UUID=817bf91c:4f14fcb0:9ba8b112:768321ee 
name=hind:2
    spares=1
ARRAY /dev/md/3  metadata=1.2 UUID=1251c6b7:36aca0eb:b66b4c8c:830793ad 
name=hind:3
    spares=1

#

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: moving spares into group and checking spares
  2016-09-14  5:18 moving spares into group and checking spares scar
@ 2016-09-14  9:29 ` Andreas Klauer
       [not found]   ` <20160914092959.GA3584-oubN3LzF/wf25t9ic+4fgA@public.gmane.org>
  2016-09-29  1:21 ` NeilBrown
  1 sibling, 1 reply; 14+ messages in thread
From: Andreas Klauer @ 2016-09-14  9:29 UTC (permalink / raw)
  To: scar; +Cc: linux-raid

On Tue, Sep 13, 2016 at 10:18:41PM -0700, scar wrote:
> i currently have four RAID-5 md arrays which i concatenated into one 
> logical volume (lvm2), essentially creating a RAID-50.  each md array 
> was created with one spare disk.

That's perfect for switching to RAID-6. More redundancy should be 
more useful than spares that only sync in when you already completely 
lost redundancy ...

And the disks would also be covered by your checks that way ;)

> i was wondering, then, how i could also check the spare disks to make
> sure they are healthy and ready to be used if needed?

smartmontools, periodic selftests. for all disks, not just the spares.
you can use select,cont tests to check a small region each day, 
covering the entire disk over $X days.

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: moving spares into group and checking spares
       [not found]   ` <20160914092959.GA3584-oubN3LzF/wf25t9ic+4fgA@public.gmane.org>
@ 2016-09-14 17:52     ` scar
  2016-09-14 18:22       ` Roman Mamedov
  2016-09-14 18:35       ` Wols Lists
  0 siblings, 2 replies; 14+ messages in thread
From: scar @ 2016-09-14 17:52 UTC (permalink / raw)
  To: linux-raid-u79uwXL29TY76Z2rM5mHXA

Andreas Klauer wrote on 09/14/2016 02:29 AM:
> On Tue, Sep 13, 2016 at 10:18:41PM -0700, scar wrote:
>> i currently have four RAID-5 md arrays which i concatenated into one
>> logical volume (lvm2), essentially creating a RAID-50.  each md array
>> was created with one spare disk.
> That's perfect for switching to RAID-6. More redundancy should be
> more useful than spares that only sync in when you already completely
> lost redundancy ...
>
> And the disks would also be covered by your checks that way ;)

i'm not sure what you're suggesting, that 4x 11+1 RAID5 arrays should be 
changed to 1x 46+2 RAID6 array?  that doesn't seem as safe to me.  and 
checkarray isn't going to check the spare disks just as it's not doing 
now.... also that would require me to backup/restore the data so i can 
create a new array


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: moving spares into group and checking spares
  2016-09-14 17:52     ` scar
@ 2016-09-14 18:22       ` Roman Mamedov
  2016-09-14 21:05         ` scar
  2016-09-14 18:35       ` Wols Lists
  1 sibling, 1 reply; 14+ messages in thread
From: Roman Mamedov @ 2016-09-14 18:22 UTC (permalink / raw)
  To: scar; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 975 bytes --]

On Wed, 14 Sep 2016 10:52:55 -0700
scar <scar@drigon.com> wrote:

> i'm not sure what you're suggesting, that 4x 11+1 RAID5 arrays should be 
> changed to 1x 46+2 RAID6 array?  that doesn't seem as safe to me

But you think an 11-member RAID5, let alone four of them joined by LVM is
safe? From a resiliency standpoint that setup is like insanity squared.

Considering that your expenses for redundancy are 8 disks at the moment, you
could go with 3x15-disk RAID6 with 2 shared hotspares, making overall
redundancy expense the same 8 disks -- but for a massively safer setup.

Also don't plan on having anything survive any of the joined arrays failure
(expecting to recover data from an FS which suddenly lost a third of itself
should never be part of any plan), and for that reason there is no point in
using LVM concatenation, might just as well join them using mdadm RAID0 and
at least gain the improved linear performance.

-- 
With respect,
Roman

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: moving spares into group and checking spares
  2016-09-14 17:52     ` scar
  2016-09-14 18:22       ` Roman Mamedov
@ 2016-09-14 18:35       ` Wols Lists
  1 sibling, 0 replies; 14+ messages in thread
From: Wols Lists @ 2016-09-14 18:35 UTC (permalink / raw)
  To: scar, linux-raid

On 14/09/16 18:52, scar wrote:
> i'm not sure what you're suggesting, that 4x 11+1 RAID5 arrays should be
> changed to 1x 46+2 RAID6 array?  that doesn't seem as safe to me.  and
> checkarray isn't going to check the spare disks just as it's not doing
> now.... also that would require me to backup/restore the data so i can
> create a new array

No. The suggestion is to convert your 4x 11+1 raid5's to 4x 12 raid6's.
(or do you mean 11 drives plus 1 parity? If that's the case I mean 11
plus 2 parity)

That way you're using all the drives, they all get tested, and if a
drive fails, you're left with a degraded raid6 aka raid5 aka redundant
array. With your current setup, if a drive fails you're left with a
degraded raid5 aka raid0 aka a "disaster in waiting".

And then you can add just the one spare disk to a spares group, so if
any drive does fail, it will get rebuilt straight away.

The only problem I can see (and I should warn you) is that there seems
to be a little "upgrading in place" problem at the moment. My gut
feeling is it's down to some interaction with systemd, so if you're not
running systemd I hope it won't bite ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: moving spares into group and checking spares
  2016-09-14 18:22       ` Roman Mamedov
@ 2016-09-14 21:05         ` scar
  2016-09-14 22:33           ` Chris Murphy
  0 siblings, 1 reply; 14+ messages in thread
From: scar @ 2016-09-14 21:05 UTC (permalink / raw)
  To: linux-raid-u79uwXL29TY76Z2rM5mHXA

Roman Mamedov wrote on 09/14/2016 11:22 AM:
> But you think an 11-member RAID5, let alone four of them joined by LVM is
> safe? From a resiliency standpoint that setup is like insanity squared.

yeah it seems fine?  disks are healthy and regularly checked, just 
wondering how to check the spares.  use cron to schedule weekly smartctl 
long test?


> Considering that your expenses for redundancy are 8 disks at the moment, you
> could go with 3x15-disk RAID6 with 2 shared hotspares, making overall
> redundancy expense the same 8 disks -- but for a massively safer setup.

actually it would be 9 disks (3x15 +2 = 47 not 48) but i'm ok with that. 
  but rebuilding the array right now is not an option

> might just as well join them using mdadm RAID0 and
> at least gain the improved linear performance.

i did want to do that but debian-installer didn't seem to support it...


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: moving spares into group and checking spares
  2016-09-14 21:05         ` scar
@ 2016-09-14 22:33           ` Chris Murphy
       [not found]             ` <CAJCQCtQJwOTYsWubd0rV-6PRL4kmVRKLfLr3=7ZPr1Zb3SrwtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Chris Murphy @ 2016-09-14 22:33 UTC (permalink / raw)
  To: scar; +Cc: Linux-RAID

On Wed, Sep 14, 2016 at 3:05 PM, scar <scar@drigon.com> wrote:
> Roman Mamedov wrote on 09/14/2016 11:22 AM:
>>
>> But you think an 11-member RAID5, let alone four of them joined by LVM is
>> safe? From a resiliency standpoint that setup is like insanity squared.
>
>
> yeah it seems fine?  disks are healthy and regularly checked, just wondering
> how to check the spares.  use cron to schedule weekly smartctl long test?

That you're asking that question now makes me wonder if you've made
certain SCT ERC value is less than SCSI command timer value? If that's
not true, I give you 1 in 3 chances of complete array collapse
following a single drive failure if they are big drives, and 1 in 4
odds if they're just 2T or less. So you need to make certain, because
it's not the default configuration unless you have NAS or enterprise
drives across the board with properly preconfigured SCT ERC out of the
box.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: moving spares into group and checking spares
       [not found]             ` <CAJCQCtQJwOTYsWubd0rV-6PRL4kmVRKLfLr3=7ZPr1Zb3SrwtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-09-14 22:59               ` scar
  2016-09-14 23:15                 ` Chris Murphy
  0 siblings, 1 reply; 14+ messages in thread
From: scar @ 2016-09-14 22:59 UTC (permalink / raw)
  To: linux-raid-u79uwXL29TY76Z2rM5mHXA

Chris Murphy wrote on 09/14/2016 03:33 PM:
> SCT ERC value is less than SCSI command timer value?


they are 1TB Hitachi HUA7210SASUN drives in Sun Fire X4540 with SCT ERC 
value of 255 (25.5 seconds) and /sys/block/sdX/device/timeout is set to 30


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: moving spares into group and checking spares
  2016-09-14 22:59               ` scar
@ 2016-09-14 23:15                 ` Chris Murphy
       [not found]                   ` <CAJCQCtQ1GHdQtShbW3U1o-fmXhT3fHrC7S9BxsxrrOG=0H_p3A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Chris Murphy @ 2016-09-14 23:15 UTC (permalink / raw)
  To: scar; +Cc: Linux-RAID

On Wed, Sep 14, 2016 at 4:59 PM, scar <scar@drigon.com> wrote:
> Chris Murphy wrote on 09/14/2016 03:33 PM:
>>
>> SCT ERC value is less than SCSI command timer value?
>
>
>
> they are 1TB Hitachi HUA7210SASUN drives in Sun Fire X4540 with SCT ERC
> value of 255 (25.5 seconds) and /sys/block/sdX/device/timeout is set to 30

Interesting choice, I haven't seen it cut that closely before, but it
ought to work. The value I most often see is 70 deciseconds which for
sure will fail, if it's going to, well before the command timer gives
up.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: moving spares into group and checking spares
       [not found]                   ` <CAJCQCtQ1GHdQtShbW3U1o-fmXhT3fHrC7S9BxsxrrOG=0H_p3A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-09-15  3:42                     ` scar
  2016-09-15  3:51                       ` Chris Murphy
  0 siblings, 1 reply; 14+ messages in thread
From: scar @ 2016-09-15  3:42 UTC (permalink / raw)
  To: linux-raid-u79uwXL29TY76Z2rM5mHXA

Chris Murphy wrote on 09/14/2016 04:15 PM:
>it
> ought to work.

i had this happen the other day:

Sep 12 05:57:01 hind kernel: [1342342.663516] RAID conf printout:
Sep 12 05:57:01 hind kernel: [1342342.663528]  --- level:5 rd:11 wd:11
Sep 12 05:57:01 hind kernel: [1342342.663535]  disk 0, o:1, dev:sdb1
Sep 12 05:57:01 hind kernel: [1342342.663540]  disk 1, o:1, dev:sdc1
Sep 12 05:57:01 hind kernel: [1342342.663544]  disk 2, o:1, dev:sdd1
Sep 12 05:57:01 hind kernel: [1342342.663548]  disk 3, o:1, dev:sde1
Sep 12 05:57:01 hind kernel: [1342342.663552]  disk 4, o:1, dev:sdf1
Sep 12 05:57:01 hind kernel: [1342342.663556]  disk 5, o:1, dev:sdg1
Sep 12 05:57:01 hind kernel: [1342342.663560]  disk 6, o:1, dev:sdh1
Sep 12 05:57:01 hind kernel: [1342342.663564]  disk 7, o:1, dev:sdi1
Sep 12 05:57:01 hind kernel: [1342342.663568]  disk 8, o:1, dev:sdj1
Sep 12 05:57:01 hind kernel: [1342342.663572]  disk 9, o:1, dev:sdk1
Sep 12 05:57:01 hind kernel: [1342342.663576]  disk 10, o:1, dev:sdl1
Sep 12 05:57:01 hind kernel: [1342342.663735] md: data-check of RAID 
array md0
Sep 12 05:57:01 hind kernel: [1342342.663751] md: minimum _guaranteed_ 
speed: 1000 KB/sec/disk.
Sep 12 05:57:01 hind kernel: [1342342.663757] md: using maximum 
available idle IO bandwidth (but not more than 200000 KB/sec) for 
data-check.
Sep 12 05:57:01 hind kernel: [1342342.663793] md: using 128k window, 
over a total of 976629760k.
Sep 12 08:05:29 hind kernel: [1350045.526546] mptbase: ioc1: 
LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands 
After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
Sep 12 08:05:29 hind kernel: [1350045.527460] sd 3:0:1:0: [sdk] 
Unhandled sense code
Sep 12 08:05:29 hind kernel: [1350045.527478] sd 3:0:1:0: [sdk]
Sep 12 08:05:29 hind kernel: [1350045.527485] Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE
Sep 12 08:05:29 hind kernel: [1350045.527492] sd 3:0:1:0: [sdk]
Sep 12 08:05:29 hind kernel: [1350045.527499] Sense Key : Medium Error 
[current]
Sep 12 08:05:29 hind kernel: [1350045.527512] Info fld=0x23abbe32
Sep 12 08:05:29 hind kernel: [1350045.527518] sd 3:0:1:0: [sdk]
Sep 12 08:05:29 hind kernel: [1350045.527525] Add. Sense: Unrecovered 
read error
Sep 12 08:05:29 hind kernel: [1350045.527532] sd 3:0:1:0: [sdk] CDB:
Sep 12 08:05:29 hind kernel: [1350045.527537] Read(10): 28 00 23 ab bc 
c8 00 04 00 00
Sep 12 08:05:29 hind kernel: [1350045.527554] end_request: critical 
medium error, dev sdk, sector 598457896
Sep 12 08:05:34 hind kernel: [1350051.022308] mptbase: ioc1: 
LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands 
After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
Sep 12 08:05:34 hind kernel: [1350051.022524] mptbase: ioc1: 
LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands 
After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
Sep 12 08:05:34 hind kernel: [1350051.022745] mptbase: ioc1: 
LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands 
After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
Sep 12 08:05:34 hind kernel: [1350051.022969] mptbase: ioc1: 
LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands 
After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
Sep 12 08:05:34 hind kernel: [1350051.023190] mptbase: ioc1: 
LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands 
After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
Sep 12 08:05:34 hind kernel: [1350051.023743] sd 3:0:1:0: [sdk] 
Unhandled sense code
Sep 12 08:05:34 hind kernel: [1350051.023772] sd 3:0:1:0: [sdk]
Sep 12 08:05:34 hind kernel: [1350051.023779] Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE
Sep 12 08:05:34 hind kernel: [1350051.023786] sd 3:0:1:0: [sdk]
Sep 12 08:05:34 hind kernel: [1350051.023792] Sense Key : Medium Error 
[current]
Sep 12 08:05:34 hind kernel: [1350051.023806] Info fld=0x23abbe32
Sep 12 08:05:34 hind kernel: [1350051.023813] sd 3:0:1:0: [sdk]
Sep 12 08:05:34 hind kernel: [1350051.023819] Add. Sense: Unrecovered 
read error
Sep 12 08:05:34 hind kernel: [1350051.023826] sd 3:0:1:0: [sdk] CDB:
Sep 12 08:05:34 hind kernel: [1350051.023830] Read(10): 28 00 23 ab be 
28 00 00 80 00
Sep 12 08:05:34 hind kernel: [1350051.023847] end_request: critical 
medium error, dev sdk, sector 598457896
Sep 12 08:05:35 hind kernel: [1350051.385810] md/raid:md0: read error 
corrected (8 sectors at 598455848 on sdk1)
Sep 12 08:05:35 hind kernel: [1350051.385826] md/raid:md0: read error 
corrected (8 sectors at 598455856 on sdk1)
Sep 12 08:05:35 hind kernel: [1350051.385834] md/raid:md0: read error 
corrected (8 sectors at 598455864 on sdk1)
Sep 12 08:05:35 hind kernel: [1350051.385840] md/raid:md0: read error 
corrected (8 sectors at 598455872 on sdk1)
Sep 12 08:05:35 hind kernel: [1350051.385847] md/raid:md0: read error 
corrected (8 sectors at 598455880 on sdk1)
Sep 12 08:05:35 hind kernel: [1350051.385853] md/raid:md0: read error 
corrected (8 sectors at 598455888 on sdk1)
Sep 12 08:05:35 hind kernel: [1350051.385859] md/raid:md0: read error 
corrected (8 sectors at 598455896 on sdk1)
Sep 12 08:05:35 hind kernel: [1350051.385865] md/raid:md0: read error 
corrected (8 sectors at 598455904 on sdk1)
Sep 12 08:05:35 hind kernel: [1350051.385873] md/raid:md0: read error 
corrected (8 sectors at 598455912 on sdk1)
Sep 12 08:05:35 hind kernel: [1350051.385880] md/raid:md0: read error 
corrected (8 sectors at 598455920 on sdk1)
Sep 12 13:39:43 hind kernel: [1370087.022160] md: md0: data-check done.


so it seems to be working?  although the sector reported by libata is 
different than what md corrected

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: moving spares into group and checking spares
  2016-09-15  3:42                     ` scar
@ 2016-09-15  3:51                       ` Chris Murphy
       [not found]                         ` <CAJCQCtRW9REafzmyk+W6aVxqxMUWCdXNkrST1e7udrH-zp26Uw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Chris Murphy @ 2016-09-15  3:51 UTC (permalink / raw)
  To: scar; +Cc: Linux-RAID

On Wed, Sep 14, 2016 at 9:42 PM, scar <scar@drigon.com> wrote:

> 00 80 00
> Sep 12 08:05:34 hind kernel: [1350051.023847] end_request: critical medium
> error, dev sdk, sector 598457896
> Sep 12 08:05:35 hind kernel: [1350051.385810] md/raid:md0: read error
> corrected (8 sectors at 598455848 on sdk1)
> Sep 12 08:05:35 hind kernel: [1350051.385826] md/raid:md0: read error
> corrected (8 sectors at 598455856 on sdk1)
> Sep 12 08:05:35 hind kernel: [1350051.385834] md/raid:md0: read error
> corrected (8 sectors at 598455864 on sdk1)
> Sep 12 08:05:35 hind kernel: [1350051.385840] md/raid:md0: read error
> corrected (8 sectors at 598455872 on sdk1)
> Sep 12 08:05:35 hind kernel: [1350051.385847] md/raid:md0: read error
> corrected (8 sectors at 598455880 on sdk1)
> Sep 12 08:05:35 hind kernel: [1350051.385853] md/raid:md0: read error
> corrected (8 sectors at 598455888 on sdk1)
> Sep 12 08:05:35 hind kernel: [1350051.385859] md/raid:md0: read error
> corrected (8 sectors at 598455896 on sdk1)
> Sep 12 08:05:35 hind kernel: [1350051.385865] md/raid:md0: read error
> corrected (8 sectors at 598455904 on sdk1)
> Sep 12 08:05:35 hind kernel: [1350051.385873] md/raid:md0: read error
> corrected (8 sectors at 598455912 on sdk1)
> Sep 12 08:05:35 hind kernel: [1350051.385880] md/raid:md0: read error
> corrected (8 sectors at 598455920 on sdk1)
> Sep 12 13:39:43 hind kernel: [1370087.022160] md: md0: data-check done.
>
>
> so it seems to be working?  although the sector reported by libata is
> different than what md corrected

Looks like it replaced the entire chunk that includes the bad sector.
Is the chunk size 32KiB?


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: moving spares into group and checking spares
       [not found]                         ` <CAJCQCtRW9REafzmyk+W6aVxqxMUWCdXNkrST1e7udrH-zp26Uw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-09-28  0:19                           ` scar
  2016-09-29  1:17                             ` NeilBrown
  0 siblings, 1 reply; 14+ messages in thread
From: scar @ 2016-09-28  0:19 UTC (permalink / raw)
  To: linux-raid-u79uwXL29TY76Z2rM5mHXA

Chris Murphy wrote on 09/14/2016 08:51 PM:
> On Wed, Sep 14, 2016 at 9:42 PM, scar <scar-47zfDnpWZoPQT0dZR+AlfA@public.gmane.org> wrote:
>
>> 00 80 00
>> Sep 12 08:05:34 hind kernel: [1350051.023847] end_request: critical medium
>> error, dev sdk, sector 598457896
>> Sep 12 08:05:35 hind kernel: [1350051.385810] md/raid:md0: read error
>> corrected (8 sectors at 598455848 on sdk1)
>> Sep 12 08:05:35 hind kernel: [1350051.385826] md/raid:md0: read error
>> corrected (8 sectors at 598455856 on sdk1)
>> Sep 12 08:05:35 hind kernel: [1350051.385834] md/raid:md0: read error
>> corrected (8 sectors at 598455864 on sdk1)
>> Sep 12 08:05:35 hind kernel: [1350051.385840] md/raid:md0: read error
>> corrected (8 sectors at 598455872 on sdk1)
>> Sep 12 08:05:35 hind kernel: [1350051.385847] md/raid:md0: read error
>> corrected (8 sectors at 598455880 on sdk1)
>> Sep 12 08:05:35 hind kernel: [1350051.385853] md/raid:md0: read error
>> corrected (8 sectors at 598455888 on sdk1)
>> Sep 12 08:05:35 hind kernel: [1350051.385859] md/raid:md0: read error
>> corrected (8 sectors at 598455896 on sdk1)
>> Sep 12 08:05:35 hind kernel: [1350051.385865] md/raid:md0: read error
>> corrected (8 sectors at 598455904 on sdk1)
>> Sep 12 08:05:35 hind kernel: [1350051.385873] md/raid:md0: read error
>> corrected (8 sectors at 598455912 on sdk1)
>> Sep 12 08:05:35 hind kernel: [1350051.385880] md/raid:md0: read error
>> corrected (8 sectors at 598455920 on sdk1)
>> Sep 12 13:39:43 hind kernel: [1370087.022160] md: md0: data-check done.
>>
>>
>> so it seems to be working?  although the sector reported by libata is
>> different than what md corrected
>
> Looks like it replaced the entire chunk that includes the bad sector.
> Is the chunk size 32KiB?
>

No it is 512K, and the sector reported by libata is almost 2000 sectors 
different than what mdadm reported


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: moving spares into group and checking spares
  2016-09-28  0:19                           ` scar
@ 2016-09-29  1:17                             ` NeilBrown
  0 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2016-09-29  1:17 UTC (permalink / raw)
  To: scar, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2099 bytes --]

On Wed, Sep 28 2016, scar wrote:

> Chris Murphy wrote on 09/14/2016 08:51 PM:
>> On Wed, Sep 14, 2016 at 9:42 PM, scar <scar@drigon.com> wrote:
>>
>>> 00 80 00
>>> Sep 12 08:05:34 hind kernel: [1350051.023847] end_request: critical medium
>>> error, dev sdk, sector 598457896
>>> Sep 12 08:05:35 hind kernel: [1350051.385810] md/raid:md0: read error
>>> corrected (8 sectors at 598455848 on sdk1)
>>> Sep 12 08:05:35 hind kernel: [1350051.385826] md/raid:md0: read error
>>> corrected (8 sectors at 598455856 on sdk1)
>>> Sep 12 08:05:35 hind kernel: [1350051.385834] md/raid:md0: read error
>>> corrected (8 sectors at 598455864 on sdk1)
>>> Sep 12 08:05:35 hind kernel: [1350051.385840] md/raid:md0: read error
>>> corrected (8 sectors at 598455872 on sdk1)
>>> Sep 12 08:05:35 hind kernel: [1350051.385847] md/raid:md0: read error
>>> corrected (8 sectors at 598455880 on sdk1)
>>> Sep 12 08:05:35 hind kernel: [1350051.385853] md/raid:md0: read error
>>> corrected (8 sectors at 598455888 on sdk1)
>>> Sep 12 08:05:35 hind kernel: [1350051.385859] md/raid:md0: read error
>>> corrected (8 sectors at 598455896 on sdk1)
>>> Sep 12 08:05:35 hind kernel: [1350051.385865] md/raid:md0: read error
>>> corrected (8 sectors at 598455904 on sdk1)
>>> Sep 12 08:05:35 hind kernel: [1350051.385873] md/raid:md0: read error
>>> corrected (8 sectors at 598455912 on sdk1)
>>> Sep 12 08:05:35 hind kernel: [1350051.385880] md/raid:md0: read error
>>> corrected (8 sectors at 598455920 on sdk1)
>>> Sep 12 13:39:43 hind kernel: [1370087.022160] md: md0: data-check done.
>>>
>>>
>>> so it seems to be working?  although the sector reported by libata is
>>> different than what md corrected
>>
>> Looks like it replaced the entire chunk that includes the bad sector.
>> Is the chunk size 32KiB?
>>
>
> No it is 512K, and the sector reported by libata is almost 2000 sectors 
> different than what mdadm reported

The difference is probably the Data Offset.
libata reports a sector in the device.
md reports a sector in the data section of the device.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: moving spares into group and checking spares
  2016-09-14  5:18 moving spares into group and checking spares scar
  2016-09-14  9:29 ` Andreas Klauer
@ 2016-09-29  1:21 ` NeilBrown
  1 sibling, 0 replies; 14+ messages in thread
From: NeilBrown @ 2016-09-29  1:21 UTC (permalink / raw)
  To: scar, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1326 bytes --]

On Wed, Sep 14 2016, scar wrote:

> i currently have four RAID-5 md arrays which i concatenated into one 
> logical volume (lvm2), essentially creating a RAID-50.  each md array 
> was created with one spare disk.
>
> instead, i would like to move the four spare disks into one group that 
> each of the four arrays can have access to when needed.  i was wondering 
> how to safely accomplish this, preferably without unmounting/disrupting 
> the filesystem.

You don't need to move the devices.  Just use the spare-group= setting
in mdadm.conf to tell mdadm that all arrays should be considered part of
the same spare-group.  Then "mdadm --monitor" will freely move spares
between arrays as needed.

>
> secondly, i have the checkarray script scheduled via cron to 
> periodically check each of the four arrays.  i noticed in the output of 
> checkarray that it doesn't list the spare disk(s).  so i'm guessing they 
> are not being checked?  i was wondering, then, how i could also check 
> the spare disks to make sure they are healthy and ready to be used if 
> needed?

mdadm doesn't provide any automatic support for this.
You would need to e.g.
 - remove a spare from an array
 - test it in some way: e.g. use dd to read every block.
 - if satisfied, add it back to the array.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-09-29  1:21 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-14  5:18 moving spares into group and checking spares scar
2016-09-14  9:29 ` Andreas Klauer
     [not found]   ` <20160914092959.GA3584-oubN3LzF/wf25t9ic+4fgA@public.gmane.org>
2016-09-14 17:52     ` scar
2016-09-14 18:22       ` Roman Mamedov
2016-09-14 21:05         ` scar
2016-09-14 22:33           ` Chris Murphy
     [not found]             ` <CAJCQCtQJwOTYsWubd0rV-6PRL4kmVRKLfLr3=7ZPr1Zb3SrwtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-09-14 22:59               ` scar
2016-09-14 23:15                 ` Chris Murphy
     [not found]                   ` <CAJCQCtQ1GHdQtShbW3U1o-fmXhT3fHrC7S9BxsxrrOG=0H_p3A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-09-15  3:42                     ` scar
2016-09-15  3:51                       ` Chris Murphy
     [not found]                         ` <CAJCQCtRW9REafzmyk+W6aVxqxMUWCdXNkrST1e7udrH-zp26Uw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-09-28  0:19                           ` scar
2016-09-29  1:17                             ` NeilBrown
2016-09-14 18:35       ` Wols Lists
2016-09-29  1:21 ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.