* moving spares into group and checking spares @ 2016-09-14 5:18 scar 2016-09-14 9:29 ` Andreas Klauer 2016-09-29 1:21 ` NeilBrown 0 siblings, 2 replies; 14+ messages in thread From: scar @ 2016-09-14 5:18 UTC (permalink / raw) To: linux-raid-u79uwXL29TY76Z2rM5mHXA i currently have four RAID-5 md arrays which i concatenated into one logical volume (lvm2), essentially creating a RAID-50. each md array was created with one spare disk. instead, i would like to move the four spare disks into one group that each of the four arrays can have access to when needed. i was wondering how to safely accomplish this, preferably without unmounting/disrupting the filesystem. secondly, i have the checkarray script scheduled via cron to periodically check each of the four arrays. i noticed in the output of checkarray that it doesn't list the spare disk(s). so i'm guessing they are not being checked? i was wondering, then, how i could also check the spare disks to make sure they are healthy and ready to be used if needed? below is output of /proc/mdstat and mdadm.conf thanks -- # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md3 : active (auto-read-only) raid5 sdal1[0] sdaw1[11](S) sdav1[10] sdau1[9] sdat1[8] sdas1[7] sdar1[6] sdaq1[5] sdap1[4] sdao1[3] sdan1[2] sdam1[1] 9766297600 blocks super 1.2 level 5, 512k chunk, algorithm 2 [11/11] [UUUUUUUUUUU] bitmap: 0/8 pages [0KB], 65536KB chunk md2 : active raid5 sdaa1[0] sdak1[11](S) sdz1[10] sdaj1[9] sdai1[8] sdah1[7] sdag1[6] sdaf1[5] sdae1[4] sdad1[3] sdac1[2] sdab1[1] 9766297600 blocks super 1.2 level 5, 512k chunk, algorithm 2 [11/11] [UUUUUUUUUUU] bitmap: 1/8 pages [4KB], 65536KB chunk md1 : active raid5 sdn1[0] sdy1[11](S) sdx1[10] sdw1[9] sdv1[8] sdu1[7] sdt1[6] sds1[5] sdr1[4] sdq1[3] sdp1[2] sdo1[1] 9766297600 blocks super 1.2 level 5, 512k chunk, algorithm 2 [11/11] [UUUUUUUUUUU] bitmap: 1/8 pages [4KB], 65536KB chunk md0 : active raid5 sdb1[0] sdm1[11](S) sdl1[10] sdk1[9] sdj1[8] sdi1[7] sdh1[6] sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] 9766297600 blocks super 1.2 level 5, 512k chunk, algorithm 2 [11/11] [UUUUUUUUUUU] bitmap: 0/8 pages [0KB], 65536KB chunk unused devices: <none> # cat /etc/mdadm/mdadm.conf CREATE owner=root group=disk mode=0660 auto=yes HOMEHOST <system> MAILADDR root ARRAY /dev/md/0 metadata=1.2 UUID=6dd6eba5:50fd8c6d:33ad61ee:e84763a8 name=hind:0 spares=1 ARRAY /dev/md/1 metadata=1.2 UUID=9336c73a:8b8993bf:ea6cfc3d:bf9f7441 name=hind:1 spares=1 ARRAY /dev/md/2 metadata=1.2 UUID=817bf91c:4f14fcb0:9ba8b112:768321ee name=hind:2 spares=1 ARRAY /dev/md/3 metadata=1.2 UUID=1251c6b7:36aca0eb:b66b4c8c:830793ad name=hind:3 spares=1 # -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: moving spares into group and checking spares 2016-09-14 5:18 moving spares into group and checking spares scar @ 2016-09-14 9:29 ` Andreas Klauer [not found] ` <20160914092959.GA3584-oubN3LzF/wf25t9ic+4fgA@public.gmane.org> 2016-09-29 1:21 ` NeilBrown 1 sibling, 1 reply; 14+ messages in thread From: Andreas Klauer @ 2016-09-14 9:29 UTC (permalink / raw) To: scar; +Cc: linux-raid On Tue, Sep 13, 2016 at 10:18:41PM -0700, scar wrote: > i currently have four RAID-5 md arrays which i concatenated into one > logical volume (lvm2), essentially creating a RAID-50. each md array > was created with one spare disk. That's perfect for switching to RAID-6. More redundancy should be more useful than spares that only sync in when you already completely lost redundancy ... And the disks would also be covered by your checks that way ;) > i was wondering, then, how i could also check the spare disks to make > sure they are healthy and ready to be used if needed? smartmontools, periodic selftests. for all disks, not just the spares. you can use select,cont tests to check a small region each day, covering the entire disk over $X days. Regards Andreas Klauer ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <20160914092959.GA3584-oubN3LzF/wf25t9ic+4fgA@public.gmane.org>]
* Re: moving spares into group and checking spares [not found] ` <20160914092959.GA3584-oubN3LzF/wf25t9ic+4fgA@public.gmane.org> @ 2016-09-14 17:52 ` scar 2016-09-14 18:22 ` Roman Mamedov 2016-09-14 18:35 ` Wols Lists 0 siblings, 2 replies; 14+ messages in thread From: scar @ 2016-09-14 17:52 UTC (permalink / raw) To: linux-raid-u79uwXL29TY76Z2rM5mHXA Andreas Klauer wrote on 09/14/2016 02:29 AM: > On Tue, Sep 13, 2016 at 10:18:41PM -0700, scar wrote: >> i currently have four RAID-5 md arrays which i concatenated into one >> logical volume (lvm2), essentially creating a RAID-50. each md array >> was created with one spare disk. > That's perfect for switching to RAID-6. More redundancy should be > more useful than spares that only sync in when you already completely > lost redundancy ... > > And the disks would also be covered by your checks that way ;) i'm not sure what you're suggesting, that 4x 11+1 RAID5 arrays should be changed to 1x 46+2 RAID6 array? that doesn't seem as safe to me. and checkarray isn't going to check the spare disks just as it's not doing now.... also that would require me to backup/restore the data so i can create a new array -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: moving spares into group and checking spares 2016-09-14 17:52 ` scar @ 2016-09-14 18:22 ` Roman Mamedov 2016-09-14 21:05 ` scar 2016-09-14 18:35 ` Wols Lists 1 sibling, 1 reply; 14+ messages in thread From: Roman Mamedov @ 2016-09-14 18:22 UTC (permalink / raw) To: scar; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 975 bytes --] On Wed, 14 Sep 2016 10:52:55 -0700 scar <scar@drigon.com> wrote: > i'm not sure what you're suggesting, that 4x 11+1 RAID5 arrays should be > changed to 1x 46+2 RAID6 array? that doesn't seem as safe to me But you think an 11-member RAID5, let alone four of them joined by LVM is safe? From a resiliency standpoint that setup is like insanity squared. Considering that your expenses for redundancy are 8 disks at the moment, you could go with 3x15-disk RAID6 with 2 shared hotspares, making overall redundancy expense the same 8 disks -- but for a massively safer setup. Also don't plan on having anything survive any of the joined arrays failure (expecting to recover data from an FS which suddenly lost a third of itself should never be part of any plan), and for that reason there is no point in using LVM concatenation, might just as well join them using mdadm RAID0 and at least gain the improved linear performance. -- With respect, Roman [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: moving spares into group and checking spares 2016-09-14 18:22 ` Roman Mamedov @ 2016-09-14 21:05 ` scar 2016-09-14 22:33 ` Chris Murphy 0 siblings, 1 reply; 14+ messages in thread From: scar @ 2016-09-14 21:05 UTC (permalink / raw) To: linux-raid-u79uwXL29TY76Z2rM5mHXA Roman Mamedov wrote on 09/14/2016 11:22 AM: > But you think an 11-member RAID5, let alone four of them joined by LVM is > safe? From a resiliency standpoint that setup is like insanity squared. yeah it seems fine? disks are healthy and regularly checked, just wondering how to check the spares. use cron to schedule weekly smartctl long test? > Considering that your expenses for redundancy are 8 disks at the moment, you > could go with 3x15-disk RAID6 with 2 shared hotspares, making overall > redundancy expense the same 8 disks -- but for a massively safer setup. actually it would be 9 disks (3x15 +2 = 47 not 48) but i'm ok with that. but rebuilding the array right now is not an option > might just as well join them using mdadm RAID0 and > at least gain the improved linear performance. i did want to do that but debian-installer didn't seem to support it... -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: moving spares into group and checking spares 2016-09-14 21:05 ` scar @ 2016-09-14 22:33 ` Chris Murphy [not found] ` <CAJCQCtQJwOTYsWubd0rV-6PRL4kmVRKLfLr3=7ZPr1Zb3SrwtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 14+ messages in thread From: Chris Murphy @ 2016-09-14 22:33 UTC (permalink / raw) To: scar; +Cc: Linux-RAID On Wed, Sep 14, 2016 at 3:05 PM, scar <scar@drigon.com> wrote: > Roman Mamedov wrote on 09/14/2016 11:22 AM: >> >> But you think an 11-member RAID5, let alone four of them joined by LVM is >> safe? From a resiliency standpoint that setup is like insanity squared. > > > yeah it seems fine? disks are healthy and regularly checked, just wondering > how to check the spares. use cron to schedule weekly smartctl long test? That you're asking that question now makes me wonder if you've made certain SCT ERC value is less than SCSI command timer value? If that's not true, I give you 1 in 3 chances of complete array collapse following a single drive failure if they are big drives, and 1 in 4 odds if they're just 2T or less. So you need to make certain, because it's not the default configuration unless you have NAS or enterprise drives across the board with properly preconfigured SCT ERC out of the box. -- Chris Murphy ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <CAJCQCtQJwOTYsWubd0rV-6PRL4kmVRKLfLr3=7ZPr1Zb3SrwtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: moving spares into group and checking spares [not found] ` <CAJCQCtQJwOTYsWubd0rV-6PRL4kmVRKLfLr3=7ZPr1Zb3SrwtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-09-14 22:59 ` scar 2016-09-14 23:15 ` Chris Murphy 0 siblings, 1 reply; 14+ messages in thread From: scar @ 2016-09-14 22:59 UTC (permalink / raw) To: linux-raid-u79uwXL29TY76Z2rM5mHXA Chris Murphy wrote on 09/14/2016 03:33 PM: > SCT ERC value is less than SCSI command timer value? they are 1TB Hitachi HUA7210SASUN drives in Sun Fire X4540 with SCT ERC value of 255 (25.5 seconds) and /sys/block/sdX/device/timeout is set to 30 -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: moving spares into group and checking spares 2016-09-14 22:59 ` scar @ 2016-09-14 23:15 ` Chris Murphy [not found] ` <CAJCQCtQ1GHdQtShbW3U1o-fmXhT3fHrC7S9BxsxrrOG=0H_p3A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 14+ messages in thread From: Chris Murphy @ 2016-09-14 23:15 UTC (permalink / raw) To: scar; +Cc: Linux-RAID On Wed, Sep 14, 2016 at 4:59 PM, scar <scar@drigon.com> wrote: > Chris Murphy wrote on 09/14/2016 03:33 PM: >> >> SCT ERC value is less than SCSI command timer value? > > > > they are 1TB Hitachi HUA7210SASUN drives in Sun Fire X4540 with SCT ERC > value of 255 (25.5 seconds) and /sys/block/sdX/device/timeout is set to 30 Interesting choice, I haven't seen it cut that closely before, but it ought to work. The value I most often see is 70 deciseconds which for sure will fail, if it's going to, well before the command timer gives up. -- Chris Murphy ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <CAJCQCtQ1GHdQtShbW3U1o-fmXhT3fHrC7S9BxsxrrOG=0H_p3A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: moving spares into group and checking spares [not found] ` <CAJCQCtQ1GHdQtShbW3U1o-fmXhT3fHrC7S9BxsxrrOG=0H_p3A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-09-15 3:42 ` scar 2016-09-15 3:51 ` Chris Murphy 0 siblings, 1 reply; 14+ messages in thread From: scar @ 2016-09-15 3:42 UTC (permalink / raw) To: linux-raid-u79uwXL29TY76Z2rM5mHXA Chris Murphy wrote on 09/14/2016 04:15 PM: >it > ought to work. i had this happen the other day: Sep 12 05:57:01 hind kernel: [1342342.663516] RAID conf printout: Sep 12 05:57:01 hind kernel: [1342342.663528] --- level:5 rd:11 wd:11 Sep 12 05:57:01 hind kernel: [1342342.663535] disk 0, o:1, dev:sdb1 Sep 12 05:57:01 hind kernel: [1342342.663540] disk 1, o:1, dev:sdc1 Sep 12 05:57:01 hind kernel: [1342342.663544] disk 2, o:1, dev:sdd1 Sep 12 05:57:01 hind kernel: [1342342.663548] disk 3, o:1, dev:sde1 Sep 12 05:57:01 hind kernel: [1342342.663552] disk 4, o:1, dev:sdf1 Sep 12 05:57:01 hind kernel: [1342342.663556] disk 5, o:1, dev:sdg1 Sep 12 05:57:01 hind kernel: [1342342.663560] disk 6, o:1, dev:sdh1 Sep 12 05:57:01 hind kernel: [1342342.663564] disk 7, o:1, dev:sdi1 Sep 12 05:57:01 hind kernel: [1342342.663568] disk 8, o:1, dev:sdj1 Sep 12 05:57:01 hind kernel: [1342342.663572] disk 9, o:1, dev:sdk1 Sep 12 05:57:01 hind kernel: [1342342.663576] disk 10, o:1, dev:sdl1 Sep 12 05:57:01 hind kernel: [1342342.663735] md: data-check of RAID array md0 Sep 12 05:57:01 hind kernel: [1342342.663751] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Sep 12 05:57:01 hind kernel: [1342342.663757] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check. Sep 12 05:57:01 hind kernel: [1342342.663793] md: using 128k window, over a total of 976629760k. Sep 12 08:05:29 hind kernel: [1350045.526546] mptbase: ioc1: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done Sep 12 08:05:29 hind kernel: [1350045.527460] sd 3:0:1:0: [sdk] Unhandled sense code Sep 12 08:05:29 hind kernel: [1350045.527478] sd 3:0:1:0: [sdk] Sep 12 08:05:29 hind kernel: [1350045.527485] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Sep 12 08:05:29 hind kernel: [1350045.527492] sd 3:0:1:0: [sdk] Sep 12 08:05:29 hind kernel: [1350045.527499] Sense Key : Medium Error [current] Sep 12 08:05:29 hind kernel: [1350045.527512] Info fld=0x23abbe32 Sep 12 08:05:29 hind kernel: [1350045.527518] sd 3:0:1:0: [sdk] Sep 12 08:05:29 hind kernel: [1350045.527525] Add. Sense: Unrecovered read error Sep 12 08:05:29 hind kernel: [1350045.527532] sd 3:0:1:0: [sdk] CDB: Sep 12 08:05:29 hind kernel: [1350045.527537] Read(10): 28 00 23 ab bc c8 00 04 00 00 Sep 12 08:05:29 hind kernel: [1350045.527554] end_request: critical medium error, dev sdk, sector 598457896 Sep 12 08:05:34 hind kernel: [1350051.022308] mptbase: ioc1: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done Sep 12 08:05:34 hind kernel: [1350051.022524] mptbase: ioc1: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done Sep 12 08:05:34 hind kernel: [1350051.022745] mptbase: ioc1: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done Sep 12 08:05:34 hind kernel: [1350051.022969] mptbase: ioc1: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done Sep 12 08:05:34 hind kernel: [1350051.023190] mptbase: ioc1: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done Sep 12 08:05:34 hind kernel: [1350051.023743] sd 3:0:1:0: [sdk] Unhandled sense code Sep 12 08:05:34 hind kernel: [1350051.023772] sd 3:0:1:0: [sdk] Sep 12 08:05:34 hind kernel: [1350051.023779] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Sep 12 08:05:34 hind kernel: [1350051.023786] sd 3:0:1:0: [sdk] Sep 12 08:05:34 hind kernel: [1350051.023792] Sense Key : Medium Error [current] Sep 12 08:05:34 hind kernel: [1350051.023806] Info fld=0x23abbe32 Sep 12 08:05:34 hind kernel: [1350051.023813] sd 3:0:1:0: [sdk] Sep 12 08:05:34 hind kernel: [1350051.023819] Add. Sense: Unrecovered read error Sep 12 08:05:34 hind kernel: [1350051.023826] sd 3:0:1:0: [sdk] CDB: Sep 12 08:05:34 hind kernel: [1350051.023830] Read(10): 28 00 23 ab be 28 00 00 80 00 Sep 12 08:05:34 hind kernel: [1350051.023847] end_request: critical medium error, dev sdk, sector 598457896 Sep 12 08:05:35 hind kernel: [1350051.385810] md/raid:md0: read error corrected (8 sectors at 598455848 on sdk1) Sep 12 08:05:35 hind kernel: [1350051.385826] md/raid:md0: read error corrected (8 sectors at 598455856 on sdk1) Sep 12 08:05:35 hind kernel: [1350051.385834] md/raid:md0: read error corrected (8 sectors at 598455864 on sdk1) Sep 12 08:05:35 hind kernel: [1350051.385840] md/raid:md0: read error corrected (8 sectors at 598455872 on sdk1) Sep 12 08:05:35 hind kernel: [1350051.385847] md/raid:md0: read error corrected (8 sectors at 598455880 on sdk1) Sep 12 08:05:35 hind kernel: [1350051.385853] md/raid:md0: read error corrected (8 sectors at 598455888 on sdk1) Sep 12 08:05:35 hind kernel: [1350051.385859] md/raid:md0: read error corrected (8 sectors at 598455896 on sdk1) Sep 12 08:05:35 hind kernel: [1350051.385865] md/raid:md0: read error corrected (8 sectors at 598455904 on sdk1) Sep 12 08:05:35 hind kernel: [1350051.385873] md/raid:md0: read error corrected (8 sectors at 598455912 on sdk1) Sep 12 08:05:35 hind kernel: [1350051.385880] md/raid:md0: read error corrected (8 sectors at 598455920 on sdk1) Sep 12 13:39:43 hind kernel: [1370087.022160] md: md0: data-check done. so it seems to be working? although the sector reported by libata is different than what md corrected -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: moving spares into group and checking spares 2016-09-15 3:42 ` scar @ 2016-09-15 3:51 ` Chris Murphy [not found] ` <CAJCQCtRW9REafzmyk+W6aVxqxMUWCdXNkrST1e7udrH-zp26Uw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 14+ messages in thread From: Chris Murphy @ 2016-09-15 3:51 UTC (permalink / raw) To: scar; +Cc: Linux-RAID On Wed, Sep 14, 2016 at 9:42 PM, scar <scar@drigon.com> wrote: > 00 80 00 > Sep 12 08:05:34 hind kernel: [1350051.023847] end_request: critical medium > error, dev sdk, sector 598457896 > Sep 12 08:05:35 hind kernel: [1350051.385810] md/raid:md0: read error > corrected (8 sectors at 598455848 on sdk1) > Sep 12 08:05:35 hind kernel: [1350051.385826] md/raid:md0: read error > corrected (8 sectors at 598455856 on sdk1) > Sep 12 08:05:35 hind kernel: [1350051.385834] md/raid:md0: read error > corrected (8 sectors at 598455864 on sdk1) > Sep 12 08:05:35 hind kernel: [1350051.385840] md/raid:md0: read error > corrected (8 sectors at 598455872 on sdk1) > Sep 12 08:05:35 hind kernel: [1350051.385847] md/raid:md0: read error > corrected (8 sectors at 598455880 on sdk1) > Sep 12 08:05:35 hind kernel: [1350051.385853] md/raid:md0: read error > corrected (8 sectors at 598455888 on sdk1) > Sep 12 08:05:35 hind kernel: [1350051.385859] md/raid:md0: read error > corrected (8 sectors at 598455896 on sdk1) > Sep 12 08:05:35 hind kernel: [1350051.385865] md/raid:md0: read error > corrected (8 sectors at 598455904 on sdk1) > Sep 12 08:05:35 hind kernel: [1350051.385873] md/raid:md0: read error > corrected (8 sectors at 598455912 on sdk1) > Sep 12 08:05:35 hind kernel: [1350051.385880] md/raid:md0: read error > corrected (8 sectors at 598455920 on sdk1) > Sep 12 13:39:43 hind kernel: [1370087.022160] md: md0: data-check done. > > > so it seems to be working? although the sector reported by libata is > different than what md corrected Looks like it replaced the entire chunk that includes the bad sector. Is the chunk size 32KiB? -- Chris Murphy ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <CAJCQCtRW9REafzmyk+W6aVxqxMUWCdXNkrST1e7udrH-zp26Uw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: moving spares into group and checking spares [not found] ` <CAJCQCtRW9REafzmyk+W6aVxqxMUWCdXNkrST1e7udrH-zp26Uw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-09-28 0:19 ` scar 2016-09-29 1:17 ` NeilBrown 0 siblings, 1 reply; 14+ messages in thread From: scar @ 2016-09-28 0:19 UTC (permalink / raw) To: linux-raid-u79uwXL29TY76Z2rM5mHXA Chris Murphy wrote on 09/14/2016 08:51 PM: > On Wed, Sep 14, 2016 at 9:42 PM, scar <scar-47zfDnpWZoPQT0dZR+AlfA@public.gmane.org> wrote: > >> 00 80 00 >> Sep 12 08:05:34 hind kernel: [1350051.023847] end_request: critical medium >> error, dev sdk, sector 598457896 >> Sep 12 08:05:35 hind kernel: [1350051.385810] md/raid:md0: read error >> corrected (8 sectors at 598455848 on sdk1) >> Sep 12 08:05:35 hind kernel: [1350051.385826] md/raid:md0: read error >> corrected (8 sectors at 598455856 on sdk1) >> Sep 12 08:05:35 hind kernel: [1350051.385834] md/raid:md0: read error >> corrected (8 sectors at 598455864 on sdk1) >> Sep 12 08:05:35 hind kernel: [1350051.385840] md/raid:md0: read error >> corrected (8 sectors at 598455872 on sdk1) >> Sep 12 08:05:35 hind kernel: [1350051.385847] md/raid:md0: read error >> corrected (8 sectors at 598455880 on sdk1) >> Sep 12 08:05:35 hind kernel: [1350051.385853] md/raid:md0: read error >> corrected (8 sectors at 598455888 on sdk1) >> Sep 12 08:05:35 hind kernel: [1350051.385859] md/raid:md0: read error >> corrected (8 sectors at 598455896 on sdk1) >> Sep 12 08:05:35 hind kernel: [1350051.385865] md/raid:md0: read error >> corrected (8 sectors at 598455904 on sdk1) >> Sep 12 08:05:35 hind kernel: [1350051.385873] md/raid:md0: read error >> corrected (8 sectors at 598455912 on sdk1) >> Sep 12 08:05:35 hind kernel: [1350051.385880] md/raid:md0: read error >> corrected (8 sectors at 598455920 on sdk1) >> Sep 12 13:39:43 hind kernel: [1370087.022160] md: md0: data-check done. >> >> >> so it seems to be working? although the sector reported by libata is >> different than what md corrected > > Looks like it replaced the entire chunk that includes the bad sector. > Is the chunk size 32KiB? > No it is 512K, and the sector reported by libata is almost 2000 sectors different than what mdadm reported -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: moving spares into group and checking spares 2016-09-28 0:19 ` scar @ 2016-09-29 1:17 ` NeilBrown 0 siblings, 0 replies; 14+ messages in thread From: NeilBrown @ 2016-09-29 1:17 UTC (permalink / raw) To: scar, linux-raid [-- Attachment #1: Type: text/plain, Size: 2099 bytes --] On Wed, Sep 28 2016, scar wrote: > Chris Murphy wrote on 09/14/2016 08:51 PM: >> On Wed, Sep 14, 2016 at 9:42 PM, scar <scar@drigon.com> wrote: >> >>> 00 80 00 >>> Sep 12 08:05:34 hind kernel: [1350051.023847] end_request: critical medium >>> error, dev sdk, sector 598457896 >>> Sep 12 08:05:35 hind kernel: [1350051.385810] md/raid:md0: read error >>> corrected (8 sectors at 598455848 on sdk1) >>> Sep 12 08:05:35 hind kernel: [1350051.385826] md/raid:md0: read error >>> corrected (8 sectors at 598455856 on sdk1) >>> Sep 12 08:05:35 hind kernel: [1350051.385834] md/raid:md0: read error >>> corrected (8 sectors at 598455864 on sdk1) >>> Sep 12 08:05:35 hind kernel: [1350051.385840] md/raid:md0: read error >>> corrected (8 sectors at 598455872 on sdk1) >>> Sep 12 08:05:35 hind kernel: [1350051.385847] md/raid:md0: read error >>> corrected (8 sectors at 598455880 on sdk1) >>> Sep 12 08:05:35 hind kernel: [1350051.385853] md/raid:md0: read error >>> corrected (8 sectors at 598455888 on sdk1) >>> Sep 12 08:05:35 hind kernel: [1350051.385859] md/raid:md0: read error >>> corrected (8 sectors at 598455896 on sdk1) >>> Sep 12 08:05:35 hind kernel: [1350051.385865] md/raid:md0: read error >>> corrected (8 sectors at 598455904 on sdk1) >>> Sep 12 08:05:35 hind kernel: [1350051.385873] md/raid:md0: read error >>> corrected (8 sectors at 598455912 on sdk1) >>> Sep 12 08:05:35 hind kernel: [1350051.385880] md/raid:md0: read error >>> corrected (8 sectors at 598455920 on sdk1) >>> Sep 12 13:39:43 hind kernel: [1370087.022160] md: md0: data-check done. >>> >>> >>> so it seems to be working? although the sector reported by libata is >>> different than what md corrected >> >> Looks like it replaced the entire chunk that includes the bad sector. >> Is the chunk size 32KiB? >> > > No it is 512K, and the sector reported by libata is almost 2000 sectors > different than what mdadm reported The difference is probably the Data Offset. libata reports a sector in the device. md reports a sector in the data section of the device. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 800 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: moving spares into group and checking spares 2016-09-14 17:52 ` scar 2016-09-14 18:22 ` Roman Mamedov @ 2016-09-14 18:35 ` Wols Lists 1 sibling, 0 replies; 14+ messages in thread From: Wols Lists @ 2016-09-14 18:35 UTC (permalink / raw) To: scar, linux-raid On 14/09/16 18:52, scar wrote: > i'm not sure what you're suggesting, that 4x 11+1 RAID5 arrays should be > changed to 1x 46+2 RAID6 array? that doesn't seem as safe to me. and > checkarray isn't going to check the spare disks just as it's not doing > now.... also that would require me to backup/restore the data so i can > create a new array No. The suggestion is to convert your 4x 11+1 raid5's to 4x 12 raid6's. (or do you mean 11 drives plus 1 parity? If that's the case I mean 11 plus 2 parity) That way you're using all the drives, they all get tested, and if a drive fails, you're left with a degraded raid6 aka raid5 aka redundant array. With your current setup, if a drive fails you're left with a degraded raid5 aka raid0 aka a "disaster in waiting". And then you can add just the one spare disk to a spares group, so if any drive does fail, it will get rebuilt straight away. The only problem I can see (and I should warn you) is that there seems to be a little "upgrading in place" problem at the moment. My gut feeling is it's down to some interaction with systemd, so if you're not running systemd I hope it won't bite ... Cheers, Wol ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: moving spares into group and checking spares 2016-09-14 5:18 moving spares into group and checking spares scar 2016-09-14 9:29 ` Andreas Klauer @ 2016-09-29 1:21 ` NeilBrown 1 sibling, 0 replies; 14+ messages in thread From: NeilBrown @ 2016-09-29 1:21 UTC (permalink / raw) To: scar, linux-raid [-- Attachment #1: Type: text/plain, Size: 1326 bytes --] On Wed, Sep 14 2016, scar wrote: > i currently have four RAID-5 md arrays which i concatenated into one > logical volume (lvm2), essentially creating a RAID-50. each md array > was created with one spare disk. > > instead, i would like to move the four spare disks into one group that > each of the four arrays can have access to when needed. i was wondering > how to safely accomplish this, preferably without unmounting/disrupting > the filesystem. You don't need to move the devices. Just use the spare-group= setting in mdadm.conf to tell mdadm that all arrays should be considered part of the same spare-group. Then "mdadm --monitor" will freely move spares between arrays as needed. > > secondly, i have the checkarray script scheduled via cron to > periodically check each of the four arrays. i noticed in the output of > checkarray that it doesn't list the spare disk(s). so i'm guessing they > are not being checked? i was wondering, then, how i could also check > the spare disks to make sure they are healthy and ready to be used if > needed? mdadm doesn't provide any automatic support for this. You would need to e.g. - remove a spare from an array - test it in some way: e.g. use dd to read every block. - if satisfied, add it back to the array. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 800 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2016-09-29 1:21 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-09-14 5:18 moving spares into group and checking spares scar 2016-09-14 9:29 ` Andreas Klauer [not found] ` <20160914092959.GA3584-oubN3LzF/wf25t9ic+4fgA@public.gmane.org> 2016-09-14 17:52 ` scar 2016-09-14 18:22 ` Roman Mamedov 2016-09-14 21:05 ` scar 2016-09-14 22:33 ` Chris Murphy [not found] ` <CAJCQCtQJwOTYsWubd0rV-6PRL4kmVRKLfLr3=7ZPr1Zb3SrwtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-09-14 22:59 ` scar 2016-09-14 23:15 ` Chris Murphy [not found] ` <CAJCQCtQ1GHdQtShbW3U1o-fmXhT3fHrC7S9BxsxrrOG=0H_p3A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-09-15 3:42 ` scar 2016-09-15 3:51 ` Chris Murphy [not found] ` <CAJCQCtRW9REafzmyk+W6aVxqxMUWCdXNkrST1e7udrH-zp26Uw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-09-28 0:19 ` scar 2016-09-29 1:17 ` NeilBrown 2016-09-14 18:35 ` Wols Lists 2016-09-29 1:21 ` NeilBrown
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.