* RAID 6 crashes system when being accessed @ 2014-07-04 14:11 Justin Stephenson 2014-07-04 19:34 ` Stan Hoeppner 0 siblings, 1 reply; 8+ messages in thread From: Justin Stephenson @ 2014-07-04 14:11 UTC (permalink / raw) To: linux-raid Hello, I am experiencing some issues with my md raid. It is crashing my system when accessed with any "verve". The reboot initiates a resync of the raid. I have gone through the crash/reboot/resynced a number of times now and the crash happens within minutes of mounting the raid. Here are some details: - It is a raid 6 with 7 3TB devices. - Formatted as EXT4 - mdadm v3.2.6 - 25th October 2012 - centos 6.5 kernel 2.6.32-431.3.1.el6.x86_64 - It has been running flawlessly for the previous 6 months. - I have a cron script running that resyncs monthly. - When the raid is unmounted, the system runs fine. (I have an additional "dumb" hardware raid 1 for dailies attached to an ESATA port. This runs perfectly). - I am in the process of re-syncing the raid 6 again right now. - I have run an fsck on the raid volume after it was fully synced and everything came up clean. - there have been lots of power outages the last while with the hot summer in Toronto. My UPS shuts the system down for me, though I think I can correlate the issues with the power outages. Below are the printouts from /proc/mdstat and mdadm --examine I am not very experienced with mdadm and am a newbie with linux. Any assistance you might be able to offer would be appreciated. Best, - Justin ----------------------------- /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdg1[4] sdh1[5] sdf1[3] sdc1[0] sde1[2] sdd1[1] sdb1[6] 14650666880 blocks super 1.2 level 6, 128k chunk, algorithm 2 [7/7] [UUUUUUU] [>....................] resync = 4.7% (138332896/2930133376) finish=353.0min speed=131774K/sec ------------------------------ mdadm --examine [root@BigBlue Desktop]# mdadm --examine /dev/sd[b-h]1 /dev/sdb1: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 0849c677:64e4772e:8892d80b:47e0097a Name : BigBlue:0 (local to host BigBlue) Creation Time : Wed Jan 15 22:46:32 2014 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 5860267024 (2794.39 GiB 3000.46 GB) Array Size : 14650666880 (13971.97 GiB 15002.28 GB) Used Dev Size : 5860266752 (2794.39 GiB 3000.46 GB) Data Offset : 262144 sectors Super Offset : 8 sectors State : active Device UUID : 7f7ace48:01a4b6f6:bf182ae3:6acddef9 Update Time : Fri Jul 4 09:26:41 2014 Checksum : a1aab985 - correct Events : 10631 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 6 Array State : AAAAAAA ('A' == active, '.' == missing) /dev/sdc1: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 0849c677:64e4772e:8892d80b:47e0097a Name : BigBlue:0 (local to host BigBlue) Creation Time : Wed Jan 15 22:46:32 2014 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 5860267024 (2794.39 GiB 3000.46 GB) Array Size : 14650666880 (13971.97 GiB 15002.28 GB) Used Dev Size : 5860266752 (2794.39 GiB 3000.46 GB) Data Offset : 262144 sectors Super Offset : 8 sectors State : active Device UUID : 8a4d11bd:fe387740:1edc1151:b0ebba43 Update Time : Fri Jul 4 09:26:41 2014 Checksum : 1772032b - correct Events : 10631 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 0 Array State : AAAAAAA ('A' == active, '.' == missing) /dev/sdd1: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 0849c677:64e4772e:8892d80b:47e0097a Name : BigBlue:0 (local to host BigBlue) Creation Time : Wed Jan 15 22:46:32 2014 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 5860267024 (2794.39 GiB 3000.46 GB) Array Size : 14650666880 (13971.97 GiB 15002.28 GB) Used Dev Size : 5860266752 (2794.39 GiB 3000.46 GB) Data Offset : 262144 sectors Super Offset : 8 sectors State : active Device UUID : aa158e32:f4449a9b:73e95000:22852278 Update Time : Fri Jul 4 09:26:41 2014 Checksum : cbb87e08 - correct Events : 10631 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 1 Array State : AAAAAAA ('A' == active, '.' == missing) /dev/sde1: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 0849c677:64e4772e:8892d80b:47e0097a Name : BigBlue:0 (local to host BigBlue) Creation Time : Wed Jan 15 22:46:32 2014 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 5860267024 (2794.39 GiB 3000.46 GB) Array Size : 14650666880 (13971.97 GiB 15002.28 GB) Used Dev Size : 5860266752 (2794.39 GiB 3000.46 GB) Data Offset : 262144 sectors Super Offset : 8 sectors State : active Device UUID : 7bee5d42:cb66c602:162e88f1:e140447c Update Time : Fri Jul 4 09:26:41 2014 Checksum : 380d7914 - correct Events : 10631 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 2 Array State : AAAAAAA ('A' == active, '.' == missing) /dev/sdf1: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 0849c677:64e4772e:8892d80b:47e0097a Name : BigBlue:0 (local to host BigBlue) Creation Time : Wed Jan 15 22:46:32 2014 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 5860267024 (2794.39 GiB 3000.46 GB) Array Size : 14650666880 (13971.97 GiB 15002.28 GB) Used Dev Size : 5860266752 (2794.39 GiB 3000.46 GB) Data Offset : 262144 sectors Super Offset : 8 sectors State : active Device UUID : b4a7b35c:6e377711:4b291874:edb2c4d0 Update Time : Fri Jul 4 09:26:41 2014 Checksum : 38247032 - correct Events : 10631 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 3 Array State : AAAAAAA ('A' == active, '.' == missing) /dev/sdg1: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 0849c677:64e4772e:8892d80b:47e0097a Name : BigBlue:0 (local to host BigBlue) Creation Time : Wed Jan 15 22:46:32 2014 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 5860267024 (2794.39 GiB 3000.46 GB) Array Size : 14650666880 (13971.97 GiB 15002.28 GB) Used Dev Size : 5860266752 (2794.39 GiB 3000.46 GB) Data Offset : 262144 sectors Super Offset : 8 sectors State : active Device UUID : 7c0bba5a:6b205302:d026a78d:91f5ebdd Update Time : Fri Jul 4 09:26:41 2014 Checksum : 4dbcfd29 - correct Events : 10631 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 4 Array State : AAAAAAA ('A' == active, '.' == missing) /dev/sdh1: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 0849c677:64e4772e:8892d80b:47e0097a Name : BigBlue:0 (local to host BigBlue) Creation Time : Wed Jan 15 22:46:32 2014 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 5860267024 (2794.39 GiB 3000.46 GB) Array Size : 14650666880 (13971.97 GiB 15002.28 GB) Used Dev Size : 5860266752 (2794.39 GiB 3000.46 GB) Data Offset : 262144 sectors Super Offset : 8 sectors State : active Device UUID : a750a146:7ee7383e:2929d803:2bee9dc1 Update Time : Fri Jul 4 09:26:41 2014 Checksum : cf6d0452 - correct Events : 10631 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 5 Array State : AAAAAAA ('A' == active, '.' == missing) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: RAID 6 crashes system when being accessed 2014-07-04 14:11 RAID 6 crashes system when being accessed Justin Stephenson @ 2014-07-04 19:34 ` Stan Hoeppner 2014-07-05 1:08 ` Re[2]: " Justin Stephenson 0 siblings, 1 reply; 8+ messages in thread From: Stan Hoeppner @ 2014-07-04 19:34 UTC (permalink / raw) To: Justin Stephenson, linux-raid On 7/4/2014 9:11 AM, Justin Stephenson wrote: > Hello, > > I am experiencing some issues with my md raid. It is crashing my system > when accessed with any "verve". The reboot initiates a resync of the > raid. I have gone through the crash/reboot/resynced a number of times > now and the crash happens within minutes of mounting the raid. > > Here are some details: > > - It is a raid 6 with 7 3TB devices. > - Formatted as EXT4 > - mdadm v3.2.6 - 25th October 2012 > - centos 6.5 kernel 2.6.32-431.3.1.el6.x86_64 > - It has been running flawlessly for the previous 6 months. > - I have a cron script running that resyncs monthly. > - When the raid is unmounted, the system runs fine. (I have an > additional "dumb" hardware raid 1 for dailies attached to an ESATA port. > This runs perfectly). > - I am in the process of re-syncing the raid 6 again right now. > - I have run an fsck on the raid volume after it was fully synced and > everything came up clean. > > - there have been lots of power outages the last while with the hot > summer in Toronto. My UPS shuts the system down for me, though I think I > can correlate the issues with the power outages. This sounds like the UPS is cutting power to the system before the shutdown sequence completes, before the array is stopped. This assumes you are already using apcupsd or similar. If you are check the configuration to make sure the system has plenty of time to shutdown after the UPS sends notification to the system. If you are not, then this will always happen as the UPS is simply cutting power when the battery gets low. Note that if the UPS is undersized for this system and only yields a few minutes of on-battery time, it may simply not have enough juice to keep the machine up throughout the shutdown process. In summary, either your shutdown software isn't configured properly, you are not using it, or the UPS is too small. This isn't an md problem. Cheers, Stan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re[2]: RAID 6 crashes system when being accessed 2014-07-04 19:34 ` Stan Hoeppner @ 2014-07-05 1:08 ` Justin Stephenson 2014-07-05 4:17 ` Roger Heflin 0 siblings, 1 reply; 8+ messages in thread From: Justin Stephenson @ 2014-07-05 1:08 UTC (permalink / raw) To: stan, linux-raid Hi, Thanks for your reply. I should clarify that the crashes continue to be an issue in the absence of any power outage so this issue is now independent of power. I mentioned the UPS only with the thought that my problems may have been caused by a sudden power-down. Please let me know if there are any logs or status print outs I could pull to help troubleshoot this. Thanks Again, - J ------ Original Message ------ From: "Stan Hoeppner" <stan@hardwarefreak.com> To: "Justin Stephenson" <justin@evensteveninc.com>; linux-raid@vger.kernel.org Sent: 04/07/2014 3:34:17 PM Subject: Re: RAID 6 crashes system when being accessed >On 7/4/2014 9:11 AM, Justin Stephenson wrote: >> Hello, >> >> I am experiencing some issues with my md raid. It is crashing my >>system >> when accessed with any "verve". The reboot initiates a resync of the >> raid. I have gone through the crash/reboot/resynced a number of times >> now and the crash happens within minutes of mounting the raid. >> >> Here are some details: >> >> - It is a raid 6 with 7 3TB devices. >> - Formatted as EXT4 >> - mdadm v3.2.6 - 25th October 2012 >> - centos 6.5 kernel 2.6.32-431.3.1.el6.x86_64 >> - It has been running flawlessly for the previous 6 months. >> - I have a cron script running that resyncs monthly. >> - When the raid is unmounted, the system runs fine. (I have an >> additional "dumb" hardware raid 1 for dailies attached to an ESATA >>port. >> This runs perfectly). >> - I am in the process of re-syncing the raid 6 again right now. >> - I have run an fsck on the raid volume after it was fully synced and >> everything came up clean. >> >> - there have been lots of power outages the last while with the hot >> summer in Toronto. My UPS shuts the system down for me, though I >>think I >> can correlate the issues with the power outages. > >This sounds like the UPS is cutting power to the system before the >shutdown sequence completes, before the array is stopped. This assumes >you are already using apcupsd or similar. If you are check the >configuration to make sure the system has plenty of time to shutdown >after the UPS sends notification to the system. If you are not, then >this will always happen as the UPS is simply cutting power when the >battery gets low. > >Note that if the UPS is undersized for this system and only yields a >few >minutes of on-battery time, it may simply not have enough juice to keep >the machine up throughout the shutdown process. > >In summary, either your shutdown software isn't configured properly, >you >are not using it, or the UPS is too small. This isn't an md problem. > > >Cheers, > >Stan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re[2]: RAID 6 crashes system when being accessed 2014-07-05 1:08 ` Re[2]: " Justin Stephenson @ 2014-07-05 4:17 ` Roger Heflin 2014-07-05 19:22 ` Re[4]: " Justin Stephenson 0 siblings, 1 reply; 8+ messages in thread From: Roger Heflin @ 2014-07-05 4:17 UTC (permalink / raw) To: Justin Stephenson; +Cc: stan, Linux RAID Some questions. Do you get any messages on the screen when it crashes and/or is there anything in /var/log/messages from the crashes? Is a sync running when it crashes? If so what kind of SATA controllers/setup are you using? I have had 2 previous setups that would run fairly stably so long as a sync was not running, but if a sync was running then the machine became unstable. Did you umount it and run a "fsck -f -y" that took a while (at least 30 seconds) or just umount it and ran fsck and it finished quickly and indicated clean? Generally if you nicely umount it the fs thinks it is clean even when it is not because of some previous event. On Fri, Jul 4, 2014 at 8:08 PM, Justin Stephenson <justin@evensteveninc.com> wrote: > Hi, > > Thanks for your reply. > > I should clarify that the crashes continue to be an issue in the absence of > any power outage so this issue is now independent of power. I mentioned the > UPS only with the thought that my problems may have been caused by a sudden > power-down. > > Please let me know if there are any logs or status print outs I could pull > to help troubleshoot this. > > Thanks Again, > > - J > > > > > ------ Original Message ------ > From: "Stan Hoeppner" <stan@hardwarefreak.com> > To: "Justin Stephenson" <justin@evensteveninc.com>; > linux-raid@vger.kernel.org > Sent: 04/07/2014 3:34:17 PM > Subject: Re: RAID 6 crashes system when being accessed > >> On 7/4/2014 9:11 AM, Justin Stephenson wrote: >>> >>> Hello, >>> >>> I am experiencing some issues with my md raid. It is crashing my system >>> when accessed with any "verve". The reboot initiates a resync of the >>> raid. I have gone through the crash/reboot/resynced a number of times >>> now and the crash happens within minutes of mounting the raid. >>> >>> Here are some details: >>> >>> - It is a raid 6 with 7 3TB devices. >>> - Formatted as EXT4 >>> - mdadm v3.2.6 - 25th October 2012 >>> - centos 6.5 kernel 2.6.32-431.3.1.el6.x86_64 >>> - It has been running flawlessly for the previous 6 months. >>> - I have a cron script running that resyncs monthly. >>> - When the raid is unmounted, the system runs fine. (I have an >>> additional "dumb" hardware raid 1 for dailies attached to an ESATA port. >>> This runs perfectly). >>> - I am in the process of re-syncing the raid 6 again right now. >>> - I have run an fsck on the raid volume after it was fully synced and >>> everything came up clean. >>> >>> - there have been lots of power outages the last while with the hot >>> summer in Toronto. My UPS shuts the system down for me, though I think I >>> can correlate the issues with the power outages. >> >> >> This sounds like the UPS is cutting power to the system before the >> shutdown sequence completes, before the array is stopped. This assumes >> you are already using apcupsd or similar. If you are check the >> configuration to make sure the system has plenty of time to shutdown >> after the UPS sends notification to the system. If you are not, then >> this will always happen as the UPS is simply cutting power when the >> battery gets low. >> >> Note that if the UPS is undersized for this system and only yields a few >> minutes of on-battery time, it may simply not have enough juice to keep >> the machine up throughout the shutdown process. >> >> In summary, either your shutdown software isn't configured properly, you >> are not using it, or the UPS is too small. This isn't an md problem. >> >> >> Cheers, >> >> Stan > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re[4]: RAID 6 crashes system when being accessed 2014-07-05 4:17 ` Roger Heflin @ 2014-07-05 19:22 ` Justin Stephenson 2014-07-05 20:42 ` Roger Heflin 0 siblings, 1 reply; 8+ messages in thread From: Justin Stephenson @ 2014-07-05 19:22 UTC (permalink / raw) To: Roger Heflin; +Cc: stan, Linux RAID Hello Roger, Thank-you for your email and for laying out some trouble shooting steps for me. I will take these to heart and keep them on file for the future. I can report that there was a screen of rapid scrolling text during the crashes and some kind of memory contents dump that had a progress indicator. From what I could see, there was some kind of kernel panic and a message about ATA-9. Nothing in the /var/log/messages file as far as I could see. I had tried unmounting and running fsck before but not with your specified -f -y flags. Here are the steps I took based on your input. - ran system overnight with md raid unmounted. - fully completed resync - performed fsck -f -y. It took approx 6 minutes (on a 12TB volume). No errors reported in the printout. - reboot - locally initiated and completed a 22 gb copy from and to the md raid and a local esata external drive. --- - from a workstation, opened SMB share to the MD raid - workstation initiated copy to and from the CentOS box (and MD drive) of the same 22gb folder over SMB. - opened vnc client to the centOS box from a workstation. Up until the fsck -f -y any of these three operations would cause a crash. In summary, it would seem that the issue has been resolved by the fsck -f -y. Up until running fsck - f -y, the system was completely unpredictable when the MD drive was mounted - either during a sync or after it was completed. I find this surprising, but perhaps I should not? Based on Stan's email, I checked my UPS power settings, and I am certain I was ending up with a hard powerdown when the battery ran out. I have remedied this. Could this have caused the MD volume to become unstable? In any event, everything is up and running. I will report back with a log entry if anything else appears. Thanks again, - Justin ------ Original Message ------ From: "Roger Heflin" <rogerheflin@gmail.com> To: "Justin Stephenson" <justin@evensteveninc.com> Cc: stan@hardwarefreak.com; "Linux RAID" <linux-raid@vger.kernel.org> Sent: 05/07/2014 12:17:45 AM Subject: Re: Re[2]: RAID 6 crashes system when being accessed >Some questions. > >Do you get any messages on the screen when it crashes and/or is there >anything in /var/log/messages from the crashes? > >Is a sync running when it crashes? If so what kind of SATA >controllers/setup are you using? I have had 2 previous setups that >would run fairly stably so long as a sync was not running, but if a >sync was running then the machine became unstable. > >Did you umount it and run a "fsck -f -y" that took a while (at least >30 seconds) or just umount it and ran fsck and it finished quickly and >indicated clean? Generally if you nicely umount it the fs thinks it >is clean even when it is not because of some previous event. > >On Fri, Jul 4, 2014 at 8:08 PM, Justin Stephenson ><justin@evensteveninc.com> wrote: >> Hi, >> >> Thanks for your reply. >> >> I should clarify that the crashes continue to be an issue in the >>absence of >> any power outage so this issue is now independent of power. I >>mentioned the >> UPS only with the thought that my problems may have been caused by a >>sudden >> power-down. >> >> Please let me know if there are any logs or status print outs I could >>pull >> to help troubleshoot this. >> >> Thanks Again, >> >> - J >> >> >> >> >> ------ Original Message ------ >> From: "Stan Hoeppner" <stan@hardwarefreak.com> >> To: "Justin Stephenson" <justin@evensteveninc.com>; >> linux-raid@vger.kernel.org >> Sent: 04/07/2014 3:34:17 PM >> Subject: Re: RAID 6 crashes system when being accessed >> >>> On 7/4/2014 9:11 AM, Justin Stephenson wrote: >>>> >>>> Hello, >>>> >>>> I am experiencing some issues with my md raid. It is crashing my >>>>system >>>> when accessed with any "verve". The reboot initiates a resync of >>>>the >>>> raid. I have gone through the crash/reboot/resynced a number of >>>>times >>>> now and the crash happens within minutes of mounting the raid. >>>> >>>> Here are some details: >>>> >>>> - It is a raid 6 with 7 3TB devices. >>>> - Formatted as EXT4 >>>> - mdadm v3.2.6 - 25th October 2012 >>>> - centos 6.5 kernel 2.6.32-431.3.1.el6.x86_64 >>>> - It has been running flawlessly for the previous 6 months. >>>> - I have a cron script running that resyncs monthly. >>>> - When the raid is unmounted, the system runs fine. (I have an >>>> additional "dumb" hardware raid 1 for dailies attached to an ESATA >>>>port. >>>> This runs perfectly). >>>> - I am in the process of re-syncing the raid 6 again right now. >>>> - I have run an fsck on the raid volume after it was fully synced >>>>and >>>> everything came up clean. >>>> >>>> - there have been lots of power outages the last while with the >>>>hot >>>> summer in Toronto. My UPS shuts the system down for me, though I >>>>think I >>>> can correlate the issues with the power outages. >>> >>> >>> This sounds like the UPS is cutting power to the system before the >>> shutdown sequence completes, before the array is stopped. This >>>assumes >>> you are already using apcupsd or similar. If you are check the >>> configuration to make sure the system has plenty of time to shutdown >>> after the UPS sends notification to the system. If you are not, then >>> this will always happen as the UPS is simply cutting power when the >>> battery gets low. >>> >>> Note that if the UPS is undersized for this system and only yields a >>>few >>> minutes of on-battery time, it may simply not have enough juice to >>>keep >>> the machine up throughout the shutdown process. >>> >>> In summary, either your shutdown software isn't configured properly, >>>you >>> are not using it, or the UPS is too small. This isn't an md problem. >>> >>> >>> Cheers, >>> >>> Stan >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" >>in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re[4]: RAID 6 crashes system when being accessed 2014-07-05 19:22 ` Re[4]: " Justin Stephenson @ 2014-07-05 20:42 ` Roger Heflin 2014-07-07 0:54 ` Re[6]: " Justin Stephenson 0 siblings, 1 reply; 8+ messages in thread From: Roger Heflin @ 2014-07-05 20:42 UTC (permalink / raw) To: Justin Stephenson; +Cc: stan, Linux RAID The MD volume itself would not be unstable. The filesystem, directory and file structures could have been corrupted, likely it did fix something that was not important enough to report. When you hit the specific directory entry and/or file data that would be when it would crash. I have no idea how many times I have fixed this sort of issue, it is pretty common on an unexpected crash, maybe 1 in 10-50 crashes will produce this sort of error, the risk rises if files were being created when it happens. If you want to do a full test this will list out all dirs "find /<dirname> -type f -ls" and this will actual read all files fairly quickly. If you want to check to see if all files and extents make sense you can run the next commnad but it will take a long time depending on how much data you have "find /<dirname> -type f -ls -exec cksum {} \;" On Sat, Jul 5, 2014 at 2:22 PM, Justin Stephenson <justin@evensteveninc.com> wrote: > Hello Roger, > > Thank-you for your email and for laying out some trouble shooting steps for > me. I will take these to heart and keep them on file for the future. > > I can report that there was a screen of rapid scrolling text during the > crashes and some kind of memory contents dump that had a progress indicator. > From what I could see, there was some kind of kernel panic and a message > about ATA-9. Nothing in the /var/log/messages file as far as I could see. > > I had tried unmounting and running fsck before but not with your specified > -f -y flags. > > Here are the steps I took based on your input. > > - ran system overnight with md raid unmounted. > - fully completed resync > - performed fsck -f -y. It took approx 6 minutes (on a 12TB volume). No > errors reported in the printout. > - reboot > - locally initiated and completed a 22 gb copy from and to the md raid and a > local esata external drive. > > --- > > - from a workstation, opened SMB share to the MD raid > - workstation initiated copy to and from the CentOS box (and MD drive) of > the same 22gb folder over SMB. > - opened vnc client to the centOS box from a workstation. > > Up until the fsck -f -y any of these three operations would cause a crash. > > > In summary, it would seem that the issue has been resolved by the fsck -f > -y. Up until running fsck - f -y, the system was completely unpredictable > when the MD drive was mounted - either during a sync or after it was > completed. I find this surprising, but perhaps I should not? > > Based on Stan's email, I checked my UPS power settings, and I am certain I > was ending up with a hard powerdown when the battery ran out. I have > remedied this. > > Could this have caused the MD volume to become unstable? > > In any event, everything is up and running. I will report back with a log > entry if anything else appears. > > Thanks again, > > - Justin > > > > > ------ Original Message ------ > From: "Roger Heflin" <rogerheflin@gmail.com> > To: "Justin Stephenson" <justin@evensteveninc.com> > Cc: stan@hardwarefreak.com; "Linux RAID" <linux-raid@vger.kernel.org> > Sent: 05/07/2014 12:17:45 AM > Subject: Re: Re[2]: RAID 6 crashes system when being accessed > >> Some questions. >> >> Do you get any messages on the screen when it crashes and/or is there >> anything in /var/log/messages from the crashes? >> >> Is a sync running when it crashes? If so what kind of SATA >> controllers/setup are you using? I have had 2 previous setups that >> would run fairly stably so long as a sync was not running, but if a >> sync was running then the machine became unstable. >> >> Did you umount it and run a "fsck -f -y" that took a while (at least >> 30 seconds) or just umount it and ran fsck and it finished quickly and >> indicated clean? Generally if you nicely umount it the fs thinks it >> is clean even when it is not because of some previous event. >> >> On Fri, Jul 4, 2014 at 8:08 PM, Justin Stephenson >> <justin@evensteveninc.com> wrote: >>> >>> Hi, >>> >>> Thanks for your reply. >>> >>> I should clarify that the crashes continue to be an issue in the absence >>> of >>> any power outage so this issue is now independent of power. I mentioned >>> the >>> UPS only with the thought that my problems may have been caused by a >>> sudden >>> power-down. >>> >>> Please let me know if there are any logs or status print outs I could >>> pull >>> to help troubleshoot this. >>> >>> Thanks Again, >>> >>> - J >>> >>> >>> >>> >>> ------ Original Message ------ >>> From: "Stan Hoeppner" <stan@hardwarefreak.com> >>> To: "Justin Stephenson" <justin@evensteveninc.com>; >>> linux-raid@vger.kernel.org >>> Sent: 04/07/2014 3:34:17 PM >>> Subject: Re: RAID 6 crashes system when being accessed >>> >>>> On 7/4/2014 9:11 AM, Justin Stephenson wrote: >>>>> >>>>> >>>>> Hello, >>>>> >>>>> I am experiencing some issues with my md raid. It is crashing my >>>>> system >>>>> when accessed with any "verve". The reboot initiates a resync of the >>>>> raid. I have gone through the crash/reboot/resynced a number of times >>>>> now and the crash happens within minutes of mounting the raid. >>>>> >>>>> Here are some details: >>>>> >>>>> - It is a raid 6 with 7 3TB devices. >>>>> - Formatted as EXT4 >>>>> - mdadm v3.2.6 - 25th October 2012 >>>>> - centos 6.5 kernel 2.6.32-431.3.1.el6.x86_64 >>>>> - It has been running flawlessly for the previous 6 months. >>>>> - I have a cron script running that resyncs monthly. >>>>> - When the raid is unmounted, the system runs fine. (I have an >>>>> additional "dumb" hardware raid 1 for dailies attached to an ESATA >>>>> port. >>>>> This runs perfectly). >>>>> - I am in the process of re-syncing the raid 6 again right now. >>>>> - I have run an fsck on the raid volume after it was fully synced and >>>>> everything came up clean. >>>>> >>>>> - there have been lots of power outages the last while with the hot >>>>> summer in Toronto. My UPS shuts the system down for me, though I >>>>> think I >>>>> can correlate the issues with the power outages. >>>> >>>> >>>> >>>> This sounds like the UPS is cutting power to the system before the >>>> shutdown sequence completes, before the array is stopped. This assumes >>>> you are already using apcupsd or similar. If you are check the >>>> configuration to make sure the system has plenty of time to shutdown >>>> after the UPS sends notification to the system. If you are not, then >>>> this will always happen as the UPS is simply cutting power when the >>>> battery gets low. >>>> >>>> Note that if the UPS is undersized for this system and only yields a >>>> few >>>> minutes of on-battery time, it may simply not have enough juice to keep >>>> the machine up throughout the shutdown process. >>>> >>>> In summary, either your shutdown software isn't configured properly, >>>> you >>>> are not using it, or the UPS is too small. This isn't an md problem. >>>> >>>> >>>> Cheers, >>>> >>>> Stan >>> >>> >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re[6]: RAID 6 crashes system when being accessed 2014-07-05 20:42 ` Roger Heflin @ 2014-07-07 0:54 ` Justin Stephenson 2014-07-07 1:56 ` Roger Heflin 0 siblings, 1 reply; 8+ messages in thread From: Justin Stephenson @ 2014-07-07 0:54 UTC (permalink / raw) To: Roger Heflin; +Cc: stan, Linux RAID Thanks again, Roger. Your input was super helpful and also helped me understand a little more about the relationship between md and my file system. in the full tests you mentioned "find /<dir> -type f -ls" and "...exec cksum {} \;" what would I be looking for? I executed the first one and I got a colossal list of files. The server stores a lot of media resources for my design practice and there are probably hundreds of thousands of files on there. Please let me know, J -------- Justin Stephenson Creative Director/Motion Designer 416-900-6069 http://justinstephenson.com ------ Original Message ------ From: "Roger Heflin" <rogerheflin@gmail.com> To: "Justin Stephenson" <justin@evensteveninc.com> Cc: "stan" <stan@hardwarefreak.com>; "Linux RAID" <linux-raid@vger.kernel.org> Sent: 05/07/2014 4:42:04 PM Subject: Re: Re[4]: RAID 6 crashes system when being accessed >The MD volume itself would not be unstable. The filesystem, >directory and file structures could have been corrupted, likely it did >fix something that was not important enough to report. When you hit >the specific directory entry and/or file data that would be when it >would crash. I have no idea how many times I have fixed this sort of >issue, it is pretty common on an unexpected crash, maybe 1 in 10-50 >crashes will produce this sort of error, the risk rises if files were >being created when it happens. > >If you want to do a full test this will list out all dirs "find >/<dirname> -type f -ls" and this will actual read all files fairly >quickly. If you want to check to see if all files and extents make >sense you can run the next commnad but it will take a long time >depending on how much data you have "find /<dirname> -type f -ls -exec >cksum {} \;" > >On Sat, Jul 5, 2014 at 2:22 PM, Justin Stephenson ><justin@evensteveninc.com> wrote: >> Hello Roger, >> >> Thank-you for your email and for laying out some trouble shooting >>steps for >> me. I will take these to heart and keep them on file for the future. >> >> I can report that there was a screen of rapid scrolling text during >>the >> crashes and some kind of memory contents dump that had a progress >>indicator. >> From what I could see, there was some kind of kernel panic and a >>message >> about ATA-9. Nothing in the /var/log/messages file as far as I could >>see. >> >> I had tried unmounting and running fsck before but not with your >>specified >> -f -y flags. >> >> Here are the steps I took based on your input. >> >> - ran system overnight with md raid unmounted. >> - fully completed resync >> - performed fsck -f -y. It took approx 6 minutes (on a 12TB volume). >>No >> errors reported in the printout. >> - reboot >> - locally initiated and completed a 22 gb copy from and to the md >>raid and a >> local esata external drive. >> >> --- >> >> - from a workstation, opened SMB share to the MD raid >> - workstation initiated copy to and from the CentOS box (and MD >>drive) of >> the same 22gb folder over SMB. >> - opened vnc client to the centOS box from a workstation. >> >> Up until the fsck -f -y any of these three operations would cause a >>crash. >> >> >> In summary, it would seem that the issue has been resolved by the >>fsck -f >> -y. Up until running fsck - f -y, the system was completely >>unpredictable >> when the MD drive was mounted - either during a sync or after it was >> completed. I find this surprising, but perhaps I should not? >> >> Based on Stan's email, I checked my UPS power settings, and I am >>certain I >> was ending up with a hard powerdown when the battery ran out. I have >> remedied this. >> >> Could this have caused the MD volume to become unstable? >> >> In any event, everything is up and running. I will report back with a >>log >> entry if anything else appears. >> >> Thanks again, >> >> - Justin >> >> >> >> >> ------ Original Message ------ >> From: "Roger Heflin" <rogerheflin@gmail.com> >> To: "Justin Stephenson" <justin@evensteveninc.com> >> Cc: stan@hardwarefreak.com; "Linux RAID" <linux-raid@vger.kernel.org> >> Sent: 05/07/2014 12:17:45 AM >> Subject: Re: Re[2]: RAID 6 crashes system when being accessed >> >>> Some questions. >>> >>> Do you get any messages on the screen when it crashes and/or is >>>there >>> anything in /var/log/messages from the crashes? >>> >>> Is a sync running when it crashes? If so what kind of SATA >>> controllers/setup are you using? I have had 2 previous setups that >>> would run fairly stably so long as a sync was not running, but if a >>> sync was running then the machine became unstable. >>> >>> Did you umount it and run a "fsck -f -y" that took a while (at least >>> 30 seconds) or just umount it and ran fsck and it finished quickly >>>and >>> indicated clean? Generally if you nicely umount it the fs thinks it >>> is clean even when it is not because of some previous event. >>> >>> On Fri, Jul 4, 2014 at 8:08 PM, Justin Stephenson >>> <justin@evensteveninc.com> wrote: >>>> >>>> Hi, >>>> >>>> Thanks for your reply. >>>> >>>> I should clarify that the crashes continue to be an issue in the >>>>absence >>>> of >>>> any power outage so this issue is now independent of power. I >>>>mentioned >>>> the >>>> UPS only with the thought that my problems may have been caused by >>>>a >>>> sudden >>>> power-down. >>>> >>>> Please let me know if there are any logs or status print outs I >>>>could >>>> pull >>>> to help troubleshoot this. >>>> >>>> Thanks Again, >>>> >>>> - J >>>> >>>> >>>> >>>> >>>> ------ Original Message ------ >>>> From: "Stan Hoeppner" <stan@hardwarefreak.com> >>>> To: "Justin Stephenson" <justin@evensteveninc.com>; >>>> linux-raid@vger.kernel.org >>>> Sent: 04/07/2014 3:34:17 PM >>>> Subject: Re: RAID 6 crashes system when being accessed >>>> >>>>> On 7/4/2014 9:11 AM, Justin Stephenson wrote: >>>>>> >>>>>> >>>>>> Hello, >>>>>> >>>>>> I am experiencing some issues with my md raid. It is crashing >>>>>>my >>>>>> system >>>>>> when accessed with any "verve". The reboot initiates a resync >>>>>>of the >>>>>> raid. I have gone through the crash/reboot/resynced a number of >>>>>>times >>>>>> now and the crash happens within minutes of mounting the raid. >>>>>> >>>>>> Here are some details: >>>>>> >>>>>> - It is a raid 6 with 7 3TB devices. >>>>>> - Formatted as EXT4 >>>>>> - mdadm v3.2.6 - 25th October 2012 >>>>>> - centos 6.5 kernel 2.6.32-431.3.1.el6.x86_64 >>>>>> - It has been running flawlessly for the previous 6 months. >>>>>> - I have a cron script running that resyncs monthly. >>>>>> - When the raid is unmounted, the system runs fine. (I have an >>>>>> additional "dumb" hardware raid 1 for dailies attached to an >>>>>>ESATA >>>>>> port. >>>>>> This runs perfectly). >>>>>> - I am in the process of re-syncing the raid 6 again right now. >>>>>> - I have run an fsck on the raid volume after it was fully >>>>>>synced and >>>>>> everything came up clean. >>>>>> >>>>>> - there have been lots of power outages the last while with the >>>>>>hot >>>>>> summer in Toronto. My UPS shuts the system down for me, though >>>>>>I >>>>>> think I >>>>>> can correlate the issues with the power outages. >>>>> >>>>> >>>>> >>>>> This sounds like the UPS is cutting power to the system before >>>>>the >>>>> shutdown sequence completes, before the array is stopped. This >>>>>assumes >>>>> you are already using apcupsd or similar. If you are check the >>>>> configuration to make sure the system has plenty of time to >>>>>shutdown >>>>> after the UPS sends notification to the system. If you are not, >>>>>then >>>>> this will always happen as the UPS is simply cutting power when >>>>>the >>>>> battery gets low. >>>>> >>>>> Note that if the UPS is undersized for this system and only >>>>>yields a >>>>> few >>>>> minutes of on-battery time, it may simply not have enough juice >>>>>to keep >>>>> the machine up throughout the shutdown process. >>>>> >>>>> In summary, either your shutdown software isn't configured >>>>>properly, >>>>> you >>>>> are not using it, or the UPS is too small. This isn't an md >>>>>problem. >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> Stan >>>> >>>> >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe >>>>linux-raid" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re[6]: RAID 6 crashes system when being accessed 2014-07-07 0:54 ` Re[6]: " Justin Stephenson @ 2014-07-07 1:56 ` Roger Heflin 0 siblings, 0 replies; 8+ messages in thread From: Roger Heflin @ 2014-07-07 1:56 UTC (permalink / raw) To: Justin Stephenson; +Cc: stan, Linux RAID You are watching for the machine to crash and/or produce messages in /var/log/messages. On Sun, Jul 6, 2014 at 7:54 PM, Justin Stephenson <justin@evensteveninc.com> wrote: > Thanks again, Roger. Your input was super helpful and also helped me > understand a little more about the relationship between md and my file > system. > > in the full tests you mentioned "find /<dir> -type f -ls" and "...exec cksum > {} \;" > > what would I be looking for? I executed the first one and I got a colossal > list of files. The server stores a lot of media resources for my design > practice and there are probably hundreds of thousands of files on there. > > Please let me know, > > J > -------- > Justin Stephenson > Creative Director/Motion Designer > 416-900-6069 > http://justinstephenson.com > > > > > ------ Original Message ------ > From: "Roger Heflin" <rogerheflin@gmail.com> > To: "Justin Stephenson" <justin@evensteveninc.com> > Cc: "stan" <stan@hardwarefreak.com>; "Linux RAID" > <linux-raid@vger.kernel.org> > Sent: 05/07/2014 4:42:04 PM > Subject: Re: Re[4]: RAID 6 crashes system when being accessed > >> The MD volume itself would not be unstable. The filesystem, >> directory and file structures could have been corrupted, likely it did >> fix something that was not important enough to report. When you hit >> the specific directory entry and/or file data that would be when it >> would crash. I have no idea how many times I have fixed this sort of >> issue, it is pretty common on an unexpected crash, maybe 1 in 10-50 >> crashes will produce this sort of error, the risk rises if files were >> being created when it happens. >> >> If you want to do a full test this will list out all dirs "find >> /<dirname> -type f -ls" and this will actual read all files fairly >> quickly. If you want to check to see if all files and extents make >> sense you can run the next commnad but it will take a long time >> depending on how much data you have "find /<dirname> -type f -ls -exec >> cksum {} \;" >> >> On Sat, Jul 5, 2014 at 2:22 PM, Justin Stephenson >> <justin@evensteveninc.com> wrote: >>> >>> Hello Roger, >>> >>> Thank-you for your email and for laying out some trouble shooting steps >>> for >>> me. I will take these to heart and keep them on file for the future. >>> >>> I can report that there was a screen of rapid scrolling text during the >>> crashes and some kind of memory contents dump that had a progress >>> indicator. >>> From what I could see, there was some kind of kernel panic and a message >>> about ATA-9. Nothing in the /var/log/messages file as far as I could >>> see. >>> >>> I had tried unmounting and running fsck before but not with your >>> specified >>> -f -y flags. >>> >>> Here are the steps I took based on your input. >>> >>> - ran system overnight with md raid unmounted. >>> - fully completed resync >>> - performed fsck -f -y. It took approx 6 minutes (on a 12TB volume). No >>> errors reported in the printout. >>> - reboot >>> - locally initiated and completed a 22 gb copy from and to the md raid >>> and a >>> local esata external drive. >>> >>> --- >>> >>> - from a workstation, opened SMB share to the MD raid >>> - workstation initiated copy to and from the CentOS box (and MD drive) >>> of >>> the same 22gb folder over SMB. >>> - opened vnc client to the centOS box from a workstation. >>> >>> Up until the fsck -f -y any of these three operations would cause a >>> crash. >>> >>> >>> In summary, it would seem that the issue has been resolved by the fsck >>> -f >>> -y. Up until running fsck - f -y, the system was completely >>> unpredictable >>> when the MD drive was mounted - either during a sync or after it was >>> completed. I find this surprising, but perhaps I should not? >>> >>> Based on Stan's email, I checked my UPS power settings, and I am certain >>> I >>> was ending up with a hard powerdown when the battery ran out. I have >>> remedied this. >>> >>> Could this have caused the MD volume to become unstable? >>> >>> In any event, everything is up and running. I will report back with a >>> log >>> entry if anything else appears. >>> >>> Thanks again, >>> >>> - Justin >>> >>> >>> >>> >>> ------ Original Message ------ >>> From: "Roger Heflin" <rogerheflin@gmail.com> >>> To: "Justin Stephenson" <justin@evensteveninc.com> >>> Cc: stan@hardwarefreak.com; "Linux RAID" <linux-raid@vger.kernel.org> >>> Sent: 05/07/2014 12:17:45 AM >>> Subject: Re: Re[2]: RAID 6 crashes system when being accessed >>> >>>> Some questions. >>>> >>>> Do you get any messages on the screen when it crashes and/or is there >>>> anything in /var/log/messages from the crashes? >>>> >>>> Is a sync running when it crashes? If so what kind of SATA >>>> controllers/setup are you using? I have had 2 previous setups that >>>> would run fairly stably so long as a sync was not running, but if a >>>> sync was running then the machine became unstable. >>>> >>>> Did you umount it and run a "fsck -f -y" that took a while (at least >>>> 30 seconds) or just umount it and ran fsck and it finished quickly and >>>> indicated clean? Generally if you nicely umount it the fs thinks it >>>> is clean even when it is not because of some previous event. >>>> >>>> On Fri, Jul 4, 2014 at 8:08 PM, Justin Stephenson >>>> <justin@evensteveninc.com> wrote: >>>>> >>>>> >>>>> Hi, >>>>> >>>>> Thanks for your reply. >>>>> >>>>> I should clarify that the crashes continue to be an issue in the >>>>> absence >>>>> of >>>>> any power outage so this issue is now independent of power. I >>>>> mentioned >>>>> the >>>>> UPS only with the thought that my problems may have been caused by a >>>>> sudden >>>>> power-down. >>>>> >>>>> Please let me know if there are any logs or status print outs I could >>>>> pull >>>>> to help troubleshoot this. >>>>> >>>>> Thanks Again, >>>>> >>>>> - J >>>>> >>>>> >>>>> >>>>> >>>>> ------ Original Message ------ >>>>> From: "Stan Hoeppner" <stan@hardwarefreak.com> >>>>> To: "Justin Stephenson" <justin@evensteveninc.com>; >>>>> linux-raid@vger.kernel.org >>>>> Sent: 04/07/2014 3:34:17 PM >>>>> Subject: Re: RAID 6 crashes system when being accessed >>>>> >>>>>> On 7/4/2014 9:11 AM, Justin Stephenson wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I am experiencing some issues with my md raid. It is crashing my >>>>>>> system >>>>>>> when accessed with any "verve". The reboot initiates a resync of >>>>>>> the >>>>>>> raid. I have gone through the crash/reboot/resynced a number of >>>>>>> times >>>>>>> now and the crash happens within minutes of mounting the raid. >>>>>>> >>>>>>> Here are some details: >>>>>>> >>>>>>> - It is a raid 6 with 7 3TB devices. >>>>>>> - Formatted as EXT4 >>>>>>> - mdadm v3.2.6 - 25th October 2012 >>>>>>> - centos 6.5 kernel 2.6.32-431.3.1.el6.x86_64 >>>>>>> - It has been running flawlessly for the previous 6 months. >>>>>>> - I have a cron script running that resyncs monthly. >>>>>>> - When the raid is unmounted, the system runs fine. (I have an >>>>>>> additional "dumb" hardware raid 1 for dailies attached to an ESATA >>>>>>> port. >>>>>>> This runs perfectly). >>>>>>> - I am in the process of re-syncing the raid 6 again right now. >>>>>>> - I have run an fsck on the raid volume after it was fully synced >>>>>>> and >>>>>>> everything came up clean. >>>>>>> >>>>>>> - there have been lots of power outages the last while with the >>>>>>> hot >>>>>>> summer in Toronto. My UPS shuts the system down for me, though I >>>>>>> think I >>>>>>> can correlate the issues with the power outages. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> This sounds like the UPS is cutting power to the system before the >>>>>> shutdown sequence completes, before the array is stopped. This >>>>>> assumes >>>>>> you are already using apcupsd or similar. If you are check the >>>>>> configuration to make sure the system has plenty of time to shutdown >>>>>> after the UPS sends notification to the system. If you are not, then >>>>>> this will always happen as the UPS is simply cutting power when the >>>>>> battery gets low. >>>>>> >>>>>> Note that if the UPS is undersized for this system and only yields a >>>>>> few >>>>>> minutes of on-battery time, it may simply not have enough juice to >>>>>> keep >>>>>> the machine up throughout the shutdown process. >>>>>> >>>>>> In summary, either your shutdown software isn't configured properly, >>>>>> you >>>>>> are not using it, or the UPS is too small. This isn't an md problem. >>>>>> >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Stan >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" >>>>> in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2014-07-07 1:56 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-07-04 14:11 RAID 6 crashes system when being accessed Justin Stephenson 2014-07-04 19:34 ` Stan Hoeppner 2014-07-05 1:08 ` Re[2]: " Justin Stephenson 2014-07-05 4:17 ` Roger Heflin 2014-07-05 19:22 ` Re[4]: " Justin Stephenson 2014-07-05 20:42 ` Roger Heflin 2014-07-07 0:54 ` Re[6]: " Justin Stephenson 2014-07-07 1:56 ` Roger Heflin
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.