From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Justin Stephenson" Subject: Re[6]: RAID 6 crashes system when being accessed Date: Mon, 07 Jul 2014 00:54:46 +0000 Message-ID: References: Reply-To: "Justin Stephenson" Mime-Version: 1.0 Content-Type: text/plain; format=flowed; charset=utf-8 Content-Transfer-Encoding: 8BIT Return-path: In-Reply-To: Sender: linux-raid-owner@vger.kernel.org To: Roger Heflin Cc: stan , Linux RAID List-Id: linux-raid.ids Thanks again, Roger. Your input was super helpful and also helped me understand a little more about the relationship between md and my file system. in the full tests you mentioned "find / -type f -ls" and "...exec cksum {} \;" what would I be looking for? I executed the first one and I got a colossal list of files. The server stores a lot of media resources for my design practice and there are probably hundreds of thousands of files on there. Please let me know, J -------- Justin Stephenson Creative Director/Motion Designer 416-900-6069 http://justinstephenson.com ------ Original Message ------ From: "Roger Heflin" To: "Justin Stephenson" Cc: "stan" ; "Linux RAID" Sent: 05/07/2014 4:42:04 PM Subject: Re: Re[4]: RAID 6 crashes system when being accessed >The MD volume itself would not be unstable. The filesystem, >directory and file structures could have been corrupted, likely it did >fix something that was not important enough to report. When you hit >the specific directory entry and/or file data that would be when it >would crash. I have no idea how many times I have fixed this sort of >issue, it is pretty common on an unexpected crash, maybe 1 in 10-50 >crashes will produce this sort of error, the risk rises if files were >being created when it happens. > >If you want to do a full test this will list out all dirs "find >/ -type f -ls" and this will actual read all files fairly >quickly. If you want to check to see if all files and extents make >sense you can run the next commnad but it will take a long time >depending on how much data you have "find / -type f -ls -exec >cksum {} \;" > >On Sat, Jul 5, 2014 at 2:22 PM, Justin Stephenson > wrote: >> Hello Roger, >> >> Thank-you for your email and for laying out some trouble shooting >>steps for >> me. I will take these to heart and keep them on file for the future. >> >> I can report that there was a screen of rapid scrolling text during >>the >> crashes and some kind of memory contents dump that had a progress >>indicator. >> From what I could see, there was some kind of kernel panic and a >>message >> about ATA-9. Nothing in the /var/log/messages file as far as I could >>see. >> >> I had tried unmounting and running fsck before but not with your >>specified >> -f -y flags. >> >> Here are the steps I took based on your input. >> >> - ran system overnight with md raid unmounted. >> - fully completed resync >> - performed fsck -f -y. It took approx 6 minutes (on a 12TB volume). >>No >> errors reported in the printout. >> - reboot >> - locally initiated and completed a 22 gb copy from and to the md >>raid and a >> local esata external drive. >> >> --- >> >> - from a workstation, opened SMB share to the MD raid >> - workstation initiated copy to and from the CentOS box (and MD >>drive) of >> the same 22gb folder over SMB. >> - opened vnc client to the centOS box from a workstation. >> >> Up until the fsck -f -y any of these three operations would cause a >>crash. >> >> >> In summary, it would seem that the issue has been resolved by the >>fsck -f >> -y. Up until running fsck - f -y, the system was completely >>unpredictable >> when the MD drive was mounted - either during a sync or after it was >> completed. I find this surprising, but perhaps I should not? >> >> Based on Stan's email, I checked my UPS power settings, and I am >>certain I >> was ending up with a hard powerdown when the battery ran out. I have >> remedied this. >> >> Could this have caused the MD volume to become unstable? >> >> In any event, everything is up and running. I will report back with a >>log >> entry if anything else appears. >> >> Thanks again, >> >> - Justin >> >> >> >> >> ------ Original Message ------ >> From: "Roger Heflin" >> To: "Justin Stephenson" >> Cc: stan@hardwarefreak.com; "Linux RAID" >> Sent: 05/07/2014 12:17:45 AM >> Subject: Re: Re[2]: RAID 6 crashes system when being accessed >> >>> Some questions. >>> >>> Do you get any messages on the screen when it crashes and/or is >>>there >>> anything in /var/log/messages from the crashes? >>> >>> Is a sync running when it crashes? If so what kind of SATA >>> controllers/setup are you using? I have had 2 previous setups that >>> would run fairly stably so long as a sync was not running, but if a >>> sync was running then the machine became unstable. >>> >>> Did you umount it and run a "fsck -f -y" that took a while (at least >>> 30 seconds) or just umount it and ran fsck and it finished quickly >>>and >>> indicated clean? Generally if you nicely umount it the fs thinks it >>> is clean even when it is not because of some previous event. >>> >>> On Fri, Jul 4, 2014 at 8:08 PM, Justin Stephenson >>> wrote: >>>> >>>> Hi, >>>> >>>> Thanks for your reply. >>>> >>>> I should clarify that the crashes continue to be an issue in the >>>>absence >>>> of >>>> any power outage so this issue is now independent of power. I >>>>mentioned >>>> the >>>> UPS only with the thought that my problems may have been caused by >>>>a >>>> sudden >>>> power-down. >>>> >>>> Please let me know if there are any logs or status print outs I >>>>could >>>> pull >>>> to help troubleshoot this. >>>> >>>> Thanks Again, >>>> >>>> - J >>>> >>>> >>>> >>>> >>>> ------ Original Message ------ >>>> From: "Stan Hoeppner" >>>> To: "Justin Stephenson" ; >>>> linux-raid@vger.kernel.org >>>> Sent: 04/07/2014 3:34:17 PM >>>> Subject: Re: RAID 6 crashes system when being accessed >>>> >>>>> On 7/4/2014 9:11 AM, Justin Stephenson wrote: >>>>>> >>>>>> >>>>>> Hello, >>>>>> >>>>>> I am experiencing some issues with my md raid. It is crashing >>>>>>my >>>>>> system >>>>>> when accessed with any "verve". The reboot initiates a resync >>>>>>of the >>>>>> raid. I have gone through the crash/reboot/resynced a number of >>>>>>times >>>>>> now and the crash happens within minutes of mounting the raid. >>>>>> >>>>>> Here are some details: >>>>>> >>>>>> - It is a raid 6 with 7 3TB devices. >>>>>> - Formatted as EXT4 >>>>>> - mdadm v3.2.6 - 25th October 2012 >>>>>> - centos 6.5 kernel 2.6.32-431.3.1.el6.x86_64 >>>>>> - It has been running flawlessly for the previous 6 months. >>>>>> - I have a cron script running that resyncs monthly. >>>>>> - When the raid is unmounted, the system runs fine. (I have an >>>>>> additional "dumb" hardware raid 1 for dailies attached to an >>>>>>ESATA >>>>>> port. >>>>>> This runs perfectly). >>>>>> - I am in the process of re-syncing the raid 6 again right now. >>>>>> - I have run an fsck on the raid volume after it was fully >>>>>>synced and >>>>>> everything came up clean. >>>>>> >>>>>> - there have been lots of power outages the last while with the >>>>>>hot >>>>>> summer in Toronto. My UPS shuts the system down for me, though >>>>>>I >>>>>> think I >>>>>> can correlate the issues with the power outages. >>>>> >>>>> >>>>> >>>>> This sounds like the UPS is cutting power to the system before >>>>>the >>>>> shutdown sequence completes, before the array is stopped. This >>>>>assumes >>>>> you are already using apcupsd or similar. If you are check the >>>>> configuration to make sure the system has plenty of time to >>>>>shutdown >>>>> after the UPS sends notification to the system. If you are not, >>>>>then >>>>> this will always happen as the UPS is simply cutting power when >>>>>the >>>>> battery gets low. >>>>> >>>>> Note that if the UPS is undersized for this system and only >>>>>yields a >>>>> few >>>>> minutes of on-battery time, it may simply not have enough juice >>>>>to keep >>>>> the machine up throughout the shutdown process. >>>>> >>>>> In summary, either your shutdown software isn't configured >>>>>properly, >>>>> you >>>>> are not using it, or the UPS is too small. This isn't an md >>>>>problem. >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> Stan >>>> >>>> >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe >>>>linux-raid" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >>