* re-add POLICY @ 2015-02-14 21:59 Chris 2015-02-15 19:03 ` re-add POLICY: conflict detection? Chris 2015-02-16 3:28 ` re-add POLICY NeilBrown 0 siblings, 2 replies; 24+ messages in thread From: Chris @ 2015-02-14 21:59 UTC (permalink / raw) To: linux-raid Hi all, I'd like mdadm to automatically attempt to re-sync raid members after they where temporarily removed from the system. I would have thought "POLICY domain=default action=re-add" should allow this, and found a prior post that also seemed to want/test that behaviour. But as I understand the answer given there http://permalink.gmane.org/gmane.linux.raid/47516 mdadm is expected to exit with an error (not re-add) upon plugging the device back in? with: mdadm: can only add /dev/loop2 to /dev/md0 as a spare, and force-spare is not set. mdadm: failed to add /dev/loop2 to existing array /dev/md0: Invalid argument. For one, I don't understand what the error messages is trying to tell me, about an invalid argument that was never supplied to --incremental? But more importantly, how can priorly diconnected devices (marked failed with non-future event count) get re-synced automatically when they are plugged in again? (avoiding manual mdadm /dev/mdX --add /dev/sdYZ hassle) Cheers, Chris ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: re-add POLICY: conflict detection? 2015-02-14 21:59 re-add POLICY Chris @ 2015-02-15 19:03 ` Chris 2015-02-16 3:28 ` re-add POLICY NeilBrown 1 sibling, 0 replies; 24+ messages in thread From: Chris @ 2015-02-15 19:03 UTC (permalink / raw) To: linux-raid thinking about the "invalid argument" message... with "action=re-add": # mdadm --incremental /dev/loop2 mdadm: can only add /dev/loop2 to /dev/md0 as a spare, and force-spare is not set. mdadm: failed to add /dev/loop2 to existing array /dev/md0: Invalid argument. My guess is that mdadm may not be adding back the failed disk, because it is unsure wether it may have run separately, and may have newer data on it? I thought it may be possible to clearly distinguish between clean re-adds and conflicts, by doing something like this: * If a member fails (or is missing when starting degraded) write this info into some failed_at_event_count field belonging to the failed member in the superblock of every remaining raid member device in the array. Now, if an array part that got unplugged reappears and still has the event count that matches the failed_at_event_count that was recorded in the superblocks of the still running disks, and the reappearing part's superblock has no failed_at_event_count values for any member of the running array, the reappearing part is ok to be automatically re-synced. But if the reappearing disk claims a member of the already running array has failed, or it reappeared with a different event count than its faile_at_event_count field in the superblocks of the running array says, a conflict has arisen and a sync may only be done with manual --force. Cheers, Chris ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: re-add POLICY 2015-02-14 21:59 re-add POLICY Chris 2015-02-15 19:03 ` re-add POLICY: conflict detection? Chris @ 2015-02-16 3:28 ` NeilBrown 2015-02-16 12:23 ` Chris 1 sibling, 1 reply; 24+ messages in thread From: NeilBrown @ 2015-02-16 3:28 UTC (permalink / raw) To: Chris; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 1626 bytes --] On Sat, 14 Feb 2015 21:59:34 +0000 (UTC) Chris <email.bug@arcor.de> wrote: > > Hi all, > > I'd like mdadm to automatically attempt to re-sync raid members after they > where temporarily removed from the system. > > I would have thought "POLICY domain=default action=re-add" should allow this, > and found a prior post that also seemed to want/test that behaviour. > But as I understand the answer given there > http://permalink.gmane.org/gmane.linux.raid/47516 > mdadm is expected to exit with an error (not re-add) upon plugging the > device back in? > > with: > mdadm: can only add /dev/loop2 to /dev/md0 as a spare, and force-spare is > not set. > mdadm: failed to add /dev/loop2 to existing array /dev/md0: Invalid argument. > > For one, I don't understand what the error messages is trying to tell me, about > an invalid argument that was never supplied to --incremental? > > But more importantly, how can priorly diconnected devices (marked failed > with non-future event count) get re-synced automatically when they are > plugged in again? > (avoiding manual mdadm /dev/mdX --add /dev/sdYZ hassle) > Does your array have a write-intent bitmap configured? If it does, then "POLICY action=re-add" really should work. If it doesn't, then maybe you need "POLICY action=spare". This isn't the default, because depending on exactly how/why the device failed, it may not be safe to treat it as a spare. If the above does not help, please report: - kernel version - mdadm version - "mdadm --examine" output of at least one good drive and one failed drive. NeilBrown [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: re-add POLICY 2015-02-16 3:28 ` re-add POLICY NeilBrown @ 2015-02-16 12:23 ` Chris 2015-02-16 13:17 ` Phil Turmel 2015-02-17 15:09 ` re-add POLICY Chris 0 siblings, 2 replies; 24+ messages in thread From: Chris @ 2015-02-16 12:23 UTC (permalink / raw) To: linux-raid NeilBrown <neilb <at> suse.de> writes: > Does your array have a write-intent bitmap configured? > If it does, then "POLICY action=re-add" really should work. Thank you for your insight. You are correct, the array has no write-intent bitmap. > If it doesn't, then maybe you need "POLICY action=spare". OK, I will test this when the notebook is back in the house. Actually, the man page had kind of kept me from trying this, because it mentions the condition "if the device is bare", and I didn't want arbitrary bare disk, partition, or free space to be automatically added, but just to trigger an automatic try with raid members that got pulled and are save to re-sync. (e.g. after the occasional bad block error that gets remapped by the hardrives firmware) [man page: spare works] "as above and additionally: if the device is bare it can become a spare if there is any array that it is a candidate for based on domains and metadata." Also, I wouldn't want a temporarily removed raid member to be added as spare to some other array. Only have them added (re-synced even if no bitmap re-add is possible) to the array they belong according to their superblock. > This isn't the default, because depending on exactly how/why the device > failed, it may not be safe to treat it as a spare. OK, I can imagine detecting the corner cases may require some inteligent error logging. What I am looking for is a safe re-sync configuration option between bitmap-based re-add, and treating a device as arbitrary spare drive. Practically, this could be something like an additional action=re-sync option in between re-add/spare, or having the "re-add" action also do (non-bitmap) full re-syncs, if the device is in a clean state. May recording the fail event count in the remaining superblocks help, as described in http://permalink.gmane.org/gmane.linux.raid/48077 help to detect the clean state? Kind Regards, Chris ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: re-add POLICY 2015-02-16 12:23 ` Chris @ 2015-02-16 13:17 ` Phil Turmel 2015-02-16 16:15 ` desktop disk's error recovery timouts (was: re-add POLICY) Chris 2015-02-17 15:09 ` re-add POLICY Chris 1 sibling, 1 reply; 24+ messages in thread From: Phil Turmel @ 2015-02-16 13:17 UTC (permalink / raw) To: Chris, linux-raid Hi Chris, On 02/16/2015 07:23 AM, Chris wrote: > .... with raid members that got pulled and are save to > re-sync. (e.g. after the occasional bad block error that gets remapped by > the hardrives firmware) This should not be part of your concern here, as MD will handle occassional UREs by reconstructing them and rewriting them on the fly, -- without failing the device. If devices are failing after read errors, you have a different problem. (Hint: look at recent threads for "timeout mismatch".) Phil ^ permalink raw reply [flat|nested] 24+ messages in thread
* desktop disk's error recovery timouts (was: re-add POLICY) 2015-02-16 13:17 ` Phil Turmel @ 2015-02-16 16:15 ` Chris 2015-02-16 17:19 ` desktop disk's error recovery timouts Phil Turmel 0 siblings, 1 reply; 24+ messages in thread From: Chris @ 2015-02-16 16:15 UTC (permalink / raw) To: linux-raid Phil Turmel <philip <at> turmel.org> writes: > On 02/16/2015 07:23 AM, Chris wrote: > > .... with raid members that got pulled and are save to > > re-sync. (e.g. after the occasional bad block error that gets remapped by > > the hardrives firmware) > > This should not be part of your concern here, as MD will handle > occassional UREs by reconstructing them and rewriting them on the fly, Phil, thank you for dropping in with this hint. It very likly applies to the disks in the docking station. I searched the mailing list, most hits said to search for the keywords, though. ;-) To understand the issue, I think https://en.wikipedia.org/wiki/Error_recovery_control was good. It would be good if this configuration information could be available there or at https://raid.wiki.kernel.org Cheers, Chris ---- I compiled some snippets from your messages, that could serve as a basis to correction/completion by someone knowledgeable: The default linux controller timeout is 30 seconds. Drives that spend longer than the timeout in recovery will be reset. If they don't respond to the reset (because they're busy in recovery) when the raid tries to write the correct data back to them, they will be kicked out of the array. You *must* set ERC shorter than the timeout, or set the driver timeout longer than the drive's worst-case recovery time. The defaults for desktop drives are *not* suitable for linux software raid. I strongly encourage you to run "smartctl -l scterc /dev/sdX" for each of your drives. For any drive that warns that it doesn't support SCT ERC, set the controller device timeout to 180 like so: echo 180 >/sys/block/sdX/device/timeout If the report says read or write ERC is disabled, run "smartctl -l scterc,70,70 /dev/sdX" to set it to 7.0 seconds. You then set up a boot-time script to do these adjustments at every restart, and make sure you performing regular scrub runs to ...? You might not want that kind of long device timeout, but then you shouldn't use desktop drives in md RAID. Anyone using desktop drives which don't support SCT ERC in md RAID is liable to see long timeouts on the simplest bad sector, and they probably prefer to keep the drive in the array AND have the sector rewritten after reconstruction than have the drive failed out of the array. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: desktop disk's error recovery timouts 2015-02-16 16:15 ` desktop disk's error recovery timouts (was: re-add POLICY) Chris @ 2015-02-16 17:19 ` Phil Turmel 2015-02-16 17:48 ` What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) Chris 0 siblings, 1 reply; 24+ messages in thread From: Phil Turmel @ 2015-02-16 17:19 UTC (permalink / raw) To: Chris, linux-raid On 02/16/2015 11:15 AM, Chris wrote: > Phil, thank you for dropping in with this hint. It very likly applies to > the disks in the docking station. I searched the mailing list, most hits > said to search for the keywords, though. ;-) I don't always have time to explain. :-( > To understand the issue, I think > https://en.wikipedia.org/wiki/Error_recovery_control > was good. Good starting points in the archives: http://marc.info/?l=linux-raid&m=135811522817345&w=1 http://marc.info/?l=linux-raid&m=133761065622164&w=2 http://marc.info/?l=linux-raid&m=135863964624202&w=2 http://marc.info/?l=linux-raid&m=139050322510249&w=2 There's useful info in each entire thread, though. Phil ^ permalink raw reply [flat|nested] 24+ messages in thread
* What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) 2015-02-16 17:19 ` desktop disk's error recovery timouts Phil Turmel @ 2015-02-16 17:48 ` Chris 2015-02-16 19:44 ` What are mdadm maintainers to do? Phil Turmel 2015-02-16 23:49 ` What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) NeilBrown 0 siblings, 2 replies; 24+ messages in thread From: Chris @ 2015-02-16 17:48 UTC (permalink / raw) To: linux-raid Thank you for the additional information, it calls for action. OK, calling for a solution to stop desktop drives from causing data loss and affecting the mdadm reputation: I gather that mdadm could ship with one additional udev rule that calls a script to check/set scterc, or falls back to increasing the system timout. Phil, you mentioned having posted such a script, could you prepare it for addition to the mdadm package? Would maintainers be ok with adding such a udev rule and script to the package? Kind Regards, Chris ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: What are mdadm maintainers to do? 2015-02-16 17:48 ` What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) Chris @ 2015-02-16 19:44 ` Phil Turmel 2015-02-16 23:49 ` What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) NeilBrown 1 sibling, 0 replies; 24+ messages in thread From: Phil Turmel @ 2015-02-16 19:44 UTC (permalink / raw) To: Chris, linux-raid On 02/16/2015 12:48 PM, Chris wrote: > > Thank you for the additional information, it calls for action. > > > OK, calling for a solution to stop desktop drives from causing data loss and > affecting the mdadm reputation: > > > I gather that mdadm could ship with one additional udev rule that calls a > script to check/set scterc, or falls back to increasing the system timout. > > Phil, you mentioned having posted such a script, could you prepare it for > addition to the mdadm package? No, I've posted snippets for users to customize in their own rc.local or distro equivalent. I vaguely recall posting a generic script for some common cases, but I've personally converted to raid-rated drives everywhere in the past couple years. Somebody else will have to tackle this. > Would maintainers be ok with adding such a udev rule and script to the package? Not my call, but keep in mind that this will add a dependency on smartmontools or whatever means is used to access/write to scterc. Phil ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) 2015-02-16 17:48 ` What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) Chris 2015-02-16 19:44 ` What are mdadm maintainers to do? Phil Turmel @ 2015-02-16 23:49 ` NeilBrown 2015-02-17 7:52 ` What are mdadm maintainers to do? (error recovery redundancy/data loss) Chris 2015-02-18 15:04 ` help with the little script (erc timout fix) Chris 1 sibling, 2 replies; 24+ messages in thread From: NeilBrown @ 2015-02-16 23:49 UTC (permalink / raw) To: Chris; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 1067 bytes --] On Mon, 16 Feb 2015 17:48:50 +0000 (UTC) Chris <email.bug@arcor.de> wrote: > > Thank you for the additional information, it calls for action. > > > OK, calling for a solution to stop desktop drives from causing data loss and > affecting the mdadm reputation: > > > I gather that mdadm could ship with one additional udev rule that calls a > script to check/set scterc, or falls back to increasing the system timout. > > Phil, you mentioned having posted such a script, could you prepare it for > addition to the mdadm package? > > > Would maintainers be ok with adding such a udev rule and script to the package? "maintainers" ? Plural? That would be nice. Unfortunately there is just the one singular me.... There are certainly other contributors who - answer questions on the list - provide bug reports - provide bits of code and I am very thankful to them. But I haven't found a likely co-maintainer yet :-( I'm certainly happy to consider and concrete proposal. The more concrete, the better. NeilBrown [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss) 2015-02-16 23:49 ` What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) NeilBrown @ 2015-02-17 7:52 ` Chris 2015-02-17 8:48 ` Mikael Abrahamsson 2015-02-17 19:33 ` Chris Murphy 2015-02-18 15:04 ` help with the little script (erc timout fix) Chris 1 sibling, 2 replies; 24+ messages in thread From: Chris @ 2015-02-17 7:52 UTC (permalink / raw) To: linux-raid NeilBrown <neilb <at> suse.de> writes: > "maintainers" ? Plural? That would be nice. > Unfortunately there is just the one singular me.... Yes, as Weedy said, I also refered to distro package maintainers. If we can come up here with an udev rule and a script to call, then upstream (you) could include this, and distro maintainers could make smartctl a suggested or recommended package of the mdadm package. I certainly have not understood the whole topic yet, what I just got is, that the script should do something like the following, and I found some implementation below. Evererybody please answer with improved versions if you can. if smartctl tool is available if scterc is disabled /usr/sbin/smartctl -l scterc,70,70 ${DEVNAME} else if screrc is not available echo 180 >/sys/block/${DEVNAME}/device/timeout Found an older implementation that "seems to work fine": http://article.gmane.org/gmane.linux.raid/44566 > > contents of udev rule: > ACTION=="add", SUBSYSTEM=="block", KERNEL=="[sh]d[a-z]", RUN+="/usr/local/bin/settimeout" > > > contents of /usr/local/bin/settimeout: > #!/bin/bash > > [ "${ACTION}" == "add" ] && { > /usr/sbin/smartctl -l scterc,70,70 ${DEVNAME} || echo 180 > /sys/${DEVPATH}/device/timeout > } > > I guess, what is missing, is to connect the HDDs > with a specific "mdadm" event, instead of running > for each HDD. > I'm not sure if this is already possible, since > some "udev" rules for "md" are already existing. Let's get this disaster prevention into mdadm, even if just as important reference experience for solving a more general kernel timeout mismatch problem "symptom of a more generic issue". http://article.gmane.org/gmane.linux.raid/44557 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss) 2015-02-17 7:52 ` What are mdadm maintainers to do? (error recovery redundancy/data loss) Chris @ 2015-02-17 8:48 ` Mikael Abrahamsson 2015-02-17 10:37 ` Chris 2015-02-17 19:33 ` Chris Murphy 1 sibling, 1 reply; 24+ messages in thread From: Mikael Abrahamsson @ 2015-02-17 8:48 UTC (permalink / raw) To: Chris; +Cc: linux-raid On Tue, 17 Feb 2015, Chris wrote: > Evererybody please answer with improved versions if you can. > > if smartctl tool is available > if scterc is disabled > /usr/sbin/smartctl -l scterc,70,70 ${DEVNAME} > else > if screrc is not available > echo 180 >/sys/block/${DEVNAME}/device/timeout > > Found an older implementation that "seems to work fine": Hi, Generally I like this idea, and I agree that this would be a good idea, but if I was running raid0 or linear, I might not want scterc to be enabled. Also, what would the harm be to always bump the timeout to 180 seconds? Yes, drives would take longer to be kicked out in case of errors, but if we're confident in scterc working, wouldn't we want to turn down the timeout to 10-15 seconds then? Personally I turn on scterc if available and turn up the timeout to 180 seconds, always, regardless what drives I'm running. I'd rather wait longer for a drive to be considered dead, than to have drives being kicked due to some hiccup in the system (controller or drive reset) that might rectify itself. So I would suggest turning on scterc and turning up the timeout to 180 seconds as soon as mdadm is installed. This is the best tradeoff I can come up with between stability and fast drive-dead-detection time. Here on the list I see people all the time coming in with multiple drives kicked due to controller resets and other intermittent flukes, I never see people coming in complaining that it took 30 seconds to detect a drive error. I doubt there'd be much complaint for 180 seconds. If someone needs faster detect times then my opinion is that they are in the category who can be expected to tune this value to their application. 180 seconds works best for the "larger crowd" using mdadm. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss) 2015-02-17 8:48 ` Mikael Abrahamsson @ 2015-02-17 10:37 ` Chris 0 siblings, 0 replies; 24+ messages in thread From: Chris @ 2015-02-17 10:37 UTC (permalink / raw) To: linux-raid Mikael Abrahamsson <swmike <at> swm.pp.se> writes: > if I was running raid0 or linear, I might not want scterc to be > enabled. Good Point. > Also, what would the harm be to always bump the timeout to 180 seconds? I don't know why the driver authors chose that linux default, but the todo with both your points: if the appearind device is an md member device (mdadm examine?) if smartctl tool is available if scterc is disabled in cotaining ${HDD_DEV} AND added device is not raid0/linear /usr/sbin/smartctl -l scterc,70,70 ${HDD_DEV} echo 180 >/sys/block/${HDD_DEV}/device/timeout ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss) 2015-02-17 7:52 ` What are mdadm maintainers to do? (error recovery redundancy/data loss) Chris 2015-02-17 8:48 ` Mikael Abrahamsson @ 2015-02-17 19:33 ` Chris Murphy 2015-02-17 22:47 ` Adam Goryachev 2015-02-17 23:33 ` Chris 1 sibling, 2 replies; 24+ messages in thread From: Chris Murphy @ 2015-02-17 19:33 UTC (permalink / raw) To: linux-raid It's not just mdadm. It likewise affects Btrfs, ZFS, and LVM. Also, there's a lack of granularity with linux command timer and SCT ERC applying only to the entire block device, not partitions. So there's a problem for mixed use cases. For example, two drives, each with two partitions. sda1 and sdb1 are raid0, and sda2 and sdb2 are raid1. What's the proper configuration for SCT ERC and the SCSI command timer? *shrug* I don't think the automatic udev configuration idea is fail safe. It sounds too easy for it to automatically cause a misconfiguration. And it also doesn't at all solve the problem that there's next to no error reporting to user space. smartd does, but it's narrow in scope and entirely defers to the hard drive's self-assessment. There's all sorts of problems that aren't in the domain of SMART that get reported in dmesg, but there's no method for gnome-shell or KDE or any DE or even send an email to a sysadmin, as an early warning. Instead, all too often it's "WTF XFS just corrupted itself!" meanwhile the real problem has been happening for a week, dmesg/journal is full of errors indicating the nature of those problems, but nothing bothered to inform a human being until the file system face planted. Chris Murphy ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss) 2015-02-17 19:33 ` Chris Murphy @ 2015-02-17 22:47 ` Adam Goryachev 2015-02-18 1:02 ` Chris Murphy 2015-02-17 23:33 ` Chris 1 sibling, 1 reply; 24+ messages in thread From: Adam Goryachev @ 2015-02-17 22:47 UTC (permalink / raw) To: Chris Murphy, linux-raid On 18/02/15 06:33, Chris Murphy wrote: > It's not just mdadm. It likewise affects Btrfs, ZFS, and LVM. > > Also, there's a lack of granularity with linux command timer and SCT > ERC applying only to the entire block device, not partitions. So > there's a problem for mixed use cases. For example, two drives, each > with two partitions. sda1 and sdb1 are raid0, and sda2 and sdb2 are > raid1. What's the proper configuration for SCT ERC and the SCSI > command timer? Umm, actually I don't know enough to disagree, but I'll ask some questions which probably shows both the assumptions I've made, and might help others understand the issue better. If we enable SCT ERC on every drive that supports it, and we are using the drive (only) in a RAID0/linear array then what is the downside? As I understand it, the drive will no longer try for > 120sec to recover the data stored in the "bad" sector, and instead return an unreadable error message in a short amount of time (well below 30 seconds) which means the driver will be able to return a read error to the application (or FS or MD) and the system as a whole will carry on. If we didn't enable SCT ERC, then the entire drive would vanish, (because the timeout wasn't changed for the driver) and the current read and every future read/write will all fail, and the system will probably crash (well, depending on the application, FS layout, etc). So, IMHO, it seems that by default, every SCT ERC capable drive should have this enabled by default. As a part of error recovery (ie, crap that really important data stored on those few unreadable sectors) the user could manually disable SCT ERC and re-attempt to request the data from the drive (eg, during dd_rescue or similar). Secondly, changing the timeout for those drives that don't support SCT ERC, again, it is fairly similar to above, we get the error from the drive before the timeout, except we will avoid the only possible downside above (failing to read a very unlikely but possible to read sector). Again, we will avoid dropping the entire drive, even if all operations on this drive will stop for a longer period of time, it is probably better than stopping permanently. So, IMHO, every non SCT ERC capable drive should have the timeout extended to 120s/180s or whatever the appropriate time is that (most) drives will respond within. Leaving only the most extremely brain dead drives which we simply ridicule on the list and anywhere and everywhere possible to ensure nobody will ever buy them (or the manufacturer will fix the problems). Of course, quite possible I've totally over simplified this, and don't understand the other repercussions? > *shrug* I don't think the automatic udev configuration idea is fail > safe. It sounds too easy for it to automatically cause a > misconfiguration. And it also doesn't at all solve the problem that > there's next to no error reporting to user space. smartd does, but > it's narrow in scope and entirely defers to the hard drive's > self-assessment. There's all sorts of problems that aren't in the > domain of SMART that get reported in dmesg, but there's no method for > gnome-shell or KDE or any DE or even send an email to a sysadmin, as > an early warning. Instead, all too often it's "WTF XFS just corrupted > itself!" meanwhile the real problem has been happening for a week, > dmesg/journal is full of errors indicating the nature of those > problems, but nothing bothered to inform a human being until the file > system face planted. Just because the solution doesn't solve the entire problem, it does solve a part of the problem, so IMHO, better to solve this part of the problem, and then discuss/try to find a solution to the rest of the problem. Unless you have a suggestion which can solve both parts of the problem? I suppose that a "good" sysadmin should install some sort of log monitoring software which will alert them to issues, whether that is via some desktop application/popup or email or something else. The problem is that most of these issues come from "home" users who will never setup anything like "log file monitoring" or raid scrubs, or anything else, so if we do decide upon a generic solution that will work for almost everybody, then we will still need to rely on the distro maintainers to implement the solution. PS, I suppose this is one of the "hide the gory details that nobody understands" balancing with "provide the information to the user so they can do something about it". One more generic consideration would be to have the kernel identify which messages are purely informational/debug and which are errors. Normal syslog has support for many different levels, but AFAIK, all kernel messages end up in the same basket. eg (plugging in and removing a USB drive generated the following log entries as seen from "dmesg": [614977.802828] usb 3-3: new high-speed USB device number 5 using xhci_hcd [614977.822724] usb 3-3: New USB device found, idVendor=0951, idProduct=1665 [614977.822729] usb 3-3: New USB device strings: Mfr=1, Product=2, SerialNumber=3 [614977.822732] usb 3-3: Product: DataTraveler 2.0 [614977.822735] usb 3-3: Manufacturer: Kingston [614977.822737] usb 3-3: SerialNumber: 60A44C413CCBFE40AB4FFB3E [614977.822899] usb 3-3: ep 0x81 - rounding interval to 128 microframes, ep desc says 255 microframes [614977.822905] usb 3-3: ep 0x2 - rounding interval to 128 microframes, ep desc says 255 microframes [614977.836547] usb-storage 3-3:1.0: USB Mass Storage device detected [614977.836734] scsi6 : usb-storage 3-3:1.0 [614977.836819] usbcore: registered new interface driver usb-storage [614978.854080] scsi 6:0:0:0: Direct-Access Kingston DataTraveler 2.0 1.00 PQ: 0 ANSI: 4 [614978.854493] sd 6:0:0:0: Attached scsi generic sg2 type 0 [614978.854658] sd 6:0:0:0: [sdb] 15131636 512-byte logical blocks: (7.74 GB/7.21 GiB) [614978.854884] sd 6:0:0:0: [sdb] Write Protect is off [614978.854888] sd 6:0:0:0: [sdb] Mode Sense: 45 00 00 00 [614978.855085] sd 6:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA [614978.860015] sdb: sdb1 [614978.860864] sd 6:0:0:0: [sdb] Attached SCSI removable disk [614979.061474] FAT-fs (sdb1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck. [615347.862058] usb 3-3: reset high-speed USB device number 5 using xhci_hcd [615347.862111] usb 3-3: Device not responding to set address. [615348.065856] usb 3-3: Device not responding to set address. [615348.269944] usb 3-3: device not accepting address 5, error -71 [615348.326429] usb 3-3: USB disconnect, device number 5 [615348.334730] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called with disabled ep ffff88011b1b2600 [615348.334744] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called with disabled ep ffff88011b1b2640 Of the above, I would suggest most of that is "info" while the following lines might be warnings: [614979.061474] FAT-fs (sdb1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck. These might be error or critical: [615347.862058] usb 3-3: reset high-speed USB device number 5 using xhci_hcd [615347.862111] usb 3-3: Device not responding to set address. [615348.065856] usb 3-3: Device not responding to set address. [615348.269944] usb 3-3: device not accepting address 5, error -71 Of course, this will rely on every driver maintainer to make a decision on just how important each line that they log may be. Just my thoughts, hopefully it will be useful. Regards, Adam -- Adam Goryachev Website Managers www.websitemanagers.com.au ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss) 2015-02-17 22:47 ` Adam Goryachev @ 2015-02-18 1:02 ` Chris Murphy 2015-02-18 11:04 ` Chris 0 siblings, 1 reply; 24+ messages in thread From: Chris Murphy @ 2015-02-18 1:02 UTC (permalink / raw) To: linux-raid On Tue, Feb 17, 2015 at 3:47 PM, Adam Goryachev <mailinglists@websitemanagers.com.au> wrote: > If we enable SCT ERC on every drive that supports it, and we are using the > drive (only) in a RAID0/linear array then what is the downside? Unnecessary data loss. > As I > understand it, the drive will no longer try for > 120sec to recover the data > stored in the "bad" sector, and instead return an unreadable error message > in a short amount of time (well below 30 seconds) which means the driver > will be able to return a read error to the application (or FS or MD) and the > system as a whole will carry on. Not necessarily, it depends what's in that sector. If it's user data, this means a sector (or possibly more) of data loss. If it's file system metadata it means progressive file system corruption. Configuring the drive to give up too soon is completely inappropriate for single, raid0 or linear configurations. Arguably the drive should have already recovered this data. If a longer recovery can recover, then why isn't the drive writing the data back to that sector so that next time it isn't so ambiguous that it requires long recovery? I can't answer that question. In some case that appears to happen in other cases it's not. But the followup is that there really ought to be some way for user space to get access to these kinds of errors rather than them accumulating until disaster strikes. The contra argument to that is, it's still cheaper to buy the proper use case specified drive. >If we didn't enable SCT ERC, then the > entire drive would vanish, (because the timeout wasn't changed for the > driver) and the current read and every future read/write will all fail, and > the system will probably crash (well, depending on the application, FS > layout, etc). Umm no. If SCT ERC remains a high value or disable, while also increasing the kernel command timer, the drive has a longer chance to recover. That's the appropriate configuration for single, linear, and raid0. > > So, IMHO, it seems that by default, every SCT ERC capable drive should have > this enabled by default. As a part of error recovery (ie, crap that really > important data stored on those few unreadable sectors) the user could > manually disable SCT ERC and re-attempt to request the data from the drive > (eg, during dd_rescue or similar). If you do this for single, linear, or raid0 it will increase the incident of data loss that would otherwise not occur if deep/long recovery times were available. Before changing these settings, there should be some better understanding of what the manufacturer defined recovery times in the real world actually are, and whether or not these long recoveries are helpful. Presumably they'd say they are helpful, but I think we need facts to contradict their position before second guessing the default settings. And we have such facts to do exactly that when it comes to raid1, 5, 6 with such drives which is why the recommendation is to change SCT ERC if supported. > Secondly, changing the timeout for those drives that don't support SCT ERC, > again, it is fairly similar to above, we get the error from the drive before > the timeout, except we will avoid the only possible downside above (failing > to read a very unlikely but possible to read sector). Again, we will avoid > dropping the entire drive, even if all operations on this drive will stop > for a longer period of time, it is probably better than stopping > permanently. Not by default. You can't assume any drive hang is due to bad sectors that merely need a longer recovery time. It could be some other error condition, in which case doing a 120 or 180 second *by default* delay means no error messages at all for upwards of 3 minutes. And in any case the proper place to change the default kernel command timer value is in the kernel, not with a udev rule. I don't know if a udev rule can say "If the drive exclusively uses md, lvm, btrfs, zfs raid1, 4+ or nested of those, and if the drive does not support configurable SCT ERC, then change the kernel command timer for those devices to ~120 seconds" then that might be a plausible solution to use consumer drives the manufacturer rather explicitly proscribes from use in raid... But the contra argument to that is, why should anyone do this work for (sorry) basically cheap users who don't want to buy the proper drive for the specific use case? There are limited resources for this work. And in fact the problem has a work around, if not a solution. What we still don't have is something that reports any such problems to user space. -- Chris Murphy ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss) 2015-02-18 1:02 ` Chris Murphy @ 2015-02-18 11:04 ` Chris 2015-02-19 6:12 ` Chris Murphy 0 siblings, 1 reply; 24+ messages in thread From: Chris @ 2015-02-18 11:04 UTC (permalink / raw) To: linux-raid > Hello all, the discussion about SCTERC boils down to letting the drive attempt ERC a little more or less. For any given disk experience seems to tell the slight difference is, that if ERC is allowed longer you may see the first unrecoverable erros (UREs) just a little (maybe only a month) later. UREs are inevitable. Thus, if I run a filesystem on just a single drive it will get corrupted at some point, nothing to do about it. Wait, except..., use a redundant raid! And here it makes a lot of a difference that the drive's ERC actually terminates before the controller timeout, to not loose all your redundacy again and be in hight risk of UREs showing up during the re-sync. So for a proper comparison we need to look at the difference it makes in the usage scenarios (error delay vs. loosing redundant error resilence + URE triggering), not at the single recoverable/unrecoverable error incidence. It looks to me, that it makes a lot of a differnce to redundant raids and no qualitative difference to single disk filesystems. And we need to keep in mind that single disk filesystems do also depend on the disk to stop grinding away with ERC attempts before the controller timout. Otherwise disk reset may make the system clear buffers and loose open files? Without prolonging the linux default controller timout, SCTERC can prevent that where supported. > in any case the proper place to change the default kernel command > timer value is in the kernel, not with a udev rule. Right. And as you write increasing the controller timout has clear downsides. Noteing as well, as long as the proposed script (a temporary safety measure) maximizes the controller timeout to remedy for disks that don's support SCTERC, this would even fix the timout mismatch for single disk filesystems. (Letting the controller wait until the disk finally succeeds or fails its recovery attempts.) So the proposed script actually provides a case that brings benefit for raid0 setups as well (as long as the linux default is not adaptive to the disk parameters), but increasing the controller timout in all cases would introduce long and unreported i/o blocking into all redundant setups. > I don't know if a udev rule can say "If the drive exclusively uses md, > lvm, btrfs, zfs raid1, 4+ or nested of those, and if the drive does > not support configurable SCT ERC, then change the kernel command timer > for those devices to ~120 seconds" then that might be a plausible > solution to use consumer drives the manufacturer rather explicitly > proscribes from use in raid... The script called by the udev rule could do that, but can be kept as simple as proposed, and can set SCTERC regardles, because setting SCTERC below the controller timout makes a qualitative difference in running the redundant arrays and a marginal difference in running non-redundant filesystems. (And nevertheless, set long controller timout for devices that don's support SCTERC.) After all, this looks like a quite simple change is appropriate: In udev-md-raid-assembly.rules, below LABEL="md_inc" (only handling all md suppported devices) add one rule: # fix timouts for redundant raids, if possible TEST="/usr/sbin/smartctl", ENV{MD_LEVEL}=="raid[1-9]*", RUN+="/usr/bin/mdadm-erc-timout-fix" And in a new /usr/bin/mdadm-erc-timout-fix file implement: if smartctl -l scterc ${HDD_DEV} returns "Disabled" /usr/sbin/smartctl -l scterc,70,70 ${HDD_DEV} else if smartctl -l scterc ${HDD_DEV} does not return "seconds" echo 180 >/sys/block/${HDD_DEV}/device/timeout Regards, Chris ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss) 2015-02-18 11:04 ` Chris @ 2015-02-19 6:12 ` Chris Murphy 2015-02-20 5:12 ` Roger Heflin 0 siblings, 1 reply; 24+ messages in thread From: Chris Murphy @ 2015-02-19 6:12 UTC (permalink / raw) Cc: linux-raid On Wed, Feb 18, 2015 at 4:04 AM, Chris <email.bug@arcor.de> wrote: >> > > Hello all, > > the discussion about SCTERC boils down to letting the drive attempt ERC a > little more or less. For any given disk experience seems to tell the slight > difference is, that if ERC is allowed longer you may see the first > unrecoverable erros (UREs) just a little (maybe only a month) later. > > UREs are inevitable. Thus, if I run a filesystem on just a single drive it > will get corrupted at some point, nothing to do about it. On a single randomly selective drive, I disagree. In aggregate, that's true, eventually it will happen, you just won't know which drive or when it'll happen. I have a number of 5+ year old drives that have never reported a URE. Meanwhile another drive has so many bad sectors I only keep it around for abusive purposes. > > Wait, except..., use a redundant raid! And here it makes a lot of a > difference that the drive's ERC actually terminates before the controller > timeout, to not loose all your redundacy again and be in hight risk of UREs > showing up during the re-sync. > > So for a proper comparison we need to look at the difference it makes in the > usage scenarios (error delay vs. loosing redundant error resilence + URE > triggering), not at the single recoverable/unrecoverable error incidence. It > looks to me, that it makes a lot of a differnce to redundant raids and no > qualitative difference to single disk filesystems. > > And we need to keep in mind that single disk filesystems do also depend on > the disk to stop grinding away with ERC attempts before the controller > timout. Otherwise disk reset may make the system clear buffers and loose > open files? Without prolonging the linux default controller timout, SCTERC > can prevent that where supported. To get to one size fits all, where SCT ERC is disabled (consumer drive), and the kernel command timer is increased accordingly, we still need the delay reportable to user space. You can't have a by default 2-3 minute showstopper without an explanation so that the user can tune this back to 30 seconds or get rid of the drive or some other mitigation. Otherwise this is a 2-3 minute silent failure. I know a huge number of users who would assume this is a crash and force power off the system. The option where SCT ERC is configurable, you could also do this one size fits all by setting this to say 50-70 deciseconds, and for read failures to cause recovery if raid1+ is used, or cause a read retry if it's single, raid0, or linear. In other words, control the retries in software for these drives. >> I don't know if a udev rule can say "If the drive exclusively uses md, >> lvm, btrfs, zfs raid1, 4+ or nested of those, and if the drive does >> not support configurable SCT ERC, then change the kernel command timer >> for those devices to ~120 seconds" then that might be a plausible >> solution to use consumer drives the manufacturer rather explicitly >> proscribes from use in raid... > > The script called by the udev rule could do that, but can be kept as simple > as proposed, and can set SCTERC regardles, because setting SCTERC below the > controller timout makes a qualitative difference in running the redundant > arrays and a marginal difference in running non-redundant filesystems. (And > nevertheless, set long controller timout for devices that don's support SCTERC.) I can't agree at all, lacking facts, that this change is marginal for non-redundant configurations. I've seen no data how common long recovery incidents are, or how much more common data loss would be if long recovery were prevented. The mere fact they exist suggests they're necessary. It may very well be that the ECC code or hardware used is so slow that it really does take so unbelievably long (really 30 seconds is an eternity, and a minute seems outrageous, and 2-3 minutes seems wholly ridiculous as in worthy of brutal unrelenting ridicule); but that doesn't even matter even if it is true, that's the behavior of the ECC whether we like it or not, we can't just willy nilly turn these things off without understanding the consequences. Just saying it's marginal doesn't make it true. So if SCT ERC is short, now you have to have a mitigation for the possibly higher number of URE's this will result in, in the form of kernel instigated read retries on read fail. And in fact, this may be false. The retries the drive does internally might be completely different than the kernel doing another read. The way data is encoded on the drive these days bears no resemblance to discreet 1's and 0's. And you also need a reliable opt out for SSD's. Their failures seem rather different. -- Chris Murphy ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss) 2015-02-19 6:12 ` Chris Murphy @ 2015-02-20 5:12 ` Roger Heflin 0 siblings, 0 replies; 24+ messages in thread From: Roger Heflin @ 2015-02-20 5:12 UTC (permalink / raw) To: Chris Murphy; +Cc: Linux RAID On Thu, Feb 19, 2015 at 12:12 AM, Chris Murphy <lists@colorremedies.com> wrote: > On Wed, Feb 18, 2015 at 4:04 AM, Chris <email.bug@arcor.de> wrote: >>> >> >> Hello all, >> > > On a single randomly selective drive, I disagree. In aggregate, that's > true, eventually it will happen, you just won't know which drive or > when it'll happen. I have a number of 5+ year old drives that have > never reported a URE. Meanwhile another drive has so many bad sectors > I only keep it around for abusive purposes. And I have seen the same. Not all will fail even of a given type. It also appears if one was really worried, running smartctl -t long often (daily or weekly) can result in the disk finding and re-writing or moving the bad sector. I have a disk that started given me trouble and the bad block count has risen a few times without an os level error during the -t long test. > > > > > To get to one size fits all, where SCT ERC is disabled (consumer > drive), and the kernel command timer is increased accordingly, we > still need the delay reportable to user space. You can't have a by > default 2-3 minute showstopper without an explanation so that the user > can tune this back to 30 seconds or get rid of the drive or some other > mitigation. Otherwise this is a 2-3 minute silent failure. I know a > huge number of users who would assume this is a crash and force power > off the system. > > The option where SCT ERC is configurable, you could also do this one > size fits all by setting this to say 50-70 deciseconds, and for read > failures to cause recovery if raid1+ is used, or cause a read retry > if it's single, raid0, or linear. In other words, control the retries > in software for these drives. This gets more interesting. From what I can tell with my drivers (reds and seagate video driver) they some allow erc to be set only 7 or higher, and some allow things to be set lower. I have been setting mine lower when it allows since I have raid 6 and expect to be able to get the data from the other disks. This min 7 vs min of lower may be a further distinction between the green(none), red 7, seagate VX (1.0 allowed). My has video recordings...when the video pauses I counting how long. I almost always appear to see the full 7 seconds, so I suspect that if it does not recover in a short time it appears to be unlikely to recover it all all. Given the data corruption issue without raid the vendors may have the though that they cannot really do anything else but retry in the no raid case. > > > > I can't agree at all, lacking facts, that this change is marginal for > non-redundant configurations. I've seen no data how common long > recovery incidents are, or how much more common data loss would be if > long recovery were prevented. > > The mere fact they exist suggests they're necessary. It may very well > be that the ECC code or hardware used is so slow that it really does > take so unbelievably long (really 30 seconds is an eternity, and a > minute seems outrageous, and 2-3 minutes seems wholly ridiculous as in > worthy of brutal unrelenting ridicule); but that doesn't even matter > even if it is true, that's the behavior of the ECC whether we like it > or not, we can't just willy nilly turn these things off without > understanding the consequences. Just saying it's marginal doesn't make > it true. > > So if SCT ERC is short, now you have to have a mitigation for the > possibly higher number of URE's this will result in, in the form of > kernel instigated read retries on read fail. And in fact, this may be > false. The retries the drive does internally might be completely > different than the kernel doing another read. The way data is encoded > on the drive these days bears no resemblance to discreet 1's and 0's. Given the drive likely has some ability to adjust the levels of the 0 and 1, I can see the disk retries possibly playing some games like that trying to get a better answer. It is worth nothing that 7 seconds does mean around 70 retries of the read (data comes under the head 70 times). I doubt the ECC is so slow it takes more than 10-20 ms to calculate more extreme failures. So I am betting on the retries being what is recovering the data. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss) 2015-02-17 19:33 ` Chris Murphy 2015-02-17 22:47 ` Adam Goryachev @ 2015-02-17 23:33 ` Chris 1 sibling, 0 replies; 24+ messages in thread From: Chris @ 2015-02-17 23:33 UTC (permalink / raw) To: linux-raid Chris Murphy writes: > > It's not just mdadm. It likewise affects Btrfs, ZFS, and LVM. Do they have own timouts, or rely on the kernel? Maybe the kernel could read the SCTERT value from the drives (in lieu of some better retry timout information, and set the controller timout a little greater than that, or very large if SCTERT is disabled/not available. > sda1 and sdb1 are raid0, and sda2 and sdb2 are > raid1. What's the proper configuration for SCT ERC and the SCSI > command timer? guessing... For SCTERT disabled drives: A compromise may be to stay with the linux default controller timout, it's 30s, and set the drives SCTERT below 30s (maybe 27s), to avoid losing redundancy and risking data loss *AND* allow more of the available time for ERC. For longer error correcting attempts (and just as long i/o controller blocking!) the contoller timout could be set to 180s, and SCTERT to 175s? BUT: If I chose to use a raid0 alongside a redundant raid I already explicitly decided to take all data loss the hardware throws at me. So I don't think it makes much of a difference if ERC times out after <30 secs or 180s, its just more or less errors belonging to me. For SCTERC enabled drives: 30s and 7s seems ok? > *shrug* I don't think the automatic udev configuration idea is fail > safe. It sounds too easy for it to automatically cause a > misconfiguration. A matching timeout configuration prevents that unavoidable unrecoverable read error take down the redundancy for sure, and cause high risk of data loss during rebuild. It does fix a misconfiguration, however could possibly set SCTERT just below the (30s) controler timout, to reduce the impact of SCTERT (e.g make use of the small chance of error correction succceding a couple of seconds later). Given the longer SCTERT timout does not lead to subseqent read error timouts piling up. > And it also doesn't at all solve the problem that > there's next to no error reporting to user space. That is correct, but rather not related to the importance to fix the timout mismatch and reduce the risk, is it? The settings do solve unecessary loss of redundancy on read errors that are sure to occur, unnecessary resyncing, and high risk of data loss during all that. ^ permalink raw reply [flat|nested] 24+ messages in thread
* help with the little script (erc timout fix) 2015-02-16 23:49 ` What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) NeilBrown 2015-02-17 7:52 ` What are mdadm maintainers to do? (error recovery redundancy/data loss) Chris @ 2015-02-18 15:04 ` Chris 2015-02-18 21:25 ` NeilBrown 1 sibling, 1 reply; 24+ messages in thread From: Chris @ 2015-02-18 15:04 UTC (permalink / raw) To: linux-raid Hello, by adapting what I could find, I compiled the following short snippet now. Could list members please look at this novice code and suggest a way to determine the containing disk device $HDD_DEV from the parition/disk, before I dare to test this. In udev-md-raid-assembly.rules, below LABEL="md_inc" (section only handling all md suppported devices) add: # fix timouts for redundant raids, if possible IMPORT{program}="BINDIR/mdadm --examine --export $tempnode" TEST="/usr/sbin/smartctl", ENV{MD_LEVEL}=="raid[1-9]*", RUN+="BINDIR/mdadm-erc-timout-fix.sh $tempnode" And in a new mdadm-erc-timout-fix.sh file implement: #! /bin/sh HDD_DEV= $1 somehow stipping off the tailing numbers? if smartctl -l scterc ${HDD_DEV} | grep -q Disabled ; then /usr/sbin/smartctl -l scterc,70,70 ${HDD_DEV} else if ! smartctl -l scterc ${HDD_DEV} | grep -q seconds ; then echo 180 >/sys/block/${HDD_DEV}/device/timeout fi fi Correct execution during boot would seem to require that distro package managers hook smartctl and the script into the initramfs generation. Regards, Chris ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: help with the little script (erc timout fix) 2015-02-18 15:04 ` help with the little script (erc timout fix) Chris @ 2015-02-18 21:25 ` NeilBrown 0 siblings, 0 replies; 24+ messages in thread From: NeilBrown @ 2015-02-18 21:25 UTC (permalink / raw) To: Chris; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 2459 bytes --] On Wed, 18 Feb 2015 15:04:53 +0000 (UTC) Chris <email.bug@arcor.de> wrote: > > Hello, > > by adapting what I could find, I compiled the following short snippet now. > > Could list members please look at this novice code and suggest a way to > determine the containing disk device $HDD_DEV from the parition/disk, > before I dare to test this. > > > > In udev-md-raid-assembly.rules, below LABEL="md_inc" (section only handling > all md suppported devices) add: > > # fix timouts for redundant raids, if possible > IMPORT{program}="BINDIR/mdadm --examine --export $tempnode" > TEST="/usr/sbin/smartctl", ENV{MD_LEVEL}=="raid[1-9]*", > RUN+="BINDIR/mdadm-erc-timout-fix.sh $tempnode" It might make sense to have 2 rules, one for partitions and one for disks (based on ENV{DEVTYPE}). Then use $parent to get the device from the partition, and $devnode to get the device of the disk. > > And in a new mdadm-erc-timout-fix.sh file implement: > > #! /bin/sh > > HDD_DEV= $1 somehow stipping off the tailing numbers? > > if smartctl -l scterc ${HDD_DEV} | grep -q Disabled ; then > /usr/sbin/smartctl -l scterc,70,70 ${HDD_DEV} > else > if ! smartctl -l scterc ${HDD_DEV} | grep -q seconds ; then > echo 180 >/sys/block/${HDD_DEV}/device/timeout > fi > fi You should be consistent and use /usr/sbin/smartctl everywhere, or explicitly set $PATH and just use smartctl everywhere. > > Correct execution during boot would seem to require that distro > package managers hook smartctl and the script into the initramfs > generation. > > Regards, > Chris One problem with this approach is that it assumes circumstances don't change. If you have a working RAID1, then limiting the timeout on both devices makes sense. If you have a degraded RAID1 with only one device left then you really want the drive to try as hard as it can to get the data. There is a "FAILFAST" mechanism in the kernel which allows the filesystem to md etc to indicate that it wants accesses to "fail fast", which presumably means to use a smaller timeout. I would rather md used this flag where appropriate, and for the device to respond to it by using suitable timeouts. The problem is that FAILFAST isn't documented usefully and it is very hard to figure out what exactly (if anything) it does. But until that is resolved, a fix like this is probably a good idea. NeilBrown [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: re-add POLICY 2015-02-16 12:23 ` Chris 2015-02-16 13:17 ` Phil Turmel @ 2015-02-17 15:09 ` Chris 2015-02-22 13:23 ` Chris 1 sibling, 1 reply; 24+ messages in thread From: Chris @ 2015-02-17 15:09 UTC (permalink / raw) To: linux-raid > NeilBrown <neilb <at> suse.de> writes: > > > If it doesn't, then maybe you need "POLICY action=spare". > > OK, I will test this when the notebook is back in the house. I could test it on another system. Without adding a bitmap, it required configring POLICY domain=default action=spare and calling mdadm --udev-rules but then, after removing and inserting sdc again, only two out of six md partitions got synced. To see if there is something wrong, I then added the sdc1 md0 member manually, and it synced without failure. So I can't tell why the other partitions did not sync atomatically. Some of the unsynced partition types are 83 (md0 member), but others are FD (md7 member) like the automatically synced ones. linux 3.2.0 mdadm v3.2.5 md7 : active raid1 sdc6[3] sda8[2] 14327680 blocks super 1.2 [3/2] [UU_] bitmap: 1/1 pages [4KB], 65536KB chunk md3 : active raid1 sdc8[4] sda10[3] 307011392 blocks super 1.2 [3/2] [UU_] bitmap: 3/3 pages [12KB], 65536KB chunk md6 : active raid1 sda7[2] 8695680 blocks super 1.2 [3/1] [_U_] md1 : active raid1 sda6[3](W) sdb2[1] 19513216 blocks super 1.2 [4/2] [_UU_] md2 : active raid1 sda9[3](W) sdb3[0] 97590144 blocks super 1.2 [4/2] [U_U_] md0 : active raid1 sdc1[4] sda5[2](W) sdb1[1] 340672 blocks super 1.2 [4/3] [UUU_] A partition that did not sync automatically: /dev/sdc7: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 7a5847cd:be0e8510:8e170bf5:5d40143f Name : name:2 (local to host name) Creation Time : Sun Dec 2 21:40:58 2012 Raid Level : raid1 Raid Devices : 4 Avail Dev Size : 195187135 (93.07 GiB 99.94 GB) Array Size : 97590144 (93.07 GiB 99.93 GB) Used Dev Size : 195180288 (93.07 GiB 99.93 GB) Data Offset : 131072 sectors Super Offset : 8 sectors State : clean Device UUID : b1a97d12:965e3d08:059acefb:6ac5b7e3 Update Time : Wed Dec 3 11:23:26 2014 Checksum : ac0ce511 - correct Events : 382479 Device Role : Active device 1 Array State : AAA. ('A' == active, '.' == missing) And a corresponding partition that is part of the running array: /dev/sda9: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 7a5847cd:be0e8510:8e170bf5:5d40143f Name : name:2 (local to host name) Creation Time : Sun Dec 2 21:40:58 2012 Raid Level : raid1 Raid Devices : 4 Avail Dev Size : 195182592 (93.07 GiB 99.93 GB) Array Size : 97590144 (93.07 GiB 99.93 GB) Used Dev Size : 195180288 (93.07 GiB 99.93 GB) Data Offset : 131072 sectors Super Offset : 8 sectors State : clean Device UUID : 4c191282:80769896:378abe34:aeb01b8d Flags : write-mostly Update Time : Mon Feb 16 17:41:54 2015 Checksum : 8cb4794c - correct Events : 384989 Device Role : Active device 2 Array State : A.A. ('A' == active, '.' == missing) BTW looking at this data now, it seems to me the superblocks almost support the clean re-sync / conflict detection I was trying to explain. a) The removed device 1 does not claim that a member in the running array (0 and 2) has failed (AAA.) b) The Events count of device 1 is lower than in the running array. c) The running array/superblock does not seem to keep a reference of the Event count when device 1 failed, for additional security that it has not ben started separately. But b) and c) may not even be necessary, as starting device 1 separately would make device 1 claim that 0 and 2 have failed, right? Regards, Chris ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: re-add POLICY 2015-02-17 15:09 ` re-add POLICY Chris @ 2015-02-22 13:23 ` Chris 0 siblings, 0 replies; 24+ messages in thread From: Chris @ 2015-02-22 13:23 UTC (permalink / raw) To: linux-raid Hello, I just noticed that I somehow overlooked that md3 and md7 on that old ubuntu system *did* have a write-intent bitmap. So in my tests action="spare" does not seem to allow automatic re-sync of arrays without a bitmap. To quote the man page again on "spare": "if the device is bare it can become a spare if there is any array that it is a candidate for based on domains and metadata." I am fankly not sure I fully understand that. A bare device has no superblock, so does mdadm only look which array fits onto the device? Since the partitions on the removed disk contain superblocks they are not bare, may that by why action=spare does not apply, and an automatic re-sync may either require a new action="re-sync" or be done by "re-add" as well? Regards, Chris ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2015-02-22 13:23 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-02-14 21:59 re-add POLICY Chris 2015-02-15 19:03 ` re-add POLICY: conflict detection? Chris 2015-02-16 3:28 ` re-add POLICY NeilBrown 2015-02-16 12:23 ` Chris 2015-02-16 13:17 ` Phil Turmel 2015-02-16 16:15 ` desktop disk's error recovery timouts (was: re-add POLICY) Chris 2015-02-16 17:19 ` desktop disk's error recovery timouts Phil Turmel 2015-02-16 17:48 ` What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) Chris 2015-02-16 19:44 ` What are mdadm maintainers to do? Phil Turmel 2015-02-16 23:49 ` What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) NeilBrown 2015-02-17 7:52 ` What are mdadm maintainers to do? (error recovery redundancy/data loss) Chris 2015-02-17 8:48 ` Mikael Abrahamsson 2015-02-17 10:37 ` Chris 2015-02-17 19:33 ` Chris Murphy 2015-02-17 22:47 ` Adam Goryachev 2015-02-18 1:02 ` Chris Murphy 2015-02-18 11:04 ` Chris 2015-02-19 6:12 ` Chris Murphy 2015-02-20 5:12 ` Roger Heflin 2015-02-17 23:33 ` Chris 2015-02-18 15:04 ` help with the little script (erc timout fix) Chris 2015-02-18 21:25 ` NeilBrown 2015-02-17 15:09 ` re-add POLICY Chris 2015-02-22 13:23 ` Chris
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.