From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f47.google.com ([209.85.218.47]:32935 "EHLO mail-oi0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752945AbcGFQoB (ORCPT ); Wed, 6 Jul 2016 12:44:01 -0400 Received: by mail-oi0-f47.google.com with SMTP id u201so277908033oie.0 for ; Wed, 06 Jul 2016 09:44:01 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <576CB0DA.6030409@gmail.com> <20160624085014.GH3325@carfax.org.uk> <576D6C0A.7070502@gmail.com> <20160627215726.GG14667@hungrycats.org> <7bad0370-ac01-2280-d8b1-e31b0ae9cffe@crc.id.au> <154fc0b3-8c39-eff6-48c9-5d2667e967b1@gmail.com> <31207cfc-245f-1b6e-4ef9-b8bf04b65e70@crc.id.au> <70f12c1b-8d30-c5f7-faa8-10a86a49c332@crc.id.au> From: Chris Murphy Date: Wed, 6 Jul 2016 10:43:57 -0600 Message-ID: Subject: Re: Adventures in btrfs raid5 disk recovery To: "Austin S. Hemmelgarn" Cc: Btrfs BTRFS Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: On Wed, Jul 6, 2016 at 5:51 AM, Austin S. Hemmelgarn wrote: > On 2016-07-05 19:05, Chris Murphy wrote: >> >> Related: >> http://www.spinics.net/lists/raid/msg52880.html >> >> Looks like there is some traction to figuring out what to do about >> this, whether it's a udev rule or something that happens in the kernel >> itself. Pretty much the only hardware setup unaffected by this are >> those with enterprise or NAS drives. Every configuration of a consumer >> drive, single, linear/concat, and all software (mdadm, lvm, Btrfs) >> RAID Levels are adversely affected by this. > > The thing I don't get about this is that while the per-device settings on a > given system are policy, the default value is not, and should be expected to > work correctly (but not necessarily optimally) on as many systems as > possible, so any claim that this should be fixed in udev are bogus by the > regular kernel rules. Sure. But changing it in the kernel leads to what other consequences? It fixes the problem under discussion but what problem will it introduce? I think it's valid to explore this, at the least so affected parties can be informed. Also, the problem isn't instigated by Linux, rather by drive manufacturers introducing a whole new kind of error recovery, with an order of magnitude longer recovery time. Now probably most hardware in the field are such drives. Even SSDs like my Samsung 840 EVO that support SCT ERC have it disabled, therefore the top end recovery time is undiscoverable in the device itself. Maybe it's buried in a spec. So does it make sense to just set the default to 180? Or is there a smarter way to do this? I don't know. >> I suspect, but haven't tested, that ZFS On Linux would be equally >> affected, unless they're completely reimplementing their own block >> layer (?) So there are quite a few parties now negatively impacted by >> the current default behavior. > > OTOH, I would not be surprised if the stance there is 'you get no support if > your not using enterprise drives', not because of the project itself, but > because it's ZFS. Part of their minimum recommended hardware requirements > is ECC RAM, so it wouldn't surprise me if enterprise storage devices are > there too. http://open-zfs.org/wiki/Hardware "Consistent performance requires hard drives that support error recovery control. " "Drives that lack such functionality can be expected to have arbitrarily high limits. Several minutes is not impossible. Drives with this functionality typically default to 7 seconds. ZFS does not currently adjust this setting on drives. However, it is advisable to write a script to set the error recovery time to a low value, such as 0.1 seconds until ZFS is modified to control it. This must be done on every boot. " They do not explicitly require enterprise drives, but they clearly expect SCT ERC enabled to some sane value. At least for Btrfs and ZFS, the mkfs is in a position to know all parameters for properly setting SCT ERC and the SCSI command timer for every device. Maybe it could create the udev rule? Single and raid0 profiles need to permit long recoveries; where raid1, 5, 6 need to set things for very short recoveries. Possibly mdadm and lvm tools do the same thing. -- Chris Murphy