From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f177.google.com ([209.85.223.177]:34510 "EHLO mail-io0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751337AbcFFN3v (ORCPT ); Mon, 6 Jun 2016 09:29:51 -0400 Received: by mail-io0-f177.google.com with SMTP id 5so16690135ioy.1 for ; Mon, 06 Jun 2016 06:29:51 -0700 (PDT) Subject: Re: Recommended why to use btrfs for production? To: Chris Murphy , Nicholas D Steeves References: <156f60b2-d666-d553-194b-c09de041d476@gmail.com> Cc: Martin , Btrfs BTRFS From: "Austin S. Hemmelgarn" Message-ID: Date: Mon, 6 Jun 2016 09:29:47 -0400 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 2016-06-03 21:48, Chris Murphy wrote: > On Fri, Jun 3, 2016 at 6:48 PM, Nicholas D Steeves wrote: >> On 3 June 2016 at 11:33, Austin S. Hemmelgarn wrote: >>> On 2016-06-03 10:11, Martin wrote: >>>>> >>>>> Make certain the kernel command timer value is greater than the driver >>>>> error recovery timeout. The former is found in sysfs, per block >>>>> device, the latter can be get and set with smartctl. Wrong >>>>> configuration is common (it's actually the default) when using >>>>> consumer drives, and inevitably leads to problems, even the loss of >>>>> the entire array. It really is a terrible default. >>>> >>>> >>>> Are nearline SAS drives considered consumer drives? >>>> >>> If it's a SAS drive, then no, especially when you start talking about things >>> marketed as 'nearline'. Additionally, SCT ERC is entirely a SATA thing, I >>> forget what the equivalent in SCSI (and by extension SAS) terms is, but I'm >>> pretty sure that the kernel handles things differently there. >> >> For the purposes of BTRFS RAID1: For drives that ship with SCT ERC of >> 7sec, is the default kernel command timeout of 30sec appropriate, or >> should it be reduced? > > It's fine. But it depends on your use case, if it can tolerate a rare >> 7 second < 30 second hang, and you're prepared to start > investigating the cause then I'd leave it alone. If the use case > prefers resetting the drive when it stops responding, then you'd go > with something shorter. > > I'm fairly certain SAS's command queue doesn't get obliterated with > such a link reset, just the hung command; where SATA drives all > information in the queue is lost. So resets on SATA are a much bigger > penalty if I have the correct understanding. There's also more involved otherwise with a ATA link reset because AHCI controllers aren't MP safe, so there's a global lock that has to be held while talking to them. Because of this, a link reset on an ATA drive (be it SATA or PATA) will cause performance degradation for all other devices on that controller as well until the reset is complete. > > >> For SATA drives that do not support SC TERC, is >> it true that 120sec is a sane value? I forget where I got this value >> of 120sec; > > It's a good question. It's not well documented, is not defined in the > SATA spec, so it's probably make/model specific. The linux-raid@ list > probably has the most information on this just because their users get > nailed by this problem often. And the recommendation does seem to vary > around 120 to 180. That is of course a maximum. The drive could give > up much sooner. But what you don't want is for the drive to be in > recovery for a bad sector, and the command timer does a link reset, > losing all of what the drive was doing: all of which is replaceable > except really one thing which is what sector was having the problem. > And right now there's no report of the drive for slow sectors. It only > reports failed reads, and it's that failed read error that includes > the sector, so that the raid mechanism can figure out what data is > missing, recongistruct from mirror or parity, and then fix the bad > sector by writing to it. FWIW, I usually go with 150 on the Seagate 'Desktop' drives I use. I've seen some cheap Hitachi and Toshiba disks that need it as high as 300 though to work right. > >> it might have been this list, it might have been an mdadm >> bug report. Also, in terms of tuning, I've been unable to find >> whether the ideal kernel timeout value changes depending on RAID >> type...is that a factor in selecting a sane kernel timeout value? > > No. It's strictly a value to make certain you get read errors from the > drive rather than link resets. You have to factor in how the controller handles things too. SOme of them will retry just like a desktop drive, and you need to account for that. > > And that's why I think it's a bad default, because it totally thwarts > attempts by manufacturers to recover marginal sectors, even in the > single disk case. That's debatable, by attempting to recover the bad sector, they're slowing down the whole system. The likelihood of recovering a bad sectors functionally falls off linearly the longer you try, and not having the ability to choose when to report an error is the bigger issue here.