All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Murphy <lists@colorremedies.com>
To: Fredrik Tolf <fredrik@dolda2000.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Rebalancing RAID1
Date: Wed, 13 Feb 2013 01:10:11 -0700	[thread overview]
Message-ID: <5EEA9264-5DCA-4A3F-B305-F3E64E9A3CC5@colorremedies.com> (raw)
In-Reply-To: <alpine.DEB.2.02.1302130711400.8810@shack.dolda2000.com>


On Feb 12, 2013, at 11:18 PM, Fredrik Tolf <fredrik@dolda2000.com> wrote:
> 
> 
>> smartctl -l scterc /dev/sdX
> 
> "Warning: device does not support SCT Error Recovery Control command"
> 
> Doesn't seem that way to me; partly because of the SMART data, and partly because of the errors that were logged as the drive failed:
> 
> Feb 12 16:36:49 nerv kernel: [36769.546522] ata6.00: Ata error. fis:0x21
> Feb 12 16:36:49 nerv kernel: [36769.550454] ata6: SError: { Handshk }
> Feb 12 16:36:51 nerv kernel: [36769.554129] ata6.00: failed command: WRITE FPDMA QUEUED
> Feb 12 16:36:51 nerv kernel: [36769.559375] ata6.00: cmd 61/00:00:00:ec:2e/04:00:cd:00:00/40 tag 0 ncq 524288 out
> Feb 12 16:36:51 nerv kernel: [36769.559375]          res 41/84:d0:00:98:2e/84:00:cd:00:00/40 Emask 0x10 (ATA bus error)
> Feb 12 16:36:51 nerv kernel: [36769.574831] ata6.00: status: { DRDY ERR }
> Feb 12 16:36:52 nerv kernel: [36769.578867] ata6.00: error: { ICRC ABRT }
> 
> That's not typical for actual media problems, in my experience. :)

Quite typical, because these drives don't support SCTERC which almost certainly means their error timeouts are well above that of the linux SCSI layer which is 30 seconds. Their timeouts are likely around 2 minutes. So in fact they never report back a URE because the command timer times out and resets the drive.
https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Online_Storage_Reconfiguration_Guide/task_controlling-scsi-command-timer-onlining-devices.html

For your use case, I'd reject these drives and get WDC Red, or even reportedly the Hitachi Deskstars still have a settable SCTERC. And set it for something like 70 deciseconds. Then if if a drive ECC hasn't recovered in 7 seconds, it will give up, and report a read error with the problem LBA. Either btrfs (or md) can recover the data from the other drive, and cause the read error to be fixed on the other drive.

However, in your case, with both the kernel message ICRC ABRT, and the following SMART entry, this is your cable problem. The ICRC and UCMA_CRC errors are the same problem reported by the actors at each end of the cable.

/dev/hdi
Serial Number:    WD-WMC1T1679668
199 UDMA_CRC_Error_Count    0x0032   200   192   000    Old_age   Always       -       91


So the question is whether the cable problem has actually been fixed, and if you're still getting ICRC errors from the kernel. As this is hdi, I'm wondering how many drives are connected, and if this could be power induced rather than just cable induced. Once that's solved, you should do a scrub, rather than a rebalance.

Chris Murphy

  reply	other threads:[~2013-02-13  8:10 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-12 23:01 Rebalancing RAID1 Fredrik Tolf
2013-02-13  0:58 ` Chris Murphy
2013-02-13  6:18   ` Fredrik Tolf
2013-02-13  8:10     ` Chris Murphy [this message]
2013-02-14  6:42       ` Fredrik Tolf
2013-02-14  7:27         ` Chris Murphy
2013-02-14  7:58           ` Fredrik Tolf
2013-02-14  8:41             ` Chris Murphy
2013-02-14  8:59               ` Hugo Mills
2013-02-14 18:05                 ` Chris Murphy
2013-02-14 20:56                   ` Hugo Mills
2013-02-14 22:11                     ` Chris Murphy
2013-02-15  3:50                   ` Fredrik Tolf
2013-02-15  3:55                     ` Chris Murphy
2013-02-15  3:56                       ` Fredrik Tolf
2013-02-15  4:03                         ` Chris Murphy
2013-02-14  8:01         ` Chris Murphy
2013-02-15  4:06           ` Fredrik Tolf
2013-02-14 14:44 ` Martin Steigerwald
2013-02-14 18:45   ` Chris Murphy
2013-02-15  3:44   ` Fredrik Tolf
2013-02-15  5:49     ` Sander
2013-02-15  9:05     ` Martin Steigerwald
2013-02-15 21:56       ` Fredrik Tolf
2013-02-18 15:29         ` Stefan Behrens
2013-02-23  0:36           ` Fredrik Tolf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5EEA9264-5DCA-4A3F-B305-F3E64E9A3CC5@colorremedies.com \
    --to=lists@colorremedies.com \
    --cc=fredrik@dolda2000.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.