All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nix <nix@esperi.org.uk>
To: Tim Small <tim@buttersideup.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: Fault tolerance with badblocks
Date: Tue, 09 May 2017 16:30:35 +0100	[thread overview]
Message-ID: <874lwu59pg.fsf@esperi.org.uk> (raw)
In-Reply-To: <7dfa3eff-9194-002d-918b-42fbae865df3@buttersideup.com> (Tim Small's message of "Tue, 9 May 2017 13:15:46 +0100")

On 9 May 2017, Tim Small spake thusly:

> On 09/05/17 11:40, Nix wrote:
>> I've had disk failures without warning, and
>> non-failed disks with both read and write errors that would not go away,
>> but that SMART reallocation value just stayed stuck at zero through all
>> of it.
>
> Really?  I see them pretty frequently...  Let's see
>
> server1, RAID6 (4 disks), reallocated_sector_ct: 0 9 1 0
> server2, RAID5 (4 disks), reallocated_sector_ct: 0 0 0 0
> server3, RAID6 (5 disks), reallocated_sector_ct: 34 754 15 115 1
> server4, RAID5 (4 disks), reallocated_sector_ct: 0 0 0 0
> server5, RAID5 (4 disks), reallocated_sector_ct: 0 0 0 0
>
> Disk 2 in server3 (which has drives which are a bit long in the tooth)
> is scheduled to be replaced next time I visit that site.
>
> Are you looking at the 'raw' column in the smartctl output?

No, but since they all read all zero:

  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0

this is pretty redundant.

I do see, on all my disks (regardless of hardware versus software RAID
or indeed age, and some of these disks are seven years old):

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0

One figure is much higher:

195 Hardware_ECC_Recovered  -O-RC-   100   064   000    -    2067212
195 Hardware_ECC_Recovered  -O-RC-   100   064   000    -    2088928
195 Hardware_ECC_Recovered  -O-RC-   082   064   000    -    156528817
195 Hardware_ECC_Recovered  -O-RC-   082   065   000    -    156513792

but this is on a bunch of three-month-old Seagate enterprise disks, and
as with the seek error rate Seagate use a deeply bizarre encoding for
this value, and none of the SeaChest programs seem to be able to decode
it.

It appears that the lower the decoded value, the worse things are -- I
have no idea why two of my drives are doing so much worse than two
others on this score. I guess I should keep an eye on them. In any case,
it's going up fast on those two even when the drives are totally idle
and even when I forcibly spin them down... I don't trust this figure to
tell me anything useful at all. SMART, borderline useless as ever.

Aside: in hex these are

001f8b0c
001fdfe0
095470b1
09543600

which rather suggests that the drives have two distinct encodings to me,
with two drives using one encoding and the other two another one,
probably split at the four-hex-digit mark -- but the drives have
identical firmware and the same model number...

  reply	other threads:[~2017-05-09 15:30 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-04 10:04 Fault tolerance in RAID0 with badblocks Ravi (Tom) Hale
2017-05-04 13:44 ` Wols Lists
2017-05-05  4:03   ` Fault tolerance " Ravi (Tom) Hale
2017-05-05 19:20     ` Anthony Youngman
2017-05-06 11:21       ` Ravi (Tom) Hale
2017-05-06 13:00         ` Wols Lists
2017-05-08 14:50           ` Nix
2017-05-08 18:00             ` Anthony Youngman
2017-05-09 10:11               ` David Brown
2017-05-09 10:18               ` Nix
2017-05-08 19:02             ` Phil Turmel
2017-05-08 19:52               ` Nix
2017-05-08 20:27                 ` Anthony Youngman
2017-05-09  9:53                   ` Nix
2017-05-09 11:09                     ` David Brown
2017-05-09 11:27                       ` Nix
2017-05-09 11:58                         ` David Brown
2017-05-09 17:25                           ` Chris Murphy
2017-05-09 19:44                             ` Wols Lists
2017-05-10  3:53                               ` Chris Murphy
2017-05-10  4:49                                 ` Wols Lists
2017-05-10 17:18                                   ` Chris Murphy
2017-05-16  3:20                                   ` NeilBrown
2017-05-10  5:00                                 ` Dave Stevens
2017-05-10 16:44                                 ` Edward Kuns
2017-05-10 18:09                                   ` Chris Murphy
2017-05-09 20:18                             ` Nix
2017-05-09 20:52                               ` Wols Lists
2017-05-10  8:41                               ` David Brown
2017-05-09 21:06                             ` A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks) Nix
2017-05-12 11:14                               ` Nix
2017-05-16  3:27                               ` NeilBrown
2017-05-16  9:13                                 ` Nix
2017-05-16 21:11                                 ` NeilBrown
2017-05-16 21:46                                   ` Nix
2017-05-18  0:07                                     ` Shaohua Li
2017-05-19  4:53                                       ` NeilBrown
2017-05-19 10:31                                         ` Nix
2017-05-19 16:48                                           ` Shaohua Li
2017-06-02 12:28                                             ` Nix
2017-05-19  4:49                                     ` NeilBrown
2017-05-19 10:32                                       ` Nix
2017-05-19 16:55                                         ` Shaohua Li
2017-05-21 22:00                                           ` NeilBrown
2017-05-09 19:16                         ` Fault tolerance with badblocks Phil Turmel
2017-05-09 20:01                           ` Nix
2017-05-09 20:57                             ` Wols Lists
2017-05-09 21:22                               ` Nix
2017-05-09 21:23                             ` Phil Turmel
2017-05-09 21:32                     ` NeilBrown
2017-05-10 19:03                       ` Nix
2017-05-09 16:05                   ` Chris Murphy
2017-05-09 17:49                     ` Wols Lists
2017-05-10  3:06                       ` Chris Murphy
2017-05-08 20:56                 ` Phil Turmel
2017-05-09 10:28                   ` Nix
2017-05-09 10:50                     ` Reindl Harald
2017-05-09 11:15                       ` Nix
2017-05-09 11:48                         ` Reindl Harald
2017-05-09 16:11                           ` Nix
2017-05-09 16:46                             ` Reindl Harald
2017-05-09  7:37             ` David Brown
2017-05-09  9:58               ` Nix
2017-05-09 10:28                 ` Brad Campbell
2017-05-09 10:40                   ` Nix
2017-05-09 12:15                     ` Tim Small
2017-05-09 15:30                       ` Nix [this message]
2017-05-05 20:23     ` Peter Grandi
2017-05-05 22:14       ` Nix

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874lwu59pg.fsf@esperi.org.uk \
    --to=nix@esperi.org.uk \
    --cc=linux-raid@vger.kernel.org \
    --cc=tim@buttersideup.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.