All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexander Shenkin <al@shenkin.org>
To: Phil Turmel <philip@turmel.org>,
	Edward Kuns <eddie.kuns@gmail.com>,
	Mark Knecht <markknecht@gmail.com>
Cc: Wols Lists <antlists@youngman.org.uk>,
	Reindl Harald <h.reindl@thelounge.net>,
	Carsten Aulbert <carsten.aulbert@aei.mpg.de>,
	Linux-RAID <linux-raid@vger.kernel.org>
Subject: Re: SMART detects pending sectors; take offline?
Date: Tue, 19 Dec 2017 10:35:57 +0000	[thread overview]
Message-ID: <d86c80ba-7703-1591-7816-00d0d9408386@shenkin.org> (raw)
In-Reply-To: <7bf0a71e-6cb7-59bc-695b-54ed6b08112b@turmel.org>



On 12/18/2017 4:09 PM, Phil Turmel wrote:
> Hi Alexander,
> 
> On 12/18/2017 10:51 AM, Alexander Shenkin wrote:
>> Hi all,
>>
>> I'm getting back to this now that I'll have time, apologies for the
>> delay.  So, is the following correct in the case of a read error?
> 
> Not quite.
> 
>> 1) System tries to read an unreadable sector
> 
>> 2) Drive timeout reports unreadable based on drive timeout setting.
> 
>> 2a) In this case, mdadm sees the sector is unreadable and rewrites it
>> elsewhere on that drive.
> 
> No.  MD reconstructs the sector from redundancy (mirror or reverse
> parity calc or reverse P+Q syndrome) and writes it back to the *same*
> sector.  Since the drive firmware reported an error here, it knows to
> verify the write as well.  If the verification fails, the drive firmware
> will relocate the sector in the background, invisible to the upper
> layers.  As far as MD is concerned, that sector address is fixed either
> way.  Relocations are handled entirely within the drive.  MD does not
> perform or track relocations.
> 
>> 3) If linux hangcheck timer runs out before the drive timeout, then
>> linux aborts the read, logs an error, and mdadm isn't given a chance
>> to rewrite elsewhere based on checksums.
> 
> No.  The hangcheck timer issue described in your forwarded email is
> unrelated.  And MD doesn't use checksums.
> 
> Each drive has a device driver timeout, as you note below, found at
> /sys/block/*/device/timeout, that linux's ATA/SCSI stack uses to cut off
> non-responsive controller cards and/or drives.  If that timer runs out
> on a read before the drive reports the read error, the low level
> *driver* reports a read error to the MD layer.  MD treats it the same as
> any other read error, locating or recomputing the sector from redundancy
> as above.  The difference in this case is that the physical drive isn't
> talking to the controller (link reset in progress, typically) and the
> corrective rewrite of the sector (to fix or relocate within the drive)
> is refused, and that write error causes MD to kick out the drive.  And
> the pending sector is also left unfixed. >
>> Given all this, it seems to me that I should now set the hangcheck
>> timer to something greater than drive timeout (180 seconds).  Does
>> that sound right?  Otherwise, linux will kill the rewrite again, no?
> 
> In and of itself, waiting on I/O is not a hang.  So it should not be
> applicable.

Ok, so, it's now my understanding that I would normally be ok, having 
set the driver timeout to 180 secs (thus giving time for the seagate 
drive to report the read error back up to the MD layer before 180 secs 
is up).  In my case, however, the kernel hangcheck timer is interrupting 
the process (md?) that is waiting on the sector read at 120 secs. 
Therefore, the writeback doesn't happen.

Thus, I should set the hangcheck to something > 120 (say, 180 secs - 
should it be >180 to let the driver timeout first?).  Does this sound 
correct?  Apologies if I'm repeating info from before - just trying to 
be sure about what I'm doing before I go ahead and do it.

If that's correct, I'll add the following line in /etc/sysctl.conf:

kernel.hung_task_timeout_secs = 180

I'll make sure the setting has taken, and then I'll run:

sudo /usr/share/mdadm/checkarray --idle --all

Thanks,
Allie


  reply	other threads:[~2017-12-19 10:35 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-07  7:48 SMART detects pending sectors; take offline? Alexander Shenkin
2017-10-07  8:21 ` Carsten Aulbert
2017-10-07 10:05   ` Alexander Shenkin
2017-10-07 17:29     ` Wols Lists
2017-10-08  9:19       ` Alexander Shenkin
2017-10-08  9:49         ` Wols Lists
2017-10-09 20:16   ` Phil Turmel
2017-10-10  9:00     ` Alexander Shenkin
2017-10-10  9:11       ` Reindl Harald
2017-10-10  9:56         ` Alexander Shenkin
2017-10-10 12:55           ` Phil Turmel
2017-10-11 10:31             ` Alexander Shenkin
2017-10-11 17:10               ` Phil Turmel
2017-10-12  9:50                 ` Alexander Shenkin
2017-10-12 11:01                   ` Wols Lists
2017-10-12 13:04                     ` Phil Turmel
2017-10-12 13:16                       ` Alexander Shenkin
2017-10-12 13:21                         ` Mark Knecht
2017-10-12 15:16                           ` Edward Kuns
2017-10-12 15:52                             ` Edward Kuns
2017-10-15 14:41                               ` Alexander Shenkin
2017-12-18 15:51                               ` Alexander Shenkin
2017-12-18 16:09                                 ` Phil Turmel
2017-12-19 10:35                                   ` Alexander Shenkin [this message]
2017-12-19 12:02                                     ` Phil Turmel
2017-12-21 11:28                                       ` Alexander Shenkin
2017-12-21 11:38                                         ` Reindl Harald
2017-12-23  3:14                                           ` Brad Campbell
2018-01-03 12:44                                             ` Alexander Shenkin
2018-01-03 13:26                                               ` Brad Campbell
2018-01-03 13:50                                                 ` Alexander Shenkin
2018-01-03 15:53                                                   ` Phil Turmel
2018-01-03 15:59                                                     ` Alexander Shenkin
2018-01-03 16:02                                                       ` Phil Turmel
2018-01-04 10:37                                                         ` Alexander Shenkin
2018-01-04 12:28                                                           ` Alexander Shenkin
2018-01-04 13:16                                                             ` Brad Campbell
2018-01-04 13:39                                                               ` Alexander Shenkin
2018-01-05  5:20                                                                 ` Brad Campbell
2018-01-05  5:25                                                                   ` Brad Campbell
2018-01-05 10:10                                                                     ` Alexander Shenkin
2018-01-05 10:32                                                                       ` Brad Campbell
2018-01-05 13:50                                                                       ` Phil Turmel
2018-01-05 14:01                                                                         ` Alexander Shenkin
2018-01-05 15:59                                                                         ` Wols Lists
2017-10-12 15:19                   ` Kai Stian Olstad
2017-10-10 22:23           ` josh
2017-10-11  6:23             ` Alexander Shenkin
2017-10-10  9:21       ` Wols Lists

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d86c80ba-7703-1591-7816-00d0d9408386@shenkin.org \
    --to=al@shenkin.org \
    --cc=antlists@youngman.org.uk \
    --cc=carsten.aulbert@aei.mpg.de \
    --cc=eddie.kuns@gmail.com \
    --cc=h.reindl@thelounge.net \
    --cc=linux-raid@vger.kernel.org \
    --cc=markknecht@gmail.com \
    --cc=philip@turmel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.