All of lore.kernel.org
 help / color / mirror / Atom feed
* SSD based sw RAID: is ERC/TLER really important?
@ 2021-07-24 18:41 Gianluca Frustagli
  2021-07-24 20:19 ` Peter Grandi
  2021-07-24 20:21 ` Andy Smith
  0 siblings, 2 replies; 10+ messages in thread
From: Gianluca Frustagli @ 2021-07-24 18:41 UTC (permalink / raw)
  To: linux-raid

Hi, 

nowadays we all know that the ERC/TLER capability is very important for the 
use of spinning drives in RAID systems because, especially for recent hard 
disks, the recovery time in case of media errors could exceed kernel timeouts 
and possibly kick off the entire drive from the RAID set and, in turn, lead to 
a fault of a RAID5 system upon a subsequent error in a second drive. 

But in the case of SSD drives (where, possibly, the error recovery activities 
performed by the drive firmware are very fast) does the presence of the 
ERC/TLER capability really matter? Is the same scenario from the spinning 
drives case actually even probable or only theorical? 

Thank you for the considerations and evaluations you want to express. 

Gianluca 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SSD based sw RAID: is ERC/TLER really important?
  2021-07-24 18:41 SSD based sw RAID: is ERC/TLER really important? Gianluca Frustagli
@ 2021-07-24 20:19 ` Peter Grandi
  2021-07-24 21:45   ` Phil Turmel
  2021-07-24 20:21 ` Andy Smith
  1 sibling, 1 reply; 10+ messages in thread
From: Peter Grandi @ 2021-07-24 20:19 UTC (permalink / raw)
  To: list Linux RAID

> the recovery time in case of media errors could exceed kernel
> timeouts and possibly kick off the entire drive from the RAID
> set and, in turn, lead to a fault of a RAID5 system upon a
> subsequent error in a second drive.

My understanding seems different:

* The purpose of having a short device error retry period is the
  opposite, it is to fail a drive as fast as possible, in
  workloads where latency matters ( or there is also the risk of
  bus/link resets hitting multiple drives). In those cases error
  retry periods of 1-2 seconds (at most) are common, rather than
  the mid-way "7 seconds" from copy-and-paste from web pages..

* The purpose of having a long device error retry is to instead
  to minimize the chances of declaring a drive failed, hoping
  that many retries succeed. (but note the difference between
  reads and writes).

* It is possible to set the kernel timeouts higher than device
  retry periods, if one does not care about latency, to minimize
  the chances of declaring a drive failed (not the difference
  between Linux command timeouts and retry timeouts, the latter
  can also be long).

> But in the case of SSD drives (where, possibly, the error
> recovery activities performed by the drive firmware are very
> fast) [...]

I guess that depends on the firmware: On one hand MLC cells can
become quite unreliable, especially at higher temperatures,
requiring many retries and lots of ECC, on the other on "write"
allocating a new erase-block is easy, as unlike for most HDDs
with a FTL, SDD sector logical and physical sector locations are
independent. Unfortunately most flash SSD drive makers don't
supply technical information on details like error recovery
strategies.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SSD based sw RAID: is ERC/TLER really important?
  2021-07-24 18:41 SSD based sw RAID: is ERC/TLER really important? Gianluca Frustagli
  2021-07-24 20:19 ` Peter Grandi
@ 2021-07-24 20:21 ` Andy Smith
  1 sibling, 0 replies; 10+ messages in thread
From: Andy Smith @ 2021-07-24 20:21 UTC (permalink / raw)
  To: linux-raid

Hi Gianluca,

On Sat, Jul 24, 2021 at 08:41:06PM +0200, Gianluca Frustagli wrote:
> But in the case of SSD drives (where, possibly, the error recovery activities 
> performed by the drive firmware are very fast) does the presence of the 
> ERC/TLER capability really matter?

If the setting is there, why wouldn't you use it? If the error
recovery is always very fast, as you hypothesise, then the low
timeout you set with smartctl -l scterc will never be reached
anyway.

> Is the same scenario from the spinning 
> drives case actually even probable or only theorical? 

I don't know what the typical error recovery behaviour of an SSD is
because in years and years I haven't seen such problems with SSDs,
but where they offer the scterc setting I do use it anyway on the
basis of it's not going to hurt anything.

I see it available on enterprise SSD like Samsung SM883, Intel
D3-S4610. I don't see it on Supermicro DOM modules.

I also don't see it on NVMe drives at all (e.g. Samsung PM983), and
NVMe seems to be the future of flash so maybe this setting dies
soon…

Cheers,
Andy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SSD based sw RAID: is ERC/TLER really important?
  2021-07-24 20:19 ` Peter Grandi
@ 2021-07-24 21:45   ` Phil Turmel
  2021-07-25  7:00     ` Wols Lists
                       ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Phil Turmel @ 2021-07-24 21:45 UTC (permalink / raw)
  To: Peter Grandi, list Linux RAID

On 7/24/21 4:19 PM, Peter Grandi wrote:

>> the recovery time in case of media errors could exceed kernel 
>> timeouts and possibly kick off the entire drive from the RAID set
>> and, in turn, lead to a fault of a RAID5 system upon a subsequent
>> error in a second drive.
> 
> My understanding seems different:

You understanding is incorrect.

> * The purpose of having a short device error retry period is the 
> opposite, it is to fail a drive as fast as possible, in workloads
> where latency matters ( or there is also the risk of bus/link resets
> hitting multiple drives). In those cases error retry periods of 1-2
> seconds (at most) are common, rather than the mid-way "7 seconds"
> from copy-and-paste from web pages..

Yes, the short ERC setting helps latency, but the primary purpose is to
be shorter than the kernel timeout.

> * The purpose of having a long device error retry is to instead to
> minimize the chances of declaring a drive failed, hoping that many
> retries succeed. (but note the difference between reads and writes).

Read errors do *not* kick drives out.  It takes several read errors in a
short time to fail a drive out of an array.

A drive not responding before the kernel timeout *will* get it kicked, 
though.  Because the kernel giving up propagates to the raid as a
read error (while the drive is off in la-la land) which then causes
the raid to *reconstruct* the missing sector and *write* it.  Along
with passing the reconstructed data up the chain.

That write will fail because the drive is still in la-la land.  Any
write failure *does* kick the drive out.

> * It is possible to set the kernel timeouts higher than device retry
> periods, if one does not care about latency, to minimize the chances
> of declaring a drive failed (not the difference between Linux command
> timeouts and retry timeouts, the latter can also be long).
> 
>> But in the case of SSD drives (where, possibly, the error recovery
>> activities performed by the drive firmware are very fast) [...]
> 
> I guess that depends on the firmware: On one hand MLC cells can 
> become quite unreliable, especially at higher temperatures, requiring
> many retries and lots of ECC, on the other on "write" allocating a
> new erase-block is easy, as unlike for most HDDs with a FTL, SDD
> sector logical and physical sector locations are independent.
> Unfortunately most flash SSD drive makers don't supply technical
> information on details like error recovery strategies.
> 

I don't have data on SSD behavior without ERC.  If their retry cycle is 
exhausted within the kernel default 30 seconds, the timeout mismatch 
issue will *not* apply.

Phil

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SSD based sw RAID: is ERC/TLER really important?
  2021-07-24 21:45   ` Phil Turmel
@ 2021-07-25  7:00     ` Wols Lists
  2021-07-25 10:28     ` Peter Grandi
  2021-07-25 11:04     ` Peter Grandi
  2 siblings, 0 replies; 10+ messages in thread
From: Wols Lists @ 2021-07-25  7:00 UTC (permalink / raw)
  To: Phil Turmel, Peter Grandi, list Linux RAID

On 24/07/21 22:45, Phil Turmel wrote:
> I don't have data on SSD behavior without ERC.  If their retry cycle is
> exhausted within the kernel default 30 seconds, the timeout mismatch
> issue will *not* apply.

I've also seen stuff that implies (with spinning rust) that the retry
cycle can hang - a read times out, then the next attempt to read the
same data works fine.

It's *possible* the same applies to SSDs, in which case shortening the
timeout could be worthwhile.

And as Phil says, the critical fact is that the drive MUST come back
from la-la-land BEFORE the kernel times out.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SSD based sw RAID: is ERC/TLER really important?
  2021-07-24 21:45   ` Phil Turmel
  2021-07-25  7:00     ` Wols Lists
@ 2021-07-25 10:28     ` Peter Grandi
  2021-07-26  1:06       ` Phil Turmel
  2021-07-25 11:04     ` Peter Grandi
  2 siblings, 1 reply; 10+ messages in thread
From: Peter Grandi @ 2021-07-25 10:28 UTC (permalink / raw)
  To: list Linux RAID

>> * The purpose of having a long device error retry is to instead to
>> minimize the chances of declaring a drive failed, hoping that many
>> retries succeed. (but note the difference between reads and writes).
>> * It is possible to set the kernel timeouts higher than device retry
>> periods, if one does not care about latency, to minimize the
>> chances of declaring a drive failed (not[e] the difference
>> between Linux command timeouts and retry timeouts, the latter
>> can also be long).

> You understanding is incorrect.
> Read errors do *not* kick drives out. It takes several read
> errors in a short time to fail a drive out of an array.

I am sorry that I was not clear enough and therefore:

* You failed to understand the relevance of "note the difference
  between reads and writes" which I added precisely because I
  guessed that someone unfamiliar with storage device would need
  that terse qualifier.

* You failed to understand the relevance of the "to minimize the
  chances of declaring a drive failed".

* You failed to realize that I was addressing tersely the
  original poster's case of a drive being declared failed
  because of a drive timeout longer than the kernel command
  timeout, without going in detail about all other possible
  cases.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SSD based sw RAID: is ERC/TLER really important?
  2021-07-24 21:45   ` Phil Turmel
  2021-07-25  7:00     ` Wols Lists
  2021-07-25 10:28     ` Peter Grandi
@ 2021-07-25 11:04     ` Peter Grandi
  2 siblings, 0 replies; 10+ messages in thread
From: Peter Grandi @ 2021-07-25 11:04 UTC (permalink / raw)
  To: list Linux RAID

>> * It is possible to set the kernel timeouts higher than device retry
>> periods, if one does not care about latency, to minimize the chances
>> of declaring a drive failed (not the difference between Linux command
>> timeouts and retry timeouts, the latter can also be long).

> I don't have data on SSD behavior without ERC. If their retry
> cycle is exhausted within the kernel default 30 seconds, the
> timeout mismatch issue will *not* apply.

That as written may confuse readers as to the difference between
the Linux command timeout and the the Linux retry timeout:

  # grep -H . /sys/module/scsi_mod/parameters/eh_deadline 
  /sys/module/scsi_mod/parameters/eh_deadline:-1

  # grep -H . /sys/block/sda/device/*timeout*
  /sys/block/sda/device/eh_timeout:10
  /sys/block/sda/device/timeout:30

Things are different again with the NVME subsystem:

  # grep -H . /sys/module/nvme_core/parameters/*{timeout,retries,latency}*
  /sys/module/nvme_core/parameters/admin_timeout:60
  /sys/module/nvme_core/parameters/io_timeout:30
  /sys/module/nvme_core/parameters/shutdown_timeout:5
  /sys/module/nvme_core/parameters/max_retries:5
  /sys/module/nvme_core/parameters/default_ps_max_latency_us:100000

Some relevant links:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_storage_devices/configuring-maximum-time-for-storage-error-recovery-with-eh_deadline_managing-storage-devices
https://unix.stackexchange.com/questions/541463/how-to-prevent-disk-i-o-timeouts-which-cause-disks-to-disconnect-and-data-corrup
https://elixir.bootlin.com/linux/v5.13.4/source/drivers/scsi/scsi_error.c

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SSD based sw RAID: is ERC/TLER really important?
  2021-07-25 10:28     ` Peter Grandi
@ 2021-07-26  1:06       ` Phil Turmel
  2021-07-26  7:57         ` Peter Grandi
  0 siblings, 1 reply; 10+ messages in thread
From: Phil Turmel @ 2021-07-26  1:06 UTC (permalink / raw)
  To: Peter Grandi, list Linux RAID

Hi Peter,

On 7/25/21 6:28 AM, Peter Grandi wrote:
>>> * The purpose of having a long device error retry is to instead to
>>> minimize the chances of declaring a drive failed, hoping that many
>>> retries succeed. (but note the difference between reads and writes).
>>> * It is possible to set the kernel timeouts higher than device retry
>>> periods, if one does not care about latency, to minimize the
>>> chances of declaring a drive failed (not[e] the difference
>>> between Linux command timeouts and retry timeouts, the latter
>>> can also be long).
> 
>> You understanding is incorrect.
>> Read errors do *not* kick drives out. It takes several read
>> errors in a short time to fail a drive out of an array.
> 
> I am sorry that I was not clear enough and therefore:
> 
> * You failed to understand the relevance of "note the difference
>    between reads and writes" which I added precisely because I
>    guessed that someone unfamiliar with storage device would need
>    that terse qualifier.
> 
> * You failed to understand the relevance of the "to minimize the
>    chances of declaring a drive failed".
> 
> * You failed to realize that I was addressing tersely the
>    original poster's case of a drive being declared failed
>    because of a drive timeout longer than the kernel command
>    timeout, without going in detail about all other possible
>    cases.
> 

You have reminded me that your mail should have been blackholed--a rule 
I put in many years agao.  I have updated the rule to be more inclusive. 
  I must not be the only one ignoring you, causing you to use multiple 
subdomains.

No need to reply.  I won't see it.

Phil

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SSD based sw RAID: is ERC/TLER really important?
  2021-07-26  1:06       ` Phil Turmel
@ 2021-07-26  7:57         ` Peter Grandi
  2021-07-26 16:12           ` Peter Grandi
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Grandi @ 2021-07-26  7:57 UTC (permalink / raw)
  To: list Linux RAID

[...]

> I must not be the only one ignoring you, causing you to use
> multiple subdomains.

It seems sad that my changing address occasionally when they get
spammed as they get "harvested" may tickle someone's foolish
arrogance of thinking themselves so important that it is done
just to get their worthless attention.
http://www.sabi.co.uk/blog/0705may.html?070527c#070527c

Unfortunately in some cases my technical comments therefore get
replies without any technical content. As to this though:

> without going in detail about all other possible cases.

There are indeed many twists and turns and "legacy" situations,
as in several places timeouts and retries are hardcoded, and
IIRC the "default" for retries is 5. But I have spent a bit of
time looking at some of the weirdness and it turns out that
nowadays the 'sd' module defines a 'max_retries' setting in the
device attributes (rather than a module parameter as in
'nvme_core'):

  https://elixir.bootlin.com/linux/latest/source/drivers/scsi/sd.c#L598

It is only available from 5.10 and would be for example at:

  /sys/class/scsi_disk/0:0:0:0/max_retries

I have also noticed that XFS bizarrely has its own layer of
recovery on top of that of the Linux IO subsystems and of the
device itself:

  https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_file_systems/configuring-xfs-error-behavior_managing-file-systems
  https://elixir.bootlin.com/linux/v5.13.5/source/fs/xfs/xfs_buf.c#L1264

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SSD based sw RAID: is ERC/TLER really important?
  2021-07-26  7:57         ` Peter Grandi
@ 2021-07-26 16:12           ` Peter Grandi
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Grandi @ 2021-07-26 16:12 UTC (permalink / raw)
  To: list Linux RAID

[...]
> I have also noticed that XFS bizarrely has its own layer of
> recovery on top of that of the Linux IO subsystems and of the
> device itself:

>   https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_file_systems/configuring-xfs-error-behavior_managing-file-systems
>   https://elixir.bootlin.com/linux/v5.13.5/source/fs/xfs/xfs_buf.c#L1264

Some people have been proposing to do the same with MD RAID, and
some similar layer sometimes is embedded in the firmware of
hardware RAID host adapters.

Consider this scenario (include the relevant error handling and
timeouts):

  * The device firmware does physical operation retries.
  * The Linux IO subsystem does device operation retries.
  * MD RAID does Linux IO operation retries.
  * XFS does MD RAID IO operation retries.

Is there something weird with this?

BTW there is something similar with read ahead or write behind:
sometimes several software layers assume that lower software
layers are doing too little of it, and do their own additional
read ahead or write behind.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-07-26 16:12 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-24 18:41 SSD based sw RAID: is ERC/TLER really important? Gianluca Frustagli
2021-07-24 20:19 ` Peter Grandi
2021-07-24 21:45   ` Phil Turmel
2021-07-25  7:00     ` Wols Lists
2021-07-25 10:28     ` Peter Grandi
2021-07-26  1:06       ` Phil Turmel
2021-07-26  7:57         ` Peter Grandi
2021-07-26 16:12           ` Peter Grandi
2021-07-25 11:04     ` Peter Grandi
2021-07-24 20:21 ` Andy Smith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.