All of lore.kernel.org
 help / color / mirror / Atom feed
* md RAID with enterprise-class SATA or SAS drives
@ 2012-05-09 22:00 Daniel Pocock
  2012-05-09 22:33 ` Marcus Sorensen
  0 siblings, 1 reply; 51+ messages in thread
From: Daniel Pocock @ 2012-05-09 22:00 UTC (permalink / raw)
  To: linux-raid



There is various information about
- enterprise-class drives (either SAS or just enterprise SATA)
- the SCSI/SAS protocols themselves vs SATA
having more advanced features (e.g. for dealing with error conditions)
than the average block device

For example, Adaptec recommends that such drives will work better with
their hardware RAID cards:

http://ask.adaptec.com/cgi-bin/adaptec_tic.cfg/php/enduser/std_adp.php?p_faqid=14596
"Desktop class disk drives have an error recovery feature that will
result in a continuous retry of the drive (read or write) when an error
is encountered, such as a bad sector. In a RAID array this can cause the
RAID controller to time-out while waiting for the drive to respond."

and this blog:
http://www.adaptec.com/blog/?p=901
"major advantages to enterprise drives (TLER for one) ... opt for the
enterprise drives in a RAID environment no matter what the cost of the
drive over the desktop drive"

My question..

- does Linux md RAID actively use the more advanced features of these
drives, e.g. to work around errors?

- if a non-RAID SAS card is used, does it matter which card is chosen?
Does md work equally well with all of them?

- ignoring the better MTBF and seek times of these drives, do any of the
other features passively contribute to a better RAID experience when
using md?

- for someone using SAS or enterprise SATA drives with Linux, is there
any particular benefit to using md RAID, dmraid or filesystem (e.g.
btrfs) RAID (apart from the btrfs having checksums)?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-09 22:00 md RAID with enterprise-class SATA or SAS drives Daniel Pocock
@ 2012-05-09 22:33 ` Marcus Sorensen
  2012-05-10 13:34   ` Daniel Pocock
  2012-05-10 13:51   ` Phil Turmel
  0 siblings, 2 replies; 51+ messages in thread
From: Marcus Sorensen @ 2012-05-09 22:33 UTC (permalink / raw)
  To: Daniel Pocock; +Cc: linux-raid

I can't speak to all of these, but...

On Wed, May 9, 2012 at 4:00 PM, Daniel Pocock <daniel@pocock.com.au> wrote:
>
>
> There is various information about
> - enterprise-class drives (either SAS or just enterprise SATA)
> - the SCSI/SAS protocols themselves vs SATA
> having more advanced features (e.g. for dealing with error conditions)
> than the average block device
>
> For example, Adaptec recommends that such drives will work better with
> their hardware RAID cards:
>
> http://ask.adaptec.com/cgi-bin/adaptec_tic.cfg/php/enduser/std_adp.php?p_faqid=14596
> "Desktop class disk drives have an error recovery feature that will
> result in a continuous retry of the drive (read or write) when an error
> is encountered, such as a bad sector. In a RAID array this can cause the
> RAID controller to time-out while waiting for the drive to respond."
>
> and this blog:
> http://www.adaptec.com/blog/?p=901
> "major advantages to enterprise drives (TLER for one) ... opt for the
> enterprise drives in a RAID environment no matter what the cost of the
> drive over the desktop drive"
>
> My question..
>
> - does Linux md RAID actively use the more advanced features of these
> drives, e.g. to work around errors?

TLER and its ilk simply give up quickly on errors. This may be good
for a RAID card that otherwise would reset itself if it doesn't get a
timely response from a drive, but it can be bad for md RAID. It
essentially increases the chance that you won't be able to rebuild,
you lose drive A of a 2 x 3TB RAID 1, and then during rebuild drive B
has an error and the disk gives up after 7 seconds, rather than doing
all of its fancy off-sector reads and whatever else it would normally
do to save your last good copy.

>
> - if a non-RAID SAS card is used, does it matter which card is chosen?
> Does md work equally well with all of them?

Yes, I believe md raid would work equally well on all SAS HBAs,
however the cards themselves vary in performance. Some cards that have
simple RAID built-in can be flashed to a dumb card in order to reclaim
more card memory (LSI "IR mode" cards), but the performance gain is
generally minimal

>
> - ignoring the better MTBF and seek times of these drives, do any of the
> other features passively contribute to a better RAID experience when
> using md?

Not that I know of, but I'd be interested in hearing what others think.

>
> - for someone using SAS or enterprise SATA drives with Linux, is there
> any particular benefit to using md RAID, dmraid or filesystem (e.g.
> btrfs) RAID (apart from the btrfs having checksums)?

As opposed to hardware RAID? The main thing I think of is freedom from
vendor lock-in. If you lose your card you don't have to run around
finding another that is compatible with the hardware RAID's on-disk
metadata format that was deprecated last year. Last I checked,
performance was pretty great with md, and you can get fancy and spread
your array across multiple controllers and things like that. Finally,
md RAID tends to have a better feature set than the hardware, for
example N-disk mirrors. I like running a 3 way mirror over 2 way +
hotspare.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-09 22:33 ` Marcus Sorensen
@ 2012-05-10 13:34   ` Daniel Pocock
  2012-05-10 13:51   ` Phil Turmel
  1 sibling, 0 replies; 51+ messages in thread
From: Daniel Pocock @ 2012-05-10 13:34 UTC (permalink / raw)
  To: Marcus Sorensen; +Cc: linux-raid


>> My question..
>>
>> - does Linux md RAID actively use the more advanced features of these
>> drives, e.g. to work around errors?
> 
> TLER and its ilk simply give up quickly on errors. This may be good
> for a RAID card that otherwise would reset itself if it doesn't get a
> timely response from a drive, but it can be bad for md RAID. It
> essentially increases the chance that you won't be able to rebuild,
> you lose drive A of a 2 x 3TB RAID 1, and then during rebuild drive B
> has an error and the disk gives up after 7 seconds, rather than doing
> all of its fancy off-sector reads and whatever else it would normally
> do to save your last good copy.

Is TLER a feature that can be turned on and off, like write caches?

Or can the RAID solution (either md or hardware RAID cards) tell the
drive to keep trying the sector if they really can't find the same data
on another drive in the array?

>>
>> - for someone using SAS or enterprise SATA drives with Linux, is there
>> any particular benefit to using md RAID, dmraid or filesystem (e.g.
>> btrfs) RAID (apart from the btrfs having checksums)?
> 
> As opposed to hardware RAID? The main thing I think of is freedom from

Not quite... I was asking how the Linux RAID solutions (particularly for
RAID1) compare to each other

I'm aware that theoretically, btrfs has the advantage of checksums, if
if a disk reads successfully but the data has a bad checksum btrfs will
look for the data elsewhere.

But what about other features: does btrfs work better with a TLER drive
than md with the same drive?  Or the other way around?

> vendor lock-in. If you lose your card you don't have to run around
> finding another that is compatible with the hardware RAID's on-disk
> metadata format that was deprecated last year. Last I checked,

I'm well aware of that one - HP has some great RAID cards, but I'm
nervous about the fact they don't let you access the raw drive.

Adaptec cards that advertise `JBOD' support apparently let you bypass
the RAID functions and access the disks directly (so you can use md or
btrfs directly onto the disk)

> performance was pretty great with md, and you can get fancy and spread
> your array across multiple controllers and things like that. Finally,
> md RAID tends to have a better feature set than the hardware, for
> example N-disk mirrors. I like running a 3 way mirror over 2 way +
> hotspare.

I'd agree with those comments - I actually use md at the moment on a HP
Microserver, it just saved me from a dead drive this week and I'm just
weighing up whether to replace the drive with the same (a Barracuda) or
something better (e.g. Seagate Constellation SATA or even SAS)



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-09 22:33 ` Marcus Sorensen
  2012-05-10 13:34   ` Daniel Pocock
@ 2012-05-10 13:51   ` Phil Turmel
  2012-05-10 14:59     ` Daniel Pocock
                       ` (2 more replies)
  1 sibling, 3 replies; 51+ messages in thread
From: Phil Turmel @ 2012-05-10 13:51 UTC (permalink / raw)
  To: Marcus Sorensen; +Cc: Daniel Pocock, linux-raid

I'm afraid I have to disagree with Marcus ...

And other observations ...

On 05/09/2012 06:33 PM, Marcus Sorensen wrote:
> I can't speak to all of these, but...
> 
> On Wed, May 9, 2012 at 4:00 PM, Daniel Pocock <daniel@pocock.com.au> wrote:
>>
>>
>> There is various information about
>> - enterprise-class drives (either SAS or just enterprise SATA)
>> - the SCSI/SAS protocols themselves vs SATA
>> having more advanced features (e.g. for dealing with error conditions)
>> than the average block device
>>
>> For example, Adaptec recommends that such drives will work better with
>> their hardware RAID cards:
>>
>> http://ask.adaptec.com/cgi-bin/adaptec_tic.cfg/php/enduser/std_adp.php?p_faqid=14596
>> "Desktop class disk drives have an error recovery feature that will
>> result in a continuous retry of the drive (read or write) when an error
>> is encountered, such as a bad sector. In a RAID array this can cause the
>> RAID controller to time-out while waiting for the drive to respond."

Linux direct drivers will also time out in this case, although the
driver timeout is adjustable.  Default is 30 seconds, while desktop
drives usually keep trying to recover errors for minutes at a time.

>> and this blog:
>> http://www.adaptec.com/blog/?p=901
>> "major advantages to enterprise drives (TLER for one) ... opt for the
>> enterprise drives in a RAID environment no matter what the cost of the
>> drive over the desktop drive"

Unless you find drives that support SCTERC, which allows you to tell
the drives to use a more reasonable timeout (typically 7 seconds).

Unfortunately, SCTERC is not a persistent parameter, so it needs to be
set on every powerup (udev rule is the best).

>> My question..
>>
>> - does Linux md RAID actively use the more advanced features of these
>> drives, e.g. to work around errors?
> 
> TLER and its ilk simply give up quickly on errors. This may be good
> for a RAID card that otherwise would reset itself if it doesn't get a
> timely response from a drive, but it can be bad for md RAID. It
> essentially increases the chance that you won't be able to rebuild,
> you lose drive A of a 2 x 3TB RAID 1, and then during rebuild drive B
> has an error and the disk gives up after 7 seconds, rather than doing
> all of its fancy off-sector reads and whatever else it would normally
> do to save your last good copy.

Here is where Marcus and I part ways.  A very common report I see on
this mailing list is people who have lost arrays where the drives all
appear to be healthy.  Given the large size of today's hard drives,
even healthy drives will occasionally have an unrecoverable read error.

When this happens in a raid array with a desktop drive without SCTERC,
the driver times out and reports an error to MD.  MD proceeds to
reconstruct the missing data and tries to write it back to the bad
sector.  However, that drive is still trying to read the bad sector and
ignores the controller.  The write is immediately rejected.  BOOM!  The
*write* error ejects that member from the array.  And you are now
degraded.

If you don't notice the degraded array right away, you probably won't
notice until a URE on another drive pops up.  Once that happens, you
can't complete a resync to revive the array.

Running a "check" or "repair" on an array without TLER will have the
opposite of the intended effect: any URE will kick a drive out instead
of fixing it.

In the same scenario with an enterprise drive, or a drive with SCTERC
turned on, the drive read times out before the controller driver, the
controller never resets the link to the drive, and the followup write
succeeds.  (The sector is either successfully corrected in place, or
it is relocated by the drive.)  No BOOM.

>> - if a non-RAID SAS card is used, does it matter which card is chosen?
>> Does md work equally well with all of them?
> 
> Yes, I believe md raid would work equally well on all SAS HBAs,
> however the cards themselves vary in performance. Some cards that have
> simple RAID built-in can be flashed to a dumb card in order to reclaim
> more card memory (LSI "IR mode" cards), but the performance gain is
> generally minimal

Hardware RAID cards usually offer battery-backed write cache, which is
very valuable in some applications.  I don't have a need for that kind
of performance, so I can't speak to the details.  (Is Stan H.
listening?)

>> - ignoring the better MTBF and seek times of these drives, do any of the
>> other features passively contribute to a better RAID experience when
>> using md?
> 
> Not that I know of, but I'd be interested in hearing what others think.

They power up with TLER enabled, where the desktop drives don't.  You've
excluded the MTBF and seek performance as criteria, which I believe are
the only remaining advantages, and not that important to light-duty
users.

The drive manufacturers have noticed this, by the way.  Most of them
no longer offer SCTERC in their desktop products, as they want RAID
users to buy their more expensive (and profitable) drives.  I was burned
by this when I replaced some Seagate Barracuda 7200.11 1T drives (which
support SCTERC) with Seagate Barracude Green 2T drives (which don't).

Neither Seagate nor Western Digital offer any desktop drive with any
form of time-limited error recovery.  Seagate and WD were my "go to"
brands for RAID.  I am now buying Hitachi, as they haven't (yet)
followed their peers.  The "I" in RAID stands for "inexpensive",
after all.

>> - for someone using SAS or enterprise SATA drives with Linux, is there
>> any particular benefit to using md RAID, dmraid or filesystem (e.g.
>> btrfs) RAID (apart from the btrfs having checksums)?
> 
> As opposed to hardware RAID? The main thing I think of is freedom from
> vendor lock-in. If you lose your card you don't have to run around
> finding another that is compatible with the hardware RAID's on-disk
> metadata format that was deprecated last year. Last I checked,
> performance was pretty great with md, and you can get fancy and spread
> your array across multiple controllers and things like that. Finally,
> md RAID tends to have a better feature set than the hardware, for
> example N-disk mirrors. I like running a 3 way mirror over 2 way +
> hotspare.

Concur.  Software RAID's feature set is impressive, with great
performance.

FWIW, I *always* use LVM on top of my arrays, simply for the flexibility
to re-arrange layouts on-the-fly.  Any performance impact that has has
never bothered my small systems.

HTH,

Phil

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 13:51   ` Phil Turmel
@ 2012-05-10 14:59     ` Daniel Pocock
  2012-05-10 15:15       ` Phil Turmel
  2012-05-10 15:26     ` Marcus Sorensen
  2012-05-10 21:15     ` Stan Hoeppner
  2 siblings, 1 reply; 51+ messages in thread
From: Daniel Pocock @ 2012-05-10 14:59 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Marcus Sorensen, linux-raid


> Here is where Marcus and I part ways.  A very common report I see on
> this mailing list is people who have lost arrays where the drives all
> appear to be healthy.  Given the large size of today's hard drives,
> even healthy drives will occasionally have an unrecoverable read error.
> 
> When this happens in a raid array with a desktop drive without SCTERC,
> the driver times out and reports an error to MD.  MD proceeds to
> reconstruct the missing data and tries to write it back to the bad
> sector.  However, that drive is still trying to read the bad sector and
> ignores the controller.  The write is immediately rejected.  BOOM!  The
> *write* error ejects that member from the array.  And you are now
> degraded.
> 
> If you don't notice the degraded array right away, you probably won't
> notice until a URE on another drive pops up.  Once that happens, you
> can't complete a resync to revive the array.

What action would you recommend for someone running md on desktop drives
today?  Can md be configured in some way to avoid such a disaster?

> 
> Running a "check" or "repair" on an array without TLER will have the
> opposite of the intended effect: any URE will kick a drive out instead
> of fixing it.
> 
> In the same scenario with an enterprise drive, or a drive with SCTERC
> turned on, the drive read times out before the controller driver, the
> controller never resets the link to the drive, and the followup write
> succeeds.  (The sector is either successfully corrected in place, or
> it is relocated by the drive.)  No BOOM.

I tend to agree with that approach, and I think that is what Adaptec is
proposing in their FAQ

Presumably, if you really do need one of those sectors, the SCTERC
timeout can be extended (e.g. by disk recovery software) to try harder?


>>> - if a non-RAID SAS card is used, does it matter which card is chosen?
>>> Does md work equally well with all of them?
>>
>> Yes, I believe md raid would work equally well on all SAS HBAs,
>> however the cards themselves vary in performance. Some cards that have
>> simple RAID built-in can be flashed to a dumb card in order to reclaim
>> more card memory (LSI "IR mode" cards), but the performance gain is
>> generally minimal
> 
> Hardware RAID cards usually offer battery-backed write cache, which is
> very valuable in some applications.  I don't have a need for that kind
> of performance, so I can't speak to the details.  (Is Stan H.
> listening?)

BBWC is not just expensive, it also has an extra management overhead,
batteries need to have full discharges occasionally (at a time when
cache is off), routine battery replacement, etc
> 
> FWIW, I *always* use LVM on top of my arrays, simply for the flexibility
> to re-arrange layouts on-the-fly.  Any performance impact that has has
> never bothered my small systems.

I'm a big fan of volume managers too, each different type of data has
it's own LV, it makes it very easy to look at the data at a high level
and see how many GB used by photos, how many for software downloads,
each LV typically has a specific backup regimen.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 14:59     ` Daniel Pocock
@ 2012-05-10 15:15       ` Phil Turmel
  0 siblings, 0 replies; 51+ messages in thread
From: Phil Turmel @ 2012-05-10 15:15 UTC (permalink / raw)
  To: Daniel Pocock; +Cc: Marcus Sorensen, linux-raid

On 05/10/2012 10:59 AM, Daniel Pocock wrote:
> 
>> Here is where Marcus and I part ways.  A very common report I see on
>> this mailing list is people who have lost arrays where the drives all
>> appear to be healthy.  Given the large size of today's hard drives,
>> even healthy drives will occasionally have an unrecoverable read error.
>>
>> When this happens in a raid array with a desktop drive without SCTERC,
>> the driver times out and reports an error to MD.  MD proceeds to
>> reconstruct the missing data and tries to write it back to the bad
>> sector.  However, that drive is still trying to read the bad sector and
>> ignores the controller.  The write is immediately rejected.  BOOM!  The
>> *write* error ejects that member from the array.  And you are now
>> degraded.
>>
>> If you don't notice the degraded array right away, you probably won't
>> notice until a URE on another drive pops up.  Once that happens, you
>> can't complete a resync to revive the array.
> 
> What action would you recommend for someone running md on desktop drives
> today?  Can md be configured in some way to avoid such a disaster?

You have to set the controller's link timeout greater than the worst-
case recovery time.  Unfortunately, that's generally not specified, and
therefore only discovered when you have a real URE.  In my experience,
it's on the order of two to three minutes.

One thing to keep in mind:  If you set the controller timeout that high,
you may encounter protocol timeouts in your services running on top of
those filesystems.  So it isn't a general solution.

FWIW:  /sys/block/sdX/device/timeout

>> Running a "check" or "repair" on an array without TLER will have the
>> opposite of the intended effect: any URE will kick a drive out instead
>> of fixing it.
>>
>> In the same scenario with an enterprise drive, or a drive with SCTERC
>> turned on, the drive read times out before the controller driver, the
>> controller never resets the link to the drive, and the followup write
>> succeeds.  (The sector is either successfully corrected in place, or
>> it is relocated by the drive.)  No BOOM.
> 
> I tend to agree with that approach, and I think that is what Adaptec is
> proposing in their FAQ
> 
> Presumably, if you really do need one of those sectors, the SCTERC
> timeout can be extended (e.g. by disk recovery software) to try harder?

Sure.  SCTERC is set by the smartctl command.  If you need to run
dd_rescue or some other recovery tool on a disk, you can simply set
SCTERC back to zero (disabled).  Or cycle power on the drive.  But you
would also have to set the controller's timeout, or it is pointless.

I don't know what you'd do with an enterprise drive that has TLER by
default.

>>>> - if a non-RAID SAS card is used, does it matter which card is chosen?
>>>> Does md work equally well with all of them?
>>>
>>> Yes, I believe md raid would work equally well on all SAS HBAs,
>>> however the cards themselves vary in performance. Some cards that have
>>> simple RAID built-in can be flashed to a dumb card in order to reclaim
>>> more card memory (LSI "IR mode" cards), but the performance gain is
>>> generally minimal
>>
>> Hardware RAID cards usually offer battery-backed write cache, which is
>> very valuable in some applications.  I don't have a need for that kind
>> of performance, so I can't speak to the details.  (Is Stan H.
>> listening?)
> 
> BBWC is not just expensive, it also has an extra management overhead,
> batteries need to have full discharges occasionally (at a time when
> cache is off), routine battery replacement, etc

I haven't had to deal with this :-)

Phil

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 13:51   ` Phil Turmel
  2012-05-10 14:59     ` Daniel Pocock
@ 2012-05-10 15:26     ` Marcus Sorensen
  2012-05-10 16:04       ` Phil Turmel
  2012-05-10 21:43       ` Stan Hoeppner
  2012-05-10 21:15     ` Stan Hoeppner
  2 siblings, 2 replies; 51+ messages in thread
From: Marcus Sorensen @ 2012-05-10 15:26 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Daniel Pocock, linux-raid

On Thu, May 10, 2012 at 7:51 AM, Phil Turmel <philip@turmel.org> wrote:
> I'm afraid I have to disagree with Marcus ...
>
> And other observations ...
>
> On 05/09/2012 06:33 PM, Marcus Sorensen wrote:
>> I can't speak to all of these, but...
>>
>> On Wed, May 9, 2012 at 4:00 PM, Daniel Pocock <daniel@pocock.com.au> wrote:
>>>
>>>
>>> There is various information about
>>> - enterprise-class drives (either SAS or just enterprise SATA)
>>> - the SCSI/SAS protocols themselves vs SATA
>>> having more advanced features (e.g. for dealing with error conditions)
>>> than the average block device
>>>
>>> For example, Adaptec recommends that such drives will work better with
>>> their hardware RAID cards:
>>>
>>> http://ask.adaptec.com/cgi-bin/adaptec_tic.cfg/php/enduser/std_adp.php?p_faqid=14596
>>> "Desktop class disk drives have an error recovery feature that will
>>> result in a continuous retry of the drive (read or write) when an error
>>> is encountered, such as a bad sector. In a RAID array this can cause the
>>> RAID controller to time-out while waiting for the drive to respond."
>
> Linux direct drivers will also time out in this case, although the
> driver timeout is adjustable.  Default is 30 seconds, while desktop
> drives usually keep trying to recover errors for minutes at a time.
>
>>> and this blog:
>>> http://www.adaptec.com/blog/?p=901
>>> "major advantages to enterprise drives (TLER for one) ... opt for the
>>> enterprise drives in a RAID environment no matter what the cost of the
>>> drive over the desktop drive"
>
> Unless you find drives that support SCTERC, which allows you to tell
> the drives to use a more reasonable timeout (typically 7 seconds).
>
> Unfortunately, SCTERC is not a persistent parameter, so it needs to be
> set on every powerup (udev rule is the best).

See smartctl for more info on how to do this. I think 5.40 has this,
although as far as I'm aware, the only desktop drives that allow you
to set a timeout are the Hitachi deskstar. Trunk also has an APM patch
(provided by me :-) that allows you to adjust head parking/drive sleep
times, if supported, for those who care.

>
>>> My question..
>>>
>>> - does Linux md RAID actively use the more advanced features of these
>>> drives, e.g. to work around errors?
>>
>> TLER and its ilk simply give up quickly on errors. This may be good
>> for a RAID card that otherwise would reset itself if it doesn't get a
>> timely response from a drive, but it can be bad for md RAID. It
>> essentially increases the chance that you won't be able to rebuild,
>> you lose drive A of a 2 x 3TB RAID 1, and then during rebuild drive B
>> has an error and the disk gives up after 7 seconds, rather than doing
>> all of its fancy off-sector reads and whatever else it would normally
>> do to save your last good copy.
>
> Here is where Marcus and I part ways.  A very common report I see on
> this mailing list is people who have lost arrays where the drives all
> appear to be healthy.  Given the large size of today's hard drives,
> even healthy drives will occasionally have an unrecoverable read error.
>
> When this happens in a raid array with a desktop drive without SCTERC,
> the driver times out and reports an error to MD.  MD proceeds to
> reconstruct the missing data and tries to write it back to the bad
> sector.  However, that drive is still trying to read the bad sector and
> ignores the controller.  The write is immediately rejected.  BOOM!  The
> *write* error ejects that member from the array.  And you are now
> degraded.
>
> If you don't notice the degraded array right away, you probably won't
> notice until a URE on another drive pops up.  Once that happens, you
> can't complete a resync to revive the array.
>
> Running a "check" or "repair" on an array without TLER will have the
> opposite of the intended effect: any URE will kick a drive out instead
> of fixing it.
>
> In the same scenario with an enterprise drive, or a drive with SCTERC
> turned on, the drive read times out before the controller driver, the
> controller never resets the link to the drive, and the followup write
> succeeds.  (The sector is either successfully corrected in place, or
> it is relocated by the drive.)  No BOOM.
>

Agreed. In the past there has been some debate about this. I think it
comes down to your use case, the data involved and what you expect.
TLER/ERC can generally make your array more durable to minor hiccups,
and is likely preferred if you can stomach the cost, at the potential
risk that I described.  If the failure is a simple one-off read
failure, then Phil's scenario is very likely. If the drive is really
going bad (say hitting max_read_errors), then the disk won't try very
hard to recover your data, at which point you have to hope the other
drive doesn't have even a minor read error when rebuilding, because it
also will not try very hard. In the end it's up to you what behavior
you want.

Here are a few odd things to consider, if you're worried about this topic:

* Using smartctl to increase the ERC timeout on enterprise SATA
drives, say to 25 seconds, for use with md. I have no idea if this
will cause the drive to actually try different methods of recovery,
but it could be a good middle ground.

* increasing max_read_errors in an attempt to keep a TLER/ERC disk in
the loop longer. The only reason to do this would be if you were
proactive in monitoring said errors and could add in more redundancy
before pulling the failing drive, thus increasing your chances that
the rebuild succeeds, having more *mostly* good copies.

* Increasing the SCSI timeout on your desktop drives to 60 seconds or
more, giving the drive a chance to succeed in deep recovery. This may
cause IO to block for awhile, so again it depends on your usage
scenario.

* frequent array checks - perhaps in combination with the above, can
increase the likelihood that you find errors in a timely manner and
increase the chances that the rebuild will succeed if you've only got
one good copy left.

I'm sure there's more, but you get the point. In the end it's simply
another testament to how flexible and configurable software RAID is.

>>> - if a non-RAID SAS card is used, does it matter which card is chosen?
>>> Does md work equally well with all of them?
>>
>> Yes, I believe md raid would work equally well on all SAS HBAs,
>> however the cards themselves vary in performance. Some cards that have
>> simple RAID built-in can be flashed to a dumb card in order to reclaim
>> more card memory (LSI "IR mode" cards), but the performance gain is
>> generally minimal
>
> Hardware RAID cards usually offer battery-backed write cache, which is
> very valuable in some applications.  I don't have a need for that kind
> of performance, so I can't speak to the details.  (Is Stan H.
> listening?)

I'm not aware of non-RAID SAS cards that provide writeback cache.  At
least none that are battery backed. However, many RAID cards will
allow you to create 1 disk RAID arrays that can be battery backed.
Even better, newer hardware RAID cards offer capacitor backup. They
include a flash module, with a capacitor that has enough juice to
write the contents from RAM to flash. This allows them to not require
the maintenance of batteries and to have a far longer retention time.

>
>>> - ignoring the better MTBF and seek times of these drives, do any of the
>>> other features passively contribute to a better RAID experience when
>>> using md?
>>
>> Not that I know of, but I'd be interested in hearing what others think.
>
> They power up with TLER enabled, where the desktop drives don't.  You've
> excluded the MTBF and seek performance as criteria, which I believe are
> the only remaining advantages, and not that important to light-duty
> users.
>
> The drive manufacturers have noticed this, by the way.  Most of them
> no longer offer SCTERC in their desktop products, as they want RAID
> users to buy their more expensive (and profitable) drives.  I was burned
> by this when I replaced some Seagate Barracuda 7200.11 1T drives (which
> support SCTERC) with Seagate Barracude Green 2T drives (which don't).
>
> Neither Seagate nor Western Digital offer any desktop drive with any
> form of time-limited error recovery.  Seagate and WD were my "go to"
> brands for RAID.  I am now buying Hitachi, as they haven't (yet)
> followed their peers.  The "I" in RAID stands for "inexpensive",
> after all.

I keep hearing that, and I was always under the impression that the
"I" stood for "Independent", as you can do RAID with any independent
disk, cheap or expensive. Seems it was changed mid-90's. I suppose
both are accepted, but perhaps the one we use says something about our
level of seniority :-)

>
>>> - for someone using SAS or enterprise SATA drives with Linux, is there
>>> any particular benefit to using md RAID, dmraid or filesystem (e.g.
>>> btrfs) RAID (apart from the btrfs having checksums)?
>>
>> As opposed to hardware RAID? The main thing I think of is freedom from
>> vendor lock-in. If you lose your card you don't have to run around
>> finding another that is compatible with the hardware RAID's on-disk
>> metadata format that was deprecated last year. Last I checked,
>> performance was pretty great with md, and you can get fancy and spread
>> your array across multiple controllers and things like that. Finally,
>> md RAID tends to have a better feature set than the hardware, for
>> example N-disk mirrors. I like running a 3 way mirror over 2 way +
>> hotspare.
>
> Concur.  Software RAID's feature set is impressive, with great
> performance.
>
> FWIW, I *always* use LVM on top of my arrays, simply for the flexibility
> to re-arrange layouts on-the-fly.  Any performance impact that has has
> never bothered my small systems.
>
> HTH,
>
> Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 15:26     ` Marcus Sorensen
@ 2012-05-10 16:04       ` Phil Turmel
  2012-05-10 17:53         ` Keith Keller
  2012-05-10 18:42         ` Daniel Pocock
  2012-05-10 21:43       ` Stan Hoeppner
  1 sibling, 2 replies; 51+ messages in thread
From: Phil Turmel @ 2012-05-10 16:04 UTC (permalink / raw)
  To: Marcus Sorensen; +Cc: Daniel Pocock, linux-raid

On 05/10/2012 11:26 AM, Marcus Sorensen wrote:
> On Thu, May 10, 2012 at 7:51 AM, Phil Turmel <philip@turmel.org> wrote:

[trim /]

>> Here is where Marcus and I part ways.  A very common report I see on
>> this mailing list is people who have lost arrays where the drives all
>> appear to be healthy.  Given the large size of today's hard drives,
>> even healthy drives will occasionally have an unrecoverable read error.
>>
>> When this happens in a raid array with a desktop drive without SCTERC,
>> the driver times out and reports an error to MD.  MD proceeds to
>> reconstruct the missing data and tries to write it back to the bad
>> sector.  However, that drive is still trying to read the bad sector and
>> ignores the controller.  The write is immediately rejected.  BOOM!  The
>> *write* error ejects that member from the array.  And you are now
>> degraded.
>>
>> If you don't notice the degraded array right away, you probably won't
>> notice until a URE on another drive pops up.  Once that happens, you
>> can't complete a resync to revive the array.
>>
>> Running a "check" or "repair" on an array without TLER will have the
>> opposite of the intended effect: any URE will kick a drive out instead
>> of fixing it.
>>
>> In the same scenario with an enterprise drive, or a drive with SCTERC
>> turned on, the drive read times out before the controller driver, the
>> controller never resets the link to the drive, and the followup write
>> succeeds.  (The sector is either successfully corrected in place, or
>> it is relocated by the drive.)  No BOOM.
>>
> 
> Agreed. In the past there has been some debate about this. I think it
> comes down to your use case, the data involved and what you expect.
> TLER/ERC can generally make your array more durable to minor hiccups,
> and is likely preferred if you can stomach the cost, at the potential
> risk that I described.  If the failure is a simple one-off read
> failure, then Phil's scenario is very likely. If the drive is really
> going bad (say hitting max_read_errors), then the disk won't try very
> hard to recover your data, at which point you have to hope the other
> drive doesn't have even a minor read error when rebuilding, because it
> also will not try very hard. In the end it's up to you what behavior
> you want.

Well, I approach this from the assumption that the normal condition
of a production RAID array is *non-degraded*.  You don't want isolated
read errors to hold up your application when the data can be quickly
reconstructed from the redundancy.  And you certainly don't want
transient errors to kick drives out of the array.

Coordinating the drive and the controller timeouts is the *only* way
to avoid the URE kickout scenario.

Changing TLER/ERC when an array becomes degraded for a real hardware
failure is a useful idea. I think I'll look at scripting that.

> Here are a few odd things to consider, if you're worried about this topic:
> 
> * Using smartctl to increase the ERC timeout on enterprise SATA
> drives, say to 25 seconds, for use with md. I have no idea if this
> will cause the drive to actually try different methods of recovery,
> but it could be a good middle ground.

For a healthy array, I think this is counter-productive, as you are
holding up your applications.  Any sector that is marginal and needs
that much time to recover really ought to be re-written anyways.

> * increasing max_read_errors in an attempt to keep a TLER/ERC disk in
> the loop longer. The only reason to do this would be if you were
> proactive in monitoring said errors and could add in more redundancy
> before pulling the failing drive, thus increasing your chances that
> the rebuild succeeds, having more *mostly* good copies.
> 
> * Increasing the SCSI timeout on your desktop drives to 60 seconds or
> more, giving the drive a chance to succeed in deep recovery. This may
> cause IO to block for awhile, so again it depends on your usage
> scenario.

I can understand using all available means to resync/rebuild a
degraded array, but I can't see leaving those settings on a healthy
array.

> * frequent array checks - perhaps in combination with the above, can
> increase the likelihood that you find errors in a timely manner and
> increase the chances that the rebuild will succeed if you've only got
> one good copy left.

Frequent array checks are not optional, if you want flush out any UREs
in the making, and maximize your odds of successfully rebuilding after
a drive replacement.  If you are running RAID6 or a triple mirror, with
frequent checks, you are very safe.

[...]

>> Neither Seagate nor Western Digital offer any desktop drive with any
>> form of time-limited error recovery.  Seagate and WD were my "go to"
>> brands for RAID.  I am now buying Hitachi, as they haven't (yet)
>> followed their peers.  The "I" in RAID stands for "inexpensive",
>> after all.
> 
> I keep hearing that, and I was always under the impression that the
> "I" stood for "Independent", as you can do RAID with any independent
> disk, cheap or expensive. Seems it was changed mid-90's. I suppose
> both are accepted, but perhaps the one we use says something about our
> level of seniority :-)

Hmmm.  I hadn't noticed the change to "independent".  Can't allow any
premium technology to be inexpensive, can we?

And yes, there's grey in my beard.

Phil

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 16:04       ` Phil Turmel
@ 2012-05-10 17:53         ` Keith Keller
  2012-05-10 18:10           ` Mathias Burén
  2012-05-10 18:23           ` Phil Turmel
  2012-05-10 18:42         ` Daniel Pocock
  1 sibling, 2 replies; 51+ messages in thread
From: Keith Keller @ 2012-05-10 17:53 UTC (permalink / raw)
  To: linux-raid

On 2012-05-10, Phil Turmel <philip@turmel.org> wrote:
>
> Frequent array checks are not optional, if you want flush out any UREs
> in the making, and maximize your odds of successfully rebuilding after
> a drive replacement.  If you are running RAID6 or a triple mirror, with
> frequent checks, you are very safe.

How frequent do people define as "frequent"?  I typically do monthly
checks, but perhaps I should be doing them weekly.

--keith


-- 
kkeller@wombat.san-francisco.ca.us



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 17:53         ` Keith Keller
@ 2012-05-10 18:10           ` Mathias Burén
  2012-05-10 18:23           ` Phil Turmel
  1 sibling, 0 replies; 51+ messages in thread
From: Mathias Burén @ 2012-05-10 18:10 UTC (permalink / raw)
  To: Keith Keller; +Cc: linux-raid

On 10 May 2012 18:53, Keith Keller <kkeller@wombat.san-francisco.ca.us> wrote:
> On 2012-05-10, Phil Turmel <philip@turmel.org> wrote:
>>
>> Frequent array checks are not optional, if you want flush out any UREs
>> in the making, and maximize your odds of successfully rebuilding after
>> a drive replacement.  If you are running RAID6 or a triple mirror, with
>> frequent checks, you are very safe.
>
> How frequent do people define as "frequent"?  I typically do monthly
> checks, but perhaps I should be doing them weekly.
>
> --keith
>
>
> --
> kkeller@wombat.san-francisco.ca.us
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

RAID6 on 7 2TB here, weekly checks..
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 17:53         ` Keith Keller
  2012-05-10 18:10           ` Mathias Burén
@ 2012-05-10 18:23           ` Phil Turmel
  2012-05-10 19:15             ` Keith Keller
  1 sibling, 1 reply; 51+ messages in thread
From: Phil Turmel @ 2012-05-10 18:23 UTC (permalink / raw)
  To: Keith Keller; +Cc: linux-raid

On 05/10/2012 01:53 PM, Keith Keller wrote:
> On 2012-05-10, Phil Turmel <philip@turmel.org> wrote:
>>
>> Frequent array checks are not optional, if you want flush out any UREs
>> in the making, and maximize your odds of successfully rebuilding after
>> a drive replacement.  If you are running RAID6 or a triple mirror, with
>> frequent checks, you are very safe.
> 
> How frequent do people define as "frequent"?  I typically do monthly
> checks, but perhaps I should be doing them weekly.

I do them weekly...  the following is called from my crontab:

#!/bin/bash
#
# Weekly Cron Job to initiate RAID scan/repair cycles
for x in /sys/block/md*/md/sync_action ; do
        echo check >$x
done
# Process occurs in background kernel tasks


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 16:04       ` Phil Turmel
  2012-05-10 17:53         ` Keith Keller
@ 2012-05-10 18:42         ` Daniel Pocock
  2012-05-10 19:09           ` Phil Turmel
  2012-05-21 14:19           ` Brian Candler
  1 sibling, 2 replies; 51+ messages in thread
From: Daniel Pocock @ 2012-05-10 18:42 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Marcus Sorensen, linux-raid



On 10/05/12 16:04, Phil Turmel wrote:
> On 05/10/2012 11:26 AM, Marcus Sorensen wrote:
>> On Thu, May 10, 2012 at 7:51 AM, Phil Turmel <philip@turmel.org> wrote:
> 
> [trim /]
> 
>>> Here is where Marcus and I part ways.  A very common report I see on
>>> this mailing list is people who have lost arrays where the drives all
>>> appear to be healthy.  Given the large size of today's hard drives,
>>> even healthy drives will occasionally have an unrecoverable read error.
>>>
>>> When this happens in a raid array with a desktop drive without SCTERC,
>>> the driver times out and reports an error to MD.  MD proceeds to
>>> reconstruct the missing data and tries to write it back to the bad
>>> sector.  However, that drive is still trying to read the bad sector and
>>> ignores the controller.  The write is immediately rejected.  BOOM!  The
>>> *write* error ejects that member from the array.  And you are now
>>> degraded.
>>>
>>> If you don't notice the degraded array right away, you probably won't
>>> notice until a URE on another drive pops up.  Once that happens, you
>>> can't complete a resync to revive the array.
>>>
>>> Running a "check" or "repair" on an array without TLER will have the
>>> opposite of the intended effect: any URE will kick a drive out instead
>>> of fixing it.
>>>
>>> In the same scenario with an enterprise drive, or a drive with SCTERC
>>> turned on, the drive read times out before the controller driver, the
>>> controller never resets the link to the drive, and the followup write
>>> succeeds.  (The sector is either successfully corrected in place, or
>>> it is relocated by the drive.)  No BOOM.
>>>
>>
>> Agreed. In the past there has been some debate about this. I think it
>> comes down to your use case, the data involved and what you expect.
>> TLER/ERC can generally make your array more durable to minor hiccups,
>> and is likely preferred if you can stomach the cost, at the potential
>> risk that I described.  If the failure is a simple one-off read
>> failure, then Phil's scenario is very likely. If the drive is really
>> going bad (say hitting max_read_errors), then the disk won't try very
>> hard to recover your data, at which point you have to hope the other
>> drive doesn't have even a minor read error when rebuilding, because it
>> also will not try very hard. In the end it's up to you what behavior
>> you want.
> 
> Well, I approach this from the assumption that the normal condition
> of a production RAID array is *non-degraded*.  You don't want isolated
> read errors to hold up your application when the data can be quickly
> reconstructed from the redundancy.  And you certainly don't want
> transient errors to kick drives out of the array.

I think you have to look at the average user's perspective: even most IT
people don't want to know everything about what goes on in their drives.
 They just expect stuff to work in a manner they consider `sensible'.
There is an expectation that if you have RAID you have more safety than
without RAID.  The idea that a whole array can go down because of
different sectors failing in each drive seems to violate that expectation.

> Coordinating the drive and the controller timeouts is the *only* way
> to avoid the URE kickout scenario.

I really think that is something that needs consideration, as a minimum,
should md log a warning message if SCTERC is not supported and
configured in a satisfactory way?

> Changing TLER/ERC when an array becomes degraded for a real hardware
> failure is a useful idea. I think I'll look at scripting that.

Ok, so I bought an enterprise grade drive, the WD RE4 (2TB) and I'm
about to add it in place of the drive that failed.

I did a quick check with smartctl:

# smartctl -a /dev/sdb -l scterc
....
SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

so the TLER feature appears to be there.  I haven't tried changing it.

For my old Barracuda 7200.12 that is still working, I see this:

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

and a diff between the full output for both drives reveals the following:

-SCT capabilities:             (0x103f) SCT Status supported.
+SCT capabilities:             (0x303f) SCT Status supported.
                                        SCT Error Recovery Control
supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.




>> Here are a few odd things to consider, if you're worried about this topic:
>>
>> * Using smartctl to increase the ERC timeout on enterprise SATA
>> drives, say to 25 seconds, for use with md. I have no idea if this
>> will cause the drive to actually try different methods of recovery,
>> but it could be a good middle ground.
> 

What are the consequences if I don't do that?  I currently have 7
seconds on my new drive.  If md can't read a sector from the drive, will
it fail the whole drive?  Will it automatically read the sector from the
other drive so the application won't know something bad happened?  Will
it automatically try to re-write the sector on the drive that couldn't
read it?

Would you know how btrfs behaves in that same scenario - does it try to
write out the sector to the drive that failed the read?  Does it also
try to write out the sector when a read came in with a bad checksum and
it got a good copy from the other drive?

> For a healthy array, I think this is counter-productive, as you are
> holding up your applications.  Any sector that is marginal and needs
> that much time to recover really ought to be re-written anyways.
> 
>> * increasing max_read_errors in an attempt to keep a TLER/ERC disk in
>> the loop longer. The only reason to do this would be if you were
>> proactive in monitoring said errors and could add in more redundancy
>> before pulling the failing drive, thus increasing your chances that
>> the rebuild succeeds, having more *mostly* good copies.
>>
>> * Increasing the SCSI timeout on your desktop drives to 60 seconds or
>> more, giving the drive a chance to succeed in deep recovery. This may
>> cause IO to block for awhile, so again it depends on your usage
>> scenario.
> 
> I can understand using all available means to resync/rebuild a
> degraded array, but I can't see leaving those settings on a healthy
> array.
> 
>> * frequent array checks - perhaps in combination with the above, can
>> increase the likelihood that you find errors in a timely manner and
>> increase the chances that the rebuild will succeed if you've only got
>> one good copy left.
> 
> Frequent array checks are not optional, if you want flush out any UREs
> in the making, and maximize your odds of successfully rebuilding after
> a drive replacement.  If you are running RAID6 or a triple mirror, with
> frequent checks, you are very safe.
> 
> [...]
> 
>>> Neither Seagate nor Western Digital offer any desktop drive with any
>>> form of time-limited error recovery.  Seagate and WD were my "go to"
>>> brands for RAID.  I am now buying Hitachi, as they haven't (yet)
>>> followed their peers.  The "I" in RAID stands for "inexpensive",
>>> after all.
>>
>> I keep hearing that, and I was always under the impression that the
>> "I" stood for "Independent", as you can do RAID with any independent
>> disk, cheap or expensive. Seems it was changed mid-90's. I suppose
>> both are accepted, but perhaps the one we use says something about our
>> level of seniority :-)
> 
> Hmmm.  I hadn't noticed the change to "independent".  Can't allow any
> premium technology to be inexpensive, can we?
> 
> And yes, there's grey in my beard.
> 
> Phil

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 18:42         ` Daniel Pocock
@ 2012-05-10 19:09           ` Phil Turmel
  2012-05-10 20:30             ` Daniel Pocock
  2012-05-11  6:50             ` Michael Tokarev
  2012-05-21 14:19           ` Brian Candler
  1 sibling, 2 replies; 51+ messages in thread
From: Phil Turmel @ 2012-05-10 19:09 UTC (permalink / raw)
  To: Daniel Pocock; +Cc: Marcus Sorensen, linux-raid

On 05/10/2012 02:42 PM, Daniel Pocock wrote:
> 
> I think you have to look at the average user's perspective: even most IT
> people don't want to know everything about what goes on in their drives.
>  They just expect stuff to work in a manner they consider `sensible'.
> There is an expectation that if you have RAID you have more safety than
> without RAID.  The idea that a whole array can go down because of
> different sectors failing in each drive seems to violate that expectation.

You absolutely do have more safety, you just might not have as much more
safety as you think.  Modern distributions try hard to automate much of
this setup (e.g. Ubuntu tries to set up mdmon for you when you install
mdadm), but it is not 100%.

Expectations have also changed in the past few years, too, in opposing
ways.  One, hard drive capacities have skyrocketed (Yay!), but error
rate specs have not, so typical users are more likely to encounter UREs.

Two, Linux has gained much more acceptance from home users building
media servers and such, with much more exposure to non-enterprise
components.

Not to excuse the situation--just to explain it.  Coding in this
arena is mostly volunteers, too.

>> Coordinating the drive and the controller timeouts is the *only* way
>> to avoid the URE kickout scenario.
> 
> I really think that is something that needs consideration, as a minimum,
> should md log a warning message if SCTERC is not supported and
> configured in a satisfactory way?

This sounds useful.

>> Changing TLER/ERC when an array becomes degraded for a real hardware
>> failure is a useful idea. I think I'll look at scripting that.
> 
> Ok, so I bought an enterprise grade drive, the WD RE4 (2TB) and I'm
> about to add it in place of the drive that failed.
> 
> I did a quick check with smartctl:
> 
> # smartctl -a /dev/sdb -l scterc
> ....
> SCT Error Recovery Control:
>            Read:     70 (7.0 seconds)
>           Write:     70 (7.0 seconds)
> 
> so the TLER feature appears to be there.  I haven't tried changing it.
> 
> For my old Barracuda 7200.12 that is still working, I see this:
> 
> SCT Error Recovery Control:
>            Read: Disabled
>           Write: Disabled

You should try changing it.  Drives that don't support it won't even
show you that.

You can then put "smartctl -l scterc,70,70 /dev/sdX" in /etc/rc.local or
your distribution's equivalent.

> and a diff between the full output for both drives reveals the following:
> 
> -SCT capabilities:             (0x103f) SCT Status supported.
> +SCT capabilities:             (0x303f) SCT Status supported.
>                                         SCT Error Recovery Control
> supported.
>                                         SCT Feature Control supported.
>                                         SCT Data Table supported.
> 
> 
> 
> 
>>> Here are a few odd things to consider, if you're worried about this topic:
>>>
>>> * Using smartctl to increase the ERC timeout on enterprise SATA
>>> drives, say to 25 seconds, for use with md. I have no idea if this
>>> will cause the drive to actually try different methods of recovery,
>>> but it could be a good middle ground.
>>
> 
> What are the consequences if I don't do that?  I currently have 7
> seconds on my new drive.  If md can't read a sector from the drive, will
> it fail the whole drive?  Will it automatically read the sector from the
> other drive so the application won't know something bad happened?  Will
> it automatically try to re-write the sector on the drive that couldn't
> read it?

MD fails drives on *write* errors.  It reconstructs from mirrors or
parity on read errors and writes the result back to the origin drive.

> Would you know how btrfs behaves in that same scenario - does it try to
> write out the sector to the drive that failed the read?  Does it also
> try to write out the sector when a read came in with a bad checksum and
> it got a good copy from the other drive?

I haven't experimented with btrfs yet.  It is still marked experimental.

Phil

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 18:23           ` Phil Turmel
@ 2012-05-10 19:15             ` Keith Keller
  0 siblings, 0 replies; 51+ messages in thread
From: Keith Keller @ 2012-05-10 19:15 UTC (permalink / raw)
  To: linux-raid

On 2012-05-10, Phil Turmel <philip@turmel.org> wrote:
>
> I do them weekly...  the following is called from my crontab:
>
> #!/bin/bash
> #
> # Weekly Cron Job to initiate RAID scan/repair cycles
> for x in /sys/block/md*/md/sync_action ; do
>         echo check >$x
> done
> # Process occurs in background kernel tasks

Actually RHEL/CentOS has a nice utility script that will check your
arrays and emit an error via stdout/stderr if any mismatches are found.
Obviously you could have a separate cron job to check mismatch_cnt but I
think it's handy to have the check and the report self-contained--if
your cron is properly configured you'll get email on any mismatches.
It also has a config file where you can choose to ignore some arrays,
run a check on others, and run a repair on others.  (Presumably other
distros may have something similar.)

--keith

-- 
kkeller@wombat.san-francisco.ca.us



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 19:09           ` Phil Turmel
@ 2012-05-10 20:30             ` Daniel Pocock
  2012-05-11  6:50             ` Michael Tokarev
  1 sibling, 0 replies; 51+ messages in thread
From: Daniel Pocock @ 2012-05-10 20:30 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Marcus Sorensen, linux-raid



On 10/05/12 19:09, Phil Turmel wrote:
> On 05/10/2012 02:42 PM, Daniel Pocock wrote:
>>
>> I think you have to look at the average user's perspective: even most IT
>> people don't want to know everything about what goes on in their drives.
>>  They just expect stuff to work in a manner they consider `sensible'.
>> There is an expectation that if you have RAID you have more safety than
>> without RAID.  The idea that a whole array can go down because of
>> different sectors failing in each drive seems to violate that expectation.
> 
> You absolutely do have more safety, you just might not have as much more
> safety as you think.  Modern distributions try hard to automate much of
> this setup (e.g. Ubuntu tries to set up mdmon for you when you install
> mdadm), but it is not 100%.
> 
> Expectations have also changed in the past few years, too, in opposing
> ways.  One, hard drive capacities have skyrocketed (Yay!), but error
> rate specs have not, so typical users are more likely to encounter UREs.
> 
> Two, Linux has gained much more acceptance from home users building
> media servers and such, with much more exposure to non-enterprise
> components.
> 
> Not to excuse the situation--just to explain it.  Coding in this
> arena is mostly volunteers, too.

I understand what you mean, and some of those issues can't be solved
with some quick fix.

However, the degraded array situation where the user doesn't know what
to do is probably not so bad for a highly technical user who can choose
the correct drive to rescue

In the heat of battle (I've been in various corporate environments when
RAID systems have gone down) there is often tremendous pressure and
emotion.  In that scenario, someone might not have a lot of time to
investigate what is really wrong, and might form the conclusion that all
the drives are completely dead even though it is just a case of a few
bad sectors on each.

>>> Coordinating the drive and the controller timeouts is the *only* way
>>> to avoid the URE kickout scenario.
>>
>> I really think that is something that needs consideration, as a minimum,
>> should md log a warning message if SCTERC is not supported and
>> configured in a satisfactory way?
> 
> This sounds useful.

Maybe it could be checked periodically in case it changes, or in case
not all drives are present at boot time

>>> Changing TLER/ERC when an array becomes degraded for a real hardware
>>> failure is a useful idea. I think I'll look at scripting that.
>>
>> Ok, so I bought an enterprise grade drive, the WD RE4 (2TB) and I'm
>> about to add it in place of the drive that failed.
>>
>> I did a quick check with smartctl:
>>
>> # smartctl -a /dev/sdb -l scterc
>> ....
>> SCT Error Recovery Control:
>>            Read:     70 (7.0 seconds)
>>           Write:     70 (7.0 seconds)
>>
>> so the TLER feature appears to be there.  I haven't tried changing it.
>>
>> For my old Barracuda 7200.12 that is still working, I see this:
>>
>> SCT Error Recovery Control:
>>            Read: Disabled
>>           Write: Disabled
> 
> You should try changing it.  Drives that don't support it won't even
> show you that.
> 
> You can then put "smartctl -l scterc,70,70 /dev/sdX" in /etc/rc.local or
> your distribution's equivalent.

Done - it looks like the drive accepted it

This is what I put in rc.local: I'm hoping that my drives always come up
as sd[ab] of course, are there other ways to do this using disk labels,
or does md have any type of callback/hook scripts (e.g. like ppp-up.d)?

echo -n "smartctl: Trying to enable SCTERC / TLER on main disks..."
/usr/sbin/smartctl -l scterc,70,70 /dev/sda > /dev/null
/usr/sbin/smartctl -l scterc,70,70 /dev/sdb > /dev/null
echo "."

I also have some /sbin/blockdev --setra calls in rc.local, do you have
any suggestions on how that should be optimized for the LVM/md
combination, e.g. I have

Raw partitions: /dev/sd[ab]2 as elements of the RAID1
MD: /dev/md2 as a PV for LVM
LVM: various LVs for different things (e.g. some for photos, some of
compiling large source code projects, very different IO patterns for
each LV)

>> and a diff between the full output for both drives reveals the following:
>>
>> -SCT capabilities:             (0x103f) SCT Status supported.
>> +SCT capabilities:             (0x303f) SCT Status supported.
>>                                         SCT Error Recovery Control
>> supported.
>>                                         SCT Feature Control supported.
>>                                         SCT Data Table supported.
>>
>>
>>
>>
>>>> Here are a few odd things to consider, if you're worried about this topic:
>>>>
>>>> * Using smartctl to increase the ERC timeout on enterprise SATA
>>>> drives, say to 25 seconds, for use with md. I have no idea if this
>>>> will cause the drive to actually try different methods of recovery,
>>>> but it could be a good middle ground.
>>>
>>
>> What are the consequences if I don't do that?  I currently have 7
>> seconds on my new drive.  If md can't read a sector from the drive, will
>> it fail the whole drive?  Will it automatically read the sector from the
>> other drive so the application won't know something bad happened?  Will
>> it automatically try to re-write the sector on the drive that couldn't
>> read it?
> 
> MD fails drives on *write* errors.  It reconstructs from mirrors or
> parity on read errors and writes the result back to the origin drive.

Ok, that is re-assuring

>> Would you know how btrfs behaves in that same scenario - does it try to
>> write out the sector to the drive that failed the read?  Does it also
>> try to write out the sector when a read came in with a bad checksum and
>> it got a good copy from the other drive?
> 
> I haven't experimented with btrfs yet.  It is still marked experimental.

Apparently

a) it may be supported in the next round of major distributions (e.g.
Debian 7 is considering it)
b) the only reason it is still marked experimental (and this is what
I've read, it is not my opinion as I don't know enough about it) is
simply because btrfsck is not fully complete

Also, there is heavy competition from ZFS on FreeBSD, I hear a lot about
people using that combination because of the perceived lateness of btrfs
on Linux - but once again, I don't know how well the ZFS/FreeBSD
combination handles drive hardware, all I know is that ZFS has the
checksum capability (which gives it an edge over any regular RAID1 like
mdraid)


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 13:51   ` Phil Turmel
  2012-05-10 14:59     ` Daniel Pocock
  2012-05-10 15:26     ` Marcus Sorensen
@ 2012-05-10 21:15     ` Stan Hoeppner
  2012-05-10 21:31       ` Daniel Pocock
                         ` (2 more replies)
  2 siblings, 3 replies; 51+ messages in thread
From: Stan Hoeppner @ 2012-05-10 21:15 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Marcus Sorensen, Daniel Pocock, linux-raid

On 5/10/2012 8:51 AM, Phil Turmel wrote:

> Hardware RAID cards usually offer battery-backed write cache, which is
> very valuable in some applications.  I don't have a need for that kind
> of performance, so I can't speak to the details.  (Is Stan H.
> listening?)

Yes, I'm here to drop the hammer, and start a flame war. ;)  I've been
lurking and trying to stay out of the fray, but you "keep dragging me
back in!" --Michael Corleone

I find the mere existence of this thread a bit comical, as with all
others that have preceded it.  I made the comment on this list quite
some time ago that md raid is mostly used by hobbyists, and took a lot
of heat for that.  The existence of this thread adds ammunition to that
argument.

If not for the fact Western Digital added "TLER" to the spec sheet of
it's RE and Raptor series drives many years ago, nobody would have every
mentioned it.

WD did this because those in the "channel" marketplace weren't buying
the drives.  They saw no difference with these new "enterprise" drives
but the much higher price.  WD has never sold RE/Raptor drives to
server/storage OEMs.  WD has never had a presence in enterprise storage.
 Seagate, Hitachi/IBM, Fujitsu, and to a small degree Toshiba, have
owned that space for over a decade.

So in an attempt to drive sales, they added "TLER" to the sheet to
differentiate from their desktop drives.  So what happens?  All the
hobbyists immediately want to enable this "TLER" feature from the
"enterprise" drives on their consumer models, because "TLER" is all that
makes them "enterprise" drives, after all, "all WD drives are the same,
just with different firmware, right?".

Proof point:  Few write about this subject using the generic term "ERC",
which is used by Seagate, or the term Samsung uses, "CCTL".  Everyone
seems to talk about "TLER".  Hmmm...  Coincidence?  No, marketing.

You won't find a single discussion about ERC/TLER/CCTL on any enterprise
storage forum, unless its brought up by someone desiring to cut cost
corners using consumer drives.

So if md raid is not limited to use by hobbyists, and is indeed used in
enterprise environments, then why aren't the enterprise boys discussing
"the problems w/TLER and enterprise drives"?  Because obviously md raid
has no issues when being used with enterprise (ERC/TLER/CCTL) drives.

Either that, or md raid is only used by hobbyists. ;)

-- 
Stan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 21:15     ` Stan Hoeppner
@ 2012-05-10 21:31       ` Daniel Pocock
  2012-05-11  1:53         ` Stan Hoeppner
  2012-05-10 21:41       ` Phil Turmel
  2012-05-10 22:27       ` David Brown
  2 siblings, 1 reply; 51+ messages in thread
From: Daniel Pocock @ 2012-05-10 21:31 UTC (permalink / raw)
  To: stan; +Cc: Phil Turmel, Marcus Sorensen, linux-raid



On 10/05/12 21:15, Stan Hoeppner wrote:
> On 5/10/2012 8:51 AM, Phil Turmel wrote:
> 
>> Hardware RAID cards usually offer battery-backed write cache, which is
>> very valuable in some applications.  I don't have a need for that kind
>> of performance, so I can't speak to the details.  (Is Stan H.
>> listening?)
> 
> Yes, I'm here to drop the hammer, and start a flame war. ;)  I've been
> lurking and trying to stay out of the fray, but you "keep dragging me
> back in!" --Michael Corleone
> 
> I find the mere existence of this thread a bit comical, as with all
> others that have preceded it.  I made the comment on this list quite
> some time ago that md raid is mostly used by hobbyists, and took a lot
> of heat for that.  The existence of this thread adds ammunition to that
> argument.

Well while talking about ammunition, did you know HP dropped some nukes?

When they released the N36L Microserver, there was a statement on their
web site saying that there are no Linux drivers for the AMD (fake)RAID,
but they weren't necessary because Linux has built in RAID


> If not for the fact Western Digital added "TLER" to the spec sheet of
> it's RE and Raptor series drives many years ago, nobody would have every
> mentioned it.
> 
> WD did this because those in the "channel" marketplace weren't buying
> the drives.  They saw no difference with these new "enterprise" drives
> but the much higher price.  WD has never sold RE/Raptor drives to
> server/storage OEMs.  WD has never had a presence in enterprise storage.
>  Seagate, Hitachi/IBM, Fujitsu, and to a small degree Toshiba, have
> owned that space for over a decade.
> 
> So in an attempt to drive sales, they added "TLER" to the sheet to
> differentiate from their desktop drives.  So what happens?  All the
> hobbyists immediately want to enable this "TLER" feature from the
> "enterprise" drives on their consumer models, because "TLER" is all that
> makes them "enterprise" drives, after all, "all WD drives are the same,
> just with different firmware, right?".
> 
> Proof point:  Few write about this subject using the generic term "ERC",
> which is used by Seagate, or the term Samsung uses, "CCTL".  Everyone
> seems to talk about "TLER".  Hmmm...  Coincidence?  No, marketing.

Actually, the TLER term is mentioned elsewhere, for example the Adaptec
blog I came across

Economists often talk about price selectivity, e.g. the coffee shops
that charge an extra pound/euro/dollar for `organic' coffee.  Does it
really cost an extra pound to produce one teaspoon of coffee in an
organic way?  Of course not, it's just a gimmick to extract an extra
pound from people who won't lose any sleep over spending an extra pound.

> You won't find a single discussion about ERC/TLER/CCTL on any enterprise
> storage forum, unless its brought up by someone desiring to cut cost
> corners using consumer drives.

Not quite, I'm going the opposite direction, trying to move away from
cheap drives - but I don't want to invest heavily in something that is

a) just a marketing gimmick
b) not going to do me any good if md doesn't exercise the special
features of the hardware

> So if md raid is not limited to use by hobbyists, and is indeed used in
> enterprise environments, then why aren't the enterprise boys discussing
> "the problems w/TLER and enterprise drives"?  Because obviously md raid
> has no issues when being used with enterprise (ERC/TLER/CCTL) drives.
> 
> Either that, or md raid is only used by hobbyists. ;)
> 
Better a hobbyist running Linux than a professional running Windows with
fakeraid


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 21:15     ` Stan Hoeppner
  2012-05-10 21:31       ` Daniel Pocock
@ 2012-05-10 21:41       ` Phil Turmel
  2012-05-10 22:27       ` David Brown
  2 siblings, 0 replies; 51+ messages in thread
From: Phil Turmel @ 2012-05-10 21:41 UTC (permalink / raw)
  To: stan; +Cc: Marcus Sorensen, Daniel Pocock, linux-raid

On 05/10/2012 05:15 PM, Stan Hoeppner wrote:
> On 5/10/2012 8:51 AM, Phil Turmel wrote:
> 
>> Hardware RAID cards usually offer battery-backed write cache, which is
>> very valuable in some applications.  I don't have a need for that kind
>> of performance, so I can't speak to the details.  (Is Stan H.
>> listening?)
> 
> Yes, I'm here to drop the hammer, and start a flame war. ;)  I've been
> lurking and trying to stay out of the fray, but you "keep dragging me
> back in!" --Michael Corleone

Mission Accomplished!  I am very interested in what professionals like
you recommend, even if I am not in a position to buy the pieces.

> I find the mere existence of this thread a bit comical, as with all
> others that have preceded it.  I made the comment on this list quite
> some time ago that md raid is mostly used by hobbyists, and took a lot
> of heat for that.  The existence of this thread adds ammunition to that
> argument.

I think it is just us hobbyists who suffer these problems, trying to get
the most out of our limited budgets.

At least we are brightening your day with comedy.

> If not for the fact Western Digital added "TLER" to the spec sheet of
> it's RE and Raptor series drives many years ago, nobody would have every
> mentioned it.
> 
> WD did this because those in the "channel" marketplace weren't buying
> the drives.  They saw no difference with these new "enterprise" drives
> but the much higher price.  WD has never sold RE/Raptor drives to
> server/storage OEMs.  WD has never had a presence in enterprise storage.
>  Seagate, Hitachi/IBM, Fujitsu, and to a small degree Toshiba, have
> owned that space for over a decade.
> 
> So in an attempt to drive sales, they added "TLER" to the sheet to
> differentiate from their desktop drives.  So what happens?  All the
> hobbyists immediately want to enable this "TLER" feature from the
> "enterprise" drives on their consumer models, because "TLER" is all that
> makes them "enterprise" drives, after all, "all WD drives are the same,
> just with different firmware, right?".
> 
> Proof point:  Few write about this subject using the generic term "ERC",
> which is used by Seagate, or the term Samsung uses, "CCTL".  Everyone
> seems to talk about "TLER".  Hmmm...  Coincidence?  No, marketing.

I won't argue with successful marketing.  More power to them.  Of
course, they aren't offering this in their more affordable products, so
me and my well-informed hobbyist compatriots won't be buying them
either.

I learned ERC before TLER, not that it matters much.

> You won't find a single discussion about ERC/TLER/CCTL on any enterprise
> storage forum, unless its brought up by someone desiring to cut cost
> corners using consumer drives.

Exactly.  But that doesn't provide evidence either way as to software
raid use in enterprises.

> So if md raid is not limited to use by hobbyists, and is indeed used in
> enterprise environments, then why aren't the enterprise boys discussing
> "the problems w/TLER and enterprise drives"?  Because obviously md raid
> has no issues when being used with enterprise (ERC/TLER/CCTL) drives.
> 
> Either that, or md raid is only used by hobbyists. ;)

Well, it is definitely used by us hobbyists.  :-)



Phil


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 15:26     ` Marcus Sorensen
  2012-05-10 16:04       ` Phil Turmel
@ 2012-05-10 21:43       ` Stan Hoeppner
  2012-05-10 23:00         ` Marcus Sorensen
  1 sibling, 1 reply; 51+ messages in thread
From: Stan Hoeppner @ 2012-05-10 21:43 UTC (permalink / raw)
  To: Marcus Sorensen; +Cc: Phil Turmel, Daniel Pocock, linux-raid

On 5/10/2012 10:26 AM, Marcus Sorensen wrote:

> * Using smartctl to increase the ERC timeout on enterprise SATA
> drives, say to 25 seconds, for use with md. I have no idea if this
> will cause the drive to actually try different methods of recovery,
> but it could be a good middle ground.

If a drive needs 25 seconds to recover from a read error it should have
been replaced long ago.

The only thing that increasing these timeouts to silly high numbers does
is, hopefully for those doing it anyway, prolong the replacement
interval of failing drives.

Can anyone guess what the big bear trap is that this places before you?
 The rest of the drives in the array have been held over much longer as
well.  So when you go to finally rebuild the replacement for this 25s
delay king, you'll be more likely to run into unrecoverable errors on
other array members.  Then you chance losing your entire array, and, for
many here, all of your data, as hobbyists don't do backups. ;)

Fist 2 rules of managing RAID systems:

1.  Monitor drives and preemptively replace those going down hill BEFORE
your RAID controllers or md raid kick them

1a. Don't wait for controllers/md raid to kick bad drives

2.  Data is always worth more than disks drives

2a. If drives cost more than your lost data, you're doing it wrong

-- 
Stan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 21:15     ` Stan Hoeppner
  2012-05-10 21:31       ` Daniel Pocock
  2012-05-10 21:41       ` Phil Turmel
@ 2012-05-10 22:27       ` David Brown
  2012-05-10 22:37         ` Daniel Pocock
       [not found]         ` <CABYL=ToORULrdhBVQk0K8zQqFYkOomY-wgG7PpnJnzP9u7iBnA@mail.gmail.com>
  2 siblings, 2 replies; 51+ messages in thread
From: David Brown @ 2012-05-10 22:27 UTC (permalink / raw)
  To: stan; +Cc: Phil Turmel, Marcus Sorensen, Daniel Pocock, linux-raid

On 10/05/12 23:15, Stan Hoeppner wrote:
> On 5/10/2012 8:51 AM, Phil Turmel wrote:
>
>> Hardware RAID cards usually offer battery-backed write cache, which is
>> very valuable in some applications.  I don't have a need for that kind
>> of performance, so I can't speak to the details.  (Is Stan H.
>> listening?)
>
> Yes, I'm here to drop the hammer, and start a flame war. ;)  I've been
> lurking and trying to stay out of the fray, but you "keep dragging me
> back in!" --Michael Corleone
>
> I find the mere existence of this thread a bit comical, as with all
> others that have preceded it.  I made the comment on this list quite
> some time ago that md raid is mostly used by hobbyists, and took a lot
> of heat for that.  The existence of this thread adds ammunition to that
> argument.
>

I think you've got that a bit backwards.  Most hobbyists (or low-budget 
users) who use raid other than motherboard fakeraid will choose Linux md 
raid.  It may well be that most users of md raid /are/ hobby or 
low-budget users.  But your implication - that professionals don't use 
md raid - is completely wrong.

It's more likely that it is hobby users that discuss these sorts of 
things - professionals just pay the money that the server manufacturer 
asks for its supported disks, since paying that is cheaper than spending 
time discussing things.  I know I mostly follow lists like this in my 
free time (as a hobbiest) rather than in work time (as a professional).


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 22:27       ` David Brown
@ 2012-05-10 22:37         ` Daniel Pocock
       [not found]         ` <CABYL=ToORULrdhBVQk0K8zQqFYkOomY-wgG7PpnJnzP9u7iBnA@mail.gmail.com>
  1 sibling, 0 replies; 51+ messages in thread
From: Daniel Pocock @ 2012-05-10 22:37 UTC (permalink / raw)
  To: David Brown; +Cc: stan, Phil Turmel, Marcus Sorensen, linux-raid



On 10/05/12 22:27, David Brown wrote:
> On 10/05/12 23:15, Stan Hoeppner wrote:
>> On 5/10/2012 8:51 AM, Phil Turmel wrote:
>>
>>> Hardware RAID cards usually offer battery-backed write cache, which is
>>> very valuable in some applications.  I don't have a need for that kind
>>> of performance, so I can't speak to the details.  (Is Stan H.
>>> listening?)
>>
>> Yes, I'm here to drop the hammer, and start a flame war. ;)  I've been
>> lurking and trying to stay out of the fray, but you "keep dragging me
>> back in!" --Michael Corleone
>>
>> I find the mere existence of this thread a bit comical, as with all
>> others that have preceded it.  I made the comment on this list quite
>> some time ago that md raid is mostly used by hobbyists, and took a lot
>> of heat for that.  The existence of this thread adds ammunition to that
>> argument.
>>
> 
> I think you've got that a bit backwards.  Most hobbyists (or low-budget
> users) who use raid other than motherboard fakeraid will choose Linux md
> raid.  It may well be that most users of md raid /are/ hobby or
> low-budget users.  But your implication - that professionals don't use
> md raid - is completely wrong.
> 
> It's more likely that it is hobby users that discuss these sorts of
> things - professionals just pay the money that the server manufacturer
> asks for its supported disks, since paying that is cheaper than spending
> time discussing things.  I know I mostly follow lists like this in my
> free time (as a hobbiest) rather than in work time (as a professional).
> 

The more scary thing is that many professionals are afraid to look at
(or ask questions on) a list like this because they don't want to look
like they didn't know something


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 21:43       ` Stan Hoeppner
@ 2012-05-10 23:00         ` Marcus Sorensen
  0 siblings, 0 replies; 51+ messages in thread
From: Marcus Sorensen @ 2012-05-10 23:00 UTC (permalink / raw)
  To: stan; +Cc: Phil Turmel, Daniel Pocock, linux-raid

You quoted me so I'll reply to this. Consider that most people use
drives with NO limit, and that the 7 second limit is standard only
because most RAID cards will freak and start sending resets at around
8 seconds. The danger you discuss is just as prevalent with a 7 second
limit; if a drive is repeatedly having to do *any* read correction
then it should be replaced, but that's a separate discussion on
monitoring. However the notion that a drive routinely doing error
correction within 5 seconds keeps you safer if called upon to do a
rebuild than one that routinely takes 11 seconds is spurious.

I agree that your assertion of "enterprise users don't use md RAID" is
false. Then again, perhaps we should only define enterprises as those
who don't use software RAID.

Regarding something someone else mentioned, as far as I'm aware md
raid kicks drives out based on a read error rate, not only on writes.
This since 2.6.33, and in the patched RHEL/CentOS 6 stuff. see
drivers/md/md.c "#define MD_DEFAULT_MAX_CORRECTED_READ_ERRORS 20"

On Thu, May 10, 2012 at 3:43 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 5/10/2012 10:26 AM, Marcus Sorensen wrote:
>
>> * Using smartctl to increase the ERC timeout on enterprise SATA
>> drives, say to 25 seconds, for use with md. I have no idea if this
>> will cause the drive to actually try different methods of recovery,
>> but it could be a good middle ground.
>
> If a drive needs 25 seconds to recover from a read error it should have
> been replaced long ago.
>
> The only thing that increasing these timeouts to silly high numbers does
> is, hopefully for those doing it anyway, prolong the replacement
> interval of failing drives.
>
> Can anyone guess what the big bear trap is that this places before you?
>  The rest of the drives in the array have been held over much longer as
> well.  So when you go to finally rebuild the replacement for this 25s
> delay king, you'll be more likely to run into unrecoverable errors on
> other array members.  Then you chance losing your entire array, and, for
> many here, all of your data, as hobbyists don't do backups. ;)
>
> Fist 2 rules of managing RAID systems:
>
> 1.  Monitor drives and preemptively replace those going down hill BEFORE
> your RAID controllers or md raid kick them
>
> 1a. Don't wait for controllers/md raid to kick bad drives
>
> 2.  Data is always worth more than disks drives
>
> 2a. If drives cost more than your lost data, you're doing it wrong
>
> --
> Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 21:31       ` Daniel Pocock
@ 2012-05-11  1:53         ` Stan Hoeppner
  2012-05-11  8:31           ` Daniel Pocock
  0 siblings, 1 reply; 51+ messages in thread
From: Stan Hoeppner @ 2012-05-11  1:53 UTC (permalink / raw)
  To: Daniel Pocock; +Cc: Phil Turmel, Marcus Sorensen, linux-raid

On 5/10/2012 4:31 PM, Daniel Pocock wrote:

> Actually, the TLER term is mentioned elsewhere, for example the Adaptec
> blog I came across

The term "Time-Limited Error Recovery" (TLER) was introduced to the
world by Western Digital on August 3, 2004, almost 8 years ago.  They
introduced the term in their press release announcing their (then) new
RAID Edition (RE) serial ATA drives.

http://www.wdc.com/en/company/pressroom/releases.aspx?release=3f62e91b-288b-4852-8f6c-5abe507ec8dd

This term is exclusive to Western Digital Corporation.  It does not
apply to any other vendors' hard drives, nor any other product of any
kind.  It is not a general term for a function.  The general term for
this function is called Error Recovery Control (ERC).  If anyone applies
this term to any drive other than a WDC model, using it as a general
term, then s/he is uninformed and using the term incorrectly.

I do not currently, nor have I ever, worked for WDC.  I simply hate
marketing buzzwords, and hate even more people's misuse of such
marketing buzzwords.

> Economists often talk about price selectivity, e.g. the coffee shops
> that charge an extra pound/euro/dollar for `organic' coffee.  Does it
> really cost an extra pound to produce one teaspoon of coffee in an
> organic way?  Of course not, it's just a gimmick to extract an extra
> pound from people who won't lose any sleep over spending an extra pound.

Price gouging for gourmet coffee isn't an apt analogue of the disk drive
business.

Manufacturers do make more profit per enterprise SATA drive than they do
desktop SATA drives.  But the overall cost difference has nothing to do
with price gouging, nor this additional profit.  The cost difference is
due to the following factors:

1.  Vastly lower numbers produced, on the order of 1000:1
    We're all familiar with economy of scale yes?

2.  Firmware features that are developed for a handful of drive
    models.  The R&D dollars expended for this are spread over far
    fewer units sold, yet require an order of magnitude more
    verification work

3.  Compatibility testing and verification with hundreds of PCIe RAID
    controllers and HBAs, standalone RAID enclosures w/onboard
    controllers, JBOD chassis with and without SAS expanders, iSCSI and
    Fiber Channel arrays, etc, etc, etc.

4.  A greatly enhanced and more time consuming QC process for the drive
    hardware and firmware

#3 and #4 account for the majority of the cost premium of an enterprise
SATA drive vs a desktop SATA drive.  Labor is typically the most costly
aspect of manufacturing for electronics products.  3/4 are extremely
labor/time intensive.

> I'm going the opposite direction, trying to move away from
> cheap drives - but I don't want to invest heavily in something that is
> 
> a) just a marketing gimmick

Enterprise drives aren't a marketing gimmick.  Some of the merketing
language surrounding them is, but that's always the case with marketing.

> b) not going to do me any good if md doesn't exercise the special
> features of the hardware

If you will not be using a server nor JBOD chassis with an SAS expander
backplane for which your hard drives are certified, then there is little
benefit for you WRT enterprise SATA drives and md raid, other than
overall increased quality and a longer warranty.

>> Either that, or md raid is only used by hobbyists. ;)
>>
> Better a hobbyist running Linux than a professional running Windows with
> fakeraid

Heheh, no doubt.

For those who don't grasp the tongue-in-cheek nature of it, my stating
"md raid is only used by hobbyists" is obviously not a literal face
value statement.  There are dozens of enterprise NAS products on the
market that ship using md raid, and there are plenty of enterprise
size/caliber IT shops that use mdraid, though probably not exclusively.
 There are many more than use md raid striping or concatenation to
stitch together multiple hardware RAID logical drives and/or SAN LUNs.

The statement is also meant to poke the ribs of the pure hobbyist users
in an attempt to get them more into an enterprise way of approaching
RAID implementation and management.

-- 
Stan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 19:09           ` Phil Turmel
  2012-05-10 20:30             ` Daniel Pocock
@ 2012-05-11  6:50             ` Michael Tokarev
  1 sibling, 0 replies; 51+ messages in thread
From: Michael Tokarev @ 2012-05-11  6:50 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Daniel Pocock, Marcus Sorensen, linux-raid

On 10.05.2012 23:09, Phil Turmel wrote:
>> For my old Barracuda 7200.12 that is still working, I see this:
>> > 
>> > SCT Error Recovery Control:
>> >            Read: Disabled
>> >           Write: Disabled
> You should try changing it.  Drives that don't support it won't even
> show you that.

You made me curious and I checked 3 identical drives I have
in raid setup in my server.  These are (somewhat oldish by
now) WD Caviar Black 640Gb ones.  Here are the identifies:

Model Family:     Western Digital Caviar Black family

Device Model:     WDC WD6401AALS-00L3B2
Serial Number:    WD-WCASYA623503
Firmware Version: 01.03B01

Device Model:     WDC WD6401AALS-00L3B2
Serial Number:    WD-WCASY4134266
Firmware Version: 01.03B01

Device Model:     WDC WD6401AALS-00L3B2
Serial Number:    WD-WCASY4137254
Firmware Version: 01.03B01


First one is more recent, bought about a year after last 2.
As you can see, everything looks exactly the same, incl.
exact model number and firmware version.  Yet, the more
recent (first) does NOT support SCT Error Recovery Control,
returning error to any -l strec command, but the other 2,
which are older, supports setting these parameters (but
I've no idea if the timeouts will actually be used by the
firmware, this is entirely different question :).  So,
3 identical drives bought within a year, shows identical
versions and models, but behaves differently...

I also noticed that all recent desktop drives from WD does
NOT support strec, while older ones tend to support it.
Which goes on-par with what others are saying.  I wonder
how this goes in drives of other manufacturers...

Thanks,

/mjt

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
       [not found]         ` <CABYL=ToORULrdhBVQk0K8zQqFYkOomY-wgG7PpnJnzP9u7iBnA@mail.gmail.com>
@ 2012-05-11  7:10           ` David Brown
  2012-05-11  8:16             ` Daniel Pocock
  2012-05-11 22:17             ` Stan Hoeppner
  0 siblings, 2 replies; 51+ messages in thread
From: David Brown @ 2012-05-11  7:10 UTC (permalink / raw)
  To: Roberto Spadim
  Cc: stan, Phil Turmel, Marcus Sorensen, Daniel Pocock, linux-raid

On 11/05/2012 00:49, Roberto Spadim wrote:
> i have dell servers and  i use raid1 in every servers just raid10 or
> raid0 are in hardware because hotswap with hardware is easier to
> implement, but if mdraid could do the job, i don´t see why use hardware
> raid devices (just if they have batery)
>

I think for simple situations, such as just wanting a straight mirror of 
two disks, then hardware raid provided by the supplier is often a good 
choice.  As you say, it can make hotswap easier - you get things like 
little red and green lights on the disk drives.  And the vendor supports 
it and knows how it works.  Also if you've got a more serious hardware 
with BBWC or similar features, then these features may be the deciding 
points.

But there is no doubt that md raid is a lot more flexible than any other 
raid system, it is often faster (especially for smaller setups - 
raid10,far being a prime example), and the money you save on raid cards 
can be spent on extra disks, UPS, etc.

One thing that may be an advantage either way is ease of configuration, 
monitoring, maintenance, and transfer of disks between systems.  With md 
raid, you have a consistent system that is independent of the hardware 
and setup, while every hardware raid system has its own proprietary 
tools, setup, hardware, monitoring software, etc.  So this is often a 
win for md raid - but if you support several hardware raid arrays, and 
use the same vendor for them all, then you have a consistent system 
there too.



> 2012/5/10 David Brown <david.brown@hesbynett.no
> <mailto:david.brown@hesbynett.no>>
>
>     On 10/05/12 23:15, Stan Hoeppner wrote:
>
>         On 5/10/2012 8:51 AM, Phil Turmel wrote:
>
>             Hardware RAID cards usually offer battery-backed write
>             cache, which is
>             very valuable in some applications.  I don't have a need for
>             that kind
>             of performance, so I can't speak to the details.  (Is Stan H.
>             listening?)
>
>
>         Yes, I'm here to drop the hammer, and start a flame war. ;)
>           I've been
>         lurking and trying to stay out of the fray, but you "keep
>         dragging me
>         back in!" --Michael Corleone
>
>         I find the mere existence of this thread a bit comical, as with all
>         others that have preceded it.  I made the comment on this list quite
>         some time ago that md raid is mostly used by hobbyists, and took
>         a lot
>         of heat for that.  The existence of this thread adds ammunition
>         to that
>         argument.
>
>
>     I think you've got that a bit backwards.  Most hobbyists (or
>     low-budget users) who use raid other than motherboard fakeraid will
>     choose Linux md raid.  It may well be that most users of md raid
>     /are/ hobby or low-budget users.  But your implication - that
>     professionals don't use md raid - is completely wrong.
>
>     It's more likely that it is hobby users that discuss these sorts of
>     things - professionals just pay the money that the server
>     manufacturer asks for its supported disks, since paying that is
>     cheaper than spending time discussing things.  I know I mostly
>     follow lists like this in my free time (as a hobbiest) rather than
>     in work time (as a professional).
>
>
>     --
>     To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>     the body of a message to majordomo@vger.kernel.org
>     <mailto:majordomo@vger.kernel.org>
>     More majordomo info at http://vger.kernel.org/__majordomo-info.html
>     <http://vger.kernel.org/majordomo-info.html>
>
>
>
>
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-11  7:10           ` David Brown
@ 2012-05-11  8:16             ` Daniel Pocock
  2012-05-11 22:28               ` Stan Hoeppner
  2012-05-11 22:17             ` Stan Hoeppner
  1 sibling, 1 reply; 51+ messages in thread
From: Daniel Pocock @ 2012-05-11  8:16 UTC (permalink / raw)
  To: David Brown
  Cc: Roberto Spadim, stan, Phil Turmel, Marcus Sorensen, linux-raid



On 11/05/12 07:10, David Brown wrote:
> On 11/05/2012 00:49, Roberto Spadim wrote:
>> i have dell servers and  i use raid1 in every servers just raid10 or
>> raid0 are in hardware because hotswap with hardware is easier to
>> implement, but if mdraid could do the job, i don´t see why use hardware
>> raid devices (just if they have batery)
>>
> 
> I think for simple situations, such as just wanting a straight mirror of
> two disks, then hardware raid provided by the supplier is often a good
> choice.  As you say, it can make hotswap easier - you get things like
> little red and green lights on the disk drives.  And the vendor supports
> it and knows how it works.  Also if you've got a more serious hardware
> with BBWC or similar features, then these features may be the deciding
> points.

My understanding of BBWC:

- for things like an NFS server, where the OS and disk hardware can't
cache write data very aggressively (due to the contract between NFS
client and server), the BBWC allows you to enable more aggressive
behavior (e.g. setting barrier=0 on the filesystem level) and gain a
speed boost

- on the other hand, is BBWC universally effective?  E.g. does it just
write to disk in a crash scenario, or only in the event of an outright
power failure?  Does it depend on any hints from the OS or drivers to
know about system state?

> One thing that may be an advantage either way is ease of configuration,
> monitoring, maintenance, and transfer of disks between systems.  With md
> raid, you have a consistent system that is independent of the hardware
> and setup, while every hardware raid system has its own proprietary
> tools, setup, hardware, monitoring software, etc.  So this is often a
> win for md raid - but if you support several hardware raid arrays, and
> use the same vendor for them all, then you have a consistent system
> there too.

This is my main point - the fact that md RAID is hardware independent,
e.g. I can swap from HP to IBM servers and use the same disks.

If I wanted more than RAID1 (e.g. RAID6) maybe I would re-evaluate the
issue, but for RAID1, a software solution seems fine.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-11  1:53         ` Stan Hoeppner
@ 2012-05-11  8:31           ` Daniel Pocock
  2012-05-11 13:54             ` Pierre Beck
  0 siblings, 1 reply; 51+ messages in thread
From: Daniel Pocock @ 2012-05-11  8:31 UTC (permalink / raw)
  To: stan; +Cc: Phil Turmel, Marcus Sorensen, linux-raid



On 11/05/12 01:53, Stan Hoeppner wrote:
> On 5/10/2012 4:31 PM, Daniel Pocock wrote:
> 
>> Actually, the TLER term is mentioned elsewhere, for example the Adaptec
>> blog I came across
> 
> The term "Time-Limited Error Recovery" (TLER) was introduced to the
> world by Western Digital on August 3, 2004, almost 8 years ago.  They
> introduced the term in their press release announcing their (then) new
> RAID Edition (RE) serial ATA drives.
> 
> http://www.wdc.com/en/company/pressroom/releases.aspx?release=3f62e91b-288b-4852-8f6c-5abe507ec8dd
> 
> This term is exclusive to Western Digital Corporation.  It does not
> apply to any other vendors' hard drives, nor any other product of any
> kind.  It is not a general term for a function.  The general term for
> this function is called Error Recovery Control (ERC).  If anyone applies
> this term to any drive other than a WDC model, using it as a general
> term, then s/he is uninformed and using the term incorrectly.
> 
> I do not currently, nor have I ever, worked for WDC.  I simply hate
> marketing buzzwords, and hate even more people's misuse of such
> marketing buzzwords.

I agree, so I'll stop using that term for now

>> Economists often talk about price selectivity, e.g. the coffee shops
>> that charge an extra pound/euro/dollar for `organic' coffee.  Does it
>> really cost an extra pound to produce one teaspoon of coffee in an
>> organic way?  Of course not, it's just a gimmick to extract an extra
>> pound from people who won't lose any sleep over spending an extra pound.
> 
> Price gouging for gourmet coffee isn't an apt analogue of the disk drive
> business.

Actually, it is relevant

I realise many vendors are well-intentioned and really do give you the
extra things you pay for - but it should never be taken for granted.

That's why I ask questions about the drive hardware.

> 
>> I'm going the opposite direction, trying to move away from
>> cheap drives - but I don't want to invest heavily in something that is
>>
>> a) just a marketing gimmick
> 
> Enterprise drives aren't a marketing gimmick.  Some of the merketing
> language surrounding them is, but that's always the case with marketing.

There is also an `ambush' concept in marketing: if everyone is looking
for `enterprise' drives, then someone comes along and puts `enterprise'
stickers on a desktop drive for making a quick buck.  There are plenty
of examples of this in other domains.  So 9 out of 10 enterprise
products might really be what I want, but there may be 1 out of 10 that
is just a gimmick?

>>> Either that, or md raid is only used by hobbyists. ;)
>>>
>> Better a hobbyist running Linux than a professional running Windows with
>> fakeraid
> 
> Heheh, no doubt.
> 
> For those who don't grasp the tongue-in-cheek nature of it, my stating
> "md raid is only used by hobbyists" is obviously not a literal face
> value statement.  There are dozens of enterprise NAS products on the
> market that ship using md raid, and there are plenty of enterprise
> size/caliber IT shops that use mdraid, though probably not exclusively.
>  There are many more than use md raid striping or concatenation to
> stitch together multiple hardware RAID logical drives and/or SAN LUNs.

There is a whole world between enterprise and hobbyist

Think about:
- home users who want to make an extra effort to protect their digital
photo/video collection (RAID is no substitute for backup of course)
- small businesses - 30% of employment is in small business.  For
businesses with less than 10 staff, they really do look at IT costs
closely.  When I was a student I spent a lot of time selling Linux
solutions to such businesses, now I work with larger enterprises.  I had
one client who was offline for half a day having their server disk
replaced, when I offered them a hot swap solution they said it wasn't
worth the money, they could survive without email and catch up on some
paperwork for 2-3 hours as long as it didn't happen every other month.
- other budget-critical users: research, education, health care, etc.
They often work within fixed budgets, and if they can get an acceptable
IT facility for less money, the cash they save goes elsewhere (or buys
more drive space for example)

> The statement is also meant to poke the ribs of the pure hobbyist users
> in an attempt to get them more into an enterprise way of approaching
> RAID implementation and management.
> 
There are many things in Linux that are not exactly the way things are
expected to be in an enterprise product: but the same can be said for
Windows.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-11  8:31           ` Daniel Pocock
@ 2012-05-11 13:54             ` Pierre Beck
  0 siblings, 0 replies; 51+ messages in thread
From: Pierre Beck @ 2012-05-11 13:54 UTC (permalink / raw)
  To: linux-raid

I'd like to join the discussion here and contribute a few constructive 
thoughts I've had about timeout issues, as well as answer the original 
questions from Daniel.

The original questions:

- does Linux md RAID actively use the more advanced features of these
drives, e.g. to work around errors?

No. mdraid does not touch SCTERC. You have to set it yourself scripting 
/ using smartctl or hdparm. Same goes for other settings like noise 
reduction or power saving.

- if a non-RAID SAS card is used, does it matter which card is chosen? 
Does md work equally well with all of them?

There's a nice post about available cards and software RAID in general: 
http://blog.zorinaq.com/?e=10

mdraid doesn't really care about controller. Which card is best? The one 
with fastest SCSI resets. I have yet to see reset benchmarks, though. On 
drive failure, you will want quick reset times because that's how 
current interaction fails a non-responding drive and it will suspend I/O 
on all drives attached to the controller until complete. SSD is another 
story - IOPS are easily limited by controller.

- ignoring the better MTBF and seek times of these drives, do any of the
other features passively contribute to a better RAID experience when
using md?

10k+ RPM drives are built with less surface (the platters are smaller), 
stiffer servo, etc. all-in-all trying to make the RAID experience one 
where you never have to replace a drive.

7k RPM drives are near-line SAS / enterprise SATA and built mechanically 
the same as desktop. Different board for SAS, only different firmware 
for SATA. Anything beyond that would surprise me.

Apart from that, vendors play cat and mouse with the ERC timeout 
feature. Enterprise level should always advertise and adhere that 
smartctl SCTERC setting.

- for someone using SAS or enterprise SATA drives with Linux, is there
any particular benefit to using md RAID, dmraid or filesystem (e.g.
btrfs) RAID (apart from the btrfs having checksums)?

dmraid is IMHO only a quick solution for fakeraid, not something I'd 
rely on in a server. mdraid has monitoring, media error handling, 
write-intent bitmaps. btrfs has advantages like faster integrity check 
(scan only used) and no initial sync. But may explode when used - 
real-life testing is limited. So for now, mdraid is the only choice in 
my opinion.

Since this thread also touched timeouts ...

Right now, Linux Software RAID priorizes data / parity integrity above 
everything else, which isn't a bad thing to do. If a request submitted 
to a drive takes minutes to complete, it waits patiently, because after 
ERC timeout, all but the requested sector tend to be intact and protect 
data on other drives. The bad sector is repaired by writing recovered 
data to it.

(SCSI timeout *before* ERC timeout is an unfortunate misconfiguration in 
that context and can be alleviated by increasing the SCSI timeout to a 
higher value than expected ERC timeout, or lowering ERC timeout if that 
is available on the drive)

Of course, having a database or website stall I/O for a few minutes or 
more (if several bad sectors are found) is less than desirable.

How to avoid that?

The first option would be to work the SCSI layer error handling to be 
less aggressive (there's a controller reset in there!) and behave well 
in a 1 second or less timeout configuration. mdraid would get an error 
reply and kick the drive from the array soon, because ERC on the drive 
is still stalling the drive and any read / write to the drive would not 
complete within the SCSI timeout. Write-intent bitmaps to the rescue: 
Script some daemon to check when drive is done stalling, re-add to 
array, sync fast.

Issues with that option: I have no idea whether ultra low SCSI timeouts 
are practical. There might be non-I/O commands which should be excluded 
from that timeout. Or maybe it can't be implemented the way I think 
about it. Some dev from the SCSI layer may be able to answer that. Also, 
spare drives would have to be removed from the array, because otherwise 
right after kicking the timed out drive, a spare would replace it, 
rendering re-add impossible. mdraid could implement a user defined delay 
for spare activation to mitigate this.

The second option would be timeouts in mdraid, with a lot of associated 
work and a new drive state between failed and online. A drive that has 
outstanding I/O but timed out would be kept in the array in "indisposed" 
state and cared for in the background. All outstanding read I/O would be 
duplicated, then redirected to online drives / recovered from parity. 
All write I/O would miss the stalled drive. The queue on the stalled 
drive would be completed (with or without errors) and bad sectors should 
be repaired just like it was online. On queue completion, internally 
re-add the drive and resync fast with write-intent bitmap. If drive 
fails to recover, activate spare.

Either option would make all drives (both desktop and enterprise) much 
less of a pain on URE and maybe other temporary disconnects.

My 2ct,

Pierre Beck

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-11  7:10           ` David Brown
  2012-05-11  8:16             ` Daniel Pocock
@ 2012-05-11 22:17             ` Stan Hoeppner
  1 sibling, 0 replies; 51+ messages in thread
From: Stan Hoeppner @ 2012-05-11 22:17 UTC (permalink / raw)
  To: David Brown
  Cc: Roberto Spadim, Phil Turmel, Marcus Sorensen, Daniel Pocock, linux-raid

On 5/11/2012 2:10 AM, David Brown wrote:

> Also if you've got a more serious hardware
> with BBWC or similar features, then these features may be the deciding
> points.

md RAID is used with BBWC raid controllers, both PCIe and SAN heads,
fairly widely.  I've discussed the benefits of such setups on this list
many times.  It's not an either/or decision.

> But there is no doubt that md raid is a lot more flexible than any other
> raid system, 

This is simply not true, not in the wholesale fashion you state.  For
many things md raid is more flexible.  For others definitely not.
Bootable arrays being one very important one.  The md raid/grub solution
is far too complicated, cumbersome, and unreliable.  A relatively low
performance and cheap 2/4 port real SATA raid card or real mobo based
raid such as an LSI SAS2008 is a far superior mirrored boot disk
solution, with a straight SAS/SATA multiport HBA with md managing the
data array.

> it is often faster (especially for smaller setups -
> raid10,far being a prime example), 

You need a lot of qualifiers embedded in that statement.  A decent raid
card w/ small drive count array will runs circles around md raid w/ a
random write or streaming workload.  It may be slightly slower in a
streaming read workload compared to the 'optimzed' md raid "10" layouts.
 Where hardware raid usually starts trailing md raid is with parity
arrays on large drive counts, starting at around 8-16 drives and up.

> and the money you save on raid cards
> can be spent on extra disks, UPS, etc.

Relatively speaking, in overall system cost terms, RAID HBAs aren't that
much more expensive than standard HBAs.  In the case of LSI, $240 vs
$480 for 8 port cards.  The cost is double, the but total cost of the 8
drives we'll connect is $3200-$5000.  That extra $250 is negligible in
the overall picture.

> One thing that may be an advantage either way is ease of configuration,
> monitoring, maintenance, and transfer of disks between systems.  With md
> raid, you have a consistent system that is independent of the hardware
> and setup, while every hardware raid system has its own proprietary
> tools, setup, hardware, monitoring software, etc.  So this is often a
> win for md raid - but if you support several hardware raid arrays, and
> use the same vendor for them all, then you have a consistent system
> there too.

Corporations have used SNMP for a consistent monitoring interface across
heterogeneous platforms for over a decade, including servers, switches,
routers, PBXes, APs, security cameras, electronics entry access, etc.
Every decent hardware RAID card is designed for corporate use, and
includes SNMP support and a MIB file.  So from a monitoring standpoint,
I disagree with your statement above.  And who transfers drives between
running systems?

Regarding proprietary tools, most corporate setups will have mobo RAID
(Dell, HP, IBM) for the boot drives, and will have an FC/iSCSI HBA for
connecting to one or more SAN controllers.  Most corporate setups don't
involve local RAID based data storage.  The single overriding reason for
this is the pervasiveness of SAN based snapshot backups and remote site
mirroring from SAN to SAN.  md raid has no comparable capability.

-- 
Stan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-11  8:16             ` Daniel Pocock
@ 2012-05-11 22:28               ` Stan Hoeppner
  2012-05-21 15:20                 ` CoolCold
  0 siblings, 1 reply; 51+ messages in thread
From: Stan Hoeppner @ 2012-05-11 22:28 UTC (permalink / raw)
  To: Daniel Pocock
  Cc: David Brown, Roberto Spadim, Phil Turmel, Marcus Sorensen, linux-raid

On 5/11/2012 3:16 AM, Daniel Pocock wrote:

> My understanding of BBWC:
> 
> - for things like an NFS server, where the OS and disk hardware can't
> cache write data very aggressively (due to the contract between NFS
> client and server), the BBWC allows you to enable more aggressive
> behavior (e.g. setting barrier=0 on the filesystem level) and gain a
> speed boost

RAID cache works just like CPU L2 cache. In this case it's a buffer
between the system block layer and the physical storage.  It can speed
up reads and writes.  In the case of writes, it can be configured to
return success to the block layer before it actually writes the data to
disk, speeding up random write IO tremendously in many cases.

> - on the other hand, is BBWC universally effective?  E.g. does it just
> write to disk in a crash scenario, or only in the event of an outright
> power failure?  Does it depend on any hints from the OS or drivers to
> know about system state?

You're concentrating here specifically on the "B" aspect, which is the
battery.  It protects you from power failure and system crashes that
cause a reset of the motherboard.  When the card is back up, it's
firmware being executed, it checks that the drives are all up, then
writes the data in the cache to the target sectors of the drives.  It's
worked this way for over a decade.

> This is my main point - the fact that md RAID is hardware independent,
> e.g. I can swap from HP to IBM servers and use the same disks.
> 
> If I wanted more than RAID1 (e.g. RAID6) maybe I would re-evaluate the
> issue, but for RAID1, a software solution seems fine.

That's the one scenario where I abhor using md raid, as I mentioned.  At
least, a boot raid 1 pair.  Using layered md raid 1 + 0, or 1 + linear
is a great solution for many workloads.  Ask me why I say raid 1 + 0
instead of raid 10.

-- 
Stan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-10 18:42         ` Daniel Pocock
  2012-05-10 19:09           ` Phil Turmel
@ 2012-05-21 14:19           ` Brian Candler
  2012-05-21 14:29             ` Phil Turmel
  1 sibling, 1 reply; 51+ messages in thread
From: Brian Candler @ 2012-05-21 14:19 UTC (permalink / raw)
  To: Daniel Pocock; +Cc: Phil Turmel, Marcus Sorensen, linux-raid

On Thu, May 10, 2012 at 06:42:13PM +0000, Daniel Pocock wrote:
> Ok, so I bought an enterprise grade drive, the WD RE4 (2TB) and I'm
> about to add it in place of the drive that failed.
> 
> I did a quick check with smartctl:
> 
> # smartctl -a /dev/sdb -l scterc
> ....
> SCT Error Recovery Control:
>            Read:     70 (7.0 seconds)
>           Write:     70 (7.0 seconds)
> 
> so the TLER feature appears to be there.  I haven't tried changing it.
> 
> For my old Barracuda 7200.12 that is still working, I see this:
> 
> SCT Error Recovery Control:
>            Read: Disabled
>           Write: Disabled
> 
> and a diff between the full output for both drives reveals the following:
> 
> -SCT capabilities:             (0x103f) SCT Status supported.
> +SCT capabilities:             (0x303f) SCT Status supported.
>                                         SCT Error Recovery Control
> supported.
>                                         SCT Feature Control supported.
>                                         SCT Data Table supported.

But FYI, the new Seagate Barracuda 3TB ST3000DM001 drives I have here do
*not* support this feature.  Has Seagate started crippling its
consumer-grade drives?

    root@storage3:~# smartctl -l scterc /dev/sdb 
    smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.3.4-030304-generic] (local build)
    Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

    Warning: device does not support SCT Error Recovery Control command

    root@storage3:~# smartctl -l scterc,70,70 /dev/sdb 
    smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.3.4-030304-generic] (local build)
    Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

    Warning: device does not support SCT Error Recovery Control command

    root@storage3:~# smartctl -i /dev/sdb -q noserial
    smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.3.4-030304-generic] (local build)
    Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

    === START OF INFORMATION SECTION ===
    Device Model:     ST3000DM001-9YN166
    Firmware Version: CC4C
    User Capacity:    3,000,592,982,016 bytes [3.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Device is:        Not in smartctl database [for details use: -P showall]
    ATA Version is:   8
    ATA Standard is:  ATA-8-ACS revision 4
    Local Time is:    Mon May 21 15:16:53 2012 BST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled

The Hitachi Deskstar HDS5C3030ALA630 *does* support scterc.

Regards,

Brian.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-21 14:19           ` Brian Candler
@ 2012-05-21 14:29             ` Phil Turmel
  2012-05-26 21:58               ` Stefan *St0fF* Huebner
  0 siblings, 1 reply; 51+ messages in thread
From: Phil Turmel @ 2012-05-21 14:29 UTC (permalink / raw)
  To: Brian Candler; +Cc: Daniel Pocock, Marcus Sorensen, linux-raid

On 05/21/2012 10:19 AM, Brian Candler wrote:

[trim /]

> But FYI, the new Seagate Barracuda 3TB ST3000DM001 drives I have here do
> *not* support this feature.  Has Seagate started crippling its
> consumer-grade drives?

*YES*

They started sometime before June of last year, when I purchased some
Barracuda Greens to upgrade from the Barracuda 7200.12 models I had been
using.

[trim /]

> The Hitachi Deskstar HDS5C3030ALA630 *does* support scterc.

If you look up-thread to my first reply on May 10th, I reported exactly
this phenomenon, and that Hitachi is the only remaining player still
supporting it in consumer-grade drives.  (That I've found.  I'd love to
discover that other players are changing their minds.)

> Regards,
> 
> Brian.

Regards, and my condolences on your purchase of the ST3000DM001 drives.

Phil.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-11 22:28               ` Stan Hoeppner
@ 2012-05-21 15:20                 ` CoolCold
  2012-05-21 18:51                   ` Stan Hoeppner
  0 siblings, 1 reply; 51+ messages in thread
From: CoolCold @ 2012-05-21 15:20 UTC (permalink / raw)
  To: stan
  Cc: Daniel Pocock, David Brown, Roberto Spadim, Phil Turmel,
	Marcus Sorensen, linux-raid

On Sat, May 12, 2012 at 2:28 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 5/11/2012 3:16 AM, Daniel Pocock wrote:
>
[snip]
> That's the one scenario where I abhor using md raid, as I mentioned.  At
> least, a boot raid 1 pair.  Using layered md raid 1 + 0, or 1 + linear
> is a great solution for many workloads.  Ask me why I say raid 1 + 0
> instead of raid 10.
So, I'm asking - why?

>
> --
> Stan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best regards,
[COOLCOLD-RIPN]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-21 15:20                 ` CoolCold
@ 2012-05-21 18:51                   ` Stan Hoeppner
  2012-05-21 18:54                     ` Roberto Spadim
  2012-05-21 23:34                     ` NeilBrown
  0 siblings, 2 replies; 51+ messages in thread
From: Stan Hoeppner @ 2012-05-21 18:51 UTC (permalink / raw)
  To: CoolCold
  Cc: Daniel Pocock, David Brown, Roberto Spadim, Phil Turmel,
	Marcus Sorensen, linux-raid

On 5/21/2012 10:20 AM, CoolCold wrote:
> On Sat, May 12, 2012 at 2:28 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> On 5/11/2012 3:16 AM, Daniel Pocock wrote:
>>
> [snip]
>> That's the one scenario where I abhor using md raid, as I mentioned.  At
>> least, a boot raid 1 pair.  Using layered md raid 1 + 0, or 1 + linear
>> is a great solution for many workloads.  Ask me why I say raid 1 + 0
>> instead of raid 10.
> So, I'm asking - why?

Neil pointed out quite some time ago that the md RAID 1/5/6/10 code runs
as a single kernel thread.  Thus when running heavy IO workloads across
many rust disks or a few SSDs, the md thread becomes CPU bound, as it
can only execute on a single core, just as with any other single thread.

This issue is becoming more relevant as folks move to the latest
generation of server CPUs that trade clock speed for higher core count.
 Imagine the surprise of the op who buys a dual socket box with 2x 16
core AMD Interlagos 2.0GHz CPUs, 256GB RAM, and 32 SSDs in md RAID 10,
only to find he can only get a tiny fraction of the SSD throughput.
Upon investigation he finds a single md thread peaking one core while
the rest are relatively idle but for the application itself.

As I understand Neil's explanation, the md RAID 0 and linear code don't
run as separate kernel threads, but merely pass offsets to the block
layer, which is fully threaded.  Thus, by layering md RAID 0 over md
RAID 1 pairs, the striping load is spread over all cores.  Same with
linear, avoiding the single thread bottleneck.

This layering can be done with any md RAID level, creating RAID50s and
RAID60s, or concatenations of RAID5/6, as well as of RAID 10.

And it shouldn't take anywhere near 32 modern SSDs to saturate a single
2GHz core with md RAID 10.  It's likely less than 8 SSDs, which yield
~400K IOPS, but I haven't done verufication testing myself at this point.

-- 
Stan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-21 18:51                   ` Stan Hoeppner
@ 2012-05-21 18:54                     ` Roberto Spadim
  2012-05-21 19:05                       ` Stan Hoeppner
  2012-05-21 23:34                     ` NeilBrown
  1 sibling, 1 reply; 51+ messages in thread
From: Roberto Spadim @ 2012-05-21 18:54 UTC (permalink / raw)
  To: stan
  Cc: CoolCold, Daniel Pocock, David Brown, Phil Turmel,
	Marcus Sorensen, linux-raid

hum, does anyone could explain what a 'multi thread' version of raid1
could be implemented?
for example, how to scale it? and why this new implementation could
scale it better

2012/5/21 Stan Hoeppner <stan@hardwarefreak.com>:
> On 5/21/2012 10:20 AM, CoolCold wrote:
>> On Sat, May 12, 2012 at 2:28 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>> On 5/11/2012 3:16 AM, Daniel Pocock wrote:
>>>
>> [snip]
>>> That's the one scenario where I abhor using md raid, as I mentioned.  At
>>> least, a boot raid 1 pair.  Using layered md raid 1 + 0, or 1 + linear
>>> is a great solution for many workloads.  Ask me why I say raid 1 + 0
>>> instead of raid 10.
>> So, I'm asking - why?
>
> Neil pointed out quite some time ago that the md RAID 1/5/6/10 code runs
> as a single kernel thread.  Thus when running heavy IO workloads across
> many rust disks or a few SSDs, the md thread becomes CPU bound, as it
> can only execute on a single core, just as with any other single thread.
>
> This issue is becoming more relevant as folks move to the latest
> generation of server CPUs that trade clock speed for higher core count.
>  Imagine the surprise of the op who buys a dual socket box with 2x 16
> core AMD Interlagos 2.0GHz CPUs, 256GB RAM, and 32 SSDs in md RAID 10,
> only to find he can only get a tiny fraction of the SSD throughput.
> Upon investigation he finds a single md thread peaking one core while
> the rest are relatively idle but for the application itself.
>
> As I understand Neil's explanation, the md RAID 0 and linear code don't
> run as separate kernel threads, but merely pass offsets to the block
> layer, which is fully threaded.  Thus, by layering md RAID 0 over md
> RAID 1 pairs, the striping load is spread over all cores.  Same with
> linear, avoiding the single thread bottleneck.
>
> This layering can be done with any md RAID level, creating RAID50s and
> RAID60s, or concatenations of RAID5/6, as well as of RAID 10.
>
> And it shouldn't take anywhere near 32 modern SSDs to saturate a single
> 2GHz core with md RAID 10.  It's likely less than 8 SSDs, which yield
> ~400K IOPS, but I haven't done verufication testing myself at this point.
>
> --
> Stan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-21 18:54                     ` Roberto Spadim
@ 2012-05-21 19:05                       ` Stan Hoeppner
  2012-05-21 19:38                         ` Roberto Spadim
  0 siblings, 1 reply; 51+ messages in thread
From: Stan Hoeppner @ 2012-05-21 19:05 UTC (permalink / raw)
  To: Roberto Spadim
  Cc: CoolCold, Daniel Pocock, David Brown, Phil Turmel,
	Marcus Sorensen, linux-raid

On 5/21/2012 1:54 PM, Roberto Spadim wrote:
> hum, does anyone could explain what a 'multi thread' version of raid1
> could be implemented?
> for example, how to scale it? and why this new implementation could
> scale it better

I just did below.  You layer a stripe over many RAID 1 pairs.  A single
md RAID 1 pair isn't enough to saturate a single core so there is no
gain to be had by trying to thread the RAID 1 code.

-- 
Stan


> 2012/5/21 Stan Hoeppner <stan@hardwarefreak.com>:
>> On 5/21/2012 10:20 AM, CoolCold wrote:
>>> On Sat, May 12, 2012 at 2:28 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>>> On 5/11/2012 3:16 AM, Daniel Pocock wrote:
>>>>
>>> [snip]
>>>> That's the one scenario where I abhor using md raid, as I mentioned.  At
>>>> least, a boot raid 1 pair.  Using layered md raid 1 + 0, or 1 + linear
>>>> is a great solution for many workloads.  Ask me why I say raid 1 + 0
>>>> instead of raid 10.
>>> So, I'm asking - why?
>>
>> Neil pointed out quite some time ago that the md RAID 1/5/6/10 code runs
>> as a single kernel thread.  Thus when running heavy IO workloads across
>> many rust disks or a few SSDs, the md thread becomes CPU bound, as it
>> can only execute on a single core, just as with any other single thread.
>>
>> This issue is becoming more relevant as folks move to the latest
>> generation of server CPUs that trade clock speed for higher core count.
>>  Imagine the surprise of the op who buys a dual socket box with 2x 16
>> core AMD Interlagos 2.0GHz CPUs, 256GB RAM, and 32 SSDs in md RAID 10,
>> only to find he can only get a tiny fraction of the SSD throughput.
>> Upon investigation he finds a single md thread peaking one core while
>> the rest are relatively idle but for the application itself.
>>
>> As I understand Neil's explanation, the md RAID 0 and linear code don't
>> run as separate kernel threads, but merely pass offsets to the block
>> layer, which is fully threaded.  Thus, by layering md RAID 0 over md
>> RAID 1 pairs, the striping load is spread over all cores.  Same with
>> linear, avoiding the single thread bottleneck.
>>
>> This layering can be done with any md RAID level, creating RAID50s and
>> RAID60s, or concatenations of RAID5/6, as well as of RAID 10.
>>
>> And it shouldn't take anywhere near 32 modern SSDs to saturate a single
>> 2GHz core with md RAID 10.  It's likely less than 8 SSDs, which yield
>> ~400K IOPS, but I haven't done verufication testing myself at this point.
>>
>> --
>> Stan
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-21 19:05                       ` Stan Hoeppner
@ 2012-05-21 19:38                         ` Roberto Spadim
  0 siblings, 0 replies; 51+ messages in thread
From: Roberto Spadim @ 2012-05-21 19:38 UTC (permalink / raw)
  To: stan
  Cc: CoolCold, Daniel Pocock, David Brown, Phil Turmel,
	Marcus Sorensen, linux-raid

hum nice
raid10 and raid0 are single thread too?

2012/5/21 Stan Hoeppner <stan@hardwarefreak.com>:
> On 5/21/2012 1:54 PM, Roberto Spadim wrote:
>> hum, does anyone could explain what a 'multi thread' version of raid1
>> could be implemented?
>> for example, how to scale it? and why this new implementation could
>> scale it better
>
> I just did below.  You layer a stripe over many RAID 1 pairs.  A single
> md RAID 1 pair isn't enough to saturate a single core so there is no
> gain to be had by trying to thread the RAID 1 code.
>
> --
> Stan
>
>
>> 2012/5/21 Stan Hoeppner <stan@hardwarefreak.com>:
>>> On 5/21/2012 10:20 AM, CoolCold wrote:
>>>> On Sat, May 12, 2012 at 2:28 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>>>> On 5/11/2012 3:16 AM, Daniel Pocock wrote:
>>>>>
>>>> [snip]
>>>>> That's the one scenario where I abhor using md raid, as I mentioned.  At
>>>>> least, a boot raid 1 pair.  Using layered md raid 1 + 0, or 1 + linear
>>>>> is a great solution for many workloads.  Ask me why I say raid 1 + 0
>>>>> instead of raid 10.
>>>> So, I'm asking - why?
>>>
>>> Neil pointed out quite some time ago that the md RAID 1/5/6/10 code runs
>>> as a single kernel thread.  Thus when running heavy IO workloads across
>>> many rust disks or a few SSDs, the md thread becomes CPU bound, as it
>>> can only execute on a single core, just as with any other single thread.
>>>
>>> This issue is becoming more relevant as folks move to the latest
>>> generation of server CPUs that trade clock speed for higher core count.
>>>  Imagine the surprise of the op who buys a dual socket box with 2x 16
>>> core AMD Interlagos 2.0GHz CPUs, 256GB RAM, and 32 SSDs in md RAID 10,
>>> only to find he can only get a tiny fraction of the SSD throughput.
>>> Upon investigation he finds a single md thread peaking one core while
>>> the rest are relatively idle but for the application itself.
>>>
>>> As I understand Neil's explanation, the md RAID 0 and linear code don't
>>> run as separate kernel threads, but merely pass offsets to the block
>>> layer, which is fully threaded.  Thus, by layering md RAID 0 over md
>>> RAID 1 pairs, the striping load is spread over all cores.  Same with
>>> linear, avoiding the single thread bottleneck.
>>>
>>> This layering can be done with any md RAID level, creating RAID50s and
>>> RAID60s, or concatenations of RAID5/6, as well as of RAID 10.
>>>
>>> And it shouldn't take anywhere near 32 modern SSDs to saturate a single
>>> 2GHz core with md RAID 10.  It's likely less than 8 SSDs, which yield
>>> ~400K IOPS, but I haven't done verufication testing myself at this point.
>>>
>>> --
>>> Stan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-21 18:51                   ` Stan Hoeppner
  2012-05-21 18:54                     ` Roberto Spadim
@ 2012-05-21 23:34                     ` NeilBrown
  2012-05-22  6:36                       ` Stan Hoeppner
  1 sibling, 1 reply; 51+ messages in thread
From: NeilBrown @ 2012-05-21 23:34 UTC (permalink / raw)
  To: stan
  Cc: CoolCold, Daniel Pocock, David Brown, Roberto Spadim,
	Phil Turmel, Marcus Sorensen, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2819 bytes --]

On Mon, 21 May 2012 13:51:21 -0500 Stan Hoeppner <stan@hardwarefreak.com>
wrote:

> On 5/21/2012 10:20 AM, CoolCold wrote:
> > On Sat, May 12, 2012 at 2:28 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> >> On 5/11/2012 3:16 AM, Daniel Pocock wrote:
> >>
> > [snip]
> >> That's the one scenario where I abhor using md raid, as I mentioned.  At
> >> least, a boot raid 1 pair.  Using layered md raid 1 + 0, or 1 + linear
> >> is a great solution for many workloads.  Ask me why I say raid 1 + 0
> >> instead of raid 10.
> > So, I'm asking - why?
> 
> Neil pointed out quite some time ago that the md RAID 1/5/6/10 code runs
> as a single kernel thread.  Thus when running heavy IO workloads across
> many rust disks or a few SSDs, the md thread becomes CPU bound, as it
> can only execute on a single core, just as with any other single thread.

This is not the complete truth.

For RAID1 and RAID10, successful IO requests do not involved the kernel
thread, so the fact that there is only one should be irrelevant.
Failed requests are retried using the thread and it is also involved it
resync/recovery so those processes may be limited by the single thread.

RAID5/6 does not use the thread for read requests on a non-degraded array.
However all write requests go through the single thread so there could be
issues there.

Have you  actually measured md/raid10 being slower than raid0 over raid1?

I have a vague memory from when this came up before that there was some extra
issue that I was missing, but I cannot recall it just now....

NeilBrown


> 
> This issue is becoming more relevant as folks move to the latest
> generation of server CPUs that trade clock speed for higher core count.
>  Imagine the surprise of the op who buys a dual socket box with 2x 16
> core AMD Interlagos 2.0GHz CPUs, 256GB RAM, and 32 SSDs in md RAID 10,
> only to find he can only get a tiny fraction of the SSD throughput.
> Upon investigation he finds a single md thread peaking one core while
> the rest are relatively idle but for the application itself.
> 
> As I understand Neil's explanation, the md RAID 0 and linear code don't
> run as separate kernel threads, but merely pass offsets to the block
> layer, which is fully threaded.  Thus, by layering md RAID 0 over md
> RAID 1 pairs, the striping load is spread over all cores.  Same with
> linear, avoiding the single thread bottleneck.
> 
> This layering can be done with any md RAID level, creating RAID50s and
> RAID60s, or concatenations of RAID5/6, as well as of RAID 10.
> 
> And it shouldn't take anywhere near 32 modern SSDs to saturate a single
> 2GHz core with md RAID 10.  It's likely less than 8 SSDs, which yield
> ~400K IOPS, but I haven't done verufication testing myself at this point.
> 


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-21 23:34                     ` NeilBrown
@ 2012-05-22  6:36                       ` Stan Hoeppner
  2012-05-22  7:29                         ` David Brown
  2012-05-24  2:10                         ` NeilBrown
  0 siblings, 2 replies; 51+ messages in thread
From: Stan Hoeppner @ 2012-05-22  6:36 UTC (permalink / raw)
  To: NeilBrown
  Cc: CoolCold, Daniel Pocock, David Brown, Roberto Spadim,
	Phil Turmel, Marcus Sorensen, linux-raid

On 5/21/2012 6:34 PM, NeilBrown wrote:
> On Mon, 21 May 2012 13:51:21 -0500 Stan Hoeppner <stan@hardwarefreak.com>
> wrote:
> 
>> On 5/21/2012 10:20 AM, CoolCold wrote:
>>> On Sat, May 12, 2012 at 2:28 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>>> On 5/11/2012 3:16 AM, Daniel Pocock wrote:
>>>>
>>> [snip]
>>>> That's the one scenario where I abhor using md raid, as I mentioned.  At
>>>> least, a boot raid 1 pair.  Using layered md raid 1 + 0, or 1 + linear
>>>> is a great solution for many workloads.  Ask me why I say raid 1 + 0
>>>> instead of raid 10.
>>> So, I'm asking - why?
>>
>> Neil pointed out quite some time ago that the md RAID 1/5/6/10 code runs
>> as a single kernel thread.  Thus when running heavy IO workloads across
>> many rust disks or a few SSDs, the md thread becomes CPU bound, as it
>> can only execute on a single core, just as with any other single thread.
> 
> This is not the complete truth.

Yes, I should have stipulated only writes are limited to a single thread.

> For RAID1 and RAID10, successful IO requests do not involved the kernel
> thread, so the fact that there is only one should be irrelevant.
> Failed requests are retried using the thread and it is also involved it
> resync/recovery so those processes may be limited by the single thread.
> 
> RAID5/6 does not use the thread for read requests on a non-degraded array.
> However all write requests go through the single thread so there could be
> issues there.

Thanks for clarifying this.  In your previous response to this issue
(quoted and linked below) you included RAID 1/10 with RAID 5/6 WRT
writes going through a single thread.

> Have you  actually measured md/raid10 being slower than raid0 over raid1?

I personally have not, as I don't have access to the storage hardware
necessary to sink a sufficiently large write stream to peak a core with
the md thread.

> I have a vague memory from when this came up before that there was some extra
> issue that I was missing, but I cannot recall it just now....

We're recalling the same thread, which was many months ago.  Here's your
post:  http://marc.info/?l=linux-raid&m=132616899005148&w=2

And here's the relevant section upon which I was basing my recent
statements:

"I think you must be misremembering.  Neither RAID0 or Linear have any
threads involved.  They just redirect the request to the appropriate
devices.  Multiple threads can submit multiple requests down through
RAID0 and Linear concurrently.

RAID1, RAID10, and RAID5/6 are different.  For reads they normally are
have no contention with other requests, but for writes things do get
single-threaded at some point."

--Neil Brown


-- 
Stan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-22  6:36                       ` Stan Hoeppner
@ 2012-05-22  7:29                         ` David Brown
  2012-05-23 13:14                           ` Stan Hoeppner
  2012-05-24  2:10                         ` NeilBrown
  1 sibling, 1 reply; 51+ messages in thread
From: David Brown @ 2012-05-22  7:29 UTC (permalink / raw)
  To: stan
  Cc: NeilBrown, CoolCold, Daniel Pocock, Roberto Spadim, Phil Turmel,
	Marcus Sorensen, linux-raid

On 22/05/2012 08:36, Stan Hoeppner wrote:
> On 5/21/2012 6:34 PM, NeilBrown wrote:
>> On Mon, 21 May 2012 13:51:21 -0500 Stan Hoeppner<stan@hardwarefreak.com>
>> wrote:
>>
>>> On 5/21/2012 10:20 AM, CoolCold wrote:
>>>> On Sat, May 12, 2012 at 2:28 AM, Stan Hoeppner<stan@hardwarefreak.com>  wrote:
>>>>> On 5/11/2012 3:16 AM, Daniel Pocock wrote:
>>>>>
>>>> [snip]
>>>>> That's the one scenario where I abhor using md raid, as I mentioned.  At
>>>>> least, a boot raid 1 pair.  Using layered md raid 1 + 0, or 1 + linear
>>>>> is a great solution for many workloads.  Ask me why I say raid 1 + 0
>>>>> instead of raid 10.
>>>> So, I'm asking - why?
>>>
>>> Neil pointed out quite some time ago that the md RAID 1/5/6/10 code runs
>>> as a single kernel thread.  Thus when running heavy IO workloads across
>>> many rust disks or a few SSDs, the md thread becomes CPU bound, as it
>>> can only execute on a single core, just as with any other single thread.
>>
>> This is not the complete truth.
>
> Yes, I should have stipulated only writes are limited to a single thread.
>
>> For RAID1 and RAID10, successful IO requests do not involved the kernel
>> thread, so the fact that there is only one should be irrelevant.
>> Failed requests are retried using the thread and it is also involved it
>> resync/recovery so those processes may be limited by the single thread.
>>
>> RAID5/6 does not use the thread for read requests on a non-degraded array.
>> However all write requests go through the single thread so there could be
>> issues there.
>
> Thanks for clarifying this.  In your previous response to this issue
> (quoted and linked below) you included RAID 1/10 with RAID 5/6 WRT
> writes going through a single thread.
>
>> Have you  actually measured md/raid10 being slower than raid0 over raid1?
>
> I personally have not, as I don't have access to the storage hardware
> necessary to sink a sufficiently large write stream to peak a core with
> the md thread.
>
>> I have a vague memory from when this came up before that there was some extra
>> issue that I was missing, but I cannot recall it just now....
>
> We're recalling the same thread, which was many months ago.  Here's your
> post:  http://marc.info/?l=linux-raid&m=132616899005148&w=2
>
> And here's the relevant section upon which I was basing my recent
> statements:
>
> "I think you must be misremembering.  Neither RAID0 or Linear have any
> threads involved.  They just redirect the request to the appropriate
> devices.  Multiple threads can submit multiple requests down through
> RAID0 and Linear concurrently.
>
> RAID1, RAID10, and RAID5/6 are different.  For reads they normally are
> have no contention with other requests, but for writes things do get
> single-threaded at some point."
>

I would think that even if writes to raid1 and raid10 do go through a 
single thread, it is unlikely to be a bottleneck - after all, it will 
mostly just pass the write on to the block layer for the 2 (or more) disks.

As for how much single-threading limits raid5/6 writes, it comes down to 
a balance between memory bandwidth and processor speed.  I would imagine 
that for calculating the simple XOR for raid5, the limit is how fast the 
data can get on and off the chip, rather than how fast a single thread 
can chew through it.  If that's the case, then having two threads doing 
the same thing on different blocks will not run any faster.  If you have 
more than one chip, however, you might have more memory bandwidth - and 
raid6 calculations as well as degraded array access involve more 
processing, and will then be cpu limited.

And if the single thread has to block (such as while waiting for reads 
during a partial stripe update on raid5 or raid6), then it could quickly 
become a bottleneck.

But in general, it's important to do some real-world testing to 
establish whether or not there really is a bottleneck here.  It is 
counter-productive for Stan (or anyone else) to advise against raid10 or 
raid5/6 because of a single-thread bottleneck if it doesn't actually 
slow things down in practice.  On the other hand, if it /is/ a hinder to 
scaling, then it is important for Neil and other experts to think about 
how to change the architecture of md raid to scale better.  And 
somewhere in between there can be guidelines to help users - something 
like "for an average server, single-threading will saturate raid5 
performance at 8 disks, raid6 performance at 6 disks, and raid10 at 10 
disks, beyond which you should use raid0 or linear striping over two or 
more arrays".

Of course, to do such testing, someone would need a big machine with 
lots of disks, which is not otherwise in use!

mvh.,

David

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-22  7:29                         ` David Brown
@ 2012-05-23 13:14                           ` Stan Hoeppner
  2012-05-23 13:27                             ` Roberto Spadim
  2012-05-23 19:49                             ` David Brown
  0 siblings, 2 replies; 51+ messages in thread
From: Stan Hoeppner @ 2012-05-23 13:14 UTC (permalink / raw)
  To: David Brown
  Cc: NeilBrown, CoolCold, Daniel Pocock, Roberto Spadim, Phil Turmel,
	Marcus Sorensen, linux-raid

On 5/22/2012 2:29 AM, David Brown wrote:

> But in general, it's important to do some real-world testing to
> establish whether or not there really is a bottleneck here.  It is
> counter-productive for Stan (or anyone else) to advise against raid10 or
> raid5/6 because of a single-thread bottleneck if it doesn't actually
> slow things down in practice.  

Please reread precisely what I stated earlier:

"Neil pointed out quite some time ago that the md RAID 1/5/6/10 code
runs as a single kernel thread.  Thus when running heavy IO workloads
across many rust disks or a few SSDs, the md thread becomes CPU bound,
as it can only execute on a single core, just as with any other single
thread."

Note "heavy IO workloads".  The real world testing upon which I based my
recommendation is in this previous thread on linux-raid, of which I was
a participant.

Mark Delfman did the testing which revealed this md RAID thread
scalability problem using 4 PCIe enterprise SSDs:

http://marc.info/?l=linux-raid&m=131307849530290&w=2

> On the other hand, if it /is/ a hinder to
> scaling, then it is important for Neil and other experts to think about
> how to change the architecture of md raid to scale better.  And

More thorough testing and identification of the problem is definitely
required.  Apparently few people are currently running md RAID 1/5/6/10
across multiple ultra high performance SSDs, people who actually need
every single ounce of IOPS out of each device in the array.  But this
trend will increase.  I'd guess those currently building md 1/5/6/10
arrays w/ many SSDs simply don't *need* every ounce of IOPS, or more
would be complaining about single core thread limit already.

> somewhere in between there can be guidelines to help users - something
> like "for an average server, single-threading will saturate raid5
> performance at 8 disks, raid6 performance at 6 disks, and raid10 at 10
> disks, beyond which you should use raid0 or linear striping over two or
> more arrays".

This isn't feasible due to the myriad possible combinations of hardware.
 And you simply won't see this problem with SRDs (spinning rust disks)
until you have hundreds of them in a single array.  It requires over 200
15K SRDs in RAID 10 to generate only 30K random IOPS.  Just about any
single x86 core can handle that, probably even a 1.6GHz Atom.  This
issue mainly affects SSD arrays, where even 8 midrange consumer SATA3
SSDs in RAID 10 can generate over 400K IOPS, 200K real and 200K mirror data.

> Of course, to do such testing, someone would need a big machine with
> lots of disks, which is not otherwise in use!

Shouldn't require anything that heavy.  I would guess that one should be
able to reveal the thread bottleneck with a low freq dual core desktop
system with an HBA such as the LSI 9211-8i @320K IOPS, and 8 Sandforce
2200 based SSDs @40K write IOPS each.

-- 
Stan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-23 13:14                           ` Stan Hoeppner
@ 2012-05-23 13:27                             ` Roberto Spadim
  2012-05-23 19:49                             ` David Brown
  1 sibling, 0 replies; 51+ messages in thread
From: Roberto Spadim @ 2012-05-23 13:27 UTC (permalink / raw)
  To: stan
  Cc: David Brown, NeilBrown, CoolCold, Daniel Pocock, Phil Turmel,
	Marcus Sorensen, linux-raid

just to understand... i didn't think about a implementation yet...

what could be done to 'multi thread' md raid1,10,5,6?

i didn't understand why it is a problem, i think that the only cpu
time that it need is the time to tell what disk and what position must
be read for each i/o request

i'm just thinking about the normal read/write without resync, check,
bad read/write, or another management feature running


2012/5/23 Stan Hoeppner <stan@hardwarefreak.com>:
> On 5/22/2012 2:29 AM, David Brown wrote:
>
>> But in general, it's important to do some real-world testing to
>> establish whether or not there really is a bottleneck here.  It is
>> counter-productive for Stan (or anyone else) to advise against raid10 or
>> raid5/6 because of a single-thread bottleneck if it doesn't actually
>> slow things down in practice.
>
> Please reread precisely what I stated earlier:
>
> "Neil pointed out quite some time ago that the md RAID 1/5/6/10 code
> runs as a single kernel thread.  Thus when running heavy IO workloads
> across many rust disks or a few SSDs, the md thread becomes CPU bound,
> as it can only execute on a single core, just as with any other single
> thread."
>
> Note "heavy IO workloads".  The real world testing upon which I based my
> recommendation is in this previous thread on linux-raid, of which I was
> a participant.
>
> Mark Delfman did the testing which revealed this md RAID thread
> scalability problem using 4 PCIe enterprise SSDs:
>
> http://marc.info/?l=linux-raid&m=131307849530290&w=2
>
>> On the other hand, if it /is/ a hinder to
>> scaling, then it is important for Neil and other experts to think about
>> how to change the architecture of md raid to scale better.  And
>
> More thorough testing and identification of the problem is definitely
> required.  Apparently few people are currently running md RAID 1/5/6/10
> across multiple ultra high performance SSDs, people who actually need
> every single ounce of IOPS out of each device in the array.  But this
> trend will increase.  I'd guess those currently building md 1/5/6/10
> arrays w/ many SSDs simply don't *need* every ounce of IOPS, or more
> would be complaining about single core thread limit already.
>
>> somewhere in between there can be guidelines to help users - something
>> like "for an average server, single-threading will saturate raid5
>> performance at 8 disks, raid6 performance at 6 disks, and raid10 at 10
>> disks, beyond which you should use raid0 or linear striping over two or
>> more arrays".
>
> This isn't feasible due to the myriad possible combinations of hardware.
>  And you simply won't see this problem with SRDs (spinning rust disks)
> until you have hundreds of them in a single array.  It requires over 200
> 15K SRDs in RAID 10 to generate only 30K random IOPS.  Just about any
> single x86 core can handle that, probably even a 1.6GHz Atom.  This
> issue mainly affects SSD arrays, where even 8 midrange consumer SATA3
> SSDs in RAID 10 can generate over 400K IOPS, 200K real and 200K mirror data.
>
>> Of course, to do such testing, someone would need a big machine with
>> lots of disks, which is not otherwise in use!
>
> Shouldn't require anything that heavy.  I would guess that one should be
> able to reveal the thread bottleneck with a low freq dual core desktop
> system with an HBA such as the LSI 9211-8i @320K IOPS, and 8 Sandforce
> 2200 based SSDs @40K write IOPS each.
>
> --
> Stan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-23 13:14                           ` Stan Hoeppner
  2012-05-23 13:27                             ` Roberto Spadim
@ 2012-05-23 19:49                             ` David Brown
  2012-05-23 23:46                               ` Stan Hoeppner
  1 sibling, 1 reply; 51+ messages in thread
From: David Brown @ 2012-05-23 19:49 UTC (permalink / raw)
  To: stan
  Cc: NeilBrown, CoolCold, Daniel Pocock, Roberto Spadim, Phil Turmel,
	Marcus Sorensen, linux-raid

On 23/05/12 15:14, Stan Hoeppner wrote:
> On 5/22/2012 2:29 AM, David Brown wrote:
>
>> But in general, it's important to do some real-world testing to
>> establish whether or not there really is a bottleneck here.  It is
>> counter-productive for Stan (or anyone else) to advise against raid10 or
>> raid5/6 because of a single-thread bottleneck if it doesn't actually
>> slow things down in practice.
>
> Please reread precisely what I stated earlier:
>
> "Neil pointed out quite some time ago that the md RAID 1/5/6/10 code
> runs as a single kernel thread.  Thus when running heavy IO workloads
> across many rust disks or a few SSDs, the md thread becomes CPU bound,
> as it can only execute on a single core, just as with any other single
> thread."
>
> Note "heavy IO workloads".  The real world testing upon which I based my
> recommendation is in this previous thread on linux-raid, of which I was
> a participant.
>
> Mark Delfman did the testing which revealed this md RAID thread
> scalability problem using 4 PCIe enterprise SSDs:
>
> http://marc.info/?l=linux-raid&m=131307849530290&w=2
>
>> On the other hand, if it /is/ a hinder to
>> scaling, then it is important for Neil and other experts to think about
>> how to change the architecture of md raid to scale better.  And
>
> More thorough testing and identification of the problem is definitely
> required.  Apparently few people are currently running md RAID 1/5/6/10
> across multiple ultra high performance SSDs, people who actually need
> every single ounce of IOPS out of each device in the array.  But this
> trend will increase.  I'd guess those currently building md 1/5/6/10
> arrays w/ many SSDs simply don't *need* every ounce of IOPS, or more
> would be complaining about single core thread limit already.
>
>> somewhere in between there can be guidelines to help users - something
>> like "for an average server, single-threading will saturate raid5
>> performance at 8 disks, raid6 performance at 6 disks, and raid10 at 10
>> disks, beyond which you should use raid0 or linear striping over two or
>> more arrays".
>
> This isn't feasible due to the myriad possible combinations of hardware.
>   And you simply won't see this problem with SRDs (spinning rust disks)
> until you have hundreds of them in a single array.  It requires over 200
> 15K SRDs in RAID 10 to generate only 30K random IOPS.  Just about any
> single x86 core can handle that, probably even a 1.6GHz Atom.  This
> issue mainly affects SSD arrays, where even 8 midrange consumer SATA3
> SSDs in RAID 10 can generate over 400K IOPS, 200K real and 200K mirror data.
>
>> Of course, to do such testing, someone would need a big machine with
>> lots of disks, which is not otherwise in use!
>
> Shouldn't require anything that heavy.  I would guess that one should be
> able to reveal the thread bottleneck with a low freq dual core desktop
> system with an HBA such as the LSI 9211-8i @320K IOPS, and 8 Sandforce
> 2200 based SSDs @40K write IOPS each.
>

It looks like Shaohua Li has done some testing, found that there is a 
slow-down even with just 2 or 4 disks, and has written patches to fix it 
(for raid1 and raid10 so far), which is very nice.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-23 19:49                             ` David Brown
@ 2012-05-23 23:46                               ` Stan Hoeppner
  2012-05-24  1:18                                 ` Stan Hoeppner
  0 siblings, 1 reply; 51+ messages in thread
From: Stan Hoeppner @ 2012-05-23 23:46 UTC (permalink / raw)
  To: David Brown
  Cc: NeilBrown, CoolCold, Daniel Pocock, Roberto Spadim, Phil Turmel,
	Marcus Sorensen, linux-raid

On 5/23/2012 2:49 PM, David Brown wrote:

> It looks like Shaohua Li has done some testing, found that there is a
> slow-down even with just 2 or 4 disks, and has written patches to fix it
> (for raid1 and raid10 so far), which is very nice.

Ahh, thanks David.  I wasn't aware of Shaohua Li's
work on this.  Got a link to his documentation and patches, or article,
by chance?

-- 
Stan


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-23 23:46                               ` Stan Hoeppner
@ 2012-05-24  1:18                                 ` Stan Hoeppner
  2012-05-24  2:08                                   ` NeilBrown
  0 siblings, 1 reply; 51+ messages in thread
From: Stan Hoeppner @ 2012-05-24  1:18 UTC (permalink / raw)
  To: stan
  Cc: David Brown, NeilBrown, CoolCold, Daniel Pocock, Roberto Spadim,
	Phil Turmel, Marcus Sorensen, linux-raid

On 5/23/2012 6:46 PM, Stan Hoeppner wrote:
> On 5/23/2012 2:49 PM, David Brown wrote:
> 
>> It looks like Shaohua Li has done some testing, found that there is a
>> slow-down even with just 2 or 4 disks, and has written patches to fix it
>> (for raid1 and raid10 so far), which is very nice.
> 
> Ahh, thanks David.  I wasn't aware of Shaohua Li's
> work on this.  Got a link to his documentation and patches, or article,
> by chance?

My Google searches only turn up Shaohua Li's TRIM patches.  I don't
believe the issue we're discussing has anything to do with TRIM, though
I can't be 100% sure as Mark didn't provide much detail of his thread
ceiling problem, and I don't have a suitable system to do proper testing.

Again it would be nice to have a link to Shaohua Li's work to which you
refer to determine if it's applicable to this issue.

-- 
Stan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-24  1:18                                 ` Stan Hoeppner
@ 2012-05-24  2:08                                   ` NeilBrown
  2012-05-24  6:16                                     ` Stan Hoeppner
  0 siblings, 1 reply; 51+ messages in thread
From: NeilBrown @ 2012-05-24  2:08 UTC (permalink / raw)
  To: stan
  Cc: David Brown, CoolCold, Daniel Pocock, Roberto Spadim,
	Phil Turmel, Marcus Sorensen, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1085 bytes --]

On Wed, 23 May 2012 20:18:06 -0500 Stan Hoeppner <stan@hardwarefreak.com>
wrote:

> On 5/23/2012 6:46 PM, Stan Hoeppner wrote:
> > On 5/23/2012 2:49 PM, David Brown wrote:
> > 
> >> It looks like Shaohua Li has done some testing, found that there is a
> >> slow-down even with just 2 or 4 disks, and has written patches to fix it
> >> (for raid1 and raid10 so far), which is very nice.
> > 
> > Ahh, thanks David.  I wasn't aware of Shaohua Li's
> > work on this.  Got a link to his documentation and patches, or article,
> > by chance?
> 
> My Google searches only turn up Shaohua Li's TRIM patches.  I don't
> believe the issue we're discussing has anything to do with TRIM, though
> I can't be 100% sure as Mark didn't provide much detail of his thread
> ceiling problem, and I don't have a suitable system to do proper testing.
> 
> Again it would be nice to have a link to Shaohua Li's work to which you
> refer to determine if it's applicable to this issue.
> 

It's probably in your INBOX.

http://www.spinics.net/lists/raid/msg38899.html

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-22  6:36                       ` Stan Hoeppner
  2012-05-22  7:29                         ` David Brown
@ 2012-05-24  2:10                         ` NeilBrown
  2012-05-24  2:55                           ` Roberto Spadim
  1 sibling, 1 reply; 51+ messages in thread
From: NeilBrown @ 2012-05-24  2:10 UTC (permalink / raw)
  To: stan
  Cc: CoolCold, Daniel Pocock, David Brown, Roberto Spadim,
	Phil Turmel, Marcus Sorensen, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3158 bytes --]

On Tue, 22 May 2012 01:36:06 -0500 Stan Hoeppner <stan@hardwarefreak.com>
wrote:

> On 5/21/2012 6:34 PM, NeilBrown wrote:
> > On Mon, 21 May 2012 13:51:21 -0500 Stan Hoeppner <stan@hardwarefreak.com>
> > wrote:
> > 
> >> On 5/21/2012 10:20 AM, CoolCold wrote:
> >>> On Sat, May 12, 2012 at 2:28 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> >>>> On 5/11/2012 3:16 AM, Daniel Pocock wrote:
> >>>>
> >>> [snip]
> >>>> That's the one scenario where I abhor using md raid, as I mentioned.  At
> >>>> least, a boot raid 1 pair.  Using layered md raid 1 + 0, or 1 + linear
> >>>> is a great solution for many workloads.  Ask me why I say raid 1 + 0
> >>>> instead of raid 10.
> >>> So, I'm asking - why?
> >>
> >> Neil pointed out quite some time ago that the md RAID 1/5/6/10 code runs
> >> as a single kernel thread.  Thus when running heavy IO workloads across
> >> many rust disks or a few SSDs, the md thread becomes CPU bound, as it
> >> can only execute on a single core, just as with any other single thread.
> > 
> > This is not the complete truth.
> 
> Yes, I should have stipulated only writes are limited to a single thread.
> 
> > For RAID1 and RAID10, successful IO requests do not involved the kernel
> > thread, so the fact that there is only one should be irrelevant.
> > Failed requests are retried using the thread and it is also involved it
> > resync/recovery so those processes may be limited by the single thread.
> > 
> > RAID5/6 does not use the thread for read requests on a non-degraded array.
> > However all write requests go through the single thread so there could be
> > issues there.
> 
> Thanks for clarifying this.  In your previous response to this issue
> (quoted and linked below) you included RAID 1/10 with RAID 5/6 WRT
> writes going through a single thread.
> 
> > Have you  actually measured md/raid10 being slower than raid0 over raid1?
> 
> I personally have not, as I don't have access to the storage hardware
> necessary to sink a sufficiently large write stream to peak a core with
> the md thread.
> 
> > I have a vague memory from when this came up before that there was some extra
> > issue that I was missing, but I cannot recall it just now....
> 
> We're recalling the same thread, which was many months ago.  Here's your
> post:  http://marc.info/?l=linux-raid&m=132616899005148&w=2
> 
> And here's the relevant section upon which I was basing my recent
> statements:
> 
> "I think you must be misremembering.  Neither RAID0 or Linear have any
> threads involved.  They just redirect the request to the appropriate
> devices.  Multiple threads can submit multiple requests down through
> RAID0 and Linear concurrently.
> 
> RAID1, RAID10, and RAID5/6 are different.  For reads they normally are
> have no contention with other requests, but for writes things do get
> single-threaded at some point."
> 

That's right - I keep forgetting about the single-threading caused by needing
to sync with bitmap updates.

So reads a fully parallel.  Writes are - for now - serialised to a single
processor for handling.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-24  2:10                         ` NeilBrown
@ 2012-05-24  2:55                           ` Roberto Spadim
  0 siblings, 0 replies; 51+ messages in thread
From: Roberto Spadim @ 2012-05-24  2:55 UTC (permalink / raw)
  To: NeilBrown
  Cc: stan, CoolCold, Daniel Pocock, David Brown, Phil Turmel,
	Marcus Sorensen, linux-raid

nice, so...
write is single thread and read multi thread
any performace test / results?

2012/5/23 NeilBrown <neilb@suse.de>
>
> On Tue, 22 May 2012 01:36:06 -0500 Stan Hoeppner <stan@hardwarefreak.com>
> wrote:
>
> > On 5/21/2012 6:34 PM, NeilBrown wrote:
> > > On Mon, 21 May 2012 13:51:21 -0500 Stan Hoeppner <stan@hardwarefreak.com>
> > > wrote:
> > >
> > >> On 5/21/2012 10:20 AM, CoolCold wrote:
> > >>> On Sat, May 12, 2012 at 2:28 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> > >>>> On 5/11/2012 3:16 AM, Daniel Pocock wrote:
> > >>>>
> > >>> [snip]
> > >>>> That's the one scenario where I abhor using md raid, as I mentioned.  At
> > >>>> least, a boot raid 1 pair.  Using layered md raid 1 + 0, or 1 + linear
> > >>>> is a great solution for many workloads.  Ask me why I say raid 1 + 0
> > >>>> instead of raid 10.
> > >>> So, I'm asking - why?
> > >>
> > >> Neil pointed out quite some time ago that the md RAID 1/5/6/10 code runs
> > >> as a single kernel thread.  Thus when running heavy IO workloads across
> > >> many rust disks or a few SSDs, the md thread becomes CPU bound, as it
> > >> can only execute on a single core, just as with any other single thread.
> > >
> > > This is not the complete truth.
> >
> > Yes, I should have stipulated only writes are limited to a single thread.
> >
> > > For RAID1 and RAID10, successful IO requests do not involved the kernel
> > > thread, so the fact that there is only one should be irrelevant.
> > > Failed requests are retried using the thread and it is also involved it
> > > resync/recovery so those processes may be limited by the single thread.
> > >
> > > RAID5/6 does not use the thread for read requests on a non-degraded array.
> > > However all write requests go through the single thread so there could be
> > > issues there.
> >
> > Thanks for clarifying this.  In your previous response to this issue
> > (quoted and linked below) you included RAID 1/10 with RAID 5/6 WRT
> > writes going through a single thread.
> >
> > > Have you  actually measured md/raid10 being slower than raid0 over raid1?
> >
> > I personally have not, as I don't have access to the storage hardware
> > necessary to sink a sufficiently large write stream to peak a core with
> > the md thread.
> >
> > > I have a vague memory from when this came up before that there was some extra
> > > issue that I was missing, but I cannot recall it just now....
> >
> > We're recalling the same thread, which was many months ago.  Here's your
> > post:  http://marc.info/?l=linux-raid&m=132616899005148&w=2
> >
> > And here's the relevant section upon which I was basing my recent
> > statements:
> >
> > "I think you must be misremembering.  Neither RAID0 or Linear have any
> > threads involved.  They just redirect the request to the appropriate
> > devices.  Multiple threads can submit multiple requests down through
> > RAID0 and Linear concurrently.
> >
> > RAID1, RAID10, and RAID5/6 are different.  For reads they normally are
> > have no contention with other requests, but for writes things do get
> > single-threaded at some point."
> >
>
> That's right - I keep forgetting about the single-threading caused by needing
> to sync with bitmap updates.
>
> So reads a fully parallel.  Writes are - for now - serialised to a single
> processor for handling.
>
> Thanks,
> NeilBrown




--
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-24  2:08                                   ` NeilBrown
@ 2012-05-24  6:16                                     ` Stan Hoeppner
  0 siblings, 0 replies; 51+ messages in thread
From: Stan Hoeppner @ 2012-05-24  6:16 UTC (permalink / raw)
  To: NeilBrown
  Cc: David Brown, CoolCold, Daniel Pocock, Roberto Spadim,
	Phil Turmel, Marcus Sorensen, linux-raid

On 5/23/2012 9:08 PM, NeilBrown wrote:
> On Wed, 23 May 2012 20:18:06 -0500 Stan Hoeppner <stan@hardwarefreak.com>
> wrote:
> 
>> On 5/23/2012 6:46 PM, Stan Hoeppner wrote:
>>> On 5/23/2012 2:49 PM, David Brown wrote:
>>>
>>>> It looks like Shaohua Li has done some testing, found that there is a
>>>> slow-down even with just 2 or 4 disks, and has written patches to fix it
>>>> (for raid1 and raid10 so far), which is very nice.
>>>
>>> Ahh, thanks David.  I wasn't aware of Shaohua Li's
>>> work on this.  Got a link to his documentation and patches, or article,
>>> by chance?
>>
>> My Google searches only turn up Shaohua Li's TRIM patches.  I don't
>> believe the issue we're discussing has anything to do with TRIM, though
>> I can't be 100% sure as Mark didn't provide much detail of his thread
>> ceiling problem, and I don't have a suitable system to do proper testing.
>>
>> Again it would be nice to have a link to Shaohua Li's work to which you
>> refer to determine if it's applicable to this issue.
>>
> 
> It's probably in your INBOX.
> 
> http://www.spinics.net/lists/raid/msg38899.html
> 
> NeilBrown

Indeed they are.  Well, in a sort folder actually.  I still shouldn't
have missed them.  Please feel free to plant virtual palms to my
forehead if you like. :)

-- 
Stan


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
  2012-05-21 14:29             ` Phil Turmel
@ 2012-05-26 21:58               ` Stefan *St0fF* Huebner
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan *St0fF* Huebner @ 2012-05-26 21:58 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Brian Candler, Daniel Pocock, Marcus Sorensen, linux-raid

On 21.05.2012 16:29, Phil Turmel wrote:
> On 05/21/2012 10:19 AM, Brian Candler wrote:
>
> [trim /]
>
>> But FYI, the new Seagate Barracuda 3TB ST3000DM001 drives I have here do
>> *not* support this feature.  Has Seagate started crippling its
>> consumer-grade drives?
> *YES*
>
> They started sometime before June of last year, when I purchased some
> Barracuda Greens to upgrade from the Barracuda 7200.12 models I had been
> using.
>
> [trim /]
Even worse is WD: they started to remove SCT ERC around November 2009 - 
that was the time we went to Hitachi...
>
>> The Hitachi Deskstar HDS5C3030ALA630 *does* support scterc.
> If you look up-thread to my first reply on May 10th, I reported exactly
> this phenomenon, and that Hitachi is the only remaining player still
> supporting it in consumer-grade drives.  (That I've found.  I'd love to
> discover that other players are changing their minds.)
>
>> Regards,
>>
>> Brian.
> Regards, and my condolences on your purchase of the ST3000DM001 drives.
Mine too,
Stefan
>
> Phil.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: md RAID with enterprise-class SATA or SAS drives
@ 2012-05-10  1:29 Richard Scobie
  0 siblings, 0 replies; 51+ messages in thread
From: Richard Scobie @ 2012-05-10  1:29 UTC (permalink / raw)
  To: Linux RAID Mailing List

Daniel Pocock <daniel@pocock.com.au> wrote:

 > ignoring the better MTBF and seek times of these drives, do any of the
 > other features passively contribute to a better RAID experience when
 > using md?

An important one for large arrays is the ability to handle the level of 
chassis vibration/resonance from adjacent drives, which has a marked 
effect on seek times (this feature is called RAFF in Western Digital 
RAID Edition drives), and is not available in desktop drives.

Regards,

Richard

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2012-05-26 21:58 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-09 22:00 md RAID with enterprise-class SATA or SAS drives Daniel Pocock
2012-05-09 22:33 ` Marcus Sorensen
2012-05-10 13:34   ` Daniel Pocock
2012-05-10 13:51   ` Phil Turmel
2012-05-10 14:59     ` Daniel Pocock
2012-05-10 15:15       ` Phil Turmel
2012-05-10 15:26     ` Marcus Sorensen
2012-05-10 16:04       ` Phil Turmel
2012-05-10 17:53         ` Keith Keller
2012-05-10 18:10           ` Mathias Burén
2012-05-10 18:23           ` Phil Turmel
2012-05-10 19:15             ` Keith Keller
2012-05-10 18:42         ` Daniel Pocock
2012-05-10 19:09           ` Phil Turmel
2012-05-10 20:30             ` Daniel Pocock
2012-05-11  6:50             ` Michael Tokarev
2012-05-21 14:19           ` Brian Candler
2012-05-21 14:29             ` Phil Turmel
2012-05-26 21:58               ` Stefan *St0fF* Huebner
2012-05-10 21:43       ` Stan Hoeppner
2012-05-10 23:00         ` Marcus Sorensen
2012-05-10 21:15     ` Stan Hoeppner
2012-05-10 21:31       ` Daniel Pocock
2012-05-11  1:53         ` Stan Hoeppner
2012-05-11  8:31           ` Daniel Pocock
2012-05-11 13:54             ` Pierre Beck
2012-05-10 21:41       ` Phil Turmel
2012-05-10 22:27       ` David Brown
2012-05-10 22:37         ` Daniel Pocock
     [not found]         ` <CABYL=ToORULrdhBVQk0K8zQqFYkOomY-wgG7PpnJnzP9u7iBnA@mail.gmail.com>
2012-05-11  7:10           ` David Brown
2012-05-11  8:16             ` Daniel Pocock
2012-05-11 22:28               ` Stan Hoeppner
2012-05-21 15:20                 ` CoolCold
2012-05-21 18:51                   ` Stan Hoeppner
2012-05-21 18:54                     ` Roberto Spadim
2012-05-21 19:05                       ` Stan Hoeppner
2012-05-21 19:38                         ` Roberto Spadim
2012-05-21 23:34                     ` NeilBrown
2012-05-22  6:36                       ` Stan Hoeppner
2012-05-22  7:29                         ` David Brown
2012-05-23 13:14                           ` Stan Hoeppner
2012-05-23 13:27                             ` Roberto Spadim
2012-05-23 19:49                             ` David Brown
2012-05-23 23:46                               ` Stan Hoeppner
2012-05-24  1:18                                 ` Stan Hoeppner
2012-05-24  2:08                                   ` NeilBrown
2012-05-24  6:16                                     ` Stan Hoeppner
2012-05-24  2:10                         ` NeilBrown
2012-05-24  2:55                           ` Roberto Spadim
2012-05-11 22:17             ` Stan Hoeppner
2012-05-10  1:29 Richard Scobie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.