All of lore.kernel.org
 help / color / mirror / Atom feed
* Question regarding XFS on LVM over hardware RAID.
@ 2014-01-29 14:26 C. Morgan Hamill
  2014-01-29 15:07 ` Eric Sandeen
  0 siblings, 1 reply; 27+ messages in thread
From: C. Morgan Hamill @ 2014-01-29 14:26 UTC (permalink / raw)
  To: xfs

Howdy folks,

I understand that XFS have stripe unit and width configured according to
the underlying RAID device when using LVM, but I was wondering if this
is still the case when a given XFS-formatted logical volume takes up
only part of the available space on the RAID.  In particular, I could
imagine that stripe width would need to be modified proportionally with
the decrease in filesystem size.  My intuition says that's false, but
I wanted to check with folks who know for sure.

Thanks for any help!
--
Morgan Hamill

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-01-29 14:26 Question regarding XFS on LVM over hardware RAID C. Morgan Hamill
@ 2014-01-29 15:07 ` Eric Sandeen
  2014-01-29 19:11   ` C. Morgan Hamill
  2014-01-29 22:40   ` Stan Hoeppner
  0 siblings, 2 replies; 27+ messages in thread
From: Eric Sandeen @ 2014-01-29 15:07 UTC (permalink / raw)
  To: C. Morgan Hamill, xfs

On 1/29/14, 8:26 AM, C. Morgan Hamill wrote:
> Howdy folks,
> 
> I understand that XFS have stripe unit and width configured according to
> the underlying RAID device when using LVM, but I was wondering if this
> is still the case when a given XFS-formatted logical volume takes up
> only part of the available space on the RAID.  In particular, I could
> imagine that stripe width would need to be modified proportionally with
> the decrease in filesystem size.  My intuition says that's false, but
> I wanted to check with folks who know for sure.

The stripe unit and width are units of geometry of the underlying
storage; a filesystem will span some number of stripe units, depending
on its size.

So no, the filesystem's notion of stripe geometry does not change
with the filesystem size.

You do want to make sure that stripe geometry is correct and aligned
from top to bottom.

I helped write up the RHEL storage admin guide, and there are some
nice words about geometry and alignment in there:

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-iolimits.html

(Hopefully this is available w/o login, I think it is)

-Eric

> Thanks for any help!
> --
> Morgan Hamill
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-01-29 15:07 ` Eric Sandeen
@ 2014-01-29 19:11   ` C. Morgan Hamill
  2014-01-29 23:55     ` Stan Hoeppner
  2014-01-29 22:40   ` Stan Hoeppner
  1 sibling, 1 reply; 27+ messages in thread
From: C. Morgan Hamill @ 2014-01-29 19:11 UTC (permalink / raw)
  To: xfs

Thanks for the quick reply.

Excerpts from Eric Sandeen's message of 2014-01-29 10:07:15 -0500:
> The stripe unit and width are units of geometry of the underlying
> storage; a filesystem will span some number of stripe units, depending
> on its size.
> 
> So no, the filesystem's notion of stripe geometry does not change
> with the filesystem size.
> 
> You do want to make sure that stripe geometry is correct and aligned
> from top to bottom.

Just to make sure I've understood, for 3 14-disk RAID 6 groups striped
together into a single RAID 60, with stripe units of 128k, split up into
some number of LVM logical volumes, I'd create the filesystems with the
following:

    mkfs.xfs -d su=128k,sw=36 ...

for all of the filesystems, regardless of how many and what size they
were.  Does that sound right?

> I helped write up the RHEL storage admin guide, and there are some
> nice words about geometry and alignment in there:
> 
> https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-iolimits.html

Thanks for the resource!
--
Morgan Hamill

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-01-29 15:07 ` Eric Sandeen
  2014-01-29 19:11   ` C. Morgan Hamill
@ 2014-01-29 22:40   ` Stan Hoeppner
  1 sibling, 0 replies; 27+ messages in thread
From: Stan Hoeppner @ 2014-01-29 22:40 UTC (permalink / raw)
  To: Eric Sandeen, C. Morgan Hamill, xfs

On 1/29/2014 9:07 AM, Eric Sandeen wrote:
> On 1/29/14, 8:26 AM, C. Morgan Hamill wrote:
>> Howdy folks,
>>
>> I understand that XFS have stripe unit and width configured according to
>> the underlying RAID device when using LVM, but I was wondering if this
>> is still the case when a given XFS-formatted logical volume takes up
>> only part of the available space on the RAID.  In particular, I could
>> imagine that stripe width would need to be modified proportionally with
>> the decrease in filesystem size.  My intuition says that's false, but
>> I wanted to check with folks who know for sure.
> 
> The stripe unit and width are units of geometry of the underlying
> storage; a filesystem will span some number of stripe units, depending
> on its size.
> 
> So no, the filesystem's notion of stripe geometry does not change
> with the filesystem size.
> 
> You do want to make sure that stripe geometry is correct and aligned
> from top to bottom.

This is correct if indeed stripe alignment is beneficial to the
workload.  But not all workloads benefit from stripe alignment.  Some
may perform worse when XFS is stripe aligned to the underlying storage.

For instance, when a workload performs lots of allocations that are
significantly smaller than the RAID stripe width.  Here you end up with
a small file allocated at the start of each stripe and the rest of the
stripe left empty.  This can create an IO hot spot on the first one or
two drives in the array, and the others may sit idle.  This obviously
has a negative impact on throughput with such a workload.

Thus for a workload that performs lots of predominantly small
allocations, it is best to not align during mkfs.xfs with hardware RAID
that doesn't provide geometry to Linux.  If the underlying storage
device does do so, or if it is is a striped md/RAID device, you will
want to manually specify 4K alignment, as mkfs.xfs will auto align to md
geometry.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-01-29 19:11   ` C. Morgan Hamill
@ 2014-01-29 23:55     ` Stan Hoeppner
  2014-01-30 14:28       ` C. Morgan Hamill
  0 siblings, 1 reply; 27+ messages in thread
From: Stan Hoeppner @ 2014-01-29 23:55 UTC (permalink / raw)
  To: C. Morgan Hamill, xfs

On 1/29/2014 1:11 PM, C. Morgan Hamill wrote:
> Thanks for the quick reply.
> 
> Excerpts from Eric Sandeen's message of 2014-01-29 10:07:15 -0500:
>> The stripe unit and width are units of geometry of the underlying
>> storage; a filesystem will span some number of stripe units, depending
>> on its size.
>>
>> So no, the filesystem's notion of stripe geometry does not change
>> with the filesystem size.
>>
>> You do want to make sure that stripe geometry is correct and aligned
>> from top to bottom.
> 
> Just to make sure I've understood, for 3 14-disk RAID 6 groups striped
> together into a single RAID 60, with stripe units of 128k, split up into
> some number of LVM logical volumes, I'd create the filesystems with the
> following:
> 
>     mkfs.xfs -d su=128k,sw=36 ...

This is not correct.  You must align to either the outer stripe or the
inner stripe when using a nested array.  In this case it appears your
inner stripe is RAID6 su 128KB * sw 12 = 1536KB.  You did not state your
outer RAID0 stripe geometry.  Which one you align to depends entirely on
your workload.

However, given that you currently intend to assemble one large array
from 3 smaller arrays, then immediately carve it into smaller pieces,
it's seems that RAID60 is probably not the correct architecture for your
workload.  RAID60 is suitable for very large streaming write/read
workloads where you are evenly distributing filesystem blocks across a
very large spindle count, with a deterministic IO pattern, and with no
RMW.  It is not very suitable for consolidation workloads, as you seem
to be describing here.

Everything starts and ends with the workload.  You always design the
storage to meet the needs of the workload, not the other way round.  You
seem to be designing your system from the storage up.  This is often a
recipe for disaster.

Please describe your workload in more detail so we can provide better,
detailed, advice.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-01-29 23:55     ` Stan Hoeppner
@ 2014-01-30 14:28       ` C. Morgan Hamill
  2014-01-30 20:28         ` Dave Chinner
  0 siblings, 1 reply; 27+ messages in thread
From: C. Morgan Hamill @ 2014-01-30 14:28 UTC (permalink / raw)
  To: stan; +Cc: xfs

First, thanks very much for your help.  We're weening ourselves off
unnecessarily expensive storage and as such I unfortunately haven't had
as much experience with physical filesystems as I'd like.  I am also
unfamiliar with XFS.  I appreciate the help immensely.

Excerpts from Stan Hoeppner's message of 2014-01-29 18:55:48 -0500:
> This is not correct.  You must align to either the outer stripe or the
> inner stripe when using a nested array.  In this case it appears your
> inner stripe is RAID6 su 128KB * sw 12 = 1536KB.  You did not state your
> outer RAID0 stripe geometry.  Which one you align to depends entirely on
> your workload.

Ahh this makes sense; it had occurred to me that something like this
might be the case.  I'm not exactly sure what you mean by inner and
outer; I can imagine it going both ways.

Just to clarify, it looks like this:

     XFS     |      XFS    |     XFS      |      XFS
---------------------------------------------------------
                    LVM volume group
---------------------------------------------------------
                         RAID 0
---------------------------------------------------------
RAID 6 (14 disks) | RAID 6 (14 disks) | RAID 6 (14 disks)
---------------------------------------------------------
                    42 4TB SAS disks

...more or less.

I agree that it's quite weird, but I'll describe the workload and the
constraints.

We're using commercial backup software to provide backup needs for the
University I work at (CrashPlan Pro enterprisey whathaveyou server).
We've got perhaps 1200 or so user desktops and 1 few hundred servers on
top of that, all of which currently adds up to just under 100TB on our
old backup system which we're moving from (IBM Tivoli).

So this archive will be our primary store for on-site backups.
CrashPlan is more or less continually transferring some amount of data
from clients to itself, which it does all at once in a bundle after
determining what's changed. It ends up storing archives on disk as files
which look to max out at 4GB each before it opens up the next one.

Writes are probably more important than reads, as restores are
relatively infrequent, so I'd like to optimize for writes.  I expect the
bottleneck to be IO as the campus is predominantly 1Gbps throughout
and will become 10Gbps is the not-that-distant future, most likely.
I can virtually guarantee CPU will not be the bottleneck.

Now, here's the constraints, which is why I was planning on setting
things up as above:

  - This is a budget job, so sane things like RAID 10 are our.  RAID
    6 or 60 are (as far as I can tell, correct me if I'm wrong) our only
    real options here, as anything else either sacrifices too much
    storage or is too susceptible failure from UREs.

  - I need to expose, in the end, three-ish (two or four would be OK)
    filesystems to the backup software, which should come fairly close
    to minimizing the effects of the archive maintenance jobs (integrity
    checks, mostly).  CrashPlan will spawn 2 jobs per store point, so
    a max of 8 at any given time should be a nice balance between
    under-utilizing and saturating the IO.

So I had thought LVM over RAID 60 would make sense because it would give
me the option of leaving a bit of disk unallocated and being able to
tweak filesystem sizes a bit as time goes on.

Now that I think of it though, perhaps something like 2 or 3 RAID6
volumes would make more sense, with XFS directly on top of them.  In
that case I have to balance number of volumes against the loss of
2 parity disks, however.

I'm not sure how best to proceed; any advice would be invaluable.
--
Morgan Hamill

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-01-30 14:28       ` C. Morgan Hamill
@ 2014-01-30 20:28         ` Dave Chinner
  2014-01-31  5:58           ` Stan Hoeppner
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2014-01-30 20:28 UTC (permalink / raw)
  To: C. Morgan Hamill; +Cc: stan, xfs

On Thu, Jan 30, 2014 at 09:28:45AM -0500, C. Morgan Hamill wrote:
> First, thanks very much for your help.  We're weening ourselves off
> unnecessarily expensive storage and as such I unfortunately haven't had
> as much experience with physical filesystems as I'd like.  I am also
> unfamiliar with XFS.  I appreciate the help immensely.
> 
> Excerpts from Stan Hoeppner's message of 2014-01-29 18:55:48 -0500:
> > This is not correct.  You must align to either the outer stripe or the
> > inner stripe when using a nested array.  In this case it appears your
> > inner stripe is RAID6 su 128KB * sw 12 = 1536KB.  You did not state your
> > outer RAID0 stripe geometry.  Which one you align to depends entirely on
> > your workload.
> 
> Ahh this makes sense; it had occurred to me that something like this
> might be the case.  I'm not exactly sure what you mean by inner and
> outer; I can imagine it going both ways.
> 
> Just to clarify, it looks like this:
> 
>      XFS     |      XFS    |     XFS      |      XFS
> ---------------------------------------------------------
>                     LVM volume group
> ---------------------------------------------------------
>                          RAID 0
> ---------------------------------------------------------
> RAID 6 (14 disks) | RAID 6 (14 disks) | RAID 6 (14 disks)
> ---------------------------------------------------------
>                     42 4TB SAS disks

So optimised for sequential IO. The time-honoured method of setting
up XFS for this if the workload is large files is to use a stripe
unit that is equal to the width of the underlying RAID6 volumes with
a stripe width of 3. That way XFS tries to align files to the start
of each RAID6 volume, and allocate in full RAID6 stripe chunks. This
mostly avoids RMW cycles for large files and sequential IO. i.e. su
= 1536k, sw = 3.

> ...more or less.
> 
> I agree that it's quite weird, but I'll describe the workload and the
> constraints.

[snip]

summary: concurrent (initially slow) sequential writes of ~4GB files.

> Now, here's the constraints, which is why I was planning on setting
> things up as above:
> 
>   - This is a budget job, so sane things like RAID 10 are our.  RAID
>     6 or 60 are (as far as I can tell, correct me if I'm wrong) our only
>     real options here, as anything else either sacrifices too much
>     storage or is too susceptible failure from UREs.

RAID6 is fine for this.

>   - I need to expose, in the end, three-ish (two or four would be OK)
>     filesystems to the backup software, which should come fairly close
>     to minimizing the effects of the archive maintenance jobs (integrity
>     checks, mostly).  CrashPlan will spawn 2 jobs per store point, so
>     a max of 8 at any given time should be a nice balance between
>     under-utilizing and saturating the IO.

So concurrency is up to 8 files being written at a time. That's
pretty much on the money for striped RAID. Much more than this and
you end up with performance being limited by seeking on the slowest
disk in the RAID sets.

> So I had thought LVM over RAID 60 would make sense because it would give
> me the option of leaving a bit of disk unallocated and being able to
> tweak filesystem sizes a bit as time goes on.

*nod*

And it allows you, in future, to add more disks and grow across them
via linear concatentation of more RAID60 luns of the same layout...

> Now that I think of it though, perhaps something like 2 or 3 RAID6
> volumes would make more sense, with XFS directly on top of them.  In
> that case I have to balance number of volumes against the loss of
> 2 parity disks, however.

Probably not worth the complexity.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-01-30 20:28         ` Dave Chinner
@ 2014-01-31  5:58           ` Stan Hoeppner
  2014-01-31 21:14             ` C. Morgan Hamill
  0 siblings, 1 reply; 27+ messages in thread
From: Stan Hoeppner @ 2014-01-31  5:58 UTC (permalink / raw)
  To: Dave Chinner, C. Morgan Hamill; +Cc: xfs

On 1/30/2014 2:28 PM, Dave Chinner wrote:
> On Thu, Jan 30, 2014 at 09:28:45AM -0500, C. Morgan Hamill wrote:
>> First, thanks very much for your help.  We're weening ourselves off
>> unnecessarily expensive storage and as such I unfortunately haven't had
>> as much experience with physical filesystems as I'd like.  I am also
>> unfamiliar with XFS.  I appreciate the help immensely.
>>
>> Excerpts from Stan Hoeppner's message of 2014-01-29 18:55:48 -0500:
>>> This is not correct.  You must align to either the outer stripe or the
>>> inner stripe when using a nested array.  In this case it appears your
>>> inner stripe is RAID6 su 128KB * sw 12 = 1536KB.  You did not state your
>>> outer RAID0 stripe geometry.  Which one you align to depends entirely on
>>> your workload.
>>
>> Ahh this makes sense; it had occurred to me that something like this
>> might be the case.  I'm not exactly sure what you mean by inner and
>> outer; I can imagine it going both ways.
>>
>> Just to clarify, it looks like this:
>>
>>      XFS     |      XFS    |     XFS      |      XFS
>> ---------------------------------------------------------
>>                     LVM volume group
>> ---------------------------------------------------------
>>                          RAID 0
>> ---------------------------------------------------------
>> RAID 6 (14 disks) | RAID 6 (14 disks) | RAID 6 (14 disks)
>> ---------------------------------------------------------
>>                     42 4TB SAS disks

RAID60 is a nested RAID level just like RAID10 and RAID50.  It is a
stripe, or RAID0, across multiple primary array types, RAID6 in this
case.  The stripe width of each 'inner' RAID6 becomes the stripe unit of
the 'outer' RAID0 array:

RAID6 geometry	 128KB * 12 = 1536KB
RAID0 geometry  1536KB * 3  = 4608KB

If you are creating your RAID60 array with a proprietary hardware
RAID/SAN management utility it may not be clearly showing you the
resulting nested geometry I've demonstrated above, which is correct for
your RAID60.

It is possible with software RAID to continue nesting stripe upon stripe
to build infinitely large nested arrays.  It is not practical to do so
for many reasons, but I'll not express those here as it is out of scope
for this discussion.  I am simply attempting to explain how nested RAID
levels are constructed.

> So optimised for sequential IO. The time-honoured method of setting
> up XFS for this if the workload is large files is to use a stripe
> unit that is equal to the width of the underlying RAID6 volumes with
> a stripe width of 3. That way XFS tries to align files to the start
> of each RAID6 volume, and allocate in full RAID6 stripe chunks. This
> mostly avoids RMW cycles for large files and sequential IO. i.e. su
> = 1536k, sw = 3.

As Dave demonstrates, your hardware geometry is 1536*3=4608KB.  Thus,
when you create your logical volumes they each need to start and end on
a 4608KB boundary, and be evenly divisible by 4608KB.  This will ensure
that all of your logical volumes are aligned to the RAID60 geometry.
When formatting the LVs with XFS you will use:

~# mkfs.xfs -d su=1536k,sw=3 /dev/[lv_device_path]

This aligns XFS to the RAID60 geometry.  Geometry alignment must be
maintained throughout the entire storage stack.  If a single layer is
not aligned properly, every layer will be misaligned.  When this occurs
performance will suffer, and could suffer tremendously.

You'll want to add "inode64" to your fstab mount options for these
filesystems.  This has nothing to do with geometry, but how XFS
allocates inodes and how/where files are written to AGs.  It is the
default in very recent kernels but I don't know in which it was made so.

>> ...more or less.
>>
>> I agree that it's quite weird, but I'll describe the workload and the
>> constraints.
> 
> [snip]
> 
> summary: concurrent (initially slow) sequential writes of ~4GB files.
> 
>> Now, here's the constraints, which is why I was planning on setting
>> things up as above:
>>
>>   - This is a budget job, so sane things like RAID 10 are our.  RAID
>>     6 or 60 are (as far as I can tell, correct me if I'm wrong) our only
>>     real options here, as anything else either sacrifices too much
>>     storage or is too susceptible failure from UREs.
> 
> RAID6 is fine for this.
>
>>   - I need to expose, in the end, three-ish (two or four would be OK)
>>     filesystems to the backup software, which should come fairly close
>>     to minimizing the effects of the archive maintenance jobs (integrity
>>     checks, mostly).  CrashPlan will spawn 2 jobs per store point, so
>>     a max of 8 at any given time should be a nice balance between
>>     under-utilizing and saturating the IO.
> 
> So concurrency is up to 8 files being written at a time. That's
> pretty much on the money for striped RAID. Much more than this and
> you end up with performance being limited by seeking on the slowest
> disk in the RAID sets.
> 
>> So I had thought LVM over RAID 60 would make sense because it would give
>> me the option of leaving a bit of disk unallocated and being able to
>> tweak filesystem sizes a bit as time goes on.
> 
> *nod*
> 
> And it allows you, in future, to add more disks and grow across them
> via linear concatentation of more RAID60 luns of the same layout...
> 
>> Now that I think of it though, perhaps something like 2 or 3 RAID6
>> volumes would make more sense, with XFS directly on top of them.  In
>> that case I have to balance number of volumes against the loss of
>> 2 parity disks, however.
> 
> Probably not worth the complexity.

You'll lose 2 disks to parity with RAID6 regardless.  Three standalone
arrays costs you 6 disks, same as making a RAID60 of those 3 arrays.
The problem you'll have with XFS directly on RAID6 is the inability to
easily expand.  The only way to do it is by by adding disks to each
RAID6 and having the controller reshape the array.  Reshapes with 4TB
drives will take more than a day to complete and the array will be very
slow during the reshape.  Every time you reshape the array your geometry
will change.  XFS has the ability to align to a new geometry using a
mount option, but it's best to avoid this.

LVM typically affords you much more flexibility here than your RAID/SAN
controller.  Just be mindful that when you expand you need to keep your
geometry, i.e. stripe width, the same.  Let's say some time in the
future you want to expand but can only afford, or only need, one 14 disk
chassis at the time, not another 3 for another RAID60.  Here you could
create a single 14 drive RAID6 with stripe geometry 384KB * 12 = 4608KB.

You could then carve it up into 1-3 pieces, each aligned to the
start/end of a 4608KB stripe and evenly divisible by 4608KB, and add
them to one of more of your LVs/XFS filesystems.  This maintains the
same overall stripe width geometry as the RAID60 to which all of your
XFS filesystems are already aligned.

The volume manager in your RAID hardware may not, probably won't, allow
doing this type of expansion after the fact, meaning after the original
RAID60 has been created.

If you remember only 3 words of my post, remember:

Alignment, alignment, alignment.

For a RAID60 setup such as you're describing, you'll want to use LVM,
and you must maintain consistent geometry throughout the stack, from
array to filesystem.  This means every physical volume you create must
start and end on a 4608KB stripe boundary.  Every volume group you
create must do the same.  And every logical volume must also start and
end on a 4608KB stripe boundary.  If you don't verify each layer is
aligned all of your XFS filesystems will likely be unaligned.  And
again, performance will suffer, possibly horribly so.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-01-31  5:58           ` Stan Hoeppner
@ 2014-01-31 21:14             ` C. Morgan Hamill
  2014-02-01 21:06               ` Stan Hoeppner
  0 siblings, 1 reply; 27+ messages in thread
From: C. Morgan Hamill @ 2014-01-31 21:14 UTC (permalink / raw)
  To: stan; +Cc: xfs

Excerpts from Stan Hoeppner's message of 2014-01-31 00:58:46 -0500:
> RAID60 is a nested RAID level just like RAID10 and RAID50.  It is a
> stripe, or RAID0, across multiple primary array types, RAID6 in this
> case.  The stripe width of each 'inner' RAID6 becomes the stripe unit of
> the 'outer' RAID0 array:
> 
> RAID6 geometry     128KB * 12 = 1536KB
> RAID0 geometry  1536KB * 3  = 4608KB
> 
> If you are creating your RAID60 array with a proprietary hardware
> RAID/SAN management utility it may not be clearly showing you the
> resulting nested geometry I've demonstrated above, which is correct for
> your RAID60.
> 
> It is possible with software RAID to continue nesting stripe upon stripe
> to build infinitely large nested arrays.  It is not practical to do so
> for many reasons, but I'll not express those here as it is out of scope
> for this discussion.  I am simply attempting to explain how nested RAID
> levels are constructed.
> 
> > So optimised for sequential IO. The time-honoured method of setting
> > up XFS for this if the workload is large files is to use a stripe
> > unit that is equal to the width of the underlying RAID6 volumes with
> > a stripe width of 3. That way XFS tries to align files to the start
> > of each RAID6 volume, and allocate in full RAID6 stripe chunks. This
> > mostly avoids RMW cycles for large files and sequential IO. i.e. su
> > = 1536k, sw = 3.

Makes perfect sense.

> As Dave demonstrates, your hardware geometry is 1536*3=4608KB.  Thus,
> when you create your logical volumes they each need to start and end on
> a 4608KB boundary, and be evenly divisible by 4608KB.  This will ensure
> that all of your logical volumes are aligned to the RAID60 geometry.
> When formatting the LVs with XFS you will use:
> 
> ~# mkfs.xfs -d su=1536k,sw=3 /dev/[lv_device_path]

Noted.

> This aligns XFS to the RAID60 geometry.  Geometry alignment must be
> maintained throughout the entire storage stack.  If a single layer is
> not aligned properly, every layer will be misaligned.  When this occurs
> performance will suffer, and could suffer tremendously.
> 
> You'll want to add "inode64" to your fstab mount options for these
> filesystems.  This has nothing to do with geometry, but how XFS
> allocates inodes and how/where files are written to AGs.  It is the
> default in very recent kernels but I don't know in which it was made so.

Yes, I was aware of this.

> LVM typically affords you much more flexibility here than your RAID/SAN
> controller.  Just be mindful that when you expand you need to keep your
> geometry, i.e. stripe width, the same.  Let's say some time in the
> future you want to expand but can only afford, or only need, one 14 disk
> chassis at the time, not another 3 for another RAID60.  Here you could
> create a single 14 drive RAID6 with stripe geometry 384KB * 12 = 4608KB.
> 
> You could then carve it up into 1-3 pieces, each aligned to the
> start/end of a 4608KB stripe and evenly divisible by 4608KB, and add
> them to one of more of your LVs/XFS filesystems.  This maintains the
> same overall stripe width geometry as the RAID60 to which all of your
> XFS filesystems are already aligned.

OK, so the upshot is is that any additions to the volume group must be
array with su*sw=4608k, and all logical volumes and filesystems must
begin and end on multiples of 4608k from the start of the block device.

As long as these things hold true, is it all right for logical
volumes/filesystems to begin on one physical device and end on another?

> If you remember only 3 words of my post, remember:
> 
> Alignment, alignment, alignment.

Yes, I am hearing you. :-)

> For a RAID60 setup such as you're describing, you'll want to use LVM,
> and you must maintain consistent geometry throughout the stack, from
> array to filesystem.  This means every physical volume you create must
> start and end on a 4608KB stripe boundary.  Every volume group you
> create must do the same.  And every logical volume must also start and
> end on a 4608KB stripe boundary.  If you don't verify each layer is
> aligned all of your XFS filesystems will likely be unaligned.  And
> again, performance will suffer, possibly horribly so.

So, basically, --dataalignment is my friend during pvcreate and
lvcreate.

Thanks so much for your and Dave's help; this has been tremendously
helpful.
--
Morgan Hamill

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-01-31 21:14             ` C. Morgan Hamill
@ 2014-02-01 21:06               ` Stan Hoeppner
  2014-02-02 21:21                 ` Dave Chinner
  2014-02-03 16:07                 ` C. Morgan Hamill
  0 siblings, 2 replies; 27+ messages in thread
From: Stan Hoeppner @ 2014-02-01 21:06 UTC (permalink / raw)
  To: C. Morgan Hamill; +Cc: xfs

On 1/31/2014 3:14 PM, C. Morgan Hamill wrote:
> Excerpts from Stan Hoeppner's message of 2014-01-31 00:58:46 -0500:
...
>> LVM typically affords you much more flexibility here than your RAID/SAN
>> controller.  Just be mindful that when you expand you need to keep your
>> geometry, i.e. stripe width, the same.  Let's say some time in the
>> future you want to expand but can only afford, or only need, one 14 disk
>> chassis at the time, not another 3 for another RAID60.  Here you could
>> create a single 14 drive RAID6 with stripe geometry 384KB * 12 = 4608KB.
>>
>> You could then carve it up into 1-3 pieces, each aligned to the
>> start/end of a 4608KB stripe and evenly divisible by 4608KB, and add
>> them to one of more of your LVs/XFS filesystems.  This maintains the
>> same overall stripe width geometry as the RAID60 to which all of your
>> XFS filesystems are already aligned.
> 
> OK, so the upshot is is that any additions to the volume group must be
> array with su*sw=4608k, and all logical volumes and filesystems must
> begin and end on multiples of 4608k from the start of the block device.
> 
> As long as these things hold true, is it all right for logical
> volumes/filesystems to begin on one physical device and end on another?

Yes, that's one of the beauties of LVM.  However, there are other
reasons you may not want to do this.  For example, if you have allocated
space from two different JBOD or SAN units to a single LVM volume, and
you lack multipath connections, if you have a cable, switch, HBA, or
other failure disconnecting one LUN that will wreak havoc on your
mounted XFS filesystem.  If you have multipath and the storage device
disappears due to some other failure such as backplane,  UPS, etc, you
have the same problem.

This isn't a deal breaker.  There are many large XFS filesystems in
production that span multiple storage arrays.  You just need to be
mindful of your architecture at all times, and it needs to be
documented.  Scenario:  XFS unmounts due to an IO error.  You're not yet
aware an entire chassis is offline.  You can't remount the filesystem so
you start a destructive xfs_repair thinking that will fix the problem.
Doing so will wreck your filesystem and you'll likely lose access to all
the files on the offline chassis, with no ability to get it back short
of some magic and a full restore from tape or D2D backup server.  We had
a case similar to this reported a couple of years ago.

>> If you remember only 3 words of my post, remember:
>>
>> Alignment, alignment, alignment.
> 
> Yes, I am hearing you. :-)
> 
>> For a RAID60 setup such as you're describing, you'll want to use LVM,
>> and you must maintain consistent geometry throughout the stack, from
>> array to filesystem.  This means every physical volume you create must
>> start and end on a 4608KB stripe boundary.  Every volume group you
>> create must do the same.  And every logical volume must also start and
>> end on a 4608KB stripe boundary.  If you don't verify each layer is
>> aligned all of your XFS filesystems will likely be unaligned.  And
>> again, performance will suffer, possibly horribly so.
> 
> So, basically, --dataalignment is my friend during pvcreate and
> lvcreate.

If the logical sector size reported by your RAID controller is 512
bytes, then "--dataalignment=9216s" should start your data section on a
RAID60 stripe boundary after the metadata section.

Tthe PhysicalExtentSize should probably also match the 4608KB stripe
width, but this is apparently not possible.  PhysicalExtentSize must be
a power of 2 value.  I don't know if or how this will affect XFS aligned
write out.  You'll need to consult with someone more knowledgeable of LVM.

> Thanks so much for your and Dave's help; this has been tremendously
> helpful.

You bet.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-02-01 21:06               ` Stan Hoeppner
@ 2014-02-02 21:21                 ` Dave Chinner
  2014-02-03 16:12                   ` C. Morgan Hamill
  2014-02-03 16:07                 ` C. Morgan Hamill
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2014-02-02 21:21 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: C. Morgan Hamill, xfs

On Sat, Feb 01, 2014 at 03:06:17PM -0600, Stan Hoeppner wrote:
> On 1/31/2014 3:14 PM, C. Morgan Hamill wrote:
> > So, basically, --dataalignment is my friend during pvcreate and
> > lvcreate.
> 
> If the logical sector size reported by your RAID controller is 512
> bytes, then "--dataalignment=9216s" should start your data section on a
> RAID60 stripe boundary after the metadata section.
> 
> Tthe PhysicalExtentSize should probably also match the 4608KB stripe
> width, but this is apparently not possible.  PhysicalExtentSize must be
> a power of 2 value.  I don't know if or how this will affect XFS aligned
> write out.  You'll need to consult with someone more knowledgeable of LVM.

You can't do single IOs of that size, anyway, so this is where the
BBWC on the raid controller does it's magic and caches sequntial IOs
until it has full stripe writes cached....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-02-01 21:06               ` Stan Hoeppner
  2014-02-02 21:21                 ` Dave Chinner
@ 2014-02-03 16:07                 ` C. Morgan Hamill
  1 sibling, 0 replies; 27+ messages in thread
From: C. Morgan Hamill @ 2014-02-03 16:07 UTC (permalink / raw)
  To: stan; +Cc: xfs

Excerpts from Stan Hoeppner's message of 2014-02-01 16:06:17 -0500:
> Yes, that's one of the beauties of LVM.  However, there are other
> reasons you may not want to do this.  For example, if you have allocated
> space from two different JBOD or SAN units to a single LVM volume, and
> you lack multipath connections, if you have a cable, switch, HBA, or
> other failure disconnecting one LUN that will wreak havoc on your
> mounted XFS filesystem.  If you have multipath and the storage device
> disappears due to some other failure such as backplane,  UPS, etc, you
> have the same problem.

Very true; I gather this would only take out any volumes which at least
partially rest on the failed device, however?  As in, I don't lose the
whole volume group, correct?

> This isn't a deal breaker.  There are many large XFS filesystems in
> production that span multiple storage arrays.  You just need to be
> mindful of your architecture at all times, and it needs to be
> documented.  Scenario:  XFS unmounts due to an IO error.  You're not yet
> aware an entire chassis is offline.  You can't remount the filesystem so
> you start a destructive xfs_repair thinking that will fix the problem.
> Doing so will wreck your filesystem and you'll likely lose access to all
> the files on the offline chassis, with no ability to get it back short
> of some magic and a full restore from tape or D2D backup server.  We had
> a case similar to this reported a couple of years ago.

Oh God, that sounds terrible.  My sysadmininess is wondering why the
chassis wasn't monitored, but hindsight, etc. etc. ;-)

> If the logical sector size reported by your RAID controller is 512
> bytes, then "--dataalignment=9216s" should start your data section on a
> RAID60 stripe boundary after the metadata section.

I see that 9216s == 2608k/512b, but I'm missing something: is the
default metadata size guaranteed to be less than a single stripe, or is
there more to it?

Oh, wait, I think I just got it: '--dataalignment' will take care to
start on some multiple of 9216 sectors, regardless of the size of the
metadata section. Doy.

> The PhysicalExtentSize should probably also match the 4608KB stripe
> width, but this is apparently not possible.  PhysicalExtentSize must be
> a power of 2 value.  I don't know if or how this will affect XFS aligned
> write out.  You'll need to consult with someone more knowledgeable of LVM.

Makes sense.  If it would have an impact, then I'd probably just end up
going with RAID 0 on top of 2 or 4 RAID 6 groups, which looks like the
math would work out there.

> You bet.

Honestly, this is the most helpful and straightforward I've ever found
any project's mailing list, so kudos++.
--
Morgan Hamill

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-02-02 21:21                 ` Dave Chinner
@ 2014-02-03 16:12                   ` C. Morgan Hamill
  2014-02-03 21:41                     ` Dave Chinner
  0 siblings, 1 reply; 27+ messages in thread
From: C. Morgan Hamill @ 2014-02-03 16:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Stan Hoeppner, xfs

Excerpts from Dave Chinner's message of 2014-02-02 16:21:52 -0500:
> On Sat, Feb 01, 2014 at 03:06:17PM -0600, Stan Hoeppner wrote:
> > On 1/31/2014 3:14 PM, C. Morgan Hamill wrote:
> > > So, basically, --dataalignment is my friend during pvcreate and
> > > lvcreate.
> > 
> > If the logical sector size reported by your RAID controller is 512
> > bytes, then "--dataalignment=9216s" should start your data section on a
> > RAID60 stripe boundary after the metadata section.
> > 
> > Tthe PhysicalExtentSize should probably also match the 4608KB stripe
> > width, but this is apparently not possible.  PhysicalExtentSize must be
> > a power of 2 value.  I don't know if or how this will affect XFS aligned
> > write out.  You'll need to consult with someone more knowledgeable of LVM.
> 
> You can't do single IOs of that size, anyway, so this is where the
> BBWC on the raid controller does it's magic and caches sequntial IOs
> until it has full stripe writes cached....

So I am probably missing something here, could you clarify?  Are you
saying that I can't do single IOs of that size (by which I take your
meaning to be IOs as small as 9216 sectors) because my RAID controllers
controller won't let me (i.e., it will cache anything smaller than the
stripe size anyway)?  Or are you saying that XFS with these given
settings won't make writes that small (which seems false, since I'm
essentially telling it to do writes of precisely that size).  I'm a bit
unclear on that.

In addition, does this in effect mean that when it comes to LVM, extent
size makes no difference for alignment purposes?  So I don't have to
worry about anything other that aligning the beginning and ending of
logical volumes, volume groups, etc. to 9216 sector multiples?

Thanks again!
--
Morgan Hamill

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-02-03 16:12                   ` C. Morgan Hamill
@ 2014-02-03 21:41                     ` Dave Chinner
  2014-02-04  8:00                       ` Stan Hoeppner
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2014-02-03 21:41 UTC (permalink / raw)
  To: C. Morgan Hamill; +Cc: Stan Hoeppner, xfs

On Mon, Feb 03, 2014 at 11:12:39AM -0500, C. Morgan Hamill wrote:
> Excerpts from Dave Chinner's message of 2014-02-02 16:21:52 -0500:
> > On Sat, Feb 01, 2014 at 03:06:17PM -0600, Stan Hoeppner wrote:
> > > On 1/31/2014 3:14 PM, C. Morgan Hamill wrote:
> > > > So, basically, --dataalignment is my friend during pvcreate and
> > > > lvcreate.
> > > 
> > > If the logical sector size reported by your RAID controller is 512
> > > bytes, then "--dataalignment=9216s" should start your data section on a
> > > RAID60 stripe boundary after the metadata section.
> > > 
> > > Tthe PhysicalExtentSize should probably also match the 4608KB stripe
> > > width, but this is apparently not possible.  PhysicalExtentSize must be
> > > a power of 2 value.  I don't know if or how this will affect XFS aligned
> > > write out.  You'll need to consult with someone more knowledgeable of LVM.
> > 
> > You can't do single IOs of that size, anyway, so this is where the
> > BBWC on the raid controller does it's magic and caches sequntial IOs
> > until it has full stripe writes cached....
> 
> So I am probably missing something here, could you clarify?  Are you
> saying that I can't do single IOs of that size (by which I take your
> meaning to be IOs as small as 9216 sectors) because my RAID controllers
> controller won't let me (i.e., it will cache anything smaller than the
> stripe size anyway)?

Typical limitations on IO size are the size of the hardware DMA
scatter-gather rings of the HBA/raid controller. For example, the
two hardware RAID controllers in my largest test box have
limitations of 70 and 80 segments and maximum IO sizes of 280k and
320k.

And looking at the IO being dispatched with blktrace, I see the
maximum size is:

  8,80   2       61     0.769857112 44866  D  WS 12423408 + 560 [qemu-system-x86]
  8,80   2       71     0.769877563 44866  D  WS 12423968 + 560 [qemu-system-x86]
  8,80   2       72     0.769889767 44866  D  WS 12424528 + 560 [qemu-system-x86]
                                                            ^^^

560 sectors or 280k. So for this hardware, sequential 280k writes
are hitting the BBWC. And because they are sequential, the BBWC is
writing them back as fully stripe writes after aggregating them in
NVRAM. Hence there are no performance diminishing RMW cycles
occurring, even though the individual IO size is much smaller than
the stripe unit/width....

> Or are you saying that XFS with these given
> settings won't make writes that small (which seems false, since I'm
> essentially telling it to do writes of precisely that size).  I'm a bit
> unclear on that.

What su/sw tells XFs is how to align allocation of files, so that
when we dispatch sequential IO to that file it is aligned to the
underlying storage because the extents that the filesystem allocated
for it are aligned. This means that if you write exactly one stripe
width of data, it will hit each disk exactly once. It might take 10
IOs to get the data to the storage, but it will only hit each disk
once.

The function of the stripe cache (in software raid) and the BBWC (in
hardware RAID) is to prevent RMW cycles while the
filesystem/hardware is still flinging data at the RAID lun. Only
once the controller has complete stripe widths will it calculate
parity and write back the data, thereby avoiding a RMW cycle....

> In addition, does this in effect mean that when it comes to LVM, extent
> size makes no difference for alignment purposes?  So I don't have to
> worry about anything other that aligning the beginning and ending of
> logical volumes, volume groups, etc. to 9216 sector multiples?

No, you still have to align everything to the underlying storage so
that the filesystem on top of the volumes is correctly aligned.
Where the data will be written (i.e. howthe filesystem allocates the
underlying blocks) determines the IO alignment of sequential/large
user IOs, and that matters far more than the size of the sequntial
IOs the kernel uses to write the data.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-02-03 21:41                     ` Dave Chinner
@ 2014-02-04  8:00                       ` Stan Hoeppner
  2014-02-18 19:44                         ` C. Morgan Hamill
  0 siblings, 1 reply; 27+ messages in thread
From: Stan Hoeppner @ 2014-02-04  8:00 UTC (permalink / raw)
  To: Dave Chinner, C. Morgan Hamill; +Cc: xfs

On 2/3/2014 3:41 PM, Dave Chinner wrote:
> On Mon, Feb 03, 2014 at 11:12:39AM -0500, C. Morgan Hamill wrote:
>> Excerpts from Dave Chinner's message of 2014-02-02 16:21:52 -0500:
>>> On Sat, Feb 01, 2014 at 03:06:17PM -0600, Stan Hoeppner wrote:
>>>> On 1/31/2014 3:14 PM, C. Morgan Hamill wrote:
>>>>> So, basically, --dataalignment is my friend during pvcreate and
>>>>> lvcreate.
>>>>
>>>> If the logical sector size reported by your RAID controller is 512
>>>> bytes, then "--dataalignment=9216s" should start your data section on a
>>>> RAID60 stripe boundary after the metadata section.
>>>>
>>>> Tthe PhysicalExtentSize should probably also match the 4608KB stripe
>>>> width, but this is apparently not possible.  PhysicalExtentSize must be
>>>> a power of 2 value.  I don't know if or how this will affect XFS aligned
>>>> write out.  You'll need to consult with someone more knowledgeable of LVM.
>>>
>>> You can't do single IOs of that size, anyway, so this is where the
>>> BBWC on the raid controller does it's magic and caches sequntial IOs
>>> until it has full stripe writes cached....
>>
>> So I am probably missing something here, could you clarify?  Are you
>> saying that I can't do single IOs of that size (by which I take your
>> meaning to be IOs as small as 9216 sectors) because my RAID controllers
>> controller won't let me (i.e., it will cache anything smaller than the
>> stripe size anyway)?
> 
> Typical limitations on IO size are the size of the hardware DMA
> scatter-gather rings of the HBA/raid controller. For example, the
> two hardware RAID controllers in my largest test box have
> limitations of 70 and 80 segments and maximum IO sizes of 280k and
> 320k.
> 
> And looking at the IO being dispatched with blktrace, I see the
> maximum size is:
> 
>   8,80   2       61     0.769857112 44866  D  WS 12423408 + 560 [qemu-system-x86]
>   8,80   2       71     0.769877563 44866  D  WS 12423968 + 560 [qemu-system-x86]
>   8,80   2       72     0.769889767 44866  D  WS 12424528 + 560 [qemu-system-x86]
>                                                             ^^^
> 
> 560 sectors or 280k. So for this hardware, sequential 280k writes
> are hitting the BBWC. And because they are sequential, the BBWC is
> writing them back as fully stripe writes after aggregating them in
> NVRAM. Hence there are no performance diminishing RMW cycles
> occurring, even though the individual IO size is much smaller than
> the stripe unit/width....
> 
>> Or are you saying that XFS with these given
>> settings won't make writes that small (which seems false, since I'm
>> essentially telling it to do writes of precisely that size).  I'm a bit
>> unclear on that.
> 
> What su/sw tells XFs is how to align allocation of files, so that
> when we dispatch sequential IO to that file it is aligned to the
> underlying storage because the extents that the filesystem allocated
> for it are aligned. This means that if you write exactly one stripe
> width of data, it will hit each disk exactly once. It might take 10
> IOs to get the data to the storage, but it will only hit each disk
> once.
> 
> The function of the stripe cache (in software raid) and the BBWC (in
> hardware RAID) is to prevent RMW cycles while the
> filesystem/hardware is still flinging data at the RAID lun. Only
> once the controller has complete stripe widths will it calculate
> parity and write back the data, thereby avoiding a RMW cycle....

-------
>> In addition, does this in effect mean that when it comes to LVM, extent
>> size makes no difference for alignment purposes?  So I don't have to
>> worry about anything other that aligning the beginning and ending of
>> logical volumes, volume groups, etc. to 9216 sector multiples?
> 
> No, you still have to align everything to the underlying storage so
> that the filesystem on top of the volumes is correctly aligned.
> Where the data will be written (i.e. howthe filesystem allocates the
> underlying blocks) determines the IO alignment of sequential/large
> user IOs, and that matters far more than the size of the sequntial
> IOs the kernel uses to write the data.

After a little digging and thinking this through...

The default PE size is 4MB but up to 16GB with LVM1, and apparently
unlimited size with LVM2.  It can be a few thousand times larger than
any sane stripe width.  This makes it pretty clear that PEs exist
strictly for volume management operations, used by the LVM tools, but
have no relationship to regular write IOs.  Thus the PE size need not
match nor be evenly divisible by the stripe width.  It's not part of the
alignment equation.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-02-04  8:00                       ` Stan Hoeppner
@ 2014-02-18 19:44                         ` C. Morgan Hamill
  2014-02-18 23:07                           ` Stan Hoeppner
  0 siblings, 1 reply; 27+ messages in thread
From: C. Morgan Hamill @ 2014-02-18 19:44 UTC (permalink / raw)
  To: xfs

Howdy, sorry for digging up this thread, but I've run into an issue
again, and am looking for advice.

Excerpts from Stan Hoeppner's message of 2014-02-04 03:00:54 -0500:
> After a little digging and thinking this through...
> 
> The default PE size is 4MB but up to 16GB with LVM1, and apparently
> unlimited size with LVM2.  It can be a few thousand times larger than
> any sane stripe width.  This makes it pretty clear that PEs exist
> strictly for volume management operations, used by the LVM tools, but
> have no relationship to regular write IOs.  Thus the PE size need not
> match nor be evenly divisible by the stripe width.  It's not part of the
> alignment equation.

So in the course of actually going about this, I realized that this
actually is not true (I think).

Logical volumes can only have sizes that are multiple of the physical
extent size (by definition, really), and so there's no way to have
logical volumes end on a multiple of the array's stripe width, given my
stripe width of 9216s, there doesn't seem to be an abundance of integer
solutions to 2^n mod 9216 = 0.

So my question is, then, does it matter if logical volumes (or, really,
XFS file systems) actually end right on a multiple of the stripe width,
or only that it _begin_ on a multiple of it (leaving a bit of dead space
before the next logical volume)?

If not, I'll tweak things to ensure my stripe width is a power of 2.

Thanks again!
--
Morgan Hamill

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-02-18 19:44                         ` C. Morgan Hamill
@ 2014-02-18 23:07                           ` Stan Hoeppner
  2014-02-20 18:31                             ` C. Morgan Hamill
  0 siblings, 1 reply; 27+ messages in thread
From: Stan Hoeppner @ 2014-02-18 23:07 UTC (permalink / raw)
  To: C. Morgan Hamill, xfs

On 2/18/2014 1:44 PM, C. Morgan Hamill wrote:
> Howdy, sorry for digging up this thread, but I've run into an issue
> again, and am looking for advice.
> 
> Excerpts from Stan Hoeppner's message of 2014-02-04 03:00:54 -0500:
>> After a little digging and thinking this through...
>>
>> The default PE size is 4MB but up to 16GB with LVM1, and apparently
>> unlimited size with LVM2.  It can be a few thousand times larger than
>> any sane stripe width.  This makes it pretty clear that PEs exist
>> strictly for volume management operations, used by the LVM tools, but
>> have no relationship to regular write IOs.  Thus the PE size need not
>> match nor be evenly divisible by the stripe width.  It's not part of the
>> alignment equation.
> 
> So in the course of actually going about this, I realized that this
> actually is not true (I think).

Two different issues.

> Logical volumes can only have sizes that are multiple of the physical
> extent size (by definition, really), and so there's no way to have
> logical volumes end on a multiple of the array's stripe width, given my
> stripe width of 9216s, there doesn't seem to be an abundance of integer
> solutions to 2^n mod 9216 = 0.
> 
> So my question is, then, does it matter if logical volumes (or, really,
> XFS file systems) actually end right on a multiple of the stripe width,
> or only that it _begin_ on a multiple of it (leaving a bit of dead space
> before the next logical volume)?

Create each LV starting on a stripe boundary.  There will be some
unallocated space between LVs.  Use the mkfs.xfs -d size= option to
create your filesystems inside of each LV such that the filesystem total
size is evenly divisible by the stripe width.  This results in an
additional small amount of unallocated space within, and at the end of,
each LV.

It's nice if you can line everything up, but when using RAID6 and one or
two bays for hot spares, one rarely ends up with 8 or 16 data spindles.

> If not, I'll tweak things to ensure my stripe width is a power of 2.

That's not possible with 12 data spindles per RAID, not possible with 42
drives in 3 chassis.  Not without a bunch of idle drives.

I still don't understand why you believe you need LVM in the mix, and
more than one filesystem.

>  - I need to expose, in the end, three-ish (two or four would be OK)
>    filesystems to the backup software, which should come fairly close
>    to minimizing the effects of the archive maintenance jobs integrity
>    checks, mostly).  CrashPlan will spawn 2 jobs per store point, so
>    a max of 8 at any given time should be a nice balance between
>    under-utilizing and saturating the IO.

Backup software is unaware of mount points.  It uses paths just like
every other program.  The number of XFS filesystems is irrelevant to
"minimizing the effects of the archive maintenance jobs".  You cannot
bog down XFS.  You will bog down the drives no matter how many
filesystems when using RAID60.

Here is what you should do:

Format the RAID60 directly with XFS.  Create 3 or 4 directories for
CrashPlan to use as its "store points".  If you need to expand in the
future, as I said previously, simply add another 14 drive RAID6 chassis,
format it directly with XFS, mount it at an appropriate place in the
directory tree and give that path to CrashPlan.  Does it have a limit on
the number of "store points"?

-- 
Stan


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-02-18 23:07                           ` Stan Hoeppner
@ 2014-02-20 18:31                             ` C. Morgan Hamill
  2014-02-21  3:33                               ` Stan Hoeppner
  0 siblings, 1 reply; 27+ messages in thread
From: C. Morgan Hamill @ 2014-02-20 18:31 UTC (permalink / raw)
  To: xfs

Quoting Stan Hoeppner (2014-02-18 18:07:24)
> Create each LV starting on a stripe boundary.  There will be some
> unallocated space between LVs.  Use the mkfs.xfs -d size= option to
> create your filesystems inside of each LV such that the filesystem total
> size is evenly divisible by the stripe width.  This results in an
> additional small amount of unallocated space within, and at the end of,
> each LV.

Of course, this occurred to me just after sending the message... ;)

> It's nice if you can line everything up, but when using RAID6 and one or
> two bays for hot spares, one rarely ends up with 8 or 16 data spindles.
> 
> > If not, I'll tweak things to ensure my stripe width is a power of 2.
> 
> That's not possible with 12 data spindles per RAID, not possible with 42
> drives in 3 chassis.  Not without a bunch of idle drives.

The closest I can come is with 4 RAID 6 arrays of 10 disks each, then
striped over:

8 * 128k = 1024k
1024k * 4 = 4096k

Which leaves me with 5 disks unused.  I might be able to live with that
if it makes things work better.  Sounds like I won't have to.


> I still don't understand why you believe you need LVM in the mix, and
> more than one filesystem.

> Backup software is unaware of mount points.  It uses paths just like
> every other program.  The number of XFS filesystems is irrelevant to
> "minimizing the effects of the archive maintenance jobs".  You cannot
> bog down XFS.  You will bog down the drives no matter how many
> filesystems when using RAID60.

A limitation of the software in question is that placing multiple
archive paths onto a single filesystem is a bit ugly: the software does
not let you specifiy a maximum size for the archive paths, and so will
think all of them are the size of the filesystem.  This isn't an issue
in isolation, but we need to make use of a data-balancing feature the
software has, which will not work if we place multiple archive paths on
a single filesystem.  It's a stupid issue to have, but it is what it is.

> Here is what you should do:
> 
> Format the RAID60 directly with XFS.  Create 3 or 4 directories for
> CrashPlan to use as its "store points".  If you need to expand in the
> future, as I said previously, simply add another 14 drive RAID6 chassis,
> format it directly with XFS, mount it at an appropriate place in the
> directory tree and give that path to CrashPlan.  Does it have a limit on
> the number of "store points"?

Yes, this is what I *want* to do.  There's a limit to the number of
store points, but it's large, so this would work fine if not for the
multiple-stores-on-one-filesystem issue.  Which is frustrating.

The *only* reason for LVM in the middle is to allow some flexibility of
sizing without dealing with the annoyances of the partition table.
I want to intentionally under-provision to start with because we are
using a small corner of this storage for a separate purpose but do not
know precisely how much yet.  LVM lets me leave, say, 10TB empty, until
I know exactly how big things are going to be.

It's a pile of little annoyances, but so it goes with these kinds of things.

It sounds like the little empty spots method will be fine though.

Thanks, yet again, for all your help.
--
Morgan Hamill

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-02-20 18:31                             ` C. Morgan Hamill
@ 2014-02-21  3:33                               ` Stan Hoeppner
  2014-02-21  8:57                                 ` Emmanuel Florac
  2014-02-21 19:17                                 ` C. Morgan Hamill
  0 siblings, 2 replies; 27+ messages in thread
From: Stan Hoeppner @ 2014-02-21  3:33 UTC (permalink / raw)
  To: C. Morgan Hamill, xfs

On 2/20/2014 12:31 PM, C. Morgan Hamill wrote:
> Quoting Stan Hoeppner (2014-02-18 18:07:24)
>> Create each LV starting on a stripe boundary.  There will be some
>> unallocated space between LVs.  Use the mkfs.xfs -d size= option to
>> create your filesystems inside of each LV such that the filesystem total
>> size is evenly divisible by the stripe width.  This results in an
>> additional small amount of unallocated space within, and at the end of,
>> each LV.
> 
> Of course, this occurred to me just after sending the message... ;)

That's the right way to do that, but you really don't want to do this
with LVM.  It's just a mess.  You can easily do this with a single XFS
filesystem and a concatenation, with none of these alignment and sizing
headaches.  Read on.

...
> 8 * 128k = 1024k
> 1024k * 4 = 4096k
> 
> Which leaves me with 5 disks unused.  I might be able to live with that
> if it makes things work better.  Sounds like I won't have to.

Forget all of this.  Forget RAID60.  I think you'd be best served by a
concatenation.

You have a RAID chassis with 15 drives and two 15 drive JBODs daisy
chained to it, all 4TB drives, correct?  Your original setup was 1 spare
and one 14 drive RAID6 array per chassis, 12 data spindles.  Correct?
Stick with that.

Export each RAID6 as a distinct LUN to the host.  Make an mdadm --linear
array of the 3 RAID6 LUNs, devices.  Then format the md linear device,
e.g. /dev/md0 using the geometry of a single RAID6 array.  We want to
make sure each allocation group is wholly contained within a RAID6
array.  You have 48TB per array and 3 arrays, 144TB total.  1TB=1000^4
and XFS deals with TebiBytes, or 1024^4.  Max agsize is 1TiB.  So to get
exactly 48 AGs per array, 144 total AGs, we'd format with

# mkfs.xfs -d su=128k,sw=12,agcount=144

The --linear array, or generically concatenation, stitches the RAID6
arrays together end-to-end.  Here the filesystem starts at LBA0 on the
first array and ends on the last LBA of the 3rd array, hence "linear".
XFS performs all operations at the AG level.  Since each AG sits atop
only one RAID6, the filesystem alignment geometry is that of a single
RAID6.  Any individual write will peak at ~1.2GB/s.  Since you're
limited by the network to 100MB/s throughput this shouldn't be an issue.

Using an md linear array you can easily expand in the future without all
the LVM headaches, by simply adding another identical RAID6 array to the
linear array (see mdadm grow) and then growing the filesystem with
xfs_growfs.  In doing so, you will want to add the new chassis before
the filesystem reaches ~70% capacity.  If you let it grow past that
point, most of your new writes may go to only the new RAID6 where the
bulk of your large free space extents now exist.  This will create an IO
hotspot on the new chassis, while the original 3 will see fewer writes.

Also, don't forget to mount with the "inode64" option in fstab.

...
> A limitation of the software in question is that placing multiple
> archive paths onto a single filesystem is a bit ugly: the software does
> not let you specifiy a maximum size for the archive paths, and so will
> think all of them are the size of the filesystem.  This isn't an issue
> in isolation, but we need to make use of a data-balancing feature the
> software has, which will not work if we place multiple archive paths on
> a single filesystem.  It's a stupid issue to have, but it is what it is.

So the problem is capacity reported to the backup application.  Easy to
address, see below.

...
> Yes, this is what I *want* to do.  There's a limit to the number of
> store points, but it's large, so this would work fine if not for the
> multiple-stores-on-one-filesystem issue.  Which is frustrating.

...
> The *only* reason for LVM in the middle is to allow some flexibility of
> sizing without dealing with the annoyances of the partition table.
> I want to intentionally under-provision to start with because we are
> using a small corner of this storage for a separate purpose but do not
> know precisely how much yet.  LVM lets me leave, say, 10TB empty, until
> I know exactly how big things are going to be.

XFS has had filesystem quotas for exactly this purpose, for almost as
long as it has existed, well over 15 years.  There are 3 types of
quotas: user, group, and project.  You must enable quotas with a mount
option.  You manipulate quotas with the xfs_quota command.  See

man xfs_quota
man mount

Project quotas are set on a directory tree level.  Set a soft and hard
project quota on a directory and the available space reported to any
process writing into it or its subdirectories is that of the project
quota, not the actual filesystem free space.  The quota can be increased
or decreased at will using xfs_quota.  That solves your "sizing" problem
rather elegantly.


Now, when using a concatenation, md linear array, to reap the rewards of
parallelism the requirement is that the application creates lots of
directories with a fairly even spread of file IO.  In this case, to get
all 3 RAID6 arrays into play, that requires creation and use of at
minimum 97 directories.  Most backup applications make tons of
directories so you should be golden here.

> It's a pile of little annoyances, but so it goes with these kinds of things.
> 
> It sounds like the little empty spots method will be fine though.

No empty spaces required.  No LVM required.  XFS atop an md linear array
with project quotas should solve all of your problems.

> Thanks, yet again, for all your help.

You're welcome Morgan.  I hope this helps steer you towards what I think
is a much better architecture for your needs.

Dave and I both initially said RAID60 was an ok way to go, but the more
I think this through, considering ease of expansion, using a single
filesystem and project quotas, it's hard to beat the concat setup.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-02-21  3:33                               ` Stan Hoeppner
@ 2014-02-21  8:57                                 ` Emmanuel Florac
  2014-02-22  2:21                                   ` Stan Hoeppner
  2014-02-21 19:17                                 ` C. Morgan Hamill
  1 sibling, 1 reply; 27+ messages in thread
From: Emmanuel Florac @ 2014-02-21  8:57 UTC (permalink / raw)
  To: stan; +Cc: C. Morgan Hamill, xfs

Le Thu, 20 Feb 2014 21:33:31 -0600 vous écriviez:

> Forget all of this.  Forget RAID60.  I think you'd be best served by a
> concatenation.

I fully  agree, though I'd use... LVM to perform the concatenation,
much more convenient and easy to use than md IMO.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-02-21  3:33                               ` Stan Hoeppner
  2014-02-21  8:57                                 ` Emmanuel Florac
@ 2014-02-21 19:17                                 ` C. Morgan Hamill
  1 sibling, 0 replies; 27+ messages in thread
From: C. Morgan Hamill @ 2014-02-21 19:17 UTC (permalink / raw)
  To: stan; +Cc: xfs


On Thu, February 20, 2014 10:33 pm, Stan Hoeppner wrote:
>  Forget all of this.  Forget RAID60.  I think you'd be best served by a
>  concatenation.
>
>  You have a RAID chassis with 15 drives and two 15 drive JBODs daisy
>  chained to it, all 4TB drives, correct?  Your original setup was 1 spare
>  and one 14 drive RAID6 array per chassis, 12 data spindles.  Correct?
>  Stick with that.

It's all in one chassis, but correct.

>  Export each RAID6 as a distinct LUN to the host.  Make an mdadm --linear
>  array of the 3 RAID6 LUNs, devices.  Then format the md linear device,
>  e.g. /dev/md0 using the geometry of a single RAID6 array.  We want to
>  make sure each allocation group is wholly contained within a RAID6
>  array.  You have 48TB per array and 3 arrays, 144TB total.  1TB=1000^4
>  and XFS deals with TebiBytes, or 1024^4.  Max agsize is 1TiB.  So to get
>  exactly 48 AGs per array, 144 total AGs, we'd format with
>
>  # mkfs.xfs -d su=128k,sw=12,agcount=144

I am intrigued...

>  The --linear array, or generically concatenation, stitches the RAID6
>  arrays together end-to-end.  Here the filesystem starts at LBA0 on the
>  first array and ends on the last LBA of the 3rd array, hence "linear".
>  XFS performs all operations at the AG level.  Since each AG sits atop
>  only one RAID6, the filesystem alignment geometry is that of a single
>  RAID6.  Any individual write will peak at ~1.2GB/s.  Since you're
>  limited by the network to 100MB/s throughput this shouldn't be an issue.
>
>  Using an md linear array you can easily expand in the future without all
>  the LVM headaches, by simply adding another identical RAID6 array to the
>  linear array (see mdadm grow) and then growing the filesystem with
>  xfs_growfs.

How does this differ from standard linear LVM? Is it simply that we avoid
the extent size issue?

>  In doing so, you will want to add the new chassis before
>  the filesystem reaches ~70% capacity.  If you let it grow past that
>  point, most of your new writes may go to only the new RAID6 where the
>  bulk of your large free space extents now exist.  This will create an IO
>  hotspot on the new chassis, while the original 3 will see fewer writes.

Good to know.

>  XFS has had filesystem quotas for exactly this purpose, for almost as
>  long as it has existed, well over 15 years.  There are 3 types of
>  quotas: user, group, and project.  You must enable quotas with a mount
>  option.  You manipulate quotas with the xfs_quota command.  See
>
>  man xfs_quota
>  man mount
>
>  Project quotas are set on a directory tree level.  Set a soft and hard
>  project quota on a directory and the available space reported to any
>  process writing into it or its subdirectories is that of the project
>  quota, not the actual filesystem free space.  The quota can be increased
>  or decreased at will using xfs_quota.  That solves your "sizing" problem
>  rather elegantly.

Oh, I was unaware of project quotas.

>  Now, when using a concatenation, md linear array, to reap the rewards of
>  parallelism the requirement is that the application creates lots of
>  directories with a fairly even spread of file IO.  In this case, to get
>  all 3 RAID6 arrays into play, that requires creation and use of at
>  minimum 97 directories.  Most backup applications make tons of
>  directories so you should be golden here.

Yes, quite a few directories are created.

>  You're welcome Morgan.  I hope this helps steer you towards what I think
>  is a much better architecture for your needs.
>
>  Dave and I both initially said RAID60 was an ok way to go, but the more
>  I think this through, considering ease of expansion, using a single
>  filesystem and project quotas, it's hard to beat the concat setup.

Seems like this will work quite well. Thanks so much for all your help.
-- 
Morgan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-02-21  8:57                                 ` Emmanuel Florac
@ 2014-02-22  2:21                                   ` Stan Hoeppner
  2014-02-25 17:04                                     ` C. Morgan Hamill
  0 siblings, 1 reply; 27+ messages in thread
From: Stan Hoeppner @ 2014-02-22  2:21 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: C. Morgan Hamill, xfs

On 2/21/2014 2:57 AM, Emmanuel Florac wrote:
> Le Thu, 20 Feb 2014 21:33:31 -0600 vous écriviez:
> 
>> Forget all of this.  Forget RAID60.  I think you'd be best served by a
>> concatenation.
> 
> I fully  agree, though I'd use... LVM to perform the concatenation,
> much more convenient and easy to use than md IMO.

Using md linear eliminates the LVM physical extent size non power of 2
misalignment issue we discussed at length up thread.  Using LVM makes
things decidedly more difficult and for zero gain.  LVM just isn't
appropriate for Morgan's situation.

Now, it's possible he could do this entirely in the RAID firmware.
However he has not stated which storage product he has, and thus I don't
know its capabilities, whether it can create or seamlessly expand a
concatenation.  Linux md can do all of this very easily and is deployed
by many people in this exact scenario.

-- 
Stan


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-02-22  2:21                                   ` Stan Hoeppner
@ 2014-02-25 17:04                                     ` C. Morgan Hamill
  2014-02-25 17:17                                       ` Emmanuel Florac
  2014-02-25 20:08                                       ` Stan Hoeppner
  0 siblings, 2 replies; 27+ messages in thread
From: C. Morgan Hamill @ 2014-02-25 17:04 UTC (permalink / raw)
  To: stan; +Cc: xfs

Excerpts from Stan Hoeppner's message of 2014-02-21 21:21:27 -0500:
> Now, it's possible he could do this entirely in the RAID firmware.
> However he has not stated which storage product he has, and thus I don't
> know its capabilities, whether it can create or seamlessly expand a
> concatenation.  Linux md can do all of this very easily and is deployed
> by many people in this exact scenario.

On this note, I'm using an Areca ARC-1882.  I've been looking for
documentation regarding concatenation with this, and having a bit of
trouble.

Do you happen to be familiar with the product?
--
Morgan Hamill

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-02-25 17:04                                     ` C. Morgan Hamill
@ 2014-02-25 17:17                                       ` Emmanuel Florac
  2014-02-25 20:08                                       ` Stan Hoeppner
  1 sibling, 0 replies; 27+ messages in thread
From: Emmanuel Florac @ 2014-02-25 17:17 UTC (permalink / raw)
  To: xfs

Le Tue, 25 Feb 2014 12:04:10 -0500
"C. Morgan Hamill" <chamill@wesleyan.edu> écrivait:

> On this note, I'm using an Areca ARC-1882.  I've been looking for
> documentation regarding concatenation with this, and having a bit of
> trouble.
> 

Unless Areca cards changed a lot in capabilities recently, it's not
possible at all. You can expand a RAID set but it's generally a bad
idea.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-02-25 17:04                                     ` C. Morgan Hamill
  2014-02-25 17:17                                       ` Emmanuel Florac
@ 2014-02-25 20:08                                       ` Stan Hoeppner
  2014-02-26 14:19                                         ` C. Morgan Hamill
  1 sibling, 1 reply; 27+ messages in thread
From: Stan Hoeppner @ 2014-02-25 20:08 UTC (permalink / raw)
  To: C. Morgan Hamill; +Cc: xfs

On 2/25/2014 11:04 AM, C. Morgan Hamill wrote:
> Excerpts from Stan Hoeppner's message of 2014-02-21 21:21:27 -0500:
>> Now, it's possible he could do this entirely in the RAID firmware.
>> However he has not stated which storage product he has, and thus I don't
>> know its capabilities, whether it can create or seamlessly expand a
>> concatenation.  Linux md can do all of this very easily and is deployed
>> by many people in this exact scenario.
> 
> On this note, I'm using an Areca ARC-1882.  I've been looking for
> documentation regarding concatenation with this, and having a bit of
> trouble.
> 
> Do you happen to be familiar with the product?

Only enough to recommend you to replace it immediately with an LSI or
Adaptec.  Areca is an absolutely tiny Taiwanese company with inferior
product and, from what I gather, horrible support for North American
customers, and Linux customers in general.  The vast majority of their
customers seem to be SOHOs and individuals using the boards in MS
Windows servers, with very few running more than a handful of drives,
and few running lots of drives doing serious work.  If you run into any
kind of performance issue with their board, and explain to them your
number of drives and arrays, OS platform and workload, they'll be
baffled like a 3rd grader and have no idea how to respond.

The odd thing is that this isn't reflected in the price of their
products, which are not substantially less money than the best of breed
LSI boards, which come with LSI's phenomenal support structure.  And
there are plenty of LSI Linux customers running hundreds of drives with
real workloads.

Areca has no real presence North America, or any country for that
matter.  They're headquartered in Taiwan and have a "global office"
there.  Speaking of their "North American support", their ~1000 ft^2
office is in an industrial park in Walnut Grove, CA, directly across the
street from "Steve's Refrigeration Supply".  Check out the Google street
view for their office address, 150 Commerce Way, Walnut, CA 91789

Now let's have a look at LSI's North American presence.
http://www.lsi.com/northamerica/pages/northamerica.aspx#tab/tab-contactus

Now lets look at prices for the ARC-1882 and LSI's fastest 8P card.

Areca ARC-1882I
PCIe 2.0/3.0 x8, 1GB DDR3-1333, 800 MHz Dual Core RAID-on-Chip ASIC, 2x
SFF-8088 6G SAS, supports 128 drives
http://www.newegg.com/Product/Product.aspx?Item=N82E16816151105
$620

Battery Backed Write Cache module, 72hr max backup time, ARC-6120-T121
$130

Solution cost:  $750


LSI 9361-8i

PCIe 3.0 x8, 1GB DDR3-1866, 1.2GHz LSISAS3108 dual core RAID-On-Chip
ASIC, 2x SFF-8643 12G SAS, supports 128 drives
http://www.newegg.com/Product/Product.aspx?Item=N82E16816118230
$570

Flash Backed Write Cache module, LSICVM02, unlimited backup time
http://www.newegg.com/Product/Product.aspx?Item=N82E16816118232
$190

Solution cost:  $760


The Areca uses inferior older technology, has inferior performance,
limited firmware feature set which doesn't support spans
(concatenation), near non-existent US support especially for advanced
Linux workloads/users, only offers battery cache backup, and is all of ...

$10 USD cheaper than the category equivalent yet vastly superior LSI.

By some off chance you don't already know, LSI is the industry gold
standard RAID HBA.  They are the sole RAID HBA OEM board supplier to
Dell, IBM, Intel, Lenovo, Fujitsu/Siemens, etc, and their ASICs are used
by many others on their in house designs.  LSI's ASICs and firmware have
seen more Linux workloads of all shapes and sizes than all other
vendors' RAID HBAs combined.

Given all of the above, and that there are at least 3 other LSI boards
of superior performance, in the same price range for the past year, why
did you go with Areca?

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-02-25 20:08                                       ` Stan Hoeppner
@ 2014-02-26 14:19                                         ` C. Morgan Hamill
  2014-02-26 17:49                                           ` Stan Hoeppner
  0 siblings, 1 reply; 27+ messages in thread
From: C. Morgan Hamill @ 2014-02-26 14:19 UTC (permalink / raw)
  To: stan; +Cc: xfs

Excerpts from Stan Hoeppner's message of 2014-02-25 15:08:44 -0500:
> Only enough to recommend you to replace it immediately with an LSI or
> Adaptec.  Areca is an absolutely tiny Taiwanese company with inferior
> product and, from what I gather, horrible support for North American
> customers, and Linux customers in general.  The vast majority of their
> customers seem to be SOHOs and individuals using the boards in MS
> Windows servers, with very few running more than a handful of drives,
> and few running lots of drives doing serious work.

Noted.

> If you run into any kind of performance issue with their board, and
> explain to them your number of drives and arrays, OS platform and
> workload, they'll be baffled like a 3rd grader and have no idea how to
> respond.

For better or worse, this will be in line with the "support" I've
experienced from the vast majority of vendors I've had to deal with.

> The Areca uses inferior older technology, has inferior performance,
> limited firmware feature set which doesn't support spans
> (concatenation), near non-existent US support especially for advanced
> Linux workloads/users, only offers battery cache backup, and is all of ...
> 
> $10 USD cheaper than the category equivalent yet vastly superior LSI.

Does seem to be the case.

> By some off chance you don't already know, LSI is the industry gold
> standard RAID HBA.  They are the sole RAID HBA OEM board supplier to
> Dell, IBM, Intel, Lenovo, Fujitsu/Siemens, etc, and their ASICs are used
> by many others on their in house designs.  LSI's ASICs and firmware have
> seen more Linux workloads of all shapes and sizes than all other
> vendors' RAID HBAs combined.

I am aware; all our servers have LSI in them for boot arrays and
whatnot.

> Given all of the above, and that there are at least 3 other LSI boards
> of superior performance, in the same price range for the past year, why
> did you go with Areca?

For better or worse, they're what we were able to get from our white box
vendor.  It will, unfortunately, have to do for now.  I'll be sure to
make a note for future expansion.

Until then, we'll just have to tread carefully.

Thanks again for all of your help.
--
Morgan Hamill

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Question regarding XFS on LVM over hardware RAID.
  2014-02-26 14:19                                         ` C. Morgan Hamill
@ 2014-02-26 17:49                                           ` Stan Hoeppner
  0 siblings, 0 replies; 27+ messages in thread
From: Stan Hoeppner @ 2014-02-26 17:49 UTC (permalink / raw)
  To: C. Morgan Hamill; +Cc: xfs

On 2/26/2014 8:19 AM, C. Morgan Hamill wrote:
> Excerpts from Stan Hoeppner's message of 2014-02-25 15:08:44 -0500:
>> Only enough to recommend you to replace it immediately with an LSI or
>> Adaptec.  Areca is an absolutely tiny Taiwanese company with inferior
>> product and, from what I gather, horrible support for North American
>> customers, and Linux customers in general.  The vast majority of their
>> customers seem to be SOHOs and individuals using the boards in MS
>> Windows servers, with very few running more than a handful of drives,
>> and few running lots of drives doing serious work.
> 
> Noted.
> 
>> If you run into any kind of performance issue with their board, and
>> explain to them your number of drives and arrays, OS platform and
>> workload, they'll be baffled like a 3rd grader and have no idea how to
>> respond.
> 
> For better or worse, this will be in line with the "support" I've
> experienced from the vast majority of vendors I've had to deal with.

Edu's often have tight(er) budgets so this often goes with the
territory, unfortunately.  On the bright side, one tends to learn quite
a bit about the hardware industry, the secret sauce that separates two
vendors using the same ASICs, where the value add comes from, etc.
This, out of necessity.

>> The Areca uses inferior older technology, has inferior performance,
>> limited firmware feature set which doesn't support spans
>> (concatenation), near non-existent US support especially for advanced
>> Linux workloads/users, only offers battery cache backup, and is all of ...
>>
>> $10 USD cheaper than the category equivalent yet vastly superior LSI.
> 
> Does seem to be the case.
> 
>> By some off chance you don't already know, LSI is the industry gold
>> standard RAID HBA.  They are the sole RAID HBA OEM board supplier to
>> Dell, IBM, Intel, Lenovo, Fujitsu/Siemens, etc, and their ASICs are used
>> by many others on their in house designs.  LSI's ASICs and firmware have
>> seen more Linux workloads of all shapes and sizes than all other
>> vendors' RAID HBAs combined.
> 
> I am aware; all our servers have LSI in them for boot arrays and
> whatnot.
> 
>> Given all of the above, and that there are at least 3 other LSI boards
>> of superior performance, in the same price range for the past year, why
>> did you go with Areca?
> 
> For better or worse, they're what we were able to get from our white box
> vendor.  It will, unfortunately, have to do for now.  I'll be sure to
> make a note for future expansion.

In that case, exercise it mercilessly with your workload to surface any
problems the firmware may have with the triple RAID6 setup.  Yank a
drive from each array while under full IO load, etc.  Even if Areca
can't provide answers or fixes to problems you uncover, if you can
identify problem spots before production, you can document these and
take steps to mitigate them.

> Until then, we'll just have to tread carefully.

>From what I understand their hardware QC is decent so board failure
shouldn't be an issue.  The issues usually deal with firmware
immaturity.  They're a tiny company with limited resources thus they
simply can't do much workload testing with multiple array
configurations.  Thus their customers running higher end workloads often
end up being guinea pigs and identifying firmware deficiencies for them,
and suffering performance chasms in the process.

LSI, Adaptec, etc do have firmware issues as well on occasion.  But
their test lab resources allow them to flesh most of these out before
the boards reach customers.

> Thanks again for all of your help.

You bet.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2014-02-26 17:49 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-29 14:26 Question regarding XFS on LVM over hardware RAID C. Morgan Hamill
2014-01-29 15:07 ` Eric Sandeen
2014-01-29 19:11   ` C. Morgan Hamill
2014-01-29 23:55     ` Stan Hoeppner
2014-01-30 14:28       ` C. Morgan Hamill
2014-01-30 20:28         ` Dave Chinner
2014-01-31  5:58           ` Stan Hoeppner
2014-01-31 21:14             ` C. Morgan Hamill
2014-02-01 21:06               ` Stan Hoeppner
2014-02-02 21:21                 ` Dave Chinner
2014-02-03 16:12                   ` C. Morgan Hamill
2014-02-03 21:41                     ` Dave Chinner
2014-02-04  8:00                       ` Stan Hoeppner
2014-02-18 19:44                         ` C. Morgan Hamill
2014-02-18 23:07                           ` Stan Hoeppner
2014-02-20 18:31                             ` C. Morgan Hamill
2014-02-21  3:33                               ` Stan Hoeppner
2014-02-21  8:57                                 ` Emmanuel Florac
2014-02-22  2:21                                   ` Stan Hoeppner
2014-02-25 17:04                                     ` C. Morgan Hamill
2014-02-25 17:17                                       ` Emmanuel Florac
2014-02-25 20:08                                       ` Stan Hoeppner
2014-02-26 14:19                                         ` C. Morgan Hamill
2014-02-26 17:49                                           ` Stan Hoeppner
2014-02-21 19:17                                 ` C. Morgan Hamill
2014-02-03 16:07                 ` C. Morgan Hamill
2014-01-29 22:40   ` Stan Hoeppner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.