All of lore.kernel.org
 help / color / mirror / Atom feed
* XFS on top RAID10 with odd drives count and 2 near copies
@ 2012-02-10 15:17 CoolCold
  2012-02-11  4:05 ` Stan Hoeppner
  0 siblings, 1 reply; 40+ messages in thread
From: CoolCold @ 2012-02-10 15:17 UTC (permalink / raw)
  To: Linux RAID

I've got server with 7 SATA drives ( Hetzner's XS13 to be precise )
and created mdadm's raid10 with two near copies, then put LVM on it.
Now I'm planning to create xfs filesystem, but a bit confused about
stripe width/stripe unit values.

As drives count is 7 and copies count is 2, so simple calculation
gives me datadrives count "3.5" which looks ugly. If I understand the
whole idea of sunit/swidth right, it should fill (or buffer) the full
stripe (sunit * data disks) and then do write, so optimization takes
place and all disks will work at once.

My imagination draws such data distribution:

A1 A1 A2 A2 A3 A3 A4
A4 A5 A5 A6 A6 A7 A7
A8 A8 A9 A9 A10 A10 A11
A11 ...

So, there are two optimal variants to do writes:
a) 4 chunks write to affect 7 drives (one drive will be affected twice)
b) 7 chunks write to affect 7 drives (every drive will be affected
twice, but may be caching/merging will take place somehow)

My read load going be near random read ( sending pictures over http )
and looks like it doesn't matter how it will be set with sunit/swidth.

My current raid setup is:
    root@datastor1:~# cat /proc/mdstat
    Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
    md3 : active raid10 sdg5[6] sdf5[5] sde5[4] sdd5[3] sdc5[2] sdb5[1] sda5[0]
          10106943808 blocks super 1.2 64K chunks 2 near-copies [7/7] [UUUUUUU]
          [>....................]  resync =  0.8%
(81543680/10106943808) finish=886.0min speed=188570K/sec
          bitmap: 76/76 pages [304KB], 65536KB chunk



Almost default mkfs.xfs creating options produced:

    root@datastor1:~# mkfs.xfs -l lazy-count=1 /dev/data/db -f
    meta-data=/dev/data/db       isize=256    agcount=32, agsize=16777216 blks
             =                       sectsz=512   attr=2, projid32bit=0
    data     =                       bsize=4096   blocks=536870912, imaxpct=5
             =                       sunit=16     swidth=112 blks
    naming   =version 2              bsize=4096   ascii-ci=0
    log      =internal log           bsize=4096   blocks=262144, version=2
             =                       sectsz=512   sunit=16 blks, lazy-count=1
    realtime =none                   extsz=4096   blocks=0, rtextents=0


As I can see, it is created 112/16 = 7 chunks swidth, which correlate
with my version b) , and I guess I will leave it this way.

So, I'll be glad if anyone can review my thoughts and share yours.


-- 
Best regards,
[COOLCOLD-RIPN]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-10 15:17 XFS on top RAID10 with odd drives count and 2 near copies CoolCold
@ 2012-02-11  4:05 ` Stan Hoeppner
  2012-02-11 14:32   ` David Brown
  2012-02-12 20:16   ` CoolCold
  0 siblings, 2 replies; 40+ messages in thread
From: Stan Hoeppner @ 2012-02-11  4:05 UTC (permalink / raw)
  To: CoolCold; +Cc: Linux RAID

On 2/10/2012 9:17 AM, CoolCold wrote:
> I've got server with 7 SATA drives ( Hetzner's XS13 to be precise )
> and created mdadm's raid10 with two near copies, then put LVM on it.
> Now I'm planning to create xfs filesystem, but a bit confused about
> stripe width/stripe unit values.

Why use LVM at all?  Snapshots?  The XS13 has no option for more drives
so it can't be for expansion flexibility.  If you don't 'need' LVM don't
use it.  It unnecessarily complicates your setup and can degrade
performance.

> As drives count is 7 and copies count is 2, so simple calculation
> gives me datadrives count "3.5" which looks ugly. If I understand the
> whole idea of sunit/swidth right, it should fill (or buffer) the full
> stripe (sunit * data disks) and then do write, so optimization takes
> place and all disks will work at once.

Pretty close.  Stripe alignment is only applicable to allocation i.e new
file creation, and log journal writes, but not file re-write nor read
ops.  Note that stripe alignment will gain you nothing if your
allocation workload doesn't match the stripe alignment.  For example
writing a 32KB file every 20 seconds.  It'll take too long to fill the
buffer before it's flushed and it's a tiny file, so you'll end up with
many partial stripe width writes.

> My read load going be near random read ( sending pictures over http )
> and looks like it doesn't matter how it will be set with sunit/swidth.

~13TB of "pictures" to serve eh?  Average JPG file size will be
relatively small, correct?  Less than 1MB?  No, stripe alignment won't
really help this workload at all, unless you upload a million files in
one shot to populate the server.  In that case alignment will make the
process complete more quickly.

>     root@datastor1:~# cat /proc/mdstat
>     Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>     md3 : active raid10 sdg5[6] sdf5[5] sde5[4] sdd5[3] sdc5[2] sdb5[1] sda5[0]
>           10106943808 blocks super 1.2 64K chunks 2 near-copies [7/7] [UUUUUUU]
>           [>....................]  resync =  0.8%
> (81543680/10106943808) finish=886.0min speed=188570K/sec
>           bitmap: 76/76 pages [304KB], 65536KB chunk

> Almost default mkfs.xfs creating options produced:
> 
>     root@datastor1:~# mkfs.xfs -l lazy-count=1 /dev/data/db -f
>     meta-data=/dev/data/db       isize=256    agcount=32, agsize=16777216 blks
>              =                       sectsz=512   attr=2, projid32bit=0
>     data     =                       bsize=4096   blocks=536870912, imaxpct=5
>              =                       sunit=16     swidth=112 blks
>     naming   =version 2              bsize=4096   ascii-ci=0
>     log      =internal log           bsize=4096   blocks=262144, version=2
>              =                       sectsz=512   sunit=16 blks, lazy-count=1
>     realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> 
> As I can see, it is created 112/16 = 7 chunks swidth, which correlate
> with my version b) , and I guess I will leave it this way.

The default mkfs.xfs algorithms don't seem to play well with the
mdraid10 near/far copy layouts.  The above configuration is doing a 7
spindle stripe of 64KB, for a 448KB total stripe size.  This doesn't
seem correct, as I don't believe a 7 drive RAID10 near is giving you 7
spindles of stripe width.  I'm no expert on the near/far layouts, so I
could be wrong here.  If a RAID0 stripe would yield a 7 spindle stripe
width, I don't see how a RAID10/near would also be 7.  A straight RAID10
with 8 drives would give a 4 spindle stripe width.

> So, I'll be glad if anyone can review my thoughts and share yours.

To provide you with any kind of concrete real world advice we need more
details about your write workload/pattern.  In absence of that, and
given what you've already stated, that the application is "sending
pictures over http", then this seems to be a standard static web server
workload.  In that case disk access, especially write throughput, is
mostly irrelevant, as memory capacity becomes the performance limiting
factor.  Given that you have 12GB of RAM for Apache/nginx/Lighty and
buffer cache, how you setup the storage probably isn't going to make a
big difference from a performance standpoint.

That said, for this web server workload, you'll be better off it you
avoid any kind of striping altogether, especially if using XFS.  You'll
be dealing with millions of small picture files I assume, in hundreds or
thousands of directories?  In that case play to XFS' strengths.  Here's
how you do it:

1.  You chose mdraid10/near strictly because you have 7 disks and wanted
to use them all.  You must eliminate that mindset.  Redo the array with
6 disks leaving the 7th as a spare (smart thing to do anyway).  What can
you really to with 10.5TB that you can't with 9TB?

2.  Take your 6 disks and create 3 mdraid1 mirror pairs--don't use
partitions as these are surely Advanced Format drives.  Now take those 3
mdraid mirror devices and create a layered mdraid --linear array of the
three.  The result will be a ~9TB mdraid device.

3.  Using a linear concat of 3 mirrors with XFS will yield some
advantages over a striped array for this picture serving workload.
Format the array with:

/$ mkfs.xfs -d agcount=12 /dev/mdx

That will give you 12 allocation groups of 750GB each, 4 AGs per
effective spindle.  Using too many AGs will cause excessive head seeking
under load, especially with a low disk count in the array.  The mkfs.xfs
agcount default is 4 for this reason.  As a general rule you want a
lower agcount when using low RPM drives (5.9k, 7.2k) and a higher
agcount with fast drives (10k, 15k).

Directories drive XFS parallelism, with each directory being created in
a different AG, allowing XFS to write/read 12 files in parallel (far in
excess of the IO capabilities of the 3 drives) without having to worry
about stripe alignment.  Since your file layout will have many hundreds
or thousands of directories and millions of files, you'll get maximum
performance from this setup.

As I said, if I understand your workload correctly, array/filesystem
layout probably don't make much difference.  But if you're after
something optimal and less complicated, for piece of mind, etc, this is
a better solution than the 7 disk RAID10 near layout with XFS.

Oh, and don't forget to mount the XFS filesystem with the inode64 option
in any case, lest performance will be much less than optimal, and you
may run out of directory inodes as the FS fills up.

Hope this information was helpful.

-- 
Stan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-11  4:05 ` Stan Hoeppner
@ 2012-02-11 14:32   ` David Brown
  2012-02-12 20:16   ` CoolCold
  1 sibling, 0 replies; 40+ messages in thread
From: David Brown @ 2012-02-11 14:32 UTC (permalink / raw)
  To: stan, CoolCold; +Cc: Linux RAID

On 11/02/12 05:05, Stan Hoeppner wrote:
> On 2/10/2012 9:17 AM, CoolCold wrote:
>> I've got server with 7 SATA drives ( Hetzner's XS13 to be precise )
>> and created mdadm's raid10 with two near copies, then put LVM on it.
>> Now I'm planning to create xfs filesystem, but a bit confused about
>> stripe width/stripe unit values.
>

Why are you using "near" copies?  raid10,n2 is usually a little faster 
for writes (since there is less head movement between writing the two 
copies), but raid10,f2 (far layout) is a lot faster for reads (better 
striping for larger files, and most reads come from the faster outer 
halves of the disks).  So if you have a read-to-write ratio of more than 
about 2 or 3, you probably want far layout.

> Why use LVM at all?  Snapshots?  The XS13 has no option for more drives
> so it can't be for expansion flexibility.  If you don't 'need' LVM don't
> use it.  It unnecessarily complicates your setup and can degrade
> performance.
>

I agree here.  LVM is wonderful if you have multiple logical partitions 
and filesystems on the array, or if you want to be able to expand the 
array later (growing with LVM is very fast, safe and easily, though 
seldom as optimal in speed as re-shaping the raid array).  However, if 
your array is fixed size and you only have one filesystem, it's 
typically best to keep it simple by omitting the LVM layer.

>> As drives count is 7 and copies count is 2, so simple calculation
>> gives me datadrives count "3.5" which looks ugly. If I understand the
>> whole idea of sunit/swidth right, it should fill (or buffer) the full
>> stripe (sunit * data disks) and then do write, so optimization takes
>> place and all disks will work at once.
>
> Pretty close.  Stripe alignment is only applicable to allocation i.e new
> file creation, and log journal writes, but not file re-write nor read
> ops.  Note that stripe alignment will gain you nothing if your
> allocation workload doesn't match the stripe alignment.  For example
> writing a 32KB file every 20 seconds.  It'll take too long to fill the
> buffer before it's flushed and it's a tiny file, so you'll end up with
> many partial stripe width writes.
>
>> My read load going be near random read ( sending pictures over http )
>> and looks like it doesn't matter how it will be set with sunit/swidth.
>
> ~13TB of "pictures" to serve eh?  Average JPG file size will be
> relatively small, correct?  Less than 1MB?  No, stripe alignment won't
> really help this workload at all, unless you upload a million files in
> one shot to populate the server.  In that case alignment will make the
> process complete more quickly.
>
>>      root@datastor1:~# cat /proc/mdstat
>>      Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>>      md3 : active raid10 sdg5[6] sdf5[5] sde5[4] sdd5[3] sdc5[2] sdb5[1] sda5[0]
>>            10106943808 blocks super 1.2 64K chunks 2 near-copies [7/7] [UUUUUUU]
>>            [>....................]  resync =  0.8%
>> (81543680/10106943808) finish=886.0min speed=188570K/sec
>>            bitmap: 76/76 pages [304KB], 65536KB chunk
>
>> Almost default mkfs.xfs creating options produced:
>>
>>      root@datastor1:~# mkfs.xfs -l lazy-count=1 /dev/data/db -f
>>      meta-data=/dev/data/db       isize=256    agcount=32, agsize=16777216 blks
>>               =                       sectsz=512   attr=2, projid32bit=0
>>      data     =                       bsize=4096   blocks=536870912, imaxpct=5
>>               =                       sunit=16     swidth=112 blks
>>      naming   =version 2              bsize=4096   ascii-ci=0
>>      log      =internal log           bsize=4096   blocks=262144, version=2
>>               =                       sectsz=512   sunit=16 blks, lazy-count=1
>>      realtime =none                   extsz=4096   blocks=0, rtextents=0
>>
>>
>> As I can see, it is created 112/16 = 7 chunks swidth, which correlate
>> with my version b) , and I guess I will leave it this way.
>
> The default mkfs.xfs algorithms don't seem to play well with the
> mdraid10 near/far copy layouts.  The above configuration is doing a 7
> spindle stripe of 64KB, for a 448KB total stripe size.  This doesn't
> seem correct, as I don't believe a 7 drive RAID10 near is giving you 7
> spindles of stripe width.  I'm no expert on the near/far layouts, so I
> could be wrong here.  If a RAID0 stripe would yield a 7 spindle stripe
> width, I don't see how a RAID10/near would also be 7.  A straight RAID10
> with 8 drives would give a 4 spindle stripe width.
>

The key points about Linux madmin raid10 is that it works with any 
number of disks, and you /do/ get stripes across all spindles.  In 
particular, with raid10,far you get better read performance than with 
raid0 (especially for large streamed reads).

So dedicating one drive as a hot spare will reduce the throughput a 
little - but I'd agree with you that it is probably a good idea.

If the system is serving multiple concurrent small files, then  your 
suggestion of 3 pairs linearly concat'ed to XFS is not bad.  But I 
suspect performance would still be better with the 6 (or maybe 7) drives 
raid10,far, especially for read-heavy applications.


>> So, I'll be glad if anyone can review my thoughts and share yours.
>
> To provide you with any kind of concrete real world advice we need more
> details about your write workload/pattern.  In absence of that, and
> given what you've already stated, that the application is "sending
> pictures over http", then this seems to be a standard static web server
> workload.  In that case disk access, especially write throughput, is
> mostly irrelevant, as memory capacity becomes the performance limiting
> factor.  Given that you have 12GB of RAM for Apache/nginx/Lighty and
> buffer cache, how you setup the storage probably isn't going to make a
> big difference from a performance standpoint.
>
> That said, for this web server workload, you'll be better off it you
> avoid any kind of striping altogether, especially if using XFS.  You'll
> be dealing with millions of small picture files I assume, in hundreds or
> thousands of directories?  In that case play to XFS' strengths.  Here's
> how you do it:
>
> 1.  You chose mdraid10/near strictly because you have 7 disks and wanted
> to use them all.  You must eliminate that mindset.  Redo the array with
> 6 disks leaving the 7th as a spare (smart thing to do anyway).  What can
> you really to with 10.5TB that you can't with 9TB?
>
> 2.  Take your 6 disks and create 3 mdraid1 mirror pairs--don't use
> partitions as these are surely Advanced Format drives.  Now take those 3
> mdraid mirror devices and create a layered mdraid --linear array of the
> three.  The result will be a ~9TB mdraid device.
>
> 3.  Using a linear concat of 3 mirrors with XFS will yield some
> advantages over a striped array for this picture serving workload.
> Format the array with:
>
> /$ mkfs.xfs -d agcount=12 /dev/mdx
>
> That will give you 12 allocation groups of 750GB each, 4 AGs per
> effective spindle.  Using too many AGs will cause excessive head seeking
> under load, especially with a low disk count in the array.  The mkfs.xfs
> agcount default is 4 for this reason.  As a general rule you want a
> lower agcount when using low RPM drives (5.9k, 7.2k) and a higher
> agcount with fast drives (10k, 15k).
>
> Directories drive XFS parallelism, with each directory being created in
> a different AG, allowing XFS to write/read 12 files in parallel (far in
> excess of the IO capabilities of the 3 drives) without having to worry
> about stripe alignment.  Since your file layout will have many hundreds
> or thousands of directories and millions of files, you'll get maximum
> performance from this setup.
>
> As I said, if I understand your workload correctly, array/filesystem
> layout probably don't make much difference.  But if you're after
> something optimal and less complicated, for piece of mind, etc, this is
> a better solution than the 7 disk RAID10 near layout with XFS.
>
> Oh, and don't forget to mount the XFS filesystem with the inode64 option
> in any case, lest performance will be much less than optimal, and you
> may run out of directory inodes as the FS fills up.
>
> Hope this information was helpful.
>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-11  4:05 ` Stan Hoeppner
  2012-02-11 14:32   ` David Brown
@ 2012-02-12 20:16   ` CoolCold
  2012-02-13  8:50     ` David Brown
                       ` (2 more replies)
  1 sibling, 3 replies; 40+ messages in thread
From: CoolCold @ 2012-02-12 20:16 UTC (permalink / raw)
  To: stan; +Cc: Linux RAID

First of all, Stan, thanks for such detailed answer, I greatly appreciate this!

On Sat, Feb 11, 2012 at 8:05 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 2/10/2012 9:17 AM, CoolCold wrote:
>> I've got server with 7 SATA drives ( Hetzner's XS13 to be precise )
>> and created mdadm's raid10 with two near copies, then put LVM on it.
>> Now I'm planning to create xfs filesystem, but a bit confused about
>> stripe width/stripe unit values.
>
> Why use LVM at all?  Snapshots?  The XS13 has no option for more drives
> so it can't be for expansion flexibility.  If you don't 'need' LVM don't
> use it.  It unnecessarily complicates your setup and can degrade
> performance.
There are several reasons for this - 1) I've made decision to use LMV
for all "data" volumes (those are except /, /boot, /home , etc)  2)
there will be mysql database which will need backups with snapshots 3)
I often have several ( 0-3 ) virtual environments (OpenVZ based) which
are living on ext3/ext4 (because of extensive metadata updates on xfs
makes it the whole machine slow) filesystem and different LV because
of this.

>
>> As drives count is 7 and copies count is 2, so simple calculation
>> gives me datadrives count "3.5" which looks ugly. If I understand the
>> whole idea of sunit/swidth right, it should fill (or buffer) the full
>> stripe (sunit * data disks) and then do write, so optimization takes
>> place and all disks will work at once.
>
> Pretty close.  Stripe alignment is only applicable to allocation i.e new
> file creation, and log journal writes, but not file re-write nor read
> ops.  Note that stripe alignment will gain you nothing if your
> allocation workload doesn't match the stripe alignment.  For example
> writing a 32KB file every 20 seconds.  It'll take too long to fill the
> buffer before it's flushed and it's a tiny file, so you'll end up with
> many partial stripe width writes.
Okay, got it - I've thinked in similar way.
>
>> My read load going be near random read ( sending pictures over http )
>> and looks like it doesn't matter how it will be set with sunit/swidth.
>
> ~13TB of "pictures" to serve eh?  Average JPG file size will be
> relatively small, correct?  Less than 1MB?  No, stripe alignment won't
> really help this workload at all, unless you upload a million files in
> one shot to populate the server.  In that case alignment will make the
> process complete more quickly.
Basing on current storage, estimations show (df -h / df -i ) average
file size is ~200kb . Inodes count is near 15 millions and it will be
more.
I've just thought that may be I should change chunk size to 256kb,
just to let one file be read from one disk, this may increase latency
and increase throughput too.

>
>>     root@datastor1:~# cat /proc/mdstat
>>     Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>>     md3 : active raid10 sdg5[6] sdf5[5] sde5[4] sdd5[3] sdc5[2] sdb5[1] sda5[0]
>>           10106943808 blocks super 1.2 64K chunks 2 near-copies [7/7] [UUUUUUU]
>>           [>....................]  resync =  0.8%
>> (81543680/10106943808) finish=886.0min speed=188570K/sec
>>           bitmap: 76/76 pages [304KB], 65536KB chunk
>
>> Almost default mkfs.xfs creating options produced:
>>
>>     root@datastor1:~# mkfs.xfs -l lazy-count=1 /dev/data/db -f
>>     meta-data=/dev/data/db       isize=256    agcount=32, agsize=16777216 blks
>>              =                       sectsz=512   attr=2, projid32bit=0
>>     data     =                       bsize=4096   blocks=536870912, imaxpct=5
>>              =                       sunit=16     swidth=112 blks
>>     naming   =version 2              bsize=4096   ascii-ci=0
>>     log      =internal log           bsize=4096   blocks=262144, version=2
>>              =                       sectsz=512   sunit=16 blks, lazy-count=1
>>     realtime =none                   extsz=4096   blocks=0, rtextents=0
>>
>>
>> As I can see, it is created 112/16 = 7 chunks swidth, which correlate
>> with my version b) , and I guess I will leave it this way.
>
> The default mkfs.xfs algorithms don't seem to play well with the
> mdraid10 near/far copy layouts.  The above configuration is doing a 7
> spindle stripe of 64KB, for a 448KB total stripe size.  This doesn't
> seem correct, as I don't believe a 7 drive RAID10 near is giving you 7
> spindles of stripe width.  I'm no expert on the near/far layouts, so I
> could be wrong here.  If a RAID0 stripe would yield a 7 spindle stripe
> width, I don't see how a RAID10/near would also be 7.  A straight RAID10
> with 8 drives would give a 4 spindle stripe width.

I've drawn nice picture from my head in my original post, it was:

A1 A1 A2 A2 A3 A3 A4
A4 A5 A5 A6 A6 A7 A7
A8 A8 A9 A9 A10 A10 A11
A11 ...

So here is A{X} is chunk number on top of 7 disks. As you can see, 7
chunks write (A1 - A7) will fill two rows. And this will made 2 disk
head movements to write full stripe, though that moves may be very
near to each other. Real situation may differ of course, and I'm not
expert to make a bet too.

>
>> So, I'll be glad if anyone can review my thoughts and share yours.
>
> To provide you with any kind of concrete real world advice we need more
> details about your write workload/pattern.  In absence of that, and
> given what you've already stated, that the application is "sending
> pictures over http", then this seems to be a standard static web server
> workload.  In that case disk access, especially write throughput, is
> mostly irrelevant, as memory capacity becomes the performance limiting
> factor.  Given that you have 12GB of RAM for Apache/nginx/Lighty and
> buffer cache, how you setup the storage probably isn't going to make a
> big difference from a performance standpoint.
Yes, this is standard static webserver workload with nginx as frontend
with almost only reads.


>
> That said, for this web server workload, you'll be better off it you
> avoid any kind of striping altogether, especially if using XFS.  You'll
> be dealing with millions of small picture files I assume, in hundreds or
> thousands of directories?  In that case play to XFS' strengths.  Here's
> how you do it:
Hundreds directories at least, yes.
After reading you ideas and refinements, I'm making conclusion that I
need to push others [team members] harder to remove mysql instances
from the static files serving boxes at all, to free RAM for least
dcache entries.

About avoiding striping - later in the text.

>
> 1.  You chose mdraid10/near strictly because you have 7 disks and wanted
> to use them all.  You must eliminate that mindset.  Redo the array with
> 6 disks leaving the 7th as a spare (smart thing to do anyway).  What can
> you really to with 10.5TB that you can't with 9TB?
Hetzner's guys were pretty fast on chaning failed disks (one - two
days after claim) so I may try without spares I guess... I just wanna
use more independent spindles here, but I'll think about your
suggestion one more time, thanks.

>
> 2.  Take your 6 disks and create 3 mdraid1 mirror pairs--don't use
> partitions as these are surely Advanced Format drives.  Now take those 3
> mdraid mirror devices and create a layered mdraid --linear array of the
> three.  The result will be a ~9TB mdraid device.
>
> 3.  Using a linear concat of 3 mirrors with XFS will yield some
> advantages over a striped array for this picture serving workload.
> Format the array with:
>
> /$ mkfs.xfs -d agcount=12 /dev/mdx
>
> That will give you 12 allocation groups of 750GB each, 4 AGs per
> effective spindle.  Using too many AGs will cause excessive head seeking
> under load, especially with a low disk count in the array.  The mkfs.xfs
> agcount default is 4 for this reason.  As a general rule you want a
> lower agcount when using low RPM drives (5.9k, 7.2k) and a higher
> agcount with fast drives (10k, 15k).
Good to know such details!

>
> Directories drive XFS parallelism, with each directory being created in
> a different AG, allowing XFS to write/read 12 files in parallel (far in
> excess of the IO capabilities of the 3 drives) without having to worry
> about stripe alignment.  Since your file layout will have many hundreds
> or thousands of directories and millions of files, you'll get maximum
> performance from this setup.

So, as I could understand, you are assuming that "internal striping"
by using AGs of XFS will be better than MD/LVM striping here? Never
thought of XFS in this way and it is interesting point.

>
> As I said, if I understand your workload correctly, array/filesystem
> layout probably don't make much difference.  But if you're after
> something optimal and less complicated, for piece of mind, etc, this is
> a better solution than the 7 disk RAID10 near layout with XFS.
>
> Oh, and don't forget to mount the XFS filesystem with the inode64 option
> in any case, lest performance will be much less than optimal, and you
> may run out of directory inodes as the FS fills up.
Okay.

>
> Hope this information was helpful.
Yes, very helpful and refreshing, thanks for you comments!

P.S. As I've got 2nd server of the same config, may be i'll have time
and do fast & dirty tests of stripes vs AGs.
>
> --
> Stan



-- 
Best regards,
[COOLCOLD-RIPN]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-12 20:16   ` CoolCold
@ 2012-02-13  8:50     ` David Brown
  2012-02-13  9:46       ` CoolCold
  2012-02-13 13:46       ` Stan Hoeppner
  2012-02-13  8:54     ` David Brown
  2012-02-13 12:09     ` Stan Hoeppner
  2 siblings, 2 replies; 40+ messages in thread
From: David Brown @ 2012-02-13  8:50 UTC (permalink / raw)
  Cc: stan, Linux RAID


Comments at the bottom, as they are too mixed to put inline.

On 12/02/2012 21:16, CoolCold wrote:
> First of all, Stan, thanks for such detailed answer, I greatly appreciate this!
>
> On Sat, Feb 11, 2012 at 8:05 AM, Stan Hoeppner<stan@hardwarefreak.com>  wrote:
>> On 2/10/2012 9:17 AM, CoolCold wrote:
>>> I've got server with 7 SATA drives ( Hetzner's XS13 to be precise )
>>> and created mdadm's raid10 with two near copies, then put LVM on it.
>>> Now I'm planning to create xfs filesystem, but a bit confused about
>>> stripe width/stripe unit values.
>>
>> Why use LVM at all?  Snapshots?  The XS13 has no option for more drives
>> so it can't be for expansion flexibility.  If you don't 'need' LVM don't
>> use it.  It unnecessarily complicates your setup and can degrade
>> performance.
> There are several reasons for this - 1) I've made decision to use LMV
> for all "data" volumes (those are except /, /boot, /home , etc)  2)
> there will be mysql database which will need backups with snapshots 3)
> I often have several ( 0-3 ) virtual environments (OpenVZ based) which
> are living on ext3/ext4 (because of extensive metadata updates on xfs
> makes it the whole machine slow) filesystem and different LV because
> of this.
>
>>
>>> As drives count is 7 and copies count is 2, so simple calculation
>>> gives me datadrives count "3.5" which looks ugly. If I understand the
>>> whole idea of sunit/swidth right, it should fill (or buffer) the full
>>> stripe (sunit * data disks) and then do write, so optimization takes
>>> place and all disks will work at once.
>>
>> Pretty close.  Stripe alignment is only applicable to allocation i.e new
>> file creation, and log journal writes, but not file re-write nor read
>> ops.  Note that stripe alignment will gain you nothing if your
>> allocation workload doesn't match the stripe alignment.  For example
>> writing a 32KB file every 20 seconds.  It'll take too long to fill the
>> buffer before it's flushed and it's a tiny file, so you'll end up with
>> many partial stripe width writes.
> Okay, got it - I've thinked in similar way.
>>
>>> My read load going be near random read ( sending pictures over http )
>>> and looks like it doesn't matter how it will be set with sunit/swidth.
>>
>> ~13TB of "pictures" to serve eh?  Average JPG file size will be
>> relatively small, correct?  Less than 1MB?  No, stripe alignment won't
>> really help this workload at all, unless you upload a million files in
>> one shot to populate the server.  In that case alignment will make the
>> process complete more quickly.
> Basing on current storage, estimations show (df -h / df -i ) average
> file size is ~200kb . Inodes count is near 15 millions and it will be
> more.
> I've just thought that may be I should change chunk size to 256kb,
> just to let one file be read from one disk, this may increase latency
> and increase throughput too.
>
>>
>>>      root@datastor1:~# cat /proc/mdstat
>>>      Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>>>      md3 : active raid10 sdg5[6] sdf5[5] sde5[4] sdd5[3] sdc5[2] sdb5[1] sda5[0]
>>>            10106943808 blocks super 1.2 64K chunks 2 near-copies [7/7] [UUUUUUU]
>>>            [>....................]  resync =  0.8%
>>> (81543680/10106943808) finish=886.0min speed=188570K/sec
>>>            bitmap: 76/76 pages [304KB], 65536KB chunk
>>
>>> Almost default mkfs.xfs creating options produced:
>>>
>>>      root@datastor1:~# mkfs.xfs -l lazy-count=1 /dev/data/db -f
>>>      meta-data=/dev/data/db       isize=256    agcount=32, agsize=16777216 blks
>>>               =                       sectsz=512   attr=2, projid32bit=0
>>>      data     =                       bsize=4096   blocks=536870912, imaxpct=5
>>>               =                       sunit=16     swidth=112 blks
>>>      naming   =version 2              bsize=4096   ascii-ci=0
>>>      log      =internal log           bsize=4096   blocks=262144, version=2
>>>               =                       sectsz=512   sunit=16 blks, lazy-count=1
>>>      realtime =none                   extsz=4096   blocks=0, rtextents=0
>>>
>>>
>>> As I can see, it is created 112/16 = 7 chunks swidth, which correlate
>>> with my version b) , and I guess I will leave it this way.
>>
>> The default mkfs.xfs algorithms don't seem to play well with the
>> mdraid10 near/far copy layouts.  The above configuration is doing a 7
>> spindle stripe of 64KB, for a 448KB total stripe size.  This doesn't
>> seem correct, as I don't believe a 7 drive RAID10 near is giving you 7
>> spindles of stripe width.  I'm no expert on the near/far layouts, so I
>> could be wrong here.  If a RAID0 stripe would yield a 7 spindle stripe
>> width, I don't see how a RAID10/near would also be 7.  A straight RAID10
>> with 8 drives would give a 4 spindle stripe width.
>
> I've drawn nice picture from my head in my original post, it was:
>
> A1 A1 A2 A2 A3 A3 A4
> A4 A5 A5 A6 A6 A7 A7
> A8 A8 A9 A9 A10 A10 A11
> A11 ...
>
> So here is A{X} is chunk number on top of 7 disks. As you can see, 7
> chunks write (A1 - A7) will fill two rows. And this will made 2 disk
> head movements to write full stripe, though that moves may be very
> near to each other. Real situation may differ of course, and I'm not
> expert to make a bet too.
>
>>
>>> So, I'll be glad if anyone can review my thoughts and share yours.
>>
>> To provide you with any kind of concrete real world advice we need more
>> details about your write workload/pattern.  In absence of that, and
>> given what you've already stated, that the application is "sending
>> pictures over http", then this seems to be a standard static web server
>> workload.  In that case disk access, especially write throughput, is
>> mostly irrelevant, as memory capacity becomes the performance limiting
>> factor.  Given that you have 12GB of RAM for Apache/nginx/Lighty and
>> buffer cache, how you setup the storage probably isn't going to make a
>> big difference from a performance standpoint.
> Yes, this is standard static webserver workload with nginx as frontend
> with almost only reads.
>
>
>>
>> That said, for this web server workload, you'll be better off it you
>> avoid any kind of striping altogether, especially if using XFS.  You'll
>> be dealing with millions of small picture files I assume, in hundreds or
>> thousands of directories?  In that case play to XFS' strengths.  Here's
>> how you do it:
> Hundreds directories at least, yes.
> After reading you ideas and refinements, I'm making conclusion that I
> need to push others [team members] harder to remove mysql instances
> from the static files serving boxes at all, to free RAM for least
> dcache entries.
>
> About avoiding striping - later in the text.
>
>>
>> 1.  You chose mdraid10/near strictly because you have 7 disks and wanted
>> to use them all.  You must eliminate that mindset.  Redo the array with
>> 6 disks leaving the 7th as a spare (smart thing to do anyway).  What can
>> you really to with 10.5TB that you can't with 9TB?
> Hetzner's guys were pretty fast on chaning failed disks (one - two
> days after claim) so I may try without spares I guess... I just wanna
> use more independent spindles here, but I'll think about your
> suggestion one more time, thanks.
>
>>
>> 2.  Take your 6 disks and create 3 mdraid1 mirror pairs--don't use
>> partitions as these are surely Advanced Format drives.  Now take those 3
>> mdraid mirror devices and create a layered mdraid --linear array of the
>> three.  The result will be a ~9TB mdraid device.
>>
>> 3.  Using a linear concat of 3 mirrors with XFS will yield some
>> advantages over a striped array for this picture serving workload.
>> Format the array with:
>>
>> /$ mkfs.xfs -d agcount=12 /dev/mdx
>>
>> That will give you 12 allocation groups of 750GB each, 4 AGs per
>> effective spindle.  Using too many AGs will cause excessive head seeking
>> under load, especially with a low disk count in the array.  The mkfs.xfs
>> agcount default is 4 for this reason.  As a general rule you want a
>> lower agcount when using low RPM drives (5.9k, 7.2k) and a higher
>> agcount with fast drives (10k, 15k).
> Good to know such details!
>
>>
>> Directories drive XFS parallelism, with each directory being created in
>> a different AG, allowing XFS to write/read 12 files in parallel (far in
>> excess of the IO capabilities of the 3 drives) without having to worry
>> about stripe alignment.  Since your file layout will have many hundreds
>> or thousands of directories and millions of files, you'll get maximum
>> performance from this setup.
>
> So, as I could understand, you are assuming that "internal striping"
> by using AGs of XFS will be better than MD/LVM striping here? Never
> thought of XFS in this way and it is interesting point.
>
>>
>> As I said, if I understand your workload correctly, array/filesystem
>> layout probably don't make much difference.  But if you're after
>> something optimal and less complicated, for piece of mind, etc, this is
>> a better solution than the 7 disk RAID10 near layout with XFS.
>>
>> Oh, and don't forget to mount the XFS filesystem with the inode64 option
>> in any case, lest performance will be much less than optimal, and you
>> may run out of directory inodes as the FS fills up.
> Okay.
>
>>
>> Hope this information was helpful.
> Yes, very helpful and refreshing, thanks for you comments!
>
> P.S. As I've got 2nd server of the same config, may be i'll have time
> and do fast&  dirty tests of stripes vs AGs.
>>
>> --
>> Stan
>

Here a few general points:

XFS has a unique (AFAIK) feature of spreading allocation groups across 
the (logical) disk, and letting these AG's work almost independently. 
So if you have multiple disks (or raid arrays, such as raid1/raid10 
pairs), and the number of AG's is divisible by the number of disks, then 
a linear concatenation of the disks will work well with XFS.  Each 
access to a file will be handled within one AG, and therefore within one 
disk (or pair).  This means you don't get striping or other 
multiple-spindle benefits for that access - but it also means the access 
is almost entirely independent of other accesses to AG's on other disks. 
  In comparison, if you had a RAID6 setup, a single write would use 
/all/ the disks and mean that every other access is blocked for a bit.

But there are caveats.

Top level directories are spread out among the AG's, so it only works 
well if you have a balanced access through a range of directories, such 
asa /home with a subdirectory per user, or a /var/mail with a 
subdirectory per email account.  If you have a /var/www with two 
subdirectories "main" and "testsite", it will be terrible.  And you must 
also remember that you don't get multi-spindle benefits for large 
streamed reads and writes - you need multiple concurrent access to see 
any benefits.

If you have several filesystems on the same array (via LVM or other 
partitioning), you will lose most of the elegance and benefits of this 
type of XFS arrangement.  You really want to use it on a dedicated array.

It is also far from clear whether a linear concat XFS is better than a 
normal XFS on a raid0 of the same drives (or raid1 pairs).  I think it 
will have lower average latencies on small accesses if you also have big 
reads/writes mixed in, but you will also have lower throughput for 
larger accesses.  For some uses, this sort of XFS arrangement is ideal - 
a particular favourite is for mail servers.  But I suspect in many other 
cases you will stray enough from the ideal access patterns to lose any 
benefits it might have.

Stan is the expert on this, and can give advice on getting the best out 
of XFS.  But personally I don't think a linear concat there is the best 
way to go - especially when you want LVM and multiple filesystems on the 
array.


As another point, since you have mostly read accesses, you should 
probably use raid10,f2 far layout rather than near layout.  It's a bit 
slower for writes, but can be much faster for reads.

mvh.,

David







^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-12 20:16   ` CoolCold
  2012-02-13  8:50     ` David Brown
@ 2012-02-13  8:54     ` David Brown
  2012-02-13  9:49       ` CoolCold
  2012-02-13 12:09     ` Stan Hoeppner
  2 siblings, 1 reply; 40+ messages in thread
From: David Brown @ 2012-02-13  8:54 UTC (permalink / raw)
  To: CoolCold; +Cc: stan, Linux RAID

On 12/02/2012 21:16, CoolCold wrote:
> First of all, Stan, thanks for such detailed answer, I greatly appreciate this!
>
> On Sat, Feb 11, 2012 at 8:05 AM, Stan Hoeppner<stan@hardwarefreak.com>  wrote:
>> On 2/10/2012 9:17 AM, CoolCold wrote:
>>> I've got server with 7 SATA drives ( Hetzner's XS13 to be precise )
>>> and created mdadm's raid10 with two near copies, then put LVM on it.
>>> Now I'm planning to create xfs filesystem, but a bit confused about
>>> stripe width/stripe unit values.
>>
>> Why use LVM at all?  Snapshots?  The XS13 has no option for more drives
>> so it can't be for expansion flexibility.  If you don't 'need' LVM don't
>> use it.  It unnecessarily complicates your setup and can degrade
>> performance.
> There are several reasons for this - 1) I've made decision to use LMV
> for all "data" volumes (those are except /, /boot, /home , etc)  2)
> there will be mysql database which will need backups with snapshots 3)
> I often have several ( 0-3 ) virtual environments (OpenVZ based) which
> are living on ext3/ext4 (because of extensive metadata updates on xfs
> makes it the whole machine slow) filesystem and different LV because
> of this.
>

This is a bit off-topic, but do you know of any way to get OpenVZ 
running on a kernel newer than 2.6.32?  One important feature in 2.6.33 
is mergeable LVM snapshots, which would be particularly useful for 
OpenVZ, such as when updating, upgrading or otherwise changing a virtual 
machine.  With mergeable snapshots you could take a snapshot, apply the 
changes to the snapshot, and if it works you merge them back into the 
main logical partition.

mvh.,

David


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-13  8:50     ` David Brown
@ 2012-02-13  9:46       ` CoolCold
  2012-02-13 11:19         ` David Brown
  2012-02-13 13:46       ` Stan Hoeppner
  1 sibling, 1 reply; 40+ messages in thread
From: CoolCold @ 2012-02-13  9:46 UTC (permalink / raw)
  To: David Brown; +Cc: stan, Linux RAID

On Mon, Feb 13, 2012 at 12:50 PM, David Brown <david@westcontrol.com> wrote:
>
> Comments at the bottom, as they are too mixed to put inline.
>
>
> On 12/02/2012 21:16, CoolCold wrote:
>>
>> First of all, Stan, thanks for such detailed answer, I greatly appreciate
>> this!
>>
>> On Sat, Feb 11, 2012 at 8:05 AM, Stan Hoeppner<stan@hardwarefreak.com>
>>  wrote:
>>>
>>> On 2/10/2012 9:17 AM, CoolCold wrote:
>>>>
>>>> I've got server with 7 SATA drives ( Hetzner's XS13 to be precise )
>>>> and created mdadm's raid10 with two near copies, then put LVM on it.
>>>> Now I'm planning to create xfs filesystem, but a bit confused about
>>>> stripe width/stripe unit values.
>>>
>>>
>>> Why use LVM at all?  Snapshots?  The XS13 has no option for more drives
>>> so it can't be for expansion flexibility.  If you don't 'need' LVM don't
>>> use it.  It unnecessarily complicates your setup and can degrade
>>> performance.
>>
>> There are several reasons for this - 1) I've made decision to use LMV
>> for all "data" volumes (those are except /, /boot, /home , etc)  2)
>> there will be mysql database which will need backups with snapshots 3)
>> I often have several ( 0-3 ) virtual environments (OpenVZ based) which
>> are living on ext3/ext4 (because of extensive metadata updates on xfs
>> makes it the whole machine slow) filesystem and different LV because
>> of this.
>>
>>>
>>>> As drives count is 7 and copies count is 2, so simple calculation
>>>> gives me datadrives count "3.5" which looks ugly. If I understand the
>>>> whole idea of sunit/swidth right, it should fill (or buffer) the full
>>>> stripe (sunit * data disks) and then do write, so optimization takes
>>>> place and all disks will work at once.
>>>
>>>
>>> Pretty close.  Stripe alignment is only applicable to allocation i.e new
>>> file creation, and log journal writes, but not file re-write nor read
>>> ops.  Note that stripe alignment will gain you nothing if your
>>> allocation workload doesn't match the stripe alignment.  For example
>>> writing a 32KB file every 20 seconds.  It'll take too long to fill the
>>> buffer before it's flushed and it's a tiny file, so you'll end up with
>>> many partial stripe width writes.
>>
>> Okay, got it - I've thinked in similar way.
>>>
>>>
>>>> My read load going be near random read ( sending pictures over http )
>>>> and looks like it doesn't matter how it will be set with sunit/swidth.
>>>
>>>
>>> ~13TB of "pictures" to serve eh?  Average JPG file size will be
>>> relatively small, correct?  Less than 1MB?  No, stripe alignment won't
>>> really help this workload at all, unless you upload a million files in
>>> one shot to populate the server.  In that case alignment will make the
>>> process complete more quickly.
>>
>> Basing on current storage, estimations show (df -h / df -i ) average
>> file size is ~200kb . Inodes count is near 15 millions and it will be
>> more.
>> I've just thought that may be I should change chunk size to 256kb,
>> just to let one file be read from one disk, this may increase latency
>> and increase throughput too.
>>
>>>
>>>>     root@datastor1:~# cat /proc/mdstat
>>>>     Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>>>>     md3 : active raid10 sdg5[6] sdf5[5] sde5[4] sdd5[3] sdc5[2] sdb5[1]
>>>> sda5[0]
>>>>           10106943808 blocks super 1.2 64K chunks 2 near-copies [7/7]
>>>> [UUUUUUU]
>>>>           [>....................]  resync =  0.8%
>>>> (81543680/10106943808) finish=886.0min speed=188570K/sec
>>>>           bitmap: 76/76 pages [304KB], 65536KB chunk
>>>
>>>
>>>> Almost default mkfs.xfs creating options produced:
>>>>
>>>>     root@datastor1:~# mkfs.xfs -l lazy-count=1 /dev/data/db -f
>>>>     meta-data=/dev/data/db       isize=256    agcount=32,
>>>> agsize=16777216 blks
>>>>              =                       sectsz=512   attr=2, projid32bit=0
>>>>     data     =                       bsize=4096   blocks=536870912,
>>>> imaxpct=5
>>>>              =                       sunit=16     swidth=112 blks
>>>>     naming   =version 2              bsize=4096   ascii-ci=0
>>>>     log      =internal log           bsize=4096   blocks=262144,
>>>> version=2
>>>>              =                       sectsz=512   sunit=16 blks,
>>>> lazy-count=1
>>>>     realtime =none                   extsz=4096   blocks=0, rtextents=0
>>>>
>>>>
>>>> As I can see, it is created 112/16 = 7 chunks swidth, which correlate
>>>> with my version b) , and I guess I will leave it this way.
>>>
>>>
>>> The default mkfs.xfs algorithms don't seem to play well with the
>>> mdraid10 near/far copy layouts.  The above configuration is doing a 7
>>> spindle stripe of 64KB, for a 448KB total stripe size.  This doesn't
>>> seem correct, as I don't believe a 7 drive RAID10 near is giving you 7
>>> spindles of stripe width.  I'm no expert on the near/far layouts, so I
>>> could be wrong here.  If a RAID0 stripe would yield a 7 spindle stripe
>>> width, I don't see how a RAID10/near would also be 7.  A straight RAID10
>>> with 8 drives would give a 4 spindle stripe width.
>>
>>
>> I've drawn nice picture from my head in my original post, it was:
>>
>> A1 A1 A2 A2 A3 A3 A4
>> A4 A5 A5 A6 A6 A7 A7
>> A8 A8 A9 A9 A10 A10 A11
>> A11 ...
>>
>> So here is A{X} is chunk number on top of 7 disks. As you can see, 7
>> chunks write (A1 - A7) will fill two rows. And this will made 2 disk
>> head movements to write full stripe, though that moves may be very
>> near to each other. Real situation may differ of course, and I'm not
>> expert to make a bet too.
>>
>>>
>>>> So, I'll be glad if anyone can review my thoughts and share yours.
>>>
>>>
>>> To provide you with any kind of concrete real world advice we need more
>>> details about your write workload/pattern.  In absence of that, and
>>> given what you've already stated, that the application is "sending
>>> pictures over http", then this seems to be a standard static web server
>>> workload.  In that case disk access, especially write throughput, is
>>> mostly irrelevant, as memory capacity becomes the performance limiting
>>> factor.  Given that you have 12GB of RAM for Apache/nginx/Lighty and
>>> buffer cache, how you setup the storage probably isn't going to make a
>>> big difference from a performance standpoint.
>>
>> Yes, this is standard static webserver workload with nginx as frontend
>> with almost only reads.
>>
>>
>>>
>>> That said, for this web server workload, you'll be better off it you
>>> avoid any kind of striping altogether, especially if using XFS.  You'll
>>> be dealing with millions of small picture files I assume, in hundreds or
>>> thousands of directories?  In that case play to XFS' strengths.  Here's
>>> how you do it:
>>
>> Hundreds directories at least, yes.
>> After reading you ideas and refinements, I'm making conclusion that I
>> need to push others [team members] harder to remove mysql instances
>> from the static files serving boxes at all, to free RAM for least
>> dcache entries.
>>
>> About avoiding striping - later in the text.
>>
>>>
>>> 1.  You chose mdraid10/near strictly because you have 7 disks and wanted
>>> to use them all.  You must eliminate that mindset.  Redo the array with
>>> 6 disks leaving the 7th as a spare (smart thing to do anyway).  What can
>>> you really to with 10.5TB that you can't with 9TB?
>>
>> Hetzner's guys were pretty fast on chaning failed disks (one - two
>> days after claim) so I may try without spares I guess... I just wanna
>> use more independent spindles here, but I'll think about your
>> suggestion one more time, thanks.
>>
>>>
>>> 2.  Take your 6 disks and create 3 mdraid1 mirror pairs--don't use
>>> partitions as these are surely Advanced Format drives.  Now take those 3
>>> mdraid mirror devices and create a layered mdraid --linear array of the
>>> three.  The result will be a ~9TB mdraid device.
>>>
>>> 3.  Using a linear concat of 3 mirrors with XFS will yield some
>>> advantages over a striped array for this picture serving workload.
>>> Format the array with:
>>>
>>> /$ mkfs.xfs -d agcount=12 /dev/mdx
>>>
>>> That will give you 12 allocation groups of 750GB each, 4 AGs per
>>> effective spindle.  Using too many AGs will cause excessive head seeking
>>> under load, especially with a low disk count in the array.  The mkfs.xfs
>>> agcount default is 4 for this reason.  As a general rule you want a
>>> lower agcount when using low RPM drives (5.9k, 7.2k) and a higher
>>> agcount with fast drives (10k, 15k).
>>
>> Good to know such details!
>>
>>>
>>> Directories drive XFS parallelism, with each directory being created in
>>> a different AG, allowing XFS to write/read 12 files in parallel (far in
>>> excess of the IO capabilities of the 3 drives) without having to worry
>>> about stripe alignment.  Since your file layout will have many hundreds
>>> or thousands of directories and millions of files, you'll get maximum
>>> performance from this setup.
>>
>>
>> So, as I could understand, you are assuming that "internal striping"
>> by using AGs of XFS will be better than MD/LVM striping here? Never
>> thought of XFS in this way and it is interesting point.
>>
>>>
>>> As I said, if I understand your workload correctly, array/filesystem
>>> layout probably don't make much difference.  But if you're after
>>> something optimal and less complicated, for piece of mind, etc, this is
>>> a better solution than the 7 disk RAID10 near layout with XFS.
>>>
>>> Oh, and don't forget to mount the XFS filesystem with the inode64 option
>>> in any case, lest performance will be much less than optimal, and you
>>> may run out of directory inodes as the FS fills up.
>>
>> Okay.
>>
>>>
>>> Hope this information was helpful.
>>
>> Yes, very helpful and refreshing, thanks for you comments!
>>
>> P.S. As I've got 2nd server of the same config, may be i'll have time
>> and do fast&  dirty tests of stripes vs AGs.
>>>
>>>
>>> --
>>> Stan
>>
>>
>
> Here a few general points:
>
> XFS has a unique (AFAIK) feature of spreading allocation groups across the
> (logical) disk, and letting these AG's work almost independently. So if you
> have multiple disks (or raid arrays, such as raid1/raid10 pairs), and the
> number of AG's is divisible by the number of disks, then a linear
> concatenation of the disks will work well with XFS.  Each access to a file
> will be handled within one AG, and therefore within one disk (or pair).
>  This means you don't get striping or other multiple-spindle benefits for
> that access - but it also means the access is almost entirely independent of
> other accesses to AG's on other disks.  In comparison, if you had a RAID6
> setup, a single write would use /all/ the disks and mean that every other
> access is blocked for a bit.
>
> But there are caveats.
>
> Top level directories are spread out among the AG's, so it only works well
> if you have a balanced access through a range of directories, such asa /home
> with a subdirectory per user, or a /var/mail with a subdirectory per email
> account.  If you have a /var/www with two subdirectories "main" and
> "testsite", it will be terrible.  And you must also remember that you don't
> get multi-spindle benefits for large streamed reads and writes - you need
> multiple concurrent access to see any benefits.
>
> If you have several filesystems on the same array (via LVM or other
> partitioning), you will lose most of the elegance and benefits of this type
> of XFS arrangement.  You really want to use it on a dedicated array.
>
> It is also far from clear whether a linear concat XFS is better than a
> normal XFS on a raid0 of the same drives (or raid1 pairs).  I think it will
> have lower average latencies on small accesses if you also have big
> reads/writes mixed in, but you will also have lower throughput for larger
> accesses.  For some uses, this sort of XFS arrangement is ideal - a
> particular favourite is for mail servers.  But I suspect in many other cases
> you will stray enough from the ideal access patterns to lose any benefits it
> might have.
>
> Stan is the expert on this, and can give advice on getting the best out of
> XFS.  But personally I don't think a linear concat there is the best way to
> go - especially when you want LVM and multiple filesystems on the array.
>
>
> As another point, since you have mostly read accesses, you should probably
> use raid10,f2 far layout rather than near layout.  It's a bit slower for
> writes, but can be much faster for reads.
>
> mvh.,
>
> David
David, thank you too - you have formalized and written down what I had
babelized in my head. Though I not going to have large sequential
writes/reads, info about "far" layouts is useful and I may use it
later as reference.

>
>
>
>
>



-- 
Best regards,
[COOLCOLD-RIPN]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-13  8:54     ` David Brown
@ 2012-02-13  9:49       ` CoolCold
  0 siblings, 0 replies; 40+ messages in thread
From: CoolCold @ 2012-02-13  9:49 UTC (permalink / raw)
  To: David Brown; +Cc: stan, Linux RAID

On Mon, Feb 13, 2012 at 12:54 PM, David Brown <david@westcontrol.com> wrote:
> On 12/02/2012 21:16, CoolCold wrote:
>>
>> First of all, Stan, thanks for such detailed answer, I greatly appreciate
>> this!
>>
>> On Sat, Feb 11, 2012 at 8:05 AM, Stan Hoeppner<stan@hardwarefreak.com>
>>  wrote:
>>>
>>> On 2/10/2012 9:17 AM, CoolCold wrote:
>>>>
>>>> I've got server with 7 SATA drives ( Hetzner's XS13 to be precise )
>>>> and created mdadm's raid10 with two near copies, then put LVM on it.
>>>> Now I'm planning to create xfs filesystem, but a bit confused about
>>>> stripe width/stripe unit values.
>>>
>>>
>>> Why use LVM at all?  Snapshots?  The XS13 has no option for more drives
>>> so it can't be for expansion flexibility.  If you don't 'need' LVM don't
>>> use it.  It unnecessarily complicates your setup and can degrade
>>> performance.
>>
>> There are several reasons for this - 1) I've made decision to use LMV
>> for all "data" volumes (those are except /, /boot, /home , etc)  2)
>> there will be mysql database which will need backups with snapshots 3)
>> I often have several ( 0-3 ) virtual environments (OpenVZ based) which
>> are living on ext3/ext4 (because of extensive metadata updates on xfs
>> makes it the whole machine slow) filesystem and different LV because
>> of this.
>>
>
> This is a bit off-topic, but do you know of any way to get OpenVZ running on
> a kernel newer than 2.6.32?  One important feature in 2.6.33 is mergeable
> LVM snapshots, which would be particularly useful for OpenVZ, such as when
> updating, upgrading or otherwise changing a virtual machine.  With mergeable
> snapshots you could take a snapshot, apply the changes to the snapshot, and
> if it works you merge them back into the main logical partition.
No, I do not know such solutions except _may be_ using RHEL
(Centos/SL) kernels as OpenVZ team does patches for RHEL kernels as
primary targets and RH may backport some features into their .32
kernel. So if RHEL will/does have mergable LVM snapshots, then you
will need to wait some time and OpenVZ featured kernels will have this
too.

>
> mvh.,
>
> David
>



-- 
Best regards,
[COOLCOLD-RIPN]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-13  9:46       ` CoolCold
@ 2012-02-13 11:19         ` David Brown
  0 siblings, 0 replies; 40+ messages in thread
From: David Brown @ 2012-02-13 11:19 UTC (permalink / raw)
  To: CoolCold; +Cc: stan, Linux RAID

On 13/02/2012 10:46, CoolCold wrote:
> On Mon, Feb 13, 2012 at 12:50 PM, David Brown<david@westcontrol.com>  wrote:

>>
>> As another point, since you have mostly read accesses, you should probably
>> use raid10,f2 far layout rather than near layout.  It's a bit slower for
>> writes, but can be much faster for reads.
>>
>> mvh.,
>>
>> David
> David, thank you too - you have formalized and written down what I had
> babelized in my head. Though I not going to have large sequential
> writes/reads, info about "far" layouts is useful and I may use it
> later as reference.
>

Far layout of raid10 is also faster than near for small reads, though 
the difference is less dramatic.

The layout you drew for raid10,n2 is:

A1 A1 A2 A2 A3 A3 A4
A4 A5 A5 A6 A6 A7 A7
A8 A8 A9 A9 A10 A10 A11
A11 ...


For raid10,f2 it is:

A1 A2 A3 A4 A5 A6 A7
A8 A9 A10 A11 A12 A13 A14
....

A7 A1 A2 A3 A4 A5 A6
A14 A8 A9 A10 A11 A12 A13
....

(I'm not too sure of the details of which drives the second copies go on 
for more than 2 drives, but the main point is that it is always on a 
different drive from the original.)

This layout has two advantages.  First, if you /are/ doing a large read, 
you'll get full raid0 striped performance.  Secondly, since each block 
has two copies, one on the inner half of a disk, and one on the outer 
half, reads will normally be handled by only the outer halves of the 
disks.  This means half the average head movement, and faster throughput 
- the outer halves of disk drives are significantly faster than the 
inner halves.  But if the system is already reading from the outer half 
of a drive, it can still use the inner halves to access a copy of the 
data in parallel if that helps overall.

This combines to give raid10,f2 an average read performance that can be 
quite a lot higher than with a pure raid0 setup, and lower read latency 
than raid10,n2 even for small reads.

The disadvantage is greater head movement during writes, so writes have 
longer latency.  But overall it is almost certainly the best choice for 
your read-heavy usage.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-12 20:16   ` CoolCold
  2012-02-13  8:50     ` David Brown
  2012-02-13  8:54     ` David Brown
@ 2012-02-13 12:09     ` Stan Hoeppner
  2012-02-13 12:42       ` David Brown
  2012-02-13 21:40       ` CoolCold
  2 siblings, 2 replies; 40+ messages in thread
From: Stan Hoeppner @ 2012-02-13 12:09 UTC (permalink / raw)
  To: CoolCold; +Cc: Linux RAID

On 2/12/2012 2:16 PM, CoolCold wrote:
> First of all, Stan, thanks for such detailed answer, I greatly appreciate this!

You're welcome.  You may or may not appreciate this reply.  It got
really long.  I tried to better explain the XFS+md linear array setup.

> There are several reasons for this - 1) I've made decision to use LMV
> for all "data" volumes (those are except /, /boot, /home , etc)  2)
> there will be mysql database which will need backups with snapshots 3)

So you need LVM for snaps, got it.

> I often have several ( 0-3 ) virtual environments (OpenVZ based) which
> are living on ext3/ext4 (because of extensive metadata updates on xfs
> makes it the whole machine slow) filesystem and different LV because
> of this.

This is no longer the case as of kernel 2.6.35+ with Dave Chinner's
delayed logging patch.  It's enabled by default in 2.6.39+ and XFS now
has equal or superior metadata performance to all other Linux
filesystems.  This presentation is about an hour long, but it's super
interesting and very informative:
http://www.youtube.com/watch?v=FegjLbCnoBw

> Basing on current storage, estimations show (df -h / df -i ) average
> file size is ~200kb . Inodes count is near 15 millions and it will be
> more.

You definitely need the inode64 allocator with that many inodes.  You
need it anyway for performance.

> I've just thought that may be I should change chunk size to 256kb,
> just to let one file be read from one disk, this may increase latency
> and increase throughput too.

Why would you do that instead of simply using XFS on a linear array?

> I've drawn nice picture from my head in my original post, it was:
> 
> A1 A1 A2 A2 A3 A3 A4
> A4 A5 A5 A6 A6 A7 A7
> A8 A8 A9 A9 A10 A10 A11
> A11 ...

> So here is A{X} is chunk number on top of 7 disks. As you can see, 7
> chunks write (A1 - A7) will fill two rows. And this will made 2 disk
> head movements to write full stripe, though that moves may be very
> near to each other. Real situation may differ of course, and I'm not
> expert to make a bet too.

xfs_info does shows some wonky numbers for sunit/swidth in your example
output, but the overall number of write bytes is correct, at 448KB,
matching the array's apparent 7*64KB.  This is double your average file
size so you'll likely have many partial stripe writes.  And you won't
get any advantage from device read ahead.  You'll actually be wasting
buffer cache memory, since each disk will read an extra 128KB.  So if a
stripe is actually across 7 spindles, for a single 200KB file read, the
kernel will read an additional 7*128KB=896KB of data into the buffer
cache.  Given the RAID layout and file access pattern, these extra
caches sectors may not get used right away, simply wasting RAM.  To
alleviate this you'd need to decrease

/sys/block/sdX/queue/read_ahead_kb

accordingly, down to something like 32KB or less, to prevent wasting
RAM.  You may need to tweak other kernel block device queue parameters
as well.

If you use a linear concat with XFS, you don't have to worry about any
of these issues because one file goes on one disk (spindle, mirror
pair).  Read ahead works as it should, no stripe alignment issues,
maximum performance for your small file workload.

> Hundreds directories at least, yes.

So as long as the most popular content is not put in a single directory
or two that reside on the same disk causing an IO hotspot, XFS+linear
will work very well for this workload.  The key is spreading the
frequently accessed files across all the allocation groups.  But, if the
popular content all gets cache in RAM, it doesn't matter.  Any other
content accesses will be random, so you're fine there.  Note that the
first 4 directories you create will be in the first 4 AGs on the first
disk, so don't concentrate all your frequently accessed stuff in the
first 4 dirs.  With the XFS+linear setup I described before, you end up
with an on disk filesystem layout like this:

         -------         -------         -------
        |  AG1  |       |  AG5  |       |  AG9  |
        |  AG2  |       |  AG6  |       |  AG10 |
        |  AG3  |       |  AG7  |       |  AG11 |
        |  AG4	|       |  AG8  |       |  AG12 |
         -------         -------         -------
         disk 1          disk 2          disk 3


This AG layout is a direct result of the linear array.  If this were a 3
spindle striped array, each AG would span all 3 disks horizontally, and
you'd have AGs 1-12 in a vertically column, one 3rd of each AG on each
disk.  If you're thinking ahead you may already see one of the
advantages of this setup WRT metadata performance.

Using the inode64 allocator, directory creation will occur in allocation
group order, putting the first 4 directories you create in the first
four respective AGs on disk 1.  Directory 13 will be created in AG1, as
will dir25 and dir37, and so on.  Each file created in a directory will
reside within the AG where its parent dir resides.

This is primarily what allows XFS+linear to have fantastic parallel
small file random access performance.

> After reading you ideas and refinements, I'm making conclusion that I
> need to push others [team members] harder to remove mysql instances
> from the static files serving boxes at all, to free RAM for least
> dcache entries.

That or simply limit the amount of memory mysql is allowed to allocate.
 If you're serving mostly static content, what's the database for?  User
accounts and login processing?  Interactive forum like phpBB?

> About avoiding striping - later in the text.

> Hetzner's guys were pretty fast on chaning failed disks (one - two
> days after claim) so I may try without spares I guess... I just wanna
> use more independent spindles here, 

If the box came with only 6 data drives would you be asking them to add
a seventh?  I believe you have a personality type that makes that 7th
odd drive an itch you must simply scratch. ;)  "It's there so I MUST use
it!"  Make it your snap target then.  It'll keep tape/etc IO off the
array when you backup the snaps.  There, itch scratched. ;)

>> That will give you 12 allocation groups of 750GB each, 4 AGs per
>> effective spindle.  Using too many AGs will cause excessive head seeking
>> under load, especially with a low disk count in the array.  The mkfs.xfs
>> agcount default is 4 for this reason.  As a general rule you want a
>> lower agcount when using low RPM drives (5.9k, 7.2k) and a higher
>> agcount with fast drives (10k, 15k).
> Good to know such details!

There's a little black magic to manual AG creation, but that's the
basics.  But it depends quite a bit on the workload.

> So, as I could understand, you are assuming that "internal striping"
> by using AGs of XFS will be better than MD/LVM striping here? Never
> thought of XFS in this way and it is interesting point.

Most people know very very little about XFS, which is ironic given it's
capabilities dwarf those of EXT, Reiser, JFS, etc.  That will start to
change as Red Hat and other distros make it the default filesystem.

There is no striping involved as noted in my diagram and explanation
above.  This is an md _linear_ array.  You've probably never read of the
md --linear option.  Nobody (few) uses it because they've simply had
"striping, striping, striping" drilled into their skulls, and they use
EXT filesystems, which absolutely REQUIRE striping to get decent
performance.  XFS has superior technology, has since the 90s, and does
not necessarily require striping to get decent performance.  As always,
it depends on the workload and access pattern.

http://linux.die.net/man/4/md

And I'm sure you've never read about XFS internal structure:

http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/Allocation_Groups.html

XFS + md linear array-- Let me repeat this so there is no
misunderstanding, that we're talking about one of many possible XFS
configurations:   *XFS + md linear array*  is extremely fast for the
highly parallel small file random access workload because it:

1.  Eliminates the complexities and buffering delays of data alignment
to an md striped array.  While fast and trivial, these operations add
more and more overhead as the workload increases.  At high IOPS they are
longer trivial.  Here, XFS instead simply sends the data directly to md,
which calculates the sector offset in the linear array and writes the
blocks to disk.

2.  mdraid doesn't have to perform any striping offset calculations.
Again, while trivial, these calculations add overhead as workload
increases.  And unlike the linear and RAID0 drivers, the mdraid10 driver
has a single master thread, meaning absolute IO performance can be
limited by a single CPU if there are enough fast disks in the RAID10
array and the CPUs in the system aren't fast enough to keep up.  Search
to list archives for instances of this issue.

3.  I already mentioned the disk read ahead advantage vs mdraid10.  It
can be significant, in terms of file access latency, and memory
consumption due to wasted buffer cache space.  If one is using a
hardware RAID solution this advantage disappears, as the read ahead hits
the RAID cache one per request.  It's no longer per drive as with mdraid
because Linux treats the hardware RAID as a single block device.  It
can't see the individual drives behind the controller, in this regard
anyway, thus doesn't perform read ahead on each drive.

4.  Fewer disk seeks for metadata reads are required.  With a striped
array + XFS a single metadata lookup can potentially cause a seek on
every spindle in the array because each AG and its metadata span all
spindles in the stripe.  Withe XFS + linear a given metadata lookup for
a file generates one seek in only one AG on one spindle.

There are other advantages but I'm getting tired of typing. ;)  If
you're truly curious and wish to learn there is valuable information in
the mdraid kernel documentation, as well as at xfs.org.  You probably
won't find much on this specific combination, but you can learn enough
about each mdraid and xfs to better understand why this combo works.
This stuff isn't beginner level reading, mind you.  You need a pretty
deep technical background in Linux and storage technology.  Which is
maybe why I've done such a poor job explaining this. ;)

> P.S. As I've got 2nd server of the same config, may be i'll have time
> and do fast & dirty tests of stripes vs AGs.

Fast and dirty tests will not be sufficient to know how either will
perform with your actual workload.  And if by fast & dirty you mean
something like

$ dd if=/dev/md2 of=/dev/null bs=8192 count=800000

then you will be superbly disappointed.  What makes the XFS linear array
very fast with huge amounts of random small file IO makes it very slow
with large single file reads/writes, because each file resides on a
single spindle, limiting you to ~120MB/s.  Again, this is not striped
RAID.  This linear array setup is designed for maximum parallel small
file throughput.  So if you want to see those "big dd" numbers that make
folks salivate, you'd need something like

dd if=/mountpt/directory1/bigfile.test of=/dev/null &
dd if=/mountpt/directory5/bigfile.test of=/dev/null &
dd if=/mountpt/directory9/bigfile.test of=/dev/null &

and then sum the 3 results.  Again, XFS speed atop the linear array
comes from concurrent file access, which is exactly what your stated
workload is, and thus why I recommended this setup.  To properly test
this synthetically may likely require something other than vanilla
benchies such as bonnie or iozone.

I would recommend copying 48k of those actual picture files evenly
across 12 directories, for 4K files per dir.  Then use something like
curl-loader with a whole lot of simulated clients to hammer on the
files.  This allows you to test web server performance and IO
performance simultaneously.

-- 
Stan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-13 12:09     ` Stan Hoeppner
@ 2012-02-13 12:42       ` David Brown
  2012-02-13 14:46         ` Stan Hoeppner
  2012-02-13 21:40       ` CoolCold
  1 sibling, 1 reply; 40+ messages in thread
From: David Brown @ 2012-02-13 12:42 UTC (permalink / raw)
  To: stan; +Cc: CoolCold, Linux RAID

On 13/02/2012 13:09, Stan Hoeppner wrote:
> On 2/12/2012 2:16 PM, CoolCold wrote:
>> First of all, Stan, thanks for such detailed answer, I greatly appreciate this!
>
> You're welcome.  You may or may not appreciate this reply.  It got
> really long.  I tried to better explain the XFS+md linear array setup.
>
>> There are several reasons for this - 1) I've made decision to use LMV
>> for all "data" volumes (those are except /, /boot, /home , etc)  2)
>> there will be mysql database which will need backups with snapshots 3)
>
> So you need LVM for snaps, got it.
>
>> I often have several ( 0-3 ) virtual environments (OpenVZ based) which
>> are living on ext3/ext4 (because of extensive metadata updates on xfs
>> makes it the whole machine slow) filesystem and different LV because
>> of this.
>
> This is no longer the case as of kernel 2.6.35+ with Dave Chinner's
> delayed logging patch.  It's enabled by default in 2.6.39+ and XFS now
> has equal or superior metadata performance to all other Linux
> filesystems.  This presentation is about an hour long, but it's super
> interesting and very informative:
> http://www.youtube.com/watch?v=FegjLbCnoBw
>

OpenVZ is great for many purposes, but one unfortunate point is that 
because it is based on patches to a number of key parts of the kernel, 
it is only rarely re-synced to new kernels.  It is currently stuck on 
2.6.32, which means he can't use this feature (and nor can I - I also 
use OpenVZ and sometimes XFS, though I'm not too bothered about 
squeezing the last drops of performance out of the system).

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-13  8:50     ` David Brown
  2012-02-13  9:46       ` CoolCold
@ 2012-02-13 13:46       ` Stan Hoeppner
  1 sibling, 0 replies; 40+ messages in thread
From: Stan Hoeppner @ 2012-02-13 13:46 UTC (permalink / raw)
  To: David Brown; +Cc: CoolCold, Linux RAID

On 2/13/2012 2:50 AM, David Brown wrote:

> It is also far from clear whether a linear concat XFS is better than a
> normal XFS on a raid0 of the same drives (or raid1 pairs).  I think it

As always the answer depends on the workload.  As you correctly stated
above (I snipped it) you'll end up with less head seeks with the linear
array than with the RAID0.  How many less depends on the workload,
again, as always.

I need to correct something I stated in my previous post that's relevant
here.  I forgot that the per drive read_ahead_kb value is ignored when a
filesystem resides on an md device.  Read ahead works at the file
descriptor level, not at the block device level.  So when using mdraid
the read_ahead_kb value of the md device is used and the per drive
settings are ignored.  Thus kernel read ahead efficiency doesn't suffer
on striped mdraid as I previously stated.  Apologies for the error.

> will have lower average latencies on small accesses if you also have big
> reads/writes mixed in, but you will also have lower throughput for
> larger accesses.  For some uses, this sort of XFS arrangement is ideal -
> a particular favourite is for mail servers.  But I suspect in many other
> cases you will stray enough from the ideal access patterns to lose any
> benefits it might have.

Yeah, if one will definitely have a mixed workload including
reading/writing sufficiently large files (more than a few MB) where
striping would be of benefit, then using RAID0 over mirror would be
better.  Once you go there though you may as well go RAID10 with a fast
layout, unless your workload is such that a single md thread eats a CPU.
 Then the layered RAID0 over mirror may be a better option.

> Stan is the expert on this, and can give advice on getting the best out
> of XFS.  But personally I don't think a linear concat there is the best
> way to go - especially when you want LVM and multiple filesystems on the
> array.

I'm no XFS expert.  The experts are the devs.  As far as users go, I
probably know some of the XFS internals and theory better than many others.

For the primary workload as stated, XFS over linear is a perfect fit.
WRT doing thin provisioning with virtual machines on this host, using
sparse files to create virtual disks for the VMs and the like, I'm not
sure how well that would work on a linear array with a single XFS
filesystem.  As David mentions, I def wouldn't put multiple XFS
filesystems on the array, with or without LVM.  This can lead to excess
head seeking, and you don't have the spindle RPM for lots of seeks.

WRT sparse file virtual disks, it would depend alot on the IO access
patterns of the VM guests and their total IO load.  If it's minimal then
the XFS + linear would be fine.  If the guests do a lot of IO, and their
disk files all end up in the same AG, that wouldn't be so good.  Without
more information it's hard to say.

> As another point, since you have mostly read accesses, you should
> probably use raid10,f2 far layout rather than near layout.  It's a bit
> slower for writes, but can be much faster for reads.

Near.. far.. whereeeeever you are...

Neil must have watched Titanic just before he came up with these labels. ;)

-- 
Stan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-13 12:42       ` David Brown
@ 2012-02-13 14:46         ` Stan Hoeppner
  0 siblings, 0 replies; 40+ messages in thread
From: Stan Hoeppner @ 2012-02-13 14:46 UTC (permalink / raw)
  To: David Brown; +Cc: CoolCold, Linux RAID

On 2/13/2012 6:42 AM, David Brown wrote:

> OpenVZ is great for many purposes, but one unfortunate point is that
> because it is based on patches to a number of key parts of the kernel,
> it is only rarely re-synced to new kernels.  It is currently stuck on
> 2.6.32, which means he can't use this feature (and nor can I - I also
> use OpenVZ and sometimes XFS, though I'm not too bothered about
> squeezing the last drops of performance out of the system).

RHEL 6.2 has all of the XFS improvements.  If there is an OpenVZ for Red
Hat 6.2?

-- 
Stan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-13 12:09     ` Stan Hoeppner
  2012-02-13 12:42       ` David Brown
@ 2012-02-13 21:40       ` CoolCold
  2012-02-13 23:02         ` keld
  2012-02-14  2:49         ` Stan Hoeppner
  1 sibling, 2 replies; 40+ messages in thread
From: CoolCold @ 2012-02-13 21:40 UTC (permalink / raw)
  To: stan; +Cc: Linux RAID

On Mon, Feb 13, 2012 at 4:09 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 2/12/2012 2:16 PM, CoolCold wrote:
>> First of all, Stan, thanks for such detailed answer, I greatly appreciate this!
>
> You're welcome.  You may or may not appreciate this reply.  It got
> really long.  I tried to better explain the XFS+md linear array setup.
>
>> There are several reasons for this - 1) I've made decision to use LMV
>> for all "data" volumes (those are except /, /boot, /home , etc)  2)
>> there will be mysql database which will need backups with snapshots 3)
>
> So you need LVM for snaps, got it.
>
>> I often have several ( 0-3 ) virtual environments (OpenVZ based) which
>> are living on ext3/ext4 (because of extensive metadata updates on xfs
>> makes it the whole machine slow) filesystem and different LV because
>> of this.
>
> This is no longer the case as of kernel 2.6.35+ with Dave Chinner's
> delayed logging patch.  It's enabled by default in 2.6.39+ and XFS now
> has equal or superior metadata performance to all other Linux
> filesystems.  This presentation is about an hour long, but it's super
> interesting and very informative:
> http://www.youtube.com/watch?v=FegjLbCnoBw
Yeah, I've seen that video and read LWN article (
http://lwn.net/Articles/476263/ )

>
>> Basing on current storage, estimations show (df -h / df -i ) average
>> file size is ~200kb . Inodes count is near 15 millions and it will be
>> more.
>
> You definitely need the inode64 allocator with that many inodes.  You
> need it anyway for performance.
>
>> I've just thought that may be I should change chunk size to 256kb,
>> just to let one file be read from one disk, this may increase latency
>> and increase throughput too.
>
> Why would you do that instead of simply using XFS on a linear array?
>
>> I've drawn nice picture from my head in my original post, it was:
>>
>> A1 A1 A2 A2 A3 A3 A4
>> A4 A5 A5 A6 A6 A7 A7
>> A8 A8 A9 A9 A10 A10 A11
>> A11 ...
>
>> So here is A{X} is chunk number on top of 7 disks. As you can see, 7
>> chunks write (A1 - A7) will fill two rows. And this will made 2 disk
>> head movements to write full stripe, though that moves may be very
>> near to each other. Real situation may differ of course, and I'm not
>> expert to make a bet too.
>
> xfs_info does shows some wonky numbers for sunit/swidth in your example
> output, but the overall number of write bytes is correct, at 448KB,
> matching the array's apparent 7*64KB.  This is double your average file
> size so you'll likely have many partial stripe writes.  And you won't
> get any advantage from device read ahead.  You'll actually be wasting
> buffer cache memory, since each disk will read an extra 128KB.  So if a
> stripe is actually across 7 spindles, for a single 200KB file read, the
> kernel will read an additional 7*128KB=896KB of data into the buffer
> cache.  Given the RAID layout and file access pattern, these extra
> caches sectors may not get used right away, simply wasting RAM.  To
> alleviate this you'd need to decrease
>
> /sys/block/sdX/queue/read_ahead_kb
>
> accordingly, down to something like 32KB or less, to prevent wasting
> RAM.  You may need to tweak other kernel block device queue parameters
> as well.
As wasting RAM is not good in any case, I'm worrying about disk seeks more.
On setup with raid10 of 7 drives, I see readahead of 448kb on raid device:
root@datastor1:/# cat /sys/block/md3/queue/read_ahead_kb
448

on linear raid (/dev/md6) over 3 mirrors (md3,md4,md5), I see 128kb
readahead and 128kb on individual raid arrays. If I understand
correctly, in first case any read request to /dev/md3 will cause
reading of the full stripe and make every drive to move heads?

>
> If you use a linear concat with XFS, you don't have to worry about any
> of these issues because one file goes on one disk (spindle, mirror
> pair).  Read ahead works as it should, no stripe alignment issues,
> maximum performance for your small file workload.

>
>> Hundreds directories at least, yes.
>
> So as long as the most popular content is not put in a single directory
> or two that reside on the same disk causing an IO hotspot, XFS+linear
> will work very well for this workload.  The key is spreading the
> frequently accessed files across all the allocation groups.  But, if the
> popular content all gets cache in RAM, it doesn't matter.  Any other
> content accesses will be random, so you're fine there.
I guess with only 12gb of ram, every access going to be random ;)

> Note that the
> first 4 directories you create will be in the first 4 AGs on the first
> disk, so don't concentrate all your frequently accessed stuff in the
> first 4 dirs.  With the XFS+linear setup I described before, you end up
> with an on disk filesystem layout like this:
>
>         -------         -------         -------
>        |  AG1  |       |  AG5  |       |  AG9  |
>        |  AG2  |       |  AG6  |       |  AG10 |
>        |  AG3  |       |  AG7  |       |  AG11 |
>        |  AG4  |       |  AG8  |       |  AG12 |
>         -------         -------         -------
>         disk 1          disk 2          disk 3
>
>
> This AG layout is a direct result of the linear array.  If this were a 3
> spindle striped array, each AG would span all 3 disks horizontally, and
> you'd have AGs 1-12 in a vertically column, one 3rd of each AG on each
> disk.  If you're thinking ahead you may already see one of the
> advantages of this setup WRT metadata performance.
Pretty clear & self-explaining picture, thanks.

>
> Using the inode64 allocator, directory creation will occur in allocation
> group order, putting the first 4 directories you create in the first
> four respective AGs on disk 1.  Directory 13 will be created in AG1, as
> will dir25 and dir37, and so on.  Each file created in a directory will
> reside within the AG where its parent dir resides.
>
> This is primarily what allows XFS+linear to have fantastic parallel
> small file random access performance.
>
>> After reading you ideas and refinements, I'm making conclusion that I
>> need to push others [team members] harder to remove mysql instances
>> from the static files serving boxes at all, to free RAM for least
>> dcache entries.
>
> That or simply limit the amount of memory mysql is allowed to allocate.
>  If you're serving mostly static content, what's the database for?  User
> accounts and login processing?  Interactive forum like phpBB?
In short - database stores pages (contents) . There are several pros
for leaving database on every server - 1st) full servers independacies
2) if share database over the network, it going to cost additional
money for traffice payments (traffic may be billed even across
datacenters of the same hoster, when leaving switch )

>
>> About avoiding striping - later in the text.
>
>> Hetzner's guys were pretty fast on chaning failed disks (one - two
>> days after claim) so I may try without spares I guess... I just wanna
>> use more independent spindles here,
>
> If the box came with only 6 data drives would you be asking them to add
> a seventh?  I believe you have a personality type that makes that 7th
> odd drive an itch you must simply scratch. ;)  "It's there so I MUST use
> it!"  Make it your snap target then.  It'll keep tape/etc IO off the
> array when you backup the snaps.  There, itch scratched. ;)
>
>>> That will give you 12 allocation groups of 750GB each, 4 AGs per
>>> effective spindle.  Using too many AGs will cause excessive head seeking
>>> under load, especially with a low disk count in the array.  The mkfs.xfs
>>> agcount default is 4 for this reason.  As a general rule you want a
>>> lower agcount when using low RPM drives (5.9k, 7.2k) and a higher
>>> agcount with fast drives (10k, 15k).
>> Good to know such details!
>
> There's a little black magic to manual AG creation, but that's the
> basics.  But it depends quite a bit on the workload.
>
>> So, as I could understand, you are assuming that "internal striping"
>> by using AGs of XFS will be better than MD/LVM striping here? Never
>> thought of XFS in this way and it is interesting point.
>
> Most people know very very little about XFS, which is ironic given it's
> capabilities dwarf those of EXT, Reiser, JFS, etc.  That will start to
> change as Red Hat and other distros make it the default filesystem.
>
> There is no striping involved as noted in my diagram and explanation
> above.  This is an md _linear_ array.  You've probably never read of the
> md --linear option.  Nobody (few) uses it because they've simply had
> "striping, striping, striping" drilled into their skulls, and they use
> EXT filesystems, which absolutely REQUIRE striping to get decent
> performance.  XFS has superior technology, has since the 90s, and does
> not necessarily require striping to get decent performance.  As always,
> it depends on the workload and access pattern.
By striping, in general, I mean common idea of distributing data in
portions over several devices, not the real granularity of bytes. So
if one dir keeps data on DISK1, another dir on DISK2 and so on, I'm
calling it "striped over DISK1, DISK2..."

>
> http://linux.die.net/man/4/md
>
> And I'm sure you've never read about XFS internal structure:
>
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/Allocation_Groups.html
No, didn't met that link before, thanks. I'm sneaking around on #xfs @
freenode, reading something useful sometimes, though.

>
> XFS + md linear array-- Let me repeat this so there is no
> misunderstanding, that we're talking about one of many possible XFS
> configurations:   *XFS + md linear array*  is extremely fast for the
> highly parallel small file random access workload because it:
>
> 1.  Eliminates the complexities and buffering delays of data alignment
> to an md striped array.  While fast and trivial, these operations add
> more and more overhead as the workload increases.  At high IOPS they are
> longer trivial.  Here, XFS instead simply sends the data directly to md,
> which calculates the sector offset in the linear array and writes the
> blocks to disk.
>
> 2.  mdraid doesn't have to perform any striping offset calculations.
> Again, while trivial, these calculations add overhead as workload
> increases.  And unlike the linear and RAID0 drivers, the mdraid10 driver
> has a single master thread, meaning absolute IO performance can be
> limited by a single CPU if there are enough fast disks in the RAID10
> array and the CPUs in the system aren't fast enough to keep up.  Search
> to list archives for instances of this issue.
>
> 3.  I already mentioned the disk read ahead advantage vs mdraid10.  It
> can be significant, in terms of file access latency, and memory
> consumption due to wasted buffer cache space.  If one is using a
> hardware RAID solution this advantage disappears, as the read ahead hits
> the RAID cache one per request.  It's no longer per drive as with mdraid
> because Linux treats the hardware RAID as a single block device.  It
> can't see the individual drives behind the controller, in this regard
> anyway, thus doesn't perform read ahead on each drive.
>
> 4.  Fewer disk seeks for metadata reads are required.  With a striped
> array + XFS a single metadata lookup can potentially cause a seek on
> every spindle in the array because each AG and its metadata span all
> spindles in the stripe.  Withe XFS + linear a given metadata lookup for
> a file generates one seek in only one AG on one spindle.
Mmm, clear, got it.

>
> There are other advantages but I'm getting tired of typing. ;)  If
> you're truly curious and wish to learn there is valuable information in
> the mdraid kernel documentation, as well as at xfs.org.  You probably
> won't find much on this specific combination, but you can learn enough
> about each mdraid and xfs to better understand why this combo works.
> This stuff isn't beginner level reading, mind you.  You need a pretty
> deep technical background in Linux and storage technology.  Which is
> maybe why I've done such a poor job explaining this. ;)
>
>> P.S. As I've got 2nd server of the same config, may be i'll have time
>> and do fast & dirty tests of stripes vs AGs.
>
> Fast and dirty tests will not be sufficient to know how either will
> perform with your actual workload.  And if by fast & dirty you mean
> something like
>
> $ dd if=/dev/md2 of=/dev/null bs=8192 count=800000
>
> then you will be superbly disappointed.  What makes the XFS linear array
> very fast with huge amounts of random small file IO makes it very slow
> with large single file reads/writes, because each file resides on a
> single spindle, limiting you to ~120MB/s.  Again, this is not striped
> RAID.  This linear array setup is designed for maximum parallel small
> file throughput.  So if you want to see those "big dd" numbers that make
> folks salivate, you'd need something like
>
> dd if=/mountpt/directory1/bigfile.test of=/dev/null &
> dd if=/mountpt/directory5/bigfile.test of=/dev/null &
> dd if=/mountpt/directory9/bigfile.test of=/dev/null &
>
Okay, this is clear.

> and then sum the 3 results.  Again, XFS speed atop the linear array
> comes from concurrent file access, which is exactly what your stated
> workload is, and thus why I recommended this setup.  To properly test
> this synthetically may likely require something other than vanilla
> benchies such as bonnie or iozone.

Yes, by "quick & dirty" test I usually mean iozone tests like "iozone
-s 1g -I -i 0 -i 1 -i 2 -r 64 -t 16 -F file1 file2 file3 file4 file5
file6 file7 file8 file9 file10 file11 file12 file13 file14 file15
file16" or at least "dd if=/db of=/dev/null iflag=direct bs=512k". May
be will try fs_mark, mentioned by Dave Chinner.

I'm writing down results here (not agregated form, just raw data) -
https://docs.google.com/document/d/1PXRCjcVWaxzFCOFFbv812gDUkcMk2-lvpeHdtroN1uw/edit

While doing this dirty tests, I've seen found that linear md over 3
subvolumes doesn't support barriers and XFS states this:
Feb 13 21:39:41 sigma2 kernel: [22336.925917] Filesystem "md6":
Disabling barriers, trial barrier write failed
though this doesn't help on iozone random write tests

>
> I would recommend copying 48k of those actual picture files evenly
> across 12 directories, for 4K files per dir.  Then use something like
> curl-loader with a whole lot of simulated clients to hammer on the
> files.  This allows you to test web server performance and IO
> performance simultaneously.

Yes, this will be more realistic, of course.

>
> --
> Stan



-- 
Best regards,
[COOLCOLD-RIPN]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-13 21:40       ` CoolCold
@ 2012-02-13 23:02         ` keld
  2012-02-14  3:49           ` Stan Hoeppner
  2012-02-14  7:31           ` CoolCold
  2012-02-14  2:49         ` Stan Hoeppner
  1 sibling, 2 replies; 40+ messages in thread
From: keld @ 2012-02-13 23:02 UTC (permalink / raw)
  To: CoolCold; +Cc: stan, Linux RAID

On Tue, Feb 14, 2012 at 01:40:25AM +0400, CoolCold wrote:
> On Mon, Feb 13, 2012 at 4:09 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> > On 2/12/2012 2:16 PM, CoolCold wrote:
> >> First of all, Stan, thanks for such detailed answer, I greatly appreciate this!
> >
> > You're welcome. ?You may or may not appreciate this reply. ?It got
> > really long. ?I tried to better explain the XFS+md linear array setup.
> >
> >> There are several reasons for this - 1) I've made decision to use LMV
> >> for all "data" volumes (those are except /, /boot, /home , etc) ?2)
> >> there will be mysql database which will need backups with snapshots 3)
> >
> > So you need LVM for snaps, got it.


Well, I do not think LVM gives you snaps. I think you need to close down the mysql database 
to have a consistent DB, then make backup, then reactivate mysql. I may be wrong, tho.
LVM is great if you want to resize partitions. XFS cannot be shrinked, tho, only grown.
For snapshots you need somthing like btrfs. But to have the DB consistent you need to close it
before taking a backup.

And anyway, I think a 7 spindle raid10,f2 would be much faster than 
a md linear array setup, both for small files and for largish
sequential files. But try it out and report to us what you find.

I would expect  a linear md, and also most other MD raids would tend to perform better in 
the almost empty state, as the files will be placed on the faster parts of the spindles. 
raid10,f2 would have a more uniform performance as it gets filled, because read access to 
files would still be to the faster parts of the spindles.


best regards
Keld

---

> >> I often have several ( 0-3 ) virtual environments (OpenVZ based) which
> >> are living on ext3/ext4 (because of extensive metadata updates on xfs
> >> makes it the whole machine slow) filesystem and different LV because
> >> of this.
> >
> > This is no longer the case as of kernel 2.6.35+ with Dave Chinner's
> > delayed logging patch. ?It's enabled by default in 2.6.39+ and XFS now
> > has equal or superior metadata performance to all other Linux
> > filesystems. ?This presentation is about an hour long, but it's super
> > interesting and very informative:
> > http://www.youtube.com/watch?v=FegjLbCnoBw
> Yeah, I've seen that video and read LWN article (
> http://lwn.net/Articles/476263/ )
> 
> >
> >> Basing on current storage, estimations show (df -h / df -i ) average
> >> file size is ~200kb . Inodes count is near 15 millions and it will be
> >> more.
> >
> > You definitely need the inode64 allocator with that many inodes. ?You
> > need it anyway for performance.
> >
> >> I've just thought that may be I should change chunk size to 256kb,
> >> just to let one file be read from one disk, this may increase latency
> >> and increase throughput too.
> >
> > Why would you do that instead of simply using XFS on a linear array?
> >
> >> I've drawn nice picture from my head in my original post, it was:
> >>
> >> A1 A1 A2 A2 A3 A3 A4
> >> A4 A5 A5 A6 A6 A7 A7
> >> A8 A8 A9 A9 A10 A10 A11
> >> A11 ...
> >
> >> So here is A{X} is chunk number on top of 7 disks. As you can see, 7
> >> chunks write (A1 - A7) will fill two rows. And this will made 2 disk
> >> head movements to write full stripe, though that moves may be very
> >> near to each other. Real situation may differ of course, and I'm not
> >> expert to make a bet too.
> >
> > xfs_info does shows some wonky numbers for sunit/swidth in your example
> > output, but the overall number of write bytes is correct, at 448KB,
> > matching the array's apparent 7*64KB. ?This is double your average file
> > size so you'll likely have many partial stripe writes. ?And you won't
> > get any advantage from device read ahead. ?You'll actually be wasting
> > buffer cache memory, since each disk will read an extra 128KB. ?So if a
> > stripe is actually across 7 spindles, for a single 200KB file read, the
> > kernel will read an additional 7*128KB=896KB of data into the buffer
> > cache. ?Given the RAID layout and file access pattern, these extra
> > caches sectors may not get used right away, simply wasting RAM. ?To
> > alleviate this you'd need to decrease
> >
> > /sys/block/sdX/queue/read_ahead_kb
> >
> > accordingly, down to something like 32KB or less, to prevent wasting
> > RAM. ?You may need to tweak other kernel block device queue parameters
> > as well.
> As wasting RAM is not good in any case, I'm worrying about disk seeks more.
> On setup with raid10 of 7 drives, I see readahead of 448kb on raid device:
> root@datastor1:/# cat /sys/block/md3/queue/read_ahead_kb
> 448
> 
> on linear raid (/dev/md6) over 3 mirrors (md3,md4,md5), I see 128kb
> readahead and 128kb on individual raid arrays. If I understand
> correctly, in first case any read request to /dev/md3 will cause
> reading of the full stripe and make every drive to move heads?
> 
> >
> > If you use a linear concat with XFS, you don't have to worry about any
> > of these issues because one file goes on one disk (spindle, mirror
> > pair). ?Read ahead works as it should, no stripe alignment issues,
> > maximum performance for your small file workload.
> 
> >
> >> Hundreds directories at least, yes.
> >
> > So as long as the most popular content is not put in a single directory
> > or two that reside on the same disk causing an IO hotspot, XFS+linear
> > will work very well for this workload. ?The key is spreading the
> > frequently accessed files across all the allocation groups. ?But, if the
> > popular content all gets cache in RAM, it doesn't matter. ?Any other
> > content accesses will be random, so you're fine there.
> I guess with only 12gb of ram, every access going to be random ;)
> 
> > Note that the
> > first 4 directories you create will be in the first 4 AGs on the first
> > disk, so don't concentrate all your frequently accessed stuff in the
> > first 4 dirs. ?With the XFS+linear setup I described before, you end up
> > with an on disk filesystem layout like this:
> >
> > ? ? ? ? ------- ? ? ? ? ------- ? ? ? ? -------
> > ? ? ? ?| ?AG1 ?| ? ? ? | ?AG5 ?| ? ? ? | ?AG9 ?|
> > ? ? ? ?| ?AG2 ?| ? ? ? | ?AG6 ?| ? ? ? | ?AG10 |
> > ? ? ? ?| ?AG3 ?| ? ? ? | ?AG7 ?| ? ? ? | ?AG11 |
> > ? ? ? ?| ?AG4 ?| ? ? ? | ?AG8 ?| ? ? ? | ?AG12 |
> > ? ? ? ? ------- ? ? ? ? ------- ? ? ? ? -------
> > ? ? ? ? disk 1 ? ? ? ? ?disk 2 ? ? ? ? ?disk 3
> >
> >
> > This AG layout is a direct result of the linear array. ?If this were a 3
> > spindle striped array, each AG would span all 3 disks horizontally, and
> > you'd have AGs 1-12 in a vertically column, one 3rd of each AG on each
> > disk. ?If you're thinking ahead you may already see one of the
> > advantages of this setup WRT metadata performance.
> Pretty clear & self-explaining picture, thanks.
> 
> >
> > Using the inode64 allocator, directory creation will occur in allocation
> > group order, putting the first 4 directories you create in the first
> > four respective AGs on disk 1. ?Directory 13 will be created in AG1, as
> > will dir25 and dir37, and so on. ?Each file created in a directory will
> > reside within the AG where its parent dir resides.
> >
> > This is primarily what allows XFS+linear to have fantastic parallel
> > small file random access performance.
> >
> >> After reading you ideas and refinements, I'm making conclusion that I
> >> need to push others [team members] harder to remove mysql instances
> >> from the static files serving boxes at all, to free RAM for least
> >> dcache entries.
> >
> > That or simply limit the amount of memory mysql is allowed to allocate.
> > ?If you're serving mostly static content, what's the database for? ?User
> > accounts and login processing? ?Interactive forum like phpBB?
> In short - database stores pages (contents) . There are several pros
> for leaving database on every server - 1st) full servers independacies
> 2) if share database over the network, it going to cost additional
> money for traffice payments (traffic may be billed even across
> datacenters of the same hoster, when leaving switch )
> 
> >
> >> About avoiding striping - later in the text.
> >
> >> Hetzner's guys were pretty fast on chaning failed disks (one - two
> >> days after claim) so I may try without spares I guess... I just wanna
> >> use more independent spindles here,
> >
> > If the box came with only 6 data drives would you be asking them to add
> > a seventh? ?I believe you have a personality type that makes that 7th
> > odd drive an itch you must simply scratch. ;) ?"It's there so I MUST use
> > it!" ?Make it your snap target then. ?It'll keep tape/etc IO off the
> > array when you backup the snaps. ?There, itch scratched. ;)
> >
> >>> That will give you 12 allocation groups of 750GB each, 4 AGs per
> >>> effective spindle. ?Using too many AGs will cause excessive head seeking
> >>> under load, especially with a low disk count in the array. ?The mkfs.xfs
> >>> agcount default is 4 for this reason. ?As a general rule you want a
> >>> lower agcount when using low RPM drives (5.9k, 7.2k) and a higher
> >>> agcount with fast drives (10k, 15k).
> >> Good to know such details!
> >
> > There's a little black magic to manual AG creation, but that's the
> > basics. ?But it depends quite a bit on the workload.
> >
> >> So, as I could understand, you are assuming that "internal striping"
> >> by using AGs of XFS will be better than MD/LVM striping here? Never
> >> thought of XFS in this way and it is interesting point.
> >
> > Most people know very very little about XFS, which is ironic given it's
> > capabilities dwarf those of EXT, Reiser, JFS, etc. ?That will start to
> > change as Red Hat and other distros make it the default filesystem.
> >
> > There is no striping involved as noted in my diagram and explanation
> > above. ?This is an md _linear_ array. ?You've probably never read of the
> > md --linear option. ?Nobody (few) uses it because they've simply had
> > "striping, striping, striping" drilled into their skulls, and they use
> > EXT filesystems, which absolutely REQUIRE striping to get decent
> > performance. ?XFS has superior technology, has since the 90s, and does
> > not necessarily require striping to get decent performance. ?As always,
> > it depends on the workload and access pattern.
> By striping, in general, I mean common idea of distributing data in
> portions over several devices, not the real granularity of bytes. So
> if one dir keeps data on DISK1, another dir on DISK2 and so on, I'm
> calling it "striped over DISK1, DISK2..."
> 
> >
> > http://linux.die.net/man/4/md
> >
> > And I'm sure you've never read about XFS internal structure:
> >
> > http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/Allocation_Groups.html
> No, didn't met that link before, thanks. I'm sneaking around on #xfs @
> freenode, reading something useful sometimes, though.
> 
> >
> > XFS + md linear array-- Let me repeat this so there is no
> > misunderstanding, that we're talking about one of many possible XFS
> > configurations: ? *XFS + md linear array* ?is extremely fast for the
> > highly parallel small file random access workload because it:
> >
> > 1. ?Eliminates the complexities and buffering delays of data alignment
> > to an md striped array. ?While fast and trivial, these operations add
> > more and more overhead as the workload increases. ?At high IOPS they are
> > longer trivial. ?Here, XFS instead simply sends the data directly to md,
> > which calculates the sector offset in the linear array and writes the
> > blocks to disk.
> >
> > 2. ?mdraid doesn't have to perform any striping offset calculations.
> > Again, while trivial, these calculations add overhead as workload
> > increases. ?And unlike the linear and RAID0 drivers, the mdraid10 driver
> > has a single master thread, meaning absolute IO performance can be
> > limited by a single CPU if there are enough fast disks in the RAID10
> > array and the CPUs in the system aren't fast enough to keep up. ?Search
> > to list archives for instances of this issue.
> >
> > 3. ?I already mentioned the disk read ahead advantage vs mdraid10. ?It
> > can be significant, in terms of file access latency, and memory
> > consumption due to wasted buffer cache space. ?If one is using a
> > hardware RAID solution this advantage disappears, as the read ahead hits
> > the RAID cache one per request. ?It's no longer per drive as with mdraid
> > because Linux treats the hardware RAID as a single block device. ?It
> > can't see the individual drives behind the controller, in this regard
> > anyway, thus doesn't perform read ahead on each drive.
> >
> > 4. ?Fewer disk seeks for metadata reads are required. ?With a striped
> > array + XFS a single metadata lookup can potentially cause a seek on
> > every spindle in the array because each AG and its metadata span all
> > spindles in the stripe. ?Withe XFS + linear a given metadata lookup for
> > a file generates one seek in only one AG on one spindle.
> Mmm, clear, got it.
> 
> >
> > There are other advantages but I'm getting tired of typing. ;) ?If
> > you're truly curious and wish to learn there is valuable information in
> > the mdraid kernel documentation, as well as at xfs.org. ?You probably
> > won't find much on this specific combination, but you can learn enough
> > about each mdraid and xfs to better understand why this combo works.
> > This stuff isn't beginner level reading, mind you. ?You need a pretty
> > deep technical background in Linux and storage technology. ?Which is
> > maybe why I've done such a poor job explaining this. ;)
> >
> >> P.S. As I've got 2nd server of the same config, may be i'll have time
> >> and do fast & dirty tests of stripes vs AGs.
> >
> > Fast and dirty tests will not be sufficient to know how either will
> > perform with your actual workload. ?And if by fast & dirty you mean
> > something like
> >
> > $ dd if=/dev/md2 of=/dev/null bs=8192 count=800000
> >
> > then you will be superbly disappointed. ?What makes the XFS linear array
> > very fast with huge amounts of random small file IO makes it very slow
> > with large single file reads/writes, because each file resides on a
> > single spindle, limiting you to ~120MB/s. ?Again, this is not striped
> > RAID. ?This linear array setup is designed for maximum parallel small
> > file throughput. ?So if you want to see those "big dd" numbers that make
> > folks salivate, you'd need something like
> >
> > dd if=/mountpt/directory1/bigfile.test of=/dev/null &
> > dd if=/mountpt/directory5/bigfile.test of=/dev/null &
> > dd if=/mountpt/directory9/bigfile.test of=/dev/null &
> >
> Okay, this is clear.
> 
> > and then sum the 3 results. ?Again, XFS speed atop the linear array
> > comes from concurrent file access, which is exactly what your stated
> > workload is, and thus why I recommended this setup. ?To properly test
> > this synthetically may likely require something other than vanilla
> > benchies such as bonnie or iozone.
> 
> Yes, by "quick & dirty" test I usually mean iozone tests like "iozone
> -s 1g -I -i 0 -i 1 -i 2 -r 64 -t 16 -F file1 file2 file3 file4 file5
> file6 file7 file8 file9 file10 file11 file12 file13 file14 file15
> file16" or at least "dd if=/db of=/dev/null iflag=direct bs=512k". May
> be will try fs_mark, mentioned by Dave Chinner.
> 
> I'm writing down results here (not agregated form, just raw data) -
> https://docs.google.com/document/d/1PXRCjcVWaxzFCOFFbv812gDUkcMk2-lvpeHdtroN1uw/edit
> 
> While doing this dirty tests, I've seen found that linear md over 3
> subvolumes doesn't support barriers and XFS states this:
> Feb 13 21:39:41 sigma2 kernel: [22336.925917] Filesystem "md6":
> Disabling barriers, trial barrier write failed
> though this doesn't help on iozone random write tests
> 
> >
> > I would recommend copying 48k of those actual picture files evenly
> > across 12 directories, for 4K files per dir. ?Then use something like
> > curl-loader with a whole lot of simulated clients to hammer on the
> > files. ?This allows you to test web server performance and IO
> > performance simultaneously.
> 
> Yes, this will be more realistic, of course.
> 
> >
> > --
> > Stan
> 
> 
> 
> -- 
> Best regards,
> [COOLCOLD-RIPN]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-13 21:40       ` CoolCold
  2012-02-13 23:02         ` keld
@ 2012-02-14  2:49         ` Stan Hoeppner
  1 sibling, 0 replies; 40+ messages in thread
From: Stan Hoeppner @ 2012-02-14  2:49 UTC (permalink / raw)
  To: CoolCold; +Cc: Linux RAID

On 2/13/2012 3:40 PM, CoolCold wrote:

> While doing this dirty tests, I've seen found that linear md over 3
> subvolumes doesn't support barriers and XFS states this:
> Feb 13 21:39:41 sigma2 kernel: [22336.925917] Filesystem "md6":
> Disabling barriers, trial barrier write failed
> though this doesn't help on iozone random write tests

When using mdraid and a disk controller without BBWC, write barriers
need to be, must be, enabled and working to guarantee journal
consistency.  If barriers are disabled here you're risking the integrity
of the entire filesystem.  Whatever is causing barriers to be disabled
needs to be fixed.  You should definitely ask about this on the XFS
mailing list.  You will want to post your complete mdraid
configuration--the 3 RAID1s and the linear, and your xfs_info output,
and describe the underlying storage hardware--controller(s), disks, etc.
 Maybe Neil might have some insight here as well.

>> I would recommend copying 48k of those actual picture files evenly
>> across 12 directories, for 4K files per dir.  Then use something like
>> curl-loader with a whole lot of simulated clients to hammer on the
>> files.  This allows you to test web server performance and IO
>> performance simultaneously.
> 
> Yes, this will be more realistic, of course.

By far.  Definitely more pain to setup though.  If you can get a
synthetic benchy to batch create the ~50K files randomly across the 12
dirs and then randomly read them out after flushing the caches, that
should be pretty close to real world use as well.

-- 
Stan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-13 23:02         ` keld
@ 2012-02-14  3:49           ` Stan Hoeppner
  2012-02-14  8:58             ` David Brown
  2012-02-14 11:38             ` keld
  2012-02-14  7:31           ` CoolCold
  1 sibling, 2 replies; 40+ messages in thread
From: Stan Hoeppner @ 2012-02-14  3:49 UTC (permalink / raw)
  To: keld; +Cc: CoolCold, Linux RAID

On 2/13/2012 5:02 PM, keld@keldix.com wrote:

> And anyway, I think a 7 spindle raid10,f2 would be much faster than 
> a md linear array setup, both for small files and for largish
> sequential files. But try it out and report to us what you find.

The results of the target workload should be interesting, given the
apparent 7 spindles of stripe width of mdraid10,f2, and only 3 effective
spindles with the linear array of mirror pairs, an apparent 4 spindle
deficit.

> I would expect  a linear md, and also most other MD raids would tend to perform better in 
> the almost empty state, as the files will be placed on the faster parts of the spindles.

This is not the case with XFS.

> raid10,f2 would have a more uniform performance as it gets filled, because read access to 
> files would still be to the faster parts of the spindles.

This may be the case with EXTx, Reiser, etc, but not with XFS.

XFS creates its allocation groups uniformly across the storage device.
So assuming your filesystem contains more than a handful of directories,
even a very young XFS will have directories and files stored from outer
to inner tracks.

This layout of AGs, and the way XFS makes use of them, is directly
responsible for much of XFS' high performance.  For example, a single
file create operation on a full EXTx filesystem will exhibit a ~30ms
combined seek delay with an average 3.5" SATA disk.  With XFS it will be
~10ms.  This is because with EXTx the directories are at the outer edge
and the free space is on the far inner edge.  With XFS the directory and
free space area few tracks apart within the same allocation group.  Once
you seek the directory in the AG, the seek latency from there to the
track with the free space may be less than 1ms.  The seek distance
principal here is the same for single disks and RAID.

-- 
Stan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-13 23:02         ` keld
  2012-02-14  3:49           ` Stan Hoeppner
@ 2012-02-14  7:31           ` CoolCold
  2012-02-14  9:05             ` David Brown
  1 sibling, 1 reply; 40+ messages in thread
From: CoolCold @ 2012-02-14  7:31 UTC (permalink / raw)
  To: keld; +Cc: stan, Linux RAID

On Tue, Feb 14, 2012 at 3:02 AM,  <keld@keldix.com> wrote:
> On Tue, Feb 14, 2012 at 01:40:25AM +0400, CoolCold wrote:
>> On Mon, Feb 13, 2012 at 4:09 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> > On 2/12/2012 2:16 PM, CoolCold wrote:
>> >> First of all, Stan, thanks for such detailed answer, I greatly appreciate this!
>> >
>> > You're welcome. ?You may or may not appreciate this reply. ?It got
>> > really long. ?I tried to better explain the XFS+md linear array setup.
>> >
>> >> There are several reasons for this - 1) I've made decision to use LMV
>> >> for all "data" volumes (those are except /, /boot, /home , etc) ?2)
>> >> there will be mysql database which will need backups with snapshots 3)
>> >
>> > So you need LVM for snaps, got it.
>
>
> Well, I do not think LVM gives you snaps. I think you need to close down the mysql database
> to have a consistent DB, then make backup, then reactivate mysql. I may be wrong, tho.
You are a bit wrong here. MySQL in general supports two storage types
- MyISAM & InnoDB. While InnoDB is ACID transactional engine, MyISAM
isn't.
So, one should be able to backup InnoDB with snapshots without
interrupting workload and it will do recovery/transaction rollbackup
on startup.
For MyISAM engine, snapshots will produce unpredictable results as
partial update may happen. But, snapshots are useful in any case,
because they allow to do backup in 4 steps:
1) "FLUSH TABLES WITH READ LOCK" - this will flush all buffers and
close databases
2) lvcreate -s && unlock tables.... - doing snapshot of data and releasing lock
3) copying it somewhere
4) lvremove .... - releasing snapshot

So, in such situation, work will stop for 1) & 2) , not cosuming time for 3) .


> LVM is great if you want to resize partitions. XFS cannot be shrinked, tho, only grown.
> For snapshots you need somthing like btrfs. But to have the DB consistent you need to close it
> before taking a backup.
>
> And anyway, I think a 7 spindle raid10,f2 would be much faster than
> a md linear array setup, both for small files and for largish
> sequential files. But try it out and report to us what you find.
Quick & dirty iozone doesn't show this yet ...

>
> I would expect  a linear md, and also most other MD raids would tend to perform better in
> the almost empty state, as the files will be placed on the faster parts of the spindles.
> raid10,f2 would have a more uniform performance as it gets filled, because read access to
> files would still be to the faster parts of the spindles.
>
>
> best regards
> Keld
>
> ---

[snip]

-- 
Best regards,
[COOLCOLD-RIPN]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-14  3:49           ` Stan Hoeppner
@ 2012-02-14  8:58             ` David Brown
  2012-02-14 11:38             ` keld
  1 sibling, 0 replies; 40+ messages in thread
From: David Brown @ 2012-02-14  8:58 UTC (permalink / raw)
  To: stan; +Cc: keld, CoolCold, Linux RAID

On 14/02/2012 04:49, Stan Hoeppner wrote:
> On 2/13/2012 5:02 PM, keld@keldix.com wrote:
>
>> And anyway, I think a 7 spindle raid10,f2 would be much faster
>> than a md linear array setup, both for small files and for largish
>> sequential files. But try it out and report to us what you find.
>
> The results of the target workload should be interesting, given the
> apparent 7 spindles of stripe width of mdraid10,f2, and only 3
> effective spindles with the linear array of mirror pairs, an apparent
> 4 spindle deficit.
>

If you try to make two simultaneous reads to the same "effective 
spindle", i.e., the same raid pair, won't you get simultaneous reads - 
one from each half of the mirror?  So even though XFS thinks it is 
sitting on three "disks", you'll still get much of the 6 spindle speed? 
  Certainly if the pairs are raid10,f2 then larger reads from the pairs 
will go at double speed as the data is striped in the pair.

>> I would expect  a linear md, and also most other MD raids would
>> tend to perform better in the almost empty state, as the files will
>> be placed on the faster parts of the spindles.
>
> This is not the case with XFS.
>
>> raid10,f2 would have a more uniform performance as it gets filled,
>> because read access to files would still be to the faster parts of
>> the spindles.
>
> This may be the case with EXTx, Reiser, etc, but not with XFS.
>
> XFS creates its allocation groups uniformly across the storage
> device. So assuming your filesystem contains more than a handful of
> directories, even a very young XFS will have directories and files
> stored from outer to inner tracks.
>
> This layout of AGs, and the way XFS makes use of them, is directly
> responsible for much of XFS' high performance.  For example, a
> single file create operation on a full EXTx filesystem will exhibit a
> ~30ms combined seek delay with an average 3.5" SATA disk.  With XFS
> it will be ~10ms.  This is because with EXTx the directories are at
> the outer edge and the free space is on the far inner edge.  With XFS
> the directory and free space area few tracks apart within the same
> allocation group.  Once you seek the directory in the AG, the seek
> latency from there to the track with the free space may be less than
> 1ms.  The seek distance principal here is the same for single disks
> and RAID.
>

For some workloads, the closeness of the data and the metadata will give 
you much lower latencies.  For other workloads, the difference in the 
disk speed between the inner and outer areas will be more significant, 
especially if the metadata is already cached by the system.  For 
metadata-heavy operations, having it all in one place (like with ext) 
will be more efficient.  But for operations involving multiple large 
writes, XFS split over allocation groups will help keep the 
fragmentation low, which is probably part of why it has a good 
reputation for speed when working with very large files.

There is no "one size fits all" filesystem - there are always tradeoffs. 
  That's part of the fun of Linux - you can use the standard systems, or 
you can have your favourite filesystem that you know well and use 
everywhere, or you can learn about them all and try and choose the 
absolute best for each job.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-14  7:31           ` CoolCold
@ 2012-02-14  9:05             ` David Brown
  2012-02-14 11:10               ` Stan Hoeppner
  0 siblings, 1 reply; 40+ messages in thread
From: David Brown @ 2012-02-14  9:05 UTC (permalink / raw)
  To: CoolCold; +Cc: keld, stan, Linux RAID

On 14/02/2012 08:31, CoolCold wrote:
> On Tue, Feb 14, 2012 at 3:02 AM,<keld@keldix.com>  wrote:
>> On Tue, Feb 14, 2012 at 01:40:25AM +0400, CoolCold wrote:
>>> On Mon, Feb 13, 2012 at 4:09 PM, Stan
>>> Hoeppner<stan@hardwarefreak.com>  wrote:
>>>> On 2/12/2012 2:16 PM, CoolCold wrote:
>>>>> First of all, Stan, thanks for such detailed answer, I
>>>>> greatly appreciate this!
>>>>
>>>> You're welcome. ?You may or may not appreciate this reply. ?It
>>>> got really long. ?I tried to better explain the XFS+md linear
>>>> array setup.
>>>>
>>>>> There are several reasons for this - 1) I've made decision to
>>>>> use LMV for all "data" volumes (those are except /, /boot,
>>>>> /home , etc) ?2) there will be mysql database which will need
>>>>> backups with snapshots 3)
>>>>
>>>> So you need LVM for snaps, got it.
>>
>>
>> Well, I do not think LVM gives you snaps. I think you need to close
>> down the mysql database to have a consistent DB, then make backup,
>> then reactivate mysql. I may be wrong, tho.
> You are a bit wrong here. MySQL in general supports two storage
> types - MyISAM&  InnoDB. While InnoDB is ACID transactional engine,
> MyISAM isn't. So, one should be able to backup InnoDB with snapshots
> without interrupting workload and it will do recovery/transaction
> rollbackup on startup. For MyISAM engine, snapshots will produce
> unpredictable results as partial update may happen. But, snapshots
> are useful in any case, because they allow to do backup in 4 steps:
> 1) "FLUSH TABLES WITH READ LOCK" - this will flush all buffers and
> close databases 2) lvcreate -s&&  unlock tables.... - doing snapshot
> of data and releasing lock 3) copying it somewhere 4) lvremove .... -
> releasing snapshot
>
> So, in such situation, work will stop for 1)&  2) , not cosuming time
> for 3) .
>
>

Very roughly speaking, taking an LVM snapshot is like pulling the plug 
on the system - if the database engine is able to recover reliably from 
a power fail (by replaying logs, or whatever), then it can restore data 
copied by a snapshot.  I don't know MySQL well enough to say if it can 
do such recovery.

I believe that if you have XFS on LVM, then making a snapshot will first 
"freeze" the filesystem, take the snapshot, then "thaw" the filesystem. 
  This process will sync the system, flushing out outstanding writes, 
and delay new writes until the "thaw" - thus you get a bit better than a 
"power-off copy".  In particular, you don't get zeroed-out files no 
matter what you've done with your barrier settings.

I think ext4 also freezes in the same way, but only with later kernels.

mvh.,

David



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-14  9:05             ` David Brown
@ 2012-02-14 11:10               ` Stan Hoeppner
  0 siblings, 0 replies; 40+ messages in thread
From: Stan Hoeppner @ 2012-02-14 11:10 UTC (permalink / raw)
  To: David Brown; +Cc: CoolCold, keld, Linux RAID

On 2/14/2012 3:05 AM, David Brown wrote:

> I believe that if you have XFS on LVM, then making a snapshot will first
> "freeze" the filesystem, take the snapshot, then "thaw" the filesystem.
>  This process will sync the system, flushing out outstanding writes, and
> delay new writes until the "thaw" - thus you get a bit better than a
> "power-off copy".  In particular, you don't get zeroed-out files no
> matter what you've done with your barrier settings.

xfs_freeze[1] was a feature/command carried over from IRIX to Linux
during the XFS port.  This functionality was later moved from the XFS
layer into the VFS layer.  Now FS freezing is fully automatic when an
LVM snapshot is taken.  Now in the VFS layer, this works with any
filesystem type, EXT2/3/4, Reiser, JFS, XFS.  I don't have the date or
kernel rev handy where this was integrated in mainline.  It's been at
least a couple/few years.  I'm not inclined to dig it up.


[1] xfs_freeze(8)

NAME
       xfs_freeze - suspend access to an XFS filesystem

SYNOPSIS
       xfs_freeze -f | -u mount-point

DESCRIPTION
       xfs_freeze suspends and resumes access to an XFS filesystem (see
xfs(5)).
[...]


-- 
Stan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-14  3:49           ` Stan Hoeppner
  2012-02-14  8:58             ` David Brown
@ 2012-02-14 11:38             ` keld
  2012-02-14 23:27               ` Stan Hoeppner
  1 sibling, 1 reply; 40+ messages in thread
From: keld @ 2012-02-14 11:38 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: CoolCold, Linux RAID

On Mon, Feb 13, 2012 at 09:49:06PM -0600, Stan Hoeppner wrote:
> On 2/13/2012 5:02 PM, keld@keldix.com wrote:
> 
> > And anyway, I think a 7 spindle raid10,f2 would be much faster than 
> > a md linear array setup, both for small files and for largish
> > sequential files. But try it out and report to us what you find.
> 
> The results of the target workload should be interesting, given the
> apparent 7 spindles of stripe width of mdraid10,f2, and only 3 effective
> spindles with the linear array of mirror pairs, an apparent 4 spindle
> deficit.
> 
> > I would expect  a linear md, and also most other MD raids would tend to perform better in 
> > the almost empty state, as the files will be placed on the faster parts of the spindles.
> 
> This is not the case with XFS.
> 
> > raid10,f2 would have a more uniform performance as it gets filled, because read access to 
> > files would still be to the faster parts of the spindles.
> 
> This may be the case with EXTx, Reiser, etc, but not with XFS.
> 
> XFS creates its allocation groups uniformly across the storage device.
> So assuming your filesystem contains more than a handful of directories,
> even a very young XFS will have directories and files stored from outer
> to inner tracks.

Would not even XFS allocate lower AGs (on faster tracks) first?

> This layout of AGs, and the way XFS makes use of them, is directly
> responsible for much of XFS' high performance.  For example, a single
> file create operation on a full EXTx filesystem will exhibit a ~30ms
> combined seek delay with an average 3.5" SATA disk.  With XFS it will be
> ~10ms.  This is because with EXTx the directories are at the outer edge
> and the free space is on the far inner edge.  With XFS the directory and
> free space area few tracks apart within the same allocation group.  Once
> you seek the directory in the AG, the seek latency from there to the
> track with the free space may be less than 1ms.  The seek distance
> principal here is the same for single disks and RAID.


Well, I was talking for a given FS, including XFS. As raid10,f2 limits the read access to the
faster halves of the spindles, reads will never go to the slower halves. 

On other raid types than raid10,far with regular use, AGs in use and  data will be spread
randomly over the disks, including the slower inner tracks. Here raid10,far will only
use the outer tracks for reading, with some speed-up as a consequence.

best regards
keld

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-14 11:38             ` keld
@ 2012-02-14 23:27               ` Stan Hoeppner
  2012-02-15  8:30                 ` Robin Hill
                                   ` (3 more replies)
  0 siblings, 4 replies; 40+ messages in thread
From: Stan Hoeppner @ 2012-02-14 23:27 UTC (permalink / raw)
  To: keld; +Cc: CoolCold, Linux RAID

On 2/14/2012 5:38 AM, keld@keldix.com wrote:
> On Mon, Feb 13, 2012 at 09:49:06PM -0600, Stan Hoeppner wrote:
>> On 2/13/2012 5:02 PM, keld@keldix.com wrote:
>>
>>> And anyway, I think a 7 spindle raid10,f2 would be much faster than 
>>> a md linear array setup, both for small files and for largish
>>> sequential files. But try it out and report to us what you find.
>>
>> The results of the target workload should be interesting, given the
>> apparent 7 spindles of stripe width of mdraid10,f2, and only 3 effective
>> spindles with the linear array of mirror pairs, an apparent 4 spindle
>> deficit.
>>
>>> I would expect  a linear md, and also most other MD raids would tend to perform better in 
>>> the almost empty state, as the files will be placed on the faster parts of the spindles.
>>
>> This is not the case with XFS.
>>
>>> raid10,f2 would have a more uniform performance as it gets filled, because read access to 
>>> files would still be to the faster parts of the spindles.
>>
>> This may be the case with EXTx, Reiser, etc, but not with XFS.
>>
>> XFS creates its allocation groups uniformly across the storage device.
>> So assuming your filesystem contains more than a handful of directories,
>> even a very young XFS will have directories and files stored from outer
>> to inner tracks.
> 
> Would not even XFS allocate lower AGs (on faster tracks) first?
> 
>> This layout of AGs, and the way XFS makes use of them, is directly
>> responsible for much of XFS' high performance.  For example, a single
>> file create operation on a full EXTx filesystem will exhibit a ~30ms
>> combined seek delay with an average 3.5" SATA disk.  With XFS it will be
>> ~10ms.  This is because with EXTx the directories are at the outer edge
>> and the free space is on the far inner edge.  With XFS the directory and
>> free space area few tracks apart within the same allocation group.  Once
>> you seek the directory in the AG, the seek latency from there to the
>> track with the free space may be less than 1ms.  The seek distance
>> principal here is the same for single disks and RAID.
> 
> 
> Well, I was talking for a given FS, including XFS. As raid10,f2 limits the read access to the
> faster halves of the spindles, reads will never go to the slower halves. 
> 
> On other raid types than raid10,far with regular use, AGs in use and  data will be spread
> randomly over the disks, including the slower inner tracks. Here raid10,far will only
> use the outer tracks for reading, with some speed-up as a consequence.

Maybe I simply don't understand this 'magic' of the f2 and far layouts.
 If you only read the "faster half" of a spindle, does this mean writes
go to the slower half?  If that's the case, how can you read data that's
never been written?

-- 
Stan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-14 23:27               ` Stan Hoeppner
@ 2012-02-15  8:30                 ` Robin Hill
  2012-02-15 13:30                   ` Stan Hoeppner
  2012-02-15  9:24                 ` keld
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 40+ messages in thread
From: Robin Hill @ 2012-02-15  8:30 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: keld, CoolCold, Linux RAID

[-- Attachment #1: Type: text/plain, Size: 923 bytes --]

On Tue Feb 14, 2012 at 05:27:43PM -0600, Stan Hoeppner wrote:

> Maybe I simply don't understand this 'magic' of the f2 and far layouts.
>  If you only read the "faster half" of a spindle, does this mean writes
> go to the slower half?  If that's the case, how can you read data that's
> never been written?
> 
Writes go to both halves, as normal for a mirrored setup, which is why
its write performance is lower than that of a near layout array (more
head movement required). Reads will (normally) come from the faster
(outer) half of the disk though, so read performance is better. In most
cases workloads are read-heavy, so this comes out as a significant gain.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-14 23:27               ` Stan Hoeppner
  2012-02-15  8:30                 ` Robin Hill
@ 2012-02-15  9:24                 ` keld
  2012-02-15 12:10                 ` David Brown
  2012-02-17 18:44                 ` Peter Grandi
  3 siblings, 0 replies; 40+ messages in thread
From: keld @ 2012-02-15  9:24 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: CoolCold, Linux RAID

On Tue, Feb 14, 2012 at 05:27:43PM -0600, Stan Hoeppner wrote:
> Maybe I simply don't understand this 'magic' of the f2 and far layouts.
>  If you only read the "faster half" of a spindle, does this mean writes
> go to the slower half?  If that's the case, how can you read data that's
> never been written?

Think of raid10,f2 as two raid0's - the first raid0 is on the outer faster
tracks, the second on the slower inner tracks. Reads are always from the outer tracks
so the performance is that of raid0, including striping, and because it is only the
outer tracks, faster and with less head movement than a raid0 on the full set of spindles.

Writes are then written to both the raid0's (so you don't read data never written:-).
Writes are a bit slower it is said, as there should be more head movement, but that is
largely compensated by the spindle elevator algorithm for writes, which collects the output data
and orders the sectors so that they can be written in sequence, minimizing head movement.
With the elevator you can actually have striping writes too.
Also writes are less time critical than reads, as you do not wait on them to get things processed in the CPUs.
You just collect the output buffers and then periodically flush the data.

Best regards
Keld

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-14 23:27               ` Stan Hoeppner
  2012-02-15  8:30                 ` Robin Hill
  2012-02-15  9:24                 ` keld
@ 2012-02-15 12:10                 ` David Brown
  2012-02-15 13:08                   ` keld
  2012-02-17 18:44                 ` Peter Grandi
  3 siblings, 1 reply; 40+ messages in thread
From: David Brown @ 2012-02-15 12:10 UTC (permalink / raw)
  To: stan; +Cc: keld, CoolCold, Linux RAID

On 15/02/2012 00:27, Stan Hoeppner wrote:
>
> Maybe I simply don't understand this 'magic' of the f2 and far layouts.
>   If you only read the "faster half" of a spindle, does this mean writes
> go to the slower half?  If that's the case, how can you read data that's
> never been written?
>

Imagine you have disk A with partitions 1 and 2 (1 being the outer 
faster half).  Similarly, disk B is partitioned into 1 and 2.

Take A1 and B2 and tie them together with raid1 as md0.
Take B1 and A2 and tie them together with raid1 as md1.
Take md0 and md1 and tie them together with raid0 as md2.

Then md2 is pretty much a "raid10,f2" of disk A and disk B.


So all data is written twice, with one copy on each disk - that's the 
"raid1" mirroring part.  And each time you need to read a large block of 
data, you can get parts of it in parallel from both disks at once, with 
contiguous reads - blocks 0, 2, 4, ... come from A1, while blocks 1, 3, 
5, ... come from B1.

Reads will normally be taken from the outer halves (A1 and B1), since 
these have faster throughput, and keeping the heads there means half the 
head movement (actually less than that on average).  But if the system 
happens to be reading from (or writing to) A1, and needs access to data 
that is also on A1, it can read it from B2 in parallel.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-15 12:10                 ` David Brown
@ 2012-02-15 13:08                   ` keld
  0 siblings, 0 replies; 40+ messages in thread
From: keld @ 2012-02-15 13:08 UTC (permalink / raw)
  To: David Brown; +Cc: stan, CoolCold, Linux RAID

On Wed, Feb 15, 2012 at 01:10:32PM +0100, David Brown wrote:
> 
> Reads will normally be taken from the outer halves (A1 and B1), since 
> these have faster throughput, and keeping the heads there means half the 
> head movement (actually less than that on average).  But if the system 
> happens to be reading from (or writing to) A1, and needs access to data 
> that is also on A1, it can read it from B2 in parallel.

reads in raid10,far will allways come from the lowest block available disk,
that is the fastest disk on most hardware disks. This is to prevent
oddly characteriscs disks from not striping: when two disks with eg slightly
different access times were read from, the IO driver could tend to favourize
the faster, and thus make overall operations slower. This also works in
degraded mode. I wrote the one-line patch:-)

Best regards
Keld

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-15  8:30                 ` Robin Hill
@ 2012-02-15 13:30                   ` Stan Hoeppner
  2012-02-15 14:03                     ` Robin Hill
  2012-02-15 15:40                     ` David Brown
  0 siblings, 2 replies; 40+ messages in thread
From: Stan Hoeppner @ 2012-02-15 13:30 UTC (permalink / raw)
  To: keld, CoolCold, Linux RAID

On 2/15/2012 2:30 AM, Robin Hill wrote:
> On Tue Feb 14, 2012 at 05:27:43PM -0600, Stan Hoeppner wrote:
> 
>> Maybe I simply don't understand this 'magic' of the f2 and far layouts.
>>  If you only read the "faster half" of a spindle, does this mean writes
>> go to the slower half?  If that's the case, how can you read data that's
>> never been written?
>>
> Writes go to both halves, as normal for a mirrored setup, which is why

Huh?  A 'normal' RAID setup mirrors one disk to another.  You're
describing data being mirrored from the outer half of a single disk to
the inner half.  Where's the Redundancy in this?  This doesn't make sense.

> its write performance is lower than that of a near layout array (more
> head movement required). Reads will (normally) come from the faster
> (outer) half of the disk though, so read performance is better. In most
> cases workloads are read-heavy, so this comes out as a significant gain.

Again, this makes no sense.  You're simply repeating what David said.
Neither of you seem to really understand this, or are simply unable to
explain it correctly, technically.

Maybe Neil will jump into the fray and answer my original question.

-- 
Stan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-15 13:30                   ` Stan Hoeppner
@ 2012-02-15 14:03                     ` Robin Hill
  2012-02-15 15:40                     ` David Brown
  1 sibling, 0 replies; 40+ messages in thread
From: Robin Hill @ 2012-02-15 14:03 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: keld, CoolCold, Linux RAID

[-- Attachment #1: Type: text/plain, Size: 1362 bytes --]

On Wed Feb 15, 2012 at 07:30:45 -0600, Stan Hoeppner wrote:

> On 2/15/2012 2:30 AM, Robin Hill wrote:
> > On Tue Feb 14, 2012 at 05:27:43PM -0600, Stan Hoeppner wrote:
> > 
> >> Maybe I simply don't understand this 'magic' of the f2 and far layouts.
> >>  If you only read the "faster half" of a spindle, does this mean writes
> >> go to the slower half?  If that's the case, how can you read data that's
> >> never been written?
> >>
> > Writes go to both halves, as normal for a mirrored setup, which is why
> 
> Huh?  A 'normal' RAID setup mirrors one disk to another.  You're
> describing data being mirrored from the outer half of a single disk to
> the inner half.  Where's the Redundancy in this?  This doesn't make sense.
> 
No, the outer half of one disk is mirrored to the inner half of the
next (for an f2 layout anyway - an f3 will split it into thirds). Its
outer half is in turn mirrored to the inner half of the next one, and so
on until the outer half of the last disk is mirrored to the inner half
of the last one. You can lose any non-adjacent disks without losing the
data.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-15 13:30                   ` Stan Hoeppner
  2012-02-15 14:03                     ` Robin Hill
@ 2012-02-15 15:40                     ` David Brown
  2012-02-17 13:16                       ` Stan Hoeppner
  1 sibling, 1 reply; 40+ messages in thread
From: David Brown @ 2012-02-15 15:40 UTC (permalink / raw)
  To: stan; +Cc: keld, CoolCold, Linux RAID

On 15/02/2012 14:30, Stan Hoeppner wrote:
> On 2/15/2012 2:30 AM, Robin Hill wrote:
>> On Tue Feb 14, 2012 at 05:27:43PM -0600, Stan Hoeppner wrote:
>>
>>> Maybe I simply don't understand this 'magic' of the f2 and far layouts.
>>>   If you only read the "faster half" of a spindle, does this mean writes
>>> go to the slower half?  If that's the case, how can you read data that's
>>> never been written?
>>>
>> Writes go to both halves, as normal for a mirrored setup, which is why
>
> Huh?  A 'normal' RAID setup mirrors one disk to another.  You're
> describing data being mirrored from the outer half of a single disk to
> the inner half.  Where's the Redundancy in this?  This doesn't make sense.
>

Like Robin said, and like I said in my earlier post, the second copy is 
on a different disk.

>> its write performance is lower than that of a near layout array (more
>> head movement required). Reads will (normally) come from the faster
>> (outer) half of the disk though, so read performance is better. In most
>> cases workloads are read-heavy, so this comes out as a significant gain.
>
> Again, this makes no sense.  You're simply repeating what David said.
> Neither of you seem to really understand this, or are simply unable to
> explain it correctly, technically.
>

As far as I can see, you are the only one in this thread who doesn't 
understand this.  I'm not sure where the problem lies, as several people 
(including me) have given you explanations that seem pretty clear to me. 
  But maybe there is some fundamental point that we are assuming is 
obvious, but you don't get - hopefully it will suddenly click in place 
for you.

Have a look at this:

<http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10>

Forget writes for a moment.  Forget the second copy on the inner halves 
of the disks.  Then as far as reading is concerned, the raid10,f2 layout 
looks /exactly/ like a raid0 stripe, using only the outer halves of the 
disks.  That means large reads use all the spindles as they read whole 
stripes at a time, just like raid0.  And because all data is on the 
outer halves of the disks, the average bandwidth is higher than if it 
were spread over the whole disk, and the average seek is faster because 
the head movement is smaller - this is standard "short stroking" speed 
improvement.

Write performance is lower because it needs more head movement to make 
the second copy in the inner halves of the disks (as well as the copy to 
the outer halves).  But if you have a reasonable read-to-write ratio, 
the total performance is higher overall.

> Maybe Neil will jump into the fray and answer my original question.
>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-15 15:40                     ` David Brown
@ 2012-02-17 13:16                       ` Stan Hoeppner
  2012-02-17 14:57                         ` David Brown
  2012-02-17 19:03                         ` Peter Grandi
  0 siblings, 2 replies; 40+ messages in thread
From: Stan Hoeppner @ 2012-02-17 13:16 UTC (permalink / raw)
  To: David Brown; +Cc: keld, CoolCold, Linux RAID

On 2/15/2012 9:40 AM, David Brown wrote:

> Like Robin said, and like I said in my earlier post, the second copy is
> on a different disk.

We've ended up too deep in the mud here.  Keld's explanation didn't make
sense resulting in my "huh" reply.  Let's move on from there back to the
real question.

You guys seem to assume that since I asked a question about the near,far
layouts that I'm ignorant of them.  These layouts are the SNIA
integrated adjacent stripe and offset stripe mirroring.  They are well
known.  This is not what I asked about.

> As far as I can see, you are the only one in this thread who doesn't
> understand this.  I'm not sure where the problem lies, as several people
> (including me) have given you explanations that seem pretty clear to me.
>  But maybe there is some fundamental point that we are assuming is
> obvious, but you don't get - hopefully it will suddenly click in place
> for you.

Again, the problem is you're assuming I'm ignorant of the subject, and
are simply repeating the boiler plate.

> Forget writes for a moment.[snip]

This saga is all about writes.  The fact you're running away from writes
may be part of the problem.

Back to the original issue.  Coolcold and I were trying to figure out
what the XFS write stripe alignment should be for a 7 disk mdraid10 near
layout array.

After multiple posts from David, Robin, and Keld attempting to 'educate'
me WRT the mdraid driver read tricks which yield an "effective RAID0
stripe", nobody has yet answered my question:

What is the stripe spindle width of a 7 drive mdraid near array?

Do note that stripe width is specific to writes.  It has nothing to do
with reads, from the filesystem perspective anyway.  For internal array
operations it will.

So lets take a look at two 4 drive RAIDs, a standard RAID10 and a
RAID10,n/f.  The standard RAID10 array has a stripe across two drives.
Each drive has a mirror.  Stripe writes are two device wide.  There are
a total of 4 write operations to the drives, 2 data and two mirror data.
 Stripe width concerns only data.

The n,r rotate the data and mirror data writes around the 4 drives.  So
it is possible, and I assume this is the case, to write data and mirror
data 4 times, making the stripe width 4, even though this takes twice as
many RAID IOs compared to the standard RAID10 lyout.  If this is the
case this is what we'd tell mkfs.xfs.  So in the 7 drive case it would
be seven.  This is the only thing I'm unclear about WRT the near/far
layouts, thus my original question.  I believe Neil will be definitively
answering this shortly.

There is a potential problem with this though, if my assumption about
write behavior of n/f is correct.  We've now done 8 RAID IOs to the 4
drives in a single RAID operation.  There should only be 4 RAID IOs in
this case, one to each disk.  This tends to violate some long accepted
standards/behavior WRT RAID IO write patterns.  Traditionally, one RAID
IO meant only one set of sector operations per disk, dictated by the
chunk/strip size.  Here we'll have twice as many, but should
theoretically also be able to push twice as much data per RAID write
operation since our stripe width would be doubled, negating the double
write IOs.  I've not tested these head to head myself.  Such results
with a high IOPS random write workload would be interesting.

-- 
Stan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-17 13:16                       ` Stan Hoeppner
@ 2012-02-17 14:57                         ` David Brown
  2012-02-17 19:30                           ` Peter Grandi
  2012-02-19 14:46                           ` Peter Grandi
  2012-02-17 19:03                         ` Peter Grandi
  1 sibling, 2 replies; 40+ messages in thread
From: David Brown @ 2012-02-17 14:57 UTC (permalink / raw)
  To: stan; +Cc: keld, CoolCold, Linux RAID

On 17/02/2012 14:16, Stan Hoeppner wrote:
> On 2/15/2012 9:40 AM, David Brown wrote:
>
>> Like Robin said, and like I said in my earlier post, the second copy is
>> on a different disk.
>
> We've ended up too deep in the mud here.  Keld's explanation didn't make
> sense resulting in my "huh" reply.  Let's move on from there back to the
> real question.
>
> You guys seem to assume that since I asked a question about the near,far
> layouts that I'm ignorant of them.  These layouts are the SNIA
> integrated adjacent stripe and offset stripe mirroring.  They are well
> known.  This is not what I asked about.
>

As far as I can see (from the SNIA DDF Technical Position v. 2.0), md 
raid10,n2 is roughly SNIA RAID-1E "integrated adjacent stripe 
mirroring", while raid10,o2 (offset layout) is roughly SNIA RAID-1E 
"Integrated offset stripe mirroring".  I say roughly, because I don't 
know if SNIA covers raid10 with only 2 disks, and I am not 100% sure 
about whether the choice of which disk mirrors which other disk is the same.

I can't see any SNIA level that remotely matches md raid10,far layout.

>> As far as I can see, you are the only one in this thread who doesn't
>> understand this.  I'm not sure where the problem lies, as several people
>> (including me) have given you explanations that seem pretty clear to me.
>>   But maybe there is some fundamental point that we are assuming is
>> obvious, but you don't get - hopefully it will suddenly click in place
>> for you.
>
> Again, the problem is you're assuming I'm ignorant of the subject, and
> are simply repeating the boiler plate.
>
>> Forget writes for a moment.[snip]
>
> This saga is all about writes.  The fact you're running away from writes
> may be part of the problem.
>

The whole point of raid10,far is to improve read speed compared to other 
layouts - even though it is slower for writes.  Obviously you /can/ do 
writes, and obviously they are safe and mirrored - but for this 
read-heavy application the speed of writes should not be the main issue. 
  The point is that raid10,far will give faster /reads/ than other 
layouts.  No one is "running away" from writes - I am just putting them 
aside to help the explanation.

> Back to the original issue.  Coolcold and I were trying to figure out
> what the XFS write stripe alignment should be for a 7 disk mdraid10 near
> layout array.
>

That is certainly one issue - and it's something you know a lot more 
about than me.  So I am not getting involved in that (but I am listening 
in and learning).

But I can't sit by idly while you discuss details of the xfs striping 
over raid10,near when I believe a change to raid10,far will make a lot 
bigger difference to this read-heavy application.

> After multiple posts from David, Robin, and Keld attempting to 'educate'
> me WRT the mdraid driver read tricks which yield an "effective RAID0
> stripe", nobody has yet answered my question:
>
> What is the stripe spindle width of a 7 drive mdraid near array?

With "near" layout, it is basically 3.5 spindles.  raid10,n2 is the same 
layout as normal raid10 if the number of disks is a multiple of 2.  (See 
later before you react to the "3.5 spindles".)

With "far" or "offset" layout it is clearly 7 spindles.

As you say, md raid10 gives an "effective raid0 stripe" for offset and 
far layouts.

The difference with raid10,far compared to raid10,offset is that each of 
these raid0 stripe reads comes from the fastest half of the disk, with 
minimal head movement (while reading), and with better use of disk 
read-ahead.

>
> Do note that stripe width is specific to writes.  It has nothing to do
> with reads, from the filesystem perspective anyway.  For internal array
> operations it will.
>

I don't understand that at all.

To my mind, stripe width applies to reads and writes.  For reads, it is 
the number of spindles that are used in parallel while reading larger 
blocks of data.  For writes, it is in addition the width of a parity 
stripe for raid5 or raid6.

Normally, the filesystem does not care about stripe widths, either for 
reading or writing, just as it does not care whether you have one disk, 
an array, local disks, iSCSI disks, or whatever.  Some filesystems care 
a /little/ about stripe width in that they align certain structures to 
stripe boundaries to make accesses more efficient.

> So lets take a look at two 4 drive RAIDs, a standard RAID10 and a
> RAID10,n/f.  The standard RAID10 array has a stripe across two drives.
> Each drive has a mirror.  Stripe writes are two device wide.  There are
> a total of 4 write operations to the drives, 2 data and two mirror data.
>   Stripe width concerns only data.
>

Fine so far.  In pictures, we have this:

Given data blocks 0, 1, 2, 3, ...., with copies "a" and "b", you have:

Standard raid10:

disk0 = 0a 2a 4a 6a 8a
disk1 = 0b 2b 4b 6b 8b
disk2 = 1a 3a 5a 7a 9a
disk3 = 1b 3b 5b 7b 9b

The stripe width is 2 - if you try to do a large read, you will get data 
from two drives in parallel.

Small writes (a single chunk) will involve 2 write operations - one to 
the "a" copy, and one to the "b" copy of each block, and will be done in 
parallel as they are on different disks.  Large writes will also be two 
copies, and will go to all disks in parallel.

"raid10,n2" layout is exactly the same as standard "raid10" - i.e., a 
stripe of mirrors - when there is a multiple of 2 disks.  For seven 
disks, the layout would be:

disk0 = 0a 3b 7a
disk1 = 0b 4a 7b
disk2 = 1a 4b 8a
disk3 = 1b 5a 8b
disk4 = 2a 5b 9a
disk5 = 2b 6a 9b
disk6 = 3a 6b 10a


> The n,r rotate the data and mirror data writes around the 4 drives.  So
> it is possible, and I assume this is the case, to write data and mirror
> data 4 times, making the stripe width 4, even though this takes twice as
> many RAID IOs compared to the standard RAID10 lyout.  If this is the
> case this is what we'd tell mkfs.xfs.  So in the 7 drive case it would
> be seven.  This is the only thing I'm unclear about WRT the near/far
> layouts, thus my original question.  I believe Neil will be definitively
> answering this shortly.
>

I think you are probably right here - it doesn't make sense to talk 
about a "3.5" spindle width.  If you call it 7, then it should work well 
even though each write takes two operations.


Let me draw the pictures of 4 and 7 disk layouts for raid10,f2 (far) and 
raid10,o2 (offset) to show what is going on:


Raid10,offset:

disk0 = 0a 3b 4a 7b 8a  11b
disk1 = 1a 0b 5a 4b 9a  8b
disk2 = 2a 1b 6a 5b 10a 9b
disk3 = 3a 2b 7a 6b 11a 10b

disk0 = 0a 6b 7a  13b
disk1 = 1a 0b 8a  7b
disk2 = 2a 1b 9a  8b
disk3 = 3a 2b 10a 9b
disk4 = 4a 3b 11a 10b
disk5 = 5a 4b 12a 11b
disk6 = 6a 5b 13a 12b

As you can guess, this gives good read speeds (7 spindles in parallel, 
though not ideal read-ahead usage), and writes speeds are also good 
(again, all 7 spindles can be used in parallel, and head movement 
between the two copies is minimal).  This layout is faster than standard 
raid10 or raid10,n2 in most use cases, though for lots of small parallel 
accesses (where striped reads don't occur) there will be no difference.


Raid10,far:

disk0 = 0a 4a 8a  ... 3b 7b 11b ...
disk1 = 1a 5a 9a  ... 0b 4b 8b  ...
disk2 = 2a 6a 10a ... 1b 5b 9b  ...
disk3 = 3a 7a 11a ... 2b 6b 10b ...

disk0 = 0a 7a  ... 6b 13b ...
disk1 = 1a 8a  ... 0b 7b  ...
disk2 = 2a 9a  ... 1b 8b  ...
disk3 = 3a 10a ... 2b 9b  ...
disk4 = 4a 11a ... 3b 10b ...
disk5 = 5a 12a ... 4b 11b ...
disk6 = 6a 13a ... 5b 12b ...

This gives optimal read speeds (7 spindles in parallel, ideal read-ahead 
usage, and all data taken from the faster half of the disks).  Writes 
speeds are not bad (again, all 7 spindles can be used in parallel, but 
you have large head movements between writing each copy of the data). 
For reads, this layout is faster than standard raid10, raid10,n2, 
raid10,o2, and even standard raid0 (since the average bandwidth is 
higher on the outer halves, and the average head movement during read 
seeks is lower).  But writes have longer latencies.


When you are dealing with multiple parallel small reads, much of the 
differences here disappear.  But there is still nothing to lose by using 
raid10,far if you have read-heavy applications - and the shorter head 
movements will still make it faster.  If the longer write operations are 
a concern, raid10,offset may be a better compromise - it is certainly 
still better than raid10,near.


> There is a potential problem with this though, if my assumption about
> write behavior of n/f is correct.  We've now done 8 RAID IOs to the 4
> drives in a single RAID operation.  There should only be 4 RAID IOs in
> this case, one to each disk.  This tends to violate some long accepted
> standards/behavior WRT RAID IO write patterns.  Traditionally, one RAID
> IO meant only one set of sector operations per disk, dictated by the
> chunk/strip size.  Here we'll have twice as many, but should
> theoretically also be able to push twice as much data per RAID write
> operation since our stripe width would be doubled, negating the double
> write IOs.  I've not tested these head to head myself.  Such results
> with a high IOPS random write workload would be interesting.
>

Most of my comments here are based on understanding the theory, rather 
than the practice - it's been a while since I did any benchmarking with 
different layouts and that was not very scientific testing.  I certainly 
agree it would be interesting to see test results.

I can't say if the extra writes will be an issue - it may conceivably 
affect speeds if the filesystem is optimised on the assumption that a 
write to 7 spindles means only 7 head movements and 7 write operations. 
  But this is the same issue as you always get with layered raid - 
logically speaking, Linux raid10 (regardless of layout) appears as a 
stripe of mirrors just like traditional layered raid10.

mvh.,

David

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-14 23:27               ` Stan Hoeppner
                                   ` (2 preceding siblings ...)
  2012-02-15 12:10                 ` David Brown
@ 2012-02-17 18:44                 ` Peter Grandi
  2012-02-18 17:39                   ` Peter Grandi
  3 siblings, 1 reply; 40+ messages in thread
From: Peter Grandi @ 2012-02-17 18:44 UTC (permalink / raw)
  To: Linux RAID

>>> The results of the target workload should be interesting,
>>> given the apparent 7 spindles of stripe width of
>>> mdraid10,f2, and only 3 effective spindles with the linear
>>> array of mirror pairs, an apparent 4 spindle deficit.
[ ... ]
 
>>> raid10,f2 would have a more uniform performance as it gets
>>> filled, because read access to files would still be to the
>>> faster parts of the spindles.
[ ... ]
>> Well, I was talking for a given FS, including XFS. As
>> raid10,f2 limits the read access to the faster halves of the
>> spindles, reads will never go to the slower halves. [ ... ]

That's not how I understand the 'far' layout and its
consequences, as described in 'man 4 md':

  "The first copy of all data blocks will be striped across the
   early part of all drives in RAID0 fashion, and then the next
   copy of all blocks will be striped across a later section of
   all drives, always ensuring that all copies of any given
   block are on different drives.

   The 'far' arrangement can give sequential read performance
   equal to that of a RAID0 array, but at the cost of degraded
   write performance."


 and I understand this skepticism:

> Maybe I simply don't understand this 'magic' of the f2 and far
> layouts.  If you only read the "faster half" of a spindle,
> does this mean writes go to the slower half?  If that's the
> case, how can you read data that's never been written?

The 'f2' layout is based on the idea of splitting each disk in
two (mor more...), and putting the first copy of each chunk in
the first halves, and the second copy of each chunk in the
second halves (of the next disk to uncorrelate storage device
failures).

The main difference is not at all that reads become faster
because they happen in the first halves, but because they become
more parallel *for single threaded* reads, consider for example
for six drives in 3 pairs:

  * With 'n2', traditional RAID0 of RAID1, the maximum degree of
    parallelism is 6 chunks read in parallel only if *two
    threads* are reading, because while one thread can read 6
    chunks in parallel, half of those chunks are useless to that
    thread because they are copies.

  * With 'f2' a *single thread* can read 6 chunks in parallel
    because it can read 6 different chunks from all the first
    halves or all the second halves.

  * With 'f2' the main price to pay is that *peak* writing speed
    is lower because because each drive is shared between copies
    of two different chunks in the same stripe, not because of
    speed difference between outer and inner tracks. The issue
    is lower parallelism, plus extra arm seeking in many cases.
    Consider the case of writing two consecutive chunks at the
    beginning of a stripe:
    - With 'f2' the first chunk gets written to the top of drive
      1, and bottom of drive 2. Then the next chunk is written
      to the top of drive 2, and the bottom of drive 3. Drive 2
      writes must be serialized and arm must move half a disk.
    - With 'n2' the first chunk goes to drives 1 and 2, and the
      second to drives 3 and 4, so there is no serialization of
      writes and no arm movement.
    With 'f2' writing 2 chunks means spreading the writes to 3
    drives instead of 4, and this reduces the throughput, but
    the real issue is the extra seeking, which also increases
    latency.

Something very close to RAID10 'f2' is fairly easy to build
manually, for example for two drives:

  mdadm -C /dev/pair1 -l raid1 -n 2 /dev/sda1 /dev/sdb2
  mdadm -C /dev/pair2 -l raid1 -n 2 /dev/sdb1 /dev/sda2
  mdadm -C /dev/r10f2 -l raid0 -n 2 /dev/pair1 /dev/pair2

If one really wants for all reads to go preferentially to
'/dev/sda1' and '/dev/sdb1' one can add '-W' as in:

  mdadm -C /dev/pair1 -l raid1 -n 2 /dev/sda1 -W /dev/sdb2
  mdadm -C /dev/pair2 -l raid1 -n 2 /dev/sdb1 -W /dev/sda2
  mdadm -C /dev/r10f2 -l raid0 -n 2 /dev/pair1 /dev/pair2

The same effect can be obtained with an 'n2' over the same four
partitions, listing them in the appropriate order:

  mdadm -C /dev/r10f2 -n raid10 -n 4 \
    /dev/sda1 /dev/sdb2 \
    /dev/sdb1 /dev/sda2

With 3 mirrors on 3 drives:

  mdadm -C /dev/mirr1 -l raid1 -n 3 /dev/sda1 /dev/sdb2 /dev/sdc3
  mdadm -C /dev/mirr2 -l raid1 -n 3 /dev/sdb1 /dev/sdc2 /dev/sda3
  mdadm -C /dev/mirr3 -l raid1 -n 3 /dev/sdc1 /dev/sda2 /dev/sdb3
  mdadm -C /dev/r10f2 -l raid0 -n 3 /dev/mirr1 /dev/mirr2 /dev/mirr3

The 'f2' RAID10 layout is very advantageous with mostly-read
data, and in the 2-drive case where it is more like RAID10,
because the 'n2' layout in the 2-drive case is just RAID1.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-17 13:16                       ` Stan Hoeppner
  2012-02-17 14:57                         ` David Brown
@ 2012-02-17 19:03                         ` Peter Grandi
  2012-02-17 22:12                           ` Stan Hoeppner
  2012-02-18 17:09                           ` Peter Grandi
  1 sibling, 2 replies; 40+ messages in thread
From: Peter Grandi @ 2012-02-17 19:03 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

> Back to the original issue.  Coolcold and I were trying to
> figure out what the XFS write stripe alignment should be for a
> 7 disk mdraid10 near layout array. After multiple posts from
> David, Robin, and Keld attempting to 'educate' me WRT the
> mdraid driver read tricks which yield an "effective RAID0
> stripe", nobody has yet answered my question:

> What is the  stripe spindle width of a 7 drive mdraid near array?

As I have repeated many many many times to you in past XFS
discussions, and please take note, stripe alignment matters ONLY
AND SOLELY IF READ-MODIFY-WRITE is involved, and RADI10 never
requires read-modify-write.

> Do note that stripe width is specific to writes.  It has
> nothing to do with reads, from the filesystem perspective
> anyway. For internal array operations it will.

Again, stripe alignment only matters for writes ONLY AND SOLELY
IF READ-MODIFY-WRITE is involved. This never happens for RAID0,
RAID1 or RAID10, because there is no parity to update; chunks
within a stripe are wholly independent of each other.

Not all parity RAID involves read-modify-write either, for
example RAID2 and RAID3 (bit and byte parallel parity RAID under
the SNIA taxonomy) never do read-modify-write either, so stripe
alignment does not matter for those either. Note: this is
because RAID setups where the physical sector (bit or byte) is
smaller than the logical sector always do whole-stripe reads or
writes, which is a special case.

Disclaimer: using stripe alignment even when it is not required
may help a bit with scheduling, it being slightly akin to a
larger block size, but not quite, but that is a secondary
effect.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-17 14:57                         ` David Brown
@ 2012-02-17 19:30                           ` Peter Grandi
  2012-02-18 13:59                             ` David Brown
  2012-02-19 14:46                           ` Peter Grandi
  1 sibling, 1 reply; 40+ messages in thread
From: Peter Grandi @ 2012-02-17 19:30 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

> To my mind, stripe width applies to reads and writes.  For
> reads, it is the number of spindles that are used in parallel
> while reading larger blocks of data.  For writes, it is in
> addition the width of a parity stripe for raid5 or raid6.

In the XFS case that's completely wrong, and irrelevant: in the
XFS case it is the number of sectors/blocks that IO has to be
_aligned_ to avoid read-modify-write, if there is the risk for
that.

The stripe width per se matters less than aligned writes as to
avoiding read-modify-write impact: if one does IO in stripe
width units but they are not aligned, performance will be
terrible as double read-modify-write will not be prevented.

What is the stripe width does not matter to applications like a
filesystem other than for read-modify-write avoidance because
how many sectors/blocks are/can be read in parallel depends
primarily on application access patterns, and secondarily on how
good is the IO subsystem scheduling.

> [ ... ] Some filesystems care a /little/ about stripe width in
> that they align certain structures to stripe boundaries to
> make accesses more efficient.

That in the case where read-modify-write cannot happen, if
read-modify write can happen, unaligned or non-full-width writes
are very costly, and not just for arrays; it happens in RAM too,
and for 4KiB physical sector drives simulating 512B logical
sectors.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-17 19:03                         ` Peter Grandi
@ 2012-02-17 22:12                           ` Stan Hoeppner
  2012-02-18 17:09                           ` Peter Grandi
  1 sibling, 0 replies; 40+ messages in thread
From: Stan Hoeppner @ 2012-02-17 22:12 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

On 2/17/2012 1:03 PM, Peter Grandi wrote:

> As I have repeated many many many times to you in past XFS
> discussions, and please take note, stripe alignment matters ONLY
> AND SOLELY IF READ-MODIFY-WRITE is involved, and RADI10 never
> requires read-modify-write.

A wise Jedi once contradicted himself when he advised me to "never speak
in absolute terms".  A wiser Jedi would have said "don't speak in
absolute terms".

[...]
> Disclaimer: using stripe alignment even when it is not required
> may help a bit with scheduling, it being slightly akin to a
> larger block size, but not quite, but that is a secondary
> effect.

And the inevitable future contradiction is the reason.  The wisest of
Jedi once told me "do not leave performance on the table".  Whether you
consider this "scheduling" effect, or others, of write alignment on non
RMW devices to be secondary, it does have positive performance
implications, and should thus not be left on the table.

More importantly, and I could be mistaken, but IIRC, even absent an
underlying RMW block device, XFS journal writes benefit from alignment
due to a resulting lower ratio of write barriers issued to blocks
written, due to the larger stripe width write out.  And as we all know
(or should), write barriers can murder performance, especially for
metadata heavy workloads, as each barrier operation typically flushes
all the drives' caches.  Most mdraid users probably don't run BBWC RAID
cards in JBOD mode to avoid barriers, though I know of a few who do.  So
XFS write alignment on mdraid should be an issue for most everyone using
XFS, regardless of whether their array is parity RMW or not.

Both unaligned RMW and write barriers are performance killers.  Which
one has more blood on its hands I can't say.  I've never done such
testing nor come across a related paper.

-- 
Stan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-17 19:30                           ` Peter Grandi
@ 2012-02-18 13:59                             ` David Brown
  0 siblings, 0 replies; 40+ messages in thread
From: David Brown @ 2012-02-18 13:59 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

On 17/02/12 20:30, Peter Grandi wrote:
> [ ... ]
>
>> To my mind, stripe width applies to reads and writes.  For
>> reads, it is the number of spindles that are used in parallel
>> while reading larger blocks of data.  For writes, it is in
>> addition the width of a parity stripe for raid5 or raid6.
>
> In the XFS case that's completely wrong, and irrelevant: in the
> XFS case it is the number of sectors/blocks that IO has to be
> _aligned_ to avoid read-modify-write, if there is the risk for
> that.
>
> The stripe width per se matters less than aligned writes as to
> avoiding read-modify-write impact: if one does IO in stripe
> width units but they are not aligned, performance will be
> terrible as double read-modify-write will not be prevented.
>
> What is the stripe width does not matter to applications like a
> filesystem other than for read-modify-write avoidance because
> how many sectors/blocks are/can be read in parallel depends
> primarily on application access patterns, and secondarily on how
> good is the IO subsystem scheduling.
>
>> [ ... ] Some filesystems care a /little/ about stripe width in
>> that they align certain structures to stripe boundaries to
>> make accesses more efficient.
>
> That in the case where read-modify-write cannot happen, if
> read-modify write can happen, unaligned or non-full-width writes
> are very costly, and not just for arrays; it happens in RAM too,
> and for 4KiB physical sector drives simulating 512B logical
> sectors.

I see your point here - when using a raid that requires 
read-modify-write (such as raid5 or raid6), then having the filesystem 
optimise writes by aligning them to RMW stripes is critical to avoiding 
poor write performance.  I'll remember that for future cases with XFS 
over raid5 or raid6.

In this case (raid10), RMW is not relevant.  So the effect of stripes is 
to allow single reads (or writes) to make use of as many spindles as 
possible in parallel.  Since this is a read-heavy application, the speed 
of reads is important - thus stripe widths /are/ important for performance.

As far as I understand you, XFS doesn't care about the stripe width for 
reading, so it doesn't matter whether you give it the correct width when 
creating the filesystem.  But that's a different matter from saying the 
/actual/ stripe width is relevant or not for read performance - it is 
just XFS's idea of the stripe width that is irrelevant for reading, 
while the real-world underlying stripe width on the raid array /is/ 
relevant.

mvh.,

David

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-17 19:03                         ` Peter Grandi
  2012-02-17 22:12                           ` Stan Hoeppner
@ 2012-02-18 17:09                           ` Peter Grandi
  1 sibling, 0 replies; 40+ messages in thread
From: Peter Grandi @ 2012-02-18 17:09 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

>> [ ... ] XFS write stripe alignment should be for a 7 disk
>> mdraid10 near layout array.

> [ ... ] stripe alignment matters ONLY AND SOLELY IF
> READ-MODIFY-WRITE is involved, and RADI10 never requires
> read-modify-write.

>> Do note that stripe width is specific to writes.  It has
>> nothing to do with reads, from the filesystem perspective
>> anyway. For internal array operations it will.

> Again, stripe alignment only matters for writes ONLY AND SOLELY
> IF READ-MODIFY-WRITE is involved. This never happens for RAID0,
> RAID1 or RAID10, because there is no parity to update; chunks
> within a stripe are wholly independent of each other.

There is a subtlety here... and at times I am excessively precise
in my wording :-) but also because it matters.

XFS requires specifying stripe geometry as 'su'/'sunit' which is
the MD chunk size, and as 'sw'/'swidth' which is the logical
stripe size, and similarly for 'ext3' and 'ext4'.

The reason is that while the *stripe* alignment (and size) don't
matter if there is no risk of RMW in the underlying storage
system, with XFS like 'ext3' and 'ext4' the *chunk* alignment
(and size) matters in all cases where there is parallelism in the
underlying storage system; it matters for both reads and writes,
and on all RAID layouts.

Because then filesystem will try to allocate _metadata_ to be
chunk aligned, so that reading/writing metadata can take
advantage of the parallelism of the array. From 'man mke2fs':

  "stride=stride-size
    Configure the filesystem for a RAID array with stride-size
    filesystem blocks. This is the number of blocks read or
    written to disk before moving to the next disk, which is
    sometimes referred to as the chunk size.
    This mostly affects placement of filesystem meta-data like
    bitmaps at mke2fs time to avoid placing them on a single
    disk, which can hurt performance. It may also be used by
    the block allocator."

But note that this is a different discussion from one about
*stripes*, and IO from applications above the filesystem. This is
the filesystem as an application itself optimizing its own data
given a hint about device geometry.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-17 18:44                 ` Peter Grandi
@ 2012-02-18 17:39                   ` Peter Grandi
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Grandi @ 2012-02-18 17:39 UTC (permalink / raw)
  To: Linux RAID

>>> On Fri, 17 Feb 2012 18:44:30 +0000, pg@lxra2.sabi.co.UK
>>> (Peter Grandi) said:

[ ... ]

> I have a few corrections and extensions for this message:
>   * With 'f2' the main price to pay is that *peak* writing speed

And the average. The peak writing speed is impaired by writing
two chunks to each drive, the average by the seeking.

Another way of looking at 'far' layouts is that given the same
number of disks as 'near', they have twice the width of the
stripe on reads, but the same width stripe on write plus seeking.

[ ... ]
>     - With 'f2' the first chunk gets written to the top of drive
>       1, and bottom of drive 2. Then the next chunk is written
>       to the top of drive 2, and the bottom of drive 3. Drive 2
>       writes must be serialized and arm must move half a disk.

Also note what happens on a whole-strip write, which for
RAID10,n2 on 6 disks would have been 3 chunks, the width of the
RAID0 layer.

With RAID10,f2 the _apparent_ width of the RAID0 layer is 6
chunks, but since we need to write each chunk twice, we end up
writing two chunks per drive, and with a half-disk seek between
them, which gives the same parallelism as RAID10,n2 plus the
cost of the seeks. So lower performance than RAID10,n2 writes,
and how much lower depending on the frequency of the seeks.

[ ... ]

>   mdadm -C /dev/r10f2 -n raid10 -n 4 \
>     /dev/sda1 /dev/sdb2 \
>     /dev/sdb1 /dev/sda2

Should be '-l raid10 -p n2'.

[ ... ]

>   mdadm -C /dev/r10f2 -l raid0 -n 3 /dev/mirr1 /dev/mirr2 /dev/mirr3

Should be '/dev/raid10f3'. The mostly equivalent RAID10,n3
layout:

   mdadm -C /dev/r10f3 -l raid10 -p n3 -n 6  \
     /dev/sda1 /dev/sdb2 /dev/sdc3 \
     /dev/sdb1 /dev/sda2 /dev/sda3 \
     /dev/sdc1 /dev/sda2 /dev/sdb3

(BTW I haven't tried any of these commands)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: XFS on top RAID10 with odd drives count and 2 near copies
  2012-02-17 14:57                         ` David Brown
  2012-02-17 19:30                           ` Peter Grandi
@ 2012-02-19 14:46                           ` Peter Grandi
  1 sibling, 0 replies; 40+ messages in thread
From: Peter Grandi @ 2012-02-19 14:46 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

>> What is the stripe spindle width of a 7 drive mdraid near array?

> With "near" layout, it is basically 3.5 spindles. [ ... ]

[ ... ]

>> [ ... ] The n,r rotate the data and mirror data writes around
>> the 4 drives.  So it is possible, and I assume this is the
>> case, to write data and mirror data 4 times, making the
>> stripe width 4, even though this takes twice as many RAID IOs
>> compared to the standard RAID10 lyout. [ ... ]

> I think you are probably right here - it doesn't make sense to
> talk about a "3.5" spindle width.

As per my previous argument, the XFS stripe width here does not
matter, so the question here is really:

 "What is the IO transaction size that gives best sustained
  sequential single thread performance with 'O_DIRECT'?"

Note that the "with 'O_DIRECT'" qualification matters a great
deal because otherwise the page cache makes the application's IO
size largely irrelevant, making relevant the rate at which it
would issue read or write requests.

While I understand why a tempting answer is 3.5 *chunks*, the
best answer is 7 chunks because:

  * With a '2' layout the chunk pattern repeats every 14 chunks
    (more generally over the GCF of '2' and the number of
    devices in a stripe) so we need only consider two.

  * In the first stripe there are 3 pairs and at end one half
    pair, and in the second stripe there is one half pair and 3
    pairs.

  * It is pointless for *single threaded* access to read both
    chunks in a mirror pair. Without loss of generality, let's
    assume that we read just the first.

  * Then in the first stripe we can read 4 chunks in parallel,
    and in the second stripe 3, as the first chunk of that stripe
    is a copy of one we already read in the first row.

  * We don't need to choose between 3, 4 or 3.5 chunks; because
    if we read 7 chunks at a time we end up reading two full
    stripes, in the shortest time possible for two stripes.

  * The same argument applies to both reads and writes, even if
    writes have to write both members of each pair.

  * Stripe boundaries don't matter, but chunk boundaries matter
    to maximize the transfer per disk, if transfers have
    noticeable fixed costs that matters.

Thus in some way it is 3.5 chunks per stripe "on average".

> If you call it 7, then it should work well even though each
> write takes two operations.

The number is right, but per the argument above this applies to
both reads and writes (with the given qualifications), and the
"two operations" probably means "over two stripes".

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2012-02-19 14:46 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-10 15:17 XFS on top RAID10 with odd drives count and 2 near copies CoolCold
2012-02-11  4:05 ` Stan Hoeppner
2012-02-11 14:32   ` David Brown
2012-02-12 20:16   ` CoolCold
2012-02-13  8:50     ` David Brown
2012-02-13  9:46       ` CoolCold
2012-02-13 11:19         ` David Brown
2012-02-13 13:46       ` Stan Hoeppner
2012-02-13  8:54     ` David Brown
2012-02-13  9:49       ` CoolCold
2012-02-13 12:09     ` Stan Hoeppner
2012-02-13 12:42       ` David Brown
2012-02-13 14:46         ` Stan Hoeppner
2012-02-13 21:40       ` CoolCold
2012-02-13 23:02         ` keld
2012-02-14  3:49           ` Stan Hoeppner
2012-02-14  8:58             ` David Brown
2012-02-14 11:38             ` keld
2012-02-14 23:27               ` Stan Hoeppner
2012-02-15  8:30                 ` Robin Hill
2012-02-15 13:30                   ` Stan Hoeppner
2012-02-15 14:03                     ` Robin Hill
2012-02-15 15:40                     ` David Brown
2012-02-17 13:16                       ` Stan Hoeppner
2012-02-17 14:57                         ` David Brown
2012-02-17 19:30                           ` Peter Grandi
2012-02-18 13:59                             ` David Brown
2012-02-19 14:46                           ` Peter Grandi
2012-02-17 19:03                         ` Peter Grandi
2012-02-17 22:12                           ` Stan Hoeppner
2012-02-18 17:09                           ` Peter Grandi
2012-02-15  9:24                 ` keld
2012-02-15 12:10                 ` David Brown
2012-02-15 13:08                   ` keld
2012-02-17 18:44                 ` Peter Grandi
2012-02-18 17:39                   ` Peter Grandi
2012-02-14  7:31           ` CoolCold
2012-02-14  9:05             ` David Brown
2012-02-14 11:10               ` Stan Hoeppner
2012-02-14  2:49         ` Stan Hoeppner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.