All of lore.kernel.org
 help / color / mirror / Atom feed
* xfs_fsr, sunit, and swidth
@ 2013-03-13 18:11 Dave Hall
  2013-03-13 23:57 ` Dave Chinner
  2013-03-14  0:03 ` Stan Hoeppner
  0 siblings, 2 replies; 32+ messages in thread
From: Dave Hall @ 2013-03-13 18:11 UTC (permalink / raw)
  To: xfs

Does xfs_fsr react in any way to the sunit and swidth attributes of the 
file system?  In other words, with an XFS filesytem set up directly on a 
hardware RAID, it is recommended that the mount command be changed to 
specify sunit and swidth values that reflect the new geometry of the 
RAID.  In my case, these values were not specified on the mkfs.xfs of a 
rather large file system running on a RAID 6 array.  I am wondering 
adding sunit and swidth parameters to the fstab will cause xfs_fsr to do 
anything different than it is already doing.  Most importantly, will it 
improve performace in any way?

Thanks.

-Dave

-- 
Dave Hall
Binghamton University
kdhall@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)	

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-03-13 18:11 xfs_fsr, sunit, and swidth Dave Hall
@ 2013-03-13 23:57 ` Dave Chinner
  2013-03-14  0:03 ` Stan Hoeppner
  1 sibling, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2013-03-13 23:57 UTC (permalink / raw)
  To: Dave Hall; +Cc: xfs

On Wed, Mar 13, 2013 at 02:11:19PM -0400, Dave Hall wrote:
> Does xfs_fsr react in any way to the sunit and swidth attributes of
> the file system?

Not directly.

> In other words, with an XFS filesytem set up
> directly on a hardware RAID, it is recommended that the mount
> command be changed to specify sunit and swidth values that reflect
> the new geometry of the RAID.

The mount option does nothing if sunit/swidth weren't specified at
mkfs time. sunit/swidth affect the initial layout of the filesystem,
and that cannot be altered after the fact. Hence you can't
arbitrarily change sunit/swidth after mkfs - you are limited to
changes that are compatible with the existing alignment. If you have
no alignment specified, then there isn't a new alignment that can be
verified as compatible with the existing layout.....

> In my case, these values were not
> specified on the mkfs.xfs of a rather large file system running on a
> RAID 6 array.

Which means the mount option won't work.

> I am wondering adding sunit and swidth parameters to
> the fstab will cause xfs_fsr to do anything different than it is
> already doing.  Most importantly, will it improve performace in any
> way?

It will make no difference at all.

A more important question: why do you even need to run xfs_fsr?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-03-13 18:11 xfs_fsr, sunit, and swidth Dave Hall
  2013-03-13 23:57 ` Dave Chinner
@ 2013-03-14  0:03 ` Stan Hoeppner
       [not found]   ` <514153ED.3000405@binghamton.edu>
  1 sibling, 1 reply; 32+ messages in thread
From: Stan Hoeppner @ 2013-03-14  0:03 UTC (permalink / raw)
  To: Dave Hall; +Cc: xfs

On 3/13/2013 1:11 PM, Dave Hall wrote:

> Does xfs_fsr react in any way to the sunit and swidth attributes of the
> file system?  

No, manually remounting with new stripe alignment and then running
xfs_fsr is not going to magically reorganize your filesystem.

> In other words, with an XFS filesytem set up directly on a
> hardware RAID, it is recommended that the mount command be changed to
> specify sunit and swidth values that reflect the new geometry of the
> RAID.  

This recommendation (as well as most things storage related) is workload
dependent.  A common misconception many people have is that XFS simply
needs to be aligned to the RAID stripe.  In reality, it's more critical
that XFS write out be aligned to the application's write pattern, and
thus, the hardware RAID stripe needs to be as well.  Another common
misconception is that simply aligning XFS to the RAID stripe will
automagically yield fully filled hardware stripes.  This is entirely
dependent on matching the hardware RAID stripe to the applications write
pattern.

> In my case, these values were not specified on the mkfs.xfs of a
> rather large file system running on a RAID 6 array.  I am wondering
> adding sunit and swidth parameters to the fstab will cause xfs_fsr to do
> anything different than it is already doing.  

No, see above.  And read this carefully:  Aligning XFS affects write out
only during allocation.  It does not affect xfs_fsr.  Nor does it affect
non allocation workloads, i.e. database inserts, writing new mail to
mbox files, etc.

> Most importantly, will it
> improve performace in any way?

You provided insufficient information for us to help you optimize
performance.  For us to even take a stab at answering this we need to
know at least:

1.  application/workload write pattern(s)  Is it allocation heavy?
        a.  small random IO
        b.  large streaming
        c.  If mixed, what is the ratio

2.  current hardware RAID parameters
        a.  strip/chunk size
        b.  # of effective spindles (RAID6 minus 2)

3.  Current percentage of filesystem bytes and inodes used
        a.  ~$ df /dev/[mount_point]
        b.  ~$ df -i /dev/[mount_point]

FWIW, parity RAID is abysmal with random writes, and especially so if
the hardware stripe width is larger than the workload's write IOs.
Thus, optimizing performance with hardware RAID and filesystems must be
done during the design phase of the storage.  For instance if you have a
RAID6 chunk/strip size of 512K and 8 spindles that's a 4MB stripe width.
 If your application is doing random allocation write out in 256K
chunks, you simply can't optimize performance without blowing away the
array and recreating.  For this example you'd need a chunk/strip of 32K
with 8 effective spindles which equals 256K.

Now, there is a possible silver lining here.  If your workload is doing
mostly large streaming writes, allocation or not, that are many
multiples of your current hardware RAID stripe, it doesn't matter if
your XFS is doing default 4K writes or if it has been aligned to the
RAID stripe.  In this case the controller's BBWC is typically going to
take the successive XFS 4K IOs and fill hardware stripes automatically.

So again, as always, the answer depends on your workload.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
       [not found]   ` <514153ED.3000405@binghamton.edu>
@ 2013-03-14 12:26     ` Stan Hoeppner
  2013-03-14 12:55       ` Stan Hoeppner
  0 siblings, 1 reply; 32+ messages in thread
From: Stan Hoeppner @ 2013-03-14 12:26 UTC (permalink / raw)
  To: Dave Hall, xfs

On 3/13/2013 11:37 PM, Dave Hall wrote:
> Stan,
> 
> If you'd rather I can re-post this to xfs@oss.sgi.com, but I'm not clear
> on exactly where this address leads.  I am grateful for your response.

No need, I'm CC'ing the list address.  Read this entirely before hitting
reply.

> So the details are that this is a 16 x 2GB 7200 rpm SATA drive array in
> a RAID enclosure.   The array is configured RAID6 (so 14 data spindles)
> with a chunk size of 128k.  The XFS formatted size is 26TB with 19TB
> currently used.

So your RAID6 stripe width is 14 * 128KB = 1,792KB.

> The workload is a backup program called rsnapshot.  If you're not
> familiar, this program uses cp -al top create a linked copy of the
> previous backup, and then rsync -av --del to copy in any changes. The
> current snapshots contain about 14.8 million files.  The total number of
> snapshots is about 600.

So you've got a metadata heavy workload with lots of links being created.

> The performance problem that lead me to investigate XFS is that some
> time around mid-November the cp -al step started running very long -
> sometimes over 48 hours.  Sometimes it runs in just a few hours. Prior
> to then the entire backup consistenly finished in less than 12 hours. 
> When the cp -al is running long the output of dstat indicates that the
> I/O to the fs is fairly light.

The 'cp -al' command is a pure metadata workload, which means lots of
writes to the filesystem directory trees, but not into files.  And if
your kernel is lower than 2.6.39 your log throughput would be pretty
high as well.  But given this is RAID6 you'll have significant RMW for
these directory writes, maybe overwhelming RMW, driving latency up and
thus actual bandwidth down.  So dstat bytes throughput may be low, but
%wa may be through the roof, making the dstat data you're watching
completely misleading as to what's really going on, what's causing the
problem.

> Please let me know if you need any further information.  

Yes,  please provide the output of the following commands:

~$ grep xfs /etc/fstab
~$ xfs_info /dev/[mount-point]
~$ df /dev/[mount_point]
~$ df -i /dev/[mount_point]
~$ xfs_db -r -c freesp /dev/[mount-point]

Also please provide the make/model of the RAID controller, the write
cache size and if it is indeed enabled and working, as well as any
errors, if any, logged by the controller in dmesg or elsewhere in Linux,
or in the controller firmware.

> Also, again, I
> can post this to xfs@oss.sgi.com but I'd really like to know more about
> the address.

Makes me where you obtained the list address.  Apparently not from the
official websites or you'd not have to ask.  Maybe this will assuage
your fears. ;)

xfs@oss.sgi.com is the official XFS mailing list submission address for
the XFS developers and users.  oss.sgi.com is the server provided and
managed by SGI (www.sgi.com) that houses the XFS open source project.
SGI created the XFS filesystem first released on their proprietary
IRIX/MIPS computers in 1994.  SGI open sourced XFS and ported it to
Linux in the early 2000s.

XFS is actively developed by a fairly large group of people, and AFAIK
most of them are currently employed by Red Hat, including Dave Chinner,
who also replied to your post.  Dave wrote the delaylog code which will
probably go a long way toward fixing your problem, if you're currently
using 2.6.38 or lower and not mounting with this option enabled.  It
didn't become the default until 2.6.39.

More info here http://www.xfs.org and here http://oss.sgi.com/projects/xfs/

> Thanks.

You bet.

-- 
Stan


> -Dave
> 
> 
> On 3/13/2013 8:03 PM, Stan Hoeppner wrote:
>> On 3/13/2013 1:11 PM, Dave Hall wrote:
>>
>>> Does xfs_fsr react in any way to the sunit and swidth attributes of the
>>> file system?
>> No, manually remounting with new stripe alignment and then running
>> xfs_fsr is not going to magically reorganize your filesystem.
>>
>>> In other words, with an XFS filesytem set up directly on a
>>> hardware RAID, it is recommended that the mount command be changed to
>>> specify sunit and swidth values that reflect the new geometry of the
>>> RAID.
>> This recommendation (as well as most things storage related) is workload
>> dependent.  A common misconception many people have is that XFS simply
>> needs to be aligned to the RAID stripe.  In reality, it's more critical
>> that XFS write out be aligned to the application's write pattern, and
>> thus, the hardware RAID stripe needs to be as well.  Another common
>> misconception is that simply aligning XFS to the RAID stripe will
>> automagically yield fully filled hardware stripes.  This is entirely
>> dependent on matching the hardware RAID stripe to the applications write
>> pattern.
>>
>>> In my case, these values were not specified on the mkfs.xfs of a
>>> rather large file system running on a RAID 6 array.  I am wondering
>>> adding sunit and swidth parameters to the fstab will cause xfs_fsr to do
>>> anything different than it is already doing.
>> No, see above.  And read this carefully:  Aligning XFS affects write out
>> only during allocation.  It does not affect xfs_fsr.  Nor does it affect
>> non allocation workloads, i.e. database inserts, writing new mail to
>> mbox files, etc.
>>
>>> Most importantly, will it
>>> improve performace in any way?
>> You provided insufficient information for us to help you optimize
>> performance.  For us to even take a stab at answering this we need to
>> know at least:
>>
>> 1.  application/workload write pattern(s)  Is it allocation heavy?
>>          a.  small random IO
>>          b.  large streaming
>>          c.  If mixed, what is the ratio
>>
>> 2.  current hardware RAID parameters
>>          a.  strip/chunk size
>>          b.  # of effective spindles (RAID6 minus 2)
>>
>> 3.  Current percentage of filesystem bytes and inodes used
>>          a.  ~$ df /dev/[mount_point]
>>          b.  ~$ df -i /dev/[mount_point]
>>
>> FWIW, parity RAID is abysmal with random writes, and especially so if
>> the hardware stripe width is larger than the workload's write IOs.
>> Thus, optimizing performance with hardware RAID and filesystems must be
>> done during the design phase of the storage.  For instance if you have a
>> RAID6 chunk/strip size of 512K and 8 spindles that's a 4MB stripe width.
>>   If your application is doing random allocation write out in 256K
>> chunks, you simply can't optimize performance without blowing away the
>> array and recreating.  For this example you'd need a chunk/strip of 32K
>> with 8 effective spindles which equals 256K.
>>
>> Now, there is a possible silver lining here.  If your workload is doing
>> mostly large streaming writes, allocation or not, that are many
>> multiples of your current hardware RAID stripe, it doesn't matter if
>> your XFS is doing default 4K writes or if it has been aligned to the
>> RAID stripe.  In this case the controller's BBWC is typically going to
>> take the successive XFS 4K IOs and fill hardware stripes automatically.
>>
>> So again, as always, the answer depends on your workload.
>>
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-03-14 12:26     ` Stan Hoeppner
@ 2013-03-14 12:55       ` Stan Hoeppner
  2013-03-14 14:59         ` Dave Hall
  0 siblings, 1 reply; 32+ messages in thread
From: Stan Hoeppner @ 2013-03-14 12:55 UTC (permalink / raw)
  To: stan; +Cc: Dave Hall, xfs

Quick note below, need one more bit of info.

On 3/14/2013 7:26 AM, Stan Hoeppner wrote:
> On 3/13/2013 11:37 PM, Dave Hall wrote:
>> Stan,
>>
>> If you'd rather I can re-post this to xfs@oss.sgi.com, but I'm not clear
>> on exactly where this address leads.  I am grateful for your response.
> 
> No need, I'm CC'ing the list address.  Read this entirely before hitting
> reply.
> 
>> So the details are that this is a 16 x 2GB 7200 rpm SATA drive array in
>> a RAID enclosure.   The array is configured RAID6 (so 14 data spindles)
>> with a chunk size of 128k.  The XFS formatted size is 26TB with 19TB
>> currently used.
> 
> So your RAID6 stripe width is 14 * 128KB = 1,792KB.
> 
>> The workload is a backup program called rsnapshot.  If you're not
>> familiar, this program uses cp -al top create a linked copy of the
>> previous backup, and then rsync -av --del to copy in any changes. The
>> current snapshots contain about 14.8 million files.  The total number of
>> snapshots is about 600.
> 
> So you've got a metadata heavy workload with lots of links being created.
> 
>> The performance problem that lead me to investigate XFS is that some
>> time around mid-November the cp -al step started running very long -
>> sometimes over 48 hours.  Sometimes it runs in just a few hours. Prior
>> to then the entire backup consistenly finished in less than 12 hours. 
>> When the cp -al is running long the output of dstat indicates that the
>> I/O to the fs is fairly light.
> 
> The 'cp -al' command is a pure metadata workload, which means lots of
> writes to the filesystem directory trees, but not into files.  And if
> your kernel is lower than 2.6.39 your log throughput would be pretty
> high as well.  But given this is RAID6 you'll have significant RMW for
> these directory writes, maybe overwhelming RMW, driving latency up and
> thus actual bandwidth down.  So dstat bytes throughput may be low, but
> %wa may be through the roof, making the dstat data you're watching
> completely misleading as to what's really going on, what's causing the
> problem.
> 
>> Please let me know if you need any further information.  
> 
> Yes,  please provide the output of the following commands:

~$ uname -a

> ~$ grep xfs /etc/fstab
> ~$ xfs_info /dev/[mount-point]
> ~$ df /dev/[mount_point]
> ~$ df -i /dev/[mount_point]
> ~$ xfs_db -r -c freesp /dev/[mount-point]
> 
> Also please provide the make/model of the RAID controller, the write
> cache size and if it is indeed enabled and working, as well as any
> errors, if any, logged by the controller in dmesg or elsewhere in Linux,
> or in the controller firmware.
> 
>> Also, again, I
>> can post this to xfs@oss.sgi.com but I'd really like to know more about
>> the address.
> 
> Makes me where you obtained the list address.  Apparently not from the
> official websites or you'd not have to ask.  Maybe this will assuage
> your fears. ;)
> 
> xfs@oss.sgi.com is the official XFS mailing list submission address for
> the XFS developers and users.  oss.sgi.com is the server provided and
> managed by SGI (www.sgi.com) that houses the XFS open source project.
> SGI created the XFS filesystem first released on their proprietary
> IRIX/MIPS computers in 1994.  SGI open sourced XFS and ported it to
> Linux in the early 2000s.
> 
> XFS is actively developed by a fairly large group of people, and AFAIK
> most of them are currently employed by Red Hat, including Dave Chinner,
> who also replied to your post.  Dave wrote the delaylog code which will
> probably go a long way toward fixing your problem, if you're currently
> using 2.6.38 or lower and not mounting with this option enabled.  It
> didn't become the default until 2.6.39.
> 
> More info here http://www.xfs.org and here http://oss.sgi.com/projects/xfs/
> 
>> Thanks.
> 
> You bet.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-03-14 12:55       ` Stan Hoeppner
@ 2013-03-14 14:59         ` Dave Hall
  2013-03-14 18:07           ` Stefan Ring
  2013-03-15  5:14           ` Stan Hoeppner
  0 siblings, 2 replies; 32+ messages in thread
From: Dave Hall @ 2013-03-14 14:59 UTC (permalink / raw)
  To: stan; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 4497 bytes --]


Dave Hall
Binghamton University
kdhall@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)


On 03/14/2013 08:55 AM, Stan Hoeppner wrote:
>> Yes,  please provide the output of the following commands:
>>      
> ~$ uname -a
>    
Linux decoy 3.2.0-0.bpo.4-amd64 #1 SMP Debian 3.2.35-2~bpo60+1 x86_64 
GNU/Linux
>> >  ~$ grep xfs /etc/fstab
>>      
LABEL=backup        /infortrend    xfs    
inode64,noatime,nodiratime,nobarrier    0    0
(cat /proc/mounts:  /dev/sdb1 /infortrend xfs 
rw,noatime,nodiratime,attr2,delaylog,nobarrier,inode64,noquota 0 0)

Note that there is also a second XFS on a separate 3ware raid card, but 
the I/O traffic on that one is fairly low.  It is used as a staging area 
for a Debian mirror that is hosted on another server.
>> >  ~$ xfs_info/dev/[mount-point]
>>      
# xfs_info /dev/sdb1
meta-data=/dev/sdb1              isize=256    agcount=26, 
agsize=268435455 blks
          =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=6836364800, imaxpct=5
          =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
          =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
>> >  ~$ df/dev/[mount_point]
>>      
# df /dev/sdb1
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sdb1            27343372288 20432618356 6910753932  75% /infortrend
>> >  ~$ df -i/dev/[mount_point]
>>      
# df -i /dev/sdb1
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sdb1            5469091840 1367746380 4101345460   26% /infortrend
>> >  ~$ xfs_db -r -c freesp/dev/[mount-point]
>>      
# xfs_db -r -c freesp /dev/sdb1
    from      to extents  blocks    pct
       1       1  832735  832735   0.05
       2       3  432183 1037663   0.06
       4       7  365573 1903965   0.11
       8      15  352402 3891608   0.23
      16      31  332762 7460486   0.43
      32      63  300571 13597941   0.79
      64     127  233778 20900655   1.21
     128     255  152003 27448751   1.59
     256     511  112673 40941665   2.37
     512    1023   82262 59331126   3.43
    1024    2047   53238 76543454   4.43
    2048    4095   34092 97842752   5.66
    4096    8191   22743 129915842   7.52
    8192   16383   14453 162422155   9.40
   16384   32767    8501 190601554  11.03
   32768   65535    4695 210822119  12.20
   65536  131071    2615 234787546  13.59
  131072  262143    1354 237684818  13.76
  262144  524287     470 160228724   9.27
  524288 1048575      74 47384798   2.74
1048576 2097151       1 2097122   0.12
>> >  
>> >  Also please provide the make/model of the RAID controller, the write
>> >  cache size and if it is indeed enabled and working, as well as any
>> >  errors, if any, logged by the controller in dmesg or elsewhere in Linux,
>> >  or in the controller firmware.
>> >  
>>      
The RAID box is an Infortrend S16S-G1030 with 512MB cache and a fully 
functional battery.  I couldn't find  any details about the internal 
RAID implementation used by Infortrend.   The array is SAS attached to 
an LSI HBA (SAS2008 PCI-Express Fusion-MPT SAS-2).

The system hardware is a SuperMicro quad 8-core XEON E7-4820 2.0GHz with 
128 GB of ram, hyper-theading enabled.  (This is something that I 
inherited.  There is no doubt that it is overkill.)
>>> >>  
Another bit of information that you didn't ask about is the I/O 
scheduler algorithm.  I just checked and found it set to 'cfq', although 
I though I had set it to 'noop' via a kernel parameter in GRUB.

Also, some observations about the cp -al:  In parallel to investigating 
hardware/OS/filesystem issue I have done some experiments with cp -al.  
It hurts to have 64 cores available and see cp -al running the wheels 
off just one, with a couple others slightly active with system level 
duties.  So I tried some experiments where I copied smaller segments of 
the file tree in parallel (using make -j).  I haven't had the chance to 
fully play this out, but these parallel cp invocations completed very 
quickly.  So it would appear that the cp command itself may bog down 
with such a large file tree.  I haven't had a chance to tear apart the 
source code or do any profiling to see if there are any obvious problems 
there.

Lastly, I will mention that I see almost 0% wa when watching top.

[-- Attachment #1.2: Type: text/html, Size: 9926 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-03-14 14:59         ` Dave Hall
@ 2013-03-14 18:07           ` Stefan Ring
  2013-03-15  5:14           ` Stan Hoeppner
  1 sibling, 0 replies; 32+ messages in thread
From: Stefan Ring @ 2013-03-14 18:07 UTC (permalink / raw)
  To: Dave Hall; +Cc: xfs

> Lastly, I will mention that I see almost 0% wa when watching top.

I notice that XFS in general will report less % wa than ext4, although
it exercises the disks a bit more when traversing a large directory
tree, for example. But with 64 cores, you will see at most 1.5% in top
anyway, if one process is doing nothing but waiting on the disk.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-03-14 14:59         ` Dave Hall
  2013-03-14 18:07           ` Stefan Ring
@ 2013-03-15  5:14           ` Stan Hoeppner
  2013-03-15 11:45             ` Dave Chinner
  1 sibling, 1 reply; 32+ messages in thread
From: Stan Hoeppner @ 2013-03-15  5:14 UTC (permalink / raw)
  To: Dave Hall; +Cc: xfs

On 3/14/2013 9:59 AM, Dave Hall wrote:

> Linux decoy 3.2.0-0.bpo.4-amd64 #1 SMP Debian 3.2.35-2~bpo60+1 x86_64
> GNU/Linux

Ok, so you're already on a recent kernel with delaylog.

>>> >  ~$ grep xfs /etc/fstab
>>>      
> LABEL=backup        /infortrend    xfs   
> inode64,noatime,nodiratime,nobarrier    0    0

XFS uses relatime by default, so noatime/nodiratime are useless, though
not part of the problem.  inode64 is good as your files and metadata
have locality.  Nobarrier is good with functioning BBWC.

> meta-data=/dev/sdb1              isize=256    agcount=26,
> agsize=268435455 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=6836364800, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0

Standard internal log, no alignment.  With delaylog, 512MB BBWC, and a
nearly pure metadata workload, this should be fine.

> Filesystem           1K-blocks      Used Available Use% Mounted on
> /dev/sdb1            27343372288 20432618356 6910753932  75% /infortrend

Looks good.  75% is close to tickling the free space fragmentation
dragon but you're not there yet.

> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> /dev/sdb1            5469091840 1367746380 4101345460   26% /infortrend

Plenty of free inodes.

> # xfs_db -r -c freesp /dev/sdb1
>    from      to extents  blocks    pct
>       1       1  832735  832735   0.05
>       2       3  432183 1037663   0.06
>       4       7  365573 1903965   0.11
>       8      15  352402 3891608   0.23
>      16      31  332762 7460486   0.43
>      32      63  300571 13597941   0.79
>      64     127  233778 20900655   1.21
>     128     255  152003 27448751   1.59
>     256     511  112673 40941665   2.37
>     512    1023   82262 59331126   3.43
>    1024    2047   53238 76543454   4.43
>    2048    4095   34092 97842752   5.66
>    4096    8191   22743 129915842   7.52
>    8192   16383   14453 162422155   9.40
>   16384   32767    8501 190601554  11.03
>   32768   65535    4695 210822119  12.20
>   65536  131071    2615 234787546  13.59
>  131072  262143    1354 237684818  13.76
>  262144  524287     470 160228724   9.27
>  524288 1048575      74 47384798   2.74
> 1048576 2097151       1 2097122   0.12

Your free space map isn't completely horrible given you're at 75%
capacity.  Looks like most of it is in chunks 32MB and larger.  Those
14.8m files have a mean size of ~1.22MB which suggests most of the files
are small, so you shouldn't be having high seek load (thus latency)
during allocation.

> The RAID box is an Infortrend S16S-G1030 with 512MB cache and a fully
> functional battery.  I couldn't find  any details about the internal
> RAID implementation used by Infortrend.   The array is SAS attached to
> an LSI HBA (SAS2008 PCI-Express Fusion-MPT SAS-2).

It's an older unit, definitely not the fastest in its class, but unless
the firmware is horrible the 512MB BBWC should handle this metadata
workload with aplomb.  With 128GB RAM and Linux read-ahead caching you
don't need the RAID controller to be doing read caching.  Go into the
SANWatch interface and make sure you're dedicating all the cache to
writes not reads.  This may or may not be configurable.  Some firmware
will simply drop read cache lines dynamically when writes come in.  Some
let you manually tweak the ratio.  I'm not that familiar with the
Infortrend units.  But again, this is a minor optimization, and I don't
think this is part of the problem.

> The system hardware is a SuperMicro quad 8-core XEON E7-4820 2.0GHz with
> 128 GB of ram, hyper-theading enabled.  (This is something that I
> inherited.  There is no doubt that it is overkill.)

Just a bit.  64 hardware threads, 72MB of L3 cache, and 128GB RAM for a
storage server with two storage HBAs and low throughput disk arrays.
Apparently running a Debian mirror is more compute intensive than I
previously thought...

> Another bit of information that you didn't ask about is the I/O
> scheduler algorithm.  

Didn't get that far yet. ;)

> I just checked and found it set to 'cfq', although
> I though I had set it to 'noop' via a kernel parameter in GRUB.

As you're using a distro kernel, I recommend simply doing it in root's
crontab.  That way it can't get 'lost' during kernel upgrades due to
grub update problems, etc.  The scheduler can be changed on the fly so
it doesn't matter where you set it in the boot sequence.

@reboot		/bin/echo noop > /sys/block/sdb/queue/scheduler

> Also, some observations about the cp -al:  In parallel to investigating
> hardware/OS/filesystem issue I have done some experiments with cp -al. 
> It hurts to have 64 cores available and see cp -al running the wheels
> off just one, with a couple others slightly active with system level
> duties.  

This tends to happen when one runs single threaded user space code on a
large multiprocessor.

> So I tried some experiments where I copied smaller segments of
> the file tree in parallel (using make -j).  I haven't had the chance to
> fully play this out, but these parallel cp invocations completed very
> quickly.  So it would appear that the cp command itself may bog down
> with such a large file tree.  I haven't had a chance to tear apart the
> source code or do any profiling to see if there are any obvious problems
> there.
> 
> Lastly, I will mention that I see almost 0% wa when watching top.

So it's probably safe to say at this point that XFS and IO in general
are not the problem.

One thing you did not mention is how you are using rsnapshot.  If you
are using it as most folks do to backup remote filesystems of other
machines over ethernet, what happens when you simply schedule multiple
rsnapshot processes concurrently, targeting each at a different remote
machine?

If you're using rsnapshot strictly locally, you should take a hard look
at xfsdump.  It exists specifically for backing up XFS filesystems/files
and has been around a very long time, is very mature.  It's not quite as
flexible as rsnapshot and may require more disk space, but it is
lighting fast, even though limited to a single thread on Linux.  Why is
it lightning fast?  Because the bulk of the work is performed in kernel
space by the XFS driver, directly manipulating the filesystem--no user
space execution or system calls.  See 'man xfsdump'.

Familiarize yourself with it and perform a test dump, to a file, of a
large (~1TB) directory/tree.  You'll see what we mean by lightning fast,
compared to rsnapshot and other user space methods.  And you'll actually
see some IO throughput with this. ;)

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-03-15  5:14           ` Stan Hoeppner
@ 2013-03-15 11:45             ` Dave Chinner
  2013-03-16  4:47               ` Stan Hoeppner
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2013-03-15 11:45 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Dave Hall, xfs

On Fri, Mar 15, 2013 at 12:14:40AM -0500, Stan Hoeppner wrote:
> On 3/14/2013 9:59 AM, Dave Hall wrote:
> Looks good.  75% is close to tickling the free space fragmentation
> dragon but you're not there yet.

Don't be so sure ;)

> 
> > Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> > /dev/sdb1            5469091840 1367746380 4101345460   26% /infortrend
> 
> Plenty of free inodes.
> 
> > # xfs_db -r -c freesp /dev/sdb1
> >    from      to extents  blocks    pct
> >       1       1  832735  832735   0.05
> >       2       3  432183 1037663   0.06
> >       4       7  365573 1903965   0.11
> >       8      15  352402 3891608   0.23
> >      16      31  332762 7460486   0.43
> >      32      63  300571 13597941   0.79
> >      64     127  233778 20900655   1.21
> >     128     255  152003 27448751   1.59
> >     256     511  112673 40941665   2.37
> >     512    1023   82262 59331126   3.43
> >    1024    2047   53238 76543454   4.43
> >    2048    4095   34092 97842752   5.66
> >    4096    8191   22743 129915842   7.52
> >    8192   16383   14453 162422155   9.40
> >   16384   32767    8501 190601554  11.03
> >   32768   65535    4695 210822119  12.20
> >   65536  131071    2615 234787546  13.59
> >  131072  262143    1354 237684818  13.76
> >  262144  524287     470 160228724   9.27
> >  524288 1048575      74 47384798   2.74
> > 1048576 2097151       1 2097122   0.12
> 
> Your free space map isn't completely horrible given you're at 75%
> capacity.  Looks like most of it is in chunks 32MB and larger.  Those
> 14.8m files have a mean size of ~1.22MB which suggests most of the files
> are small, so you shouldn't be having high seek load (thus latency)
> during allocation.

FWIW, you can't really tell how bad the freespace fragmentation is
from the global output like this. All of the large contiguous free
space might be in one or two AGs, and the others might be badly
fragmented. Hence you need to at least sample a few AGs to determine
if this is representative of the freespace in each AG....

As it is, the above output raises alarms for me. What I see is that
the number of small extents massively outnumbers the large extents.
The fact that there are roughly 2.5 million extents smaller than 63
blocks and that there is only one freespace extent larger than 4GB
indicates to me that free space is substantially fragmented. At 25%
free space, that's 250GB per AG, and if the largest freespace in
most AGs is less than 4GB in length, then free space is not
contiguous. i.e.  Free space appears to be heavily weighted towards
small extents...`

So, the above output would lead me to investigate the freespace
layout more deeply to determine if this is going to affect the
workload that is being run...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-03-15 11:45             ` Dave Chinner
@ 2013-03-16  4:47               ` Stan Hoeppner
  2013-03-16  7:21                 ` Dave Chinner
  0 siblings, 1 reply; 32+ messages in thread
From: Stan Hoeppner @ 2013-03-16  4:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Dave Hall, xfs

On 3/15/2013 6:45 AM, Dave Chinner wrote:
> On Fri, Mar 15, 2013 at 12:14:40AM -0500, Stan Hoeppner wrote:
>> On 3/14/2013 9:59 AM, Dave Hall wrote:
>> Looks good.  75% is close to tickling the free space fragmentation
>> dragon but you're not there yet.
> 
> Don't be so sure ;)

The only thing I'm sure of is that I'll always be learning something new
about XFS and how to troubleshoot it. ;)

>>
>>> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
>>> /dev/sdb1            5469091840 1367746380 4101345460   26% /infortrend
>>
>> Plenty of free inodes.
>>
>>> # xfs_db -r -c freesp /dev/sdb1
>>>    from      to extents  blocks    pct
>>>       1       1  832735  832735   0.05
>>>       2       3  432183 1037663   0.06
>>>       4       7  365573 1903965   0.11
>>>       8      15  352402 3891608   0.23
>>>      16      31  332762 7460486   0.43
>>>      32      63  300571 13597941   0.79
>>>      64     127  233778 20900655   1.21
>>>     128     255  152003 27448751   1.59
>>>     256     511  112673 40941665   2.37
>>>     512    1023   82262 59331126   3.43
>>>    1024    2047   53238 76543454   4.43
>>>    2048    4095   34092 97842752   5.66
>>>    4096    8191   22743 129915842   7.52
>>>    8192   16383   14453 162422155   9.40
>>>   16384   32767    8501 190601554  11.03
>>>   32768   65535    4695 210822119  12.20
>>>   65536  131071    2615 234787546  13.59
>>>  131072  262143    1354 237684818  13.76
>>>  262144  524287     470 160228724   9.27
>>>  524288 1048575      74 47384798   2.74
>>> 1048576 2097151       1 2097122   0.12
>>
>> Your free space map isn't completely horrible given you're at 75%
>> capacity.  Looks like most of it is in chunks 32MB and larger.  Those
>> 14.8m files have a mean size of ~1.22MB which suggests most of the files
>> are small, so you shouldn't be having high seek load (thus latency)
>> during allocation.
> 
> FWIW, you can't really tell how bad the freespace fragmentation is
> from the global output like this. 

True.

> All of the large contiguous free
> space might be in one or two AGs, and the others might be badly
> fragmented. Hence you need to at least sample a few AGs to determine
> if this is representative of the freespace in each AG....

What would be representative of 26AGs?  First, middle, last?  So Mr.
Hall would execute:

~$ xfs_db -r /dev/sdb1
xfs_db> freesp -a0
...
xfs_db> freesp -a13
...
xfs_db> freesp -a26
...
xfs_db> quit

> As it is, the above output raises alarms for me. What I see is that
> the number of small extents massively outnumbers the large extents.
> The fact that there are roughly 2.5 million extents smaller than 63
> blocks and that there is only one freespace extent larger than 4GB
> indicates to me that free space is substantially fragmented. At 25%
> free space, that's 250GB per AG, and if the largest freespace in
> most AGs is less than 4GB in length, then free space is not
> contiguous. i.e.  Free space appears to be heavily weighted towards
> small extents...`

It didn't raise alarms for me.  This is an rsnapshot workload with
millions of small files.  For me it was a foregone conclusion he'd have
serious fragmentation.  What I was looking at is whether it's severe
enough to be a factor in his stated problem.  I don't think it is.  In
fact I think it's completely unrelated, which is why I didn't go into
deeper analysis of this.  Though I could be incorrect. ;)

> So, the above output would lead me to investigate the freespace
> layout more deeply to determine if this is going to affect the
> workload that is being run...

May be time to hold class again Dave as I'm probably missing something.
 His slowdown is serial hardlink creation with "cp -al" of many millions
of files.  Hardlinks are metadata structures, which means this workload
modifies btrees and inodes, not extents, right?

XFS directory metadata is stored closely together in each AG, correct?
'cp -al' is going to walk directories in order, which means we're going
have good read caching of the directory information thus little to no
random read IO.  The cp is then going to create a hardlink per file.
Now, even with the default 4KB write alignment, we should be getting a
large bundle of hardlinks per write.  And I would think the 512MB BBWC
on the array controller, if firmware is decent, should do a good job of
merging these to mitigate RMW cycles.

The OP is seeing 100% CPU for the cp operation, almost no IO, and no
iowait.  If XFS or RMW were introducing any latency I'd think we'd see
some iowait.

Thus I believe at this point, the problem is those millions of serial
user space calls in a single Perl thread causing the high CPU burn,
little IO, and long run time, not XFS nor the storage.  And I think the
OP came to this conclusion as well, without waiting on our analysis of
his filesystem.

Regardless of the OP's course of action, I of course welcome critique of
my analysis, so I learn new things and improve for future cases.
Specifically WRT high metadata modification workloads on parity SRD
storage.  Which is what this OP could actually have if he runs many
rsnaphots in parallel.  With 32 cores/64 threads and 128GB RAM he can
certainly generate much higher rsnapshot load on his filesystem and
storage, if he chooses to.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-03-16  4:47               ` Stan Hoeppner
@ 2013-03-16  7:21                 ` Dave Chinner
  2013-03-16 11:45                   ` Stan Hoeppner
  2013-03-25 17:00                   ` Dave Hall
  0 siblings, 2 replies; 32+ messages in thread
From: Dave Chinner @ 2013-03-16  7:21 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Dave Hall, xfs

On Fri, Mar 15, 2013 at 11:47:08PM -0500, Stan Hoeppner wrote:
> On 3/15/2013 6:45 AM, Dave Chinner wrote:
> > On Fri, Mar 15, 2013 at 12:14:40AM -0500, Stan Hoeppner wrote:
> >> On 3/14/2013 9:59 AM, Dave Hall wrote:
> >> Looks good.  75% is close to tickling the free space fragmentation
> >> dragon but you're not there yet.
> > 
> > Don't be so sure ;)
> 
> The only thing I'm sure of is that I'll always be learning something new
> about XFS and how to troubleshoot it. ;)
> 
> >>
> >>> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> >>> /dev/sdb1            5469091840 1367746380 4101345460   26% /infortrend
> >>
> >> Plenty of free inodes.
> >>
> >>> # xfs_db -r -c freesp /dev/sdb1
> >>>    from      to extents  blocks    pct
> >>>       1       1  832735  832735   0.05
> >>>       2       3  432183 1037663   0.06
> >>>       4       7  365573 1903965   0.11
> >>>       8      15  352402 3891608   0.23
> >>>      16      31  332762 7460486   0.43
> >>>      32      63  300571 13597941   0.79
> >>>      64     127  233778 20900655   1.21
> >>>     128     255  152003 27448751   1.59
> >>>     256     511  112673 40941665   2.37
> >>>     512    1023   82262 59331126   3.43
> >>>    1024    2047   53238 76543454   4.43
> >>>    2048    4095   34092 97842752   5.66
> >>>    4096    8191   22743 129915842   7.52
> >>>    8192   16383   14453 162422155   9.40
> >>>   16384   32767    8501 190601554  11.03
> >>>   32768   65535    4695 210822119  12.20
> >>>   65536  131071    2615 234787546  13.59
> >>>  131072  262143    1354 237684818  13.76
> >>>  262144  524287     470 160228724   9.27
> >>>  524288 1048575      74 47384798   2.74
> >>> 1048576 2097151       1 2097122   0.12
> >>
> >> Your free space map isn't completely horrible given you're at 75%
> >> capacity.  Looks like most of it is in chunks 32MB and larger.  Those
> >> 14.8m files have a mean size of ~1.22MB which suggests most of the files
> >> are small, so you shouldn't be having high seek load (thus latency)
> >> during allocation.
> > 
> > FWIW, you can't really tell how bad the freespace fragmentation is
> > from the global output like this. 
> 
> True.
> 
> > All of the large contiguous free
> > space might be in one or two AGs, and the others might be badly
> > fragmented. Hence you need to at least sample a few AGs to determine
> > if this is representative of the freespace in each AG....
> 
> What would be representative of 26AGs?  First, middle, last?  So Mr.
> Hall would execute:
> 
> ~$ xfs_db -r /dev/sdb1
> xfs_db> freesp -a0
> ...
> xfs_db> freesp -a13
> ...
> xfs_db> freesp -a26
> ...
> xfs_db> quit

Yup, though I normally just  run something like:

# for i in `seq 0 1 <agcount - 1>`; do
> xfs_db -c "freesp -a $i" <dev>
> done

To look at the them all quickly...

> > As it is, the above output raises alarms for me. What I see is that
> > the number of small extents massively outnumbers the large extents.
> > The fact that there are roughly 2.5 million extents smaller than 63
> > blocks and that there is only one freespace extent larger than 4GB
> > indicates to me that free space is substantially fragmented. At 25%
> > free space, that's 250GB per AG, and if the largest freespace in
> > most AGs is less than 4GB in length, then free space is not
> > contiguous. i.e.  Free space appears to be heavily weighted towards
> > small extents...`
> 
> It didn't raise alarms for me.  This is an rsnapshot workload with
> millions of small files.  For me it was a foregone conclusion he'd have
> serious fragmentation.  What I was looking at is whether it's severe
> enough to be a factor in his stated problem.  I don't think it is.  In
> fact I think it's completely unrelated, which is why I didn't go into
> deeper analysis of this.  Though I could be incorrect. ;)

Ok, so what size blocks are the metadata held in? 1-4 filesystem
block extents. So, when we do a by-size freespace btree lookup, we
don't find a large freespace to allocate from. So we fall back to a
by-blkno search down the freespace btree to find a neraby block of
sufficient size. That search runs until we run off one end of the
freespace btree. And when this might have to walk along several tens
of thousand of btree records, each allocation will consume a *lot*
of CPU time. How much? well, compared to finding a large freespace
extent, think orders of magnitude more CPU overhead per
allocation...

> > So, the above output would lead me to investigate the freespace
> > layout more deeply to determine if this is going to affect the
> > workload that is being run...
> 
> May be time to hold class again Dave as I'm probably missing something.
>  His slowdown is serial hardlink creation with "cp -al" of many millions
> of files.  Hardlinks are metadata structures, which means this workload
> modifies btrees and inodes, not extents, right?

It modifies directories and inodes, and adding directory entries
requires allocation of new directory blocks, and that requires
scanning of the freespace trees....

> XFS directory metadata is stored closely together in each AG, correct?
> 'cp -al' is going to walk directories in order, which means we're going
> have good read caching of the directory information thus little to no
> random read IO. 

not f the directory is fragmented. If freespace is fragmented, then
there's a good chance that directory blocks are not going to have
good locality, though the effect of that will be minimised by the
directory block readahead that is done.

> The cp is then going to create a hardlink per file.
> Now, even with the default 4KB write alignment, we should be getting a
> large bundle of hardlinks per write.  And I would think the 512MB BBWC
> on the array controller, if firmware is decent, should do a good job of
> merging these to mitigate RMW cycles.

it's possible, but I would expect the lack of IO to be caused by the
fact modification is CPU bound. i.e. it's taking so long for every
hard link to be created (on average) that the IO subsystem can
handle the read/write IO demands with ease because there is
realtively little IO being issued.

> The OP is seeing 100% CPU for the cp operation, almost no IO, and no
> iowait.  If XFS or RMW were introducing any latency I'd think we'd see
> some iowait.

Right, so that leads to the conclusion that the freespace
fragmentation is definitely a potential cause of the excessive CPU
usage....

> Thus I believe at this point, the problem is those millions of serial
> user space calls in a single Perl thread causing the high CPU burn,
> little IO, and long run time, not XFS nor the storage.  And I think the
> OP came to this conclusion as well, without waiting on our analysis of
> his filesystem.

Using perf to profile the kernel while the cp -al workload is
running will tell use exactly where the CPU is being burnt. That
will confirm the analysis, or point us at some other issue that is
causing excessive CPU burn...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-03-16  7:21                 ` Dave Chinner
@ 2013-03-16 11:45                   ` Stan Hoeppner
  2013-03-25 17:00                   ` Dave Hall
  1 sibling, 0 replies; 32+ messages in thread
From: Stan Hoeppner @ 2013-03-16 11:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Dave Hall, xfs

On 3/16/2013 2:21 AM, Dave Chinner wrote:
> On Fri, Mar 15, 2013 at 11:47:08PM -0500, Stan Hoeppner wrote:
>> On 3/15/2013 6:45 AM, Dave Chinner wrote:
>>> On Fri, Mar 15, 2013 at 12:14:40AM -0500, Stan Hoeppner wrote:
>>>> On 3/14/2013 9:59 AM, Dave Hall wrote:
>>>> Looks good.  75% is close to tickling the free space fragmentation
>>>> dragon but you're not there yet.
>>>
>>> Don't be so sure ;)
>>
>> The only thing I'm sure of is that I'll always be learning something new
>> about XFS and how to troubleshoot it. ;)
>>
>>>>
>>>>> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
>>>>> /dev/sdb1            5469091840 1367746380 4101345460   26% /infortrend
>>>>
>>>> Plenty of free inodes.
>>>>
>>>>> # xfs_db -r -c freesp /dev/sdb1
>>>>>    from      to extents  blocks    pct
>>>>>       1       1  832735  832735   0.05
>>>>>       2       3  432183 1037663   0.06
>>>>>       4       7  365573 1903965   0.11
>>>>>       8      15  352402 3891608   0.23
>>>>>      16      31  332762 7460486   0.43
>>>>>      32      63  300571 13597941   0.79
>>>>>      64     127  233778 20900655   1.21
>>>>>     128     255  152003 27448751   1.59
>>>>>     256     511  112673 40941665   2.37
>>>>>     512    1023   82262 59331126   3.43
>>>>>    1024    2047   53238 76543454   4.43
>>>>>    2048    4095   34092 97842752   5.66
>>>>>    4096    8191   22743 129915842   7.52
>>>>>    8192   16383   14453 162422155   9.40
>>>>>   16384   32767    8501 190601554  11.03
>>>>>   32768   65535    4695 210822119  12.20
>>>>>   65536  131071    2615 234787546  13.59
>>>>>  131072  262143    1354 237684818  13.76
>>>>>  262144  524287     470 160228724   9.27
>>>>>  524288 1048575      74 47384798   2.74
>>>>> 1048576 2097151       1 2097122   0.12
>>>>
>>>> Your free space map isn't completely horrible given you're at 75%
>>>> capacity.  Looks like most of it is in chunks 32MB and larger.  Those
>>>> 14.8m files have a mean size of ~1.22MB which suggests most of the files
>>>> are small, so you shouldn't be having high seek load (thus latency)
>>>> during allocation.
>>>
>>> FWIW, you can't really tell how bad the freespace fragmentation is
>>> from the global output like this. 
>>
>> True.
>>
>>> All of the large contiguous free
>>> space might be in one or two AGs, and the others might be badly
>>> fragmented. Hence you need to at least sample a few AGs to determine
>>> if this is representative of the freespace in each AG....
>>
>> What would be representative of 26AGs?  First, middle, last?  So Mr.
>> Hall would execute:
>>
>> ~$ xfs_db -r /dev/sdb1
>> xfs_db> freesp -a0
>> ...
>> xfs_db> freesp -a13
>> ...
>> xfs_db> freesp -a26
>> ...
>> xfs_db> quit
> 
> Yup, though I normally just  run something like:
> 
> # for i in `seq 0 1 <agcount - 1>`; do
>> xfs_db -c "freesp -a $i" <dev>
>> done
> 
> To look at the them all quickly...

Ahh, you have to put the xfs_db command in quotes if it has args.  I
kept getting an error when using -a in my command line.  Thanks.

Your command line will give histograms for all 26 AGs.  This isn't
sampling just a few as you suggested.  Do we generally want to have
users dump histograms of all their AGs to the mailing list?  Or will
sampling do?  In this case something like this?

~$ for i in [0 8 17 26]; do xfs_db -r -c "freesp -a $i" /dev/sdb1; done

>>> As it is, the above output raises alarms for me. What I see is that
>>> the number of small extents massively outnumbers the large extents.
>>> The fact that there are roughly 2.5 million extents smaller than 63
>>> blocks and that there is only one freespace extent larger than 4GB
>>> indicates to me that free space is substantially fragmented. At 25%
>>> free space, that's 250GB per AG, and if the largest freespace in
>>> most AGs is less than 4GB in length, then free space is not
>>> contiguous. i.e.  Free space appears to be heavily weighted towards
>>> small extents...`
>>
>> It didn't raise alarms for me.  This is an rsnapshot workload with
>> millions of small files.  For me it was a foregone conclusion he'd have
>> serious fragmentation.  What I was looking at is whether it's severe
>> enough to be a factor in his stated problem.  I don't think it is.  In
>> fact I think it's completely unrelated, which is why I didn't go into
>> deeper analysis of this.  Though I could be incorrect. ;)
> 
> Ok, so what size blocks are the metadata held in? 1-4 filesystem
> block extents. 

So, 4KB to 16KB.  How many of the hard links being created can we store
in each?

> So, when we do a by-size freespace btree lookup, we
> don't find a large freespace to allocate from. So we fall back to a
> by-blkno search down the freespace btree to find a neraby block of
> sufficient size. 

If we only need a free block of 4-16KB for our hardlinks, nearly any of
his free space would be usable wouldn't it?

> That search runs until we run off one end of the
> freespace btree. And when this might have to walk along several tens
> of thousand of btree records, each allocation will consume a *lot*
> of CPU time. How much? well, compared to finding a large freespace
> extent, think orders of magnitude more CPU overhead per
> allocation...

I follow you, up to a point.  I'm disconnected between the free block
size requirements for metadata, and having to potentially walk two
entire btrees looking for a free chunk of sufficient size.  Seems to me
every free extent in his histogram is usable for hardlink metadata if
our minimum is one filesystem block, or 4KB.

WRT CPU burn, I'll address my thoughts on that much further below.

>>> So, the above output would lead me to investigate the freespace
>>> layout more deeply to determine if this is going to affect the
>>> workload that is being run...
>>
>> May be time to hold class again Dave as I'm probably missing something.
>>  His slowdown is serial hardlink creation with "cp -al" of many millions
>> of files.  Hardlinks are metadata structures, which means this workload
>> modifies btrees and inodes, not extents, right?
> 
> It modifies directories and inodes, and adding directory entries
> requires allocation of new directory blocks, and that requires
> scanning of the freespace trees....

Got it.

>> XFS directory metadata is stored closely together in each AG, correct?
>> 'cp -al' is going to walk directories in order, which means we're going
>> have good read caching of the directory information thus little to no
>> random read IO. 
> 
> not f the directory is fragmented. If freespace is fragmented, then
> there's a good chance that directory blocks are not going to have
> good locality, though the effect of that will be minimised by the
> directory block readahead that is done.

Got it.  And given this box has 128GB of RAM there's probably a lot of
directory metadata alreay in cache.

>> The cp is then going to create a hardlink per file.
>> Now, even with the default 4KB write alignment, we should be getting a
>> large bundle of hardlinks per write.  And I would think the 512MB BBWC
>> on the array controller, if firmware is decent, should do a good job of
>> merging these to mitigate RMW cycles.
> 
> it's possible, but I would expect the lack of IO to be caused by the
> fact modification is CPU bound. i.e. it's taking so long for every
> hard link to be created (on average) that the IO subsystem can
> handle the read/write IO demands with ease because there is
> realtively little IO being issued.

The OP stated once CPU is throttled, two have very light load, the other
29 are idle.  The throttled core must be the one on which the cp code is
executing.  The kernel isn't going to schedule the XFS btree walking
thread(s) on the same core, is it?  So if no other cores are anywhere
near peak, isn't it safe to assume the workload isn't CPU bound due to
free space btree walking?

I should have thought of this earlier when he described the load on his
cores...

>> The OP is seeing 100% CPU for the cp operation, almost no IO, and no
>> iowait.  If XFS or RMW were introducing any latency I'd think we'd see
>> some iowait.
> 
> Right, so that leads to the conclusion that the freespace
> fragmentation is definitely a potential cause of the excessive CPU
> usage....

Is is still a candidate, given what I describe above WRT XFS thread
scheduling, and that only one core is hammered?

>> Thus I believe at this point, the problem is those millions of serial
>> user space calls in a single Perl thread causing the high CPU burn,
>> little IO, and long run time, not XFS nor the storage.  And I think the
>> OP came to this conclusion as well, without waiting on our analysis of
>> his filesystem.
> 
> Using perf to profile the kernel while the cp -al workload is
> running will tell use exactly where the CPU is being burnt. That
> will confirm the analysis, or point us at some other issue that is
> causing excessive CPU burn...

I'd like to see this as well.  Because if the bottleneck isn't XFS, I'd
like to understand how a 2GHz core with 18MB of L3 cache is being
completely consumed by a cp command which is doing nothing but creating
hardlinks--while the IO rate is almost nothing.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-03-16  7:21                 ` Dave Chinner
  2013-03-16 11:45                   ` Stan Hoeppner
@ 2013-03-25 17:00                   ` Dave Hall
  2013-03-27 21:16                     ` Stan Hoeppner
  2013-03-28  1:38                     ` xfs_fsr, sunit, and swidth Dave Chinner
  1 sibling, 2 replies; 32+ messages in thread
From: Dave Hall @ 2013-03-25 17:00 UTC (permalink / raw)
  To: xfs


Dave Hall
Binghamton University
kdhall@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)


On 03/16/2013 03:21 AM, Dave Chinner wrote:
> Using perf to profile the kernel while the cp -al workload is
> running will tell use exactly where the CPU is being burnt. That
> will confirm the analysis, or point us at some other issue that is
> causing excessive CPU burn...
>
Dave,  which perf command(s) would you like me to run.  (I'm familiar 
with the concept behind this kind of tool, but I haven't worked with 
this one before).

Also, what would you like me to do with the xfs_db freesp output for 26 
agroups?

-Dave

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-03-25 17:00                   ` Dave Hall
@ 2013-03-27 21:16                     ` Stan Hoeppner
  2013-03-29 19:59                       ` Dave Hall
  2013-03-28  1:38                     ` xfs_fsr, sunit, and swidth Dave Chinner
  1 sibling, 1 reply; 32+ messages in thread
From: Stan Hoeppner @ 2013-03-27 21:16 UTC (permalink / raw)
  To: Dave Hall; +Cc: xfs

On 3/25/2013 12:00 PM, Dave Hall wrote:
> On 03/16/2013 03:21 AM, Dave Chinner wrote:
>> Using perf to profile the kernel while the cp -al workload is
>> running will tell use exactly where the CPU is being burnt. That
>> will confirm the analysis, or point us at some other issue that is
>> causing excessive CPU burn...
>>
> Dave,  which perf command(s) would you like me to run.  (I'm familiar
> with the concept behind this kind of tool, but I haven't worked with
> this one before).

I'll let Dave answer this one.

> Also, what would you like me to do with the xfs_db freesp output for 26
> agroups?

A pastebin link should be fine.  Only a couple of people will be looking
at it.  I don't see value in free space maps of 26 AGs being archived.

FWIW, it's probably best to reply-all instead of just to the list.
Sometimes posts get lost in the noise.  Not sure if that's the case
here, but it's been a couple of days with no response from Dave C, and
the answers to these questions are very short.  Thus I'm guessing he
missed your post, so I'm CC'ing him here.

-- 
Stan



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-03-25 17:00                   ` Dave Hall
  2013-03-27 21:16                     ` Stan Hoeppner
@ 2013-03-28  1:38                     ` Dave Chinner
  1 sibling, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2013-03-28  1:38 UTC (permalink / raw)
  To: Dave Hall; +Cc: xfs

On Mon, Mar 25, 2013 at 01:00:51PM -0400, Dave Hall wrote:
> 
> Dave Hall
> Binghamton University
> kdhall@binghamton.edu
> 607-760-2328 (Cell)
> 607-777-4641 (Office)
> 
> 
> On 03/16/2013 03:21 AM, Dave Chinner wrote:
> >Using perf to profile the kernel while the cp -al workload is
> >running will tell use exactly where the CPU is being burnt. That
> >will confirm the analysis, or point us at some other issue that is
> >causing excessive CPU burn...
> >
> Dave,  which perf command(s) would you like me to run.  (I'm
> familiar with the concept behind this kind of tool, but I haven't
> worked with this one before).

Just run 'perf top -U' for 10s while the problem is occurring and
pastebin the output....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-03-27 21:16                     ` Stan Hoeppner
@ 2013-03-29 19:59                       ` Dave Hall
  2013-03-31  1:22                         ` Dave Chinner
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Hall @ 2013-03-29 19:59 UTC (permalink / raw)
  To: stan; +Cc: xfs

Dave, Stan,

Here is the link for perf top -U:  http://pastebin.com/JYLXYWki.  The ag 
report is at http://pastebin.com/VzziSa4L.  Interestingly, the backups 
ran fast a couple times this week.  Once under 9 hours.  Today it looks 
like it's running long again.

-Dave

Dave Hall
Binghamton University
kdhall@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)


On 03/27/2013 05:16 PM, Stan Hoeppner wrote:
> On 3/25/2013 12:00 PM, Dave Hall wrote:
>    
>> On 03/16/2013 03:21 AM, Dave Chinner wrote:
>>      
>>> Using perf to profile the kernel while the cp -al workload is
>>> running will tell use exactly where the CPU is being burnt. That
>>> will confirm the analysis, or point us at some other issue that is
>>> causing excessive CPU burn...
>>>
>>>        
>> Dave,  which perf command(s) would you like me to run.  (I'm familiar
>> with the concept behind this kind of tool, but I haven't worked with
>> this one before).
>>      
> I'll let Dave answer this one.
>
>    
>> Also, what would you like me to do with the xfs_db freesp output for 26
>> agroups?
>>      
> A pastebin link should be fine.  Only a couple of people will be looking
> at it.  I don't see value in free space maps of 26 AGs being archived.
>
> FWIW, it's probably best to reply-all instead of just to the list.
> Sometimes posts get lost in the noise.  Not sure if that's the case
> here, but it's been a couple of days with no response from Dave C, and
> the answers to these questions are very short.  Thus I'm guessing he
> missed your post, so I'm CC'ing him here.
>
>    

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-03-29 19:59                       ` Dave Hall
@ 2013-03-31  1:22                         ` Dave Chinner
  2013-04-02 10:34                           ` Hans-Peter Jansen
  2013-04-03 14:25                           ` Dave Hall
  0 siblings, 2 replies; 32+ messages in thread
From: Dave Chinner @ 2013-03-31  1:22 UTC (permalink / raw)
  To: Dave Hall; +Cc: stan, xfs

On Fri, Mar 29, 2013 at 03:59:46PM -0400, Dave Hall wrote:
> Dave, Stan,
> 
> Here is the link for perf top -U:  http://pastebin.com/JYLXYWki.
> The ag report is at http://pastebin.com/VzziSa4L.  Interestingly,
> the backups ran fast a couple times this week.  Once under 9 hours.
> Today it looks like it's running long again.

    12.38%  [xfs]     [k] xfs_btree_get_rec
    11.65%  [xfs]     [k] _xfs_buf_find
    11.29%  [xfs]     [k] xfs_btree_increment
     7.88%  [xfs]     [k] xfs_inobt_get_rec
     5.40%  [kernel]  [k] intel_idle
     4.13%  [xfs]     [k] xfs_btree_get_block
     4.09%  [xfs]     [k] xfs_dialloc
     3.21%  [xfs]     [k] xfs_btree_readahead
     2.00%  [xfs]     [k] xfs_btree_rec_offset
     1.50%  [xfs]     [k] xfs_btree_rec_addr

Inode allocation searches, looking for an inode near to the parent
directory.

Whatthis indicates is that you have lots of sparsely allocated inode
chunks on disk. i.e. each 64 indoe chunk has some free inodes in it,
and some used inodes. This is Likely due to random removal of inodes
as you delete old backups and link counts drop to zero. Because we
only index inodes on "allocated chunks", finding a chunk that has a
free inode can be like finding a needle in a haystack. There are
heuristics used to stop searches from consuming too much CPU, but it
still can be quite slow when you repeatedly hit those paths....

I don't have an answer that will magically speed things up for
you right now...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-03-31  1:22                         ` Dave Chinner
@ 2013-04-02 10:34                           ` Hans-Peter Jansen
  2013-04-03 14:25                           ` Dave Hall
  1 sibling, 0 replies; 32+ messages in thread
From: Hans-Peter Jansen @ 2013-04-02 10:34 UTC (permalink / raw)
  To: xfs; +Cc: Dave Hall, stan

On Sonntag, 31. März 2013 12:22:31 Dave Chinner wrote:
> On Fri, Mar 29, 2013 at 03:59:46PM -0400, Dave Hall wrote:
> > Dave, Stan,
> > 
> > Here is the link for perf top -U:  http://pastebin.com/JYLXYWki.
> > The ag report is at http://pastebin.com/VzziSa4L.  Interestingly,
> > the backups ran fast a couple times this week.  Once under 9 hours.
> > Today it looks like it's running long again.
> 
>     12.38%  [xfs]     [k] xfs_btree_get_rec
>     11.65%  [xfs]     [k] _xfs_buf_find
>     11.29%  [xfs]     [k] xfs_btree_increment
>      7.88%  [xfs]     [k] xfs_inobt_get_rec
>      5.40%  [kernel]  [k] intel_idle
>      4.13%  [xfs]     [k] xfs_btree_get_block
>      4.09%  [xfs]     [k] xfs_dialloc
>      3.21%  [xfs]     [k] xfs_btree_readahead
>      2.00%  [xfs]     [k] xfs_btree_rec_offset
>      1.50%  [xfs]     [k] xfs_btree_rec_addr
> 
> Inode allocation searches, looking for an inode near to the parent
> directory.
> 
> Whatthis indicates is that you have lots of sparsely allocated inode
> chunks on disk. i.e. each 64 indoe chunk has some free inodes in it,
> and some used inodes. This is Likely due to random removal of inodes
> as you delete old backups and link counts drop to zero. Because we
> only index inodes on "allocated chunks", finding a chunk that has a
> free inode can be like finding a needle in a haystack. There are
> heuristics used to stop searches from consuming too much CPU, but it
> still can be quite slow when you repeatedly hit those paths....
> 
> I don't have an answer that will magically speed things up for
> you right now...

Hmm, unfortunately, this access pattern is pretty common, at least all "cp -al 
& rsync" based backup solutions will suffer from it after a while. I noticed, 
that the "removing old backups" part is also taking *ages* in this scenario. 

I had to manually remove parts of a backup (subtrees with a few million 
ordinary files, massively hardlinked as usual), that took 4-5 hours for each 
run on a Hitachi Ultrastar 7K4000 drive. For the 8 subtrees, that finally took 
one and a half day, freeing about 500 GB space. Oh well.

The question is: is it (logically) possible to reorganize the fragmented inode 
allocation space with a specialized tool (to be implemented), that lays out 
the allocation space in such a way, that matches XFS earliest "expectations", 
or does that violate some deeper FS logic, I'm not aware of? 

I have to mention, that I haven't made any tests with other file systems, as 
playing games with backups ranges very low on my scale of sensible tests, but 
experience has shown, that XFS usually sucks less than its alternatives, even 
if the access pattern don't match its primary optimization domain.

Hence, implementing such a tool makes sense, where "least sucking" should be 
aimed for.

Cheers,
Pete

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-03-31  1:22                         ` Dave Chinner
  2013-04-02 10:34                           ` Hans-Peter Jansen
@ 2013-04-03 14:25                           ` Dave Hall
  2013-04-12 17:25                             ` Dave Hall
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Hall @ 2013-04-03 14:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: stan, xfs

So, assuming entropy has reached critical mass and that there is no easy 
fix for this physical file system, what would happen if I replicated 
this data to a new disk array?  When I say 'replicate', I'm not talking 
about xfs_dump.  I'm talking about running a series of cp -al/rsync 
operations (or maybe rsync with --link-dest) that will precisely 
reproduce the linked data on my current array.  All of the inodes would 
be re-allocated.  There wouldn't be any (or at least not many) deletes.

I am hoping that if I do this the inode fragmentation will be 
significantly reduced on the target as compared to the source.  Of 
course over time it may re-fragment, but with two arrays I can always 
wipe one and reload it.

-Dave

Dave Hall
Binghamton University
kdhall@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)


On 03/30/2013 09:22 PM, Dave Chinner wrote:
> On Fri, Mar 29, 2013 at 03:59:46PM -0400, Dave Hall wrote:
>    
>> Dave, Stan,
>>
>> Here is the link for perf top -U:  http://pastebin.com/JYLXYWki.
>> The ag report is at http://pastebin.com/VzziSa4L.  Interestingly,
>> the backups ran fast a couple times this week.  Once under 9 hours.
>> Today it looks like it's running long again.
>>      
>      12.38%  [xfs]     [k] xfs_btree_get_rec
>      11.65%  [xfs]     [k] _xfs_buf_find
>      11.29%  [xfs]     [k] xfs_btree_increment
>       7.88%  [xfs]     [k] xfs_inobt_get_rec
>       5.40%  [kernel]  [k] intel_idle
>       4.13%  [xfs]     [k] xfs_btree_get_block
>       4.09%  [xfs]     [k] xfs_dialloc
>       3.21%  [xfs]     [k] xfs_btree_readahead
>       2.00%  [xfs]     [k] xfs_btree_rec_offset
>       1.50%  [xfs]     [k] xfs_btree_rec_addr
>
> Inode allocation searches, looking for an inode near to the parent
> directory.
>
> Whatthis indicates is that you have lots of sparsely allocated inode
> chunks on disk. i.e. each 64 indoe chunk has some free inodes in it,
> and some used inodes. This is Likely due to random removal of inodes
> as you delete old backups and link counts drop to zero. Because we
> only index inodes on "allocated chunks", finding a chunk that has a
> free inode can be like finding a needle in a haystack. There are
> heuristics used to stop searches from consuming too much CPU, but it
> still can be quite slow when you repeatedly hit those paths....
>
> I don't have an answer that will magically speed things up for
> you right now...
>
> Cheers,
>
> Dave.
>    

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-04-03 14:25                           ` Dave Hall
@ 2013-04-12 17:25                             ` Dave Hall
  2013-04-13  0:45                               ` Dave Chinner
  2013-04-13  0:51                               ` Stan Hoeppner
  0 siblings, 2 replies; 32+ messages in thread
From: Dave Hall @ 2013-04-12 17:25 UTC (permalink / raw)
  To: stan; +Cc: xfs

Stan,

IDid this post get lost in the shuffle?  Looking at it I think it could 
have been a bit unclear.  What I need to do anyways is have a second, 
off-site copy of my backup data.  So I'm going to be building a second 
array.  In copying, in order to preserve the hard link structure of the 
source array I'd have to run a sequence of cp -al / rsync calls that 
would mimic what rsnapshot did to get me to where I am right now.  (Note 
that I could also potentially use rsync --link-dest.)

So the question is how would the target xfs file system fare as far as 
my inode fragmentation situation is concerned?  I'm hoping that since 
the target would be a fresh file system, and since during the 'copy' 
phase I'd only be adding inodes, that the inode allocation would be more 
compact and orderly than what I have on the source array since.  What do 
you think?

Thanks.

-Dave

Dave Hall
Binghamton University
kdhall@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)


On 04/03/2013 10:25 AM, Dave Hall wrote:
> So, assuming entropy has reached critical mass and that there is no 
> easy fix for this physical file system, what would happen if I 
> replicated this data to a new disk array?  When I say 'replicate', I'm 
> not talking about xfs_dump.  I'm talking about running a series of cp 
> -al/rsync operations (or maybe rsync with --link-dest) that will 
> precisely reproduce the linked data on my current array.  All of the 
> inodes would be re-allocated.  There wouldn't be any (or at least not 
> many) deletes.
>
> I am hoping that if I do this the inode fragmentation will be 
> significantly reduced on the target as compared to the source.  Of 
> course over time it may re-fragment, but with two arrays I can always 
> wipe one and reload it.
>
> -Dave
>
> Dave Hall
> Binghamton University
> kdhall@binghamton.edu
> 607-760-2328 (Cell)
> 607-777-4641 (Office)
>
>
> On 03/30/2013 09:22 PM, Dave Chinner wrote:
>> On Fri, Mar 29, 2013 at 03:59:46PM -0400, Dave Hall wrote:
>>> Dave, Stan,
>>>
>>> Here is the link for perf top -U:  http://pastebin.com/JYLXYWki.
>>> The ag report is at http://pastebin.com/VzziSa4L.  Interestingly,
>>> the backups ran fast a couple times this week.  Once under 9 hours.
>>> Today it looks like it's running long again.
>>      12.38%  [xfs]     [k] xfs_btree_get_rec
>>      11.65%  [xfs]     [k] _xfs_buf_find
>>      11.29%  [xfs]     [k] xfs_btree_increment
>>       7.88%  [xfs]     [k] xfs_inobt_get_rec
>>       5.40%  [kernel]  [k] intel_idle
>>       4.13%  [xfs]     [k] xfs_btree_get_block
>>       4.09%  [xfs]     [k] xfs_dialloc
>>       3.21%  [xfs]     [k] xfs_btree_readahead
>>       2.00%  [xfs]     [k] xfs_btree_rec_offset
>>       1.50%  [xfs]     [k] xfs_btree_rec_addr
>>
>> Inode allocation searches, looking for an inode near to the parent
>> directory.
>>
>> Whatthis indicates is that you have lots of sparsely allocated inode
>> chunks on disk. i.e. each 64 indoe chunk has some free inodes in it,
>> and some used inodes. This is Likely due to random removal of inodes
>> as you delete old backups and link counts drop to zero. Because we
>> only index inodes on "allocated chunks", finding a chunk that has a
>> free inode can be like finding a needle in a haystack. There are
>> heuristics used to stop searches from consuming too much CPU, but it
>> still can be quite slow when you repeatedly hit those paths....
>>
>> I don't have an answer that will magically speed things up for
>> you right now...
>>
>> Cheers,
>>
>> Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-04-12 17:25                             ` Dave Hall
@ 2013-04-13  0:45                               ` Dave Chinner
  2013-04-13  0:51                               ` Stan Hoeppner
  1 sibling, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2013-04-13  0:45 UTC (permalink / raw)
  To: Dave Hall; +Cc: stan, xfs

On Fri, Apr 12, 2013 at 01:25:22PM -0400, Dave Hall wrote:
> Stan,
> 
> IDid this post get lost in the shuffle?  Looking at it I think it
> could have been a bit unclear.  What I need to do anyways is have a
> second, off-site copy of my backup data.  So I'm going to be
> building a second array.  In copying, in order to preserve the hard
> link structure of the source array I'd have to run a sequence of cp
> -al / rsync calls that would mimic what rsnapshot did to get me to
> where I am right now.  (Note that I could also potentially use rsync
> --link-dest.)
> So the question is how would the target xfs file system fare as far
> as my inode fragmentation situation is concerned?  I'm hoping that
> since the target would be a fresh file system, and since during the
> 'copy' phase I'd only be adding inodes, that the inode allocation
> would be more compact and orderly than what I have on the source
> array since.  What do you think?

Sure, it would be to start with, but you'll eventually end up in the
same place. Removing links from the forest is what leads to the
sparse free inode space, so even starting with a dense inode
allocation pattern, it'll turn sparse the moment you remove backups
from the forest....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-04-12 17:25                             ` Dave Hall
  2013-04-13  0:45                               ` Dave Chinner
@ 2013-04-13  0:51                               ` Stan Hoeppner
  2013-04-15 20:35                                 ` Dave Hall
  1 sibling, 1 reply; 32+ messages in thread
From: Stan Hoeppner @ 2013-04-13  0:51 UTC (permalink / raw)
  To: Dave Hall; +Cc: xfs

On 4/12/2013 12:25 PM, Dave Hall wrote:
> Stan,
> 
> IDid this post get lost in the shuffle?  Looking at it I think it could
> have been a bit unclear.  What I need to do anyways is have a second,
> off-site copy of my backup data.  So I'm going to be building a second
> array.  In copying, in order to preserve the hard link structure of the
> source array I'd have to run a sequence of cp -al / rsync calls that
> would mimic what rsnapshot did to get me to where I am right now.  (Note
> that I could also potentially use rsync --link-dest.)
> 
> So the question is how would the target xfs file system fare as far as
> my inode fragmentation situation is concerned?  I'm hoping that since
> the target would be a fresh file system, and since during the 'copy'
> phase I'd only be adding inodes, that the inode allocation would be more
> compact and orderly than what I have on the source array since.  What do
> you think?

The question isn't what it will look like initially, as your inodes
shouldn't be sparsely allocated as with your current aged filesystem.

The question is how quickly the problem will arise on the new filesystem
as you free inodes.  I don't have the answer to that question.  There's
no way to predict this that I know of.

-- 
Stan

> Thanks.
> 
> -Dave
> 
> Dave Hall
> Binghamton University
> kdhall@binghamton.edu
> 607-760-2328 (Cell)
> 607-777-4641 (Office)
> 
> 
> On 04/03/2013 10:25 AM, Dave Hall wrote:
>> So, assuming entropy has reached critical mass and that there is no
>> easy fix for this physical file system, what would happen if I
>> replicated this data to a new disk array?  When I say 'replicate', I'm
>> not talking about xfs_dump.  I'm talking about running a series of cp
>> -al/rsync operations (or maybe rsync with --link-dest) that will
>> precisely reproduce the linked data on my current array.  All of the
>> inodes would be re-allocated.  There wouldn't be any (or at least not
>> many) deletes.
>>
>> I am hoping that if I do this the inode fragmentation will be
>> significantly reduced on the target as compared to the source.  Of
>> course over time it may re-fragment, but with two arrays I can always
>> wipe one and reload it.
>>
>> -Dave
>>
>> Dave Hall
>> Binghamton University
>> kdhall@binghamton.edu
>> 607-760-2328 (Cell)
>> 607-777-4641 (Office)
>>
>>
>> On 03/30/2013 09:22 PM, Dave Chinner wrote:
>>> On Fri, Mar 29, 2013 at 03:59:46PM -0400, Dave Hall wrote:
>>>> Dave, Stan,
>>>>
>>>> Here is the link for perf top -U:  http://pastebin.com/JYLXYWki.
>>>> The ag report is at http://pastebin.com/VzziSa4L.  Interestingly,
>>>> the backups ran fast a couple times this week.  Once under 9 hours.
>>>> Today it looks like it's running long again.
>>>      12.38%  [xfs]     [k] xfs_btree_get_rec
>>>      11.65%  [xfs]     [k] _xfs_buf_find
>>>      11.29%  [xfs]     [k] xfs_btree_increment
>>>       7.88%  [xfs]     [k] xfs_inobt_get_rec
>>>       5.40%  [kernel]  [k] intel_idle
>>>       4.13%  [xfs]     [k] xfs_btree_get_block
>>>       4.09%  [xfs]     [k] xfs_dialloc
>>>       3.21%  [xfs]     [k] xfs_btree_readahead
>>>       2.00%  [xfs]     [k] xfs_btree_rec_offset
>>>       1.50%  [xfs]     [k] xfs_btree_rec_addr
>>>
>>> Inode allocation searches, looking for an inode near to the parent
>>> directory.
>>>
>>> Whatthis indicates is that you have lots of sparsely allocated inode
>>> chunks on disk. i.e. each 64 indoe chunk has some free inodes in it,
>>> and some used inodes. This is Likely due to random removal of inodes
>>> as you delete old backups and link counts drop to zero. Because we
>>> only index inodes on "allocated chunks", finding a chunk that has a
>>> free inode can be like finding a needle in a haystack. There are
>>> heuristics used to stop searches from consuming too much CPU, but it
>>> still can be quite slow when you repeatedly hit those paths....
>>>
>>> I don't have an answer that will magically speed things up for
>>> you right now...
>>>
>>> Cheers,
>>>
>>> Dave.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-04-13  0:51                               ` Stan Hoeppner
@ 2013-04-15 20:35                                 ` Dave Hall
  2013-04-16  1:45                                   ` Stan Hoeppner
  2013-04-16 16:18                                   ` Dave Chinner
  0 siblings, 2 replies; 32+ messages in thread
From: Dave Hall @ 2013-04-15 20:35 UTC (permalink / raw)
  To: stan; +Cc: xfs

Stan,

I understand that this will be an ongoing problem.  It seems like all I 
could do at this point would be to ' manually defrag' my inodes the hard 
way by doing this 'copy' operation whenever things slow down.  (Either 
that or go get my PHD in file systems and try to come up with a better 
inode management algorithm.)  I will be keeping two copies of this data 
going forward anyways.

Are there any other suggestions you might have at this time - xfs or 
otherwise?

-Dave

Dave Hall
Binghamton University
kdhall@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)


On 04/12/2013 08:51 PM, Stan Hoeppner wrote:
> On 4/12/2013 12:25 PM, Dave Hall wrote:
>    
>> Stan,
>>
>> IDid this post get lost in the shuffle?  Looking at it I think it could
>> have been a bit unclear.  What I need to do anyways is have a second,
>> off-site copy of my backup data.  So I'm going to be building a second
>> array.  In copying, in order to preserve the hard link structure of the
>> source array I'd have to run a sequence of cp -al / rsync calls that
>> would mimic what rsnapshot did to get me to where I am right now.  (Note
>> that I could also potentially use rsync --link-dest.)
>>
>> So the question is how would the target xfs file system fare as far as
>> my inode fragmentation situation is concerned?  I'm hoping that since
>> the target would be a fresh file system, and since during the 'copy'
>> phase I'd only be adding inodes, that the inode allocation would be more
>> compact and orderly than what I have on the source array since.  What do
>> you think?
>>      
> The question isn't what it will look like initially, as your inodes
> shouldn't be sparsely allocated as with your current aged filesystem.
>
> The question is how quickly the problem will arise on the new filesystem
> as you free inodes.  I don't have the answer to that question.  There's
> no way to predict this that I know of.
>
>    

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-04-15 20:35                                 ` Dave Hall
@ 2013-04-16  1:45                                   ` Stan Hoeppner
  2013-04-16 16:18                                   ` Dave Chinner
  1 sibling, 0 replies; 32+ messages in thread
From: Stan Hoeppner @ 2013-04-16  1:45 UTC (permalink / raw)
  To: Dave Hall; +Cc: xfs

On 4/15/2013 3:35 PM, Dave Hall wrote:
> Stan,
> 
> I understand that this will be an ongoing problem.  It seems like all I could do at this point would be to ' manually defrag' my inodes the hard way by doing this 'copy' operation whenever things slow down.  (Either that or go get my PHD in file systems and try to come up with a better inode management algorithm.)  I will be keeping two copies of this data going forward anyways.
> 
> Are there any other suggestions you might have at this time - xfs or otherwise?

I'm no expert in this particular area, so I'll simply give the sysadmin 101 perspective:

Always pick the right tool for the job.  If XFS isn't working satisfactorily for this job and no fix is forthcoming, I'd test EXT4 and JFS to see if either of them is more suitable for this job.

The other option is to switch to a backup job that doesn't create/delete millions of hard links.

There are likely other possibilities.

-- 
Stan


> -Dave
> 
> Dave Hall
> Binghamton University
> kdhall@binghamton.edu
> 607-760-2328 (Cell)
> 607-777-4641 (Office)
> 
> 
> On 04/12/2013 08:51 PM, Stan Hoeppner wrote:
>> On 4/12/2013 12:25 PM, Dave Hall wrote:
>>   
>>> Stan,
>>>
>>> IDid this post get lost in the shuffle?  Looking at it I think it could
>>> have been a bit unclear.  What I need to do anyways is have a second,
>>> off-site copy of my backup data.  So I'm going to be building a second
>>> array.  In copying, in order to preserve the hard link structure of the
>>> source array I'd have to run a sequence of cp -al / rsync calls that
>>> would mimic what rsnapshot did to get me to where I am right now.  (Note
>>> that I could also potentially use rsync --link-dest.)
>>>
>>> So the question is how would the target xfs file system fare as far as
>>> my inode fragmentation situation is concerned?  I'm hoping that since
>>> the target would be a fresh file system, and since during the 'copy'
>>> phase I'd only be adding inodes, that the inode allocation would be more
>>> compact and orderly than what I have on the source array since.  What do
>>> you think?
>>>      
>> The question isn't what it will look like initially, as your inodes
>> shouldn't be sparsely allocated as with your current aged filesystem.
>>
>> The question is how quickly the problem will arise on the new filesystem
>> as you free inodes.  I don't have the answer to that question.  There's
>> no way to predict this that I know of.
>>
>>    
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: xfs_fsr, sunit, and swidth
  2013-04-15 20:35                                 ` Dave Hall
  2013-04-16  1:45                                   ` Stan Hoeppner
@ 2013-04-16 16:18                                   ` Dave Chinner
  2015-02-22 23:35                                     ` XFS/LVM/Multipath on a single RAID volume Dave Hall
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2013-04-16 16:18 UTC (permalink / raw)
  To: Dave Hall; +Cc: stan, xfs

On Mon, Apr 15, 2013 at 04:35:38PM -0400, Dave Hall wrote:
> Stan,
> 
> I understand that this will be an ongoing problem.  It seems like
> all I could do at this point would be to ' manually defrag' my
> inodes the hard way by doing this 'copy' operation whenever things
> slow down.  (Either that or go get my PHD in file systems and try to
> come up with a better inode management algorithm.)

No need, I know how to fix it for good. Just add a new btree that
tracks free inodes, rather than having to scan the allocated inode
tree to find free inodes. Shouldn't actually be too difficult to do,
as it's a generic btree and the code to keep both btrees in sync is
a copy of the way the two freespace btrees are kept in sync....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* XFS/LVM/Multipath on a single RAID volume
  2013-04-16 16:18                                   ` Dave Chinner
@ 2015-02-22 23:35                                     ` Dave Hall
  2015-02-23 11:18                                       ` Emmanuel Florac
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Hall @ 2015-02-22 23:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: stan, xfs

Dave, Stan,

Not sure if you remember, but we corresponded for a while a couple years 
ago about some performance problems I was having with XFS on a 26TB 
SAS-attached RAID box.  If either of you is still working on XFS, I've 
got some new questions.  Actually, what I've got is a new array to set 
up.  Same size, but faster disks and a faster controller.  It will 
replace the existing array as the primary backup volume.

So since I have a fresh array that's not in production yet I was hoping 
to get some pointers on how to configure it to maximize XFS 
performance.  In particular, I've seen a suggestion that a multipathed 
array should be sliced up into logical drives and pasted back together 
with LVM.  Wondering also about putting the journal in a separate 
logical drive on the same array.

I am able to set up a 2-way multipath right now, and I might be able to 
justify adding a second controller to the array to get a 4-way multipath 
going.

Even if the LVM approach is the wrong one, I clearly have a rare chance 
to set this array up the right way.  Please let me know if you have any 
suggestions.

Thanks.

-Dave

Dave Hall
Binghamton University
kdhall@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: XFS/LVM/Multipath on a single RAID volume
  2015-02-22 23:35                                     ` XFS/LVM/Multipath on a single RAID volume Dave Hall
@ 2015-02-23 11:18                                       ` Emmanuel Florac
  2015-02-24 22:04                                         ` Dave Hall
  0 siblings, 1 reply; 32+ messages in thread
From: Emmanuel Florac @ 2015-02-23 11:18 UTC (permalink / raw)
  To: Dave Hall; +Cc: stan, xfs

Le Sun, 22 Feb 2015 18:35:19 -0500
Dave Hall <kdhall@binghamton.edu> écrivait:

> So since I have a fresh array that's not in production yet I was
> hoping to get some pointers on how to configure it to maximize XFS 
> performance.  In particular, I've seen a suggestion that a
> multipathed array should be sliced up into logical drives and pasted
> back together with LVM.  Wondering also about putting the journal in
> a separate logical drive on the same array.

What's the hardware configuration like? before multipathing, you need
to know if your RAID controller and disks can actually saturate your
link. Generally SAS-attached enclosures are driven through a 4 way
SFF-8088 cable, with a bandwidth of 4x 6Gbps (maximum throughput per
link: 3 GB/s) or 4 x 12 Gbps (max thruput: 6 GB/s).

> I am able to set up a 2-way multipath right now, and I might be able
> to justify adding a second controller to the array to get a 4-way
> multipath going.

A multipath can double the throughput, provided that you have enough
drives: you'll need about 24 7k RPM drives to saturate _one_ 4x6Gbps
SAS link. If you have only 12 drives, dual attachment probably won't
yield much.

> Even if the LVM approach is the wrong one, I clearly have a rare
> chance to set this array up the right way.  Please let me know if you
> have any suggestions.

In my experience, software RAID-0 with md gives slightly better
performance than LVM, though not much.


-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: XFS/LVM/Multipath on a single RAID volume
  2015-02-23 11:18                                       ` Emmanuel Florac
@ 2015-02-24 22:04                                         ` Dave Hall
  2015-02-24 22:33                                           ` Dave Chinner
  2015-02-25 11:21                                           ` Emmanuel Florac
  0 siblings, 2 replies; 32+ messages in thread
From: Dave Hall @ 2015-02-24 22:04 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: stan, xfs


Dave Hall
Binghamton University
kdhall@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)


On 02/23/2015 06:18 AM, Emmanuel Florac wrote:
> Le Sun, 22 Feb 2015 18:35:19 -0500
> Dave Hall<kdhall@binghamton.edu>  écrivait:
>
>    
>> So since I have a fresh array that's not in production yet I was
>> hoping to get some pointers on how to configure it to maximize XFS
>> performance.  In particular, I've seen a suggestion that a
>> multipathed array should be sliced up into logical drives and pasted
>> back together with LVM.  Wondering also about putting the journal in
>> a separate logical drive on the same array.
>>      
> What's the hardware configuration like? before multipathing, you need
> to know if your RAID controller and disks can actually saturate your
> link. Generally SAS-attached enclosures are driven through a 4 way
> SFF-8088 cable, with a bandwidth of 4x 6Gbps (maximum throughput per
> link: 3 GB/s) or 4 x 12 Gbps (max thruput: 6 GB/s).
>
>    
The new hardware is an Infortrend with 16 x 2TB 6Gbps SAS drives.  It 
has one controller with dual 6Gbps SAS ports.  The server currently has 
two 3Gbps SAS HBAs.

On an existing array based on similar but slightly slower hardware, I'm 
getting miserable performance.  The bottleneck seems to be on the server 
side.  For specifics, the array is laid out as a single 26TB volume and 
attached by a single 3Gbps SAS.  The server is quad 8-core Xeon with 
128GB RAM.  The networking is all 10GB.  The application is rsnapshot 
which is essentially a series of rsync copies where the unchanged files 
are hard-linked from one snapshot to the next.  CPU utilization is very 
low and only a few cores seem to be active.  Yet the operation is taking 
hours to complete.

The premise that was presented to me by someone in the storage business 
is that with 'many' proccessor cores one should slice a large array up 
into segments, multipath the whole deal, and then mash the segments back 
together with LVM (or MD).  Since the kernel would ultimately see a 
bunch of smaller storage segments that were all getting activity, it 
should dispatch a set of cores for each storage segment and get the job 
done faster.  I think in theory this would even work to some extent on a 
single-path SAS connection.
>> I am able to set up a 2-way multipath right now, and I might be able
>> to justify adding a second controller to the array to get a 4-way
>> multipath going.
>>      
> A multipath can double the throughput, provided that you have enough
> drives: you'll need about 24 7k RPM drives to saturate _one_ 4x6Gbps
> SAS link. If you have only 12 drives, dual attachment probably won't
> yield much.
>
>    
>> Even if the LVM approach is the wrong one, I clearly have a rare
>> chance to set this array up the right way.  Please let me know if you
>> have any suggestions.
>>      
> In my experience, software RAID-0 with md gives slightly better
> performance than LVM, though not much.
>
>
>    
MD RAID-0 seems as likely as LVM, so I'd probably try that first.  The 
big question is how to size the slices of the array to make XFS happy 
and then how to make sure XFS knows about it.  Secondly, there is the 
question of the log volume.  Seems that with multipath there might be 
some possible advantage to putting this in it's on slice on the array so 
that log writes could be in an I/O stream that is managed separately 
from the rest.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: XFS/LVM/Multipath on a single RAID volume
  2015-02-24 22:04                                         ` Dave Hall
@ 2015-02-24 22:33                                           ` Dave Chinner
       [not found]                                             ` <54ED01BC.6080302@binghamton.edu>
  2015-02-25 11:49                                             ` Emmanuel Florac
  2015-02-25 11:21                                           ` Emmanuel Florac
  1 sibling, 2 replies; 32+ messages in thread
From: Dave Chinner @ 2015-02-24 22:33 UTC (permalink / raw)
  To: Dave Hall; +Cc: stan, xfs

On Tue, Feb 24, 2015 at 05:04:35PM -0500, Dave Hall wrote:
> 
> Dave Hall
> Binghamton University
> kdhall@binghamton.edu
> 607-760-2328 (Cell)
> 607-777-4641 (Office)
> 
> 
> On 02/23/2015 06:18 AM, Emmanuel Florac wrote:
> >Le Sun, 22 Feb 2015 18:35:19 -0500
> >Dave Hall<kdhall@binghamton.edu>  écrivait:
> >
> >>So since I have a fresh array that's not in production yet I was
> >>hoping to get some pointers on how to configure it to maximize XFS
> >>performance.  In particular, I've seen a suggestion that a
> >>multipathed array should be sliced up into logical drives and pasted
> >>back together with LVM.  Wondering also about putting the journal in
> >>a separate logical drive on the same array.
> >What's the hardware configuration like? before multipathing, you need
> >to know if your RAID controller and disks can actually saturate your
> >link. Generally SAS-attached enclosures are driven through a 4 way
> >SFF-8088 cable, with a bandwidth of 4x 6Gbps (maximum throughput per
> >link: 3 GB/s) or 4 x 12 Gbps (max thruput: 6 GB/s).
> >
> The new hardware is an Infortrend with 16 x 2TB 6Gbps SAS drives.
> It has one controller with dual 6Gbps SAS ports.  The server
> currently has two 3Gbps SAS HBAs.
> 
> On an existing array based on similar but slightly slower hardware,
> I'm getting miserable performance.  The bottleneck seems to be on
> the server side.  For specifics, the array is laid out as a single
> 26TB volume and attached by a single 3Gbps SAS. 

So, 300MB/s max throughput.

> The server is quad
> 8-core Xeon with 128GB RAM.  The networking is all 10GB.  The
> application is rsnapshot which is essentially a series of rsync
> copies where the unchanged files are hard-linked from one snapshot
> to the next.  CPU utilization is very low and only a few cores seem
> to be active.  Yet the operation is taking hours to complete.

rsync is likely limited by network throughput and round trip
latency. Test your storage performance locally first, see it if
performs as expected.

> The premise that was presented to me by someone in the storage
> business is that with 'many' proccessor cores one should slice a
> large array up into segments, multipath the whole deal, and then
> mash the segments back together with LVM (or MD).

No, that's just a bad idea. CPU and memory locality is the least of
your worries, and wont have any influence on performance at such low
speeds. When you start getting up into the multiple-GB/s of
throughput (note, GB/s not Gbps) locality matters more, but not for
what you are doing. And multipathing should be ignored until you've
characterised and understood single port lun performance.

> Since the kernel
> would ultimately see a bunch of smaller storage segments that were
> all getting activity, it should dispatch a set of cores for each
> storage segment and get the job done faster.  I think in theory this
> would even work to some extent on a single-path SAS connection.

The kernel already does most of the necessary locality stuff for
optimal performance for you.

> >>I am able to set up a 2-way multipath right now, and I might be able
> >>to justify adding a second controller to the array to get a 4-way
> >>multipath going.
> >A multipath can double the throughput, provided that you have enough
> >drives: you'll need about 24 7k RPM drives to saturate _one_ 4x6Gbps
> >SAS link. If you have only 12 drives, dual attachment probably won't
> >yield much.
> >
> >>Even if the LVM approach is the wrong one, I clearly have a rare
> >>chance to set this array up the right way.  Please let me know if you
> >>have any suggestions.
> >In my experience, software RAID-0 with md gives slightly better
> >performance than LVM, though not much.
> >
> >
> MD RAID-0 seems as likely as LVM, so I'd probably try that first.
> The big question is how to size the slices of the array

Doesn't really matter for RAID 0.

> to make XFS
> happy and then how to make sure XFS knows about it.

IF you are using MD, then mkfs.xfs will pick up the config
automatically from the MD device.

> Secondly, there
> is the question of the log volume.  Seems that with multipath there
> might be some possible advantage to putting this in it's on slice on
> the array so that log writes could be in an I/O stream that is
> managed separately from the rest.

There are very few workloads where an external log makes any sense
these days. Log bandwidth is generally a minor part of any workload,
and non-volatile write caches aggregrate the sequential writes to
the point where they impose very little physical IO overhead on teh
array...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: XFS/LVM/Multipath on a single RAID volume
       [not found]                                             ` <54ED01BC.6080302@binghamton.edu>
@ 2015-02-24 23:33                                               ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2015-02-24 23:33 UTC (permalink / raw)
  To: Dave Hall; +Cc: xfs

[cc the XFS list again]

On Tue, Feb 24, 2015 at 05:57:00PM -0500, Dave Hall wrote:
> Dave,
> 
> I'm not going to post any more of my noob questions. 

Which defeats the purpose of having a public, archived list - other
people can find your questions and the answers through search
engines like Google.

> Sounds like
> about the best I could do would be to get a faster HBA (planned) and
> just go for it.  Also sounds like I might want to look at breaking
> up some the large rsyncs that are running inside rsnapshot.  Perhaps
> it's just the directory tree traversal that's killing my
> performance.

Most likely - that's small, random IO and will almost always be seek
bound on spinning disks.

> One last question - format options:  I seem to recall that there are
> some parameters on the mkfs - su, sw, etc.  Do I need to specify
> those when I set up this new volume or can mkfs.xfs calculate them
> correctly, now?

XFS has calculated them correctly for years when you are using MD or
LVM for software striping. Nowdays it even works with some hardware
RAID, but support is still vendor and hardware specific. That's when
you may have to specify it manually, as per the FAQ:

http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E

> Also, I saw something about formatting differently
> for a workload like email with many small files, vs. a media
> workload that's focused on large files.  Since rsnapshot has to
> create a new directory tree for every snapshot I'm going to say it's
> closer to the email workload.  Any guidance on that?

Set up your storage config to be optimal for your workload, and XFS
should set it's defaults appropriately. If you have a random seek
bound workload, though, there's very little you can tweak at the
filesystem level that will make any significant different to
performance. In these cases, It's better to buy big, cheap SSDs than
expensive spinning disks if you need better performance for this
sort of workload.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: XFS/LVM/Multipath on a single RAID volume
  2015-02-24 22:04                                         ` Dave Hall
  2015-02-24 22:33                                           ` Dave Chinner
@ 2015-02-25 11:21                                           ` Emmanuel Florac
  1 sibling, 0 replies; 32+ messages in thread
From: Emmanuel Florac @ 2015-02-25 11:21 UTC (permalink / raw)
  To: Dave Hall; +Cc: stan, xfs

Le Tue, 24 Feb 2015 17:04:35 -0500
Dave Hall <kdhall@binghamton.edu> écrivait:

> The new hardware is an Infortrend with 16 x 2TB 6Gbps SAS drives.  It 
> has one controller with dual 6Gbps SAS ports.  The server currently
> has two 3Gbps SAS HBAs.

On my experience with these kinds of controllers, they perform quite
poorly with more than 1 RAID-6 array. I'd go for a single RAID-6
array. Then as you said you'll have to do multipath LVM to create two
LVs to stripe together to use both your HBAs and get some more
performance.

However with only 16 7k RPM drives you can't hope much more than 1.5
GByte/s, which is achievable with only one 3Gb SAS HBA...


-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: XFS/LVM/Multipath on a single RAID volume
  2015-02-24 22:33                                           ` Dave Chinner
       [not found]                                             ` <54ED01BC.6080302@binghamton.edu>
@ 2015-02-25 11:49                                             ` Emmanuel Florac
  1 sibling, 0 replies; 32+ messages in thread
From: Emmanuel Florac @ 2015-02-25 11:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Dave Hall, stan, xfs

Le Wed, 25 Feb 2015 09:33:44 +1100
Dave Chinner <david@fromorbit.com> écrivait:

> > On an existing array based on similar but slightly slower hardware,
> > I'm getting miserable performance.  The bottleneck seems to be on
> > the server side.  For specifics, the array is laid out as a single
> > 26TB volume and attached by a single 3Gbps SAS.   
> 
> So, 300MB/s max throughput.
> 

Ah yes, maybe external RAID controllers can only use one SAS channel
out of the 4 available, that would definitely limit performance badly.
This limitation don't apply to internal RAID controllers (Adaptec, LSI,
Areca) driving a JBOD though.

I'll do a short digression on external storage enclosures: they're
mostly useful to provide redundant controllers. If you're using only one
controller, cheap ones (such as infortrend, Promise and the like) will
always perform poorly compared to a modern PCIe RAID controller.

High-end storage enclosures (DotHill, NetApp, etc) with high-bandwidth
attachments (FC or IB) provide better performance AND redundancy, but
at a hefty price.

So if you want fast, cheap arrays, definitely use Adaptec/LSI/Areca and
simple JBOD chassis like supermicro's.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2015-02-25 11:49 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-13 18:11 xfs_fsr, sunit, and swidth Dave Hall
2013-03-13 23:57 ` Dave Chinner
2013-03-14  0:03 ` Stan Hoeppner
     [not found]   ` <514153ED.3000405@binghamton.edu>
2013-03-14 12:26     ` Stan Hoeppner
2013-03-14 12:55       ` Stan Hoeppner
2013-03-14 14:59         ` Dave Hall
2013-03-14 18:07           ` Stefan Ring
2013-03-15  5:14           ` Stan Hoeppner
2013-03-15 11:45             ` Dave Chinner
2013-03-16  4:47               ` Stan Hoeppner
2013-03-16  7:21                 ` Dave Chinner
2013-03-16 11:45                   ` Stan Hoeppner
2013-03-25 17:00                   ` Dave Hall
2013-03-27 21:16                     ` Stan Hoeppner
2013-03-29 19:59                       ` Dave Hall
2013-03-31  1:22                         ` Dave Chinner
2013-04-02 10:34                           ` Hans-Peter Jansen
2013-04-03 14:25                           ` Dave Hall
2013-04-12 17:25                             ` Dave Hall
2013-04-13  0:45                               ` Dave Chinner
2013-04-13  0:51                               ` Stan Hoeppner
2013-04-15 20:35                                 ` Dave Hall
2013-04-16  1:45                                   ` Stan Hoeppner
2013-04-16 16:18                                   ` Dave Chinner
2015-02-22 23:35                                     ` XFS/LVM/Multipath on a single RAID volume Dave Hall
2015-02-23 11:18                                       ` Emmanuel Florac
2015-02-24 22:04                                         ` Dave Hall
2015-02-24 22:33                                           ` Dave Chinner
     [not found]                                             ` <54ED01BC.6080302@binghamton.edu>
2015-02-24 23:33                                               ` Dave Chinner
2015-02-25 11:49                                             ` Emmanuel Florac
2015-02-25 11:21                                           ` Emmanuel Florac
2013-03-28  1:38                     ` xfs_fsr, sunit, and swidth Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.