XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)

All of lore.kernel.org
 help / color / mirror / Atom feed

* XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
@ 2012-04-05 18:10 Stefan Ring
  2012-04-05 19:56 ` Peter Grandi
                   ` (3 more replies)
  0 siblings, 4 replies; 64+ messages in thread
From: Stefan Ring @ 2012-04-05 18:10 UTC (permalink / raw)
  To: xfs

Encouraged by reading about the recent improvements to XFS, I decided
to give it another try on a new server machine. I am happy to report
that compared to my previous tests a few years ago, performance has
progressed from unusably slow to barely acceptable, but still lagging
behind ext4, which is a noticeable (and notable) improvement indeed
;).

The filesystem operations I care about the most are the likes which
involve thousands of small files across lots of directories, like
large trees of source code. For my test, I created a tarball of a
finished IcedTea6 build, about 2.5 GB in size. It contains roughly
200,000 files in 20,000 directories. The test I want to report about
here was extracting this tarball onto an XFS filesystem. I tested
other actions as well, but they didn't reveal anything too noticeable.

So the test consists of nothing but un-tarring the archive, followed
by a "sync" to make sure that the time-to-disk is measured. Prior to
running it, I had populated the filesystem in the following way:

I created two directory hierarchies, each containing the unpacked
tarball 20 times, which I rsynced simultaneously to the target
filesystem. When this was done, I deleted one half of them, creating
some free space fragmentation, and what I hoped would mimic real-world
conditions to some degree.

So now to the test itself -- the tar "x" command returned quite fast
(on the order of only a few seconds), but the following sync took
ages. I created a diagram using seekwatcher, and it reveals that the
disk head jumps about wildly between four zones which are written to
in almost perfectly linear fashion.

When I reran the test with only a single allocation group, behavior
was much better (about twice as fast).

OTOH, when I continuously extracted the same tarball in a loop without
syncing in-between, it would continuously slow down in the ag=1 case
to the point of being unacceptably slow. The same behavior did not
occur with ag=4.

I am aware that no filesystem can be optimal, but given that the
entire write set -- all 2.5 GB of it -- is "known" to the file system,
that is, in memory, wouldn't it be possible to write it out to disk in
a somewhat more reasonable fashion?

This is the seekwatcher graph:
http://dl.dropbox.com/u/5338701/dev/xfs/xfs-ag4.png

And for comparison, the same on ext4, on the same partition primed in
the same way (parallel rsyncs mentioned above):
http://dl.dropbox.com/u/5338701/dev/xfs/ext4.png

As can be seen from the time scale in the bottom part, the ext4
version performed about 5 times as fast because of a much more
disk-friendly write pattern.

I ran the tests with a current RHEL 6.2 kernel and also with a 3.3rc2
kernel. Both of them exhibited the same behavior. The disk hardware
used was a SmartArray p400 controller with 6x 10k rpm 300GB SAS disks
in RAID 6. The server has plenty of RAM (64 GB).

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-05 18:10 XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) Stefan Ring
@ 2012-04-05 19:56 ` Peter Grandi
  2012-04-05 22:41   ` Peter Grandi
  2012-04-06 14:36   ` Peter Grandi
  2012-04-05 21:37 ` Christoph Hellwig
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 64+ messages in thread
From: Peter Grandi @ 2012-04-05 19:56 UTC (permalink / raw)
  To: Linux fs XFS

[ ... ]

> The filesystem operations I care about the most are the likes which
> involve thousands of small files across lots of directories, like
> large trees of source code. For my test, I created a tarball of a
> finished IcedTea6 build, about 2.5 GB in size. It contains roughly
> 200,000 files in 20,000 directories.

Ah another totally inappropriate "test" of something (euphemism)
insipid. The XFS mailing list gets regularly queries on this topic.

Apparently not many people have figured out in the Linux culture
that general purpose filesystems cannot handle well large groups
of small files, and since the beginning of computing various
forms of "aggregate" files have been used for that, like 'ar'
('.a') files from UNIX, which should have been used far more
commonly than has happened, and never mind things like BDB/GDBM
databases.

But many lazy application programmer like to use the filesystem
as a small-record database, it is so easy...

> [ ... ] I ran the tests with a current RHEL 6.2 kernel and
> also with a 3.3rc2 kernel. Both of them exhibited the same
> behavior. The disk hardware used was a SmartArray p400
> controller with 6x 10k rpm 300GB SAS disks in RAID 6. The
> server has plenty of RAM (64 GB). [ ... ]

Huge hardware, but (euphemism) imaginative setup, as among its
many defects RAID6 is particularly inappropriate for most small
file/metadata heavy operation.

> [ ... ] I created two directory hierarchies, each containing
> the unpacked tarball 20 times, which I rsynced simultaneously
> to the target filesystem. When this was done, I deleted one
> half of them, creating some free space fragmentation, and what
> I hoped would mimic real-world conditions to some degree.

Your test is less (euphemism) insignificant because you tried
to cope with filetree lifetime issues.

> [ ... ] disk head jumps about wildly between four zones which
> are written to in almost perfectly linear fashion.

> [ ... ] I am aware that no filesystem can be optimal,

Every filesystem can be close to optimal, just not for every
workload.

> but given that the entire write set -- all 2.5 GB of it -- is
> "known" to the file system, that is, in memory, wouldn't it be
> possible to write it out to disk in a somewhat more reasonable
> fashion?

That sounds to me like a (euphemism) strategic aim: why ever
should a filesystem optimize that special case? Especially given
that 

XFS does spread file allocations across AGs because it aims for
multihreaded operations, especially on RAID sets with several
independent (that is, not RAID6 with small writes) arms.

Unfortunately filesystems are not psychic and cannot use
predictive allocation policies, and have to cope with poorly
written applications that don't do advising (or 'fsync' properly
which is even worse). So some policies get hard-written in the
filesystem "flavor".

Your remedy, as you have noticed, is to tweak the filesystem
logic by changing the number of AGs, and you might also want to
experiment with the elevator (you seem to have forgotten about
that) and other block subsystem policies, and/or with the safety
vs.  latency tradeoffs available at the filesystem and storage
system levels.

There are many annoying details, and recentish version of XFS
try to help with the hideous hack of building an elevator inside
the filesystem code itself:

  http://oss.sgi.com/archives/xfs/2010-01/msg00011.html
  http://oss.sgi.com/archives/xfs/2010-01/msg00008.html

which however is sort of effective, because the Linux block IO
subsystem has several (euphemism) appalling issues.

> As can be seen from the time scale in the bottom part, the ext4
> version performed about 5 times as fast because of a much more
> disk-friendly write pattern.

Is it really disk friendly for every workload? Think about what
happens on 'ext4' there, and when it jumps between block groups,
and it is in effect doing commits in a different order. What
'ext4' does costs dearly on other workload types.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-05 18:10 XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) Stefan Ring
  2012-04-05 19:56 ` Peter Grandi
@ 2012-04-05 21:37 ` Christoph Hellwig
  2012-04-06  1:09   ` Peter Grandi
  2012-04-06  8:25   ` Stefan Ring
  2012-04-05 22:32 ` Roger Willcocks
  2012-04-05 23:07 ` Peter Grandi
  3 siblings, 2 replies; 64+ messages in thread
From: Christoph Hellwig @ 2012-04-05 21:37 UTC (permalink / raw)
  To: Stefan Ring; +Cc: xfs

Hi Stefan,

thanks for the detailed report.

The seekwatcher makes it very clear that XFS is spreading I/O over the
4 allocation groups, while ext4 isn't.   There's a couple of reasons why
XFS is doing that, including to max out multiple devices in a
multi-device setup, and not totally killing read speed.

Can you try a few mount options for me both all together and if you have
some time also individually.

 -o inode64

	This allows inodes to be close to data even for >1TB
	filesystems.  It's something we hope to make the default soon.

 -o filestreams

	This keeps data written in a single directory group together.
	Not sure your directories are large enough to really benefit
	from it, but it's worth a try.

 -o allocsize=4k

	This disables the agressive file preallocation we do in XFS,
	which sounds like it's not useful for your workload.

> I ran the tests with a current RHEL 6.2 kernel and also with a 3.3rc2
> kernel. Both of them exhibited the same behavior. The disk hardware
> used was a SmartArray p400 controller with 6x 10k rpm 300GB SAS disks
> in RAID 6. The server has plenty of RAM (64 GB).

For metadata intensive workloads like yours you would be much better
using a non-striping raid, e.g. concatentation and mirroring instead of
raid 5 or raid 6.  I know this has a cost in terms of "wasted" space,
but for IOPs bound workload the difference is dramatic.


P.s. please ignore Peter - he's made himself a name as not only beeing
technically incompetent but also extremly abrasive.  He is in no way
associated with the XFS development team.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-05 18:10 XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) Stefan Ring
  2012-04-05 19:56 ` Peter Grandi
  2012-04-05 21:37 ` Christoph Hellwig
@ 2012-04-05 22:32 ` Roger Willcocks
  2012-04-06  7:11   ` Stefan Ring
  2012-04-05 23:07 ` Peter Grandi
  3 siblings, 1 reply; 64+ messages in thread
From: Roger Willcocks @ 2012-04-05 22:32 UTC (permalink / raw)
  To: Stefan Ring; +Cc: xfs

http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch06s10.html


On 5 Apr 2012, at 19:10, Stefan Ring wrote:

> Encouraged by reading about the recent improvements to XFS, I decided
> to give it another try on a new server machine. I am happy to report
> that compared to my previous tests a few years ago, performance has
> progressed from unusably slow to barely acceptable, but still lagging
> behind ext4, which is a noticeable (and notable) improvement indeed
> ;).
> 
> The filesystem operations I care about the most are the likes which
> involve thousands of small files across lots of directories, like
> large trees of source code. For my test, I created a tarball of a
> finished IcedTea6 build, about 2.5 GB in size. It contains roughly
> 200,000 files in 20,000 directories. The test I want to report about
> here was extracting this tarball onto an XFS filesystem. I tested
> other actions as well, but they didn't reveal anything too noticeable.
> 
> So the test consists of nothing but un-tarring the archive, followed
> by a "sync" to make sure that the time-to-disk is measured. Prior to
> running it, I had populated the filesystem in the following way:
> 
> I created two directory hierarchies, each containing the unpacked
> tarball 20 times, which I rsynced simultaneously to the target
> filesystem. When this was done, I deleted one half of them, creating
> some free space fragmentation, and what I hoped would mimic real-world
> conditions to some degree.
> 
> So now to the test itself -- the tar "x" command returned quite fast
> (on the order of only a few seconds), but the following sync took
> ages. I created a diagram using seekwatcher, and it reveals that the
> disk head jumps about wildly between four zones which are written to
> in almost perfectly linear fashion.
> 
> When I reran the test with only a single allocation group, behavior
> was much better (about twice as fast).
> 
> OTOH, when I continuously extracted the same tarball in a loop without
> syncing in-between, it would continuously slow down in the ag=1 case
> to the point of being unacceptably slow. The same behavior did not
> occur with ag=4.
> 
> I am aware that no filesystem can be optimal, but given that the
> entire write set -- all 2.5 GB of it -- is "known" to the file system,
> that is, in memory, wouldn't it be possible to write it out to disk in
> a somewhat more reasonable fashion?
> 
> This is the seekwatcher graph:
> http://dl.dropbox.com/u/5338701/dev/xfs/xfs-ag4.png
> 
> And for comparison, the same on ext4, on the same partition primed in
> the same way (parallel rsyncs mentioned above):
> http://dl.dropbox.com/u/5338701/dev/xfs/ext4.png
> 
> As can be seen from the time scale in the bottom part, the ext4
> version performed about 5 times as fast because of a much more
> disk-friendly write pattern.
> 
> I ran the tests with a current RHEL 6.2 kernel and also with a 3.3rc2
> kernel. Both of them exhibited the same behavior. The disk hardware
> used was a SmartArray p400 controller with 6x 10k rpm 300GB SAS disks
> in RAID 6. The server has plenty of RAM (64 GB).
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-05 19:56 ` Peter Grandi
@ 2012-04-05 22:41   ` Peter Grandi
  2012-04-06 14:36   ` Peter Grandi
  1 sibling, 0 replies; 64+ messages in thread
From: Peter Grandi @ 2012-04-05 22:41 UTC (permalink / raw)
  To: Linux fs XFS

[ ... ]

> Apparently not many people have figured out in the Linux
> culture that general purpose filesystems cannot handle well
> large groups of small files, and since the beginning of
> computing various forms of "aggregate" files have been used
> for that, like 'ar' ('.a') files from UNIX, which should have
> been used far more commonly than has happened, and never mind
> things like BDB/GDBM databases.

As to this, another filesystem strongly oriented at massive
streaming, Lustre, is sometimes used for small-file workloads,
and one of the suggestions given for that is to put the small
files inside an 'ext2' filesystem in a file, and mount it via
'loop'.  That is to use 'ext2' (or some other filesystem type)
as an archive format. That is less crazy than it seems.

[ ... ]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-05 18:10 XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) Stefan Ring
                   ` (2 preceding siblings ...)
  2012-04-05 22:32 ` Roger Willcocks
@ 2012-04-05 23:07 ` Peter Grandi
  2012-04-06  0:13   ` Peter Grandi
                     ` (2 more replies)
  3 siblings, 3 replies; 64+ messages in thread
From: Peter Grandi @ 2012-04-05 23:07 UTC (permalink / raw)
  To: Linux fs XFS

[ ... ]

> [ ... ] tarball of a finished IcedTea6 build, about 2.5 GB in
> size. It contains roughly 200,000 files in 20,000 directories.
> [ ... ] given that the entire write set -- all 2.5 GB of it --
> is "known" to the file system, that is, in memory, wouldn't it
> be possible to write it out to disk in a somewhat more
> reasonable fashion?  [ ... ] The disk hardware used was a
> SmartArray p400 controller with 6x 10k rpm 300GB SAS disks in
> RAID 6. The server has plenty of RAM (64 GB).

On reflection this trigger for me an aside: traditional
filesystem types are designed for the case where the ratio is
the opposite, something like a 64GB data collection to process
and 2.5GB of RAM, and where therefore the issue is minimizing
ongoing disk accesses, not the upload from memory to disk of a
bulk sparse set of stuff.

The Sprite Log-structured File System was a design targeted at
large-memory systems, assuming that then writes are the issue
(especially as Sprite was network-based), and reads would mostly
happen from RAM, as in your (euphemism) insipid test.

I suspect that if the fundamental tradeoffs are inverted, then a
completely different design like a LFS might be appropriate.

But the above has a relationship to your (euphemism) unwise
concerns: the case where 200,000 files for 2.5GB are completely
written to RAM and then flushed as a whole to disk is not only
"untraditional" it is also (euphemism) peculiar: try by setting
the flusher to run rather often so that not more than 100-300MB
of dirty pages are left at any one time.

Which brings another subject: usually hw RAID host adapter have
cache, and have firmware that cleverly rearranges writes.

Looking at the specs of the P400:

  http://h18004.www1.hp.com/products/servers/proliantstorage/arraycontrollers/smartarrayp400/

it seems to me that it has standard 256MB of cache, and only
supports RAID6 with a battery backed write cache (wise!).

Which means that your Linux-level seek graphs may be not so
useful, because the host adapter may be drastically rearranging
the seek patterns, and you may need to tweak the P400 elevator,
rather than or in addition to the Linux elevator.

Unless possibly barriers are enabled, and even with a BBWC the
P400 writes through on receiving a barrier request. IIRC XFS is
rather stricter in issuing barrier requests than 'ext4', and you
may be seeing more the effect of that than the effect of aiming
to splitting the access patterns between 4 AGs to improve the
potential for multithreading (which you deny because you are
using what is most likely a large RAID6 stripe size with a small
IO intensive write workload, as previously noted).

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-05 23:07 ` Peter Grandi
@ 2012-04-06  0:13   ` Peter Grandi
  2012-04-06  7:27     ` Stefan Ring
  2012-04-06  0:53   ` Peter Grandi
  2012-04-06  5:53   ` Stefan Ring
  2 siblings, 1 reply; 64+ messages in thread
From: Peter Grandi @ 2012-04-06  0:13 UTC (permalink / raw)
  To: Linux fs XFS

[ ... ]

> Which means that your Linux-level seek graphs may be not so
> useful, because the host adapter may be drastically rearranging
> the seek patterns, and you may need to tweak the P400 elevator,
> rather than or in addition to the Linux elevator.

> Unless possibly barriers are enabled, and even with a BBWC the
> P400 writes through on receiving a barrier request. IIRC XFS is
> rather stricter in issuing barrier requests than 'ext4', and you
> may be seeing more the effect of that than the effect of aiming
> to splitting the access patterns between 4 AGs [ ... ]

As to this, in theory even having split the files among 4 AGs,
the upload from system RAM to host adapter RAM and then to disk
could happen by writing first all the dirty blocks for one AG,
then a long seek to the next AG, and so on, and the additional
cost of 3 long seeks would be negligible. 

That you report a significant slowdown indicates that this is
not happening, and that likely XFS flushing is happening not in
spacewise order but in timewise order.

The seeks graphs you have gathered indeed indicate that with
'ext4' there is a spacewise flush, while with XFS the flush
alternates constantly among the 4 AGs, instead of doing each AG
in turn. Which seems to indicate an elevator issue or a barrier
issue after the delayed allocator has assigned block addresses
to the various pages being flushed.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-05 23:07 ` Peter Grandi
  2012-04-06  0:13   ` Peter Grandi
@ 2012-04-06  0:53   ` Peter Grandi
  2012-04-06  7:32     ` Stefan Ring
  2012-04-06  5:53   ` Stefan Ring
  2 siblings, 1 reply; 64+ messages in thread
From: Peter Grandi @ 2012-04-06  0:53 UTC (permalink / raw)
  To: Linux fs XFS

> [ ... ] Which brings another subject: usually hw RAID host
> adapter have cache, and have firmware that cleverly rearranges
> writes. Looking at the specs of the P400: [ ... ] it seems to
> me that it has standard 256MB of cache, and only supports
> RAID6 with a battery backed write cache (wise!). [ ... ]

Uhm, looking further into the P400 an interesting detail:

http://hardforum.com/showpost.php?s=c19964285e760bee47b8558ae82899d5&p=1033958051&postcount=4
 «One is a stick of memory with a battery attached to it and one
  without. The one without is what the basic models ship with
  and usually has either 256 or 512Mb of memory, it supports
  caching for read operations only. [ ... ] You need the battery
  backed write cache module if you want to be able to use/turn
  on write caching on the array controller which makes a huge
  difference for write performance in general and is pretty much
  critical for raid 5 performance on writes.»

It may be worthwhile to check if there is an enabled BBWC
because if there is the BBWC the host adapter should be
buffering writes up to 256MiB/512MiB and sorting them thus long
inter-AG seeks should be happening only 10 or 5 times or not
much more (4 times) that. Instead it may be happening that the
P400 is doing write-through, which would reflect the unsorted
seek pattern at the Linux->host adapter level into the host
adapter->disk drive level.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-05 21:37 ` Christoph Hellwig
@ 2012-04-06  1:09   ` Peter Grandi
  2012-04-06  8:25   ` Stefan Ring
  1 sibling, 0 replies; 64+ messages in thread
From: Peter Grandi @ 2012-04-06  1:09 UTC (permalink / raw)
  To: Linux fs XFS

[ ... ]

> For metadata intensive workloads like yours you would be much
> better using a non-striping raid, e.g. concatentation and
> mirroring instead of raid 5 or raid 6. I know this has a cost
> in terms of "wasted" space, but for IOPs bound workload the
> difference is dramatic.

The problem with parity RAIDs and small-write-IO intensive
workloads is not striping as such, it is with *large* stripes.
That is a detail that matters quite a lot.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-05 23:07 ` Peter Grandi
  2012-04-06  0:13   ` Peter Grandi
  2012-04-06  0:53   ` Peter Grandi
@ 2012-04-06  5:53   ` Stefan Ring
  2012-04-06 15:35     ` Peter Grandi
  2012-04-07 19:11     ` Peter Grandi
  2 siblings, 2 replies; 64+ messages in thread
From: Stefan Ring @ 2012-04-06  5:53 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs XFS

> Which brings another subject: usually hw RAID host adapter have
> cache, and have firmware that cleverly rearranges writes.
>
> Looking at the specs of the P400:
>
>  http://h18004.www1.hp.com/products/servers/proliantstorage/arraycontrollers/smartarrayp400/
>
> it seems to me that it has standard 256MB of cache, and only
> supports RAID6 with a battery backed write cache (wise!).
>
> Which means that your Linux-level seek graphs may be not so
> useful, because the host adapter may be drastically rearranging
> the seek patterns, and you may need to tweak the P400 elevator,
> rather than or in addition to the Linux elevator.
>
> Unless possibly barriers are enabled, and even with a BBWC the
> P400 writes through on receiving a barrier request. IIRC XFS is
> rather stricter in issuing barrier requests than 'ext4', and you
> may be seeing more the effect of that than the effect of aiming
> to splitting the access patterns between 4 AGs to improve the
> potential for multithreading (which you deny because you are
> using what is most likely a large RAID6 stripe size with a small
> IO intensive write workload, as previously noted).

Yes, it does have 256 MB BBWC, and it is enabled. When I disabled it,
the time needed would rise from 120 sec in the BBWC case to a whopping
330 sec.

IIRC, I did the benchmark with barrier=0, but changing this did not
make a big difference. Nothing did; that’s what frustrated me a bit
;). I also tried different Linux IO elevators, as you suggested in
your other response, without any measurable effect.

The stripe size is this, btw.: su=16k,sw=4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-05 22:32 ` Roger Willcocks
@ 2012-04-06  7:11   ` Stefan Ring
  2012-04-06  8:24     ` Stefan Ring
  0 siblings, 1 reply; 64+ messages in thread
From: Stefan Ring @ 2012-04-06  7:11 UTC (permalink / raw)
  To: Roger Willcocks; +Cc: xfs

On Fri, Apr 6, 2012 at 12:32 AM, Roger Willcocks <roger@filmlight.ltd.uk> wrote:
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch06s10.html

This sounds like it could help very much. I'll try that. Thanks!

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-06  0:13   ` Peter Grandi
@ 2012-04-06  7:27     ` Stefan Ring
  2012-04-06 23:28       ` Stan Hoeppner
  2012-04-07 16:50       ` Peter Grandi
  0 siblings, 2 replies; 64+ messages in thread
From: Stefan Ring @ 2012-04-06  7:27 UTC (permalink / raw)
  To: Linux fs XFS

> As to this, in theory even having split the files among 4 AGs,
> the upload from system RAM to host adapter RAM and then to disk
> could happen by writing first all the dirty blocks for one AG,
> then a long seek to the next AG, and so on, and the additional
> cost of 3 long seeks would be negligible.

Yes, that’s exactly what I had in mind, and what prompted me to write
this post. It would be about 10 times as fast. That’s what bothers me
so much.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-06  0:53   ` Peter Grandi
@ 2012-04-06  7:32     ` Stefan Ring
  0 siblings, 0 replies; 64+ messages in thread
From: Stefan Ring @ 2012-04-06  7:32 UTC (permalink / raw)
  To: Linux fs XFS

> http://hardforum.com/showpost.php?s=c19964285e760bee47b8558ae82899d5&p=1033958051&postcount=4
>  «One is a stick of memory with a battery attached to it and one
>  without. The one without is what the basic models ship with
>  and usually has either 256 or 512Mb of memory, it supports
>  caching for read operations only. [ ... ] You need the battery
>  backed write cache module if you want to be able to use/turn
>  on write caching on the array controller which makes a huge
>  difference for write performance in general and is pretty much
>  critical for raid 5 performance on writes.»
>
> It may be worthwhile to check if there is an enabled BBWC
> because if there is the BBWC the host adapter should be
> buffering writes up to 256MiB/512MiB and sorting them thus long
> inter-AG seeks should be happening only 10 or 5 times or not
> much more (4 times) that. Instead it may be happening that the
> P400 is doing write-through, which would reflect the unsorted
> seek pattern at the Linux->host adapter level into the host
> adapter->disk drive level.

The write cache can also be enabled without a battery present (at
considerable risk), but I insisted to get a battery. It is enabled,
and it makes a noticeable difference. Without it, it’s even slower
(more than factor 2).

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-06  7:11   ` Stefan Ring
@ 2012-04-06  8:24     ` Stefan Ring
  0 siblings, 0 replies; 64+ messages in thread
From: Stefan Ring @ 2012-04-06  8:24 UTC (permalink / raw)
  To: Roger Willcocks; +Cc: xfs

On Fri, Apr 6, 2012 at 9:11 AM, Stefan Ring <stefanrin@gmail.com> wrote:
> On Fri, Apr 6, 2012 at 12:32 AM, Roger Willcocks <roger@filmlight.ltd.uk> wrote:
>> http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch06s10.html
>
> This sounds like it could help very much. I'll try that. Thanks!

Unfortunately, the documentation says it’s only effective for
filesystems > 1TB, which mine isn’t. I tried it, and it doesn’t make a
difference, which is to be expected.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-05 21:37 ` Christoph Hellwig
  2012-04-06  1:09   ` Peter Grandi
@ 2012-04-06  8:25   ` Stefan Ring
  2012-04-07 18:57     ` Martin Steigerwald
  1 sibling, 1 reply; 64+ messages in thread
From: Stefan Ring @ 2012-04-06  8:25 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

> thanks for the detailed report.

Thanks for the detailed and kind answer.

> Can you try a few mount options for me both all together and if you have
> some time also individually.
>
>  -o inode64
>
>        This allows inodes to be close to data even for >1TB
>        filesystems.  It's something we hope to make the default soon.

The filesystem is not that large. It’s only 400GB. I turned it on
anyway. No difference.

>  -o filestreams
>
>        This keeps data written in a single directory group together.
>        Not sure your directories are large enough to really benefit
>        from it, but it's worth a try.
>  -o allocsize=4k
>
>        This disables the agressive file preallocation we do in XFS,
>        which sounds like it's not useful for your workload.

inode64+filestreams: no difference
inode64+allocsize: no difference
inode64+filestreams+allocsize: no difference :(

> For metadata intensive workloads like yours you would be much better
> using a non-striping raid, e.g. concatentation and mirroring instead of
> raid 5 or raid 6.  I know this has a cost in terms of "wasted" space,
> but for IOPs bound workload the difference is dramatic.

Hmm, I’m sure you’re right, but I’m out of luck here. If I had 24
drives, I could think about a different organization. But with only 6
bays, I cannot give up all that space.

Although *in theory*, it *should* be possible to run fast for
write-only workloads. The stripe size is 64 KB (4x16), and it’s not
like data is written all over the place. So it should very well be
possible to write the data out in some reasonably sized and aligned
chunks. The filesystem partition itself is nicely aligned.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-05 19:56 ` Peter Grandi
  2012-04-05 22:41   ` Peter Grandi
@ 2012-04-06 14:36   ` Peter Grandi
  2012-04-06 15:37     ` Stefan Ring
  1 sibling, 1 reply; 64+ messages in thread
From: Peter Grandi @ 2012-04-06 14:36 UTC (permalink / raw)
  To: Linux fs XFS

[ ... ]

> [ ... ]  general purpose filesystems cannot handle well large
> groups of small files,
[ ... ]

>> As can be seen from the time scale in the bottom part, the
>> ext4 version performed about 5 times as fast because of a much
>> more disk-friendly write pattern.

As to 'ext4' and doing (euphemism) insipid tests involving
peculiar setups, there is an interesting story in this post:

  http://oss.sgi.com/archives/xfs/2012-03/msg00465.html

on the perils of using 'tar x' as a "test" of something
meaningful (illustrated using a much smaller "test" than
yours).

The telling details was that there was a ratio of 227 times (6
seconds versus 23 minutes) between running 'tar x' without any
safety and with most safeties.

A ratio of 227 times indicates that there is something big going
on, which is that contemporary disk drives have 2 orders of
magnitude between bulk sequential and small random "speed" (which
is the major reason why «general purpose filesystems cannot
handle well large groups of small files»), and that in between
one can choose a vast number of different safety/speed tradeoffs
(or introduce performance problems :->).

Does that means that 'ext4' has "Abysmal write performance" in
the 23 minutes case? No, just a different tradeoff.

Similarly XFS has had for a long time a mostly undeserved
reputation for being "slow" on small-IO/metadata intensive
workloads, in large part because traditionally it has been
designed to deliver a higher level of (implicit, metadata) safety
than other filesystems; for good reasons.

Therefore as I argued in other comments the «excessive seeking»
you report seems due to me more to storage layer issues and
perhaps stricter interpretation of safety by XFS, than to
something really wrong with XFS, which is a tool that has to be
deployed with consideration.

As to that a comparison that does point a finger at the
underlying storage system:

  * In your graphs 'ext4' writes out 2.5GB of small files at
    around 100MB/s (and with relatively few long seeks on that
    workload) on an "enterprise" storage system that has 4+2
    disks each capable of 130MB/s.

  * In the 6s "test" I did reported above in a similar situation
    'ext4' wrote out 370MB also at not much less than 100MB/s,
    but on a single "consumer" disk on a much slower destktop.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-06  5:53   ` Stefan Ring
@ 2012-04-06 15:35     ` Peter Grandi
  2012-04-10 14:05       ` Stefan Ring
  2012-04-07 19:11     ` Peter Grandi
  1 sibling, 1 reply; 64+ messages in thread
From: Peter Grandi @ 2012-04-06 15:35 UTC (permalink / raw)
  To: Linux fs XFS

[ ... ]

> Yes, it does have 256 MB BBWC, and it is enabled. When I
> disabled it, the time needed would rise from 120 sec in the
> BBWC case to a whopping 330 sec.

> IIRC, I did the benchmark with barrier=0, but changing this did not
> make a big difference.

Note that the syntax is slightly different between 'ext4' and
XFS, and if you use 'barrier=0' with XFS, it will mount the
filetree *with* barriers (just double checked).

> Nothing did; that’s what frustrated me a bit ;). I also tried
> different Linux IO elevators, as you suggested in your other
> response, without any measurable effect.

Here a lot depends on the firmware of the P400, because with a
BBWC in theory it can completely ignore the request ordering and
the barriers it receives from the Linux side (and barriers
*might* be disabled), so by and large the Linux elevator should
not matter.

  Note: what would look good in this very narrow example is
  something like 'anticipatory', and while that has disappeared
  IIRC there is a way to tweak 'cfq' to behave like that.

But the times you report are consistent with the notion that
your Linux side seek graph is what happens at the P400 level
too, which is something that should not happen with a BBWC.

If that's the case, tweaking the Linux side scheduling might
help, for example increasing a lot 'queue/nr_requests' and
'device/queue_depth' ('nr_requests' should be apparently at
least twice 'queue_depth' in most cases).

Or else ensuring that the P400 does reorder requests and does
not write-through, as it has a BBWC.

Overall your test does not seem very notable to me, except that
it is strange that in the XFS 4 AG case the generated IO stream
(at the Linux level) is seeking incessantly between the 4 AGs
instead of in phases, and this apparently gets reflected to the
disks by the P400 even if it has a BBWC.

It is not clear to me why the seeks among the 4AGs happen in
such a tightly interleaved way (barriers? The way journaling
works?) instead of a more bulky way.

The suggestion by another commenter to use 'rotorstep' (probably
set to a high value) may help then, as it bunches files in AGs.

> The stripe size is this, btw.: su=16k,sw=4

BTW congratulations for limiting your RAID6 set to 4+2, and
using a relatively small chunk size compared to that chosen by
many others.

But it is still pretty large for this case: 64KiB when your
average file size is around 12KiB. Potentially lots of RMW, and
little opportunity to take advantage of the higher parallelism
of having 4 AGs with 4 independent streams of data.

As mentioned in another comment, I got nearly the same 'ext4'
writeout rate on a single disk...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-06 14:36   ` Peter Grandi
@ 2012-04-06 15:37     ` Stefan Ring
  2012-04-07 13:33       ` Peter Grandi
  0 siblings, 1 reply; 64+ messages in thread
From: Stefan Ring @ 2012-04-06 15:37 UTC (permalink / raw)
  To: Linux fs XFS

> As to 'ext4' and doing (euphemism) insipid tests involving
> peculiar setups, there is an interesting story in this post:
>
>  http://oss.sgi.com/archives/xfs/2012-03/msg00465.html

I really don't see the connection to this thread. You're advocating
mostly that tar use fsync on every file, which to me seems absurd. If
the system goes down halfway through tar extraction, I would delete
the tree and untar again. What do I care if some files are corrupt,
when the entire tree is incomplete anyway?

Despite the somewhat inflammatory thread subject, I don't want to bash
anyone. It's just that untarring large source trees is a very typical
workload for me. And I just don't want to accept that XFS cannot do
better than being several orders of magnitude slower than ext4
(speaking of binary orders of magnitude). As I see it, both file
systems give the same guarantees:

1) That upon completion of sync, all data is readily available on
permanent storage.
2) That the file system metadata doesn't suffer corruption, should the
system lose power during the operation.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-06  7:27     ` Stefan Ring
@ 2012-04-06 23:28       ` Stan Hoeppner
  2012-04-07  7:27         ` Stefan Ring
                           ` (2 more replies)
  2012-04-07 16:50       ` Peter Grandi
  1 sibling, 3 replies; 64+ messages in thread
From: Stan Hoeppner @ 2012-04-06 23:28 UTC (permalink / raw)
  To: Stefan Ring; +Cc: Linux fs XFS

On 4/6/2012 2:27 AM, Stefan Ring wrote:
>> As to this, in theory even having split the files among 4 AGs,
>> the upload from system RAM to host adapter RAM and then to disk
>> could happen by writing first all the dirty blocks for one AG,
>> then a long seek to the next AG, and so on, and the additional
>> cost of 3 long seeks would be negligible.
> 
> Yes, that’s exactly what I had in mind, and what prompted me to write
> this post. It would be about 10 times as fast. That’s what bothers me
> so much.

XFS is still primarily a "lots and large" filesystem.  Its allocation
group based design is what facilitates this.  Very wide stripe arrays
have horrible performance for most workloads, especially random IOPS
heavy workloads, and you won't see hardware that will allow arrays of
hundreds, let alone dozens of spindles in a RAID stripe set.  Say one
needs a high IOPS single 50TB filesystem.  We could use 4 Nexsan E60
arrays each containing 60 15k SAS drives of 450GB each, 240 drives
total.  Creating four 60 drive RAID10 arrays, let alone 60 drive RAID6
arrays, would be silly.

Instead, a far more optimal solution would be to set aside 4 spares per
chassis and create 14 four drive RADI10 arrays.  This would yield ~600
seeks/sec and ~400MB/s sequential throughput performance per 2 spindle
array.  We'd stitch the resulting 56 hardware RAID10 arrays together in
an mdraid linear (concatenated) array.  Then we'd format this 112
effective spindle linear array with simply:

$ mkfs.xfs -d agcount=56 /dev/md0

Since each RAID10 is 900GB capacity, we have 56 AGs of just under the
1TB limit, 1 AG per 2 physical spindles.  Due to the 2 stripe spindle
nature of the constituent hardware RAID10 arrays, we don't need to worry
about aligning XFS writes to the RAID stripe width.  The hardware cache
will take care of filling the small stripes.  Now we're in the opposite
situation of having too many AGs per spindle.  We've put 2 spindles in a
single AG and turned the seek starvation issues on its head.

Given a workload with at least 56 threads, we can write 56 files in
parallel at ~400MB/s each, one to each AG, 22.4GB/s aggregate
throughput.  With this particular hardware, the 16x8Gb FC ports limit
total one way bandwidth to 12.8GB/s aggregate, or "only" 228MB/s per AG.
 Not too shabby.  But streaming bandwidth isn't the workload here.  This
setup will allow for ~30,000 random write IOPS with 56 writers. Not that
impressive compared to SSD, but you've got 50TB of space instead of a
few hundred gigs.

The moral of this story is this:  If XFS behaved the way you opine
above, each of these 56 AGs would be written in a serial fashion,
basically limiting the throughput of 112 effective 15k SAS spindles to
something along the lines of only ~400MB/s and ~600 random IOPS.  Note
that this hypothetical XFS storage system is tiny compared to some of
those in the wild.  NASA's Advanced Supercomputing Division alone has
deployed 500TB+ XFS filesystems on nested concatenated/striped arrays.

So while the XFS AG architecture may not be perfectly suited to your
single 6 drive RAID6 array, it still gives rather remarkable performance
given that the same architecture can scale pretty linearly to the
heights above, and far beyond.  Something EXTx and others could never
dream of.  Some of the SGI guys might be able to confirm deployed single
XFS filesystems spanning 1000+ drives in the past.  Today we'd probably
only see that scale with CXFS.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-06 23:28       ` Stan Hoeppner
@ 2012-04-07  7:27         ` Stefan Ring
  2012-04-07  8:53           ` Emmanuel Florac
                             ` (2 more replies)
  2012-04-07  8:49         ` Emmanuel Florac
  2012-04-09 14:21         ` Geoffrey Wehrman
  2 siblings, 3 replies; 64+ messages in thread
From: Stefan Ring @ 2012-04-07  7:27 UTC (permalink / raw)
  To: stan; +Cc: Linux fs XFS

> Instead, a far more optimal solution would be to set aside 4 spares per
> chassis and create 14 four drive RADI10 arrays.  This would yield ~600
> seeks/sec and ~400MB/s sequential throughput performance per 2 spindle
> array.  We'd stitch the resulting 56 hardware RAID10 arrays together in
> an mdraid linear (concatenated) array.  Then we'd format this 112
> effective spindle linear array with simply:
>
> $ mkfs.xfs -d agcount=56 /dev/md0
>
> Since each RAID10 is 900GB capacity, we have 56 AGs of just under the
> 1TB limit, 1 AG per 2 physical spindles.  Due to the 2 stripe spindle
> nature of the constituent hardware RAID10 arrays, we don't need to worry
> about aligning XFS writes to the RAID stripe width.  The hardware cache
> will take care of filling the small stripes.  Now we're in the opposite
> situation of having too many AGs per spindle.  We've put 2 spindles in a
> single AG and turned the seek starvation issues on its head.

So it sounds like that for poor guys like us, who can’t afford the
hardware to have dozens of spindles, the best option would be to
create the XFS file system with agcount=1? That seems to be the only
reasonable conclusion to me, since a single RAID device, like a single
disk, cannot write in parallel anyway.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-06 23:28       ` Stan Hoeppner
  2012-04-07  7:27         ` Stefan Ring
@ 2012-04-07  8:49         ` Emmanuel Florac
  2012-04-08 20:33           ` Stan Hoeppner
  2012-04-09 14:21         ` Geoffrey Wehrman
  2 siblings, 1 reply; 64+ messages in thread
From: Emmanuel Florac @ 2012-04-07  8:49 UTC (permalink / raw)
  To: stan; +Cc: Stefan Ring, Linux fs XFS

Le Fri, 06 Apr 2012 18:28:37 -0500 vous écriviez:

> Creating four 60 drive RAID10 arrays, let alone 60 drive RAID6
> arrays, would be silly.

From my experience, with modern arrays don't make much of a difference.
I've reached decent IOPS (i. e. about 4000 IOPS) on large arrays of up
to 46 drives provided there are enough threads -- more threads than
spindles, preferably.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-07  7:27         ` Stefan Ring
@ 2012-04-07  8:53           ` Emmanuel Florac
  2012-04-07 14:57           ` Stan Hoeppner
  2012-04-09  0:19           ` Dave Chinner
  2 siblings, 0 replies; 64+ messages in thread
From: Emmanuel Florac @ 2012-04-07  8:53 UTC (permalink / raw)
  To: Stefan Ring; +Cc: stan, Linux fs XFS

Le Sat, 7 Apr 2012 09:27:50 +0200 vous écriviez:

> So it sounds like that for poor guys like us, who can’t afford the
> hardware to have dozens of spindles, the best option would be to
> create the XFS file system with agcount=1? That seems to be the only
> reasonable conclusion to me, since a single RAID device, like a single
> disk, cannot write in parallel anyway.

You best option is to buy a SSD. Seriously, even a basic decent consumer
model will bury your array in the dust. Also, recent RAID controllers
from LSI and Adaptec are able to "enhance" a spinning rust array by
using an SSD as a cache.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-06 15:37     ` Stefan Ring
@ 2012-04-07 13:33       ` Peter Grandi
  0 siblings, 0 replies; 64+ messages in thread
From: Peter Grandi @ 2012-04-07 13:33 UTC (permalink / raw)
  To: Linux fs XFS

>> As to 'ext4' and doing (euphemism) insipid tests involving
>> peculiar setups, there is an interesting story in this post:

>>  http://oss.sgi.com/archives/xfs/2012-03/msg00465.html

> I really don't see the connection to this thread. You're
> advocating mostly that tar use fsync on every file, which to
> me seems absurd.

Rather different: I am pointing out that there is a fundamental
problem, that the spectrum of safety/speed tradeoffs covers 2
orders of magnitude as to speed, and that for equivalent points
XFS and 'ext4' don't perform that differently (factor of 2 in
this particular "test", which is sort of "noise").

  Note: it is Schilling who advocates for 'tar' to 'fsync' every
    file, and he gives some pretty good reasons why that should
    be the default, and why that should not be that expensive,
    (which I is a bit optimistic0. My advocacy in that thread
    was that having different safety/speed tradeoffs is a good
    thing, if they are honestly represented as tradeoffs.

So it is likely if there is a significant difference you are
getting a different tradeoff even if you may not *want* a
different tradeoff.

  Note: JFS and XFS are more or less as good as it gets as to
    "general purpose" filesystems, and when people complain
    about "speed" of them odds are that they are using either
    improperly, or in corner cases, or there is a problem in the
    application or storage layer. To get something better than
    JFS or XFS one must look at filesystems based on radically
    different tradeoffs, like NILFS2 (log), OCFS2 (shareable) or
    BTRFS (COW). In your case perhaps NILFS2 would give best
    results.

And that's what seems to be happening: 'ext4' seems to commit
metadata and data in spacewise order, XFS in timewise order,
because the seek order on writeout probably reflects the order
in which files were extracted from the 'tar' file.

> If the system goes down halfway through tar extraction, I
> would delete the tree and untar again. What do I care if some
> files are corrupt, when the entire tree is incomplete anyway?

Maybe you don't care; but filesystems are not psychic (they use
hardwired and adaptive policy, not predictive) and given that
most people seem to care the default for XFS is to try harder to
keep metadata durable.

Also various versions of 'tar' have options that allow
continuing rather than restarting an extraction because some
people prefer that.

> [ ... ] It's just that untarring large source trees is a very
> typical workload for me.

Well, it makes a lot of difference whether you are creating an
extreme corner case just to see what happens, or whether you
have a real problem, even a corner case problem, about which you
have to make some compromise.

The problem you have described seems rather strange:

  * You write a lot of little files to memory, as you have way
    more memory than data.
  * The whole is written out to a relatively RAID6 in one go, on
    a storage layer that can do 500-700MB/s but does 1/5th of that.
  * You don't do anything else with the files.

> And I just don't want to accept that XFS cannot do better than
> being several orders of magnitude slower than ext4 (speaking
> of binary orders of magnitude).

> As I see it, both file systems give the same guarantees:
> 1) That upon completion of sync, all data is readily available
>    on permanent storage.
> 2) That the file system metadata doesn't suffer corruption,
>    should the system lose power during the operation.

Yes, but they also give you some *implicit* guarantees that are
different. For example that:

  * XFS spreads out files for you so you can better take
    advantage of parallelism in your storage layer, and further
    allocations are more resistant to fragmentation.

  * 'ext4' probably commits in a different and less safe order
    from XFS. If the storage layer rearranged IO order this
    might matter a lot less.

You may not care about either, but then you are doing something
very special.

For example, if you were to use your freshly written sources to
do a build, then conceivably spreading the files over 4 AGs
means that the builds can be much quicker on a system with
available hardware parallelism.

Also, *you* don't care about the order in which losses would
happen, and how much, if the system crashes, but most users tend
to want to avoid repeating work, because either they are not
copying data, or the copy is huge and they don't want to restart
it from the beginning.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-07  7:27         ` Stefan Ring
  2012-04-07  8:53           ` Emmanuel Florac
@ 2012-04-07 14:57           ` Stan Hoeppner
  2012-04-09 11:02             ` Stefan Ring
  2012-04-09  0:19           ` Dave Chinner
  2 siblings, 1 reply; 64+ messages in thread
From: Stan Hoeppner @ 2012-04-07 14:57 UTC (permalink / raw)
  To: Stefan Ring; +Cc: Linux fs XFS

On 4/7/2012 2:27 AM, Stefan Ring wrote:
>> Instead, a far more optimal solution would be to set aside 4 spares per
>> chassis and create 14 four drive RADI10 arrays.  This would yield ~600
>> seeks/sec and ~400MB/s sequential throughput performance per 2 spindle
>> array.  We'd stitch the resulting 56 hardware RAID10 arrays together in
>> an mdraid linear (concatenated) array.  Then we'd format this 112
>> effective spindle linear array with simply:
>>
>> $ mkfs.xfs -d agcount=56 /dev/md0
>>
>> Since each RAID10 is 900GB capacity, we have 56 AGs of just under the
>> 1TB limit, 1 AG per 2 physical spindles.  Due to the 2 stripe spindle
>> nature of the constituent hardware RAID10 arrays, we don't need to worry
>> about aligning XFS writes to the RAID stripe width.  The hardware cache
>> will take care of filling the small stripes.  Now we're in the opposite
>> situation of having too many AGs per spindle.  We've put 2 spindles in a
>> single AG and turned the seek starvation issues on its head.
> 
> So it sounds like that for poor guys like us, who can’t afford the
> hardware to have dozens of spindles, the best option would be to
> create the XFS file system with agcount=1? 

Not at all.  You can achieve this performance with the 6 300GB spindles
you currently have, as Christoph and I both mentioned.  You simply lose
one spindle of capacity, 300GB, vs your current RAID6 setup.  Make 3
RAID1 pairs in the p400 and concatenate them.  If the p400 can't do this
concat the mirror pair devices with md --linear.  Format the resulting
Linux block device with the following and mount with inode64.

$ mkfs.xfs -d agcount=3 /dev/[device]

That will give you 1 AG per spindle, 3 horizontal AGs total instead of 4
vertical AGs as you get with default striping setup.  This is optimal
for your high IOPS workload as it eliminates all 'extraneous' seeks
yielding a per disk access pattern nearly identical to EXT4.  And it
will almost certainly outrun EXT4 on your RAID6 due mostly to the
eliminated seeks, but also to elimination of parity calculations.
You've wiped the array a few times in your testing already right, so one
or two more test setups should be no sweat.  Give it a go.  The results
will be pleasantly surprising.

> That seems to be the only
> reasonable conclusion to me, since a single RAID device, like a single
> disk, cannot write in parallel anyway.

It's not a reasonable conclusion.  And both striping and concat arrays
write in parallel, just a different kind of parallel.  The very coarse
description (for which I'll likely take heat) is that striping 'breaks
up' one file into stripe_width number of blocks, then writes all the
blocks, one to each disk, in parallel, until all the blocks of the file
are written.  Conversely, with a concatenated array, since XFS writes
each file to a different AG, and each spindle is 1 AG in this case, each
file's blocks are written serially to one disk.  But we can have 3 of
these going in parallel with 3 disks.

The former method relies on being able to neatly pack a file's blocks
into stripes that are written in parallel, to get max write performance.
 This is irrelevant with a concat.  We write all the blocks until the
file is written, and we waste no rotation or seeks in the process as can
be the case with partial stripe width writes on striped arrays.  The
only thing we "waste" is some disk space.  Everyone knows parity equals
lower write IOPS, and knows of the disk space tradeoff with non-parity
RAID to get maximum IOPS.  And since we're talking EXT4 vs XFS, make the
playing field level by testing EXT4 on a p400 based RAID10 of these 6
drives and compare the results to the concat.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-06  7:27     ` Stefan Ring
  2012-04-06 23:28       ` Stan Hoeppner
@ 2012-04-07 16:50       ` Peter Grandi
  2012-04-07 17:10         ` Joe Landman
  1 sibling, 1 reply; 64+ messages in thread
From: Peter Grandi @ 2012-04-07 16:50 UTC (permalink / raw)
  To: Linux fs XFS

[ ... ]

>> As to this, in theory even having split the files among 4
>> AGs, the upload from system RAM to host adapter RAM and then
>> to disk could happen by writing first all the dirty blocks
>> for one AG, then a long seek to the next AG, and so on, and
>> the additional cost of 3 long seeks would be negligible.

> Yes, that’s exactly what I had in mind, and what prompted me
> to write this post. It would be about 10 times as fast.

Ahhh yes, but let's go back to this and summarize some of my
previous observations:

  * If the scheduling order was by AG, and the hardware was
    parallel, the available parallelism would not be exploited,
    (and fragmentation might be worse) as if there were only a
    single AG. And XFS does let you configure the number of AGs
    in part for for that reason.

  * Your storage layer does not seem to deliver parallel
    operations: as the ~100MB/s overall 'ext4' speed and the
    seek graphs show, in effect your 4+2 RAID6 performs in this
    case as if it were a single drive with a single arm.

  * Even with the actual scheduling at the Linux level being by
    interleaving AGs in XFS, your host adapter with a BBWC
    should be able to reorder them, in 256MiB lots, ignoring
    Linux level barriers and ordering, but it seems that this is
    not happening.

So the major things to look into seem to me:

  * Ensure that your RAID set can deliver the parallelism at
    which XFS is targeted, with the bulk transfer rates that it
    can do.

  * Otherwise figure out ways to ensure that the IO transactions
    generated by XFS are not in interleave-AG order.

  * Otherwise figure out ways to get the XFS IO ordering
    rearranged at the storage layer in spacewise order.

Summarizing some of the things to try, and some of them are
rather tentative, because you have a rather peculiar corner
case:

  * Change the flusher to writeout incrementally instead of just
    at 'sync' time, e.g. every 1-2 seconds. In some similar
    cases this makes things a lot better, as large 'uploads' to
    the storage layer from the page cache can cause damaging
    latencies. But the success of this may depend on having a
    properly parallel storage layer, at least for XFS.

  * Use a different RAID setup. If the RAID set is used only for
    reproducible data, a RAID0, else a RAID10, or even a RAID5
    with a small chunk size.

  * Check the elevator and cache policy on the P400, if they are
    settable. Too bad many RAID host adapters have (euphemism)
    hideous fw (many older 3ware models come to mind) with some
    undocumented (euphemism) peculiarties as to scheduling.

  * Tweak 'queue/nr_requests' and 'device/queue_depth'. Probably
    they should be big (hundreds/thousands), but various
    settings should be tried as fw sometimes is so weird.

  * Given that it is now established that your host adapter has
    BBWC, consider switching the Linux elevator to 'noop', so as
    to leave IO scheduling to the host adapter fw, and reduce
    issue latency. 'queue/nr_requests' may be set to a very low
    number here perhaps, but my guess is that it shouldn't matter.

  * Alternatively if the host adapter fw insists on not
    reordering IO from the Linux level, use Linux elevator
    settings that behaves similarly to 'anticipatory'.

It may help to use Bonnie (Garloff's 1.4 version with
'-o_direct') to give a rough feel of filetree speed profile, for
example I tend to use these options:

  Bonnie -y -u -o_direct -s 2000 -v 2 -d "$DIR"

Ultimately even 'ext4' does not seem the right filesystem for
this workload either, because all these "legacy" filesystems are
targeted at situations where data is much bigger than memory,
and you are trying to fit them into a very specific corner case
where the opposite is true.

Making my fantasy run wild, my guess is that your workload is
not 'tar x', but release building, where sources and objects fit
entirely in memory, and you are only concerned with persisting
the sources because you want to do several builds from that set
of sources without re-tar-x-ing them, and ideally you would like
to reduce build times by building several objects in parallel.

  BTW your corner case then has another property here: that disk
  writes greatly exceed disk reads, because you would only write
  once the sources and then read them from cache every time
  thereafter while the system is up. I doubt also that you would
  want to persist the generated objects themselves, but only the
  generated final "package" containing them, which might suggest
  building the objects to a 'tmpfs', unlss you want them
  persisted (a bit) to make builds restartable.

If that's the case, and you cannot fix the storage layer to be
more suitable for 'ext4' or XFS, consider using NILFS2, or even
'ext2' (with a long flusher interval perhaps).

  Note: or "cheat" and do your builds to a flash SSD, as they
    both run a fw layer that implements a COW/logging allocation
    strategy, and have nicer seek times :-).

> That’s what bothers me so much.

And in case you did not get this before, I have a long standing
pet peeve about abusing filesystems for small file IO, or other
ways of going against the grain of what is plausible, which I
call the "syntactic approach" (every syntactically valid system
configuration is assumed to work equally well...).

Some technical postscripts:

  * It seems that most if not all RAID6 implementations don't do
    shortened RMWs, where only the updated blocks and the PQ
    blocks are involved, but they always do full stripe RMW.
    Even with a BBWC in the host adapter this is one major
    reason to avoid RAID6 in favor of at least RAID5, for your
    setup in particular. But hey, RAID6 setups are all
    syntactically valid! :-)

  * The 'ext3' on disk layout and allocation policies seem to
    deliver very good compact locality on bulk writeouts and
    on relatively fresh filetrees, but then locality can degrade
    apocaliptically over time, like seven times:
      http://www.sabi.co.uk/blog/anno05-3rd.html#050913
    I suspect that the same applies to 'ext4', even if perhaps a
    bit less. You have tried to "age" the filetree a bit, but I
    suspect you did not succeed enough, as the graphed Linux
    level seek patterns with 'ext4' shows a mostly-linear write.

  * Hopefully your storage layer does not use DM/LVMs...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-07 16:50       ` Peter Grandi
@ 2012-04-07 17:10         ` Joe Landman
  2012-04-08 21:42           ` Stan Hoeppner
  0 siblings, 1 reply; 64+ messages in thread
From: Joe Landman @ 2012-04-07 17:10 UTC (permalink / raw)
  To: xfs

On 04/07/2012 12:50 PM, Peter Grandi wrote:

>    * Your storage layer does not seem to deliver parallel
>      operations: as the ~100MB/s overall 'ext4' speed and the
>      seek graphs show, in effect your 4+2 RAID6 performs in this
>      case as if it were a single drive with a single arm.

This is what lept out at me.  I retried a very similar test (pulled 
Icedtea 2.1, compiled it, tarred it, measured untar on our boxen).  I 
was getting a fairly consistent 4 +/- delta seconds.

Ignoring the rest of your post for brevity (basically to focus upon this 
one issue), I suspect that the observed performance issue has more to do 
with the RAID card, the disks, and the server than the file system.

100MB/s on some supposedly fast drives with a RAID card indicates that 
either the RAID is badly implemented, the RAID layout is suspect, or 
similar.  He should be getting closer to N(data disks) * BW(single disk) 
for something "close" to a streaming operation.

This isn't suggesting that he didn't hit some bug which happens to over 
specify use of ag=0, but he definitely had a weak RAID system (at best).

If he retries with a more capable system, or one with a saner RAID 
layout (16k chunk size?  For spinning rust?  Seriously?  Short stroking 
DB layout?), an agcount of 32 or higher, and still sees similar issues, 
then I'd be more suspicious of a bug.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-06  8:25   ` Stefan Ring
@ 2012-04-07 18:57     ` Martin Steigerwald
  2012-04-10 14:02       ` Stefan Ring
  0 siblings, 1 reply; 64+ messages in thread
From: Martin Steigerwald @ 2012-04-07 18:57 UTC (permalink / raw)
  To: xfs; +Cc: Stefan Ring, Christoph Hellwig

Am Freitag, 6. April 2012 schrieb Stefan Ring:
> > thanks for the detailed report.
> 
> Thanks for the detailed and kind answer.
> 
> > Can you try a few mount options for me both all together and if you
> > have some time also individually.
> > 
> >  -o inode64
> > 
> >        This allows inodes to be close to data even for >1TB
> >        filesystems.  It's something we hope to make the default soon.
> 
> The filesystem is not that large. It’s only 400GB. I turned it on
> anyway. No difference.
> 
> >  -o filestreams
> > 
> >        This keeps data written in a single directory group together.
> >        Not sure your directories are large enough to really benefit
> >        from it, but it's worth a try.
> >  -o allocsize=4k
> > 
> >        This disables the agressive file preallocation we do in XFS,
> >        which sounds like it's not useful for your workload.
> 
> inode64+filestreams: no difference
> inode64+allocsize: no difference
> inode64+filestreams+allocsize: no difference :(
> 
> > For metadata intensive workloads like yours you would be much better
> > using a non-striping raid, e.g. concatentation and mirroring instead
> > of raid 5 or raid 6.  I know this has a cost in terms of "wasted"
> > space, but for IOPs bound workload the difference is dramatic.
> 
> Hmm, I’m sure you’re right, but I’m out of luck here. If I had 24
> drives, I could think about a different organization. But with only 6
> bays, I cannot give up all that space.
> 
> Although *in theory*, it *should* be possible to run fast for
> write-only workloads. The stripe size is 64 KB (4x16), and it’s not
> like data is written all over the place. So it should very well be
> possible to write the data out in some reasonably sized and aligned
> chunks. The filesystem partition itself is nicely aligned.

And is XFS aligned to the RAID 6?

What does xfs_info display on it?

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-06  5:53   ` Stefan Ring
  2012-04-06 15:35     ` Peter Grandi
@ 2012-04-07 19:11     ` Peter Grandi
  1 sibling, 0 replies; 64+ messages in thread
From: Peter Grandi @ 2012-04-07 19:11 UTC (permalink / raw)
  To: Linux fs XFS

> [ ... ] I also tried different Linux IO elevators, as you
> suggested in your other response, without any measurable
> effect. [ ... ]

That's probably because of the RAID6 host adapter being
uncooperative, but I wondered whether this might apply in some
form:

http://xfs.org/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E
 «As of kernel 3.2.12, the default i/o scheduler, CFQ, will
  defeat much of the parallelization in XFS.»

BTW earlier 'cfq' versions have been reported to have huge
problems with workloads involving writes and reads, and only
'deadline' (which is quite unsuitable for some workloads) seems
to be fairly reliable.

http://www.webhostingtalk.com/showthread.php?t=727173

«Anyway, we have found HUGE problems with CFQ in many different
 scenarios and many different hardware setups. If it was only an
 issue with our configuration I would have foregone posting this
 message and simply informed those kernel developers responsible
 for the fix.

 Two scenarios where CFQ has a severe problem - When you are
 running a single block device (1 drive, or a raid 1 scenario)
 under certain circumstances where heavy sustained writes are
 occurring the CFQ scheduler will behave very strangely. It will
 begin to give all access to reads and limit all writes to the
 point of allowing only 0-2 I/O write operations being allowed
 per second vs 100-180 read operations per second. This condition
 will persist indefinitely until the sustained write process
 completes. This is VERY bad for a shared environment where you
 need reads and writes to complete regardless of increased reads
 or writes. This behavior goes beyond what CFQ says it is
 supposed to do in this situation - meaning this is a bug, and a
 serious one at that. We can reproduce this EVERY TIME.

 The second scenario occurs when you have two or more block
 devices, either single drives, or any type of raid array
 including raid 0,1,0+1,1+0,5 and 6. (We never tested 3,4 who
 uses raid 3 or 4 anymore anyway?!!). This case is almost exactly
 opposite of what happens with only one block device. In this
 case if one of more of the drives is blocked with heavy writes
 for a sustained period of time CFQ will block reads from the
 other devices or severely limit the reads until the writes have
 completed. We can also reproduce this behavior with test
 software we have written on a 100% consistent basis.»

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-07  8:49         ` Emmanuel Florac
@ 2012-04-08 20:33           ` Stan Hoeppner
  2012-04-08 21:45             ` Emmanuel Florac
  0 siblings, 1 reply; 64+ messages in thread
From: Stan Hoeppner @ 2012-04-08 20:33 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: Stefan Ring, Linux fs XFS

On 4/7/2012 3:49 AM, Emmanuel Florac wrote:
> Le Fri, 06 Apr 2012 18:28:37 -0500 vous écriviez:
> 
>> Creating four 60 drive RAID10 arrays, let alone 60 drive RAID6
>> arrays, would be silly.
> 
> From my experience, with modern arrays don't make much of a difference.
> I've reached decent IOPS (i. e. about 4000 IOPS) on large arrays of up
> to 46 drives provided there are enough threads -- more threads than
> spindles, preferably.

Are you speaking of a mixed metadata/data heavy IOPS workload similar to
that which is the focus of this thread, or another type of workload?  Is
this 46 drive array RAID10 or RAID6?

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-07 17:10         ` Joe Landman
@ 2012-04-08 21:42           ` Stan Hoeppner
  2012-04-09  5:13             ` Stan Hoeppner
  2012-04-09  9:23             ` Stefan Ring
  0 siblings, 2 replies; 64+ messages in thread
From: Stan Hoeppner @ 2012-04-08 21:42 UTC (permalink / raw)
  To: xfs

On 4/7/2012 12:10 PM, Joe Landman wrote:
> On 04/07/2012 12:50 PM, Peter Grandi wrote:
> 
>>    * Your storage layer does not seem to deliver parallel
>>      operations: as the ~100MB/s overall 'ext4' speed and the
>>      seek graphs show, in effect your 4+2 RAID6 performs in this
>>      case as if it were a single drive with a single arm.
> 
> This is what lept out at me.  I retried a very similar test (pulled
> Icedtea 2.1, compiled it, tarred it, measured untar on our boxen).  I
> was getting a fairly consistent 4 +/- delta seconds.

That's an interesting point.  I guess I'd chalked the low throughput up
to high seeks.

> 100MB/s on some supposedly fast drives with a RAID card indicates that
> either the RAID is badly implemented, the RAID layout is suspect, or
> similar.  He should be getting closer to N(data disks) * BW(single disk)
> for something "close" to a streaming operation.

Reading this thread seems to indicate you're onto something Joe:
http://h30499.www3.hp.com/t5/System-Administration/Extremely-slow-io-on-cciss-raid6/td-p/4214888

Add this to the mix:
"The HP Smart Array P400 is HP's first PCI-Express (PCIe) serial
attached SCSI (SAS) RAID controller"

That's from:
http://h18000.www1.hp.com/products/servers/proliantstorage/arraycontrollers/smartarrayp400/index.html

First gen products aren't always duds, but the likelihood is often much
higher.  Everyone posting to that forum is getting low throughput, and
most of them are testing streaming reads/writes, not massively random IO
as is Stefan's case.

> This isn't suggesting that he didn't hit some bug which happens to over
> specify use of ag=0, but he definitely had a weak RAID system (at best).
> 
> If he retries with a more capable system, or one with a saner RAID
> layout (16k chunk size?  For spinning rust?  Seriously?  Short stroking
> DB layout?), an agcount of 32 or higher, and still sees similar issues,
> then I'd be more suspicious of a bug.

Or merely a weak/old product.  The P400 was an entry level RAID HBA,
HP's first PCIe/SAS RAID card.  It was discontinued quite some time ago.
 The use of DDR2/533 memory indicates it's design stage started probably
somewhere around 2004, 8 years ago.

Now that I've researched the P400, and assuming Stefan currently has the
card firmware optimally configured, I'd bet this workload is simply
overwhelming the RAID ASIC.  To confirm this, simply configure each
drive as a RAID0 array, so all 6 drives are exported as block devices.
Configure them as an md RAID6 and test the workload.  Be sure to change
the Linux elevator to noop first since you're using hardware write cache:

$ echo deadline > /sys/block/sd[a-e]/queue/scheduler

Execute this 6 times, once for each of the 6 drives, changing the device
name each time, obviously.  This is not a persistent change.

The gap between EXT4 and XFS will likely still exist, but overall
numbers should jump substantially Northward, if the problem is indeed a
slow RAID ASIC.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-08 20:33           ` Stan Hoeppner
@ 2012-04-08 21:45             ` Emmanuel Florac
  2012-04-09  5:27               ` Stan Hoeppner
  0 siblings, 1 reply; 64+ messages in thread
From: Emmanuel Florac @ 2012-04-08 21:45 UTC (permalink / raw)
  To: stan; +Cc: Stefan Ring, Linux fs XFS

Le Sun, 08 Apr 2012 15:33:01 -0500 vous écriviez:

> > 
> > From my experience, with modern arrays don't make much of a
> > difference. I've reached decent IOPS (i. e. about 4000 IOPS) on
> > large arrays of up to 46 drives provided there are enough threads
> > -- more threads than spindles, preferably.  
> 
> Are you speaking of a mixed metadata/data heavy IOPS workload similar
> to that which is the focus of this thread, or another type of
> workload?  Is this 46 drive array RAID10 or RAID6?

Pure random access, 8K IO benchmark (database simulation). RAID-6
performs about the same in pure reading tests, but stinks terribly at
writing of course.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-07  7:27         ` Stefan Ring
  2012-04-07  8:53           ` Emmanuel Florac
  2012-04-07 14:57           ` Stan Hoeppner
@ 2012-04-09  0:19           ` Dave Chinner
  2012-04-09 11:39             ` Emmanuel Florac
  2 siblings, 1 reply; 64+ messages in thread
From: Dave Chinner @ 2012-04-09  0:19 UTC (permalink / raw)
  To: Stefan Ring; +Cc: stan, Linux fs XFS

On Sat, Apr 07, 2012 at 09:27:50AM +0200, Stefan Ring wrote:
> > Instead, a far more optimal solution would be to set aside 4 spares per
> > chassis and create 14 four drive RADI10 arrays.  This would yield ~600
> > seeks/sec and ~400MB/s sequential throughput performance per 2 spindle
> > array.  We'd stitch the resulting 56 hardware RAID10 arrays together in
> > an mdraid linear (concatenated) array.  Then we'd format this 112
> > effective spindle linear array with simply:
> >
> > $ mkfs.xfs -d agcount=56 /dev/md0
> >
> > Since each RAID10 is 900GB capacity, we have 56 AGs of just under the
> > 1TB limit, 1 AG per 2 physical spindles.  Due to the 2 stripe spindle
> > nature of the constituent hardware RAID10 arrays, we don't need to worry
> > about aligning XFS writes to the RAID stripe width.  The hardware cache
> > will take care of filling the small stripes.  Now we're in the opposite
> > situation of having too many AGs per spindle.  We've put 2 spindles in a
> > single AG and turned the seek starvation issues on its head.
> 
> So it sounds like that for poor guys like us, who can’t afford the
> hardware to have dozens of spindles, the best option would be to
> create the XFS file system with agcount=1?

No, because then you have no redundancy in metadata structures, so
if you lose/corrupt the superblock you can easier lose the entire
filesytem.  Not to mention you have no allocation parallelism in the
filesystem, so you'll get terrible performance in many common
workloads. IO fairness will also be a big problem.

> That seems to be the only reasonable conclusion to me, since a
> single RAID device, like a single disk, cannot write in parallel
> anyway.

A decent RAID controller with a BBWC and a single LUN benefits from
parallelism just as much as a large disk arrays do because the BBWC
minimises the write IO latency and the controller to do a better job
of scheduling it's IO.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-08 21:42           ` Stan Hoeppner
@ 2012-04-09  5:13             ` Stan Hoeppner
  2012-04-09 11:52               ` Stefan Ring
  2012-04-09  9:23             ` Stefan Ring
  1 sibling, 1 reply; 64+ messages in thread
From: Stan Hoeppner @ 2012-04-09  5:13 UTC (permalink / raw)
  To: xfs

On 4/8/2012 4:42 PM, Stan Hoeppner wrote:
> On 4/7/2012 12:10 PM, Joe Landman wrote:
>> On 04/07/2012 12:50 PM, Peter Grandi wrote:
>>
>>>    * Your storage layer does not seem to deliver parallel
>>>      operations: as the ~100MB/s overall 'ext4' speed and the
>>>      seek graphs show, in effect your 4+2 RAID6 performs in this
>>>      case as if it were a single drive with a single arm.
>>
>> This is what lept out at me.  I retried a very similar test (pulled
>> Icedtea 2.1, compiled it, tarred it, measured untar on our boxen).  I
>> was getting a fairly consistent 4 +/- delta seconds.
> 
> That's an interesting point.  I guess I'd chalked the low throughput up
> to high seeks.
> 
>> 100MB/s on some supposedly fast drives with a RAID card indicates that
>> either the RAID is badly implemented, the RAID layout is suspect, or
>> similar.  He should be getting closer to N(data disks) * BW(single disk)
>> for something "close" to a streaming operation.
> 
> Reading this thread seems to indicate you're onto something Joe:
> http://h30499.www3.hp.com/t5/System-Administration/Extremely-slow-io-on-cciss-raid6/td-p/4214888

The P400 uses the LSISAS1078 chip, PowerPC 500MHz core, "2 hardware
RAID5/6 processors".  Some sequential benchmarks under Windows with
8x750GB SATA drives on an LSI 1078 based card show sequential RAID6
write rates of ~100MB/s.  RAID0 write rate of this card for 8 drives is
350MB/s.  These drives are capable of 50MB/s sustained writes, so the
RAID0 performance isn't far off the hardware max.

It seems the 1078 is simply not that quick with anything but pure
striping.  Hardware RAID10 write performance appears only about 50%
faster than RAID6.  The RAID6 speed is roughly 1/3rd of the RAID0 speed.
 So exporting the individual drives as I previously mentioned and using
mdraid6 should yield at least  a 3x improvement, assuming your CPUs
aren't already loaded down.

Or, as others have mentioned, simply install an MLC SSD and get 10-100x
more random throughput with XFS if you match the agcount to the number
of flash chips in the SSD.  XFS parallelism flexing its muscles once
again.  EXT4 won't improve as much as it will tend to write the flash
chips sequentially.  Newegg currently has two Mushkin 120GB models for
$120 each, both with 4/5 eggs.

http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=100008120+600038484+50001504&QksAutoSuggestion=&ShowDeactivatedMark=False&Configurator=&IsNodeId=1&Subcategory=636&description=&hisInDesc=&Ntk=&CFG=&SpeTabStoreType=&AdvancedSearch=1&srchInDesc=

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-08 21:45             ` Emmanuel Florac
@ 2012-04-09  5:27               ` Stan Hoeppner
  2012-04-09 12:45                 ` Emmanuel Florac
  0 siblings, 1 reply; 64+ messages in thread
From: Stan Hoeppner @ 2012-04-09  5:27 UTC (permalink / raw)
  To: xfs

On 4/8/2012 4:45 PM, Emmanuel Florac wrote:
> Le Sun, 08 Apr 2012 15:33:01 -0500 vous écriviez:
> 
>>>
>>> From my experience, with modern arrays don't make much of a
>>> difference. I've reached decent IOPS (i. e. about 4000 IOPS) on
>>> large arrays of up to 46 drives provided there are enough threads
>>> -- more threads than spindles, preferably.  
>>
>> Are you speaking of a mixed metadata/data heavy IOPS workload similar
>> to that which is the focus of this thread, or another type of
>> workload?  Is this 46 drive array RAID10 or RAID6?
> 
> Pure random access, 8K IO benchmark (database simulation). RAID-6
> performs about the same in pure reading tests, but stinks terribly at
> writing of course.

In your RAID10 random write testing, was this with a filesystem or doing
direct block IO?  If the latter, I wonder if its write pattern is
anything like the access pattern we'd see hitting dozens of AGs while
creating 10s of thousands of files.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-08 21:42           ` Stan Hoeppner
  2012-04-09  5:13             ` Stan Hoeppner
@ 2012-04-09  9:23             ` Stefan Ring
  2012-04-09 23:06               ` Stan Hoeppner
  1 sibling, 1 reply; 64+ messages in thread
From: Stefan Ring @ 2012-04-09  9:23 UTC (permalink / raw)
  To: stan; +Cc: xfs

> Or merely a weak/old product.  The P400 was an entry level RAID HBA,
> HP's first PCIe/SAS RAID card.  It was discontinued quite some time ago.
>  The use of DDR2/533 memory indicates it's design stage started probably
> somewhere around 2004, 8 years ago.

It was what you got when you bought a direct-attach storage blade from
HP until a few months ago. Apparently, they changed it to P410i very
recently: <http://h10010.www1.hp.com/wwpc/us/en/sm/WF06a/3709945-3709945-3710114-3722820-3722776-4304942.html?dnr=1>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-07 14:57           ` Stan Hoeppner
@ 2012-04-09 11:02             ` Stefan Ring
  2012-04-09 12:48               ` Emmanuel Florac
  2012-04-09 23:38               ` Stan Hoeppner
  0 siblings, 2 replies; 64+ messages in thread
From: Stefan Ring @ 2012-04-09 11:02 UTC (permalink / raw)
  To: stan; +Cc: Linux fs XFS

> Not at all.  You can achieve this performance with the 6 300GB spindles
> you currently have, as Christoph and I both mentioned.  You simply lose
> one spindle of capacity, 300GB, vs your current RAID6 setup.  Make 3
> RAID1 pairs in the p400 and concatenate them.  If the p400 can't do this
> concat the mirror pair devices with md --linear.  Format the resulting
> Linux block device with the following and mount with inode64.
>
> $ mkfs.xfs -d agcount=3 /dev/[device]
>
> That will give you 1 AG per spindle, 3 horizontal AGs total instead of 4
> vertical AGs as you get with default striping setup.  This is optimal
> for your high IOPS workload as it eliminates all 'extraneous' seeks
> yielding a per disk access pattern nearly identical to EXT4.  And it
> will almost certainly outrun EXT4 on your RAID6 due mostly to the
> eliminated seeks, but also to elimination of parity calculations.
> You've wiped the array a few times in your testing already right, so one
> or two more test setups should be no sweat.  Give it a go.  The results
> will be pleasantly surprising.

Well I had to move around quite a bit of data, but for the sake of
completeness, I had to give it a try.

With a nice and tidy fresh XFS file system, performance is indeed
impressive – about 16 sec for the same task that would take 2 min 25
before. So that’s about 150 MB/sec, which is not great, but for many
tiny files it would perhaps be a bit unreasonable to expect more. A
simple copy of the tar onto the XFS file system yields the same linear
performance, the same as with ext4, btw. So 150 MB/sec seems to be the
best these disks can do, meaning that theoretically, with 3 AGs, it
should be able to reach 450 MB/sec under optimal conditions.

I will still do a test with the free space fragmentation priming on
the concatenated AG=3 volume, because it seems to be rather slow as
well.

But then I guess I’m back to ext4 land. XFS just doesn’t offer enough
benefits in this case to justify the hassle.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-09  0:19           ` Dave Chinner
@ 2012-04-09 11:39             ` Emmanuel Florac
  2012-04-09 21:47               ` Dave Chinner
  0 siblings, 1 reply; 64+ messages in thread
From: Emmanuel Florac @ 2012-04-09 11:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Stefan Ring, stan, Linux fs XFS

Le Mon, 9 Apr 2012 10:19:43 +1000 vous écriviez:

> A decent RAID controller with a BBWC and a single LUN benefits from
> parallelism just as much as a large disk arrays do because the BBWC
> minimises the write IO latency and the controller to do a better job
> of scheduling its IO.

BTW recently I've found that for storage servers, noop io scheduler
often is the best choice, I suppose precisely because it doesn't try to
outsmart the RAID controller logic...

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-09  5:13             ` Stan Hoeppner
@ 2012-04-09 11:52               ` Stefan Ring
  2012-04-10  7:34                 ` Stan Hoeppner
  0 siblings, 1 reply; 64+ messages in thread
From: Stefan Ring @ 2012-04-09 11:52 UTC (permalink / raw)
  To: stan; +Cc: xfs

> It seems the 1078 is simply not that quick with anything but pure
> striping.  Hardware RAID10 write performance appears only about 50%
> faster than RAID6.  The RAID6 speed is roughly 1/3rd of the RAID0 speed.
>  So exporting the individual drives as I previously mentioned and using
> mdraid6 should yield at least  a 3x improvement, assuming your CPUs
> aren't already loaded down.

Whatever the problem with the controller may be, it behaves quite
nicely usually. It seems clear though, that, regardless of the storage
technology, it cannot be a good idea to schedule tiny blocks in the
order that XFS schedules them in my case.

This:
AG0 *   *   *
AG1  *   *   *
AG2   *   *   *
AG3    *   *   *

cannot be better than this:

AG0 ***
AG1    ***
AG2       ***
AG3          ***

Yes, in theory, a good cache controller should be able to sort this
out. But at least this particular controller is not able to do so and
could use a little help. Also, a single consumer-grade drive is
certainly not helped by this write ordering.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-09  5:27               ` Stan Hoeppner
@ 2012-04-09 12:45                 ` Emmanuel Florac
  2012-04-13 19:36                   ` Stefan Ring
  0 siblings, 1 reply; 64+ messages in thread
From: Emmanuel Florac @ 2012-04-09 12:45 UTC (permalink / raw)
  To: stan; +Cc: xfs

Le Mon, 09 Apr 2012 00:27:29 -0500 vous écriviez:

> In your RAID10 random write testing, was this with a filesystem or
> doing direct block IO? 

Doing random IO in a file lying on an XFS filesystem.

> If the latter, I wonder if its write pattern
> is anything like the access pattern we'd see hitting dozens of AGs
> while creating 10s of thousands of files.

I suppose the file creation process to hit more some defined hot spots
than pure random access.

I just have a machine for testing purposes with 15 4TB drives in
RAID-6, not exactly a IOPS demon :)

So I've build a tar file to make it somewhat similar to OP's
problem :

root@3[raid]# ls -lh test.tar 
-rw-r--r-- 1 root root 2,6G  9 avril 13:52 test.tar
root@3[raid]# tar tf test.tar | wc -l
234318

# echo 3 > /proc/sys/vm/drop_caches
# time tar xf test.tar

real    1m2.584s
user    0m1.376s
sys     0m13.643s

Let's rerun it with files cached (the machine has 16 GB RAM, so
every single file must be cached):

# time tar xf test.tar

real    0m50.842s
user    0m0.809s
sys     0m13.767s

Typical IOs during unarchiving: no read, write IO bound.

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0,00  1573,50    0,00  480,50     0,00    36,96   157,52    60,65  124,45   2,08 100,10
dm-0              0,00     0,00    0,00 2067,00     0,00    39,56    39,20   322,55  151,62   0,48 100,10

The OP setup being 6 15k drives, should provide roughly the same
number of true IOPS (1200) as my slow as hell bunch of 7200RPM 
4TB drives (1500). I suppose write cache makes for most of
the difference; or else 15K drives are overrated :)

Alas, I can't run the test on this machine with ext4: I can't 
get mkfs.ext4 to swallow my big device. 

mkfs -t ext4 -v -b 4096 -n /dev/dm-0 2147483647

should work (though drastically limiting the filesystem size),
but dies miserably when removing the -n flag. Mmmph, I suppose it's 
production ready if you don't have much data to store.

JFS doesn't work either. And I was wondering why I'm using XFS?  :)

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-09 11:02             ` Stefan Ring
@ 2012-04-09 12:48               ` Emmanuel Florac
  2012-04-09 12:53                 ` Stefan Ring
  2012-04-09 23:38               ` Stan Hoeppner
  1 sibling, 1 reply; 64+ messages in thread
From: Emmanuel Florac @ 2012-04-09 12:48 UTC (permalink / raw)
  To: Stefan Ring; +Cc: stan, Linux fs XFS

Le Mon, 9 Apr 2012 13:02:27 +0200 vous écriviez:

> So 150 MB/sec seems to be the
> best these disks can do,

Definitely NOT right. I mean I've got routinely 600 MB/s from 8 7.2K
drives RAID-6 arrays. Unless cache is off, which RAID-6 isn't obviously
thought for.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-09 12:48               ` Emmanuel Florac
@ 2012-04-09 12:53                 ` Stefan Ring
  2012-04-09 13:03                   ` Emmanuel Florac
  0 siblings, 1 reply; 64+ messages in thread
From: Stefan Ring @ 2012-04-09 12:53 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: stan, Linux fs XFS

>> So 150 MB/sec seems to be the
>> best these disks can do,
>
> Definitely NOT right. I mean I've got routinely 600 MB/s from 8 7.2K
> drives RAID-6 arrays. Unless cache is off, which RAID-6 isn't obviously
> thought for.

In this case it was a 2-disk RAID 1, so it’s 150 MB/s per disk. Seems
quite right to me.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-09 12:53                 ` Stefan Ring
@ 2012-04-09 13:03                   ` Emmanuel Florac
  0 siblings, 0 replies; 64+ messages in thread
From: Emmanuel Florac @ 2012-04-09 13:03 UTC (permalink / raw)
  To: Stefan Ring; +Cc: stan, Linux fs XFS

Le Mon, 9 Apr 2012 14:53:14 +0200 vous écriviez:

> In this case it was a 2-disk RAID 1, so it’s 150 MB/s per disk. Seems
> quite right to me.

Yes sorry, I though it to be the 6 drives array speed. Seen it just
after hitting "send", but there is no "supersede" in mailing lists :-)

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-06 23:28       ` Stan Hoeppner
  2012-04-07  7:27         ` Stefan Ring
  2012-04-07  8:49         ` Emmanuel Florac
@ 2012-04-09 14:21         ` Geoffrey Wehrman
  2012-04-10 19:30           ` Stan Hoeppner
  2 siblings, 1 reply; 64+ messages in thread
From: Geoffrey Wehrman @ 2012-04-09 14:21 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Stefan Ring, Linux fs XFS

On Fri, Apr 06, 2012 at 06:28:37PM -0500, Stan Hoeppner wrote:
| So while the XFS AG architecture may not be perfectly suited to your
| single 6 drive RAID6 array, it still gives rather remarkable performance
| given that the same architecture can scale pretty linearly to the
| heights above, and far beyond.  Something EXTx and others could never
| dream of.  Some of the SGI guys might be able to confirm deployed single
| XFS filesystems spanning 1000+ drives in the past.  Today we'd probably
| only see that scale with CXFS.

With an SGI IS16000 array which supports up to 1,200 drives, filesystems
with large numbers of drives isn't difficult.  Most configurations
using the IS16000 have 8+2 RAID6 luns.  I've seen sustained 15 GB/s to
a single filesystem on one of the arrays with a 600 drive configuration.
The scalability of XFS is impressive.

-- 
Geoffrey Wehrman
SGI Building 10                             Office: (651)683-5496
2750 Blue Water Road                           Fax: (651)683-5098
Eagan, MN 55121                             E-mail: gwehrman@sgi.com
	  http://www.sgi.com/products/storage/software/

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-09 11:39             ` Emmanuel Florac
@ 2012-04-09 21:47               ` Dave Chinner
  0 siblings, 0 replies; 64+ messages in thread
From: Dave Chinner @ 2012-04-09 21:47 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: Stefan Ring, stan, Linux fs XFS

On Mon, Apr 09, 2012 at 01:39:13PM +0200, Emmanuel Florac wrote:
> Le Mon, 9 Apr 2012 10:19:43 +1000 vous écriviez:
> 
> > A decent RAID controller with a BBWC and a single LUN benefits from
> > parallelism just as much as a large disk arrays do because the BBWC
> > minimises the write IO latency and the controller to do a better job
> > of scheduling its IO.
> 
> BTW recently I've found that for storage servers, noop io scheduler
> often is the best choice, I suppose precisely because it doesn't try to
> outsmart the RAID controller logic...

We've been recommending the use of the no-op (or worst case,
deadline) scheduler for XFS on hardware RAID for quite a few years.
I only test against the no-op scheduler, because I got sick of
having to track down regressions caused by "smart" CFQ heuristics....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-09  9:23             ` Stefan Ring
@ 2012-04-09 23:06               ` Stan Hoeppner
  0 siblings, 0 replies; 64+ messages in thread
From: Stan Hoeppner @ 2012-04-09 23:06 UTC (permalink / raw)
  To: Stefan Ring; +Cc: xfs

On 4/9/2012 4:23 AM, Stefan Ring wrote:
>> Or merely a weak/old product.  The P400 was an entry level RAID HBA,
>> HP's first PCIe/SAS RAID card.  It was discontinued quite some time ago.
>>  The use of DDR2/533 memory indicates it's design stage started probably
>> somewhere around 2004, 8 years ago.
> 
> It was what you got when you bought a direct-attach storage blade from
> HP until a few months ago. Apparently, they changed it to P410i very
> recently: <http://h10010.www1.hp.com/wwpc/us/en/sm/WF06a/3709945-3709945-3710114-3722820-3722776-4304942.html?dnr=1>

Nonetheless, it's performance is quite bad with RAID6 (RAID10 as well).
 If you're happy with EXT4 on the P400 based RAID6, you'll be even much
happier with 3-4x more performance using md for the RAID6.  If it was
worth your time to test the XFS concat I would think this test would be
even more so, as it appears you'll be sticking with EXT4.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-09 11:02             ` Stefan Ring
  2012-04-09 12:48               ` Emmanuel Florac
@ 2012-04-09 23:38               ` Stan Hoeppner
  2012-04-10  6:11                 ` Stefan Ring
  1 sibling, 1 reply; 64+ messages in thread
From: Stan Hoeppner @ 2012-04-09 23:38 UTC (permalink / raw)
  To: Stefan Ring; +Cc: Linux fs XFS

On 4/9/2012 6:02 AM, Stefan Ring wrote:
>> Not at all.  You can achieve this performance with the 6 300GB spindles
>> you currently have, as Christoph and I both mentioned.  You simply lose
>> one spindle of capacity, 300GB, vs your current RAID6 setup.  Make 3
>> RAID1 pairs in the p400 and concatenate them.  If the p400 can't do this
>> concat the mirror pair devices with md --linear.  Format the resulting
>> Linux block device with the following and mount with inode64.
>>
>> $ mkfs.xfs -d agcount=3 /dev/[device]
>>
>> That will give you 1 AG per spindle, 3 horizontal AGs total instead of 4
>> vertical AGs as you get with default striping setup.  This is optimal
>> for your high IOPS workload as it eliminates all 'extraneous' seeks
>> yielding a per disk access pattern nearly identical to EXT4.  And it
>> will almost certainly outrun EXT4 on your RAID6 due mostly to the
>> eliminated seeks, but also to elimination of parity calculations.
>> You've wiped the array a few times in your testing already right, so one
>> or two more test setups should be no sweat.  Give it a go.  The results
>> will be pleasantly surprising.
> 
> Well I had to move around quite a bit of data, but for the sake of
> completeness, I had to give it a try.
> 
> With a nice and tidy fresh XFS file system, performance is indeed
> impressive – about 16 sec for the same task that would take 2 min 25
> before. So that’s about 150 MB/sec, which is not great, but for many
> tiny files it would perhaps be a bit unreasonable to expect more. A

150MB/s isn't correct.  Should be closer to 450MB/s.  This makes it
appear that you're writing all these files to a single directory.  If
you're writing them fairly evenly to 3 directories or a multiple of 3,
you should see close to 450MB/s, if using mdraid linear over 3 P400
RAID1 pairs.  If this is what you're doing then something seems wrong
somewhere.  Try unpacking a kernel tarball.  Lots of subdirectories to
exercise all 3 AGs thus all 3 spindles.

> simple copy of the tar onto the XFS file system yields the same linear
> performance, the same as with ext4, btw. So 150 MB/sec seems to be the
> best these disks can do, meaning that theoretically, with 3 AGs, it
> should be able to reach 450 MB/sec under optimal conditions.

The optimal condition, again, requires writing 3 of this file to 3
directories to hit ~450MB/s, which you should get close to if using
mdraid linear over RAID1 pairs.  XFS is a filesystem after all, so it's
parallelism must come from manipulating usage of filesystem structures.
 I thought I explained all of this previously when I introduced the "XFS
concat" into this thread.

> I will still do a test with the free space fragmentation priming on
> the concatenated AG=3 volume, because it seems to be rather slow as
> well.

> But then I guess I’m back to ext4 land. XFS just doesn’t offer enough
> benefits in this case to justify the hassle.

If you were writing to only one directory I can understand this
sentiment.  Again, if you were writing 3 directories fairly evenly, with
the md concat, then your sentiment here should be quite different.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-09 23:38               ` Stan Hoeppner
@ 2012-04-10  6:11                 ` Stefan Ring
  2012-04-10 20:29                   ` Stan Hoeppner
  0 siblings, 1 reply; 64+ messages in thread
From: Stefan Ring @ 2012-04-10  6:11 UTC (permalink / raw)
  To: stan; +Cc: Linux fs XFS

> 150MB/s isn't correct.  Should be closer to 450MB/s.  This makes it
> appear that you're writing all these files to a single directory.  If
> you're writing them fairly evenly to 3 directories or a multiple of 3,
> you should see close to 450MB/s, if using mdraid linear over 3 P400
> RAID1 pairs.  If this is what you're doing then something seems wrong
> somewhere.  Try unpacking a kernel tarball.  Lots of subdirectories to
> exercise all 3 AGs thus all 3 spindles.

The spindles were exercised; I watched it with iostat. Maybe I could
have reached more with more parallelism, but that wasn’t my goal at
all. Although, over the course of these experiments, I got to doubt
that the controller could even handle this data rate.

>> simple copy of the tar onto the XFS file system yields the same linear
>> performance, the same as with ext4, btw. So 150 MB/sec seems to be the
>> best these disks can do, meaning that theoretically, with 3 AGs, it
>> should be able to reach 450 MB/sec under optimal conditions.
>
> The optimal condition, again, requires writing 3 of this file to 3
> directories to hit ~450MB/s, which you should get close to if using
> mdraid linear over RAID1 pairs.  XFS is a filesystem after all, so it's
> parallelism must come from manipulating usage of filesystem structures.
>  I thought I explained all of this previously when I introduced the "XFS
> concat" into this thread.

The optimal condition would be 3 parallel writes of huge files, which
can be easily written linearly. Not thousands of tiny files.

>> But then I guess I’m back to ext4 land. XFS just doesn’t offer enough
>> benefits in this case to justify the hassle.
>
> If you were writing to only one directory I can understand this
> sentiment.  Again, if you were writing 3 directories fairly evenly, with
> the md concat, then your sentiment here should be quite different.

Haha, I made a U-turn on this one. XFS is back on the table (and on
the disks now) ;). When I thought I was done, I wanted to restore a
few large KVM images which were on the disks prior to the RAID
reconfiguration. With ext4, I watched iostat writing at 130MB/s for a
while. After 2 or 3 minutes, it broke down completely and languished
at 30-40MB/s for many minutes, even after I had SIGSTOPed the writing
process, during which it was nearly impossible to use vim to edit a
file on the ext4 partition. It would pause for tens of seconds all the
time. It’s not even clear why it broke down so badly. From another
seekwatcher sample I took, it looked like fairly linear writing.

So I threw XFS back in, restarted the restore, and it went very
smoothly while still providing acceptable interactivity.

XFS is not a panacea (obviously), and it may be a bit slower in many
cases, and doesn’t seem to cope well with fragmented free space (which
is what this entire thread is really about), but overall it feels more
well-rounded. After all, I don’t really care how much it writes per
time unit, as long as it’s not ridiculously little and it doesn’t
bring everything else to a halt.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-09 11:52               ` Stefan Ring
@ 2012-04-10  7:34                 ` Stan Hoeppner
  2012-04-10 13:59                   ` Stefan Ring
  0 siblings, 1 reply; 64+ messages in thread
From: Stan Hoeppner @ 2012-04-10  7:34 UTC (permalink / raw)
  To: xfs

On 4/9/2012 6:52 AM, Stefan Ring wrote:

> Whatever the problem with the controller may be, it behaves quite
> nicely usually. It seems clear though, that, regardless of the storage
> technology, it cannot be a good idea to schedule tiny blocks in the
> order that XFS schedules them in my case.
> 
> This:
> AG0 *   *   *
> AG1  *   *   *
> AG2   *   *   *
> AG3    *   *   *
> 
> cannot be better than this:
> 
> AG0 ***
> AG1    ***
> AG2       ***
> AG3          ***

With 4 AGs this must represent the RAID6 or RAID10 case.  Those don't
seem to show any overlapping concurrency.  Maybe I'm missing something,
but it should look more like this, at least in the concat case:

AG0 ***
AG1 ***
AG2 ***

> Yes, in theory, a good cache controller should be able to sort this
> out. But at least this particular controller is not able to do so and
> could use a little help. 

Is the cache in write-through or write-back mode?  The latter should
allow for aggressive reordering.  The former none, or very little.  And
is all of it dedicated to writes, or is it split?  If split, dedicate it
all to writes.  Linux is going to cache block reads anyway, so it makes
little sense to cache them in the controller as well.

> Also, a single consumer-grade drive is
> certainly not helped by this write ordering.

Are you referring to the Mushkin SSD I mentioned?  The SandForce 2281
onboard the Enhanced Chronos Deluxe is capable of a *sustained* 20,000
4KB random write IOPs, 60,000 peak.  Mushkin states 90,000, which may be
due to their use of Toggle Mode NAND instead ONFi, and/or they're simply
fudging.  Regardless, 20K real write IOPS is enough to make
scheduling/ordering mostly irrelevant I'd think.  Just format with 8 AGs
to be on the safe side for DLP (directory level parallelism), and you're
off to the races.  The features of the SF2000 series make MLC SSDs based
on it much more like 'enterprise' SLC SSDs in most respects.  The lines
between "consumer" and "enterprise" SSDs have already been blurred as
many vendors have already been selling "enterprise" MLC SSDs for a while
now, including Intel, Kingston, OCZ, PNY, and Seagate.  All are based on
the same SandForce 2281 as in this Mushkin, or the 2282, which is
required for devices over 512GB.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-10  7:34                 ` Stan Hoeppner
@ 2012-04-10 13:59                   ` Stefan Ring
  0 siblings, 0 replies; 64+ messages in thread
From: Stefan Ring @ 2012-04-10 13:59 UTC (permalink / raw)
  To: Linux fs XFS

> With 4 AGs this must represent the RAID6 or RAID10 case.

Yes, the original RAID 6 case.

>> Yes, in theory, a good cache controller should be able to sort this
>> out. But at least this particular controller is not able to do so and
>> could use a little help.
>
> Is the cache in write-through or write-back mode?  The latter should
> allow for aggressive reordering.  The former none, or very little.  And
> is all of it dedicated to writes, or is it split?  If split, dedicate it
> all to writes.  Linux is going to cache block reads anyway, so it makes
> little sense to cache them in the controller as well.

The cache is a write-back cache. Yes, it’s split 75% write / 25% read.
Changing to 100% write does not make a difference.

I can imagine that the small read cache might be beneficial for
partial stripe writes, when the stripe contents from the untouched
drives are in cache.

>> Also, a single consumer-grade drive is
>> certainly not helped by this write ordering.
>
> Are you referring to the Mushkin SSD I mentioned?

No, I meant rotational storage. But even SSDs should gain at least a
little from a linear write pattern.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-07 18:57     ` Martin Steigerwald
@ 2012-04-10 14:02       ` Stefan Ring
  2012-04-10 14:32         ` Joe Landman
                           ` (2 more replies)
  0 siblings, 3 replies; 64+ messages in thread
From: Stefan Ring @ 2012-04-10 14:02 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Christoph Hellwig, xfs

> And is XFS aligned to the RAID 6?
>
> What does xfs_info display on it?

Yes, it’s aligned.

meta-data=/dev/mapper/vg_data-lvhome isize=256    agcount=4,
agsize=73233656 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=292934624, imaxpct=5
         =                       sunit=8      swidth=32 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=143040, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

I changed the stripe size to 32kb in the meantime. This way, it
performs slightly better.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-06 15:35     ` Peter Grandi
@ 2012-04-10 14:05       ` Stefan Ring
  0 siblings, 0 replies; 64+ messages in thread
From: Stefan Ring @ 2012-04-10 14:05 UTC (permalink / raw)
  To: Linux fs XFS

> BTW congratulations for limiting your RAID6 set to 4+2, and
> using a relatively small chunk size compared to that chosen by
> many others.

Interestingly, it performs better with a larger stripe size, though.
Probably because it’s better able to combine writes when the blocks
are larger.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-10 14:02       ` Stefan Ring
@ 2012-04-10 14:32         ` Joe Landman
  2012-04-10 15:56           ` Stefan Ring
  2012-04-10 18:13         ` Martin Steigerwald
  2012-04-10 20:44         ` Stan Hoeppner
  2 siblings, 1 reply; 64+ messages in thread
From: Joe Landman @ 2012-04-10 14:32 UTC (permalink / raw)
  To: xfs

On 04/10/2012 10:02 AM, Stefan Ring wrote:
>> And is XFS aligned to the RAID 6?
>>
>> What does xfs_info display on it?
>
> Yes, it’s aligned.
>
> meta-data=/dev/mapper/vg_data-lvhome isize=256    agcount=4,
> agsize=73233656 blks
>           =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=292934624, imaxpct=5
>           =                       sunit=8      swidth=32 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=143040, version=2
>           =                       sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
>
> I changed the stripe size to 32kb in the meantime. This way, it
> performs slightly better.

try 128k to 512k for stripe size.  And try to increase your agcount by 
(nearly) an order of magnitude.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-10 14:32         ` Joe Landman
@ 2012-04-10 15:56           ` Stefan Ring
  0 siblings, 0 replies; 64+ messages in thread
From: Stefan Ring @ 2012-04-10 15:56 UTC (permalink / raw)
  To: Linux fs XFS

> try 128k to 512k for stripe size.  And try to increase your agcount by
> (nearly) an order of magnitude.

Would that be of any real value to anyone here, except for satisfying
curiosity (which I feel as well ;))? Because frankly, it’s a lot of
work, and I’m quite through with this tedious kind of activity…

My conclusion is that everything should work well if the levels below
the file system behaved the way they should and brought the writes
into a sane order. Apparently, both the RAID controller as well as the
Linux block scheduler fail to do so. Despite the annoying nature of
this state of affairs, I do believe that file systems should be able
to count on the lower levels of the stack for such low-level work and
not work around them, but apparently, they are often failed. Probably
that’s one of the reasons why almost every file system acquires some
sort of block scheduling over time. Maybe some day, the Linux IO
scheduler will do a better job. Unfortunately, by then, this entire
issue will be irrelevant because nobody will be using rotational
storage anymore, at least not for everyday work.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-10 14:02       ` Stefan Ring
  2012-04-10 14:32         ` Joe Landman
@ 2012-04-10 18:13         ` Martin Steigerwald
  2012-04-10 20:44         ` Stan Hoeppner
  2 siblings, 0 replies; 64+ messages in thread
From: Martin Steigerwald @ 2012-04-10 18:13 UTC (permalink / raw)
  To: Stefan Ring; +Cc: Christoph Hellwig, xfs

Am Dienstag, 10. April 2012 schrieb Stefan Ring:
> > And is XFS aligned to the RAID 6?
> > 
> > What does xfs_info display on it?
> 
> Yes, it’s aligned.
> 
> meta-data=/dev/mapper/vg_data-lvhome isize=256    agcount=4,
> agsize=73233656 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=292934624,
> imaxpct=5 =                       sunit=8      swidth=32 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=143040, version=2
>          =                       sectsz=512   sunit=8 blks,
> lazy-count=1 realtime =none                   extsz=4096   blocks=0,
> rtextents=0

Hmmm, so its not the alignment. xfs_info output looks sane otherwise.

I have no further ideas for now. But others had it seems. (Reading rest of 
new messages in thread.)

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-09 14:21         ` Geoffrey Wehrman
@ 2012-04-10 19:30           ` Stan Hoeppner
  2012-04-11 22:19             ` Geoffrey Wehrman
  0 siblings, 1 reply; 64+ messages in thread
From: Stan Hoeppner @ 2012-04-10 19:30 UTC (permalink / raw)
  To: Geoffrey Wehrman; +Cc: Stefan Ring, Linux fs XFS

On 4/9/2012 9:21 AM, Geoffrey Wehrman wrote:
> On Fri, Apr 06, 2012 at 06:28:37PM -0500, Stan Hoeppner wrote:
> | So while the XFS AG architecture may not be perfectly suited to your
> | single 6 drive RAID6 array, it still gives rather remarkable performance
> | given that the same architecture can scale pretty linearly to the
> | heights above, and far beyond.  Something EXTx and others could never
> | dream of.  Some of the SGI guys might be able to confirm deployed single
> | XFS filesystems spanning 1000+ drives in the past.  Today we'd probably
> | only see that scale with CXFS.

Good to hear from you Geoffrey.

> With an SGI IS16000 array which supports up to 1,200 drives, filesystems
> with large numbers of drives isn't difficult.  Most configurations
> using the IS16000 have 8+2 RAID6 luns.  

Is the concatenation of all these RAID6 LUNs performed within the
IS16000, or with md/lvm, or?

> I've seen sustained 15 GB/s to
> a single filesystem on one of the arrays with a 600 drive configuration.

To be clear, this is a single Linux XFS filesystem on a single host, not
multiple CXFS clients, correct?  If so, out of curiosity, is the host in
this case an old Itanium Altix or the newer Xeon based Altix UV?  And
finally, is this example system using FC or Infiniband connectivity?
How many ports?

> The scalability of XFS is impressive.

Quite impressive.  And there's nothing in XFS itself preventing
scalability of a single filesystem over 4 IS16000s w/4800 total drives,
although one might run into some limitations when attempting to
concatenate that many LUNs.  I've never attempted that scale with md or
lvm, and I've never had my hands on an IS16000.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-10  6:11                 ` Stefan Ring
@ 2012-04-10 20:29                   ` Stan Hoeppner
  2012-04-10 20:43                     ` Stefan Ring
  0 siblings, 1 reply; 64+ messages in thread
From: Stan Hoeppner @ 2012-04-10 20:29 UTC (permalink / raw)
  To: Stefan Ring; +Cc: Linux fs XFS

On 4/10/2012 1:11 AM, Stefan Ring wrote:
>> 150MB/s isn't correct.  Should be closer to 450MB/s.  This makes it
>> appear that you're writing all these files to a single directory.  If
>> you're writing them fairly evenly to 3 directories or a multiple of 3,
>> you should see close to 450MB/s, if using mdraid linear over 3 P400
>> RAID1 pairs.  If this is what you're doing then something seems wrong
>> somewhere.  Try unpacking a kernel tarball.  Lots of subdirectories to
>> exercise all 3 AGs thus all 3 spindles.
> 
> The spindles were exercised; I watched it with iostat. Maybe I could
> have reached more with more parallelism, but that wasn’t my goal at
> all. Although, over the course of these experiments, I got to doubt
> that the controller could even handle this data rate.

Hmm.  We might need to see me detail of what your workload is actually
doing.  It's possible that 3 AGs is too few.  Going with more will cause
more head seeking, but it might also alleviate some bottlenecks within
XFS itself that we may be creating by using only 3 AGs.  I don't know
XFS internals well enough to say.  Dave can surely tell us if 3 may be
too few.

And yes, that controller doesn't seem to be the speediest with a huge
random IO workload.

>>> simple copy of the tar onto the XFS file system yields the same linear
>>> performance, the same as with ext4, btw. So 150 MB/sec seems to be the
>>> best these disks can do, meaning that theoretically, with 3 AGs, it
>>> should be able to reach 450 MB/sec under optimal conditions.
>>
>> The optimal condition, again, requires writing 3 of this file to 3
>> directories to hit ~450MB/s, which you should get close to if using
>> mdraid linear over RAID1 pairs.  XFS is a filesystem after all, so it's
>> parallelism must come from manipulating usage of filesystem structures.
>>  I thought I explained all of this previously when I introduced the "XFS
>> concat" into this thread.
> 
> The optimal condition would be 3 parallel writes of huge files, which
> can be easily written linearly. Not thousands of tiny files.

That was my point.  You mentioned copying a single tar file.  A single
file write to a concatenated XFS will hit only one AG, thus only one
spindle.  If you launch 3 parallel copies of that file to 3 different
directories, each one on a different AG, then you should hit close to
450.  The trick is knowing which directories are on which AGs.  If you
manually create 3 directories right after making the filesystem, each
one will be on a different AG.  Write a file to each of these dirs in
parallel and you should hit ~450MB/s.

>>> But then I guess I’m back to ext4 land. XFS just doesn’t offer enough
>>> benefits in this case to justify the hassle.
>>
>> If you were writing to only one directory I can understand this
>> sentiment.  Again, if you were writing 3 directories fairly evenly, with
>> the md concat, then your sentiment here should be quite different.
> 
> Haha, I made a U-turn on this one. XFS is back on the table (and on
> the disks now) ;). When I thought I was done, I wanted to restore a
> few large KVM images which were on the disks prior to the RAID
> reconfiguration. With ext4, I watched iostat writing at 130MB/s for a
> while. After 2 or 3 minutes, it broke down completely and languished
> at 30-40MB/s for many minutes, even after I had SIGSTOPed the writing
> process, during which it was nearly impossible to use vim to edit a
> file on the ext4 partition. It would pause for tens of seconds all the
> time. It’s not even clear why it broke down so badly. From another
> seekwatcher sample I took, it looked like fairly linear writing.

What was the location of the KVM images you were copying?  Is it
possible the source device simply slowed down?  Or network congestion if
this was an NFS copy?

> So I threw XFS back in, restarted the restore, and it went very
> smoothly while still providing acceptable interactivity.

It's nice to know XFS "saved the day" but I'm not so sure XFS deserves
the credit here.  The EXT4 driver itself/alone shouldn't cause the lack
of responsiveness behavior you saw.  I'm guessing something went wrong
on the source side of these file copies, given your report of dropping
to 30-40MB/s on the writeout.

> XFS is not a panacea (obviously), and it may be a bit slower in many
> cases, and doesn’t seem to cope well with fragmented free space (which
> is what this entire thread is really about), 

Did you retest fragmented freespace writes with the linear concat or
RAID10?  If not you're drawing incorrect conclusions due to not having
all the facts.  RAID6 can cause tremendous overhead with writes into
fragmented free space because of RMW, same with RAID5.  And given the
P400's RAID6 performance it's not at all surprising XFS would appear to
perform poorly here.  And my suggestion of using only 3 AGs to minimize
seeks may actually be detrimental here as well.  6 AGs may perform
better, and overall, than 3 AGs.

> but overall it feels more
> well-rounded. After all, I don’t really care how much it writes per
> time unit, as long as it’s not ridiculously little and it doesn’t
> bring everything else to a halt.

And you should be discovering by now that while XFS may not be a
"panacea" of a filesystem, it has unbelievable flexibility in allowing
you to tune it for specific storage layouts and workloads to wring out
its maximum performance.  Even with optimum tuning, it may not match the
performance of other filesystems for specific workloads, but you can
tune it to get damn close with ALL workloads, and also trounce all other
with very large workloads.  No other filesystem can do this.  Note
Geoffrey's example of an XFS on 600 disks with 15GB/s throughput.  Name
another FS that can perform acceptably with your workload, and also that
workload. ;)

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-10 20:29                   ` Stan Hoeppner
@ 2012-04-10 20:43                     ` Stefan Ring
  2012-04-10 21:29                       ` Stan Hoeppner
  0 siblings, 1 reply; 64+ messages in thread
From: Stefan Ring @ 2012-04-10 20:43 UTC (permalink / raw)
  To: stan; +Cc: Linux fs XFS

> What was the location of the KVM images you were copying?  Is it
> possible the source device simply slowed down?  Or network congestion if
> this was an NFS copy?

Piped via ssh from another host. No, everything was completely idle otherwise.

>> So I threw XFS back in, restarted the restore, and it went very
>> smoothly while still providing acceptable interactivity.
>
> It's nice to know XFS "saved the day" but I'm not so sure XFS deserves
> the credit here.  The EXT4 driver itself/alone shouldn't cause the lack
> of responsiveness behavior you saw.  I'm guessing something went wrong
> on the source side of these file copies, given your report of dropping
> to 30-40MB/s on the writeout.

Maybe it shouldn’t, but something sure did. And the circumstances seem
to point at ext4. Since the situation persisted for minutes after I
had stopped the transfer, it cannot possibly have been related to the
source.

I have a feeling that with appropriate vm.dirty_ratio tuning (and
probably related settings), I could have remedied this. But that’s
just one more thing I’d have to tinker with just to get to get
acceptable behavior out of this machine. I don’t mind if I don’t get
top-notch performance out of the box, but this is simply too much. I
don’t want to be expected to hand-tune every damn thing.

>> XFS is not a panacea (obviously), and it may be a bit slower in many
>> cases, and doesn’t seem to cope well with fragmented free space (which
>> is what this entire thread is really about),
>
> Did you retest fragmented freespace writes with the linear concat or
> RAID10?  If not you're drawing incorrect conclusions due to not having
> all the facts.

Yes, I did this. It performed very well. Only slightly slower than on
a completely empty file system.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-10 14:02       ` Stefan Ring
  2012-04-10 14:32         ` Joe Landman
  2012-04-10 18:13         ` Martin Steigerwald
@ 2012-04-10 20:44         ` Stan Hoeppner
  2012-04-10 21:00           ` Stefan Ring
  2 siblings, 1 reply; 64+ messages in thread
From: Stan Hoeppner @ 2012-04-10 20:44 UTC (permalink / raw)
  To: Stefan Ring; +Cc: Christoph Hellwig, xfs

On 4/10/2012 9:02 AM, Stefan Ring wrote:
>> And is XFS aligned to the RAID 6?
>>
>> What does xfs_info display on it?
> 
> Yes, it’s aligned.
> 
> meta-data=/dev/mapper/vg_data-lvhome 

Is the LVM volume aligned to the RAID stripe?  Is their a partition atop
the RAID LUN and under LVM?  Is the partition aligned?  Why LVM anyway?

>                                  isize=256    agcount=4, agsize=73233656 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=292934624, imaxpct=5
>          =                       sunit=8      swidth=32 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=143040, version=2
>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> I changed the stripe size to 32kb in the meantime. This way, it
> performs slightly better.

The devil is always in the details.  Were you using partitions and LVM
with the RAID1 concat tesing?  With the free space testing?

I assumed you were directly formatting the LUN with XFS.  With LVM and
possibly partitions involved here, that could explain some of the
mediocre performance across the board, with both EXT4 and XFS.  If one
wants maximum performance from their filesystem, one should typically
stay away from partitions and LVM, and any other layers that can slow IO
down.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-10 20:44         ` Stan Hoeppner
@ 2012-04-10 21:00           ` Stefan Ring
  0 siblings, 0 replies; 64+ messages in thread
From: Stefan Ring @ 2012-04-10 21:00 UTC (permalink / raw)
  To: stan; +Cc: Christoph Hellwig, xfs

> Is the LVM volume aligned to the RAID stripe?  Is their a partition atop
> the RAID LUN and under LVM?  Is the partition aligned?  Why LVM anyway?

Yes, it is aligned. I followed the advice from
<http://www.mysqlperformanceblog.com/2011/06/09/aligning-io-on-a-hard-disk-raid-the-theory/>.

Why LVM? Because we use it on lots of servers, and there is some value
to having a somewhat similar setup in development as in production.
I’ve done similar tests time and again with LVM and without, and I’ve
never ever measured a significant difference. I haven’t re-tested it
this time, true, but I would be surprised if it would magically behave
completely differently this time.

> The devil is always in the details.  Were you using partitions and LVM
> with the RAID1 concat tesing?  With the free space testing?

I used LVM linear for the concatenation – one volume group made from 3
physical volumes. The pvols were on primary partitions. The one-volume
RAID 6 is set up similarly; from only one pvol of course.

> I assumed you were directly formatting the LUN with XFS.  With LVM and
> possibly partitions involved here, that could explain some of the
> mediocre performance across the board, with both EXT4 and XFS.  If one
> wants maximum performance from their filesystem, one should typically
> stay away from partitions and LVM, and any other layers that can slow IO
> down.

I don’t want maximum performance, I want acceptable performance ;).
This means, I am satisfied with 80% or more of what’s possible, but
I’m not satisfied with 15%.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-10 20:43                     ` Stefan Ring
@ 2012-04-10 21:29                       ` Stan Hoeppner
  0 siblings, 0 replies; 64+ messages in thread
From: Stan Hoeppner @ 2012-04-10 21:29 UTC (permalink / raw)
  To: Stefan Ring; +Cc: Linux fs XFS

On 4/10/2012 3:43 PM, Stefan Ring wrote:
> I don’t want to be expected to hand-tune every damn thing.

You don't.

>> $ mkfs.xfs -d agcount=3 /dev/[device]

> With a nice and tidy fresh XFS file system, performance is indeed
> impressive – about 16 sec for the same task that would take 2 min 25
> before.

9x improvement in your workload.  First problem down.  What was the
runtime for EXT4 here?  Less than 16 seconds?

>>> and doesn’t seem to cope well with fragmented free space (which
>>> is what this entire thread is really about),

>> Did you retest fragmented freespace writes

> Yes, I did this. It performed very well. Only slightly slower than on
> a completely empty file system.

2nd problem down.  So the concat is your solution, no?  If not, what's
still missing?

BTW, concats don't have parity thus no RMW, so with the concat setup you
should set 100% of the P400 cache to writes.  The 25% you had for reads
definitely helps RAID6 RMW, but yields no benefit for concat.  Bump
write cache to 100% and you'll gain a little more XFS concat
performance.  And if by chance there is some weird logic in the P400
firmware, dedicating 100% to write cache may magically blow the doors
off.  I'm guessing I'm not the only one here to have seen odd magical
settings values like this at least once, though not necessarily with
RAID cache.

Even if not magical, in addition to increasing write cache size by 25%,
you will also increase write cache bandwidth with your high allocation
workload, as metadata free space lookups won't get cached by the
controller.  And given that sector write ordering is an apparent problem
currently, having this extra size and bandwidth may put you over the top.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-10 19:30           ` Stan Hoeppner
@ 2012-04-11 22:19             ` Geoffrey Wehrman
  0 siblings, 0 replies; 64+ messages in thread
From: Geoffrey Wehrman @ 2012-04-11 22:19 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Stefan Ring, Linux fs XFS

On Tue, Apr 10, 2012 at 02:30:39PM -0500, Stan Hoeppner wrote:
| On 4/9/2012 9:21 AM, Geoffrey Wehrman wrote:
| > On Fri, Apr 06, 2012 at 06:28:37PM -0500, Stan Hoeppner wrote:
| > | So while the XFS AG architecture may not be perfectly suited to your
| > | single 6 drive RAID6 array, it still gives rather remarkable performance
| > | given that the same architecture can scale pretty linearly to the
| > | heights above, and far beyond.  Something EXTx and others could never
| > | dream of.  Some of the SGI guys might be able to confirm deployed single
| > | XFS filesystems spanning 1000+ drives in the past.  Today we'd probably
| > | only see that scale with CXFS.
| 
| Good to hear from you Geoffrey.
| 
| > With an SGI IS16000 array which supports up to 1,200 drives, filesystems
| > with large numbers of drives isn't difficult.  Most configurations
| > using the IS16000 have 8+2 RAID6 luns.  
| 
| Is the concatenation of all these RAID6 LUNs performed within the
| IS16000, or with md/lvm, or?

The LUNs were concatenated with XVM which is SGI's md/lvm equivalent.
The filesystem was then constructed so that the LUN boundaries matched
AG boundaries in the filesystem.  The filesystem was mounted with the
inode64 mount option.  inode64 rotors directories across AGs, and then
attempts to allocate space for files created in the AG containing the
directory.  Utilizing this behavior allowed the generated load to be
spread across the entire set of LUNs.

| > I've seen sustained 15 GB/s to
| > a single filesystem on one of the arrays with a 600 drive configuration.
| 
| To be clear, this is a single Linux XFS filesystem on a single host, not
| multiple CXFS clients, correct?  If so, out of curiosity, is the host in
| this case an old Itanium Altix or the newer Xeon based Altix UV?  And
| finally, is this example system using FC or Infiniband connectivity?
| How many ports?

This was a single Linux XFS filesystem, but with two CXFS client hosts.
They were both rather ordinary dual socket x86_64 Xeon systems using
FC connectivity.  I fully expect that the same results could be obtained
from a single host with enough I/O bandwidth.

-- 
Geoffrey Wehrman

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-09 12:45                 ` Emmanuel Florac
@ 2012-04-13 19:36                   ` Stefan Ring
  2012-04-14  7:32                     ` Stan Hoeppner
  0 siblings, 1 reply; 64+ messages in thread
From: Stefan Ring @ 2012-04-13 19:36 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: stan, xfs

> Let's rerun it with files cached (the machine has 16 GB RAM, so
> every single file must be cached):
>
> # time tar xf test.tar
>
> real    0m50.842s
> user    0m0.809s
> sys     0m13.767s

That’s about the same time I’m getting on a fresh (non-fragmented)
file system with the RAID 6 volume.

Interestingly, the P400’s successor, the P410 does recognize a setting
that the P400 lacks, which is called elevatorsort. It sounds like this
could make all the difference. Unfortunately, the P400 doesn’t have
it. I don’t have a P410 with more than 2 drives to test this, but some
effect should definitely be measurable.

Since this finding has piqued my interest again, I’m willing to invest
a little more time, but I’m completely occupied for the next few days,
so it will have to wait a while.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-13 19:36                   ` Stefan Ring
@ 2012-04-14  7:32                     ` Stan Hoeppner
  2012-04-14 11:30                       ` Stefan Ring
  0 siblings, 1 reply; 64+ messages in thread
From: Stan Hoeppner @ 2012-04-14  7:32 UTC (permalink / raw)
  To: xfs

On 4/13/2012 2:36 PM, Stefan Ring wrote:
>> Let's rerun it with files cached (the machine has 16 GB RAM, so
>> every single file must be cached):
>>
>> # time tar xf test.tar
>>
>> real    0m50.842s
>> user    0m0.809s
>> sys     0m13.767s
> 
> That’s about the same time I’m getting on a fresh (non-fragmented)
> file system with the RAID 6 volume.
> 
> Interestingly, the P400’s successor, the P410 does recognize a setting
> that the P400 lacks, which is called elevatorsort. It sounds like this
> could make all the difference. Unfortunately, the P400 doesn’t have
> it. I don’t have a P410 with more than 2 drives to test this, but some
> effect should definitely be measurable.
> 
> Since this finding has piqued my interest again, I’m willing to invest
> a little more time, but I’m completely occupied for the next few days,
> so it will have to wait a while.

What configuration are you running right now Stefan?  You said you went
back to XFS due to the EXT4 lockups, but I can't recall what RAID config
you put underneath it this time.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
  2012-04-14  7:32                     ` Stan Hoeppner
@ 2012-04-14 11:30                       ` Stefan Ring
  0 siblings, 0 replies; 64+ messages in thread
From: Stefan Ring @ 2012-04-14 11:30 UTC (permalink / raw)
  To: stan; +Cc: xfs

On Sat, Apr 14, 2012 at 9:32 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 4/13/2012 2:36 PM, Stefan Ring wrote:
>>> Let's rerun it with files cached (the machine has 16 GB RAM, so
>>> every single file must be cached):
>>>
>>> # time tar xf test.tar
>>>
>>> real    0m50.842s
>>> user    0m0.809s
>>> sys     0m13.767s
>>
>> That’s about the same time I’m getting on a fresh (non-fragmented)
>> file system with the RAID 6 volume.
>>
> What configuration are you running right now Stefan?  You said you went
> back to XFS due to the EXT4 lockups, but I can't recall what RAID config
> you put underneath it this time.

RAID 6 4+2, LVM (single volume), 32kb stripe size (=> full stripe:
128kb), agcount=4

Except for the stripe size, the same config I had originally. The only
instance of really poor behavior is with the (artificially) fragmented
free space.

I have moved everything elsewhere for a while, so I can once again do
some testing that involves destroying and rebuilding everything.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2012-04-14 11:30 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-05 18:10 XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) Stefan Ring
2012-04-05 19:56 ` Peter Grandi
2012-04-05 22:41   ` Peter Grandi
2012-04-06 14:36   ` Peter Grandi
2012-04-06 15:37     ` Stefan Ring
2012-04-07 13:33       ` Peter Grandi
2012-04-05 21:37 ` Christoph Hellwig
2012-04-06  1:09   ` Peter Grandi
2012-04-06  8:25   ` Stefan Ring
2012-04-07 18:57     ` Martin Steigerwald
2012-04-10 14:02       ` Stefan Ring
2012-04-10 14:32         ` Joe Landman
2012-04-10 15:56           ` Stefan Ring
2012-04-10 18:13         ` Martin Steigerwald
2012-04-10 20:44         ` Stan Hoeppner
2012-04-10 21:00           ` Stefan Ring
2012-04-05 22:32 ` Roger Willcocks
2012-04-06  7:11   ` Stefan Ring
2012-04-06  8:24     ` Stefan Ring
2012-04-05 23:07 ` Peter Grandi
2012-04-06  0:13   ` Peter Grandi
2012-04-06  7:27     ` Stefan Ring
2012-04-06 23:28       ` Stan Hoeppner
2012-04-07  7:27         ` Stefan Ring
2012-04-07  8:53           ` Emmanuel Florac
2012-04-07 14:57           ` Stan Hoeppner
2012-04-09 11:02             ` Stefan Ring
2012-04-09 12:48               ` Emmanuel Florac
2012-04-09 12:53                 ` Stefan Ring
2012-04-09 13:03                   ` Emmanuel Florac
2012-04-09 23:38               ` Stan Hoeppner
2012-04-10  6:11                 ` Stefan Ring
2012-04-10 20:29                   ` Stan Hoeppner
2012-04-10 20:43                     ` Stefan Ring
2012-04-10 21:29                       ` Stan Hoeppner
2012-04-09  0:19           ` Dave Chinner
2012-04-09 11:39             ` Emmanuel Florac
2012-04-09 21:47               ` Dave Chinner
2012-04-07  8:49         ` Emmanuel Florac
2012-04-08 20:33           ` Stan Hoeppner
2012-04-08 21:45             ` Emmanuel Florac
2012-04-09  5:27               ` Stan Hoeppner
2012-04-09 12:45                 ` Emmanuel Florac
2012-04-13 19:36                   ` Stefan Ring
2012-04-14  7:32                     ` Stan Hoeppner
2012-04-14 11:30                       ` Stefan Ring
2012-04-09 14:21         ` Geoffrey Wehrman
2012-04-10 19:30           ` Stan Hoeppner
2012-04-11 22:19             ` Geoffrey Wehrman
2012-04-07 16:50       ` Peter Grandi
2012-04-07 17:10         ` Joe Landman
2012-04-08 21:42           ` Stan Hoeppner
2012-04-09  5:13             ` Stan Hoeppner
2012-04-09 11:52               ` Stefan Ring
2012-04-10  7:34                 ` Stan Hoeppner
2012-04-10 13:59                   ` Stefan Ring
2012-04-09  9:23             ` Stefan Ring
2012-04-09 23:06               ` Stan Hoeppner
2012-04-06  0:53   ` Peter Grandi
2012-04-06  7:32     ` Stefan Ring
2012-04-06  5:53   ` Stefan Ring
2012-04-06 15:35     ` Peter Grandi
2012-04-10 14:05       ` Stefan Ring
2012-04-07 19:11     ` Peter Grandi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.