Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)

From: pg_xf2@xf2.for.to.sabi.co.UK (Peter Grandi)
To: Linux fs XFS <xfs@oss.sgi.com>
Subject: Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)
Date: Fri, 6 Apr 2012 00:07:23 +0100	[thread overview]
Message-ID: <20350.9643.379841.771496@tree.ty.sabi.co.UK> (raw)
In-Reply-To: <CAAxjCEwBMbd0x7WQmFELM8JyFu6Kv_b+KDe3XFqJE6shfSAfyQ@mail.gmail.com>

[ ... ]

> [ ... ] tarball of a finished IcedTea6 build, about 2.5 GB in
> size. It contains roughly 200,000 files in 20,000 directories.
> [ ... ] given that the entire write set -- all 2.5 GB of it --
> is "known" to the file system, that is, in memory, wouldn't it
> be possible to write it out to disk in a somewhat more
> reasonable fashion?  [ ... ] The disk hardware used was a
> SmartArray p400 controller with 6x 10k rpm 300GB SAS disks in
> RAID 6. The server has plenty of RAM (64 GB).

On reflection this trigger for me an aside: traditional
filesystem types are designed for the case where the ratio is
the opposite, something like a 64GB data collection to process
and 2.5GB of RAM, and where therefore the issue is minimizing
ongoing disk accesses, not the upload from memory to disk of a
bulk sparse set of stuff.

The Sprite Log-structured File System was a design targeted at
large-memory systems, assuming that then writes are the issue
(especially as Sprite was network-based), and reads would mostly
happen from RAM, as in your (euphemism) insipid test.

I suspect that if the fundamental tradeoffs are inverted, then a
completely different design like a LFS might be appropriate.

But the above has a relationship to your (euphemism) unwise
concerns: the case where 200,000 files for 2.5GB are completely
written to RAM and then flushed as a whole to disk is not only
"untraditional" it is also (euphemism) peculiar: try by setting
the flusher to run rather often so that not more than 100-300MB
of dirty pages are left at any one time.

Which brings another subject: usually hw RAID host adapter have
cache, and have firmware that cleverly rearranges writes.

Looking at the specs of the P400:

  http://h18004.www1.hp.com/products/servers/proliantstorage/arraycontrollers/smartarrayp400/

it seems to me that it has standard 256MB of cache, and only
supports RAID6 with a battery backed write cache (wise!).

Which means that your Linux-level seek graphs may be not so
useful, because the host adapter may be drastically rearranging
the seek patterns, and you may need to tweak the P400 elevator,
rather than or in addition to the Linux elevator.

Unless possibly barriers are enabled, and even with a BBWC the
P400 writes through on receiving a barrier request. IIRC XFS is
rather stricter in issuing barrier requests than 'ext4', and you
may be seeing more the effect of that than the effect of aiming
to splitting the access patterns between 4 AGs to improve the
potential for multithreading (which you deny because you are
using what is most likely a large RAID6 stripe size with a small
IO intensive write workload, as previously noted).

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs