All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: A little RAID experiment
@ 2012-04-26 22:33 Richard Scobie
  2012-04-27 21:30 ` Emmanuel Florac
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Scobie @ 2012-04-26 22:33 UTC (permalink / raw)
  To: stefanrin; +Cc: xfs

I know you were interested in hardware RAID controllers, but out of 
curiosity, this is the result on a 16 x 1TB SATA linux md software RAID6 
array.

Formatted xfs, with external journal on an independent SATA device, 
mounted delaylog,inode64,logbsize=256k,logdev=/dev/md0,noatime,pquota.

Operations performed:  0 Read, 26065 Write, 0 Other = 26065 Total
Read 0b  Written 203.63Mb  Total transferred 203.63Mb  (13.575Mb/sec)
  1737.65 Requests/sec executed

Filesystem is 44% full, kernel 2.6.39.2.

xfs_bmap test_file.0
test_file.0:
         0: [0..8388607]: 9565100544..9573489151
         1: [8388608..16777215]: 9578354176..9586742783

Regards,

Richard

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-04-26 22:33 A little RAID experiment Richard Scobie
@ 2012-04-27 21:30 ` Emmanuel Florac
  2012-04-28  4:15   ` Richard Scobie
  0 siblings, 1 reply; 41+ messages in thread
From: Emmanuel Florac @ 2012-04-27 21:30 UTC (permalink / raw)
  To: Richard Scobie; +Cc: stefanrin, xfs

Le Fri, 27 Apr 2012 10:33:39 +1200 vous écriviez:

> Formatted xfs, with external journal on an independent SATA device, 
> mounted delaylog,inode64,logbsize=256k,logdev=/dev/md0,noatime,pquota.

Wouldn't it be preferable to use the RAID controller to host the log
device? This way you profit from the write cache, as the log easily
fits in.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-04-27 21:30 ` Emmanuel Florac
@ 2012-04-28  4:15   ` Richard Scobie
  0 siblings, 0 replies; 41+ messages in thread
From: Richard Scobie @ 2012-04-28  4:15 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: stefanrin, xfs

Emmanuel Florac wrote:

> Wouldn't it be preferable to use the RAID controller to host the log
> device? This way you profit from the write cache, as the log easily
> fits in.

This setup is md software RAID ;). The controller is an LSI 1068 using 
initiator-target firmware. There is no write cache I am aware of.

Regards,

Richard

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-10-10 21:27   ` Dave Chinner
@ 2012-10-10 22:01     ` Stefan Ring
  0 siblings, 0 replies; 41+ messages in thread
From: Stefan Ring @ 2012-10-10 22:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Linux fs XFS

> Just inidicates that the working set for your test is much more
> resident in the controller cache - has nothing to do with the disk
> speeds. Tun a larger set of files/workload and the results will end
> up a lot closer to disk speed instead of cache speed...

That's indeed a valid objection, but I just verified that with the
working set size multiplied by the relative cache size difference
(64GB instead of 8GB), the performance stays exactly the same. The new
controller seems to run much better cache control algorithms.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-10-10 14:57 ` Stefan Ring
@ 2012-10-10 21:27   ` Dave Chinner
  2012-10-10 22:01     ` Stefan Ring
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2012-10-10 21:27 UTC (permalink / raw)
  To: Stefan Ring; +Cc: Linux fs XFS

On Wed, Oct 10, 2012 at 04:57:47PM +0200, Stefan Ring wrote:
> Btw, one of our customers recently aquired new gear with HP SmartArray
> Gen8 controllers. Now they are something to get excited about! This is
> the kind of write performance I would expect from an expensive server
> product. Check this out (this is again my artificial benchmark as well
> as random write of 4K blocks):
> 
> SmartArray P400, 6 300G disks (10k, SAS) RAID 6, 256M BBWC:
                                                   ^^^^
.....

> SmartArray Gen8, 8 300G disks (15k, SAS) RAID 5, 2GB FBWC:
                                                   ^^^^

That's the reason for the difference in performance...

> So yeah, the disks are a bit faster. But what does that matter when
> there is such a huge difference otherwise?

Just inidicates that the working set for your test is much more
resident in the controller cache - has nothing to do with the disk
speeds. Tun a larger set of files/workload and the results will end
up a lot closer to disk speed instead of cache speed...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-04-25  8:07 Stefan Ring
                   ` (2 preceding siblings ...)
  2012-07-16 19:57 ` Stefan Ring
@ 2012-10-10 14:57 ` Stefan Ring
  2012-10-10 21:27   ` Dave Chinner
  3 siblings, 1 reply; 41+ messages in thread
From: Stefan Ring @ 2012-10-10 14:57 UTC (permalink / raw)
  To: Linux fs XFS

Btw, one of our customers recently aquired new gear with HP SmartArray
Gen8 controllers. Now they are something to get excited about! This is
the kind of write performance I would expect from an expensive server
product. Check this out (this is again my artificial benchmark as well
as random write of 4K blocks):

SmartArray P400, 6 300G disks (10k, SAS) RAID 6, 256M BBWC:

ag4
Read 0b  Written 161.56Mb  Total transferred 161.56Mb  (5.3853Mb/sec)
 1378.63 Requests/sec executed

random write
Read 0b  Written 97.578Mb  Total transferred 97.578Mb  (3.2526Mb/sec)
  832.66 Requests/sec executed

SmartArray Gen8, 8 300G disks (15k, SAS) RAID 5, 2GB FBWC:

ag4
Read 0b  Written 2.4575Gb  Total transferred 2.4575Gb  (83.883Mb/sec)
21474.03 Requests/sec executed

random write
Read 0b  Written 343.86Mb  Total transferred 343.86Mb  (11.462Mb/sec)
 2934.24 Requests/sec executed

So yeah, the disks are a bit faster. But what does that matter when
there is such a huge difference otherwise?

Unfortunately, while composing this text, I noticed that the new one
is configured as RAID 5, and I cannot change it because of HP's
licensing policy. That makes it not a meaningful comparison, although
extrapolation from previous SmartArray controllers would suggest that
the RAID5 and RAID6 performance is comparable.

My subjective impression is still a very good one!

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-26  8:32                             ` Dave Chinner
@ 2012-09-11 16:37                               ` Stefan Ring
  0 siblings, 0 replies; 41+ messages in thread
From: Stefan Ring @ 2012-09-11 16:37 UTC (permalink / raw)
  To: Linux fs XFS

On Thu, Jul 26, 2012, Dave Chinner <david@fromorbit.com> wrote:
>> 10001
>> 20001
>> 30001
>> 40001
>> 10002
>> 20002
>> 30002
>> 40002
>> 10003
>> 20003
>> ...
>
> That's the problem you should have reported.

I did, but then I got bashed for using RAID 5/6 and about the
specifics of hardware and everything, which shouldn't even matter, but
I let myself get dragged into this discussion.

Anyway, in the meantime I had a closer look at the actual block trace,
and it looks a bit different than the way I interpreted it at first.
It sends runs of 30-50 writes with holes in them, like so:

2, 4-5, 7, 10-12, 14, 16-17

and so on. These holes seem to be caused by the free space
fragmentation. Every once in a while -- somewhat frequently, after 30
or so blocks, as mentioned -- it switches to another allocation group.
If these blocks were contiguous, then the elevator should be able to
merge them, but the tiny holes make this impossible. So I guess
there's nothing that can be substantially improved here. The frequent
ag switches are a bit difficult for the controller to handle, but
different controllers struggle under different work loads, and there's
nothing that can be done about that. I noticed just today that the HP
SmartArray controllers handle truly random writes better than the
MegaRAID variety that I praised so much in my postings.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-25  9:29                           ` Stefan Ring
  2012-07-25 10:00                             ` Stan Hoeppner
@ 2012-07-26  8:32                             ` Dave Chinner
  2012-09-11 16:37                               ` Stefan Ring
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2012-07-26  8:32 UTC (permalink / raw)
  To: Stefan Ring; +Cc: Linux fs XFS

On Wed, Jul 25, 2012 at 11:29:58AM +0200, Stefan Ring wrote:
> In this particular case, performance was conspicuously poor, and after
> some digging with blktrace and seekwatcher, I identified the cause of
> this slowness to be a write pattern that looked like this (in block
> numbers), where the step width (arbitrarily displayed as 10000 here
> for illustration purposes) was 1/4 of the size of the volume, clearly
> because the volume had 4 allocation groups (the default). Of course it
> was not entirely regular, but overall it was very similar to this:
> 
> 10001
> 20001
> 30001
> 40001
> 10002
> 20002
> 30002
> 40002
> 10003
> 20003
> ...

That's the problem you should have reported. Not something
artificial from a benchmark. What you seemed to report was a "random
writes behave differently on different RAID setups, not that
"writeback is not sorting efficiently".

Indeed, if the above is metadata, then there's something really
weird going on, because metadata writeback is not sorted that way by
XFS, and nothing should cause writeback in that style. i.e. if it is
metadata, it shoul dbe:

10001 (queue)
10002 (merge)
10003 (merge)
....
20001 (queue)
20002 (merge)
20003 (merge)
....

and so on for any metadata dispatched in close temporal proximity.

If it is data writeback, then there's still something funny going on
as it implies that the temporal data locality the allocator
providing is non-existent. i.e. inodes that are dirtied sequentially
in the same directory should be written in the same order and
allocation should be to a similar region on disk. Hence you should
get similar IO patterns to the metadata, though not as well formed.

Using xfs_bmap will tell you where the files are located, and often
comparing c/mtime will tell you th order in which files were
written. That can tell you whether data allocation was jumping all
over the place or not...

> It has been pointed out that XFS schedules the writes like this on
> purpose so that they can be done in parallel,

XFs doesn't schedule writes like that - it only spreads the
allocation out. writeback and the IO elevators are what do the IO
scheduling, and sometimes they don't play nicely with XFS.

If you create files in this manner:

/a/file1
/b/file1
/c/file1
/d/file1
/a/file2
/b/file2
....

Then writeback is going to schedule them in the same order, and that
will result in IO being rotored across all AGs because writeback
retains the creation/dirtying order. There's only so much reordering
that can be done when writes are scheduled like this.

If you create files like this:

/a/file1
/a/file2
/a/file3
.....
/b/file1
/b/file2
/b/file3
.....

The writeback will issue them in that order, and data allocation
will be contiguous and hence writes much more sequential.

This is often a problem with naive multi-threaded applications - the
thought that more IO in flight will be faster than what a single
thread can do. If you cause IO to interleave like above, then it
won't go faster and could turn sequential workloads into random IO
workloads.

OTOH, well designed applications can take advantage of XFS's
segregation and scale IO linearly by a combination of careful
placement and scalable block device design (e.g. a concat rather
than a flat stripe).

But, I really don't knwo what you application is - all I know
is that you used sysbench to generate random IO that showed similar
problems. posting the blktraces for us to analyse ourselves
(I can tell an awful lot from repeating patterns of block
numbers and IO sizes) rather than telling use what you saw is an
example of what we need to see to understand your problem. This
pretty much says it all:

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

> and that I should create
> a concatenated volume with physical devices matching the allocation
> groups. I actually went through this exercise, and yes, it was very
> beneficial, but that's not the point. I don't want to (have to) do
> that.

If you want to maximise storage performance, then that's what you do
for certain workloads. Saying "I want" fllowed by  "I'm too lazy to
do that, but I still want" won't get you very far....

> And it's not always feasible, anyway. What about home usage with
> a single SATA disk? Is it not worthwile to perform well on low-end
> devices?

Not really.  XFS is mostly optimised for large scale HPC and enterprise
workloads and hardware. The only small scale system optimisations we
make are generally for your cheap 1-4 disk ARM/MIPS based NAS
devices. The workloads on those are effectively a server workload
anyway, so most of the optimisations we make benefit them as well.

As for desktops, well, it's fast enough for my workstation and
laptop, so I don't really care much more than that.. ;)

> You might ask then, why even bother using XFS instead of ext4?

No, I don't. If ext4 is better or XFS is too much trouble for you,
then it is better for you to use ext4. No-one here will argue
against you doing that - use what works for you.

However, if you do use XFS, and ask for advice, then it pays to
listen to the people who respond because they tend to be power users
with lots of experience or subject matter experts.....

> I care about the multi-user case. The problem I have with ext is that
> it is unbearably unresponsive when someone writes a semi-large amount
> of data (a few gigs) at once -- like extracting a large-ish tarball.
> Just using vim, even with :set nofsync, is almost impossible during
> that time. I have adopted various disgusting hacks like extracting to
> a ramdisk instead and rsyncing the lot over to the real disk with a
> very low --bwlimit, but I'm thoroughly fed up with this kind of crap,
> and in general, XFS works very well.
> 
> If noone cares about my findings, I will henceforth be quiet on this topic.

I care about the problems you are having, but I don't care about a
-simulation- of what you think is the problem. Report the real
problem (data allocation or writeback is not sequential when it
should be) and we might be able to get to the bottom of your issue.

Report a simulation of an issue, and we'll just tell you what is
wrong with your simulation (i.e. random IO and RAID5/6 don't mix. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-25 10:08                               ` Stefan Ring
@ 2012-07-25 11:00                                 ` Stan Hoeppner
  0 siblings, 0 replies; 41+ messages in thread
From: Stan Hoeppner @ 2012-07-25 11:00 UTC (permalink / raw)
  To: Stefan Ring; +Cc: Linux fs XFS

On 7/25/2012 5:08 AM, Stefan Ring wrote:
>> You simply must present complete information.  The omission of such is
>> likely why most ignored your post but for me.  I'm the hardwarefreak
>> after all, so I'm always game for RAID discussions. ;)
>>
>> If you can represent with complete specs and data, so that it paints a
>> coherent picture, you may see more willing participation.
> 
> I agree, no offense taken. I will respond to your previous message
> individually, although it won't get as complete as you'd like because
> I simply don't know and cannot find out much about some of the
> systems.

No need for such a response, nobody would read it.  Instead, just
present all the relevant hardware and config specs for the two HBA RAID
cards (as you have nothing on the P2000), and provide that link again to
the sysbench output.  Short, sweet, simple.  If you make posts too long
and rambling you get ignored.  I know from experience. :(

Model:
Cache Size:
#/type of drives:
RAID level:
Cache mode:

That should be sufficient.

-- 
Stan


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-25 10:00                             ` Stan Hoeppner
@ 2012-07-25 10:08                               ` Stefan Ring
  2012-07-25 11:00                                 ` Stan Hoeppner
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Ring @ 2012-07-25 10:08 UTC (permalink / raw)
  To: stan; +Cc: Linux fs XFS

> You simply must present complete information.  The omission of such is
> likely why most ignored your post but for me.  I'm the hardwarefreak
> after all, so I'm always game for RAID discussions. ;)
>
> If you can represent with complete specs and data, so that it paints a
> coherent picture, you may see more willing participation.

I agree, no offense taken. I will respond to your previous message
individually, although it won't get as complete as you'd like because
I simply don't know and cannot find out much about some of the
systems.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-25  9:29                           ` Stefan Ring
@ 2012-07-25 10:00                             ` Stan Hoeppner
  2012-07-25 10:08                               ` Stefan Ring
  2012-07-26  8:32                             ` Dave Chinner
  1 sibling, 1 reply; 41+ messages in thread
From: Stan Hoeppner @ 2012-07-25 10:00 UTC (permalink / raw)
  To: Stefan Ring; +Cc: Linux fs XFS

Hi Stefan,

On 7/25/2012 4:29 AM, Stefan Ring wrote:
> There appears to be a bit of a tension in this thread, and I have the
> suspicion that it's a case of mismatched presumed expectations. The
> sole purpose of my activity here over the last months was to present
> some findings which I thought would be interesting to XFS developers.
> If I were working on XFS, I would be interested. From most of the
> answers, though, I get the impression that I am perceived as looking
> for help tuning my XFS setup, which is not the case at all. In fact,
> I'm quite happy with it. Let me recap just to give this thread the
> intended tone:

I don't want to top post, but I don't want to trim a bunch lest it
appear I'm ignoring significant points you make, so I'll start here, and
flow, but maybe not respond to each point.

I didn't intend to create tension, and I apologize for any sarcasm in my
last point.  I think you may be on to something, and I do find your
research efforts worthwhile.  However...

The single point I was attempting to make in my last post was that for
your data and conclusions to have any validity, you need to provide all
of the details of your testing environment.  You made head-to-head
comparisons and performance conclusions of 3 RAID systems, but omitted
critical details that are needed to interpret and compare the
performance data.  Some of this data you simply didn't have access to.
In a situation like that, you simply shouldn't include that system in
your presentation.  WRT the LSI controller, you didn't mention RAID
level or number of disks.

You simply must present complete information.  The omission of such is
likely why most ignored your post but for me.  I'm the hardwarefreak
after all, so I'm always game for RAID discussions. ;)

If you can represent with complete specs and data, so that it paints a
coherent picture, you may see more willing participation.

> This episode of my journey with XFS started when I read that there had
> been recent significant performance improvements to XFS' metadata
> operations. Having tried XFS every couple of years or so before, and
> always with the same verdict -- horribly slow -- I was curious if it
> had finally become usable.
> 
> A new server machine arriving just at the right time would serve as
> the perfect testbed. I threw some workloads at it, which I hoped would
> resemble my typical workload, and I focussed especially on areas which
> bothered me the most on our current development server running ext3.
> Everything worked more or less satisfactorily, except for the case of
> un-tarring a metadata-heavy tarball in the presence of considerable
> free-space fragmentation.
> 
> In this particular case, performance was conspicuously poor, and after
> some digging with blktrace and seekwatcher, I identified the cause of
> this slowness to be a write pattern that looked like this (in block
> numbers), where the step width (arbitrarily displayed as 10000 here
> for illustration purposes) was 1/4 of the size of the volume, clearly
> because the volume had 4 allocation groups (the default). Of course it
> was not entirely regular, but overall it was very similar to this:
> 
> 10001
> 20001
> 30001
> 40001
> 10002
> 20002
> 30002
> 40002
> 10003
> 20003
> ...
> 
> I tuned and tweaked everything I could think of -- elevator settings,
> readahead, su/sw, barrier, RAID hardware cache --, but the behavior
> would always be the same. It just so happens that the RAID controller
> in this machine (HP SmartArray P400) doesn't cope very well with a
> write pattern like this. To it, the sequence appears to be random, and
> it performs even worse than it would if it were actually random.
> 
> Going by what I think to know about the topic, it struck me as odd
> that blocks would be sent to disk in this very unfavorable order. To
> my mind, three entities had failed at sanitizing the write sequence:
> the filesystem, the block layer and the RAID controller. My opinion is
> still unchanged regarding the latter two.
> 
> The strikingly bad performance on the RAID controller piqued my
> interest, and I went on a different journey investigating this oddity
> and created a minor sysbench modification that would just measure
> performance for this particular pattern. Not many people helped with
> my experiment, and I was accused of wanting ponies. If I'm the only
> one who is curious about this, then so be it. I deemed it worthwile
> sharing my experience and pointing out that a sequence like the one
> above is a death blow to all HP gear I've got my hands on so far.
> 
> It has been pointed out that XFS schedules the writes like this on
> purpose so that they can be done in parallel, and that I should create
> a concatenated volume with physical devices matching the allocation
> groups. I actually went through this exercise, and yes, it was very
> beneficial, but that's not the point. I don't want to (have to) do
> that. And it's not always feasible, anyway. What about home usage with
> a single SATA disk? Is it not worthwile to perform well on low-end
> devices?
> 
> You might ask then, why even bother using XFS instead of ext4?
> 
> I care about the multi-user case. The problem I have with ext is that
> it is unbearably unresponsive when someone writes a semi-large amount
> of data (a few gigs) at once -- like extracting a large-ish tarball.
> Just using vim, even with :set nofsync, is almost impossible during
> that time. I have adopted various disgusting hacks like extracting to
> a ramdisk instead and rsyncing the lot over to the real disk with a
> very low --bwlimit, but I'm thoroughly fed up with this kind of crap,
> and in general, XFS works very well.
> 
> If noone cares about my findings, I will henceforth be quiet on this topic.

Again, it's not that nobody cares.  It's that your findings have no
weight, no merit, in absence of complete storage system and software
stack configuration specs.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-19  3:08                         ` Stan Hoeppner
@ 2012-07-25  9:29                           ` Stefan Ring
  2012-07-25 10:00                             ` Stan Hoeppner
  2012-07-26  8:32                             ` Dave Chinner
  0 siblings, 2 replies; 41+ messages in thread
From: Stefan Ring @ 2012-07-25  9:29 UTC (permalink / raw)
  To: Linux fs XFS

There appears to be a bit of a tension in this thread, and I have the
suspicion that it's a case of mismatched presumed expectations. The
sole purpose of my activity here over the last months was to present
some findings which I thought would be interesting to XFS developers.
If I were working on XFS, I would be interested. From most of the
answers, though, I get the impression that I am perceived as looking
for help tuning my XFS setup, which is not the case at all. In fact,
I'm quite happy with it. Let me recap just to give this thread the
intended tone:

This episode of my journey with XFS started when I read that there had
been recent significant performance improvements to XFS' metadata
operations. Having tried XFS every couple of years or so before, and
always with the same verdict -- horribly slow -- I was curious if it
had finally become usable.

A new server machine arriving just at the right time would serve as
the perfect testbed. I threw some workloads at it, which I hoped would
resemble my typical workload, and I focussed especially on areas which
bothered me the most on our current development server running ext3.
Everything worked more or less satisfactorily, except for the case of
un-tarring a metadata-heavy tarball in the presence of considerable
free-space fragmentation.

In this particular case, performance was conspicuously poor, and after
some digging with blktrace and seekwatcher, I identified the cause of
this slowness to be a write pattern that looked like this (in block
numbers), where the step width (arbitrarily displayed as 10000 here
for illustration purposes) was 1/4 of the size of the volume, clearly
because the volume had 4 allocation groups (the default). Of course it
was not entirely regular, but overall it was very similar to this:

10001
20001
30001
40001
10002
20002
30002
40002
10003
20003
...

I tuned and tweaked everything I could think of -- elevator settings,
readahead, su/sw, barrier, RAID hardware cache --, but the behavior
would always be the same. It just so happens that the RAID controller
in this machine (HP SmartArray P400) doesn't cope very well with a
write pattern like this. To it, the sequence appears to be random, and
it performs even worse than it would if it were actually random.

Going by what I think to know about the topic, it struck me as odd
that blocks would be sent to disk in this very unfavorable order. To
my mind, three entities had failed at sanitizing the write sequence:
the filesystem, the block layer and the RAID controller. My opinion is
still unchanged regarding the latter two.

The strikingly bad performance on the RAID controller piqued my
interest, and I went on a different journey investigating this oddity
and created a minor sysbench modification that would just measure
performance for this particular pattern. Not many people helped with
my experiment, and I was accused of wanting ponies. If I'm the only
one who is curious about this, then so be it. I deemed it worthwile
sharing my experience and pointing out that a sequence like the one
above is a death blow to all HP gear I've got my hands on so far.

It has been pointed out that XFS schedules the writes like this on
purpose so that they can be done in parallel, and that I should create
a concatenated volume with physical devices matching the allocation
groups. I actually went through this exercise, and yes, it was very
beneficial, but that's not the point. I don't want to (have to) do
that. And it's not always feasible, anyway. What about home usage with
a single SATA disk? Is it not worthwile to perform well on low-end
devices?

You might ask then, why even bother using XFS instead of ext4?

I care about the multi-user case. The problem I have with ext is that
it is unbearably unresponsive when someone writes a semi-large amount
of data (a few gigs) at once -- like extracting a large-ish tarball.
Just using vim, even with :set nofsync, is almost impossible during
that time. I have adopted various disgusting hacks like extracting to
a ramdisk instead and rsyncing the lot over to the real disk with a
very low --bwlimit, but I'm thoroughly fed up with this kind of crap,
and in general, XFS works very well.

If noone cares about my findings, I will henceforth be quiet on this topic.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-18 12:37                       ` Stefan Ring
@ 2012-07-19  3:08                         ` Stan Hoeppner
  2012-07-25  9:29                           ` Stefan Ring
  0 siblings, 1 reply; 41+ messages in thread
From: Stan Hoeppner @ 2012-07-19  3:08 UTC (permalink / raw)
  To: xfs

Sorry for any potential dups.  Mail log shows this msg was accepted 3.5
hours ago but it hasn't spit back to me yet and no bounce.  Resending.

On 7/18/2012 7:37 AM, Stefan Ring wrote:
>> At least I have some multi-threaded results from the other two machines:
>>
>> LSI:
>>
>> 4 threads
>>
>> [   2s] reads: 0.00 MB/s writes: 63.08 MB/s fsyncs: 0.00/s response
>> time: 0.452ms (95%)
>> [   4s] reads: 0.00 MB/s writes: 34.26 MB/s fsyncs: 0.00/s response
>> time: 1.660ms (95%)
> 
> And because of the bad formatting:
> https://github.com/Ringdingcoder/sysbench/blob/master/mail2.txt

And this is why people publishing real, useable benchmark results
publish all specs of the hardware/software environment being tested.  I
think I've mentioned once or twice how critical accurate/complete
information is.

Looking at the table linked above, two things become clear:

1.  The array spindle config of the 3 systems is wildly different.

   a.  P400  = 6x  10K  SAS  RAID6
   b.  P2000 = 12x 7.2k SATA RAID6
   c.  LSI   = unknown

2.  The LSI outperforms the other two by a wide margin, yet we know
nothing of the disks attached.  At first blush, ans assuming disk config
is similar to the other two systems, the controller firmware *appears*
to perform magic.  But without knowing the spindle config of the LSI we
simply can't draw any conclusions yet.

This benchmark test seems to involve no or little metadata IO, so few
RMW cycles, and RAID6 doesn't kill us.  So if the LSI has the common 24
bay 2.5" JBOD shelf attached, with 2 spares and 22x 15K SAS drives (20
stripe spindles) in RAID6, this alone may fully explain the performance
gap, due to 6.7x the seek performance against the 6x 10k drives (4
spindles) in RAID6 on the P400.  This would also equal 4x the seek
performance of the 12 disks (10 spindles) of the P2000.

Given the results for the P2000, it seems clear that the LUN you're
hitting is not striped across 10 spindles.  It would seem that the 12
drives have been split up into two or more RAID arrays, probably 2x 6
drive RAID6s, and your test LUN sits on one of them, yielding 4x 7.2k
stripe spindles.  If it spanned 10 of 12 drives in a RAID6, it shouldn't
stall as shown in your data.  The "tell" here is that the P2000 with 10
7.2k drives has 1.7x the seek performance of the 4 spindles in your
P400, which outruns the P2000 once cache is full.  The P2000 controller
has over 4x the write cache of the P400, which is clearly demonstrated
in your data:

>From 2s to 8s, the P2000 averages ~25MB/s throughput with sub 10ms
latency.  At 10s and up, latency jumps to multiple *seconds* and
throughput drops to "zero".  This clearly shows that when cache is full
and must flush, the drives are simply overwhelmed.  10x 7.2k striped
SATA spindles would not perform this badly.  Thus it seems clear your
LUN sits on only 4 of the 12 spindles.

The cached performance of the P2000 is about 50% of the LSI, and the LSI
has 4x less cache memory.  This could be due to cache mirroring between
the two controllers eating 50% of the cache RAM bandwidth.

So in summary, it would be nice to know the disk config of the LSI.
Once we have complete hardware information, it may likely turn out that
the bulk of the performance differences simply come down to what disks
are attached to each controller.  BTW, you provided lspci output of the
chip on the RAID card.  Please provide the actual model# of the LSI
card.  Dozens of LSI and OEM cards on the market have used the SAS1078
ASIC.  The card you have may not even be an LSI card, or may even be
embedded.  We can't tell from the info given.

The devil is always in the details Stefan. ;)

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-18 12:32                     ` Stefan Ring
@ 2012-07-18 12:37                       ` Stefan Ring
  2012-07-19  3:08                         ` Stan Hoeppner
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Ring @ 2012-07-18 12:37 UTC (permalink / raw)
  To: xfs

> At least I have some multi-threaded results from the other two machines:
>
> LSI:
>
> 4 threads
>
> [   2s] reads: 0.00 MB/s writes: 63.08 MB/s fsyncs: 0.00/s response
> time: 0.452ms (95%)
> [   4s] reads: 0.00 MB/s writes: 34.26 MB/s fsyncs: 0.00/s response
> time: 1.660ms (95%)

And because of the bad formatting:
https://github.com/Ringdingcoder/sysbench/blob/master/mail2.txt

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-18 10:24                   ` Stan Hoeppner
@ 2012-07-18 12:32                     ` Stefan Ring
  2012-07-18 12:37                       ` Stefan Ring
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Ring @ 2012-07-18 12:32 UTC (permalink / raw)
  To: stan; +Cc: xfs

> Given the LSI 1078 based RAID card with 1 thread runs circles around the
> P2000 with 4, 8, or 16 threads, and never stalls, with responses less
> than 1ms, meaning all writes hit cache, it would seem other workloads
> are hitting the P2000 simultaneously with your test, limiting your
> performance.  Either that or some kind of quotas have been set on the
> LUNs to prevent one host from saturating the controllers.  Or both.

Maybe there exists a load quota of some kind as you are suggesting,
but from what I've seen from screenshots in the installation manuals,
I don't remember any of this.

> This is why I asked about exclusive access.  Without it your results for
> the P2000 are literally worthless.  Lacking complete configuration info
> puts you in the same boat.  You simply can't draw any realistic
> conclusions about the P2000 performance without having complete control
> of the device for dedicated testing purposes.

That's a reasonable suggestion. Alas, I'm not expecting to get that
level of access to the device. I know for a fact though, that it is
only connected to a single machine, which is otherwise completely idle
and controlled by "us" (the company I work for). But even so, I cannot
set up XFS there on a whim because it's in preparation for production
use.

> You have such control of the P400 and LSI do you not?  Concentrate your
> testing and comparisons on those.

The period of full control over the P400 is over, but at least I know
how it is configured. The LSI is in production (meaning: untouchable),
but seems reasonably configured.

At least I have some multi-threaded results from the other two machines:

LSI:

4 threads

[   2s] reads: 0.00 MB/s writes: 63.08 MB/s fsyncs: 0.00/s response
time: 0.452ms (95%)
[   4s] reads: 0.00 MB/s writes: 34.26 MB/s fsyncs: 0.00/s response
time: 1.660ms (95%)
[   6s] reads: 0.00 MB/s writes: 33.92 MB/s fsyncs: 0.00/s response
time: 1.478ms (95%)
[   8s] reads: 0.00 MB/s writes: 36.34 MB/s fsyncs: 0.00/s response
time: 1.589ms (95%)
[  10s] reads: 0.00 MB/s writes: 34.99 MB/s fsyncs: 0.00/s response
time: 1.621ms (95%)
[  12s] reads: 0.00 MB/s writes: 36.41 MB/s fsyncs: 0.00/s response
time: 1.639ms (95%)

8 threads

[   2s] reads: 0.00 MB/s writes: 45.34 MB/s fsyncs: 0.00/s response
time: 2.749ms (95%)
[   4s] reads: 0.00 MB/s writes: 32.15 MB/s fsyncs: 0.00/s response
time: 4.579ms (95%)
[   6s] reads: 0.00 MB/s writes: 33.64 MB/s fsyncs: 0.00/s response
time: 4.644ms (95%)
[   8s] reads: 0.00 MB/s writes: 35.20 MB/s fsyncs: 0.00/s response
time: 4.131ms (95%)
[  10s] reads: 0.00 MB/s writes: 33.88 MB/s fsyncs: 0.00/s response
time: 3.876ms (95%)
[  12s] reads: 0.00 MB/s writes: 33.65 MB/s fsyncs: 0.00/s response
time: 4.929ms (95%)

16 threads

[   2s] reads: 0.00 MB/s writes: 36.90 MB/s fsyncs: 0.00/s response
time: 3.510ms (95%)
[   4s] reads: 0.00 MB/s writes: 35.36 MB/s fsyncs: 0.00/s response
time: 8.629ms (95%)
[   6s] reads: 0.00 MB/s writes: 32.27 MB/s fsyncs: 0.00/s response
time: 10.091ms (95%)
[   8s] reads: 0.00 MB/s writes: 34.79 MB/s fsyncs: 0.00/s response
time: 9.499ms (95%)
[  10s] reads: 0.00 MB/s writes: 35.62 MB/s fsyncs: 0.00/s response
time: 8.801ms (95%)
[  12s] reads: 0.00 MB/s writes: 34.64 MB/s fsyncs: 0.00/s response
time: 9.488ms (95%)

... and so on. Nothing noteworthy after that.

Response time is higher, throughput stays the same.

P400:

4 threads

[   2s] reads: 0.00 MB/s writes: 33.59 MB/s fsyncs: 0.00/s response
time: 0.255ms (95%)
[   4s] reads: 0.00 MB/s writes: 5.11 MB/s fsyncs: 0.00/s response
time: 12.853ms (95%)
[   6s] reads: 0.00 MB/s writes: 5.45 MB/s fsyncs: 0.00/s response
time: 0.677ms (95%)
[   8s] reads: 0.00 MB/s writes: 5.16 MB/s fsyncs: 0.00/s response
time: 0.902ms (95%)
[  10s] reads: 0.00 MB/s writes: 4.56 MB/s fsyncs: 0.00/s response
time: 58.242ms (95%)
[  12s] reads: 0.00 MB/s writes: 5.30 MB/s fsyncs: 0.00/s response
time: 0.669ms (95%)
[  14s] reads: 0.00 MB/s writes: 5.22 MB/s fsyncs: 0.00/s response
time: 0.743ms (95%)
[  16s] reads: 0.00 MB/s writes: 4.73 MB/s fsyncs: 0.00/s response
time: 57.877ms (95%)
[  18s] reads: 0.00 MB/s writes: 4.39 MB/s fsyncs: 0.00/s response
time: 58.417ms (95%)
[  20s] reads: 0.00 MB/s writes: 4.56 MB/s fsyncs: 0.00/s response
time: 57.704ms (95%)
[  22s] reads: 0.00 MB/s writes: 4.81 MB/s fsyncs: 0.00/s response
time: 57.429ms (95%)
[  24s] reads: 0.00 MB/s writes: 4.53 MB/s fsyncs: 0.00/s response
time: 57.895ms (95%)

Some response time fluctuation at first, but it settles quickly.

8 threads

[   2s] reads: 0.00 MB/s writes: 38.61 MB/s fsyncs: 0.00/s response
time: 0.969ms (95%)
[   4s] reads: 0.00 MB/s writes: 4.98 MB/s fsyncs: 0.00/s response
time: 59.886ms (95%)
[   6s] reads: 0.00 MB/s writes: 4.69 MB/s fsyncs: 0.00/s response
time: 60.300ms (95%)
[   8s] reads: 0.00 MB/s writes: 4.57 MB/s fsyncs: 0.00/s response
time: 60.246ms (95%)
[  10s] reads: 0.00 MB/s writes: 4.46 MB/s fsyncs: 0.00/s response
time: 60.626ms (95%)
[  12s] reads: 0.00 MB/s writes: 4.46 MB/s fsyncs: 0.00/s response
time: 60.445ms (95%)
[  14s] reads: 0.00 MB/s writes: 4.61 MB/s fsyncs: 0.00/s response
time: 60.662ms (95%)
[  16s] reads: 0.00 MB/s writes: 4.35 MB/s fsyncs: 0.00/s response
time: 60.571ms (95%)
[  18s] reads: 0.00 MB/s writes: 4.87 MB/s fsyncs: 0.00/s response
time: 60.156ms (95%)
[  20s] reads: 0.00 MB/s writes: 4.77 MB/s fsyncs: 0.00/s response
time: 60.210ms (95%)
[  22s] reads: 0.00 MB/s writes: 4.58 MB/s fsyncs: 0.00/s response
time: 60.463ms (95%)
[  24s] reads: 0.00 MB/s writes: 4.65 MB/s fsyncs: 0.00/s response
time: 60.264ms (95%)

16 threads

[   2s] reads: 0.00 MB/s writes: 17.35 MB/s fsyncs: 0.00/s response
time: 7.764ms (95%)
[   4s] reads: 0.00 MB/s writes: 5.17 MB/s fsyncs: 0.00/s response
time: 62.655ms (95%)
[   6s] reads: 0.00 MB/s writes: 5.15 MB/s fsyncs: 0.00/s response
time: 62.749ms (95%)
[   8s] reads: 0.00 MB/s writes: 4.89 MB/s fsyncs: 0.00/s response
time: 63.258ms (95%)
[  10s] reads: 0.00 MB/s writes: 4.98 MB/s fsyncs: 0.00/s response
time: 62.862ms (95%)
[  12s] reads: 0.00 MB/s writes: 5.26 MB/s fsyncs: 0.00/s response
time: 63.032ms (95%)
[  14s] reads: 0.00 MB/s writes: 5.27 MB/s fsyncs: 0.00/s response
time: 62.599ms (95%)
[  16s] reads: 0.00 MB/s writes: 4.80 MB/s fsyncs: 0.00/s response
time: 63.088ms (95%)
[  18s] reads: 0.00 MB/s writes: 4.84 MB/s fsyncs: 0.00/s response
time: 63.239ms (95%)
[  20s] reads: 0.00 MB/s writes: 5.24 MB/s fsyncs: 0.00/s response
time: 62.712ms (95%)
[  22s] reads: 0.00 MB/s writes: 4.25 MB/s fsyncs: 0.00/s response
time: 63.619ms (95%)
[  24s] reads: 0.00 MB/s writes: 4.90 MB/s fsyncs: 0.00/s response
time: 63.202ms (95%)

Pretty boring.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-18  7:22                 ` Stefan Ring
@ 2012-07-18 10:24                   ` Stan Hoeppner
  2012-07-18 12:32                     ` Stefan Ring
  0 siblings, 1 reply; 41+ messages in thread
From: Stan Hoeppner @ 2012-07-18 10:24 UTC (permalink / raw)
  To: xfs

On 7/18/2012 2:22 AM, Stefan Ring wrote:

> Because it was XFS originally which hammered the controller with this
> disadvantageous pattern. 

Do you feel you have researched and tested this theory thoroughly enough
to draw such a conclusion?  Note the LSI numbers with a single thread
compared to the P400.  It seems at this point the LSI has no problem
with the pattern.  How about threaded results?

> Except for the concurrency, it doesn't matter
> much on which filesystem sysbench operates. I've previously verified
> this on another system.

It's hard to believe a 4 generation old (6-7 years) LSI ASIC with
256/512MB cache is able to sink this workload without ever stalling when
flushing to rust, where the HP P2000 FC SAN array shows pretty sad
performance.  I'd really like to see the threaded results for the LSI at
this point.  I think that would be informative.

> It was the Fibre Channel controller, the one with the collapsing
> throughput. (P2000 G3 MSA, QLogic Corp. ISP2532-based 8Gb Fibre
> Channel to PCI Express HBA)

Given the LSI 1078 based RAID card with 1 thread runs circles around the
P2000 with 4, 8, or 16 threads, and never stalls, with responses less
than 1ms, meaning all writes hit cache, it would seem other workloads
are hitting the P2000 simultaneously with your test, limiting your
performance.  Either that or some kind of quotas have been set on the
LUNs to prevent one host from saturating the controllers.  Or both.

This is why I asked about exclusive access.  Without it your results for
the P2000 are literally worthless.  Lacking complete configuration info
puts you in the same boat.  You simply can't draw any realistic
conclusions about the P2000 performance without having complete control
of the device for dedicated testing purposes.

You have such control of the P400 and LSI do you not?  Concentrate your
testing and comparisons on those.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-18  7:09               ` Stan Hoeppner
@ 2012-07-18  7:22                 ` Stefan Ring
  2012-07-18 10:24                   ` Stan Hoeppner
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Ring @ 2012-07-18  7:22 UTC (permalink / raw)
  To: stan; +Cc: xfs

> *Gasp*  EXT3?  Not XFS?  Why are posting this thread on XFS?  The two
> will likely have (significantly) different behavior.

Because it was XFS originally which hammered the controller with this
disadvantageous pattern. Except for the concurrency, it doesn't matter
much on which filesystem sysbench operates. I've previously verified
this on another system.

> Also, to make any meaningful comparison, we kinda need to know which
> controller was targeted by these 3 runs below. ;)

It was the Fibre Channel controller, the one with the collapsing
throughput. (P2000 G3 MSA, QLogic Corp. ISP2532-based 8Gb Fibre
Channel to PCI Express HBA)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-18  6:44             ` Stefan Ring
@ 2012-07-18  7:09               ` Stan Hoeppner
  2012-07-18  7:22                 ` Stefan Ring
  0 siblings, 1 reply; 41+ messages in thread
From: Stan Hoeppner @ 2012-07-18  7:09 UTC (permalink / raw)
  To: xfs

On 7/18/2012 1:44 AM, Stefan Ring wrote:
> On Wed, Jul 18, 2012 at 4:18 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> On 7/17/2012 12:26 AM, Dave Chinner wrote:
>> ...
>>> I bet it's single threaded, which means it is:
>>
>> The data given seems to strongly suggest a single thread.
>>
>>> Which means throughput is limited by IO latency, not bandwidth.
>>> If it takes 10us to do the write(2), issue and process the IO
>>> completion, and it takes 10us for the hardware to do the IO, you're
>>> limited to 50,000 IOPS, or 200MB/s. Given that the best being seen
>>> is around 35MB/s, you're looking at around 10,000 IOPS of 100us
>>> round trip time. At 5MB/s, it's 1200 IOPS or around 800us round
>>> trip.
>>>
>>> That's why you get different performance from the different raid
>>> controllers - some process cache hits a lot faster than others.
>> ...
>>> IOWs, welcome to Understanding RAID Controller Caching Behaviours
>>> 101 :)
>>
>> It would be somewhat interesting to see Stefan's latency and throughput
>> numbers for 4/8/16 threads.  Maybe the sysbench "--num-threads=" option
>> is the ticket.  The docs state this is for testing scheduler
>> performance, and it's not clear whether this actually does threaded IO.
>>  If not, time for a new IO benchmark.
> 
> Yes, it is intentionally single-threaded and round-trip-bound, as that
> is exactly the kind of behavior that XFS chose to display.

You're referring to your original huge-metadata problem?  IIRC your
workload there was a single thread, wasn't it?

> I tested with more threads now. It is initially faster, which only
> serves to hasten the tanking, and the response time goes through the
> roof. I also needed to increase the --file-num. Apparently the
> filesystem (ext3) in this case cannot handle concurrent accesses to
> the same file.

*Gasp*  EXT3?  Not XFS?  Why are posting this thread on XFS?  The two
will likely have (significantly) different behavior.

Also, to make any meaningful comparison, we kinda need to know which
controller was targeted by these 3 runs below. ;)

> 4 threads:
> 
> [   2s] reads: 0.00 MB/s writes: 23.55 MB/s fsyncs: 0.00/s response
> time: 1.171ms (95%)
> [   4s] reads: 0.00 MB/s writes: 24.35 MB/s fsyncs: 0.00/s response
> time: 1.129ms (95%)
> [   6s] reads: 0.00 MB/s writes: 24.55 MB/s fsyncs: 0.00/s response
> time: 1.141ms (95%)
> [   8s] reads: 0.00 MB/s writes: 25.73 MB/s fsyncs: 0.00/s response
> time: 1.088ms (95%)
> [  10s] reads: 0.00 MB/s writes: 6.14 MB/s fsyncs: 0.00/s response
> time: 0.994ms (95%)
> [  12s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
> time: 2735.611ms (95%)
> [  14s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
> time: 3800.107ms (95%)
> [  16s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
> time: 4404.397ms (95%)
> [  18s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response
> time: 3153.588ms (95%)
> [  20s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
> time: 4769.433ms (95%)
> 
> 
> 8 threads:
> 
> [   2s] reads: 0.00 MB/s writes: 26.99 MB/s fsyncs: 0.00/s response
> time: 2.451ms (95%)
> [   4s] reads: 0.00 MB/s writes: 28.12 MB/s fsyncs: 0.00/s response
> time: 3.153ms (95%)
> [   6s] reads: 0.00 MB/s writes: 25.97 MB/s fsyncs: 0.00/s response
> time: 2.965ms (95%)
> [   8s] reads: 0.00 MB/s writes: 23.23 MB/s fsyncs: 0.00/s response
> time: 2.560ms (95%)
> [  10s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response
> time: 791.041ms (95%)
> [  12s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
> time: 3458.162ms (95%)
> [  14s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
> time: 5519.598ms (95%)
> [  16s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
> time: 3219.401ms (95%)
> [  18s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
> time: 10235.289ms (95%)
> [  20s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
> time: 3765.007ms (95%)
> 
> 16 threads:
> 
> [   2s] reads: 0.00 MB/s writes: 34.27 MB/s fsyncs: 0.00/s response
> time: 3.899ms (95%)
> [   4s] reads: 0.00 MB/s writes: 28.62 MB/s fsyncs: 0.00/s response
> time: 6.910ms (95%)
> [   6s] reads: 0.00 MB/s writes: 27.94 MB/s fsyncs: 0.00/s response
> time: 6.869ms (95%)
> [   8s] reads: 0.00 MB/s writes: 13.50 MB/s fsyncs: 0.00/s response
> time: 7.594ms (95%)
> [  10s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
> time: 2308.573ms (95%)
> [  12s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
> time: 4811.016ms (95%)
> [  14s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response
> time: 4635.714ms (95%)
> [  16s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
> time: 3200.185ms (95%)
> [  18s] reads: 0.00 MB/s writes: 0.03 MB/s fsyncs: 0.00/s response
> time: 9623.207ms (95%)
> [  20s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
> time: 8053.211ms (95%)

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-18  2:18           ` Stan Hoeppner
@ 2012-07-18  6:44             ` Stefan Ring
  2012-07-18  7:09               ` Stan Hoeppner
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Ring @ 2012-07-18  6:44 UTC (permalink / raw)
  To: stan; +Cc: xfs

On Wed, Jul 18, 2012 at 4:18 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 7/17/2012 12:26 AM, Dave Chinner wrote:
> ...
>> I bet it's single threaded, which means it is:
>
> The data given seems to strongly suggest a single thread.
>
>> Which means throughput is limited by IO latency, not bandwidth.
>> If it takes 10us to do the write(2), issue and process the IO
>> completion, and it takes 10us for the hardware to do the IO, you're
>> limited to 50,000 IOPS, or 200MB/s. Given that the best being seen
>> is around 35MB/s, you're looking at around 10,000 IOPS of 100us
>> round trip time. At 5MB/s, it's 1200 IOPS or around 800us round
>> trip.
>>
>> That's why you get different performance from the different raid
>> controllers - some process cache hits a lot faster than others.
> ...
>> IOWs, welcome to Understanding RAID Controller Caching Behaviours
>> 101 :)
>
> It would be somewhat interesting to see Stefan's latency and throughput
> numbers for 4/8/16 threads.  Maybe the sysbench "--num-threads=" option
> is the ticket.  The docs state this is for testing scheduler
> performance, and it's not clear whether this actually does threaded IO.
>  If not, time for a new IO benchmark.

Yes, it is intentionally single-threaded and round-trip-bound, as that
is exactly the kind of behavior that XFS chose to display.

I tested with more threads now. It is initially faster, which only
serves to hasten the tanking, and the response time goes through the
roof. I also needed to increase the --file-num. Apparently the
filesystem (ext3) in this case cannot handle concurrent accesses to
the same file.

4 threads:

[   2s] reads: 0.00 MB/s writes: 23.55 MB/s fsyncs: 0.00/s response
time: 1.171ms (95%)
[   4s] reads: 0.00 MB/s writes: 24.35 MB/s fsyncs: 0.00/s response
time: 1.129ms (95%)
[   6s] reads: 0.00 MB/s writes: 24.55 MB/s fsyncs: 0.00/s response
time: 1.141ms (95%)
[   8s] reads: 0.00 MB/s writes: 25.73 MB/s fsyncs: 0.00/s response
time: 1.088ms (95%)
[  10s] reads: 0.00 MB/s writes: 6.14 MB/s fsyncs: 0.00/s response
time: 0.994ms (95%)
[  12s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
time: 2735.611ms (95%)
[  14s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
time: 3800.107ms (95%)
[  16s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
time: 4404.397ms (95%)
[  18s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response
time: 3153.588ms (95%)
[  20s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
time: 4769.433ms (95%)


8 threads:

[   2s] reads: 0.00 MB/s writes: 26.99 MB/s fsyncs: 0.00/s response
time: 2.451ms (95%)
[   4s] reads: 0.00 MB/s writes: 28.12 MB/s fsyncs: 0.00/s response
time: 3.153ms (95%)
[   6s] reads: 0.00 MB/s writes: 25.97 MB/s fsyncs: 0.00/s response
time: 2.965ms (95%)
[   8s] reads: 0.00 MB/s writes: 23.23 MB/s fsyncs: 0.00/s response
time: 2.560ms (95%)
[  10s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response
time: 791.041ms (95%)
[  12s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
time: 3458.162ms (95%)
[  14s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
time: 5519.598ms (95%)
[  16s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
time: 3219.401ms (95%)
[  18s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
time: 10235.289ms (95%)
[  20s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
time: 3765.007ms (95%)

16 threads:

[   2s] reads: 0.00 MB/s writes: 34.27 MB/s fsyncs: 0.00/s response
time: 3.899ms (95%)
[   4s] reads: 0.00 MB/s writes: 28.62 MB/s fsyncs: 0.00/s response
time: 6.910ms (95%)
[   6s] reads: 0.00 MB/s writes: 27.94 MB/s fsyncs: 0.00/s response
time: 6.869ms (95%)
[   8s] reads: 0.00 MB/s writes: 13.50 MB/s fsyncs: 0.00/s response
time: 7.594ms (95%)
[  10s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
time: 2308.573ms (95%)
[  12s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
time: 4811.016ms (95%)
[  14s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response
time: 4635.714ms (95%)
[  16s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
time: 3200.185ms (95%)
[  18s] reads: 0.00 MB/s writes: 0.03 MB/s fsyncs: 0.00/s response
time: 9623.207ms (95%)
[  20s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
time: 8053.211ms (95%)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-17  5:26         ` Dave Chinner
@ 2012-07-18  2:18           ` Stan Hoeppner
  2012-07-18  6:44             ` Stefan Ring
  0 siblings, 1 reply; 41+ messages in thread
From: Stan Hoeppner @ 2012-07-18  2:18 UTC (permalink / raw)
  To: xfs

On 7/17/2012 12:26 AM, Dave Chinner wrote:
...
> I bet it's single threaded, which means it is:

The data given seems to strongly suggest a single thread.

> Which means throughput is limited by IO latency, not bandwidth.
> If it takes 10us to do the write(2), issue and process the IO
> completion, and it takes 10us for the hardware to do the IO, you're
> limited to 50,000 IOPS, or 200MB/s. Given that the best being seen
> is around 35MB/s, you're looking at around 10,000 IOPS of 100us
> round trip time. At 5MB/s, it's 1200 IOPS or around 800us round
> trip.
> 
> That's why you get different performance from the different raid
> controllers - some process cache hits a lot faster than others.
...
> IOWs, welcome to Understanding RAID Controller Caching Behaviours
> 101 :)

It would be somewhat interesting to see Stefan's latency and throughput
numbers for 4/8/16 threads.  Maybe the sysbench "--num-threads=" option
is the ticket.  The docs state this is for testing scheduler
performance, and it's not clear whether this actually does threaded IO.
 If not, time for a new IO benchmark.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-17  1:39       ` Stan Hoeppner
@ 2012-07-17  5:26         ` Dave Chinner
  2012-07-18  2:18           ` Stan Hoeppner
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2012-07-17  5:26 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: xfs

On Mon, Jul 16, 2012 at 08:39:15PM -0500, Stan Hoeppner wrote:
> It depends on the one, and what the one expects.  Most people on this
> list would never expect parity RAID to perform well with the workloads
> you're throwing at it.  Your expectations are clearly different than
> most on this list.

Rule of thumb: don't use RAID5/6 for small random write workloads.

> The kicker here is that most of the data you presented shows almost all
> writes being acked by cache, in which case RAID level should be
> irrelevant, but at the same time showing abysmal throughput.  When all
> write hit cache, throughput should be through the roof.

I bet it's single threaded, which means it is:

	sysbench		kernel
	write(2)
				issue io
				wait for completion
	write(2)
				issue io
				wait for completion
	write(2)
	.....

Which means throughput is limited by IO latency, not bandwidth.
If it takes 10us to do the write(2), issue and process the IO
completion, and it takes 10us for the hardware to do the IO, you're
limited to 50,000 IOPS, or 200MB/s. Given that the best being seen
is around 35MB/s, you're looking at around 10,000 IOPS of 100us
round trip time. At 5MB/s, it's 1200 IOPS or around 800us round
trip.

That's why you get different performance from the different raid
controllers - some process cache hits a lot faster than others.

As to the one that stalled - when the cache hits a certain level of
dirtiness (say 50%), it will start flushing cached writes and
depending on the algorithm may start behaving like a FIFO to new
requests. i.e. each new request coming in needs to wait for one to
drain. At that point, the write rate will tank to maybe 50 IOPS,
which will barely register on the benchmark throughput. (just look
at what happens to the IO latency that is measured...)

IOWs, welcome to Understanding RAID Controller Caching Behaviours
101 :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-16 21:58     ` Stefan Ring
@ 2012-07-17  1:39       ` Stan Hoeppner
  2012-07-17  5:26         ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Stan Hoeppner @ 2012-07-17  1:39 UTC (permalink / raw)
  To: xfs

On 7/16/2012 4:58 PM, Stefan Ring wrote:

> I'm pretty sure that the data is correct, and the test is not flawed.

That may be true.  Nonetheless, what you presented does not paint a
coherent picture.

> The only relevant omission is that I've run the test a few times in a
> row. 

You also omitted whether you had exclusive access to the P2000 array.
The P2000 has 2GB write cache.  The numbers you report are far below
what this unit is capable of.  Your data suggests

1.  You didn't have exclusive access during testing
2.  A configuration issue

>> Again, the response times suggest all these writes are being
>> acknowledged by BBWC.  Given this is a PCIe RAID HBA, the throughput
>> numbers to BBWC should be hundreds of megs per second.
> 
> It's semi-random, quite small writes -- actually not very random, but
> still not exactly linear --, so some performance degradation is
> expected.

The data set I commented on here shows all responses were from BBWC.
How can you "expect" degradation from cache?

>> Again, due to the response times, all the writes appear acknowledged by
>> BBWC.  While the LSI throughput is better, it is still far far lower
>> than what it should be, i.e. hundreds of megs per second to BBWC.
> 
> The cache gets filled up quickly in this case, so it can only accept
> as much data as it manages to write out to the disks.

This is not what the data I quoted shows Stefan.  The data shows all the
writes were acked by cache, according to response times.

> Maybe so, but it might also be worthwhile to point out flaws with
> current real hardware, when it does not behave the way one would
> expect.

The only "flaw" you've identified, long ago, is that low end HP hardware
based RAID5/6 is not suitable for metadata heavy workloads.  Everyone
here told you RAID5/6, whether hardware or software, was not a good
candidate for such workloads.  You played with RAID10 and a concat
setup, and received greatly enhanced performance.

It depends on the one, and what the one expects.  Most people on this
list would never expect parity RAID to perform well with the workloads
you're throwing at it.  Your expectations are clearly different than
most on this list.

The kicker here is that most of the data you presented shows almost all
writes being acked by cache, in which case RAID level should be
irrelevant, but at the same time showing abysmal throughput.  When all
write hit cache, throughput should be through the roof.

So again, something is amiss here.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-16 21:27   ` Stan Hoeppner
  2012-07-16 21:58     ` Stefan Ring
@ 2012-07-16 22:16     ` Stefan Ring
  1 sibling, 0 replies; 41+ messages in thread
From: Stefan Ring @ 2012-07-16 22:16 UTC (permalink / raw)
  To: stan; +Cc: xfs

>> If I thought that the internal RAID was bad, that's only because I
>> have not yet experienced an external enclosure from HP attached via
>> FibreChannel (P2000 G3 MSA, QLogic Corp. ISP2532-based 8Gb Fibre
>> Channel to PCI Express HBA). Unfortunately, I don't have detailed
>> information about the configuration of this enclosure, except that
>> it's a RAID6 volume, with 10 or 12 disks, I believe.
>
> Without that information the numbers below may tend to be a bit meaningless.

Yes, probably, but I likely won't get at the information, let alone
change or tweak anything. But even with the most naive setup, a good
storage stack "should" not exhibit this kind of behavior.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-16 21:27   ` Stan Hoeppner
@ 2012-07-16 21:58     ` Stefan Ring
  2012-07-17  1:39       ` Stan Hoeppner
  2012-07-16 22:16     ` Stefan Ring
  1 sibling, 1 reply; 41+ messages in thread
From: Stefan Ring @ 2012-07-16 21:58 UTC (permalink / raw)
  To: stan; +Cc: xfs

> These writes appear to all be larger than the BBWC, according to the
> response times.  It's odd that the data written is 0.00MB/s, meaning
> nothing was actually written.  How does writing nothing takes over 1 second?

The writes are 4KB all the time, but at this point the FBWC has been
filled up. I guess it's not "nothing", but close to it, and the MB/s
figure is rounded. If it takes > 1 sec for a single write to get
through, not much gets written in a 2 second interval.

> Either there is something wrong with your test, critical data omitted
> from these reports, it isn't reporting coherent data, or I'm simply not
> "trained" to read this output.  The output doesn't make any sense.

I'm pretty sure that the data is correct, and the test is not flawed.
The only relevant omission is that I've run the test a few times in a
row. That should explain the first "0.07MB/s" line, because the cache
was already loaded. The output does make sense, it's just the
controller that's behaving erratically. It seems to accept data into
the cache up to a point, then it starts writing it out to disk and not
doing much else during that time.

>> [  30s] reads: 0.00 MB/s writes: 5.27 MB/s fsyncs: 0.00/s response
>> time: 0.254ms (95%)
>> Operations performed:  0 reads, 42890 writes, 0 Other = 42890 Total
>> Read 0b  Written 167.54Mb  Total transferred 167.54Mb  (5.5773Mb/sec)
>>  1427.80 Requests/sec executed
>
> Again, the response times suggest all these writes are being
> acknowledged by BBWC.  Given this is a PCIe RAID HBA, the throughput
> numbers to BBWC should be hundreds of megs per second.

It's semi-random, quite small writes -- actually not very random, but
still not exactly linear --, so some performance degradation is
expected.

>> [  28s] reads: 0.00 MB/s writes: 36.15 MB/s fsyncs: 0.00/s response
>> time: 0.232ms (95%)
>> Operations performed:  0 reads, 284087 writes, 0 Other = 284087 Total
>> Read 0b  Written 1.0837Gb  Total transferred 1.0837Gb  (36.99Mb/sec)
>>  9469.55 Requests/sec executed
>
> Again, due to the response times, all the writes appear acknowledged by
> BBWC.  While the LSI throughput is better, it is still far far lower
> than what it should be, i.e. hundreds of megs per second to BBWC.

The cache gets filled up quickly in this case, so it can only accept
as much data as it manages to write out to the disks.

> I'm not familiar with sysbench.  That said, your command line seems to
> be specifying 8GB files.  Your original issue reported here long ago was
> low performance with huge metadata, i.e. deleting kernel trees etc.
> What storage characteristics is the command above supposed to test?

You're right. When I had the issue with a metadata-intensive workload
-- it was mostly free space fragmentation that caused trouble,
apparently --, I ran seekwatcher and noticed a pattern that I tried to
illustrate in <http://oss.sgi.com/pipermail/xfs/2012-April/018231.html>.
The SmartArray controller was not able to make sense of this pattern,
although in theory, it would be very easy to optimize. I was familiar
with sysbench, which offers a handy random write test of with
selectable block size, and I modified it so it would write out the
blocks in the order suggested by the pattern.

> I'd like a pony.  If anyone here were to give me a pony that would
> satisfy one desire of one person.  Ergo, if others performing your test
> will have a positive impact on the XFS code and user base, and not
> simply serve to satisfy the curiosity of one user, I'm sure others would
> be glad to run such tests.  At this point though it seems such testing
> would only satisfy the former, and not the latter.

Maybe so, but it might also be worthwhile to point out flaws with
current real hardware, when it does not behave the way one would
expect.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-16 19:57 ` Stefan Ring
  2012-07-16 20:03   ` Stefan Ring
@ 2012-07-16 21:27   ` Stan Hoeppner
  2012-07-16 21:58     ` Stefan Ring
  2012-07-16 22:16     ` Stefan Ring
  1 sibling, 2 replies; 41+ messages in thread
From: Stan Hoeppner @ 2012-07-16 21:27 UTC (permalink / raw)
  To: xfs

On 7/16/2012 2:57 PM, Stefan Ring wrote:

> If I thought that the internal RAID was bad, that's only because I
> have not yet experienced an external enclosure from HP attached via
> FibreChannel (P2000 G3 MSA, QLogic Corp. ISP2532-based 8Gb Fibre
> Channel to PCI Express HBA). Unfortunately, I don't have detailed
> information about the configuration of this enclosure, except that
> it's a RAID6 volume, with 10 or 12 disks, I believe.

Without that information the numbers below may tend to be a bit meaningless.

> Witness this horrendous tanking of write throughput:
> 
> [   2s] reads: 0.00 MB/s writes: 0.07 MB/s fsyncs: 0.00/s response
> time: 0.616ms (95%)
> [   4s] reads: 0.00 MB/s writes: 14.10 MB/s fsyncs: 0.00/s response
> time: 0.481ms (95%)
> [   6s] reads: 0.00 MB/s writes: 15.28 MB/s fsyncs: 0.00/s response
> time: 0.458ms (95%)
> [   8s] reads: 0.00 MB/s writes: 14.65 MB/s fsyncs: 0.00/s response
> time: 0.464ms (95%)
> [  10s] reads: 0.00 MB/s writes: 15.32 MB/s fsyncs: 0.00/s response
> time: 0.447ms (95%)
> [  12s] reads: 0.00 MB/s writes: 15.18 MB/s fsyncs: 0.00/s response
> time: 0.460ms (95%)
> [  14s] reads: 0.00 MB/s writes: 15.18 MB/s fsyncs: 0.00/s response
> time: 0.471ms (95%)
> [  16s] reads: 0.00 MB/s writes: 14.06 MB/s fsyncs: 0.00/s response
> time: 0.468ms (95%)

Up to this point it appears the BBWC is acknowledging write completion,
as the response times are less than 1ms, 16-40 times lower than a disk
drive response.  If this is the case, the transfer rates should be close
to 800MB/s, the limit for 8Gb FC.

> [  18s] reads: 0.00 MB/s writes: 0.43 MB/s fsyncs: 0.00/s response
> time: 3.933ms (95%)
> [  20s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response
> time: 985.122ms (95%)
> [  22s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
> time: 1435.164ms (95%)
> [  24s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response
> time: 1194.568ms (95%)
> [  26s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response
> time: 1112.091ms (95%)
> [  28s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
> time: 1443.350ms (95%)
> [  30s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response
> time: 1078.972ms (95%)

These writes appear to all be larger than the BBWC, according to the
response times.  It's odd that the data written is 0.00MB/s, meaning
nothing was actually written.  How does writing nothing takes over 1 second?

Either there is something wrong with your test, critical data omitted
from these reports, it isn't reporting coherent data, or I'm simply not
"trained" to read this output.  The output doesn't make any sense.

> Operations performed:  0 reads, 53413 writes, 0 Other = 53413 Total
> Read 0b  Written 208.64Mb  Total transferred 208.64Mb  (6.8007Mb/sec)
>  1740.98 Requests/sec executed
> 
> For comparison, this is the SmartArray P400 RAID6 that I initially
> complained about:
> 
> [   2s] reads: 0.00 MB/s writes: 6.34 MB/s fsyncs: 0.00/s response
> time: 0.219ms (95%)
> [   4s] reads: 0.00 MB/s writes: 5.35 MB/s fsyncs: 0.00/s response
> time: 0.217ms (95%)
> [   6s] reads: 0.00 MB/s writes: 5.48 MB/s fsyncs: 0.00/s response
> time: 0.208ms (95%)
> [   8s] reads: 0.00 MB/s writes: 5.30 MB/s fsyncs: 0.00/s response
> time: 0.228ms (95%)
> [  10s] reads: 0.00 MB/s writes: 5.81 MB/s fsyncs: 0.00/s response
> time: 0.226ms (95%)
> [  12s] reads: 0.00 MB/s writes: 6.01 MB/s fsyncs: 0.00/s response
> time: 0.223ms (95%)
> [  14s] reads: 0.00 MB/s writes: 5.39 MB/s fsyncs: 0.00/s response
> time: 0.212ms (95%)
> [  16s] reads: 0.00 MB/s writes: 5.21 MB/s fsyncs: 0.00/s response
> time: 0.225ms (95%)
> [  18s] reads: 0.00 MB/s writes: 5.16 MB/s fsyncs: 0.00/s response
> time: 0.224ms (95%)
> [  20s] reads: 0.00 MB/s writes: 5.97 MB/s fsyncs: 0.00/s response
> time: 0.217ms (95%)
> [  22s] reads: 0.00 MB/s writes: 4.28 MB/s fsyncs: 0.00/s response
> time: 0.228ms (95%)
> [  24s] reads: 0.00 MB/s writes: 7.44 MB/s fsyncs: 0.00/s response
> time: 0.191ms (95%)
> [  26s] reads: 0.00 MB/s writes: 5.30 MB/s fsyncs: 0.00/s response
> time: 0.250ms (95%)
> [  28s] reads: 0.00 MB/s writes: 5.45 MB/s fsyncs: 0.00/s response
> time: 0.258ms (95%)
> [  30s] reads: 0.00 MB/s writes: 5.27 MB/s fsyncs: 0.00/s response
> time: 0.254ms (95%)
> Operations performed:  0 reads, 42890 writes, 0 Other = 42890 Total
> Read 0b  Written 167.54Mb  Total transferred 167.54Mb  (5.5773Mb/sec)
>  1427.80 Requests/sec executed

Again, the response times suggest all these writes are being
acknowledged by BBWC.  Given this is a PCIe RAID HBA, the throughput
numbers to BBWC should be hundreds of megs per second.

> Slow, but at least it's consistent.
> 
> And that's what I would expect, and which a decent RAID controller
> manages to provide (LSI Logic / Symbios Logic MegaRAID SAS 1078):
> 
> [   2s] reads: 0.00 MB/s writes: 56.65 MB/s fsyncs: 0.00/s response
> time: 0.117ms (95%)
> [   4s] reads: 0.00 MB/s writes: 37.15 MB/s fsyncs: 0.00/s response
> time: 0.221ms (95%)
> [   6s] reads: 0.00 MB/s writes: 35.92 MB/s fsyncs: 0.00/s response
> time: 0.225ms (95%)
> [   8s] reads: 0.00 MB/s writes: 34.15 MB/s fsyncs: 0.00/s response
> time: 0.239ms (95%)
> [  10s] reads: 0.00 MB/s writes: 33.19 MB/s fsyncs: 0.00/s response
> time: 0.221ms (95%)
> [  12s] reads: 0.00 MB/s writes: 34.02 MB/s fsyncs: 0.00/s response
> time: 0.229ms (95%)
> [  14s] reads: 0.00 MB/s writes: 36.61 MB/s fsyncs: 0.00/s response
> time: 0.233ms (95%)
> [  16s] reads: 0.00 MB/s writes: 37.62 MB/s fsyncs: 0.00/s response
> time: 0.232ms (95%)
> [  18s] reads: 0.00 MB/s writes: 35.75 MB/s fsyncs: 0.00/s response
> time: 0.228ms (95%)
> [  20s] reads: 0.00 MB/s writes: 35.42 MB/s fsyncs: 0.00/s response
> time: 0.233ms (95%)
> [  22s] reads: 0.00 MB/s writes: 34.63 MB/s fsyncs: 0.00/s response
> time: 0.233ms (95%)
> [  24s] reads: 0.00 MB/s writes: 34.83 MB/s fsyncs: 0.00/s response
> time: 0.230ms (95%)
> [  26s] reads: 0.00 MB/s writes: 36.84 MB/s fsyncs: 0.00/s response
> time: 0.229ms (95%)
> [  28s] reads: 0.00 MB/s writes: 36.15 MB/s fsyncs: 0.00/s response
> time: 0.232ms (95%)
> Operations performed:  0 reads, 284087 writes, 0 Other = 284087 Total
> Read 0b  Written 1.0837Gb  Total transferred 1.0837Gb  (36.99Mb/sec)
>  9469.55 Requests/sec executed

Again, due to the response times, all the writes appear acknowledged by
BBWC.  While the LSI throughput is better, it is still far far lower
than what it should be, i.e. hundreds of megs per second to BBWC.

> The command line used was: sysbench --test=fileio --max-time=30
> --max-requests=10000000 --file-num=1 --file-extra-flags=direct
> --file-total-size=8G --file-block-size=4k --file-fsync-all=off
> --file-fsync-freq=0 --file-fsync-mode=fdatasync --num-threads=1
> --file-test-mode=ag4 --report-interval=2 run

I'm not familiar with sysbench.  That said, your command line seems to
be specifying 8GB files.  Your original issue reported here long ago was
low performance with huge metadata, i.e. deleting kernel trees etc.
What storage characteristics is the command above supposed to test?

> I have not yet uploaded my patched version of the development
> sysbench, but I'm planning to do so, and I'd be really interested if
> someone could run it on a really high-end storage system.

I'd like a pony.  If anyone here were to give me a pony that would
satisfy one desire of one person.  Ergo, if others performing your test
will have a positive impact on the XFS code and user base, and not
simply serve to satisfy the curiosity of one user, I'm sure others would
be glad to run such tests.  At this point though it seems such testing
would only satisfy the former, and not the latter.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-16 20:03   ` Stefan Ring
@ 2012-07-16 20:05     ` Stefan Ring
  0 siblings, 0 replies; 41+ messages in thread
From: Stefan Ring @ 2012-07-16 20:05 UTC (permalink / raw)
  To: Linux fs XFS

On Mon, Jul 16, 2012 at 10:03 PM, Stefan Ring <stefanrin@gmail.com> wrote:
> Damn, the formatting has been broken. For easier readability, I've
> uploaded the text here:

https://github.com/Ringdingcoder/sysbench/blob/master/mail1.txt

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-07-16 19:57 ` Stefan Ring
@ 2012-07-16 20:03   ` Stefan Ring
  2012-07-16 20:05     ` Stefan Ring
  2012-07-16 21:27   ` Stan Hoeppner
  1 sibling, 1 reply; 41+ messages in thread
From: Stefan Ring @ 2012-07-16 20:03 UTC (permalink / raw)
  To: Linux fs XFS

Damn, the formatting has been broken. For easier readability, I've
uploaded the text here:
https://github.com/Ringdingcoder/sysbench/blob/0dd3e1797ee5b847f0877144a6e0cd9de60ae7c3/mail1.txt

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-04-25  8:07 Stefan Ring
  2012-04-25 14:17 ` Roger Willcocks
  2012-04-27 13:50 ` Stan Hoeppner
@ 2012-07-16 19:57 ` Stefan Ring
  2012-07-16 20:03   ` Stefan Ring
  2012-07-16 21:27   ` Stan Hoeppner
  2012-10-10 14:57 ` Stefan Ring
  3 siblings, 2 replies; 41+ messages in thread
From: Stefan Ring @ 2012-07-16 19:57 UTC (permalink / raw)
  To: Linux fs XFS

On Wed, Apr 25, 2012 at 10:07 AM, Stefan Ring <stefanrin@gmail.com> wrote:
> This grew out of the discussion in my other thread ("Abysmal write
> performance because of excessive seeking (allocation groups to
> blame?)") -- that should in fact have been called "Free space
> fragmentation causes excessive seeks".
>
> Could someone with a good hardware RAID (5 or 6, but also mirrored
> setups would be interesting) please conduct a little experiment for
> me?
>
> I've put up a modified sysbench here:
> <https://github.com/Ringdingcoder/sysbench>. This tries to simulate
> the write pattern I've seen with XFS. It would be really interesting
> to know how different RAID controllers cope with this.
>
> - Checkout (or download tarball):
> https://github.com/Ringdingcoder/sysbench/tarball/master
> - ./configure --without-mysql && make
> - fallocate -l 8g test_file.0
> - ./sysbench/sysbench --test=fileio --max-time=15
> --max-requests=10000000 --file-num=1 --file-extra-flags=direct
> --file-total-size=8G --file-block-size=8192 --file-fsync-all=off
> --file-fsync-freq=0 --file-fsync-mode=fdatasync --num-threads=1
> --file-test-mode=ag4 run
>
> If you don't have fallocate, you can also use the last line with "run"
> replaced by "prepare" to create the file. Run the benchmark a few
> times to check if the numbers are somewhat stable. When doing a few
> runs in direct succession, the first one will likely be faster because
> the cache has not been loaded up yet. The interesting part of the
> output is this:
>
> Read 0b  Written 64.516Mb  Total transferred 64.516Mb  (4.301Mb/sec)
>   550.53 Requests/sec executed
>
> That's a measurement from my troubled RAID 6 volume (SmartArray P400,
> 6x 10k disks).
>
> From the other controller in this machine (RAID 1, SmartArray P410i,
> 2x 15k disks), I get:
>
> Read 0b  Written 276.85Mb  Total transferred 276.85Mb  (18.447Mb/sec)
>  2361.21 Requests/sec executed
>
> The better result might be caused by the better controller or the RAID
> 1, with the latter reason being more likely.

In the meantime, the very useful --report-interval switch has been
added to development versions of sysbench, and I've had access to one
additional system.

If I thought that the internal RAID was bad, that's only because I
have not yet experienced an external enclosure from HP attached via
FibreChannel (P2000 G3 MSA, QLogic Corp. ISP2532-based 8Gb Fibre
Channel to PCI Express HBA). Unfortunately, I don't have detailed
information about the configuration of this enclosure, except that
it's a RAID6 volume, with 10 or 12 disks, I believe.

Witness this horrendous tanking of write throughput:

[   2s] reads: 0.00 MB/s writes: 0.07 MB/s fsyncs: 0.00/s response
time: 0.616ms (95%)
[   4s] reads: 0.00 MB/s writes: 14.10 MB/s fsyncs: 0.00/s response
time: 0.481ms (95%)
[   6s] reads: 0.00 MB/s writes: 15.28 MB/s fsyncs: 0.00/s response
time: 0.458ms (95%)
[   8s] reads: 0.00 MB/s writes: 14.65 MB/s fsyncs: 0.00/s response
time: 0.464ms (95%)
[  10s] reads: 0.00 MB/s writes: 15.32 MB/s fsyncs: 0.00/s response
time: 0.447ms (95%)
[  12s] reads: 0.00 MB/s writes: 15.18 MB/s fsyncs: 0.00/s response
time: 0.460ms (95%)
[  14s] reads: 0.00 MB/s writes: 15.18 MB/s fsyncs: 0.00/s response
time: 0.471ms (95%)
[  16s] reads: 0.00 MB/s writes: 14.06 MB/s fsyncs: 0.00/s response
time: 0.468ms (95%)
[  18s] reads: 0.00 MB/s writes: 0.43 MB/s fsyncs: 0.00/s response
time: 3.933ms (95%)
[  20s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response
time: 985.122ms (95%)
[  22s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
time: 1435.164ms (95%)
[  24s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response
time: 1194.568ms (95%)
[  26s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response
time: 1112.091ms (95%)
[  28s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response
time: 1443.350ms (95%)
[  30s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response
time: 1078.972ms (95%)
Operations performed:  0 reads, 53413 writes, 0 Other = 53413 Total
Read 0b  Written 208.64Mb  Total transferred 208.64Mb  (6.8007Mb/sec)
 1740.98 Requests/sec executed

For comparison, this is the SmartArray P400 RAID6 that I initially
complained about:

[   2s] reads: 0.00 MB/s writes: 6.34 MB/s fsyncs: 0.00/s response
time: 0.219ms (95%)
[   4s] reads: 0.00 MB/s writes: 5.35 MB/s fsyncs: 0.00/s response
time: 0.217ms (95%)
[   6s] reads: 0.00 MB/s writes: 5.48 MB/s fsyncs: 0.00/s response
time: 0.208ms (95%)
[   8s] reads: 0.00 MB/s writes: 5.30 MB/s fsyncs: 0.00/s response
time: 0.228ms (95%)
[  10s] reads: 0.00 MB/s writes: 5.81 MB/s fsyncs: 0.00/s response
time: 0.226ms (95%)
[  12s] reads: 0.00 MB/s writes: 6.01 MB/s fsyncs: 0.00/s response
time: 0.223ms (95%)
[  14s] reads: 0.00 MB/s writes: 5.39 MB/s fsyncs: 0.00/s response
time: 0.212ms (95%)
[  16s] reads: 0.00 MB/s writes: 5.21 MB/s fsyncs: 0.00/s response
time: 0.225ms (95%)
[  18s] reads: 0.00 MB/s writes: 5.16 MB/s fsyncs: 0.00/s response
time: 0.224ms (95%)
[  20s] reads: 0.00 MB/s writes: 5.97 MB/s fsyncs: 0.00/s response
time: 0.217ms (95%)
[  22s] reads: 0.00 MB/s writes: 4.28 MB/s fsyncs: 0.00/s response
time: 0.228ms (95%)
[  24s] reads: 0.00 MB/s writes: 7.44 MB/s fsyncs: 0.00/s response
time: 0.191ms (95%)
[  26s] reads: 0.00 MB/s writes: 5.30 MB/s fsyncs: 0.00/s response
time: 0.250ms (95%)
[  28s] reads: 0.00 MB/s writes: 5.45 MB/s fsyncs: 0.00/s response
time: 0.258ms (95%)
[  30s] reads: 0.00 MB/s writes: 5.27 MB/s fsyncs: 0.00/s response
time: 0.254ms (95%)
Operations performed:  0 reads, 42890 writes, 0 Other = 42890 Total
Read 0b  Written 167.54Mb  Total transferred 167.54Mb  (5.5773Mb/sec)
 1427.80 Requests/sec executed

Slow, but at least it's consistent.

And that's what I would expect, and which a decent RAID controller
manages to provide (LSI Logic / Symbios Logic MegaRAID SAS 1078):

[   2s] reads: 0.00 MB/s writes: 56.65 MB/s fsyncs: 0.00/s response
time: 0.117ms (95%)
[   4s] reads: 0.00 MB/s writes: 37.15 MB/s fsyncs: 0.00/s response
time: 0.221ms (95%)
[   6s] reads: 0.00 MB/s writes: 35.92 MB/s fsyncs: 0.00/s response
time: 0.225ms (95%)
[   8s] reads: 0.00 MB/s writes: 34.15 MB/s fsyncs: 0.00/s response
time: 0.239ms (95%)
[  10s] reads: 0.00 MB/s writes: 33.19 MB/s fsyncs: 0.00/s response
time: 0.221ms (95%)
[  12s] reads: 0.00 MB/s writes: 34.02 MB/s fsyncs: 0.00/s response
time: 0.229ms (95%)
[  14s] reads: 0.00 MB/s writes: 36.61 MB/s fsyncs: 0.00/s response
time: 0.233ms (95%)
[  16s] reads: 0.00 MB/s writes: 37.62 MB/s fsyncs: 0.00/s response
time: 0.232ms (95%)
[  18s] reads: 0.00 MB/s writes: 35.75 MB/s fsyncs: 0.00/s response
time: 0.228ms (95%)
[  20s] reads: 0.00 MB/s writes: 35.42 MB/s fsyncs: 0.00/s response
time: 0.233ms (95%)
[  22s] reads: 0.00 MB/s writes: 34.63 MB/s fsyncs: 0.00/s response
time: 0.233ms (95%)
[  24s] reads: 0.00 MB/s writes: 34.83 MB/s fsyncs: 0.00/s response
time: 0.230ms (95%)
[  26s] reads: 0.00 MB/s writes: 36.84 MB/s fsyncs: 0.00/s response
time: 0.229ms (95%)
[  28s] reads: 0.00 MB/s writes: 36.15 MB/s fsyncs: 0.00/s response
time: 0.232ms (95%)
Operations performed:  0 reads, 284087 writes, 0 Other = 284087 Total
Read 0b  Written 1.0837Gb  Total transferred 1.0837Gb  (36.99Mb/sec)
 9469.55 Requests/sec executed

The command line used was: sysbench --test=fileio --max-time=30
--max-requests=10000000 --file-num=1 --file-extra-flags=direct
--file-total-size=8G --file-block-size=4k --file-fsync-all=off
--file-fsync-freq=0 --file-fsync-mode=fdatasync --num-threads=1
--file-test-mode=ag4 --report-interval=2 run

I have not yet uploaded my patched version of the development
sysbench, but I'm planning to do so, and I'd be really interested if
someone could run it on a really high-end storage system.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-05-31  1:30       ` Stan Hoeppner
@ 2012-05-31  6:44         ` Stefan Ring
  0 siblings, 0 replies; 41+ messages in thread
From: Stefan Ring @ 2012-05-31  6:44 UTC (permalink / raw)
  To: stan; +Cc: Linux fs XFS

> You now have persistent write cache.  Did you test with XFS barriers
> disabled?  If not you should.  You'll likely see a decent, possibly
> outstanding, performance improvement with your huge metadata
> modification workload, as XFS will no longer flush the cache frequently
> when writing to the journal log.

I've already done that test with the previous controller. It had a
BBWC, so it was persistent as well. And it was easy to enable or
disable it. Yes, everything was always done with barrier=0. And yes,
the cache made a big difference (about 3x).

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-05-30 11:07     ` Stefan Ring
@ 2012-05-31  1:30       ` Stan Hoeppner
  2012-05-31  6:44         ` Stefan Ring
  0 siblings, 1 reply; 41+ messages in thread
From: Stan Hoeppner @ 2012-05-31  1:30 UTC (permalink / raw)
  To: Stefan Ring; +Cc: Linux fs XFS

On 5/30/2012 6:07 AM, Stefan Ring wrote:
> On Tue, May 1, 2012 at 12:46 PM, Stefan Ring <stefanrin@gmail.com> wrote:
>>> Stefan, you should be able to simply clear the P410i configuration in
>>> the BIOS, power down, then simply connect the 6 drive backplane cable to
>>> the 410i, load the config from the disks, and go.  This allows head to
>>> head RAID6 comparison between the P400 and P410i.  No doubt the 410i
>>> will be quicker.  This procedure will tell you how much quicker.
>>
>> Unfortunately, the server is located at a hosting facility at the
>> opposite end of town, and I'd spend an entire day just traveling to
>> and fro, so that's not currently an option. I might get lucky though,
>> because we should soon get another server with an external P410i.
> 
> The new storage blade has only been upgraded to the P410i controller,
> and even though there is a new setting called "elevatorsort", which is
> enabled, the performance is just as bad. The new one has a
> flash-writeback cache and may be faster by a few percent ticks, but
> that's it. It doesn't even make sense to compare the two in-depth, as
> they perform almost identically.

You now have persistent write cache.  Did you test with XFS barriers
disabled?  If not you should.  You'll likely see a decent, possibly
outstanding, performance improvement with your huge metadata
modification workload, as XFS will no longer flush the cache frequently
when writing to the journal log.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-05-01 10:46   ` Stefan Ring
@ 2012-05-30 11:07     ` Stefan Ring
  2012-05-31  1:30       ` Stan Hoeppner
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Ring @ 2012-05-30 11:07 UTC (permalink / raw)
  To: stan; +Cc: Linux fs XFS

On Tue, May 1, 2012 at 12:46 PM, Stefan Ring <stefanrin@gmail.com> wrote:
>> Stefan, you should be able to simply clear the P410i configuration in
>> the BIOS, power down, then simply connect the 6 drive backplane cable to
>> the 410i, load the config from the disks, and go.  This allows head to
>> head RAID6 comparison between the P400 and P410i.  No doubt the 410i
>> will be quicker.  This procedure will tell you how much quicker.
>
> Unfortunately, the server is located at a hosting facility at the
> opposite end of town, and I'd spend an entire day just traveling to
> and fro, so that's not currently an option. I might get lucky though,
> because we should soon get another server with an external P410i.

The new storage blade has only been upgraded to the P410i controller,
and even though there is a new setting called "elevatorsort", which is
enabled, the performance is just as bad. The new one has a
flash-writeback cache and may be faster by a few percent ticks, but
that's it. It doesn't even make sense to compare the two in-depth, as
they perform almost identically.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-04-27 13:50 ` Stan Hoeppner
@ 2012-05-01 10:46   ` Stefan Ring
  2012-05-30 11:07     ` Stefan Ring
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Ring @ 2012-05-01 10:46 UTC (permalink / raw)
  To: stan; +Cc: Linux fs XFS

> Stefan, you should be able to simply clear the P410i configuration in
> the BIOS, power down, then simply connect the 6 drive backplane cable to
> the 410i, load the config from the disks, and go.  This allows head to
> head RAID6 comparison between the P400 and P410i.  No doubt the 410i
> will be quicker.  This procedure will tell you how much quicker.

Unfortunately, the server is located at a hosting facility at the
opposite end of town, and I'd spend an entire day just traveling to
and fro, so that's not currently an option. I might get lucky though,
because we should soon get another server with an external P410i.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-04-27 15:28     ` Joe Landman
@ 2012-04-28  4:42       ` Stan Hoeppner
  0 siblings, 0 replies; 41+ messages in thread
From: Stan Hoeppner @ 2012-04-28  4:42 UTC (permalink / raw)
  To: landman; +Cc: xfs

On 4/27/2012 10:28 AM, Joe Landman wrote:
> On 04/26/2012 04:53 AM, Stefan Ring wrote:
> 
> 
>> I just want to stress that our machine with the SmartArray controller
>> is not a cheap old dusty leftover, but a recently-bought (December
>> 2011) not exactly cheap Blade server, and that’s all you get from HP.
> 
> We have an anecdote about something akin to this which happened 2 years
> ago.  A potential customer was testing a <insert large multi-letter
> acronym brand name here> machine to run a specific set of software which
> tightly coupled to its disks.  Performance was terrible.  Our partner
> (the software vendor) contacted us and asked us to help.  We'd suggested
> that the partner loan them the machine they had bought from us 2 years
> earlier (at the time) and try that.
> 
> Our 2 year old machine (actually 2 generations back at the time of test,
> now 5 generations behind our current kit) wound up being more than an
> order of magnitude faster than the (then) latest and greatest kit from
> <insert large multi-letter acronym brand name here>.
> 
> The lesson is this.  Latest and greatest doesn't mean fastest.  Design,
> and implementation matter.  Brand names don't.
> 
> To this day, we still see machines being pushed out with PCIx technology
> for networking, or disk, or ...

I've seen this as well.  A vendor gets comfortable and confident with a
particular main board, RAID card, NIC, etc, that demonstrate uber
reliability in the field and is easy to work on/with.  They continue
selling it as long as they can still get their hands on it, even though
much better technology has long been available.  It's the "stick with
what we know works" mentality.  Sometimes this is a good strategy.  If a
customer constantly needs maximum performance, obviously not.

> ... and customers buy it up, for reasons that have little to do with
> performance, suitability to the task, etc.
> 
> If you need performance, its important to focus some effort upon
> locating systems/vendors capable of performing where you need them to
> perform.  Otherwise you may wind up with a warmed over web server with
> some random card and a few "fast" disks tossed in.
> 
> I don't mean to be blunt, but this is basically what you were sold. Note
> also, I see this in cluster file system bits all the time.  We get calls
> from people, who describe a design, and ask us for help making them go
> fast.  We discover that they've made some deep fundamental design
> decisions very poorly (usually upon the basis of what <insert large
> multi-letter acronym brand name here> told them were options), and there
> was no way to get between point A (their per unit performance) and point
> B (what they were hoping for as an aggregate system performance).
> 
> At the most basic level, your performance will be modulated by your
> slowest performing part.  You can put infinitely fast disks on a slow
> controller, and your performance will be terrible.  You can put slow
> disks on a very fast controller, and you will likely have better luck.

I generally agree with this last statement, but I think it's most
relevant to parity arrays.  In general RAID1/10 performance tends to be
less impacted by controller speed.  But yes, a really poor slow
controller is going to limit anything you try to do with any disks.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-04-26  8:53   ` Stefan Ring
  2012-04-27 15:10     ` Stan Hoeppner
@ 2012-04-27 15:28     ` Joe Landman
  2012-04-28  4:42       ` Stan Hoeppner
  1 sibling, 1 reply; 41+ messages in thread
From: Joe Landman @ 2012-04-27 15:28 UTC (permalink / raw)
  To: xfs

On 04/26/2012 04:53 AM, Stefan Ring wrote:


> I just want to stress that our machine with the SmartArray controller
> is not a cheap old dusty leftover, but a recently-bought (December
> 2011) not exactly cheap Blade server, and that’s all you get from HP.

We have an anecdote about something akin to this which happened 2 years 
ago.  A potential customer was testing a <insert large multi-letter 
acronym brand name here> machine to run a specific set of software which 
tightly coupled to its disks.  Performance was terrible.  Our partner 
(the software vendor) contacted us and asked us to help.  We'd suggested 
that the partner loan them the machine they had bought from us 2 years 
earlier (at the time) and try that.

Our 2 year old machine (actually 2 generations back at the time of test, 
now 5 generations behind our current kit) wound up being more than an 
order of magnitude faster than the (then) latest and greatest kit from 
<insert large multi-letter acronym brand name here>.

The lesson is this.  Latest and greatest doesn't mean fastest.  Design, 
and implementation matter.  Brand names don't.

To this day, we still see machines being pushed out with PCIx technology 
for networking, or disk, or ...

... and customers buy it up, for reasons that have little to do with 
performance, suitability to the task, etc.

If you need performance, its important to focus some effort upon 
locating systems/vendors capable of performing where you need them to 
perform.  Otherwise you may wind up with a warmed over web server with 
some random card and a few "fast" disks tossed in.

I don't mean to be blunt, but this is basically what you were sold. 
Note also, I see this in cluster file system bits all the time.  We get 
calls from people, who describe a design, and ask us for help making 
them go fast.  We discover that they've made some deep fundamental 
design decisions very poorly (usually upon the basis of what <insert 
large multi-letter acronym brand name here> told them were options), and 
there was no way to get between point A (their per unit performance) and 
point B (what they were hoping for as an aggregate system performance).

At the most basic level, your performance will be modulated by your 
slowest performing part.  You can put infinitely fast disks on a slow 
controller, and your performance will be terrible.  You can put slow 
disks on a very fast controller, and you will likely have better luck.

/Hoping this lesson is not lost ...

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-04-26  8:53   ` Stefan Ring
@ 2012-04-27 15:10     ` Stan Hoeppner
  2012-04-27 15:28     ` Joe Landman
  1 sibling, 0 replies; 41+ messages in thread
From: Stan Hoeppner @ 2012-04-27 15:10 UTC (permalink / raw)
  To: Stefan Ring; +Cc: Roger Willcocks, Linux fs XFS

On 4/26/2012 3:53 AM, Stefan Ring wrote:
>> Read 0b  Written 995.77Mb  Total transferred 995.77Mb  (66.337Mb/sec)
>>  8491.11 Requests/sec executed]
> 
> I was a bit sceptical towards your measurement at first, especially
> since your xfs_bmap shows that the file is split into 4 regions which
> nicely aligns (almost) with the agcount=4 setup that the benchmark
> emulates, but this seems to be just a coincidence.
> 
> Meanwhile, I've found a customer's system, where we have a MegaRAID
> SAS 1078 with a 6-disk RAID 6 volume, and this one delivers 54MB/sec,
> which really puts the SmartArray controller to shame at its measly
> 4MB/sec.

That's interesting considering the MegaRAID 8708/8880, which I assume is
the 1078 based card above, and the P400 are of roughly the same IC
generation.  Both use PowerPC cores, the P400 at 440MHz and the 1078 at
500MHz, both with DDR2 DRAM, the P400 @533 and the 8880 @667.  On paper
they're very similar.  I'd guess the cause of the big performance
difference is that the 1078 has dedicated parity circuitry and the P400
likely calculates parity in software on the PPC core.  FYI, the parity
engines on the 2208 dual core ASIC are apparently lightning fast
compared to any previous generations.  This chip is found on the
MegaRAID 9265, 9266, 9285, each board having 1GB DDR3-1333 cache.

> I just want to stress that our machine with the SmartArray controller
> is not a cheap old dusty leftover, but a recently-bought (December
> 2011) not exactly cheap Blade server, and that’s all you get from HP.

The fact they still sell a product doesn't mean it's recent technology.
 On the contrary, HP, IBM, and to a lesser extent Dell, tend to keep
some models on the shelf for a very long time, often 4 years or more.
The P400 has likely been around even longer, given it's DDR2 memory and
SATA 3Gb interfaces.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-04-25 16:23   ` Stefan Ring
@ 2012-04-27 14:03     ` Stan Hoeppner
  0 siblings, 0 replies; 41+ messages in thread
From: Stan Hoeppner @ 2012-04-27 14:03 UTC (permalink / raw)
  To: Stefan Ring; +Cc: Roger Willcocks, Linux fs XFS

On 4/25/2012 11:23 AM, Stefan Ring wrote:
>> Result (seems reasonably consistent):
>>
>> Operations performed:  0 Read, 127458 Write, 0 Other = 127458 Total
>> Read 0b  Written 995.77Mb  Total transferred 995.77Mb  (66.337Mb/sec)
>>  8491.11 Requests/sec executed]
> 
> Holy moly, this is an entirely different game you're playing here! I
> suppose that you're using a battery backed write cache?

He's running a 20 data spindle RAID60, across two decent hardware RAID
cards each with 512MB write cache, so of course it's going to be much
faster than your 4 data spindle RAID6, even with slightly slower spindles.

Note that 8x 15K drives in RAID10 on your P410i should slightly surpass
Roger's RAID60 performance, ~70MB/s vs 66MB/s.  3x fewer drives for
roughly equal performance, but obviously less capacity.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-04-25  8:07 Stefan Ring
  2012-04-25 14:17 ` Roger Willcocks
@ 2012-04-27 13:50 ` Stan Hoeppner
  2012-05-01 10:46   ` Stefan Ring
  2012-07-16 19:57 ` Stefan Ring
  2012-10-10 14:57 ` Stefan Ring
  3 siblings, 1 reply; 41+ messages in thread
From: Stan Hoeppner @ 2012-04-27 13:50 UTC (permalink / raw)
  To: Stefan Ring; +Cc: Linux fs XFS

On 4/25/2012 3:07 AM, Stefan Ring wrote:
> This grew out of the discussion in my other thread ("Abysmal write
> performance because of excessive seeking (allocation groups to
> blame?)") -- that should in fact have been called "Free space
> fragmentation causes excessive seeks".
> 
> Could someone with a good hardware RAID (5 or 6, but also mirrored
> setups would be interesting) please conduct a little experiment for
> me?
> 
> I've put up a modified sysbench here:
> <https://github.com/Ringdingcoder/sysbench>. This tries to simulate
> the write pattern I've seen with XFS. It would be really interesting
> to know how different RAID controllers cope with this.
> 
> - Checkout (or download tarball):
> https://github.com/Ringdingcoder/sysbench/tarball/master
> - ./configure --without-mysql && make
> - fallocate -l 8g test_file.0
> - ./sysbench/sysbench --test=fileio --max-time=15
> --max-requests=10000000 --file-num=1 --file-extra-flags=direct
> --file-total-size=8G --file-block-size=8192 --file-fsync-all=off
> --file-fsync-freq=0 --file-fsync-mode=fdatasync --num-threads=1
> --file-test-mode=ag4 run
> 
> If you don't have fallocate, you can also use the last line with "run"
> replaced by "prepare" to create the file. Run the benchmark a few
> times to check if the numbers are somewhat stable. When doing a few
> runs in direct succession, the first one will likely be faster because
> the cache has not been loaded up yet. The interesting part of the
> output is this:
> 
> Read 0b  Written 64.516Mb  Total transferred 64.516Mb  (4.301Mb/sec)
>   550.53 Requests/sec executed
> 
> That's a measurement from my troubled RAID 6 volume (SmartArray P400,
> 6x 10k disks).
> 
> From the other controller in this machine (RAID 1, SmartArray P410i,
> 2x 15k disks), I get:
> 
> Read 0b  Written 276.85Mb  Total transferred 276.85Mb  (18.447Mb/sec)
>  2361.21 Requests/sec executed
> 
> The better result might be caused by the better controller or the RAID
> 1, with the latter reason being more likely.

Stefan, you should be able to simply clear the P410i configuration in
the BIOS, power down, then simply connect the 6 drive backplane cable to
the 410i, load the config from the disks, and go.  This allows head to
head RAID6 comparison between the P400 and P410i.  No doubt the 410i
will be quicker.  This procedure will tell you how much quicker.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-04-25 14:17 ` Roger Willcocks
  2012-04-25 16:23   ` Stefan Ring
@ 2012-04-26  8:53   ` Stefan Ring
  2012-04-27 15:10     ` Stan Hoeppner
  2012-04-27 15:28     ` Joe Landman
  1 sibling, 2 replies; 41+ messages in thread
From: Stefan Ring @ 2012-04-26  8:53 UTC (permalink / raw)
  To: Roger Willcocks; +Cc: Linux fs XFS

> Read 0b  Written 995.77Mb  Total transferred 995.77Mb  (66.337Mb/sec)
>  8491.11 Requests/sec executed]

I was a bit sceptical towards your measurement at first, especially
since your xfs_bmap shows that the file is split into 4 regions which
nicely aligns (almost) with the agcount=4 setup that the benchmark
emulates, but this seems to be just a coincidence.

Meanwhile, I've found a customer's system, where we have a MegaRAID
SAS 1078 with a 6-disk RAID 6 volume, and this one delivers 54MB/sec,
which really puts the SmartArray controller to shame at its measly
4MB/sec.

I just want to stress that our machine with the SmartArray controller
is not a cheap old dusty leftover, but a recently-bought (December
2011) not exactly cheap Blade server, and that’s all you get from HP.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-04-25 14:17 ` Roger Willcocks
@ 2012-04-25 16:23   ` Stefan Ring
  2012-04-27 14:03     ` Stan Hoeppner
  2012-04-26  8:53   ` Stefan Ring
  1 sibling, 1 reply; 41+ messages in thread
From: Stefan Ring @ 2012-04-25 16:23 UTC (permalink / raw)
  To: Roger Willcocks; +Cc: Linux fs XFS

> Result (seems reasonably consistent):
>
> Operations performed:  0 Read, 127458 Write, 0 Other = 127458 Total
> Read 0b  Written 995.77Mb  Total transferred 995.77Mb  (66.337Mb/sec)
>  8491.11 Requests/sec executed]

Holy moly, this is an entirely different game you're playing here! I
suppose that you're using a battery backed write cache?

Thanks a lot for trying!

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: A little RAID experiment
  2012-04-25  8:07 Stefan Ring
@ 2012-04-25 14:17 ` Roger Willcocks
  2012-04-25 16:23   ` Stefan Ring
  2012-04-26  8:53   ` Stefan Ring
  2012-04-27 13:50 ` Stan Hoeppner
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 41+ messages in thread
From: Roger Willcocks @ 2012-04-25 14:17 UTC (permalink / raw)
  To: Stefan Ring; +Cc: Linux fs XFS

I've tried this on a system with two x 3ware (lsi) 9750-16i4e each
having 12 x Hitachi Deskstar 1TB disks formatted as raid 6; the two
hardware raids combined as a software raid 0 [*].

xfs formatted and mounted with 'noatime,noalign,nobarrier'.

Result (seems reasonably consistent):

Operations performed:  0 Read, 127458 Write, 0 Other = 127458 Total
Read 0b  Written 995.77Mb  Total transferred 995.77Mb  (66.337Mb/sec)
 8491.11 Requests/sec executed]

This with CentOS 5.8 kernel. Note that the 17TB volume is 92% full.

# xfs_bmap test_file.0 
test_file.0:
        0: [0..4963295]: 21324666648..21329629943
        1: [4963296..9919871]: 22779572824..22784529399
        2: [9919872..14871367]: 22769382704..22774334199
        3: [14871368..16777215]: 22767476856..22769382703

--
Roger

[*] actually a variation of raid 0 which distributes the blocks in a
pattern which compensates for the units being much faster at their edge
than at their center, to give a flatter performance curve.


On Wed, 2012-04-25 at 10:07 +0200, Stefan Ring wrote:
> This grew out of the discussion in my other thread ("Abysmal write
> performance because of excessive seeking (allocation groups to
> blame?)") -- that should in fact have been called "Free space
> fragmentation causes excessive seeks".
> 
> Could someone with a good hardware RAID (5 or 6, but also mirrored
> setups would be interesting) please conduct a little experiment for
> me?
> 
> I've put up a modified sysbench here:
> <https://github.com/Ringdingcoder/sysbench>. This tries to simulate
> the write pattern I've seen with XFS. It would be really interesting
> to know how different RAID controllers cope with this.
> 
> - Checkout (or download tarball):
> https://github.com/Ringdingcoder/sysbench/tarball/master
> - ./configure --without-mysql && make
> - fallocate -l 8g test_file.0
> - ./sysbench/sysbench --test=fileio --max-time=15
> --max-requests=10000000 --file-num=1 --file-extra-flags=direct
> --file-total-size=8G --file-block-size=8192 --file-fsync-all=off
> --file-fsync-freq=0 --file-fsync-mode=fdatasync --num-threads=1
> --file-test-mode=ag4 run
> 
> If you don't have fallocate, you can also use the last line with "run"
> replaced by "prepare" to create the file. Run the benchmark a few
> times to check if the numbers are somewhat stable. When doing a few
> runs in direct succession, the first one will likely be faster because
> the cache has not been loaded up yet. The interesting part of the
> output is this:
> 
> Read 0b  Written 64.516Mb  Total transferred 64.516Mb  (4.301Mb/sec)
>   550.53 Requests/sec executed
> 
> That's a measurement from my troubled RAID 6 volume (SmartArray P400,
> 6x 10k disks).
> 
> >From the other controller in this machine (RAID 1, SmartArray P410i,
> 2x 15k disks), I get:
> 
> Read 0b  Written 276.85Mb  Total transferred 276.85Mb  (18.447Mb/sec)
>  2361.21 Requests/sec executed
> 
> The better result might be caused by the better controller or the RAID
> 1, with the latter reason being more likely.
> 
> Regards,
> Stefan
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 
-- 
Roger Willcocks <roger@filmlight.ltd.uk>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* A little RAID experiment
@ 2012-04-25  8:07 Stefan Ring
  2012-04-25 14:17 ` Roger Willcocks
                   ` (3 more replies)
  0 siblings, 4 replies; 41+ messages in thread
From: Stefan Ring @ 2012-04-25  8:07 UTC (permalink / raw)
  To: Linux fs XFS

This grew out of the discussion in my other thread ("Abysmal write
performance because of excessive seeking (allocation groups to
blame?)") -- that should in fact have been called "Free space
fragmentation causes excessive seeks".

Could someone with a good hardware RAID (5 or 6, but also mirrored
setups would be interesting) please conduct a little experiment for
me?

I've put up a modified sysbench here:
<https://github.com/Ringdingcoder/sysbench>. This tries to simulate
the write pattern I've seen with XFS. It would be really interesting
to know how different RAID controllers cope with this.

- Checkout (or download tarball):
https://github.com/Ringdingcoder/sysbench/tarball/master
- ./configure --without-mysql && make
- fallocate -l 8g test_file.0
- ./sysbench/sysbench --test=fileio --max-time=15
--max-requests=10000000 --file-num=1 --file-extra-flags=direct
--file-total-size=8G --file-block-size=8192 --file-fsync-all=off
--file-fsync-freq=0 --file-fsync-mode=fdatasync --num-threads=1
--file-test-mode=ag4 run

If you don't have fallocate, you can also use the last line with "run"
replaced by "prepare" to create the file. Run the benchmark a few
times to check if the numbers are somewhat stable. When doing a few
runs in direct succession, the first one will likely be faster because
the cache has not been loaded up yet. The interesting part of the
output is this:

Read 0b  Written 64.516Mb  Total transferred 64.516Mb  (4.301Mb/sec)
  550.53 Requests/sec executed

That's a measurement from my troubled RAID 6 volume (SmartArray P400,
6x 10k disks).

>From the other controller in this machine (RAID 1, SmartArray P410i,
2x 15k disks), I get:

Read 0b  Written 276.85Mb  Total transferred 276.85Mb  (18.447Mb/sec)
 2361.21 Requests/sec executed

The better result might be caused by the better controller or the RAID
1, with the latter reason being more likely.

Regards,
Stefan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2012-10-10 22:00 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-26 22:33 A little RAID experiment Richard Scobie
2012-04-27 21:30 ` Emmanuel Florac
2012-04-28  4:15   ` Richard Scobie
  -- strict thread matches above, loose matches on Subject: below --
2012-04-25  8:07 Stefan Ring
2012-04-25 14:17 ` Roger Willcocks
2012-04-25 16:23   ` Stefan Ring
2012-04-27 14:03     ` Stan Hoeppner
2012-04-26  8:53   ` Stefan Ring
2012-04-27 15:10     ` Stan Hoeppner
2012-04-27 15:28     ` Joe Landman
2012-04-28  4:42       ` Stan Hoeppner
2012-04-27 13:50 ` Stan Hoeppner
2012-05-01 10:46   ` Stefan Ring
2012-05-30 11:07     ` Stefan Ring
2012-05-31  1:30       ` Stan Hoeppner
2012-05-31  6:44         ` Stefan Ring
2012-07-16 19:57 ` Stefan Ring
2012-07-16 20:03   ` Stefan Ring
2012-07-16 20:05     ` Stefan Ring
2012-07-16 21:27   ` Stan Hoeppner
2012-07-16 21:58     ` Stefan Ring
2012-07-17  1:39       ` Stan Hoeppner
2012-07-17  5:26         ` Dave Chinner
2012-07-18  2:18           ` Stan Hoeppner
2012-07-18  6:44             ` Stefan Ring
2012-07-18  7:09               ` Stan Hoeppner
2012-07-18  7:22                 ` Stefan Ring
2012-07-18 10:24                   ` Stan Hoeppner
2012-07-18 12:32                     ` Stefan Ring
2012-07-18 12:37                       ` Stefan Ring
2012-07-19  3:08                         ` Stan Hoeppner
2012-07-25  9:29                           ` Stefan Ring
2012-07-25 10:00                             ` Stan Hoeppner
2012-07-25 10:08                               ` Stefan Ring
2012-07-25 11:00                                 ` Stan Hoeppner
2012-07-26  8:32                             ` Dave Chinner
2012-09-11 16:37                               ` Stefan Ring
2012-07-16 22:16     ` Stefan Ring
2012-10-10 14:57 ` Stefan Ring
2012-10-10 21:27   ` Dave Chinner
2012-10-10 22:01     ` Stefan Ring

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.