* A little RAID experiment
@ 2012-04-25 8:07 Stefan Ring
2012-04-25 14:17 ` Roger Willcocks
` (3 more replies)
0 siblings, 4 replies; 41+ messages in thread
From: Stefan Ring @ 2012-04-25 8:07 UTC (permalink / raw)
To: Linux fs XFS
This grew out of the discussion in my other thread ("Abysmal write
performance because of excessive seeking (allocation groups to
blame?)") -- that should in fact have been called "Free space
fragmentation causes excessive seeks".
Could someone with a good hardware RAID (5 or 6, but also mirrored
setups would be interesting) please conduct a little experiment for
me?
I've put up a modified sysbench here:
<https://github.com/Ringdingcoder/sysbench>. This tries to simulate
the write pattern I've seen with XFS. It would be really interesting
to know how different RAID controllers cope with this.
- Checkout (or download tarball):
https://github.com/Ringdingcoder/sysbench/tarball/master
- ./configure --without-mysql && make
- fallocate -l 8g test_file.0
- ./sysbench/sysbench --test=fileio --max-time=15
--max-requests=10000000 --file-num=1 --file-extra-flags=direct
--file-total-size=8G --file-block-size=8192 --file-fsync-all=off
--file-fsync-freq=0 --file-fsync-mode=fdatasync --num-threads=1
--file-test-mode=ag4 run
If you don't have fallocate, you can also use the last line with "run"
replaced by "prepare" to create the file. Run the benchmark a few
times to check if the numbers are somewhat stable. When doing a few
runs in direct succession, the first one will likely be faster because
the cache has not been loaded up yet. The interesting part of the
output is this:
Read 0b Written 64.516Mb Total transferred 64.516Mb (4.301Mb/sec)
550.53 Requests/sec executed
That's a measurement from my troubled RAID 6 volume (SmartArray P400,
6x 10k disks).
>From the other controller in this machine (RAID 1, SmartArray P410i,
2x 15k disks), I get:
Read 0b Written 276.85Mb Total transferred 276.85Mb (18.447Mb/sec)
2361.21 Requests/sec executed
The better result might be caused by the better controller or the RAID
1, with the latter reason being more likely.
Regards,
Stefan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-04-25 8:07 A little RAID experiment Stefan Ring @ 2012-04-25 14:17 ` Roger Willcocks 2012-04-25 16:23 ` Stefan Ring 2012-04-26 8:53 ` Stefan Ring 2012-04-27 13:50 ` Stan Hoeppner ` (2 subsequent siblings) 3 siblings, 2 replies; 41+ messages in thread From: Roger Willcocks @ 2012-04-25 14:17 UTC (permalink / raw) To: Stefan Ring; +Cc: Linux fs XFS I've tried this on a system with two x 3ware (lsi) 9750-16i4e each having 12 x Hitachi Deskstar 1TB disks formatted as raid 6; the two hardware raids combined as a software raid 0 [*]. xfs formatted and mounted with 'noatime,noalign,nobarrier'. Result (seems reasonably consistent): Operations performed: 0 Read, 127458 Write, 0 Other = 127458 Total Read 0b Written 995.77Mb Total transferred 995.77Mb (66.337Mb/sec) 8491.11 Requests/sec executed] This with CentOS 5.8 kernel. Note that the 17TB volume is 92% full. # xfs_bmap test_file.0 test_file.0: 0: [0..4963295]: 21324666648..21329629943 1: [4963296..9919871]: 22779572824..22784529399 2: [9919872..14871367]: 22769382704..22774334199 3: [14871368..16777215]: 22767476856..22769382703 -- Roger [*] actually a variation of raid 0 which distributes the blocks in a pattern which compensates for the units being much faster at their edge than at their center, to give a flatter performance curve. On Wed, 2012-04-25 at 10:07 +0200, Stefan Ring wrote: > This grew out of the discussion in my other thread ("Abysmal write > performance because of excessive seeking (allocation groups to > blame?)") -- that should in fact have been called "Free space > fragmentation causes excessive seeks". > > Could someone with a good hardware RAID (5 or 6, but also mirrored > setups would be interesting) please conduct a little experiment for > me? > > I've put up a modified sysbench here: > <https://github.com/Ringdingcoder/sysbench>. This tries to simulate > the write pattern I've seen with XFS. It would be really interesting > to know how different RAID controllers cope with this. > > - Checkout (or download tarball): > https://github.com/Ringdingcoder/sysbench/tarball/master > - ./configure --without-mysql && make > - fallocate -l 8g test_file.0 > - ./sysbench/sysbench --test=fileio --max-time=15 > --max-requests=10000000 --file-num=1 --file-extra-flags=direct > --file-total-size=8G --file-block-size=8192 --file-fsync-all=off > --file-fsync-freq=0 --file-fsync-mode=fdatasync --num-threads=1 > --file-test-mode=ag4 run > > If you don't have fallocate, you can also use the last line with "run" > replaced by "prepare" to create the file. Run the benchmark a few > times to check if the numbers are somewhat stable. When doing a few > runs in direct succession, the first one will likely be faster because > the cache has not been loaded up yet. The interesting part of the > output is this: > > Read 0b Written 64.516Mb Total transferred 64.516Mb (4.301Mb/sec) > 550.53 Requests/sec executed > > That's a measurement from my troubled RAID 6 volume (SmartArray P400, > 6x 10k disks). > > >From the other controller in this machine (RAID 1, SmartArray P410i, > 2x 15k disks), I get: > > Read 0b Written 276.85Mb Total transferred 276.85Mb (18.447Mb/sec) > 2361.21 Requests/sec executed > > The better result might be caused by the better controller or the RAID > 1, with the latter reason being more likely. > > Regards, > Stefan > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > -- Roger Willcocks <roger@filmlight.ltd.uk> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-04-25 14:17 ` Roger Willcocks @ 2012-04-25 16:23 ` Stefan Ring 2012-04-27 14:03 ` Stan Hoeppner 2012-04-26 8:53 ` Stefan Ring 1 sibling, 1 reply; 41+ messages in thread From: Stefan Ring @ 2012-04-25 16:23 UTC (permalink / raw) To: Roger Willcocks; +Cc: Linux fs XFS > Result (seems reasonably consistent): > > Operations performed: 0 Read, 127458 Write, 0 Other = 127458 Total > Read 0b Written 995.77Mb Total transferred 995.77Mb (66.337Mb/sec) > 8491.11 Requests/sec executed] Holy moly, this is an entirely different game you're playing here! I suppose that you're using a battery backed write cache? Thanks a lot for trying! _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-04-25 16:23 ` Stefan Ring @ 2012-04-27 14:03 ` Stan Hoeppner 0 siblings, 0 replies; 41+ messages in thread From: Stan Hoeppner @ 2012-04-27 14:03 UTC (permalink / raw) To: Stefan Ring; +Cc: Roger Willcocks, Linux fs XFS On 4/25/2012 11:23 AM, Stefan Ring wrote: >> Result (seems reasonably consistent): >> >> Operations performed: 0 Read, 127458 Write, 0 Other = 127458 Total >> Read 0b Written 995.77Mb Total transferred 995.77Mb (66.337Mb/sec) >> 8491.11 Requests/sec executed] > > Holy moly, this is an entirely different game you're playing here! I > suppose that you're using a battery backed write cache? He's running a 20 data spindle RAID60, across two decent hardware RAID cards each with 512MB write cache, so of course it's going to be much faster than your 4 data spindle RAID6, even with slightly slower spindles. Note that 8x 15K drives in RAID10 on your P410i should slightly surpass Roger's RAID60 performance, ~70MB/s vs 66MB/s. 3x fewer drives for roughly equal performance, but obviously less capacity. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-04-25 14:17 ` Roger Willcocks 2012-04-25 16:23 ` Stefan Ring @ 2012-04-26 8:53 ` Stefan Ring 2012-04-27 15:10 ` Stan Hoeppner 2012-04-27 15:28 ` Joe Landman 1 sibling, 2 replies; 41+ messages in thread From: Stefan Ring @ 2012-04-26 8:53 UTC (permalink / raw) To: Roger Willcocks; +Cc: Linux fs XFS > Read 0b Written 995.77Mb Total transferred 995.77Mb (66.337Mb/sec) > 8491.11 Requests/sec executed] I was a bit sceptical towards your measurement at first, especially since your xfs_bmap shows that the file is split into 4 regions which nicely aligns (almost) with the agcount=4 setup that the benchmark emulates, but this seems to be just a coincidence. Meanwhile, I've found a customer's system, where we have a MegaRAID SAS 1078 with a 6-disk RAID 6 volume, and this one delivers 54MB/sec, which really puts the SmartArray controller to shame at its measly 4MB/sec. I just want to stress that our machine with the SmartArray controller is not a cheap old dusty leftover, but a recently-bought (December 2011) not exactly cheap Blade server, and that’s all you get from HP. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-04-26 8:53 ` Stefan Ring @ 2012-04-27 15:10 ` Stan Hoeppner 2012-04-27 15:28 ` Joe Landman 1 sibling, 0 replies; 41+ messages in thread From: Stan Hoeppner @ 2012-04-27 15:10 UTC (permalink / raw) To: Stefan Ring; +Cc: Roger Willcocks, Linux fs XFS On 4/26/2012 3:53 AM, Stefan Ring wrote: >> Read 0b Written 995.77Mb Total transferred 995.77Mb (66.337Mb/sec) >> 8491.11 Requests/sec executed] > > I was a bit sceptical towards your measurement at first, especially > since your xfs_bmap shows that the file is split into 4 regions which > nicely aligns (almost) with the agcount=4 setup that the benchmark > emulates, but this seems to be just a coincidence. > > Meanwhile, I've found a customer's system, where we have a MegaRAID > SAS 1078 with a 6-disk RAID 6 volume, and this one delivers 54MB/sec, > which really puts the SmartArray controller to shame at its measly > 4MB/sec. That's interesting considering the MegaRAID 8708/8880, which I assume is the 1078 based card above, and the P400 are of roughly the same IC generation. Both use PowerPC cores, the P400 at 440MHz and the 1078 at 500MHz, both with DDR2 DRAM, the P400 @533 and the 8880 @667. On paper they're very similar. I'd guess the cause of the big performance difference is that the 1078 has dedicated parity circuitry and the P400 likely calculates parity in software on the PPC core. FYI, the parity engines on the 2208 dual core ASIC are apparently lightning fast compared to any previous generations. This chip is found on the MegaRAID 9265, 9266, 9285, each board having 1GB DDR3-1333 cache. > I just want to stress that our machine with the SmartArray controller > is not a cheap old dusty leftover, but a recently-bought (December > 2011) not exactly cheap Blade server, and that’s all you get from HP. The fact they still sell a product doesn't mean it's recent technology. On the contrary, HP, IBM, and to a lesser extent Dell, tend to keep some models on the shelf for a very long time, often 4 years or more. The P400 has likely been around even longer, given it's DDR2 memory and SATA 3Gb interfaces. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-04-26 8:53 ` Stefan Ring 2012-04-27 15:10 ` Stan Hoeppner @ 2012-04-27 15:28 ` Joe Landman 2012-04-28 4:42 ` Stan Hoeppner 1 sibling, 1 reply; 41+ messages in thread From: Joe Landman @ 2012-04-27 15:28 UTC (permalink / raw) To: xfs On 04/26/2012 04:53 AM, Stefan Ring wrote: > I just want to stress that our machine with the SmartArray controller > is not a cheap old dusty leftover, but a recently-bought (December > 2011) not exactly cheap Blade server, and that’s all you get from HP. We have an anecdote about something akin to this which happened 2 years ago. A potential customer was testing a <insert large multi-letter acronym brand name here> machine to run a specific set of software which tightly coupled to its disks. Performance was terrible. Our partner (the software vendor) contacted us and asked us to help. We'd suggested that the partner loan them the machine they had bought from us 2 years earlier (at the time) and try that. Our 2 year old machine (actually 2 generations back at the time of test, now 5 generations behind our current kit) wound up being more than an order of magnitude faster than the (then) latest and greatest kit from <insert large multi-letter acronym brand name here>. The lesson is this. Latest and greatest doesn't mean fastest. Design, and implementation matter. Brand names don't. To this day, we still see machines being pushed out with PCIx technology for networking, or disk, or ... ... and customers buy it up, for reasons that have little to do with performance, suitability to the task, etc. If you need performance, its important to focus some effort upon locating systems/vendors capable of performing where you need them to perform. Otherwise you may wind up with a warmed over web server with some random card and a few "fast" disks tossed in. I don't mean to be blunt, but this is basically what you were sold. Note also, I see this in cluster file system bits all the time. We get calls from people, who describe a design, and ask us for help making them go fast. We discover that they've made some deep fundamental design decisions very poorly (usually upon the basis of what <insert large multi-letter acronym brand name here> told them were options), and there was no way to get between point A (their per unit performance) and point B (what they were hoping for as an aggregate system performance). At the most basic level, your performance will be modulated by your slowest performing part. You can put infinitely fast disks on a slow controller, and your performance will be terrible. You can put slow disks on a very fast controller, and you will likely have better luck. /Hoping this lesson is not lost ... -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-04-27 15:28 ` Joe Landman @ 2012-04-28 4:42 ` Stan Hoeppner 0 siblings, 0 replies; 41+ messages in thread From: Stan Hoeppner @ 2012-04-28 4:42 UTC (permalink / raw) To: landman; +Cc: xfs On 4/27/2012 10:28 AM, Joe Landman wrote: > On 04/26/2012 04:53 AM, Stefan Ring wrote: > > >> I just want to stress that our machine with the SmartArray controller >> is not a cheap old dusty leftover, but a recently-bought (December >> 2011) not exactly cheap Blade server, and that’s all you get from HP. > > We have an anecdote about something akin to this which happened 2 years > ago. A potential customer was testing a <insert large multi-letter > acronym brand name here> machine to run a specific set of software which > tightly coupled to its disks. Performance was terrible. Our partner > (the software vendor) contacted us and asked us to help. We'd suggested > that the partner loan them the machine they had bought from us 2 years > earlier (at the time) and try that. > > Our 2 year old machine (actually 2 generations back at the time of test, > now 5 generations behind our current kit) wound up being more than an > order of magnitude faster than the (then) latest and greatest kit from > <insert large multi-letter acronym brand name here>. > > The lesson is this. Latest and greatest doesn't mean fastest. Design, > and implementation matter. Brand names don't. > > To this day, we still see machines being pushed out with PCIx technology > for networking, or disk, or ... I've seen this as well. A vendor gets comfortable and confident with a particular main board, RAID card, NIC, etc, that demonstrate uber reliability in the field and is easy to work on/with. They continue selling it as long as they can still get their hands on it, even though much better technology has long been available. It's the "stick with what we know works" mentality. Sometimes this is a good strategy. If a customer constantly needs maximum performance, obviously not. > ... and customers buy it up, for reasons that have little to do with > performance, suitability to the task, etc. > > If you need performance, its important to focus some effort upon > locating systems/vendors capable of performing where you need them to > perform. Otherwise you may wind up with a warmed over web server with > some random card and a few "fast" disks tossed in. > > I don't mean to be blunt, but this is basically what you were sold. Note > also, I see this in cluster file system bits all the time. We get calls > from people, who describe a design, and ask us for help making them go > fast. We discover that they've made some deep fundamental design > decisions very poorly (usually upon the basis of what <insert large > multi-letter acronym brand name here> told them were options), and there > was no way to get between point A (their per unit performance) and point > B (what they were hoping for as an aggregate system performance). > > At the most basic level, your performance will be modulated by your > slowest performing part. You can put infinitely fast disks on a slow > controller, and your performance will be terrible. You can put slow > disks on a very fast controller, and you will likely have better luck. I generally agree with this last statement, but I think it's most relevant to parity arrays. In general RAID1/10 performance tends to be less impacted by controller speed. But yes, a really poor slow controller is going to limit anything you try to do with any disks. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-04-25 8:07 A little RAID experiment Stefan Ring 2012-04-25 14:17 ` Roger Willcocks @ 2012-04-27 13:50 ` Stan Hoeppner 2012-05-01 10:46 ` Stefan Ring 2012-07-16 19:57 ` Stefan Ring 2012-10-10 14:57 ` Stefan Ring 3 siblings, 1 reply; 41+ messages in thread From: Stan Hoeppner @ 2012-04-27 13:50 UTC (permalink / raw) To: Stefan Ring; +Cc: Linux fs XFS On 4/25/2012 3:07 AM, Stefan Ring wrote: > This grew out of the discussion in my other thread ("Abysmal write > performance because of excessive seeking (allocation groups to > blame?)") -- that should in fact have been called "Free space > fragmentation causes excessive seeks". > > Could someone with a good hardware RAID (5 or 6, but also mirrored > setups would be interesting) please conduct a little experiment for > me? > > I've put up a modified sysbench here: > <https://github.com/Ringdingcoder/sysbench>. This tries to simulate > the write pattern I've seen with XFS. It would be really interesting > to know how different RAID controllers cope with this. > > - Checkout (or download tarball): > https://github.com/Ringdingcoder/sysbench/tarball/master > - ./configure --without-mysql && make > - fallocate -l 8g test_file.0 > - ./sysbench/sysbench --test=fileio --max-time=15 > --max-requests=10000000 --file-num=1 --file-extra-flags=direct > --file-total-size=8G --file-block-size=8192 --file-fsync-all=off > --file-fsync-freq=0 --file-fsync-mode=fdatasync --num-threads=1 > --file-test-mode=ag4 run > > If you don't have fallocate, you can also use the last line with "run" > replaced by "prepare" to create the file. Run the benchmark a few > times to check if the numbers are somewhat stable. When doing a few > runs in direct succession, the first one will likely be faster because > the cache has not been loaded up yet. The interesting part of the > output is this: > > Read 0b Written 64.516Mb Total transferred 64.516Mb (4.301Mb/sec) > 550.53 Requests/sec executed > > That's a measurement from my troubled RAID 6 volume (SmartArray P400, > 6x 10k disks). > > From the other controller in this machine (RAID 1, SmartArray P410i, > 2x 15k disks), I get: > > Read 0b Written 276.85Mb Total transferred 276.85Mb (18.447Mb/sec) > 2361.21 Requests/sec executed > > The better result might be caused by the better controller or the RAID > 1, with the latter reason being more likely. Stefan, you should be able to simply clear the P410i configuration in the BIOS, power down, then simply connect the 6 drive backplane cable to the 410i, load the config from the disks, and go. This allows head to head RAID6 comparison between the P400 and P410i. No doubt the 410i will be quicker. This procedure will tell you how much quicker. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-04-27 13:50 ` Stan Hoeppner @ 2012-05-01 10:46 ` Stefan Ring 2012-05-30 11:07 ` Stefan Ring 0 siblings, 1 reply; 41+ messages in thread From: Stefan Ring @ 2012-05-01 10:46 UTC (permalink / raw) To: stan; +Cc: Linux fs XFS > Stefan, you should be able to simply clear the P410i configuration in > the BIOS, power down, then simply connect the 6 drive backplane cable to > the 410i, load the config from the disks, and go. This allows head to > head RAID6 comparison between the P400 and P410i. No doubt the 410i > will be quicker. This procedure will tell you how much quicker. Unfortunately, the server is located at a hosting facility at the opposite end of town, and I'd spend an entire day just traveling to and fro, so that's not currently an option. I might get lucky though, because we should soon get another server with an external P410i. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-05-01 10:46 ` Stefan Ring @ 2012-05-30 11:07 ` Stefan Ring 2012-05-31 1:30 ` Stan Hoeppner 0 siblings, 1 reply; 41+ messages in thread From: Stefan Ring @ 2012-05-30 11:07 UTC (permalink / raw) To: stan; +Cc: Linux fs XFS On Tue, May 1, 2012 at 12:46 PM, Stefan Ring <stefanrin@gmail.com> wrote: >> Stefan, you should be able to simply clear the P410i configuration in >> the BIOS, power down, then simply connect the 6 drive backplane cable to >> the 410i, load the config from the disks, and go. This allows head to >> head RAID6 comparison between the P400 and P410i. No doubt the 410i >> will be quicker. This procedure will tell you how much quicker. > > Unfortunately, the server is located at a hosting facility at the > opposite end of town, and I'd spend an entire day just traveling to > and fro, so that's not currently an option. I might get lucky though, > because we should soon get another server with an external P410i. The new storage blade has only been upgraded to the P410i controller, and even though there is a new setting called "elevatorsort", which is enabled, the performance is just as bad. The new one has a flash-writeback cache and may be faster by a few percent ticks, but that's it. It doesn't even make sense to compare the two in-depth, as they perform almost identically. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-05-30 11:07 ` Stefan Ring @ 2012-05-31 1:30 ` Stan Hoeppner 2012-05-31 6:44 ` Stefan Ring 0 siblings, 1 reply; 41+ messages in thread From: Stan Hoeppner @ 2012-05-31 1:30 UTC (permalink / raw) To: Stefan Ring; +Cc: Linux fs XFS On 5/30/2012 6:07 AM, Stefan Ring wrote: > On Tue, May 1, 2012 at 12:46 PM, Stefan Ring <stefanrin@gmail.com> wrote: >>> Stefan, you should be able to simply clear the P410i configuration in >>> the BIOS, power down, then simply connect the 6 drive backplane cable to >>> the 410i, load the config from the disks, and go. This allows head to >>> head RAID6 comparison between the P400 and P410i. No doubt the 410i >>> will be quicker. This procedure will tell you how much quicker. >> >> Unfortunately, the server is located at a hosting facility at the >> opposite end of town, and I'd spend an entire day just traveling to >> and fro, so that's not currently an option. I might get lucky though, >> because we should soon get another server with an external P410i. > > The new storage blade has only been upgraded to the P410i controller, > and even though there is a new setting called "elevatorsort", which is > enabled, the performance is just as bad. The new one has a > flash-writeback cache and may be faster by a few percent ticks, but > that's it. It doesn't even make sense to compare the two in-depth, as > they perform almost identically. You now have persistent write cache. Did you test with XFS barriers disabled? If not you should. You'll likely see a decent, possibly outstanding, performance improvement with your huge metadata modification workload, as XFS will no longer flush the cache frequently when writing to the journal log. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-05-31 1:30 ` Stan Hoeppner @ 2012-05-31 6:44 ` Stefan Ring 0 siblings, 0 replies; 41+ messages in thread From: Stefan Ring @ 2012-05-31 6:44 UTC (permalink / raw) To: stan; +Cc: Linux fs XFS > You now have persistent write cache. Did you test with XFS barriers > disabled? If not you should. You'll likely see a decent, possibly > outstanding, performance improvement with your huge metadata > modification workload, as XFS will no longer flush the cache frequently > when writing to the journal log. I've already done that test with the previous controller. It had a BBWC, so it was persistent as well. And it was easy to enable or disable it. Yes, everything was always done with barrier=0. And yes, the cache made a big difference (about 3x). _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-04-25 8:07 A little RAID experiment Stefan Ring 2012-04-25 14:17 ` Roger Willcocks 2012-04-27 13:50 ` Stan Hoeppner @ 2012-07-16 19:57 ` Stefan Ring 2012-07-16 20:03 ` Stefan Ring 2012-07-16 21:27 ` Stan Hoeppner 2012-10-10 14:57 ` Stefan Ring 3 siblings, 2 replies; 41+ messages in thread From: Stefan Ring @ 2012-07-16 19:57 UTC (permalink / raw) To: Linux fs XFS On Wed, Apr 25, 2012 at 10:07 AM, Stefan Ring <stefanrin@gmail.com> wrote: > This grew out of the discussion in my other thread ("Abysmal write > performance because of excessive seeking (allocation groups to > blame?)") -- that should in fact have been called "Free space > fragmentation causes excessive seeks". > > Could someone with a good hardware RAID (5 or 6, but also mirrored > setups would be interesting) please conduct a little experiment for > me? > > I've put up a modified sysbench here: > <https://github.com/Ringdingcoder/sysbench>. This tries to simulate > the write pattern I've seen with XFS. It would be really interesting > to know how different RAID controllers cope with this. > > - Checkout (or download tarball): > https://github.com/Ringdingcoder/sysbench/tarball/master > - ./configure --without-mysql && make > - fallocate -l 8g test_file.0 > - ./sysbench/sysbench --test=fileio --max-time=15 > --max-requests=10000000 --file-num=1 --file-extra-flags=direct > --file-total-size=8G --file-block-size=8192 --file-fsync-all=off > --file-fsync-freq=0 --file-fsync-mode=fdatasync --num-threads=1 > --file-test-mode=ag4 run > > If you don't have fallocate, you can also use the last line with "run" > replaced by "prepare" to create the file. Run the benchmark a few > times to check if the numbers are somewhat stable. When doing a few > runs in direct succession, the first one will likely be faster because > the cache has not been loaded up yet. The interesting part of the > output is this: > > Read 0b Written 64.516Mb Total transferred 64.516Mb (4.301Mb/sec) > 550.53 Requests/sec executed > > That's a measurement from my troubled RAID 6 volume (SmartArray P400, > 6x 10k disks). > > From the other controller in this machine (RAID 1, SmartArray P410i, > 2x 15k disks), I get: > > Read 0b Written 276.85Mb Total transferred 276.85Mb (18.447Mb/sec) > 2361.21 Requests/sec executed > > The better result might be caused by the better controller or the RAID > 1, with the latter reason being more likely. In the meantime, the very useful --report-interval switch has been added to development versions of sysbench, and I've had access to one additional system. If I thought that the internal RAID was bad, that's only because I have not yet experienced an external enclosure from HP attached via FibreChannel (P2000 G3 MSA, QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA). Unfortunately, I don't have detailed information about the configuration of this enclosure, except that it's a RAID6 volume, with 10 or 12 disks, I believe. Witness this horrendous tanking of write throughput: [ 2s] reads: 0.00 MB/s writes: 0.07 MB/s fsyncs: 0.00/s response time: 0.616ms (95%) [ 4s] reads: 0.00 MB/s writes: 14.10 MB/s fsyncs: 0.00/s response time: 0.481ms (95%) [ 6s] reads: 0.00 MB/s writes: 15.28 MB/s fsyncs: 0.00/s response time: 0.458ms (95%) [ 8s] reads: 0.00 MB/s writes: 14.65 MB/s fsyncs: 0.00/s response time: 0.464ms (95%) [ 10s] reads: 0.00 MB/s writes: 15.32 MB/s fsyncs: 0.00/s response time: 0.447ms (95%) [ 12s] reads: 0.00 MB/s writes: 15.18 MB/s fsyncs: 0.00/s response time: 0.460ms (95%) [ 14s] reads: 0.00 MB/s writes: 15.18 MB/s fsyncs: 0.00/s response time: 0.471ms (95%) [ 16s] reads: 0.00 MB/s writes: 14.06 MB/s fsyncs: 0.00/s response time: 0.468ms (95%) [ 18s] reads: 0.00 MB/s writes: 0.43 MB/s fsyncs: 0.00/s response time: 3.933ms (95%) [ 20s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response time: 985.122ms (95%) [ 22s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response time: 1435.164ms (95%) [ 24s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response time: 1194.568ms (95%) [ 26s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response time: 1112.091ms (95%) [ 28s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response time: 1443.350ms (95%) [ 30s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response time: 1078.972ms (95%) Operations performed: 0 reads, 53413 writes, 0 Other = 53413 Total Read 0b Written 208.64Mb Total transferred 208.64Mb (6.8007Mb/sec) 1740.98 Requests/sec executed For comparison, this is the SmartArray P400 RAID6 that I initially complained about: [ 2s] reads: 0.00 MB/s writes: 6.34 MB/s fsyncs: 0.00/s response time: 0.219ms (95%) [ 4s] reads: 0.00 MB/s writes: 5.35 MB/s fsyncs: 0.00/s response time: 0.217ms (95%) [ 6s] reads: 0.00 MB/s writes: 5.48 MB/s fsyncs: 0.00/s response time: 0.208ms (95%) [ 8s] reads: 0.00 MB/s writes: 5.30 MB/s fsyncs: 0.00/s response time: 0.228ms (95%) [ 10s] reads: 0.00 MB/s writes: 5.81 MB/s fsyncs: 0.00/s response time: 0.226ms (95%) [ 12s] reads: 0.00 MB/s writes: 6.01 MB/s fsyncs: 0.00/s response time: 0.223ms (95%) [ 14s] reads: 0.00 MB/s writes: 5.39 MB/s fsyncs: 0.00/s response time: 0.212ms (95%) [ 16s] reads: 0.00 MB/s writes: 5.21 MB/s fsyncs: 0.00/s response time: 0.225ms (95%) [ 18s] reads: 0.00 MB/s writes: 5.16 MB/s fsyncs: 0.00/s response time: 0.224ms (95%) [ 20s] reads: 0.00 MB/s writes: 5.97 MB/s fsyncs: 0.00/s response time: 0.217ms (95%) [ 22s] reads: 0.00 MB/s writes: 4.28 MB/s fsyncs: 0.00/s response time: 0.228ms (95%) [ 24s] reads: 0.00 MB/s writes: 7.44 MB/s fsyncs: 0.00/s response time: 0.191ms (95%) [ 26s] reads: 0.00 MB/s writes: 5.30 MB/s fsyncs: 0.00/s response time: 0.250ms (95%) [ 28s] reads: 0.00 MB/s writes: 5.45 MB/s fsyncs: 0.00/s response time: 0.258ms (95%) [ 30s] reads: 0.00 MB/s writes: 5.27 MB/s fsyncs: 0.00/s response time: 0.254ms (95%) Operations performed: 0 reads, 42890 writes, 0 Other = 42890 Total Read 0b Written 167.54Mb Total transferred 167.54Mb (5.5773Mb/sec) 1427.80 Requests/sec executed Slow, but at least it's consistent. And that's what I would expect, and which a decent RAID controller manages to provide (LSI Logic / Symbios Logic MegaRAID SAS 1078): [ 2s] reads: 0.00 MB/s writes: 56.65 MB/s fsyncs: 0.00/s response time: 0.117ms (95%) [ 4s] reads: 0.00 MB/s writes: 37.15 MB/s fsyncs: 0.00/s response time: 0.221ms (95%) [ 6s] reads: 0.00 MB/s writes: 35.92 MB/s fsyncs: 0.00/s response time: 0.225ms (95%) [ 8s] reads: 0.00 MB/s writes: 34.15 MB/s fsyncs: 0.00/s response time: 0.239ms (95%) [ 10s] reads: 0.00 MB/s writes: 33.19 MB/s fsyncs: 0.00/s response time: 0.221ms (95%) [ 12s] reads: 0.00 MB/s writes: 34.02 MB/s fsyncs: 0.00/s response time: 0.229ms (95%) [ 14s] reads: 0.00 MB/s writes: 36.61 MB/s fsyncs: 0.00/s response time: 0.233ms (95%) [ 16s] reads: 0.00 MB/s writes: 37.62 MB/s fsyncs: 0.00/s response time: 0.232ms (95%) [ 18s] reads: 0.00 MB/s writes: 35.75 MB/s fsyncs: 0.00/s response time: 0.228ms (95%) [ 20s] reads: 0.00 MB/s writes: 35.42 MB/s fsyncs: 0.00/s response time: 0.233ms (95%) [ 22s] reads: 0.00 MB/s writes: 34.63 MB/s fsyncs: 0.00/s response time: 0.233ms (95%) [ 24s] reads: 0.00 MB/s writes: 34.83 MB/s fsyncs: 0.00/s response time: 0.230ms (95%) [ 26s] reads: 0.00 MB/s writes: 36.84 MB/s fsyncs: 0.00/s response time: 0.229ms (95%) [ 28s] reads: 0.00 MB/s writes: 36.15 MB/s fsyncs: 0.00/s response time: 0.232ms (95%) Operations performed: 0 reads, 284087 writes, 0 Other = 284087 Total Read 0b Written 1.0837Gb Total transferred 1.0837Gb (36.99Mb/sec) 9469.55 Requests/sec executed The command line used was: sysbench --test=fileio --max-time=30 --max-requests=10000000 --file-num=1 --file-extra-flags=direct --file-total-size=8G --file-block-size=4k --file-fsync-all=off --file-fsync-freq=0 --file-fsync-mode=fdatasync --num-threads=1 --file-test-mode=ag4 --report-interval=2 run I have not yet uploaded my patched version of the development sysbench, but I'm planning to do so, and I'd be really interested if someone could run it on a really high-end storage system. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-16 19:57 ` Stefan Ring @ 2012-07-16 20:03 ` Stefan Ring 2012-07-16 20:05 ` Stefan Ring 2012-07-16 21:27 ` Stan Hoeppner 1 sibling, 1 reply; 41+ messages in thread From: Stefan Ring @ 2012-07-16 20:03 UTC (permalink / raw) To: Linux fs XFS Damn, the formatting has been broken. For easier readability, I've uploaded the text here: https://github.com/Ringdingcoder/sysbench/blob/0dd3e1797ee5b847f0877144a6e0cd9de60ae7c3/mail1.txt _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-16 20:03 ` Stefan Ring @ 2012-07-16 20:05 ` Stefan Ring 0 siblings, 0 replies; 41+ messages in thread From: Stefan Ring @ 2012-07-16 20:05 UTC (permalink / raw) To: Linux fs XFS On Mon, Jul 16, 2012 at 10:03 PM, Stefan Ring <stefanrin@gmail.com> wrote: > Damn, the formatting has been broken. For easier readability, I've > uploaded the text here: https://github.com/Ringdingcoder/sysbench/blob/master/mail1.txt _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-16 19:57 ` Stefan Ring 2012-07-16 20:03 ` Stefan Ring @ 2012-07-16 21:27 ` Stan Hoeppner 2012-07-16 21:58 ` Stefan Ring 2012-07-16 22:16 ` Stefan Ring 1 sibling, 2 replies; 41+ messages in thread From: Stan Hoeppner @ 2012-07-16 21:27 UTC (permalink / raw) To: xfs On 7/16/2012 2:57 PM, Stefan Ring wrote: > If I thought that the internal RAID was bad, that's only because I > have not yet experienced an external enclosure from HP attached via > FibreChannel (P2000 G3 MSA, QLogic Corp. ISP2532-based 8Gb Fibre > Channel to PCI Express HBA). Unfortunately, I don't have detailed > information about the configuration of this enclosure, except that > it's a RAID6 volume, with 10 or 12 disks, I believe. Without that information the numbers below may tend to be a bit meaningless. > Witness this horrendous tanking of write throughput: > > [ 2s] reads: 0.00 MB/s writes: 0.07 MB/s fsyncs: 0.00/s response > time: 0.616ms (95%) > [ 4s] reads: 0.00 MB/s writes: 14.10 MB/s fsyncs: 0.00/s response > time: 0.481ms (95%) > [ 6s] reads: 0.00 MB/s writes: 15.28 MB/s fsyncs: 0.00/s response > time: 0.458ms (95%) > [ 8s] reads: 0.00 MB/s writes: 14.65 MB/s fsyncs: 0.00/s response > time: 0.464ms (95%) > [ 10s] reads: 0.00 MB/s writes: 15.32 MB/s fsyncs: 0.00/s response > time: 0.447ms (95%) > [ 12s] reads: 0.00 MB/s writes: 15.18 MB/s fsyncs: 0.00/s response > time: 0.460ms (95%) > [ 14s] reads: 0.00 MB/s writes: 15.18 MB/s fsyncs: 0.00/s response > time: 0.471ms (95%) > [ 16s] reads: 0.00 MB/s writes: 14.06 MB/s fsyncs: 0.00/s response > time: 0.468ms (95%) Up to this point it appears the BBWC is acknowledging write completion, as the response times are less than 1ms, 16-40 times lower than a disk drive response. If this is the case, the transfer rates should be close to 800MB/s, the limit for 8Gb FC. > [ 18s] reads: 0.00 MB/s writes: 0.43 MB/s fsyncs: 0.00/s response > time: 3.933ms (95%) > [ 20s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response > time: 985.122ms (95%) > [ 22s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response > time: 1435.164ms (95%) > [ 24s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response > time: 1194.568ms (95%) > [ 26s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response > time: 1112.091ms (95%) > [ 28s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response > time: 1443.350ms (95%) > [ 30s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response > time: 1078.972ms (95%) These writes appear to all be larger than the BBWC, according to the response times. It's odd that the data written is 0.00MB/s, meaning nothing was actually written. How does writing nothing takes over 1 second? Either there is something wrong with your test, critical data omitted from these reports, it isn't reporting coherent data, or I'm simply not "trained" to read this output. The output doesn't make any sense. > Operations performed: 0 reads, 53413 writes, 0 Other = 53413 Total > Read 0b Written 208.64Mb Total transferred 208.64Mb (6.8007Mb/sec) > 1740.98 Requests/sec executed > > For comparison, this is the SmartArray P400 RAID6 that I initially > complained about: > > [ 2s] reads: 0.00 MB/s writes: 6.34 MB/s fsyncs: 0.00/s response > time: 0.219ms (95%) > [ 4s] reads: 0.00 MB/s writes: 5.35 MB/s fsyncs: 0.00/s response > time: 0.217ms (95%) > [ 6s] reads: 0.00 MB/s writes: 5.48 MB/s fsyncs: 0.00/s response > time: 0.208ms (95%) > [ 8s] reads: 0.00 MB/s writes: 5.30 MB/s fsyncs: 0.00/s response > time: 0.228ms (95%) > [ 10s] reads: 0.00 MB/s writes: 5.81 MB/s fsyncs: 0.00/s response > time: 0.226ms (95%) > [ 12s] reads: 0.00 MB/s writes: 6.01 MB/s fsyncs: 0.00/s response > time: 0.223ms (95%) > [ 14s] reads: 0.00 MB/s writes: 5.39 MB/s fsyncs: 0.00/s response > time: 0.212ms (95%) > [ 16s] reads: 0.00 MB/s writes: 5.21 MB/s fsyncs: 0.00/s response > time: 0.225ms (95%) > [ 18s] reads: 0.00 MB/s writes: 5.16 MB/s fsyncs: 0.00/s response > time: 0.224ms (95%) > [ 20s] reads: 0.00 MB/s writes: 5.97 MB/s fsyncs: 0.00/s response > time: 0.217ms (95%) > [ 22s] reads: 0.00 MB/s writes: 4.28 MB/s fsyncs: 0.00/s response > time: 0.228ms (95%) > [ 24s] reads: 0.00 MB/s writes: 7.44 MB/s fsyncs: 0.00/s response > time: 0.191ms (95%) > [ 26s] reads: 0.00 MB/s writes: 5.30 MB/s fsyncs: 0.00/s response > time: 0.250ms (95%) > [ 28s] reads: 0.00 MB/s writes: 5.45 MB/s fsyncs: 0.00/s response > time: 0.258ms (95%) > [ 30s] reads: 0.00 MB/s writes: 5.27 MB/s fsyncs: 0.00/s response > time: 0.254ms (95%) > Operations performed: 0 reads, 42890 writes, 0 Other = 42890 Total > Read 0b Written 167.54Mb Total transferred 167.54Mb (5.5773Mb/sec) > 1427.80 Requests/sec executed Again, the response times suggest all these writes are being acknowledged by BBWC. Given this is a PCIe RAID HBA, the throughput numbers to BBWC should be hundreds of megs per second. > Slow, but at least it's consistent. > > And that's what I would expect, and which a decent RAID controller > manages to provide (LSI Logic / Symbios Logic MegaRAID SAS 1078): > > [ 2s] reads: 0.00 MB/s writes: 56.65 MB/s fsyncs: 0.00/s response > time: 0.117ms (95%) > [ 4s] reads: 0.00 MB/s writes: 37.15 MB/s fsyncs: 0.00/s response > time: 0.221ms (95%) > [ 6s] reads: 0.00 MB/s writes: 35.92 MB/s fsyncs: 0.00/s response > time: 0.225ms (95%) > [ 8s] reads: 0.00 MB/s writes: 34.15 MB/s fsyncs: 0.00/s response > time: 0.239ms (95%) > [ 10s] reads: 0.00 MB/s writes: 33.19 MB/s fsyncs: 0.00/s response > time: 0.221ms (95%) > [ 12s] reads: 0.00 MB/s writes: 34.02 MB/s fsyncs: 0.00/s response > time: 0.229ms (95%) > [ 14s] reads: 0.00 MB/s writes: 36.61 MB/s fsyncs: 0.00/s response > time: 0.233ms (95%) > [ 16s] reads: 0.00 MB/s writes: 37.62 MB/s fsyncs: 0.00/s response > time: 0.232ms (95%) > [ 18s] reads: 0.00 MB/s writes: 35.75 MB/s fsyncs: 0.00/s response > time: 0.228ms (95%) > [ 20s] reads: 0.00 MB/s writes: 35.42 MB/s fsyncs: 0.00/s response > time: 0.233ms (95%) > [ 22s] reads: 0.00 MB/s writes: 34.63 MB/s fsyncs: 0.00/s response > time: 0.233ms (95%) > [ 24s] reads: 0.00 MB/s writes: 34.83 MB/s fsyncs: 0.00/s response > time: 0.230ms (95%) > [ 26s] reads: 0.00 MB/s writes: 36.84 MB/s fsyncs: 0.00/s response > time: 0.229ms (95%) > [ 28s] reads: 0.00 MB/s writes: 36.15 MB/s fsyncs: 0.00/s response > time: 0.232ms (95%) > Operations performed: 0 reads, 284087 writes, 0 Other = 284087 Total > Read 0b Written 1.0837Gb Total transferred 1.0837Gb (36.99Mb/sec) > 9469.55 Requests/sec executed Again, due to the response times, all the writes appear acknowledged by BBWC. While the LSI throughput is better, it is still far far lower than what it should be, i.e. hundreds of megs per second to BBWC. > The command line used was: sysbench --test=fileio --max-time=30 > --max-requests=10000000 --file-num=1 --file-extra-flags=direct > --file-total-size=8G --file-block-size=4k --file-fsync-all=off > --file-fsync-freq=0 --file-fsync-mode=fdatasync --num-threads=1 > --file-test-mode=ag4 --report-interval=2 run I'm not familiar with sysbench. That said, your command line seems to be specifying 8GB files. Your original issue reported here long ago was low performance with huge metadata, i.e. deleting kernel trees etc. What storage characteristics is the command above supposed to test? > I have not yet uploaded my patched version of the development > sysbench, but I'm planning to do so, and I'd be really interested if > someone could run it on a really high-end storage system. I'd like a pony. If anyone here were to give me a pony that would satisfy one desire of one person. Ergo, if others performing your test will have a positive impact on the XFS code and user base, and not simply serve to satisfy the curiosity of one user, I'm sure others would be glad to run such tests. At this point though it seems such testing would only satisfy the former, and not the latter. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-16 21:27 ` Stan Hoeppner @ 2012-07-16 21:58 ` Stefan Ring 2012-07-17 1:39 ` Stan Hoeppner 2012-07-16 22:16 ` Stefan Ring 1 sibling, 1 reply; 41+ messages in thread From: Stefan Ring @ 2012-07-16 21:58 UTC (permalink / raw) To: stan; +Cc: xfs > These writes appear to all be larger than the BBWC, according to the > response times. It's odd that the data written is 0.00MB/s, meaning > nothing was actually written. How does writing nothing takes over 1 second? The writes are 4KB all the time, but at this point the FBWC has been filled up. I guess it's not "nothing", but close to it, and the MB/s figure is rounded. If it takes > 1 sec for a single write to get through, not much gets written in a 2 second interval. > Either there is something wrong with your test, critical data omitted > from these reports, it isn't reporting coherent data, or I'm simply not > "trained" to read this output. The output doesn't make any sense. I'm pretty sure that the data is correct, and the test is not flawed. The only relevant omission is that I've run the test a few times in a row. That should explain the first "0.07MB/s" line, because the cache was already loaded. The output does make sense, it's just the controller that's behaving erratically. It seems to accept data into the cache up to a point, then it starts writing it out to disk and not doing much else during that time. >> [ 30s] reads: 0.00 MB/s writes: 5.27 MB/s fsyncs: 0.00/s response >> time: 0.254ms (95%) >> Operations performed: 0 reads, 42890 writes, 0 Other = 42890 Total >> Read 0b Written 167.54Mb Total transferred 167.54Mb (5.5773Mb/sec) >> 1427.80 Requests/sec executed > > Again, the response times suggest all these writes are being > acknowledged by BBWC. Given this is a PCIe RAID HBA, the throughput > numbers to BBWC should be hundreds of megs per second. It's semi-random, quite small writes -- actually not very random, but still not exactly linear --, so some performance degradation is expected. >> [ 28s] reads: 0.00 MB/s writes: 36.15 MB/s fsyncs: 0.00/s response >> time: 0.232ms (95%) >> Operations performed: 0 reads, 284087 writes, 0 Other = 284087 Total >> Read 0b Written 1.0837Gb Total transferred 1.0837Gb (36.99Mb/sec) >> 9469.55 Requests/sec executed > > Again, due to the response times, all the writes appear acknowledged by > BBWC. While the LSI throughput is better, it is still far far lower > than what it should be, i.e. hundreds of megs per second to BBWC. The cache gets filled up quickly in this case, so it can only accept as much data as it manages to write out to the disks. > I'm not familiar with sysbench. That said, your command line seems to > be specifying 8GB files. Your original issue reported here long ago was > low performance with huge metadata, i.e. deleting kernel trees etc. > What storage characteristics is the command above supposed to test? You're right. When I had the issue with a metadata-intensive workload -- it was mostly free space fragmentation that caused trouble, apparently --, I ran seekwatcher and noticed a pattern that I tried to illustrate in <http://oss.sgi.com/pipermail/xfs/2012-April/018231.html>. The SmartArray controller was not able to make sense of this pattern, although in theory, it would be very easy to optimize. I was familiar with sysbench, which offers a handy random write test of with selectable block size, and I modified it so it would write out the blocks in the order suggested by the pattern. > I'd like a pony. If anyone here were to give me a pony that would > satisfy one desire of one person. Ergo, if others performing your test > will have a positive impact on the XFS code and user base, and not > simply serve to satisfy the curiosity of one user, I'm sure others would > be glad to run such tests. At this point though it seems such testing > would only satisfy the former, and not the latter. Maybe so, but it might also be worthwhile to point out flaws with current real hardware, when it does not behave the way one would expect. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-16 21:58 ` Stefan Ring @ 2012-07-17 1:39 ` Stan Hoeppner 2012-07-17 5:26 ` Dave Chinner 0 siblings, 1 reply; 41+ messages in thread From: Stan Hoeppner @ 2012-07-17 1:39 UTC (permalink / raw) To: xfs On 7/16/2012 4:58 PM, Stefan Ring wrote: > I'm pretty sure that the data is correct, and the test is not flawed. That may be true. Nonetheless, what you presented does not paint a coherent picture. > The only relevant omission is that I've run the test a few times in a > row. You also omitted whether you had exclusive access to the P2000 array. The P2000 has 2GB write cache. The numbers you report are far below what this unit is capable of. Your data suggests 1. You didn't have exclusive access during testing 2. A configuration issue >> Again, the response times suggest all these writes are being >> acknowledged by BBWC. Given this is a PCIe RAID HBA, the throughput >> numbers to BBWC should be hundreds of megs per second. > > It's semi-random, quite small writes -- actually not very random, but > still not exactly linear --, so some performance degradation is > expected. The data set I commented on here shows all responses were from BBWC. How can you "expect" degradation from cache? >> Again, due to the response times, all the writes appear acknowledged by >> BBWC. While the LSI throughput is better, it is still far far lower >> than what it should be, i.e. hundreds of megs per second to BBWC. > > The cache gets filled up quickly in this case, so it can only accept > as much data as it manages to write out to the disks. This is not what the data I quoted shows Stefan. The data shows all the writes were acked by cache, according to response times. > Maybe so, but it might also be worthwhile to point out flaws with > current real hardware, when it does not behave the way one would > expect. The only "flaw" you've identified, long ago, is that low end HP hardware based RAID5/6 is not suitable for metadata heavy workloads. Everyone here told you RAID5/6, whether hardware or software, was not a good candidate for such workloads. You played with RAID10 and a concat setup, and received greatly enhanced performance. It depends on the one, and what the one expects. Most people on this list would never expect parity RAID to perform well with the workloads you're throwing at it. Your expectations are clearly different than most on this list. The kicker here is that most of the data you presented shows almost all writes being acked by cache, in which case RAID level should be irrelevant, but at the same time showing abysmal throughput. When all write hit cache, throughput should be through the roof. So again, something is amiss here. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-17 1:39 ` Stan Hoeppner @ 2012-07-17 5:26 ` Dave Chinner 2012-07-18 2:18 ` Stan Hoeppner 0 siblings, 1 reply; 41+ messages in thread From: Dave Chinner @ 2012-07-17 5:26 UTC (permalink / raw) To: Stan Hoeppner; +Cc: xfs On Mon, Jul 16, 2012 at 08:39:15PM -0500, Stan Hoeppner wrote: > It depends on the one, and what the one expects. Most people on this > list would never expect parity RAID to perform well with the workloads > you're throwing at it. Your expectations are clearly different than > most on this list. Rule of thumb: don't use RAID5/6 for small random write workloads. > The kicker here is that most of the data you presented shows almost all > writes being acked by cache, in which case RAID level should be > irrelevant, but at the same time showing abysmal throughput. When all > write hit cache, throughput should be through the roof. I bet it's single threaded, which means it is: sysbench kernel write(2) issue io wait for completion write(2) issue io wait for completion write(2) ..... Which means throughput is limited by IO latency, not bandwidth. If it takes 10us to do the write(2), issue and process the IO completion, and it takes 10us for the hardware to do the IO, you're limited to 50,000 IOPS, or 200MB/s. Given that the best being seen is around 35MB/s, you're looking at around 10,000 IOPS of 100us round trip time. At 5MB/s, it's 1200 IOPS or around 800us round trip. That's why you get different performance from the different raid controllers - some process cache hits a lot faster than others. As to the one that stalled - when the cache hits a certain level of dirtiness (say 50%), it will start flushing cached writes and depending on the algorithm may start behaving like a FIFO to new requests. i.e. each new request coming in needs to wait for one to drain. At that point, the write rate will tank to maybe 50 IOPS, which will barely register on the benchmark throughput. (just look at what happens to the IO latency that is measured...) IOWs, welcome to Understanding RAID Controller Caching Behaviours 101 :) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-17 5:26 ` Dave Chinner @ 2012-07-18 2:18 ` Stan Hoeppner 2012-07-18 6:44 ` Stefan Ring 0 siblings, 1 reply; 41+ messages in thread From: Stan Hoeppner @ 2012-07-18 2:18 UTC (permalink / raw) To: xfs On 7/17/2012 12:26 AM, Dave Chinner wrote: ... > I bet it's single threaded, which means it is: The data given seems to strongly suggest a single thread. > Which means throughput is limited by IO latency, not bandwidth. > If it takes 10us to do the write(2), issue and process the IO > completion, and it takes 10us for the hardware to do the IO, you're > limited to 50,000 IOPS, or 200MB/s. Given that the best being seen > is around 35MB/s, you're looking at around 10,000 IOPS of 100us > round trip time. At 5MB/s, it's 1200 IOPS or around 800us round > trip. > > That's why you get different performance from the different raid > controllers - some process cache hits a lot faster than others. ... > IOWs, welcome to Understanding RAID Controller Caching Behaviours > 101 :) It would be somewhat interesting to see Stefan's latency and throughput numbers for 4/8/16 threads. Maybe the sysbench "--num-threads=" option is the ticket. The docs state this is for testing scheduler performance, and it's not clear whether this actually does threaded IO. If not, time for a new IO benchmark. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-18 2:18 ` Stan Hoeppner @ 2012-07-18 6:44 ` Stefan Ring 2012-07-18 7:09 ` Stan Hoeppner 0 siblings, 1 reply; 41+ messages in thread From: Stefan Ring @ 2012-07-18 6:44 UTC (permalink / raw) To: stan; +Cc: xfs On Wed, Jul 18, 2012 at 4:18 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote: > On 7/17/2012 12:26 AM, Dave Chinner wrote: > ... >> I bet it's single threaded, which means it is: > > The data given seems to strongly suggest a single thread. > >> Which means throughput is limited by IO latency, not bandwidth. >> If it takes 10us to do the write(2), issue and process the IO >> completion, and it takes 10us for the hardware to do the IO, you're >> limited to 50,000 IOPS, or 200MB/s. Given that the best being seen >> is around 35MB/s, you're looking at around 10,000 IOPS of 100us >> round trip time. At 5MB/s, it's 1200 IOPS or around 800us round >> trip. >> >> That's why you get different performance from the different raid >> controllers - some process cache hits a lot faster than others. > ... >> IOWs, welcome to Understanding RAID Controller Caching Behaviours >> 101 :) > > It would be somewhat interesting to see Stefan's latency and throughput > numbers for 4/8/16 threads. Maybe the sysbench "--num-threads=" option > is the ticket. The docs state this is for testing scheduler > performance, and it's not clear whether this actually does threaded IO. > If not, time for a new IO benchmark. Yes, it is intentionally single-threaded and round-trip-bound, as that is exactly the kind of behavior that XFS chose to display. I tested with more threads now. It is initially faster, which only serves to hasten the tanking, and the response time goes through the roof. I also needed to increase the --file-num. Apparently the filesystem (ext3) in this case cannot handle concurrent accesses to the same file. 4 threads: [ 2s] reads: 0.00 MB/s writes: 23.55 MB/s fsyncs: 0.00/s response time: 1.171ms (95%) [ 4s] reads: 0.00 MB/s writes: 24.35 MB/s fsyncs: 0.00/s response time: 1.129ms (95%) [ 6s] reads: 0.00 MB/s writes: 24.55 MB/s fsyncs: 0.00/s response time: 1.141ms (95%) [ 8s] reads: 0.00 MB/s writes: 25.73 MB/s fsyncs: 0.00/s response time: 1.088ms (95%) [ 10s] reads: 0.00 MB/s writes: 6.14 MB/s fsyncs: 0.00/s response time: 0.994ms (95%) [ 12s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response time: 2735.611ms (95%) [ 14s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response time: 3800.107ms (95%) [ 16s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response time: 4404.397ms (95%) [ 18s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response time: 3153.588ms (95%) [ 20s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response time: 4769.433ms (95%) 8 threads: [ 2s] reads: 0.00 MB/s writes: 26.99 MB/s fsyncs: 0.00/s response time: 2.451ms (95%) [ 4s] reads: 0.00 MB/s writes: 28.12 MB/s fsyncs: 0.00/s response time: 3.153ms (95%) [ 6s] reads: 0.00 MB/s writes: 25.97 MB/s fsyncs: 0.00/s response time: 2.965ms (95%) [ 8s] reads: 0.00 MB/s writes: 23.23 MB/s fsyncs: 0.00/s response time: 2.560ms (95%) [ 10s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response time: 791.041ms (95%) [ 12s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response time: 3458.162ms (95%) [ 14s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response time: 5519.598ms (95%) [ 16s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response time: 3219.401ms (95%) [ 18s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response time: 10235.289ms (95%) [ 20s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response time: 3765.007ms (95%) 16 threads: [ 2s] reads: 0.00 MB/s writes: 34.27 MB/s fsyncs: 0.00/s response time: 3.899ms (95%) [ 4s] reads: 0.00 MB/s writes: 28.62 MB/s fsyncs: 0.00/s response time: 6.910ms (95%) [ 6s] reads: 0.00 MB/s writes: 27.94 MB/s fsyncs: 0.00/s response time: 6.869ms (95%) [ 8s] reads: 0.00 MB/s writes: 13.50 MB/s fsyncs: 0.00/s response time: 7.594ms (95%) [ 10s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response time: 2308.573ms (95%) [ 12s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response time: 4811.016ms (95%) [ 14s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response time: 4635.714ms (95%) [ 16s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response time: 3200.185ms (95%) [ 18s] reads: 0.00 MB/s writes: 0.03 MB/s fsyncs: 0.00/s response time: 9623.207ms (95%) [ 20s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response time: 8053.211ms (95%) _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-18 6:44 ` Stefan Ring @ 2012-07-18 7:09 ` Stan Hoeppner 2012-07-18 7:22 ` Stefan Ring 0 siblings, 1 reply; 41+ messages in thread From: Stan Hoeppner @ 2012-07-18 7:09 UTC (permalink / raw) To: xfs On 7/18/2012 1:44 AM, Stefan Ring wrote: > On Wed, Jul 18, 2012 at 4:18 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote: >> On 7/17/2012 12:26 AM, Dave Chinner wrote: >> ... >>> I bet it's single threaded, which means it is: >> >> The data given seems to strongly suggest a single thread. >> >>> Which means throughput is limited by IO latency, not bandwidth. >>> If it takes 10us to do the write(2), issue and process the IO >>> completion, and it takes 10us for the hardware to do the IO, you're >>> limited to 50,000 IOPS, or 200MB/s. Given that the best being seen >>> is around 35MB/s, you're looking at around 10,000 IOPS of 100us >>> round trip time. At 5MB/s, it's 1200 IOPS or around 800us round >>> trip. >>> >>> That's why you get different performance from the different raid >>> controllers - some process cache hits a lot faster than others. >> ... >>> IOWs, welcome to Understanding RAID Controller Caching Behaviours >>> 101 :) >> >> It would be somewhat interesting to see Stefan's latency and throughput >> numbers for 4/8/16 threads. Maybe the sysbench "--num-threads=" option >> is the ticket. The docs state this is for testing scheduler >> performance, and it's not clear whether this actually does threaded IO. >> If not, time for a new IO benchmark. > > Yes, it is intentionally single-threaded and round-trip-bound, as that > is exactly the kind of behavior that XFS chose to display. You're referring to your original huge-metadata problem? IIRC your workload there was a single thread, wasn't it? > I tested with more threads now. It is initially faster, which only > serves to hasten the tanking, and the response time goes through the > roof. I also needed to increase the --file-num. Apparently the > filesystem (ext3) in this case cannot handle concurrent accesses to > the same file. *Gasp* EXT3? Not XFS? Why are posting this thread on XFS? The two will likely have (significantly) different behavior. Also, to make any meaningful comparison, we kinda need to know which controller was targeted by these 3 runs below. ;) > 4 threads: > > [ 2s] reads: 0.00 MB/s writes: 23.55 MB/s fsyncs: 0.00/s response > time: 1.171ms (95%) > [ 4s] reads: 0.00 MB/s writes: 24.35 MB/s fsyncs: 0.00/s response > time: 1.129ms (95%) > [ 6s] reads: 0.00 MB/s writes: 24.55 MB/s fsyncs: 0.00/s response > time: 1.141ms (95%) > [ 8s] reads: 0.00 MB/s writes: 25.73 MB/s fsyncs: 0.00/s response > time: 1.088ms (95%) > [ 10s] reads: 0.00 MB/s writes: 6.14 MB/s fsyncs: 0.00/s response > time: 0.994ms (95%) > [ 12s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response > time: 2735.611ms (95%) > [ 14s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response > time: 3800.107ms (95%) > [ 16s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response > time: 4404.397ms (95%) > [ 18s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response > time: 3153.588ms (95%) > [ 20s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response > time: 4769.433ms (95%) > > > 8 threads: > > [ 2s] reads: 0.00 MB/s writes: 26.99 MB/s fsyncs: 0.00/s response > time: 2.451ms (95%) > [ 4s] reads: 0.00 MB/s writes: 28.12 MB/s fsyncs: 0.00/s response > time: 3.153ms (95%) > [ 6s] reads: 0.00 MB/s writes: 25.97 MB/s fsyncs: 0.00/s response > time: 2.965ms (95%) > [ 8s] reads: 0.00 MB/s writes: 23.23 MB/s fsyncs: 0.00/s response > time: 2.560ms (95%) > [ 10s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response > time: 791.041ms (95%) > [ 12s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response > time: 3458.162ms (95%) > [ 14s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response > time: 5519.598ms (95%) > [ 16s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response > time: 3219.401ms (95%) > [ 18s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response > time: 10235.289ms (95%) > [ 20s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response > time: 3765.007ms (95%) > > 16 threads: > > [ 2s] reads: 0.00 MB/s writes: 34.27 MB/s fsyncs: 0.00/s response > time: 3.899ms (95%) > [ 4s] reads: 0.00 MB/s writes: 28.62 MB/s fsyncs: 0.00/s response > time: 6.910ms (95%) > [ 6s] reads: 0.00 MB/s writes: 27.94 MB/s fsyncs: 0.00/s response > time: 6.869ms (95%) > [ 8s] reads: 0.00 MB/s writes: 13.50 MB/s fsyncs: 0.00/s response > time: 7.594ms (95%) > [ 10s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response > time: 2308.573ms (95%) > [ 12s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response > time: 4811.016ms (95%) > [ 14s] reads: 0.00 MB/s writes: 0.00 MB/s fsyncs: 0.00/s response > time: 4635.714ms (95%) > [ 16s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response > time: 3200.185ms (95%) > [ 18s] reads: 0.00 MB/s writes: 0.03 MB/s fsyncs: 0.00/s response > time: 9623.207ms (95%) > [ 20s] reads: 0.00 MB/s writes: 0.01 MB/s fsyncs: 0.00/s response > time: 8053.211ms (95%) -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-18 7:09 ` Stan Hoeppner @ 2012-07-18 7:22 ` Stefan Ring 2012-07-18 10:24 ` Stan Hoeppner 0 siblings, 1 reply; 41+ messages in thread From: Stefan Ring @ 2012-07-18 7:22 UTC (permalink / raw) To: stan; +Cc: xfs > *Gasp* EXT3? Not XFS? Why are posting this thread on XFS? The two > will likely have (significantly) different behavior. Because it was XFS originally which hammered the controller with this disadvantageous pattern. Except for the concurrency, it doesn't matter much on which filesystem sysbench operates. I've previously verified this on another system. > Also, to make any meaningful comparison, we kinda need to know which > controller was targeted by these 3 runs below. ;) It was the Fibre Channel controller, the one with the collapsing throughput. (P2000 G3 MSA, QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA) _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-18 7:22 ` Stefan Ring @ 2012-07-18 10:24 ` Stan Hoeppner 2012-07-18 12:32 ` Stefan Ring 0 siblings, 1 reply; 41+ messages in thread From: Stan Hoeppner @ 2012-07-18 10:24 UTC (permalink / raw) To: xfs On 7/18/2012 2:22 AM, Stefan Ring wrote: > Because it was XFS originally which hammered the controller with this > disadvantageous pattern. Do you feel you have researched and tested this theory thoroughly enough to draw such a conclusion? Note the LSI numbers with a single thread compared to the P400. It seems at this point the LSI has no problem with the pattern. How about threaded results? > Except for the concurrency, it doesn't matter > much on which filesystem sysbench operates. I've previously verified > this on another system. It's hard to believe a 4 generation old (6-7 years) LSI ASIC with 256/512MB cache is able to sink this workload without ever stalling when flushing to rust, where the HP P2000 FC SAN array shows pretty sad performance. I'd really like to see the threaded results for the LSI at this point. I think that would be informative. > It was the Fibre Channel controller, the one with the collapsing > throughput. (P2000 G3 MSA, QLogic Corp. ISP2532-based 8Gb Fibre > Channel to PCI Express HBA) Given the LSI 1078 based RAID card with 1 thread runs circles around the P2000 with 4, 8, or 16 threads, and never stalls, with responses less than 1ms, meaning all writes hit cache, it would seem other workloads are hitting the P2000 simultaneously with your test, limiting your performance. Either that or some kind of quotas have been set on the LUNs to prevent one host from saturating the controllers. Or both. This is why I asked about exclusive access. Without it your results for the P2000 are literally worthless. Lacking complete configuration info puts you in the same boat. You simply can't draw any realistic conclusions about the P2000 performance without having complete control of the device for dedicated testing purposes. You have such control of the P400 and LSI do you not? Concentrate your testing and comparisons on those. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-18 10:24 ` Stan Hoeppner @ 2012-07-18 12:32 ` Stefan Ring 2012-07-18 12:37 ` Stefan Ring 0 siblings, 1 reply; 41+ messages in thread From: Stefan Ring @ 2012-07-18 12:32 UTC (permalink / raw) To: stan; +Cc: xfs > Given the LSI 1078 based RAID card with 1 thread runs circles around the > P2000 with 4, 8, or 16 threads, and never stalls, with responses less > than 1ms, meaning all writes hit cache, it would seem other workloads > are hitting the P2000 simultaneously with your test, limiting your > performance. Either that or some kind of quotas have been set on the > LUNs to prevent one host from saturating the controllers. Or both. Maybe there exists a load quota of some kind as you are suggesting, but from what I've seen from screenshots in the installation manuals, I don't remember any of this. > This is why I asked about exclusive access. Without it your results for > the P2000 are literally worthless. Lacking complete configuration info > puts you in the same boat. You simply can't draw any realistic > conclusions about the P2000 performance without having complete control > of the device for dedicated testing purposes. That's a reasonable suggestion. Alas, I'm not expecting to get that level of access to the device. I know for a fact though, that it is only connected to a single machine, which is otherwise completely idle and controlled by "us" (the company I work for). But even so, I cannot set up XFS there on a whim because it's in preparation for production use. > You have such control of the P400 and LSI do you not? Concentrate your > testing and comparisons on those. The period of full control over the P400 is over, but at least I know how it is configured. The LSI is in production (meaning: untouchable), but seems reasonably configured. At least I have some multi-threaded results from the other two machines: LSI: 4 threads [ 2s] reads: 0.00 MB/s writes: 63.08 MB/s fsyncs: 0.00/s response time: 0.452ms (95%) [ 4s] reads: 0.00 MB/s writes: 34.26 MB/s fsyncs: 0.00/s response time: 1.660ms (95%) [ 6s] reads: 0.00 MB/s writes: 33.92 MB/s fsyncs: 0.00/s response time: 1.478ms (95%) [ 8s] reads: 0.00 MB/s writes: 36.34 MB/s fsyncs: 0.00/s response time: 1.589ms (95%) [ 10s] reads: 0.00 MB/s writes: 34.99 MB/s fsyncs: 0.00/s response time: 1.621ms (95%) [ 12s] reads: 0.00 MB/s writes: 36.41 MB/s fsyncs: 0.00/s response time: 1.639ms (95%) 8 threads [ 2s] reads: 0.00 MB/s writes: 45.34 MB/s fsyncs: 0.00/s response time: 2.749ms (95%) [ 4s] reads: 0.00 MB/s writes: 32.15 MB/s fsyncs: 0.00/s response time: 4.579ms (95%) [ 6s] reads: 0.00 MB/s writes: 33.64 MB/s fsyncs: 0.00/s response time: 4.644ms (95%) [ 8s] reads: 0.00 MB/s writes: 35.20 MB/s fsyncs: 0.00/s response time: 4.131ms (95%) [ 10s] reads: 0.00 MB/s writes: 33.88 MB/s fsyncs: 0.00/s response time: 3.876ms (95%) [ 12s] reads: 0.00 MB/s writes: 33.65 MB/s fsyncs: 0.00/s response time: 4.929ms (95%) 16 threads [ 2s] reads: 0.00 MB/s writes: 36.90 MB/s fsyncs: 0.00/s response time: 3.510ms (95%) [ 4s] reads: 0.00 MB/s writes: 35.36 MB/s fsyncs: 0.00/s response time: 8.629ms (95%) [ 6s] reads: 0.00 MB/s writes: 32.27 MB/s fsyncs: 0.00/s response time: 10.091ms (95%) [ 8s] reads: 0.00 MB/s writes: 34.79 MB/s fsyncs: 0.00/s response time: 9.499ms (95%) [ 10s] reads: 0.00 MB/s writes: 35.62 MB/s fsyncs: 0.00/s response time: 8.801ms (95%) [ 12s] reads: 0.00 MB/s writes: 34.64 MB/s fsyncs: 0.00/s response time: 9.488ms (95%) ... and so on. Nothing noteworthy after that. Response time is higher, throughput stays the same. P400: 4 threads [ 2s] reads: 0.00 MB/s writes: 33.59 MB/s fsyncs: 0.00/s response time: 0.255ms (95%) [ 4s] reads: 0.00 MB/s writes: 5.11 MB/s fsyncs: 0.00/s response time: 12.853ms (95%) [ 6s] reads: 0.00 MB/s writes: 5.45 MB/s fsyncs: 0.00/s response time: 0.677ms (95%) [ 8s] reads: 0.00 MB/s writes: 5.16 MB/s fsyncs: 0.00/s response time: 0.902ms (95%) [ 10s] reads: 0.00 MB/s writes: 4.56 MB/s fsyncs: 0.00/s response time: 58.242ms (95%) [ 12s] reads: 0.00 MB/s writes: 5.30 MB/s fsyncs: 0.00/s response time: 0.669ms (95%) [ 14s] reads: 0.00 MB/s writes: 5.22 MB/s fsyncs: 0.00/s response time: 0.743ms (95%) [ 16s] reads: 0.00 MB/s writes: 4.73 MB/s fsyncs: 0.00/s response time: 57.877ms (95%) [ 18s] reads: 0.00 MB/s writes: 4.39 MB/s fsyncs: 0.00/s response time: 58.417ms (95%) [ 20s] reads: 0.00 MB/s writes: 4.56 MB/s fsyncs: 0.00/s response time: 57.704ms (95%) [ 22s] reads: 0.00 MB/s writes: 4.81 MB/s fsyncs: 0.00/s response time: 57.429ms (95%) [ 24s] reads: 0.00 MB/s writes: 4.53 MB/s fsyncs: 0.00/s response time: 57.895ms (95%) Some response time fluctuation at first, but it settles quickly. 8 threads [ 2s] reads: 0.00 MB/s writes: 38.61 MB/s fsyncs: 0.00/s response time: 0.969ms (95%) [ 4s] reads: 0.00 MB/s writes: 4.98 MB/s fsyncs: 0.00/s response time: 59.886ms (95%) [ 6s] reads: 0.00 MB/s writes: 4.69 MB/s fsyncs: 0.00/s response time: 60.300ms (95%) [ 8s] reads: 0.00 MB/s writes: 4.57 MB/s fsyncs: 0.00/s response time: 60.246ms (95%) [ 10s] reads: 0.00 MB/s writes: 4.46 MB/s fsyncs: 0.00/s response time: 60.626ms (95%) [ 12s] reads: 0.00 MB/s writes: 4.46 MB/s fsyncs: 0.00/s response time: 60.445ms (95%) [ 14s] reads: 0.00 MB/s writes: 4.61 MB/s fsyncs: 0.00/s response time: 60.662ms (95%) [ 16s] reads: 0.00 MB/s writes: 4.35 MB/s fsyncs: 0.00/s response time: 60.571ms (95%) [ 18s] reads: 0.00 MB/s writes: 4.87 MB/s fsyncs: 0.00/s response time: 60.156ms (95%) [ 20s] reads: 0.00 MB/s writes: 4.77 MB/s fsyncs: 0.00/s response time: 60.210ms (95%) [ 22s] reads: 0.00 MB/s writes: 4.58 MB/s fsyncs: 0.00/s response time: 60.463ms (95%) [ 24s] reads: 0.00 MB/s writes: 4.65 MB/s fsyncs: 0.00/s response time: 60.264ms (95%) 16 threads [ 2s] reads: 0.00 MB/s writes: 17.35 MB/s fsyncs: 0.00/s response time: 7.764ms (95%) [ 4s] reads: 0.00 MB/s writes: 5.17 MB/s fsyncs: 0.00/s response time: 62.655ms (95%) [ 6s] reads: 0.00 MB/s writes: 5.15 MB/s fsyncs: 0.00/s response time: 62.749ms (95%) [ 8s] reads: 0.00 MB/s writes: 4.89 MB/s fsyncs: 0.00/s response time: 63.258ms (95%) [ 10s] reads: 0.00 MB/s writes: 4.98 MB/s fsyncs: 0.00/s response time: 62.862ms (95%) [ 12s] reads: 0.00 MB/s writes: 5.26 MB/s fsyncs: 0.00/s response time: 63.032ms (95%) [ 14s] reads: 0.00 MB/s writes: 5.27 MB/s fsyncs: 0.00/s response time: 62.599ms (95%) [ 16s] reads: 0.00 MB/s writes: 4.80 MB/s fsyncs: 0.00/s response time: 63.088ms (95%) [ 18s] reads: 0.00 MB/s writes: 4.84 MB/s fsyncs: 0.00/s response time: 63.239ms (95%) [ 20s] reads: 0.00 MB/s writes: 5.24 MB/s fsyncs: 0.00/s response time: 62.712ms (95%) [ 22s] reads: 0.00 MB/s writes: 4.25 MB/s fsyncs: 0.00/s response time: 63.619ms (95%) [ 24s] reads: 0.00 MB/s writes: 4.90 MB/s fsyncs: 0.00/s response time: 63.202ms (95%) Pretty boring. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-18 12:32 ` Stefan Ring @ 2012-07-18 12:37 ` Stefan Ring 2012-07-19 3:08 ` Stan Hoeppner 0 siblings, 1 reply; 41+ messages in thread From: Stefan Ring @ 2012-07-18 12:37 UTC (permalink / raw) To: xfs > At least I have some multi-threaded results from the other two machines: > > LSI: > > 4 threads > > [ 2s] reads: 0.00 MB/s writes: 63.08 MB/s fsyncs: 0.00/s response > time: 0.452ms (95%) > [ 4s] reads: 0.00 MB/s writes: 34.26 MB/s fsyncs: 0.00/s response > time: 1.660ms (95%) And because of the bad formatting: https://github.com/Ringdingcoder/sysbench/blob/master/mail2.txt _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-18 12:37 ` Stefan Ring @ 2012-07-19 3:08 ` Stan Hoeppner 2012-07-25 9:29 ` Stefan Ring 0 siblings, 1 reply; 41+ messages in thread From: Stan Hoeppner @ 2012-07-19 3:08 UTC (permalink / raw) To: xfs Sorry for any potential dups. Mail log shows this msg was accepted 3.5 hours ago but it hasn't spit back to me yet and no bounce. Resending. On 7/18/2012 7:37 AM, Stefan Ring wrote: >> At least I have some multi-threaded results from the other two machines: >> >> LSI: >> >> 4 threads >> >> [ 2s] reads: 0.00 MB/s writes: 63.08 MB/s fsyncs: 0.00/s response >> time: 0.452ms (95%) >> [ 4s] reads: 0.00 MB/s writes: 34.26 MB/s fsyncs: 0.00/s response >> time: 1.660ms (95%) > > And because of the bad formatting: > https://github.com/Ringdingcoder/sysbench/blob/master/mail2.txt And this is why people publishing real, useable benchmark results publish all specs of the hardware/software environment being tested. I think I've mentioned once or twice how critical accurate/complete information is. Looking at the table linked above, two things become clear: 1. The array spindle config of the 3 systems is wildly different. a. P400 = 6x 10K SAS RAID6 b. P2000 = 12x 7.2k SATA RAID6 c. LSI = unknown 2. The LSI outperforms the other two by a wide margin, yet we know nothing of the disks attached. At first blush, ans assuming disk config is similar to the other two systems, the controller firmware *appears* to perform magic. But without knowing the spindle config of the LSI we simply can't draw any conclusions yet. This benchmark test seems to involve no or little metadata IO, so few RMW cycles, and RAID6 doesn't kill us. So if the LSI has the common 24 bay 2.5" JBOD shelf attached, with 2 spares and 22x 15K SAS drives (20 stripe spindles) in RAID6, this alone may fully explain the performance gap, due to 6.7x the seek performance against the 6x 10k drives (4 spindles) in RAID6 on the P400. This would also equal 4x the seek performance of the 12 disks (10 spindles) of the P2000. Given the results for the P2000, it seems clear that the LUN you're hitting is not striped across 10 spindles. It would seem that the 12 drives have been split up into two or more RAID arrays, probably 2x 6 drive RAID6s, and your test LUN sits on one of them, yielding 4x 7.2k stripe spindles. If it spanned 10 of 12 drives in a RAID6, it shouldn't stall as shown in your data. The "tell" here is that the P2000 with 10 7.2k drives has 1.7x the seek performance of the 4 spindles in your P400, which outruns the P2000 once cache is full. The P2000 controller has over 4x the write cache of the P400, which is clearly demonstrated in your data: >From 2s to 8s, the P2000 averages ~25MB/s throughput with sub 10ms latency. At 10s and up, latency jumps to multiple *seconds* and throughput drops to "zero". This clearly shows that when cache is full and must flush, the drives are simply overwhelmed. 10x 7.2k striped SATA spindles would not perform this badly. Thus it seems clear your LUN sits on only 4 of the 12 spindles. The cached performance of the P2000 is about 50% of the LSI, and the LSI has 4x less cache memory. This could be due to cache mirroring between the two controllers eating 50% of the cache RAM bandwidth. So in summary, it would be nice to know the disk config of the LSI. Once we have complete hardware information, it may likely turn out that the bulk of the performance differences simply come down to what disks are attached to each controller. BTW, you provided lspci output of the chip on the RAID card. Please provide the actual model# of the LSI card. Dozens of LSI and OEM cards on the market have used the SAS1078 ASIC. The card you have may not even be an LSI card, or may even be embedded. We can't tell from the info given. The devil is always in the details Stefan. ;) -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-19 3:08 ` Stan Hoeppner @ 2012-07-25 9:29 ` Stefan Ring 2012-07-25 10:00 ` Stan Hoeppner 2012-07-26 8:32 ` Dave Chinner 0 siblings, 2 replies; 41+ messages in thread From: Stefan Ring @ 2012-07-25 9:29 UTC (permalink / raw) To: Linux fs XFS There appears to be a bit of a tension in this thread, and I have the suspicion that it's a case of mismatched presumed expectations. The sole purpose of my activity here over the last months was to present some findings which I thought would be interesting to XFS developers. If I were working on XFS, I would be interested. From most of the answers, though, I get the impression that I am perceived as looking for help tuning my XFS setup, which is not the case at all. In fact, I'm quite happy with it. Let me recap just to give this thread the intended tone: This episode of my journey with XFS started when I read that there had been recent significant performance improvements to XFS' metadata operations. Having tried XFS every couple of years or so before, and always with the same verdict -- horribly slow -- I was curious if it had finally become usable. A new server machine arriving just at the right time would serve as the perfect testbed. I threw some workloads at it, which I hoped would resemble my typical workload, and I focussed especially on areas which bothered me the most on our current development server running ext3. Everything worked more or less satisfactorily, except for the case of un-tarring a metadata-heavy tarball in the presence of considerable free-space fragmentation. In this particular case, performance was conspicuously poor, and after some digging with blktrace and seekwatcher, I identified the cause of this slowness to be a write pattern that looked like this (in block numbers), where the step width (arbitrarily displayed as 10000 here for illustration purposes) was 1/4 of the size of the volume, clearly because the volume had 4 allocation groups (the default). Of course it was not entirely regular, but overall it was very similar to this: 10001 20001 30001 40001 10002 20002 30002 40002 10003 20003 ... I tuned and tweaked everything I could think of -- elevator settings, readahead, su/sw, barrier, RAID hardware cache --, but the behavior would always be the same. It just so happens that the RAID controller in this machine (HP SmartArray P400) doesn't cope very well with a write pattern like this. To it, the sequence appears to be random, and it performs even worse than it would if it were actually random. Going by what I think to know about the topic, it struck me as odd that blocks would be sent to disk in this very unfavorable order. To my mind, three entities had failed at sanitizing the write sequence: the filesystem, the block layer and the RAID controller. My opinion is still unchanged regarding the latter two. The strikingly bad performance on the RAID controller piqued my interest, and I went on a different journey investigating this oddity and created a minor sysbench modification that would just measure performance for this particular pattern. Not many people helped with my experiment, and I was accused of wanting ponies. If I'm the only one who is curious about this, then so be it. I deemed it worthwile sharing my experience and pointing out that a sequence like the one above is a death blow to all HP gear I've got my hands on so far. It has been pointed out that XFS schedules the writes like this on purpose so that they can be done in parallel, and that I should create a concatenated volume with physical devices matching the allocation groups. I actually went through this exercise, and yes, it was very beneficial, but that's not the point. I don't want to (have to) do that. And it's not always feasible, anyway. What about home usage with a single SATA disk? Is it not worthwile to perform well on low-end devices? You might ask then, why even bother using XFS instead of ext4? I care about the multi-user case. The problem I have with ext is that it is unbearably unresponsive when someone writes a semi-large amount of data (a few gigs) at once -- like extracting a large-ish tarball. Just using vim, even with :set nofsync, is almost impossible during that time. I have adopted various disgusting hacks like extracting to a ramdisk instead and rsyncing the lot over to the real disk with a very low --bwlimit, but I'm thoroughly fed up with this kind of crap, and in general, XFS works very well. If noone cares about my findings, I will henceforth be quiet on this topic. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-25 9:29 ` Stefan Ring @ 2012-07-25 10:00 ` Stan Hoeppner 2012-07-25 10:08 ` Stefan Ring 2012-07-26 8:32 ` Dave Chinner 1 sibling, 1 reply; 41+ messages in thread From: Stan Hoeppner @ 2012-07-25 10:00 UTC (permalink / raw) To: Stefan Ring; +Cc: Linux fs XFS Hi Stefan, On 7/25/2012 4:29 AM, Stefan Ring wrote: > There appears to be a bit of a tension in this thread, and I have the > suspicion that it's a case of mismatched presumed expectations. The > sole purpose of my activity here over the last months was to present > some findings which I thought would be interesting to XFS developers. > If I were working on XFS, I would be interested. From most of the > answers, though, I get the impression that I am perceived as looking > for help tuning my XFS setup, which is not the case at all. In fact, > I'm quite happy with it. Let me recap just to give this thread the > intended tone: I don't want to top post, but I don't want to trim a bunch lest it appear I'm ignoring significant points you make, so I'll start here, and flow, but maybe not respond to each point. I didn't intend to create tension, and I apologize for any sarcasm in my last point. I think you may be on to something, and I do find your research efforts worthwhile. However... The single point I was attempting to make in my last post was that for your data and conclusions to have any validity, you need to provide all of the details of your testing environment. You made head-to-head comparisons and performance conclusions of 3 RAID systems, but omitted critical details that are needed to interpret and compare the performance data. Some of this data you simply didn't have access to. In a situation like that, you simply shouldn't include that system in your presentation. WRT the LSI controller, you didn't mention RAID level or number of disks. You simply must present complete information. The omission of such is likely why most ignored your post but for me. I'm the hardwarefreak after all, so I'm always game for RAID discussions. ;) If you can represent with complete specs and data, so that it paints a coherent picture, you may see more willing participation. > This episode of my journey with XFS started when I read that there had > been recent significant performance improvements to XFS' metadata > operations. Having tried XFS every couple of years or so before, and > always with the same verdict -- horribly slow -- I was curious if it > had finally become usable. > > A new server machine arriving just at the right time would serve as > the perfect testbed. I threw some workloads at it, which I hoped would > resemble my typical workload, and I focussed especially on areas which > bothered me the most on our current development server running ext3. > Everything worked more or less satisfactorily, except for the case of > un-tarring a metadata-heavy tarball in the presence of considerable > free-space fragmentation. > > In this particular case, performance was conspicuously poor, and after > some digging with blktrace and seekwatcher, I identified the cause of > this slowness to be a write pattern that looked like this (in block > numbers), where the step width (arbitrarily displayed as 10000 here > for illustration purposes) was 1/4 of the size of the volume, clearly > because the volume had 4 allocation groups (the default). Of course it > was not entirely regular, but overall it was very similar to this: > > 10001 > 20001 > 30001 > 40001 > 10002 > 20002 > 30002 > 40002 > 10003 > 20003 > ... > > I tuned and tweaked everything I could think of -- elevator settings, > readahead, su/sw, barrier, RAID hardware cache --, but the behavior > would always be the same. It just so happens that the RAID controller > in this machine (HP SmartArray P400) doesn't cope very well with a > write pattern like this. To it, the sequence appears to be random, and > it performs even worse than it would if it were actually random. > > Going by what I think to know about the topic, it struck me as odd > that blocks would be sent to disk in this very unfavorable order. To > my mind, three entities had failed at sanitizing the write sequence: > the filesystem, the block layer and the RAID controller. My opinion is > still unchanged regarding the latter two. > > The strikingly bad performance on the RAID controller piqued my > interest, and I went on a different journey investigating this oddity > and created a minor sysbench modification that would just measure > performance for this particular pattern. Not many people helped with > my experiment, and I was accused of wanting ponies. If I'm the only > one who is curious about this, then so be it. I deemed it worthwile > sharing my experience and pointing out that a sequence like the one > above is a death blow to all HP gear I've got my hands on so far. > > It has been pointed out that XFS schedules the writes like this on > purpose so that they can be done in parallel, and that I should create > a concatenated volume with physical devices matching the allocation > groups. I actually went through this exercise, and yes, it was very > beneficial, but that's not the point. I don't want to (have to) do > that. And it's not always feasible, anyway. What about home usage with > a single SATA disk? Is it not worthwile to perform well on low-end > devices? > > You might ask then, why even bother using XFS instead of ext4? > > I care about the multi-user case. The problem I have with ext is that > it is unbearably unresponsive when someone writes a semi-large amount > of data (a few gigs) at once -- like extracting a large-ish tarball. > Just using vim, even with :set nofsync, is almost impossible during > that time. I have adopted various disgusting hacks like extracting to > a ramdisk instead and rsyncing the lot over to the real disk with a > very low --bwlimit, but I'm thoroughly fed up with this kind of crap, > and in general, XFS works very well. > > If noone cares about my findings, I will henceforth be quiet on this topic. Again, it's not that nobody cares. It's that your findings have no weight, no merit, in absence of complete storage system and software stack configuration specs. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-25 10:00 ` Stan Hoeppner @ 2012-07-25 10:08 ` Stefan Ring 2012-07-25 11:00 ` Stan Hoeppner 0 siblings, 1 reply; 41+ messages in thread From: Stefan Ring @ 2012-07-25 10:08 UTC (permalink / raw) To: stan; +Cc: Linux fs XFS > You simply must present complete information. The omission of such is > likely why most ignored your post but for me. I'm the hardwarefreak > after all, so I'm always game for RAID discussions. ;) > > If you can represent with complete specs and data, so that it paints a > coherent picture, you may see more willing participation. I agree, no offense taken. I will respond to your previous message individually, although it won't get as complete as you'd like because I simply don't know and cannot find out much about some of the systems. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-25 10:08 ` Stefan Ring @ 2012-07-25 11:00 ` Stan Hoeppner 0 siblings, 0 replies; 41+ messages in thread From: Stan Hoeppner @ 2012-07-25 11:00 UTC (permalink / raw) To: Stefan Ring; +Cc: Linux fs XFS On 7/25/2012 5:08 AM, Stefan Ring wrote: >> You simply must present complete information. The omission of such is >> likely why most ignored your post but for me. I'm the hardwarefreak >> after all, so I'm always game for RAID discussions. ;) >> >> If you can represent with complete specs and data, so that it paints a >> coherent picture, you may see more willing participation. > > I agree, no offense taken. I will respond to your previous message > individually, although it won't get as complete as you'd like because > I simply don't know and cannot find out much about some of the > systems. No need for such a response, nobody would read it. Instead, just present all the relevant hardware and config specs for the two HBA RAID cards (as you have nothing on the P2000), and provide that link again to the sysbench output. Short, sweet, simple. If you make posts too long and rambling you get ignored. I know from experience. :( Model: Cache Size: #/type of drives: RAID level: Cache mode: That should be sufficient. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-25 9:29 ` Stefan Ring 2012-07-25 10:00 ` Stan Hoeppner @ 2012-07-26 8:32 ` Dave Chinner 2012-09-11 16:37 ` Stefan Ring 1 sibling, 1 reply; 41+ messages in thread From: Dave Chinner @ 2012-07-26 8:32 UTC (permalink / raw) To: Stefan Ring; +Cc: Linux fs XFS On Wed, Jul 25, 2012 at 11:29:58AM +0200, Stefan Ring wrote: > In this particular case, performance was conspicuously poor, and after > some digging with blktrace and seekwatcher, I identified the cause of > this slowness to be a write pattern that looked like this (in block > numbers), where the step width (arbitrarily displayed as 10000 here > for illustration purposes) was 1/4 of the size of the volume, clearly > because the volume had 4 allocation groups (the default). Of course it > was not entirely regular, but overall it was very similar to this: > > 10001 > 20001 > 30001 > 40001 > 10002 > 20002 > 30002 > 40002 > 10003 > 20003 > ... That's the problem you should have reported. Not something artificial from a benchmark. What you seemed to report was a "random writes behave differently on different RAID setups, not that "writeback is not sorting efficiently". Indeed, if the above is metadata, then there's something really weird going on, because metadata writeback is not sorted that way by XFS, and nothing should cause writeback in that style. i.e. if it is metadata, it shoul dbe: 10001 (queue) 10002 (merge) 10003 (merge) .... 20001 (queue) 20002 (merge) 20003 (merge) .... and so on for any metadata dispatched in close temporal proximity. If it is data writeback, then there's still something funny going on as it implies that the temporal data locality the allocator providing is non-existent. i.e. inodes that are dirtied sequentially in the same directory should be written in the same order and allocation should be to a similar region on disk. Hence you should get similar IO patterns to the metadata, though not as well formed. Using xfs_bmap will tell you where the files are located, and often comparing c/mtime will tell you th order in which files were written. That can tell you whether data allocation was jumping all over the place or not... > It has been pointed out that XFS schedules the writes like this on > purpose so that they can be done in parallel, XFs doesn't schedule writes like that - it only spreads the allocation out. writeback and the IO elevators are what do the IO scheduling, and sometimes they don't play nicely with XFS. If you create files in this manner: /a/file1 /b/file1 /c/file1 /d/file1 /a/file2 /b/file2 .... Then writeback is going to schedule them in the same order, and that will result in IO being rotored across all AGs because writeback retains the creation/dirtying order. There's only so much reordering that can be done when writes are scheduled like this. If you create files like this: /a/file1 /a/file2 /a/file3 ..... /b/file1 /b/file2 /b/file3 ..... The writeback will issue them in that order, and data allocation will be contiguous and hence writes much more sequential. This is often a problem with naive multi-threaded applications - the thought that more IO in flight will be faster than what a single thread can do. If you cause IO to interleave like above, then it won't go faster and could turn sequential workloads into random IO workloads. OTOH, well designed applications can take advantage of XFS's segregation and scale IO linearly by a combination of careful placement and scalable block device design (e.g. a concat rather than a flat stripe). But, I really don't knwo what you application is - all I know is that you used sysbench to generate random IO that showed similar problems. posting the blktraces for us to analyse ourselves (I can tell an awful lot from repeating patterns of block numbers and IO sizes) rather than telling use what you saw is an example of what we need to see to understand your problem. This pretty much says it all: http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F > and that I should create > a concatenated volume with physical devices matching the allocation > groups. I actually went through this exercise, and yes, it was very > beneficial, but that's not the point. I don't want to (have to) do > that. If you want to maximise storage performance, then that's what you do for certain workloads. Saying "I want" fllowed by "I'm too lazy to do that, but I still want" won't get you very far.... > And it's not always feasible, anyway. What about home usage with > a single SATA disk? Is it not worthwile to perform well on low-end > devices? Not really. XFS is mostly optimised for large scale HPC and enterprise workloads and hardware. The only small scale system optimisations we make are generally for your cheap 1-4 disk ARM/MIPS based NAS devices. The workloads on those are effectively a server workload anyway, so most of the optimisations we make benefit them as well. As for desktops, well, it's fast enough for my workstation and laptop, so I don't really care much more than that.. ;) > You might ask then, why even bother using XFS instead of ext4? No, I don't. If ext4 is better or XFS is too much trouble for you, then it is better for you to use ext4. No-one here will argue against you doing that - use what works for you. However, if you do use XFS, and ask for advice, then it pays to listen to the people who respond because they tend to be power users with lots of experience or subject matter experts..... > I care about the multi-user case. The problem I have with ext is that > it is unbearably unresponsive when someone writes a semi-large amount > of data (a few gigs) at once -- like extracting a large-ish tarball. > Just using vim, even with :set nofsync, is almost impossible during > that time. I have adopted various disgusting hacks like extracting to > a ramdisk instead and rsyncing the lot over to the real disk with a > very low --bwlimit, but I'm thoroughly fed up with this kind of crap, > and in general, XFS works very well. > > If noone cares about my findings, I will henceforth be quiet on this topic. I care about the problems you are having, but I don't care about a -simulation- of what you think is the problem. Report the real problem (data allocation or writeback is not sequential when it should be) and we might be able to get to the bottom of your issue. Report a simulation of an issue, and we'll just tell you what is wrong with your simulation (i.e. random IO and RAID5/6 don't mix. ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-26 8:32 ` Dave Chinner @ 2012-09-11 16:37 ` Stefan Ring 0 siblings, 0 replies; 41+ messages in thread From: Stefan Ring @ 2012-09-11 16:37 UTC (permalink / raw) To: Linux fs XFS On Thu, Jul 26, 2012, Dave Chinner <david@fromorbit.com> wrote: >> 10001 >> 20001 >> 30001 >> 40001 >> 10002 >> 20002 >> 30002 >> 40002 >> 10003 >> 20003 >> ... > > That's the problem you should have reported. I did, but then I got bashed for using RAID 5/6 and about the specifics of hardware and everything, which shouldn't even matter, but I let myself get dragged into this discussion. Anyway, in the meantime I had a closer look at the actual block trace, and it looks a bit different than the way I interpreted it at first. It sends runs of 30-50 writes with holes in them, like so: 2, 4-5, 7, 10-12, 14, 16-17 and so on. These holes seem to be caused by the free space fragmentation. Every once in a while -- somewhat frequently, after 30 or so blocks, as mentioned -- it switches to another allocation group. If these blocks were contiguous, then the elevator should be able to merge them, but the tiny holes make this impossible. So I guess there's nothing that can be substantially improved here. The frequent ag switches are a bit difficult for the controller to handle, but different controllers struggle under different work loads, and there's nothing that can be done about that. I noticed just today that the HP SmartArray controllers handle truly random writes better than the MegaRAID variety that I praised so much in my postings. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-07-16 21:27 ` Stan Hoeppner 2012-07-16 21:58 ` Stefan Ring @ 2012-07-16 22:16 ` Stefan Ring 1 sibling, 0 replies; 41+ messages in thread From: Stefan Ring @ 2012-07-16 22:16 UTC (permalink / raw) To: stan; +Cc: xfs >> If I thought that the internal RAID was bad, that's only because I >> have not yet experienced an external enclosure from HP attached via >> FibreChannel (P2000 G3 MSA, QLogic Corp. ISP2532-based 8Gb Fibre >> Channel to PCI Express HBA). Unfortunately, I don't have detailed >> information about the configuration of this enclosure, except that >> it's a RAID6 volume, with 10 or 12 disks, I believe. > > Without that information the numbers below may tend to be a bit meaningless. Yes, probably, but I likely won't get at the information, let alone change or tweak anything. But even with the most naive setup, a good storage stack "should" not exhibit this kind of behavior. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-04-25 8:07 A little RAID experiment Stefan Ring ` (2 preceding siblings ...) 2012-07-16 19:57 ` Stefan Ring @ 2012-10-10 14:57 ` Stefan Ring 2012-10-10 21:27 ` Dave Chinner 3 siblings, 1 reply; 41+ messages in thread From: Stefan Ring @ 2012-10-10 14:57 UTC (permalink / raw) To: Linux fs XFS Btw, one of our customers recently aquired new gear with HP SmartArray Gen8 controllers. Now they are something to get excited about! This is the kind of write performance I would expect from an expensive server product. Check this out (this is again my artificial benchmark as well as random write of 4K blocks): SmartArray P400, 6 300G disks (10k, SAS) RAID 6, 256M BBWC: ag4 Read 0b Written 161.56Mb Total transferred 161.56Mb (5.3853Mb/sec) 1378.63 Requests/sec executed random write Read 0b Written 97.578Mb Total transferred 97.578Mb (3.2526Mb/sec) 832.66 Requests/sec executed SmartArray Gen8, 8 300G disks (15k, SAS) RAID 5, 2GB FBWC: ag4 Read 0b Written 2.4575Gb Total transferred 2.4575Gb (83.883Mb/sec) 21474.03 Requests/sec executed random write Read 0b Written 343.86Mb Total transferred 343.86Mb (11.462Mb/sec) 2934.24 Requests/sec executed So yeah, the disks are a bit faster. But what does that matter when there is such a huge difference otherwise? Unfortunately, while composing this text, I noticed that the new one is configured as RAID 5, and I cannot change it because of HP's licensing policy. That makes it not a meaningful comparison, although extrapolation from previous SmartArray controllers would suggest that the RAID5 and RAID6 performance is comparable. My subjective impression is still a very good one! _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-10-10 14:57 ` Stefan Ring @ 2012-10-10 21:27 ` Dave Chinner 2012-10-10 22:01 ` Stefan Ring 0 siblings, 1 reply; 41+ messages in thread From: Dave Chinner @ 2012-10-10 21:27 UTC (permalink / raw) To: Stefan Ring; +Cc: Linux fs XFS On Wed, Oct 10, 2012 at 04:57:47PM +0200, Stefan Ring wrote: > Btw, one of our customers recently aquired new gear with HP SmartArray > Gen8 controllers. Now they are something to get excited about! This is > the kind of write performance I would expect from an expensive server > product. Check this out (this is again my artificial benchmark as well > as random write of 4K blocks): > > SmartArray P400, 6 300G disks (10k, SAS) RAID 6, 256M BBWC: ^^^^ ..... > SmartArray Gen8, 8 300G disks (15k, SAS) RAID 5, 2GB FBWC: ^^^^ That's the reason for the difference in performance... > So yeah, the disks are a bit faster. But what does that matter when > there is such a huge difference otherwise? Just inidicates that the working set for your test is much more resident in the controller cache - has nothing to do with the disk speeds. Tun a larger set of files/workload and the results will end up a lot closer to disk speed instead of cache speed... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-10-10 21:27 ` Dave Chinner @ 2012-10-10 22:01 ` Stefan Ring 0 siblings, 0 replies; 41+ messages in thread From: Stefan Ring @ 2012-10-10 22:01 UTC (permalink / raw) To: Dave Chinner; +Cc: Linux fs XFS > Just inidicates that the working set for your test is much more > resident in the controller cache - has nothing to do with the disk > speeds. Tun a larger set of files/workload and the results will end > up a lot closer to disk speed instead of cache speed... That's indeed a valid objection, but I just verified that with the working set size multiplied by the relative cache size difference (64GB instead of 8GB), the performance stays exactly the same. The new controller seems to run much better cache control algorithms. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment @ 2012-04-26 22:33 Richard Scobie 2012-04-27 21:30 ` Emmanuel Florac 0 siblings, 1 reply; 41+ messages in thread From: Richard Scobie @ 2012-04-26 22:33 UTC (permalink / raw) To: stefanrin; +Cc: xfs I know you were interested in hardware RAID controllers, but out of curiosity, this is the result on a 16 x 1TB SATA linux md software RAID6 array. Formatted xfs, with external journal on an independent SATA device, mounted delaylog,inode64,logbsize=256k,logdev=/dev/md0,noatime,pquota. Operations performed: 0 Read, 26065 Write, 0 Other = 26065 Total Read 0b Written 203.63Mb Total transferred 203.63Mb (13.575Mb/sec) 1737.65 Requests/sec executed Filesystem is 44% full, kernel 2.6.39.2. xfs_bmap test_file.0 test_file.0: 0: [0..8388607]: 9565100544..9573489151 1: [8388608..16777215]: 9578354176..9586742783 Regards, Richard _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-04-26 22:33 Richard Scobie @ 2012-04-27 21:30 ` Emmanuel Florac 2012-04-28 4:15 ` Richard Scobie 0 siblings, 1 reply; 41+ messages in thread From: Emmanuel Florac @ 2012-04-27 21:30 UTC (permalink / raw) To: Richard Scobie; +Cc: stefanrin, xfs Le Fri, 27 Apr 2012 10:33:39 +1200 vous écriviez: > Formatted xfs, with external journal on an independent SATA device, > mounted delaylog,inode64,logbsize=256k,logdev=/dev/md0,noatime,pquota. Wouldn't it be preferable to use the RAID controller to host the log device? This way you profit from the write cache, as the log easily fits in. -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: A little RAID experiment 2012-04-27 21:30 ` Emmanuel Florac @ 2012-04-28 4:15 ` Richard Scobie 0 siblings, 0 replies; 41+ messages in thread From: Richard Scobie @ 2012-04-28 4:15 UTC (permalink / raw) To: Emmanuel Florac; +Cc: stefanrin, xfs Emmanuel Florac wrote: > Wouldn't it be preferable to use the RAID controller to host the log > device? This way you profit from the write cache, as the log easily > fits in. This setup is md software RAID ;). The controller is an LSI 1068 using initiator-target firmware. There is no write cache I am aware of. Regards, Richard _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2012-10-10 22:00 UTC | newest] Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-04-25 8:07 A little RAID experiment Stefan Ring 2012-04-25 14:17 ` Roger Willcocks 2012-04-25 16:23 ` Stefan Ring 2012-04-27 14:03 ` Stan Hoeppner 2012-04-26 8:53 ` Stefan Ring 2012-04-27 15:10 ` Stan Hoeppner 2012-04-27 15:28 ` Joe Landman 2012-04-28 4:42 ` Stan Hoeppner 2012-04-27 13:50 ` Stan Hoeppner 2012-05-01 10:46 ` Stefan Ring 2012-05-30 11:07 ` Stefan Ring 2012-05-31 1:30 ` Stan Hoeppner 2012-05-31 6:44 ` Stefan Ring 2012-07-16 19:57 ` Stefan Ring 2012-07-16 20:03 ` Stefan Ring 2012-07-16 20:05 ` Stefan Ring 2012-07-16 21:27 ` Stan Hoeppner 2012-07-16 21:58 ` Stefan Ring 2012-07-17 1:39 ` Stan Hoeppner 2012-07-17 5:26 ` Dave Chinner 2012-07-18 2:18 ` Stan Hoeppner 2012-07-18 6:44 ` Stefan Ring 2012-07-18 7:09 ` Stan Hoeppner 2012-07-18 7:22 ` Stefan Ring 2012-07-18 10:24 ` Stan Hoeppner 2012-07-18 12:32 ` Stefan Ring 2012-07-18 12:37 ` Stefan Ring 2012-07-19 3:08 ` Stan Hoeppner 2012-07-25 9:29 ` Stefan Ring 2012-07-25 10:00 ` Stan Hoeppner 2012-07-25 10:08 ` Stefan Ring 2012-07-25 11:00 ` Stan Hoeppner 2012-07-26 8:32 ` Dave Chinner 2012-09-11 16:37 ` Stefan Ring 2012-07-16 22:16 ` Stefan Ring 2012-10-10 14:57 ` Stefan Ring 2012-10-10 21:27 ` Dave Chinner 2012-10-10 22:01 ` Stefan Ring 2012-04-26 22:33 Richard Scobie 2012-04-27 21:30 ` Emmanuel Florac 2012-04-28 4:15 ` Richard Scobie
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.