From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stan Hoeppner Subject: Re: RAID performance - new kernel results - 5x SSD RAID5 Date: Mon, 04 Mar 2013 06:20:15 -0600 Message-ID: <5134917F.8060903@hardwarefreak.com> References: <51134E43.7090508@websitemanagers.com.au> <51137FB8.6060003@websitemanagers.com.au> <5113A2D6.20104@websitemanagers.com.au> <51150475.2020803@websitemanagers.com.au> <5120A84E.4020702@websitemanagers.com.au> <51222A81.9080600@hardwarefreak.com> <51250377.509@websitemanagers.com.au> <5125B8E5.5000502@hardwarefreak.com> <5125C154.3090603@websitemanagers.com.au> <51272808.7070302@hardwarefreak.com> <512A79D3.9020502@hardwarefreak.com> <5130D206.4090302@websitemanagers.com.au> <5131C338.8010402@hardwarefreak.com> Reply-To: stan@hardwarefreak.com Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Sender: linux-raid-owner@vger.kernel.org To: Adam Goryachev Cc: Linux RAID List-Id: linux-raid.ids On 3/3/2013 11:32 AM, Adam Goryachev wrote: > Stan Hoeppner wrote: >>> 1) Make sure stripe_cache_size is as least 8192. If not: >>> ~$ echo 8192 > /sys/block/md0/md/stripe_cache_size >>> Currently using default 256. > > Done now I see below that this paid some dividend. You could try increasing it further and may get even better write throughput for this FIO test, but keep in mind large stripe_cache_size values eat serious amounts of RAM: Formula: stripe_cache_size * 4096 bytes * drive_count = RAM usage. For your 5 drive array: 8192 eats 160MB 16384 eats 320MB 32768 eats 640MB Considering this is an iSCSI block IO server, dedicating 640MB of RAM to md stripe cache isn't a bad idea at all if it seriously increases write throughput (and without decreasing read throughput). You don't need RAM for buffer cache since you're not doing local file operations. I'd even go up to 131072 and eat 2.5GB of RAM if the performance is substantially better than lower values. Whatever value you choose, make it permanent by adding this entry to root's crontab: @reboot /bin/echo 32768 > /sys/block/md0/md/stripe_cache_size >>> top -b -n 60 -d 0.25|grep Cpu|sort -n > /some.dir/some.file > > Done now > > There seems to be only one row from the top output which is interesting: > Cpu0 : 3.6%us, 71.4%sy, 0.0%ni, 0.0%id, 10.7%wa, 0.0%hi, 14.3%si, 0.0%,st > > Every other line had a high value for %id, or %wa, with lower values for %sy. This was during the second 'stage' of the fio run, earlier in the fio run there was no entry even close to showing the CPU as busy. I expended the the time/effort walking you through all of this because I want to analyze the complete output myself. Would you please pastebin it or send me the file? Thanks. > READ: io=131072MB, aggrb=2506MB/s, minb=2566MB/s, maxb=2566MB/s, mint=52303msec, maxt=52303msec > WRITE: io=131072MB, aggrb=1262MB/s, minb=1292MB/s, maxb=1292MB/s, mint=103882msec, maxt=103882msec Even better than I anticipated. Nice, very nice. 2.3x the write throughput. Your last AIO single threaded streaming run: READ: 2,200 MB/s WRITE: 560 MB/s Multi-threaded run with stripe cache optimization and compressible data: READ: 2,500 MB/s WRITE: *1,300 MB/s* > Is this what I should be expecting now? No, because this FIO test, as with the streaming test, isn't an accurate model of your real daily IO workload, which entails much smaller, mixed read/write random IOs. But it does give a more accurate picture of the peak aggregate write bandwidth of your array. Once you have determined the optimal stripe_cache_size, you need to run this FIO test again, multiple times, first with the LVM snapshot enabled, and then with DRBD enabled. The DRBD load on the array on san1 should be only reads at a maximum rate of ~120MB/s as you have a single GbE link to the secondary. This is only 1/20th of the peak random read throughput of the array. Your prior sequential FIO runs showed a huge degradation in write performance when DRBD was running. This makes no sense, and should not be the case. You need to determine why DRBD on san1 is hammering write performance. > To me, it looks like it is close enough, but if you think I should be able to get even faster, then I will certainly investigate further. There may be some juice left on the table. Experiment with stripe_cache_size until you hit the sweet spot. I'd use only power of 2 values. If 32768 gives a decent gain, then try 65536, then 131072. If 32768 doesn't gain, or decreases throughput, try 16384. If 16384 doesn't yield decent gains or goes backward, stick with 8192. Again, you must manually stick the value as it doesn't survive reboots. Easiest route is cron. >>> 4) Try to connect the SSD's direct to the HBA, bypassing the hotswap >>> device in case this is limiting to SATA II or similar. Put palm to forehead. FIO 2.5GB/s read speed. 2.5GBps / 5 = 500MB/s per drive, ergo your link speed must be 6Gbps on each drive. If it were 3Gbps you'd be limited to 300MB/s per drive, 1.5GB/s total. > I'll have to find some software to run benchmarks within the windows VM's FIO runs on Windows: http://www.bluestop.org/fio/ > READ: io=32768MB, aggrb=237288KB/s, minb=237288KB/s, maxb=237288KB/s, mint=141408msec, maxt=141408msec > WRITE: io=32768MB, aggrb=203307KB/s, minb=203307KB/s, maxb=203307KB/s, mint=165043msec, maxt=165043msec > > So, 230MB/s read and 200MB/s write, using 2 x 1Gbps ethernet seems pretty good. So little left it's not worth optimization time. > The TS VM's have dedicated 4 cores, and the physical machines have dedicated 2 cores. I don't think the TS are CPU limited, this was for me to watch this on the DC to ensure it's single CPU was not limiting performance I understand the DC issue. I was addressing a different issue here. If the TS VMs have 4 cores there's nothing more you can do. > I only run a single windows VM on each xen box (unless the xen boxes started failing, then I'd run multiple VM's on one xen box until it was repaired). >From your previous msgs early in the thread I was under the impression you were running two TS VMs per Xen box. I must have misread something. > As such, I have 2 CPU cores dedicated to the dom0 (xen, physical box) and 4 cores dedicated to the windows VM (domU). I can drop the physical box back to a single core and add an additional core to windows, but I don't think that will make much of a difference. I don't tend to see high CPU load on the TS boxes... You may see less than peak averages in Munin due to the capture interval of munin-node, but still have momentary peaks that bog users down. I'm not saying this is the case, but it's probably worth investigating further, via methods other than studying Munin graphs. > Mostly I see high memory utilisation more than CPU, and that is one of the reasons to upgrade them to Win2008 64bit so I can allocate more RAM. I'm hoping that even with the virtualisation overhead, the modern CPU's are faster than the previous physical machines which were about 5 years old. MS stupidity: x86 x64 W2k3 Server Standard 4GB 32GB XP 4GB 128GB > So, overall, I haven't achieved anywhere near as much as I had hoped... You doubled write throughput to 1.3GB/s, at least WRT FIO. That's one fairly significant achievement. > I've changed the stripe_cache_size, and disabled HT on san1. Remember to test other sizes and make the optimum value permanent. > Seems to be faster than before, so will see how it goes today/this week. The only optimization since your last FIO test was increasing stripe_cache_size (the rest of the FIO throughput increase was simply due to changing the workload profile and using non random data buffers). The buffer difference: stripe_cache_size buffer space full stripes buffered 256 (default) 5 MB 16 8192 160 MB 512 To find out how much of the 732MB/s write throughput increase is due to buffering 512 stripes instead of 16, simply change it back to 256, re-run my FIO job file, and subtract the write result from 1292MB/s. -- Stan