From mboxrd@z Thu Jan 1 00:00:00 1970 From: "John Stoffel" Subject: Re: best base / worst case RAID 5,6 write speeds Date: Tue, 15 Dec 2015 16:54:49 -0500 Message-ID: <22128.35881.182823.556362@quad.stoffel.home> References: <22122.64143.522908.45940@quad.stoffel.home> <22123.9525.433754.283927@quad.stoffel.home> <566B6C8F.7020201@turmel.org> <566BA6E5.6030008@turmel.org> <22128.11867.847781.946791@quad.stoffel.home> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Sender: linux-raid-owner@vger.kernel.org To: Dallas Clement Cc: John Stoffel , Mark Knecht , Phil Turmel , Linux-RAID List-Id: linux-raid.ids >>>>> "Dallas" == Dallas Clement writes: Dallas> Thanks guys for all the ideas and help. Dallas> Phil, >> Very interesting indeed. I wonder if the extra I/O in flight at high >> depths is consuming all available stripe cache space, possibly not >> consistently. I'd raise and lower that in various combinations with >> various combinations of iodepth. Running out of stripe cache will cause >> premature RMWs. Dallas> Okay, I'll play with that today. I have to confess I'm not Dallas> sure that I completely understand how the stripe cache works. Dallas> I think the idea is to batch I/Os into a complete stripe if Dallas> possible and write out to the disks all in one go to avoid Dallas> RMWs. Other than alignment issues, I'm unclear on what Dallas> triggers RMWs. It seems like as Robert mentioned that if the Dallas> I/Os block size is stripe aligned, there should never be RMWs. Remember, there's a bounding limit on both how large the stripe cache is, and how long (timewise) it will let the cache sit around waiting for new blocks to come in. That's probably what you're hitting at times with the high queue depth numbers. I assume the blocktrace info would tell you more, but I haven't really a clue how to interpret it. Dallas> My stripe cache is 8192 btw. Dallas> John, >> I suspect you've hit a known problem-ish area with Linux disk io, which is that big queue depths aren't optimal. Dallas> Yes, certainly looks that way. But maybe as Phil indicated I might be Dallas> exceeding my stripe cache. I am still surprised that there are so Dallas> many RMWs even if the stripe cache has been exhausted. >> As you can see, it peaks at a queue depth of 4, and then tends >> downward before falling off a cliff. So now what I'd do is keep the >> queue depth at 4, but vary the block size and other parameters and see >> how things change there. Dallas> Why do you think there is a gradual drop off after queue depth Dallas> of 4 and before it falls off the cliff? I think because the in-kernel sizes start getting bigger, and so the kernel spends more time queuing and caching the data and moving it around, instead of just shoveling it down to the disks as quick as it can. Dallas> I with this were for fun! ;) Although this has been a fun Dallas> discussion. I've learned a ton. This effort is for work Dallas> though. I'd be all over the SSDs and caching otherwise. I'm Dallas> trying to characterize and then squeeze all of the performance Dallas> I can out of a legacy NAS product. I am constrained by the Dallas> existing hardware. Unfortunately I do not have the option of Dallas> using SSDs or hardware RAID controllers. I have to rely Dallas> completely on Linux RAID. Ah... in that case, you need to do your testing from the NAS side, don't bother going to this level. I'd honestly now just set your queue depth to 4 and move on to testing the NAS side of things, where you have one, two, four, eight, or more test boxes hitting the NAS box. Dallas> I also need to optimize for large sequential writes (streaming Dallas> video, audio, large file transfers), iSCSI (mostly used for Dallas> hosting VMs), and random I/O (small and big files) as you Dallas> would expect with a NAS. So you want to do everything at all once. Fun. So really I'd move back to the Network side, because unless your NAS box has more than 1GigE interface, and supports Bonding/trunking, you've hit the performance wall. Also, even if you get a ton of performance with large streaming writes, when you sprinkle in a small set of random IO/s, you're going to hit the cliff much sooner. And in that case... it's another set of optimizations. Are you going to use NFSv3? TCP? UDP? 1500 MTU, 9000 MTU? How many clients? How active? Can you give up disk space for IOP/s? So get away from the RAID6 and move to RAID1 mirrors with a strip atop it, so that you maximize how many IOPs you can get.