On 2012-08-22, at 12:00 AM, NeilBrown wrote: > On Wed, 22 Aug 2012 11:57:02 +0800 Yuanhan Liu > wrote: >> >> -#define NR_STRIPES 256 >> +#define NR_STRIPES 1024 > > Changing one magic number into another magic number might help your case, but > it not really a general solution. We've actually been carrying a patch for a few years in Lustre to increase the NR_STRIPES to 2048, and made it a configurable module parameter. This made a noticeable improvement to the performance for fast systems. > Possibly making sure that max_nr_stripes is at least some multiple of the > chunk size might make sense, but I wouldn't want to see a very large multiple. > > I thing the problems with RAID5 are deeper than that. Hopefully I'll figure > out exactly what the best fix is soon - I'm trying to look into it. The other MD RAID-5/6 patches that we have change the page submission order to avoid the need to merge pages in the elevator so much, and a patch to allow zero-copy IO submission if the caller marks the page for direct IO (indicating it will not be modified until after IO completes). This avoids a lot of overhead on fast systems. This isn't really my area of expertise, but patches against RHEL6 could be seen at http://review.whamcloud.com/1142 if you want to take a look. I don't know if that code is at all relevant to what is in 3.x today. > I don't think the size of the cache is a big part of the solution. I think > correct scheduling of IO is the real answer. My experience is that on fast systems the IO scheduler just gets in the way. Submitting larger contiguous IOs to each disk in the first place is far better than trying to merge small IOs again at the back end. Cheers, Andreas