On 2012-08-22, at 12:00 AM, NeilBrown wrote:
> On Wed, 22 Aug 2012 11:57:02 +0800 Yuanhan Liu <yuanhan.liu@linux.intel.com>
> wrote:
>> 
>> -#define NR_STRIPES		256
>> +#define NR_STRIPES		1024
> 
> Changing one magic number into another magic number might help your case, but
> it not really a general solution.

We've actually been carrying a patch for a few years in Lustre to
increase the NR_STRIPES to 2048, and made it a configurable module
parameter.  This made a noticeable improvement to the performance
for fast systems.

> Possibly making sure that max_nr_stripes is at least some multiple of the
> chunk size might make sense, but I wouldn't want to see a very large multiple.
> 
> I thing the problems with RAID5 are deeper than that.  Hopefully I'll figure
> out exactly what the best fix is soon - I'm trying to look into it.

The other MD RAID-5/6 patches that we have change the page submission
order to avoid the need to merge pages in the elevator so much, and a
patch to allow zero-copy IO submission if the caller marks the page for
direct IO (indicating it will not be modified until after IO completes).
This avoids a lot of overhead on fast systems.

This isn't really my area of expertise, but patches against RHEL6
could be seen at http://review.whamcloud.com/1142 if you want to
take a look.  I don't know if that code is at all relevant to what
is in 3.x today.

> I don't think the size of the cache is a big part of the solution.  I think
> correct scheduling of IO is the real answer.

My experience is that on fast systems the IO scheduler just gets in the
way.  Submitting larger contiguous IOs to each disk in the first place
is far better than trying to merge small IOs again at the back end.

Cheers, Andreas