Re: Bigger stripe size

* Re: Bigger stripe size
       [not found] <12EF8D94C6F8734FB2FF37B9FBEDD1735863D351@EXCHANGE.collogia.de>
@ 2014-08-14  4:11 ` NeilBrown
  2014-08-14  6:33   ` AW: " Markus Stockhausen
  0 siblings, 1 reply; 3+ messages in thread
From: NeilBrown @ 2014-08-14  4:11 UTC (permalink / raw)
  To: Markus Stockhausen; +Cc: shli, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2691 bytes --]

On Wed, 13 Aug 2014 07:21:20 +0000 Markus Stockhausen
<stockhausen@collogia.de> wrote:

> Hello you two,
> 
> I saw Shaohua's patches for making the stripe size in raid4/5/6 configurable.
> If I got it right Neil likes the idea but does not agree with the kind of the
> implementation.
> 
> The patch is quite big an intrusive so I guess that any other design will have
> the same complexitiy. Neils idea about linking stripe headers sounds reasonable
> but will make it neccessary to "look at the linked neighbours" for some operations.
> Whatever "look" means programmatically. So I would like to hear your feedback
> about the following desing.
> 
> Will it make sense to work with per-stripe sizes? E.g.
> 
> User reads/writes 4K -> Work on a 4K stripe.
> User reads/writes 16K -> Work on a 16K stripe.
> 
> Difficulties.
> 
> - avoid overlapping of "small" and "big" stripes
> - split stripe cache in different sizes
> - Can we allocate multi-page memory to have continous work-areas?
> - ...
> 
> Benefits.
> 
> - Stripe handling unchanged.
> - paritiy calculation more efficient
> -  ...
> 
> Other ideas?

I fear that we are chasing the wrong problem.

The scheduling of stripe handling is currently very poor.  If you do a large
sequential write which should map to multiple full-stripe writes, you still
get a lot of reads.  This is bad.
The reason is that limited information is available to the raid5 driver
concerning what is coming next and it often guesses wrongly.

I suspect that it can be made a lot cleverer but I'm not entirely sure how.
A first step would be to "watch" exactly what happens in terms of the way
that requests come down, the timing of 'unplug' events, and the actual
handling of stripes.  'blktrace' could provide most or all of the raw data.

Then determine what the trace "should" look like and come up with a way for
raid5 too figure that out and do it.
I suspect that might involve are more "clever" queuing algorithm, possibly
keeping all the stripe_heads sorted, possibly storing them in an RB-tree.

Once you have that queuing in place so that the pattern of write requests
submitted to the drives makes sense, then it is time to analyse CPU efficiency
and find out where double-handling is happening, or when "batching" or
re-ordering of operations can make a difference.
If the queuing algorithm collects contiguous sequences of stripe_heads
together, then processes a batch of them in succession make provide the same
improvements as processing fewer larger stripe_heads.

So: first step is to get the IO patterns optimal.  Then look for ways to
optimise for CPU time.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread