Re: RFC - de-clustered raid 60 or 61 algorithm

From: NeilBrown <neilb@suse.com>
To: Wol's lists <antlists@youngman.org.uk>,
	mdraid <linux-raid@vger.kernel.org>
Subject: Re: RFC - de-clustered raid 60 or 61 algorithm
Date: Thu, 08 Feb 2018 14:14:29 +1100	[thread overview]
Message-ID: <876078maui.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <9f9e737c-d6d1-5cce-8190-14d970320265@youngman.org.uk>

[-- Attachment #1: Type: text/plain, Size: 5555 bytes --]

On Thu, Feb 08 2018, Wol's lists wrote:

> After the de-clustered thread, Neil said it would probably only take a 
> small amount of coding to do something like that. It was also discussed 
> about spreading the load over disks on a reconstruction if there were a 
> lot of disks in an array. I've been trying to get my head round a simple 
> algorithm to smear data over the disks along the lines of raid-10.
>
> Basically, the idea is to define a logical stripe which is a multiple of 
> both the number of physical disks, and of the number of logical disks. 
> Within this logical stripe the blocks are shuffled using prime numbers 
> to make sure we don't get a pathological shuffle.
>
> At present, I've defined the logical stripe to be simply the product of 
> the number of logical disks times the number of mirrors times the number 
> of physical disks. We could shrink this by removing common factors, but 
> we don't need to.
>
> Given a logical block within this stripe, its physical position is 
> calculated by the simple equation "logical block * prime mod logical 
> stripe size". So long as the "prime" does not share any factors with the 
> logical stripe size, then (with one exception) you're not going to get 
> hash collisions, and you're not going to get more than one block per 
> stripe stored on each drive. The exception, of course, is if physical 
> disks is not greater than logical disks. Having the two identical is 
> especially dangerous as users will not expect the pathological behaviour 
> they will get - multiple blocks per stripe stored on the same disk. 
> mdadm will need to detect and reject this layout. I think the best 
> behaviour will be found where logical disks, mirrors, and physical disks 
> don't share any prime factors.
>
> I've been playing with a mirror setup, and if we have two mirrors, we 
> can rebuild any failed disk by coping from two other drives. I think 
> also (I haven't looked at it) that you could do a fast rebuild without 
> impacting other users of the system too much provided you don't swamp 
> i/o bandwidth, as half of the requests for data on the three drives 
> being used for rebuilding could actually be satisfied from other drives.

I think that ends up being much the same result as a current raid10
where the number of copies doesn't divide the number of devices.
Reconstruction reads come from 2 different devices, and half the reads
that would go to them now go elsewhere.

I think that if you take your solution and a selection of different
"prime" number and rotate through the primes from stripe to stripe, you
can expect a more even distribution of load.

You hint at this below when you suggest that adding the "*prime" doesn't
add much.  I think it adds a lot more when you start rotating the primes
across the stripes.

>
> Rebuilding a striped raid such as a raid-60 also looks like it would 
> spread the load.
>
> The one thing that bothers me is that I don't know whether the "* prime 
> mod" logic actually adds very much - whether we can just stripe stuff 
> across like we do with raid-10. Where it will score is in a storage 
> assembly that is a cluster of a cluster of disks. Say you have a 
> controller with ten disks, and ten controllers in a rack, a suitable 
> choice of prime and logical stripe could ensure the rack would survive 
> losing a controller. And given that dealing with massive arrays is what 
> this algorithm is about, that seems worthwhile.

The case of multiple failures is my major concern with the whole idea.
If failures were truly independent, then losing 3 in 100 at the same
time is probably quite unlikely.  But we all know there are various
factors that can cause failures to correlate - controller failure being
just one of them.
Maybe if you have dual-attached fibre channel to each drive, and you get
the gold-plated drives that have actually been exhaustively tested....

What do large installations do?  I assumed they had lots of modest
raid5 arrays and then mirrored between pairs of those (for data they
actually care about).  Maybe I'm wrong.

NeilBrown

>
> Anyways, here's my simple code that demonstrates the algorithm and 
> prints out how the blocks will be laid out. Is it a good idea? I'd like 
> to think so ...
>
> Cheers,
> Wol
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> void main()
> {
>      int disks, logdisks, mirrors, prime;
>
>      printf( "%s", "Enter the number of disks ");
>      scanf( "%d", &disks);
>      printf( "%s", "Enter the number of logical disks ");
>      scanf( "%d", &logdisks);
>      printf( "%s", "Enter the number of mirrors ");
>      scanf( "%d", &mirrors);
>      printf( "%s", "Enter the prime ");
>      scanf( "%d", &prime);
>
>      int blocks, *array;
>
>      blocks = logdisks * mirrors * disks;
>      array = (int *) malloc( sizeof(int) * blocks );
>      memset( array, '\0', sizeof(int) * blocks );
>
>      int i;
>
>      for (i=0; i < blocks; i++) {
>          array[i] = (i * prime) % blocks;
>      }
>
>      int logstripe, logdisk;
>
>      for (i=0; i < blocks; i++) {
>          if ( (i % disks) == 0) {
>              printf( "\n");
>          }
>
>          // printf( "%4d", array[i]);
>
>          logstripe = array[i] / (mirrors * logdisks);
>          logdisk = array[i] % logdisks;
>          printf( "  %02d%c", logstripe, (char) logdisk+65);
>
>      }
>      printf( "\n");
> }

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]