All of lore.kernel.org
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.com>
To: Wol's lists <antlists@youngman.org.uk>,
	mdraid <linux-raid@vger.kernel.org>
Subject: Re: RFC - de-clustered raid 60 or 61 algorithm
Date: Thu, 08 Feb 2018 14:14:29 +1100	[thread overview]
Message-ID: <876078maui.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <9f9e737c-d6d1-5cce-8190-14d970320265@youngman.org.uk>

[-- Attachment #1: Type: text/plain, Size: 5555 bytes --]

On Thu, Feb 08 2018, Wol's lists wrote:

> After the de-clustered thread, Neil said it would probably only take a 
> small amount of coding to do something like that. It was also discussed 
> about spreading the load over disks on a reconstruction if there were a 
> lot of disks in an array. I've been trying to get my head round a simple 
> algorithm to smear data over the disks along the lines of raid-10.
>
> Basically, the idea is to define a logical stripe which is a multiple of 
> both the number of physical disks, and of the number of logical disks. 
> Within this logical stripe the blocks are shuffled using prime numbers 
> to make sure we don't get a pathological shuffle.
>
> At present, I've defined the logical stripe to be simply the product of 
> the number of logical disks times the number of mirrors times the number 
> of physical disks. We could shrink this by removing common factors, but 
> we don't need to.
>
> Given a logical block within this stripe, its physical position is 
> calculated by the simple equation "logical block * prime mod logical 
> stripe size". So long as the "prime" does not share any factors with the 
> logical stripe size, then (with one exception) you're not going to get 
> hash collisions, and you're not going to get more than one block per 
> stripe stored on each drive. The exception, of course, is if physical 
> disks is not greater than logical disks. Having the two identical is 
> especially dangerous as users will not expect the pathological behaviour 
> they will get - multiple blocks per stripe stored on the same disk. 
> mdadm will need to detect and reject this layout. I think the best 
> behaviour will be found where logical disks, mirrors, and physical disks 
> don't share any prime factors.
>
> I've been playing with a mirror setup, and if we have two mirrors, we 
> can rebuild any failed disk by coping from two other drives. I think 
> also (I haven't looked at it) that you could do a fast rebuild without 
> impacting other users of the system too much provided you don't swamp 
> i/o bandwidth, as half of the requests for data on the three drives 
> being used for rebuilding could actually be satisfied from other drives.

I think that ends up being much the same result as a current raid10
where the number of copies doesn't divide the number of devices.
Reconstruction reads come from 2 different devices, and half the reads
that would go to them now go elsewhere.

I think that if you take your solution and a selection of different
"prime" number and rotate through the primes from stripe to stripe, you
can expect a more even distribution of load.

You hint at this below when you suggest that adding the "*prime" doesn't
add much.  I think it adds a lot more when you start rotating the primes
across the stripes.

>
> Rebuilding a striped raid such as a raid-60 also looks like it would 
> spread the load.
>
> The one thing that bothers me is that I don't know whether the "* prime 
> mod" logic actually adds very much - whether we can just stripe stuff 
> across like we do with raid-10. Where it will score is in a storage 
> assembly that is a cluster of a cluster of disks. Say you have a 
> controller with ten disks, and ten controllers in a rack, a suitable 
> choice of prime and logical stripe could ensure the rack would survive 
> losing a controller. And given that dealing with massive arrays is what 
> this algorithm is about, that seems worthwhile.

The case of multiple failures is my major concern with the whole idea.
If failures were truly independent, then losing 3 in 100 at the same
time is probably quite unlikely.  But we all know there are various
factors that can cause failures to correlate - controller failure being
just one of them.
Maybe if you have dual-attached fibre channel to each drive, and you get
the gold-plated drives that have actually been exhaustively tested....

What do large installations do?  I assumed they had lots of modest
raid5 arrays and then mirrored between pairs of those (for data they
actually care about).  Maybe I'm wrong.

NeilBrown


>
> Anyways, here's my simple code that demonstrates the algorithm and 
> prints out how the blocks will be laid out. Is it a good idea? I'd like 
> to think so ...
>
> Cheers,
> Wol
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> void main()
> {
>      int disks, logdisks, mirrors, prime;
>
>      printf( "%s", "Enter the number of disks ");
>      scanf( "%d", &disks);
>      printf( "%s", "Enter the number of logical disks ");
>      scanf( "%d", &logdisks);
>      printf( "%s", "Enter the number of mirrors ");
>      scanf( "%d", &mirrors);
>      printf( "%s", "Enter the prime ");
>      scanf( "%d", &prime);
>
>      int blocks, *array;
>
>      blocks = logdisks * mirrors * disks;
>      array = (int *) malloc( sizeof(int) * blocks );
>      memset( array, '\0', sizeof(int) * blocks );
>
>      int i;
>
>      for (i=0; i < blocks; i++) {
>          array[i] = (i * prime) % blocks;
>      }
>
>      int logstripe, logdisk;
>
>      for (i=0; i < blocks; i++) {
>          if ( (i % disks) == 0) {
>              printf( "\n");
>          }
>
>          // printf( "%4d", array[i]);
>
>          logstripe = array[i] / (mirrors * logdisks);
>          logdisk = array[i] % logdisks;
>          printf( "  %02d%c", logstripe, (char) logdisk+65);
>
>      }
>      printf( "\n");
> }

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

  reply	other threads:[~2018-02-08  3:14 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-08  0:46 RFC - de-clustered raid 60 or 61 algorithm Wol's lists
2018-02-08  3:14 ` NeilBrown [this message]
2018-02-08 12:56   ` Phil Turmel
2018-02-08 23:10     ` Wol's lists
2018-02-09 23:12   ` Wol's lists
2018-02-10  3:02   ` John Stoffel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=876078maui.fsf@notabene.neil.brown.name \
    --to=neilb@suse.com \
    --cc=antlists@youngman.org.uk \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.