All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC - de-clustered raid 60 or 61 algorithm
@ 2018-02-08  0:46 Wol's lists
  2018-02-08  3:14 ` NeilBrown
  0 siblings, 1 reply; 6+ messages in thread
From: Wol's lists @ 2018-02-08  0:46 UTC (permalink / raw)
  To: mdraid; +Cc: NeilBrown

After the de-clustered thread, Neil said it would probably only take a 
small amount of coding to do something like that. It was also discussed 
about spreading the load over disks on a reconstruction if there were a 
lot of disks in an array. I've been trying to get my head round a simple 
algorithm to smear data over the disks along the lines of raid-10.

Basically, the idea is to define a logical stripe which is a multiple of 
both the number of physical disks, and of the number of logical disks. 
Within this logical stripe the blocks are shuffled using prime numbers 
to make sure we don't get a pathological shuffle.

At present, I've defined the logical stripe to be simply the product of 
the number of logical disks times the number of mirrors times the number 
of physical disks. We could shrink this by removing common factors, but 
we don't need to.

Given a logical block within this stripe, its physical position is 
calculated by the simple equation "logical block * prime mod logical 
stripe size". So long as the "prime" does not share any factors with the 
logical stripe size, then (with one exception) you're not going to get 
hash collisions, and you're not going to get more than one block per 
stripe stored on each drive. The exception, of course, is if physical 
disks is not greater than logical disks. Having the two identical is 
especially dangerous as users will not expect the pathological behaviour 
they will get - multiple blocks per stripe stored on the same disk. 
mdadm will need to detect and reject this layout. I think the best 
behaviour will be found where logical disks, mirrors, and physical disks 
don't share any prime factors.

I've been playing with a mirror setup, and if we have two mirrors, we 
can rebuild any failed disk by coping from two other drives. I think 
also (I haven't looked at it) that you could do a fast rebuild without 
impacting other users of the system too much provided you don't swamp 
i/o bandwidth, as half of the requests for data on the three drives 
being used for rebuilding could actually be satisfied from other drives.

Rebuilding a striped raid such as a raid-60 also looks like it would 
spread the load.

The one thing that bothers me is that I don't know whether the "* prime 
mod" logic actually adds very much - whether we can just stripe stuff 
across like we do with raid-10. Where it will score is in a storage 
assembly that is a cluster of a cluster of disks. Say you have a 
controller with ten disks, and ten controllers in a rack, a suitable 
choice of prime and logical stripe could ensure the rack would survive 
losing a controller. And given that dealing with massive arrays is what 
this algorithm is about, that seems worthwhile.

Anyways, here's my simple code that demonstrates the algorithm and 
prints out how the blocks will be laid out. Is it a good idea? I'd like 
to think so ...

Cheers,
Wol

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void main()
{
     int disks, logdisks, mirrors, prime;

     printf( "%s", "Enter the number of disks ");
     scanf( "%d", &disks);
     printf( "%s", "Enter the number of logical disks ");
     scanf( "%d", &logdisks);
     printf( "%s", "Enter the number of mirrors ");
     scanf( "%d", &mirrors);
     printf( "%s", "Enter the prime ");
     scanf( "%d", &prime);

     int blocks, *array;

     blocks = logdisks * mirrors * disks;
     array = (int *) malloc( sizeof(int) * blocks );
     memset( array, '\0', sizeof(int) * blocks );

     int i;

     for (i=0; i < blocks; i++) {
         array[i] = (i * prime) % blocks;
     }

     int logstripe, logdisk;

     for (i=0; i < blocks; i++) {
         if ( (i % disks) == 0) {
             printf( "\n");
         }

         // printf( "%4d", array[i]);

         logstripe = array[i] / (mirrors * logdisks);
         logdisk = array[i] % logdisks;
         printf( "  %02d%c", logstripe, (char) logdisk+65);

     }
     printf( "\n");
}

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC - de-clustered raid 60 or 61 algorithm
  2018-02-08  0:46 RFC - de-clustered raid 60 or 61 algorithm Wol's lists
@ 2018-02-08  3:14 ` NeilBrown
  2018-02-08 12:56   ` Phil Turmel
                     ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: NeilBrown @ 2018-02-08  3:14 UTC (permalink / raw)
  To: Wol's lists, mdraid

[-- Attachment #1: Type: text/plain, Size: 5555 bytes --]

On Thu, Feb 08 2018, Wol's lists wrote:

> After the de-clustered thread, Neil said it would probably only take a 
> small amount of coding to do something like that. It was also discussed 
> about spreading the load over disks on a reconstruction if there were a 
> lot of disks in an array. I've been trying to get my head round a simple 
> algorithm to smear data over the disks along the lines of raid-10.
>
> Basically, the idea is to define a logical stripe which is a multiple of 
> both the number of physical disks, and of the number of logical disks. 
> Within this logical stripe the blocks are shuffled using prime numbers 
> to make sure we don't get a pathological shuffle.
>
> At present, I've defined the logical stripe to be simply the product of 
> the number of logical disks times the number of mirrors times the number 
> of physical disks. We could shrink this by removing common factors, but 
> we don't need to.
>
> Given a logical block within this stripe, its physical position is 
> calculated by the simple equation "logical block * prime mod logical 
> stripe size". So long as the "prime" does not share any factors with the 
> logical stripe size, then (with one exception) you're not going to get 
> hash collisions, and you're not going to get more than one block per 
> stripe stored on each drive. The exception, of course, is if physical 
> disks is not greater than logical disks. Having the two identical is 
> especially dangerous as users will not expect the pathological behaviour 
> they will get - multiple blocks per stripe stored on the same disk. 
> mdadm will need to detect and reject this layout. I think the best 
> behaviour will be found where logical disks, mirrors, and physical disks 
> don't share any prime factors.
>
> I've been playing with a mirror setup, and if we have two mirrors, we 
> can rebuild any failed disk by coping from two other drives. I think 
> also (I haven't looked at it) that you could do a fast rebuild without 
> impacting other users of the system too much provided you don't swamp 
> i/o bandwidth, as half of the requests for data on the three drives 
> being used for rebuilding could actually be satisfied from other drives.

I think that ends up being much the same result as a current raid10
where the number of copies doesn't divide the number of devices.
Reconstruction reads come from 2 different devices, and half the reads
that would go to them now go elsewhere.

I think that if you take your solution and a selection of different
"prime" number and rotate through the primes from stripe to stripe, you
can expect a more even distribution of load.

You hint at this below when you suggest that adding the "*prime" doesn't
add much.  I think it adds a lot more when you start rotating the primes
across the stripes.

>
> Rebuilding a striped raid such as a raid-60 also looks like it would 
> spread the load.
>
> The one thing that bothers me is that I don't know whether the "* prime 
> mod" logic actually adds very much - whether we can just stripe stuff 
> across like we do with raid-10. Where it will score is in a storage 
> assembly that is a cluster of a cluster of disks. Say you have a 
> controller with ten disks, and ten controllers in a rack, a suitable 
> choice of prime and logical stripe could ensure the rack would survive 
> losing a controller. And given that dealing with massive arrays is what 
> this algorithm is about, that seems worthwhile.

The case of multiple failures is my major concern with the whole idea.
If failures were truly independent, then losing 3 in 100 at the same
time is probably quite unlikely.  But we all know there are various
factors that can cause failures to correlate - controller failure being
just one of them.
Maybe if you have dual-attached fibre channel to each drive, and you get
the gold-plated drives that have actually been exhaustively tested....

What do large installations do?  I assumed they had lots of modest
raid5 arrays and then mirrored between pairs of those (for data they
actually care about).  Maybe I'm wrong.

NeilBrown


>
> Anyways, here's my simple code that demonstrates the algorithm and 
> prints out how the blocks will be laid out. Is it a good idea? I'd like 
> to think so ...
>
> Cheers,
> Wol
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> void main()
> {
>      int disks, logdisks, mirrors, prime;
>
>      printf( "%s", "Enter the number of disks ");
>      scanf( "%d", &disks);
>      printf( "%s", "Enter the number of logical disks ");
>      scanf( "%d", &logdisks);
>      printf( "%s", "Enter the number of mirrors ");
>      scanf( "%d", &mirrors);
>      printf( "%s", "Enter the prime ");
>      scanf( "%d", &prime);
>
>      int blocks, *array;
>
>      blocks = logdisks * mirrors * disks;
>      array = (int *) malloc( sizeof(int) * blocks );
>      memset( array, '\0', sizeof(int) * blocks );
>
>      int i;
>
>      for (i=0; i < blocks; i++) {
>          array[i] = (i * prime) % blocks;
>      }
>
>      int logstripe, logdisk;
>
>      for (i=0; i < blocks; i++) {
>          if ( (i % disks) == 0) {
>              printf( "\n");
>          }
>
>          // printf( "%4d", array[i]);
>
>          logstripe = array[i] / (mirrors * logdisks);
>          logdisk = array[i] % logdisks;
>          printf( "  %02d%c", logstripe, (char) logdisk+65);
>
>      }
>      printf( "\n");
> }

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC - de-clustered raid 60 or 61 algorithm
  2018-02-08  3:14 ` NeilBrown
@ 2018-02-08 12:56   ` Phil Turmel
  2018-02-08 23:10     ` Wol's lists
  2018-02-09 23:12   ` Wol's lists
  2018-02-10  3:02   ` John Stoffel
  2 siblings, 1 reply; 6+ messages in thread
From: Phil Turmel @ 2018-02-08 12:56 UTC (permalink / raw)
  To: NeilBrown, Wol's lists, mdraid

On 02/07/2018 10:14 PM, NeilBrown wrote:
> On Thu, Feb 08 2018, Wol's lists wrote:

>> I've been playing with a mirror setup, and if we have two mirrors, we 
>> can rebuild any failed disk by coping from two other drives. I think 
>> also (I haven't looked at it) that you could do a fast rebuild without 
>> impacting other users of the system too much provided you don't swamp 
>> i/o bandwidth, as half of the requests for data on the three drives 
>> being used for rebuilding could actually be satisfied from other drives.
> 
> I think that ends up being much the same result as a current raid10
> where the number of copies doesn't divide the number of devices.
> Reconstruction reads come from 2 different devices, and half the reads
> that would go to them now go elsewhere.

This begs the question:

Why not just use the raid10,near striping algorithm?  Say one wants
raid6 n=6 inside raid60 n=25.  Use the raid10,near6 n=25 striping
algorithm, but within each near6 inner stripe place data and P and Q
using the existing raid6 rotation.

What is the more complex placement algorithm providing?

Phil

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC - de-clustered raid 60 or 61 algorithm
  2018-02-08 12:56   ` Phil Turmel
@ 2018-02-08 23:10     ` Wol's lists
  0 siblings, 0 replies; 6+ messages in thread
From: Wol's lists @ 2018-02-08 23:10 UTC (permalink / raw)
  To: Phil Turmel, NeilBrown, mdraid

On 08/02/18 12:56, Phil Turmel wrote:
> On 02/07/2018 10:14 PM, NeilBrown wrote:
>> On Thu, Feb 08 2018, Wol's lists wrote:
> 
>>> I've been playing with a mirror setup, and if we have two mirrors, we
>>> can rebuild any failed disk by coping from two other drives. I think
>>> also (I haven't looked at it) that you could do a fast rebuild without
>>> impacting other users of the system too much provided you don't swamp
>>> i/o bandwidth, as half of the requests for data on the three drives
>>> being used for rebuilding could actually be satisfied from other drives.
>>
>> I think that ends up being much the same result as a current raid10
>> where the number of copies doesn't divide the number of devices.
>> Reconstruction reads come from 2 different devices, and half the reads
>> that would go to them now go elsewhere.
> 
> This begs the question:
> 
> Why not just use the raid10,near striping algorithm?  Say one wants
> raid6 n=6 inside raid60 n=25.  Use the raid10,near6 n=25 striping
> algorithm, but within each near6 inner stripe place data and P and Q
> using the existing raid6 rotation.
> 
> What is the more complex placement algorithm providing?
> 
It came from the declustered thread.

Especially with raid-60, a rebuild will hammer a small subset of the 
drives in the array. The idea is that a more complex algorithm will 
spread the load across more drives. If your raid-6 in a raid-60 has say 
8 drives, a rebuild will stress 16 drives. If you've got 100 drives 
total, that's a lot of stress that could be avoided if the data could be 
more widely spread.

Thing is, you CAN gain a lot from a complex raid like raid-60 which you 
lose with a raid-6+0 - again something that came up was you have to 
scrub a raid-6+0 as a whole bunch of separate arrays.

Really, it's a case of the more we can spread the data, it (1) reduces 
the stress during a rebuild, thus reducing the risk of a second related 
failure, and (2) it increases the chances of surviving a multiple drive 
failure because if three logically related drives fail you've lost your 
raid-6-based array. Spreading the data reduces the logical linking 
between drives.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC - de-clustered raid 60 or 61 algorithm
  2018-02-08  3:14 ` NeilBrown
  2018-02-08 12:56   ` Phil Turmel
@ 2018-02-09 23:12   ` Wol's lists
  2018-02-10  3:02   ` John Stoffel
  2 siblings, 0 replies; 6+ messages in thread
From: Wol's lists @ 2018-02-09 23:12 UTC (permalink / raw)
  To: NeilBrown, mdraid

On 08/02/18 03:14, NeilBrown wrote:
> I think that ends up being much the same result as a current raid10
> where the number of copies doesn't divide the number of devices.
> Reconstruction reads come from 2 different devices, and half the reads
> that would go to them now go elsewhere.

Thing is, it's no worse ... :-)
> 
> I think that if you take your solution and a selection of different
> "prime" number and rotate through the primes from stripe to stripe, you
> can expect a more even distribution of load.
> 
Tried it. Doesn't seem to work :-(

> You hint at this below when you suggest that adding the "*prime" doesn't
> add much.  I think it adds a lot more when you start rotating the primes
> across the stripes.

Dunno how well this will work, but my big problem when I've been playing 
with this is pathological outcomes. It seems to work and then one disk 
is pathological. Typically it stores the two mirrors on the same disk - 
oops!

My new test program has three different algorithms. Algorithms 1 & 2 
seem near enough identical. And as has been commented, they offer 
nothing over the raid-10 algorithm. Thing is, the maths is pretty easy 
to prove the behaviour is not pathological.

Algorithm 3 does exactly what we want BUT. If the number of drives is 
not greater than two logical stripes, then it's pretty much guaranteed 
to be pathological. However, provided that it's greater than that, I 
think we can prove that it's not.

The other thing to watch out for is if our prime(s) are a factor of the 
number of disks, then the algorithm is seriously pathological, but 
seeing as we should generate the primes on the fly, we can easily handle 
that.

So I think we have to have a slightly complicated algorithm that says 
"if the number of drives is 2 times logical drives times mirrors, or 
less, then use the raid-10 algorithm and accept the poor mirror 
distribution, otherwise use algorithm 3".

So it won't be a particularly good algorithm for just a few (relatively 
speaking) physical drives, but if there are a lot of drives it'll work well.

I've been testing the algorithm with 4 logical drives by two mirrors, 
and using 12, 17 or 18 physical drives. Unfortunately it seems like a 
prime number of drives is best, but of course you won't get a big drive 
cabinet designed to take that number of drives :-(

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void main()
{
     int disks, logdisks, mirrors, prime, algorithm;

     printf( "%s", "Enter the number of disks ");
     scanf( "%d", &disks);
     printf( "%s", "Enter the number of logical disks ");
     scanf( "%d", &logdisks);
     printf( "%s", "Enter the number of mirrors ");
     scanf( "%d", &mirrors);
     printf( "%s", "Enter the prime ");
     scanf( "%d", &prime);
     printf( "%s", "Enter the algorithm ");
     scanf( "%d", &algorithm);

     int blocks, *array;

     blocks = logdisks * mirrors * disks;
     array = (int *) malloc( sizeof(int) * blocks );
     memset( array, '\0', sizeof(int) * blocks );

     int i;

     int primes[] = { 5,11,19,23,31,37,41,43,47 };
     switch (algorithm) {
         case 1:
         /* initial simple version */
         for (i=0; i < blocks; i++) {
             array[i] = (i * prime) % blocks;
         }
         break;

         case 2:
         /* simple version 2 */
         for (i=0; i < blocks; i++) {
             int posn;
             posn = (i * prime) % blocks;
             array[posn] = i;
         }
         break;

         case 3:
         /* second attempt rotate a stripe */
         for (i=0; i < blocks; i++) {
             int stripe, column;
             stripe = i / disks;
             column = ( stripe + i * primes[stripe] ) % disks ;
             array[ stripe * disks + column ] = i;
         }
         break;

     }

     int logstripe, logdisk;

     for (i=0; i < blocks; i++) {
         if ( (i % disks) == 0) {
             printf( "\n");
         }

         // printf( "%4d", array[i]);

         logstripe = array[i] / (mirrors * logdisks);
         logdisk = array[i] % logdisks;
         printf( "  %02d%c", logstripe, (char) logdisk+65);

     }
     printf( "\n");
}

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC - de-clustered raid 60 or 61 algorithm
  2018-02-08  3:14 ` NeilBrown
  2018-02-08 12:56   ` Phil Turmel
  2018-02-09 23:12   ` Wol's lists
@ 2018-02-10  3:02   ` John Stoffel
  2 siblings, 0 replies; 6+ messages in thread
From: John Stoffel @ 2018-02-10  3:02 UTC (permalink / raw)
  To: NeilBrown; +Cc: Wol's lists, mdraid

>>>>> "NeilBrown" == NeilBrown  <neilb@suse.com> writes:

NeilBrown> On Thu, Feb 08 2018, Wol's lists wrote:
>> After the de-clustered thread, Neil said it would probably only take a 
>> small amount of coding to do something like that. It was also discussed 
>> about spreading the load over disks on a reconstruction if there were a 
>> lot of disks in an array. I've been trying to get my head round a simple 
>> algorithm to smear data over the disks along the lines of raid-10.
>> 
>> Basically, the idea is to define a logical stripe which is a multiple of 
>> both the number of physical disks, and of the number of logical disks. 
>> Within this logical stripe the blocks are shuffled using prime numbers 
>> to make sure we don't get a pathological shuffle.
>> 
>> At present, I've defined the logical stripe to be simply the product of 
>> the number of logical disks times the number of mirrors times the number 
>> of physical disks. We could shrink this by removing common factors, but 
>> we don't need to.
>> 
>> Given a logical block within this stripe, its physical position is 
>> calculated by the simple equation "logical block * prime mod logical 
>> stripe size". So long as the "prime" does not share any factors with the 
>> logical stripe size, then (with one exception) you're not going to get 
>> hash collisions, and you're not going to get more than one block per 
>> stripe stored on each drive. The exception, of course, is if physical 
>> disks is not greater than logical disks. Having the two identical is 
>> especially dangerous as users will not expect the pathological behaviour 
>> they will get - multiple blocks per stripe stored on the same disk. 
>> mdadm will need to detect and reject this layout. I think the best 
>> behaviour will be found where logical disks, mirrors, and physical disks 
>> don't share any prime factors.
>> 
>> I've been playing with a mirror setup, and if we have two mirrors, we 
>> can rebuild any failed disk by coping from two other drives. I think 
>> also (I haven't looked at it) that you could do a fast rebuild without 
>> impacting other users of the system too much provided you don't swamp 
>> i/o bandwidth, as half of the requests for data on the three drives 
>> being used for rebuilding could actually be satisfied from other drives.

NeilBrown> I think that ends up being much the same result as a current raid10
NeilBrown> where the number of copies doesn't divide the number of devices.
NeilBrown> Reconstruction reads come from 2 different devices, and half the reads
NeilBrown> that would go to them now go elsewhere.

NeilBrown> I think that if you take your solution and a selection of different
NeilBrown> "prime" number and rotate through the primes from stripe to stripe, you
NeilBrown> can expect a more even distribution of load.

NeilBrown> You hint at this below when you suggest that adding the "*prime" doesn't
NeilBrown> add much.  I think it adds a lot more when you start rotating the primes
NeilBrown> across the stripes.

>> 
>> Rebuilding a striped raid such as a raid-60 also looks like it would 
>> spread the load.
>> 
>> The one thing that bothers me is that I don't know whether the "* prime 
>> mod" logic actually adds very much - whether we can just stripe stuff 
>> across like we do with raid-10. Where it will score is in a storage 
>> assembly that is a cluster of a cluster of disks. Say you have a 
>> controller with ten disks, and ten controllers in a rack, a suitable 
>> choice of prime and logical stripe could ensure the rack would survive 
>> losing a controller. And given that dealing with massive arrays is what 
>> this algorithm is about, that seems worthwhile.

NeilBrown> The case of multiple failures is my major concern with the whole idea.
NeilBrown> If failures were truly independent, then losing 3 in 100 at the same
NeilBrown> time is probably quite unlikely.  But we all know there are various
NeilBrown> factors that can cause failures to correlate - controller failure being
NeilBrown> just one of them.
NeilBrown> Maybe if you have dual-attached fibre channel to each drive, and you get
NeilBrown> the gold-plated drives that have actually been exhaustively tested....

NeilBrown> What do large installations do?  I assumed they had lots of modest
NeilBrown> raid5 arrays and then mirrored between pairs of those (for data they
NeilBrown> actually care about).  Maybe I'm wrong.

Well, I can only really speak to how Netapp does it, but they have
what is called RAID-DP (Dual Parity) but they also build up data in
Aggregates, which is composed of one or more Raid-Groups.  Each raid
group has 2 parity and X data drives, usually around 14-16 drives.

The idea being that for large setups, you try to put each member of
the raid group on a seperate shelf (shelves generally have 24 or 48
drives now, in olded times they had... ummm 12 or 16?  I forget off
the top of my head).

Part of the idea would be that if you lost and entire shelf of disks,
you wouldn't lose all your raidgroups/aggregates.

And yes, if you lost a raidgroup, you lost the aggregate.

So paying attention to and planning for disk failures in large numbers
of disks is a big part of Netapp, even with dual ported SAS links to
shelves, and dual paths to SAS/SATA drives in those shelves.

I would be honestly scared to have 100 disks holding my data with only
two parity disks.   Losing three disks just kills you dead, and it's
going to happen.

For example, you can goto 45drives.com and buy a system with 60 drives
in it, using LSI controllers and (I think) expander cards.  Building a
reliable system to handle the failure of a single controller is key
here.  And I think you can add more controllers, so you reduce your
number of expander ports, which improves (hopefully!) reliability.

It's all a tradeoff.

What I would do is run some simulations where you setup 100 disks with
X number of them parity, and the simulate failure rates.  Then include
the controllers in that simulation and see how likely you are to lose
all your data before you can rebuild.  It's not an easy problem space
at all.

John

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-02-10  3:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-08  0:46 RFC - de-clustered raid 60 or 61 algorithm Wol's lists
2018-02-08  3:14 ` NeilBrown
2018-02-08 12:56   ` Phil Turmel
2018-02-08 23:10     ` Wol's lists
2018-02-09 23:12   ` Wol's lists
2018-02-10  3:02   ` John Stoffel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.