All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [SPDK] Replication for SPDK (RAID 1E)
@ 2019-09-05  2:28 
  0 siblings, 0 replies; 7+ messages in thread
From:  @ 2019-09-05  2:28 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 3533 bytes --]

Hi Paul,
 
It’s great to know you work on raid!

Ziye proposed a patch for RAID1 before RAID bdev module maybe more than a year ago.

IMHO, we will need to have disk replacement feature and degradation mode first. It may be difficult to use raid1 without them. Because raid1 is for RAS

We need to have clean abstraction for RAID level, I.e., extracting common operations and creating function pointer table.

We need to have copy feature between two bdevs.

Thanks
Shuhei

Sent from my iPhone

> On Sep 5, 2019, at 8:49, Luse, Paul E <paul.e.luse(a)intel.com> wrote:
> 
> Hi Everyone,
> 
> I've got a pretty simple POC working and wanted to solicit any high level input before I get too far.  The idea is to, of course, start very basic but leave room for adding features later.  Here are the broad strokes:
> 
> 
> *       Add a new RAID level to the existing RAID module with level "1E" that requires a "number of replicas" parameter
> 
> *       The cool thing about 1E is that we can use any number of disks and also pick the number of times the data is replicated so for example:
> 
> o   2 disk 1E with replication of 2 would be your basic 2 disk RAID1. Mapped as follows (columns are physical disks Dn identifies data copies)
> D0         D0
> D1         D1
> D2         D2
> 
> *       3 disk 1E with replication of 2:
> 
> D0         D0         D1
> 
> D1         D2         D2
> 
> D3         D3         D4
> 
> *       3 disk 1E with replication of 3:
> 
> D0         D0         D0
> 
> D1         D1         D1
> 
> D2         D2         D2
> 
> *       3 disk 1E with replication of 1 (RAID0)
> D0         D1         D2
> D3         D4         D5
> D6         D7         D8
> 
> *       This scheme is obviously very flexible and can provide basic RAID1 without disk restrictions and also provide for some super paranoid configs
> 
> *       At the same time we could consider limiting, at least at first, the combinations of disks and replicas to minimize complexity and test but IMHO I think we should leave it wide open
> 
> *       An even cooler part of this is how well the current implementation lends itself to this.  A RAID0, behind the scenes, becomes a RAID1E with 1 replica
> 
> 
> 
> 
> Initially we can start with just the RAID level addition (no notification of member failure, no rebuilds, spares, etc). as I don't believe there's really any existing framework to support these kinds of features.  This is the main question I have for interested parties.  Would this be useful without any of the recovery type features or should we at least have some sort of async notification on member disk failure when num_replicas > 1?
> 
> Trello link: whether it is feasible for 19.10 or not depends on feedback from everyone on features :) https://clicktime.symantec.com/3Ui9u3JQhDqBY9P7euvmPEF7Vc?u=https%3A%2F%2Ftrello.com%2Fc%2FFR4iHAnI
> 
> Thanks!
> Paul
> 
> PS: My current POC is super raw.  I have hardcoded number of replicas to 3 and have what I believe is the correct block mapping for any # of disks, any # of replicas but have only tested 3 replicas with 2 and 3 member disks using bdevperf w/verify. After I get it in presentable shape and flesh the design and UT out a bit more I'll post something.
> 
> 
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://clicktime.symantec.com/3NBASeDzyXVbw78PP7EaJro7Vc?u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [SPDK] Replication for SPDK (RAID 1E)
@ 2019-09-05  3:44 Luse, Paul E
  0 siblings, 0 replies; 7+ messages in thread
From: Luse, Paul E @ 2019-09-05  3:44 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 7468 bytes --]

Of course... not to worry I crank out emails faster than code for sure :) Plus there no reason at all to rush for 19.10. We’ll do it right!

-from my iPhone 

> On Sep 4, 2019, at 8:25 PM, 松本周平 / MATSUMOTO,SHUUHEI <shuhei.matsumoto.xt(a)hitachi.com> wrote:
> 
> Hi Paul,
> 
> Please give time to read through the whole comment :-)
> 
> But about copy, my intention is that if we add an new disk, the disk has to catch up to the existing disks. So copy between two bdevs may be a basic feature.
> 
> Thanks,
> Shuhei
> 
> Sent from my iPhone
> 
>> On Sep 5, 2019, at 10:59, Luse, Paul E <paul.e.luse(a)intel.com> wrote:
>> 
>> Hi Shuhei,
>> 
>> Thanks for the reply! Yes, I remember Ziye's patch and as I recall there were two things about it that I didn't want to carry forward (a) it was it's own bdev module as opposed to enhancing what's there (I don't believe the RAID0 was there yet at the time though, don't remember) (b) it relied heavily on common functions used by other basic bdev modules like gpt.
>> 
>> The first point is worth discussion though, I'm glad you brought it up.  But first, yes I expected feedback on needing some sort of degradation/replacement feature(s) and I don't disagree.  By copy I assume you mean something like "transform an existing RAID0-->RAID1E"? We could make something like as simple or as complex as we wanted as well but that would be very cool. Anyway, on a separate bdev module vs updating the RAID0, I chatted with Jim about this a bit as well but I think given what we already have it's a much lighter lift (less code, less complex) to add it to the RAID0 module. The only pro I can think of to doing it as a separate module is setting the precedent to stack vbdevs to create more complex RAID levels but I think that's more RAID complexity that we want or need for SPDK but certainly and open to feedback on that point.
>> 
>> All of this stuff will be phased in over a series of patches of course but I guess we can decide at what point it's considered non-experimental based on feature set.  I'll send out a more complete list of proposed features and include at least some basic stuff in what I'd call the first "production" version and we can go from there.
>> 
>> Wrt an implantation details like abstracting common operations and using a function table I can appreciate that input as well.  For R/W it may not be necessary though, at least based on my POC which has very few changes to how strip locations and physical disk identifiers are calculated. That can all be part of review feedback on the actual patches. I hope to start posting next week if not sooner.  Either way at the RPC level I think it's clear enough that we'll have very distinct RAID levels.
>> 
>> Anyway, thanks again and I'll work on a more detailed feature set definition based on your feedback!
>> 
>> Thx
>> Paul
>> 
>> -----Original Message-----
>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of ???? / MATSUMOTO,SHUUHEI
>> Sent: Wednesday, September 4, 2019 7:28 PM
>> To: Storage Performance Development Kit <spdk(a)lists.01.org>
>> Subject: Re: [SPDK] Replication for SPDK (RAID 1E)
>> 
>> Hi Paul,
>> 
>> It’s great to know you work on raid!
>> 
>> Ziye proposed a patch for RAID1 before RAID bdev module maybe more than a year ago.
>> 
>> IMHO, we will need to have disk replacement feature and degradation mode first. It may be difficult to use raid1 without them. Because raid1 is for RAS
>> 
>> We need to have clean abstraction for RAID level, I.e., extracting common operations and creating function pointer table.
>> 
>> We need to have copy feature between two bdevs.
>> 
>> Thanks
>> Shuhei
>> 
>> Sent from my iPhone
>> 
>>> On Sep 5, 2019, at 8:49, Luse, Paul E <paul.e.luse(a)intel.com> wrote:
>>> 
>>> Hi Everyone,
>>> 
>>> I've got a pretty simple POC working and wanted to solicit any high level input before I get too far.  The idea is to, of course, start very basic but leave room for adding features later.  Here are the broad strokes:
>>> 
>>> 
>>> *       Add a new RAID level to the existing RAID module with level "1E" that requires a "number of replicas" parameter
>>> 
>>> *       The cool thing about 1E is that we can use any number of disks and also pick the number of times the data is replicated so for example:
>>> 
>>> o   2 disk 1E with replication of 2 would be your basic 2 disk RAID1. Mapped as follows (columns are physical disks Dn identifies data copies)
>>> D0         D0
>>> D1         D1
>>> D2         D2
>>> 
>>> *       3 disk 1E with replication of 2:
>>> 
>>> D0         D0         D1
>>> 
>>> D1         D2         D2
>>> 
>>> D3         D3         D4
>>> 
>>> *       3 disk 1E with replication of 3:
>>> 
>>> D0         D0         D0
>>> 
>>> D1         D1         D1
>>> 
>>> D2         D2         D2
>>> 
>>> *       3 disk 1E with replication of 1 (RAID0)
>>> D0         D1         D2
>>> D3         D4         D5
>>> D6         D7         D8
>>> 
>>> *       This scheme is obviously very flexible and can provide basic RAID1 without disk restrictions and also provide for some super paranoid configs
>>> 
>>> *       At the same time we could consider limiting, at least at first, the combinations of disks and replicas to minimize complexity and test but IMHO I think we should leave it wide open
>>> 
>>> *       An even cooler part of this is how well the current implementation lends itself to this.  A RAID0, behind the scenes, becomes a RAID1E with 1 replica
>>> 
>>> 
>>> 
>>> 
>>> Initially we can start with just the RAID level addition (no notification of member failure, no rebuilds, spares, etc). as I don't believe there's really any existing framework to support these kinds of features.  This is the main question I have for interested parties.  Would this be useful without any of the recovery type features or should we at least have some sort of async notification on member disk failure when num_replicas > 1?
>>> 
>>> Trello link: whether it is feasible for 19.10 or not depends on feedback from everyone on features :) https://clicktime.symantec.com/3Ui9u3JQhDqBY9P7euvmPEF7Vc?u=https%3A%2F%2Ftrello.com%2Fc%2FFR4iHAnI
>>> 
>>> Thanks!
>>> Paul
>>> 
>>> PS: My current POC is super raw.  I have hardcoded number of replicas to 3 and have what I believe is the correct block mapping for any # of disks, any # of replicas but have only tested 3 replicas with 2 and 3 member disks using bdevperf w/verify. After I get it in presentable shape and flesh the design and UT out a bit more I'll post something.
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> SPDK mailing list
>>> SPDK(a)lists.01.org
>>> https://clicktime.symantec.com/3NBASeDzyXVbw78PP7EaJro7Vc?u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://clicktime.symantec.com/3DyCUvVSi4zpXYkHUqirtJs7Vc?u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://clicktime.symantec.com/3DyCUvVSi4zpXYkHUqirtJs7Vc?u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [SPDK] Replication for SPDK (RAID 1E)
@ 2019-09-05  3:16 
  0 siblings, 0 replies; 7+ messages in thread
From:  @ 2019-09-05  3:16 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6918 bytes --]

Hi Paul,

Please give time to read through the whole comment :-)

But about copy, my intention is that if we add an new disk, the disk has to catch up to the existing disks. So copy between two bdevs may be a basic feature.

Thanks,
Shuhei

Sent from my iPhone

> On Sep 5, 2019, at 10:59, Luse, Paul E <paul.e.luse(a)intel.com> wrote:
> 
> Hi Shuhei,
> 
> Thanks for the reply! Yes, I remember Ziye's patch and as I recall there were two things about it that I didn't want to carry forward (a) it was it's own bdev module as opposed to enhancing what's there (I don't believe the RAID0 was there yet at the time though, don't remember) (b) it relied heavily on common functions used by other basic bdev modules like gpt.
> 
> The first point is worth discussion though, I'm glad you brought it up.  But first, yes I expected feedback on needing some sort of degradation/replacement feature(s) and I don't disagree.  By copy I assume you mean something like "transform an existing RAID0-->RAID1E"? We could make something like as simple or as complex as we wanted as well but that would be very cool. Anyway, on a separate bdev module vs updating the RAID0, I chatted with Jim about this a bit as well but I think given what we already have it's a much lighter lift (less code, less complex) to add it to the RAID0 module. The only pro I can think of to doing it as a separate module is setting the precedent to stack vbdevs to create more complex RAID levels but I think that's more RAID complexity that we want or need for SPDK but certainly and open to feedback on that point.
> 
> All of this stuff will be phased in over a series of patches of course but I guess we can decide at what point it's considered non-experimental based on feature set.  I'll send out a more complete list of proposed features and include at least some basic stuff in what I'd call the first "production" version and we can go from there.
> 
> Wrt an implantation details like abstracting common operations and using a function table I can appreciate that input as well.  For R/W it may not be necessary though, at least based on my POC which has very few changes to how strip locations and physical disk identifiers are calculated. That can all be part of review feedback on the actual patches. I hope to start posting next week if not sooner.  Either way at the RPC level I think it's clear enough that we'll have very distinct RAID levels.
> 
> Anyway, thanks again and I'll work on a more detailed feature set definition based on your feedback!
> 
> Thx
> Paul
> 
> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of ???? / MATSUMOTO,SHUUHEI
> Sent: Wednesday, September 4, 2019 7:28 PM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: Re: [SPDK] Replication for SPDK (RAID 1E)
> 
> Hi Paul,
> 
> It’s great to know you work on raid!
> 
> Ziye proposed a patch for RAID1 before RAID bdev module maybe more than a year ago.
> 
> IMHO, we will need to have disk replacement feature and degradation mode first. It may be difficult to use raid1 without them. Because raid1 is for RAS
> 
> We need to have clean abstraction for RAID level, I.e., extracting common operations and creating function pointer table.
> 
> We need to have copy feature between two bdevs.
> 
> Thanks
> Shuhei
> 
> Sent from my iPhone
> 
>> On Sep 5, 2019, at 8:49, Luse, Paul E <paul.e.luse(a)intel.com> wrote:
>> 
>> Hi Everyone,
>> 
>> I've got a pretty simple POC working and wanted to solicit any high level input before I get too far.  The idea is to, of course, start very basic but leave room for adding features later.  Here are the broad strokes:
>> 
>> 
>> *       Add a new RAID level to the existing RAID module with level "1E" that requires a "number of replicas" parameter
>> 
>> *       The cool thing about 1E is that we can use any number of disks and also pick the number of times the data is replicated so for example:
>> 
>> o   2 disk 1E with replication of 2 would be your basic 2 disk RAID1. Mapped as follows (columns are physical disks Dn identifies data copies)
>> D0         D0
>> D1         D1
>> D2         D2
>> 
>> *       3 disk 1E with replication of 2:
>> 
>> D0         D0         D1
>> 
>> D1         D2         D2
>> 
>> D3         D3         D4
>> 
>> *       3 disk 1E with replication of 3:
>> 
>> D0         D0         D0
>> 
>> D1         D1         D1
>> 
>> D2         D2         D2
>> 
>> *       3 disk 1E with replication of 1 (RAID0)
>> D0         D1         D2
>> D3         D4         D5
>> D6         D7         D8
>> 
>> *       This scheme is obviously very flexible and can provide basic RAID1 without disk restrictions and also provide for some super paranoid configs
>> 
>> *       At the same time we could consider limiting, at least at first, the combinations of disks and replicas to minimize complexity and test but IMHO I think we should leave it wide open
>> 
>> *       An even cooler part of this is how well the current implementation lends itself to this.  A RAID0, behind the scenes, becomes a RAID1E with 1 replica
>> 
>> 
>> 
>> 
>> Initially we can start with just the RAID level addition (no notification of member failure, no rebuilds, spares, etc). as I don't believe there's really any existing framework to support these kinds of features.  This is the main question I have for interested parties.  Would this be useful without any of the recovery type features or should we at least have some sort of async notification on member disk failure when num_replicas > 1?
>> 
>> Trello link: whether it is feasible for 19.10 or not depends on feedback from everyone on features :) https://clicktime.symantec.com/3Ui9u3JQhDqBY9P7euvmPEF7Vc?u=https%3A%2F%2Ftrello.com%2Fc%2FFR4iHAnI
>> 
>> Thanks!
>> Paul
>> 
>> PS: My current POC is super raw.  I have hardcoded number of replicas to 3 and have what I believe is the correct block mapping for any # of disks, any # of replicas but have only tested 3 replicas with 2 and 3 member disks using bdevperf w/verify. After I get it in presentable shape and flesh the design and UT out a bit more I'll post something.
>> 
>> 
>> 
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://clicktime.symantec.com/3NBASeDzyXVbw78PP7EaJro7Vc?u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://clicktime.symantec.com/3DyCUvVSi4zpXYkHUqirtJs7Vc?u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://clicktime.symantec.com/3DyCUvVSi4zpXYkHUqirtJs7Vc?u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [SPDK] Replication for SPDK (RAID 1E)
@ 2019-09-05  3:08 Luse, Paul E
  0 siblings, 0 replies; 7+ messages in thread
From: Luse, Paul E @ 2019-09-05  3:08 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4516 bytes --]

Thanks :) I’ll definitely do the function pointer table in the first rev. That’ll be super important for add more levels later...

-from my iPhone 

> On Sep 4, 2019, at 7:59 PM, 松本周平 / MATSUMOTO,SHUUHEI <shuhei.matsumoto.xt(a)hitachi.com> wrote:
> 
> But your idea is surely good start!
> 
> Sent from my iPhone
> 
>> On Sep 5, 2019, at 10:36, 松本周平 / MATSUMOTO,SHUUHEI <shuhei.matsumoto.xt(a)hitachi.com> wrote:
>> 
>> Hi Paul,
>> 
>> It’s great to know you work on raid!
>> 
>> Ziye proposed a patch for RAID1 before RAID bdev module maybe more than a year ago.
>> 
>> IMHO, we will need to have disk replacement feature and degradation mode first. It may be difficult to use raid1 without them. Because raid1 is for RAS
>> 
>> We need to have clean abstraction for RAID level, I.e., extracting common operations and creating function pointer table.
>> 
>> We need to have copy feature between two bdevs.
>> 
>> Thanks
>> Shuhei
>> 
>> Sent from my iPhone
>> 
>>> On Sep 5, 2019, at 8:49, Luse, Paul E <paul.e.luse(a)intel.com> wrote:
>>> 
>>> Hi Everyone,
>>> 
>>> I've got a pretty simple POC working and wanted to solicit any high level input before I get too far.  The idea is to, of course, start very basic but leave room for adding features later.  Here are the broad strokes:
>>> 
>>> 
>>> *       Add a new RAID level to the existing RAID module with level "1E" that requires a "number of replicas" parameter
>>> 
>>> *       The cool thing about 1E is that we can use any number of disks and also pick the number of times the data is replicated so for example:
>>> 
>>> o   2 disk 1E with replication of 2 would be your basic 2 disk RAID1. Mapped as follows (columns are physical disks Dn identifies data copies)
>>> D0         D0
>>> D1         D1
>>> D2         D2
>>> 
>>> *       3 disk 1E with replication of 2:
>>> 
>>> D0         D0         D1
>>> 
>>> D1         D2         D2
>>> 
>>> D3         D3         D4
>>> 
>>> *       3 disk 1E with replication of 3:
>>> 
>>> D0         D0         D0
>>> 
>>> D1         D1         D1
>>> 
>>> D2         D2         D2
>>> 
>>> *       3 disk 1E with replication of 1 (RAID0)
>>> D0         D1         D2
>>> D3         D4         D5
>>> D6         D7         D8
>>> 
>>> *       This scheme is obviously very flexible and can provide basic RAID1 without disk restrictions and also provide for some super paranoid configs
>>> 
>>> *       At the same time we could consider limiting, at least at first, the combinations of disks and replicas to minimize complexity and test but IMHO I think we should leave it wide open
>>> 
>>> *       An even cooler part of this is how well the current implementation lends itself to this.  A RAID0, behind the scenes, becomes a RAID1E with 1 replica
>>> 
>>> 
>>> 
>>> 
>>> Initially we can start with just the RAID level addition (no notification of member failure, no rebuilds, spares, etc). as I don't believe there's really any existing framework to support these kinds of features.  This is the main question I have for interested parties.  Would this be useful without any of the recovery type features or should we at least have some sort of async notification on member disk failure when num_replicas > 1?
>>> 
>>> Trello link: whether it is feasible for 19.10 or not depends on feedback from everyone on features :) https://clicktime.symantec.com/3Ui9u3JQhDqBY9P7euvmPEF7Vc?u=https%3A%2F%2Ftrello.com%2Fc%2FFR4iHAnI
>>> 
>>> Thanks!
>>> Paul
>>> 
>>> PS: My current POC is super raw.  I have hardcoded number of replicas to 3 and have what I believe is the correct block mapping for any # of disks, any # of replicas but have only tested 3 replicas with 2 and 3 member disks using bdevperf w/verify. After I get it in presentable shape and flesh the design and UT out a bit more I'll post something.
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> SPDK mailing list
>>> SPDK(a)lists.01.org
>>> https://clicktime.symantec.com/3NBASeDzyXVbw78PP7EaJro7Vc?u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://clicktime.symantec.com/3PQrvrn5FwxupLWdDkJs9Ef7Vc?u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [SPDK] Replication for SPDK (RAID 1E)
@ 2019-09-05  2:53 Luse, Paul E
  0 siblings, 0 replies; 7+ messages in thread
From: Luse, Paul E @ 2019-09-05  2:53 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6137 bytes --]

Hi Shuhei,

Thanks for the reply! Yes, I remember Ziye's patch and as I recall there were two things about it that I didn't want to carry forward (a) it was it's own bdev module as opposed to enhancing what's there (I don't believe the RAID0 was there yet at the time though, don't remember) (b) it relied heavily on common functions used by other basic bdev modules like gpt.

The first point is worth discussion though, I'm glad you brought it up.  But first, yes I expected feedback on needing some sort of degradation/replacement feature(s) and I don't disagree.  By copy I assume you mean something like "transform an existing RAID0-->RAID1E"? We could make something like as simple or as complex as we wanted as well but that would be very cool. Anyway, on a separate bdev module vs updating the RAID0, I chatted with Jim about this a bit as well but I think given what we already have it's a much lighter lift (less code, less complex) to add it to the RAID0 module. The only pro I can think of to doing it as a separate module is setting the precedent to stack vbdevs to create more complex RAID levels but I think that's more RAID complexity that we want or need for SPDK but certainly and open to feedback on that point.

All of this stuff will be phased in over a series of patches of course but I guess we can decide at what point it's considered non-experimental based on feature set.  I'll send out a more complete list of proposed features and include at least some basic stuff in what I'd call the first "production" version and we can go from there.

Wrt an implantation details like abstracting common operations and using a function table I can appreciate that input as well.  For R/W it may not be necessary though, at least based on my POC which has very few changes to how strip locations and physical disk identifiers are calculated. That can all be part of review feedback on the actual patches. I hope to start posting next week if not sooner.  Either way at the RPC level I think it's clear enough that we'll have very distinct RAID levels.

Anyway, thanks again and I'll work on a more detailed feature set definition based on your feedback!

Thx
Paul

-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of ???? / MATSUMOTO,SHUUHEI
Sent: Wednesday, September 4, 2019 7:28 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Replication for SPDK (RAID 1E)

Hi Paul,
 
It’s great to know you work on raid!

Ziye proposed a patch for RAID1 before RAID bdev module maybe more than a year ago.

IMHO, we will need to have disk replacement feature and degradation mode first. It may be difficult to use raid1 without them. Because raid1 is for RAS

We need to have clean abstraction for RAID level, I.e., extracting common operations and creating function pointer table.

We need to have copy feature between two bdevs.

Thanks
Shuhei

Sent from my iPhone

> On Sep 5, 2019, at 8:49, Luse, Paul E <paul.e.luse(a)intel.com> wrote:
> 
> Hi Everyone,
> 
> I've got a pretty simple POC working and wanted to solicit any high level input before I get too far.  The idea is to, of course, start very basic but leave room for adding features later.  Here are the broad strokes:
> 
> 
> *       Add a new RAID level to the existing RAID module with level "1E" that requires a "number of replicas" parameter
> 
> *       The cool thing about 1E is that we can use any number of disks and also pick the number of times the data is replicated so for example:
> 
> o   2 disk 1E with replication of 2 would be your basic 2 disk RAID1. Mapped as follows (columns are physical disks Dn identifies data copies)
> D0         D0
> D1         D1
> D2         D2
> 
> *       3 disk 1E with replication of 2:
> 
> D0         D0         D1
> 
> D1         D2         D2
> 
> D3         D3         D4
> 
> *       3 disk 1E with replication of 3:
> 
> D0         D0         D0
> 
> D1         D1         D1
> 
> D2         D2         D2
> 
> *       3 disk 1E with replication of 1 (RAID0)
> D0         D1         D2
> D3         D4         D5
> D6         D7         D8
> 
> *       This scheme is obviously very flexible and can provide basic RAID1 without disk restrictions and also provide for some super paranoid configs
> 
> *       At the same time we could consider limiting, at least at first, the combinations of disks and replicas to minimize complexity and test but IMHO I think we should leave it wide open
> 
> *       An even cooler part of this is how well the current implementation lends itself to this.  A RAID0, behind the scenes, becomes a RAID1E with 1 replica
> 
> 
> 
> 
> Initially we can start with just the RAID level addition (no notification of member failure, no rebuilds, spares, etc). as I don't believe there's really any existing framework to support these kinds of features.  This is the main question I have for interested parties.  Would this be useful without any of the recovery type features or should we at least have some sort of async notification on member disk failure when num_replicas > 1?
> 
> Trello link: whether it is feasible for 19.10 or not depends on feedback from everyone on features :) https://clicktime.symantec.com/3Ui9u3JQhDqBY9P7euvmPEF7Vc?u=https%3A%2F%2Ftrello.com%2Fc%2FFR4iHAnI
> 
> Thanks!
> Paul
> 
> PS: My current POC is super raw.  I have hardcoded number of replicas to 3 and have what I believe is the correct block mapping for any # of disks, any # of replicas but have only tested 3 replicas with 2 and 3 member disks using bdevperf w/verify. After I get it in presentable shape and flesh the design and UT out a bit more I'll post something.
> 
> 
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://clicktime.symantec.com/3NBASeDzyXVbw78PP7EaJro7Vc?u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [SPDK] Replication for SPDK (RAID 1E)
@ 2019-09-05  2:50 
  0 siblings, 0 replies; 7+ messages in thread
From:  @ 2019-09-05  2:50 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4013 bytes --]

But your idea is surely good start!

Sent from my iPhone

> On Sep 5, 2019, at 10:36, 松本周平 / MATSUMOTO,SHUUHEI <shuhei.matsumoto.xt(a)hitachi.com> wrote:
> 
> Hi Paul,
> 
> It’s great to know you work on raid!
> 
> Ziye proposed a patch for RAID1 before RAID bdev module maybe more than a year ago.
> 
> IMHO, we will need to have disk replacement feature and degradation mode first. It may be difficult to use raid1 without them. Because raid1 is for RAS
> 
> We need to have clean abstraction for RAID level, I.e., extracting common operations and creating function pointer table.
> 
> We need to have copy feature between two bdevs.
> 
> Thanks
> Shuhei
> 
> Sent from my iPhone
> 
>> On Sep 5, 2019, at 8:49, Luse, Paul E <paul.e.luse(a)intel.com> wrote:
>> 
>> Hi Everyone,
>> 
>> I've got a pretty simple POC working and wanted to solicit any high level input before I get too far.  The idea is to, of course, start very basic but leave room for adding features later.  Here are the broad strokes:
>> 
>> 
>> *       Add a new RAID level to the existing RAID module with level "1E" that requires a "number of replicas" parameter
>> 
>> *       The cool thing about 1E is that we can use any number of disks and also pick the number of times the data is replicated so for example:
>> 
>> o   2 disk 1E with replication of 2 would be your basic 2 disk RAID1. Mapped as follows (columns are physical disks Dn identifies data copies)
>> D0         D0
>> D1         D1
>> D2         D2
>> 
>> *       3 disk 1E with replication of 2:
>> 
>> D0         D0         D1
>> 
>> D1         D2         D2
>> 
>> D3         D3         D4
>> 
>> *       3 disk 1E with replication of 3:
>> 
>> D0         D0         D0
>> 
>> D1         D1         D1
>> 
>> D2         D2         D2
>> 
>> *       3 disk 1E with replication of 1 (RAID0)
>> D0         D1         D2
>> D3         D4         D5
>> D6         D7         D8
>> 
>> *       This scheme is obviously very flexible and can provide basic RAID1 without disk restrictions and also provide for some super paranoid configs
>> 
>> *       At the same time we could consider limiting, at least at first, the combinations of disks and replicas to minimize complexity and test but IMHO I think we should leave it wide open
>> 
>> *       An even cooler part of this is how well the current implementation lends itself to this.  A RAID0, behind the scenes, becomes a RAID1E with 1 replica
>> 
>> 
>> 
>> 
>> Initially we can start with just the RAID level addition (no notification of member failure, no rebuilds, spares, etc). as I don't believe there's really any existing framework to support these kinds of features.  This is the main question I have for interested parties.  Would this be useful without any of the recovery type features or should we at least have some sort of async notification on member disk failure when num_replicas > 1?
>> 
>> Trello link: whether it is feasible for 19.10 or not depends on feedback from everyone on features :) https://clicktime.symantec.com/3Ui9u3JQhDqBY9P7euvmPEF7Vc?u=https%3A%2F%2Ftrello.com%2Fc%2FFR4iHAnI
>> 
>> Thanks!
>> Paul
>> 
>> PS: My current POC is super raw.  I have hardcoded number of replicas to 3 and have what I believe is the correct block mapping for any # of disks, any # of replicas but have only tested 3 replicas with 2 and 3 member disks using bdevperf w/verify. After I get it in presentable shape and flesh the design and UT out a bit more I'll post something.
>> 
>> 
>> 
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://clicktime.symantec.com/3NBASeDzyXVbw78PP7EaJro7Vc?u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://clicktime.symantec.com/3PQrvrn5FwxupLWdDkJs9Ef7Vc?u=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [SPDK] Replication for SPDK (RAID 1E)
@ 2019-09-05  0:48 Luse, Paul E
  0 siblings, 0 replies; 7+ messages in thread
From: Luse, Paul E @ 2019-09-05  0:48 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 2548 bytes --]

Hi Everyone,

I've got a pretty simple POC working and wanted to solicit any high level input before I get too far.  The idea is to, of course, start very basic but leave room for adding features later.  Here are the broad strokes:


*       Add a new RAID level to the existing RAID module with level "1E" that requires a "number of replicas" parameter

*       The cool thing about 1E is that we can use any number of disks and also pick the number of times the data is replicated so for example:

o   2 disk 1E with replication of 2 would be your basic 2 disk RAID1. Mapped as follows (columns are physical disks Dn identifies data copies)
D0         D0
D1         D1
D2         D2

*       3 disk 1E with replication of 2:

D0         D0         D1

D1         D2         D2

D3         D3         D4

*       3 disk 1E with replication of 3:

D0         D0         D0

D1         D1         D1

D2         D2         D2

*       3 disk 1E with replication of 1 (RAID0)
D0         D1         D2
D3         D4         D5
D6         D7         D8

*       This scheme is obviously very flexible and can provide basic RAID1 without disk restrictions and also provide for some super paranoid configs

*       At the same time we could consider limiting, at least at first, the combinations of disks and replicas to minimize complexity and test but IMHO I think we should leave it wide open

*       An even cooler part of this is how well the current implementation lends itself to this.  A RAID0, behind the scenes, becomes a RAID1E with 1 replica




Initially we can start with just the RAID level addition (no notification of member failure, no rebuilds, spares, etc). as I don't believe there's really any existing framework to support these kinds of features.  This is the main question I have for interested parties.  Would this be useful without any of the recovery type features or should we at least have some sort of async notification on member disk failure when num_replicas > 1?

Trello link: whether it is feasible for 19.10 or not depends on feedback from everyone on features :) https://trello.com/c/FR4iHAnI

Thanks!
Paul

PS: My current POC is super raw.  I have hardcoded number of replicas to 3 and have what I believe is the correct block mapping for any # of disks, any # of replicas but have only tested 3 replicas with 2 and 3 member disks using bdevperf w/verify. After I get it in presentable shape and flesh the design and UT out a bit more I'll post something.




^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-09-05  3:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-05  2:28 [SPDK] Replication for SPDK (RAID 1E) 
  -- strict thread matches above, loose matches on Subject: below --
2019-09-05  3:44 Luse, Paul E
2019-09-05  3:16 
2019-09-05  3:08 Luse, Paul E
2019-09-05  2:53 Luse, Paul E
2019-09-05  2:50 
2019-09-05  0:48 Luse, Paul E

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.