All of lore.kernel.org
 help / color / mirror / Atom feed
* [SPDK] Re: SPDK RAID5 support
@ 2019-10-04 15:31 Luse, Paul E
  0 siblings, 0 replies; 22+ messages in thread
From: Luse, Paul E @ 2019-10-04 15:31 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 971 bytes --]

Agreed.  And to save on emails, Artur wrt your response to my email - awesome! And wrt DDF yeah I meant "strict adherence to" - totally agree there's a lot to borrow there if nothing else than to make sure we're covering all of our bases.  Looking forward to seeing this get started and I'm definitely going to grab an item or two of the backlog (

Thx
Paul

On 10/4/19, 6:39 AM, "Artur Paszkiewicz" <artur.paszkiewicz(a)intel.com> wrote:

    On 10/4/19 12:49 AM, 松本周平 / MATSUMOTO,SHUUHEI wrote:
    > Recently SPDK are starting to support DIF feature.
    > Can we have any possibility to include the extended LBA (block size = 512 + 8, 4096 + 128, or etc) into SPDK RAID?
    
    Do you mean DIF passthrough? I think this should be included.
    
    Thanks,
    Artur
    _______________________________________________
    SPDK mailing list -- spdk(a)lists.01.org
    To unsubscribe send an email to spdk-leave(a)lists.01.org
    


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-21 15:22 Artur Paszkiewicz
  0 siblings, 0 replies; 22+ messages in thread
From: Artur Paszkiewicz @ 2019-10-21 15:22 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 594 bytes --]

On 10/11/19 5:32 PM, Liu, Xiaodong wrote:
> Great work! Artur.
> By the way, do you have some draft or RFC code about your RAID5 support? So we can have a better overall understanding on where the refactoring is towards to. 

You can find it here: https://github.com/apaszkie/spdk/tree/raid5_poc/lib/bdev/collections

Please keep in mind that the raid5 was just a piece of a larger POC project, I had to remove all the unrelated parts and cut out a lot of stuff before uploading. I won't try to port this 1:1 but rather use it as a reference for the new implementation.

Regards,
Artur

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-16 12:19 Sasha Kotchubievsky
  0 siblings, 0 replies; 22+ messages in thread
From: Sasha Kotchubievsky @ 2019-10-16 12:19 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 10997 bytes --]

Thank Paul,

I understand your point. 
In any case, I would suggest to consider to add support for RAID6 along with RAID5.

There is demand RAID5/5 solution based on SPDK and I think, this feature will get great feedback from the field.
Last year, we evaluated RAID5/6 implementation in SPDK. In our POC we used RAID0 vbdev with our extensions. One of challenges   was to get max possible performance Read-modify-write scenario. 

I'll definitely keep eye on coming patches. And my group, will help as we can with the feature.

Best regards
Sasha 

-----Original Message-----
From: Luse, Paul E <paul.e.luse(a)intel.com> 
Sent: Wednesday, October 16, 2019 12:47 AM
To: Sasha Kotchubievsky <sashakot(a)dev.mellanox.co.il>; Storage Performance Development Kit <spdk(a)lists.01.org>
Cc: Karkra, Kapil <kapil.karkra(a)intel.com>; Baldysiak, Pawel <pawel.baldysiak(a)intel.com>; Ptak, Slawomir <slawomir.ptak(a)intel.com>
Subject: Re: [SPDK] Re: SPDK RAID5 support

Good feedback Sasha.  We need to be cautious about how much we "bite off" though. EC is quite different from RAID5 in terms of mapping and rebuild - the parity calculations are the easiest part of it since we can rely on ISAL.

It should be kept in mind though for sure, I just want to make sure we don't try and do too much right away. I've only got half-way through the chain so far and I think it looks right on the money.

Thx
Paul

On 10/15/19, 2:21 PM, "Sasha Kotchubievsky" <sashakot(a)dev.mellanox.co.il> wrote:

    Hi ,
    
    It looks like a massive refactoring.
    
    In case of such big refactoring, I'd suggest to think about more general 
    solution like erasure coding. RAD5/6 are trivial cases of erasure 
    coding. I believe, "management": disk degradation and recovery are 
    similar for EC and for RAID. Actual calculation of parity block can be 
    done in ISA-lib under an abstraction layer ( which can be replaced by 
    any other implementation).
    
    
    Best regards
    
    Sasha
    
    On 13-Oct-19 9:18 PM, Luse, Paul E wrote:
    > Note that there's a patch series up there now from Artur also, I'll be reviewing it myself for the first tim here shortly (
    >
    > https://review.gerrithub.io/c/spdk/spdk/+/471075/1
    >
    > On 10/13/19, 10:39 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
    >
    >      Awesome! Good feedback, keep an eye out for patches and check out trello https://trello.com/b/4HEkWVvF/raid for backlog items...
    >      
    >      Thx
    >      Paul
    >      
    >      On 10/13/19, 2:28 AM, "Sasha Kotchubievsky" <sashakot(a)dev.mellanox.co.il> wrote:
    >      
    >          Hi,
    >          
    >          I'm very exiting to progress in RAID development in SPDK.
    >          
    >          Artur, will you focus on RAID5 only, or RAID6 is also will be supported?
    >          
    >          I think, it's important to keep existing SPDK approach for configuration.
    >          It would nice to see an abstraction for parity calculation. I believe, the calculation can be optimized for specific platform, or even using HW accelerators is possible .
    >          In case of distributed storage, even existing in market network cards can optimize RAID related operations. I believe, the next, upcoming generation will this capabilities to the next level. That needs some support from bdev layer like events about configuration changes, or recovery/degradation.
    >          
    >          
    >          Best regards
    >          Sasha
    >          
    >          -----Original Message-----
    >          From: 松本周平 / MATSUMOTO,SHUUHEI <shuhei.matsumoto.xt(a)hitachi.com>
    >          Sent: Friday, October 4, 2019 1:49 AM
    >          To: Storage Performance Development Kit <spdk(a)lists.01.org>
    >          Cc: Karkra, Kapil <kapil.karkra(a)intel.com>; Baldysiak, Pawel <pawel.baldysiak(a)intel.com>; Ptak, Slawomir <slawomir.ptak(a)intel.com>
    >          Subject: [SPDK] Re: SPDK RAID5 support
    >          
    >          Hi Artur, Paul, and All,
    >          
    >          Thank you so much, I'm excited to know this.
    >          
    >          Recently SPDK are starting to support DIF feature.
    >          Can we have any possibility to include the extended LBA (block size = 512 + 8, 4096 + 128, or etc) into SPDK RAID?
    >          Do you have any comment?
    >          
    >          Thanks,
    >          Shuhei
    >          
    >          ________________________________
    >          差出人: Luse, Paul E <paul.e.luse(a)intel.com>
    >          送信日時: 2019年10月4日 4:20
    >          宛先: Storage Performance Development Kit <spdk(a)lists.01.org>
    >          CC: Karkra, Kapil <kapil.karkra(a)intel.com>; Baldysiak, Pawel <pawel.baldysiak(a)intel.com>; Ptak, Slawomir <slawomir.ptak(a)intel.com>
    >          件名: [SPDK] Re: SPDK RAID5 support
    >          
    >          Hi Artur,
    >          
    >          Thanks, I think this can be an awesome contribution.  A few other things to consider, you mention some of these already, so I just added some more color.  It would be good I think moving forward to put a trello board up with a backlog of tasks so that others (like me __) can jump in and help also.  I'd still like to get the RAID1E in there, but it makes little sense to do it before any major refactoring.
    >          
    >          I think it makes sense, if you guys are ready, to start putting together the backlog and knocking out a large series of small patches to get the refactoring done.
    >          
    >          * the current RAID0 has no config on disk. We'll need to come up with a scheme for handling existing RAID0 configured out of band with RAID5 using COD. (like not allowing a RAID0 to be built on a set with COD, etc.)
    >          * the metadata layout, as you knee from working previous RAID projects, needs to be thought out carefully to consider not only extensibility but version control for backwards compactivity and issues with conflicting COD that are found.  For example you have a 3 disk RAID5, take one of the disks out and use it in another array somewhere then bring it back later and fire up the original 3, the metadata has to have sufficient info to know who belongs to what and which volumes to create and which to put in some sort of offline state. I don't think we need that kind of capability right up front (deciding how to deal with conflicts) but the metadata should have enough information it up front. DDF was brought up before in the earlier RAID discussions so just to make sure we're all on the same page, I see no value in complying with that spec. Open to other thoughts though.
    >          * one thing to keep in mind, we probably want to retain common RPC code
    >          * we should keep migration in mind as well, not that it's something we may ever need/want but lots of reserved space in metadata for tracking state wrt migrations and rebuilds is needed.
    >          * similar to the RAID1E discussions earlier, there are some features you mention as 'future' that most would consider a requirement for redundant RAID - like degrade operation and rebuild. Those don't all have to go in at the same time however without that minimum set we need to mark it experimental, so nobody tries to use it thinking it has those basic things.
    >          * I can't remember if we talked about unit tests or not, but be sure the backlog includes getting solid UT coverage in there up front.  The existing UT code will likely need some refactoring as well to support the function code refactoring.
    >          
    >          -Paul
    >          
    >          On 10/3/19, 3:00 AM, "Artur Paszkiewicz" <artur.paszkiewicz(a)intel.com> wrote:
    >          
    >              Hi all,
    >          
    >              We want to add RAID5 support to SPDK. My team has experience with other RAID
    >              projects, primarily with Linux MD RAID, which we actively develop and support
    >              for Intel VROC. We already have an initial SPDK RAID5 implementation created
    >              for an internal project. It has working read/write, including partial-stripe
    >              updates, parity calculation and reconstruct-reads.
    >          
    >              Currently in SPDK there exists a RAID bdev module, which has only RAID0
    >              functionality. This can be used as a basis for a more generic RAID stack. Here
    >              is our idea how to approach this:
    >          
    >              1. Refactor the bdev_raid module to separate RAID0-specific I/O handling code
    >              from more generic parts - configuration, bdev creation, etc. Move the RAID0
    >              code to a new file. Use RAID level-specific callbacks, similar to existing
    >              struct raid_fn_table. This architecture is also used in MD RAID drivers, where
    >              different RAID "personalities" work on top of a common layer.
    >          
    >              2. Add RAID5 support in another file, similarly to RAID0. Port our current
    >              RAID5 code to this new framework.
    >          
    >              3. Incrementally add new functionalities. At this point, probably the most
    >              important will be support for member drive failure and degraded operation, RAID
    >              rebuild and some form of on-disk metadata.
    >          
    >              Any comments or suggestions are welcome.
    >          
    >              Thanks,
    >              Artur
    >              _______________________________________________
    >              SPDK mailing list -- spdk(a)lists.01.org
    >              To unsubscribe send an email to spdk-leave(a)lists.01.org
    >          
    >          
    >          _______________________________________________
    >          SPDK mailing list -- spdk(a)lists.01.org
    >          To unsubscribe send an email to spdk-leave(a)lists.01.org _______________________________________________
    >          SPDK mailing list -- spdk(a)lists.01.org
    >          To unsubscribe send an email to spdk-leave(a)lists.01.org
    >          _______________________________________________
    >          SPDK mailing list -- spdk(a)lists.01.org
    >          To unsubscribe send an email to spdk-leave(a)lists.01.org
    >          
    >      
    >      _______________________________________________
    >      SPDK mailing list -- spdk(a)lists.01.org
    >      To unsubscribe send an email to spdk-leave(a)lists.01.org
    >      
    >
    > _______________________________________________
    > SPDK mailing list -- spdk(a)lists.01.org
    > To unsubscribe send an email to spdk-leave(a)lists.01.org
    


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-15 21:21 Sasha Kotchubievsky
  0 siblings, 0 replies; 22+ messages in thread
From: Sasha Kotchubievsky @ 2019-10-15 21:21 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 8924 bytes --]

Hi ,

It looks like a massive refactoring.

In case of such big refactoring, I'd suggest to think about more general 
solution like erasure coding. RAD5/6 are trivial cases of erasure 
coding. I believe, "management": disk degradation and recovery are 
similar for EC and for RAID. Actual calculation of parity block can be 
done in ISA-lib under an abstraction layer ( which can be replaced by 
any other implementation).


Best regards

Sasha

On 13-Oct-19 9:18 PM, Luse, Paul E wrote:
> Note that there's a patch series up there now from Artur also, I'll be reviewing it myself for the first tim here shortly (
>
> https://review.gerrithub.io/c/spdk/spdk/+/471075/1
>
> On 10/13/19, 10:39 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
>
>      Awesome! Good feedback, keep an eye out for patches and check out trello https://trello.com/b/4HEkWVvF/raid for backlog items...
>      
>      Thx
>      Paul
>      
>      On 10/13/19, 2:28 AM, "Sasha Kotchubievsky" <sashakot(a)dev.mellanox.co.il> wrote:
>      
>          Hi,
>          
>          I'm very exiting to progress in RAID development in SPDK.
>          
>          Artur, will you focus on RAID5 only, or RAID6 is also will be supported?
>          
>          I think, it's important to keep existing SPDK approach for configuration.
>          It would nice to see an abstraction for parity calculation. I believe, the calculation can be optimized for specific platform, or even using HW accelerators is possible .
>          In case of distributed storage, even existing in market network cards can optimize RAID related operations. I believe, the next, upcoming generation will this capabilities to the next level. That needs some support from bdev layer like events about configuration changes, or recovery/degradation.
>          
>          
>          Best regards
>          Sasha
>          
>          -----Original Message-----
>          From: 松本周平 / MATSUMOTO,SHUUHEI <shuhei.matsumoto.xt(a)hitachi.com>
>          Sent: Friday, October 4, 2019 1:49 AM
>          To: Storage Performance Development Kit <spdk(a)lists.01.org>
>          Cc: Karkra, Kapil <kapil.karkra(a)intel.com>; Baldysiak, Pawel <pawel.baldysiak(a)intel.com>; Ptak, Slawomir <slawomir.ptak(a)intel.com>
>          Subject: [SPDK] Re: SPDK RAID5 support
>          
>          Hi Artur, Paul, and All,
>          
>          Thank you so much, I'm excited to know this.
>          
>          Recently SPDK are starting to support DIF feature.
>          Can we have any possibility to include the extended LBA (block size = 512 + 8, 4096 + 128, or etc) into SPDK RAID?
>          Do you have any comment?
>          
>          Thanks,
>          Shuhei
>          
>          ________________________________
>          差出人: Luse, Paul E <paul.e.luse(a)intel.com>
>          送信日時: 2019年10月4日 4:20
>          宛先: Storage Performance Development Kit <spdk(a)lists.01.org>
>          CC: Karkra, Kapil <kapil.karkra(a)intel.com>; Baldysiak, Pawel <pawel.baldysiak(a)intel.com>; Ptak, Slawomir <slawomir.ptak(a)intel.com>
>          件名: [SPDK] Re: SPDK RAID5 support
>          
>          Hi Artur,
>          
>          Thanks, I think this can be an awesome contribution.  A few other things to consider, you mention some of these already, so I just added some more color.  It would be good I think moving forward to put a trello board up with a backlog of tasks so that others (like me __) can jump in and help also.  I'd still like to get the RAID1E in there, but it makes little sense to do it before any major refactoring.
>          
>          I think it makes sense, if you guys are ready, to start putting together the backlog and knocking out a large series of small patches to get the refactoring done.
>          
>          * the current RAID0 has no config on disk. We'll need to come up with a scheme for handling existing RAID0 configured out of band with RAID5 using COD. (like not allowing a RAID0 to be built on a set with COD, etc.)
>          * the metadata layout, as you knee from working previous RAID projects, needs to be thought out carefully to consider not only extensibility but version control for backwards compactivity and issues with conflicting COD that are found.  For example you have a 3 disk RAID5, take one of the disks out and use it in another array somewhere then bring it back later and fire up the original 3, the metadata has to have sufficient info to know who belongs to what and which volumes to create and which to put in some sort of offline state. I don't think we need that kind of capability right up front (deciding how to deal with conflicts) but the metadata should have enough information it up front. DDF was brought up before in the earlier RAID discussions so just to make sure we're all on the same page, I see no value in complying with that spec. Open to other thoughts though.
>          * one thing to keep in mind, we probably want to retain common RPC code
>          * we should keep migration in mind as well, not that it's something we may ever need/want but lots of reserved space in metadata for tracking state wrt migrations and rebuilds is needed.
>          * similar to the RAID1E discussions earlier, there are some features you mention as 'future' that most would consider a requirement for redundant RAID - like degrade operation and rebuild. Those don't all have to go in at the same time however without that minimum set we need to mark it experimental, so nobody tries to use it thinking it has those basic things.
>          * I can't remember if we talked about unit tests or not, but be sure the backlog includes getting solid UT coverage in there up front.  The existing UT code will likely need some refactoring as well to support the function code refactoring.
>          
>          -Paul
>          
>          On 10/3/19, 3:00 AM, "Artur Paszkiewicz" <artur.paszkiewicz(a)intel.com> wrote:
>          
>              Hi all,
>          
>              We want to add RAID5 support to SPDK. My team has experience with other RAID
>              projects, primarily with Linux MD RAID, which we actively develop and support
>              for Intel VROC. We already have an initial SPDK RAID5 implementation created
>              for an internal project. It has working read/write, including partial-stripe
>              updates, parity calculation and reconstruct-reads.
>          
>              Currently in SPDK there exists a RAID bdev module, which has only RAID0
>              functionality. This can be used as a basis for a more generic RAID stack. Here
>              is our idea how to approach this:
>          
>              1. Refactor the bdev_raid module to separate RAID0-specific I/O handling code
>              from more generic parts - configuration, bdev creation, etc. Move the RAID0
>              code to a new file. Use RAID level-specific callbacks, similar to existing
>              struct raid_fn_table. This architecture is also used in MD RAID drivers, where
>              different RAID "personalities" work on top of a common layer.
>          
>              2. Add RAID5 support in another file, similarly to RAID0. Port our current
>              RAID5 code to this new framework.
>          
>              3. Incrementally add new functionalities. At this point, probably the most
>              important will be support for member drive failure and degraded operation, RAID
>              rebuild and some form of on-disk metadata.
>          
>              Any comments or suggestions are welcome.
>          
>              Thanks,
>              Artur
>              _______________________________________________
>              SPDK mailing list -- spdk(a)lists.01.org
>              To unsubscribe send an email to spdk-leave(a)lists.01.org
>          
>          
>          _______________________________________________
>          SPDK mailing list -- spdk(a)lists.01.org
>          To unsubscribe send an email to spdk-leave(a)lists.01.org _______________________________________________
>          SPDK mailing list -- spdk(a)lists.01.org
>          To unsubscribe send an email to spdk-leave(a)lists.01.org
>          _______________________________________________
>          SPDK mailing list -- spdk(a)lists.01.org
>          To unsubscribe send an email to spdk-leave(a)lists.01.org
>          
>      
>      _______________________________________________
>      SPDK mailing list -- spdk(a)lists.01.org
>      To unsubscribe send an email to spdk-leave(a)lists.01.org
>      
>
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org
> To unsubscribe send an email to spdk-leave(a)lists.01.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-14 17:43 Harris, James R
  0 siblings, 0 replies; 22+ messages in thread
From: Harris, James R @ 2019-10-14 17:43 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 7240 bytes --]



On 10/13/19, 2:28 AM, "Sasha Kotchubievsky" <sashakot(a)dev.mellanox.co.il> wrote:

    Hi,
    
    I'm very exiting to progress in RAID development in SPDK.
    
    Artur, will you focus on RAID5 only, or RAID6 is also will be supported?
    
    I think, it's important to keep existing SPDK approach for configuration. 
    It would nice to see an abstraction for parity calculation. I believe, the calculation can be optimized for specific platform, or even using HW accelerators is possible .

[Jim]  Agreed.  To start we may use something similar to CRC calculations (lib/util/crc32c.c), where we pick a parity calculation implementation at compile time.  Longer term, something more dynamic like the SPDK copy_engine (for memcopies) or DPDK framework (for crypto/compression) would be even nicer.

    In case of distributed storage, even existing in market network cards can optimize RAID related operations. I believe, the next, upcoming generation will this capabilities to the next level. That needs some support from bdev layer like events about configuration changes, or recovery/degradation.
    
    
    Best regards
    Sasha
    
    -----Original Message-----
    From: 松本周平 / MATSUMOTO,SHUUHEI <shuhei.matsumoto.xt(a)hitachi.com> 
    Sent: Friday, October 4, 2019 1:49 AM
    To: Storage Performance Development Kit <spdk(a)lists.01.org>
    Cc: Karkra, Kapil <kapil.karkra(a)intel.com>; Baldysiak, Pawel <pawel.baldysiak(a)intel.com>; Ptak, Slawomir <slawomir.ptak(a)intel.com>
    Subject: [SPDK] Re: SPDK RAID5 support
    
    Hi Artur, Paul, and All,
    
    Thank you so much, I'm excited to know this.
    
    Recently SPDK are starting to support DIF feature.
    Can we have any possibility to include the extended LBA (block size = 512 + 8, 4096 + 128, or etc) into SPDK RAID?
    Do you have any comment?
    
    Thanks,
    Shuhei
    
    ________________________________
    差出人: Luse, Paul E <paul.e.luse(a)intel.com>
    送信日時: 2019年10月4日 4:20
    宛先: Storage Performance Development Kit <spdk(a)lists.01.org>
    CC: Karkra, Kapil <kapil.karkra(a)intel.com>; Baldysiak, Pawel <pawel.baldysiak(a)intel.com>; Ptak, Slawomir <slawomir.ptak(a)intel.com>
    件名: [SPDK] Re: SPDK RAID5 support
    
    Hi Artur,
    
    Thanks, I think this can be an awesome contribution.  A few other things to consider, you mention some of these already, so I just added some more color.  It would be good I think moving forward to put a trello board up with a backlog of tasks so that others (like me __) can jump in and help also.  I'd still like to get the RAID1E in there, but it makes little sense to do it before any major refactoring.
    
    I think it makes sense, if you guys are ready, to start putting together the backlog and knocking out a large series of small patches to get the refactoring done.
    
    * the current RAID0 has no config on disk. We'll need to come up with a scheme for handling existing RAID0 configured out of band with RAID5 using COD. (like not allowing a RAID0 to be built on a set with COD, etc.)
    * the metadata layout, as you knee from working previous RAID projects, needs to be thought out carefully to consider not only extensibility but version control for backwards compactivity and issues with conflicting COD that are found.  For example you have a 3 disk RAID5, take one of the disks out and use it in another array somewhere then bring it back later and fire up the original 3, the metadata has to have sufficient info to know who belongs to what and which volumes to create and which to put in some sort of offline state. I don't think we need that kind of capability right up front (deciding how to deal with conflicts) but the metadata should have enough information it up front. DDF was brought up before in the earlier RAID discussions so just to make sure we're all on the same page, I see no value in complying with that spec. Open to other thoughts though.
    * one thing to keep in mind, we probably want to retain common RPC code
    * we should keep migration in mind as well, not that it's something we may ever need/want but lots of reserved space in metadata for tracking state wrt migrations and rebuilds is needed.
    * similar to the RAID1E discussions earlier, there are some features you mention as 'future' that most would consider a requirement for redundant RAID - like degrade operation and rebuild. Those don't all have to go in at the same time however without that minimum set we need to mark it experimental, so nobody tries to use it thinking it has those basic things.
    * I can't remember if we talked about unit tests or not, but be sure the backlog includes getting solid UT coverage in there up front.  The existing UT code will likely need some refactoring as well to support the function code refactoring.
    
    -Paul
    
    On 10/3/19, 3:00 AM, "Artur Paszkiewicz" <artur.paszkiewicz(a)intel.com> wrote:
    
        Hi all,
    
        We want to add RAID5 support to SPDK. My team has experience with other RAID
        projects, primarily with Linux MD RAID, which we actively develop and support
        for Intel VROC. We already have an initial SPDK RAID5 implementation created
        for an internal project. It has working read/write, including partial-stripe
        updates, parity calculation and reconstruct-reads.
    
        Currently in SPDK there exists a RAID bdev module, which has only RAID0
        functionality. This can be used as a basis for a more generic RAID stack. Here
        is our idea how to approach this:
    
        1. Refactor the bdev_raid module to separate RAID0-specific I/O handling code
        from more generic parts - configuration, bdev creation, etc. Move the RAID0
        code to a new file. Use RAID level-specific callbacks, similar to existing
        struct raid_fn_table. This architecture is also used in MD RAID drivers, where
        different RAID "personalities" work on top of a common layer.
    
        2. Add RAID5 support in another file, similarly to RAID0. Port our current
        RAID5 code to this new framework.
    
        3. Incrementally add new functionalities. At this point, probably the most
        important will be support for member drive failure and degraded operation, RAID
        rebuild and some form of on-disk metadata.
    
        Any comments or suggestions are welcome.
    
        Thanks,
        Artur
        _______________________________________________
        SPDK mailing list -- spdk(a)lists.01.org
        To unsubscribe send an email to spdk-leave(a)lists.01.org
    
    
    _______________________________________________
    SPDK mailing list -- spdk(a)lists.01.org
    To unsubscribe send an email to spdk-leave(a)lists.01.org _______________________________________________
    SPDK mailing list -- spdk(a)lists.01.org
    To unsubscribe send an email to spdk-leave(a)lists.01.org
    _______________________________________________
    SPDK mailing list -- spdk(a)lists.01.org
    To unsubscribe send an email to spdk-leave(a)lists.01.org
    


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-13 18:18 Luse, Paul E
  0 siblings, 0 replies; 22+ messages in thread
From: Luse, Paul E @ 2019-10-13 18:18 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 7925 bytes --]

Note that there's a patch series up there now from Artur also, I'll be reviewing it myself for the first tim here shortly (

https://review.gerrithub.io/c/spdk/spdk/+/471075/1 

On 10/13/19, 10:39 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:

    Awesome! Good feedback, keep an eye out for patches and check out trello https://trello.com/b/4HEkWVvF/raid for backlog items...
    
    Thx
    Paul
    
    On 10/13/19, 2:28 AM, "Sasha Kotchubievsky" <sashakot(a)dev.mellanox.co.il> wrote:
    
        Hi,
        
        I'm very exiting to progress in RAID development in SPDK.
        
        Artur, will you focus on RAID5 only, or RAID6 is also will be supported?
        
        I think, it's important to keep existing SPDK approach for configuration. 
        It would nice to see an abstraction for parity calculation. I believe, the calculation can be optimized for specific platform, or even using HW accelerators is possible .
        In case of distributed storage, even existing in market network cards can optimize RAID related operations. I believe, the next, upcoming generation will this capabilities to the next level. That needs some support from bdev layer like events about configuration changes, or recovery/degradation.
        
        
        Best regards
        Sasha
        
        -----Original Message-----
        From: 松本周平 / MATSUMOTO,SHUUHEI <shuhei.matsumoto.xt(a)hitachi.com> 
        Sent: Friday, October 4, 2019 1:49 AM
        To: Storage Performance Development Kit <spdk(a)lists.01.org>
        Cc: Karkra, Kapil <kapil.karkra(a)intel.com>; Baldysiak, Pawel <pawel.baldysiak(a)intel.com>; Ptak, Slawomir <slawomir.ptak(a)intel.com>
        Subject: [SPDK] Re: SPDK RAID5 support
        
        Hi Artur, Paul, and All,
        
        Thank you so much, I'm excited to know this.
        
        Recently SPDK are starting to support DIF feature.
        Can we have any possibility to include the extended LBA (block size = 512 + 8, 4096 + 128, or etc) into SPDK RAID?
        Do you have any comment?
        
        Thanks,
        Shuhei
        
        ________________________________
        差出人: Luse, Paul E <paul.e.luse(a)intel.com>
        送信日時: 2019年10月4日 4:20
        宛先: Storage Performance Development Kit <spdk(a)lists.01.org>
        CC: Karkra, Kapil <kapil.karkra(a)intel.com>; Baldysiak, Pawel <pawel.baldysiak(a)intel.com>; Ptak, Slawomir <slawomir.ptak(a)intel.com>
        件名: [SPDK] Re: SPDK RAID5 support
        
        Hi Artur,
        
        Thanks, I think this can be an awesome contribution.  A few other things to consider, you mention some of these already, so I just added some more color.  It would be good I think moving forward to put a trello board up with a backlog of tasks so that others (like me __) can jump in and help also.  I'd still like to get the RAID1E in there, but it makes little sense to do it before any major refactoring.
        
        I think it makes sense, if you guys are ready, to start putting together the backlog and knocking out a large series of small patches to get the refactoring done.
        
        * the current RAID0 has no config on disk. We'll need to come up with a scheme for handling existing RAID0 configured out of band with RAID5 using COD. (like not allowing a RAID0 to be built on a set with COD, etc.)
        * the metadata layout, as you knee from working previous RAID projects, needs to be thought out carefully to consider not only extensibility but version control for backwards compactivity and issues with conflicting COD that are found.  For example you have a 3 disk RAID5, take one of the disks out and use it in another array somewhere then bring it back later and fire up the original 3, the metadata has to have sufficient info to know who belongs to what and which volumes to create and which to put in some sort of offline state. I don't think we need that kind of capability right up front (deciding how to deal with conflicts) but the metadata should have enough information it up front. DDF was brought up before in the earlier RAID discussions so just to make sure we're all on the same page, I see no value in complying with that spec. Open to other thoughts though.
        * one thing to keep in mind, we probably want to retain common RPC code
        * we should keep migration in mind as well, not that it's something we may ever need/want but lots of reserved space in metadata for tracking state wrt migrations and rebuilds is needed.
        * similar to the RAID1E discussions earlier, there are some features you mention as 'future' that most would consider a requirement for redundant RAID - like degrade operation and rebuild. Those don't all have to go in at the same time however without that minimum set we need to mark it experimental, so nobody tries to use it thinking it has those basic things.
        * I can't remember if we talked about unit tests or not, but be sure the backlog includes getting solid UT coverage in there up front.  The existing UT code will likely need some refactoring as well to support the function code refactoring.
        
        -Paul
        
        On 10/3/19, 3:00 AM, "Artur Paszkiewicz" <artur.paszkiewicz(a)intel.com> wrote:
        
            Hi all,
        
            We want to add RAID5 support to SPDK. My team has experience with other RAID
            projects, primarily with Linux MD RAID, which we actively develop and support
            for Intel VROC. We already have an initial SPDK RAID5 implementation created
            for an internal project. It has working read/write, including partial-stripe
            updates, parity calculation and reconstruct-reads.
        
            Currently in SPDK there exists a RAID bdev module, which has only RAID0
            functionality. This can be used as a basis for a more generic RAID stack. Here
            is our idea how to approach this:
        
            1. Refactor the bdev_raid module to separate RAID0-specific I/O handling code
            from more generic parts - configuration, bdev creation, etc. Move the RAID0
            code to a new file. Use RAID level-specific callbacks, similar to existing
            struct raid_fn_table. This architecture is also used in MD RAID drivers, where
            different RAID "personalities" work on top of a common layer.
        
            2. Add RAID5 support in another file, similarly to RAID0. Port our current
            RAID5 code to this new framework.
        
            3. Incrementally add new functionalities. At this point, probably the most
            important will be support for member drive failure and degraded operation, RAID
            rebuild and some form of on-disk metadata.
        
            Any comments or suggestions are welcome.
        
            Thanks,
            Artur
            _______________________________________________
            SPDK mailing list -- spdk(a)lists.01.org
            To unsubscribe send an email to spdk-leave(a)lists.01.org
        
        
        _______________________________________________
        SPDK mailing list -- spdk(a)lists.01.org
        To unsubscribe send an email to spdk-leave(a)lists.01.org _______________________________________________
        SPDK mailing list -- spdk(a)lists.01.org
        To unsubscribe send an email to spdk-leave(a)lists.01.org
        _______________________________________________
        SPDK mailing list -- spdk(a)lists.01.org
        To unsubscribe send an email to spdk-leave(a)lists.01.org
        
    
    _______________________________________________
    SPDK mailing list -- spdk(a)lists.01.org
    To unsubscribe send an email to spdk-leave(a)lists.01.org
    


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-13 17:39 Luse, Paul E
  0 siblings, 0 replies; 22+ messages in thread
From: Luse, Paul E @ 2019-10-13 17:39 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 7071 bytes --]

Awesome! Good feedback, keep an eye out for patches and check out trello https://trello.com/b/4HEkWVvF/raid for backlog items...

Thx
Paul

On 10/13/19, 2:28 AM, "Sasha Kotchubievsky" <sashakot(a)dev.mellanox.co.il> wrote:

    Hi,
    
    I'm very exiting to progress in RAID development in SPDK.
    
    Artur, will you focus on RAID5 only, or RAID6 is also will be supported?
    
    I think, it's important to keep existing SPDK approach for configuration. 
    It would nice to see an abstraction for parity calculation. I believe, the calculation can be optimized for specific platform, or even using HW accelerators is possible .
    In case of distributed storage, even existing in market network cards can optimize RAID related operations. I believe, the next, upcoming generation will this capabilities to the next level. That needs some support from bdev layer like events about configuration changes, or recovery/degradation.
    
    
    Best regards
    Sasha
    
    -----Original Message-----
    From: 松本周平 / MATSUMOTO,SHUUHEI <shuhei.matsumoto.xt(a)hitachi.com> 
    Sent: Friday, October 4, 2019 1:49 AM
    To: Storage Performance Development Kit <spdk(a)lists.01.org>
    Cc: Karkra, Kapil <kapil.karkra(a)intel.com>; Baldysiak, Pawel <pawel.baldysiak(a)intel.com>; Ptak, Slawomir <slawomir.ptak(a)intel.com>
    Subject: [SPDK] Re: SPDK RAID5 support
    
    Hi Artur, Paul, and All,
    
    Thank you so much, I'm excited to know this.
    
    Recently SPDK are starting to support DIF feature.
    Can we have any possibility to include the extended LBA (block size = 512 + 8, 4096 + 128, or etc) into SPDK RAID?
    Do you have any comment?
    
    Thanks,
    Shuhei
    
    ________________________________
    差出人: Luse, Paul E <paul.e.luse(a)intel.com>
    送信日時: 2019年10月4日 4:20
    宛先: Storage Performance Development Kit <spdk(a)lists.01.org>
    CC: Karkra, Kapil <kapil.karkra(a)intel.com>; Baldysiak, Pawel <pawel.baldysiak(a)intel.com>; Ptak, Slawomir <slawomir.ptak(a)intel.com>
    件名: [SPDK] Re: SPDK RAID5 support
    
    Hi Artur,
    
    Thanks, I think this can be an awesome contribution.  A few other things to consider, you mention some of these already, so I just added some more color.  It would be good I think moving forward to put a trello board up with a backlog of tasks so that others (like me __) can jump in and help also.  I'd still like to get the RAID1E in there, but it makes little sense to do it before any major refactoring.
    
    I think it makes sense, if you guys are ready, to start putting together the backlog and knocking out a large series of small patches to get the refactoring done.
    
    * the current RAID0 has no config on disk. We'll need to come up with a scheme for handling existing RAID0 configured out of band with RAID5 using COD. (like not allowing a RAID0 to be built on a set with COD, etc.)
    * the metadata layout, as you knee from working previous RAID projects, needs to be thought out carefully to consider not only extensibility but version control for backwards compactivity and issues with conflicting COD that are found.  For example you have a 3 disk RAID5, take one of the disks out and use it in another array somewhere then bring it back later and fire up the original 3, the metadata has to have sufficient info to know who belongs to what and which volumes to create and which to put in some sort of offline state. I don't think we need that kind of capability right up front (deciding how to deal with conflicts) but the metadata should have enough information it up front. DDF was brought up before in the earlier RAID discussions so just to make sure we're all on the same page, I see no value in complying with that spec. Open to other thoughts though.
    * one thing to keep in mind, we probably want to retain common RPC code
    * we should keep migration in mind as well, not that it's something we may ever need/want but lots of reserved space in metadata for tracking state wrt migrations and rebuilds is needed.
    * similar to the RAID1E discussions earlier, there are some features you mention as 'future' that most would consider a requirement for redundant RAID - like degrade operation and rebuild. Those don't all have to go in at the same time however without that minimum set we need to mark it experimental, so nobody tries to use it thinking it has those basic things.
    * I can't remember if we talked about unit tests or not, but be sure the backlog includes getting solid UT coverage in there up front.  The existing UT code will likely need some refactoring as well to support the function code refactoring.
    
    -Paul
    
    On 10/3/19, 3:00 AM, "Artur Paszkiewicz" <artur.paszkiewicz(a)intel.com> wrote:
    
        Hi all,
    
        We want to add RAID5 support to SPDK. My team has experience with other RAID
        projects, primarily with Linux MD RAID, which we actively develop and support
        for Intel VROC. We already have an initial SPDK RAID5 implementation created
        for an internal project. It has working read/write, including partial-stripe
        updates, parity calculation and reconstruct-reads.
    
        Currently in SPDK there exists a RAID bdev module, which has only RAID0
        functionality. This can be used as a basis for a more generic RAID stack. Here
        is our idea how to approach this:
    
        1. Refactor the bdev_raid module to separate RAID0-specific I/O handling code
        from more generic parts - configuration, bdev creation, etc. Move the RAID0
        code to a new file. Use RAID level-specific callbacks, similar to existing
        struct raid_fn_table. This architecture is also used in MD RAID drivers, where
        different RAID "personalities" work on top of a common layer.
    
        2. Add RAID5 support in another file, similarly to RAID0. Port our current
        RAID5 code to this new framework.
    
        3. Incrementally add new functionalities. At this point, probably the most
        important will be support for member drive failure and degraded operation, RAID
        rebuild and some form of on-disk metadata.
    
        Any comments or suggestions are welcome.
    
        Thanks,
        Artur
        _______________________________________________
        SPDK mailing list -- spdk(a)lists.01.org
        To unsubscribe send an email to spdk-leave(a)lists.01.org
    
    
    _______________________________________________
    SPDK mailing list -- spdk(a)lists.01.org
    To unsubscribe send an email to spdk-leave(a)lists.01.org _______________________________________________
    SPDK mailing list -- spdk(a)lists.01.org
    To unsubscribe send an email to spdk-leave(a)lists.01.org
    _______________________________________________
    SPDK mailing list -- spdk(a)lists.01.org
    To unsubscribe send an email to spdk-leave(a)lists.01.org
    


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-13  9:26 Sasha Kotchubievsky
  0 siblings, 0 replies; 22+ messages in thread
From: Sasha Kotchubievsky @ 2019-10-13  9:26 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6290 bytes --]

Hi,

I'm very exiting to progress in RAID development in SPDK.

Artur, will you focus on RAID5 only, or RAID6 is also will be supported?

I think, it's important to keep existing SPDK approach for configuration. 
It would nice to see an abstraction for parity calculation. I believe, the calculation can be optimized for specific platform, or even using HW accelerators is possible .
In case of distributed storage, even existing in market network cards can optimize RAID related operations. I believe, the next, upcoming generation will this capabilities to the next level. That needs some support from bdev layer like events about configuration changes, or recovery/degradation.


Best regards
Sasha

-----Original Message-----
From: 松本周平 / MATSUMOTO,SHUUHEI <shuhei.matsumoto.xt(a)hitachi.com> 
Sent: Friday, October 4, 2019 1:49 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Cc: Karkra, Kapil <kapil.karkra(a)intel.com>; Baldysiak, Pawel <pawel.baldysiak(a)intel.com>; Ptak, Slawomir <slawomir.ptak(a)intel.com>
Subject: [SPDK] Re: SPDK RAID5 support

Hi Artur, Paul, and All,

Thank you so much, I'm excited to know this.

Recently SPDK are starting to support DIF feature.
Can we have any possibility to include the extended LBA (block size = 512 + 8, 4096 + 128, or etc) into SPDK RAID?
Do you have any comment?

Thanks,
Shuhei

________________________________
差出人: Luse, Paul E <paul.e.luse(a)intel.com>
送信日時: 2019年10月4日 4:20
宛先: Storage Performance Development Kit <spdk(a)lists.01.org>
CC: Karkra, Kapil <kapil.karkra(a)intel.com>; Baldysiak, Pawel <pawel.baldysiak(a)intel.com>; Ptak, Slawomir <slawomir.ptak(a)intel.com>
件名: [SPDK] Re: SPDK RAID5 support

Hi Artur,

Thanks, I think this can be an awesome contribution.  A few other things to consider, you mention some of these already, so I just added some more color.  It would be good I think moving forward to put a trello board up with a backlog of tasks so that others (like me __) can jump in and help also.  I'd still like to get the RAID1E in there, but it makes little sense to do it before any major refactoring.

I think it makes sense, if you guys are ready, to start putting together the backlog and knocking out a large series of small patches to get the refactoring done.

* the current RAID0 has no config on disk. We'll need to come up with a scheme for handling existing RAID0 configured out of band with RAID5 using COD. (like not allowing a RAID0 to be built on a set with COD, etc.)
* the metadata layout, as you knee from working previous RAID projects, needs to be thought out carefully to consider not only extensibility but version control for backwards compactivity and issues with conflicting COD that are found.  For example you have a 3 disk RAID5, take one of the disks out and use it in another array somewhere then bring it back later and fire up the original 3, the metadata has to have sufficient info to know who belongs to what and which volumes to create and which to put in some sort of offline state. I don't think we need that kind of capability right up front (deciding how to deal with conflicts) but the metadata should have enough information it up front. DDF was brought up before in the earlier RAID discussions so just to make sure we're all on the same page, I see no value in complying with that spec. Open to other thoughts though.
* one thing to keep in mind, we probably want to retain common RPC code
* we should keep migration in mind as well, not that it's something we may ever need/want but lots of reserved space in metadata for tracking state wrt migrations and rebuilds is needed.
* similar to the RAID1E discussions earlier, there are some features you mention as 'future' that most would consider a requirement for redundant RAID - like degrade operation and rebuild. Those don't all have to go in at the same time however without that minimum set we need to mark it experimental, so nobody tries to use it thinking it has those basic things.
* I can't remember if we talked about unit tests or not, but be sure the backlog includes getting solid UT coverage in there up front.  The existing UT code will likely need some refactoring as well to support the function code refactoring.

-Paul

On 10/3/19, 3:00 AM, "Artur Paszkiewicz" <artur.paszkiewicz(a)intel.com> wrote:

    Hi all,

    We want to add RAID5 support to SPDK. My team has experience with other RAID
    projects, primarily with Linux MD RAID, which we actively develop and support
    for Intel VROC. We already have an initial SPDK RAID5 implementation created
    for an internal project. It has working read/write, including partial-stripe
    updates, parity calculation and reconstruct-reads.

    Currently in SPDK there exists a RAID bdev module, which has only RAID0
    functionality. This can be used as a basis for a more generic RAID stack. Here
    is our idea how to approach this:

    1. Refactor the bdev_raid module to separate RAID0-specific I/O handling code
    from more generic parts - configuration, bdev creation, etc. Move the RAID0
    code to a new file. Use RAID level-specific callbacks, similar to existing
    struct raid_fn_table. This architecture is also used in MD RAID drivers, where
    different RAID "personalities" work on top of a common layer.

    2. Add RAID5 support in another file, similarly to RAID0. Port our current
    RAID5 code to this new framework.

    3. Incrementally add new functionalities. At this point, probably the most
    important will be support for member drive failure and degraded operation, RAID
    rebuild and some form of on-disk metadata.

    Any comments or suggestions are welcome.

    Thanks,
    Artur
    _______________________________________________
    SPDK mailing list -- spdk(a)lists.01.org
    To unsubscribe send an email to spdk-leave(a)lists.01.org


_______________________________________________
SPDK mailing list -- spdk(a)lists.01.org
To unsubscribe send an email to spdk-leave(a)lists.01.org _______________________________________________
SPDK mailing list -- spdk(a)lists.01.org
To unsubscribe send an email to spdk-leave(a)lists.01.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-13  8:56 Sasha Kotchubievsky
  0 siblings, 0 replies; 22+ messages in thread
From: Sasha Kotchubievsky @ 2019-10-13  8:56 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 13764 bytes --]

Hi David,

I have a couple of questions regarding your proposal:

- As far I see, DRBD protocol is under GPL v2. Most SPDK sources are 
under BSD 3-clause license. Under which license, interfaces between SPDK 
and  DRBD can be released?

- Is DRBD implementation above RDMA (ROCEv2, IB) are publicly available?

- Is possible to run latency test using fio rather than dd ? Latency 
reported in the table below, look too low for me. It's better check 
IOPs, BW and latency using set of queue depth values like 1, 4, 16, etc. 
Queue depth = 1 will represent very clearly latency added by DRBD protocol.


I'd say: coupling SPDK with kernel-based solution in general and with 
DRBD in particular will prevent possible network-layer optimizations for 
both TCP and RDMA cases. Also, it will add complexity to deployment.


Best regards

Sasha


On 08-Oct-19 9:25 PM, David Butterfield wrote:
> On 10/3/19 2:44 PM, Marushak, Nathan wrote:
>> Do you happen to have any performance and efficiency details? While the port you did was done with minimal changes, great work by the way, we have typically seen that most existing SW architectures require real changes to provide the necessary performance and efficiency improvements required for today's NVM and Networking performance.
> Diagram: https://raw.githubusercontent.com/DavidButterfield/spdk/tcmu-runner/spdk_drbd.pdf
>
> Hi Nathan, Paul:
>
> After your replies to my earlier message, I did some looking at the DRBD code and took some very
> simple measurements; below is what I found.
>
> Regarding the idea of importing a kernel RAID implementation rather than writing a new one:
>
> I would consider it seriously -- that's an awful lot of logic that might not have to be reimple-
> mented, and then matured a couple of years before it's ready for production use with critical
> data.  And going forward, instead of two implementations to be maintained in parallel, a single
> common implementation that behaves consistently whether it's running in the kernel or in a
> usermode process.  That's a lot of potential value to be weighed along with other factors.
>
> Clearly it is essential that it must be able to perform very well within an SPDK process and
> operate smoothly within the datapath.  If it can't do that, then there's little choice but to
> redevelop it despite the cost.  But it seems worth looking for ways to shorten what is likely to
> be a fairly long grind.
>
> I'm not familiar with the other kernel RAID modules, so I'll refer here in terms of DRBD, which
> includes around 50,000 lines of kernel code.  But I see no reason to expect other kernel RAID
> modules to be any harder to port to usermode than DRBD was (or SCST at ~80,000 lines of code).
>
> What are the dimensions of architectural concern?  In writing interface shims between SPDK and
> DRBD I noticed three main areas of mismatch (elaborated further below):
>
>    (1) Mismatch between the SPDK bdev protocol and the bio protocol expected by DRBD
>    (2) Network interface between DRBD replication peers
>    (3) Mismatch between SPDK and kernel mechanisms for threading and work handoff
>
> Are there additional areas I should watch out for that have caused trouble in past efforts?
>
> Performance
> ===========
> This is a home project, and I don't have equipment to carry out robust performance testing; but
> I've started by taking a few very simple measurements to estimate the time DRBD takes to process
> a Read operation, which is assumed to be the same as the increase in Read operation response-
> time when DRBD is inserted into an SPDK bdev chain.
>
> My SPDK test machine is a laptop with a 4-threaded Core i5-2520M @ 2.5GHz with 4GiB of DDR3 1333
> (0.8 ns) RAM (barely enough RAM to run the SPDK iSCSI server).  The laptop is running Ubuntu
> 18.04.1 and kernel 5.0.0-29.  I ran the SPDK iSCSI server (compiled -O3 without DEBUG) with a
> single reactor thread on CPU0.
>
> I used dd(1) to do some simple 4KiB sequential Read tests from raw LUNs on the SPDK iSCSI
> server, configured with LUNs 0, 4, and 6 (all backed by bdev_malloc) as shown in the diagram.
> I used the same machine for the initiator and the server, connecting to the IP address of its
> own (Intel 82579V) Ethernet interface to avoid the 1 Gb Ethernet bottleneck.
>
> For these measurements the DRBD server was not connected to a peer, so replication over the
> network was not active.  This should be the fastest path through DRBD to its backing storage,
> without interference by replication or network considerations.  (Also, since this was a Read
> test only, nothing should be going over the peer-to-peer network anyway)
>
> Each LUN was given 16 trials (16 runs of dd), each time reading 130,000 4KiB blocks (~1/2 GB)
> through the iSCSI server, ultimately from an instance of bdev_malloc (the SPDK ramdisk).  Shown
> below for each LUN are the fastest trial out of its 16 trials, and also the average (mean) of
> the top five fastest trials out of the 16.  From the MBPS reported by dd(1) I have also
> calculated the microseconds per 4KiB Read operation:
>
>                  MBPS Reported by dd(1)                  Microseconds per OP
> LUN Config      Best_of_16_trials   Avg(Top_5_trials)   Best    5Avg   Delta
> --- ------      -----------------   -----------------   ----    ----   -----
>   0  Malloc             1435                1412         2.85    2.90
>   4  bio                1006                 989         4.07    4.14    1.24
>   6  bio+DRBD            893                 887         4.59    4.62    0.48
>     (DRBD without bio)                     (1212)               (3.38)
>
> LUN0 is a standard SPDK bdev_malloc instance, to compare with the other measurements.
>
> LUN4 adds bdev_bio and bio_spdk instances back-to-back ahead of the bdev_malloc instance,
> translating each request to kernel bio protocol and back before it gets to bdev_malloc (see
> diagram).  The timing difference between LUN4 and LUN0 therefore represents per-OP overhead
> contributed by those two modules taken together.
>
> Note that the overhead is not much in the bdev protocol translation; I assume it's mostly wakeup
> latency.  The bdev_bio and bio_spdk modules each do handoffs of requests and responses from one
> thread to another:  bdev_bio hands requests off from an SPDK thread to a DRBD/UMC thread through
> a queue_work() call; and bio_spdk hands requests off from a DRBD/UMC thread to an SPDK thread
> through a call to spdk_thread_send_msg().  And similarly in the reply direction.  So the per-OP
> timing difference of 1.24 microseconds between LUN4 and LUN0 includes four thread handoffs, two
> of them with wakeup latency.
>
> LUN6 adds DRBD into the configuration of LUN4, between the bdev_bio and bio_spdk instances.
> So the timing difference of 0.48 microseconds between LUN4 and LUN6 represents per-OP processing
> contributed by DRBD.
>
> (DRBD without bio) shows calculated hypothetical timing with DRBD alone, subtracting the
> bdev_bio and bio_spdk translations and wakeup latencies.  This is the expected timing if DRBD is
> modified to run on SPDK threads (discussed below), eliminating the thread context switches.
> [Calculated as (4.62 - 4.14) + 2.90 = 3.38 and hypothetical MBPS back-calculated from that.]
>
>  From these measurements, DRBD appears to be adding about 480 nanoseconds of processing time per
> Read operation -- about a 17% increase over the straight bdev_malloc device for a 4KiB Read.
> For larger reads the 480 ns should represent smaller percentages.
>
> Finally, I re-ran all of the above experiments twice more with substantially the same results.
> All the "best" and "average(best_5)" results for each LUN were within 3% of each other across
> re-runs of the experiments (all but one were within 2%).
>
> [Because the test times were fairly short, random scheduling events with heavy impact but low
> frequency can spoil any particular run with performance far below average.  Test time depends on
> the length of the volume, which is difficult to enlarge because I'm backing with a bdev_malloc
> instance on a machine with 4GiB RAM.  That is why I chose to average the top 5 out of 16 -- to
> drop spoiled runs, yet not rely completely on one "best" run.  A large spread between the "best"
> and the "average(best_5)" would indicate that the "best" time was unusually high.]
>
> ================================================================================================
>
> (1) Mismatch between the SPDK bdev protocol and the bio protocol expected by DRBD
>
> This one is pretty trivial.  Most usage in DRBD of the bio structure and protocol is
> concentrated in six places in the code:  two near the "top" where client requests arrive, two
> near the "bottom" where requests to backing storage are issued, and two near the "middle" where
> peer-to-peer communication occurs.  These areas together total around 500 lines of code.  I
> reckon a couple hundred lines of new code could be added under #ifdef to change these places to
> understand the SPDK bdev structures and calls instead of kernel bio structures and calls.
>
> (2) Network interface to DRBD replication peers
>
> In the demo prototype the iSCSI network I/O is done using the SPDK/DPDK networking facility; but
> DRBD continues to implement network I/O to replication peers using socket(7) calls.  This is
> mainly because I didn't need to change that to get the prototype running.
>
> The implementation of the peer transport service within DRBD is isolated behind a DRBD-internal
> transport ops vector, so it's already designed to be easily replaced with other transport
> implementations.  The implementation in drbd_transport_tcp.c would be replaced with one that is
> nonblocking and issues SPDK networking calls instead of socket(7) calls.
>
> Some changes are probably needed to the peer receive-side logic so that it can operate using
> non-blocking network I/O only.  There is already a dispatcher "drbdd()" that calls service
> functions based on the inter-peer command type in the incoming header; but those functions then
> know how much additional data they want, and call for it synchronously.  They may have to have
> their post-recv processing split out into callback functions to be called (on a reactor thread)
> when the amount of data they want is available from the network.  This will be a bit of a chore,
> because there are a good few of them; but it's straightforward and the rearranged code could
> work for both kernel and usermode and still be clean without needing #ifdefs in each place.
>
> (3) Mismatch between SPDK and kernel mechanisms for threading and work handoff
>
> DRBD already issues backing store I/O operations for asynchronous completion from a small set of
> threads (i.e. it does *not* use a large number of threads each doing a synchronous I/O call).
>
> I'll use replicated Write for discussion because it is the more complicated case.  Consider a
> set of DRBD servers acting as peers in a network serving some storage resource.  They maintain
> network connections with each other while replication of the resource is active.  DRBD has some
> service threads associated with each of these connections (which could go away under SPDK).
>
> One of the peer-connection service threads is a "drbd_sender" thread.  An incoming Write request
> on a resource is processed by
>      (1a) queueing a copy of the request to the work queue of the drbd_sender thread for each of
>           the connected peers for that resource;
>      (1b) waking up those drbd_sender threads;
>      (2a) attempting a "fast-track" submission of the I/O request to the local backing store;
>      (2b) if (2a) fails, queueing the I/O request to a "drbd_submitter" thread.
>           (I don't know the relative frequency of (2b) as compared with (2a))
>
> The queueing at (2b) is currently done using the kernel queue_work() interface, which does the
> wakeup and arrives (once for each call to queue_work) at the specified function on the work
> queue's service thread.  So I think it is already in the right model and can be simply #ifdef'd
> to use spdk_thread_send_msg() in place of queue_work(), and let a reactor thread do the I/O
> submission to the backing store (eliminating the "drbd-submitter" work_queue service thread).
>
> The queueing at (1a) is a little more involved, but I have sketched out the changes on paper and
> they are straightforward, maybe 100 lines of added code under #ifdef.  The top-level sender
> function with the loop that waits for work and then services it has to be split, with the
> "service" part getting called by the SPDK reactor thread in response to a spdk_thread_send_msg()
> call.  Steps (1a) and (1b) on the requesting thread are then replaced with a call to
> spdk_thread_send_msg().
>
> Or, the requesting thread may call the sender service function directly under some or all
> conditions (to be analyzed).  Either way the drbd_sender threads would also be eliminated,
> replaced by reactor threads running a nonblocking network transport implementation.
>
> In any case I believe the code changes needed to fit the DRBD datapath smoothly into the SPDK
> model, with commensurate performance, would be a tiny fraction of the 50,000 lines of relatively
> mature code that would then be available for use in the SPDK environment.
>
> Regards,
> David Butterfield
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org
> To unsubscribe send an email to spdk-leave(a)lists.01.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-11 15:37 Luse, Paul E
  0 siblings, 0 replies; 22+ messages in thread
From: Luse, Paul E @ 2019-10-11 15:37 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 1351 bytes --]

And if it's not up there too a Trello board w/backlog items if it makes sense so everyone can join the party (

On 10/11/19, 8:33 AM, "Liu, Xiaodong" <xiaodong.liu(a)intel.com> wrote:

    Great work! Artur.
    By the way, do you have some draft or RFC code about your RAID5 support? So we can have a better overall understanding on where the refactoring is towards to. 
    
     --Thanks
    From Xiaodong
    
    
    -----Original Message-----
    From: Artur Paszkiewicz [mailto:artur.paszkiewicz(a)intel.com] 
    Sent: Friday, October 11, 2019 9:07 PM
    To: spdk(a)lists.01.org
    Cc: Karkra, Kapil <kapil.karkra(a)intel.com>; Baldysiak, Pawel <pawel.baldysiak(a)intel.com>; Ptak, Slawomir <slawomir.ptak(a)intel.com>
    Subject: [SPDK] Re: SPDK RAID5 support
    
    Hi all,
    
    I just sent a first series of patches with raid_bdev refactoring:
    https://review.gerrithub.io/c/spdk/spdk/+/471075
    
    Please let me know what you think.
    
    Thanks,
    Artur
    _______________________________________________
    SPDK mailing list -- spdk(a)lists.01.org
    To unsubscribe send an email to spdk-leave(a)lists.01.org
    _______________________________________________
    SPDK mailing list -- spdk(a)lists.01.org
    To unsubscribe send an email to spdk-leave(a)lists.01.org
    


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-11 15:32 Liu, Xiaodong
  0 siblings, 0 replies; 22+ messages in thread
From: Liu, Xiaodong @ 2019-10-11 15:32 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 883 bytes --]

Great work! Artur.
By the way, do you have some draft or RFC code about your RAID5 support? So we can have a better overall understanding on where the refactoring is towards to. 

 --Thanks
From Xiaodong


-----Original Message-----
From: Artur Paszkiewicz [mailto:artur.paszkiewicz(a)intel.com] 
Sent: Friday, October 11, 2019 9:07 PM
To: spdk(a)lists.01.org
Cc: Karkra, Kapil <kapil.karkra(a)intel.com>; Baldysiak, Pawel <pawel.baldysiak(a)intel.com>; Ptak, Slawomir <slawomir.ptak(a)intel.com>
Subject: [SPDK] Re: SPDK RAID5 support

Hi all,

I just sent a first series of patches with raid_bdev refactoring:
https://review.gerrithub.io/c/spdk/spdk/+/471075

Please let me know what you think.

Thanks,
Artur
_______________________________________________
SPDK mailing list -- spdk(a)lists.01.org
To unsubscribe send an email to spdk-leave(a)lists.01.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-11 13:08 Luse, Paul E
  0 siblings, 0 replies; 22+ messages in thread
From: Luse, Paul E @ 2019-10-11 13:08 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 523 bytes --]

Great!! I'll check them out later today...

On 10/11/19, 6:07 AM, "Artur Paszkiewicz" <artur.paszkiewicz(a)intel.com> wrote:

    Hi all,
    
    I just sent a first series of patches with raid_bdev refactoring:
    https://review.gerrithub.io/c/spdk/spdk/+/471075
    
    Please let me know what you think.
    
    Thanks,
    Artur
    _______________________________________________
    SPDK mailing list -- spdk(a)lists.01.org
    To unsubscribe send an email to spdk-leave(a)lists.01.org
    


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-11 13:07 Artur Paszkiewicz
  0 siblings, 0 replies; 22+ messages in thread
From: Artur Paszkiewicz @ 2019-10-11 13:07 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 184 bytes --]

Hi all,

I just sent a first series of patches with raid_bdev refactoring:
https://review.gerrithub.io/c/spdk/spdk/+/471075

Please let me know what you think.

Thanks,
Artur

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-08 20:21 Luse, Paul E
  0 siblings, 0 replies; 22+ messages in thread
From: Luse, Paul E @ 2019-10-08 20:21 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 15747 bytes --]


    Hi David,
    
    Thanks for the additional information. FYI the solution that was discussed in the last community meeting is actually mostly already written so we're not exactly starting from scratch (and it was also written by folks with prior RAID experience). And, this is one of the most important points, it was written for SPDK.  The idea of pushing it for upstream consideration is what's new here. You have a great point though, many of us have lots of experience developing & supporting various RAID stacks from host-based RAID to embedded RAID. It ain't easy.
    
    Wrt architectural concerns, you sorta hit on something when you mention writing shims.  We never want SPDK to become a collection of other source modules stapled together in order to save development time.  We want everything in SPDK to be written to natively bolt into our threading model, to use our same RPC mechanisms for configuration, to be completely lockless (at least for the IO path), etc.. Chances are pretty good that any kernel based implementation won't be designed this way. 
    
    It's also important to consider that when code is merged into the SPDK project, the maintainers (and the community as a whole) are signing up to maintain it and extend it in perpetuity. Having any module that is of a different architecture, or even style, makes that support/maintenance process a nightmare as SPDK has many, many modules. 
    
    So although porting the kernel based module very well may be a good solution for someone to bolt in on their own, it's just not likely a good fit for upstreaming into SPDK. And, the community would support (and has on many occasions) an activity like that, it would just need to live in their repo. That's the beauty of the bdev layer (
    
    Thx
    Paul
    
    On 10/8/19, 11:26 AM, "David Butterfield" <dab21774(a)gmail.com> wrote:
    
        On 10/3/19 2:44 PM, Marushak, Nathan wrote:
        > Do you happen to have any performance and efficiency details? While the port you did was done with minimal changes, great work by the way, we have typically seen that most existing SW architectures require real changes to provide the necessary performance and efficiency improvements required for today's NVM and Networking performance.
        
        Diagram: https://raw.githubusercontent.com/DavidButterfield/spdk/tcmu-runner/spdk_drbd.pdf
        
        Hi Nathan, Paul:
        
        After your replies to my earlier message, I did some looking at the DRBD code and took some very
        simple measurements; below is what I found.
        
        Regarding the idea of importing a kernel RAID implementation rather than writing a new one:
        
        I would consider it seriously -- that's an awful lot of logic that might not have to be reimple-
        mented, and then matured a couple of years before it's ready for production use with critical
        data.  And going forward, instead of two implementations to be maintained in parallel, a single
        common implementation that behaves consistently whether it's running in the kernel or in a
        usermode process.  That's a lot of potential value to be weighed along with other factors.
        
        Clearly it is essential that it must be able to perform very well within an SPDK process and
        operate smoothly within the datapath.  If it can't do that, then there's little choice but to
        redevelop it despite the cost.  But it seems worth looking for ways to shorten what is likely to
        be a fairly long grind.
        
        I'm not familiar with the other kernel RAID modules, so I'll refer here in terms of DRBD, which
        includes around 50,000 lines of kernel code.  But I see no reason to expect other kernel RAID
        modules to be any harder to port to usermode than DRBD was (or SCST at ~80,000 lines of code).
        
        What are the dimensions of architectural concern?  In writing interface shims between SPDK and
        DRBD I noticed three main areas of mismatch (elaborated further below):
        
          (1) Mismatch between the SPDK bdev protocol and the bio protocol expected by DRBD
          (2) Network interface between DRBD replication peers
          (3) Mismatch between SPDK and kernel mechanisms for threading and work handoff
        
        Are there additional areas I should watch out for that have caused trouble in past efforts?
        
        Performance
        ===========
        This is a home project, and I don't have equipment to carry out robust performance testing; but
        I've started by taking a few very simple measurements to estimate the time DRBD takes to process
        a Read operation, which is assumed to be the same as the increase in Read operation response-
        time when DRBD is inserted into an SPDK bdev chain.
        
        My SPDK test machine is a laptop with a 4-threaded Core i5-2520M @ 2.5GHz with 4GiB of DDR3 1333
        (0.8 ns) RAM (barely enough RAM to run the SPDK iSCSI server).  The laptop is running Ubuntu
        18.04.1 and kernel 5.0.0-29.  I ran the SPDK iSCSI server (compiled -O3 without DEBUG) with a
        single reactor thread on CPU0.
        
        I used dd(1) to do some simple 4KiB sequential Read tests from raw LUNs on the SPDK iSCSI
        server, configured with LUNs 0, 4, and 6 (all backed by bdev_malloc) as shown in the diagram.
        I used the same machine for the initiator and the server, connecting to the IP address of its
        own (Intel 82579V) Ethernet interface to avoid the 1 Gb Ethernet bottleneck.
        
        For these measurements the DRBD server was not connected to a peer, so replication over the
        network was not active.  This should be the fastest path through DRBD to its backing storage,
        without interference by replication or network considerations.  (Also, since this was a Read
        test only, nothing should be going over the peer-to-peer network anyway)
        
        Each LUN was given 16 trials (16 runs of dd), each time reading 130,000 4KiB blocks (~1/2 GB)
        through the iSCSI server, ultimately from an instance of bdev_malloc (the SPDK ramdisk).  Shown
        below for each LUN are the fastest trial out of its 16 trials, and also the average (mean) of
        the top five fastest trials out of the 16.  From the MBPS reported by dd(1) I have also
        calculated the microseconds per 4KiB Read operation:
        
                        MBPS Reported by dd(1)                  Microseconds per OP
        LUN Config      Best_of_16_trials   Avg(Top_5_trials)   Best    5Avg   Delta
        --- ------      -----------------   -----------------   ----    ----   -----
         0  Malloc             1435                1412         2.85    2.90
         4  bio                1006                 989         4.07    4.14    1.24
         6  bio+DRBD            893                 887         4.59    4.62    0.48
           (DRBD without bio)                     (1212)               (3.38)
        
        LUN0 is a standard SPDK bdev_malloc instance, to compare with the other measurements.
        
        LUN4 adds bdev_bio and bio_spdk instances back-to-back ahead of the bdev_malloc instance,
        translating each request to kernel bio protocol and back before it gets to bdev_malloc (see
        diagram).  The timing difference between LUN4 and LUN0 therefore represents per-OP overhead
        contributed by those two modules taken together.
        
        Note that the overhead is not much in the bdev protocol translation; I assume it's mostly wakeup
        latency.  The bdev_bio and bio_spdk modules each do handoffs of requests and responses from one
        thread to another:  bdev_bio hands requests off from an SPDK thread to a DRBD/UMC thread through
        a queue_work() call; and bio_spdk hands requests off from a DRBD/UMC thread to an SPDK thread
        through a call to spdk_thread_send_msg().  And similarly in the reply direction.  So the per-OP
        timing difference of 1.24 microseconds between LUN4 and LUN0 includes four thread handoffs, two
        of them with wakeup latency.
        
        LUN6 adds DRBD into the configuration of LUN4, between the bdev_bio and bio_spdk instances.
        So the timing difference of 0.48 microseconds between LUN4 and LUN6 represents per-OP processing
        contributed by DRBD.
        
        (DRBD without bio) shows calculated hypothetical timing with DRBD alone, subtracting the
        bdev_bio and bio_spdk translations and wakeup latencies.  This is the expected timing if DRBD is
        modified to run on SPDK threads (discussed below), eliminating the thread context switches.
        [Calculated as (4.62 - 4.14) + 2.90 = 3.38 and hypothetical MBPS back-calculated from that.]
        
        From these measurements, DRBD appears to be adding about 480 nanoseconds of processing time per
        Read operation -- about a 17% increase over the straight bdev_malloc device for a 4KiB Read.
        For larger reads the 480 ns should represent smaller percentages.
        
        Finally, I re-ran all of the above experiments twice more with substantially the same results.
        All the "best" and "average(best_5)" results for each LUN were within 3% of each other across
        re-runs of the experiments (all but one were within 2%).
        
        [Because the test times were fairly short, random scheduling events with heavy impact but low
        frequency can spoil any particular run with performance far below average.  Test time depends on
        the length of the volume, which is difficult to enlarge because I'm backing with a bdev_malloc
        instance on a machine with 4GiB RAM.  That is why I chose to average the top 5 out of 16 -- to
        drop spoiled runs, yet not rely completely on one "best" run.  A large spread between the "best"
        and the "average(best_5)" would indicate that the "best" time was unusually high.]
        
        ================================================================================================
        
        (1) Mismatch between the SPDK bdev protocol and the bio protocol expected by DRBD
        
        This one is pretty trivial.  Most usage in DRBD of the bio structure and protocol is
        concentrated in six places in the code:  two near the "top" where client requests arrive, two
        near the "bottom" where requests to backing storage are issued, and two near the "middle" where
        peer-to-peer communication occurs.  These areas together total around 500 lines of code.  I
        reckon a couple hundred lines of new code could be added under #ifdef to change these places to
        understand the SPDK bdev structures and calls instead of kernel bio structures and calls.
        
        (2) Network interface to DRBD replication peers
        
        In the demo prototype the iSCSI network I/O is done using the SPDK/DPDK networking facility; but
        DRBD continues to implement network I/O to replication peers using socket(7) calls.  This is
        mainly because I didn't need to change that to get the prototype running.
        
        The implementation of the peer transport service within DRBD is isolated behind a DRBD-internal
        transport ops vector, so it's already designed to be easily replaced with other transport
        implementations.  The implementation in drbd_transport_tcp.c would be replaced with one that is
        nonblocking and issues SPDK networking calls instead of socket(7) calls.
        
        Some changes are probably needed to the peer receive-side logic so that it can operate using
        non-blocking network I/O only.  There is already a dispatcher "drbdd()" that calls service
        functions based on the inter-peer command type in the incoming header; but those functions then
        know how much additional data they want, and call for it synchronously.  They may have to have
        their post-recv processing split out into callback functions to be called (on a reactor thread)
        when the amount of data they want is available from the network.  This will be a bit of a chore,
        because there are a good few of them; but it's straightforward and the rearranged code could
        work for both kernel and usermode and still be clean without needing #ifdefs in each place.
        
        (3) Mismatch between SPDK and kernel mechanisms for threading and work handoff
        
        DRBD already issues backing store I/O operations for asynchronous completion from a small set of
        threads (i.e. it does *not* use a large number of threads each doing a synchronous I/O call).
        
        I'll use replicated Write for discussion because it is the more complicated case.  Consider a
        set of DRBD servers acting as peers in a network serving some storage resource.  They maintain
        network connections with each other while replication of the resource is active.  DRBD has some
        service threads associated with each of these connections (which could go away under SPDK).
        
        One of the peer-connection service threads is a "drbd_sender" thread.  An incoming Write request
        on a resource is processed by
            (1a) queueing a copy of the request to the work queue of the drbd_sender thread for each of
                 the connected peers for that resource;
            (1b) waking up those drbd_sender threads;
            (2a) attempting a "fast-track" submission of the I/O request to the local backing store;
            (2b) if (2a) fails, queueing the I/O request to a "drbd_submitter" thread.
                 (I don't know the relative frequency of (2b) as compared with (2a))
        
        The queueing at (2b) is currently done using the kernel queue_work() interface, which does the
        wakeup and arrives (once for each call to queue_work) at the specified function on the work
        queue's service thread.  So I think it is already in the right model and can be simply #ifdef'd
        to use spdk_thread_send_msg() in place of queue_work(), and let a reactor thread do the I/O
        submission to the backing store (eliminating the "drbd-submitter" work_queue service thread).
        
        The queueing at (1a) is a little more involved, but I have sketched out the changes on paper and
        they are straightforward, maybe 100 lines of added code under #ifdef.  The top-level sender
        function with the loop that waits for work and then services it has to be split, with the
        "service" part getting called by the SPDK reactor thread in response to a spdk_thread_send_msg()
        call.  Steps (1a) and (1b) on the requesting thread are then replaced with a call to
        spdk_thread_send_msg().
        
        Or, the requesting thread may call the sender service function directly under some or all
        conditions (to be analyzed).  Either way the drbd_sender threads would also be eliminated,
        replaced by reactor threads running a nonblocking network transport implementation.
        
        In any case I believe the code changes needed to fit the DRBD datapath smoothly into the SPDK
        model, with commensurate performance, would be a tiny fraction of the 50,000 lines of relatively
        mature code that would then be available for use in the SPDK environment.
        
        Regards,
        David Butterfield
        
    
    


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-08 18:25 David Butterfield
  0 siblings, 0 replies; 22+ messages in thread
From: David Butterfield @ 2019-10-08 18:25 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 12353 bytes --]

On 10/3/19 2:44 PM, Marushak, Nathan wrote:
> Do you happen to have any performance and efficiency details? While the port you did was done with minimal changes, great work by the way, we have typically seen that most existing SW architectures require real changes to provide the necessary performance and efficiency improvements required for today's NVM and Networking performance.

Diagram: https://raw.githubusercontent.com/DavidButterfield/spdk/tcmu-runner/spdk_drbd.pdf

Hi Nathan, Paul:

After your replies to my earlier message, I did some looking at the DRBD code and took some very
simple measurements; below is what I found.

Regarding the idea of importing a kernel RAID implementation rather than writing a new one:

I would consider it seriously -- that's an awful lot of logic that might not have to be reimple-
mented, and then matured a couple of years before it's ready for production use with critical
data.  And going forward, instead of two implementations to be maintained in parallel, a single
common implementation that behaves consistently whether it's running in the kernel or in a
usermode process.  That's a lot of potential value to be weighed along with other factors.

Clearly it is essential that it must be able to perform very well within an SPDK process and
operate smoothly within the datapath.  If it can't do that, then there's little choice but to
redevelop it despite the cost.  But it seems worth looking for ways to shorten what is likely to
be a fairly long grind.

I'm not familiar with the other kernel RAID modules, so I'll refer here in terms of DRBD, which
includes around 50,000 lines of kernel code.  But I see no reason to expect other kernel RAID
modules to be any harder to port to usermode than DRBD was (or SCST at ~80,000 lines of code).

What are the dimensions of architectural concern?  In writing interface shims between SPDK and
DRBD I noticed three main areas of mismatch (elaborated further below):

  (1) Mismatch between the SPDK bdev protocol and the bio protocol expected by DRBD
  (2) Network interface between DRBD replication peers
  (3) Mismatch between SPDK and kernel mechanisms for threading and work handoff

Are there additional areas I should watch out for that have caused trouble in past efforts?

Performance
===========
This is a home project, and I don't have equipment to carry out robust performance testing; but
I've started by taking a few very simple measurements to estimate the time DRBD takes to process
a Read operation, which is assumed to be the same as the increase in Read operation response-
time when DRBD is inserted into an SPDK bdev chain.

My SPDK test machine is a laptop with a 4-threaded Core i5-2520M @ 2.5GHz with 4GiB of DDR3 1333
(0.8 ns) RAM (barely enough RAM to run the SPDK iSCSI server).  The laptop is running Ubuntu
18.04.1 and kernel 5.0.0-29.  I ran the SPDK iSCSI server (compiled -O3 without DEBUG) with a
single reactor thread on CPU0.

I used dd(1) to do some simple 4KiB sequential Read tests from raw LUNs on the SPDK iSCSI
server, configured with LUNs 0, 4, and 6 (all backed by bdev_malloc) as shown in the diagram.
I used the same machine for the initiator and the server, connecting to the IP address of its
own (Intel 82579V) Ethernet interface to avoid the 1 Gb Ethernet bottleneck.

For these measurements the DRBD server was not connected to a peer, so replication over the
network was not active.  This should be the fastest path through DRBD to its backing storage,
without interference by replication or network considerations.  (Also, since this was a Read
test only, nothing should be going over the peer-to-peer network anyway)

Each LUN was given 16 trials (16 runs of dd), each time reading 130,000 4KiB blocks (~1/2 GB)
through the iSCSI server, ultimately from an instance of bdev_malloc (the SPDK ramdisk).  Shown
below for each LUN are the fastest trial out of its 16 trials, and also the average (mean) of
the top five fastest trials out of the 16.  From the MBPS reported by dd(1) I have also
calculated the microseconds per 4KiB Read operation:

                MBPS Reported by dd(1)                  Microseconds per OP
LUN Config      Best_of_16_trials   Avg(Top_5_trials)   Best    5Avg   Delta
--- ------      -----------------   -----------------   ----    ----   -----
 0  Malloc             1435                1412         2.85    2.90
 4  bio                1006                 989         4.07    4.14    1.24
 6  bio+DRBD            893                 887         4.59    4.62    0.48
   (DRBD without bio)                     (1212)               (3.38)

LUN0 is a standard SPDK bdev_malloc instance, to compare with the other measurements.

LUN4 adds bdev_bio and bio_spdk instances back-to-back ahead of the bdev_malloc instance,
translating each request to kernel bio protocol and back before it gets to bdev_malloc (see
diagram).  The timing difference between LUN4 and LUN0 therefore represents per-OP overhead
contributed by those two modules taken together.

Note that the overhead is not much in the bdev protocol translation; I assume it's mostly wakeup
latency.  The bdev_bio and bio_spdk modules each do handoffs of requests and responses from one
thread to another:  bdev_bio hands requests off from an SPDK thread to a DRBD/UMC thread through
a queue_work() call; and bio_spdk hands requests off from a DRBD/UMC thread to an SPDK thread
through a call to spdk_thread_send_msg().  And similarly in the reply direction.  So the per-OP
timing difference of 1.24 microseconds between LUN4 and LUN0 includes four thread handoffs, two
of them with wakeup latency.

LUN6 adds DRBD into the configuration of LUN4, between the bdev_bio and bio_spdk instances.
So the timing difference of 0.48 microseconds between LUN4 and LUN6 represents per-OP processing
contributed by DRBD.

(DRBD without bio) shows calculated hypothetical timing with DRBD alone, subtracting the
bdev_bio and bio_spdk translations and wakeup latencies.  This is the expected timing if DRBD is
modified to run on SPDK threads (discussed below), eliminating the thread context switches.
[Calculated as (4.62 - 4.14) + 2.90 = 3.38 and hypothetical MBPS back-calculated from that.]

From these measurements, DRBD appears to be adding about 480 nanoseconds of processing time per
Read operation -- about a 17% increase over the straight bdev_malloc device for a 4KiB Read.
For larger reads the 480 ns should represent smaller percentages.

Finally, I re-ran all of the above experiments twice more with substantially the same results.
All the "best" and "average(best_5)" results for each LUN were within 3% of each other across
re-runs of the experiments (all but one were within 2%).

[Because the test times were fairly short, random scheduling events with heavy impact but low
frequency can spoil any particular run with performance far below average.  Test time depends on
the length of the volume, which is difficult to enlarge because I'm backing with a bdev_malloc
instance on a machine with 4GiB RAM.  That is why I chose to average the top 5 out of 16 -- to
drop spoiled runs, yet not rely completely on one "best" run.  A large spread between the "best"
and the "average(best_5)" would indicate that the "best" time was unusually high.]

================================================================================================

(1) Mismatch between the SPDK bdev protocol and the bio protocol expected by DRBD

This one is pretty trivial.  Most usage in DRBD of the bio structure and protocol is
concentrated in six places in the code:  two near the "top" where client requests arrive, two
near the "bottom" where requests to backing storage are issued, and two near the "middle" where
peer-to-peer communication occurs.  These areas together total around 500 lines of code.  I
reckon a couple hundred lines of new code could be added under #ifdef to change these places to
understand the SPDK bdev structures and calls instead of kernel bio structures and calls.

(2) Network interface to DRBD replication peers

In the demo prototype the iSCSI network I/O is done using the SPDK/DPDK networking facility; but
DRBD continues to implement network I/O to replication peers using socket(7) calls.  This is
mainly because I didn't need to change that to get the prototype running.

The implementation of the peer transport service within DRBD is isolated behind a DRBD-internal
transport ops vector, so it's already designed to be easily replaced with other transport
implementations.  The implementation in drbd_transport_tcp.c would be replaced with one that is
nonblocking and issues SPDK networking calls instead of socket(7) calls.

Some changes are probably needed to the peer receive-side logic so that it can operate using
non-blocking network I/O only.  There is already a dispatcher "drbdd()" that calls service
functions based on the inter-peer command type in the incoming header; but those functions then
know how much additional data they want, and call for it synchronously.  They may have to have
their post-recv processing split out into callback functions to be called (on a reactor thread)
when the amount of data they want is available from the network.  This will be a bit of a chore,
because there are a good few of them; but it's straightforward and the rearranged code could
work for both kernel and usermode and still be clean without needing #ifdefs in each place.

(3) Mismatch between SPDK and kernel mechanisms for threading and work handoff

DRBD already issues backing store I/O operations for asynchronous completion from a small set of
threads (i.e. it does *not* use a large number of threads each doing a synchronous I/O call).

I'll use replicated Write for discussion because it is the more complicated case.  Consider a
set of DRBD servers acting as peers in a network serving some storage resource.  They maintain
network connections with each other while replication of the resource is active.  DRBD has some
service threads associated with each of these connections (which could go away under SPDK).

One of the peer-connection service threads is a "drbd_sender" thread.  An incoming Write request
on a resource is processed by
    (1a) queueing a copy of the request to the work queue of the drbd_sender thread for each of
         the connected peers for that resource;
    (1b) waking up those drbd_sender threads;
    (2a) attempting a "fast-track" submission of the I/O request to the local backing store;
    (2b) if (2a) fails, queueing the I/O request to a "drbd_submitter" thread.
         (I don't know the relative frequency of (2b) as compared with (2a))

The queueing at (2b) is currently done using the kernel queue_work() interface, which does the
wakeup and arrives (once for each call to queue_work) at the specified function on the work
queue's service thread.  So I think it is already in the right model and can be simply #ifdef'd
to use spdk_thread_send_msg() in place of queue_work(), and let a reactor thread do the I/O
submission to the backing store (eliminating the "drbd-submitter" work_queue service thread).

The queueing at (1a) is a little more involved, but I have sketched out the changes on paper and
they are straightforward, maybe 100 lines of added code under #ifdef.  The top-level sender
function with the loop that waits for work and then services it has to be split, with the
"service" part getting called by the SPDK reactor thread in response to a spdk_thread_send_msg()
call.  Steps (1a) and (1b) on the requesting thread are then replaced with a call to
spdk_thread_send_msg().

Or, the requesting thread may call the sender service function directly under some or all
conditions (to be analyzed).  Either way the drbd_sender threads would also be eliminated,
replaced by reactor threads running a nonblocking network transport implementation.

In any case I believe the code changes needed to fit the DRBD datapath smoothly into the SPDK
model, with commensurate performance, would be a tiny fraction of the 50,000 lines of relatively
mature code that would then be available for use in the SPDK environment.

Regards,
David Butterfield

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-04 13:38 Artur Paszkiewicz
  0 siblings, 0 replies; 22+ messages in thread
From: Artur Paszkiewicz @ 2019-10-04 13:38 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 319 bytes --]

On 10/4/19 12:49 AM, 松本周平 / MATSUMOTO,SHUUHEI wrote:
> Recently SPDK are starting to support DIF feature.
> Can we have any possibility to include the extended LBA (block size = 512 + 8, 4096 + 128, or etc) into SPDK RAID?

Do you mean DIF passthrough? I think this should be included.

Thanks,
Artur

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-04 13:24 Artur Paszkiewicz
  0 siblings, 0 replies; 22+ messages in thread
From: Artur Paszkiewicz @ 2019-10-04 13:24 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 3838 bytes --]

On 10/3/19 9:20 PM, Luse, Paul E wrote:
> Thanks, I think this can be an awesome contribution.  A few other things to consider, you mention some of these already, so I just added some more color.  It would be good I think moving forward to put a trello board up with a backlog of tasks so that others (like me __) can jump in and help also.  I'd still like to get the RAID1E in there, but it makes little sense to do it before any major refactoring. 
> 
> I think it makes sense, if you guys are ready, to start putting together the backlog and knocking out a large series of small patches to get the refactoring done.

I'll start working on the backlog and refactoring in the upcoming weeks.

> * the current RAID0 has no config on disk. We'll need to come up with a scheme for handling existing RAID0 configured out of band with RAID5 using COD. (like not allowing a RAID0 to be built on a set with COD, etc.)

Good point. My initial thought is that any bdev containing RAID metadata should
be claimed by the module as soon as it is discovered (module's examine
handler?).

> * the metadata layout, as you knee from working previous RAID projects, needs to be thought out carefully to consider not only extensibility but version control for backwards compactivity and issues with conflicting COD that are found.  For example you have a 3 disk RAID5, take one of the disks out and use it in another array somewhere then bring it back later and fire up the original 3, the metadata has to have sufficient info to know who belongs to what and which volumes to create and which to put in some sort of offline state. I don't think we need that kind of capability right up front (deciding how to deal with conflicts) but the metadata should have enough information it up front. DDF was brought up before in the earlier RAID discussions so just to make sure we're all on the same page, I see no value in complying with that spec. Open to other thoughts though.

Using DDF would have big benefits, we wouldn't have to re-invent the wheel and
would be able to use the RAID arrays in other environments, with existing
tools, etc. I think we can consider adding support for this at some point. For
now, something less complex should be OK. We can use mdraid native metadata as
reference. It is simple and works well for stackable block device software raid,
similar to what we are going to be building. It has its quirks, so I'd prefer
not using the same exact format.

> * one thing to keep in mind, we probably want to retain common RPC code

I agree.

> * we should keep migration in mind as well, not that it's something we may ever need/want but lots of reserved space in metadata for tracking state wrt migrations and rebuilds is needed.

For rebuild a simple checkpoint should be sufficient. Migration can require a
backup area, but maybe we don't have to reserve space for that on the data
drives? Let's say that a migration process will require providing separate
storage for the backup.

> * similar to the RAID1E discussions earlier, there are some features you mention as 'future' that most would consider a requirement for redundant RAID - like degrade operation and rebuild. Those don't all have to go in at the same time however without that minimum set we need to mark it experimental, so nobody tries to use it thinking it has those basic things.

By 'future' I meant not included in the initial patchset. Of course, those
things will have to be included if this to be considered stable.

> * I can't remember if we talked about unit tests or not, but be sure the backlog includes getting solid UT coverage in there up front.  The existing UT code will likely need some refactoring as well to support the function code refactoring.

Yes, I assumed this will be necessary.

Thanks,
Artur

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-03 22:49 
  0 siblings, 0 replies; 22+ messages in thread
From:  @ 2019-10-03 22:49 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 5035 bytes --]

Hi Artur, Paul, and All,

Thank you so much, I'm excited to know this.

Recently SPDK are starting to support DIF feature.
Can we have any possibility to include the extended LBA (block size = 512 + 8, 4096 + 128, or etc) into SPDK RAID?
Do you have any comment?

Thanks,
Shuhei

________________________________
差出人: Luse, Paul E <paul.e.luse(a)intel.com>
送信日時: 2019年10月4日 4:20
宛先: Storage Performance Development Kit <spdk(a)lists.01.org>
CC: Karkra, Kapil <kapil.karkra(a)intel.com>; Baldysiak, Pawel <pawel.baldysiak(a)intel.com>; Ptak, Slawomir <slawomir.ptak(a)intel.com>
件名: [SPDK] Re: SPDK RAID5 support

Hi Artur,

Thanks, I think this can be an awesome contribution.  A few other things to consider, you mention some of these already, so I just added some more color.  It would be good I think moving forward to put a trello board up with a backlog of tasks so that others (like me __) can jump in and help also.  I'd still like to get the RAID1E in there, but it makes little sense to do it before any major refactoring.

I think it makes sense, if you guys are ready, to start putting together the backlog and knocking out a large series of small patches to get the refactoring done.

* the current RAID0 has no config on disk. We'll need to come up with a scheme for handling existing RAID0 configured out of band with RAID5 using COD. (like not allowing a RAID0 to be built on a set with COD, etc.)
* the metadata layout, as you knee from working previous RAID projects, needs to be thought out carefully to consider not only extensibility but version control for backwards compactivity and issues with conflicting COD that are found.  For example you have a 3 disk RAID5, take one of the disks out and use it in another array somewhere then bring it back later and fire up the original 3, the metadata has to have sufficient info to know who belongs to what and which volumes to create and which to put in some sort of offline state. I don't think we need that kind of capability right up front (deciding how to deal with conflicts) but the metadata should have enough information it up front. DDF was brought up before in the earlier RAID discussions so just to make sure we're all on the same page, I see no value in complying with that spec. Open to other thoughts though.
* one thing to keep in mind, we probably want to retain common RPC code
* we should keep migration in mind as well, not that it's something we may ever need/want but lots of reserved space in metadata for tracking state wrt migrations and rebuilds is needed.
* similar to the RAID1E discussions earlier, there are some features you mention as 'future' that most would consider a requirement for redundant RAID - like degrade operation and rebuild. Those don't all have to go in at the same time however without that minimum set we need to mark it experimental, so nobody tries to use it thinking it has those basic things.
* I can't remember if we talked about unit tests or not, but be sure the backlog includes getting solid UT coverage in there up front.  The existing UT code will likely need some refactoring as well to support the function code refactoring.

-Paul

On 10/3/19, 3:00 AM, "Artur Paszkiewicz" <artur.paszkiewicz(a)intel.com> wrote:

    Hi all,

    We want to add RAID5 support to SPDK. My team has experience with other RAID
    projects, primarily with Linux MD RAID, which we actively develop and support
    for Intel VROC. We already have an initial SPDK RAID5 implementation created
    for an internal project. It has working read/write, including partial-stripe
    updates, parity calculation and reconstruct-reads.

    Currently in SPDK there exists a RAID bdev module, which has only RAID0
    functionality. This can be used as a basis for a more generic RAID stack. Here
    is our idea how to approach this:

    1. Refactor the bdev_raid module to separate RAID0-specific I/O handling code
    from more generic parts - configuration, bdev creation, etc. Move the RAID0
    code to a new file. Use RAID level-specific callbacks, similar to existing
    struct raid_fn_table. This architecture is also used in MD RAID drivers, where
    different RAID "personalities" work on top of a common layer.

    2. Add RAID5 support in another file, similarly to RAID0. Port our current
    RAID5 code to this new framework.

    3. Incrementally add new functionalities. At this point, probably the most
    important will be support for member drive failure and degraded operation, RAID
    rebuild and some form of on-disk metadata.

    Any comments or suggestions are welcome.

    Thanks,
    Artur
    _______________________________________________
    SPDK mailing list -- spdk(a)lists.01.org
    To unsubscribe send an email to spdk-leave(a)lists.01.org


_______________________________________________
SPDK mailing list -- spdk(a)lists.01.org
To unsubscribe send an email to spdk-leave(a)lists.01.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-03 20:44 Marushak, Nathan
  0 siblings, 0 replies; 22+ messages in thread
From: Marushak, Nathan @ 2019-10-03 20:44 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 5280 bytes --]

Hey David,

Do you happen to have any performance and efficiency details? While the port you did was done with minimal changes, great work by the way, we have typically seen that most existing SW architectures require real changes to provide the necessary performance and efficiency improvements required for today's NVM and Networking performance.

Thanks,
Nate

> -----Original Message-----
> From: Luse, Paul E [mailto:paul.e.luse(a)intel.com]
> Sent: Thursday, October 03, 2019 9:12 AM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>; Paszkiewicz,
> Artur <artur.paszkiewicz(a)intel.com>
> Cc: Karkra, Kapil <kapil.karkra(a)intel.com>; Baldysiak, Pawel
> <pawel.baldysiak(a)intel.com>; Ptak, Slawomir <slawomir.ptak(a)intel.com>
> Subject: [SPDK] Re: SPDK RAID5 support
> 
> Hi David,
> 
> Thanks for reminding me that you had similar feedback that I don't think I
> replied to wrt RAID1E as I just got distracted by another shiny thing (
> 
> We talked about a few things wrt the recent RAID discussions in the last
> community meeting. In general, the primary driver behind choosing a path is
> architectural fit within SPDK. Then there's the implementation details that
> must take into account that we already have a RAID0 module in tree. So
> although we want to consider every option, any kind of port or drop in is
> unlikely to be a good fit for SPDK but we're not done talking about this
> stuff either... after some more email discussion on Artur's note, we'll
> likely put it on the agenda again for the next community meeting.  Would be
> great if you can keep an eye out and join us. https://spdk.io/community/
> 
> Artur, I'll reply to your email separately a bit later on today.
> 
> Thanks!
> Paul
> 
> On 10/3/19, 8:56 AM, "David Butterfield" <dab21774(a)gmail.com> wrote:
> 
>     I have DRBD 9.0 running in usermode under SPDK as shown in this
> diagram:
> 
>         https://raw.githubusercontent.com/DavidButterfield/spdk/tcmu-
> runner/spdk_drbd.pdf
> 
>     One possibility would be to port a kernel RAID module into the spot
> occupied by DRBD in the diagram.
> 
>     The "port" of DRBD to run in usermode changes fewer than a dozen lines
> of code from the original source in the LINBIT repository.  Rather than
> changing the source code from the DRBD kernel module, its expected
> environment is simulated around it.  (This is intended to make it easier to
> update the usermode port to newer versions of the application as they
> appear.)
> 
>     I did the same thing with SCST a couple of years ago.  I would expect
> the same to be possible for a kernel RAID module.  It won't just "drop in",
> because the set of emulated kernel functions has to be expanded to include
> whatever the RAID module uses that isn't already covered by the existing
> ports of SCST and DRBD.  I estimate it would take me one to two months of
> full-time work to get a kernel RAID module up and running well enough to be
> tested and used for experimentation.
> 
>     Regards,
>     David Butterfield
>     -----------------
> 
>     On 10/3/19 3:59 AM, Artur Paszkiewicz wrote:
>     > Hi all,
>     >
>     > We want to add RAID5 support to SPDK. My team has experience with
> other RAID
>     > projects, primarily with Linux MD RAID, which we actively develop and
> support
>     > for Intel VROC. We already have an initial SPDK RAID5 implementation
> created
>     > for an internal project. It has working read/write, including
> partial-stripe
>     > updates, parity calculation and reconstruct-reads.
>     >
>     > Currently in SPDK there exists a RAID bdev module, which has only
> RAID0
>     > functionality. This can be used as a basis for a more generic RAID
> stack. Here
>     > is our idea how to approach this:
>     >
>     > 1. Refactor the bdev_raid module to separate RAID0-specific I/O
> handling code
>     > from more generic parts - configuration, bdev creation, etc. Move the
> RAID0
>     > code to a new file. Use RAID level-specific callbacks, similar to
> existing
>     > struct raid_fn_table. This architecture is also used in MD RAID
> drivers, where
>     > different RAID "personalities" work on top of a common layer.
>     >
>     > 2. Add RAID5 support in another file, similarly to RAID0. Port our
> current
>     > RAID5 code to this new framework.
>     >
>     > 3. Incrementally add new functionalities. At this point, probably the
> most
>     > important will be support for member drive failure and degraded
> operation, RAID
>     > rebuild and some form of on-disk metadata.
>     >
>     > Any comments or suggestions are welcome.
>     >
>     > Thanks,
>     > Artur
>     > _______________________________________________
>     > SPDK mailing list -- spdk(a)lists.01.org
>     > To unsubscribe send an email to spdk-leave(a)lists.01.org
>     >
>     _______________________________________________
>     SPDK mailing list -- spdk(a)lists.01.org
>     To unsubscribe send an email to spdk-leave(a)lists.01.org
> 
> 
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org
> To unsubscribe send an email to spdk-leave(a)lists.01.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-03 19:20 Luse, Paul E
  0 siblings, 0 replies; 22+ messages in thread
From: Luse, Paul E @ 2019-10-03 19:20 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4259 bytes --]

Hi Artur,

Thanks, I think this can be an awesome contribution.  A few other things to consider, you mention some of these already, so I just added some more color.  It would be good I think moving forward to put a trello board up with a backlog of tasks so that others (like me __) can jump in and help also.  I'd still like to get the RAID1E in there, but it makes little sense to do it before any major refactoring. 

I think it makes sense, if you guys are ready, to start putting together the backlog and knocking out a large series of small patches to get the refactoring done.

* the current RAID0 has no config on disk. We'll need to come up with a scheme for handling existing RAID0 configured out of band with RAID5 using COD. (like not allowing a RAID0 to be built on a set with COD, etc.)
* the metadata layout, as you knee from working previous RAID projects, needs to be thought out carefully to consider not only extensibility but version control for backwards compactivity and issues with conflicting COD that are found.  For example you have a 3 disk RAID5, take one of the disks out and use it in another array somewhere then bring it back later and fire up the original 3, the metadata has to have sufficient info to know who belongs to what and which volumes to create and which to put in some sort of offline state. I don't think we need that kind of capability right up front (deciding how to deal with conflicts) but the metadata should have enough information it up front. DDF was brought up before in the earlier RAID discussions so just to make sure we're all on the same page, I see no value in complying with that spec. Open to other thoughts though.
* one thing to keep in mind, we probably want to retain common RPC code
* we should keep migration in mind as well, not that it's something we may ever need/want but lots of reserved space in metadata for tracking state wrt migrations and rebuilds is needed.
* similar to the RAID1E discussions earlier, there are some features you mention as 'future' that most would consider a requirement for redundant RAID - like degrade operation and rebuild. Those don't all have to go in at the same time however without that minimum set we need to mark it experimental, so nobody tries to use it thinking it has those basic things.
* I can't remember if we talked about unit tests or not, but be sure the backlog includes getting solid UT coverage in there up front.  The existing UT code will likely need some refactoring as well to support the function code refactoring.

-Paul

On 10/3/19, 3:00 AM, "Artur Paszkiewicz" <artur.paszkiewicz(a)intel.com> wrote:

    Hi all,
    
    We want to add RAID5 support to SPDK. My team has experience with other RAID
    projects, primarily with Linux MD RAID, which we actively develop and support
    for Intel VROC. We already have an initial SPDK RAID5 implementation created
    for an internal project. It has working read/write, including partial-stripe
    updates, parity calculation and reconstruct-reads.
    
    Currently in SPDK there exists a RAID bdev module, which has only RAID0
    functionality. This can be used as a basis for a more generic RAID stack. Here
    is our idea how to approach this:
    
    1. Refactor the bdev_raid module to separate RAID0-specific I/O handling code
    from more generic parts - configuration, bdev creation, etc. Move the RAID0
    code to a new file. Use RAID level-specific callbacks, similar to existing
    struct raid_fn_table. This architecture is also used in MD RAID drivers, where
    different RAID "personalities" work on top of a common layer.
    
    2. Add RAID5 support in another file, similarly to RAID0. Port our current
    RAID5 code to this new framework.
    
    3. Incrementally add new functionalities. At this point, probably the most
    important will be support for member drive failure and degraded operation, RAID
    rebuild and some form of on-disk metadata.
    
    Any comments or suggestions are welcome.
    
    Thanks,
    Artur
    _______________________________________________
    SPDK mailing list -- spdk(a)lists.01.org
    To unsubscribe send an email to spdk-leave(a)lists.01.org
    


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-03 16:11 Luse, Paul E
  0 siblings, 0 replies; 22+ messages in thread
From: Luse, Paul E @ 2019-10-03 16:11 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4111 bytes --]

Hi David,

Thanks for reminding me that you had similar feedback that I don't think I replied to wrt RAID1E as I just got distracted by another shiny thing (

We talked about a few things wrt the recent RAID discussions in the last community meeting. In general, the primary driver behind choosing a path is architectural fit within SPDK. Then there's the implementation details that must take into account that we already have a RAID0 module in tree. So although we want to consider every option, any kind of port or drop in is unlikely to be a good fit for SPDK but we're not done talking about this stuff either... after some more email discussion on Artur's note, we'll likely put it on the agenda again for the next community meeting.  Would be great if you can keep an eye out and join us. https://spdk.io/community/ 

Artur, I'll reply to your email separately a bit later on today.

Thanks!
Paul

On 10/3/19, 8:56 AM, "David Butterfield" <dab21774(a)gmail.com> wrote:

    I have DRBD 9.0 running in usermode under SPDK as shown in this diagram:
    
        https://raw.githubusercontent.com/DavidButterfield/spdk/tcmu-runner/spdk_drbd.pdf
    
    One possibility would be to port a kernel RAID module into the spot occupied by DRBD in the diagram.
    
    The "port" of DRBD to run in usermode changes fewer than a dozen lines of code from the original source in the LINBIT repository.  Rather than changing the source code from the DRBD kernel module, its expected environment is simulated around it.  (This is intended to make it easier to update the usermode port to newer versions of the application as they appear.)
    
    I did the same thing with SCST a couple of years ago.  I would expect the same to be possible for a kernel RAID module.  It won't just "drop in", because the set of emulated kernel functions has to be expanded to include whatever the RAID module uses that isn't already covered by the existing ports of SCST and DRBD.  I estimate it would take me one to two months of full-time work to get a kernel RAID module up and running well enough to be tested and used for experimentation.
    
    Regards,
    David Butterfield
    -----------------
    
    On 10/3/19 3:59 AM, Artur Paszkiewicz wrote:
    > Hi all,
    > 
    > We want to add RAID5 support to SPDK. My team has experience with other RAID
    > projects, primarily with Linux MD RAID, which we actively develop and support
    > for Intel VROC. We already have an initial SPDK RAID5 implementation created
    > for an internal project. It has working read/write, including partial-stripe
    > updates, parity calculation and reconstruct-reads.
    > 
    > Currently in SPDK there exists a RAID bdev module, which has only RAID0
    > functionality. This can be used as a basis for a more generic RAID stack. Here
    > is our idea how to approach this:
    > 
    > 1. Refactor the bdev_raid module to separate RAID0-specific I/O handling code
    > from more generic parts - configuration, bdev creation, etc. Move the RAID0
    > code to a new file. Use RAID level-specific callbacks, similar to existing
    > struct raid_fn_table. This architecture is also used in MD RAID drivers, where
    > different RAID "personalities" work on top of a common layer.
    > 
    > 2. Add RAID5 support in another file, similarly to RAID0. Port our current
    > RAID5 code to this new framework.
    > 
    > 3. Incrementally add new functionalities. At this point, probably the most
    > important will be support for member drive failure and degraded operation, RAID
    > rebuild and some form of on-disk metadata.
    > 
    > Any comments or suggestions are welcome.
    > 
    > Thanks,
    > Artur
    > _______________________________________________
    > SPDK mailing list -- spdk(a)lists.01.org
    > To unsubscribe send an email to spdk-leave(a)lists.01.org
    > 
    _______________________________________________
    SPDK mailing list -- spdk(a)lists.01.org
    To unsubscribe send an email to spdk-leave(a)lists.01.org
    


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Re: SPDK RAID5 support
@ 2019-10-03 15:55 David Butterfield
  0 siblings, 0 replies; 22+ messages in thread
From: David Butterfield @ 2019-10-03 15:55 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 2756 bytes --]

I have DRBD 9.0 running in usermode under SPDK as shown in this diagram:

    https://raw.githubusercontent.com/DavidButterfield/spdk/tcmu-runner/spdk_drbd.pdf

One possibility would be to port a kernel RAID module into the spot occupied by DRBD in the diagram.

The "port" of DRBD to run in usermode changes fewer than a dozen lines of code from the original source in the LINBIT repository.  Rather than changing the source code from the DRBD kernel module, its expected environment is simulated around it.  (This is intended to make it easier to update the usermode port to newer versions of the application as they appear.)

I did the same thing with SCST a couple of years ago.  I would expect the same to be possible for a kernel RAID module.  It won't just "drop in", because the set of emulated kernel functions has to be expanded to include whatever the RAID module uses that isn't already covered by the existing ports of SCST and DRBD.  I estimate it would take me one to two months of full-time work to get a kernel RAID module up and running well enough to be tested and used for experimentation.

Regards,
David Butterfield
-----------------

On 10/3/19 3:59 AM, Artur Paszkiewicz wrote:
> Hi all,
> 
> We want to add RAID5 support to SPDK. My team has experience with other RAID
> projects, primarily with Linux MD RAID, which we actively develop and support
> for Intel VROC. We already have an initial SPDK RAID5 implementation created
> for an internal project. It has working read/write, including partial-stripe
> updates, parity calculation and reconstruct-reads.
> 
> Currently in SPDK there exists a RAID bdev module, which has only RAID0
> functionality. This can be used as a basis for a more generic RAID stack. Here
> is our idea how to approach this:
> 
> 1. Refactor the bdev_raid module to separate RAID0-specific I/O handling code
> from more generic parts - configuration, bdev creation, etc. Move the RAID0
> code to a new file. Use RAID level-specific callbacks, similar to existing
> struct raid_fn_table. This architecture is also used in MD RAID drivers, where
> different RAID "personalities" work on top of a common layer.
> 
> 2. Add RAID5 support in another file, similarly to RAID0. Port our current
> RAID5 code to this new framework.
> 
> 3. Incrementally add new functionalities. At this point, probably the most
> important will be support for member drive failure and degraded operation, RAID
> rebuild and some form of on-disk metadata.
> 
> Any comments or suggestions are welcome.
> 
> Thanks,
> Artur
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org
> To unsubscribe send an email to spdk-leave(a)lists.01.org
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-10-21 15:22 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-04 15:31 [SPDK] Re: SPDK RAID5 support Luse, Paul E
  -- strict thread matches above, loose matches on Subject: below --
2019-10-21 15:22 Artur Paszkiewicz
2019-10-16 12:19 Sasha Kotchubievsky
2019-10-15 21:21 Sasha Kotchubievsky
2019-10-14 17:43 Harris, James R
2019-10-13 18:18 Luse, Paul E
2019-10-13 17:39 Luse, Paul E
2019-10-13  9:26 Sasha Kotchubievsky
2019-10-13  8:56 Sasha Kotchubievsky
2019-10-11 15:37 Luse, Paul E
2019-10-11 15:32 Liu, Xiaodong
2019-10-11 13:08 Luse, Paul E
2019-10-11 13:07 Artur Paszkiewicz
2019-10-08 20:21 Luse, Paul E
2019-10-08 18:25 David Butterfield
2019-10-04 13:38 Artur Paszkiewicz
2019-10-04 13:24 Artur Paszkiewicz
2019-10-03 22:49 
2019-10-03 20:44 Marushak, Nathan
2019-10-03 19:20 Luse, Paul E
2019-10-03 16:11 Luse, Paul E
2019-10-03 15:55 David Butterfield

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.