All of lore.kernel.org
 help / color / mirror / Atom feed
* [SPDK] Re: NVMe hotplug for RDMA and TCP transports
@ 2020-07-13 16:29 Andrey Kuzmin
  0 siblings, 0 replies; 7+ messages in thread
From: Andrey Kuzmin @ 2020-07-13 16:29 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 12417 bytes --]

Hi Seth,

please see my comments below.

On Wed, Jul 8, 2020, 22:59 Howell, Seth <seth.howell(a)intel.com> wrote:

> Hi Andrey,
>
> Sorry for the radio silence on this. There have been a couple of
> developments recently that have changed the way that I think we want to
> handle this. The following is coming after a couple of conversations that I
> have had with Jim and Ben.
>
> Jim had a couple of qualms about calling the hotplug callback when the
> target side of our NVMe bdev suddenly disconnects. Chief among those is
> that when we call the hotplug remove callback for a given bdev, that means
> that the bdev is entirely gone and represents a pretty critical event for
> the application. Rather than calling the hotplug callback for fabrics
> controllers, he would prefer to handle these events with different
> asynchronous events and allow the user to choose what to do with them
> without destroying the bdev.
>
> I have recently been adding some multipath features to the initiator side
> of the connection: https://review.spdk.io/gerrit/c/spdk/spdk/+/2885/25.
> This means that now an NVMe bdev doesn't so much represent a specific
> controller as it does a set of namespaces that can be accessed through
> multiple controllers, each having different TRIDs. I have already added
> different functions to enable adding multiple TRIDs, and currently have an
> automatic failover response in place to update the TRID we are connected to
> without failing.
>
> There are also plans in the future to allow for load balancing (multiple
> connections through the same bdev to different controllers but the same set
> of namespaces).So a single path being disconnected will also not be a
> hotplug event.
>
> Do you think it would be reasonable to allow for hotplug removal and
> insertion through registering multiple TRIDs for the same controller? That
> way, the user has the flexibility to add multiple paths to the same bdev
> and do automatic recovery/failover in the bdev module. We could also
> register a separate asynchronous callback in the case that all registered
> paths are unreachable at once.
>

Integrating multipathing below bdev layer looks pretty reasonable given
nvme/nvmf adopting those features. What is means for bdev layer is likely
that it needs to evolve towards nvme model, with async event notifications
on the opened bdev state transitions (a topic we had actually discussed
about a year ago).

In particular, as you have proposed above, path state change notifications
can be delivered via hotremove callback (or its extension), provided those
events are defined at the bdev module level (presumably, following nvme ANA
syntax). Since application needs re multipathing vary, it likely makes
sense to preserve bdev_open/hotremove semantics, and add extra bdev_open
flavor with event-aware callback.

From my experience, HA scenarios also require some tweaking of such an open
flavor in regard to the path(s) state and, particularly, path count at open
time since single vs. multiple paths being available are different startup
scenarios in HA.

Regarding load balancing, it sounds like a natural feature to have if
multipathing is to be added.

Regards,
Andrey


> Thanks,
>
> Seth
>
> -----Original Message-----
> From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
> Sent: Thursday, May 28, 2020 12:41 PM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: [SPDK] Re: NVMe hotplug for RDMA and TCP transports
>
> Hi Seth,
>
> On Thu, May 28, 2020, 01:11 Howell, Seth <seth.howell(a)intel.com> wrote:
>
> > Hi Andrey,++
> >
> > Thanks for clarifying that you were talking specifically about the
> > bdev layer. I think you are right there (And Ben spoke to me today as
> > well about this). When I was initially working on the bdev_nvme
> > implementation of failover, we were trying to mimic what happens in
> > the kernel when you lose contact with a target side NVMe drive.
> > Essentially, it just continually attempts to reconnect until you
> disconnect the bdev.
> >
> > But it makes sense to expose that through the bdev hotplug functionality.
> >
>
> Good to know. I will be glad to review the associated patches or help with
> that in any other practical way.
>
>
> > As far as hotplug insertion goes, do you have a specific idea of how
> > to implement that in a generic enough way to be universally
> > applicable? I can think of a few different things people might want to
> do:
> > 1. Have a specific set of subsystems on a given TRID (set of TRIDs)
> > that you would want to poll.
> > 2. Attach to all subsystems on a given set of TRIDs to poll.
> > I think both of these would require some changes to both the
> > bdev_nvme_set_hotplug API and RPC methods since they currently only
> > allow probing of PCIe addresses. It might be best to simply allow
> > users to call bdev_nvme_set_hotplug multiple times specifying a single
> > TRID each time and then keep an internal list of all of the monitored
> targets.
> >
>
> People's mileage may vary, but I see it pretty reasonable and nicely
> aligned with existing set_hotplug API if polling is limited to subsystems
> added since or active when the user has enabled hotplug. Essentially,
> hotplug enable call is then treated as user's request to monitor known
> subsystems, similar to what we have now with PCIe. Removing subsystem
> controller via RPC when hotplug is enabled is then a request to stop
> polling for the controller's TRID. This looks both simple and practical
> enough for the first shot to me.
>
> Adding extra options to set_nvme_hotplug would provide a more flexible and
> powerful solution, such as capability to discover any subsystem on the
> specified TRID or stop monitoring a given TRID, enabling fine-grained
> control. Coming up with a practical and simple default behavior still seems
> key to me here.
>
>
> > I'm definitely in favor of both of these changes. I would be
> > interested to see what others in the community think as well though in
> > case I am missing anything. The removal side could be pretty readily
> > implemented, but there are the API considerations on the insertion side.
> >
>
> It certainly makes sense to get community input on the API changes.
>
> Thanks,
> Andrey
>
>
> > Thanks,
> >
> > Seth
> >
> > -----Original Message-----
> > From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
> > Sent: Tuesday, May 26, 2020 2:11 PM
> > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Subject: [SPDK] Re: NVMe hotplug for RDMA and TCP transports
> >
> > Thanks Seth, please fins a few comments inline below.
> >
> > On Tue, May 26, 2020, 23:35 Howell, Seth <seth.howell(a)intel.com> wrote:
> >
> > > Hi Andrey,
> > >
> > > Typically when we refer to hotplug (removal)
> >
> >
> > That addition speaks for itself :). Since I'm interested in both hot
> > removal/plugging, that's just wording.
> >
> > in fabrics transports, we are talking about the target side of the
> > > connection suddenly disconnecting the admin and I/O qpairs by the
> > > target side of the connection. This definition of hotplug is already
> > > supported in the NVMe initiator. If your definition of hotplug is
> > > something different, please correct me so that I can better answer
> > > your
> > question.
> > >
> > > In RDMA for example, when we receive a disconnect event on the admin
> > > qpair for a given controller, we mark that controller as failed and
> > > fail up all I/O corresponding to I/O qpairs on that controller. Then
> > > subsequent calls to either submit I/O or process completions on any
> > > qpair associated with that controller return -ENXIO indicating to
> > > the initiator application that the drive has been failed by the target
> side.
> > > There are a couple of reasons that could happen:
> > > 1. The actual drive itself has been hotplugged from the target
> > > application (i.e. nvme pcie hotplug on the target side) 2. There wsa
> > > some network event that caused the target application to disconnect
> > > (NIC failure, RDMA error, etc)
> > >
> > > Because there are multiple reasons we could receive a "hotplug"
> > > event from the target application we leave it up to the initator
> > > application to decide what they want to do with this. Either destroy
> > > the controller from the initiator side, try reconnecting to the
> > > controller from the same TRID or attempting to connect to the
> > > controller from a different TRID (something like target side port
> failover).
> > >
> >
> > What I'm concerned with right now is that the above decision is
> > seemingly at odds with the SPDK own bdev layer hotremove
> > functionality. When spdk_bdev_open is being called, the caller
> > provides a hotremove callback that is expected to be called when the
> associated bdev goes away.
> >
> > If, for instance, I model hotplug event by killing SPDK nvmf/tcp
> > target while running bdeperf against namespace bdevs it exposes, I'd
> > expect the bdeperf hotremove callback to be fired for each active target
> namespace.
> > What I'm witnessing instead is the target subsystem controller (on the
> > initiator side) going into failed state after a number of unsuccessful
> > resets, with bdevperf failing due to I/O errors rather than cleanly
> > handling the hotremove event.
> >
> > Is that by design so that I'm looking for something that's actually
> > not expected to work, or is bdev layer hot-remove functionality a bit
> > ahead of the nvme layer in this case?
> >
> >
> > > In terms of hotplug insertion, I assume that would mean you want the
> > > initiator to automatically connect to a target subsystem that can be
> > > presented at any point in time during the running of the application.
> > > There isn't a specific driver level implementation of this feature
> > > for fabrics controllers, I think mostly because it would be very
> > > easy to implement and customize this functionality at the application
> layer.
> > > For example, one could periodically call discover on the targets
> > > they want to connect to and when new controllers/subsystems appear,
> > > connect
> > to them at that time.
> > >
> >
> > Understood, though I'd expect such a feature to be pretty popular,
> > similar to PCIe hotplug (which currently works), so providing it
> > off-the-shelf rather than leaving the implementation to SPDK users would
> make sense to me.
> >
> > Thanks,
> > Andrey
> >
> >
> > > I hope that this answers your question. Please let me know if I am
> > > talking about a different definition of hotplug than the one you are
> > using.
> > >
> > > Thanks,
> > >
> > > Seth
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
> > > Sent: Friday, May 22, 2020 1:47 AM
> > > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > > Subject: [SPDK] NVMe hotplug for RDMA and TCP transports
> > >
> > > Hi team,
> > >
> > > is NVMe hotplug functionality as implemented limited to PCIe
> > > transport or does it also work for other transports? If it's
> > > currently PCIe only, are there any plans to extend the support to
> RDMA/TCP?
> > >
> > > Thanks,
> > > Andrey
> > > _______________________________________________
> > > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email
> > > to spdk-leave(a)lists.01.org
> > > _______________________________________________
> > > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email
> > > to spdk-leave(a)lists.01.org
> > >
> > _______________________________________________
> > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email to
> > spdk-leave(a)lists.01.org
> > _______________________________________________
> > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email to
> > spdk-leave(a)lists.01.org
> >
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org
> To unsubscribe send an email to spdk-leave(a)lists.01.org
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org
> To unsubscribe send an email to spdk-leave(a)lists.01.org
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [SPDK] Re: NVMe hotplug for RDMA and TCP transports
@ 2020-07-08 19:59 Howell, Seth
  0 siblings, 0 replies; 7+ messages in thread
From: Howell, Seth @ 2020-07-08 19:59 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 10569 bytes --]

Hi Andrey,

Sorry for the radio silence on this. There have been a couple of developments recently that have changed the way that I think we want to handle this. The following is coming after a couple of conversations that I have had with Jim and Ben.

Jim had a couple of qualms about calling the hotplug callback when the target side of our NVMe bdev suddenly disconnects. Chief among those is that when we call the hotplug remove callback for a given bdev, that means that the bdev is entirely gone and represents a pretty critical event for the application. Rather than calling the hotplug callback for fabrics controllers, he would prefer to handle these events with different asynchronous events and allow the user to choose what to do with them without destroying the bdev.

I have recently been adding some multipath features to the initiator side of the connection: https://review.spdk.io/gerrit/c/spdk/spdk/+/2885/25. This means that now an NVMe bdev doesn't so much represent a specific controller as it does a set of namespaces that can be accessed through multiple controllers, each having different TRIDs. I have already added different functions to enable adding multiple TRIDs, and currently have an automatic failover response in place to update the TRID we are connected to without failing.

There are also plans in the future to allow for load balancing (multiple connections through the same bdev to different controllers but the same set of namespaces).So a single path being disconnected will also not be a hotplug event.

Do you think it would be reasonable to allow for hotplug removal and insertion through registering multiple TRIDs for the same controller? That way, the user has the flexibility to add multiple paths to the same bdev and do automatic recovery/failover in the bdev module. We could also register a separate asynchronous callback in the case that all registered paths are unreachable at once.

Thanks,

Seth

-----Original Message-----
From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com> 
Sent: Thursday, May 28, 2020 12:41 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: [SPDK] Re: NVMe hotplug for RDMA and TCP transports

Hi Seth,

On Thu, May 28, 2020, 01:11 Howell, Seth <seth.howell(a)intel.com> wrote:

> Hi Andrey,++
>
> Thanks for clarifying that you were talking specifically about the 
> bdev layer. I think you are right there (And Ben spoke to me today as 
> well about this). When I was initially working on the bdev_nvme 
> implementation of failover, we were trying to mimic what happens in 
> the kernel when you lose contact with a target side NVMe drive. 
> Essentially, it just continually attempts to reconnect until you disconnect the bdev.
>
> But it makes sense to expose that through the bdev hotplug functionality.
>

Good to know. I will be glad to review the associated patches or help with that in any other practical way.


> As far as hotplug insertion goes, do you have a specific idea of how 
> to implement that in a generic enough way to be universally 
> applicable? I can think of a few different things people might want to do:
> 1. Have a specific set of subsystems on a given TRID (set of TRIDs) 
> that you would want to poll.
> 2. Attach to all subsystems on a given set of TRIDs to poll.
> I think both of these would require some changes to both the 
> bdev_nvme_set_hotplug API and RPC methods since they currently only 
> allow probing of PCIe addresses. It might be best to simply allow 
> users to call bdev_nvme_set_hotplug multiple times specifying a single 
> TRID each time and then keep an internal list of all of the monitored targets.
>

People's mileage may vary, but I see it pretty reasonable and nicely aligned with existing set_hotplug API if polling is limited to subsystems added since or active when the user has enabled hotplug. Essentially, hotplug enable call is then treated as user's request to monitor known subsystems, similar to what we have now with PCIe. Removing subsystem controller via RPC when hotplug is enabled is then a request to stop polling for the controller's TRID. This looks both simple and practical enough for the first shot to me.

Adding extra options to set_nvme_hotplug would provide a more flexible and powerful solution, such as capability to discover any subsystem on the specified TRID or stop monitoring a given TRID, enabling fine-grained control. Coming up with a practical and simple default behavior still seems key to me here.


> I'm definitely in favor of both of these changes. I would be 
> interested to see what others in the community think as well though in 
> case I am missing anything. The removal side could be pretty readily 
> implemented, but there are the API considerations on the insertion side.
>

It certainly makes sense to get community input on the API changes.

Thanks,
Andrey


> Thanks,
>
> Seth
>
> -----Original Message-----
> From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
> Sent: Tuesday, May 26, 2020 2:11 PM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: [SPDK] Re: NVMe hotplug for RDMA and TCP transports
>
> Thanks Seth, please fins a few comments inline below.
>
> On Tue, May 26, 2020, 23:35 Howell, Seth <seth.howell(a)intel.com> wrote:
>
> > Hi Andrey,
> >
> > Typically when we refer to hotplug (removal)
>
>
> That addition speaks for itself :). Since I'm interested in both hot 
> removal/plugging, that's just wording.
>
> in fabrics transports, we are talking about the target side of the
> > connection suddenly disconnecting the admin and I/O qpairs by the 
> > target side of the connection. This definition of hotplug is already 
> > supported in the NVMe initiator. If your definition of hotplug is 
> > something different, please correct me so that I can better answer 
> > your
> question.
> >
> > In RDMA for example, when we receive a disconnect event on the admin 
> > qpair for a given controller, we mark that controller as failed and 
> > fail up all I/O corresponding to I/O qpairs on that controller. Then 
> > subsequent calls to either submit I/O or process completions on any 
> > qpair associated with that controller return -ENXIO indicating to 
> > the initiator application that the drive has been failed by the target side.
> > There are a couple of reasons that could happen:
> > 1. The actual drive itself has been hotplugged from the target 
> > application (i.e. nvme pcie hotplug on the target side) 2. There wsa 
> > some network event that caused the target application to disconnect 
> > (NIC failure, RDMA error, etc)
> >
> > Because there are multiple reasons we could receive a "hotplug" 
> > event from the target application we leave it up to the initator 
> > application to decide what they want to do with this. Either destroy 
> > the controller from the initiator side, try reconnecting to the 
> > controller from the same TRID or attempting to connect to the 
> > controller from a different TRID (something like target side port failover).
> >
>
> What I'm concerned with right now is that the above decision is 
> seemingly at odds with the SPDK own bdev layer hotremove 
> functionality. When spdk_bdev_open is being called, the caller 
> provides a hotremove callback that is expected to be called when the associated bdev goes away.
>
> If, for instance, I model hotplug event by killing SPDK nvmf/tcp 
> target while running bdeperf against namespace bdevs it exposes, I'd 
> expect the bdeperf hotremove callback to be fired for each active target namespace.
> What I'm witnessing instead is the target subsystem controller (on the 
> initiator side) going into failed state after a number of unsuccessful 
> resets, with bdevperf failing due to I/O errors rather than cleanly 
> handling the hotremove event.
>
> Is that by design so that I'm looking for something that's actually 
> not expected to work, or is bdev layer hot-remove functionality a bit 
> ahead of the nvme layer in this case?
>
>
> > In terms of hotplug insertion, I assume that would mean you want the 
> > initiator to automatically connect to a target subsystem that can be 
> > presented at any point in time during the running of the application.
> > There isn't a specific driver level implementation of this feature 
> > for fabrics controllers, I think mostly because it would be very 
> > easy to implement and customize this functionality at the application layer.
> > For example, one could periodically call discover on the targets 
> > they want to connect to and when new controllers/subsystems appear, 
> > connect
> to them at that time.
> >
>
> Understood, though I'd expect such a feature to be pretty popular, 
> similar to PCIe hotplug (which currently works), so providing it 
> off-the-shelf rather than leaving the implementation to SPDK users would make sense to me.
>
> Thanks,
> Andrey
>
>
> > I hope that this answers your question. Please let me know if I am 
> > talking about a different definition of hotplug than the one you are
> using.
> >
> > Thanks,
> >
> > Seth
> >
> >
> >
> > -----Original Message-----
> > From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
> > Sent: Friday, May 22, 2020 1:47 AM
> > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Subject: [SPDK] NVMe hotplug for RDMA and TCP transports
> >
> > Hi team,
> >
> > is NVMe hotplug functionality as implemented limited to PCIe 
> > transport or does it also work for other transports? If it's 
> > currently PCIe only, are there any plans to extend the support to RDMA/TCP?
> >
> > Thanks,
> > Andrey
> > _______________________________________________
> > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email 
> > to spdk-leave(a)lists.01.org 
> > _______________________________________________
> > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email 
> > to spdk-leave(a)lists.01.org
> >
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email to 
> spdk-leave(a)lists.01.org 
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email to 
> spdk-leave(a)lists.01.org
>
_______________________________________________
SPDK mailing list -- spdk(a)lists.01.org
To unsubscribe send an email to spdk-leave(a)lists.01.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [SPDK] Re: NVMe hotplug for RDMA and TCP transports
@ 2020-05-28 18:41 Andrey Kuzmin
  0 siblings, 0 replies; 7+ messages in thread
From: Andrey Kuzmin @ 2020-05-28 18:41 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 8147 bytes --]

Hi Seth,

On Thu, May 28, 2020, 01:11 Howell, Seth <seth.howell(a)intel.com> wrote:

> Hi Andrey,
>
> Thanks for clarifying that you were talking specifically about the bdev
> layer. I think you are right there (And Ben spoke to me today as well about
> this). When I was initially working on the bdev_nvme implementation of
> failover, we were trying to mimic what happens in the kernel when you lose
> contact with a target side NVMe drive. Essentially, it just continually
> attempts to reconnect until you disconnect the bdev.
>
> But it makes sense to expose that through the bdev hotplug functionality.
>

Good to know. I will be glad to review the associated patches or help with
that in any other practical way.


> As far as hotplug insertion goes, do you have a specific idea of how to
> implement that in a generic enough way to be universally applicable? I can
> think of a few different things people might want to do:
> 1. Have a specific set of subsystems on a given TRID (set of TRIDs) that
> you would want to poll.
> 2. Attach to all subsystems on a given set of TRIDs to poll.
> I think both of these would require some changes to both the
> bdev_nvme_set_hotplug API and RPC methods since they currently only allow
> probing of PCIe addresses. It might be best to simply allow users to call
> bdev_nvme_set_hotplug multiple times specifying a single TRID each time and
> then keep an internal list of all of the monitored targets.
>

People's mileage may vary, but I see it pretty reasonable and nicely
aligned with existing set_hotplug API if polling is limited to subsystems
added since or active when the user has enabled hotplug. Essentially,
hotplug enable call is then treated as user's request to monitor known
subsystems, similar to what we have now with PCIe. Removing subsystem
controller via RPC when hotplug is enabled is then a request to stop
polling for the controller's TRID. This looks both simple and practical
enough for the first shot to me.

Adding extra options to set_nvme_hotplug would provide a more flexible and
powerful solution, such as capability to discover any subsystem on the
specified TRID or stop monitoring a given TRID, enabling fine-grained
control. Coming up with a practical and simple default behavior still seems
key to me here.


> I'm definitely in favor of both of these changes. I would be interested to
> see what others in the community think as well though in case I am missing
> anything. The removal side could be pretty readily implemented, but there
> are the API considerations on the insertion side.
>

It certainly makes sense to get community input on the API changes.

Thanks,
Andrey


> Thanks,
>
> Seth
>
> -----Original Message-----
> From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
> Sent: Tuesday, May 26, 2020 2:11 PM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: [SPDK] Re: NVMe hotplug for RDMA and TCP transports
>
> Thanks Seth, please fins a few comments inline below.
>
> On Tue, May 26, 2020, 23:35 Howell, Seth <seth.howell(a)intel.com> wrote:
>
> > Hi Andrey,
> >
> > Typically when we refer to hotplug (removal)
>
>
> That addition speaks for itself :). Since I'm interested in both hot
> removal/plugging, that's just wording.
>
> in fabrics transports, we are talking about the target side of the
> > connection suddenly disconnecting the admin and I/O qpairs by the
> > target side of the connection. This definition of hotplug is already
> > supported in the NVMe initiator. If your definition of hotplug is
> > something different, please correct me so that I can better answer your
> question.
> >
> > In RDMA for example, when we receive a disconnect event on the admin
> > qpair for a given controller, we mark that controller as failed and
> > fail up all I/O corresponding to I/O qpairs on that controller. Then
> > subsequent calls to either submit I/O or process completions on any
> > qpair associated with that controller return -ENXIO indicating to the
> > initiator application that the drive has been failed by the target side.
> > There are a couple of reasons that could happen:
> > 1. The actual drive itself has been hotplugged from the target
> > application (i.e. nvme pcie hotplug on the target side) 2. There wsa
> > some network event that caused the target application to disconnect
> > (NIC failure, RDMA error, etc)
> >
> > Because there are multiple reasons we could receive a "hotplug" event
> > from the target application we leave it up to the initator application
> > to decide what they want to do with this. Either destroy the
> > controller from the initiator side, try reconnecting to the controller
> > from the same TRID or attempting to connect to the controller from a
> > different TRID (something like target side port failover).
> >
>
> What I'm concerned with right now is that the above decision is seemingly
> at odds with the SPDK own bdev layer hotremove functionality. When
> spdk_bdev_open is being called, the caller provides a hotremove callback
> that is expected to be called when the associated bdev goes away.
>
> If, for instance, I model hotplug event by killing SPDK nvmf/tcp target
> while running bdeperf against namespace bdevs it exposes, I'd expect the
> bdeperf hotremove callback to be fired for each active target namespace.
> What I'm witnessing instead is the target subsystem controller (on the
> initiator side) going into failed state after a number of unsuccessful
> resets, with bdevperf failing due to I/O errors rather than cleanly
> handling the hotremove event.
>
> Is that by design so that I'm looking for something that's actually not
> expected to work, or is bdev layer hot-remove functionality a bit ahead of
> the nvme layer in this case?
>
>
> > In terms of hotplug insertion, I assume that would mean you want the
> > initiator to automatically connect to a target subsystem that can be
> > presented at any point in time during the running of the application.
> > There isn't a specific driver level implementation of this feature for
> > fabrics controllers, I think mostly because it would be very easy to
> > implement and customize this functionality at the application layer.
> > For example, one could periodically call discover on the targets they
> > want to connect to and when new controllers/subsystems appear, connect
> to them at that time.
> >
>
> Understood, though I'd expect such a feature to be pretty popular, similar
> to PCIe hotplug (which currently works), so providing it off-the-shelf
> rather than leaving the implementation to SPDK users would make sense to me.
>
> Thanks,
> Andrey
>
>
> > I hope that this answers your question. Please let me know if I am
> > talking about a different definition of hotplug than the one you are
> using.
> >
> > Thanks,
> >
> > Seth
> >
> >
> >
> > -----Original Message-----
> > From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
> > Sent: Friday, May 22, 2020 1:47 AM
> > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Subject: [SPDK] NVMe hotplug for RDMA and TCP transports
> >
> > Hi team,
> >
> > is NVMe hotplug functionality as implemented limited to PCIe transport
> > or does it also work for other transports? If it's currently PCIe
> > only, are there any plans to extend the support to RDMA/TCP?
> >
> > Thanks,
> > Andrey
> > _______________________________________________
> > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email to
> > spdk-leave(a)lists.01.org
> > _______________________________________________
> > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email to
> > spdk-leave(a)lists.01.org
> >
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org
> To unsubscribe send an email to spdk-leave(a)lists.01.org
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org
> To unsubscribe send an email to spdk-leave(a)lists.01.org
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [SPDK] Re: NVMe hotplug for RDMA and TCP transports
@ 2020-05-27 22:11 Howell, Seth
  0 siblings, 0 replies; 7+ messages in thread
From: Howell, Seth @ 2020-05-27 22:11 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6555 bytes --]

Hi Andrey,

Thanks for clarifying that you were talking specifically about the bdev layer. I think you are right there (And Ben spoke to me today as well about this). When I was initially working on the bdev_nvme implementation of failover, we were trying to mimic what happens in the kernel when you lose contact with a target side NVMe drive. Essentially, it just continually attempts to reconnect until you disconnect the bdev.

But it makes sense to expose that through the bdev hotplug functionality.

As far as hotplug insertion goes, do you have a specific idea of how to implement that in a generic enough way to be universally applicable? I can think of a few different things people might want to do:
1. Have a specific set of subsystems on a given TRID (set of TRIDs) that you would want to poll.
2. Attach to all subsystems on a given set of TRIDs to poll.
I think both of these would require some changes to both the bdev_nvme_set_hotplug API and RPC methods since they currently only allow probing of PCIe addresses. It might be best to simply allow users to call bdev_nvme_set_hotplug multiple times specifying a single TRID each time and then keep an internal list of all of the monitored targets.

I'm definitely in favor of both of these changes. I would be interested to see what others in the community think as well though in case I am missing anything. The removal side could be pretty readily implemented, but there are the API considerations on the insertion side.

Thanks,

Seth

-----Original Message-----
From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com> 
Sent: Tuesday, May 26, 2020 2:11 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: [SPDK] Re: NVMe hotplug for RDMA and TCP transports

Thanks Seth, please fins a few comments inline below.

On Tue, May 26, 2020, 23:35 Howell, Seth <seth.howell(a)intel.com> wrote:

> Hi Andrey,
>
> Typically when we refer to hotplug (removal)


That addition speaks for itself :). Since I'm interested in both hot removal/plugging, that's just wording.

in fabrics transports, we are talking about the target side of the
> connection suddenly disconnecting the admin and I/O qpairs by the 
> target side of the connection. This definition of hotplug is already 
> supported in the NVMe initiator. If your definition of hotplug is 
> something different, please correct me so that I can better answer your question.
>
> In RDMA for example, when we receive a disconnect event on the admin 
> qpair for a given controller, we mark that controller as failed and 
> fail up all I/O corresponding to I/O qpairs on that controller. Then 
> subsequent calls to either submit I/O or process completions on any 
> qpair associated with that controller return -ENXIO indicating to the 
> initiator application that the drive has been failed by the target side.
> There are a couple of reasons that could happen:
> 1. The actual drive itself has been hotplugged from the target 
> application (i.e. nvme pcie hotplug on the target side) 2. There wsa 
> some network event that caused the target application to disconnect 
> (NIC failure, RDMA error, etc)
>
> Because there are multiple reasons we could receive a "hotplug" event 
> from the target application we leave it up to the initator application 
> to decide what they want to do with this. Either destroy the 
> controller from the initiator side, try reconnecting to the controller 
> from the same TRID or attempting to connect to the controller from a 
> different TRID (something like target side port failover).
>

What I'm concerned with right now is that the above decision is seemingly at odds with the SPDK own bdev layer hotremove functionality. When spdk_bdev_open is being called, the caller provides a hotremove callback that is expected to be called when the associated bdev goes away.

If, for instance, I model hotplug event by killing SPDK nvmf/tcp target while running bdeperf against namespace bdevs it exposes, I'd expect the bdeperf hotremove callback to be fired for each active target namespace.
What I'm witnessing instead is the target subsystem controller (on the initiator side) going into failed state after a number of unsuccessful resets, with bdevperf failing due to I/O errors rather than cleanly handling the hotremove event.

Is that by design so that I'm looking for something that's actually not expected to work, or is bdev layer hot-remove functionality a bit ahead of the nvme layer in this case?


> In terms of hotplug insertion, I assume that would mean you want the 
> initiator to automatically connect to a target subsystem that can be 
> presented at any point in time during the running of the application. 
> There isn't a specific driver level implementation of this feature for 
> fabrics controllers, I think mostly because it would be very easy to 
> implement and customize this functionality at the application layer. 
> For example, one could periodically call discover on the targets they 
> want to connect to and when new controllers/subsystems appear, connect to them at that time.
>

Understood, though I'd expect such a feature to be pretty popular, similar to PCIe hotplug (which currently works), so providing it off-the-shelf rather than leaving the implementation to SPDK users would make sense to me.

Thanks,
Andrey


> I hope that this answers your question. Please let me know if I am 
> talking about a different definition of hotplug than the one you are using.
>
> Thanks,
>
> Seth
>
>
>
> -----Original Message-----
> From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
> Sent: Friday, May 22, 2020 1:47 AM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: [SPDK] NVMe hotplug for RDMA and TCP transports
>
> Hi team,
>
> is NVMe hotplug functionality as implemented limited to PCIe transport 
> or does it also work for other transports? If it's currently PCIe 
> only, are there any plans to extend the support to RDMA/TCP?
>
> Thanks,
> Andrey
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email to 
> spdk-leave(a)lists.01.org 
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email to 
> spdk-leave(a)lists.01.org
>
_______________________________________________
SPDK mailing list -- spdk(a)lists.01.org
To unsubscribe send an email to spdk-leave(a)lists.01.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [SPDK] Re: NVMe hotplug for RDMA and TCP transports
@ 2020-05-26 21:10 Andrey Kuzmin
  0 siblings, 0 replies; 7+ messages in thread
From: Andrey Kuzmin @ 2020-05-26 21:10 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4625 bytes --]

Thanks Seth, please fins a few comments inline below.

On Tue, May 26, 2020, 23:35 Howell, Seth <seth.howell(a)intel.com> wrote:

> Hi Andrey,
>
> Typically when we refer to hotplug (removal)


That addition speaks for itself :). Since I'm interested in both hot
removal/plugging, that's just wording.

in fabrics transports, we are talking about the target side of the
> connection suddenly disconnecting the admin and I/O qpairs by the target
> side of the connection. This definition of hotplug is already supported in
> the NVMe initiator. If your definition of hotplug is something different,
> please correct me so that I can better answer your question.
>
> In RDMA for example, when we receive a disconnect event on the admin qpair
> for a given controller, we mark that controller as failed and fail up all
> I/O corresponding to I/O qpairs on that controller. Then subsequent calls
> to either submit I/O or process completions on any qpair associated with
> that controller return -ENXIO indicating to the initiator application that
> the drive has been failed by the target side.
> There are a couple of reasons that could happen:
> 1. The actual drive itself has been hotplugged from the target application
> (i.e. nvme pcie hotplug on the target side)
> 2. There wsa some network event that caused the target application to
> disconnect (NIC failure, RDMA error, etc)
>
> Because there are multiple reasons we could receive a "hotplug" event from
> the target application we leave it up to the initator application to decide
> what they want to do with this. Either destroy the controller from the
> initiator side, try reconnecting to the controller from the same TRID or
> attempting to connect to the controller from a different TRID (something
> like target side port failover).
>

What I'm concerned with right now is that the above decision is seemingly
at odds with the SPDK own bdev layer hotremove functionality. When
spdk_bdev_open is being called, the caller provides a hotremove callback
that is expected to be called when the associated bdev goes away.

If, for instance, I model hotplug event by killing SPDK nvmf/tcp target
while running bdeperf against namespace bdevs it exposes, I'd expect the
bdeperf hotremove callback to be fired for each active target namespace.
What I'm witnessing instead is the target subsystem controller (on the
initiator side) going into failed state after a number of unsuccessful
resets, with bdevperf failing due to I/O errors rather than cleanly
handling the hotremove event.

Is that by design so that I'm looking for something that's actually not
expected to work, or is bdev layer hot-remove functionality a bit ahead of
the nvme layer in this case?


> In terms of hotplug insertion, I assume that would mean you want the
> initiator to automatically connect to a target subsystem that can be
> presented at any point in time during the running of the application. There
> isn't a specific driver level implementation of this feature for fabrics
> controllers, I think mostly because it would be very easy to implement and
> customize this functionality at the application layer. For example, one
> could periodically call discover on the targets they want to connect to and
> when new controllers/subsystems appear, connect to them at that time.
>

Understood, though I'd expect such a feature to be pretty popular, similar
to PCIe hotplug (which currently works), so providing it off-the-shelf
rather than leaving the implementation to SPDK users would make sense to me.

Thanks,
Andrey


> I hope that this answers your question. Please let me know if I am talking
> about a different definition of hotplug than the one you are using.
>
> Thanks,
>
> Seth
>
>
>
> -----Original Message-----
> From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
> Sent: Friday, May 22, 2020 1:47 AM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: [SPDK] NVMe hotplug for RDMA and TCP transports
>
> Hi team,
>
> is NVMe hotplug functionality as implemented limited to PCIe transport or
> does it also work for other transports? If it's currently PCIe only, are
> there any plans to extend the support to RDMA/TCP?
>
> Thanks,
> Andrey
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org
> To unsubscribe send an email to spdk-leave(a)lists.01.org
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org
> To unsubscribe send an email to spdk-leave(a)lists.01.org
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [SPDK] Re: NVMe hotplug for RDMA and TCP transports
@ 2020-05-26 20:35 Howell, Seth
  0 siblings, 0 replies; 7+ messages in thread
From: Howell, Seth @ 2020-05-26 20:35 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 2888 bytes --]

Hi Andrey,

Typically when we refer to hotplug (removal) in fabrics transports, we are talking about the target side of the connection suddenly disconnecting the admin and I/O qpairs by the target side of the connection. This definition of hotplug is already supported in the NVMe initiator. If your definition of hotplug is something different, please correct me so that I can better answer your question.

In RDMA for example, when we receive a disconnect event on the admin qpair for a given controller, we mark that controller as failed and fail up all I/O corresponding to I/O qpairs on that controller. Then subsequent calls to either submit I/O or process completions on any qpair associated with that controller return -ENXIO indicating to the initiator application that the drive has been failed by the target side.
There are a couple of reasons that could happen:
1. The actual drive itself has been hotplugged from the target application (i.e. nvme pcie hotplug on the target side)
2. There wsa some network event that caused the target application to disconnect (NIC failure, RDMA error, etc)

Because there are multiple reasons we could receive a "hotplug" event from the target application we leave it up to the initator application to decide what they want to do with this. Either destroy the controller from the initiator side, try reconnecting to the controller from the same TRID or attempting to connect to the controller from a different TRID (something like target side port failover).

In terms of hotplug insertion, I assume that would mean you want the initiator to automatically connect to a target subsystem that can be presented at any point in time during the running of the application. There isn't a specific driver level implementation of this feature for fabrics controllers, I think mostly because it would be very easy to implement and customize this functionality at the application layer. For example, one could periodically call discover on the targets they want to connect to and when new controllers/subsystems appear, connect to them at that time.

I hope that this answers your question. Please let me know if I am talking about a different definition of hotplug than the one you are using.

Thanks,

Seth



-----Original Message-----
From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com> 
Sent: Friday, May 22, 2020 1:47 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: [SPDK] NVMe hotplug for RDMA and TCP transports

Hi team,

is NVMe hotplug functionality as implemented limited to PCIe transport or does it also work for other transports? If it's currently PCIe only, are there any plans to extend the support to RDMA/TCP?

Thanks,
Andrey
_______________________________________________
SPDK mailing list -- spdk(a)lists.01.org
To unsubscribe send an email to spdk-leave(a)lists.01.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [SPDK] Re: NVMe hotplug for RDMA and TCP transports
@ 2020-05-22 10:34 Revan biradar
  0 siblings, 0 replies; 7+ messages in thread
From: Revan biradar @ 2020-05-22 10:34 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 592 bytes --]

Hi Team,

Even i want know this feature.
NVME hot plug for RDMA/TCP.

Thanks
Revan Biradar

On Fri, May 22, 2020 at 2:17 PM Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
wrote:

> Hi team,
>
> is NVMe hotplug functionality as implemented limited to PCIe transport or
> does it also work for other transports? If it's currently PCIe only, are
> there any plans to extend the support to RDMA/TCP?
>
> Thanks,
> Andrey
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org
> To unsubscribe send an email to spdk-leave(a)lists.01.org
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-07-13 16:29 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-13 16:29 [SPDK] Re: NVMe hotplug for RDMA and TCP transports Andrey Kuzmin
  -- strict thread matches above, loose matches on Subject: below --
2020-07-08 19:59 Howell, Seth
2020-05-28 18:41 Andrey Kuzmin
2020-05-27 22:11 Howell, Seth
2020-05-26 21:10 Andrey Kuzmin
2020-05-26 20:35 Howell, Seth
2020-05-22 10:34 Revan biradar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.