From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============2636733986830787068==" MIME-Version: 1.0 From: Howell, Seth Subject: [SPDK] Re: NVMe hotplug for RDMA and TCP transports Date: Wed, 08 Jul 2020 19:59:44 +0000 Message-ID: In-Reply-To: CANvN+ek_-MfHhG7EGOWefGKY2DAZ-ud3kre_0NN9Eg9KotjOqg@mail.gmail.com List-ID: To: spdk@lists.01.org --===============2636733986830787068== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Hi Andrey, Sorry for the radio silence on this. There have been a couple of developmen= ts recently that have changed the way that I think we want to handle this. = The following is coming after a couple of conversations that I have had wit= h Jim and Ben. Jim had a couple of qualms about calling the hotplug callback when the targ= et side of our NVMe bdev suddenly disconnects. Chief among those is that wh= en we call the hotplug remove callback for a given bdev, that means that th= e bdev is entirely gone and represents a pretty critical event for the appl= ication. Rather than calling the hotplug callback for fabrics controllers, = he would prefer to handle these events with different asynchronous events a= nd allow the user to choose what to do with them without destroying the bde= v. I have recently been adding some multipath features to the initiator side o= f the connection: https://review.spdk.io/gerrit/c/spdk/spdk/+/2885/25. This= means that now an NVMe bdev doesn't so much represent a specific controlle= r as it does a set of namespaces that can be accessed through multiple cont= rollers, each having different TRIDs. I have already added different functi= ons to enable adding multiple TRIDs, and currently have an automatic failov= er response in place to update the TRID we are connected to without failing. There are also plans in the future to allow for load balancing (multiple co= nnections through the same bdev to different controllers but the same set o= f namespaces).So a single path being disconnected will also not be a hotplu= g event. Do you think it would be reasonable to allow for hotplug removal and insert= ion through registering multiple TRIDs for the same controller? That way, t= he user has the flexibility to add multiple paths to the same bdev and do a= utomatic recovery/failover in the bdev module. We could also register a sep= arate asynchronous callback in the case that all registered paths are unrea= chable at once. Thanks, Seth -----Original Message----- From: Andrey Kuzmin = Sent: Thursday, May 28, 2020 12:41 PM To: Storage Performance Development Kit Subject: [SPDK] Re: NVMe hotplug for RDMA and TCP transports Hi Seth, On Thu, May 28, 2020, 01:11 Howell, Seth wrote: > Hi Andrey,++ > > Thanks for clarifying that you were talking specifically about the = > bdev layer. I think you are right there (And Ben spoke to me today as = > well about this). When I was initially working on the bdev_nvme = > implementation of failover, we were trying to mimic what happens in = > the kernel when you lose contact with a target side NVMe drive. = > Essentially, it just continually attempts to reconnect until you disconne= ct the bdev. > > But it makes sense to expose that through the bdev hotplug functionality. > Good to know. I will be glad to review the associated patches or help with = that in any other practical way. > As far as hotplug insertion goes, do you have a specific idea of how = > to implement that in a generic enough way to be universally = > applicable? I can think of a few different things people might want to do: > 1. Have a specific set of subsystems on a given TRID (set of TRIDs) = > that you would want to poll. > 2. Attach to all subsystems on a given set of TRIDs to poll. > I think both of these would require some changes to both the = > bdev_nvme_set_hotplug API and RPC methods since they currently only = > allow probing of PCIe addresses. It might be best to simply allow = > users to call bdev_nvme_set_hotplug multiple times specifying a single = > TRID each time and then keep an internal list of all of the monitored tar= gets. > People's mileage may vary, but I see it pretty reasonable and nicely aligne= d with existing set_hotplug API if polling is limited to subsystems added s= ince or active when the user has enabled hotplug. Essentially, hotplug enab= le call is then treated as user's request to monitor known subsystems, simi= lar to what we have now with PCIe. Removing subsystem controller via RPC wh= en hotplug is enabled is then a request to stop polling for the controller'= s TRID. This looks both simple and practical enough for the first shot to m= e. Adding extra options to set_nvme_hotplug would provide a more flexible and = powerful solution, such as capability to discover any subsystem on the spec= ified TRID or stop monitoring a given TRID, enabling fine-grained control. = Coming up with a practical and simple default behavior still seems key to m= e here. > I'm definitely in favor of both of these changes. I would be = > interested to see what others in the community think as well though in = > case I am missing anything. The removal side could be pretty readily = > implemented, but there are the API considerations on the insertion side. > It certainly makes sense to get community input on the API changes. Thanks, Andrey > Thanks, > > Seth > > -----Original Message----- > From: Andrey Kuzmin > Sent: Tuesday, May 26, 2020 2:11 PM > To: Storage Performance Development Kit > Subject: [SPDK] Re: NVMe hotplug for RDMA and TCP transports > > Thanks Seth, please fins a few comments inline below. > > On Tue, May 26, 2020, 23:35 Howell, Seth wrote: > > > Hi Andrey, > > > > Typically when we refer to hotplug (removal) > > > That addition speaks for itself :). Since I'm interested in both hot = > removal/plugging, that's just wording. > > in fabrics transports, we are talking about the target side of the > > connection suddenly disconnecting the admin and I/O qpairs by the = > > target side of the connection. This definition of hotplug is already = > > supported in the NVMe initiator. If your definition of hotplug is = > > something different, please correct me so that I can better answer = > > your > question. > > > > In RDMA for example, when we receive a disconnect event on the admin = > > qpair for a given controller, we mark that controller as failed and = > > fail up all I/O corresponding to I/O qpairs on that controller. Then = > > subsequent calls to either submit I/O or process completions on any = > > qpair associated with that controller return -ENXIO indicating to = > > the initiator application that the drive has been failed by the target = side. > > There are a couple of reasons that could happen: > > 1. The actual drive itself has been hotplugged from the target = > > application (i.e. nvme pcie hotplug on the target side) 2. There wsa = > > some network event that caused the target application to disconnect = > > (NIC failure, RDMA error, etc) > > > > Because there are multiple reasons we could receive a "hotplug" = > > event from the target application we leave it up to the initator = > > application to decide what they want to do with this. Either destroy = > > the controller from the initiator side, try reconnecting to the = > > controller from the same TRID or attempting to connect to the = > > controller from a different TRID (something like target side port failo= ver). > > > > What I'm concerned with right now is that the above decision is = > seemingly at odds with the SPDK own bdev layer hotremove = > functionality. When spdk_bdev_open is being called, the caller = > provides a hotremove callback that is expected to be called when the asso= ciated bdev goes away. > > If, for instance, I model hotplug event by killing SPDK nvmf/tcp = > target while running bdeperf against namespace bdevs it exposes, I'd = > expect the bdeperf hotremove callback to be fired for each active target = namespace. > What I'm witnessing instead is the target subsystem controller (on the = > initiator side) going into failed state after a number of unsuccessful = > resets, with bdevperf failing due to I/O errors rather than cleanly = > handling the hotremove event. > > Is that by design so that I'm looking for something that's actually = > not expected to work, or is bdev layer hot-remove functionality a bit = > ahead of the nvme layer in this case? > > > > In terms of hotplug insertion, I assume that would mean you want the = > > initiator to automatically connect to a target subsystem that can be = > > presented at any point in time during the running of the application. > > There isn't a specific driver level implementation of this feature = > > for fabrics controllers, I think mostly because it would be very = > > easy to implement and customize this functionality at the application l= ayer. > > For example, one could periodically call discover on the targets = > > they want to connect to and when new controllers/subsystems appear, = > > connect > to them at that time. > > > > Understood, though I'd expect such a feature to be pretty popular, = > similar to PCIe hotplug (which currently works), so providing it = > off-the-shelf rather than leaving the implementation to SPDK users would = make sense to me. > > Thanks, > Andrey > > > > I hope that this answers your question. Please let me know if I am = > > talking about a different definition of hotplug than the one you are > using. > > > > Thanks, > > > > Seth > > > > > > > > -----Original Message----- > > From: Andrey Kuzmin > > Sent: Friday, May 22, 2020 1:47 AM > > To: Storage Performance Development Kit > > Subject: [SPDK] NVMe hotplug for RDMA and TCP transports > > > > Hi team, > > > > is NVMe hotplug functionality as implemented limited to PCIe = > > transport or does it also work for other transports? If it's = > > currently PCIe only, are there any plans to extend the support to RDMA/= TCP? > > > > Thanks, > > Andrey > > _______________________________________________ > > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email = > > to spdk-leave(a)lists.01.org = > > _______________________________________________ > > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email = > > to spdk-leave(a)lists.01.org > > > _______________________________________________ > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email to = > spdk-leave(a)lists.01.org = > _______________________________________________ > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email to = > spdk-leave(a)lists.01.org > _______________________________________________ SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email to spdk-leave(a)lists.01.org --===============2636733986830787068==--