Hi Seth,

On Thu, May 28, 2020, 01:11 Howell, Seth <seth.howell(a)intel.com> wrote:

> Hi Andrey,
>
> Thanks for clarifying that you were talking specifically about the bdev
> layer. I think you are right there (And Ben spoke to me today as well about
> this). When I was initially working on the bdev_nvme implementation of
> failover, we were trying to mimic what happens in the kernel when you lose
> contact with a target side NVMe drive. Essentially, it just continually
> attempts to reconnect until you disconnect the bdev.
>
> But it makes sense to expose that through the bdev hotplug functionality.
>

Good to know. I will be glad to review the associated patches or help with
that in any other practical way.


> As far as hotplug insertion goes, do you have a specific idea of how to
> implement that in a generic enough way to be universally applicable? I can
> think of a few different things people might want to do:
> 1. Have a specific set of subsystems on a given TRID (set of TRIDs) that
> you would want to poll.
> 2. Attach to all subsystems on a given set of TRIDs to poll.
> I think both of these would require some changes to both the
> bdev_nvme_set_hotplug API and RPC methods since they currently only allow
> probing of PCIe addresses. It might be best to simply allow users to call
> bdev_nvme_set_hotplug multiple times specifying a single TRID each time and
> then keep an internal list of all of the monitored targets.
>

People's mileage may vary, but I see it pretty reasonable and nicely
aligned with existing set_hotplug API if polling is limited to subsystems
added since or active when the user has enabled hotplug. Essentially,
hotplug enable call is then treated as user's request to monitor known
subsystems, similar to what we have now with PCIe. Removing subsystem
controller via RPC when hotplug is enabled is then a request to stop
polling for the controller's TRID. This looks both simple and practical
enough for the first shot to me.

Adding extra options to set_nvme_hotplug would provide a more flexible and
powerful solution, such as capability to discover any subsystem on the
specified TRID or stop monitoring a given TRID, enabling fine-grained
control. Coming up with a practical and simple default behavior still seems
key to me here.


> I'm definitely in favor of both of these changes. I would be interested to
> see what others in the community think as well though in case I am missing
> anything. The removal side could be pretty readily implemented, but there
> are the API considerations on the insertion side.
>

It certainly makes sense to get community input on the API changes.

Thanks,
Andrey


> Thanks,
>
> Seth
>
> -----Original Message-----
> From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
> Sent: Tuesday, May 26, 2020 2:11 PM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: [SPDK] Re: NVMe hotplug for RDMA and TCP transports
>
> Thanks Seth, please fins a few comments inline below.
>
> On Tue, May 26, 2020, 23:35 Howell, Seth <seth.howell(a)intel.com> wrote:
>
> > Hi Andrey,
> >
> > Typically when we refer to hotplug (removal)
>
>
> That addition speaks for itself :). Since I'm interested in both hot
> removal/plugging, that's just wording.
>
> in fabrics transports, we are talking about the target side of the
> > connection suddenly disconnecting the admin and I/O qpairs by the
> > target side of the connection. This definition of hotplug is already
> > supported in the NVMe initiator. If your definition of hotplug is
> > something different, please correct me so that I can better answer your
> question.
> >
> > In RDMA for example, when we receive a disconnect event on the admin
> > qpair for a given controller, we mark that controller as failed and
> > fail up all I/O corresponding to I/O qpairs on that controller. Then
> > subsequent calls to either submit I/O or process completions on any
> > qpair associated with that controller return -ENXIO indicating to the
> > initiator application that the drive has been failed by the target side.
> > There are a couple of reasons that could happen:
> > 1. The actual drive itself has been hotplugged from the target
> > application (i.e. nvme pcie hotplug on the target side) 2. There wsa
> > some network event that caused the target application to disconnect
> > (NIC failure, RDMA error, etc)
> >
> > Because there are multiple reasons we could receive a "hotplug" event
> > from the target application we leave it up to the initator application
> > to decide what they want to do with this. Either destroy the
> > controller from the initiator side, try reconnecting to the controller
> > from the same TRID or attempting to connect to the controller from a
> > different TRID (something like target side port failover).
> >
>
> What I'm concerned with right now is that the above decision is seemingly
> at odds with the SPDK own bdev layer hotremove functionality. When
> spdk_bdev_open is being called, the caller provides a hotremove callback
> that is expected to be called when the associated bdev goes away.
>
> If, for instance, I model hotplug event by killing SPDK nvmf/tcp target
> while running bdeperf against namespace bdevs it exposes, I'd expect the
> bdeperf hotremove callback to be fired for each active target namespace.
> What I'm witnessing instead is the target subsystem controller (on the
> initiator side) going into failed state after a number of unsuccessful
> resets, with bdevperf failing due to I/O errors rather than cleanly
> handling the hotremove event.
>
> Is that by design so that I'm looking for something that's actually not
> expected to work, or is bdev layer hot-remove functionality a bit ahead of
> the nvme layer in this case?
>
>
> > In terms of hotplug insertion, I assume that would mean you want the
> > initiator to automatically connect to a target subsystem that can be
> > presented at any point in time during the running of the application.
> > There isn't a specific driver level implementation of this feature for
> > fabrics controllers, I think mostly because it would be very easy to
> > implement and customize this functionality at the application layer.
> > For example, one could periodically call discover on the targets they
> > want to connect to and when new controllers/subsystems appear, connect
> to them at that time.
> >
>
> Understood, though I'd expect such a feature to be pretty popular, similar
> to PCIe hotplug (which currently works), so providing it off-the-shelf
> rather than leaving the implementation to SPDK users would make sense to me.
>
> Thanks,
> Andrey
>
>
> > I hope that this answers your question. Please let me know if I am
> > talking about a different definition of hotplug than the one you are
> using.
> >
> > Thanks,
> >
> > Seth
> >
> >
> >
> > -----Original Message-----
> > From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
> > Sent: Friday, May 22, 2020 1:47 AM
> > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Subject: [SPDK] NVMe hotplug for RDMA and TCP transports
> >
> > Hi team,
> >
> > is NVMe hotplug functionality as implemented limited to PCIe transport
> > or does it also work for other transports? If it's currently PCIe
> > only, are there any plans to extend the support to RDMA/TCP?
> >
> > Thanks,
> > Andrey
> > _______________________________________________
> > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email to
> > spdk-leave(a)lists.01.org
> > _______________________________________________
> > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email to
> > spdk-leave(a)lists.01.org
> >
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org
> To unsubscribe send an email to spdk-leave(a)lists.01.org
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org
> To unsubscribe send an email to spdk-leave(a)lists.01.org
>