From mboxrd@z Thu Jan  1 00:00:00 1970
Content-Type: multipart/mixed; boundary="===============2636733986830787068=="
MIME-Version: 1.0
From: Howell, Seth <seth.howell at intel.com>
Subject: [SPDK] Re: NVMe hotplug for RDMA and TCP transports
Date: Wed, 08 Jul 2020 19:59:44 +0000
Message-ID: <MN2PR11MB4256DB37734556C56E650672FE670@MN2PR11MB4256.namprd11.prod.outlook.com>
In-Reply-To: CANvN+ek_-MfHhG7EGOWefGKY2DAZ-ud3kre_0NN9Eg9KotjOqg@mail.gmail.com
List-ID: <spdk@lists.01.org>
To: spdk@lists.01.org

--===============2636733986830787068==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable

Hi Andrey,

Sorry for the radio silence on this. There have been a couple of developmen=
ts recently that have changed the way that I think we want to handle this. =
The following is coming after a couple of conversations that I have had wit=
h Jim and Ben.

Jim had a couple of qualms about calling the hotplug callback when the targ=
et side of our NVMe bdev suddenly disconnects. Chief among those is that wh=
en we call the hotplug remove callback for a given bdev, that means that th=
e bdev is entirely gone and represents a pretty critical event for the appl=
ication. Rather than calling the hotplug callback for fabrics controllers, =
he would prefer to handle these events with different asynchronous events a=
nd allow the user to choose what to do with them without destroying the bde=
v.

I have recently been adding some multipath features to the initiator side o=
f the connection: https://review.spdk.io/gerrit/c/spdk/spdk/+/2885/25. This=
 means that now an NVMe bdev doesn't so much represent a specific controlle=
r as it does a set of namespaces that can be accessed through multiple cont=
rollers, each having different TRIDs. I have already added different functi=
ons to enable adding multiple TRIDs, and currently have an automatic failov=
er response in place to update the TRID we are connected to without failing.

There are also plans in the future to allow for load balancing (multiple co=
nnections through the same bdev to different controllers but the same set o=
f namespaces).So a single path being disconnected will also not be a hotplu=
g event.

Do you think it would be reasonable to allow for hotplug removal and insert=
ion through registering multiple TRIDs for the same controller? That way, t=
he user has the flexibility to add multiple paths to the same bdev and do a=
utomatic recovery/failover in the bdev module. We could also register a sep=
arate asynchronous callback in the case that all registered paths are unrea=
chable at once.

Thanks,

Seth

-----Original Message-----
From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com> =

Sent: Thursday, May 28, 2020 12:41 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: [SPDK] Re: NVMe hotplug for RDMA and TCP transports

Hi Seth,

On Thu, May 28, 2020, 01:11 Howell, Seth <seth.howell(a)intel.com> wrote:

> Hi Andrey,++
>
> Thanks for clarifying that you were talking specifically about the =

> bdev layer. I think you are right there (And Ben spoke to me today as =

> well about this). When I was initially working on the bdev_nvme =

> implementation of failover, we were trying to mimic what happens in =

> the kernel when you lose contact with a target side NVMe drive. =

> Essentially, it just continually attempts to reconnect until you disconne=
ct the bdev.
>
> But it makes sense to expose that through the bdev hotplug functionality.
>

Good to know. I will be glad to review the associated patches or help with =
that in any other practical way.


> As far as hotplug insertion goes, do you have a specific idea of how =

> to implement that in a generic enough way to be universally =

> applicable? I can think of a few different things people might want to do:
> 1. Have a specific set of subsystems on a given TRID (set of TRIDs) =

> that you would want to poll.
> 2. Attach to all subsystems on a given set of TRIDs to poll.
> I think both of these would require some changes to both the =

> bdev_nvme_set_hotplug API and RPC methods since they currently only =

> allow probing of PCIe addresses. It might be best to simply allow =

> users to call bdev_nvme_set_hotplug multiple times specifying a single =

> TRID each time and then keep an internal list of all of the monitored tar=
gets.
>

People's mileage may vary, but I see it pretty reasonable and nicely aligne=
d with existing set_hotplug API if polling is limited to subsystems added s=
ince or active when the user has enabled hotplug. Essentially, hotplug enab=
le call is then treated as user's request to monitor known subsystems, simi=
lar to what we have now with PCIe. Removing subsystem controller via RPC wh=
en hotplug is enabled is then a request to stop polling for the controller'=
s TRID. This looks both simple and practical enough for the first shot to m=
e.

Adding extra options to set_nvme_hotplug would provide a more flexible and =
powerful solution, such as capability to discover any subsystem on the spec=
ified TRID or stop monitoring a given TRID, enabling fine-grained control. =
Coming up with a practical and simple default behavior still seems key to m=
e here.


> I'm definitely in favor of both of these changes. I would be =

> interested to see what others in the community think as well though in =

> case I am missing anything. The removal side could be pretty readily =

> implemented, but there are the API considerations on the insertion side.
>

It certainly makes sense to get community input on the API changes.

Thanks,
Andrey


> Thanks,
>
> Seth
>
> -----Original Message-----
> From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
> Sent: Tuesday, May 26, 2020 2:11 PM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: [SPDK] Re: NVMe hotplug for RDMA and TCP transports
>
> Thanks Seth, please fins a few comments inline below.
>
> On Tue, May 26, 2020, 23:35 Howell, Seth <seth.howell(a)intel.com> wrote:
>
> > Hi Andrey,
> >
> > Typically when we refer to hotplug (removal)
>
>
> That addition speaks for itself :). Since I'm interested in both hot =

> removal/plugging, that's just wording.
>
> in fabrics transports, we are talking about the target side of the
> > connection suddenly disconnecting the admin and I/O qpairs by the =

> > target side of the connection. This definition of hotplug is already =

> > supported in the NVMe initiator. If your definition of hotplug is =

> > something different, please correct me so that I can better answer =

> > your
> question.
> >
> > In RDMA for example, when we receive a disconnect event on the admin =

> > qpair for a given controller, we mark that controller as failed and =

> > fail up all I/O corresponding to I/O qpairs on that controller. Then =

> > subsequent calls to either submit I/O or process completions on any =

> > qpair associated with that controller return -ENXIO indicating to =

> > the initiator application that the drive has been failed by the target =
side.
> > There are a couple of reasons that could happen:
> > 1. The actual drive itself has been hotplugged from the target =

> > application (i.e. nvme pcie hotplug on the target side) 2. There wsa =

> > some network event that caused the target application to disconnect =

> > (NIC failure, RDMA error, etc)
> >
> > Because there are multiple reasons we could receive a "hotplug" =

> > event from the target application we leave it up to the initator =

> > application to decide what they want to do with this. Either destroy =

> > the controller from the initiator side, try reconnecting to the =

> > controller from the same TRID or attempting to connect to the =

> > controller from a different TRID (something like target side port failo=
ver).
> >
>
> What I'm concerned with right now is that the above decision is =

> seemingly at odds with the SPDK own bdev layer hotremove =

> functionality. When spdk_bdev_open is being called, the caller =

> provides a hotremove callback that is expected to be called when the asso=
ciated bdev goes away.
>
> If, for instance, I model hotplug event by killing SPDK nvmf/tcp =

> target while running bdeperf against namespace bdevs it exposes, I'd =

> expect the bdeperf hotremove callback to be fired for each active target =
namespace.
> What I'm witnessing instead is the target subsystem controller (on the =

> initiator side) going into failed state after a number of unsuccessful =

> resets, with bdevperf failing due to I/O errors rather than cleanly =

> handling the hotremove event.
>
> Is that by design so that I'm looking for something that's actually =

> not expected to work, or is bdev layer hot-remove functionality a bit =

> ahead of the nvme layer in this case?
>
>
> > In terms of hotplug insertion, I assume that would mean you want the =

> > initiator to automatically connect to a target subsystem that can be =

> > presented at any point in time during the running of the application.
> > There isn't a specific driver level implementation of this feature =

> > for fabrics controllers, I think mostly because it would be very =

> > easy to implement and customize this functionality at the application l=
ayer.
> > For example, one could periodically call discover on the targets =

> > they want to connect to and when new controllers/subsystems appear, =

> > connect
> to them at that time.
> >
>
> Understood, though I'd expect such a feature to be pretty popular, =

> similar to PCIe hotplug (which currently works), so providing it =

> off-the-shelf rather than leaving the implementation to SPDK users would =
make sense to me.
>
> Thanks,
> Andrey
>
>
> > I hope that this answers your question. Please let me know if I am =

> > talking about a different definition of hotplug than the one you are
> using.
> >
> > Thanks,
> >
> > Seth
> >
> >
> >
> > -----Original Message-----
> > From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
> > Sent: Friday, May 22, 2020 1:47 AM
> > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Subject: [SPDK] NVMe hotplug for RDMA and TCP transports
> >
> > Hi team,
> >
> > is NVMe hotplug functionality as implemented limited to PCIe =

> > transport or does it also work for other transports? If it's =

> > currently PCIe only, are there any plans to extend the support to RDMA/=
TCP?
> >
> > Thanks,
> > Andrey
> > _______________________________________________
> > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email =

> > to spdk-leave(a)lists.01.org =

> > _______________________________________________
> > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email =

> > to spdk-leave(a)lists.01.org
> >
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email to =

> spdk-leave(a)lists.01.org =

> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email to =

> spdk-leave(a)lists.01.org
>
_______________________________________________
SPDK mailing list -- spdk(a)lists.01.org
To unsubscribe send an email to spdk-leave(a)lists.01.org
--===============2636733986830787068==--