From mboxrd@z Thu Jan  1 00:00:00 1970
Content-Type: multipart/mixed; boundary="===============8749302365162064642=="
MIME-Version: 1.0
From: Andrey Kuzmin <andrey.v.kuzmin at gmail.com>
Subject: [SPDK] Re: NVMe hotplug for RDMA and TCP transports
Date: Mon, 13 Jul 2020 19:29:12 +0300
Message-ID: <CANvN+ekS8NdThwp9wU1Ksn30Uceh4nwVcjU3YRkV+PkeZdOuJw@mail.gmail.com>
In-Reply-To: MN2PR11MB4256DB37734556C56E650672FE670@MN2PR11MB4256.namprd11.prod.outlook.com
List-ID: <spdk@lists.01.org>
To: spdk@lists.01.org

--===============8749302365162064642==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable

Hi Seth,

please see my comments below.

On Wed, Jul 8, 2020, 22:59 Howell, Seth <seth.howell(a)intel.com> wrote:

> Hi Andrey,
>
> Sorry for the radio silence on this. There have been a couple of
> developments recently that have changed the way that I think we want to
> handle this. The following is coming after a couple of conversations that=
 I
> have had with Jim and Ben.
>
> Jim had a couple of qualms about calling the hotplug callback when the
> target side of our NVMe bdev suddenly disconnects. Chief among those is
> that when we call the hotplug remove callback for a given bdev, that means
> that the bdev is entirely gone and represents a pretty critical event for
> the application. Rather than calling the hotplug callback for fabrics
> controllers, he would prefer to handle these events with different
> asynchronous events and allow the user to choose what to do with them
> without destroying the bdev.
>
> I have recently been adding some multipath features to the initiator side
> of the connection: https://review.spdk.io/gerrit/c/spdk/spdk/+/2885/25.
> This means that now an NVMe bdev doesn't so much represent a specific
> controller as it does a set of namespaces that can be accessed through
> multiple controllers, each having different TRIDs. I have already added
> different functions to enable adding multiple TRIDs, and currently have an
> automatic failover response in place to update the TRID we are connected =
to
> without failing.
>
> There are also plans in the future to allow for load balancing (multiple
> connections through the same bdev to different controllers but the same s=
et
> of namespaces).So a single path being disconnected will also not be a
> hotplug event.
>
> Do you think it would be reasonable to allow for hotplug removal and
> insertion through registering multiple TRIDs for the same controller? That
> way, the user has the flexibility to add multiple paths to the same bdev
> and do automatic recovery/failover in the bdev module. We could also
> register a separate asynchronous callback in the case that all registered
> paths are unreachable at once.
>

Integrating multipathing below bdev layer looks pretty reasonable given
nvme/nvmf adopting those features. What is means for bdev layer is likely
that it needs to evolve towards nvme model, with async event notifications
on the opened bdev state transitions (a topic we had actually discussed
about a year ago).

In particular, as you have proposed above, path state change notifications
can be delivered via hotremove callback (or its extension), provided those
events are defined at the bdev module level (presumably, following nvme ANA
syntax). Since application needs re multipathing vary, it likely makes
sense to preserve bdev_open/hotremove semantics, and add extra bdev_open
flavor with event-aware callback.

>From my experience, HA scenarios also require some tweaking of such an open
flavor in regard to the path(s) state and, particularly, path count at open
time since single vs. multiple paths being available are different startup
scenarios in HA.

Regarding load balancing, it sounds like a natural feature to have if
multipathing is to be added.

Regards,
Andrey


> Thanks,
>
> Seth
>
> -----Original Message-----
> From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
> Sent: Thursday, May 28, 2020 12:41 PM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: [SPDK] Re: NVMe hotplug for RDMA and TCP transports
>
> Hi Seth,
>
> On Thu, May 28, 2020, 01:11 Howell, Seth <seth.howell(a)intel.com> wrote:
>
> > Hi Andrey,++
> >
> > Thanks for clarifying that you were talking specifically about the
> > bdev layer. I think you are right there (And Ben spoke to me today as
> > well about this). When I was initially working on the bdev_nvme
> > implementation of failover, we were trying to mimic what happens in
> > the kernel when you lose contact with a target side NVMe drive.
> > Essentially, it just continually attempts to reconnect until you
> disconnect the bdev.
> >
> > But it makes sense to expose that through the bdev hotplug functionalit=
y.
> >
>
> Good to know. I will be glad to review the associated patches or help with
> that in any other practical way.
>
>
> > As far as hotplug insertion goes, do you have a specific idea of how
> > to implement that in a generic enough way to be universally
> > applicable? I can think of a few different things people might want to
> do:
> > 1. Have a specific set of subsystems on a given TRID (set of TRIDs)
> > that you would want to poll.
> > 2. Attach to all subsystems on a given set of TRIDs to poll.
> > I think both of these would require some changes to both the
> > bdev_nvme_set_hotplug API and RPC methods since they currently only
> > allow probing of PCIe addresses. It might be best to simply allow
> > users to call bdev_nvme_set_hotplug multiple times specifying a single
> > TRID each time and then keep an internal list of all of the monitored
> targets.
> >
>
> People's mileage may vary, but I see it pretty reasonable and nicely
> aligned with existing set_hotplug API if polling is limited to subsystems
> added since or active when the user has enabled hotplug. Essentially,
> hotplug enable call is then treated as user's request to monitor known
> subsystems, similar to what we have now with PCIe. Removing subsystem
> controller via RPC when hotplug is enabled is then a request to stop
> polling for the controller's TRID. This looks both simple and practical
> enough for the first shot to me.
>
> Adding extra options to set_nvme_hotplug would provide a more flexible and
> powerful solution, such as capability to discover any subsystem on the
> specified TRID or stop monitoring a given TRID, enabling fine-grained
> control. Coming up with a practical and simple default behavior still see=
ms
> key to me here.
>
>
> > I'm definitely in favor of both of these changes. I would be
> > interested to see what others in the community think as well though in
> > case I am missing anything. The removal side could be pretty readily
> > implemented, but there are the API considerations on the insertion side.
> >
>
> It certainly makes sense to get community input on the API changes.
>
> Thanks,
> Andrey
>
>
> > Thanks,
> >
> > Seth
> >
> > -----Original Message-----
> > From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
> > Sent: Tuesday, May 26, 2020 2:11 PM
> > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Subject: [SPDK] Re: NVMe hotplug for RDMA and TCP transports
> >
> > Thanks Seth, please fins a few comments inline below.
> >
> > On Tue, May 26, 2020, 23:35 Howell, Seth <seth.howell(a)intel.com> wrot=
e:
> >
> > > Hi Andrey,
> > >
> > > Typically when we refer to hotplug (removal)
> >
> >
> > That addition speaks for itself :). Since I'm interested in both hot
> > removal/plugging, that's just wording.
> >
> > in fabrics transports, we are talking about the target side of the
> > > connection suddenly disconnecting the admin and I/O qpairs by the
> > > target side of the connection. This definition of hotplug is already
> > > supported in the NVMe initiator. If your definition of hotplug is
> > > something different, please correct me so that I can better answer
> > > your
> > question.
> > >
> > > In RDMA for example, when we receive a disconnect event on the admin
> > > qpair for a given controller, we mark that controller as failed and
> > > fail up all I/O corresponding to I/O qpairs on that controller. Then
> > > subsequent calls to either submit I/O or process completions on any
> > > qpair associated with that controller return -ENXIO indicating to
> > > the initiator application that the drive has been failed by the target
> side.
> > > There are a couple of reasons that could happen:
> > > 1. The actual drive itself has been hotplugged from the target
> > > application (i.e. nvme pcie hotplug on the target side) 2. There wsa
> > > some network event that caused the target application to disconnect
> > > (NIC failure, RDMA error, etc)
> > >
> > > Because there are multiple reasons we could receive a "hotplug"
> > > event from the target application we leave it up to the initator
> > > application to decide what they want to do with this. Either destroy
> > > the controller from the initiator side, try reconnecting to the
> > > controller from the same TRID or attempting to connect to the
> > > controller from a different TRID (something like target side port
> failover).
> > >
> >
> > What I'm concerned with right now is that the above decision is
> > seemingly at odds with the SPDK own bdev layer hotremove
> > functionality. When spdk_bdev_open is being called, the caller
> > provides a hotremove callback that is expected to be called when the
> associated bdev goes away.
> >
> > If, for instance, I model hotplug event by killing SPDK nvmf/tcp
> > target while running bdeperf against namespace bdevs it exposes, I'd
> > expect the bdeperf hotremove callback to be fired for each active target
> namespace.
> > What I'm witnessing instead is the target subsystem controller (on the
> > initiator side) going into failed state after a number of unsuccessful
> > resets, with bdevperf failing due to I/O errors rather than cleanly
> > handling the hotremove event.
> >
> > Is that by design so that I'm looking for something that's actually
> > not expected to work, or is bdev layer hot-remove functionality a bit
> > ahead of the nvme layer in this case?
> >
> >
> > > In terms of hotplug insertion, I assume that would mean you want the
> > > initiator to automatically connect to a target subsystem that can be
> > > presented at any point in time during the running of the application.
> > > There isn't a specific driver level implementation of this feature
> > > for fabrics controllers, I think mostly because it would be very
> > > easy to implement and customize this functionality at the application
> layer.
> > > For example, one could periodically call discover on the targets
> > > they want to connect to and when new controllers/subsystems appear,
> > > connect
> > to them at that time.
> > >
> >
> > Understood, though I'd expect such a feature to be pretty popular,
> > similar to PCIe hotplug (which currently works), so providing it
> > off-the-shelf rather than leaving the implementation to SPDK users would
> make sense to me.
> >
> > Thanks,
> > Andrey
> >
> >
> > > I hope that this answers your question. Please let me know if I am
> > > talking about a different definition of hotplug than the one you are
> > using.
> > >
> > > Thanks,
> > >
> > > Seth
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Andrey Kuzmin <andrey.v.kuzmin(a)gmail.com>
> > > Sent: Friday, May 22, 2020 1:47 AM
> > > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > > Subject: [SPDK] NVMe hotplug for RDMA and TCP transports
> > >
> > > Hi team,
> > >
> > > is NVMe hotplug functionality as implemented limited to PCIe
> > > transport or does it also work for other transports? If it's
> > > currently PCIe only, are there any plans to extend the support to
> RDMA/TCP?
> > >
> > > Thanks,
> > > Andrey
> > > _______________________________________________
> > > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email
> > > to spdk-leave(a)lists.01.org
> > > _______________________________________________
> > > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email
> > > to spdk-leave(a)lists.01.org
> > >
> > _______________________________________________
> > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email to
> > spdk-leave(a)lists.01.org
> > _______________________________________________
> > SPDK mailing list -- spdk(a)lists.01.org To unsubscribe send an email to
> > spdk-leave(a)lists.01.org
> >
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org
> To unsubscribe send an email to spdk-leave(a)lists.01.org
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org
> To unsubscribe send an email to spdk-leave(a)lists.01.org
>

--===============8749302365162064642==--