All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] libibverbs IB device hotplug support
@ 2017-02-27 16:08 Alex Rosenbaum
       [not found] ` <CAFgAxU9eQYwBad2+SJw1x+cQMMqzgkz0fpeumG7Rx=o_eQqo2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Rosenbaum @ 2017-02-27 16:08 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, Yishai Hadas

Introduction

--------------------------------------------------------------------------------

Hotplug [1] is the method in which new hardware becomes available in the system

or is being removed from it. User space applications would like to continue

operation while hardware is being changed in the system, without the need to

restart the process, lose its current state and lose open sessions (at least on

other available hardware).



The problem

--------------------------------------------------------------------------------

Today the IB device list is returned by ibv_get_device_list() from libibverbs.

This IB device list is created by scanning of the /sys/class/infiniband_verbs

[2]. The list is cached and never updated, no matter if there were hardware

changes in the system.



Detection of hotplug events is not part of libibverbs functionality and thus

not part of this RFC. User space applications should monitor respectful changes

in the system according to its specific logic to detect plugout or plugin of

hardware devices. This can be does by means of: netlink, udev or other inputs.



Suggested Solution

--------------------------------------------------------------------------------

In order for user space applications to support hotplug of IB device (PlugIn

and PlugOut), libibverbs must be able to provide the application access to new

ibv_device objects, according to the recent system hardware changes.



Here I suggest to modify the implementation of ibv_get_device_list() so that

consecutive calls will re-scan the sysfs in the same manner as done today in

order to create a fresh ibv_device list each time. We will remove caching of

devices that support plugout, while keeping the ibv_device cache for devices

which do not support plugout.



For this purpose, the ibv_get_device_list() device scanning logic should be

separated from the libibverbs singleton initialization step.

User can call ibv_open_device() while holding this list (see man pages) and

once ibv_free_device_list() is called libibverbs can release the unused

ibv_device objects. Later, on calls to ibv_close_device(), additional

ibv_device object should be released. Currently, on ibv_free_device_list(),

only the array is freed, while the ibv_device objects are never freed.

libibverbs will maintain a ref_count for each verbs_device object. Increase

verbs_device->ref_count for every ibv_get_device_list() or ibv_open_device().

Decrease it for every ibv_free_device_list() or ibv_close_device().

On decrease, if ref_count tested to be zero, libibverbs will call the provider

library to release the 'strcut verbs_device' which it allocated.

Each provider library should provide a function to release the verbs_device

object: 'uninit_device(struct verbs_device* device)'.

In order to prevent resource leak for provider libraries that do not support

plugout API, libibverbs will move the relevant ibv_device’s to a cached device

list which will never be refreshed (like today) and also remove the respectful

provider library (ibv_driver) from the registered driver list. Remove of the

ibv_driver will make sure future scans of the sysfs will not generate

additional copies of the same ibv_device.



Applications Behavior

--------------------------------------------------------------------------------

Applications use different logic to decide which ibv_device is the relevant

device they want to use. And each application has its own detection logic to

track such changes in device availability.

Few examples: librdmacm logic is based on GUID values. Socket acceleration

(libvma) maps an IB device to its corresponding net iface based on netlink and

sysfs. DPDK applications lookup the IB device PCI address. And most MPI

implementation want human specified IB device name in command line and will

probably not handle any hotplug (out or in) events.



It is the application's responsibility to check which ibv_device returned from

ibv_get_device_list() has changed from previous scan and which is of interest.



Verbs can issue an IBV_EVENT_DEVICE_FATAL async event on an open user space

ibv_context for device's which support the ib_device->disassociate_ucontext().

This event will indicate to the application that the device is no longer

operational. In addition, user space CQ channel fd’s blocking calls on recv(),

select(), poll() or epoll() will be released with EINTR errno.



Typical user space application will monitor hardware changes and/or call for

ibv_get_device_list() only from control path dedicated thread, and not from the

fast path threads.



Pitfall

--------------------------------------------------------------------------------

If a legacy user space application did not follow the ibv_get_device_list()

man page definition, and it saved a private copy of an ibv_device pointer and

used it after releasing the device list (call to ibv_free_device_list()), then

ibv_open_device() might seg-fault based on this new suggestion.

We can work around this by moving the IB device re-scan logic to a new API

'ibv_refresh_device_list()' so that only new application using this API will

have correct behavior as needed.



Reference

--------------------------------------------------------------------------------

[1] https://www.kernel.org/doc/pending/hotplug.txt

[2] https://github.com/linux-rdma/rdma-core/blob/master/Documentation/libibverbs.md



API changes

--------------------------------------------------------------------------------

Signed-off-by: Alex Rosenbaum <alexr@xxxxxxxxxxxx>

--- a/libibverbs/verbs.h

+++ b/libibverbs/verbs.h

@@ -1336,6 +1336,7 @@ struct verbs_device {

                                struct ibv_context *ctx, int cmd_fd);

        void    (*uninit_context)(struct verbs_device *device,

                                struct ibv_context *ctx);

+       void    (*uninit_device)(struct verbs_device *device);

+       atomic_t         ref_count;

        /* future fields added here */

};
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found] ` <CAFgAxU9eQYwBad2+SJw1x+cQMMqzgkz0fpeumG7Rx=o_eQqo2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-02-27 18:59   ` Jason Gunthorpe
       [not found]     ` <20170227185912.GM5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2017-02-28 15:48   ` Doug Ledford
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Jason Gunthorpe @ 2017-02-27 18:59 UTC (permalink / raw)
  To: Alex Rosenbaum
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, leon-DgEjT+Ai2ygdnm+yROfE0A,
	Yishai Hadas

On Mon, Feb 27, 2017 at 06:08:41PM +0200, Alex Rosenbaum wrote:
 
> Here I suggest to modify the implementation of ibv_get_device_list()
> so that consecutive calls will re-scan the sysfs in the same manner
> as done today in order to create a fresh ibv_device list each
> time. We will remove caching of devices that support plugout, while
> keeping the ibv_device cache for devices which do not support
> plugout.

I have no problem with this.

> In order to prevent resource leak for provider libraries that do not
> support plugout API, libibverbs will move the relevant ibv_device’s
> to a cached device

We no longer support configurations like this, add the new feature and
globally update all providers in rdma-core. No hacks to support
un-updated providers.

> Few examples: librdmacm logic is based on GUID values. Socket
> acceleration (libvma) maps an IB device to its corresponding net
> iface based on netlink and sysfs. DPDK applications lookup the IB
> device PCI address. And most MPI implementation want human specified
> IB device name in command line and will probably not handle any
> hotplug (out or in) events.

I have often thought we should have better entry points to get a
specific device/port in libibverbs.

> If a legacy user space application did not follow the
> ibv_get_device_list() man page definition, and it saved a private
> copy of an ibv_device pointer and used it after releasing the device
> list (call to ibv_free_device_list()), then

This is explicitly called out as broken in the man page, we do not
need to support applications using the API in such an obviously wrong
way.

NOTES

       Client code should open all the devices it intends to use with
       ibv_open_device() before calling ibv_free_device_list().  Once
       it frees the array with ibv_free_device_list(), it will be able
       to use only the open devices; pointers to unopened devices will no
       longer be valid.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]     ` <20170227185912.GM5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-27 22:11       ` Alex Rosenbaum
       [not found]         ` <CAFgAxU9XJTXkaeL_VE7zHASPBM+j=TZd2L+McKBcJRThJtUN5w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Rosenbaum @ 2017-02-27 22:11 UTC (permalink / raw)
  To: Jason Gunthorpe, Mike Marciniszyn, Steve Wise, Lijun Ou,
	M: Wei Hu(Xavier),
	Dennis Dalessandro, Tatyana Nikolova, Vladimir Sokolovsky,
	Yishai Hadas, Devesh Sharma, Ram Amrani, Ariel Elior, Moni Shoua,
	Adit Ranadive
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, leon-DgEjT+Ai2ygdnm+yROfE0A

On Mon, Feb 27, 2017 at 8:59 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Mon, Feb 27, 2017 at 06:08:41PM +0200, Alex Rosenbaum wrote:
>
>> In order to prevent resource leak for provider libraries that do not
>> support plugout API, libibverbs will move the relevant ibv_device’s
>> to a cached device
>
> We no longer support configurations like this, add the new feature and
> globally update all providers in rdma-core. No hacks to support
> un-updated providers.

This is definitely one approach I was thinking about as well. It will
surely help clean some of the older, not extendible API's.
I'd like to call out to all provide library maintainer. if you have
started code porting from the old ibv_register_driver() to the new
extendible verbs_register_driver() then please move forward with
submitting it. Once using the extendible 'struct verbs_device' it will
be easy to add the new unint_device() functionality to all provided
library in order to support plug-in for all.

Here is the list of the maintainers and provider libraries still use
the old ibv_register_driver():
Steve Wise for CXBG3 & CXBG4
Mike Marciniszyn & Dennis Dalessandro for HF1 and IPATH/QIB
Lijun Ou & Wei Hu(Xavier) for HNS
Tatyana Nikolova for I40IW and NES
Vladimir Sokolovsky, Moni Shoua and Yishai Hadas for MTHCA and RXE
Devesh Sharma for OCRDMA
Ram Amrani & Ariel Elior for QEDR
Adit Ranadive for VMW_PVDRMA


PS: I'll present the verbs hotplug idea in tomorrow's OFVWG call.
See Liran's notice: http://marc.info/?l=linux-rdma&m=148821532209342&w=2

Cheers,
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]         ` <CAFgAxU9XJTXkaeL_VE7zHASPBM+j=TZd2L+McKBcJRThJtUN5w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-02-27 22:36           ` Jason Gunthorpe
       [not found]             ` <20170227223600.GA1526-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Gunthorpe @ 2017-02-27 22:36 UTC (permalink / raw)
  To: Alex Rosenbaum
  Cc: Mike Marciniszyn, Steve Wise, Lijun Ou, M: Wei Hu(Xavier),
	Dennis Dalessandro, Tatyana Nikolova, Vladimir Sokolovsky,
	Yishai Hadas, Devesh Sharma, Ram Amrani, Ariel Elior, Moni Shoua,
	Adit Ranadive, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	leon-DgEjT+Ai2ygdnm+yROfE0A

On Tue, Feb 28, 2017 at 12:11:58AM +0200, Alex Rosenbaum wrote:

> This is definitely one approach I was thinking about as well. It will
> surely help clean some of the older, not extendible API's.
> I'd like to call out to all provide library maintainer. if you have
> started code porting from the old ibv_register_driver() to the new
> extendible verbs_register_driver() then please move forward with
> submitting it. Once using the extendible 'struct verbs_device' it will
> be easy to add the new unint_device() functionality to all provided
> library in order to support plug-in for all.

Someone is going to have to take this on..

Conceptually it isn't hard to do, but I figured it was too complicated
to undertake without the ability to test.

Maybe you could help out by providing a bunch of information how to do
the conversion and show a patch for something like RXE.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]             ` <20170227223600.GA1526-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-28  7:56               ` Alex Rosenbaum
       [not found]                 ` <CAFgAxU8KkrOS6aib4ykf8vj1sQFGK_oMW=fX3vZ2X2_r3ryDVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Rosenbaum @ 2017-02-28  7:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mike Marciniszyn, Steve Wise, Lijun Ou, M: Wei Hu(Xavier),
	Dennis Dalessandro, Tatyana Nikolova, Vladimir Sokolovsky,
	Yishai Hadas, Devesh Sharma, Ram Amrani, Ariel Elior, Moni Shoua,
	Adit Ranadive, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	leon-DgEjT+Ai2ygdnm+yROfE0A

On Tue, Feb 28, 2017 at 12:36 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Tue, Feb 28, 2017 at 12:11:58AM +0200, Alex Rosenbaum wrote:
>
>> This is definitely one approach I was thinking about as well. It will
>> surely help clean some of the older, not extendible API's.
>> I'd like to call out to all provide library maintainer. if you have
>> started code porting from the old ibv_register_driver() to the new
>> extendible verbs_register_driver() then please move forward with
>> submitting it. Once using the extendible 'struct verbs_device' it will
>> be easy to add the new unint_device() functionality to all provided
>> library in order to support plug-in for all.
>
> Someone is going to have to take this on..
>
> Conceptually it isn't hard to do, but I figured it was too complicated
> to undertake without the ability to test.
>
> Maybe you could help out by providing a bunch of information how to do
> the conversion and show a patch for something like RXE.
>
> Jason

We can definitely portal all providers (some extra work) but as you
wrote, i'm also not sure how to proceed with code review and
testing...? we'll surely need the maintainers ACK.
If not, we can proceed with my initial suggestion where libraries that
don't want this new HotPlug support will not change there code, and
libibverbs will manage the old and new drivers list.

Anyway, I'll take it with MoshS for RXE and we'll provide an example
of porting a provider library to the new extendible
verbs_register_driver().

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found] ` <CAFgAxU9eQYwBad2+SJw1x+cQMMqzgkz0fpeumG7Rx=o_eQqo2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-02-27 18:59   ` Jason Gunthorpe
@ 2017-02-28 15:48   ` Doug Ledford
       [not found]     ` <1488296882.86943.214.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2017-03-01  0:10   ` ira.weiny
  2017-03-02 16:08   ` Hefty, Sean
  3 siblings, 1 reply; 32+ messages in thread
From: Doug Ledford @ 2017-02-28 15:48 UTC (permalink / raw)
  To: Alex Rosenbaum, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, Yishai Hadas

[-- Attachment #1: Type: text/plain, Size: 7064 bytes --]

On Mon, 2017-02-27 at 18:08 +0200, Alex Rosenbaum wrote:
> 
> 
> Today the IB device list is returned by ibv_get_device_list() from
> libibverbs.
> 
> This IB device list is created by scanning of the
> /sys/class/infiniband_verbs
> 
> [2]. The list is cached and never updated, no matter if there were
> hardware
> 
> changes in the system.

Correct.  This is something I've felt we've needed to update/fix for a
while now.

> Detection of hotplug events is not part of libibverbs functionality
> and thus
> 
> not part of this RFC. User space applications should monitor
> respectful changes
> 
> in the system according to its specific logic to detect plugout or
> plugin of
> 
> hardware devices. This can be does by means of: netlink, udev or
> other inputs.

On this point I disagree.  I see no reason that we should implement
hotplug detection in every application when the library can get it
right just once and make it available to everyone.

> Here I suggest to modify the implementation of ibv_get_device_list()
> so that
> 
> consecutive calls will re-scan the sysfs in the same manner as done
> today in
> 
> order to create a fresh ibv_device list each time. We will remove
> caching of
> 
> devices that support plugout, while keeping the ibv_device cache for
> devices
> 
> which do not support plugout.

I would argue that a better solution is to, on first call into
ibv_get_device_list() (which doubles as the library's init() entry), we
scan the system for whatever devices there are, build a permanent list
of these devices, then register the necessary hooks ourselves to get
updates on both plug out and plug in, and on those events we update the
list as appropriate (on plug out, notify the application via async
event, after the application has processed the event, if the device is
unused, remove the device entirely, if the device is still in use, then
move it to a defunct list and wait for the application to be able to
release all of its references to it, and only then remove it; on plug
in, add a new device to the device list and notify the application via
async event at which point the application can decide if it wants to
use the device).  This might actually involve a new async notifier
added to the libibverbs API that is not tied to a device and which an
application would need to be modified to register with the library.
 But since any applicaiton wanting to support this is going to need to
be modified anyway, a new entry point isn't the end of the world.

> 
> Applications Behavior
> 
> -------------------------------------------------------------------
> -------------
> 
> Applications use different logic to decide which ibv_device is the
> relevant
> 
> device they want to use. And each application has its own detection
> logic to
> 
> track such changes in device availability.
> 
> Few examples: librdmacm logic is based on GUID values. Socket
> acceleration
> 
> (libvma) maps an IB device to its corresponding net iface based on
> netlink and
> 
> sysfs. DPDK applications lookup the IB device PCI address. And most
> MPI
> 
> implementation want human specified IB device name in command line
> and will
> 
> probably not handle any hotplug (out or in) events.
> 
> 
> 
> It is the application's responsibility to check which ibv_device
> returned from
> 
> ibv_get_device_list() has changed from previous scan and which is of
> interest.

With an async event notifier, and with a static list of devices, it
would be possible to simply pass the new device in with the notifier.
 Makes things much simpler for the app IMO.

> 
> 
> Verbs can issue an IBV_EVENT_DEVICE_FATAL async event on an open user
> space
> 
> ibv_context for device's which support the ib_device-
> >disassociate_ucontext().
> 
> This event will indicate to the application that the device is no
> longer
> 
> operational. In addition, user space CQ channel fd’s blocking calls
> on recv(),
> 
> select(), poll() or epoll() will be released with EINTR errno.
> 
> 
> 
> Typical user space application will monitor hardware changes and/or
> call for
> 
> ibv_get_device_list() only from control path dedicated thread, and
> not from the
> 
> fast path threads.
> 
> 
> 
> Pitfall
> 
> -------------------------------------------------------------------
> -------------
> 
> If a legacy user space application did not follow the
> ibv_get_device_list()
> 
> man page definition, and it saved a private copy of an ibv_device
> pointer and
> 
> used it after releasing the device list (call to
> ibv_free_device_list()), then
> 
> ibv_open_device() might seg-fault based on this new suggestion.
> 
> We can work around this by moving the IB device re-scan logic to a
> new API
> 
> 'ibv_refresh_device_list()' so that only new application using this
> API will
> 
> have correct behavior as needed.

Following what I wrote above, this would no longer be a problem.  The
call to ibv_get_device_list can simply return the existing list (on
first call we init this list), and we can make ibv_free_device_list a
no-op.  The async events can notify of adds/removes, user can update
their own list accordingly, but we would also protect ourselves in all
of the ibv_ calls such that if the user space code calls into our code
with a defunct device we don't crash.  We might want to protect all of
our calls, even hot path calls, from invalid ibv_context pointers now
as a result of this support.


> 
> 
> Reference
> 
> -------------------------------------------------------------------
> -------------
> 
> [1] https://www.kernel.org/doc/pending/hotplug.txt
> 
> [2] https://github.com/linux-rdma/rdma-core/blob/master/Documentation
> /libibverbs.md
> 
> 
> 
> API changes
> 
> -------------------------------------------------------------------
> -------------
> 
> Signed-off-by: Alex Rosenbaum <alexr@xxxxxxxxxxxx>
> 
> --- a/libibverbs/verbs.h
> 
> +++ b/libibverbs/verbs.h
> 
> @@ -1336,6 +1336,7 @@ struct verbs_device {
> 
>                                 struct ibv_context *ctx, int cmd_fd);
> 
>         void    (*uninit_context)(struct verbs_device *device,
> 
>                                 struct ibv_context *ctx);
> 
> +       void    (*uninit_device)(struct verbs_device *device);
> 
> +       atomic_t         ref_count;
> 
>         /* future fields added here */
> 
> };
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" 
> in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG KeyID: B826A3330E572FDD
   
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC] libibverbs IB device hotplug support
       [not found]     ` <1488296882.86943.214.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-02-28 16:32       ` Liran Liss
       [not found]         ` <HE1PR0501MB28123E8F342607E2C2CA4DD9B1560-692Kmc8YnlIVrnpjwTCbp8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Liran Liss @ 2017-02-28 16:32 UTC (permalink / raw)
  To: Doug Ledford, Alex Rosenbaum, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, Yishai Hadas

> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
> owner@vger.kernel.org] On Behalf Of Doug Ledford

> > not part of this RFC. User space applications should monitor
> > respectful changes
> >
> > in the system according to its specific logic to detect plugout or
> > plugin of
> >
> > hardware devices. This can be does by means of: netlink, udev or other
> > inputs.
> 
> On this point I disagree.  I see no reason that we should implement hotplug
> detection in every application when the library can get it right just once and
> make it available to everyone.
> 

This is definitely something we should consider doing, but orthogonal to
obtaining an up-to-date device list.

> > Here I suggest to modify the implementation of ibv_get_device_list()
> > so that
> >
> > consecutive calls will re-scan the sysfs in the same manner as done
> > today in
> >
> > order to create a fresh ibv_device list each time. We will remove
> > caching of
> >
> > devices that support plugout, while keeping the ibv_device cache for
> > devices
> >
> > which do not support plugout.
> 
> I would argue that a better solution is to, on first call into
> ibv_get_device_list() (which doubles as the library's init() entry), we scan the
> system for whatever devices there are, build a permanent list of these devices,
> then register the necessary hooks ourselves to get updates on both plug out and
> plug in, and on those events we update the list as appropriate (on plug out,
> notify the application via async event, after the application has processed the
> event, if the device is unused, remove the device entirely, if the device is still in
> use, then move it to a defunct list and wait for the application to be able to
> release all of its references to it, and only then remove it; on plug in, add a new
> device to the device list and notify the application via async event at which point
> the application can decide if it wants to use the device).  This might actually
> involve a new async notifier added to the libibverbs API that is not tied to a
> device and which an application would need to be modified to register with the
> library.
>  But since any applicaiton wanting to support this is going to need to be modified
> anyway, a new entry point isn't the end of the world.
> 

I don't think that we should introduce an asych context into libibverbs.
Also, not all apps would like to process such asynch events - some merely want to
use the RDMA subsystem at some time and get an up-to-date device list.

In the future, I would introduce some ibv_create_notifier_channel(), which
apps can ibv_get_notifier_event() from it that informs that the device list has changed...

This is a simpler API that maintains backward compatibility with ibv_get_device_list().
--Liran

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]                 ` <CAFgAxU8KkrOS6aib4ykf8vj1sQFGK_oMW=fX3vZ2X2_r3ryDVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-02-28 21:26                   ` Jason Gunthorpe
  0 siblings, 0 replies; 32+ messages in thread
From: Jason Gunthorpe @ 2017-02-28 21:26 UTC (permalink / raw)
  To: Alex Rosenbaum
  Cc: Mike Marciniszyn, Steve Wise, Lijun Ou, M: Wei Hu(Xavier),
	Dennis Dalessandro, Tatyana Nikolova, Vladimir Sokolovsky,
	Yishai Hadas, Devesh Sharma, Ram Amrani, Ariel Elior, Moni Shoua,
	Adit Ranadive, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	leon-DgEjT+Ai2ygdnm+yROfE0A

On Tue, Feb 28, 2017 at 09:56:57AM +0200, Alex Rosenbaum wrote:
> We can definitely portal all providers (some extra work) but as you
> wrote, i'm also not sure how to proceed with code review and

It turns out this is super simple and quite mechanical as long as you
don't try and convert to the new context protocol too.

https://github.com/linux-rdma/rdma-core/pull/87

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]         ` <HE1PR0501MB28123E8F342607E2C2CA4DD9B1560-692Kmc8YnlIVrnpjwTCbp8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-02-28 21:47           ` Doug Ledford
       [not found]             ` <3389d831-c135-b326-4b96-5f2a746de5ac-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Doug Ledford @ 2017-02-28 21:47 UTC (permalink / raw)
  To: Liran Liss, Alex Rosenbaum, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, Yishai Hadas


[-- Attachment #1.1: Type: text/plain, Size: 4207 bytes --]

On 2/28/2017 11:32 AM, Liran Liss wrote:
>> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
>> owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Doug Ledford
> 
>>> not part of this RFC. User space applications should monitor
>>> respectful changes
>>>
>>> in the system according to its specific logic to detect plugout or
>>> plugin of
>>>
>>> hardware devices. This can be does by means of: netlink, udev or other
>>> inputs.
>>
>> On this point I disagree.  I see no reason that we should implement hotplug
>> detection in every application when the library can get it right just once and
>> make it available to everyone.
>>
> 
> This is definitely something we should consider doing, but orthogonal to
> obtaining an up-to-date device list.

Not really.  Half of the logic in the original RFC can be considered
hacks to work around not doing this properly in the first place.

>>> Here I suggest to modify the implementation of ibv_get_device_list()
>>> so that
>>>
>>> consecutive calls will re-scan the sysfs in the same manner as done
>>> today in
>>>
>>> order to create a fresh ibv_device list each time. We will remove
>>> caching of
>>>
>>> devices that support plugout, while keeping the ibv_device cache for
>>> devices
>>>
>>> which do not support plugout.
>>
>> I would argue that a better solution is to, on first call into
>> ibv_get_device_list() (which doubles as the library's init() entry), we scan the
>> system for whatever devices there are, build a permanent list of these devices,
>> then register the necessary hooks ourselves to get updates on both plug out and
>> plug in, and on those events we update the list as appropriate (on plug out,
>> notify the application via async event, after the application has processed the
>> event, if the device is unused, remove the device entirely, if the device is still in
>> use, then move it to a defunct list and wait for the application to be able to
>> release all of its references to it, and only then remove it; on plug in, add a new
>> device to the device list and notify the application via async event at which point
>> the application can decide if it wants to use the device).  This might actually
>> involve a new async notifier added to the libibverbs API that is not tied to a
>> device and which an application would need to be modified to register with the
>> library.
>>  But since any applicaiton wanting to support this is going to need to be modified
>> anyway, a new entry point isn't the end of the world.
>>
> 
> I don't think that we should introduce an asych context into libibverbs.

Why not?

> Also, not all apps would like to process such asynch events - some merely want to
> use the RDMA subsystem at some time and get an up-to-date device list.

They don't have to.  If they want to register an event handler and get
notified, then they can.  If they don't want to, then they don't need
to.  Behind the scenes, if libibverbs simply does the right thing, then
an app that gets async events will work fine, and an app that wants to
simply call ibv_get_device_list and scan for new devices also works fine.

> 
> In the future, I would introduce some ibv_create_notifier_channel(), which
> apps can ibv_get_notifier_event() from it that informs that the device list has changed...
> 
> This is a simpler API that maintains backward compatibility with ibv_get_device_list().

I couldn't care less that this is simpler.  If it means maintaining two
implementations, or implementation hacks instead of a single proper
implementation, then it's not worth it.  Just do it right the first
time, especially since we aren't talking about some huge implementation
that would take months to get correct.  Taking baby steps and going one
piece at a time is, IMO, reserved for trickier things to implement than
this.  This is straight forward enough to go from start to final
implementation in one shot.


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG Key ID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found] ` <CAFgAxU9eQYwBad2+SJw1x+cQMMqzgkz0fpeumG7Rx=o_eQqo2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-02-27 18:59   ` Jason Gunthorpe
  2017-02-28 15:48   ` Doug Ledford
@ 2017-03-01  0:10   ` ira.weiny
       [not found]     ` <20170301001023.GA3001-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
  2017-03-02 16:08   ` Hefty, Sean
  3 siblings, 1 reply; 32+ messages in thread
From: ira.weiny @ 2017-03-01  0:10 UTC (permalink / raw)
  To: Alex Rosenbaum
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, leon-DgEjT+Ai2ygdnm+yROfE0A,
	Yishai Hadas

On Mon, Feb 27, 2017 at 06:08:41PM +0200, Alex Rosenbaum wrote:
> Introduction
> 
> --------------------------------------------------------------------------------
> 
> Hotplug [1] is the method in which new hardware becomes available in the system
> 
> or is being removed from it. User space applications would like to continue
> 
> operation while hardware is being changed in the system, without the need to
> 
> restart the process, lose its current state and lose open sessions (at least on
> 
> other available hardware).
> 
> 
> 
> The problem
> 
> --------------------------------------------------------------------------------
> 
> Today the IB device list is returned by ibv_get_device_list() from libibverbs.
> 
> This IB device list is created by scanning of the /sys/class/infiniband_verbs
> 
> [2]. The list is cached and never updated, no matter if there were hardware
> 
> changes in the system.


After the call it occurred to me that we may be missing a bigger picture with
this discussion.  In the past we have discussed in the past what exactly a
"struct ib_device" is within the context of the kernel.

Should we also consider "port hotplug"?

For example what happens if a Mellanox port is unplugged from an IB port and
plugged into an Ethernet port?  Is that considered a "device" hot plug?  What
happens if this is a dual port device and only one of the 2 IB ports are
changed?  So many things like AH, Path Records, etc are tied to the "device"
but really they should be tied to the port.

To be clear I am not opposed to this change at this time.  But I think we
should consider shifting the model away from devices and toward ports as we
continue.

Ira

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]             ` <3389d831-c135-b326-4b96-5f2a746de5ac-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-03-01  0:44               ` Jason Gunthorpe
       [not found]                 ` <20170301004449.GA13114-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Gunthorpe @ 2017-03-01  0:44 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Liran Liss, Alex Rosenbaum, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	leon-DgEjT+Ai2ygdnm+yROfE0A, Yishai Hadas

On Tue, Feb 28, 2017 at 04:47:05PM -0500, Doug Ledford wrote:

> > I don't think that we should introduce an asych context into libibverbs.
> 
> Why not?

Generally, I dislike the idea of running threads from libraries,
particularly libraries like ibverbs. So many apps get no benefit from
the thread, but it sits there connected to udev..

But aside from that, threads create races, and in this case if the
hidden thread has to run to get the up to date device list, then the
application has to synchronize with it to get the up to date
information.

This is practically problematic because one of the areas we need to
fix is rdma-cm which is already getting a kind of new-device
notification from kernel messages that are not synchronized with udev
or any other thread.

In this case I think the simple solution is the most correct one - run
readdir on every call to ibv_get_devices and do set-interesection with
the internal list before returning the result. Easy, simple locking,
no races to worry about.

We probably also need to add some retries for opening /dev/uverbsXX
based on sysfs to really solve all the races...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC] libibverbs IB device hotplug support
       [not found]                 ` <20170301004449.GA13114-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-03-01 18:00                   ` Hefty, Sean
       [not found]                     ` <1828884A29C6694DAF28B7E6B8A82373AB0ECDC6-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Hefty, Sean @ 2017-03-01 18:00 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford
  Cc: Liran Liss, Alex Rosenbaum, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	leon-DgEjT+Ai2ygdnm+yROfE0A, Yishai Hadas

> > > I don't think that we should introduce an asych context into
> libibverbs.
> >
> > Why not?
> 
> Generally, I dislike the idea of running threads from libraries,
> particularly libraries like ibverbs. So many apps get no benefit from
> the thread, but it sits there connected to udev..

I thought Doug was referring to reporting the device add/remove event through some event reporting interface, like what the librdmacm already does.  So no threads for that are needed.  IMO, this makes sense.

I didn't follow his idea for how the device list would be updated and kept in sync between the app and the library.

If you had device add/remove events, the get_event call could intercept that event and update the device list there.  But I don't know that you try to sync a shared list between the library and the app.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]                     ` <1828884A29C6694DAF28B7E6B8A82373AB0ECDC6-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2017-03-01 18:07                       ` Jason Gunthorpe
  2017-03-01 21:47                       ` Alex Rosenbaum
  2017-03-01 21:56                       ` Doug Ledford
  2 siblings, 0 replies; 32+ messages in thread
From: Jason Gunthorpe @ 2017-03-01 18:07 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Doug Ledford, Liran Liss, Alex Rosenbaum,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, leon-DgEjT+Ai2ygdnm+yROfE0A,
	Yishai Hadas

On Wed, Mar 01, 2017 at 06:00:09PM +0000, Hefty, Sean wrote:

> If you had device add/remove events, the get_event call could
> intercept that event and update the device list there.  But I don't
> know that you try to sync a shared list between the library and the
> app.

That is the same basic race I outlined, just without the thread.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]     ` <20170301001023.GA3001-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
@ 2017-03-01 21:21       ` Alex Rosenbaum
       [not found]         ` <CAFgAxU-Vf6RLEo8N=oya0X+WbneRHE1xa_-5GY+TEqo1wyzpow-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-03-01 21:57       ` Alex Rosenbaum
  1 sibling, 1 reply; 32+ messages in thread
From: Alex Rosenbaum @ 2017-03-01 21:21 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, leon-DgEjT+Ai2ygdnm+yROfE0A,
	Yishai Hadas

On Wed, Mar 1, 2017 at 2:10 AM, ira.weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> After the call it occurred to me that we may be missing a bigger picture with
> this discussion.  In the past we have discussed in the past what exactly a
> "struct ib_device" is within the context of the kernel.
>
> Should we also consider "port hotplug"?
>
> For example what happens if a Mellanox port is unplugged from an IB port and
> plugged into an Ethernet port?  Is that considered a "device" hot plug?

Yes, this fis a 'device' hotplug.
Reconfiguring a ConnectX3 port from one link layer to another requires
full device plugout & plugin. And this needs to be done by the sys
admin.


> What happens if this is a dual port device and only one of the 2 IB ports are
> changed?  So many things like AH, Path Records, etc are tied to the "device"
> but really they should be tied to the port.

The newer ConnectX4 (and forward) devices expose a pci function per
port, so reconfiguring a port link layer will hotplug only that pci
function.

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]                     ` <1828884A29C6694DAF28B7E6B8A82373AB0ECDC6-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2017-03-01 18:07                       ` Jason Gunthorpe
@ 2017-03-01 21:47                       ` Alex Rosenbaum
       [not found]                         ` <CAFgAxU9vObaW4O+byEJ5pV1Ofou4cd05HHWWPC7iTJshyk+LdA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-03-01 21:56                       ` Doug Ledford
  2 siblings, 1 reply; 32+ messages in thread
From: Alex Rosenbaum @ 2017-03-01 21:47 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Jason Gunthorpe, Doug Ledford, Liran Liss,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, leon-DgEjT+Ai2ygdnm+yROfE0A,
	Yishai Hadas

On Wed, Mar 1, 2017 at 8:00 PM, Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>> > > I don't think that we should introduce an asych context into
>> libibverbs.
>> >
>> > Why not?
>>
>> Generally, I dislike the idea of running threads from libraries,
>> particularly libraries like ibverbs. So many apps get no benefit from
>> the thread, but it sits there connected to udev..
>
> I thought Doug was referring to reporting the device add/remove event through some event reporting interface, like what the librdmacm already does.  So no threads for that are needed.  IMO, this makes sense.
>
> I didn't follow his idea for how the device list would be updated and kept in sync between the app and the library.
>
> If you had device add/remove events, the get_event call could intercept that event and update the device list there.  But I don't know that you try to sync a shared list between the library and the app.

In order to provide 'out-of-ibv_context' libibverbs events we'll need
a new interface, something like ibv_get_system_event(), to report
device changes.
Every time the application calls this API, libibverbs must process all
events in queue in order to have an up to date ibv_device list with
they way it appears in sysfs.
Then, do we force all these popped events on the user? or can we force
him to loop on ibv_get_system_event() until EAGAIN so we're sure the
device list is updated?
Is this hiding yet another fd for the application to block on?

Verbs is built around a cmd_fd per ibv_context. Reporting of new
devices hotplug needs a separate channel. It does not seem correct to
add yet another such cdev to verbs.

I agree the adding this hotplug report capabilities into librdmacm
sounds more appropriate.
But libibvers will still need to provide the latest ibv_device list
snapshot as it appears in sysfs.

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]                     ` <1828884A29C6694DAF28B7E6B8A82373AB0ECDC6-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2017-03-01 18:07                       ` Jason Gunthorpe
  2017-03-01 21:47                       ` Alex Rosenbaum
@ 2017-03-01 21:56                       ` Doug Ledford
       [not found]                         ` <cad3d2c6-0c98-9f58-a3bb-81823103bd76-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2 siblings, 1 reply; 32+ messages in thread
From: Doug Ledford @ 2017-03-01 21:56 UTC (permalink / raw)
  To: Hefty, Sean, Jason Gunthorpe
  Cc: Liran Liss, Alex Rosenbaum, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	leon-DgEjT+Ai2ygdnm+yROfE0A, Yishai Hadas


[-- Attachment #1.1: Type: text/plain, Size: 2844 bytes --]

On 3/1/2017 1:00 PM, Hefty, Sean wrote:
>>>> I don't think that we should introduce an asych context into
>> libibverbs.
>>> 
>>> Why not?
>> 
>> Generally, I dislike the idea of running threads from libraries, 
>> particularly libraries like ibverbs. So many apps get no benefit
>> from the thread, but it sits there connected to udev..
> 
> I thought Doug was referring to reporting the device add/remove event
> through some event reporting interface, like what the librdmacm
> already does.  So no threads for that are needed.  IMO, this makes
> sense.

It can be done without a thread, that's not to say I would necessarily
do so myself.

> I didn't follow his idea for how the device list would be updated and
> kept in sync between the app and the library.

There is no requirement that the device list be kept in sync between the
library and the app, period.

There are three usages for what I suggested:

1)  The app gets a device list initially just like it does now.  If the
app never updates that list again, so be it.  This is 100% backward
compatible.

2)  The simplified version of hotplug/unplug that has been discussed:
the app calls ibv_get_device_list.  As proposed by Alex, the library
would rescan the sysfs directory on this call.  If the device is there,
great.  If it isn't, great too.  Under my proposed implementation, the
exact same thing is true.  If the device is there, great.  If there
isn't a new device in the list already, then great too.  Because the
initial implementation proposal left if up to the app to determine
how/when to check for a new device, there was never any guarantee that
whatever triggered the app to look for the new device meant that it was
ready to be used in sysfs and would be found by the re-reading of the
sysfs directory.  So whatever race might be present in my proposed
implementation would also be present in the original implementation.

3)  My proposed implementation where an entry point is added for async
events.  In this case, the library would not pass the event on up until
the device was ready to be used, so there is no race here.  And there is
minimal delay from detection to notification and ready for use.

> If you had device add/remove events, the get_event call could
> intercept that event and update the device list there.  But I don't
> know that you try to sync a shared list between the library and the
> app.

No, I wouldn't try that at all.  The inherently racy thing in all of
this is having the library be responsible for scanning a device, but not
responsible for detecting that there is a new device to be scanned.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG Key ID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]     ` <20170301001023.GA3001-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
  2017-03-01 21:21       ` Alex Rosenbaum
@ 2017-03-01 21:57       ` Alex Rosenbaum
       [not found]         ` <CAFgAxU_i+5FtBkdf=t5vdQAvBqDt9aKsFJbAKGN5YWQQ9w-c2A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 32+ messages in thread
From: Alex Rosenbaum @ 2017-03-01 21:57 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, leon-DgEjT+Ai2ygdnm+yROfE0A,
	Yishai Hadas

On Wed, Mar 1, 2017 at 2:10 AM, ira.weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> On Mon, Feb 27, 2017 at 06:08:41PM +0200, Alex Rosenbaum wrote:
>> Introduction
>>
>> --------------------------------------------------------------------------------
> After the call
>
> Ira
>

Ira,
On the call we discussed that today there is no 'incarnation' value
exposed for the uverbs%.
You suggested an optimization by using the cdev's file creation
timestamp as option for libibverbs to understand that the driver did
or did-not restart since mapping of that ibv_device.
Main pitfalls with this is if system time is being modified (admin or
ntp/ptp). In the rare case that we have a hoplug event and time is
rewinded we can accidentally reach a very hard debugging exercise.
What do you think?
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]                         ` <CAFgAxU9vObaW4O+byEJ5pV1Ofou4cd05HHWWPC7iTJshyk+LdA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-03-01 22:07                           ` Doug Ledford
       [not found]                             ` <d92df6b7-207f-f3a3-8bf5-b12cffe20684-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2017-03-01 23:18                           ` Hefty, Sean
  1 sibling, 1 reply; 32+ messages in thread
From: Doug Ledford @ 2017-03-01 22:07 UTC (permalink / raw)
  To: Alex Rosenbaum, Hefty, Sean
  Cc: Jason Gunthorpe, Liran Liss, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	leon-DgEjT+Ai2ygdnm+yROfE0A, Yishai Hadas


[-- Attachment #1.1: Type: text/plain, Size: 3673 bytes --]

On 3/1/2017 4:47 PM, Alex Rosenbaum wrote:
> On Wed, Mar 1, 2017 at 8:00 PM, Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>>>>> I don't think that we should introduce an asych context into
>>> libibverbs.
>>>>
>>>> Why not?
>>>
>>> Generally, I dislike the idea of running threads from libraries,
>>> particularly libraries like ibverbs. So many apps get no benefit from
>>> the thread, but it sits there connected to udev..
>>
>> I thought Doug was referring to reporting the device add/remove event through some event reporting interface, like what the librdmacm already does.  So no threads for that are needed.  IMO, this makes sense.
>>
>> I didn't follow his idea for how the device list would be updated and kept in sync between the app and the library.
>>
>> If you had device add/remove events, the get_event call could intercept that event and update the device list there.  But I don't know that you try to sync a shared list between the library and the app.
> 
> In order to provide 'out-of-ibv_context' libibverbs events we'll need
> a new interface, something like ibv_get_system_event(), to report
> device changes.

Yep.

> Every time the application calls this API, libibverbs must process all
> events in queue in order to have an up to date ibv_device list with
> they way it appears in sysfs.

This is not necessarily true.  It is entirely possible that the library
has already processed the events themselves internally and the device
list is already consistent.  You're tying the act of the user calling
into the library with the library processing work that needs processed.
These two ought not be tied.

> Then, do we force all these popped events on the user? or can we force
> him to loop on ibv_get_system_event() until EAGAIN so we're sure the
> device list is updated?

If the user wants to drain that queue, then that would be just fine.  On
the other hand, as I wrote in my original implementation proposal, the
user could also simply call ibv_get_device_list and get the updated
list, which directly implies that the queue of events need not be popped
clean in order for the processing work to have already been done.

> Is this hiding yet another fd for the application to block on?

Personally I would use a thread in the library that would block on
inotify events from the sysfs directory.  Any time it got woke up, it
would process the remove/add event, update the device list, create a
notification if the app had registered a notification handler and queue
that to the app, then go back to sleep on the inotify events in the
sysfs dir.  The processing of the event and the notification would be
totally disjoint.  The app need not process events, even if it
registered and event handler, in order to get the new device list.

> Verbs is built around a cmd_fd per ibv_context. Reporting of new
> devices hotplug needs a separate channel. It does not seem correct to
> add yet another such cdev to verbs.

I would have no intention of adding a cdev or anything else kernel
related to this.  This can be done 100% in user space, and should be.
That also makes the change backward compatible with earlier kernels.

> I agree the adding this hotplug report capabilities into librdmacm
> sounds more appropriate.
> But libibvers will still need to provide the latest ibv_device list
> snapshot as it appears in sysfs.

Under my proposed implementation it would.


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG Key ID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC] libibverbs IB device hotplug support
       [not found]         ` <CAFgAxU_i+5FtBkdf=t5vdQAvBqDt9aKsFJbAKGN5YWQQ9w-c2A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-03-01 22:15           ` Weiny, Ira
       [not found]             ` <2807E5FD2F6FDA4886F6618EAC48510E67C5CA68-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Weiny, Ira @ 2017-03-01 22:15 UTC (permalink / raw)
  To: Alex Rosenbaum
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, leon-DgEjT+Ai2ygdnm+yROfE0A,
	Yishai Hadas,
	'Jason Gunthorpe
	(jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org)'

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1126 bytes --]

> >>
> >> ---------------------------------------------------------------------
> >> -----------
> > After the call
> >
> > Ira
> >
> 
> Ira,
> On the call we discussed that today there is no 'incarnation' value exposed for
> the uverbs%.
> You suggested an optimization by using the cdev's file creation timestamp as
> option for libibverbs to understand that the driver did or did-not restart since
> mapping of that ibv_device.
> Main pitfalls with this is if system time is being modified (admin or ntp/ptp). In
> the rare case that we have a hoplug event and time is rewinded we can
> accidentally reach a very hard debugging exercise.
> What do you think?

It was actually Jason who suggested that.

Anyway I think the scenario you explain is probably a very rare case.  So I would not worry about it.

Going forward we could put an instance number in sysfs if we really are concerned about making this 100% robust.

But I'm not sure that is required.

Ira

N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±­ÙšŠ{ayº\x1dʇڙë,j\a­¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC] libibverbs IB device hotplug support
       [not found]         ` <CAFgAxU-Vf6RLEo8N=oya0X+WbneRHE1xa_-5GY+TEqo1wyzpow-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-03-01 22:17           ` Weiny, Ira
       [not found]             ` <2807E5FD2F6FDA4886F6618EAC48510E67C5CA80-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Weiny, Ira @ 2017-03-01 22:17 UTC (permalink / raw)
  To: Alex Rosenbaum
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, leon-DgEjT+Ai2ygdnm+yROfE0A,
	Yishai Hadas

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 598 bytes --]

> 
> > What happens if this is a dual port device and only one of the 2 IB
> > ports are changed?  So many things like AH, Path Records, etc are tied to the
> "device"
> > but really they should be tied to the port.
> 
> The newer ConnectX4 (and forward) devices expose a pci function per port, so
> reconfiguring a port link layer will hotplug only that pci function.
> 

Interesting...  Do you have a "struct ib_device" for each pci function?

Ira

N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±­ÙšŠ{ayº\x1dʇڙë,j\a­¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]                         ` <cad3d2c6-0c98-9f58-a3bb-81823103bd76-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-03-01 22:23                           ` Jason Gunthorpe
  2017-03-01 22:34                           ` Alex Rosenbaum
  1 sibling, 0 replies; 32+ messages in thread
From: Jason Gunthorpe @ 2017-03-01 22:23 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Hefty, Sean, Liran Liss, Alex Rosenbaum,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, leon-DgEjT+Ai2ygdnm+yROfE0A,
	Yishai Hadas

On Wed, Mar 01, 2017 at 04:56:50PM -0500, Doug Ledford wrote:

> isn't a new device in the list already, then great too.  Because the
> initial implementation proposal left if up to the app to determine
> how/when to check for a new device, there was never any guarantee that
> whatever triggered the app to look for the new device meant that it was
> ready to be used in sysfs

No, that isn't true. The kernel guarentees that sysfs is fully setup
before it can deliver any sort of event, by any channel, to an
app. This includes rdma connection establish, uevents for udev,
netlink notifications/etc.

> and would be found by the re-reading of the sysfs directory.  So
> whatever race might be present in my proposed implementation would
> also be present in the original implementation.

Since the kernel strongly orders things, there is no race in Alex's
suggestion.

All apps using rdma-cm have to cope with hot plug transparently. An
app can 'all device' bind and get an incomming connection from the
kernel on a newly plugged device. This must work without having the
app do any new steps or introducing races between the rdma-cm channel
and some other channel hidden in verbs.

> Personally I would use a thread in the library that would block on
> inotify events from the sysfs directory.

sysfs does not support inotify.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]                             ` <d92df6b7-207f-f3a3-8bf5-b12cffe20684-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-03-01 22:32                               ` Alex Rosenbaum
       [not found]                                 ` <CAFgAxU_-jbAoOR2XMYfSgbDMn7FnrthudZJLgeNNTzb9GEUXrg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Rosenbaum @ 2017-03-01 22:32 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Hefty, Sean, Jason Gunthorpe, Liran Liss,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, leon-DgEjT+Ai2ygdnm+yROfE0A,
	Yishai Hadas

On Thu, Mar 2, 2017 at 12:07 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 3/1/2017 4:47 PM, Alex Rosenbaum wrote:
>> On Wed, Mar 1, 2017 at 8:00 PM, Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>>>>>> I don't think that we should introduce an asych context into
>>>> libibverbs.
>>>>>
>>>>> Why not?
>>>>
>>>> Generally, I dislike the idea of running threads from libraries,
>>>> particularly libraries like ibverbs. So many apps get no benefit from
>>>> the thread, but it sits there connected to udev..
>>>
>>> I thought Doug was referring to reporting the device add/remove event through some event reporting interface, like what the librdmacm already does.  So no threads for that are needed.  IMO, this makes sense.
>>>
>>> I didn't follow his idea for how the device list would be updated and kept in sync between the app and the library.
>>>
>>> If you had device add/remove events, the get_event call could intercept that event and update the device list there.  But I don't know that you try to sync a shared list between the library and the app.
>>
>> In order to provide 'out-of-ibv_context' libibverbs events we'll need
>> a new interface, something like ibv_get_system_event(), to report
>> device changes.
>
> Yep.
>
>> Every time the application calls this API, libibverbs must process all
>> events in queue in order to have an up to date ibv_device list with
>> they way it appears in sysfs.
>
> This is not necessarily true.  It is entirely possible that the library
> has already processed the events themselves internally and the device
> list is already consistent.  You're tying the act of the user calling
> into the library with the library processing work that needs processed.
> These two ought not be tied.
>
>> Then, do we force all these popped events on the user? or can we force
>> him to loop on ibv_get_system_event() until EAGAIN so we're sure the
>> device list is updated?
>
> If the user wants to drain that queue, then that would be just fine.  On
> the other hand, as I wrote in my original implementation proposal, the
> user could also simply call ibv_get_device_list and get the updated
> list, which directly implies that the queue of events need not be popped
> clean in order for the processing work to have already been done.
>
>> Is this hiding yet another fd for the application to block on?
>
> Personally I would use a thread in the library that would block on
> inotify events from the sysfs directory.  Any time it got woke up, it
> would process the remove/add event, update the device list, create a
> notification if the app had registered a notification handler and queue
> that to the app, then go back to sleep on the inotify events in the
> sysfs dir.  The processing of the event and the notification would be
> totally disjoint.  The app need not process events, even if it
> registered and event handler, in order to get the new device list.

I actually did look at inotify() for this problem and agree that this
is can be a good option.
There are some issues with it. You get a few more events then you
really need (some extra context switches). It does not map all sysfs
cleanly, there are some limitation. It will not wakeup on
/sys/class/infiniband_verbs/*, but will catch the
/dev/infiniband/uverbes% new instances.

Then there is a question of which context/thread? library vs applications

>
>> Verbs is built around a cmd_fd per ibv_context. Reporting of new
>> devices hotplug needs a separate channel. It does not seem correct to
>> add yet another such cdev to verbs.
>
> I would have no intention of adding a cdev or anything else kernel
> related to this.  This can be done 100% in user space, and should be.
> That also makes the change backward compatible with earlier kernels.
>
>> I agree the adding this hotplug report capabilities into librdmacm
>> sounds more appropriate.
>> But libibvers will still need to provide the latest ibv_device list
>> snapshot as it appears in sysfs.
>
> Under my proposed implementation it would.
>
>
> --
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>     GPG Key ID: B826A3330E572FDD
>     Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]                         ` <cad3d2c6-0c98-9f58-a3bb-81823103bd76-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2017-03-01 22:23                           ` Jason Gunthorpe
@ 2017-03-01 22:34                           ` Alex Rosenbaum
  1 sibling, 0 replies; 32+ messages in thread
From: Alex Rosenbaum @ 2017-03-01 22:34 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Hefty, Sean, Jason Gunthorpe, Liran Liss,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, leon-DgEjT+Ai2ygdnm+yROfE0A,
	Yishai Hadas

On Wed, Mar 1, 2017 at 11:56 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 3/1/2017 1:00 PM, Hefty, Sean wrote:
>>>>> I don't think that we should introduce an asych context into
>>> libibverbs.
>>>>
>>>> Why not?
>>>
>>> Generally, I dislike the idea of running threads from libraries,
>>> particularly libraries like ibverbs. So many apps get no benefit
>>> from the thread, but it sits there connected to udev..
>>
>> I thought Doug was referring to reporting the device add/remove event
>> through some event reporting interface, like what the librdmacm
>> already does.  So no threads for that are needed.  IMO, this makes
>> sense.
>
> It can be done without a thread, that's not to say I would necessarily
> do so myself.
>
>> I didn't follow his idea for how the device list would be updated and
>> kept in sync between the app and the library.
>
> There is no requirement that the device list be kept in sync between the
> library and the app, period.
>
> There are three usages for what I suggested:
>
> 1)  The app gets a device list initially just like it does now.  If the
> app never updates that list again, so be it.  This is 100% backward
> compatible.
>
> 2)  The simplified version of hotplug/unplug that has been discussed:
> the app calls ibv_get_device_list.  As proposed by Alex, the library
> would rescan the sysfs directory on this call.  If the device is there,
> great.  If it isn't, great too.  Under my proposed implementation, the
> exact same thing is true.  If the device is there, great.  If there
> isn't a new device in the list already, then great too.  Because the
> initial implementation proposal left if up to the app to determine
> how/when to check for a new device, there was never any guarantee that
> whatever triggered the app to look for the new device meant that it was
> ready to be used in sysfs and would be found by the re-reading of the
> sysfs directory.  So whatever race might be present in my proposed
> implementation would also be present in the original implementation.
>
> 3)  My proposed implementation where an entry point is added for async
> events.  In this case, the library would not pass the event on up until
> the device was ready to be used, so there is no race here.  And there is
> minimal delay from detection to notification and ready for use.
>
>> If you had device add/remove events, the get_event call could
>> intercept that event and update the device list there.  But I don't
>> know that you try to sync a shared list between the library and the
>> app.
>
> No, I wouldn't try that at all.  The inherently racy thing in all of
> this is having the library be responsible for scanning a device, but not
> responsible for detecting that there is a new device to be scanned.
>
> --
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>     GPG Key ID: B826A3330E572FDD
>     Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
>

Should we fully define the notification solution (#3 in your list)
before continuing with simple refresh (#2)?

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]             ` <2807E5FD2F6FDA4886F6618EAC48510E67C5CA80-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2017-03-01 22:38               ` Alex Rosenbaum
  0 siblings, 0 replies; 32+ messages in thread
From: Alex Rosenbaum @ 2017-03-01 22:38 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, leon-DgEjT+Ai2ygdnm+yROfE0A,
	Yishai Hadas

On Thu, Mar 2, 2017 at 12:17 AM, Weiny, Ira <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>>
>> > What happens if this is a dual port device and only one of the 2 IB
>> > ports are changed?  So many things like AH, Path Records, etc are tied to the
>> "device"
>> > but really they should be tied to the port.
>>
>> The newer ConnectX4 (and forward) devices expose a pci function per port, so
>> reconfiguring a port link layer will hotplug only that pci function.
>>
>
> Interesting...  Do you have a "struct ib_device" for each pci function?
>
> Ira
>

yes. that how mlx5 devices are mapped on the PCI bus. each will have a
separate function with single port.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]             ` <2807E5FD2F6FDA4886F6618EAC48510E67C5CA68-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2017-03-01 22:42               ` Alex Rosenbaum
  2017-03-01 23:07               ` Jason Gunthorpe
  1 sibling, 0 replies; 32+ messages in thread
From: Alex Rosenbaum @ 2017-03-01 22:42 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, leon-DgEjT+Ai2ygdnm+yROfE0A,
	Yishai Hadas,
	Jason Gunthorpe
	(jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org)

On Thu, Mar 2, 2017 at 12:15 AM, Weiny, Ira <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>> >>
>> >> ---------------------------------------------------------------------
>> >> -----------
>> > After the call
>> >
>> > Ira
>> >
>>
>> Ira,
>> On the call we discussed that today there is no 'incarnation' value exposed for
>> the uverbs%.
>> You suggested an optimization by using the cdev's file creation timestamp as
>> option for libibverbs to understand that the driver did or did-not restart since
>> mapping of that ibv_device.
>> Main pitfalls with this is if system time is being modified (admin or ntp/ptp). In
>> the rare case that we have a hoplug event and time is rewinded we can
>> accidentally reach a very hard debugging exercise.
>> What do you think?
>
> It was actually Jason who suggested that.

redirecting the credit: "thanks Jason :)"
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]             ` <2807E5FD2F6FDA4886F6618EAC48510E67C5CA68-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2017-03-01 22:42               ` Alex Rosenbaum
@ 2017-03-01 23:07               ` Jason Gunthorpe
       [not found]                 ` <20170301230743.GC2820-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  1 sibling, 1 reply; 32+ messages in thread
From: Jason Gunthorpe @ 2017-03-01 23:07 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: Alex Rosenbaum, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	leon-DgEjT+Ai2ygdnm+yROfE0A, Yishai Hadas

On Wed, Mar 01, 2017 at 10:15:31PM +0000, Weiny, Ira wrote:
> > On the call we discussed that today there is no 'incarnation'
> > value exposed for the uverbs%.  You suggested an optimization by
> > using the cdev's file creation timestamp as option for libibverbs
> > to understand that the driver did or did-not restart since mapping
> > of that ibv_device.

Not the cdev, read the timestamp from /sys/class/infiniband/mlx4_0/hca_type

> > Main pitfalls with this is if system time is being modified (admin
> > or ntp/ptp). In the rare case that we have a hoplug event and time
> > is rewinded we can accidentally reach a very hard debugging
> > exercise.

You are only using it for direct equality comparision, so someone
would have to roll back the clock and hotplug the driver at the exact
same nano-second resolution timestamp. That is deeply unlikely to ever
happen by accident.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC] libibverbs IB device hotplug support
       [not found]                         ` <CAFgAxU9vObaW4O+byEJ5pV1Ofou4cd05HHWWPC7iTJshyk+LdA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-03-01 22:07                           ` Doug Ledford
@ 2017-03-01 23:18                           ` Hefty, Sean
  1 sibling, 0 replies; 32+ messages in thread
From: Hefty, Sean @ 2017-03-01 23:18 UTC (permalink / raw)
  To: Alex Rosenbaum
  Cc: Jason Gunthorpe, Doug Ledford, Liran Liss,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, leon-DgEjT+Ai2ygdnm+yROfE0A,
	Yishai Hadas

> In order to provide 'out-of-ibv_context' libibverbs events we'll need
> a new interface, something like ibv_get_system_event(), to report
> device changes.
> Every time the application calls this API, libibverbs must process all
> events in queue in order to have an up to date ibv_device list with
> they way it appears in sysfs.

There is always a delay in the reporting of an event and the device list being updated, so I don't see a need to process all queued events.

In any case, the question is whether to report hot plug events and, if so, how.
 
> Then, do we force all these popped events on the user? or can we force
> him to loop on ibv_get_system_event() until EAGAIN so we're sure the
> device list is updated?
> Is this hiding yet another fd for the application to block on?

I would like the use of a single fd for all verbs/rdma cm events, rather than the multiplexing issue that exists today.  If we decouple the fd from the device, apps have the freedom to continue with the same usage model as today or use fewer system resources.

As a use case, it may be worth looking at how the librdmacm can make use of device insertion/removal.  It currently uses a statically built list of devices.  I don't believe new devices can be used, while software resources for removed devices are never completely released.

- Sean

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]                                 ` <CAFgAxU_-jbAoOR2XMYfSgbDMn7FnrthudZJLgeNNTzb9GEUXrg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-03-02 12:48                                   ` Doug Ledford
       [not found]                                     ` <467c0560-8de7-42e5-14b9-178c367874d2-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Doug Ledford @ 2017-03-02 12:48 UTC (permalink / raw)
  To: Alex Rosenbaum
  Cc: Hefty, Sean, Jason Gunthorpe, Liran Liss,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, leon-DgEjT+Ai2ygdnm+yROfE0A,
	Yishai Hadas


[-- Attachment #1.1: Type: text/plain, Size: 2412 bytes --]

On 3/1/2017 5:32 PM, Alex Rosenbaum wrote:
> On Thu, Mar 2, 2017 at 12:07 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

>> Personally I would use a thread in the library that would block on
>> inotify events from the sysfs directory.  Any time it got woke up, it
>> would process the remove/add event, update the device list, create a
>> notification if the app had registered a notification handler and queue
>> that to the app, then go back to sleep on the inotify events in the
>> sysfs dir.  The processing of the event and the notification would be
>> totally disjoint.  The app need not process events, even if it
>> registered and event handler, in order to get the new device list.
> 
> I actually did look at inotify() for this problem and agree that this
> is can be a good option.
> There are some issues with it. You get a few more events then you
> really need (some extra context switches). It does not map all sysfs
> cleanly, there are some limitation. It will not wakeup on
> /sys/class/infiniband_verbs/*, but will catch the
> /dev/infiniband/uverbes% new instances.
> 
> Then there is a question of which context/thread? library vs applications

On further reflection, I think inotify is out of the question.  The
problem is that our initial setup is done via systemd on most modern
systems, and under systemd the rdma.init kernel setup script is a
oneshot script.  The udev rules for various rdma capable hardware all
call out for the rdma.init service to be run.  This is fine at bootup.
However, once we are up and running, I don't think a hotplug of any
module driver that services only the ethernet portion of a two part
driver (cxgb*/hns/bnxt) will trigger a load of the corresponding RDMA
portion of the driver.  I think we need to monitor udev events, we
specifically need to trap for all add events for class network, then
need to detect if the device added belongs to one of the known paired
ethernet/RDMA drivers, and if the RDMA module didn't get loaded, load
it.  That doesn't work if we just watch /dev/infiniband/uverbs* or
/sys/class/infiniband_verbs/* as these devices will never show up until
someone loads the proper module.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG Key ID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]                 ` <20170301230743.GC2820-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-03-02 13:22                   ` Leon Romanovsky
  0 siblings, 0 replies; 32+ messages in thread
From: Leon Romanovsky @ 2017-03-02 13:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Weiny, Ira, Alex Rosenbaum, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Yishai Hadas

[-- Attachment #1: Type: text/plain, Size: 1096 bytes --]

On Wed, Mar 01, 2017 at 04:07:43PM -0700, Jason Gunthorpe wrote:
> On Wed, Mar 01, 2017 at 10:15:31PM +0000, Weiny, Ira wrote:
> > > On the call we discussed that today there is no 'incarnation'
> > > value exposed for the uverbs%.  You suggested an optimization by
> > > using the cdev's file creation timestamp as option for libibverbs
> > > to understand that the driver did or did-not restart since mapping
> > > of that ibv_device.
>
> Not the cdev, read the timestamp from /sys/class/infiniband/mlx4_0/hca_type
>
> > > Main pitfalls with this is if system time is being modified (admin
> > > or ntp/ptp). In the rare case that we have a hoplug event and time
> > > is rewinded we can accidentally reach a very hard debugging
> > > exercise.
>
> You are only using it for direct equality comparision, so someone
> would have to roll back the clock and hotplug the driver at the exact
> same nano-second resolution timestamp. That is deeply unlikely to ever
> happen by accident.

Jason,

I think that we are over engineering here, but it can occur with NTP clock hopping.

Thanks

>
> Jason

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC] libibverbs IB device hotplug support
       [not found] ` <CAFgAxU9eQYwBad2+SJw1x+cQMMqzgkz0fpeumG7Rx=o_eQqo2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
                     ` (2 preceding siblings ...)
  2017-03-01  0:10   ` ira.weiny
@ 2017-03-02 16:08   ` Hefty, Sean
       [not found]     ` <1828884A29C6694DAF28B7E6B8A82373AB0ED3C6-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  3 siblings, 1 reply; 32+ messages in thread
From: Hefty, Sean @ 2017-03-02 16:08 UTC (permalink / raw)
  To: Alex Rosenbaum, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: leon-DgEjT+Ai2ygdnm+yROfE0A, Yishai Hadas

> Here I suggest to modify the implementation of ibv_get_device_list() so
> that
> consecutive calls will re-scan the sysfs in the same manner as done
> today in
> order to create a fresh ibv_device list each time. We will remove
> caching of
> devices that support plugout, while keeping the ibv_device cache for
> devices
> which do not support plugout.
> 
> For this purpose, the ibv_get_device_list() device scanning logic
> should be
> separated from the libibverbs singleton initialization step.
> User can call ibv_open_device() while holding this list (see man pages)
> and
> once ibv_free_device_list() is called libibverbs can release the unused
> ibv_device objects. Later, on calls to ibv_close_device(), additional
> ibv_device object should be released. Currently, on
> ibv_free_device_list(),
> only the array is freed, while the ibv_device objects are never freed.
> libibverbs will maintain a ref_count for each verbs_device object.
> Increase
> verbs_device->ref_count for every ibv_get_device_list() or
> ibv_open_device().
> Decrease it for every ibv_free_device_list() or ibv_close_device().
> On decrease, if ref_count tested to be zero, libibverbs will call the
> provider
> library to release the 'strcut verbs_device' which it allocated.
> Each provider library should provide a function to release the
> verbs_device
> 
> object: 'uninit_device(struct verbs_device* device)'.

Based on the discussion so far, is there any disagreement with just the above functionality?  Ignoring the event reporting aspect, the above makes sense to me as a reasonable first step.

- Sean

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]                                     ` <467c0560-8de7-42e5-14b9-178c367874d2-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-03-02 16:53                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 32+ messages in thread
From: Jason Gunthorpe @ 2017-03-02 16:53 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Alex Rosenbaum, Hefty, Sean, Liran Liss,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, leon-DgEjT+Ai2ygdnm+yROfE0A,
	Yishai Hadas

On Thu, Mar 02, 2017 at 07:48:55AM -0500, Doug Ledford wrote:

> oneshot script.  The udev rules for various rdma capable hardware all
> call out for the rdma.init service to be run.  This is fine at
> bootup.

Yep, but that could be fixed in a fairly straightfoward way:

SUBSYSTEM=="module", KERNEL=="cxgb*", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}="rdma_load_kmod@%k.service"

eg run a one shot instance of a systemd template that loads the extra
modules needed by the core module.

> However, once we are up and running, I don't think a hotplug of any
> module driver that services only the ethernet portion of a two part
> driver (cxgb*/hns/bnxt) will trigger a load of the corresponding RDMA
> portion of the driver.

I continue to belive we need to fix this in the kernel, the two part
drivers should auto-load their RDMA sides..

Not sure what this has to do with verbs though...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] libibverbs IB device hotplug support
       [not found]     ` <1828884A29C6694DAF28B7E6B8A82373AB0ED3C6-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2017-03-03 12:31       ` Alex Rosenbaum
  0 siblings, 0 replies; 32+ messages in thread
From: Alex Rosenbaum @ 2017-03-03 12:31 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, leon-DgEjT+Ai2ygdnm+yROfE0A,
	Yishai Hadas

On Thu, Mar 2, 2017 at 6:08 PM, Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>> Here I suggest to modify the implementation of ibv_get_device_list() so
>> that
>> consecutive calls will re-scan the sysfs in the same manner as done
>> today in
>> order to create a fresh ibv_device list each time. We will remove
>> caching of
>> devices that support plugout, while keeping the ibv_device cache for
>> devices
>> which do not support plugout.
>>
>> For this purpose, the ibv_get_device_list() device scanning logic
>> should be
>> separated from the libibverbs singleton initialization step.
>> User can call ibv_open_device() while holding this list (see man pages)
>> and
>> once ibv_free_device_list() is called libibverbs can release the unused
>> ibv_device objects. Later, on calls to ibv_close_device(), additional
>> ibv_device object should be released. Currently, on
>> ibv_free_device_list(),
>> only the array is freed, while the ibv_device objects are never freed.
>> libibverbs will maintain a ref_count for each verbs_device object.
>> Increase
>> verbs_device->ref_count for every ibv_get_device_list() or
>> ibv_open_device().
>> Decrease it for every ibv_free_device_list() or ibv_close_device().
>> On decrease, if ref_count tested to be zero, libibverbs will call the
>> provider
>> library to release the 'strcut verbs_device' which it allocated.
>> Each provider library should provide a function to release the
>> verbs_device
>>
>> object: 'uninit_device(struct verbs_device* device)'.
>
> Based on the discussion so far, is there any disagreement with just the above functionality?  Ignoring the event reporting aspect, the above makes sense to me as a reasonable first step.
>
> - Sean

Summarize the basic hotplug functionality part we discussed so far:
1. calling ibv_get_device_list() will rescan sysfs and provide an
updated list of ibv_device's
2. optimize by identifying same device incarnation with timestamp
creation of /sys/class/infiniband/%ibdev%/hca_type
3. release struct verbs_device according to usage ref_count
4. implement 'uninit_device(struct verbs_device* device)' for all
provider libraries to release plugout devices

NOTE: consolidating all providers libraries to use the extendible
struct verbs_device and move it into driver private area already
merged by https://github.com/linux-rdma/rdma-core/pull/87

We'll prepare a PR with the above functionality for all providers.
And come back with a separate suggestion for hotplug reporting mechanism.

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2017-03-03 12:31 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-27 16:08 [RFC] libibverbs IB device hotplug support Alex Rosenbaum
     [not found] ` <CAFgAxU9eQYwBad2+SJw1x+cQMMqzgkz0fpeumG7Rx=o_eQqo2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-27 18:59   ` Jason Gunthorpe
     [not found]     ` <20170227185912.GM5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-27 22:11       ` Alex Rosenbaum
     [not found]         ` <CAFgAxU9XJTXkaeL_VE7zHASPBM+j=TZd2L+McKBcJRThJtUN5w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-27 22:36           ` Jason Gunthorpe
     [not found]             ` <20170227223600.GA1526-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-28  7:56               ` Alex Rosenbaum
     [not found]                 ` <CAFgAxU8KkrOS6aib4ykf8vj1sQFGK_oMW=fX3vZ2X2_r3ryDVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-28 21:26                   ` Jason Gunthorpe
2017-02-28 15:48   ` Doug Ledford
     [not found]     ` <1488296882.86943.214.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-28 16:32       ` Liran Liss
     [not found]         ` <HE1PR0501MB28123E8F342607E2C2CA4DD9B1560-692Kmc8YnlIVrnpjwTCbp8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-02-28 21:47           ` Doug Ledford
     [not found]             ` <3389d831-c135-b326-4b96-5f2a746de5ac-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-03-01  0:44               ` Jason Gunthorpe
     [not found]                 ` <20170301004449.GA13114-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-03-01 18:00                   ` Hefty, Sean
     [not found]                     ` <1828884A29C6694DAF28B7E6B8A82373AB0ECDC6-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2017-03-01 18:07                       ` Jason Gunthorpe
2017-03-01 21:47                       ` Alex Rosenbaum
     [not found]                         ` <CAFgAxU9vObaW4O+byEJ5pV1Ofou4cd05HHWWPC7iTJshyk+LdA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-03-01 22:07                           ` Doug Ledford
     [not found]                             ` <d92df6b7-207f-f3a3-8bf5-b12cffe20684-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-03-01 22:32                               ` Alex Rosenbaum
     [not found]                                 ` <CAFgAxU_-jbAoOR2XMYfSgbDMn7FnrthudZJLgeNNTzb9GEUXrg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-03-02 12:48                                   ` Doug Ledford
     [not found]                                     ` <467c0560-8de7-42e5-14b9-178c367874d2-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-03-02 16:53                                       ` Jason Gunthorpe
2017-03-01 23:18                           ` Hefty, Sean
2017-03-01 21:56                       ` Doug Ledford
     [not found]                         ` <cad3d2c6-0c98-9f58-a3bb-81823103bd76-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-03-01 22:23                           ` Jason Gunthorpe
2017-03-01 22:34                           ` Alex Rosenbaum
2017-03-01  0:10   ` ira.weiny
     [not found]     ` <20170301001023.GA3001-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
2017-03-01 21:21       ` Alex Rosenbaum
     [not found]         ` <CAFgAxU-Vf6RLEo8N=oya0X+WbneRHE1xa_-5GY+TEqo1wyzpow-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-03-01 22:17           ` Weiny, Ira
     [not found]             ` <2807E5FD2F6FDA4886F6618EAC48510E67C5CA80-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2017-03-01 22:38               ` Alex Rosenbaum
2017-03-01 21:57       ` Alex Rosenbaum
     [not found]         ` <CAFgAxU_i+5FtBkdf=t5vdQAvBqDt9aKsFJbAKGN5YWQQ9w-c2A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-03-01 22:15           ` Weiny, Ira
     [not found]             ` <2807E5FD2F6FDA4886F6618EAC48510E67C5CA68-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2017-03-01 22:42               ` Alex Rosenbaum
2017-03-01 23:07               ` Jason Gunthorpe
     [not found]                 ` <20170301230743.GC2820-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-03-02 13:22                   ` Leon Romanovsky
2017-03-02 16:08   ` Hefty, Sean
     [not found]     ` <1828884A29C6694DAF28B7E6B8A82373AB0ED3C6-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2017-03-03 12:31       ` Alex Rosenbaum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.