All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] XRC upstream merge reboot
@ 2011-05-16 21:13 Hefty, Sean
       [not found] ` <1828884A29C6694DAF28B7E6B8A82373F7AB-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Hefty, Sean @ 2011-05-16 21:13 UTC (permalink / raw)
  To: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

I've been working on a set of XRC patches aimed at upstream inclusion to the kernel, libibverbs, and librdmacm.  I'm using existing patches as the major starting point.  A goal is to maintain the user space ABI.  Before proceeding further, I wanted to get broader feedback.  Starting at the top and working down, these are the basic ideas:


librdmacm
---------
The API is basically unchanged.  XRC usage is indicated through the QP type.  The challenge is determining if XRC maps to a specific rdma_port_space.


libibverbs
----------
We define a new device capability flag IBV_DEVICE_EXT_OPS, indicating that the library supports extended operations.  If set, the provider library returns an extended structure from ibv_open_device():

	struct ibv_context_ext {
		struct ibv_context context;
		int                version;
		struct ibv_ext_ops ext_ops;
	};

The ext_ops will allow for additional operations not provided by ibv_context_ops, for example:

	struct ibv_ext_ops {
		int	(share_pd)(struct ibv_pd *pd, int fd, int oflags);
	};

In order for libibverbs to check for ext_ops support, it steals a byte from the device name:

	/*
	 * Support for extended operations is recorded at the end of
	 * the name character array.
	 */
	#define ext_ops_cap            name[IBV_SYSFS_NAME_MAX - 1]

(If strlen(name) indicates that this byte terminates the string, extended operation support is disabled for this device.)

Hopefully, this provides the framework needed for libibverbs to support both old and new provider libraries.

>From an architecture viewpoint, XRC adds 4 new XRC specific objects: domains, INI QPs, TGT QPs, and SRQs.  For the purposes of the libibverbs API only, I'm suggesting the following mappings:

XRC domains - Hidden under a PD, dynamically allocated when needed.  An extended ops call allows the xrcd to be shared between processes.  This minimizes changes to existing structures and APIs which only take a struct ibv_pd.

INI QPs - Exposed through a new IBV_QPT_XRC_SQ qp type.  This is a send-only QP with minimal differences from an RC QP from a user's perspective.

TGT QPs - Not exposed to user space.  XRC TGT QP creation and setup is handled by the kernel.

XRC SRQs - Exposed through a new IBV_QPT_XRC_RQ qp type.  This is an SRQ that is tracked using a struct ibv_qp.  This minimizes API changes to both libibverbs and librdmacm.

If ext_ops are supported and in active use, extended structures may be expected with some calls, such as ibv_post_send() requiring a struct ibv_xrc_send_wr for XRC QPs.

	struct ibv_xrc_send_wr {
		struct ibv_send_wr wr;
		uint32_t remote_qpn;
	};


uverbs
------
(Ideas for kernel changes are sketchier, but the existing patches cover most of the functionality except for IB CM interactions.)

Need new uverbs commands to support alloc/dealloc xrcd and create xrc srq.  Create QP must handle XRC INI QPs.  XRC TGT QPs are not exposed; ***all XRC INI->TGT QP setup is done in band***.

Somewhere, an xrc sub-module listens on a SID and accepts incoming XRC connection requests.  This requires associating the xrcd and SID, the details of which I'm not clear on.  The xrcd is most readily known to uverbs, but a SID is usually formed by the rdma_cm.  Even how the latter is done is unclear.

The usage model I envision is for a user to call listen on an XRC SRQ (IBV_QPT_XRC_RQ), which listens for a SIDR REQ to resolve the SRQN and a REQ to setup the INI->TGT QPs.  The issue is sync'ing the lifetime of any formed connections with the xrcd.


verbs
-----
The patch for this is basically available.  3 new calls are added: ib_create_xrc_srq, ib_alloc_xrcd, and ib_dealloc_xrcd.  The IB_QPT_XRC is split into 2 types: IB_QPT_INI_XRC and IB_QPT_TGT_XRC.  An INI QP has a pd, but no xrcd, while the TGT QP is the reverse.


Existing patches to the mlx4 driver and library would be modified to handle these changes.  If anyone has any thoughts on these changes, I'd appreciate them before I have them implemented.  :)

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found] ` <1828884A29C6694DAF28B7E6B8A82373F7AB-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-05-18 14:54   ` Jack Morgenstein
       [not found]     ` <201105181754.33759.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2011-05-18 16:44   ` Roland Dreier
  1 sibling, 1 reply; 53+ messages in thread
From: Jack Morgenstein @ 2011-05-18 14:54 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w

Sean,
Great that you are taking this on!  I will review this next week.

-Jack

On Tuesday 17 May 2011 00:13, Hefty, Sean wrote:
> I've been working on a set of XRC patches aimed at upstream inclusion to the kernel, libibverbs, and librdmacm.  I'm using existing patches as the major starting point.  A goal is to maintain the user space ABI.  Before proceeding further, I wanted to get broader feedback.  Starting at the top and working down, these are the basic ideas:
> 
> 
> librdmacm
> ---------
> The API is basically unchanged.  XRC usage is indicated through the QP type.  The challenge is determining if XRC maps to a specific rdma_port_space.
> 
> 
> libibverbs
> ----------
> We define a new device capability flag IBV_DEVICE_EXT_OPS, indicating that the library supports extended operations.  If set, the provider library returns an extended structure from ibv_open_device():
> 
> 	struct ibv_context_ext {
> 		struct ibv_context context;
> 		int                version;
> 		struct ibv_ext_ops ext_ops;
> 	};
> 
> The ext_ops will allow for additional operations not provided by ibv_context_ops, for example:
> 
> 	struct ibv_ext_ops {
> 		int	(share_pd)(struct ibv_pd *pd, int fd, int oflags);
> 	};
> 
> In order for libibverbs to check for ext_ops support, it steals a byte from the device name:
> 
> 	/*
> 	 * Support for extended operations is recorded at the end of
> 	 * the name character array.
> 	 */
> 	#define ext_ops_cap            name[IBV_SYSFS_NAME_MAX - 1]
> 
> (If strlen(name) indicates that this byte terminates the string, extended operation support is disabled for this device.)
> 
> Hopefully, this provides the framework needed for libibverbs to support both old and new provider libraries.
> 
> From an architecture viewpoint, XRC adds 4 new XRC specific objects: domains, INI QPs, TGT QPs, and SRQs.  For the purposes of the libibverbs API only, I'm suggesting the following mappings:
> 
> XRC domains - Hidden under a PD, dynamically allocated when needed.  An extended ops call allows the xrcd to be shared between processes.  This minimizes changes to existing structures and APIs which only take a struct ibv_pd.
> 
> INI QPs - Exposed through a new IBV_QPT_XRC_SQ qp type.  This is a send-only QP with minimal differences from an RC QP from a user's perspective.
> 
> TGT QPs - Not exposed to user space.  XRC TGT QP creation and setup is handled by the kernel.
> 
> XRC SRQs - Exposed through a new IBV_QPT_XRC_RQ qp type.  This is an SRQ that is tracked using a struct ibv_qp.  This minimizes API changes to both libibverbs and librdmacm.
> 
> If ext_ops are supported and in active use, extended structures may be expected with some calls, such as ibv_post_send() requiring a struct ibv_xrc_send_wr for XRC QPs.
> 
> 	struct ibv_xrc_send_wr {
> 		struct ibv_send_wr wr;
> 		uint32_t remote_qpn;
> 	};
> 
> 
> uverbs
> ------
> (Ideas for kernel changes are sketchier, but the existing patches cover most of the functionality except for IB CM interactions.)
> 
> Need new uverbs commands to support alloc/dealloc xrcd and create xrc srq.  Create QP must handle XRC INI QPs.  XRC TGT QPs are not exposed; ***all XRC INI->TGT QP setup is done in band***.
> 
> Somewhere, an xrc sub-module listens on a SID and accepts incoming XRC connection requests.  This requires associating the xrcd and SID, the details of which I'm not clear on.  The xrcd is most readily known to uverbs, but a SID is usually formed by the rdma_cm.  Even how the latter is done is unclear.
> 
> The usage model I envision is for a user to call listen on an XRC SRQ (IBV_QPT_XRC_RQ), which listens for a SIDR REQ to resolve the SRQN and a REQ to setup the INI->TGT QPs.  The issue is sync'ing the lifetime of any formed connections with the xrcd.
> 
> 
> verbs
> -----
> The patch for this is basically available.  3 new calls are added: ib_create_xrc_srq, ib_alloc_xrcd, and ib_dealloc_xrcd.  The IB_QPT_XRC is split into 2 types: IB_QPT_INI_XRC and IB_QPT_TGT_XRC.  An INI QP has a pd, but no xrcd, while the TGT QP is the reverse.
> 
> 
> Existing patches to the mlx4 driver and library would be modified to handle these changes.  If anyone has any thoughts on these changes, I'd appreciate them before I have them implemented.  :)
> 
> - Sean
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]     ` <201105181754.33759.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2011-05-18 15:27       ` Hefty, Sean
  2011-06-22  7:17       ` Jack Morgenstein
  1 sibling, 0 replies; 53+ messages in thread
From: Hefty, Sean @ 2011-05-18 15:27 UTC (permalink / raw)
  To: Jack Morgenstein
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w

> Great that you are taking this on!  I will review this next week.

Hopefully I'll have some early patches sometime next week.  See below for my current thoughts based on how the implementation is progressing.  My thoughts change hourly.

> > From an architecture viewpoint, XRC adds 4 new XRC specific objects:
> domains, INI QPs, TGT QPs, and SRQs.  For the purposes of the libibverbs
> API only, I'm suggesting the following mappings:
> >
> > XRC domains - Hidden under a PD, dynamically allocated when needed.  An
> extended ops call allows the xrcd to be shared between processes.  This
> minimizes changes to existing structures and APIs which only take a struct
> ibv_pd.
> >
> > INI QPs - Exposed through a new IBV_QPT_XRC_SQ qp type.  This is a send-
> only QP with minimal differences from an RC QP from a user's perspective.
> >
> > TGT QPs - Not exposed to user space.  XRC TGT QP creation and setup is
> handled by the kernel.
> >
> > XRC SRQs - Exposed through a new IBV_QPT_XRC_RQ qp type.  This is an SRQ
> that is tracked using a struct ibv_qp.  This minimizes API changes to both
> libibverbs and librdmacm.

The librdmacm wants IBV_QPT_XRC_RQ to be defined, which it uses for SRQN resolution.  XRC_SQ results in using the IB CM connection protocol.  XRC_RQ results in using CM SIDR.

The actual XRC SRQ is allocated through a new call like the existing patches.  XRC_RQ is not used by ibv_create_qp().

> > Need new uverbs commands to support alloc/dealloc xrcd and create xrc
> srq.  Create QP must handle XRC INI QPs.  XRC TGT QPs are not exposed;
> ***all XRC INI->TGT QP setup is done in band***.
> >
> > Somewhere, an xrc sub-module listens on a SID and accepts incoming XRC
> connection requests.  This requires associating the xrcd and SID, the
> details of which I'm not clear on.  The xrcd is most readily known to
> uverbs, but a SID is usually formed by the rdma_cm.  Even how the latter
> is done is unclear.

I'm not sure if xrcd needs to be explicitly exposed to user space.  I think user space will end up needing to be involved in some degree when INI->TGT QPs are established from the perspective of the rdma_cm.  For example, initiating the listen and telling the kernel whether to accept or reject a connection.  I still don't know that XRC TGT QPs need to be exposed through libibverbs similar to how they are being done.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found] ` <1828884A29C6694DAF28B7E6B8A82373F7AB-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2011-05-18 14:54   ` Jack Morgenstein
@ 2011-05-18 16:44   ` Roland Dreier
       [not found]     ` <BANLkTimWMU9ohSQGYEEnFR0HbBaypFR51A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 53+ messages in thread
From: Roland Dreier @ 2011-05-18 16:44 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Mon, May 16, 2011 at 2:13 PM, Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> libibverbs
> ----------
> We define a new device capability flag IBV_DEVICE_EXT_OPS, indicating that the library supports extended operations.  If set, the provider library returns an extended structure from ibv_open_device():
>
>        struct ibv_context_ext {
>                struct ibv_context context;
>                int                version;
>                struct ibv_ext_ops ext_ops;
>        };
>
> The ext_ops will allow for additional operations not provided by ibv_context_ops, for example:
>
>        struct ibv_ext_ops {
>                int     (share_pd)(struct ibv_pd *pd, int fd, int oflags);
>        };
>
> In order for libibverbs to check for ext_ops support, it steals a byte from the device name:
>
>        /*
>         * Support for extended operations is recorded at the end of
>         * the name character array.
>         */
>        #define ext_ops_cap            name[IBV_SYSFS_NAME_MAX - 1]

I like this idea of allowing for future extensibility... but if we
could go even a bit further
and have support for named extensions, I think that would be even better.  ie
we could define a bunch of new XRC related stuff and then have some interface
to the driver where we ask for the "XRC" extension (by name with a string) then
that would be very handy for the future.

I wonder if a "ummunotify" extension would make integrating ummunotify
stuff into
libibiverbs easier...

This is inspired by my very limited knowledge of OpenGL, so maybe a
more detailed
look at the mechanism there would be helpful.

I think stealing a byte from the name to mark this support might make sense, but
then I would probably go a bit further and change the second parameter of
ibv_register_driver to be an ibv_driver_init_func_ext() so we can extend even
before the open of the driver.

Anyway thanks for rebooting this work, Sean.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]     ` <BANLkTimWMU9ohSQGYEEnFR0HbBaypFR51A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-05-18 17:02       ` Jason Gunthorpe
       [not found]         ` <20110518170226.GA2595-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Gunthorpe @ 2011-05-18 17:02 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Hefty, Sean,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Wed, May 18, 2011 at 09:44:01AM -0700, Roland Dreier wrote:

> and have support for named extensions, I think that would be even
> better.  ie we could define a bunch of new XRC related stuff and
> then have some interface to the driver where we ask for the "XRC"
> extension (by name with a string) then that would be very handy for
> the future.

Considering the fairly small community I'm not sure this much
complexity has a payoff? uDAPL already has an API like that and I'm
not sure it has done much for usability.

As long as the version number in the ibv_context is increasing and not
branching then I think it is OK. 0 = what we have now. 1 = + XRC, 2 =
+XRC+ummunotify, etc. Drivers 0 out the function pointers they do not
support.

I think getting the XRC stuff sorted is pretty important, it would
be nice to keep it tight..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]         ` <20110518170226.GA2595-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2011-05-18 17:30           ` Hefty, Sean
       [not found]             ` <1828884A29C6694DAF28B7E6B8A82373FBC7-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Hefty, Sean @ 2011-05-18 17:30 UTC (permalink / raw)
  To: Jason Gunthorpe, Roland Dreier
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> As long as the version number in the ibv_context is increasing and not
> branching then I think it is OK. 0 = what we have now. 1 = + XRC, 2 =
> +XRC+ummunotify, etc. Drivers 0 out the function pointers they do not
> support.

I was thinking more along this line, but I can see how using a named extension could be useful for OFED or vendor specific extensions that aren't part of the upstream libibverbs.  (Whether _that_ is useful is another matter, but it seems to be the world that we're in anyway.)

I'm not familiar with OpenGL, so I'll take a look at it.  (The concept sounds similar to Window's COM interfaces.)

Beyond the interfaces, are there any thoughts on how to handle structure changes, such as:

	struct ibv_xrc_send_wr {
		struct ibv_send_wr wr;
		uint32_t remote_qpn;
	};

?  Do we want to use the existing ibv_post_send() call, or add a new ibv_post_xrc_send() routine specifically for this purpose (and simplify the above definition)?

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]             ` <1828884A29C6694DAF28B7E6B8A82373FBC7-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-05-18 18:05               ` Jason Gunthorpe
       [not found]                 ` <20110518180519.GA11860-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2011-05-18 19:22               ` Roland Dreier
  1 sibling, 1 reply; 53+ messages in thread
From: Jason Gunthorpe @ 2011-05-18 18:05 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Roland Dreier,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Wed, May 18, 2011 at 05:30:30PM +0000, Hefty, Sean wrote:
> > As long as the version number in the ibv_context is increasing and not
> > branching then I think it is OK. 0 = what we have now. 1 = + XRC, 2 =
> > +XRC+ummunotify, etc. Drivers 0 out the function pointers they do not
> > support.
> 
> I was thinking more along this line, but I can see how using a named extension could be useful for OFED or vendor specific extensions that aren't part of the upstream libibverbs.  (Whether _that_ is useful is another matter, but it seems to be the world that we're in anyway.)
> 
> I'm not familiar with OpenGL, so I'll take a look at it.  (The concept sounds similar to Window's COM interfaces.)
> 
> Beyond the interfaces, are there any thoughts on how to handle structure changes, such as:
> 
> 	struct ibv_xrc_send_wr {
> 		struct ibv_send_wr wr;
> 		uint32_t remote_qpn;
> 	};
> 
> ?  Do we want to use the existing ibv_post_send() call, or add a new
> ibv_post_xrc_send() routine specifically for this purpose (and
> simplify the above definition)?

I prefer to see type safety, so ibv_post_xrc_send is my vote.

Just looking at this, I think it can be stuffed into the existing wr..

        union {
                struct {
                        uint64_t        remote_addr;
                        uint32_t        rkey;
                } rdma;
                struct {
                        uint64_t        remote_addr;
                        uint64_t        compare_add;
                        uint64_t        swap;
                        uint32_t        rkey;
                } atomic;
                struct {
                        struct ibv_ah  *ah;
                        uint32_t        remote_qpn;
                        uint32_t        remote_qkey;
                } ud;
        } wr;

(ignoring 32 bit for now) This union must start 64 bit aligned.

The size is 3*64 + 1*32 so there is a 32 bit pad, thus we can rewrite
it as:

        union {
                struct {
                        uint64_t        remote_addr;
                        uint32_t        rkey;
			uint32_t        xrc_remote_qpn;
                } rdma;
                struct {
                        uint64_t        remote_addr;
                        uint64_t        compare_add;
                        uint64_t        swap;
                        uint32_t        rkey;
			uint32_t        xrc_remote_qpn;
                } atomic;
                struct {
                        struct ibv_ah  *ah;
                        uint32_t        remote_qpn;
                        uint32_t        remote_qkey;
                } ud;
        } wr;

Without changing the size at all.

32 bit users will grow, but that is still ABI acceptable because the
pass into ibverbs uses a linked list structure that clearly identifies
the start of each wr. So long as the extended stuff is not touched
if the WR is not related to XRC things will be OK.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                 ` <20110518180519.GA11860-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2011-05-18 18:13                   ` Hefty, Sean
       [not found]                     ` <1828884A29C6694DAF28B7E6B8A82373FC13-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Hefty, Sean @ 2011-05-18 18:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Roland Dreier,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> The size is 3*64 + 1*32 so there is a 32 bit pad, thus we can rewrite
> it as:
> 
>         union {
>                 struct {
>                         uint64_t        remote_addr;
>                         uint32_t        rkey;
> 			uint32_t        xrc_remote_qpn;
>                 } rdma;
>                 struct {
>                         uint64_t        remote_addr;
>                         uint64_t        compare_add;
>                         uint64_t        swap;
>                         uint32_t        rkey;
> 			uint32_t        xrc_remote_qpn;
>                 } atomic;
>                 struct {
>                         struct ibv_ah  *ah;
>                         uint32_t        remote_qpn;
>                         uint32_t        remote_qkey;
>                 } ud;
>         } wr;
> 
> Without changing the size at all.

You need it in the normal send case as well, either outside of the union, or part of a new struct within the union.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                     ` <1828884A29C6694DAF28B7E6B8A82373FC13-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-05-18 18:22                       ` Jason Gunthorpe
  0 siblings, 0 replies; 53+ messages in thread
From: Jason Gunthorpe @ 2011-05-18 18:22 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Roland Dreier,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Wed, May 18, 2011 at 06:13:54PM +0000, Hefty, Sean wrote:
> You need it in the normal send case as well, either outside of the
> union, or part of a new struct within the union.

Works for me..

union {
  [..]
  struct {
     uint64_t reserved1[3];
     uint32_t reserved2;
     uint32_t remote_qpn;
  } xrc;
};

Can't go outside the union because the pad should be placed inside the
union.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]             ` <1828884A29C6694DAF28B7E6B8A82373FBC7-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2011-05-18 18:05               ` Jason Gunthorpe
@ 2011-05-18 19:22               ` Roland Dreier
       [not found]                 ` <BANLkTi=cLjErM3pKzihyFtGWZ0kSu9BiPA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 53+ messages in thread
From: Roland Dreier @ 2011-05-18 19:22 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Jason Gunthorpe,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Wed, May 18, 2011 at 10:30 AM, Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>> As long as the version number in the ibv_context is increasing and not
>> branching then I think it is OK. 0 = what we have now. 1 = + XRC, 2 =
>> +XRC+ummunotify, etc. Drivers 0 out the function pointers they do not
>> support.
>
> I was thinking more along this line, but I can see how using a named extension could be useful for OFED or vendor specific extensions that aren't part of the upstream libibverbs.  (Whether _that_ is useful is another matter, but it seems to be the world that we're in anyway.)
>

Exactly... we seem to be in a world of branches, and I do think that
decoupling independent extensions is better for everyone.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                 ` <BANLkTi=cLjErM3pKzihyFtGWZ0kSu9BiPA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-05-19  5:29                   ` Hefty, Sean
  0 siblings, 0 replies; 53+ messages in thread
From: Hefty, Sean @ 2011-05-19  5:29 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Jason Gunthorpe,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

What about something along the lines of the following?  This is 2 incomplete patches squashed together, lacking any serious documentation.  I believe this will support existing apps and driver libraries, as either binary or re-compiling unmodified source code.

A driver library calls ibv_register_driver_ext() if it implements the extended operations.  (Roland, you mentioned changing the ibv_register_driver init_func parameter, but I didn't follow your thought.)  Extended ops are defined at both the ibv_device and ibv_context levels.  An app may also query a device to see if it supports a specific extension.

There is an assumption that extension names reflect the availability of the extension.  For example, there may be "ibv_foo", "ofed_foo", or "mlx_foo" APIs.  The "ibv" extensions would be those defined by the libibverbs package.


---
 include/infiniband/verbs.h |   32 ++++++++++++++++++++++++++++++++
 src/device.c               |   18 ++++++++++++++++++
 src/init.c                 |   12 +++++++++++-
 src/libibverbs.map         |    5 +++++
 src/verbs.c                |    9 +++++++++
 5 files changed, 75 insertions(+), 1 deletions(-)

diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 0f1cb2e..fad875f 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -617,12 +617,19 @@ struct ibv_device {
 	enum ibv_transport_type	transport_type;
 	/* Name of underlying kernel IB device, eg "mthca0" */
 	char			name[IBV_SYSFS_NAME_MAX];
+	/*
+	 * Support for extended operations is recorded at the end of
+	 * the name character array.
+	 */
+#define ext_ops_supported	name[IBV_SYSFS_NAME_MAX - 1]
 	/* Name of uverbs device, eg "uverbs0" */
 	char			dev_name[IBV_SYSFS_NAME_MAX];
 	/* Path to infiniband_verbs class device in sysfs */
 	char			dev_path[IBV_SYSFS_PATH_MAX];
 	/* Path to infiniband class device in sysfs */
 	char			ibdev_path[IBV_SYSFS_PATH_MAX];
+	int			(*have_ext_ops)(char *ext_name);
+	void *			(*get_ext_device_ops)(char *ext_name);
 };
 
 struct ibv_context_ops {
@@ -683,6 +690,14 @@ struct ibv_context_ops {
 	void			(*async_event)(struct ibv_async_event *event);
 };
 
+#define IBV_XRC_OPS "ibv_xrc"
+
+struct ibv_xrc_ops {
+	struct ibv_xrcd		(*open_xrcd)();
+	int			(*close_xrcd)();
+	struct ibv_src		(*create_srq)();
+};
+
 struct ibv_context {
 	struct ibv_device      *device;
 	struct ibv_context_ops	ops;
@@ -691,6 +706,7 @@ struct ibv_context {
 	int			num_comp_vectors;
 	pthread_mutex_t		mutex;
 	void		       *abi_compat;
+	void *			(*get_ext_ops)(char *ext_name);
 };
 
 /**
@@ -724,6 +740,17 @@ const char *ibv_get_device_name(struct ibv_device *device);
 uint64_t ibv_get_device_guid(struct ibv_device *device);
 
 /**
+ * ibv_have_ext_ops - Return true if device supports the requested
+ * extended operations.
+ */
+int ibv_have_ext_ops(struct ibv_device *device, const char *name);
+
+/**
+ * ibv_get_device_ext_ops - Return extended operations.
+ */
+void *ibv_get_device_ext_ops(struct ibv_device *device, const char *name);
+
+/**
  * ibv_open_device - Initialize device for use
  */
 struct ibv_context *ibv_open_device(struct ibv_device *device);
@@ -734,6 +761,11 @@ struct ibv_context *ibv_open_device(struct ibv_device *device);
 int ibv_close_device(struct ibv_context *context);
 
 /**
+ * ibv_get_ext_ops - Return extended operations.
+ */
+void *ibv_get_ext_ops(struct ibv_context *context, const char *name);
+
+/**
  * ibv_get_async_event - Get next async event
  * @event: Pointer to use to return async event
  *
diff --git a/src/device.c b/src/device.c
index 185f4a6..da0eca4 100644
--- a/src/device.c
+++ b/src/device.c
@@ -181,6 +181,24 @@ int __ibv_close_device(struct ibv_context *context)
 }
 default_symver(__ibv_close_device, ibv_close_device);
 
+int __ibv_have_ext_ops(struct ibv_device *device, const char *name)
+{
+	if (!device->ext_ops_supported)
+		return ENOSYS;
+
+	return device->have_ext_ops(device, name);
+}
+default_symver(__ibv_have_ext_ops, ibv_have_ext_ops);
+
+void *__ibv_get_device_ext_ops(struct ibv_device *device, const char *name)
+{
+	if (!device->ext_ops_supported)
+		return NULL;
+
+	return device->get_device_ext_ops(device, name);
+}
+default_symver(__ibv_get_device_ext_ops, ibv_get_device_ext_ops);
+
 int __ibv_get_async_event(struct ibv_context *context,
 			  struct ibv_async_event *event)
 {
diff --git a/src/init.c b/src/init.c
index 4f0130e..0fdc89c 100644
--- a/src/init.c
+++ b/src/init.c
@@ -71,6 +71,7 @@ struct ibv_driver {
 	const char	       *name;
 	ibv_driver_init_func	init_func;
 	struct ibv_driver      *next;
+	int			ext_ops_supported;
 };
 
 static struct ibv_sysfs_dev *sysfs_dev_list;
@@ -153,7 +154,8 @@ static int find_sysfs_devs(void)
 	return ret;
 }
 
-void ibv_register_driver(const char *name, ibv_driver_init_func init_func)
+void ibv_register_driver_ext(const char *name, ibv_driver_init_func init_func,
+			     int ext_ops_supported)
 {
 	struct ibv_driver *driver;
 
@@ -174,6 +176,11 @@ void ibv_register_driver(const char *name, ibv_driver_init_func init_func)
 	tail_driver = driver;
 }
 
+void ibv_register_driver(const char *name, ibv_driver_init_func init_func)
+{
+	ibv_register_driver_ext(name, init_func, 0);
+}
+
 static void load_driver(const char *name)
 {
 	char *so_name;
@@ -368,6 +375,9 @@ static struct ibv_device *try_driver(struct ibv_driver *driver,
 	strcpy(dev->name,       sysfs_dev->ibdev_name);
 	strcpy(dev->ibdev_path, sysfs_dev->ibdev_path);
 
+	if (strlen(dev->name) < sizeof(dev->name))
+		dev->ext_ops_supported = driver->ext_ops_supported;
+
 	return dev;
 }
 
diff --git a/src/libibverbs.map b/src/libibverbs.map
index 1827da0..422e07f 100644
--- a/src/libibverbs.map
+++ b/src/libibverbs.map
@@ -96,4 +96,9 @@ IBVERBS_1.1 {
 		ibv_port_state_str;
 		ibv_event_type_str;
 		ibv_wc_status_str;
+
+		ibv_register_driver_ext;
+		ibv_have_ext_ops;
+		ibv_get_device_ext_ops;
+		ibv_get_ext_ops;
 } IBVERBS_1.0;
diff --git a/src/verbs.c b/src/verbs.c
index ba3c0a4..0be9ac2 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -76,6 +76,15 @@ enum ibv_rate mult_to_ibv_rate(int mult)
 	}
 }
 
+void *__ibv_get_ext_ops(struct ibv_context *context, const char *name)
+{
+	if (!context->device->ext_ops_supported)
+		return NULL;
+
+	return context->get_ext_ops(context, name);
+}
+default_symver(__ibv_get_ext_ops, ibv_get_ext_ops);
+
 int __ibv_query_device(struct ibv_context *context,
 		       struct ibv_device_attr *device_attr)
 {


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]     ` <201105181754.33759.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2011-05-18 15:27       ` Hefty, Sean
@ 2011-06-22  7:17       ` Jack Morgenstein
       [not found]         ` <201106221017.06212.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  1 sibling, 1 reply; 53+ messages in thread
From: Jack Morgenstein @ 2011-06-22  7:17 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w

Hi Sean,

Some initial feature feedback.

I noticed (from the code in your git, xrc branch) that the XRC target QPs
stick around until the XRC domain is de-allocated.
There was a long thread about this in December, 2007, where the MPI community
found this approach unacceptable (leading to accumulation of "dead" XRC TGT qp's).
They needed to leave the XRC domain active, and just allocate/delete TGT QPs as needed,
without resource usage buildup.

See the thread starting at:
http://lists.openfabrics.org/pipermail/general/2007-December/044215.html
([ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process)

This discussion lead to the addition of the XRC reg/unreg verbs for processes to
"register" with XRC TGT QPs, and reference counting for destroying
these QPs.

In addition, this approach also required propagating the XRC TGT QP events
to all processes registered with that QP, so that they could unregister
in the event of an error -- reducing the QP reference count and allowing it to be destroyed.

See also the threads starting at:
http://lists.openfabrics.org/pipermail/general/2008-January/045302.html
([ofa-general] [PATCH 0/ 8] XRC patch series (including xrc receive-only QPs))

I know that this looks ugly, but I did not see any other method other than registration to
allow persistence of TGT QPs without unnecessary "dead" resource buildup.

Any thoughts on this?

-Jack

On Wednesday 18 May 2011 17:54, Jack Morgenstein wrote:
> Sean,
> Great that you are taking this on!  I will review this next week.
> 
> -Jack
> 
> On Tuesday 17 May 2011 00:13, Hefty, Sean wrote:
> > I've been working on a set of XRC patches aimed at upstream inclusion to the kernel, libibverbs, and librdmacm.  I'm using existing patches as the major starting point.  A goal is to maintain the user space ABI.  Before proceeding further, I wanted to get broader feedback.  Starting at the top and working down, these are the basic ideas:
> > 
> > 
> > librdmacm
> > ---------
> > The API is basically unchanged.  XRC usage is indicated through the QP type.  The challenge is determining if XRC maps to a specific rdma_port_space.
> > 
> > 
> > libibverbs
> > ----------
> > We define a new device capability flag IBV_DEVICE_EXT_OPS, indicating that the library supports extended operations.  If set, the provider library returns an extended structure from ibv_open_device():
> > 
> > 	struct ibv_context_ext {
> > 		struct ibv_context context;
> > 		int                version;
> > 		struct ibv_ext_ops ext_ops;
> > 	};
> > 
> > The ext_ops will allow for additional operations not provided by ibv_context_ops, for example:
> > 
> > 	struct ibv_ext_ops {
> > 		int	(share_pd)(struct ibv_pd *pd, int fd, int oflags);
> > 	};
> > 
> > In order for libibverbs to check for ext_ops support, it steals a byte from the device name:
> > 
> > 	/*
> > 	 * Support for extended operations is recorded at the end of
> > 	 * the name character array.
> > 	 */
> > 	#define ext_ops_cap            name[IBV_SYSFS_NAME_MAX - 1]
> > 
> > (If strlen(name) indicates that this byte terminates the string, extended operation support is disabled for this device.)
> > 
> > Hopefully, this provides the framework needed for libibverbs to support both old and new provider libraries.
> > 
> > From an architecture viewpoint, XRC adds 4 new XRC specific objects: domains, INI QPs, TGT QPs, and SRQs.  For the purposes of the libibverbs API only, I'm suggesting the following mappings:
> > 
> > XRC domains - Hidden under a PD, dynamically allocated when needed.  An extended ops call allows the xrcd to be shared between processes.  This minimizes changes to existing structures and APIs which only take a struct ibv_pd.
> > 
> > INI QPs - Exposed through a new IBV_QPT_XRC_SQ qp type.  This is a send-only QP with minimal differences from an RC QP from a user's perspective.
> > 
> > TGT QPs - Not exposed to user space.  XRC TGT QP creation and setup is handled by the kernel.
> > 
> > XRC SRQs - Exposed through a new IBV_QPT_XRC_RQ qp type.  This is an SRQ that is tracked using a struct ibv_qp.  This minimizes API changes to both libibverbs and librdmacm.
> > 
> > If ext_ops are supported and in active use, extended structures may be expected with some calls, such as ibv_post_send() requiring a struct ibv_xrc_send_wr for XRC QPs.
> > 
> > 	struct ibv_xrc_send_wr {
> > 		struct ibv_send_wr wr;
> > 		uint32_t remote_qpn;
> > 	};
> > 
> > 
> > uverbs
> > ------
> > (Ideas for kernel changes are sketchier, but the existing patches cover most of the functionality except for IB CM interactions.)
> > 
> > Need new uverbs commands to support alloc/dealloc xrcd and create xrc srq.  Create QP must handle XRC INI QPs.  XRC TGT QPs are not exposed; ***all XRC INI->TGT QP setup is done in band***.
> > 
> > Somewhere, an xrc sub-module listens on a SID and accepts incoming XRC connection requests.  This requires associating the xrcd and SID, the details of which I'm not clear on.  The xrcd is most readily known to uverbs, but a SID is usually formed by the rdma_cm.  Even how the latter is done is unclear.
> > 
> > The usage model I envision is for a user to call listen on an XRC SRQ (IBV_QPT_XRC_RQ), which listens for a SIDR REQ to resolve the SRQN and a REQ to setup the INI->TGT QPs.  The issue is sync'ing the lifetime of any formed connections with the xrcd.
> > 
> > 
> > verbs
> > -----
> > The patch for this is basically available.  3 new calls are added: ib_create_xrc_srq, ib_alloc_xrcd, and ib_dealloc_xrcd.  The IB_QPT_XRC is split into 2 types: IB_QPT_INI_XRC and IB_QPT_TGT_XRC.  An INI QP has a pd, but no xrcd, while the TGT QP is the reverse.
> > 
> > 
> > Existing patches to the mlx4 driver and library would be modified to handle these changes.  If anyone has any thoughts on these changes, I'd appreciate them before I have them implemented.  :)
> > 
> > - Sean
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]         ` <201106221017.06212.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2011-06-22 16:14           ` Hefty, Sean
       [not found]             ` <1828884A29C6694DAF28B7E6B8A82373029A95-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Hefty, Sean @ 2011-06-22 16:14 UTC (permalink / raw)
  To: Jack Morgenstein
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w

> I noticed (from the code in your git, xrc branch) that the XRC target QPs
> stick around until the XRC domain is de-allocated.
> There was a long thread about this in December, 2007, where the MPI
> community
> found this approach unacceptable (leading to accumulation of "dead" XRC
> TGT qp's).
> They needed to leave the XRC domain active, and just allocate/delete TGT
> QPs as needed,
> without resource usage buildup.

This is partly true, and I haven't come up with a better way to handle this.  Note that the patches allow the original creator of the TGT QP to destroy it by simply calling ibv_destroy_qp().  This doesn't handle the process dying, but maybe that's not a real concern.

If the QP is tied into the CM protocol, it may also be possible to automatically destroy it when receiving a DREQ, provided that the creating process no longer owns it.  This would need to be a new patch.

Thanks for the pointers to the threads.  I'll re-read those.

> This discussion lead to the addition of the XRC reg/unreg verbs for
> processes to
> "register" with XRC TGT QPs, and reference counting for destroying
> these QPs.

After looking at the implementation more, what I didn't like about the reg/unreg calls is that it is independent of receiving data on an SRQ.  That is, a user can receive data on an SRQ through a TGT QP before they have registered and after unregistering.  From the perspective of a user space API, this is counter-intuitive.  The reg/unreg calls are basically for reference counting some kernel component.

I also liked the idea of having a single process own control of the TGT QP, for the purposes of modifying it or destroying it, separately from other processes that may be sharing the same xrcd.

> In addition, this approach also required propagating the XRC TGT QP events
> to all processes registered with that QP, so that they could unregister
> in the event of an error -- reducing the QP reference count and allowing
> it to be destroyed.

Ok - I was having a hard time figuring out what exactly all of the processes were supposed to do with the TGT QP events.  It seemed like only one of them could actually respond to any error.

- Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]             ` <1828884A29C6694DAF28B7E6B8A82373029A95-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-06-22 17:03               ` Jack Morgenstein
       [not found]                 ` <201106222003.50214.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Jack Morgenstein @ 2011-06-22 17:03 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w

On Wednesday 22 June 2011 19:14, Hefty, Sean wrote:
> This is partly true, and I haven't come up with a better way to handle this.  
> Note that the patches allow the original creator of the TGT QP to destroy it by simply calling ibv_destroy_qp().
> This doesn't handle the process dying, but maybe that's not a real concern.  
Correct, that was not the MPI community's main concern.
 
> 
> After looking at the implementation more, what I didn't like about the reg/unreg
> calls is that it is independent of receiving data on an SRQ.  That is, a user can
> receive data on an SRQ through a TGT QP before they have registered and after unregistering.
That is correct, but the registering/unregistering is for expressing "interest" in the TGT QP, only (i.e., for
reference counting only, as you indicated), and not at all for enabling the SRQ to receive traffic.
When the reference count on the TGT QP goes to zero, it is automatically destroyed.
(The TGT QP creator itself takes a reference, which is decremented when either the
process exits, or it calls the deref procedure.)

In my thread post
    http://lists.openfabrics.org/pipermail/general/2008-January/045302.html  ,
I describe the intended workflow for this interface.

-Jack

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                 ` <201106222003.50214.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2011-06-22 18:19                   ` Hefty, Sean
       [not found]                     ` <1828884A29C6694DAF28B7E6B8A82373029B3F-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Hefty, Sean @ 2011-06-22 18:19 UTC (permalink / raw)
  To: Jack Morgenstein
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w

> > After looking at the implementation more, what I didn't like about the
> reg/unreg
> > calls is that it is independent of receiving data on an SRQ.  That is, a
> user can
> > receive data on an SRQ through a TGT QP before they have registered and
> after unregistering.
> That is correct, but the registering/unregistering is for expressing
> "interest" in the TGT QP, only (i.e., for
> reference counting only, as you indicated), and not at all for enabling
> the SRQ to receive traffic.
> When the reference count on the TGT QP goes to zero, it is automatically
> destroyed.
> (The TGT QP creator itself takes a reference, which is decremented when
> either the
> process exits, or it calls the deref procedure.)
> 
> In my thread post
>     http://lists.openfabrics.org/pipermail/general/2008-
> January/045302.html  ,
> I describe the intended workflow for this interface.

I read over the threads that you referenced.  I do understand what the reg/unreg calls were trying to do.  In short, I agree with your original approach of letting the tgt qp hang around while the xrcd exists, and I'm not convinced what HP MPI was trying to do should drive a more complicated implementation and usage model.

For MPI, I would expect an xrcd to be associated with a single job instance.  Trying to share an xrcd across jobs just seems like a bad idea.  (You asked about this here  http://lists.openfabrics.org/pipermail/general/2007-December/044282.html)  A tgt qp should be fairly minimal in its resource allocation.  Would it really be that bad to just let it hang around until the xrcd was destroyed?  It makes the usage model much simpler.  TGT QPs are handled using the existing API and kernel ABI.

Here are some other random thoughts.  We can destroy an unassociated tgt qp on an error, but is it likely a tgt qp will get an async error?  We can also destroy an unassociated tgt qp when an xrcd has no more associated srq's.  We can report the creation of a tgt qp on an xrcd as an async event.  Should there be a way for a user to query all tgt qp's that exist on an xrcd?  Should only one process 'own' the tgt qp, or should any process that can open the xrcd be allowed to modify any tgt qp?

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                     ` <1828884A29C6694DAF28B7E6B8A82373029B3F-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-06-22 19:21                       ` Jack Morgenstein
       [not found]                         ` <201106222221.05993.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Jack Morgenstein @ 2011-06-22 19:21 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w

> I read over the threads that you referenced.  I do understand what the 
> reg/unreg calls were trying to do. In short, I agree with your original approach
> of letting the tgt qp hang around while the xrcd exists, 
> and I'm not convinced what HP MPI was trying to do should drive a 
> more complicated implementation and usage model.
I believe that MPI is the major XRC user, and we wished to make XRC as easy
as possible for them to use in their environment.
 
> For MPI, I would expect an xrcd to be associated with a single job instance.
So did I, but they said that this was not the case, and they were very pleased
with the final (more complicated implementation-wise) interface.
We need to get them involved in this discussion ASAP. 

Tziporet, who should be the MPI contacts for this thread?

> Trying to share an xrcd across jobs just seems like a bad idea.
I agree, but this did not seem to be an issue for the MPI community.
I am aware of the dangers of Job A crossing over into Job B's SRQs,
because the XRC domains are not distinct. Evidently, though, they did not consider this
loophole dangerous.

> (You asked about this here  http://lists.openfabrics.org/pipermail/general/2007-December/044282.html)
> A tgt qp should be fairly minimal in its resource allocation.  Would it really be that bad to just let
> it hang around until the xrcd was destroyed?
This does constitute a resource leakage of sorts, since "dead" resources would accumulate in the xrc domain.

> It makes the usage model much simpler.  TGT QPs are handled using the existing API and kernel ABI.     
> 
> Here are some other random thoughts.  We can destroy an unassociated tgt qp on an error, but is it likely
> a tgt qp will get an async error?
I need to think about this.

> We can also destroy an unassociated tgt qp when an xrcd 
> has no more associated srq's.
This is a problem, because in the MPI usage model, there are almost always associated SRQs active.
One job is finishing, another job starts (creating more associated SRQs of its own) before the first finishes.
It is possible we would not have a window of inactivity here.

> We can report the creation of a tgt qp on an xrcd as an async event.
To whom?

> Should there be a way for a user to query all tgt qp's that exist on an xrcd?
There has been no request for such a feature as yet.  However, with the current OFED implementation,
when a job finished all its TGT qp's are destroyed because their reference counts go to zero.
(This mechanism actually works very well).

> Should only one process 'own' the tgt qp, or should any process that can open the xrcd be allowed to modify any tgt qp?
Currently, any process that has access to the xrcd object and the tgt QP number can modify and query that QP.
(the (xrc domain handle + qpn) pair functioned as a handle). 

-Jack 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                         ` <201106222221.05993.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2011-06-22 19:32                           ` Tziporet Koren
  2011-06-22 19:57                           ` Hefty, Sean
  1 sibling, 0 replies; 53+ messages in thread
From: Tziporet Koren @ 2011-06-22 19:32 UTC (permalink / raw)
  To: Jack Morgenstein, Hefty, Sean, Yevgeny Kliteynik,
	Sayantan Sur
	(surs-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP@public.gmane.org)
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	Dotan Barak, Dhabaleswar K. Panda


> > For MPI, I would expect an xrcd to be associated with a single job instance.
> > So did I, but they said that this was not the case, and they were very pleased
> > with the final (more complicated implementation-wise) interface.
> > We need to get them involved in this discussion ASAP. 

> Tziporet, who should be the MPI contacts for this thread?

Yevgeny K. can contact the Open MPI people
Sayantan can be the contact for MVAPICH

Tziporet
 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                         ` <201106222221.05993.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2011-06-22 19:32                           ` Tziporet Koren
@ 2011-06-22 19:57                           ` Hefty, Sean
       [not found]                             ` <1828884A29C6694DAF28B7E6B8A82373029BDE-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  1 sibling, 1 reply; 53+ messages in thread
From: Hefty, Sean @ 2011-06-22 19:57 UTC (permalink / raw)
  To: Jack Morgenstein
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w

> > For MPI, I would expect an xrcd to be associated with a single job
> instance.
> So did I, but they said that this was not the case, and they were very
> pleased
> with the final (more complicated implementation-wise) interface.
> We need to get them involved in this discussion ASAP.

I agree.  But I've also heard MPI developers complain loud and long about how difficult it is for them to establish connections over IB.

Maybe we can come up with something that supports both usage models and let the user specify the lifetime of the tgt qp.

> > We can report the creation of a tgt qp on an xrcd as an async event.
> To whom?

to all users of the xrcd.  IMO, if we require undefined, out of band communication to use XRC, then we have an incomplete solution.  It's just too bad that we can't report additional data (like the tgt qpn) with an async event...
 
> > Should there be a way for a user to query all tgt qp's that exist on an
> xrcd?
> There has been no request for such a feature as yet.  However, with the
> current OFED implementation,
> when a job finished all its TGT qp's are destroyed because their reference
> counts go to zero.

Again, I don't think we should rely on undefined communication to make xrc work.  If we must rely on some sort of registration feature, then there should be some standard way for communicating the tgt qpn's.  If we can't define some standard way of doing that because it 'breaks' the apps, then we should rethink the registration approach.

Also, MPI ignores a lot of the IB standard for connections and SA communication.  I don't believe that what we push upstream should.  We need to handle XRC using the CM protocol, alternate paths, etc. and be able to route those events to the correct responding process.  Maybe we need some way to transfer ownership of a tgt qp from one process to another, rather than trying to share ownership.

Is there *any* way for a tgt qp to know if the remote ini qp is still active?

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                             ` <1828884A29C6694DAF28B7E6B8A82373029BDE-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-06-23  6:11                               ` Jack Morgenstein
  2011-06-23  6:35                               ` Jack Morgenstein
  1 sibling, 0 replies; 53+ messages in thread
From: Jack Morgenstein @ 2011-06-23  6:11 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w

On Wednesday 22 June 2011 22:57, Hefty, Sean wrote:
> Is there *any* way for a tgt qp to know if the remote ini qp is still active?
> 
Not that I am aware of.

-Jack
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                             ` <1828884A29C6694DAF28B7E6B8A82373029BDE-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2011-06-23  6:11                               ` Jack Morgenstein
@ 2011-06-23  6:35                               ` Jack Morgenstein
       [not found]                                 ` <201106230935.07425.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  1 sibling, 1 reply; 53+ messages in thread
From: Jack Morgenstein @ 2011-06-23  6:35 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w

On Wednesday 22 June 2011 22:57, Hefty, Sean wrote:
> > > We can report the creation of a tgt qp on an xrcd as an async event.
> > To whom?
> 
> to all users of the xrcd.  IMO, if we require undefined, out of band communication to use
> XRC, then we have an incomplete solution.  It's just too bad that we can't report additional
> data (like the tgt qpn) with an async event...  
> 
I do not think this will help much.  Just knowing that a TGT QP exists is not really relevant.  The sender (who is on a remote host)
needs to know about the TGT QP, to use it to send to the XRC SRQs. We cannot propagate an event over to the sender side!  The local users
of the TGT QP don't really care -- it is the responsibility of the Application to get the INI-TGT linkage up, and the XRC SRQs created before
sending from INI to SRQ.  At that point, the SRQ either receives packets or it does not.

I do agree, though, that the need to use out-of-band communication (rather than the CM) to establish the INI-TGT RC connection is not clean.

If you are thinking of using this for tracking, there is no way that I can see except reference counting to know when all users of the TGT have
finished with it.  Indicating usage by "tapping" the TGT QP whenever an SRQ receives a packet via the TGT is not a good idea (this is data
path!), not to mention that quiescence is not a good indicator of whether an app is done using the target (the amount of time that things
are quiet depends on the app).

We discussed using a "heartbeat" in the thread link I sent you, and ultimately discarded that idea.

If you have other ideas which you think might work, please send them along.

-Jack
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                                 ` <201106230935.07425.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2011-06-23 18:03                                   ` Hefty, Sean
  2011-07-20 18:51                                   ` Hefty, Sean
  1 sibling, 0 replies; 53+ messages in thread
From: Hefty, Sean @ 2011-06-23 18:03 UTC (permalink / raw)
  To: Jack Morgenstein
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w

> If you are thinking of using this for tracking, there is no way that I can
> see except reference counting to know when all users of the TGT have
> finished with it.  Indicating usage by "tapping" the TGT QP whenever an
> SRQ receives a packet via the TGT is not a good idea (this is data
> path!), not to mention that quiescence is not a good indicator of whether
> an app is done using the target (the amount of time that things
> are quiet depends on the app).

I mentioned the event as a standard way for communicating the tgt qp to other processes using the xrcd.  Currently, we have something like this:

Process 1:
create tgt qp
for each relevant process
	send out of band message with tgt qpn

Process 2..n:
receive tgt qpn
kernel call to register with tgt qp
kernel searches for tgt qpn, bumps ref count
<later>
kernel call to unregister with tgt qp
kernel searches for tgt qpn, decrements ref count

The use of an event is simply an (incomplete) idea to standardize what is currently done by sending/receiving the out of band messages with the tgt qpn.  On this note, I can't imagine that this usage model is better performing that simply creating a new xrcd for each job...

I was thinking about this more last night, in more generic terms, what we want is a way for an application to allocate a QP in the kernel and share it with other apps.  I think it would be best to reuse the existing ibverbs API and kern-abi as much as possible.  (Eventually someone may find a use for this other than just TGT QPs.)  I just don't know how we handle sharing or transferring ownership of the CM data.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                                 ` <201106230935.07425.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2011-06-23 18:03                                   ` Hefty, Sean
@ 2011-07-20 18:51                                   ` Hefty, Sean
       [not found]                                     ` <1828884A29C6694DAF28B7E6B8A82373136F63B9-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  1 sibling, 1 reply; 53+ messages in thread
From: Hefty, Sean @ 2011-07-20 18:51 UTC (permalink / raw)
  To: Jack Morgenstein,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)
  Cc: tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shamis, Pavel (shamisp-1Heg1YXhbW8@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

I've tried to come up with a clean way to determine the lifetime of an xrc tgt qp, and I think the best approach is still:

1. Allow the creating process to destroy it at any time, and

2a. If not explicitly destroyed, the tgt qp is bound to the lifetime of the xrc domain
or
2b. The creating process specifies during the creation of the tgt qp whether the qp should be destroyed on exit.

The MPIs associate an xrc domain with a job, so this should work.  Everything else significantly complicates the usage model and implementation, both for verbs and the CM.  An application can maintain a reference count out of band with a persistent server and use explicit destruction if they want to share the xrcd across jobs.

Option 2a is the current implementation, but 2b should be a minor change.  I'd like to reach a consensus on the right approach here, since there doesn't appear to be issues elsewhere.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                                     ` <1828884A29C6694DAF28B7E6B8A82373136F63B9-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-07-21  7:38                                       ` Jack Morgenstein
       [not found]                                         ` <201107211038.23000.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Jack Morgenstein @ 2011-07-21  7:38 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shamis, Pavel (shamisp-1Heg1YXhbW8@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

On Wednesday 20 July 2011 21:51, Hefty, Sean wrote:
> I've tried to come up with a clean way to determine the lifetime of an xrc tgt qp,\
> and I think the best approach is still: 
> 
> 1. Allow the creating process to destroy it at any time, and
> 
> 2a. If not explicitly destroyed, the tgt qp is bound to the lifetime of the xrc domain
> or
> 2b. The creating process specifies during the creation of the tgt qp
> whether the qp should be destroyed on exit. 
> 
> The MPIs associate an xrc domain with a job, so this should work.
> Everything else significantly complicates the usage model and implementation,
> both for verbs and the CM.  An application can maintain a reference count
> out of band with a persistent server and use explicit destruction
> if they want to share the xrcd across jobs.
I assume that you intend the persistent server to replace the reg_xrc_rcv_qp/
unreg_xrc_rcv_qp verbs.  Correct?
> 
> Option 2a is the current implementation, but 2b should be a minor change.
> I'd like to reach a consensus on the right approach here, since there doesn't
> appear to be issues elsewhere.  
> 
> - Sean

I have no opinion either way (with regard to tgt qp registration and reference counting).
The OFED xrc implementation was driven by the requirements of the MPI community.

If MPI can use a different XRC domain per job (and deallocate the domain
at the job's end), this would solve the tgt qp lifetime problem (-- by
destroying all the tgt qp's when the xrc domain is deallocated).

Regarding option 2b: do you mean that in this case the tgt qp is NOT bound to the
XRC domain lifetime? who destroys the tgt qp in this case when the creator indicates
that the tgt qp should not be destroyed on exit?

I am concerned with backwards compatibility, here.  It seems that XRC users will need to
change their source-code, not just recompile.  I am assuming that OFED will take the
mainstream kernel implementation at some point.  Since this is **userspace** code, there could be a problem
if OFED users upgrade their OFED installation to one which supports the new interface.
This could be especially difficult if, for example, the customer is using 3rd-party packages
which utilize the current OFED xrc interface.  We could start seeing customers not take
new OFED releases solely because of the XRC incompatibility (or worse, customers upgrading
and then finding out that their 3rd-party XRC apps no longer work).

Having a new OFED support BOTH interfaces is a nightmare I don't even want to think about!

-Jack
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                                         ` <201107211038.23000.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2011-07-21  7:57                                           ` Jack Morgenstein
  2011-07-21 11:58                                           ` Jeff Squyres
  2011-07-21 17:53                                           ` Hefty, Sean
  2 siblings, 0 replies; 53+ messages in thread
From: Jack Morgenstein @ 2011-07-21  7:57 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shamis, Pavel (shamisp-1Heg1YXhbW8@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

On Thursday 21 July 2011 10:38, Jack Morgenstein wrote:
> Having a new OFED support BOTH interfaces is a nightmare I don't even want to think about!

I over-reacted here, sorry about that.  I know that it will be difficult
to support both the old and the new interface.  However, to support the
current OFED customer base, we will do so to ease the interface transition
by the apps, which we expect will take place over time.

Sean, we will do our best to help you with getting XRC into the mainstream
kernel. 

-Jack
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                                         ` <201107211038.23000.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2011-07-21  7:57                                           ` Jack Morgenstein
@ 2011-07-21 11:58                                           ` Jeff Squyres
       [not found]                                             ` <D8276D45-5FE8-464C-B3A4-14404DE8C760-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
  2011-07-21 17:53                                           ` Hefty, Sean
  2 siblings, 1 reply; 53+ messages in thread
From: Jeff Squyres @ 2011-07-21 11:58 UTC (permalink / raw)
  To: Jack Morgenstein
  Cc: Hefty, Sean,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Shamis, Pavel (shamisp-1Heg1YXhbW8@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

On Jul 21, 2011, at 3:38 AM, Jack Morgenstein wrote:

> If MPI can use a different XRC domain per job (and deallocate the domain
> at the job's end), this would solve the tgt qp lifetime problem (-- by
> destroying all the tgt qp's when the xrc domain is deallocated).

What happens if the MPI job crashes and does not properly deallocate the XRC domain / tgt qp?

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                                                 ` <201107211547.31850.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2011-07-21 12:46                                                   ` Jeff Squyres
  2011-07-21 16:06                                                   ` Hefty, Sean
  1 sibling, 0 replies; 53+ messages in thread
From: Jeff Squyres @ 2011-07-21 12:46 UTC (permalink / raw)
  To: Jack Morgenstein
  Cc: Hefty, Sean,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Shamis, Pavel (shamisp-1Heg1YXhbW8@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

On Jul 21, 2011, at 8:47 AM, Jack Morgenstein wrote:

> [snip]
> When the last user of an XRC domain exits cleanly (or crashes), the domain should be destroyed.
> In this case, with Sean's design, the tgt qp's for the XRC domain should also be destroyed.

Sounds perfect.

-- 
Jeff Squyres
jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                                             ` <D8276D45-5FE8-464C-B3A4-14404DE8C760-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
@ 2011-07-21 12:47                                               ` Jack Morgenstein
       [not found]                                                 ` <201107211547.31850.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Jack Morgenstein @ 2011-07-21 12:47 UTC (permalink / raw)
  To: Jeff Squyres, Hefty, Sean
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Shamis, Pavel (shamisp-1Heg1YXhbW8@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

On Thursday 21 July 2011 14:58, Jeff Squyres wrote:
> > If MPI can use a different XRC domain per job (and deallocate the domain
> > at the job's end), this would solve the tgt qp lifetime problem (-- by
> > destroying all the tgt qp's when the xrc domain is deallocated).
> 
> What happens if the MPI job crashes and does not properly deallocate the XRC domain / tgt qp?
> 
I assume that you mean with Sean's new interface. Sean should verify what I write below.

If you use file descriptors for the XRC domain, then when the last user of the domain exits, the domain
gets destroyed (at least this is in OFED.  Sean's code looks the same).

In this case, the kernel cleanup code for the process should close the XRC domains opened by that
process, so there is no leakage.

When the last user of an XRC domain exits cleanly (or crashes), the domain should be destroyed.
In this case, with Sean's design, the tgt qp's for the XRC domain should also be destroyed.

Sean, is this correct?

-Jack
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                                                 ` <201107211547.31850.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2011-07-21 12:46                                                   ` Jeff Squyres
@ 2011-07-21 16:06                                                   ` Hefty, Sean
  1 sibling, 0 replies; 53+ messages in thread
From: Hefty, Sean @ 2011-07-21 16:06 UTC (permalink / raw)
  To: Jack Morgenstein, Jeff Squyres
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Shamis, Pavel (shamisp-1Heg1YXhbW8@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

> If you use file descriptors for the XRC domain, then when the last user of the
> domain exits, the domain
> gets destroyed (at least this is in OFED.  Sean's code looks the same).
> 
> In this case, the kernel cleanup code for the process should close the XRC
> domains opened by that
> process, so there is no leakage.
> 
> When the last user of an XRC domain exits cleanly (or crashes), the domain
> should be destroyed.
> In this case, with Sean's design, the tgt qp's for the XRC domain should also
> be destroyed.
> 
> Sean, is this correct?

This is correct.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                                         ` <201107211038.23000.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2011-07-21  7:57                                           ` Jack Morgenstein
  2011-07-21 11:58                                           ` Jeff Squyres
@ 2011-07-21 17:53                                           ` Hefty, Sean
       [not found]                                             ` <1828884A29C6694DAF28B7E6B8A82373136F6691-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2 siblings, 1 reply; 53+ messages in thread
From: Hefty, Sean @ 2011-07-21 17:53 UTC (permalink / raw)
  To: Jack Morgenstein
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shamis, Pavel (shamisp-1Heg1YXhbW8@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

> > I've tried to come up with a clean way to determine the lifetime of an xrc
> tgt qp,\
> > and I think the best approach is still:
> >
> > 1. Allow the creating process to destroy it at any time, and
> >
> > 2a. If not explicitly destroyed, the tgt qp is bound to the lifetime of the
> xrc domain
> > or
> > 2b. The creating process specifies during the creation of the tgt qp
> > whether the qp should be destroyed on exit.
> >
> > The MPIs associate an xrc domain with a job, so this should work.
> > Everything else significantly complicates the usage model and
> implementation,
> > both for verbs and the CM.  An application can maintain a reference count
> > out of band with a persistent server and use explicit destruction
> > if they want to share the xrcd across jobs.
> I assume that you intend the persistent server to replace the reg_xrc_rcv_qp/
> unreg_xrc_rcv_qp verbs.  Correct?

I'm suggesting that anyone who wants to share an xrcd across jobs can use out of band communication to maintain their own reference count, rather than pushing that feature into the mainline.  This requires a code change for apps that have coded to OFED and use this feature.

> I have no opinion either way (with regard to tgt qp registration and reference
> counting).
> The OFED xrc implementation was driven by the requirements of the MPI
> community.

>From the emails threads I followed, it was a request from HP MPI.  The other MPIs have used the same interface since it was what was defined, but do not appear to be sharing the xrcd across jobs.  HP has since canceled their MPI product.

> Regarding option 2b: do you mean that in this case the tgt qp is NOT bound to
> the
> XRC domain lifetime? who destroys the tgt qp in this case when the creator
> indicates
> that the tgt qp should not be destroyed on exit?

With option 2b, the tgt qp lifetime is either tied to the life of the creating process or the xrcd.  The creating process specifies which on creation.  Basically, the choice allows the creating process to destroy the tgt qp when it exits, rather than waiting until the xrcd is closed.  Note that ibverbs only considers the life of the tgt qp, but we also need to the consider the life of a corresponding connection maintained by the IB CM.
 
> I am concerned with backwards compatibility, here.  It seems that XRC users
> will need to
> change their source-code, not just recompile.  I am assuming that OFED will
> take the
> mainstream kernel implementation at some point.  Since this is **userspace**
> code, there could be a problem
> if OFED users upgrade their OFED installation to one which supports the new
> interface.
> This could be especially difficult if, for example, the customer is using 3rd-
> party packages
> which utilize the current OFED xrc interface.  We could start seeing customers
> not take
> new OFED releases solely because of the XRC incompatibility (or worse,
> customers upgrading
> and then finding out that their 3rd-party XRC apps no longer work).

Eventually, the xrc users should change their source code to move away from the ofed compatability APIs.  An app needs to recompile regardless.  Existing apps will run into issues if they share the xrcd across jobs.  In that case, they will leak tgt qps.  There are also issues if an app calls the OFED ibv_modify_xrc_rcv_qp() or ibv_query_xrc_rcv_qp() APIs from a process other than the one which created the qp.  These are the main risks that I see.

> Having a new OFED support BOTH interfaces is a nightmare I don't even want to
> think about!

We're already in a situation where there are multiple libibverbs interfaces.  The OFED compatibility patch to libibverbs was added specifically so that OFED could support both sets of APIs, while being binary compatible with the upstream ibverbs.  The proposed kernel patches do not support the functionality required for the OFED APIs, but it's not clear whether apps are really dependent on that functionality.  (I don't want to make MPI have to change their code right away either.)

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                                             ` <1828884A29C6694DAF28B7E6B8A82373136F6691-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-07-26 20:04                                               ` Shamis, Pavel
       [not found]                                                 ` <26AE60A9-D055-4D40-A830-5AADDBA20ED8-1Heg1YXhbW8@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Shamis, Pavel @ 2011-07-26 20:04 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Jack Morgenstein,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

Please see my notes below.

>>> I've tried to come up with a clean way to determine the lifetime of an xrc
>> tgt qp,\
>>> and I think the best approach is still:
>>> 
>>> 1. Allow the creating process to destroy it at any time, and
>>> 
>>> 2a. If not explicitly destroyed, the tgt qp is bound to the lifetime of the
>> xrc domain
>>> or
>>> 2b. The creating process specifies during the creation of the tgt qp
>>> whether the qp should be destroyed on exit.
>>> 
>>> The MPIs associate an xrc domain with a job, so this should work.
>>> Everything else significantly complicates the usage model and
>> implementation,
>>> both for verbs and the CM.  An application can maintain a reference count
>>> out of band with a persistent server and use explicit destruction
>>> if they want to share the xrcd across jobs.
>> I assume that you intend the persistent server to replace the reg_xrc_rcv_qp/
>> unreg_xrc_rcv_qp verbs.  Correct?
> 
> I'm suggesting that anyone who wants to share an xrcd across jobs can use out of band communication to maintain their own reference count, rather than pushing that feature into the mainline.  This requires a code change for apps that have coded to OFED and use this feature.


Actually I think it is really not so good idea manage reference counter  across OOB communication. 

Few years ago we had a long discussion among OFED and MPI communities (HP MPI, Intel MPI, Open MPI, Mvapich) about XRC interface definition in OFED.
All of us agreed about the interface that we have today and so far we have not heard about any complains. 
I don't say that it is ideal interface, but I would like to clarify motivation behind the idea of XRC and XRC API that we have today.

The purpose of XRC is to decrease the amount of resources (QPs) that are required for user level communication between multicore nodes. The primary customer of this protocol is middleware HPC software and MPI specifically (but not only). The original intend was to allow to share single receive QP between multiple in-depended processes on the same node.
In order to manage the single resource between multiple process couple of options have been discussed:

1. OOB synchronization on MPI level.
Pros:
- It makes life easier for verbs developer :-)

Cons:
- All MPIs will have to implement the same OOB synchronization mechanism. 
Potentially it adds a lot of overhead and synchronization code to MPI implementation, and to be honest, we already have more than 
enough MPI code that tries to workaround open fabrics API limitations. As well it will make MPI2 dynamic process management much more complicated.

- By definition the XRC QP is owned  by group of processes, that share the same XRC domain, consequently VERBS API
should provide usable API that will allow group management for XRC QP. Luck of such API makes
XRC problematic for integration to HPC communication libraries.

2. Reference counter on verbs level.   

Cons:
- Probably it will make life more complicated for verb developer. 
( Even so, it is not relevant anymore, since the code already exist and no new code development is required )

Pros:
- This solution does not introduce any additional overhead for MPI implementation. 
We have elegant increase/decrease call that manages the reference counter and alows efficient XRC QP management without any extra overhead. 
As well it does not require any special code for MPI-2 dynamic process management.


Obviously, we decided to go with option #2. As result XRC support was easily adopted by multiple MPI implementation. 
And as I mentioned earlier, we haven't heard any complaints.

IMHO, I don't see a good reason to redefine existing API.
I afraid, that such API change will encourage MPI developers to abandon XRC support.

My 0.02$

Regards,

Pavel (Pasha) Shamis
---
Application Performance Tools Group
Computer Science and Math Division
Oak Ridge National Laboratory

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                                                 ` <26AE60A9-D055-4D40-A830-5AADDBA20ED8-1Heg1YXhbW8@public.gmane.org>
@ 2011-08-01 15:03                                                   ` Hefty, Sean
       [not found]                                                     ` <1828884A29C6694DAF28B7E6B8A82373136F9075-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Hefty, Sean @ 2011-08-01 15:03 UTC (permalink / raw)
  To: Shamis, Pavel
  Cc: Jack Morgenstein,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

> Actually I think it is really not so good idea manage reference counter
> across OOB communication.

But this is exactly what the current API *requires* that users of XRC do!!!  And I agree, it's not a good idea. :)

> IMHO, I don't see a good reason to redefine existing API.
> I afraid, that such API change will encourage MPI developers to abandon XRC
> support.

I'm suggesting that most apps don't need to do anything special.  No OOB connections, no special message exchanges carrying XRC numbers, no reference counting, no new verbs for xrc recv qps that do the same thing as the existing verbs.  Using XRC becomes easier!  The lifetime of the XRC receive QP is simply tied to the lifetime of the XRC domain, unless the creating process wants to destroy it sooner.

The only reason reference counting was added at all was to support a questionable usage model of sharing an xrc domain across jobs.  I'm suggesting that *only those apps* that want that usage model can implement it over the provided APIs.  They can continue to use OOB connections, pass XRC numbers, replace ibv_reg_recv_xrv() with atomic_inc(), and let the persistent server destroy the xrc recv qp when those processes are done.  For everyone else (and it sounds like this is really everyone at this point), there's simply no need for any of that.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                                                     ` <1828884A29C6694DAF28B7E6B8A82373136F9075-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-08-01 16:20                                                       ` Shamis, Pavel
       [not found]                                                         ` <AE625966-FD97-4DBF-A024-22B83B5F3E39-1Heg1YXhbW8@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Shamis, Pavel @ 2011-08-01 16:20 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Jack Morgenstein,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP


>> Actually I think it is really not so good idea manage reference counter
>> across OOB communication.
> 
> But this is exactly what the current API *requires* that users of XRC do!!!  And I agree, it's not a good idea. :)

We do have unregister on finalization. But this code doesn't introduce any synchronization across processes on the same node, since kernel manages the receive qp. If the reference counter will be moved to app responsibility, it will enforce the app to mange the reference counter on app level , in other words , it will require some process to be responsible for the QP. In context of MPI-2 dynamics, such approach will make  MPI community live much more complicated.

> 
>> IMHO, I don't see a good reason to redefine existing API.
>> I afraid, that such API change will encourage MPI developers to abandon XRC
>> support.
> 
> 
> The only reason reference counting was added at all was to support a questionable usage model of sharing an xrc domain across jobs.  

Well, actually it is  a primary reason why XRC was introduced in first place. XRC helps reduce number off all-to-all connections from NP^2 ( N - number of nodes, P - number of processes per node) to NP. 
You may find more details, 
here: 
http://www.open-mpi.org/papers/euro-pvmmpi-2008-xrc/euro-pvmmpi-2008-xrc.pdf
and here:
http://nowlab.cse.ohio-state.edu/publications/conf-papers/2008/koop-cluster08.pdf



> I'm suggesting that *only those apps* that want that usage model can implement it over the provided APIs.  They can continue to use OOB connections, pass XRC numbers, replace ibv_reg_recv_xrv() with atomic_inc(), and let the persistent server destroy the xrc recv qp when those processes are done.  For everyone else (and it sounds like this is really everyone at this point), there's

Who will provide this persistent server ? If verbs or some other OFED library will provide such service  + api add/remove/register QPs on such server , then I have no problem.

Regards,
Pasha.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                                                         ` <AE625966-FD97-4DBF-A024-22B83B5F3E39-1Heg1YXhbW8@public.gmane.org>
@ 2011-08-01 18:28                                                           ` Hefty, Sean
       [not found]                                                             ` <1828884A29C6694DAF28B7E6B8A82373136F9194-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Hefty, Sean @ 2011-08-01 18:28 UTC (permalink / raw)
  To: Shamis, Pavel
  Cc: Jack Morgenstein,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

> We do have unregister on finalization. But this code doesn't introduce any
> synchronization across processes on the same node, since kernel manages the
> receive qp. If the reference counter will be moved to app responsibility, it
> will enforce the app to mange the reference counter on app level , in other
> words , it will require some process to be responsible for the QP. In context
> of MPI-2 dynamics, such approach will make  MPI community live much more
> complicated.

I'm not sure we're fully communicating here.  I understand the usage model for xrc.

The basic problem is determining the lifetime of the xrc target qp.  One process creates the xrc target qp.  A process also owns connecting the target qp.  Any process can create the xrc domain.

The OFED patches introduce new user and kernel interfaces to create, modify, and query the xrc target qp.  The proposed patches use the existing interfaces.  The OFED patches did not address connection establishment, and the proposed patches do.  In both patch sets, the xrc target qp is shared across multiple processes based on the xrc domain.

The final question is when does an xrc target qp get destroyed.  The proposal is to destroy the qp when the xrc domain is destroyed.  No OOB communication between local processes or reg/unreg is necessary.  (The main reason for the OFED patches requiring OOB communication and reg/unreg is to support sharing an xrc domain across *jobs*.  The proposed patches would require code changes to apps currently written to the OFED patches that want to share an xrc domain across jobs.  Other apps are likely to work unchanged.)

The trade-offs between the two patch sets are an easier usage model, versus slightly more flexibility.  In the past, MPI developers have pushed hard for easier RDMA connection models.  And from recent email exchanges with the developers, the additional flexibility is not used.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                                                             ` <1828884A29C6694DAF28B7E6B8A82373136F9194-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-08-02 10:44                                                               ` Jack Morgenstein
       [not found]                                                                 ` <201108021344.25284.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2011-08-02 19:08                                                               ` Shamis, Pavel
  1 sibling, 1 reply; 53+ messages in thread
From: Jack Morgenstein @ 2011-08-02 10:44 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Shamis, Pavel,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

On Monday 01 August 2011 21:28, Hefty, Sean wrote:

>From Pavel Shamis:
> > We do have unregister on finalization. But this code doesn't introduce any
> > synchronization across processes on the same node, since kernel manages the
> > receive qp. If the reference counter will be moved to app responsibility, it
> > will enforce the app to mange the reference counter on app level , in other
> > words , it will require some process to be responsible for the QP. In context
> > of MPI-2 dynamics, such approach will make  MPI community live much more
> > complicated.
> 
Why can't the server allocate a new domain per job?  Who creates the target QP? -- can't the
target QP creator create the domain (instead of the server), and provide the domain handle to the server? 
Once the calculation gets started (with other clients opening that domain and
creating XRC SRQs to receive messages via the TGT QP,
the TGT QP creator can dealloc the xrc domain and exit (without destroying the TGT QP).
The xrc domain will not actually be deallocated in the low-level driver until all XRC SRQ clients
also dealloc the domain (reducing its reference count to zero).
At that point, all the domain's TGT QPs will be destroyed as well.

We would, in fact, use opening/allocating the XRC domain, which is done anyway, instead of registering
"interest" in a specific TGT QP, as the means of controlling target QP lifetime.

Is this an option?
-Jack

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                                                                 ` <201108021344.25284.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2011-08-02 16:29                                                                   ` Shamis, Pavel
       [not found]                                                                     ` <32D25205-3E9C-4757-B0AB-7117BDF3F2F7-1Heg1YXhbW8@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Shamis, Pavel @ 2011-08-02 16:29 UTC (permalink / raw)
  To: Jack Morgenstein
  Cc: Hefty, Sean,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

Hi Jack,
Please see my comments below

> From Pavel Shamis:
>>> We do have unregister on finalization. But this code doesn't introduce any
>>> synchronization across processes on the same node, since kernel manages the
>>> receive qp. If the reference counter will be moved to app responsibility, it
>>> will enforce the app to mange the reference counter on app level , in other
>>> words , it will require some process to be responsible for the QP. In context
>>> of MPI-2 dynamics, such approach will make  MPI community live much more
>>> complicated.
>> 
> Why can't the server allocate a new domain per job?  Who creates the target QP? -- can't the
> target QP creator create the domain (instead of the server), and provide the domain handle to the server? 

XRC domain is created by process that starts first.  All the rest processes, that belong to the same mpi session and reside on the same node, join the domain. 
TGT QP is created by process that receive inbound connection first and it is not necessary the same process that created the domain. Even so we assume that both processes belong to the same domain, and belong to the same mpi session.

> Once the calculation gets started (with other clients opening that domain and
> creating XRC SRQs to receive messages via the TGT QP,
> the TGT QP creator can dealloc the xrc domain and exit (without destroying the TGT QP).
> The xrc domain will not actually be deallocated in the low-level driver until all XRC SRQ clients
> also dealloc the domain (reducing its reference count to zero).
> At that point, all the domain's TGT QPs will be destroyed as well.

Does it mean that TGT QP as well is allocated on low level driver ? If so, I have no problem with this approach.


> 
> We would, in fact, use opening/allocating the XRC domain, which is done anyway, instead of registering
> "interest" in a specific TGT QP, as the means of controlling target QP lifetime.
> 
> Is this an option?

Sure, sounds good.

Regards,

Pasha--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                                                             ` <1828884A29C6694DAF28B7E6B8A82373136F9194-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2011-08-02 10:44                                                               ` Jack Morgenstein
@ 2011-08-02 19:08                                                               ` Shamis, Pavel
       [not found]                                                                 ` <EABE213A-448A-45F8-B131-AE1EE3F9547F-1Heg1YXhbW8@public.gmane.org>
  1 sibling, 1 reply; 53+ messages in thread
From: Shamis, Pavel @ 2011-08-02 19:08 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Jack Morgenstein,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP


>> We do have unregister on finalization. But this code doesn't introduce any
>> synchronization across processes on the same node, since kernel manages the
>> receive qp. If the reference counter will be moved to app responsibility, it
>> will enforce the app to mange the reference counter on app level , in other
>> words , it will require some process to be responsible for the QP. In context
>> of MPI-2 dynamics, such approach will make  MPI community live much more
>> complicated.
> 
> I'm not sure we're fully communicating here.  I understand the usage model for xrc.

Ok.

> 
> The basic problem is determining the lifetime of the xrc target qp.  One process creates the xrc target qp.  A process also owns connecting the target qp.  Any process can create the xrc domain.

If the target QP is opened in low level driver, then it's owned by group of processes that share the same XRC domain.
And , as I mentioned in the reply to Jack, I totally agree that maximum life time of target QP is bound to XRC domain life time,
even so it should be a way to destroy the QP before XRC domain distraction. My main point is that the target qp should be maintained by low level driver and not specific MPI process (like send qp)

Regards,

Pasha--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                                                                 ` <EABE213A-448A-45F8-B131-AE1EE3F9547F-1Heg1YXhbW8@public.gmane.org>
@ 2011-08-02 21:25                                                                   ` Hefty, Sean
       [not found]                                                                     ` <1828884A29C6694DAF28B7E6B8A82373136F962C-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Hefty, Sean @ 2011-08-02 21:25 UTC (permalink / raw)
  To: Shamis, Pavel
  Cc: Jack Morgenstein,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

> If the target QP is opened in low level driver, then it's owned by group of
> processes that share the same XRC domain.

Can you define what you mean by 'owned'?

With the latest patches, the target qp is created in the kernel.  Data received on the target qp can go to any process sharing the associated xrc domain.  However, only the creating process is permitted to modify the qp.

> And , as I mentioned in the reply to Jack, I totally agree that maximum life
> time of target QP is bound to XRC domain life time,

This is the main point I was trying to get agreement on from the MPI developers, and it appears that everyone agrees now.

> even so it should be a way to destroy the QP before XRC domain distraction.

This is also doable with the latest patches.  The process which creates the tgt qp has the ability to explicitly destroy it.

> main point is that the target qp should be maintained by low level driver and
> not specific MPI process (like send qp)

As with 'owned', can you clarify what you mean by 'maintained'?

The target qp can continue to exist even after the creating process exits.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                                                                     ` <1828884A29C6694DAF28B7E6B8A82373136F962C-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-08-02 23:00                                                                       ` Shamis, Pavel
       [not found]                                                                         ` <DE779D97-F54F-45E4-B3D4-DBEB10F9302D-1Heg1YXhbW8@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Shamis, Pavel @ 2011-08-02 23:00 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Jack Morgenstein,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

On Aug 2, 2011, at 5:25 PM, Hefty, Sean wrote:

>> If the target QP is opened in low level driver, then it's owned by group of
>> processes that share the same XRC domain.
> 
> Can you define what you mean by 'owned'?
> 
> With the latest patches, the target qp is created in the kernel.  Data received on the target qp can go to any process sharing the associated xrc domain.  However, only the creating process is permitted to modify the qp.

Owned - the group of process may use the tgt qp to receive data.
So it seems that we are on the same page here.

BTW, did we have the same limitation/feature (only creating process is allowed to modify) in  original XRC driver ?


> 
>> And , as I mentioned in the reply to Jack, I totally agree that maximum life
>> time of target QP is bound to XRC domain life time,
> 
> This is the main point I was trying to get agreement on from the MPI developers, and it appears that everyone agrees now.

Indeed.

> 
>> even so it should be a way to destroy the QP before XRC domain distraction.
> 
> This is also doable with the latest patches.  The process which creates the tgt qp has the ability to explicitly destroy it.

Hmm, is it way to destroy the QP, when the original process does not exist anymore ?
Some MPI implements network fall tolerance mechanisms over IB. It means that if QP (or device) enters to error state it should be a way to destroy the  specific QP and open new one.


> 
>> main point is that the target qp should be maintained by low level driver and
>> not specific MPI process (like send qp)
> 
> As with 'owned', can you clarify what you mean by 'maintained'?
> 
> The target qp can continue to exist even after the creating process exits.

This is what I was talking about.


Regards,
P--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                                                                         ` <DE779D97-F54F-45E4-B3D4-DBEB10F9302D-1Heg1YXhbW8@public.gmane.org>
@ 2011-08-02 23:53                                                                           ` Hefty, Sean
       [not found]                                                                             ` <1828884A29C6694DAF28B7E6B8A82373136F967E-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Hefty, Sean @ 2011-08-02 23:53 UTC (permalink / raw)
  To: Shamis, Pavel, Jack Morgenstein
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

> BTW, did we have the same limitation/feature (only creating process is allowed
> to modify) in  original XRC driver ?

I'm not certain about the implementation, but the OFED APIs would allow any process within the xrc domain to modify the qp.

> Hmm, is it way to destroy the QP, when the original process does not exist
> anymore ?

The only way to destroy the tgt qp after the creating process exits is for the xrc domain to be destroyed.  Jack and I have discussed the possibility of having the kernel destroy the qp on error or ib cm disconnect.  I'm not sure how likely it would be for a tgt qp will enter the error state.  If it's possible under a fairly normal use case, I'll start on a separate patch to handle that case.

> Some MPI implements network fall tolerance mechanisms over IB. It means that
> if QP (or device) enters to error state it should be a way to destroy the
> specific QP and open new one.

Note that a tgt qp should consume minimal resources, so it may not be a big deal to just leave it around.  A new qp can be connected regardless. 
 
- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                                                                     ` <32D25205-3E9C-4757-B0AB-7117BDF3F2F7-1Heg1YXhbW8@public.gmane.org>
@ 2011-08-03 10:37                                                                       ` Jack Morgenstein
       [not found]                                                                         ` <201108031337.24527.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Jack Morgenstein @ 2011-08-03 10:37 UTC (permalink / raw)
  To: Shamis, Pavel
  Cc: Hefty, Sean,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

On Tuesday 02 August 2011 19:29, Shamis, Pavel wrote:
> XRC domain is created by process that starts first.  All the rest processes, that belong
> to the same mpi session and reside on the same node, join the domain.  
> TGT QP is created by process that receive inbound connection first and it is not necessary
> the same process that created the domain. Even so we assume that both processes belong to
> the same domain, and belong to the same mpi session.  
> 
The only things that are important here are:
1. Before the TGT QP creator exits (de-allocating its domain), there is at least one other
   process active which has opened the same domain (so that the domain, and the TGT QP
   are not de-allocated when the creator exits, which would clobber the calculation).

   Note that this condition probably exists already in MPI -- if the creator had the only
   domain reference, then the domain would be de-allocated when the creator exited,
   and the calculation would not work anyway.

2. When the job is finished, all processes have de-allocated the XRC domain -- so that the
   domain gets de-allocated and all its TGT QPs destroyed. (i.e., the domain's lifetime is the job).

If these 2 conditions are met, there is absolutely no justification for TGT QP reference counting.
The domain reference count is good enough -- when the domain reference count goes to zero,
the domain is de-allocated and all its TGT QPs destroyed.

Things only get complicated when the domain-allocator process allocates a single domain and simply
uses that single domain for all jobs (i.e., the domain is never de-allocated for the lifetime of the
allocating process, and the allocating process is the server for all jobs).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                                                                             ` <1828884A29C6694DAF28B7E6B8A82373136F967E-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-08-03 13:47                                                                               ` Shamis, Pavel
       [not found]                                                                                 ` <5C691E518F345F4882FAB9E9839E60BA0BCA4622F4-vxojlfkN5A++qDdrU24kdQ@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Shamis, Pavel @ 2011-08-03 13:47 UTC (permalink / raw)
  To: 'Hefty, Sean', Jack Morgenstein
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

> 
> > BTW, did we have the same limitation/feature (only creating process is
> allowed
> > to modify) in  original XRC driver ?
> 
> I'm not certain about the implementation, but the OFED APIs would allow
> any process within the xrc domain to modify the qp.
> 
> > Hmm, is it way to destroy the QP, when the original process does not exist
> > anymore ?
> 
> The only way to destroy the tgt qp after the creating process exits is for the
> xrc domain to be destroyed.  Jack and I have discussed the possibility of
> having the kernel destroy the qp on error or ib cm disconnect.  I'm not sure
> how likely it would be for a tgt qp will enter the error state.  If it's possible
> under a fairly normal use case, I'll start on a separate patch to handle that
> case.
> 

Well, actually I was thinking about APM. If the "creator" exits, we do not have a way to
upload alternative path.
 

> > Some MPI implements network fall tolerance mechanisms over IB. It
> means that
> > if QP (or device) enters to error state it should be a way to destroy the
> > specific QP and open new one.
> 
> Note that a tgt qp should consume minimal resources, so it may not be a big
> deal to just leave it around.  A new qp can be connected regardless.
>

Agree.

Regards,
Pasha
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                                                                                 ` <5C691E518F345F4882FAB9E9839E60BA0BCA4622F4-vxojlfkN5A++qDdrU24kdQ@public.gmane.org>
@ 2011-08-03 15:52                                                                                   ` Hefty, Sean
       [not found]                                                                                     ` <1828884A29C6694DAF28B7E6B8A82373136F97AC-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Hefty, Sean @ 2011-08-03 15:52 UTC (permalink / raw)
  To: Shamis, Pavel, Jack Morgenstein
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

> Well, actually I was thinking about APM. If the "creator" exits, we do not
> have a way to
> upload alternative path.

Correct - that would be a limitation.  You would need to move to a new tgt qp.  In a general solution, this involves not only allowing other processes to modify the QP, but also sharing at the ib cm level.  It may be possible to have the kernel respond to CM LAP messages without any user space intervention in such a case.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                                                                                     ` <1828884A29C6694DAF28B7E6B8A82373136F97AC-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-08-03 20:22                                                                                       ` Shamis, Pavel
       [not found]                                                                                         ` <5C691E518F345F4882FAB9E9839E60BA0BCA462300-vxojlfkN5A++qDdrU24kdQ@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Shamis, Pavel @ 2011-08-03 20:22 UTC (permalink / raw)
  To: 'Hefty, Sean', Jack Morgenstein
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP


> > Well, actually I was thinking about APM. If the "creator" exits, we do not
> > have a way to
> > upload alternative path.
> 
> Correct - that would be a limitation.  You would need to move to a new tgt
> qp.

Well, in Open MPI we have XRC code that uses APM. 
If Mellanox cares about the feature, they would have to rework this part of code in Open MPI.
I don't know about other apps.

Regards,
Pasha

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                                                                                         ` <5C691E518F345F4882FAB9E9839E60BA0BCA462300-vxojlfkN5A++qDdrU24kdQ@public.gmane.org>
@ 2011-08-03 20:49                                                                                           ` Hefty, Sean
       [not found]                                                                                             ` <1828884A29C6694DAF28B7E6B8A82373136F9A13-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Hefty, Sean @ 2011-08-03 20:49 UTC (permalink / raw)
  To: Shamis, Pavel, Jack Morgenstein,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org)
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

> Well, in Open MPI we have XRC code that uses APM.
> If Mellanox cares about the feature, they would have to rework this part of
> code in Open MPI.
> I don't know about other apps.

But does the APM implementation expect some other process other than the creator to be able to modify the QP?

APM would still be supported. 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                                                                                             ` <1828884A29C6694DAF28B7E6B8A82373136F9A13-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-08-03 21:16                                                                                               ` Shamis, Pavel
       [not found]                                                                                                 ` <5C691E518F345F4882FAB9E9839E60BA0BCA462301-vxojlfkN5A++qDdrU24kdQ@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Shamis, Pavel @ 2011-08-03 21:16 UTC (permalink / raw)
  To: 'Hefty, Sean',
	Jack Morgenstein,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org)
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

> 
> > Well, in Open MPI we have XRC code that uses APM.
> > If Mellanox cares about the feature, they would have to rework this part of
> > code in Open MPI.
> > I don't know about other apps.
> 
> But does the APM implementation expect some other process other than
> the creator to be able to modify the QP?

It was long time ago and I do not remember all the details. From first glance it seems that it is not bounded to specific process.
I would suggest to Mellanox to review this par of code.

Regards,
Pasha.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                                                                                                 ` <5C691E518F345F4882FAB9E9839E60BA0BCA462301-vxojlfkN5A++qDdrU24kdQ@public.gmane.org>
@ 2011-08-03 21:36                                                                                                   ` Jason Gunthorpe
       [not found]                                                                                                     ` <20110803213642.GE28465-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Gunthorpe @ 2011-08-03 21:36 UTC (permalink / raw)
  To: Shamis, Pavel
  Cc: 'Hefty, Sean',
	Jack Morgenstein,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

On Wed, Aug 03, 2011 at 05:16:17PM -0400, Shamis, Pavel wrote:
> > 
> > > Well, in Open MPI we have XRC code that uses APM.
> > > If Mellanox cares about the feature, they would have to rework this part of
> > > code in Open MPI.
> > > I don't know about other apps.
> > 
> > But does the APM implementation expect some other process other than
> > the creator to be able to modify the QP?
> 
> It was long time ago and I do not remember all the details. From
> first glance it seems that it is not bounded to specific process.  I
> would suggest to Mellanox to review this par of code.

I'd say a basic starting point would be this question:

Where does the ib_verbs async event for APM state change get routed for
XRC? Does the event have enough info to identify all the necessary
parts? Can the process that receives the event do something about it?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                                                                                                     ` <20110803213642.GE28465-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2011-08-04  0:06                                                                                                       ` Hefty, Sean
       [not found]                                                                                                         ` <1828884A29C6694DAF28B7E6B8A82373136F9A7E-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Hefty, Sean @ 2011-08-04  0:06 UTC (permalink / raw)
  To: Jason Gunthorpe, Shamis, Pavel
  Cc: Jack Morgenstein,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

> Where does the ib_verbs async event for APM state change get routed for
> XRC?

The OFED APIs route QP events to all processes which register for that qp number.

> Does the event have enough info to identify all the necessary
> parts?

The event carries the qp number only.

> Can the process that receives the event do something about it?

*shrugs*  MPI devs?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                                                                                                         ` <1828884A29C6694DAF28B7E6B8A82373136F9A7E-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-08-04  4:05                                                                                                           ` Jason Gunthorpe
       [not found]                                                                                                             ` <20110804040503.GA13935-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Jason Gunthorpe @ 2011-08-04  4:05 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Shamis, Pavel, Jack Morgenstein,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

On Thu, Aug 04, 2011 at 12:06:24AM +0000, Hefty, Sean wrote:
> > Where does the ib_verbs async event for APM state change get routed for
> > XRC?
> 
> The OFED APIs route QP events to all processes which register for
> that qp number.
 
?? How do you register for an event? There is only
ibv_get_async_event(3) - I thought it returned all events relevant to
the associated verbs context.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                                                                                                             ` <20110804040503.GA13935-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2011-08-04  5:53                                                                                                               ` Hefty, Sean
  0 siblings, 0 replies; 53+ messages in thread
From: Hefty, Sean @ 2011-08-04  5:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shamis, Pavel, Jack Morgenstein,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

> ?? How do you register for an event? There is only
> ibv_get_async_event(3) - I thought it returned all events relevant to
> the associated verbs context.

The OFED APIs for managing XRC receive QPs are:

int (*create_xrc_rcv_qp)(struct ibv_qp_init_attr *init_attr,
                         uint32_t *xrc_qp_num);
int (*modify_xrc_rcv_qp)(struct ibv_xrc_domain *xrc_domain,
                         uint32_t xrc_qp_num,
                         struct ibv_qp_attr *attr,
                         int attr_mask);
int (*query_xrc_rcv_qp)(struct ibv_xrc_domain *xrc_domain,
                        uint32_t xrc_qp_num,
                        struct ibv_qp_attr *attr,
                        int attr_mask,
                        struct ibv_qp_init_attr *init_attr);
int (*reg_xrc_rcv_qp)(struct ibv_xrc_domain *xrc_domain,
                      uint32_t xrc_qp_num);
int (*unreg_xrc_rcv_qp)(struct ibv_xrc_domain *xrc_domain,
                        uint32_t xrc_qp_num);

The ibv_reg_xrc_rcv_qp() call bumped a reference count on the receive qp and registered for events.  The event structure was modified to include the xrc_qp_num, along with a flag added to the event to indicate that the event was for an xrc qp.

(To clarify my patches do not have these calls, so event registration isn't possible and would need to be communicated above verbs.)

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                                                                         ` <201108031337.24527.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2011-08-10 22:20                                                                           ` Hefty, Sean
       [not found]                                                                             ` <1828884A29C6694DAF28B7E6B8A8237316E3E55C-Q3cL8pyY+6ukrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Hefty, Sean @ 2011-08-10 22:20 UTC (permalink / raw)
  To: Jack Morgenstein, Shamis, Pavel
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

> Things only get complicated when the domain-allocator process allocates a
> single domain and simply
> uses that single domain for all jobs (i.e., the domain is never de-allocated
> for the lifetime of the
> allocating process, and the allocating process is the server for all jobs).

To help with OFED feature level compatibility, I'm in the process of adding a new call to ibverbs:

struct ib_qp_open_attr {
	void (*event_handler)(struct ib_event *, void *);
	void  *qp_context;
	u32    qp_num;
};

struct ib_qp *ib_open_qp(*xrcd, *qp_open_attr);

This provides functionality similar to ib_reg_xrc_recv_qp().  It allows any process within the xrcd to modify the tgt qp and receive events.  Its use is not required, so we can support both usage models.

I believe we can define this generic enough that it could eventually be used to share other QP types among multiple processes, though more infrastructure would be needed.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                                                                             ` <1828884A29C6694DAF28B7E6B8A8237316E3E55C-Q3cL8pyY+6ukrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-08-11 14:12                                                                               ` Shamis, Pavel
  2011-08-21 14:42                                                                               ` Jack Morgenstein
  1 sibling, 0 replies; 53+ messages in thread
From: Shamis, Pavel @ 2011-08-11 14:12 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Jack Morgenstein,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

I think it's good idea to support both usage models.

Regards,

Pasha.


>> Things only get complicated when the domain-allocator process allocates a
>> single domain and simply
>> uses that single domain for all jobs (i.e., the domain is never de-allocated
>> for the lifetime of the
>> allocating process, and the allocating process is the server for all jobs).
> 
> To help with OFED feature level compatibility, I'm in the process of adding a new call to ibverbs:
> 
> struct ib_qp_open_attr {
> 	void (*event_handler)(struct ib_event *, void *);
> 	void  *qp_context;
> 	u32    qp_num;
> };
> 
> struct ib_qp *ib_open_qp(*xrcd, *qp_open_attr);
> 
> This provides functionality similar to ib_reg_xrc_recv_qp().  It allows any process within the xrcd to modify the tgt qp and receive events.  Its use is not required, so we can support both usage models.
> 
> I believe we can define this generic enough that it could eventually be used to share other QP types among multiple processes, though more infrastructure would be needed.
> 
> - Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] XRC upstream merge reboot
       [not found]                                                                             ` <1828884A29C6694DAF28B7E6B8A8237316E3E55C-Q3cL8pyY+6ukrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2011-08-11 14:12                                                                               ` Shamis, Pavel
@ 2011-08-21 14:42                                                                               ` Jack Morgenstein
       [not found]                                                                                 ` <201108211742.18803.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  1 sibling, 1 reply; 53+ messages in thread
From: Jack Morgenstein @ 2011-08-21 14:42 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Shamis, Pavel,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

On Thursday 11 August 2011 01:20, Hefty, Sean wrote:
> To help with OFED feature level compatibility, I'm in the process of adding a new call to ibverbs:
> 
> struct ib_qp_open_attr {
>         void (*event_handler)(struct ib_event *, void *);
>         void  *qp_context;
>         u32    qp_num;
> };
> 
> struct ib_qp *ib_open_qp(*xrcd, *qp_open_attr);
> 
> This provides functionality similar to ib_reg_xrc_recv_qp().  It allows any process within
> the xrcd to modify the tgt qp and receive events. 
> Its use is not required, so we can support both usage models. 
> 
> I believe we can define this generic enough that it could eventually be used
> to share other QP types among multiple processes, though more infrastructure would be needed. 
> 
Hi Sean,
Sorry for no reply until now -- I was on vacation for 2 weeks, and got back to work only today.

I am a bit concerned here.  In the current usage model, target QPs are destroyed when their reference count goes to zero
(ib_reg_xrc_recv_qp and ibv_xrc_create_qp increment the reference count, while ib_unreg_xrc_recv_qp decrements it).
In this model, the TGT QP user/consumer does not need to know if it is the last user of the QP (and therefore should
destroy it).  The QP simply gets destroyed when no one is left using it.

In your proposed model, it looks like the last TGT QP user needs to know that it is the last user and must therefore destroy
the TGT QP (rather than the QP being destroyed automatically as the result of the ref count going to zero).

Am I correct?

(Or does every user -- both the creator and the caller of ib_open_qp do an ib_destroy_qp() -- and the destroy actually
occurs when no users who did open/create remain?).

-Jack
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC] XRC upstream merge reboot
       [not found]                                                                                 ` <201108211742.18803.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2011-08-22 15:46                                                                                   ` Hefty, Sean
  0 siblings, 0 replies; 53+ messages in thread
From: Hefty, Sean @ 2011-08-22 15:46 UTC (permalink / raw)
  To: Jack Morgenstein
  Cc: Shamis, Pavel,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	tziporet-VPRAkNaXOzVS1MOuV/RT9w, dotanb-VPRAkNaXOzVS1MOuV/RT9w,
	Jeff Squyres (jsquyres-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org),
	Shumilin, Victor, Truschin, Vladimir, Devendar Bureddy,
	mvapich-core-wPOY3OvGL++pAIv7I8X2sze48wsgrGvP

> I am a bit concerned here.  In the current usage model, target QPs are destroyed when their reference
> count goes to zero
> (ib_reg_xrc_recv_qp and ibv_xrc_create_qp increment the reference count, while ib_unreg_xrc_recv_qp
> decrements it).
> In this model, the TGT QP user/consumer does not need to know if it is the last user of the QP (and
> therefore should
> destroy it).  The QP simply gets destroyed when no one is left using it.
> 
> In your proposed model, it looks like the last TGT QP user needs to know that it is the last user and
> must therefore destroy
> the TGT QP (rather than the QP being destroyed automatically as the result of the ref count going to
> zero).
> 
> Am I correct?
> 
> (Or does every user -- both the creator and the caller of ib_open_qp do an ib_destroy_qp() -- and the
> destroy actually
> occurs when no users who did open/create remain?).

The latter is correct.  The caller of ib_create_qp() and ib_open_qp() all call ib_destroy_qp() if they wish to destroy the QP once the reference count hits zero.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2011-08-22 15:46 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-16 21:13 [RFC] XRC upstream merge reboot Hefty, Sean
     [not found] ` <1828884A29C6694DAF28B7E6B8A82373F7AB-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-05-18 14:54   ` Jack Morgenstein
     [not found]     ` <201105181754.33759.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-05-18 15:27       ` Hefty, Sean
2011-06-22  7:17       ` Jack Morgenstein
     [not found]         ` <201106221017.06212.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-06-22 16:14           ` Hefty, Sean
     [not found]             ` <1828884A29C6694DAF28B7E6B8A82373029A95-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-06-22 17:03               ` Jack Morgenstein
     [not found]                 ` <201106222003.50214.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-06-22 18:19                   ` Hefty, Sean
     [not found]                     ` <1828884A29C6694DAF28B7E6B8A82373029B3F-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-06-22 19:21                       ` Jack Morgenstein
     [not found]                         ` <201106222221.05993.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-06-22 19:32                           ` Tziporet Koren
2011-06-22 19:57                           ` Hefty, Sean
     [not found]                             ` <1828884A29C6694DAF28B7E6B8A82373029BDE-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-06-23  6:11                               ` Jack Morgenstein
2011-06-23  6:35                               ` Jack Morgenstein
     [not found]                                 ` <201106230935.07425.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-06-23 18:03                                   ` Hefty, Sean
2011-07-20 18:51                                   ` Hefty, Sean
     [not found]                                     ` <1828884A29C6694DAF28B7E6B8A82373136F63B9-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-07-21  7:38                                       ` Jack Morgenstein
     [not found]                                         ` <201107211038.23000.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-07-21  7:57                                           ` Jack Morgenstein
2011-07-21 11:58                                           ` Jeff Squyres
     [not found]                                             ` <D8276D45-5FE8-464C-B3A4-14404DE8C760-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2011-07-21 12:47                                               ` Jack Morgenstein
     [not found]                                                 ` <201107211547.31850.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-07-21 12:46                                                   ` Jeff Squyres
2011-07-21 16:06                                                   ` Hefty, Sean
2011-07-21 17:53                                           ` Hefty, Sean
     [not found]                                             ` <1828884A29C6694DAF28B7E6B8A82373136F6691-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-07-26 20:04                                               ` Shamis, Pavel
     [not found]                                                 ` <26AE60A9-D055-4D40-A830-5AADDBA20ED8-1Heg1YXhbW8@public.gmane.org>
2011-08-01 15:03                                                   ` Hefty, Sean
     [not found]                                                     ` <1828884A29C6694DAF28B7E6B8A82373136F9075-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-08-01 16:20                                                       ` Shamis, Pavel
     [not found]                                                         ` <AE625966-FD97-4DBF-A024-22B83B5F3E39-1Heg1YXhbW8@public.gmane.org>
2011-08-01 18:28                                                           ` Hefty, Sean
     [not found]                                                             ` <1828884A29C6694DAF28B7E6B8A82373136F9194-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-08-02 10:44                                                               ` Jack Morgenstein
     [not found]                                                                 ` <201108021344.25284.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-08-02 16:29                                                                   ` Shamis, Pavel
     [not found]                                                                     ` <32D25205-3E9C-4757-B0AB-7117BDF3F2F7-1Heg1YXhbW8@public.gmane.org>
2011-08-03 10:37                                                                       ` Jack Morgenstein
     [not found]                                                                         ` <201108031337.24527.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-08-10 22:20                                                                           ` Hefty, Sean
     [not found]                                                                             ` <1828884A29C6694DAF28B7E6B8A8237316E3E55C-Q3cL8pyY+6ukrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-08-11 14:12                                                                               ` Shamis, Pavel
2011-08-21 14:42                                                                               ` Jack Morgenstein
     [not found]                                                                                 ` <201108211742.18803.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-08-22 15:46                                                                                   ` Hefty, Sean
2011-08-02 19:08                                                               ` Shamis, Pavel
     [not found]                                                                 ` <EABE213A-448A-45F8-B131-AE1EE3F9547F-1Heg1YXhbW8@public.gmane.org>
2011-08-02 21:25                                                                   ` Hefty, Sean
     [not found]                                                                     ` <1828884A29C6694DAF28B7E6B8A82373136F962C-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-08-02 23:00                                                                       ` Shamis, Pavel
     [not found]                                                                         ` <DE779D97-F54F-45E4-B3D4-DBEB10F9302D-1Heg1YXhbW8@public.gmane.org>
2011-08-02 23:53                                                                           ` Hefty, Sean
     [not found]                                                                             ` <1828884A29C6694DAF28B7E6B8A82373136F967E-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-08-03 13:47                                                                               ` Shamis, Pavel
     [not found]                                                                                 ` <5C691E518F345F4882FAB9E9839E60BA0BCA4622F4-vxojlfkN5A++qDdrU24kdQ@public.gmane.org>
2011-08-03 15:52                                                                                   ` Hefty, Sean
     [not found]                                                                                     ` <1828884A29C6694DAF28B7E6B8A82373136F97AC-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-08-03 20:22                                                                                       ` Shamis, Pavel
     [not found]                                                                                         ` <5C691E518F345F4882FAB9E9839E60BA0BCA462300-vxojlfkN5A++qDdrU24kdQ@public.gmane.org>
2011-08-03 20:49                                                                                           ` Hefty, Sean
     [not found]                                                                                             ` <1828884A29C6694DAF28B7E6B8A82373136F9A13-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-08-03 21:16                                                                                               ` Shamis, Pavel
     [not found]                                                                                                 ` <5C691E518F345F4882FAB9E9839E60BA0BCA462301-vxojlfkN5A++qDdrU24kdQ@public.gmane.org>
2011-08-03 21:36                                                                                                   ` Jason Gunthorpe
     [not found]                                                                                                     ` <20110803213642.GE28465-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2011-08-04  0:06                                                                                                       ` Hefty, Sean
     [not found]                                                                                                         ` <1828884A29C6694DAF28B7E6B8A82373136F9A7E-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-08-04  4:05                                                                                                           ` Jason Gunthorpe
     [not found]                                                                                                             ` <20110804040503.GA13935-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2011-08-04  5:53                                                                                                               ` Hefty, Sean
2011-05-18 16:44   ` Roland Dreier
     [not found]     ` <BANLkTimWMU9ohSQGYEEnFR0HbBaypFR51A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-05-18 17:02       ` Jason Gunthorpe
     [not found]         ` <20110518170226.GA2595-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2011-05-18 17:30           ` Hefty, Sean
     [not found]             ` <1828884A29C6694DAF28B7E6B8A82373FBC7-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-05-18 18:05               ` Jason Gunthorpe
     [not found]                 ` <20110518180519.GA11860-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2011-05-18 18:13                   ` Hefty, Sean
     [not found]                     ` <1828884A29C6694DAF28B7E6B8A82373FC13-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-05-18 18:22                       ` Jason Gunthorpe
2011-05-18 19:22               ` Roland Dreier
     [not found]                 ` <BANLkTi=cLjErM3pKzihyFtGWZ0kSu9BiPA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-05-19  5:29                   ` Hefty, Sean

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.