Linux-RDMA Archive on lore.kernel.org
 help / color / Atom feed
* Finding the namespace of a struct ib_device
@ 2020-09-03 14:02 Ka-Cheong Poon
  2020-09-03 17:39 ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-09-03 14:02 UTC (permalink / raw)
  To: linux-rdma

When a struct ib_client's add() function is called. is there a
supported method to find out the namespace of the passed in
struct ib_device?  There is rdma_dev_access_netns() but it does
not return the namespace.  It seems that it needs to have
something like the following.

struct net *rdma_dev_to_netns(struct ib_device *ib_dev)
{
        return read_pnet(&ib_dev->coredev.rdma_net);
}

Comments?

Thanks.


-- 
K. Poon
ka-cheong.poon@oracle.com



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Finding the namespace of a struct ib_device
  2020-09-03 14:02 Finding the namespace of a struct ib_device Ka-Cheong Poon
@ 2020-09-03 17:39 ` Jason Gunthorpe
  2020-09-04  4:01   ` Ka-Cheong Poon
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2020-09-03 17:39 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: linux-rdma

On Thu, Sep 03, 2020 at 10:02:01PM +0800, Ka-Cheong Poon wrote:
> When a struct ib_client's add() function is called. is there a
> supported method to find out the namespace of the passed in
> struct ib_device?  There is rdma_dev_access_netns() but it does
> not return the namespace.  It seems that it needs to have
> something like the following.
> 
> struct net *rdma_dev_to_netns(struct ib_device *ib_dev)
> {
>        return read_pnet(&ib_dev->coredev.rdma_net);
> }
> 
> Comments?

I suppose, but why would something need this?

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Finding the namespace of a struct ib_device
  2020-09-03 17:39 ` Jason Gunthorpe
@ 2020-09-04  4:01   ` Ka-Cheong Poon
  2020-09-04 11:32     ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-09-04  4:01 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma

On 9/4/20 1:39 AM, Jason Gunthorpe wrote:
> On Thu, Sep 03, 2020 at 10:02:01PM +0800, Ka-Cheong Poon wrote:
>> When a struct ib_client's add() function is called. is there a
>> supported method to find out the namespace of the passed in
>> struct ib_device?  There is rdma_dev_access_netns() but it does
>> not return the namespace.  It seems that it needs to have
>> something like the following.
>>
>> struct net *rdma_dev_to_netns(struct ib_device *ib_dev)
>> {
>>         return read_pnet(&ib_dev->coredev.rdma_net);
>> }
>>
>> Comments?
> 
> I suppose, but why would something need this?


If the client needs to allocate stuff for the namespace
related to that device, it needs to know the namespace of
that device.  Then when that namespace is deleted, the
client can clean up those related stuff as the client's
namespace exit function can be called before the remove()
function is triggered in rdma_dev_exit_net().  Without
knowing the namespace of that device, coordination cannot
be done.




-- 
K. Poon
ka-cheong.poon@oracle.com



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Finding the namespace of a struct ib_device
  2020-09-04  4:01   ` Ka-Cheong Poon
@ 2020-09-04 11:32     ` Jason Gunthorpe
  2020-09-04 14:02       ` Ka-Cheong Poon
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2020-09-04 11:32 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: linux-rdma

On Fri, Sep 04, 2020 at 12:01:12PM +0800, Ka-Cheong Poon wrote:
> On 9/4/20 1:39 AM, Jason Gunthorpe wrote:
> > On Thu, Sep 03, 2020 at 10:02:01PM +0800, Ka-Cheong Poon wrote:
> > > When a struct ib_client's add() function is called. is there a
> > > supported method to find out the namespace of the passed in
> > > struct ib_device?  There is rdma_dev_access_netns() but it does
> > > not return the namespace.  It seems that it needs to have
> > > something like the following.
> > > 
> > > struct net *rdma_dev_to_netns(struct ib_device *ib_dev)
> > > {
> > >         return read_pnet(&ib_dev->coredev.rdma_net);
> > > }
> > > 
> > > Comments?
> > 
> > I suppose, but why would something need this?
> 
> 
> If the client needs to allocate stuff for the namespace
> related to that device, it needs to know the namespace of
> that device.  Then when that namespace is deleted, the
> client can clean up those related stuff as the client's
> namespace exit function can be called before the remove()
> function is triggered in rdma_dev_exit_net().  Without
> knowing the namespace of that device, coordination cannot
> be done.

Since each device can only be in one namespace, why would a client
ever need to allocate at a level more granular than a device?

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Finding the namespace of a struct ib_device
  2020-09-04 11:32     ` Jason Gunthorpe
@ 2020-09-04 14:02       ` Ka-Cheong Poon
  2020-09-06  7:44         ` Leon Romanovsky
  0 siblings, 1 reply; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-09-04 14:02 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma

On 9/4/20 7:32 PM, Jason Gunthorpe wrote:
> On Fri, Sep 04, 2020 at 12:01:12PM +0800, Ka-Cheong Poon wrote:
>> On 9/4/20 1:39 AM, Jason Gunthorpe wrote:
>>> On Thu, Sep 03, 2020 at 10:02:01PM +0800, Ka-Cheong Poon wrote:
>>>> When a struct ib_client's add() function is called. is there a
>>>> supported method to find out the namespace of the passed in
>>>> struct ib_device?  There is rdma_dev_access_netns() but it does
>>>> not return the namespace.  It seems that it needs to have
>>>> something like the following.
>>>>
>>>> struct net *rdma_dev_to_netns(struct ib_device *ib_dev)
>>>> {
>>>>          return read_pnet(&ib_dev->coredev.rdma_net);
>>>> }
>>>>
>>>> Comments?
>>>
>>> I suppose, but why would something need this?
>>
>>
>> If the client needs to allocate stuff for the namespace
>> related to that device, it needs to know the namespace of
>> that device.  Then when that namespace is deleted, the
>> client can clean up those related stuff as the client's
>> namespace exit function can be called before the remove()
>> function is triggered in rdma_dev_exit_net().  Without
>> knowing the namespace of that device, coordination cannot
>> be done.
> 
> Since each device can only be in one namespace, why would a client
> ever need to allocate at a level more granular than a device?


A client wants to have namespace specific info.  If the
device belongs to a namespace, it wants to associate those
info with that device.  When a namespace is deleted, the
info will need to be deleted.  You can consider the info
as associated with both a namespace and a device.


-- 
K. Poon
ka-cheong.poon@oracle.com



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Finding the namespace of a struct ib_device
  2020-09-04 14:02       ` Ka-Cheong Poon
@ 2020-09-06  7:44         ` Leon Romanovsky
  2020-09-07  3:33           ` Ka-Cheong Poon
  0 siblings, 1 reply; 48+ messages in thread
From: Leon Romanovsky @ 2020-09-06  7:44 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: Jason Gunthorpe, linux-rdma

On Fri, Sep 04, 2020 at 10:02:10PM +0800, Ka-Cheong Poon wrote:
> On 9/4/20 7:32 PM, Jason Gunthorpe wrote:
> > On Fri, Sep 04, 2020 at 12:01:12PM +0800, Ka-Cheong Poon wrote:
> > > On 9/4/20 1:39 AM, Jason Gunthorpe wrote:
> > > > On Thu, Sep 03, 2020 at 10:02:01PM +0800, Ka-Cheong Poon wrote:
> > > > > When a struct ib_client's add() function is called. is there a
> > > > > supported method to find out the namespace of the passed in
> > > > > struct ib_device?  There is rdma_dev_access_netns() but it does
> > > > > not return the namespace.  It seems that it needs to have
> > > > > something like the following.
> > > > >
> > > > > struct net *rdma_dev_to_netns(struct ib_device *ib_dev)
> > > > > {
> > > > >          return read_pnet(&ib_dev->coredev.rdma_net);
> > > > > }
> > > > >
> > > > > Comments?
> > > >
> > > > I suppose, but why would something need this?
> > >
> > >
> > > If the client needs to allocate stuff for the namespace
> > > related to that device, it needs to know the namespace of
> > > that device.  Then when that namespace is deleted, the
> > > client can clean up those related stuff as the client's
> > > namespace exit function can be called before the remove()
> > > function is triggered in rdma_dev_exit_net().  Without
> > > knowing the namespace of that device, coordination cannot
> > > be done.
> >
> > Since each device can only be in one namespace, why would a client
> > ever need to allocate at a level more granular than a device?
>
>
> A client wants to have namespace specific info.  If the
> device belongs to a namespace, it wants to associate those
> info with that device.  When a namespace is deleted, the
> info will need to be deleted.  You can consider the info
> as associated with both a namespace and a device.

Can you be more specific about which info you are talking about?
And what is the client that is net namespace-aware from one side,
but from another separate data between them "manually"?

Thanks

>
>
> --
> K. Poon
> ka-cheong.poon@oracle.com
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Finding the namespace of a struct ib_device
  2020-09-06  7:44         ` Leon Romanovsky
@ 2020-09-07  3:33           ` Ka-Cheong Poon
  2020-09-07  7:18             ` Leon Romanovsky
  0 siblings, 1 reply; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-09-07  3:33 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Jason Gunthorpe, linux-rdma

On 9/6/20 3:44 PM, Leon Romanovsky wrote:
> On Fri, Sep 04, 2020 at 10:02:10PM +0800, Ka-Cheong Poon wrote:
>> On 9/4/20 7:32 PM, Jason Gunthorpe wrote:
>>> On Fri, Sep 04, 2020 at 12:01:12PM +0800, Ka-Cheong Poon wrote:
>>>> On 9/4/20 1:39 AM, Jason Gunthorpe wrote:
>>>>> On Thu, Sep 03, 2020 at 10:02:01PM +0800, Ka-Cheong Poon wrote:
>>>>>> When a struct ib_client's add() function is called. is there a
>>>>>> supported method to find out the namespace of the passed in
>>>>>> struct ib_device?  There is rdma_dev_access_netns() but it does
>>>>>> not return the namespace.  It seems that it needs to have
>>>>>> something like the following.
>>>>>>
>>>>>> struct net *rdma_dev_to_netns(struct ib_device *ib_dev)
>>>>>> {
>>>>>>           return read_pnet(&ib_dev->coredev.rdma_net);
>>>>>> }
>>>>>>
>>>>>> Comments?
>>>>>
>>>>> I suppose, but why would something need this?
>>>>
>>>>
>>>> If the client needs to allocate stuff for the namespace
>>>> related to that device, it needs to know the namespace of
>>>> that device.  Then when that namespace is deleted, the
>>>> client can clean up those related stuff as the client's
>>>> namespace exit function can be called before the remove()
>>>> function is triggered in rdma_dev_exit_net().  Without
>>>> knowing the namespace of that device, coordination cannot
>>>> be done.
>>>
>>> Since each device can only be in one namespace, why would a client
>>> ever need to allocate at a level more granular than a device?
>>
>>
>> A client wants to have namespace specific info.  If the
>> device belongs to a namespace, it wants to associate those
>> info with that device.  When a namespace is deleted, the
>> info will need to be deleted.  You can consider the info
>> as associated with both a namespace and a device.
> 
> Can you be more specific about which info you are talking about?


Actually, a lot of info can be both namespace and device specific.
For example, a client wants to have a different PD allocation policy
with a device when used in different namespaces.


> And what is the client that is net namespace-aware from one side,
> but from another separate data between them "manually"?


Could you please elaborate what is meant by "namespace aware from
one side but from another separate data between them manually"?
I understand what namespace aware means.  But it is not clear what
is meant by "separating data manually".  Do you mean having different
behavior in different namespaces?  If this is the case, there is
nothing special here.  An admin may choose to have different behavior
in different namespaces.  There is nothing manual going on in the
client code.


-- 
K. Poon
ka-cheong.poon@oracle.com



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Finding the namespace of a struct ib_device
  2020-09-07  3:33           ` Ka-Cheong Poon
@ 2020-09-07  7:18             ` Leon Romanovsky
  2020-09-07  8:24               ` Ka-Cheong Poon
  0 siblings, 1 reply; 48+ messages in thread
From: Leon Romanovsky @ 2020-09-07  7:18 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: Jason Gunthorpe, linux-rdma

On Mon, Sep 07, 2020 at 11:33:38AM +0800, Ka-Cheong Poon wrote:
> On 9/6/20 3:44 PM, Leon Romanovsky wrote:
> > On Fri, Sep 04, 2020 at 10:02:10PM +0800, Ka-Cheong Poon wrote:
> > > On 9/4/20 7:32 PM, Jason Gunthorpe wrote:
> > > > On Fri, Sep 04, 2020 at 12:01:12PM +0800, Ka-Cheong Poon wrote:
> > > > > On 9/4/20 1:39 AM, Jason Gunthorpe wrote:
> > > > > > On Thu, Sep 03, 2020 at 10:02:01PM +0800, Ka-Cheong Poon wrote:
> > > > > > > When a struct ib_client's add() function is called. is there a
> > > > > > > supported method to find out the namespace of the passed in
> > > > > > > struct ib_device?  There is rdma_dev_access_netns() but it does
> > > > > > > not return the namespace.  It seems that it needs to have
> > > > > > > something like the following.
> > > > > > >
> > > > > > > struct net *rdma_dev_to_netns(struct ib_device *ib_dev)
> > > > > > > {
> > > > > > >           return read_pnet(&ib_dev->coredev.rdma_net);
> > > > > > > }
> > > > > > >
> > > > > > > Comments?
> > > > > >
> > > > > > I suppose, but why would something need this?
> > > > >
> > > > >
> > > > > If the client needs to allocate stuff for the namespace
> > > > > related to that device, it needs to know the namespace of
> > > > > that device.  Then when that namespace is deleted, the
> > > > > client can clean up those related stuff as the client's
> > > > > namespace exit function can be called before the remove()
> > > > > function is triggered in rdma_dev_exit_net().  Without
> > > > > knowing the namespace of that device, coordination cannot
> > > > > be done.
> > > >
> > > > Since each device can only be in one namespace, why would a client
> > > > ever need to allocate at a level more granular than a device?
> > >
> > >
> > > A client wants to have namespace specific info.  If the
> > > device belongs to a namespace, it wants to associate those
> > > info with that device.  When a namespace is deleted, the
> > > info will need to be deleted.  You can consider the info
> > > as associated with both a namespace and a device.
> >
> > Can you be more specific about which info you are talking about?
>
>
> Actually, a lot of info can be both namespace and device specific.
> For example, a client wants to have a different PD allocation policy
> with a device when used in different namespaces.
>
>
> > And what is the client that is net namespace-aware from one side,
> > but from another separate data between them "manually"?
>
>
> Could you please elaborate what is meant by "namespace aware from
> one side but from another separate data between them manually"?
> I understand what namespace aware means.  But it is not clear what
> is meant by "separating data manually".  Do you mean having different
> behavior in different namespaces?  If this is the case, there is
> nothing special here.  An admin may choose to have different behavior
> in different namespaces.  There is nothing manual going on in the
> client code.

We are talking about net-namespaces, and as we wrote above, the ib_device
that supports such namespace can exist only in a single one

The client that implemented such support can check its namespace while
"client->add" is called. It should be equal to be seen by ib_device.

See:
 rdma_dev_change_netns ->
 	enable_device_and_get ->
		add_client_context ->
			client->add(device)


"Manual" means that client will store results of first client->add call
(in init_net NS) and will use globally stored data for other NS, which
is not netdev way to work with namespaces. The expectation that they are
separated without shared data between.

Thanks

>
>
> --
> K. Poon
> ka-cheong.poon@oracle.com
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Finding the namespace of a struct ib_device
  2020-09-07  7:18             ` Leon Romanovsky
@ 2020-09-07  8:24               ` Ka-Cheong Poon
  2020-09-07  9:04                 ` Leon Romanovsky
  0 siblings, 1 reply; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-09-07  8:24 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Jason Gunthorpe, linux-rdma

On 9/7/20 3:18 PM, Leon Romanovsky wrote:
> On Mon, Sep 07, 2020 at 11:33:38AM +0800, Ka-Cheong Poon wrote:
>> On 9/6/20 3:44 PM, Leon Romanovsky wrote:
>>> On Fri, Sep 04, 2020 at 10:02:10PM +0800, Ka-Cheong Poon wrote:
>>>> On 9/4/20 7:32 PM, Jason Gunthorpe wrote:
>>>>> On Fri, Sep 04, 2020 at 12:01:12PM +0800, Ka-Cheong Poon wrote:
>>>>>> On 9/4/20 1:39 AM, Jason Gunthorpe wrote:
>>>>>>> On Thu, Sep 03, 2020 at 10:02:01PM +0800, Ka-Cheong Poon wrote:
>>>>>>>> When a struct ib_client's add() function is called. is there a
>>>>>>>> supported method to find out the namespace of the passed in
>>>>>>>> struct ib_device?  There is rdma_dev_access_netns() but it does
>>>>>>>> not return the namespace.  It seems that it needs to have
>>>>>>>> something like the following.
>>>>>>>>
>>>>>>>> struct net *rdma_dev_to_netns(struct ib_device *ib_dev)
>>>>>>>> {
>>>>>>>>            return read_pnet(&ib_dev->coredev.rdma_net);
>>>>>>>> }
>>>>>>>>
>>>>>>>> Comments?
>>>>>>>
>>>>>>> I suppose, but why would something need this?
>>>>>>
>>>>>>
>>>>>> If the client needs to allocate stuff for the namespace
>>>>>> related to that device, it needs to know the namespace of
>>>>>> that device.  Then when that namespace is deleted, the
>>>>>> client can clean up those related stuff as the client's
>>>>>> namespace exit function can be called before the remove()
>>>>>> function is triggered in rdma_dev_exit_net().  Without
>>>>>> knowing the namespace of that device, coordination cannot
>>>>>> be done.
>>>>>
>>>>> Since each device can only be in one namespace, why would a client
>>>>> ever need to allocate at a level more granular than a device?
>>>>
>>>>
>>>> A client wants to have namespace specific info.  If the
>>>> device belongs to a namespace, it wants to associate those
>>>> info with that device.  When a namespace is deleted, the
>>>> info will need to be deleted.  You can consider the info
>>>> as associated with both a namespace and a device.
>>>
>>> Can you be more specific about which info you are talking about?
>>
>>
>> Actually, a lot of info can be both namespace and device specific.
>> For example, a client wants to have a different PD allocation policy
>> with a device when used in different namespaces.
>>
>>
>>> And what is the client that is net namespace-aware from one side,
>>> but from another separate data between them "manually"?
>>
>>
>> Could you please elaborate what is meant by "namespace aware from
>> one side but from another separate data between them manually"?
>> I understand what namespace aware means.  But it is not clear what
>> is meant by "separating data manually".  Do you mean having different
>> behavior in different namespaces?  If this is the case, there is
>> nothing special here.  An admin may choose to have different behavior
>> in different namespaces.  There is nothing manual going on in the
>> client code.
> 
> We are talking about net-namespaces, and as we wrote above, the ib_device
> that supports such namespace can exist only in a single one
> 
> The client that implemented such support can check its namespace while
> "client->add" is called. It should be equal to be seen by ib_device.
> 
> See:
>   rdma_dev_change_netns ->
>   	enable_device_and_get ->
> 		add_client_context ->
> 			client->add(device)


This is the original question.  How does the client's add() function
know the namespace of device?  What is your suggestion in finding
the net namespace of device at add() time?


> "Manual" means that client will store results of first client->add call
> (in init_net NS) and will use globally stored data for other NS, which
> is not netdev way to work with namespaces. The expectation that they are
> separated without shared data between.


It is not clear why client needs to use globally stored data for other
net namespaces.  When an RDMA device is moved from init_net to another net
namespace, the client's remove() function is called first, and then the
client's add() function is called.  If a client can know the net namespace
of a device when add()/remove() is called, it can use namespace specific
data storage.  It does not need to store namespace specific data in a
global store.  The original question is how to find out the net namespace
of a device.



-- 
K. Poon
ka-cheong.poon@oracle.com



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Finding the namespace of a struct ib_device
  2020-09-07  8:24               ` Ka-Cheong Poon
@ 2020-09-07  9:04                 ` Leon Romanovsky
  2020-09-07  9:28                   ` Ka-Cheong Poon
  0 siblings, 1 reply; 48+ messages in thread
From: Leon Romanovsky @ 2020-09-07  9:04 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: Jason Gunthorpe, linux-rdma

On Mon, Sep 07, 2020 at 04:24:26PM +0800, Ka-Cheong Poon wrote:
> On 9/7/20 3:18 PM, Leon Romanovsky wrote:
> > On Mon, Sep 07, 2020 at 11:33:38AM +0800, Ka-Cheong Poon wrote:
> > > On 9/6/20 3:44 PM, Leon Romanovsky wrote:
> > > > On Fri, Sep 04, 2020 at 10:02:10PM +0800, Ka-Cheong Poon wrote:
> > > > > On 9/4/20 7:32 PM, Jason Gunthorpe wrote:
> > > > > > On Fri, Sep 04, 2020 at 12:01:12PM +0800, Ka-Cheong Poon wrote:
> > > > > > > On 9/4/20 1:39 AM, Jason Gunthorpe wrote:
> > > > > > > > On Thu, Sep 03, 2020 at 10:02:01PM +0800, Ka-Cheong Poon wrote:
> > > > > > > > > When a struct ib_client's add() function is called. is there a
> > > > > > > > > supported method to find out the namespace of the passed in
> > > > > > > > > struct ib_device?  There is rdma_dev_access_netns() but it does
> > > > > > > > > not return the namespace.  It seems that it needs to have
> > > > > > > > > something like the following.
> > > > > > > > >
> > > > > > > > > struct net *rdma_dev_to_netns(struct ib_device *ib_dev)
> > > > > > > > > {
> > > > > > > > >            return read_pnet(&ib_dev->coredev.rdma_net);
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > Comments?
> > > > > > > >
> > > > > > > > I suppose, but why would something need this?
> > > > > > >
> > > > > > >
> > > > > > > If the client needs to allocate stuff for the namespace
> > > > > > > related to that device, it needs to know the namespace of
> > > > > > > that device.  Then when that namespace is deleted, the
> > > > > > > client can clean up those related stuff as the client's
> > > > > > > namespace exit function can be called before the remove()
> > > > > > > function is triggered in rdma_dev_exit_net().  Without
> > > > > > > knowing the namespace of that device, coordination cannot
> > > > > > > be done.
> > > > > >
> > > > > > Since each device can only be in one namespace, why would a client
> > > > > > ever need to allocate at a level more granular than a device?
> > > > >
> > > > >
> > > > > A client wants to have namespace specific info.  If the
> > > > > device belongs to a namespace, it wants to associate those
> > > > > info with that device.  When a namespace is deleted, the
> > > > > info will need to be deleted.  You can consider the info
> > > > > as associated with both a namespace and a device.
> > > >
> > > > Can you be more specific about which info you are talking about?
> > >
> > >
> > > Actually, a lot of info can be both namespace and device specific.
> > > For example, a client wants to have a different PD allocation policy
> > > with a device when used in different namespaces.
> > >
> > >
> > > > And what is the client that is net namespace-aware from one side,
> > > > but from another separate data between them "manually"?
> > >
> > >
> > > Could you please elaborate what is meant by "namespace aware from
> > > one side but from another separate data between them manually"?
> > > I understand what namespace aware means.  But it is not clear what
> > > is meant by "separating data manually".  Do you mean having different
> > > behavior in different namespaces?  If this is the case, there is
> > > nothing special here.  An admin may choose to have different behavior
> > > in different namespaces.  There is nothing manual going on in the
> > > client code.
> >
> > We are talking about net-namespaces, and as we wrote above, the ib_device
> > that supports such namespace can exist only in a single one
> >
> > The client that implemented such support can check its namespace while
> > "client->add" is called. It should be equal to be seen by ib_device.
> >
> > See:
> >   rdma_dev_change_netns ->
> >   	enable_device_and_get ->
> > 		add_client_context ->
> > 			client->add(device)
>
>
> This is the original question.  How does the client's add() function
> know the namespace of device?  What is your suggestion in finding
> the net namespace of device at add() time?

As I wrote above, "It should be equal to be seen by ib_device.", check net
namespace of your client.

Thanks

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Finding the namespace of a struct ib_device
  2020-09-07  9:04                 ` Leon Romanovsky
@ 2020-09-07  9:28                   ` Ka-Cheong Poon
  2020-09-07 10:22                     ` Leon Romanovsky
  0 siblings, 1 reply; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-09-07  9:28 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Jason Gunthorpe, linux-rdma

On 9/7/20 5:04 PM, Leon Romanovsky wrote:
> On Mon, Sep 07, 2020 at 04:24:26PM +0800, Ka-Cheong Poon wrote:
>> On 9/7/20 3:18 PM, Leon Romanovsky wrote:
>>> On Mon, Sep 07, 2020 at 11:33:38AM +0800, Ka-Cheong Poon wrote:
>>>> On 9/6/20 3:44 PM, Leon Romanovsky wrote:
>>>>> On Fri, Sep 04, 2020 at 10:02:10PM +0800, Ka-Cheong Poon wrote:
>>>>>> On 9/4/20 7:32 PM, Jason Gunthorpe wrote:
>>>>>>> On Fri, Sep 04, 2020 at 12:01:12PM +0800, Ka-Cheong Poon wrote:
>>>>>>>> On 9/4/20 1:39 AM, Jason Gunthorpe wrote:
>>>>>>>>> On Thu, Sep 03, 2020 at 10:02:01PM +0800, Ka-Cheong Poon wrote:
>>>>>>>>>> When a struct ib_client's add() function is called. is there a
>>>>>>>>>> supported method to find out the namespace of the passed in
>>>>>>>>>> struct ib_device?  There is rdma_dev_access_netns() but it does
>>>>>>>>>> not return the namespace.  It seems that it needs to have
>>>>>>>>>> something like the following.
>>>>>>>>>>
>>>>>>>>>> struct net *rdma_dev_to_netns(struct ib_device *ib_dev)
>>>>>>>>>> {
>>>>>>>>>>             return read_pnet(&ib_dev->coredev.rdma_net);
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> Comments?
>>>>>>>>>
>>>>>>>>> I suppose, but why would something need this?
>>>>>>>>
>>>>>>>>
>>>>>>>> If the client needs to allocate stuff for the namespace
>>>>>>>> related to that device, it needs to know the namespace of
>>>>>>>> that device.  Then when that namespace is deleted, the
>>>>>>>> client can clean up those related stuff as the client's
>>>>>>>> namespace exit function can be called before the remove()
>>>>>>>> function is triggered in rdma_dev_exit_net().  Without
>>>>>>>> knowing the namespace of that device, coordination cannot
>>>>>>>> be done.
>>>>>>>
>>>>>>> Since each device can only be in one namespace, why would a client
>>>>>>> ever need to allocate at a level more granular than a device?
>>>>>>
>>>>>>
>>>>>> A client wants to have namespace specific info.  If the
>>>>>> device belongs to a namespace, it wants to associate those
>>>>>> info with that device.  When a namespace is deleted, the
>>>>>> info will need to be deleted.  You can consider the info
>>>>>> as associated with both a namespace and a device.
>>>>>
>>>>> Can you be more specific about which info you are talking about?
>>>>
>>>>
>>>> Actually, a lot of info can be both namespace and device specific.
>>>> For example, a client wants to have a different PD allocation policy
>>>> with a device when used in different namespaces.
>>>>
>>>>
>>>>> And what is the client that is net namespace-aware from one side,
>>>>> but from another separate data between them "manually"?
>>>>
>>>>
>>>> Could you please elaborate what is meant by "namespace aware from
>>>> one side but from another separate data between them manually"?
>>>> I understand what namespace aware means.  But it is not clear what
>>>> is meant by "separating data manually".  Do you mean having different
>>>> behavior in different namespaces?  If this is the case, there is
>>>> nothing special here.  An admin may choose to have different behavior
>>>> in different namespaces.  There is nothing manual going on in the
>>>> client code.
>>>
>>> We are talking about net-namespaces, and as we wrote above, the ib_device
>>> that supports such namespace can exist only in a single one
>>>
>>> The client that implemented such support can check its namespace while
>>> "client->add" is called. It should be equal to be seen by ib_device.
>>>
>>> See:
>>>    rdma_dev_change_netns ->
>>>    	enable_device_and_get ->
>>> 		add_client_context ->
>>> 			client->add(device)
>>
>>
>> This is the original question.  How does the client's add() function
>> know the namespace of device?  What is your suggestion in finding
>> the net namespace of device at add() time?
> 
> As I wrote above, "It should be equal to be seen by ib_device.", check net
> namespace of your client.


Could you please be more specific?  A client calls ib_register_client() to
register with the RDMA framework.  Then when a device is added, the client's
add() function is called with the struct ib_device.  How does the client
find out the namespace "seen by the ib_device"?  Do you mean that there is
a variant of ib_register_client() which can take a net namespace as parameter?
Or is there a variant of struct ib_client which has a net namespace field?
Or?  Thanks.


-- 
K. Poon
ka-cheong.poon@oracle.com



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Finding the namespace of a struct ib_device
  2020-09-07  9:28                   ` Ka-Cheong Poon
@ 2020-09-07 10:22                     ` Leon Romanovsky
  2020-09-07 13:48                       ` Ka-Cheong Poon
  0 siblings, 1 reply; 48+ messages in thread
From: Leon Romanovsky @ 2020-09-07 10:22 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: Jason Gunthorpe, linux-rdma

On Mon, Sep 07, 2020 at 05:28:23PM +0800, Ka-Cheong Poon wrote:
> On 9/7/20 5:04 PM, Leon Romanovsky wrote:
> > On Mon, Sep 07, 2020 at 04:24:26PM +0800, Ka-Cheong Poon wrote:
> > > On 9/7/20 3:18 PM, Leon Romanovsky wrote:
> > > > On Mon, Sep 07, 2020 at 11:33:38AM +0800, Ka-Cheong Poon wrote:
> > > > > On 9/6/20 3:44 PM, Leon Romanovsky wrote:
> > > > > > On Fri, Sep 04, 2020 at 10:02:10PM +0800, Ka-Cheong Poon wrote:
> > > > > > > On 9/4/20 7:32 PM, Jason Gunthorpe wrote:
> > > > > > > > On Fri, Sep 04, 2020 at 12:01:12PM +0800, Ka-Cheong Poon wrote:
> > > > > > > > > On 9/4/20 1:39 AM, Jason Gunthorpe wrote:
> > > > > > > > > > On Thu, Sep 03, 2020 at 10:02:01PM +0800, Ka-Cheong Poon wrote:
> > > > > > > > > > > When a struct ib_client's add() function is called. is there a
> > > > > > > > > > > supported method to find out the namespace of the passed in
> > > > > > > > > > > struct ib_device?  There is rdma_dev_access_netns() but it does
> > > > > > > > > > > not return the namespace.  It seems that it needs to have
> > > > > > > > > > > something like the following.
> > > > > > > > > > >
> > > > > > > > > > > struct net *rdma_dev_to_netns(struct ib_device *ib_dev)
> > > > > > > > > > > {
> > > > > > > > > > >             return read_pnet(&ib_dev->coredev.rdma_net);
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > Comments?
> > > > > > > > > >
> > > > > > > > > > I suppose, but why would something need this?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > If the client needs to allocate stuff for the namespace
> > > > > > > > > related to that device, it needs to know the namespace of
> > > > > > > > > that device.  Then when that namespace is deleted, the
> > > > > > > > > client can clean up those related stuff as the client's
> > > > > > > > > namespace exit function can be called before the remove()
> > > > > > > > > function is triggered in rdma_dev_exit_net().  Without
> > > > > > > > > knowing the namespace of that device, coordination cannot
> > > > > > > > > be done.
> > > > > > > >
> > > > > > > > Since each device can only be in one namespace, why would a client
> > > > > > > > ever need to allocate at a level more granular than a device?
> > > > > > >
> > > > > > >
> > > > > > > A client wants to have namespace specific info.  If the
> > > > > > > device belongs to a namespace, it wants to associate those
> > > > > > > info with that device.  When a namespace is deleted, the
> > > > > > > info will need to be deleted.  You can consider the info
> > > > > > > as associated with both a namespace and a device.
> > > > > >
> > > > > > Can you be more specific about which info you are talking about?
> > > > >
> > > > >
> > > > > Actually, a lot of info can be both namespace and device specific.
> > > > > For example, a client wants to have a different PD allocation policy
> > > > > with a device when used in different namespaces.
> > > > >
> > > > >
> > > > > > And what is the client that is net namespace-aware from one side,
> > > > > > but from another separate data between them "manually"?
> > > > >
> > > > >
> > > > > Could you please elaborate what is meant by "namespace aware from
> > > > > one side but from another separate data between them manually"?
> > > > > I understand what namespace aware means.  But it is not clear what
> > > > > is meant by "separating data manually".  Do you mean having different
> > > > > behavior in different namespaces?  If this is the case, there is
> > > > > nothing special here.  An admin may choose to have different behavior
> > > > > in different namespaces.  There is nothing manual going on in the
> > > > > client code.
> > > >
> > > > We are talking about net-namespaces, and as we wrote above, the ib_device
> > > > that supports such namespace can exist only in a single one
> > > >
> > > > The client that implemented such support can check its namespace while
> > > > "client->add" is called. It should be equal to be seen by ib_device.
> > > >
> > > > See:
> > > >    rdma_dev_change_netns ->
> > > >    	enable_device_and_get ->
> > > > 		add_client_context ->
> > > > 			client->add(device)
> > >
> > >
> > > This is the original question.  How does the client's add() function
> > > know the namespace of device?  What is your suggestion in finding
> > > the net namespace of device at add() time?
> >
> > As I wrote above, "It should be equal to be seen by ib_device.", check net
> > namespace of your client.
>
>
> Could you please be more specific?  A client calls ib_register_client() to
> register with the RDMA framework.  Then when a device is added, the client's
> add() function is called with the struct ib_device.  How does the client
> find out the namespace "seen by the ib_device"?  Do you mean that there is
> a variant of ib_register_client() which can take a net namespace as parameter?
> Or is there a variant of struct ib_client which has a net namespace field?
> Or?  Thanks.

"Do you mean that there is a variant of ib_register_client()
which can take a net namespace as parameter?"

No, it doesn't exist but it is easy to extend and IMHO the right
thing to do.

Thanks

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Finding the namespace of a struct ib_device
  2020-09-07 10:22                     ` Leon Romanovsky
@ 2020-09-07 13:48                       ` Ka-Cheong Poon
  2020-09-29 16:57                         ` RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device) Ka-Cheong Poon
  0 siblings, 1 reply; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-09-07 13:48 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Jason Gunthorpe, linux-rdma

On 9/7/20 6:22 PM, Leon Romanovsky wrote:
> On Mon, Sep 07, 2020 at 05:28:23PM +0800, Ka-Cheong Poon wrote:
>> On 9/7/20 5:04 PM, Leon Romanovsky wrote:
>>> On Mon, Sep 07, 2020 at 04:24:26PM +0800, Ka-Cheong Poon wrote:
>>>> On 9/7/20 3:18 PM, Leon Romanovsky wrote:
>>>>> On Mon, Sep 07, 2020 at 11:33:38AM +0800, Ka-Cheong Poon wrote:
>>>>>> On 9/6/20 3:44 PM, Leon Romanovsky wrote:
>>>>>>> On Fri, Sep 04, 2020 at 10:02:10PM +0800, Ka-Cheong Poon wrote:
>>>>>>>> On 9/4/20 7:32 PM, Jason Gunthorpe wrote:
>>>>>>>>> On Fri, Sep 04, 2020 at 12:01:12PM +0800, Ka-Cheong Poon wrote:
>>>>>>>>>> On 9/4/20 1:39 AM, Jason Gunthorpe wrote:
>>>>>>>>>>> On Thu, Sep 03, 2020 at 10:02:01PM +0800, Ka-Cheong Poon wrote:
>>>>>>>>>>>> When a struct ib_client's add() function is called. is there a
>>>>>>>>>>>> supported method to find out the namespace of the passed in
>>>>>>>>>>>> struct ib_device?  There is rdma_dev_access_netns() but it does
>>>>>>>>>>>> not return the namespace.  It seems that it needs to have
>>>>>>>>>>>> something like the following.
>>>>>>>>>>>>
>>>>>>>>>>>> struct net *rdma_dev_to_netns(struct ib_device *ib_dev)
>>>>>>>>>>>> {
>>>>>>>>>>>>              return read_pnet(&ib_dev->coredev.rdma_net);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> Comments?
>>>>>>>>>>>
>>>>>>>>>>> I suppose, but why would something need this?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> If the client needs to allocate stuff for the namespace
>>>>>>>>>> related to that device, it needs to know the namespace of
>>>>>>>>>> that device.  Then when that namespace is deleted, the
>>>>>>>>>> client can clean up those related stuff as the client's
>>>>>>>>>> namespace exit function can be called before the remove()
>>>>>>>>>> function is triggered in rdma_dev_exit_net().  Without
>>>>>>>>>> knowing the namespace of that device, coordination cannot
>>>>>>>>>> be done.
>>>>>>>>>
>>>>>>>>> Since each device can only be in one namespace, why would a client
>>>>>>>>> ever need to allocate at a level more granular than a device?
>>>>>>>>
>>>>>>>>
>>>>>>>> A client wants to have namespace specific info.  If the
>>>>>>>> device belongs to a namespace, it wants to associate those
>>>>>>>> info with that device.  When a namespace is deleted, the
>>>>>>>> info will need to be deleted.  You can consider the info
>>>>>>>> as associated with both a namespace and a device.
>>>>>>>
>>>>>>> Can you be more specific about which info you are talking about?
>>>>>>
>>>>>>
>>>>>> Actually, a lot of info can be both namespace and device specific.
>>>>>> For example, a client wants to have a different PD allocation policy
>>>>>> with a device when used in different namespaces.
>>>>>>
>>>>>>
>>>>>>> And what is the client that is net namespace-aware from one side,
>>>>>>> but from another separate data between them "manually"?
>>>>>>
>>>>>>
>>>>>> Could you please elaborate what is meant by "namespace aware from
>>>>>> one side but from another separate data between them manually"?
>>>>>> I understand what namespace aware means.  But it is not clear what
>>>>>> is meant by "separating data manually".  Do you mean having different
>>>>>> behavior in different namespaces?  If this is the case, there is
>>>>>> nothing special here.  An admin may choose to have different behavior
>>>>>> in different namespaces.  There is nothing manual going on in the
>>>>>> client code.
>>>>>
>>>>> We are talking about net-namespaces, and as we wrote above, the ib_device
>>>>> that supports such namespace can exist only in a single one
>>>>>
>>>>> The client that implemented such support can check its namespace while
>>>>> "client->add" is called. It should be equal to be seen by ib_device.
>>>>>
>>>>> See:
>>>>>     rdma_dev_change_netns ->
>>>>>     	enable_device_and_get ->
>>>>> 		add_client_context ->
>>>>> 			client->add(device)
>>>>
>>>>
>>>> This is the original question.  How does the client's add() function
>>>> know the namespace of device?  What is your suggestion in finding
>>>> the net namespace of device at add() time?
>>>
>>> As I wrote above, "It should be equal to be seen by ib_device.", check net
>>> namespace of your client.
>>
>>
>> Could you please be more specific?  A client calls ib_register_client() to
>> register with the RDMA framework.  Then when a device is added, the client's
>> add() function is called with the struct ib_device.  How does the client
>> find out the namespace "seen by the ib_device"?  Do you mean that there is
>> a variant of ib_register_client() which can take a net namespace as parameter?
>> Or is there a variant of struct ib_client which has a net namespace field?
>> Or?  Thanks.
> 
> "Do you mean that there is a variant of ib_register_client()
> which can take a net namespace as parameter?"
> 
> No, it doesn't exist but it is easy to extend and IMHO the right
> thing to do.


This may require a number of changes and the way a client interacts with
the current RDMA framework.  For example, currently a client registers
once using one struct ib_client and gets device notifications for all
namespaces and devices.  Suppose there is rdma_[un]register_net_client(),
it may need to require a client to use a different struct ib_client to
register for each net namespace.  And struct ib_client probably needs to
have a field to store the net namespace.  Probably all those client
interaction functions will need to be modified.  Since the clients xarray
is global, more clients may mean performance implication, such as it takes
longer to go through the whole clients xarray.

There are probably many other subtle changes required.  It may turn out to
be not so straight forward.  Is this community willing the take such changes?
I can take a stab at it if the community really thinks that this is preferred.

Thanks.


-- 
K. Poon
ka-cheong.poon@oracle.com



^ permalink raw reply	[flat|nested] 48+ messages in thread

* RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-09-07 13:48                       ` Ka-Cheong Poon
@ 2020-09-29 16:57                         ` Ka-Cheong Poon
  2020-09-29 17:40                           ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-09-29 16:57 UTC (permalink / raw)
  To: linux-rdma


[-- Attachment #1: Type: text/plain, Size: 6137 bytes --]

On 9/7/20 9:48 PM, Ka-Cheong Poon wrote:

> This may require a number of changes and the way a client interacts with
> the current RDMA framework.  For example, currently a client registers
> once using one struct ib_client and gets device notifications for all
> namespaces and devices.  Suppose there is rdma_[un]register_net_client(),
> it may need to require a client to use a different struct ib_client to
> register for each net namespace.  And struct ib_client probably needs to
> have a field to store the net namespace.  Probably all those client
> interaction functions will need to be modified.  Since the clients xarray
> is global, more clients may mean performance implication, such as it takes
> longer to go through the whole clients xarray.
> 
> There are probably many other subtle changes required.  It may turn out to
> be not so straight forward.  Is this community willing the take such changes?
> I can take a stab at it if the community really thinks that this is preferred.


Attached is a diff of a prototype for the above.  This exercise is
to see what needs to be done to have a more network namespace aware
interface for RDMA client registration.

Currently, there are ib_[un]register_client().  Under the RDMA namespace
exclusive mode, all RDMA devices are assigned to the init_net namespace
initially.  A kernel module uses this interface to register with the RDMA
subsystem.  When a device is assigned to a namespace, the client's
registered remove upcall is called with the device as the parameter (this
is removing from the init_net namespace).  Then the client's add upcall
is called with the device as the parameter (this is assigning to the new
namespace).  When that namespace is removed (*), a similar sequence of
events happen, a remove upcall (removing from the namespace) is followed
by add upcall (assigning back to the init_net namespace).  All the RDMA
clients are stored in a global struct xarray called clients (in device.c)
and each client is assigned a client ID.

This exercise adds the rdma_[un]register_net_client() for those clients
which want to have more separation between different namespaces.  This
interface takes a struct net parameter.  A kernel module uses this to
indicate that it is only interested in the RDMA events related to the
given network namespace.  Suppose a client uses init_net as the parameter.
In the above example when a device is assigned to a namespace, only the
client's remove upcall is called (removing from the init_net namespace).
The add upcall is not followed.  Then when the namespace is removed, the
client's add upcall is called (re-assigning back to the init_net namespace).
Suppose a client uses a specific namespace as the parameter.  When a device
is assigned to that specific namespace, the client's add upcall is called.
When the client unregisters with RDMA (or when the namespace is going away),
the client's remove upcall is called.  The RDMA clients are stored in each
namespace's struct rdma_dev_net and each client is assigned a client ID
in that namespace (this means that it is unique only in that namespace but
not unique globally among all namespaces).

This seemingly simple exercise turned out to be not so simple because of
the need to keep the existing interface with the existing behavior.  So only
when a client uses the new interface, the behavior is changed to what is
described above.  There should be no change of behavior to any existing
RDMA client.  There are several obstacles to overcome for this change.  One
difficulty is the global client ID since a lot of code rely on this ID as an
index the both the global clients xarray and individual device's client_data
xarray.  Detailed changes are in the attached diff if folks are interested.

Note that the new interface has one obvious issue, it does not make much sense
in RDMA shared network namespace mode.  In the shared mode, all devices are
associated with init_net.  So if a client uses the new interface to register
a specific namespace other than init_net, it will never get any upcall.  This
and the difficulties in adding a seemingly simple interface makes me wonder
about the following questions.

Is the RDMA shared namespace mode the preferred mode to use as it is the
default mode?  Is it expected that a client knows the running mode before
interacting with the RDMA subsystem?  Is a client not supposed to differentiate
different namespaces?  Besides the current add client upcall, another example
related to this is about event handling.  Suppose a client calls rdma_create_id()
to create listeners in different namespaces but with the same event handler.
A new connection comes in and the event handler is called for an
RDMA_CM_EVENT_CONNECT_REQUEST event.  There is no obvious namespace info regarding
the event.  It seems that the only way to find out the namespace info is to
use the context of struct rdma_cm_id.  The client must somehow add the namespace
info to the context since the subsystem does not provide any help.  Is this the
assumed solution?  BTW, this exercise still does not remove the need to have
rdma_dev_to_netns() as the add upcall does not provide any namespace info.  Given
all these questions, the rdma_[un]register_net_client() do not seem to fit in
the current way in interacting with the RDMA subsystem unfortunately.

Thanks.


(*) Note that in __rdma_create_id(), it does a get_net(net) to put a
     reference on a namespace.  Suppose a kernel module calls rdma_create_id()
     in its namespace .init function to create an RDMA listener and calls
     rdma_destroy_id() in its namespace .exit function to destroy it.  Since
     __rdma_create_id() adds a reference to a namespace, when a sys admin
     deletes a namespace (say `ip netns del ...`), the namespace won't be
     deleted because of this reference.  And the module will not release this
     reference until its .exit function is called only when the namespace is
     deleted.  To resolve this issue, in the diff (in __rdma_create_id()), I
     did something similar to the kern check in sk_alloc().



-- 
K. Poon
ka-cheong.poon@oracle.com



[-- Attachment #2: rdma_register_client.diff --]
[-- Type: text/x-patch, Size: 23331 bytes --]

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 7f0e91e92968..15eb91eee200 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -873,7 +873,10 @@ struct rdma_cm_id *__rdma_create_id(struct net *net,
 	INIT_LIST_HEAD(&id_priv->listen_list);
 	INIT_LIST_HEAD(&id_priv->mc_list);
 	get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num);
-	id_priv->id.route.addr.dev_addr.net = get_net(net);
+	if (caller)
+		id_priv->id.route.addr.dev_addr.net = net;
+	else
+		id_priv->id.route.addr.dev_addr.net = get_net(net);
 	id_priv->seq_num &= 0x00ffffff;
 
 	return &id_priv->id;
@@ -1819,8 +1822,12 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv)
 static void _destroy_id(struct rdma_id_private *id_priv,
 			enum rdma_cm_state state)
 {
+	bool rel_net = true;
+
 	cma_cancel_operation(id_priv, state);
 
+	if (id_priv->res.kern_name)
+		rel_net = false;
 	rdma_restrack_del(&id_priv->res);
 	if (id_priv->cma_dev) {
 		if (rdma_cap_ib_cm(id_priv->id.device, 1)) {
@@ -1846,7 +1853,8 @@ static void _destroy_id(struct rdma_id_private *id_priv,
 	if (id_priv->id.route.addr.dev_addr.sgid_attr)
 		rdma_put_gid_attr(id_priv->id.route.addr.dev_addr.sgid_attr);
 
-	put_net(id_priv->id.route.addr.dev_addr.net);
+	if (rel_net)
+		put_net(id_priv->id.route.addr.dev_addr.net);
 	kfree(id_priv);
 }
 
diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index a1e6a67b2c4a..3c6c3cd516f3 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -66,6 +66,11 @@ struct rdma_dev_net {
 	struct sock *nl_sock;
 	possible_net_t net;
 	u32 id;
+
+	u32			rdn_highest_client_id;
+	struct xarray		rdn_clients;
+	struct rw_semaphore	rdn_clients_rwsem;
+
 };
 
 extern const struct attribute_group ib_dev_attr_group;
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index c36b4d2b61e0..f113c9b2e547 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -93,10 +93,7 @@ static DEFINE_XARRAY_FLAGS(devices, XA_FLAGS_ALLOC);
 static DECLARE_RWSEM(devices_rwsem);
 #define DEVICE_REGISTERED XA_MARK_1
 
-static u32 highest_client_id;
 #define CLIENT_REGISTERED XA_MARK_1
-static DEFINE_XARRAY_FLAGS(clients, XA_FLAGS_ALLOC);
-static DECLARE_RWSEM(clients_rwsem);
 
 static void ib_client_put(struct ib_client *client)
 {
@@ -399,6 +396,7 @@ static int rename_compat_devs(struct ib_device *device)
 
 int ib_device_rename(struct ib_device *ibdev, const char *name)
 {
+	struct rdma_dev_net *rdn;
 	unsigned long index;
 	void *client_data;
 	int ret;
@@ -424,10 +422,12 @@ int ib_device_rename(struct ib_device *ibdev, const char *name)
 	ret = rename_compat_devs(ibdev);
 
 	downgrade_write(&devices_rwsem);
+	rdn = rdma_net_to_dev_net(read_pnet(&ibdev->coredev.rdma_net));
+
 	down_read(&ibdev->client_data_rwsem);
 	xan_for_each_marked(&ibdev->client_data, index, client_data,
 			    CLIENT_DATA_REGISTERED) {
-		struct ib_client *client = xa_load(&clients, index);
+		struct ib_client *client = xa_load(&rdn->rdn_clients, index);
 
 		if (!client || !client->rename)
 			continue;
@@ -504,6 +504,7 @@ static void ib_device_release(struct device *device)
 
 	xa_destroy(&dev->compat_devs);
 	xa_destroy(&dev->client_data);
+	xa_destroy(&dev->net_client_data);
 	kfree_rcu(dev, rcu_head);
 }
 
@@ -594,6 +595,7 @@ struct ib_device *_ib_alloc_device(size_t size)
 	 * destroyed if the user stores NULL in the client data.
 	 */
 	xa_init_flags(&device->client_data, XA_FLAGS_ALLOC);
+	xa_init_flags(&device->net_client_data, XA_FLAGS_ALLOC);
 	init_rwsem(&device->client_data_rwsem);
 	xa_init_flags(&device->compat_devs, XA_FLAGS_ALLOC);
 	mutex_init(&device->compat_devs_mutex);
@@ -631,6 +633,7 @@ void ib_dealloc_device(struct ib_device *device)
 
 	WARN_ON(!xa_empty(&device->compat_devs));
 	WARN_ON(!xa_empty(&device->client_data));
+	WARN_ON(!xa_empty(&device->net_client_data));
 	WARN_ON(refcount_read(&device->refcount));
 	rdma_restrack_clean(device);
 	/* Balances with device_initialize */
@@ -647,8 +650,9 @@ EXPORT_SYMBOL(ib_dealloc_device);
  * or remove is fully completed.
  */
 static int add_client_context(struct ib_device *device,
-			      struct ib_client *client)
+			      struct ib_client *client, bool net_client)
 {
+	struct xarray *cl_data;
 	int ret = 0;
 
 	if (!device->kverbs_provider && !client->no_kverbs_req)
@@ -663,16 +667,20 @@ static int add_client_context(struct ib_device *device,
 		goto out_unlock;
 	refcount_inc(&device->refcount);
 
+	if (net_client)
+		cl_data = &device->net_client_data;
+	else
+		cl_data = &device->client_data;
+
 	/*
 	 * Another caller to add_client_context got here first and has already
 	 * completely initialized context.
 	 */
-	if (xa_get_mark(&device->client_data, client->client_id,
+	if (xa_get_mark(cl_data, client->client_id,
 		    CLIENT_DATA_REGISTERED))
 		goto out;
 
-	ret = xa_err(xa_store(&device->client_data, client->client_id, NULL,
-			      GFP_KERNEL));
+	ret = xa_err(xa_store(cl_data, client->client_id, NULL, GFP_KERNEL));
 	if (ret)
 		goto out;
 	downgrade_write(&device->client_data_rwsem);
@@ -692,8 +700,7 @@ static int add_client_context(struct ib_device *device,
 	}
 
 	/* Readers shall not see a client until add has been completed */
-	xa_set_mark(&device->client_data, client->client_id,
-		    CLIENT_DATA_REGISTERED);
+	xa_set_mark(cl_data, client->client_id, CLIENT_DATA_REGISTERED);
 	up_read(&device->client_data_rwsem);
 	return 0;
 
@@ -706,20 +713,26 @@ static int add_client_context(struct ib_device *device,
 }
 
 static void remove_client_context(struct ib_device *device,
-				  unsigned int client_id)
+				  unsigned int client_id,
+				  struct rdma_dev_net *rdn, bool net_client)
 {
 	struct ib_client *client;
+	struct xarray *cl_data;
 	void *client_data;
 
+	if (net_client)
+		cl_data = &device->net_client_data;
+	else
+		cl_data = &device->client_data;
+
 	down_write(&device->client_data_rwsem);
-	if (!xa_get_mark(&device->client_data, client_id,
-			 CLIENT_DATA_REGISTERED)) {
+	if (!xa_get_mark(cl_data, client_id, CLIENT_DATA_REGISTERED)) {
 		up_write(&device->client_data_rwsem);
 		return;
 	}
-	client_data = xa_load(&device->client_data, client_id);
-	xa_clear_mark(&device->client_data, client_id, CLIENT_DATA_REGISTERED);
-	client = xa_load(&clients, client_id);
+	client_data = xa_load(cl_data, client_id);
+	xa_clear_mark(cl_data, client_id, CLIENT_DATA_REGISTERED);
+	client = xa_load(&rdn->rdn_clients, client_id);
 	up_write(&device->client_data_rwsem);
 
 	/*
@@ -734,7 +747,10 @@ static void remove_client_context(struct ib_device *device,
 	if (client->remove)
 		client->remove(device, client_data);
 
-	xa_erase(&device->client_data, client_id);
+	if (client->net_client)
+		xa_erase(&device->net_client_data, client_id);
+	else
+		xa_erase(&device->client_data, client_id);
 	ib_device_put(device);
 	ib_client_put(client);
 }
@@ -924,6 +940,7 @@ static int add_one_compat_dev(struct ib_device *device,
 		goto insert_err;
 
 	mutex_unlock(&device->compat_devs_mutex);
+
 	return 0;
 
 insert_err:
@@ -1099,6 +1116,9 @@ static void rdma_dev_exit_net(struct net *net)
 
 	rdma_nl_net_exit(rnet);
 	xa_erase(&rdma_nets, rnet->id);
+
+	WARN_ON(!xa_empty(&rnet->rdn_clients));
+	xa_destroy(&rnet->rdn_clients);
 }
 
 static __net_init int rdma_dev_init_net(struct net *net)
@@ -1114,6 +1134,9 @@ static __net_init int rdma_dev_init_net(struct net *net)
 	if (ret)
 		return ret;
 
+	xa_init_flags(&rnet->rdn_clients, XA_FLAGS_ALLOC);
+	init_rwsem(&rnet->rdn_clients_rwsem);
+
 	/* No need to create any compat devices in default init_net. */
 	if (net_eq(net, &init_net))
 		return 0;
@@ -1263,9 +1286,14 @@ static int setup_device(struct ib_device *device)
 
 static void disable_device(struct ib_device *device)
 {
+	struct rdma_dev_net *init_rdn, *rdn;
+	struct net *net;
 	u32 cid;
 
 	WARN_ON(!refcount_read(&device->refcount));
+	init_rdn = rdma_net_to_dev_net(&init_net);
+	net = read_pnet(&device->coredev.rdma_net);
+	rdn = rdma_net_to_dev_net(net);
 
 	down_write(&devices_rwsem);
 	xa_clear_mark(&devices, device->index, DEVICE_REGISTERED);
@@ -1277,12 +1305,21 @@ static void disable_device(struct ib_device *device)
 	 * clients can be added to this ib_device past this point we only need
 	 * the maximum possible client_id value here.
 	 */
-	down_read(&clients_rwsem);
-	cid = highest_client_id;
-	up_read(&clients_rwsem);
+	down_read(&init_rdn->rdn_clients_rwsem);
+	cid = init_rdn->rdn_highest_client_id;
+	up_read(&init_rdn->rdn_clients_rwsem);
 	while (cid) {
 		cid--;
-		remove_client_context(device, cid);
+		remove_client_context(device, cid, init_rdn, false);
+	}
+
+	rdn = rdma_net_to_dev_net(net);
+	down_read(&rdn->rdn_clients_rwsem);
+	cid = rdn->rdn_highest_client_id;
+	up_read(&rdn->rdn_clients_rwsem);
+	while (cid) {
+		cid--;
+		remove_client_context(device, cid, rdn, true);
 	}
 
 	/* Pairs with refcount_set in enable_device */
@@ -1297,6 +1334,26 @@ static void disable_device(struct ib_device *device)
 	remove_compat_devs(device);
 }
 
+static int add_net_client_context(struct rdma_dev_net *rdn,
+				  struct ib_device *device, bool net_client)
+{
+	struct ib_client *client;
+	unsigned long index;
+	int ret = 0;
+
+	down_read(&rdn->rdn_clients_rwsem);
+	xa_for_each_marked(&rdn->rdn_clients, index, client,
+			   CLIENT_REGISTERED) {
+		if (client->net_client == net_client)
+			ret = add_client_context(device, client, net_client);
+		if (ret)
+			break;
+	}
+	up_read(&rdn->rdn_clients_rwsem);
+
+	return ret;
+}
+
 /*
  * An enabled device is visible to all clients and to all the public facing
  * APIs that return a device pointer. This always returns with a new get, even
@@ -1304,8 +1361,8 @@ static void disable_device(struct ib_device *device)
  */
 static int enable_device_and_get(struct ib_device *device)
 {
-	struct ib_client *client;
-	unsigned long index;
+	struct rdma_dev_net *rdn;
+	struct net *net;
 	int ret = 0;
 
 	/*
@@ -1321,20 +1378,27 @@ static int enable_device_and_get(struct ib_device *device)
 	 * DEVICE_REGISTERED while we are completing the client setup.
 	 */
 	downgrade_write(&devices_rwsem);
-
 	if (device->ops.enable_driver) {
 		ret = device->ops.enable_driver(device);
 		if (ret)
 			goto out;
 	}
 
-	down_read(&clients_rwsem);
-	xa_for_each_marked (&clients, index, client, CLIENT_REGISTERED) {
-		ret = add_client_context(device, client);
-		if (ret)
-			break;
-	}
-	up_read(&clients_rwsem);
+	/* For backward compatibility, always add client context for all "old"
+	 * registered clients using ib_register_client().
+	 */
+	rdn = rdma_net_to_dev_net(&init_net);
+	ret = add_net_client_context(rdn, device, false);
+	if (ret)
+		goto out;
+
+	/* Now add client context for clients registered using
+	 * rdma_register_net_client().
+	 */
+	net = read_pnet(&device->coredev.rdma_net);
+	rdn = rdma_net_to_dev_net(net);
+	ret = add_net_client_context(rdn, device, true);
+
 	if (!ret)
 		ret = add_compat_devs(device);
 out:
@@ -1711,37 +1775,49 @@ static struct pernet_operations rdma_dev_net_ops = {
 	.size = sizeof(struct rdma_dev_net),
 };
 
-static int assign_client_id(struct ib_client *client)
+static int assign_client_id(struct net *net, struct ib_client *client,
+			    bool net_client)
 {
+	struct rdma_dev_net *rdn;
 	int ret;
 
-	down_write(&clients_rwsem);
+	rdn = rdma_net_to_dev_net(net);
+
+	down_write(&rdn->rdn_clients_rwsem);
+
 	/*
 	 * The add/remove callbacks must be called in FIFO/LIFO order. To
 	 * achieve this we assign client_ids so they are sorted in
 	 * registration order.
 	 */
-	client->client_id = highest_client_id;
-	ret = xa_insert(&clients, client->client_id, client, GFP_KERNEL);
+	client->client_id = rdn->rdn_highest_client_id;
+	ret = xa_insert(&rdn->rdn_clients, client->client_id, client,
+			GFP_KERNEL);
 	if (ret)
 		goto out;
 
-	highest_client_id++;
-	xa_set_mark(&clients, client->client_id, CLIENT_REGISTERED);
+	rdn->rdn_highest_client_id++;
+	xa_set_mark(&rdn->rdn_clients, client->client_id, CLIENT_REGISTERED);
+	client->net_client = net_client;
 
 out:
-	up_write(&clients_rwsem);
+	up_write(&rdn->rdn_clients_rwsem);
 	return ret;
 }
 
-static void remove_client_id(struct ib_client *client)
+static void remove_client_id(struct net *net, struct ib_client *client)
 {
-	down_write(&clients_rwsem);
-	xa_erase(&clients, client->client_id);
-	for (; highest_client_id; highest_client_id--)
-		if (xa_load(&clients, highest_client_id - 1))
+	struct rdma_dev_net *rdn;
+	struct xarray *clients;
+
+	rdn = rdma_net_to_dev_net(net);
+	clients = &rdn->rdn_clients;
+	down_write(&rdn->rdn_clients_rwsem);
+	xa_erase(clients, client->client_id);
+	for (; rdn->rdn_highest_client_id; rdn->rdn_highest_client_id--)
+		if (xa_load(clients, rdn->rdn_highest_client_id - 1))
 			break;
-	up_write(&clients_rwsem);
+	up_write(&rdn->rdn_clients_rwsem);
 }
 
 /**
@@ -1765,13 +1841,13 @@ int ib_register_client(struct ib_client *client)
 
 	refcount_set(&client->uses, 1);
 	init_completion(&client->uses_zero);
-	ret = assign_client_id(client);
+	ret = assign_client_id(&init_net, client, false);
 	if (ret)
 		return ret;
 
 	down_read(&devices_rwsem);
 	xa_for_each_marked (&devices, index, device, DEVICE_REGISTERED) {
-		ret = add_client_context(device, client);
+		ret = add_client_context(device, client, false);
 		if (ret) {
 			up_read(&devices_rwsem);
 			ib_unregister_client(client);
@@ -1783,6 +1859,34 @@ int ib_register_client(struct ib_client *client)
 }
 EXPORT_SYMBOL(ib_register_client);
 
+int rdma_register_net_client(struct net *net, struct ib_client *client)
+{
+	struct ib_device *device;
+	unsigned long index;
+	int ret;
+
+	refcount_set(&client->uses, 1);
+	init_completion(&client->uses_zero);
+	ret = assign_client_id(net, client, true);
+	if (ret)
+		return ret;
+
+	down_read(&devices_rwsem);
+	xa_for_each_marked (&devices, index, device, DEVICE_REGISTERED) {
+		if (!net_eq(net, read_pnet(&device->coredev.rdma_net)))
+			continue;
+		ret = add_client_context(device, client, true);
+		if (ret) {
+			up_read(&devices_rwsem);
+			rdma_unregister_net_client(net, client);
+			return ret;
+		}
+	}
+	up_read(&devices_rwsem);
+	return 0;
+}
+EXPORT_SYMBOL(rdma_register_net_client);
+
 /**
  * ib_unregister_client - Unregister an IB client
  * @client:Client to unregister
@@ -1797,12 +1901,14 @@ EXPORT_SYMBOL(ib_register_client);
 void ib_unregister_client(struct ib_client *client)
 {
 	struct ib_device *device;
+	struct rdma_dev_net *rdn;
 	unsigned long index;
 
-	down_write(&clients_rwsem);
+	rdn = rdma_net_to_dev_net(&init_net);
+	down_write(&rdn->rdn_clients_rwsem);
 	ib_client_put(client);
-	xa_clear_mark(&clients, client->client_id, CLIENT_REGISTERED);
-	up_write(&clients_rwsem);
+	xa_clear_mark(&rdn->rdn_clients, client->client_id, CLIENT_REGISTERED);
+	up_write(&rdn->rdn_clients_rwsem);
 
 	/* We do not want to have locks while calling client->remove() */
 	rcu_read_lock();
@@ -1811,7 +1917,7 @@ void ib_unregister_client(struct ib_client *client)
 			continue;
 		rcu_read_unlock();
 
-		remove_client_context(device, client->client_id);
+		remove_client_context(device, client->client_id, rdn, false);
 
 		ib_device_put(device);
 		rcu_read_lock();
@@ -1823,19 +1929,58 @@ void ib_unregister_client(struct ib_client *client)
 	 * removal is ongoing. Wait until all removals are completed.
 	 */
 	wait_for_completion(&client->uses_zero);
-	remove_client_id(client);
+	remove_client_id(&init_net, client);
 }
 EXPORT_SYMBOL(ib_unregister_client);
 
+void rdma_unregister_net_client(struct net *net, struct ib_client *client)
+{
+	struct ib_device *device;
+	struct rdma_dev_net *rdn;
+	unsigned long index;
+
+	rdn = rdma_net_to_dev_net(net);
+	down_write(&rdn->rdn_clients_rwsem);
+	ib_client_put(client);
+	xa_clear_mark(&rdn->rdn_clients, client->client_id, CLIENT_REGISTERED);
+	up_write(&rdn->rdn_clients_rwsem);
+
+	/* We do not want to have locks while calling client->remove() */
+	rcu_read_lock();
+	xa_for_each (&devices, index, device) {
+		if (!ib_device_try_get(device))
+			continue;
+		rcu_read_unlock();
+
+		remove_client_context(device, client->client_id, rdn, true);
+
+		ib_device_put(device);
+		rcu_read_lock();
+	}
+	rcu_read_unlock();
+
+	/*
+	 * remove_client_context() is not a fence, it can return even though a
+	 * removal is ongoing. Wait until all removals are completed.
+	 */
+	wait_for_completion(&client->uses_zero);
+	remove_client_id(net, client);
+}
+EXPORT_SYMBOL(rdma_unregister_net_client);
+
 static int __ib_get_global_client_nl_info(const char *client_name,
 					  struct ib_client_nl_info *res)
 {
 	struct ib_client *client;
+	struct rdma_dev_net *rdn;
 	unsigned long index;
 	int ret = -ENOENT;
 
-	down_read(&clients_rwsem);
-	xa_for_each_marked (&clients, index, client, CLIENT_REGISTERED) {
+	/* No network namespace info available... */
+	rdn = rdma_net_to_dev_net(&init_net);
+	down_read(&rdn->rdn_clients_rwsem);
+	xa_for_each_marked (&rdn->rdn_clients, index, client,
+			    CLIENT_REGISTERED) {
 		if (strcmp(client->name, client_name) != 0)
 			continue;
 		if (!client->get_global_nl_info) {
@@ -1849,7 +1994,7 @@ static int __ib_get_global_client_nl_info(const char *client_name,
 			get_device(res->cdev);
 		break;
 	}
-	up_read(&clients_rwsem);
+	up_read(&rdn->rdn_clients_rwsem);
 	return ret;
 }
 
@@ -1857,14 +2002,24 @@ static int __ib_get_client_nl_info(struct ib_device *ibdev,
 				   const char *client_name,
 				   struct ib_client_nl_info *res)
 {
+	struct xarray *cl_data, *cls;
+	struct rdma_dev_net *rdn;
 	unsigned long index;
 	void *client_data;
 	int ret = -ENOENT;
 
 	down_read(&ibdev->client_data_rwsem);
-	xan_for_each_marked (&ibdev->client_data, index, client_data,
+	if (ib_devices_shared_netns) {
+		rdn = rdma_net_to_dev_net(&init_net);
+		cl_data = &ibdev->client_data;
+	} else {
+		rdn = rdma_net_to_dev_net(read_pnet(&ibdev->coredev.rdma_net));
+		cl_data = &ibdev->net_client_data;
+	}
+	cls = &rdn->rdn_clients;
+	xan_for_each_marked (cl_data, index, client_data,
 			     CLIENT_DATA_REGISTERED) {
-		struct ib_client *client = xa_load(&clients, index);
+		struct ib_client *client = xa_load(cls, index);
 
 		if (!client || strcmp(client->name, client_name) != 0)
 			continue;
@@ -1939,13 +2094,17 @@ int ib_get_client_nl_info(struct ib_device *ibdev, const char *client_name,
 void ib_set_client_data(struct ib_device *device, struct ib_client *client,
 			void *data)
 {
+	struct xarray *cl_data;
 	void *rc;
 
 	if (WARN_ON(IS_ERR(data)))
 		data = NULL;
 
-	rc = xa_store(&device->client_data, client->client_id, data,
-		      GFP_KERNEL);
+	if (client->net_client)
+		cl_data = &device->net_client_data;
+	else
+		cl_data = &device->client_data;
+	rc = xa_store(cl_data, client->client_id, data, GFP_KERNEL);
 	WARN_ON(xa_is_err(rc));
 }
 EXPORT_SYMBOL(ib_set_client_data);
@@ -2523,20 +2682,27 @@ struct net_device *ib_get_net_dev_by_params(struct ib_device *dev,
 					    const struct sockaddr *addr)
 {
 	struct net_device *net_dev = NULL;
+	struct rdma_dev_net *init_rdn, *rdn;
 	unsigned long index;
 	void *client_data;
 
 	if (!rdma_protocol_ib(dev, port))
 		return NULL;
 
+	init_rdn = rdma_net_to_dev_net(&init_net);
+	rdn = rdma_net_to_dev_net(read_pnet(&dev->coredev.rdma_net));
 	/*
 	 * Holding the read side guarantees that the client will not become
 	 * unregistered while we are calling get_net_dev_by_params()
 	 */
 	down_read(&dev->client_data_rwsem);
+	/* First try all the non-net registered clients, and then the net
+	 * registered clients.
+	 */
 	xan_for_each_marked (&dev->client_data, index, client_data,
 			     CLIENT_DATA_REGISTERED) {
-		struct ib_client *client = xa_load(&clients, index);
+		struct ib_client *client = xa_load(&init_rdn->rdn_clients,
+						   index);
 
 		if (!client || !client->get_net_dev_by_params)
 			continue;
@@ -2546,6 +2712,22 @@ struct net_device *ib_get_net_dev_by_params(struct ib_device *dev,
 		if (net_dev)
 			break;
 	}
+	if (!net_dev) {
+		xan_for_each_marked(&dev->net_client_data, index, client_data,
+				     CLIENT_DATA_REGISTERED) {
+			struct ib_client *client = xa_load(&rdn->rdn_clients,
+							   index);
+
+			if (!client || !client->get_net_dev_by_params)
+				continue;
+
+			net_dev = client->get_net_dev_by_params(dev, port,
+								pkey, gid, addr,
+								client_data);
+			if (net_dev)
+				break;
+		}
+	}
 	up_read(&dev->client_data_rwsem);
 
 	return net_dev;
@@ -2749,6 +2931,12 @@ static int __init ib_core_init(void)
 
 	rdma_nl_init();
 
+	ret = register_pernet_device(&rdma_dev_net_ops);
+	if (ret) {
+		pr_warn("Couldn't init compat dev. ret %d\n", ret);
+		goto err_compat;
+	}
+
 	ret = addr_init();
 	if (ret) {
 		pr_warn("Couldn't init IB address resolution\n");
@@ -2773,12 +2961,6 @@ static int __init ib_core_init(void)
 		goto err_sa;
 	}
 
-	ret = register_pernet_device(&rdma_dev_net_ops);
-	if (ret) {
-		pr_warn("Couldn't init compat dev. ret %d\n", ret);
-		goto err_compat;
-	}
-
 	nldev_init();
 	rdma_nl_register(RDMA_NL_LS, ibnl_ls_cb_table);
 	roce_gid_mgmt_init();
@@ -2809,11 +2991,11 @@ static void __exit ib_core_cleanup(void)
 	roce_gid_mgmt_cleanup();
 	nldev_exit();
 	rdma_nl_unregister(RDMA_NL_LS);
-	unregister_pernet_device(&rdma_dev_net_ops);
 	unregister_blocking_lsm_notifier(&ibdev_lsm_nb);
 	ib_sa_cleanup();
 	ib_mad_cleanup();
 	addr_cleanup();
+	unregister_pernet_device(&rdma_dev_net_ops);
 	rdma_nl_exit();
 	class_unregister(&ib_class);
 	destroy_workqueue(ib_comp_unbound_wq);
@@ -2821,7 +3003,6 @@ static void __exit ib_core_cleanup(void)
 	/* Make sure that any pending umem accounting work is done. */
 	destroy_workqueue(ib_wq);
 	flush_workqueue(system_unbound_wq);
-	WARN_ON(!xa_empty(&clients));
 	WARN_ON(!xa_empty(&devices));
 }
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index c0b2fa7e9b95..1f3f497a870a 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2729,6 +2729,9 @@ struct ib_device {
 	char iw_ifname[IFNAMSIZ];
 	u32 iw_driver_flags;
 	u32 lag_flags;
+
+	/* Also protected by client_data_rwsem */
+	struct xarray			net_client_data;
 };
 
 struct ib_client_nl_info;
@@ -2770,6 +2773,7 @@ struct ib_client {
 
 	/* kverbs are not required by the client */
 	u8 no_kverbs_req:1;
+	u8 net_client:1;
 };
 
 /*
@@ -2807,6 +2811,9 @@ void ib_unregister_device_queued(struct ib_device *ib_dev);
 int ib_register_client   (struct ib_client *client);
 void ib_unregister_client(struct ib_client *client);
 
+int rdma_register_net_client(struct net *net, struct ib_client *client);
+void rdma_unregister_net_client(struct net *net, struct ib_client *client);
+
 void __rdma_block_iter_start(struct ib_block_iter *biter,
 			     struct scatterlist *sglist,
 			     unsigned int nents,
@@ -2852,7 +2859,10 @@ rdma_block_iter_dma_address(struct ib_block_iter *biter)
 static inline void *ib_get_client_data(struct ib_device *device,
 				       struct ib_client *client)
 {
-	return xa_load(&device->client_data, client->client_id);
+	if (client->net_client) 
+		return xa_load(&device->net_client_data, client->client_id);
+	else
+		return xa_load(&device->client_data, client->client_id);
 }
 void  ib_set_client_data(struct ib_device *device, struct ib_client *client,
 			 void *data);

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-09-29 16:57                         ` RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device) Ka-Cheong Poon
@ 2020-09-29 17:40                           ` Jason Gunthorpe
  2020-09-30 10:32                             ` Ka-Cheong Poon
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2020-09-29 17:40 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: linux-rdma

On Wed, Sep 30, 2020 at 12:57:48AM +0800, Ka-Cheong Poon wrote:
> On 9/7/20 9:48 PM, Ka-Cheong Poon wrote:
> 
> > This may require a number of changes and the way a client interacts with
> > the current RDMA framework.  For example, currently a client registers
> > once using one struct ib_client and gets device notifications for all
> > namespaces and devices.  Suppose there is rdma_[un]register_net_client(),
> > it may need to require a client to use a different struct ib_client to
> > register for each net namespace.  And struct ib_client probably needs to
> > have a field to store the net namespace.  Probably all those client
> > interaction functions will need to be modified.  Since the clients xarray
> > is global, more clients may mean performance implication, such as it takes
> > longer to go through the whole clients xarray.
> > 
> > There are probably many other subtle changes required.  It may turn out to
> > be not so straight forward.  Is this community willing the take such changes?
> > I can take a stab at it if the community really thinks that this is preferred.
> 
> 
> Attached is a diff of a prototype for the above.  This exercise is
> to see what needs to be done to have a more network namespace aware
> interface for RDMA client registration.

An RDMA device is either in all namespaces or in a single
namespace. If a client has some interest in only some namespaces then
it should check the namespace during client registration and not
register if it isn't interested. No need to change anything in the
core code.

> Is the RDMA shared namespace mode the preferred mode to use as it is the
> default mode?  

Shared is the legacy mode, modern systems should switch to namespace
mode at early boot

> Is it expected that a client knows the running mode before
> interacting with the RDMA subsystem?  

Why would a client care?

> Is a client not supposed to differentiate different namespaces?

None do today.

> A new connection comes in and the event handler is called for an
> RDMA_CM_EVENT_CONNECT_REQUEST event.  There is no obvious namespace info regarding
> the event.  It seems that the only way to find out the namespace info is to
> use the context of struct rdma_cm_id.  

The rdma_cm_id has only a single namespace, the ULP knows what it is
because it created it. A listening ID can't spawn new IDs in different
namespaces.

> (*) Note that in __rdma_create_id(), it does a get_net(net) to put a
>     reference on a namespace.  Suppose a kernel module calls rdma_create_id()
>     in its namespace .init function to create an RDMA listener and calls
>     rdma_destroy_id() in its namespace .exit function to destroy it.  

Yes, namespaces remain until all objects touching them are deleted.

It seems like a ULP error to drive cm_id lifetime entirely from the
per-net stuff.

This would be similar to creating a socket in the kernel.

>     __rdma_create_id() adds a reference to a namespace, when a sys admin
>     deletes a namespace (say `ip netns del ...`), the namespace won't be
>     deleted because of this reference.  And the module will not release this
>     reference until its .exit function is called only when the namespace is
>     deleted.  To resolve this issue, in the diff (in __rdma_create_id()), I
>     did something similar to the kern check in sk_alloc().

What you are running into is there is no kernel user of net
namespaces, all current ULPs exclusively use the init_net.

Without an example of what that is supposed to be like it is hard to
really have a discussion. You should reference other TCP in the kernel
to see if someone has figured out how to make this work for TCP. It
should be basically the same.

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-09-29 17:40                           ` Jason Gunthorpe
@ 2020-09-30 10:32                             ` Ka-Cheong Poon
  2020-10-02 14:04                               ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-09-30 10:32 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma

On 9/30/20 1:40 AM, Jason Gunthorpe wrote:
> On Wed, Sep 30, 2020 at 12:57:48AM +0800, Ka-Cheong Poon wrote:
>> On 9/7/20 9:48 PM, Ka-Cheong Poon wrote:
>>
>>> This may require a number of changes and the way a client interacts with
>>> the current RDMA framework.  For example, currently a client registers
>>> once using one struct ib_client and gets device notifications for all
>>> namespaces and devices.  Suppose there is rdma_[un]register_net_client(),
>>> it may need to require a client to use a different struct ib_client to
>>> register for each net namespace.  And struct ib_client probably needs to
>>> have a field to store the net namespace.  Probably all those client
>>> interaction functions will need to be modified.  Since the clients xarray
>>> is global, more clients may mean performance implication, such as it takes
>>> longer to go through the whole clients xarray.
>>>
>>> There are probably many other subtle changes required.  It may turn out to
>>> be not so straight forward.  Is this community willing the take such changes?
>>> I can take a stab at it if the community really thinks that this is preferred.
>>
>>
>> Attached is a diff of a prototype for the above.  This exercise is
>> to see what needs to be done to have a more network namespace aware
>> interface for RDMA client registration.
> 
> An RDMA device is either in all namespaces or in a single
> namespace. If a client has some interest in only some namespaces then
> it should check the namespace during client registration and not
> register if it isn't interested. No need to change anything in the
> core code.


After the aforementioned check on a namespace, what can the client
do?  It still needs to use the existing ib_register_client() to
register with RDMA subsystem.  And after registration, it will get
notifications for all add/remove upcalls on devices not related
to the namespace it is interested in.  The client can work around
this if there is a supported way to find out the namespace of a
device, hence the original proposal of having rdma_dev_to_netns().


>> Is the RDMA shared namespace mode the preferred mode to use as it is the
>> default mode?
> 
> Shared is the legacy mode, modern systems should switch to namespace
> mode at early boot


Thanks for the clarification.  I originally thought that the shared
mode was for supporting a large number of namespaces.  In the
exclusive mode, a device needs to be assigned to a namespace for
that namespace to use it.  If there are a large number of namespaces,
there won't be enough devices to assign to all of them (e.g. the
hardware I have access to only supports up to 24 VFs).  The shared
mode can be used in this case.  Could you please explain what needs
to be done to support a large number of namespaces in exclusive
mode?

BTW, if exclusive mode is the future, it may make sense to have
something like rdma_[un]register_net_client().


>> Is it expected that a client knows the running mode before
>> interacting with the RDMA subsystem?
> 
> Why would a client care?


Because it may want to behave differently.  For example, in shared
mode, it may want to create shadow device structure to hold per
namespace info for a device.  But in exclusive mode, a device can
only be in one namespace, there is no need to have such shadow
device structure.


>> Is a client not supposed to differentiate different namespaces?
> 
> None do today.


This is probably the case as calling rdma_create_id() in kernel can
disallow a namespace to be deleted.  There must be no client doing
that right now.  My code is using RDMA in a namespace, hence I'd
like to understand more about the RDMA subsystem's namespace support.
For example, what is the reason that the cma_wq is a global queue
shared by all namespaces instead of per namespace?  Is it expected
that the work load will be low enough for all namespaces such that
they will not interfere with each other?


>> A new connection comes in and the event handler is called for an
>> RDMA_CM_EVENT_CONNECT_REQUEST event.  There is no obvious namespace info regarding
>> the event.  It seems that the only way to find out the namespace info is to
>> use the context of struct rdma_cm_id.
> 
> The rdma_cm_id has only a single namespace, the ULP knows what it is
> because it created it. A listening ID can't spawn new IDs in different
> namespaces.


The problem is that the handler is not given the listener's
rdma_cm_id when it is called.  It is only given the new rdma_cm_id.
Do you mean that there is a way to find out the listener's rdma_cm_id
given the new rdma_cm_id?  But even if the listener's rdma_cm_id can
be found, what is the mechanism to find out the namespace of that
listener's namespace in the handler?  The client may compare that
pointer with every listeners it creates.  Is there a better way?


>> (*) Note that in __rdma_create_id(), it does a get_net(net) to put a
>>      reference on a namespace.  Suppose a kernel module calls rdma_create_id()
>>      in its namespace .init function to create an RDMA listener and calls
>>      rdma_destroy_id() in its namespace .exit function to destroy it.
> 
> Yes, namespaces remain until all objects touching them are deleted.
> 
> It seems like a ULP error to drive cm_id lifetime entirely from the
> per-net stuff.


It is not an ULP error.  While there are many reasons to delete
a listener, it is not necessary for the listener to die unless the
namespace is going away.


> This would be similar to creating a socket in the kernel.


Right and a kernel socket does not prevent a namespace to be deleted.


>>      __rdma_create_id() adds a reference to a namespace, when a sys admin
>>      deletes a namespace (say `ip netns del ...`), the namespace won't be
>>      deleted because of this reference.  And the module will not release this
>>      reference until its .exit function is called only when the namespace is
>>      deleted.  To resolve this issue, in the diff (in __rdma_create_id()), I
>>      did something similar to the kern check in sk_alloc().
> 
> What you are running into is there is no kernel user of net
> namespaces, all current ULPs exclusively use the init_net.
> 
> Without an example of what that is supposed to be like it is hard to
> really have a discussion. You should reference other TCP in the kernel
> to see if someone has figured out how to make this work for TCP. It
> should be basically the same.


The kern check in sk_alloc() decides whether to hold a reference on
the namespace.  What is in the diff follows the same mechanism.


-- 
K. Poon
ka-cheong.poon@oracle.com



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-09-30 10:32                             ` Ka-Cheong Poon
@ 2020-10-02 14:04                               ` Jason Gunthorpe
  2020-10-05 10:27                                 ` Ka-Cheong Poon
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2020-10-02 14:04 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: linux-rdma

On Wed, Sep 30, 2020 at 06:32:28PM +0800, Ka-Cheong Poon wrote:
> After the aforementioned check on a namespace, what can the client
> do?  It still needs to use the existing ib_register_client() to
> register with RDMA subsystem.  And after registration, it will get
> notifications for all add/remove upcalls on devices not related
> to the namespace it is interested in.  The client can work around
> this if there is a supported way to find out the namespace of a
> device, hence the original proposal of having rdma_dev_to_netns().

Yes, the client would have to check the netns and abort client
registration.

Arguably many of our current clients are wrong in this area since they
only work on init_net anyhow.

It would make sense to introduce a rdma_dev_to_netns() and use it to
block clients on ULPs that use the CM outside init_net.

> that namespace to use it.  If there are a large number of namespaces,
> there won't be enough devices to assign to all of them (e.g. the
> hardware I have access to only supports up to 24 VFs).  The shared
> mode can be used in this case.  Could you please explain what needs
> to be done to support a large number of namespaces in exclusive
> mode?

Modern HW supports many more than 24 VFs, this is the expected
interface

> BTW, if exclusive mode is the future, it may make sense to have
> something like rdma_[un]register_net_client().

I don't think we need this

> > > A new connection comes in and the event handler is called for an
> > > RDMA_CM_EVENT_CONNECT_REQUEST event.  There is no obvious namespace info regarding
> > > the event.  It seems that the only way to find out the namespace info is to
> > > use the context of struct rdma_cm_id.
> > 
> > The rdma_cm_id has only a single namespace, the ULP knows what it is
> > because it created it. A listening ID can't spawn new IDs in different
> > namespaces.
> 
> The problem is that the handler is not given the listener's
> rdma_cm_id when it is called.  It is only given the new rdma_cm_id.

The new cm_id starts with the same ->context as the listener, the ULP should
use this to pass any data, such as the namespace.

> > It seems like a ULP error to drive cm_id lifetime entirely from the
> > per-net stuff.
>
> It is not an ULP error.  While there are many reasons to delete
> a listener, it is not necessary for the listener to die unless the
> namespace is going away.

It certainly currently is.

I'm skeptical ULPs should be doing per-ns stuff like that. A ns aware
ULP should fundamentally be linked to some FD and the ns to use should
derived from the process that FD is linked to. Keeping per-ns stuff
seems wrong.

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-02 14:04                               ` Jason Gunthorpe
@ 2020-10-05 10:27                                 ` Ka-Cheong Poon
  2020-10-05 13:16                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-10-05 10:27 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma

On 10/2/20 10:04 PM, Jason Gunthorpe wrote:
> On Wed, Sep 30, 2020 at 06:32:28PM +0800, Ka-Cheong Poon wrote:
>> After the aforementioned check on a namespace, what can the client
>> do?  It still needs to use the existing ib_register_client() to
>> register with RDMA subsystem.  And after registration, it will get
>> notifications for all add/remove upcalls on devices not related
>> to the namespace it is interested in.  The client can work around
>> this if there is a supported way to find out the namespace of a
>> device, hence the original proposal of having rdma_dev_to_netns().
> 
> Yes, the client would have to check the netns and abort client
> registration.
> 
> Arguably many of our current clients are wrong in this area since they
> only work on init_net anyhow.
> 
> It would make sense to introduce a rdma_dev_to_netns() and use it to
> block clients on ULPs that use the CM outside init_net.


Will send a simple patch for this.


>> that namespace to use it.  If there are a large number of namespaces,
>> there won't be enough devices to assign to all of them (e.g. the
>> hardware I have access to only supports up to 24 VFs).  The shared
>> mode can be used in this case.  Could you please explain what needs
>> to be done to support a large number of namespaces in exclusive
>> mode?
> 
> Modern HW supports many more than 24 VFs, this is the expected
> interface


Do you have a ballpark on how many VFs are supported?  Is it in
the range of many thousands?

BTW, while the shared mode is still here, can there be a simple
way for a client to find out which mode the RDMA subsystem is using?


>> BTW, if exclusive mode is the future, it may make sense to have
>> something like rdma_[un]register_net_client().
> 
> I don't think we need this
> 
>>>> A new connection comes in and the event handler is called for an
>>>> RDMA_CM_EVENT_CONNECT_REQUEST event.  There is no obvious namespace info regarding
>>>> the event.  It seems that the only way to find out the namespace info is to
>>>> use the context of struct rdma_cm_id.
>>>
>>> The rdma_cm_id has only a single namespace, the ULP knows what it is
>>> because it created it. A listening ID can't spawn new IDs in different
>>> namespaces.
>>
>> The problem is that the handler is not given the listener's
>> rdma_cm_id when it is called.  It is only given the new rdma_cm_id.
> 
> The new cm_id starts with the same ->context as the listener, the ULP should
> use this to pass any data, such as the namespace.


This is what I suspected as mentioned in the previous email.  But
this makes it inconvenient if the context is already used for
something else.


>>> It seems like a ULP error to drive cm_id lifetime entirely from the
>>> per-net stuff.
>>
>> It is not an ULP error.  While there are many reasons to delete
>> a listener, it is not necessary for the listener to die unless the
>> namespace is going away.
> 
> It certainly currently is.
> 
> I'm skeptical ULPs should be doing per-ns stuff like that. A ns aware
> ULP should fundamentally be linked to some FD and the ns to use should
> derived from the process that FD is linked to. Keeping per-ns stuff
> seems wrong.


It is a kernel module.  Which FD are you referring to?  It is
unclear why a kernel module must associate itself with a user
space FD.  Is there a particular reason that rdma_create_id()
needs to behave differently than sock_create_kern() in this
regard?

While discussing about per namespace stuff, what is the reason
that the cma_wq is a global shared by all namespaces instead of
per namespace?  Is there a problem to have a per namespace cma_wq?


-- 
K. Poon
ka-cheong.poon@oracle.com



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-05 10:27                                 ` Ka-Cheong Poon
@ 2020-10-05 13:16                                   ` Jason Gunthorpe
  2020-10-05 13:57                                     ` Ka-Cheong Poon
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2020-10-05 13:16 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: linux-rdma

On Mon, Oct 05, 2020 at 06:27:39PM +0800, Ka-Cheong Poon wrote:
> On 10/2/20 10:04 PM, Jason Gunthorpe wrote:
> > > that namespace to use it.  If there are a large number of namespaces,
> > > there won't be enough devices to assign to all of them (e.g. the
> > > hardware I have access to only supports up to 24 VFs).  The shared
> > > mode can be used in this case.  Could you please explain what needs
> > > to be done to support a large number of namespaces in exclusive
> > > mode?
> > 
> > Modern HW supports many more than 24 VFs, this is the expected
> > interface
> 
> Do you have a ballpark on how many VFs are supported?  Is it in
> the range of many thousands?

Yes

> BTW, while the shared mode is still here, can there be a simple
> way for a client to find out which mode the RDMA subsystem is using?

Return NULL for the namespace

> > The new cm_id starts with the same ->context as the listener, the ULP should
> > use this to pass any data, such as the namespace.
> 
> This is what I suspected as mentioned in the previous email.  But
> this makes it inconvenient if the context is already used for
> something else.

Don't see why. the context should be allocated memory, so the ULP can
put several things lin there.

> > I'm skeptical ULPs should be doing per-ns stuff like that. A ns aware
> > ULP should fundamentally be linked to some FD and the ns to use should
> > derived from the process that FD is linked to. Keeping per-ns stuff
> > seems wrong.
> 
> 
> It is a kernel module.  Which FD are you referring to?  It is
> unclear why a kernel module must associate itself with a user
> space FD.  Is there a particular reason that rdma_create_id()
> needs to behave differently than sock_create_kern() in this
> regard?

Somehow the kernel module has to be commanded to use this namespace,
and generally I expect that command to be connected to FD.

We don't have many use cases where the kernel operates namespaces
independently..

> While discussing about per namespace stuff, what is the reason
> that the cma_wq is a global shared by all namespaces instead of
> per namespace?  Is there a problem to have a per namespace cma_wq?

Why would we want to do that?

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-05 13:16                                   ` Jason Gunthorpe
@ 2020-10-05 13:57                                     ` Ka-Cheong Poon
  2020-10-05 14:25                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-10-05 13:57 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma

On 10/5/20 9:16 PM, Jason Gunthorpe wrote:
> On Mon, Oct 05, 2020 at 06:27:39PM +0800, Ka-Cheong Poon wrote:
>> On 10/2/20 10:04 PM, Jason Gunthorpe wrote:
>>>> that namespace to use it.  If there are a large number of namespaces,
>>>> there won't be enough devices to assign to all of them (e.g. the
>>>> hardware I have access to only supports up to 24 VFs).  The shared
>>>> mode can be used in this case.  Could you please explain what needs
>>>> to be done to support a large number of namespaces in exclusive
>>>> mode?
>>>
>>> Modern HW supports many more than 24 VFs, this is the expected
>>> interface
>>
>> Do you have a ballpark on how many VFs are supported?  Is it in
>> the range of many thousands?
> 
> Yes
> 
>> BTW, while the shared mode is still here, can there be a simple
>> way for a client to find out which mode the RDMA subsystem is using?
> 
> Return NULL for the namespace


OK, will add that to rdma_dev_to_netns().


>>> The new cm_id starts with the same ->context as the listener, the ULP should
>>> use this to pass any data, such as the namespace.
>>
>> This is what I suspected as mentioned in the previous email.  But
>> this makes it inconvenient if the context is already used for
>> something else.
> 
> Don't see why. the context should be allocated memory, so the ULP can
> put several things lin there.
> 
>>> I'm skeptical ULPs should be doing per-ns stuff like that. A ns aware
>>> ULP should fundamentally be linked to some FD and the ns to use should
>>> derived from the process that FD is linked to. Keeping per-ns stuff
>>> seems wrong.
>>
>>
>> It is a kernel module.  Which FD are you referring to?  It is
>> unclear why a kernel module must associate itself with a user
>> space FD.  Is there a particular reason that rdma_create_id()
>> needs to behave differently than sock_create_kern() in this
>> regard?
> 
> Somehow the kernel module has to be commanded to use this namespace,
> and generally I expect that command to be connected to FD.


It is an unnecessary restriction on what a kernel module
can do.  Is it a problem if a kernel module initiates its
own RDMA connection for doing various stuff in a namespace?
Any kernel module can initiate a TCP connection to do various
stuff without worrying about namespace deletion problem.  It
does not cause a problem AFAICT.  If the module needs to make
sure that the namespace does not go away, it can add its own
reference.  Is there a particular reason that RDMA subsystem
needs to behave differently?


> We don't have many use cases where the kernel operates namespaces
> independently..


FWIW, I am adding code to do that.  It works fine using
TCP kernel socket.  It has the namespace deletion problem
with RDMA connection.


>> While discussing about per namespace stuff, what is the reason
>> that the cma_wq is a global shared by all namespaces instead of
>> per namespace?  Is there a problem to have a per namespace cma_wq?
> 
> Why would we want to do that?


For scalability and namespace separation reasons as cma_wq is
single threaded.  For example, there can be many work to be done
in one namespace.  But this should not have an adverse effect on
other namespaces (as long as there are resources available).


-- 
K. Poon
ka-cheong.poon@oracle.com



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-05 13:57                                     ` Ka-Cheong Poon
@ 2020-10-05 14:25                                       ` Jason Gunthorpe
  2020-10-05 15:02                                         ` Ka-Cheong Poon
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2020-10-05 14:25 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: linux-rdma

On Mon, Oct 05, 2020 at 09:57:47PM +0800, Ka-Cheong Poon wrote:
> > > It is a kernel module.  Which FD are you referring to?  It is
> > > unclear why a kernel module must associate itself with a user
> > > space FD.  Is there a particular reason that rdma_create_id()
> > > needs to behave differently than sock_create_kern() in this
> > > regard?
> > 
> > Somehow the kernel module has to be commanded to use this namespace,
> > and generally I expect that command to be connected to FD.
> 
> 
> It is an unnecessary restriction on what a kernel module
> can do.  Is it a problem if a kernel module initiates its
> own RDMA connection for doing various stuff in a namespace?

Yes, someone has to apply policy to authorize this. Kernel modules
randomly running around using security objects is not OK.

Kernel modules should not be doing networking unless commanded to by
userspace.

> Any kernel module can initiate a TCP connection to do various
> stuff without worrying about namespace deletion problem.  It
> does not cause a problem AFAICT.  If the module needs to make
> sure that the namespace does not go away, it can add its own
> reference.  Is there a particular reason that RDMA subsystem
> needs to behave differently?

We don't have those kinds of ULPs.

> For scalability and namespace separation reasons as cma_wq is
> single threaded.  For example, there can be many work to be done
> in one namespace.  But this should not have an adverse effect on
> other namespaces (as long as there are resources available).

This is a design issue of the cma_wq, it can be reworked to not need
single threaded, nothing to do with namespaces

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-05 14:25                                       ` Jason Gunthorpe
@ 2020-10-05 15:02                                         ` Ka-Cheong Poon
  2020-10-05 15:45                                           ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-10-05 15:02 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma

On 10/5/20 10:25 PM, Jason Gunthorpe wrote:
> On Mon, Oct 05, 2020 at 09:57:47PM +0800, Ka-Cheong Poon wrote:
>>>> It is a kernel module.  Which FD are you referring to?  It is
>>>> unclear why a kernel module must associate itself with a user
>>>> space FD.  Is there a particular reason that rdma_create_id()
>>>> needs to behave differently than sock_create_kern() in this
>>>> regard?
>>>
>>> Somehow the kernel module has to be commanded to use this namespace,
>>> and generally I expect that command to be connected to FD.
>>
>>
>> It is an unnecessary restriction on what a kernel module
>> can do.  Is it a problem if a kernel module initiates its
>> own RDMA connection for doing various stuff in a namespace?
> 
> Yes, someone has to apply policy to authorize this. Kernel modules
> randomly running around using security objects is not OK.


The policy is to allow this.  It is not random stuff.
Can the RDMA subsystem support it?


> Kernel modules should not be doing networking unless commanded to by
> userspace.


It is still not clear why this is an issue with RDMA
connection, but not with general kernel socket.  It is
not random networking.  There is a purpose.


>> Any kernel module can initiate a TCP connection to do various
>> stuff without worrying about namespace deletion problem.  It
>> does not cause a problem AFAICT.  If the module needs to make
>> sure that the namespace does not go away, it can add its own
>> reference.  Is there a particular reason that RDMA subsystem
>> needs to behave differently?
> 
> We don't have those kinds of ULPs.


So if the reason of the current rdma_create_id() behavior
is that there is no such user, I am adding one.  It should
be clear that this difference between kernel socket and
rdma_create_id() causes a problem in namespace handling.


>> For scalability and namespace separation reasons as cma_wq is
>> single threaded.  For example, there can be many work to be done
>> in one namespace.  But this should not have an adverse effect on
>> other namespaces (as long as there are resources available).
> 
> This is a design issue of the cma_wq, it can be reworked to not need
> single threaded, nothing to do with namespaces


As mentioned, there are at least two parts.  The above is
on scalability.  There is also the namespace separation reason.
The goal is to make sure that processing of one namespace
should not have unwanted (positive nor negative) effect on
processing of other namespaces.  If the cma_wq is re-designed,
number of namespaces should be one input parameter on creating
how many threads and other resources allocation/scheduling.
One cma_wq per namespace is the simplest allocation.


-- 
K. Poon
ka-cheong.poon@oracle.com



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-05 15:02                                         ` Ka-Cheong Poon
@ 2020-10-05 15:45                                           ` Jason Gunthorpe
  2020-10-06  9:36                                             ` Ka-Cheong Poon
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2020-10-05 15:45 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: linux-rdma

On Mon, Oct 05, 2020 at 11:02:18PM +0800, Ka-Cheong Poon wrote:
> On 10/5/20 10:25 PM, Jason Gunthorpe wrote:
> > On Mon, Oct 05, 2020 at 09:57:47PM +0800, Ka-Cheong Poon wrote:
> > > > > It is a kernel module.  Which FD are you referring to?  It is
> > > > > unclear why a kernel module must associate itself with a user
> > > > > space FD.  Is there a particular reason that rdma_create_id()
> > > > > needs to behave differently than sock_create_kern() in this
> > > > > regard?
> > > > 
> > > > Somehow the kernel module has to be commanded to use this namespace,
> > > > and generally I expect that command to be connected to FD.
> > > 
> > > 
> > > It is an unnecessary restriction on what a kernel module
> > > can do.  Is it a problem if a kernel module initiates its
> > > own RDMA connection for doing various stuff in a namespace?
> > 
> > Yes, someone has to apply policy to authorize this. Kernel modules
> > randomly running around using security objects is not OK.
> 
> The policy is to allow this.  It is not random stuff.
> Can the RDMA subsystem support it?

allow everything is not a policy
 
> > Kernel modules should not be doing networking unless commanded to by
> > userspace.
> 
> It is still not clear why this is an issue with RDMA
> connection, but not with general kernel socket.  It is
> not random networking.  There is a purpose.

It is a problem with sockets too, how do the socket users trigger
their socket usages? AFAIK all cases originate with userspace

> So if the reason of the current rdma_create_id() behavior
> is that there is no such user, I am adding one.  It should
> be clear that this difference between kernel socket and
> rdma_create_id() causes a problem in namespace handling.

It would be helpful to understand how that works, as I've said I don't
think a kernel module should open listening sockets/cm_ids on every
namespace without being told to do this.

> If the cma_wq is re-designed, number of namespaces should be one
> input parameter on creating how many threads and other resources
> allocation/scheduling.  One cma_wq per namespace is the simplest
> allocation.

no, it will just run all CM_IDs concurrently on all processors.

Namespaces are not cgroups, we don't guarentee anything about resource
consumption for namespaces.

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-05 15:45                                           ` Jason Gunthorpe
@ 2020-10-06  9:36                                             ` Ka-Cheong Poon
  2020-10-06 12:46                                               ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-10-06  9:36 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma

On 10/5/20 11:45 PM, Jason Gunthorpe wrote:
> On Mon, Oct 05, 2020 at 11:02:18PM +0800, Ka-Cheong Poon wrote:
>> On 10/5/20 10:25 PM, Jason Gunthorpe wrote:
>>> On Mon, Oct 05, 2020 at 09:57:47PM +0800, Ka-Cheong Poon wrote:
>>>>>> It is a kernel module.  Which FD are you referring to?  It is
>>>>>> unclear why a kernel module must associate itself with a user
>>>>>> space FD.  Is there a particular reason that rdma_create_id()
>>>>>> needs to behave differently than sock_create_kern() in this
>>>>>> regard?
>>>>>
>>>>> Somehow the kernel module has to be commanded to use this namespace,
>>>>> and generally I expect that command to be connected to FD.
>>>>
>>>>
>>>> It is an unnecessary restriction on what a kernel module
>>>> can do.  Is it a problem if a kernel module initiates its
>>>> own RDMA connection for doing various stuff in a namespace?
>>>
>>> Yes, someone has to apply policy to authorize this. Kernel modules
>>> randomly running around using security objects is not OK.
>>
>> The policy is to allow this.  It is not random stuff.
>> Can the RDMA subsystem support it?
> 
> allow everything is not a policy


It is not allowing everything.  It is the simple case that
a kernel module can have a listener without the namespace
issue.  Kernel socket does not have this problem.


>>> Kernel modules should not be doing networking unless commanded to by
>>> userspace.
>>
>> It is still not clear why this is an issue with RDMA
>> connection, but not with general kernel socket.  It is
>> not random networking.  There is a purpose.
> 
> It is a problem with sockets too, how do the socket users trigger
> their socket usages? AFAIK all cases originate with userspace


A user starts a namespace.  The module is loaded for servicing
requests.  The module starts a listener.  The user deletes
the namespace.  This scenario will have everything cleaned up
properly if the listener is a kernel socket.  This is not the
case with RDMA.


>> So if the reason of the current rdma_create_id() behavior
>> is that there is no such user, I am adding one.  It should
>> be clear that this difference between kernel socket and
>> rdma_create_id() causes a problem in namespace handling.
> 
> It would be helpful to understand how that works, as I've said I don't
> think a kernel module should open listening sockets/cm_ids on every
> namespace without being told to do this.


The issue is not about starting a listener.  The issue is on
namespace deletion.




-- 
K. Poon
ka-cheong.poon@oracle.com



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-06  9:36                                             ` Ka-Cheong Poon
@ 2020-10-06 12:46                                               ` Jason Gunthorpe
  2020-10-07  8:38                                                 ` Ka-Cheong Poon
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2020-10-06 12:46 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: linux-rdma

On Tue, Oct 06, 2020 at 05:36:32PM +0800, Ka-Cheong Poon wrote:

> > > > Kernel modules should not be doing networking unless commanded to by
> > > > userspace.
> > > 
> > > It is still not clear why this is an issue with RDMA
> > > connection, but not with general kernel socket.  It is
> > > not random networking.  There is a purpose.
> > 
> > It is a problem with sockets too, how do the socket users trigger
> > their socket usages? AFAIK all cases originate with userspace
> 
> A user starts a namespace.  The module is loaded for servicing
> requests.  The module starts a listener.  The user deletes
> the namespace.  This scenario will have everything cleaned up
> properly if the listener is a kernel socket.  This is not the
> case with RDMA.

Please point to reputable code in upstream doing this

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-06 12:46                                               ` Jason Gunthorpe
@ 2020-10-07  8:38                                                 ` Ka-Cheong Poon
  2020-10-07 11:16                                                   ` Leon Romanovsky
  2020-10-07 12:28                                                   ` Jason Gunthorpe
  0 siblings, 2 replies; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-10-07  8:38 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma

On 10/6/20 8:46 PM, Jason Gunthorpe wrote:
> On Tue, Oct 06, 2020 at 05:36:32PM +0800, Ka-Cheong Poon wrote:
> 
>>>>> Kernel modules should not be doing networking unless commanded to by
>>>>> userspace.
>>>>
>>>> It is still not clear why this is an issue with RDMA
>>>> connection, but not with general kernel socket.  It is
>>>> not random networking.  There is a purpose.
>>>
>>> It is a problem with sockets too, how do the socket users trigger
>>> their socket usages? AFAIK all cases originate with userspace
>>
>> A user starts a namespace.  The module is loaded for servicing
>> requests.  The module starts a listener.  The user deletes
>> the namespace.  This scenario will have everything cleaned up
>> properly if the listener is a kernel socket.  This is not the
>> case with RDMA.
> 
> Please point to reputable code in upstream doing this


It is not clear what "reputable" here really means.  If it just
means something in kernel, then nearly all, if not all, Internet
protocols code in kernel create a control kernel socket for every
network namespaces.  That socket is deleted in the per namespace
exit function.  If it explicitly means listening socket, AFS and
TIPC in kernel do that for every namespaces.  That socket is
deleted in the per namespace exit function.

It is very common for a network protocol to have something like
this for protocol processing.  It is not clear why RDMA subsystem
behaves differently and forbids this common practice.  Could you
please elaborate the issues this practice has such that the RDMA
subsystem cannot support it?



-- 
K. Poon
ka-cheong.poon@oracle.com



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-07  8:38                                                 ` Ka-Cheong Poon
@ 2020-10-07 11:16                                                   ` Leon Romanovsky
  2020-10-08 10:22                                                     ` Ka-Cheong Poon
  2020-10-07 12:28                                                   ` Jason Gunthorpe
  1 sibling, 1 reply; 48+ messages in thread
From: Leon Romanovsky @ 2020-10-07 11:16 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: Jason Gunthorpe, linux-rdma

On Wed, Oct 07, 2020 at 04:38:45PM +0800, Ka-Cheong Poon wrote:
> On 10/6/20 8:46 PM, Jason Gunthorpe wrote:
> > On Tue, Oct 06, 2020 at 05:36:32PM +0800, Ka-Cheong Poon wrote:
> >
> > > > > > Kernel modules should not be doing networking unless commanded to by
> > > > > > userspace.
> > > > >
> > > > > It is still not clear why this is an issue with RDMA
> > > > > connection, but not with general kernel socket.  It is
> > > > > not random networking.  There is a purpose.
> > > >
> > > > It is a problem with sockets too, how do the socket users trigger
> > > > their socket usages? AFAIK all cases originate with userspace
> > >
> > > A user starts a namespace.  The module is loaded for servicing
> > > requests.  The module starts a listener.  The user deletes
> > > the namespace.  This scenario will have everything cleaned up
> > > properly if the listener is a kernel socket.  This is not the
> > > case with RDMA.
> >
> > Please point to reputable code in upstream doing this
>
>
> It is not clear what "reputable" here really means.  If it just
> means something in kernel, then nearly all, if not all, Internet
> protocols code in kernel create a control kernel socket for every
> network namespaces.  That socket is deleted in the per namespace
> exit function.  If it explicitly means listening socket, AFS and
> TIPC in kernel do that for every namespaces.  That socket is
> deleted in the per namespace exit function.
>
> It is very common for a network protocol to have something like
> this for protocol processing.  It is not clear why RDMA subsystem
> behaves differently and forbids this common practice.  Could you
> please elaborate the issues this practice has such that the RDMA
> subsystem cannot support it?

Just curious, are we talking about theoretical thing here or do you
have concrete and upstream ULP code to present?

Thanks

>
>
>
> --
> K. Poon
> ka-cheong.poon@oracle.com
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-07  8:38                                                 ` Ka-Cheong Poon
  2020-10-07 11:16                                                   ` Leon Romanovsky
@ 2020-10-07 12:28                                                   ` Jason Gunthorpe
  2020-10-08 10:49                                                     ` Ka-Cheong Poon
  1 sibling, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2020-10-07 12:28 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: linux-rdma

On Wed, Oct 07, 2020 at 04:38:45PM +0800, Ka-Cheong Poon wrote:
> On 10/6/20 8:46 PM, Jason Gunthorpe wrote:
> > On Tue, Oct 06, 2020 at 05:36:32PM +0800, Ka-Cheong Poon wrote:
> > 
> > > > > > Kernel modules should not be doing networking unless commanded to by
> > > > > > userspace.
> > > > > 
> > > > > It is still not clear why this is an issue with RDMA
> > > > > connection, but not with general kernel socket.  It is
> > > > > not random networking.  There is a purpose.
> > > > 
> > > > It is a problem with sockets too, how do the socket users trigger
> > > > their socket usages? AFAIK all cases originate with userspace
> > > 
> > > A user starts a namespace.  The module is loaded for servicing
> > > requests.  The module starts a listener.  The user deletes
> > > the namespace.  This scenario will have everything cleaned up
> > > properly if the listener is a kernel socket.  This is not the
> > > case with RDMA.
> > 
> > Please point to reputable code in upstream doing this
> 
> 
> It is not clear what "reputable" here really means.  If it just
> means something in kernel, then nearly all, if not all, Internet
> protocols code in kernel create a control kernel socket for every
> network namespaces.  That socket is deleted in the per namespace
> exit function.  If it explicitly means listening socket, AFS and
> TIPC in kernel do that for every namespaces.  That socket is
> deleted in the per namespace exit function.

AFS and TIPC are not exactly well reviewed mainstream areas.

> It is very common for a network protocol to have something like
> this for protocol processing.  It is not clear why RDMA subsystem
> behaves differently and forbids this common practice.  Could you
> please elaborate the issues this practice has such that the RDMA
> subsystem cannot support it?

The kernel should not have rouge listening sockets just because a
model is loaded. Creation if listening kernel side sockets should be
triggered by userspace.

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-07 11:16                                                   ` Leon Romanovsky
@ 2020-10-08 10:22                                                     ` Ka-Cheong Poon
  2020-10-08 10:36                                                       ` Leon Romanovsky
  0 siblings, 1 reply; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-10-08 10:22 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Jason Gunthorpe, linux-rdma

On 10/7/20 7:16 PM, Leon Romanovsky wrote:
> On Wed, Oct 07, 2020 at 04:38:45PM +0800, Ka-Cheong Poon wrote:
>> On 10/6/20 8:46 PM, Jason Gunthorpe wrote:
>>> On Tue, Oct 06, 2020 at 05:36:32PM +0800, Ka-Cheong Poon wrote:
>>>
>>>>>>> Kernel modules should not be doing networking unless commanded to by
>>>>>>> userspace.
>>>>>>
>>>>>> It is still not clear why this is an issue with RDMA
>>>>>> connection, but not with general kernel socket.  It is
>>>>>> not random networking.  There is a purpose.
>>>>>
>>>>> It is a problem with sockets too, how do the socket users trigger
>>>>> their socket usages? AFAIK all cases originate with userspace
>>>>
>>>> A user starts a namespace.  The module is loaded for servicing
>>>> requests.  The module starts a listener.  The user deletes
>>>> the namespace.  This scenario will have everything cleaned up
>>>> properly if the listener is a kernel socket.  This is not the
>>>> case with RDMA.
>>>
>>> Please point to reputable code in upstream doing this
>>
>>
>> It is not clear what "reputable" here really means.  If it just
>> means something in kernel, then nearly all, if not all, Internet
>> protocols code in kernel create a control kernel socket for every
>> network namespaces.  That socket is deleted in the per namespace
>> exit function.  If it explicitly means listening socket, AFS and
>> TIPC in kernel do that for every namespaces.  That socket is
>> deleted in the per namespace exit function.
>>
>> It is very common for a network protocol to have something like
>> this for protocol processing.  It is not clear why RDMA subsystem
>> behaves differently and forbids this common practice.  Could you
>> please elaborate the issues this practice has such that the RDMA
>> subsystem cannot support it?
> 
> Just curious, are we talking about theoretical thing here or do you
> have concrete and upstream ULP code to present?


As I mentioned in a previous email, I have running code.
Otherwise, why would I go to such great length to find
out what is missing in the RDMA subsystem in supporting
kernel namespace usage.



-- 
K. Poon
ka-cheong.poon@oracle.com



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-08 10:22                                                     ` Ka-Cheong Poon
@ 2020-10-08 10:36                                                       ` Leon Romanovsky
  2020-10-08 11:08                                                         ` Ka-Cheong Poon
  0 siblings, 1 reply; 48+ messages in thread
From: Leon Romanovsky @ 2020-10-08 10:36 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: Jason Gunthorpe, linux-rdma

On Thu, Oct 08, 2020 at 06:22:03PM +0800, Ka-Cheong Poon wrote:
> On 10/7/20 7:16 PM, Leon Romanovsky wrote:
> > On Wed, Oct 07, 2020 at 04:38:45PM +0800, Ka-Cheong Poon wrote:
> > > On 10/6/20 8:46 PM, Jason Gunthorpe wrote:
> > > > On Tue, Oct 06, 2020 at 05:36:32PM +0800, Ka-Cheong Poon wrote:
> > > >
> > > > > > > > Kernel modules should not be doing networking unless commanded to by
> > > > > > > > userspace.
> > > > > > >
> > > > > > > It is still not clear why this is an issue with RDMA
> > > > > > > connection, but not with general kernel socket.  It is
> > > > > > > not random networking.  There is a purpose.
> > > > > >
> > > > > > It is a problem with sockets too, how do the socket users trigger
> > > > > > their socket usages? AFAIK all cases originate with userspace
> > > > >
> > > > > A user starts a namespace.  The module is loaded for servicing
> > > > > requests.  The module starts a listener.  The user deletes
> > > > > the namespace.  This scenario will have everything cleaned up
> > > > > properly if the listener is a kernel socket.  This is not the
> > > > > case with RDMA.
> > > >
> > > > Please point to reputable code in upstream doing this
> > >
> > >
> > > It is not clear what "reputable" here really means.  If it just
> > > means something in kernel, then nearly all, if not all, Internet
> > > protocols code in kernel create a control kernel socket for every
> > > network namespaces.  That socket is deleted in the per namespace
> > > exit function.  If it explicitly means listening socket, AFS and
> > > TIPC in kernel do that for every namespaces.  That socket is
> > > deleted in the per namespace exit function.
> > >
> > > It is very common for a network protocol to have something like
> > > this for protocol processing.  It is not clear why RDMA subsystem
> > > behaves differently and forbids this common practice.  Could you
> > > please elaborate the issues this practice has such that the RDMA
> > > subsystem cannot support it?
> >
> > Just curious, are we talking about theoretical thing here or do you
> > have concrete and upstream ULP code to present?
>
>
> As I mentioned in a previous email, I have running code.
> Otherwise, why would I go to such great length to find
> out what is missing in the RDMA subsystem in supporting
> kernel namespace usage.

So why don't you post this running code?

Thanks

>
>
>
> --
> K. Poon
> ka-cheong.poon@oracle.com
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-07 12:28                                                   ` Jason Gunthorpe
@ 2020-10-08 10:49                                                     ` Ka-Cheong Poon
  0 siblings, 0 replies; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-10-08 10:49 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma

On 10/7/20 8:28 PM, Jason Gunthorpe wrote:
> On Wed, Oct 07, 2020 at 04:38:45PM +0800, Ka-Cheong Poon wrote:
>> On 10/6/20 8:46 PM, Jason Gunthorpe wrote:
>>> On Tue, Oct 06, 2020 at 05:36:32PM +0800, Ka-Cheong Poon wrote:
>>>
>>>>>>> Kernel modules should not be doing networking unless commanded to by
>>>>>>> userspace.
>>>>>>
>>>>>> It is still not clear why this is an issue with RDMA
>>>>>> connection, but not with general kernel socket.  It is
>>>>>> not random networking.  There is a purpose.
>>>>>
>>>>> It is a problem with sockets too, how do the socket users trigger
>>>>> their socket usages? AFAIK all cases originate with userspace
>>>>
>>>> A user starts a namespace.  The module is loaded for servicing
>>>> requests.  The module starts a listener.  The user deletes
>>>> the namespace.  This scenario will have everything cleaned up
>>>> properly if the listener is a kernel socket.  This is not the
>>>> case with RDMA.
>>>
>>> Please point to reputable code in upstream doing this
>>
>>
>> It is not clear what "reputable" here really means.  If it just
>> means something in kernel, then nearly all, if not all, Internet
>> protocols code in kernel create a control kernel socket for every
>> network namespaces.  That socket is deleted in the per namespace
>> exit function.  If it explicitly means listening socket, AFS and
>> TIPC in kernel do that for every namespaces.  That socket is
>> deleted in the per namespace exit function.
> 
> AFS and TIPC are not exactly well reviewed mainstream areas.


How about all the other Internet protocol code?  They all
create a kernel socket without user interaction.  If it is
using rdma_create_id(), it will prevent a namespace from
being deleted.


>> It is very common for a network protocol to have something like
>> this for protocol processing.  It is not clear why RDMA subsystem
>> behaves differently and forbids this common practice.  Could you
>> please elaborate the issues this practice has such that the RDMA
>> subsystem cannot support it?
> 
> The kernel should not have rouge listening sockets just because a
> model is loaded. Creation if listening kernel side sockets should be
> triggered by userspace.


It is unclear why the socket is "rogue".  A sys admin loads a
kernel module for a reason.  It cannot be randomly loaded by
itself.  In this respect, it is not different from a user space
daemon.  No one will describe a listening socket started by a daemon
when it starts as "rogue".  Why is a listening socket started by a
kernel module "rogue"?  If a user is remote, without the listening
socket, how can anything work in the first place?



-- 
K. Poon
ka-cheong.poon@oracle.com



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-08 10:36                                                       ` Leon Romanovsky
@ 2020-10-08 11:08                                                         ` Ka-Cheong Poon
  2020-10-08 16:08                                                           ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-10-08 11:08 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Jason Gunthorpe, linux-rdma

On 10/8/20 6:36 PM, Leon Romanovsky wrote:
> On Thu, Oct 08, 2020 at 06:22:03PM +0800, Ka-Cheong Poon wrote:
>> On 10/7/20 7:16 PM, Leon Romanovsky wrote:
>>> On Wed, Oct 07, 2020 at 04:38:45PM +0800, Ka-Cheong Poon wrote:
>>>> On 10/6/20 8:46 PM, Jason Gunthorpe wrote:
>>>>> On Tue, Oct 06, 2020 at 05:36:32PM +0800, Ka-Cheong Poon wrote:
>>>>>
>>>>>>>>> Kernel modules should not be doing networking unless commanded to by
>>>>>>>>> userspace.
>>>>>>>>
>>>>>>>> It is still not clear why this is an issue with RDMA
>>>>>>>> connection, but not with general kernel socket.  It is
>>>>>>>> not random networking.  There is a purpose.
>>>>>>>
>>>>>>> It is a problem with sockets too, how do the socket users trigger
>>>>>>> their socket usages? AFAIK all cases originate with userspace
>>>>>>
>>>>>> A user starts a namespace.  The module is loaded for servicing
>>>>>> requests.  The module starts a listener.  The user deletes
>>>>>> the namespace.  This scenario will have everything cleaned up
>>>>>> properly if the listener is a kernel socket.  This is not the
>>>>>> case with RDMA.
>>>>>
>>>>> Please point to reputable code in upstream doing this
>>>>
>>>>
>>>> It is not clear what "reputable" here really means.  If it just
>>>> means something in kernel, then nearly all, if not all, Internet
>>>> protocols code in kernel create a control kernel socket for every
>>>> network namespaces.  That socket is deleted in the per namespace
>>>> exit function.  If it explicitly means listening socket, AFS and
>>>> TIPC in kernel do that for every namespaces.  That socket is
>>>> deleted in the per namespace exit function.
>>>>
>>>> It is very common for a network protocol to have something like
>>>> this for protocol processing.  It is not clear why RDMA subsystem
>>>> behaves differently and forbids this common practice.  Could you
>>>> please elaborate the issues this practice has such that the RDMA
>>>> subsystem cannot support it?
>>>
>>> Just curious, are we talking about theoretical thing here or do you
>>> have concrete and upstream ULP code to present?
>>
>>
>> As I mentioned in a previous email, I have running code.
>> Otherwise, why would I go to such great length to find
>> out what is missing in the RDMA subsystem in supporting
>> kernel namespace usage.
> 
> So why don't you post this running code?


Will it change the listening RDMA endpoint started by the module from
"rogue" to normal?  This is the fundamental problem.  This is the reason
I ask why the RDMA subsystem behaves like this in the first place.  If
the reason is just that there is no existing user, it is fine.  Unexpectedly,
the reason turns out to be that no kernel module is allowed to create its own
RDMA endpoint without a corresponding user space file descriptor and/or some
form of user space interaction.  This is a very serious restriction on how
the RDMA subsystem can be used by any kernel module.  This has to be sorted
out first.

Note that namespace does not really play a role in this "rogue" reasoning.
The init_net is also a namespace.  The "rogue" reasoning means that no
kernel module should start a listening RDMA endpoint by itself with or
without any extra namespaces.  In fact, to conform to this reasoning, the
"right" thing to do would be to change the code already in upstream to get
rid of the listening RDMA endpoint in init_net!



-- 
K. Poon
ka-cheong.poon@oracle.com



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-08 11:08                                                         ` Ka-Cheong Poon
@ 2020-10-08 16:08                                                           ` Jason Gunthorpe
  2020-10-08 16:21                                                             ` Chuck Lever
  2020-10-09  4:49                                                             ` Ka-Cheong Poon
  0 siblings, 2 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2020-10-08 16:08 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: Leon Romanovsky, linux-rdma

On Thu, Oct 08, 2020 at 07:08:42PM +0800, Ka-Cheong Poon wrote:
> Note that namespace does not really play a role in this "rogue" reasoning.
> The init_net is also a namespace.  The "rogue" reasoning means that no
> kernel module should start a listening RDMA endpoint by itself with or
> without any extra namespaces.  In fact, to conform to this reasoning, the
> "right" thing to do would be to change the code already in upstream to get
> rid of the listening RDMA endpoint in init_net!

Actually I think they all already need user co-ordination?
 
- NFS, user has to setup and load exports
- Storage Targets, user has to setup the target
- IPoIB, user has to set the link up

etc.

Each of those could provide the anchor to learn the namespace.

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-08 16:08                                                           ` Jason Gunthorpe
@ 2020-10-08 16:21                                                             ` Chuck Lever
  2020-10-08 16:46                                                               ` Jason Gunthorpe
  2020-10-09  4:49                                                             ` Ka-Cheong Poon
  1 sibling, 1 reply; 48+ messages in thread
From: Chuck Lever @ 2020-10-08 16:21 UTC (permalink / raw)
  To: Jason Gunthorpe, Ka-Cheong Poon; +Cc: Leon Romanovsky, linux-rdma



> On Oct 8, 2020, at 12:08 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Thu, Oct 08, 2020 at 07:08:42PM +0800, Ka-Cheong Poon wrote:
>> Note that namespace does not really play a role in this "rogue" reasoning.
>> The init_net is also a namespace.  The "rogue" reasoning means that no
>> kernel module should start a listening RDMA endpoint by itself with or
>> without any extra namespaces.  In fact, to conform to this reasoning, the
>> "right" thing to do would be to change the code already in upstream to get
>> rid of the listening RDMA endpoint in init_net!
> 
> Actually I think they all already need user co-ordination?
> 
> - NFS, user has to setup and load exports
> - Storage Targets, user has to setup the target
> - IPoIB, user has to set the link up
> 
> etc.
> 
> Each of those could provide the anchor to learn the namespace.

My two cents, and worth every penny:

I think the NFSD listener is net namespace-aware. I vaguely recall
that a user administrative program (maybe rpc.nfsd?) requests an
NFS service listener in a particular namespace.

Should work the same for sockets and listener QPs. For RPC-over-RDMA,
a struct net argument is passed in from the generic code:

 66 static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv,
 67                                                  struct net *net);
 68 static struct svc_xprt *svc_rdma_create(struct svc_serv *serv,
 69                                         struct net *net,
 70                                         struct sockaddr *sa, int salen,
 71                                         int flags);

And that struct net is then passed on to rdma_create_id().


--
Chuck Lever




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-08 16:21                                                             ` Chuck Lever
@ 2020-10-08 16:46                                                               ` Jason Gunthorpe
  0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2020-10-08 16:46 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Ka-Cheong Poon, Leon Romanovsky, linux-rdma

On Thu, Oct 08, 2020 at 12:21:10PM -0400, Chuck Lever wrote:
> 
> 
> > On Oct 8, 2020, at 12:08 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > On Thu, Oct 08, 2020 at 07:08:42PM +0800, Ka-Cheong Poon wrote:
> >> Note that namespace does not really play a role in this "rogue" reasoning.
> >> The init_net is also a namespace.  The "rogue" reasoning means that no
> >> kernel module should start a listening RDMA endpoint by itself with or
> >> without any extra namespaces.  In fact, to conform to this reasoning, the
> >> "right" thing to do would be to change the code already in upstream to get
> >> rid of the listening RDMA endpoint in init_net!
> > 
> > Actually I think they all already need user co-ordination?
> > 
> > - NFS, user has to setup and load exports
> > - Storage Targets, user has to setup the target
> > - IPoIB, user has to set the link up
> > 
> > etc.
> > 
> > Each of those could provide the anchor to learn the namespace.
> 
> My two cents, and worth every penny:
> 
> I think the NFSD listener is net namespace-aware. I vaguely recall
> that a user administrative program (maybe rpc.nfsd?) requests an
> NFS service listener in a particular namespace.
>
> Should work the same for sockets and listener QPs. For RPC-over-RDMA,
> a struct net argument is passed in from the generic code:
> 
>  66 static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv,
>  67                                                  struct net *net);
>  68 static struct svc_xprt *svc_rdma_create(struct svc_serv *serv,
>  69                                         struct net *net,
>  70                                         struct sockaddr *sa, int salen,
>  71                                         int flags);
> 
> And that struct net is then passed on to rdma_create_id().

Yes

It might help Ka-Cheong to explore how NFS should work

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-08 16:08                                                           ` Jason Gunthorpe
  2020-10-08 16:21                                                             ` Chuck Lever
@ 2020-10-09  4:49                                                             ` Ka-Cheong Poon
  2020-10-09 14:39                                                               ` Jason Gunthorpe
  1 sibling, 1 reply; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-10-09  4:49 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Leon Romanovsky, linux-rdma

On 10/9/20 12:08 AM, Jason Gunthorpe wrote:
> On Thu, Oct 08, 2020 at 07:08:42PM +0800, Ka-Cheong Poon wrote:
>> Note that namespace does not really play a role in this "rogue" reasoning.
>> The init_net is also a namespace.  The "rogue" reasoning means that no
>> kernel module should start a listening RDMA endpoint by itself with or
>> without any extra namespaces.  In fact, to conform to this reasoning, the
>> "right" thing to do would be to change the code already in upstream to get
>> rid of the listening RDMA endpoint in init_net!
> 
> Actually I think they all already need user co-ordination?
>   
> - NFS, user has to setup and load exports
> - Storage Targets, user has to setup the target
> - IPoIB, user has to set the link up
> 
> etc.
> 
> Each of those could provide the anchor to learn the namespace.


It is unclear how this is related to the question at hand.  It
is not about learning the namespace.  A kernel module knows
when a namespace is created.  There is no need to learn it.  The
question is creating a kernel RDMA endpoint in a namespace without
adding a reference to that namespace.  The analogy to the daemon
scenario is that a daemon starts a socket endpoint at start up.
No one calls that endpoint "rogue".  Why is that a kernel module
should not start a socket endpoint at start up?  Why is that socket
endpoint "rogue"?  The reason is still not being given.

As I mentioned before, this is a very serious restriction on how
the RDMA subsystem can be used in a namespace environment by kernel
module.  The reason given for this restriction is that any kernel
socket without a corresponding user space file descriptor is "rogue".
All Internet protocol code create a kernel socket without user
interaction.  Are they all "rogue"?



-- 
K. Poon
ka-cheong.poon@oracle.com



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-09  4:49                                                             ` Ka-Cheong Poon
@ 2020-10-09 14:39                                                               ` Jason Gunthorpe
  2020-10-09 14:48                                                                 ` Chuck Lever
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2020-10-09 14:39 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: Leon Romanovsky, linux-rdma

On Fri, Oct 09, 2020 at 12:49:30PM +0800, Ka-Cheong Poon wrote:
> As I mentioned before, this is a very serious restriction on how
> the RDMA subsystem can be used in a namespace environment by kernel
> module.  The reason given for this restriction is that any kernel
> socket without a corresponding user space file descriptor is "rogue".
> All Internet protocol code create a kernel socket without user
> interaction.  Are they all "rogue"?

You should work with Chuck to make NFS use namespaces properly and
then you can propose what changes might be needed with a proper
justification.

The rules for lifetime on IB clients are tricky, and the interaction
with namespaces makes it all a lot more murky.

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-09 14:39                                                               ` Jason Gunthorpe
@ 2020-10-09 14:48                                                                 ` Chuck Lever
  2020-10-09 14:57                                                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Chuck Lever @ 2020-10-09 14:48 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Ka-Cheong Poon, Leon Romanovsky, linux-rdma

Hi Jason-

> On Oct 9, 2020, at 10:39 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Fri, Oct 09, 2020 at 12:49:30PM +0800, Ka-Cheong Poon wrote:
>> As I mentioned before, this is a very serious restriction on how
>> the RDMA subsystem can be used in a namespace environment by kernel
>> module.  The reason given for this restriction is that any kernel
>> socket without a corresponding user space file descriptor is "rogue".
>> All Internet protocol code create a kernel socket without user
>> interaction.  Are they all "rogue"?
> 
> You should work with Chuck to make NFS use namespaces properly and
> then you can propose what changes might be needed with a proper
> justification.

The NFS server code already uses namespaces for creating listener
endpoints, already has a user space component that drives the
creation of listeners, and already passes an appropriate struct
net to rdma_create_id. As far as I am aware, it is namespace-aware
and -friendly all the way down to rdma_create_id().

What more needs to be done?


> The rules for lifetime on IB clients are tricky, and the interaction
> with namespaces makes it all a lot more murky.

I think what Ka-cheong is asking is for a detailed explanation of
these lifetime rules so we can understand why rdma_create_id bumps
the namespace reference count.


--
Chuck Lever
chucklever@gmail.com




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-09 14:48                                                                 ` Chuck Lever
@ 2020-10-09 14:57                                                                   ` Jason Gunthorpe
  2020-10-09 15:00                                                                     ` Chuck Lever
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2020-10-09 14:57 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Ka-Cheong Poon, Leon Romanovsky, linux-rdma

On Fri, Oct 09, 2020 at 10:48:55AM -0400, Chuck Lever wrote:
> Hi Jason-
> 
> > On Oct 9, 2020, at 10:39 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > On Fri, Oct 09, 2020 at 12:49:30PM +0800, Ka-Cheong Poon wrote:
> >> As I mentioned before, this is a very serious restriction on how
> >> the RDMA subsystem can be used in a namespace environment by kernel
> >> module.  The reason given for this restriction is that any kernel
> >> socket without a corresponding user space file descriptor is "rogue".
> >> All Internet protocol code create a kernel socket without user
> >> interaction.  Are they all "rogue"?
> > 
> > You should work with Chuck to make NFS use namespaces properly and
> > then you can propose what changes might be needed with a proper
> > justification.
> 
> The NFS server code already uses namespaces for creating listener
> endpoints, already has a user space component that drives the
> creation of listeners, and already passes an appropriate struct
> net to rdma_create_id. As far as I am aware, it is namespace-aware
> and -friendly all the way down to rdma_create_id().
> 
> What more needs to be done?

I have no idea, if you are able to pass a namespace all the way down
to the listening cm_id and everything works right (I'm skeptical) then
there is nothing more to worry about - why are we having this thread?

> > The rules for lifetime on IB clients are tricky, and the interaction
> > with namespaces makes it all a lot more murky.
> 
> I think what Ka-cheong is asking is for a detailed explanation of
> these lifetime rules so we can understand why rdma_create_id bumps
> the namespace reference count.

It is because the CM has no code to revoke a CM ID before the
namespace goes away and the pointer becomes invalid.

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-09 14:57                                                                   ` Jason Gunthorpe
@ 2020-10-09 15:00                                                                     ` Chuck Lever
  2020-10-09 15:07                                                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Chuck Lever @ 2020-10-09 15:00 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Ka-Cheong Poon, Leon Romanovsky, linux-rdma



> On Oct 9, 2020, at 10:57 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Fri, Oct 09, 2020 at 10:48:55AM -0400, Chuck Lever wrote:
>> Hi Jason-
>> 
>>> On Oct 9, 2020, at 10:39 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>> 
>>> On Fri, Oct 09, 2020 at 12:49:30PM +0800, Ka-Cheong Poon wrote:
>>>> As I mentioned before, this is a very serious restriction on how
>>>> the RDMA subsystem can be used in a namespace environment by kernel
>>>> module.  The reason given for this restriction is that any kernel
>>>> socket without a corresponding user space file descriptor is "rogue".
>>>> All Internet protocol code create a kernel socket without user
>>>> interaction.  Are they all "rogue"?
>>> 
>>> You should work with Chuck to make NFS use namespaces properly and
>>> then you can propose what changes might be needed with a proper
>>> justification.
>> 
>> The NFS server code already uses namespaces for creating listener
>> endpoints, already has a user space component that drives the
>> creation of listeners, and already passes an appropriate struct
>> net to rdma_create_id. As far as I am aware, it is namespace-aware
>> and -friendly all the way down to rdma_create_id().
>> 
>> What more needs to be done?
> 
> I have no idea, if you are able to pass a namespace all the way down
> to the listening cm_id and everything works right (I'm skeptical) then
> there is nothing more to worry about - why are we having this thread?

The thread is about RDS, not NFS. NFS has some useful examples to
crib, but it's not the main point.

I don't think NFS/RDMA namespacing works today, but it's not because
NFS isn't ready. I agree that is another thread.


>>> The rules for lifetime on IB clients are tricky, and the interaction
>>> with namespaces makes it all a lot more murky.
>> 
>> I think what Ka-cheong is asking is for a detailed explanation of
>> these lifetime rules so we can understand why rdma_create_id bumps
>> the namespace reference count.
> 
> It is because the CM has no code to revoke a CM ID before the
> namespace goes away and the pointer becomes invalid.

Is it just a question of "no-one has yet written this code" or is
there a deeper technical reason why this has not been done?


--
Chuck Lever
chucklever@gmail.com




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-09 15:00                                                                     ` Chuck Lever
@ 2020-10-09 15:07                                                                       ` Jason Gunthorpe
  2020-10-09 15:27                                                                         ` Chuck Lever
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2020-10-09 15:07 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Ka-Cheong Poon, Leon Romanovsky, linux-rdma

On Fri, Oct 09, 2020 at 11:00:22AM -0400, Chuck Lever wrote:
> 
> 
> > On Oct 9, 2020, at 10:57 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > On Fri, Oct 09, 2020 at 10:48:55AM -0400, Chuck Lever wrote:
> >> Hi Jason-
> >> 
> >>> On Oct 9, 2020, at 10:39 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >>> 
> >>> On Fri, Oct 09, 2020 at 12:49:30PM +0800, Ka-Cheong Poon wrote:
> >>>> As I mentioned before, this is a very serious restriction on how
> >>>> the RDMA subsystem can be used in a namespace environment by kernel
> >>>> module.  The reason given for this restriction is that any kernel
> >>>> socket without a corresponding user space file descriptor is "rogue".
> >>>> All Internet protocol code create a kernel socket without user
> >>>> interaction.  Are they all "rogue"?
> >>> 
> >>> You should work with Chuck to make NFS use namespaces properly and
> >>> then you can propose what changes might be needed with a proper
> >>> justification.
> >> 
> >> The NFS server code already uses namespaces for creating listener
> >> endpoints, already has a user space component that drives the
> >> creation of listeners, and already passes an appropriate struct
> >> net to rdma_create_id. As far as I am aware, it is namespace-aware
> >> and -friendly all the way down to rdma_create_id().
> >> 
> >> What more needs to be done?
> > 
> > I have no idea, if you are able to pass a namespace all the way down
> > to the listening cm_id and everything works right (I'm skeptical) then
> > there is nothing more to worry about - why are we having this thread?
> 
> The thread is about RDS, not NFS. NFS has some useful examples to
> crib, but it's not the main point.
> 
> I don't think NFS/RDMA namespacing works today, but it's not because
> NFS isn't ready. I agree that is another thread.

Exactly, so instead of talking about RDS stuff without any patches,
let's talk about NFS with patches - if you can make NFS work then I
assume RDS will be happy.

NFS has an established model for using namespaces that the other
transports uses, so I'd rather focus on this.

> >>> The rules for lifetime on IB clients are tricky, and the interaction
> >>> with namespaces makes it all a lot more murky.
> >> 
> >> I think what Ka-cheong is asking is for a detailed explanation of
> >> these lifetime rules so we can understand why rdma_create_id bumps
> >> the namespace reference count.
> > 
> > It is because the CM has no code to revoke a CM ID before the
> > namespace goes away and the pointer becomes invalid.
> 
> Is it just a question of "no-one has yet written this code" or is
> there a deeper technical reason why this has not been done?

It is hard to know without spending a big deep look at this
stuff.

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-09 15:07                                                                       ` Jason Gunthorpe
@ 2020-10-09 15:27                                                                         ` Chuck Lever
  2020-10-09 15:34                                                                           ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Chuck Lever @ 2020-10-09 15:27 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Ka-Cheong Poon, Leon Romanovsky, linux-rdma



> On Oct 9, 2020, at 11:07 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Fri, Oct 09, 2020 at 11:00:22AM -0400, Chuck Lever wrote:
>> 
>> 
>>> On Oct 9, 2020, at 10:57 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>> 
>>> On Fri, Oct 09, 2020 at 10:48:55AM -0400, Chuck Lever wrote:
>>>> Hi Jason-
>>>> 
>>>>> On Oct 9, 2020, at 10:39 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>>>> 
>>>>> On Fri, Oct 09, 2020 at 12:49:30PM +0800, Ka-Cheong Poon wrote:
>>>>>> As I mentioned before, this is a very serious restriction on how
>>>>>> the RDMA subsystem can be used in a namespace environment by kernel
>>>>>> module.  The reason given for this restriction is that any kernel
>>>>>> socket without a corresponding user space file descriptor is "rogue".
>>>>>> All Internet protocol code create a kernel socket without user
>>>>>> interaction.  Are they all "rogue"?
>>>>> 
>>>>> You should work with Chuck to make NFS use namespaces properly and
>>>>> then you can propose what changes might be needed with a proper
>>>>> justification.
>>>> 
>>>> The NFS server code already uses namespaces for creating listener
>>>> endpoints, already has a user space component that drives the
>>>> creation of listeners, and already passes an appropriate struct
>>>> net to rdma_create_id. As far as I am aware, it is namespace-aware
>>>> and -friendly all the way down to rdma_create_id().
>>>> 
>>>> What more needs to be done?
>>> 
>>> I have no idea, if you are able to pass a namespace all the way down
>>> to the listening cm_id and everything works right (I'm skeptical) then
>>> there is nothing more to worry about - why are we having this thread?
>> 
>> The thread is about RDS, not NFS. NFS has some useful examples to
>> crib, but it's not the main point.
>> 
>> I don't think NFS/RDMA namespacing works today, but it's not because
>> NFS isn't ready. I agree that is another thread.
> 
> Exactly, so instead of talking about RDS stuff without any patches,

Roger that. Maybe Ka-Cheong and team can propose some patches to
help the discussion along.


> let's talk about NFS with patches - if you can make NFS work then I
> assume RDS will be happy.

Perhaps not a valid assumption :-)

NFS is a traditional client-server model, and has a user space tool
that drives the creation of endpoints, just as you expect.

With RDS, listener endpoints are not visible in user space. They
are a globally-managed shared resource, more like network interfaces
than listener sockets.

Therefore I think the approach is going to be "one RDS listener per
net namespace". The problem Ka-Cheong is trying to address is how to
manage the destruction of a listener-namespace pair. The extra
reference count on the cm_id is pinning the namespace so it cannot
be destroyed.


> NFS has an established model for using namespaces that the other
> transports uses, so I'd rather focus on this.

Understood, but it doesn't seem like there is enough useful overlap
between the NFS and RDS usage scenarios. With NFS, I would expect
an explicit listener shutdown from userland prior to namespace
destruction.


>>>>> The rules for lifetime on IB clients are tricky, and the interaction
>>>>> with namespaces makes it all a lot more murky.
>>>> 
>>>> I think what Ka-cheong is asking is for a detailed explanation of
>>>> these lifetime rules so we can understand why rdma_create_id bumps
>>>> the namespace reference count.
>>> 
>>> It is because the CM has no code to revoke a CM ID before the
>>> namespace goes away and the pointer becomes invalid.
>> 
>> Is it just a question of "no-one has yet written this code" or is
>> there a deeper technical reason why this has not been done?
> 
> It is hard to know without spending a big deep look at this
> stuff.

Fair enough.

--
Chuck Lever
chucklever@gmail.com




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-09 15:27                                                                         ` Chuck Lever
@ 2020-10-09 15:34                                                                           ` Jason Gunthorpe
  2020-10-09 15:52                                                                             ` Chuck Lever
  2020-10-12  8:20                                                                             ` Ka-Cheong Poon
  0 siblings, 2 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2020-10-09 15:34 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Ka-Cheong Poon, Leon Romanovsky, linux-rdma

On Fri, Oct 09, 2020 at 11:27:44AM -0400, Chuck Lever wrote:

> Therefore I think the approach is going to be "one RDS listener per
> net namespace". The problem Ka-Cheong is trying to address is how to
> manage the destruction of a listener-namespace pair. The extra
> reference count on the cm_id is pinning the namespace so it cannot
> be destroyed.

I really don't think this idea of just loading a kernel module and it
immediately creates a network visibile listening socket in every
namespace is very good.

> Understood, but it doesn't seem like there is enough useful overlap
> between the NFS and RDS usage scenarios. With NFS, I would expect
> an explicit listener shutdown from userland prior to namespace
> destruction.

Yes, because namespaces are fundamentally supposed to be anchored in
the processes inside the namespace.

Having the kernel jump in and start opening holes as soon as a
namespace is created is just wrong.

At a bare minimum the listener should not exist until something in the
namespace is willing to work with RDS.

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-09 15:34                                                                           ` Jason Gunthorpe
@ 2020-10-09 15:52                                                                             ` Chuck Lever
  2020-10-12  8:20                                                                             ` Ka-Cheong Poon
  1 sibling, 0 replies; 48+ messages in thread
From: Chuck Lever @ 2020-10-09 15:52 UTC (permalink / raw)
  To: Jason Gunthorpe, Ka-Cheong Poon; +Cc: Leon Romanovsky, linux-rdma



> On Oct 9, 2020, at 11:34 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Fri, Oct 09, 2020 at 11:27:44AM -0400, Chuck Lever wrote:
> 
>> Therefore I think the approach is going to be "one RDS listener per
>> net namespace". The problem Ka-Cheong is trying to address is how to
>> manage the destruction of a listener-namespace pair. The extra
>> reference count on the cm_id is pinning the namespace so it cannot
>> be destroyed.
> 
> I really don't think this idea of just loading a kernel module and it
> immediately creates a network visibile listening socket in every
> namespace is very good.
> 
>> Understood, but it doesn't seem like there is enough useful overlap
>> between the NFS and RDS usage scenarios. With NFS, I would expect
>> an explicit listener shutdown from userland prior to namespace
>> destruction.
> 
> Yes, because namespaces are fundamentally supposed to be anchored in
> the processes inside the namespace.

Aye, the container model.


> Having the kernel jump in and start opening holes as soon as a
> namespace is created is just wrong.
> 
> At a bare minimum the listener should not exist until something in the
> namespace is willing to work with RDS.

I was thinking that too, but I'm not sure if that change would have
ramifications to existing RDS applications. There's quite a bit of
legacy to deal with.

An alternative would be to add a user daemon to RDS to manage the
listener lifecycle, rather than having the endpoint created by
module load. That might help the listener-namespace destruction
issue, and should be entirely application-transparent.


--
Chuck Lever
chucklever@gmail.com




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-09 15:34                                                                           ` Jason Gunthorpe
  2020-10-09 15:52                                                                             ` Chuck Lever
@ 2020-10-12  8:20                                                                             ` Ka-Cheong Poon
  2020-10-16 18:54                                                                               ` Jason Gunthorpe
  1 sibling, 1 reply; 48+ messages in thread
From: Ka-Cheong Poon @ 2020-10-12  8:20 UTC (permalink / raw)
  To: Jason Gunthorpe, Chuck Lever; +Cc: Leon Romanovsky, linux-rdma

On 10/9/20 11:34 PM, Jason Gunthorpe wrote:

> Yes, because namespaces are fundamentally supposed to be anchored in
> the processes inside the namespace.
> 
> Having the kernel jump in and start opening holes as soon as a
> namespace is created is just wrong.
> 
> At a bare minimum the listener should not exist until something in the
> namespace is willing to work with RDS.


As I mentioned in a previous email, starting is not the problem.  It
is the problem of deleting a namespace.  Using what is suggested
above, it means that there needs to be an explicit shutdown in
addition to the normal shutdown of a namespace.  It is not clear why
this is necessary.  The additional reference by rdma_create_id() puts
an unnecessary restriction on what a kernel module can do.  Without
this reference, if a kernel module wants to use, say a one-to-one/many
mapping model to user space socket as suggested above, it can do that.
And if a kernel module does not want to use this model, it can also
choose to do that.  It is not clear why such a restriction must be
enforced in RDMA subsystem while there is no such restriction using
kernel socket.


-- 
K. Poon
ka-cheong.poon@oracle.com



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-12  8:20                                                                             ` Ka-Cheong Poon
@ 2020-10-16 18:54                                                                               ` Jason Gunthorpe
  2020-10-16 20:49                                                                                 ` Chuck Lever
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2020-10-16 18:54 UTC (permalink / raw)
  To: Ka-Cheong Poon; +Cc: Chuck Lever, Leon Romanovsky, linux-rdma

On Mon, Oct 12, 2020 at 04:20:40PM +0800, Ka-Cheong Poon wrote:
> On 10/9/20 11:34 PM, Jason Gunthorpe wrote:
> 
> > Yes, because namespaces are fundamentally supposed to be anchored in
> > the processes inside the namespace.
> > 
> > Having the kernel jump in and start opening holes as soon as a
> > namespace is created is just wrong.
> > 
> > At a bare minimum the listener should not exist until something in the
> > namespace is willing to work with RDS.
> 
> 
> As I mentioned in a previous email, starting is not the problem.  It
> is the problem of deleting a namespace.

Starting and ending are symmetric. When the last thing inside the
namespace stops needing RDS then RDS should close down the cm_id's.

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-16 18:54                                                                               ` Jason Gunthorpe
@ 2020-10-16 20:49                                                                                 ` Chuck Lever
  2020-10-19 18:31                                                                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Chuck Lever @ 2020-10-16 20:49 UTC (permalink / raw)
  To: Jason Gunthorpe, Ka-Cheong Poon; +Cc: Leon Romanovsky, linux-rdma



> On Oct 16, 2020, at 2:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Mon, Oct 12, 2020 at 04:20:40PM +0800, Ka-Cheong Poon wrote:
>> On 10/9/20 11:34 PM, Jason Gunthorpe wrote:
>> 
>>> Yes, because namespaces are fundamentally supposed to be anchored in
>>> the processes inside the namespace.
>>> 
>>> Having the kernel jump in and start opening holes as soon as a
>>> namespace is created is just wrong.
>>> 
>>> At a bare minimum the listener should not exist until something in the
>>> namespace is willing to work with RDS.
>> 
>> 
>> As I mentioned in a previous email, starting is not the problem.  It
>> is the problem of deleting a namespace.
> 
> Starting and ending are symmetric. When the last thing inside the
> namespace stops needing RDS then RDS should close down the cm_id's.

Unfortunately, cluster heartbeat requires the RDS listener endpoint
to continue after the last RDS user goes away, if the container
continues to exist.

IMO having an explicit RDS start-up and shutdown apart from namespace
creation and deletion is a cleaner approach. On a multi-tenant system
with many containers, some of those containers will want RDS listeners
and some will not. RDS should not assume that every net namespace
needs or wants to have a listener.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)
  2020-10-16 20:49                                                                                 ` Chuck Lever
@ 2020-10-19 18:31                                                                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2020-10-19 18:31 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Ka-Cheong Poon, Leon Romanovsky, linux-rdma

On Fri, Oct 16, 2020 at 04:49:41PM -0400, Chuck Lever wrote:
> 
> 
> > On Oct 16, 2020, at 2:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > On Mon, Oct 12, 2020 at 04:20:40PM +0800, Ka-Cheong Poon wrote:
> >> On 10/9/20 11:34 PM, Jason Gunthorpe wrote:
> >> 
> >>> Yes, because namespaces are fundamentally supposed to be anchored in
> >>> the processes inside the namespace.
> >>> 
> >>> Having the kernel jump in and start opening holes as soon as a
> >>> namespace is created is just wrong.
> >>> 
> >>> At a bare minimum the listener should not exist until something in the
> >>> namespace is willing to work with RDS.
> >> 
> >> 
> >> As I mentioned in a previous email, starting is not the problem.  It
> >> is the problem of deleting a namespace.
> > 
> > Starting and ending are symmetric. When the last thing inside the
> > namespace stops needing RDS then RDS should close down the cm_id's.
> 
> Unfortunately, cluster heartbeat requires the RDS listener endpoint
> to continue after the last RDS user goes away, if the container
> continues to exist.

What purpose is the heartbeat if nobody is listening for RDS stuff
inside the net namespace anyhow?

> IMO having an explicit RDS start-up and shutdown apart from namespace
> creation and deletion is a cleaner approach. On a multi-tenant system
> with many containers, some of those containers will want RDS listeners
> and some will not. RDS should not assume that every net namespace
> needs or wants to have a listener.

Right

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, back to index

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-03 14:02 Finding the namespace of a struct ib_device Ka-Cheong Poon
2020-09-03 17:39 ` Jason Gunthorpe
2020-09-04  4:01   ` Ka-Cheong Poon
2020-09-04 11:32     ` Jason Gunthorpe
2020-09-04 14:02       ` Ka-Cheong Poon
2020-09-06  7:44         ` Leon Romanovsky
2020-09-07  3:33           ` Ka-Cheong Poon
2020-09-07  7:18             ` Leon Romanovsky
2020-09-07  8:24               ` Ka-Cheong Poon
2020-09-07  9:04                 ` Leon Romanovsky
2020-09-07  9:28                   ` Ka-Cheong Poon
2020-09-07 10:22                     ` Leon Romanovsky
2020-09-07 13:48                       ` Ka-Cheong Poon
2020-09-29 16:57                         ` RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device) Ka-Cheong Poon
2020-09-29 17:40                           ` Jason Gunthorpe
2020-09-30 10:32                             ` Ka-Cheong Poon
2020-10-02 14:04                               ` Jason Gunthorpe
2020-10-05 10:27                                 ` Ka-Cheong Poon
2020-10-05 13:16                                   ` Jason Gunthorpe
2020-10-05 13:57                                     ` Ka-Cheong Poon
2020-10-05 14:25                                       ` Jason Gunthorpe
2020-10-05 15:02                                         ` Ka-Cheong Poon
2020-10-05 15:45                                           ` Jason Gunthorpe
2020-10-06  9:36                                             ` Ka-Cheong Poon
2020-10-06 12:46                                               ` Jason Gunthorpe
2020-10-07  8:38                                                 ` Ka-Cheong Poon
2020-10-07 11:16                                                   ` Leon Romanovsky
2020-10-08 10:22                                                     ` Ka-Cheong Poon
2020-10-08 10:36                                                       ` Leon Romanovsky
2020-10-08 11:08                                                         ` Ka-Cheong Poon
2020-10-08 16:08                                                           ` Jason Gunthorpe
2020-10-08 16:21                                                             ` Chuck Lever
2020-10-08 16:46                                                               ` Jason Gunthorpe
2020-10-09  4:49                                                             ` Ka-Cheong Poon
2020-10-09 14:39                                                               ` Jason Gunthorpe
2020-10-09 14:48                                                                 ` Chuck Lever
2020-10-09 14:57                                                                   ` Jason Gunthorpe
2020-10-09 15:00                                                                     ` Chuck Lever
2020-10-09 15:07                                                                       ` Jason Gunthorpe
2020-10-09 15:27                                                                         ` Chuck Lever
2020-10-09 15:34                                                                           ` Jason Gunthorpe
2020-10-09 15:52                                                                             ` Chuck Lever
2020-10-12  8:20                                                                             ` Ka-Cheong Poon
2020-10-16 18:54                                                                               ` Jason Gunthorpe
2020-10-16 20:49                                                                                 ` Chuck Lever
2020-10-19 18:31                                                                                   ` Jason Gunthorpe
2020-10-07 12:28                                                   ` Jason Gunthorpe
2020-10-08 10:49                                                     ` Ka-Cheong Poon

Linux-RDMA Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rdma/0 linux-rdma/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rdma linux-rdma/ https://lore.kernel.org/linux-rdma \
		linux-rdma@vger.kernel.org
	public-inbox-index linux-rdma

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rdma


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git