On 9/7/20 9:48 PM, Ka-Cheong Poon wrote:

> This may require a number of changes and the way a client interacts with
> the current RDMA framework.  For example, currently a client registers
> once using one struct ib_client and gets device notifications for all
> namespaces and devices.  Suppose there is rdma_[un]register_net_client(),
> it may need to require a client to use a different struct ib_client to
> register for each net namespace.  And struct ib_client probably needs to
> have a field to store the net namespace.  Probably all those client
> interaction functions will need to be modified.  Since the clients xarray
> is global, more clients may mean performance implication, such as it takes
> longer to go through the whole clients xarray.
> 
> There are probably many other subtle changes required.  It may turn out to
> be not so straight forward.  Is this community willing the take such changes?
> I can take a stab at it if the community really thinks that this is preferred.


Attached is a diff of a prototype for the above.  This exercise is
to see what needs to be done to have a more network namespace aware
interface for RDMA client registration.

Currently, there are ib_[un]register_client().  Under the RDMA namespace
exclusive mode, all RDMA devices are assigned to the init_net namespace
initially.  A kernel module uses this interface to register with the RDMA
subsystem.  When a device is assigned to a namespace, the client's
registered remove upcall is called with the device as the parameter (this
is removing from the init_net namespace).  Then the client's add upcall
is called with the device as the parameter (this is assigning to the new
namespace).  When that namespace is removed (*), a similar sequence of
events happen, a remove upcall (removing from the namespace) is followed
by add upcall (assigning back to the init_net namespace).  All the RDMA
clients are stored in a global struct xarray called clients (in device.c)
and each client is assigned a client ID.

This exercise adds the rdma_[un]register_net_client() for those clients
which want to have more separation between different namespaces.  This
interface takes a struct net parameter.  A kernel module uses this to
indicate that it is only interested in the RDMA events related to the
given network namespace.  Suppose a client uses init_net as the parameter.
In the above example when a device is assigned to a namespace, only the
client's remove upcall is called (removing from the init_net namespace).
The add upcall is not followed.  Then when the namespace is removed, the
client's add upcall is called (re-assigning back to the init_net namespace).
Suppose a client uses a specific namespace as the parameter.  When a device
is assigned to that specific namespace, the client's add upcall is called.
When the client unregisters with RDMA (or when the namespace is going away),
the client's remove upcall is called.  The RDMA clients are stored in each
namespace's struct rdma_dev_net and each client is assigned a client ID
in that namespace (this means that it is unique only in that namespace but
not unique globally among all namespaces).

This seemingly simple exercise turned out to be not so simple because of
the need to keep the existing interface with the existing behavior.  So only
when a client uses the new interface, the behavior is changed to what is
described above.  There should be no change of behavior to any existing
RDMA client.  There are several obstacles to overcome for this change.  One
difficulty is the global client ID since a lot of code rely on this ID as an
index the both the global clients xarray and individual device's client_data
xarray.  Detailed changes are in the attached diff if folks are interested.

Note that the new interface has one obvious issue, it does not make much sense
in RDMA shared network namespace mode.  In the shared mode, all devices are
associated with init_net.  So if a client uses the new interface to register
a specific namespace other than init_net, it will never get any upcall.  This
and the difficulties in adding a seemingly simple interface makes me wonder
about the following questions.

Is the RDMA shared namespace mode the preferred mode to use as it is the
default mode?  Is it expected that a client knows the running mode before
interacting with the RDMA subsystem?  Is a client not supposed to differentiate
different namespaces?  Besides the current add client upcall, another example
related to this is about event handling.  Suppose a client calls rdma_create_id()
to create listeners in different namespaces but with the same event handler.
A new connection comes in and the event handler is called for an
RDMA_CM_EVENT_CONNECT_REQUEST event.  There is no obvious namespace info regarding
the event.  It seems that the only way to find out the namespace info is to
use the context of struct rdma_cm_id.  The client must somehow add the namespace
info to the context since the subsystem does not provide any help.  Is this the
assumed solution?  BTW, this exercise still does not remove the need to have
rdma_dev_to_netns() as the add upcall does not provide any namespace info.  Given
all these questions, the rdma_[un]register_net_client() do not seem to fit in
the current way in interacting with the RDMA subsystem unfortunately.

Thanks.


(*) Note that in __rdma_create_id(), it does a get_net(net) to put a
     reference on a namespace.  Suppose a kernel module calls rdma_create_id()
     in its namespace .init function to create an RDMA listener and calls
     rdma_destroy_id() in its namespace .exit function to destroy it.  Since
     __rdma_create_id() adds a reference to a namespace, when a sys admin
     deletes a namespace (say `ip netns del ...`), the namespace won't be
     deleted because of this reference.  And the module will not release this
     reference until its .exit function is called only when the namespace is
     deleted.  To resolve this issue, in the diff (in __rdma_create_id()), I
     did something similar to the kern check in sk_alloc().


-- 
K. Poon
ka-cheong.poon@oracle.com