All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
@ 2015-05-17  5:50 Haggai Eran
  2015-05-17  5:50 ` [PATCH v4 for-next 02/12] IB/addr: Pass network namespace as a parameter Haggai Eran
                   ` (9 more replies)
  0 siblings, 10 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-17  5:50 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth,
	Haggai Eran

Thanks again everyone for the review comments. I've updated the patch set
accordingly. The main changes are in the first patch to use a read-write
semaphore instead of an SRCU, and with the reference counting of shared
ib_cm_ids.
Please let me know if I missed anything, or if there are other issues with
the series.

Regards,
Haggai

Changes from v3:
- Patch 1 and 3: use read-write semaphore instead of an SRCU.
- Patch 5:
  * Use a direct reference count instead of a kref.
  * Instead of adding get/put pair for ib_cm_ids, just avoid destroying an
    id when it is still in use.
  * Squashes these two patches together, since the first one became too
    short:
      IB/cm: Reference count ib_cm_ids
      IB/cm: API to retrieve existing listening CM IDs
- Rebase to Doug's to-be-rebased/for-4.2 branch.

Changes from v2:
- Add patch 1 to change device_mutex to an RCU.
- Remove patch that fixed IPv4 connections to an IPv4/IPv6 listener.
- Limit namespace related changes to RDMA CM and InfiniBand only.
- Rebase on dledford/for-v4.2, with David Ahern's unaligned access patch.
  * Use Michael Wang's capability functions where needed.
- Move the struct net argument to be the first in all functions, to match the
  networking core scheme.
- Patch 2:
  * Remove unwanted braces.
- Patch 4: check the return value of ib_find_cached_pkey.
- Patch 8: verify the address family before calling cm_save_ib_info.
- Patch 10: use generic_net instead of a custom radix tree for having per
  network namespace data.
- Minor changes.

Changes from v1:
- Include patch 1 in this series.
- Rebase for v4.1.

Changes from v0:
- Fix code review comments by Yann
- Rebase on top of linux-3.19

RDMA-CM uses IP based addressing and routing to setup RDMA connections between
hosts. Currently, all of the IP interfaces and addresses used by the RDMA-CM
must reside in the init_net namespace. This restricts the usage of containers
with RDMA to only work with host network namespace (aka the kernel init_net NS
instance).

This patchset allows using network namespaces with the RDMA-CM.

Each RDMA-CM id keeps a reference to a network namespace.

This reference is based on the process network namespace at the time of the
creation of the object or inherited from the listener.

This network namespace is used to perform all IP and network related
operations. Specifically, the local device lookup, as well as the remote GID
address resolution are done in the context of the RDMA-CM object's namespace.
This allows outgoing connections to reach the right target, even if the same
IP address exists in multiple network namespaces. This can happen if each
network namespace resides on a different P_Key.

Additionally, the network namespace is used to split the listener service ID
table. From the user point of view, each network namespace has a unique,
completely independent table of service IDs. This allows running multiple
instances of a single service on the same machine, using containers. To
implement this, multiple RDMA CM IDs, belonging to different namespaces may
now share their CM ID. When a request on such a CM ID arrives, the RDMA CM
module finds out the correct namespaces and looks for the RDMA CM ID
matching the request's parameters.

The functionality introduced by this series would come into play when the
transport is InfiniBand and IPoIB interfaces are assigned to each namespace.
Multiple IPoIB interfaces can be created and assigned to different RDMA-CM
capable containers, for example using pipework [1].

Full support for RoCE will be introduced in a later stage.

The patches apply against Doug's tree for v4.2.

The patchset is structured as follows:

Patch 1 adds a read-write semaphore in addition to the device mutex in
ib_core to allow traversing the client list without a deadlock in Patch 3.

Patch 2 is a relatively trivial API extension, requiring the callers
of certain ib_addr functions to provide a network namespace, as needed.

Patches 3 and 4 adds the ability to lookup a network namespace according to
the IP address, device and P_Key. It finds the matching IPoIB interfaces, and
safely takes a reference on the network namespace before returning to the
caller.

Patches 5-6 make necessary changes to the CM layer, to allow sharing of a
single CM ID between multiple RDMA CM IDs. This includes adding a reference
count to ib_cm_id structs, add an API to either create a new CM ID or use
an existing one, and expose the service ID to ib_cm clients.

Patches 7-8 do some preliminary refactoring to the rdma_cm module. Patch 7
refactors the logic that extracts the IP address from a connect request to
allow reuse by the namespace lookup code further on.  Patch 8 changes the
way RDMA CM module creates CM IDs, to avoid relying on the compare_data
feature of ib_cm. This feature associate a single compare_data struct per
ib_cm_id, so it cannot be used when sharing CM IDs.

Patches 9-12 add proper namespace support to the RDMA-CM module. This
includes adding multiple port space tables, sharing ib_cm_ids between
rdma_cm_ids, adding a network namespace parameter, and finally retrieving
the namespace from the creating process.

[1] https://github.com/jpetazzo/pipework/pull/108

Guy Shapiro (4):
  IB/addr: Pass network namespace as a parameter
  IB/ipoib: Return IPoIB devices matching connection parameters
  IB/cma: Add support for network namespaces
  IB/ucma: Take the network namespace from the process

Haggai Eran (7):
  IB/core: Add rwsem to allow reading device list or client list
  IB/cm: Share listening CM IDs
  IB/cm: Expose service ID in request events
  IB/cma: Refactor RDMA IP CM private-data parsing code
  IB/cma: Add compare_data checks to the RDMA CM module
  IB/cma: Separate port allocation to network namespaces
  IB/cma: Share CM IDs between namespaces

Yotam Kenneth (1):
  IB/core: Find the network namespace matching connection parameters

 drivers/infiniband/core/addr.c                     |  18 +-
 drivers/infiniband/core/cm.c                       | 129 +++++-
 drivers/infiniband/core/cma.c                      | 461 +++++++++++++++------
 drivers/infiniband/core/device.c                   |  76 +++-
 drivers/infiniband/core/ucma.c                     |   4 +-
 drivers/infiniband/ulp/ipoib/ipoib_main.c          | 139 ++++++-
 drivers/infiniband/ulp/iser/iser_verbs.c           |   2 +-
 drivers/infiniband/ulp/isert/ib_isert.c            |   2 +-
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    |   4 +-
 include/rdma/ib_addr.h                             |  16 +-
 include/rdma/ib_cm.h                               |   8 +
 include/rdma/ib_verbs.h                            |  33 ++
 include/rdma/rdma_cm.h                             |   6 +-
 net/9p/trans_rdma.c                                |   4 +-
 net/rds/ib.c                                       |   2 +-
 net/rds/ib_cm.c                                    |   2 +-
 net/rds/iw.c                                       |   2 +-
 net/rds/iw_cm.c                                    |   2 +-
 net/rds/rdma_transport.c                           |   4 +-
 net/sunrpc/xprtrdma/svc_rdma_transport.c           |   4 +-
 net/sunrpc/xprtrdma/verbs.c                        |   3 +-
 21 files changed, 767 insertions(+), 154 deletions(-)

-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v4 for-next 01/12] IB/core: Add rwsem to allow reading device list or client list
       [not found] ` <1431841868-28063-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-05-17  5:50   ` Haggai Eran
  2015-05-17  5:51   ` [PATCH v4 for-next 06/12] IB/cm: Expose service ID in request events Haggai Eran
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-17  5:50 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth,
	Haggai Eran, Matan Barak, Jason Gunthorpe

Currently the RDMA subsystem's device list and client list are protected by
a single mutex. This prevents adding user-facing APIs that iterate these
lists, since using them may cause a deadlock. The patch attempts to solve
this problem by adding a read-write semaphore to protect the lists. Readers
now don't need the mutex, and are safe just by read-locking the semaphore.

The ib_register_device, ib_register_client, ib_unregister_device, and
ib_unregister_client functions are modified to lock the semaphore for write
during their respective list modification

This patch attempts to solve a similar need [1] that was seen in the RoCE
v2 patch series.

[1] http://www.spinics.net/lists/linux-rdma/msg24733.html

Cc: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/device.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index b360350a0b20..3a44723c6b9d 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -59,13 +59,17 @@ static LIST_HEAD(device_list);
 static LIST_HEAD(client_list);
 
 /*
- * device_mutex protects access to both device_list and client_list.
- * There's no real point to using multiple locks or something fancier
- * like an rwsem: we always access both lists, and we're always
- * modifying one list or the other list.  In any case this is not a
- * hot path so there's no point in trying to optimize.
+ * device_mutex and lists_rwsem protect access to both device_list and
+ * client_list.  device_mutex protects writer access by device and client
+ * registration / de-registration.  lists_rwsem protects reader access to
+ * these lists.  Iterators of these lists must lock it for read, while updates
+ * to the lists must be done with a write lock. A special case is when the
+ * device_mutex is locked. In this case locking the lists for read access as
+ * the device_mutex implies it.
  */
 static DEFINE_MUTEX(device_mutex);
+static DECLARE_RWSEM(lists_rwsem);
+
 
 static int ib_device_check_mandatory(struct ib_device *device)
 {
@@ -311,7 +315,9 @@ int ib_register_device(struct ib_device *device,
 		goto out;
 	}
 
+	down_write(&lists_rwsem);
 	list_add_tail(&device->core_list, &device_list);
+	up_write(&lists_rwsem);
 
 	device->reg_state = IB_DEV_REGISTERED;
 
@@ -347,7 +353,9 @@ void ib_unregister_device(struct ib_device *device)
 		if (client->remove)
 			client->remove(device);
 
+	down_write(&lists_rwsem);
 	list_del(&device->core_list);
+	up_write(&lists_rwsem);
 
 	kfree(device->gid_tbl_len);
 	kfree(device->pkey_tbl_len);
@@ -384,7 +392,10 @@ int ib_register_client(struct ib_client *client)
 
 	mutex_lock(&device_mutex);
 
+	down_write(&lists_rwsem);
 	list_add_tail(&client->list, &client_list);
+	up_write(&lists_rwsem);
+
 	list_for_each_entry(device, &device_list, core_list)
 		if (client->add && !add_client_context(device, client))
 			client->add(device);
@@ -423,7 +434,10 @@ void ib_unregister_client(struct ib_client *client)
 			}
 		spin_unlock_irqrestore(&device->client_data_lock, flags);
 	}
+
+	down_write(&lists_rwsem);
 	list_del(&client->list);
+	up_write(&lists_rwsem);
 
 	mutex_unlock(&device_mutex);
 }
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v4 for-next 02/12] IB/addr: Pass network namespace as a parameter
  2015-05-17  5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran
@ 2015-05-17  5:50 ` Haggai Eran
  2015-05-17  5:50 ` [PATCH v4 for-next 03/12] IB/core: Find the network namespace matching connection parameters Haggai Eran
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-17  5:50 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth, Haggai Eran

From: Guy Shapiro <guysh@mellanox.com>

Add network namespace support to the ib_addr module. For that, all the
address resolution and matching should be done using the appropriate
namespace instead of init_net.

This is achieved by:

1. Adding an explicit network namespace argument to exported function that
   require a namespace.
2. Saving the namespace in the rdma_addr_client structure.
3. Using it when calling networking functions.

In order to preserve the behavior of calling modules, &init_net is
passed as the parameter in calls from other modules. This is modified as
namespace support is added on more levels.

Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
---
 drivers/infiniband/core/addr.c | 18 ++++++++++--------
 drivers/infiniband/core/cma.c  |  1 +
 include/rdma/ib_addr.h         | 16 +++++++++++++++-
 3 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 38339d220d7f..95c0bc4190a3 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -128,7 +128,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr,
 	int ret = -EADDRNOTAVAIL;
 
 	if (dev_addr->bound_dev_if) {
-		dev = dev_get_by_index(&init_net, dev_addr->bound_dev_if);
+		dev = dev_get_by_index(dev_addr->net, dev_addr->bound_dev_if);
 		if (!dev)
 			return -ENODEV;
 		ret = rdma_copy_addr(dev_addr, dev, NULL);
@@ -138,7 +138,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr,
 
 	switch (addr->sa_family) {
 	case AF_INET:
-		dev = ip_dev_find(&init_net,
+		dev = ip_dev_find(dev_addr->net,
 			((struct sockaddr_in *) addr)->sin_addr.s_addr);
 
 		if (!dev)
@@ -149,12 +149,11 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr,
 			*vlan_id = rdma_vlan_dev_vlan_id(dev);
 		dev_put(dev);
 		break;
-
 #if IS_ENABLED(CONFIG_IPV6)
 	case AF_INET6:
 		rcu_read_lock();
-		for_each_netdev_rcu(&init_net, dev) {
-			if (ipv6_chk_addr(&init_net,
+		for_each_netdev_rcu(dev_addr->net, dev) {
+			if (ipv6_chk_addr(dev_addr->net,
 					  &((struct sockaddr_in6 *) addr)->sin6_addr,
 					  dev, 1)) {
 				ret = rdma_copy_addr(dev_addr, dev, NULL);
@@ -236,7 +235,7 @@ static int addr4_resolve(struct sockaddr_in *src_in,
 	fl4.daddr = dst_ip;
 	fl4.saddr = src_ip;
 	fl4.flowi4_oif = addr->bound_dev_if;
-	rt = ip_route_output_key(&init_net, &fl4);
+	rt = ip_route_output_key(addr->net, &fl4);
 	if (IS_ERR(rt)) {
 		ret = PTR_ERR(rt);
 		goto out;
@@ -278,12 +277,13 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
 	fl6.saddr = src_in->sin6_addr;
 	fl6.flowi6_oif = addr->bound_dev_if;
 
-	dst = ip6_route_output(&init_net, NULL, &fl6);
+	dst = ip6_route_output(addr->net, NULL, &fl6);
 	if ((ret = dst->error))
 		goto put;
 
 	if (ipv6_addr_any(&fl6.saddr)) {
-		ret = ipv6_dev_get_saddr(&init_net, ip6_dst_idev(dst)->dev,
+		ret = ipv6_dev_get_saddr(addr->net,
+					 ip6_dst_idev(dst)->dev,
 					 &fl6.daddr, 0, &fl6.saddr);
 		if (ret)
 			goto put;
@@ -476,6 +476,7 @@ int rdma_addr_find_dmac_by_grh(union ib_gid *sgid, union ib_gid *dgid, u8 *dmac,
 	rdma_gid2ip(&dgid_addr._sockaddr, dgid);
 
 	memset(&dev_addr, 0, sizeof(dev_addr));
+	dev_addr.net = &init_net;
 
 	ctx.addr = &dev_addr;
 	init_completion(&ctx.comp);
@@ -510,6 +511,7 @@ int rdma_addr_find_smac_by_sgid(union ib_gid *sgid, u8 *smac, u16 *vlan_id)
 	rdma_gid2ip(&gid_addr._sockaddr, sgid);
 
 	memset(&dev_addr, 0, sizeof(dev_addr));
+	dev_addr.net = &init_net;
 	ret = rdma_translate_ip(&gid_addr._sockaddr, &dev_addr, vlan_id);
 	if (ret)
 		return ret;
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 1977f601a1ec..c957d53df666 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -524,6 +524,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler,
 	INIT_LIST_HEAD(&id_priv->listen_list);
 	INIT_LIST_HEAD(&id_priv->mc_list);
 	get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num);
+	id_priv->id.route.addr.dev_addr.net = &init_net;
 
 	return &id_priv->id;
 }
diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h
index ac54c27a2bfd..445a72e1320a 100644
--- a/include/rdma/ib_addr.h
+++ b/include/rdma/ib_addr.h
@@ -47,6 +47,7 @@
 #include <rdma/ib_verbs.h>
 #include <rdma/ib_pack.h>
 #include <net/ipv6.h>
+#include <net/net_namespace.h>
 
 struct rdma_addr_client {
 	atomic_t refcount;
@@ -64,6 +65,16 @@ void rdma_addr_register_client(struct rdma_addr_client *client);
  */
 void rdma_addr_unregister_client(struct rdma_addr_client *client);
 
+/**
+ * struct rdma_dev_addr - Contains resolved RDMA hardware addresses
+ * @src_dev_addr:	Source MAC address.
+ * @dst_dev_addr:	Destination MAC address.
+ * @broadcast:		Broadcast address of the device.
+ * @dev_type:		The interface hardware type of the device.
+ * @bound_dev_if:	An optional device interface index.
+ * @transport:		The transport type used.
+ * @net:		Network namespace containing the bound_dev_if net_dev.
+ */
 struct rdma_dev_addr {
 	unsigned char src_dev_addr[MAX_ADDR_LEN];
 	unsigned char dst_dev_addr[MAX_ADDR_LEN];
@@ -71,11 +82,14 @@ struct rdma_dev_addr {
 	unsigned short dev_type;
 	int bound_dev_if;
 	enum rdma_transport_type transport;
+	struct net *net;
 };
 
 /**
  * rdma_translate_ip - Translate a local IP address to an RDMA hardware
  *   address.
+ *
+ * The dev_addr->net field must be initialized.
  */
 int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr,
 		      u16 *vlan_id);
@@ -90,7 +104,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr,
  * @dst_addr: The destination address to resolve.
  * @addr: A reference to a data location that will receive the resolved
  *   addresses.  The data location must remain valid until the callback has
- *   been invoked.
+ *   been invoked. The net field of the addr struct must be valid.
  * @timeout_ms: Amount of time to wait for the address resolution to complete.
  * @callback: Call invoked once address resolution has completed, timed out,
  *   or been canceled.  A status of 0 indicates success.
-- 
1.7.11.2

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v4 for-next 03/12] IB/core: Find the network namespace matching connection parameters
  2015-05-17  5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran
  2015-05-17  5:50 ` [PATCH v4 for-next 02/12] IB/addr: Pass network namespace as a parameter Haggai Eran
@ 2015-05-17  5:50 ` Haggai Eran
       [not found]   ` <1431841868-28063-4-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-05-17  5:51 ` [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices " Haggai Eran
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 81+ messages in thread
From: Haggai Eran @ 2015-05-17  5:50 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth, Haggai Eran

From: Yotam Kenneth <yotamke@mellanox.com>

In the case of IPoIB, and maybe in other cases, the network device is
managed by an upper-layer protocol (ULP). In order to expose this
network device to other users of the IB device, let ULPs implement
a callback that returns network device according to connection parameters.

The IB device and port, together with the P_Key and the IP address should
be enough to uniquely identify the ULP net device.

This function is passed to ib_core as part of the ib_client
registration.

Using this functionality, add a way to get the network namespace
corresponding to a work completion. This is needed so that responses to CM
requests can be sent from the same network namespace as the request.

Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
---
 drivers/infiniband/core/device.c | 52 ++++++++++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h          | 33 +++++++++++++++++++++++++
 2 files changed, 85 insertions(+)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 3a44723c6b9d..be184f9feab2 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -38,6 +38,7 @@
 #include <linux/slab.h>
 #include <linux/init.h>
 #include <linux/mutex.h>
+#include <linux/netdevice.h>
 #include <rdma/rdma_netlink.h>
 
 #include "core_priv.h"
@@ -748,6 +749,57 @@ int ib_find_pkey(struct ib_device *device,
 }
 EXPORT_SYMBOL(ib_find_pkey);
 
+static struct net_device *ib_get_net_dev_by_port_pkey_ip(struct ib_device *dev,
+							 u8 port,
+							 u16 pkey,
+							 struct sockaddr *addr)
+{
+	struct net_device *ret = NULL;
+	struct ib_client *client;
+
+	down_read(&lists_rwsem);
+
+	list_for_each_entry_rcu(client, &client_list, list)
+		if (client->get_net_device_by_port_pkey_ip) {
+			ret = client->get_net_device_by_port_pkey_ip(dev, port,
+								     pkey,
+								     addr);
+			if (ret)
+				break;
+		}
+
+	up_read(&lists_rwsem);
+
+	return ret;
+}
+
+struct net *ib_get_net_ns_by_port_pkey_ip(struct ib_device *dev,
+					  u8 port,
+					  u16 pkey,
+					  struct sockaddr *addr)
+{
+	struct net_device *ndev = NULL;
+	struct net *ns;
+
+	if (rdma_protocol_ib(dev, port))
+		ndev = ib_get_net_dev_by_port_pkey_ip(dev, port, pkey, addr);
+
+	if (!ndev)
+		goto not_found;
+
+	rcu_read_lock();
+	ns = maybe_get_net(dev_net(ndev));
+	dev_put(ndev);
+	rcu_read_unlock();
+	if (!ns)
+		goto not_found;
+	return ns;
+
+not_found:
+	return get_net(&init_net);
+}
+EXPORT_SYMBOL(ib_get_net_ns_by_port_pkey_ip);
+
 static int __init ib_core_init(void)
 {
 	int ret;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 8d59479eea4d..76093a33ed46 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -48,6 +48,7 @@
 #include <linux/rwsem.h>
 #include <linux/scatterlist.h>
 #include <linux/workqueue.h>
+#include <linux/socket.h>
 #include <uapi/linux/if_ether.h>
 
 #include <linux/atomic.h>
@@ -1691,6 +1692,24 @@ struct ib_client {
 	void (*add)   (struct ib_device *);
 	void (*remove)(struct ib_device *);
 
+	/* Returns the net_dev belonging to this ib_client and matching the
+	 * given parameters.
+	 * @dev:	An RDMA device that the net_dev use for communication.
+	 * @port:	A physical port number on the RDMA device.
+	 * @pkey:	P_Key that the net_dev uses if applicable.
+	 * @addr:	An IP address the net_dev is configured with.
+	 *
+	 * An ib_client that implements a net_dev on top of RDMA devices
+	 * (such as IP over IB) should implement this callback, allowing the
+	 * rdma_cm module to find the right net_dev for a given request.
+	 *
+	 * The caller is responsible for calling dev_put on the returned
+	 * netdev. */
+	struct net_device *(*get_net_device_by_port_pkey_ip)(
+			struct ib_device *dev,
+			u8 port,
+			u16 pkey,
+			struct sockaddr *addr);
 	struct list_head list;
 };
 
@@ -2835,4 +2854,18 @@ static inline int ib_check_mr_access(int flags)
 int ib_check_mr_status(struct ib_mr *mr, u32 check_mask,
 		       struct ib_mr_status *mr_status);
 
+/**
+ * ib_get_net_ns_by_port_pkey_ip() - Return the appropriate net namespace
+ * for a received CM request
+ * @dev:	An RDMA device on which the request has been received.
+ * @port:	Port number on the RDMA device.
+ * @pkey:	The Pkey the request came on.
+ * @addr:	Contains the IP address that the request specified as its
+ *		destination.
+ */
+struct net *ib_get_net_ns_by_port_pkey_ip(struct ib_device *dev,
+					  u8 port,
+					  u16 pkey,
+					  struct sockaddr *addr);
+
 #endif /* IB_VERBS_H */
-- 
1.7.11.2

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
  2015-05-17  5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran
  2015-05-17  5:50 ` [PATCH v4 for-next 02/12] IB/addr: Pass network namespace as a parameter Haggai Eran
  2015-05-17  5:50 ` [PATCH v4 for-next 03/12] IB/core: Find the network namespace matching connection parameters Haggai Eran
@ 2015-05-17  5:51 ` Haggai Eran
       [not found]   ` <1431841868-28063-5-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-05-17  5:51 ` [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs Haggai Eran
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 81+ messages in thread
From: Haggai Eran @ 2015-05-17  5:51 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth, Haggai Eran

From: Guy Shapiro <guysh@mellanox.com>

Implement the get_net_device_by_port_pkey_ip callback that returns network
device to ib_core according to connection parameters. Check the ipoib
device and iterate over all child devices to look for a match.

For each ipoib device we iterate through all upper devices when searching
for a matching IP, in order to support bonding.

Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 139 +++++++++++++++++++++++++++++-
 1 file changed, 138 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 3421e42870c3..75def39a4271 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -48,6 +48,9 @@
 
 #include <linux/jhash.h>
 #include <net/arp.h>
+#include <net/addrconf.h>
+#include <linux/inetdevice.h>
+#include <rdma/ib_cache.h>
 
 #define DRV_VERSION "1.0.0"
 
@@ -91,11 +94,15 @@ struct ib_sa_client ipoib_sa_client;
 static void ipoib_add_one(struct ib_device *device);
 static void ipoib_remove_one(struct ib_device *device);
 static void ipoib_neigh_reclaim(struct rcu_head *rp);
+static struct net_device *ipoib_get_net_device_by_port_pkey_ip(
+		struct ib_device *dev, u8 port, u16 pkey,
+		struct sockaddr *addr);
 
 static struct ib_client ipoib_client = {
 	.name   = "ipoib",
 	.add    = ipoib_add_one,
-	.remove = ipoib_remove_one
+	.remove = ipoib_remove_one,
+	.get_net_device_by_port_pkey_ip = ipoib_get_net_device_by_port_pkey_ip,
 };
 
 int ipoib_open(struct net_device *dev)
@@ -222,6 +229,136 @@ static int ipoib_change_mtu(struct net_device *dev, int new_mtu)
 	return 0;
 }
 
+/* Called with an RCU read lock taken */
+static bool ipoib_is_dev_match_addr(struct sockaddr *addr,
+				    struct net_device *dev)
+{
+	struct net *net = dev_net(dev);
+	struct in_device *in_dev;
+	struct sockaddr_in *addr_in = (struct sockaddr_in *)addr;
+#if IS_ENABLED(CONFIG_IPV6)
+	struct sockaddr_in6 *addr_in6 = (struct sockaddr_in6 *)addr;
+#endif
+	__be32 ret_addr;
+
+	switch (addr->sa_family) {
+	case AF_INET:
+		in_dev = in_dev_get(dev);
+		if (!in_dev)
+			return false;
+
+		ret_addr = inet_confirm_addr(net, in_dev, 0,
+					     addr_in->sin_addr.s_addr,
+					     RT_SCOPE_HOST);
+		in_dev_put(in_dev);
+		if (ret_addr)
+			return true;
+
+		break;
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		if (ipv6_chk_addr(net, &addr_in6->sin6_addr, dev, 1))
+			return true;
+
+		break;
+#endif
+	}
+	return false;
+}
+
+/**
+ * Find a net_device matching the given address, which is an upper device of
+ * the given net_device.
+ * @addr: IP address to look for.
+ * @dev: base IPoIB net_device
+ *
+ * If found, returns the net_device with a reference held. Otherwise return
+ * NULL.
+ */
+static struct net_device *ipoib_get_net_dev_match_addr(struct sockaddr *addr,
+						       struct net_device *dev)
+{
+	struct net_device *upper,
+			  *result = NULL;
+	struct list_head *iter;
+
+	rcu_read_lock();
+	if (ipoib_is_dev_match_addr(addr, dev)) {
+		dev_hold(dev);
+		result = dev;
+		goto out;
+	}
+
+	netdev_for_each_all_upper_dev_rcu(dev, upper, iter) {
+		if (ipoib_is_dev_match_addr(addr, upper)) {
+			dev_hold(upper);
+			result = upper;
+			break;
+		}
+	}
+out:
+	rcu_read_unlock();
+	return result;
+}
+
+/* returns an IPoIB netdev on top a given ipoib device matching a pkey_index
+ * and address, if one exists. */
+static struct net_device *ipoib_match_pkey_addr(struct ipoib_dev_priv *priv,
+						u16 pkey_index,
+						struct sockaddr *addr)
+{
+	struct ipoib_dev_priv *child_priv;
+	struct net_device *net_dev = NULL;
+
+	if (priv->pkey_index == pkey_index) {
+		net_dev = ipoib_get_net_dev_match_addr(addr, priv->dev);
+		if (net_dev)
+			return net_dev;
+	}
+
+	/* Check child interfaces */
+	down_read(&priv->vlan_rwsem);
+	list_for_each_entry(child_priv, &priv->child_intfs, list) {
+		net_dev = ipoib_match_pkey_addr(child_priv, pkey_index, addr);
+		if (net_dev)
+			break;
+	}
+	up_read(&priv->vlan_rwsem);
+
+	return net_dev;
+}
+
+static struct net_device *ipoib_get_net_device_by_port_pkey_ip(
+		struct ib_device *dev, u8 port, u16 pkey, struct sockaddr *addr)
+{
+	struct ipoib_dev_priv *priv;
+	struct list_head *dev_list;
+	struct net_device *net_dev;
+	u16 pkey_index;
+	int ret;
+
+	ret = ib_find_cached_pkey(dev, port, pkey, &pkey_index);
+	if (ret)
+		return NULL;
+
+	if (!rdma_protocol_ib(dev, port))
+		return NULL;
+
+	dev_list = ib_get_client_data(dev, &ipoib_client);
+	if (!dev_list)
+		return NULL;
+
+	list_for_each_entry(priv, dev_list, list) {
+		if (priv->port != port)
+			continue;
+
+		net_dev = ipoib_match_pkey_addr(priv, pkey_index, addr);
+		if (net_dev)
+			return net_dev;
+	}
+	return NULL;
+}
+
 int ipoib_set_mode(struct net_device *dev, const char *buf)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-- 
1.7.11.2

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs
  2015-05-17  5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran
                   ` (2 preceding siblings ...)
  2015-05-17  5:51 ` [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices " Haggai Eran
@ 2015-05-17  5:51 ` Haggai Eran
  2015-05-19 18:35   ` Jason Gunthorpe
       [not found] ` <1431841868-28063-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 81+ messages in thread
From: Haggai Eran @ 2015-05-17  5:51 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth, Haggai Eran, Jason Gunthorpe

Enabling network namespaces for RDMA CM will allow processes on different
namespaces to listen on the same port. In order to leave namespace support
out of the CM layer, this requires that multiple RDMA CM IDs will be able
to share a single CM ID.

This patch adds infrastructure to retrieve an existing listening ib_cm_id,
based on its device and service ID, or create a new one if one does not
already exist. It also adds a reference count for such instances
(cm_id_private.sharecount), and prevents cm_destroy_id from destroying a CM
if it is still shared. See the relevant discussion [1].

[1] Re: [PATCH v3 for-next 05/13] IB/cm: Reference count ib_cm_ids
    http://www.spinics.net/lists/netdev/msg328860.html

Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
---
 drivers/infiniband/core/cm.c | 121 +++++++++++++++++++++++++++++++++++++++++--
 include/rdma/ib_cm.h         |   4 ++
 2 files changed, 120 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 14423c20c55b..4f936932dd54 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -212,6 +212,8 @@ struct cm_id_private {
 	spinlock_t lock;	/* Do not acquire inside cm.lock */
 	struct completion comp;
 	atomic_t refcount;
+	/* Number of clients sharing this ib_cm_id. Only valid for listeners. */
+	atomic_t sharecount;
 
 	struct ib_mad_send_buf *msg;
 	struct cm_timewait_info *timewait_info;
@@ -720,6 +722,7 @@ struct ib_cm_id *ib_create_cm_id(struct ib_device *device,
 	INIT_LIST_HEAD(&cm_id_priv->work_list);
 	atomic_set(&cm_id_priv->work_count, -1);
 	atomic_set(&cm_id_priv->refcount, 1);
+	atomic_set(&cm_id_priv->sharecount, 1);
 	return &cm_id_priv->id;
 
 error:
@@ -841,6 +844,12 @@ static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
 	struct cm_work *work;
 
 	cm_id_priv = container_of(cm_id, struct cm_id_private, id);
+
+	if (!atomic_dec_and_test(&cm_id_priv->sharecount)) {
+		/* The id is still shared. */
+		return;
+	}
+
 retest:
 	spin_lock_irq(&cm_id_priv->lock);
 	switch (cm_id->state) {
@@ -927,11 +936,32 @@ void ib_destroy_cm_id(struct ib_cm_id *cm_id)
 }
 EXPORT_SYMBOL(ib_destroy_cm_id);
 
-int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask,
-		 struct ib_cm_compare_data *compare_data)
+/**
+ * __ib_cm_listen - Initiates listening on the specified service ID for
+ *   connection and service ID resolution requests.
+ * @cm_id: Connection identifier associated with the listen request.
+ * @service_id: Service identifier matched against incoming connection
+ *   and service ID resolution requests.  The service ID should be specified
+ *   network-byte order.  If set to IB_CM_ASSIGN_SERVICE_ID, the CM will
+ *   assign a service ID to the caller.
+ * @service_mask: Mask applied to service ID used to listen across a
+ *   range of service IDs.  If set to 0, the service ID is matched
+ *   exactly.  This parameter is ignored if %service_id is set to
+ *   IB_CM_ASSIGN_SERVICE_ID.
+ * @compare_data: This parameter is optional.  It specifies data that must
+ *   appear in the private data of a connection request for the specified
+ *   listen request.
+ * @lock: If set, lock the cm.lock spin-lock when adding the id to the
+ *   listener tree. When false, the caller must already hold the spin-lock,
+ *   and compare_data must be NULL.
+ */
+static int __ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id,
+			  __be64 service_mask,
+			  struct ib_cm_compare_data *compare_data,
+			  bool lock)
 {
 	struct cm_id_private *cm_id_priv, *cur_cm_id_priv;
-	unsigned long flags;
+	unsigned long flags = 0;
 	int ret = 0;
 
 	service_mask = service_mask ? service_mask : ~cpu_to_be64(0);
@@ -957,7 +987,8 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask,
 
 	cm_id->state = IB_CM_LISTEN;
 
-	spin_lock_irqsave(&cm.lock, flags);
+	if (lock)
+		spin_lock_irqsave(&cm.lock, flags);
 	if (service_id == IB_CM_ASSIGN_SERVICE_ID) {
 		cm_id->service_id = cpu_to_be64(cm.listen_service_id++);
 		cm_id->service_mask = ~cpu_to_be64(0);
@@ -966,7 +997,8 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask,
 		cm_id->service_mask = service_mask;
 	}
 	cur_cm_id_priv = cm_insert_listen(cm_id_priv);
-	spin_unlock_irqrestore(&cm.lock, flags);
+	if (lock)
+		spin_unlock_irqrestore(&cm.lock, flags);
 
 	if (cur_cm_id_priv) {
 		cm_id->state = IB_CM_IDLE;
@@ -976,8 +1008,87 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask,
 	}
 	return ret;
 }
+
+int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask,
+		 struct ib_cm_compare_data *compare_data)
+{
+	return __ib_cm_listen(cm_id, service_id, service_mask, compare_data,
+			      true);
+}
 EXPORT_SYMBOL(ib_cm_listen);
 
+/**
+ * Create a new listening ib_cm_id and listen on the given service ID.
+ *
+ * If there's an existing ID listening on that same device and service ID,
+ * return it.
+ *
+ * @device: Device associated with the cm_id.  All related communication will
+ * be associated with the specified device.
+ * @cm_handler: Callback invoked to notify the user of CM events.
+ * @service_id: Service identifier matched against incoming connection
+ *   and service ID resolution requests.  The service ID should be specified
+ *   network-byte order.  If set to IB_CM_ASSIGN_SERVICE_ID, the CM will
+ *   assign a service ID to the caller.
+ * @service_mask: Mask applied to service ID used to listen across a
+ *   range of service IDs.  If set to 0, the service ID is matched
+ *   exactly.  This parameter is ignored if %service_id is set to
+ *   IB_CM_ASSIGN_SERVICE_ID.
+ */
+struct ib_cm_id *ib_cm_id_create_and_listen(
+		struct ib_device *device,
+		ib_cm_handler cm_handler,
+		__be64 service_id,
+		__be64 service_mask)
+{
+	struct cm_id_private *cm_id_priv;
+	struct ib_cm_id *cm_id;
+	unsigned long flags;
+	int err = 0;
+
+	/* Create an ID in advance, since the creation may sleep */
+	cm_id = ib_create_cm_id(device, cm_handler, NULL);
+	if (IS_ERR(cm_id))
+		return cm_id;
+
+	spin_lock_irqsave(&cm.lock, flags);
+
+	if (service_id == IB_CM_ASSIGN_SERVICE_ID)
+		goto new_id;
+
+	/* Find an existing ID */
+	cm_id_priv = cm_find_listen(device, service_id, NULL);
+	if (cm_id_priv) {
+		if (atomic_inc_return(&cm_id_priv->sharecount) == 1) {
+			/* This ID is already being destroyed */
+			atomic_dec(&cm_id_priv->sharecount);
+			goto new_id;
+		}
+
+		spin_unlock_irqrestore(&cm.lock, flags);
+		ib_destroy_cm_id(cm_id);
+		cm_id = &cm_id_priv->id;
+		if (cm_id->cm_handler != cm_handler || cm_id->context)
+			/* Sharing an ib_cm_id with different handlers is not
+			 * supported */
+			return ERR_PTR(-EINVAL);
+		return cm_id;
+	}
+
+new_id:
+	/* Use newly created ID */
+	err = __ib_cm_listen(cm_id, service_id, service_mask, NULL, false);
+
+	spin_unlock_irqrestore(&cm.lock, flags);
+
+	if (err) {
+		ib_destroy_cm_id(cm_id);
+		return ERR_PTR(err);
+	}
+	return cm_id;
+}
+EXPORT_SYMBOL(ib_cm_id_create_and_listen);
+
 static __be64 cm_form_tid(struct cm_id_private *cm_id_priv,
 			  enum cm_msg_sequence msg_seq)
 {
diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
index 39ed2d2fbd51..59f534b5f091 100644
--- a/include/rdma/ib_cm.h
+++ b/include/rdma/ib_cm.h
@@ -361,6 +361,10 @@ struct ib_cm_compare_data {
 int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask,
 		 struct ib_cm_compare_data *compare_data);
 
+struct ib_cm_id *ib_cm_id_create_and_listen(
+		struct ib_device *device, ib_cm_handler cm_handler,
+		__be64 service_id, __be64 service_mask);
+
 struct ib_cm_req_param {
 	struct ib_sa_path_rec	*primary_path;
 	struct ib_sa_path_rec	*alternate_path;
-- 
1.7.11.2

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v4 for-next 06/12] IB/cm: Expose service ID in request events
       [not found] ` <1431841868-28063-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-05-17  5:50   ` [PATCH v4 for-next 01/12] IB/core: Add rwsem to allow reading device list or client list Haggai Eran
@ 2015-05-17  5:51   ` Haggai Eran
  2015-05-17  5:51   ` [PATCH v4 for-next 07/12] IB/cma: Refactor RDMA IP CM private-data parsing code Haggai Eran
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-17  5:51 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth,
	Haggai Eran

Expose the service ID on an incoming CM or SIDR request to the event
handler. This will allow the RDMA CM module to de-multiplex connection
requests based on the information encoded in the service ID.

Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/cm.c | 3 +++
 include/rdma/ib_cm.h         | 1 +
 2 files changed, 4 insertions(+)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 4f936932dd54..04cdc753b763 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -1365,6 +1365,7 @@ static void cm_format_paths_from_req(struct cm_req_msg *req_msg,
 	primary_path->packet_life_time =
 		cm_req_get_primary_local_ack_timeout(req_msg);
 	primary_path->packet_life_time -= (primary_path->packet_life_time > 0);
+	primary_path->service_id = req_msg->service_id;
 
 	if (req_msg->alt_local_lid) {
 		memset(alt_path, 0, sizeof *alt_path);
@@ -1386,6 +1387,7 @@ static void cm_format_paths_from_req(struct cm_req_msg *req_msg,
 		alt_path->packet_life_time =
 			cm_req_get_alt_local_ack_timeout(req_msg);
 		alt_path->packet_life_time -= (alt_path->packet_life_time > 0);
+		alt_path->service_id = req_msg->service_id;
 	}
 }
 
@@ -3089,6 +3091,7 @@ static void cm_format_sidr_req_event(struct cm_work *work,
 	param = &work->cm_event.param.sidr_req_rcvd;
 	param->pkey = __be16_to_cpu(sidr_req_msg->pkey);
 	param->listen_id = listen_id;
+	param->service_id = sidr_req_msg->service_id;
 	param->port = work->port->port_num;
 	work->cm_event.private_data = &sidr_req_msg->private_data;
 }
diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
index 59f534b5f091..6655016b3855 100644
--- a/include/rdma/ib_cm.h
+++ b/include/rdma/ib_cm.h
@@ -223,6 +223,7 @@ struct ib_cm_apr_event_param {
 
 struct ib_cm_sidr_req_event_param {
 	struct ib_cm_id		*listen_id;
+	__be64			service_id;
 	u8			port;
 	u16			pkey;
 };
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v4 for-next 07/12] IB/cma: Refactor RDMA IP CM private-data parsing code
       [not found] ` <1431841868-28063-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-05-17  5:50   ` [PATCH v4 for-next 01/12] IB/core: Add rwsem to allow reading device list or client list Haggai Eran
  2015-05-17  5:51   ` [PATCH v4 for-next 06/12] IB/cm: Expose service ID in request events Haggai Eran
@ 2015-05-17  5:51   ` Haggai Eran
  2015-05-17  5:51   ` [PATCH v4 for-next 08/12] IB/cma: Add compare_data checks to the RDMA CM module Haggai Eran
  2015-05-17  5:51   ` [PATCH v4 for-next 12/12] IB/ucma: Take the network namespace from the process Haggai Eran
  4 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-17  5:51 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth,
	Haggai Eran

When receiving a connection request, rdma_cm needs to associate the request
with a network namespace. To do this, it needs to know the request's
destination IP. For this the module needs to allow getting this information
from the private data in the request packet, instead of relying on the
information already being in the listening RDMA CM ID.

When creating a new incoming connection ID, the code in
cma_save_ip{4,6}_info can no longer rely on the listener's private data to
find the port number, so it reads it from the requested service ID.

Signed-off-by: Guy Shapiro <guysh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Yotam Kenneth <yotamke-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/cma.c | 150 ++++++++++++++++++++++++++----------------
 1 file changed, 92 insertions(+), 58 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index c957d53df666..1828903e8388 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -843,97 +843,122 @@ static inline int cma_any_port(struct sockaddr *addr)
 	return !cma_port(addr);
 }
 
-static void cma_save_ib_info(struct rdma_cm_id *id, struct rdma_cm_id *listen_id,
+static void cma_save_ib_info(struct sockaddr *src_addr,
+			     struct sockaddr *dst_addr,
 			     struct ib_sa_path_rec *path)
 {
-	struct sockaddr_ib *listen_ib, *ib;
+	struct sockaddr_ib *ib;
 
-	listen_ib = (struct sockaddr_ib *) &listen_id->route.addr.src_addr;
-	ib = (struct sockaddr_ib *) &id->route.addr.src_addr;
-	ib->sib_family = listen_ib->sib_family;
-	ib->sib_pkey = path->pkey;
-	ib->sib_flowinfo = path->flow_label;
-	memcpy(&ib->sib_addr, &path->sgid, 16);
-	ib->sib_sid = listen_ib->sib_sid;
-	ib->sib_sid_mask = cpu_to_be64(0xffffffffffffffffULL);
-	ib->sib_scope_id = listen_ib->sib_scope_id;
-
-	ib = (struct sockaddr_ib *) &id->route.addr.dst_addr;
-	ib->sib_family = listen_ib->sib_family;
-	ib->sib_pkey = path->pkey;
-	ib->sib_flowinfo = path->flow_label;
-	memcpy(&ib->sib_addr, &path->dgid, 16);
-}
-
-static __be16 ss_get_port(const struct sockaddr_storage *ss)
-{
-	if (ss->ss_family == AF_INET)
-		return ((struct sockaddr_in *)ss)->sin_port;
-	else if (ss->ss_family == AF_INET6)
-		return ((struct sockaddr_in6 *)ss)->sin6_port;
-	BUG();
+	if (src_addr) {
+		ib = (struct sockaddr_ib *)src_addr;
+		ib->sib_family = AF_IB;
+		ib->sib_pkey = path->pkey;
+		ib->sib_flowinfo = path->flow_label;
+		memcpy(&ib->sib_addr, &path->sgid, 16);
+		ib->sib_sid = path->service_id;
+		ib->sib_sid_mask = cpu_to_be64(0xffffffffffffffffULL);
+		ib->sib_scope_id = 0;
+	}
+	if (dst_addr) {
+		ib = (struct sockaddr_ib *)dst_addr;
+		ib->sib_family = AF_IB;
+		ib->sib_pkey = path->pkey;
+		ib->sib_flowinfo = path->flow_label;
+		memcpy(&ib->sib_addr, &path->dgid, 16);
+	}
 }
 
-static void cma_save_ip4_info(struct rdma_cm_id *id, struct rdma_cm_id *listen_id,
-			      struct cma_hdr *hdr)
+static void cma_save_ip4_info(struct sockaddr *src_addr,
+			      struct sockaddr *dst_addr,
+			      struct cma_hdr *hdr,
+			      __be16 local_port)
 {
 	struct sockaddr_in *ip4;
 
-	ip4 = (struct sockaddr_in *) &id->route.addr.src_addr;
-	ip4->sin_family = AF_INET;
-	ip4->sin_addr.s_addr = hdr->dst_addr.ip4.addr;
-	ip4->sin_port = ss_get_port(&listen_id->route.addr.src_addr);
+	if (src_addr) {
+		ip4 = (struct sockaddr_in *)src_addr;
+		ip4->sin_family = AF_INET;
+		ip4->sin_addr.s_addr = hdr->dst_addr.ip4.addr;
+		ip4->sin_port = local_port;
+	}
 
-	ip4 = (struct sockaddr_in *) &id->route.addr.dst_addr;
-	ip4->sin_family = AF_INET;
-	ip4->sin_addr.s_addr = hdr->src_addr.ip4.addr;
-	ip4->sin_port = hdr->port;
+	if (dst_addr) {
+		ip4 = (struct sockaddr_in *)dst_addr;
+		ip4->sin_family = AF_INET;
+		ip4->sin_addr.s_addr = hdr->src_addr.ip4.addr;
+		ip4->sin_port = hdr->port;
+	}
 }
 
-static void cma_save_ip6_info(struct rdma_cm_id *id, struct rdma_cm_id *listen_id,
-			      struct cma_hdr *hdr)
+static void cma_save_ip6_info(struct sockaddr *src_addr,
+			      struct sockaddr *dst_addr,
+			      struct cma_hdr *hdr,
+			      __be16 local_port)
 {
 	struct sockaddr_in6 *ip6;
 
-	ip6 = (struct sockaddr_in6 *) &id->route.addr.src_addr;
-	ip6->sin6_family = AF_INET6;
-	ip6->sin6_addr = hdr->dst_addr.ip6;
-	ip6->sin6_port = ss_get_port(&listen_id->route.addr.src_addr);
+	if (src_addr) {
+		ip6 = (struct sockaddr_in6 *)src_addr;
+		ip6->sin6_family = AF_INET6;
+		ip6->sin6_addr = hdr->dst_addr.ip6;
+		ip6->sin6_port = local_port;
+	}
 
-	ip6 = (struct sockaddr_in6 *) &id->route.addr.dst_addr;
-	ip6->sin6_family = AF_INET6;
-	ip6->sin6_addr = hdr->src_addr.ip6;
-	ip6->sin6_port = hdr->port;
+	if (dst_addr) {
+		ip6 = (struct sockaddr_in6 *)dst_addr;
+		ip6->sin6_family = AF_INET6;
+		ip6->sin6_addr = hdr->src_addr.ip6;
+		ip6->sin6_port = hdr->port;
+	}
 }
 
-static int cma_save_net_info(struct rdma_cm_id *id, struct rdma_cm_id *listen_id,
-			     struct ib_cm_event *ib_event)
+static u16 cma_port_from_service_id(__be64 service_id)
 {
-	struct cma_hdr *hdr;
+	return be64_to_cpu(service_id);
+}
 
-	if ((listen_id->route.addr.src_addr.ss_family == AF_IB) &&
-	    (ib_event->event == IB_CM_REQ_RECEIVED)) {
-		cma_save_ib_info(id, listen_id, ib_event->param.req_rcvd.primary_path);
-		return 0;
-	}
+static int cma_save_ip_info(struct sockaddr *src_addr,
+			    struct sockaddr *dst_addr,
+			    struct ib_cm_event *ib_event,
+			    __be64 service_id)
+{
+	struct cma_hdr *hdr;
+	__be16 port;
 
 	hdr = ib_event->private_data;
 	if (hdr->cma_version != CMA_VERSION)
 		return -EINVAL;
 
+	port = htons(cma_port_from_service_id(service_id));
+
 	switch (cma_get_ip_ver(hdr)) {
 	case 4:
-		cma_save_ip4_info(id, listen_id, hdr);
+		cma_save_ip4_info(src_addr, dst_addr, hdr, port);
 		break;
 	case 6:
-		cma_save_ip6_info(id, listen_id, hdr);
+		cma_save_ip6_info(src_addr, dst_addr, hdr, port);
 		break;
 	default:
 		return -EINVAL;
 	}
+
 	return 0;
 }
 
+static int cma_save_net_info(struct sockaddr *src_addr,
+			     struct sockaddr *dst_addr,
+			     struct ib_cm_event *ib_event,
+			     sa_family_t sa_family, __be64 service_id)
+{
+	if (sa_family == AF_IB && ib_event->event == IB_CM_REQ_RECEIVED) {
+		cma_save_ib_info(src_addr, dst_addr,
+				 ib_event->param.req_rcvd.primary_path);
+		return 0;
+	}
+
+	return cma_save_ip_info(src_addr, dst_addr, ib_event, service_id);
+}
+
 static inline int cma_user_data_offset(struct rdma_id_private *id_priv)
 {
 	return cma_family(id_priv) == AF_IB ? 0 : sizeof(struct cma_hdr);
@@ -1184,6 +1209,9 @@ static struct rdma_id_private *cma_new_conn_id(struct rdma_cm_id *listen_id,
 	struct rdma_id_private *id_priv;
 	struct rdma_cm_id *id;
 	struct rdma_route *rt;
+	const sa_family_t ss_family = listen_id->route.addr.src_addr.ss_family;
+	const __be64 service_id =
+		      ib_event->param.req_rcvd.primary_path->service_id;
 	int ret;
 
 	id = rdma_create_id(listen_id->event_handler, listen_id->context,
@@ -1192,7 +1220,9 @@ static struct rdma_id_private *cma_new_conn_id(struct rdma_cm_id *listen_id,
 		return NULL;
 
 	id_priv = container_of(id, struct rdma_id_private, id);
-	if (cma_save_net_info(id, listen_id, ib_event))
+	if (cma_save_net_info((struct sockaddr *)&id->route.addr.src_addr,
+			      (struct sockaddr *)&id->route.addr.dst_addr,
+			      ib_event, ss_family, service_id))
 		goto err;
 
 	rt = &id->route;
@@ -1238,7 +1268,11 @@ static struct rdma_id_private *cma_new_udp_id(struct rdma_cm_id *listen_id,
 		return NULL;
 
 	id_priv = container_of(id, struct rdma_id_private, id);
-	if (cma_save_net_info(id, listen_id, ib_event))
+	if (cma_save_net_info((struct sockaddr *)&id->route.addr.src_addr,
+			      (struct sockaddr *)&id->route.addr.dst_addr,
+			      ib_event,
+			      listen_id->route.addr.src_addr.ss_family,
+			      ib_event->param.sidr_req_rcvd.service_id))
 		goto err;
 
 	if (!cma_any_addr((struct sockaddr *) &id->route.addr.src_addr)) {
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v4 for-next 08/12] IB/cma: Add compare_data checks to the RDMA CM module
       [not found] ` <1431841868-28063-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-05-17  5:51   ` [PATCH v4 for-next 07/12] IB/cma: Refactor RDMA IP CM private-data parsing code Haggai Eran
@ 2015-05-17  5:51   ` Haggai Eran
  2015-05-17  5:51   ` [PATCH v4 for-next 12/12] IB/ucma: Take the network namespace from the process Haggai Eran
  4 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-17  5:51 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth,
	Haggai Eran

Previously RDMA CM relied on the CM module to check the incoming requests
against "compare data" struct to dispatch events for different RDMA CM IDs
based on the request parameters (IP address, address family, etc.). With
namespace support, multiple namespaces in RDMA CM will need to share a
single CM ID. Such an ID cannot be associated with a specific compare data,
because that could create conflicts with other namespaces.

The patch adds checks to verify that incoming requests match their RDMA CM
ID destination inside the RDMA CM module itself.

Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/cm.c  |  5 +++--
 drivers/infiniband/core/cma.c | 12 +++++++++---
 include/rdma/ib_cm.h          |  3 +++
 3 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 04cdc753b763..9bb1034c356f 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -461,8 +461,8 @@ static int cm_compare_data(struct ib_cm_compare_data *src_data,
 	return memcmp(src, dst, sizeof(src));
 }
 
-static int cm_compare_private_data(u32 *private_data,
-				   struct ib_cm_compare_data *dst_data)
+int cm_compare_private_data(u32 *private_data,
+			    struct ib_cm_compare_data *dst_data)
 {
 	u32 src[IB_CM_COMPARE_SIZE];
 
@@ -472,6 +472,7 @@ static int cm_compare_private_data(u32 *private_data,
 	cm_mask_copy(src, private_data, dst_data->mask);
 	return memcmp(src, dst_data->data, sizeof(src));
 }
+EXPORT_SYMBOL(cm_compare_private_data);
 
 /*
  * Trivial helpers to strip endian annotation and compare; the
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 1828903e8388..dced18975e2e 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -146,6 +146,7 @@ struct rdma_id_private {
 	u8			tos;
 	u8			reuseaddr;
 	u8			afonly;
+	struct ib_cm_compare_data compare_data;
 };
 
 struct cma_multicast {
@@ -1319,6 +1320,10 @@ static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event)
 	int offset, ret;
 
 	listen_id = cm_id->context;
+	if (cm_compare_private_data(ib_event->private_data,
+				    &listen_id->compare_data))
+		return -EINVAL;
+
 	if (!cma_check_req_qp_type(&listen_id->id, ib_event))
 		return -EINVAL;
 
@@ -1586,7 +1591,6 @@ out:
 
 static int cma_ib_listen(struct rdma_id_private *id_priv)
 {
-	struct ib_cm_compare_data compare_data;
 	struct sockaddr *addr;
 	struct ib_cm_id	*id;
 	__be64 svc_id;
@@ -1603,8 +1607,10 @@ static int cma_ib_listen(struct rdma_id_private *id_priv)
 	if (cma_any_addr(addr) && !id_priv->afonly)
 		ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, NULL);
 	else {
-		cma_set_compare_data(id_priv->id.ps, addr, &compare_data);
-		ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, &compare_data);
+		cma_set_compare_data(id_priv->id.ps, addr,
+				     &id_priv->compare_data);
+		ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0,
+				   &id_priv->compare_data);
 	}
 
 	if (ret) {
diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
index 6655016b3855..1f0427310611 100644
--- a/include/rdma/ib_cm.h
+++ b/include/rdma/ib_cm.h
@@ -343,6 +343,9 @@ struct ib_cm_compare_data {
 	u32  mask[IB_CM_COMPARE_SIZE];
 };
 
+int cm_compare_private_data(u32 *private_data,
+			    struct ib_cm_compare_data *dst_data);
+
 /**
  * ib_cm_listen - Initiates listening on the specified service ID for
  *   connection and service ID resolution requests.
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v4 for-next 09/12] IB/cma: Separate port allocation to network namespaces
  2015-05-17  5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran
                   ` (4 preceding siblings ...)
       [not found] ` <1431841868-28063-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-05-17  5:51 ` Haggai Eran
  2015-05-17  5:51 ` [PATCH v4 for-next 10/12] IB/cma: Share CM IDs between namespaces Haggai Eran
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-17  5:51 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth, Haggai Eran

Keep a struct for each network namespace containing the IDRs for the RDMA
CM port spaces. The struct is created dynamically using the generic_net
mechanism.

This patch is internal infrastructure work for the following patches. In
this patch, init_net is statically used as the network namespace for
the new port-space API.

Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
---
 drivers/infiniband/core/cma.c | 142 +++++++++++++++++++++++++++++++++---------
 1 file changed, 113 insertions(+), 29 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index dced18975e2e..a4645a16c9f9 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -44,6 +44,8 @@
 #include <linux/module.h>
 #include <net/route.h>
 
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
 #include <net/tcp.h>
 #include <net/ipv6.h>
 
@@ -80,10 +82,37 @@ static LIST_HEAD(dev_list);
 static LIST_HEAD(listen_any_list);
 static DEFINE_MUTEX(lock);
 static struct workqueue_struct *cma_wq;
-static DEFINE_IDR(tcp_ps);
-static DEFINE_IDR(udp_ps);
-static DEFINE_IDR(ipoib_ps);
-static DEFINE_IDR(ib_ps);
+static int cma_pernet_id;
+
+struct cma_pernet {
+	struct idr tcp_ps;
+	struct idr udp_ps;
+	struct idr ipoib_ps;
+	struct idr ib_ps;
+};
+
+static struct cma_pernet *cma_pernet(struct net *net)
+{
+	return net_generic(net, cma_pernet_id);
+}
+
+static struct idr *cma_pernet_idr(struct net *net, enum rdma_port_space ps)
+{
+	struct cma_pernet *pernet = cma_pernet(net);
+
+	switch (ps) {
+	case RDMA_PS_TCP:
+		return &pernet->tcp_ps;
+	case RDMA_PS_UDP:
+		return &pernet->udp_ps;
+	case RDMA_PS_IPOIB:
+		return &pernet->ipoib_ps;
+	case RDMA_PS_IB:
+		return &pernet->ib_ps;
+	default:
+		return NULL;
+	}
+}
 
 struct cma_device {
 	struct list_head	list;
@@ -94,11 +123,34 @@ struct cma_device {
 };
 
 struct rdma_bind_list {
-	struct idr		*ps;
+	enum rdma_port_space	ps;
 	struct hlist_head	owners;
 	unsigned short		port;
 };
 
+static int cma_ps_alloc(struct net *net, enum rdma_port_space ps,
+			struct rdma_bind_list *bind_list, int snum)
+{
+	struct idr *idr = cma_pernet_idr(net, ps);
+
+	return idr_alloc(idr, bind_list, snum, snum + 1, GFP_KERNEL);
+}
+
+static struct rdma_bind_list *cma_ps_find(struct net *net,
+					  enum rdma_port_space ps, int snum)
+{
+	struct idr *idr = cma_pernet_idr(net, ps);
+
+	return idr_find(idr, snum);
+}
+
+static void cma_ps_remove(struct net *net, enum rdma_port_space ps, int snum)
+{
+	struct idr *idr = cma_pernet_idr(net, ps);
+
+	idr_remove(idr, snum);
+}
+
 enum {
 	CMA_OPTION_AFONLY,
 };
@@ -1027,7 +1079,7 @@ static void cma_release_port(struct rdma_id_private *id_priv)
 	mutex_lock(&lock);
 	hlist_del(&id_priv->node);
 	if (hlist_empty(&bind_list->owners)) {
-		idr_remove(bind_list->ps, bind_list->port);
+		cma_ps_remove(&init_net, bind_list->ps, bind_list->port);
 		kfree(bind_list);
 	}
 	mutex_unlock(&lock);
@@ -2327,8 +2379,8 @@ static void cma_bind_port(struct rdma_bind_list *bind_list,
 	hlist_add_head(&id_priv->node, &bind_list->owners);
 }
 
-static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv,
-			  unsigned short snum)
+static int cma_alloc_port(enum rdma_port_space ps,
+			  struct rdma_id_private *id_priv, unsigned short snum)
 {
 	struct rdma_bind_list *bind_list;
 	int ret;
@@ -2337,7 +2389,7 @@ static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv,
 	if (!bind_list)
 		return -ENOMEM;
 
-	ret = idr_alloc(ps, bind_list, snum, snum + 1, GFP_KERNEL);
+	ret = cma_ps_alloc(&init_net, ps, bind_list, snum);
 	if (ret < 0)
 		goto err;
 
@@ -2350,7 +2402,8 @@ err:
 	return ret == -ENOSPC ? -EADDRNOTAVAIL : ret;
 }
 
-static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv)
+static int cma_alloc_any_port(enum rdma_port_space ps,
+			      struct rdma_id_private *id_priv)
 {
 	static unsigned int last_used_port;
 	int low, high, remaining;
@@ -2361,7 +2414,7 @@ static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv)
 	rover = prandom_u32() % remaining + low;
 retry:
 	if (last_used_port != rover &&
-	    !idr_find(ps, (unsigned short) rover)) {
+	    !cma_ps_find(&init_net, ps, (unsigned short)rover)) {
 		int ret = cma_alloc_port(ps, id_priv, rover);
 		/*
 		 * Remember previously used port number in order to avoid
@@ -2416,7 +2469,8 @@ static int cma_check_port(struct rdma_bind_list *bind_list,
 	return 0;
 }
 
-static int cma_use_port(struct idr *ps, struct rdma_id_private *id_priv)
+static int cma_use_port(enum rdma_port_space ps,
+			struct rdma_id_private *id_priv)
 {
 	struct rdma_bind_list *bind_list;
 	unsigned short snum;
@@ -2426,7 +2480,7 @@ static int cma_use_port(struct idr *ps, struct rdma_id_private *id_priv)
 	if (snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE))
 		return -EACCES;
 
-	bind_list = idr_find(ps, snum);
+	bind_list = cma_ps_find(&init_net, ps, snum);
 	if (!bind_list) {
 		ret = cma_alloc_port(ps, id_priv, snum);
 	} else {
@@ -2449,25 +2503,24 @@ static int cma_bind_listen(struct rdma_id_private *id_priv)
 	return ret;
 }
 
-static struct idr *cma_select_inet_ps(struct rdma_id_private *id_priv)
+static enum rdma_port_space cma_select_inet_ps(
+		struct rdma_id_private *id_priv)
 {
 	switch (id_priv->id.ps) {
 	case RDMA_PS_TCP:
-		return &tcp_ps;
 	case RDMA_PS_UDP:
-		return &udp_ps;
 	case RDMA_PS_IPOIB:
-		return &ipoib_ps;
 	case RDMA_PS_IB:
-		return &ib_ps;
+		return id_priv->id.ps;
 	default:
-		return NULL;
+
+		return 0;
 	}
 }
 
-static struct idr *cma_select_ib_ps(struct rdma_id_private *id_priv)
+static enum rdma_port_space cma_select_ib_ps(struct rdma_id_private *id_priv)
 {
-	struct idr *ps = NULL;
+	enum rdma_port_space ps = 0;
 	struct sockaddr_ib *sib;
 	u64 sid_ps, mask, sid;
 
@@ -2477,15 +2530,15 @@ static struct idr *cma_select_ib_ps(struct rdma_id_private *id_priv)
 
 	if ((id_priv->id.ps == RDMA_PS_IB) && (sid == (RDMA_IB_IP_PS_IB & mask))) {
 		sid_ps = RDMA_IB_IP_PS_IB;
-		ps = &ib_ps;
+		ps = RDMA_PS_IB;
 	} else if (((id_priv->id.ps == RDMA_PS_IB) || (id_priv->id.ps == RDMA_PS_TCP)) &&
 		   (sid == (RDMA_IB_IP_PS_TCP & mask))) {
 		sid_ps = RDMA_IB_IP_PS_TCP;
-		ps = &tcp_ps;
+		ps = RDMA_PS_TCP;
 	} else if (((id_priv->id.ps == RDMA_PS_IB) || (id_priv->id.ps == RDMA_PS_UDP)) &&
 		   (sid == (RDMA_IB_IP_PS_UDP & mask))) {
 		sid_ps = RDMA_IB_IP_PS_UDP;
-		ps = &udp_ps;
+		ps = RDMA_PS_UDP;
 	}
 
 	if (ps) {
@@ -2498,7 +2551,7 @@ static struct idr *cma_select_ib_ps(struct rdma_id_private *id_priv)
 
 static int cma_get_port(struct rdma_id_private *id_priv)
 {
-	struct idr *ps;
+	enum rdma_port_space ps;
 	int ret;
 
 	if (cma_family(id_priv) != AF_IB)
@@ -3647,6 +3700,35 @@ static const struct ibnl_client_cbs cma_cb_table[] = {
 				       .module = THIS_MODULE },
 };
 
+static int cma_init_net(struct net *net)
+{
+	struct cma_pernet *pernet = cma_pernet(net);
+
+	idr_init(&pernet->tcp_ps);
+	idr_init(&pernet->udp_ps);
+	idr_init(&pernet->ipoib_ps);
+	idr_init(&pernet->ib_ps);
+
+	return 0;
+}
+
+static void cma_exit_net(struct net *net)
+{
+	struct cma_pernet *pernet = cma_pernet(net);
+
+	idr_destroy(&pernet->tcp_ps);
+	idr_destroy(&pernet->udp_ps);
+	idr_destroy(&pernet->ipoib_ps);
+	idr_destroy(&pernet->ib_ps);
+}
+
+static struct pernet_operations cma_pernet_operations = {
+	.init = cma_init_net,
+	.exit = cma_exit_net,
+	.id = &cma_pernet_id,
+	.size = sizeof(struct cma_pernet),
+};
+
 static int __init cma_init(void)
 {
 	int ret;
@@ -3655,6 +3737,10 @@ static int __init cma_init(void)
 	if (!cma_wq)
 		return -ENOMEM;
 
+	ret = register_pernet_subsys(&cma_pernet_operations);
+	if (ret)
+		goto err_wq;
+
 	ib_sa_register_client(&sa_client);
 	rdma_addr_register_client(&addr_client);
 	register_netdevice_notifier(&cma_nb);
@@ -3672,6 +3758,7 @@ err:
 	unregister_netdevice_notifier(&cma_nb);
 	rdma_addr_unregister_client(&addr_client);
 	ib_sa_unregister_client(&sa_client);
+err_wq:
 	destroy_workqueue(cma_wq);
 	return ret;
 }
@@ -3683,11 +3770,8 @@ static void __exit cma_cleanup(void)
 	unregister_netdevice_notifier(&cma_nb);
 	rdma_addr_unregister_client(&addr_client);
 	ib_sa_unregister_client(&sa_client);
+	unregister_pernet_subsys(&cma_pernet_operations);
 	destroy_workqueue(cma_wq);
-	idr_destroy(&tcp_ps);
-	idr_destroy(&udp_ps);
-	idr_destroy(&ipoib_ps);
-	idr_destroy(&ib_ps);
 }
 
 module_init(cma_init);
-- 
1.7.11.2

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v4 for-next 10/12] IB/cma: Share CM IDs between namespaces
  2015-05-17  5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran
                   ` (5 preceding siblings ...)
  2015-05-17  5:51 ` [PATCH v4 for-next 09/12] IB/cma: Separate port allocation to network namespaces Haggai Eran
@ 2015-05-17  5:51 ` Haggai Eran
  2015-05-17  5:51 ` [PATCH v4 for-next 11/12] IB/cma: Add support for network namespaces Haggai Eran
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-17  5:51 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth, Haggai Eran

Use ib_cm_id_create_and_listen to create listening IB CM IDs or share
existing ones if needed. When given a request on a specific CM ID, the code
now needs to find the namespace matching the request, and find the RDMA CM
ID based on the namespace and the request parameters, instead of using the
context field of ib_cm_id as was previously done.

Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
---
 drivers/infiniband/core/cma.c | 136 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 115 insertions(+), 21 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index a4645a16c9f9..e5d389ffa497 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1012,6 +1012,112 @@ static int cma_save_net_info(struct sockaddr *src_addr,
 	return cma_save_ip_info(src_addr, dst_addr, ib_event, service_id);
 }
 
+struct cma_req_info {
+	struct ib_device *device;
+	int port;
+	__be64 service_id;
+	u16 pkey;
+};
+
+static int cma_save_req_info(struct ib_cm_event *ib_event,
+			     struct cma_req_info *req)
+{
+	struct ib_cm_req_event_param *req_param = &ib_event->param.req_rcvd;
+	struct ib_cm_sidr_req_event_param *sidr_param =
+		&ib_event->param.sidr_req_rcvd;
+
+	switch (ib_event->event) {
+	case IB_CM_REQ_RECEIVED:
+		req->device	= req_param->listen_id->device;
+		req->port	= req_param->port;
+		req->service_id	= req_param->primary_path->service_id;
+		req->pkey	= be16_to_cpu(req_param->primary_path->pkey);
+		break;
+	case IB_CM_SIDR_REQ_RECEIVED:
+		req->device	= sidr_param->listen_id->device;
+		req->port	= sidr_param->port;
+		req->service_id	= sidr_param->service_id;
+		req->pkey	= sidr_param->pkey;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct net *cma_get_net_ns(struct ib_cm_event *ib_event,
+				  struct cma_req_info *req)
+{
+	struct sockaddr_storage addr_storage;
+	struct sockaddr *listen_addr;
+	int err = 0;
+
+	listen_addr = (struct sockaddr *)&addr_storage;
+	err = cma_save_ip_info(listen_addr, NULL, ib_event, req->service_id);
+	if (err)
+		return ERR_PTR(err);
+
+	return ib_get_net_ns_by_port_pkey_ip(req->device, req->port,
+					     req->pkey, listen_addr);
+}
+
+static enum rdma_port_space rdma_ps_from_service_id(__be64 service_id)
+{
+	return (be64_to_cpu(service_id) >> 16) & 0xffff;
+}
+
+static struct rdma_id_private *cma_find_listener(
+		struct rdma_bind_list *bind_list,
+		struct ib_cm_id *cm_id,
+		struct ib_cm_event *ib_event)
+{
+	struct rdma_id_private *id_priv, *id_priv_dev;
+
+	if (!bind_list)
+		return ERR_PTR(-EINVAL);
+
+	hlist_for_each_entry(id_priv, &bind_list->owners, node) {
+		if (!cm_compare_private_data(ib_event->private_data,
+					     &id_priv->compare_data)) {
+			if (id_priv->id.device == cm_id->device)
+				return id_priv;
+			list_for_each_entry(id_priv_dev,
+					    &id_priv->listen_list,
+					    listen_list) {
+				if (id_priv_dev->id.device == cm_id->device)
+					return id_priv_dev;
+			}
+		}
+	}
+
+	return ERR_PTR(-EINVAL);
+}
+
+static struct rdma_id_private *cma_id_from_event(struct ib_cm_id *cm_id,
+						 struct ib_cm_event *ib_event)
+{
+	struct cma_req_info req;
+	struct net *net;
+	struct rdma_bind_list *bind_list;
+	struct rdma_id_private *id_priv;
+	int err;
+
+	err = cma_save_req_info(ib_event, &req);
+	if (err)
+		return ERR_PTR(err);
+
+	net = cma_get_net_ns(ib_event, &req);
+	if (IS_ERR(net))
+		return ERR_PTR(PTR_ERR(net));
+
+	bind_list = cma_ps_find(net, rdma_ps_from_service_id(req.service_id),
+				cma_port_from_service_id(req.service_id));
+	id_priv = cma_find_listener(bind_list, cm_id, ib_event);
+	put_net(net);
+	return id_priv;
+}
+
 static inline int cma_user_data_offset(struct rdma_id_private *id_priv)
 {
 	return cma_family(id_priv) == AF_IB ? 0 : sizeof(struct cma_hdr);
@@ -1371,10 +1477,9 @@ static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event)
 	struct rdma_cm_event event;
 	int offset, ret;
 
-	listen_id = cm_id->context;
-	if (cm_compare_private_data(ib_event->private_data,
-				    &listen_id->compare_data))
-		return -EINVAL;
+	listen_id = cma_id_from_event(cm_id, ib_event);
+	if (IS_ERR(listen_id))
+		return PTR_ERR(listen_id);
 
 	if (!cma_check_req_qp_type(&listen_id->id, ib_event))
 		return -EINVAL;
@@ -1648,27 +1753,16 @@ static int cma_ib_listen(struct rdma_id_private *id_priv)
 	__be64 svc_id;
 	int ret;
 
-	id = ib_create_cm_id(id_priv->id.device, cma_req_handler, id_priv);
-	if (IS_ERR(id))
-		return PTR_ERR(id);
-
-	id_priv->cm_id.ib = id;
-
 	addr = cma_src_addr(id_priv);
 	svc_id = rdma_get_service_id(&id_priv->id, addr);
-	if (cma_any_addr(addr) && !id_priv->afonly)
-		ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, NULL);
-	else {
+	if (!cma_any_addr(addr) || id_priv->afonly)
 		cma_set_compare_data(id_priv->id.ps, addr,
 				     &id_priv->compare_data);
-		ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0,
-				   &id_priv->compare_data);
-	}
-
-	if (ret) {
-		ib_destroy_cm_id(id_priv->cm_id.ib);
-		id_priv->cm_id.ib = NULL;
-	}
+	id = ib_cm_id_create_and_listen(id_priv->id.device, cma_req_handler,
+					svc_id, 0);
+	if (IS_ERR(id))
+		return PTR_ERR(id);
+	id_priv->cm_id.ib = id;
 
 	return ret;
 }
-- 
1.7.11.2

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v4 for-next 11/12] IB/cma: Add support for network namespaces
  2015-05-17  5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran
                   ` (6 preceding siblings ...)
  2015-05-17  5:51 ` [PATCH v4 for-next 10/12] IB/cma: Share CM IDs between namespaces Haggai Eran
@ 2015-05-17  5:51 ` Haggai Eran
  2015-05-19 14:30 ` [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Yann Droneaud
  2015-05-26 13:34 ` Doug Ledford
  9 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-17  5:51 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth, Haggai Eran

From: Guy Shapiro <guysh@mellanox.com>

Add support for network namespaces in the ib_cma module. This is
accomplished by:

1. Adding network namespace parameter for rdma_create_id. This parameter is
   used to populate the network namespace field in rdma_id_private.
   rdma_create_id keeps a reference on the network namespace.
2. Using the network namespace from the rdma_id instead of init_net inside
   of ib_cma, when listening on an ID and when looking for an ID for an
   incoming request.
3. Decrementing the reference count for the appropriate network namespace
   when calling rdma_destroy_id.

In order to preserve the current behavior init_net is passed when calling
from other modules.

Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
---
 drivers/infiniband/core/cma.c                      | 42 +++++++++++++---------
 drivers/infiniband/core/ucma.c                     |  3 +-
 drivers/infiniband/ulp/iser/iser_verbs.c           |  2 +-
 drivers/infiniband/ulp/isert/ib_isert.c            |  2 +-
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    |  4 ++-
 include/rdma/rdma_cm.h                             |  6 +++-
 net/9p/trans_rdma.c                                |  4 +--
 net/rds/ib.c                                       |  2 +-
 net/rds/ib_cm.c                                    |  2 +-
 net/rds/iw.c                                       |  2 +-
 net/rds/iw_cm.c                                    |  2 +-
 net/rds/rdma_transport.c                           |  4 +--
 net/sunrpc/xprtrdma/svc_rdma_transport.c           |  4 +--
 net/sunrpc/xprtrdma/verbs.c                        |  3 +-
 14 files changed, 50 insertions(+), 32 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index e5d389ffa497..da4b48537111 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -553,7 +553,8 @@ static int cma_disable_callback(struct rdma_id_private *id_priv,
 	return 0;
 }
 
-struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler,
+struct rdma_cm_id *rdma_create_id(struct net *net,
+				  rdma_cm_event_handler event_handler,
 				  void *context, enum rdma_port_space ps,
 				  enum ib_qp_type qp_type)
 {
@@ -577,7 +578,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler,
 	INIT_LIST_HEAD(&id_priv->listen_list);
 	INIT_LIST_HEAD(&id_priv->mc_list);
 	get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num);
-	id_priv->id.route.addr.dev_addr.net = &init_net;
+	id_priv->id.route.addr.dev_addr.net = get_net(net);
 
 	return &id_priv->id;
 }
@@ -1178,6 +1179,7 @@ static void cma_cancel_operation(struct rdma_id_private *id_priv,
 static void cma_release_port(struct rdma_id_private *id_priv)
 {
 	struct rdma_bind_list *bind_list = id_priv->bind_list;
+	struct net *net = id_priv->id.route.addr.dev_addr.net;
 
 	if (!bind_list)
 		return;
@@ -1185,7 +1187,7 @@ static void cma_release_port(struct rdma_id_private *id_priv)
 	mutex_lock(&lock);
 	hlist_del(&id_priv->node);
 	if (hlist_empty(&bind_list->owners)) {
-		cma_ps_remove(&init_net, bind_list->ps, bind_list->port);
+		cma_ps_remove(net, bind_list->ps, bind_list->port);
 		kfree(bind_list);
 	}
 	mutex_unlock(&lock);
@@ -1244,6 +1246,7 @@ void rdma_destroy_id(struct rdma_cm_id *id)
 		cma_deref_id(id_priv->id.context);
 
 	kfree(id_priv->id.route.path_rec);
+	put_net(id_priv->id.route.addr.dev_addr.net);
 	kfree(id_priv);
 }
 EXPORT_SYMBOL(rdma_destroy_id);
@@ -1373,7 +1376,8 @@ static struct rdma_id_private *cma_new_conn_id(struct rdma_cm_id *listen_id,
 		      ib_event->param.req_rcvd.primary_path->service_id;
 	int ret;
 
-	id = rdma_create_id(listen_id->event_handler, listen_id->context,
+	id = rdma_create_id(listen_id->route.addr.dev_addr.net,
+			    listen_id->event_handler, listen_id->context,
 			    listen_id->ps, ib_event->param.req_rcvd.qp_type);
 	if (IS_ERR(id))
 		return NULL;
@@ -1419,9 +1423,10 @@ static struct rdma_id_private *cma_new_udp_id(struct rdma_cm_id *listen_id,
 {
 	struct rdma_id_private *id_priv;
 	struct rdma_cm_id *id;
+	struct net *net = listen_id->route.addr.dev_addr.net;
 	int ret;
 
-	id = rdma_create_id(listen_id->event_handler, listen_id->context,
+	id = rdma_create_id(net, listen_id->event_handler, listen_id->context,
 			    listen_id->ps, IB_QPT_UD);
 	if (IS_ERR(id))
 		return NULL;
@@ -1676,7 +1681,8 @@ static int iw_conn_req_handler(struct iw_cm_id *cm_id,
 		return -ECONNABORTED;
 
 	/* Create a new RDMA id for the new IW CM ID */
-	new_cm_id = rdma_create_id(listen_id->id.event_handler,
+	new_cm_id = rdma_create_id(listen_id->id.route.addr.dev_addr.net,
+				   listen_id->id.event_handler,
 				   listen_id->id.context,
 				   RDMA_PS_TCP, IB_QPT_RC);
 	if (IS_ERR(new_cm_id)) {
@@ -1808,12 +1814,13 @@ static void cma_listen_on_dev(struct rdma_id_private *id_priv,
 {
 	struct rdma_id_private *dev_id_priv;
 	struct rdma_cm_id *id;
+	struct net *net = id_priv->id.route.addr.dev_addr.net;
 	int ret;
 
 	if (cma_family(id_priv) == AF_IB && !rdma_cap_ib_cm(cma_dev->device, 1))
 		return;
 
-	id = rdma_create_id(cma_listen_handler, id_priv, id_priv->id.ps,
+	id = rdma_create_id(net, cma_listen_handler, id_priv, id_priv->id.ps,
 			    id_priv->id.qp_type);
 	if (IS_ERR(id))
 		return;
@@ -2483,7 +2490,8 @@ static int cma_alloc_port(enum rdma_port_space ps,
 	if (!bind_list)
 		return -ENOMEM;
 
-	ret = cma_ps_alloc(&init_net, ps, bind_list, snum);
+	ret = cma_ps_alloc(id_priv->id.route.addr.dev_addr.net, ps, bind_list,
+			   snum);
 	if (ret < 0)
 		goto err;
 
@@ -2502,13 +2510,14 @@ static int cma_alloc_any_port(enum rdma_port_space ps,
 	static unsigned int last_used_port;
 	int low, high, remaining;
 	unsigned int rover;
+	struct net *net = id_priv->id.route.addr.dev_addr.net;
 
-	inet_get_local_port_range(&init_net, &low, &high);
+	inet_get_local_port_range(net, &low, &high);
 	remaining = (high - low) + 1;
 	rover = prandom_u32() % remaining + low;
 retry:
 	if (last_used_port != rover &&
-	    !cma_ps_find(&init_net, ps, (unsigned short)rover)) {
+	    !cma_ps_find(net, ps, (unsigned short)rover)) {
 		int ret = cma_alloc_port(ps, id_priv, rover);
 		/*
 		 * Remember previously used port number in order to avoid
@@ -2574,7 +2583,7 @@ static int cma_use_port(enum rdma_port_space ps,
 	if (snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE))
 		return -EACCES;
 
-	bind_list = cma_ps_find(&init_net, ps, snum);
+	bind_list = cma_ps_find(id_priv->id.route.addr.dev_addr.net, ps, snum);
 	if (!bind_list) {
 		ret = cma_alloc_port(ps, id_priv, snum);
 	} else {
@@ -2766,8 +2775,11 @@ int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr)
 		if (addr->sa_family == AF_INET)
 			id_priv->afonly = 1;
 #if IS_ENABLED(CONFIG_IPV6)
-		else if (addr->sa_family == AF_INET6)
-			id_priv->afonly = init_net.ipv6.sysctl.bindv6only;
+		else if (addr->sa_family == AF_INET6) {
+			struct net *net = id_priv->id.route.addr.dev_addr.net;
+
+			id_priv->afonly = net->ipv6.sysctl.bindv6only;
+		}
 #endif
 	}
 	ret = cma_get_port(id_priv);
@@ -3571,6 +3583,7 @@ static int cma_netdev_change(struct net_device *ndev, struct rdma_id_private *id
 	dev_addr = &id_priv->id.route.addr.dev_addr;
 
 	if ((dev_addr->bound_dev_if == ndev->ifindex) &&
+	    (net_eq(dev_net(ndev), dev_addr->net)) &&
 	    memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) {
 		printk(KERN_INFO "RDMA CM addr change for ndev %s used by id %p\n",
 		       ndev->name, &id_priv->id);
@@ -3596,9 +3609,6 @@ static int cma_netdev_callback(struct notifier_block *self, unsigned long event,
 	struct rdma_id_private *id_priv;
 	int ret = NOTIFY_DONE;
 
-	if (dev_net(ndev) != &init_net)
-		return NOTIFY_DONE;
-
 	if (event != NETDEV_BONDING_FAILOVER)
 		return NOTIFY_DONE;
 
diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index d42b816c781f..44670ad5ba10 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -391,7 +391,8 @@ static ssize_t ucma_create_id(struct ucma_file *file, const char __user *inbuf,
 		return -ENOMEM;
 
 	ctx->uid = cmd.uid;
-	ctx->cm_id = rdma_create_id(ucma_event_handler, ctx, cmd.ps, qp_type);
+	ctx->cm_id = rdma_create_id(&init_net, ucma_event_handler, ctx, cmd.ps,
+				    qp_type);
 	if (IS_ERR(ctx->cm_id)) {
 		ret = PTR_ERR(ctx->cm_id);
 		goto err1;
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c
index cc2dd35ffbc0..5517333738a8 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -960,7 +960,7 @@ int iser_connect(struct iser_conn   *iser_conn,
 	ib_conn->beacon.wr_id = ISER_BEACON_WRID;
 	ib_conn->beacon.opcode = IB_WR_SEND;
 
-	ib_conn->cma_id = rdma_create_id(iser_cma_handler,
+	ib_conn->cma_id = rdma_create_id(&init_net, iser_cma_handler,
 					 (void *)iser_conn,
 					 RDMA_PS_TCP, IB_QPT_RC);
 	if (IS_ERR(ib_conn->cma_id)) {
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index 327529ee85eb..746d05ef2cea 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -3043,7 +3043,7 @@ isert_setup_id(struct isert_np *isert_np)
 	sa = (struct sockaddr *)&np->np_sockaddr;
 	isert_dbg("ksockaddr: %p, sa: %p\n", &np->np_sockaddr, sa);
 
-	id = rdma_create_id(isert_cma_handler, isert_np,
+	id = rdma_create_id(&init_net, isert_cma_handler, isert_np,
 			    RDMA_PS_TCP, IB_QPT_RC);
 	if (IS_ERR(id)) {
 		isert_err("rdma_create_id() failed: %ld\n", PTR_ERR(id));
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index cd664d025f41..f4aab7485a84 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -125,7 +125,9 @@ extern kib_tunables_t  kiblnd_tunables;
 				     IBLND_CREDIT_HIGHWATER_V1 : \
 				     *kiblnd_tunables.kib_peercredits_hiw) /* when eagerly to return credits */
 
-#define kiblnd_rdma_create_id(cb, dev, ps, qpt) rdma_create_id(cb, dev, ps, qpt)
+#define kiblnd_rdma_create_id(cb, dev, ps, qpt) rdma_create_id(&init_net, \
+							       cb, dev, \
+							       ps, qpt)
 
 static inline int
 kiblnd_concurrent_sends_v1(void)
diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h
index 1ed2088dc9f5..f45566f8dc1a 100644
--- a/include/rdma/rdma_cm.h
+++ b/include/rdma/rdma_cm.h
@@ -158,13 +158,17 @@ struct rdma_cm_id {
 /**
  * rdma_create_id - Create an RDMA identifier.
  *
+ * @net: The network namespace in which to create the new id.
  * @event_handler: User callback invoked to report events associated with the
  *   returned rdma_id.
  * @context: User specified context associated with the id.
  * @ps: RDMA port space.
  * @qp_type: type of queue pair associated with the id.
+ *
+ * The id holds a reference on the network namespace until it is destroyed.
  */
-struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler,
+struct rdma_cm_id *rdma_create_id(struct net *net,
+				  rdma_cm_event_handler event_handler,
 				  void *context, enum rdma_port_space ps,
 				  enum ib_qp_type qp_type);
 
diff --git a/net/9p/trans_rdma.c b/net/9p/trans_rdma.c
index 3533d2a53ab6..8bc3b5cbedf1 100644
--- a/net/9p/trans_rdma.c
+++ b/net/9p/trans_rdma.c
@@ -660,8 +660,8 @@ rdma_create_trans(struct p9_client *client, const char *addr, char *args)
 		return -ENOMEM;
 
 	/* Create the RDMA CM ID */
-	rdma->cm_id = rdma_create_id(p9_cm_event_handler, client, RDMA_PS_TCP,
-				     IB_QPT_RC);
+	rdma->cm_id = rdma_create_id(&init_net, p9_cm_event_handler, client,
+				     RDMA_PS_TCP, IB_QPT_RC);
 	if (IS_ERR(rdma->cm_id))
 		goto error;
 
diff --git a/net/rds/ib.c b/net/rds/ib.c
index ba2dffeff608..c68e7b606975 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -326,7 +326,7 @@ static int rds_ib_laddr_check(__be32 addr)
 	/* Create a CMA ID and try to bind it. This catches both
 	 * IB and iWARP capable NICs.
 	 */
-	cm_id = rdma_create_id(NULL, NULL, RDMA_PS_TCP, IB_QPT_RC);
+	cm_id = rdma_create_id(&init_net, NULL, NULL, RDMA_PS_TCP, IB_QPT_RC);
 	if (IS_ERR(cm_id))
 		return PTR_ERR(cm_id);
 
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 31b74f5e61ad..0640f66d5d9e 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -583,7 +583,7 @@ int rds_ib_conn_connect(struct rds_connection *conn)
 
 	/* XXX I wonder what affect the port space has */
 	/* delegate cm event handler to rdma_transport */
-	ic->i_cm_id = rdma_create_id(rds_rdma_cm_event_handler, conn,
+	ic->i_cm_id = rdma_create_id(&init_net, rds_rdma_cm_event_handler, conn,
 				     RDMA_PS_TCP, IB_QPT_RC);
 	if (IS_ERR(ic->i_cm_id)) {
 		ret = PTR_ERR(ic->i_cm_id);
diff --git a/net/rds/iw.c b/net/rds/iw.c
index 589935661d66..230da79e62a1 100644
--- a/net/rds/iw.c
+++ b/net/rds/iw.c
@@ -227,7 +227,7 @@ static int rds_iw_laddr_check(__be32 addr)
 	/* Create a CMA ID and try to bind it. This catches both
 	 * IB and iWARP capable NICs.
 	 */
-	cm_id = rdma_create_id(NULL, NULL, RDMA_PS_TCP, IB_QPT_RC);
+	cm_id = rdma_create_id(&init_net, NULL, NULL, RDMA_PS_TCP, IB_QPT_RC);
 	if (IS_ERR(cm_id))
 		return PTR_ERR(cm_id);
 
diff --git a/net/rds/iw_cm.c b/net/rds/iw_cm.c
index a6c2bea9f8f9..5c18819f13fc 100644
--- a/net/rds/iw_cm.c
+++ b/net/rds/iw_cm.c
@@ -520,7 +520,7 @@ int rds_iw_conn_connect(struct rds_connection *conn)
 
 	/* XXX I wonder what affect the port space has */
 	/* delegate cm event handler to rdma_transport */
-	ic->i_cm_id = rdma_create_id(rds_rdma_cm_event_handler, conn,
+	ic->i_cm_id = rdma_create_id(&init_net, rds_rdma_cm_event_handler, conn,
 				     RDMA_PS_TCP, IB_QPT_RC);
 	if (IS_ERR(ic->i_cm_id)) {
 		ret = PTR_ERR(ic->i_cm_id);
diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 6cd9d1deafc3..0c88f24aed85 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -159,8 +159,8 @@ static int rds_rdma_listen_init(void)
 	struct rdma_cm_id *cm_id;
 	int ret;
 
-	cm_id = rdma_create_id(rds_rdma_cm_event_handler, NULL, RDMA_PS_TCP,
-			       IB_QPT_RC);
+	cm_id = rdma_create_id(&init_net, rds_rdma_cm_event_handler, NULL,
+			       RDMA_PS_TCP, IB_QPT_RC);
 	if (IS_ERR(cm_id)) {
 		ret = PTR_ERR(cm_id);
 		printk(KERN_ERR "RDS/RDMA: failed to setup listener, "
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 3df8320c6efe..6d5c61c2b381 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -704,8 +704,8 @@ static struct svc_xprt *svc_rdma_create(struct svc_serv *serv,
 	if (!cma_xprt)
 		return ERR_PTR(-ENOMEM);
 
-	listen_id = rdma_create_id(rdma_listen_handler, cma_xprt, RDMA_PS_TCP,
-				   IB_QPT_RC);
+	listen_id = rdma_create_id(&init_net, rdma_listen_handler, cma_xprt,
+				   RDMA_PS_TCP, IB_QPT_RC);
 	if (IS_ERR(listen_id)) {
 		ret = PTR_ERR(listen_id);
 		dprintk("svcrdma: rdma_create_id failed = %d\n", ret);
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 4870d272e006..12d225359e8d 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -509,7 +509,8 @@ rpcrdma_create_id(struct rpcrdma_xprt *xprt,
 
 	init_completion(&ia->ri_done);
 
-	id = rdma_create_id(rpcrdma_conn_upcall, xprt, RDMA_PS_TCP, IB_QPT_RC);
+	id = rdma_create_id(&init_net, rpcrdma_conn_upcall, xprt, RDMA_PS_TCP,
+			    IB_QPT_RC);
 	if (IS_ERR(id)) {
 		rc = PTR_ERR(id);
 		dprintk("RPC:       %s: rdma_create_id() failed %i\n",
-- 
1.7.11.2

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v4 for-next 12/12] IB/ucma: Take the network namespace from the process
       [not found] ` <1431841868-28063-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2015-05-17  5:51   ` [PATCH v4 for-next 08/12] IB/cma: Add compare_data checks to the RDMA CM module Haggai Eran
@ 2015-05-17  5:51   ` Haggai Eran
  4 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-17  5:51 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth,
	Haggai Eran

From: Guy Shapiro <guysh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Add support for network namespaces from user space. This is done by passing
the network namespace of the process instead of init_net.

Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Yotam Kenneth <yotamke-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Guy Shapiro <guysh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/ucma.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index 44670ad5ba10..46b50c64933d 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -42,6 +42,7 @@
 #include <linux/slab.h>
 #include <linux/sysctl.h>
 #include <linux/module.h>
+#include <linux/nsproxy.h>
 
 #include <rdma/rdma_user_cm.h>
 #include <rdma/ib_marshall.h>
@@ -391,8 +392,8 @@ static ssize_t ucma_create_id(struct ucma_file *file, const char __user *inbuf,
 		return -ENOMEM;
 
 	ctx->uid = cmd.uid;
-	ctx->cm_id = rdma_create_id(&init_net, ucma_event_handler, ctx, cmd.ps,
-				    qp_type);
+	ctx->cm_id = rdma_create_id(current->nsproxy->net_ns,
+				    ucma_event_handler, ctx, cmd.ps, qp_type);
 	if (IS_ERR(ctx->cm_id)) {
 		ret = PTR_ERR(ctx->cm_id);
 		goto err1;
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
  2015-05-17  5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran
                   ` (7 preceding siblings ...)
  2015-05-17  5:51 ` [PATCH v4 for-next 11/12] IB/cma: Add support for network namespaces Haggai Eran
@ 2015-05-19 14:30 ` Yann Droneaud
  2015-05-19 14:54     ` Haggai Eran
  2015-05-26 13:34 ` Doug Ledford
  9 siblings, 1 reply; 81+ messages in thread
From: Yann Droneaud @ 2015-05-19 14:30 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Doug Ledford, linux-rdma, netdev, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

Hi,

Le dimanche 17 mai 2015 à 08:50 +0300, Haggai Eran a écrit :
> Thanks again everyone for the review comments. I've updated the patch 
> set
> accordingly. The main changes are in the first patch to use a read
> -write
> semaphore instead of an SRCU, and with the reference counting of 
> shared
> ib_cm_ids.
> Please let me know if I missed anything, or if there are other issues 
> with
> the series.
> 
> Regards,
> Haggai
> 
> Changes from v3:
> - Patch 1 and 3: use read-write semaphore instead of an SRCU.
> - Patch 5:
>   * Use a direct reference count instead of a kref.
>   * Instead of adding get/put pair for ib_cm_ids, just avoid 
> destroying an
>     id when it is still in use.
>   * Squashes these two patches together, since the first one became 
> too
>     short:
>       IB/cm: Reference count ib_cm_ids
>       IB/cm: API to retrieve existing listening CM IDs
> - Rebase to Doug's to-be-rebased/for-4.2 branch.
> 
> Changes from v2:
> - Add patch 1 to change device_mutex to an RCU.
> - Remove patch that fixed IPv4 connections to an IPv4/IPv6 listener.
> - Limit namespace related changes to RDMA CM and InfiniBand only.
> - Rebase on dledford/for-v4.2, with David Ahern's unaligned access 
> patch.
>   * Use Michael Wang's capability functions where needed.
> - Move the struct net argument to be the first in all functions, to 
> match the
>   networking core scheme.
> - Patch 2:
>   * Remove unwanted braces.
> - Patch 4: check the return value of ib_find_cached_pkey.
> - Patch 8: verify the address family before calling cm_save_ib_info.
> - Patch 10: use generic_net instead of a custom radix tree for having 
> per
>   network namespace data.
> - Minor changes.
> 
> Changes from v1:
> - Include patch 1 in this series.
> - Rebase for v4.1.
> 
> Changes from v0:
> - Fix code review comments by Yann
> - Rebase on top of linux-3.19
> 
> RDMA-CM uses IP based addressing and routing to setup RDMA 
> connections between
> hosts. Currently, all of the IP interfaces and addresses used by the 
> RDMA-CM
> must reside in the init_net namespace. This restricts the usage of 
> containers
> with RDMA to only work with host network namespace (aka the kernel 
> init_net NS
> instance).
> 
> This patchset allows using network namespaces with the RDMA-CM.
> 
> Each RDMA-CM id keeps a reference to a network namespace.
> 
> This reference is based on the process network namespace at the time 
> of the
> creation of the object or inherited from the listener.
> 
> This network namespace is used to perform all IP and network related
> operations. Specifically, the local device lookup, as well as the 
> remote GID
> address resolution are done in the context of the RDMA-CM object's 
> namespace.
> This allows outgoing connections to reach the right target, even if 
> the same
> IP address exists in multiple network namespaces. This can happen if 
> each
> network namespace resides on a different P_Key.
> 
> Additionally, the network namespace is used to split the listener 
> service ID
> table. From the user point of view, each network namespace has a 
> unique,
> completely independent table of service IDs. This allows running 
> multiple
> instances of a single service on the same machine, using containers. 
> To
> implement this, multiple RDMA CM IDs, belonging to different 
> namespaces may
> now share their CM ID. When a request on such a CM ID arrives, the 
> RDMA CM
> module finds out the correct namespaces and looks for the RDMA CM ID
> matching the request's parameters.
> 
> The functionality introduced by this series would come into play when 
> the
> transport is InfiniBand and IPoIB interfaces are assigned to each 
> namespace.
> Multiple IPoIB interfaces can be created and assigned to different 
> RDMA-CM
> capable containers, for example using pipework [1].
> 
> Full support for RoCE will be introduced in a later stage.
> 

How does this play with iWarp: as iWarp HCA are aware of IP addresses /
UDP/TCP ports, AFAIK, are those tied to namespace with this patchset or
will it be possible to use the iWarp HCA to access to address/port
resources tied to a different namespace ?


> The patches apply against Doug's tree for v4.2.
> 
> The patchset is structured as follows:
> 
> Patch 1 adds a read-write semaphore in addition to the device mutex 
> in
> ib_core to allow traversing the client list without a deadlock in 
> Patch 3.
> 
> Patch 2 is a relatively trivial API extension, requiring the callers
> of certain ib_addr functions to provide a network namespace, as 
> needed.
> 
> Patches 3 and 4 adds the ability to lookup a network namespace 
> according to
> the IP address, device and P_Key. It finds the matching IPoIB 
> interfaces, and
> safely takes a reference on the network namespace before returning to 
> the
> caller.
> 
> Patches 5-6 make necessary changes to the CM layer, to allow sharing 
> of a
> single CM ID between multiple RDMA CM IDs. This includes adding a 
> reference
> count to ib_cm_id structs, add an API to either create a new CM ID or 
> use
> an existing one, and expose the service ID to ib_cm clients.
> 
> Patches 7-8 do some preliminary refactoring to the rdma_cm module. 
> Patch 7
> refactors the logic that extracts the IP address from a connect 
> request to
> allow reuse by the namespace lookup code further on.  Patch 8 changes 
> the
> way RDMA CM module creates CM IDs, to avoid relying on the 
> compare_data
> feature of ib_cm. This feature associate a single compare_data struct 
> per
> ib_cm_id, so it cannot be used when sharing CM IDs.
> 
> Patches 9-12 add proper namespace support to the RDMA-CM module. This
> includes adding multiple port space tables, sharing ib_cm_ids between
> rdma_cm_ids, adding a network namespace parameter, and finally 
> retrieving
> the namespace from the creating process.
> 

Regards.

-- 
Yann Droneaud
OPTEYA

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
  2015-05-19 14:30 ` [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Yann Droneaud
@ 2015-05-19 14:54     ` Haggai Eran
  0 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-19 14:54 UTC (permalink / raw)
  To: Yann Droneaud
  Cc: Doug Ledford, linux-rdma, netdev, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On 19/05/2015 17:30, Yann Droneaud wrote:
>Le dimanche 17 mai 2015 à 08:50 +0300, Haggai Eran a écrit :
>> Full support for RoCE will be introduced in a later stage.
>>
> How does this play with iWarp: as iWarp HCA are aware of IP addresses /
> UDP/TCP ports, AFAIK, are those tied to namespace with this patchset or
> will it be possible to use the iWarp HCA to access to address/port
> resources tied to a different namespace ?

At the moment the patchset doesn't add iWarp namespace support, as I
wrote in the discussion about v3 [1]. I suspect that the current iWarp
code lets users listen in one namespace and receive requests belonging
to the init namespace, or break the namespaces in similar ways. The
patchset doesn't attempt to fix this for iWarp or RoCE, only for InfiniBand.

Regards,
Haggai

[1] http://www.spinics.net/lists/linux-rdma/msg25156.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
@ 2015-05-19 14:54     ` Haggai Eran
  0 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-19 14:54 UTC (permalink / raw)
  To: Yann Droneaud
  Cc: Doug Ledford, linux-rdma, netdev, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On 19/05/2015 17:30, Yann Droneaud wrote:
>Le dimanche 17 mai 2015 à 08:50 +0300, Haggai Eran a écrit :
>> Full support for RoCE will be introduced in a later stage.
>>
> How does this play with iWarp: as iWarp HCA are aware of IP addresses /
> UDP/TCP ports, AFAIK, are those tied to namespace with this patchset or
> will it be possible to use the iWarp HCA to access to address/port
> resources tied to a different namespace ?

At the moment the patchset doesn't add iWarp namespace support, as I
wrote in the discussion about v3 [1]. I suspect that the current iWarp
code lets users listen in one namespace and receive requests belonging
to the init namespace, or break the namespaces in similar ways. The
patchset doesn't attempt to fix this for iWarp or RoCE, only for InfiniBand.

Regards,
Haggai

[1] http://www.spinics.net/lists/linux-rdma/msg25156.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]     ` <555B4EBE.7010900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-05-19 16:39       ` Parav Pandit
  2015-05-19 18:01         ` Haggai Eran
  2015-05-19 18:38         ` Jason Gunthorpe
  0 siblings, 2 replies; 81+ messages in thread
From: Parav Pandit @ 2015-05-19 16:39 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Yann Droneaud, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On Tue, May 19, 2015 at 8:24 PM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> On 19/05/2015 17:30, Yann Droneaud wrote:
>>Le dimanche 17 mai 2015 à 08:50 +0300, Haggai Eran a écrit :
>>> Full support for RoCE will be introduced in a later stage.
>>>
>> How does this play with iWarp: as iWarp HCA are aware of IP addresses /
>> UDP/TCP ports, AFAIK, are those tied to namespace with this patchset or
>> will it be possible to use the iWarp HCA to access to address/port
>> resources tied to a different namespace ?
>
> At the moment the patchset doesn't add iWarp namespace support, as I
> wrote in the discussion about v3 [1]. I suspect that the current iWarp
> code lets users listen in one namespace and receive requests belonging
> to the init namespace, or break the namespaces in similar ways. The
> patchset doesn't attempt to fix this for iWarp or RoCE, only for InfiniBand.
>

This has come up during our first round of patch set for RoCEv2.
However due to last set of changes in this patch set, we decided to
move in for later submission.

Current high design approach to support this is:

A callback function registered to IP stack for IP address
assignment/removal. (it could be done possibly done for GID entry
update, instead of IP)
For each such namespace, for device supporting RoCEv2, one UDP socket
with RoCEv2 UDP port would be created and destroyed.



> Regards,
> Haggai
>
> [1] http://www.spinics.net/lists/linux-rdma/msg25156.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
  2015-05-19 16:39       ` Parav Pandit
@ 2015-05-19 18:01         ` Haggai Eran
       [not found]           ` <1432058488417.98688-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-05-19 18:38         ` Jason Gunthorpe
  1 sibling, 1 reply; 81+ messages in thread
From: Haggai Eran @ 2015-05-19 18:01 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Yann Droneaud, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On Tuesday, May 19, 2015 7:39 PM, Parav Pandit <parav.pandit-1wcpHE2jlwO1Z/+hSey0Gg@public.gmane.org> wrote:
> This has come up during our first round of patch set for RoCEv2.
> However due to last set of changes in this patch set, we decided to
> move in for later submission.
> 
> Current high design approach to support this is:
> 
> A callback function registered to IP stack for IP address
> assignment/removal. (it could be done possibly done for GID entry
> update, instead of IP)
> For each such namespace, for device supporting RoCEv2, one UDP socket
> with RoCEv2 UDP port would be created and destroyed.

This needs to be done to take ownership of the RoCEv2 UDP port of course, but there is more to RoCE namespace support then that. You will need to de-multiplex rdma_cm_id requests from a shared ib_cm_id similarly to what is done in the InfiniBand patch-set here. Using the new RoCE GID cache should make it relatively easy. In addition, in order to support link-layer address resolution in the verbs layer as it is done today, we may need to add namespace support to the modify QP verb and the create AH verb.

Haggai--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 03/12] IB/core: Find the network namespace matching connection parameters
       [not found]   ` <1431841868-28063-4-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-05-19 18:26     ` Jason Gunthorpe
       [not found]       ` <20150519182616.GF18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2015-05-19 18:26 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

> +	list_for_each_entry_rcu(client, &client_list, list)
> +		if (client->get_net_device_by_port_pkey_ip) {
> +			ret = client->get_net_device_by_port_pkey_ip(dev, port,
> +								     pkey,
> +
> addr);

Considering the patch that introduced the rwsem, this doesn't look
right.

We can't call a client call back on a device that is unregistered, we
can't call a client call back on a client that is unregistering.

So the ordering of the list_del/add vs call back calls in device.c is
not right, please audit everything with an eye toward maintaing sane
invarients.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
       [not found]   ` <1431841868-28063-5-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-05-19 18:28     ` Jason Gunthorpe
       [not found]       ` <20150519182810.GG18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-05-19 23:55     ` Jason Gunthorpe
  1 sibling, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2015-05-19 18:28 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote:

> +#if IS_ENABLED(CONFIG_IPV6)
> +	struct sockaddr_in6 *addr_in6 = (struct sockaddr_in6 *)addr;
> +#endif
> +	__be32 ret_addr;
> +
> +	switch (addr->sa_family) {
> +	case AF_INET:
> +		in_dev = in_dev_get(dev);
> +		if (!in_dev)
> +			return false;
> +
> +		ret_addr = inet_confirm_addr(net, in_dev, 0,
> +					     addr_in->sin_addr.s_addr,
> +					     RT_SCOPE_HOST);
> +		in_dev_put(in_dev);
> +		if (ret_addr)
> +			return true;
> +
> +		break;
> +#if IS_ENABLED(CONFIG_IPV6)
> +	case AF_INET6:
> +		if (ipv6_chk_addr(net, &addr_in6->sin6_addr, dev, 1))
> +			return true;
> +
> +		break;
> +#endif

Can you use

  if (IS_ENABLED(CONFIG_IPV6))

At the call site instead of the #if guards?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs
  2015-05-17  5:51 ` [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs Haggai Eran
@ 2015-05-19 18:35   ` Jason Gunthorpe
       [not found]     ` <20150519183545.GH18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2015-05-19 18:35 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Doug Ledford, linux-rdma, netdev, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On Sun, May 17, 2015 at 08:51:01AM +0300, Haggai Eran wrote:
> @@ -212,6 +212,8 @@ struct cm_id_private {
>  	spinlock_t lock;	/* Do not acquire inside cm.lock */
>  	struct completion comp;
>  	atomic_t refcount;
> +	/* Number of clients sharing this ib_cm_id. Only valid for listeners. */
> +	atomic_t sharecount;

No need for this atomic, hold the lock

The use of the atomic looks racy:

> +	if (!atomic_dec_and_test(&cm_id_priv->sharecount)) {
> +		/* The id is still shared. */
> +		return;
> +	}

Might race with this:

> +		if (atomic_inc_return(&cm_id_priv->sharecount) == 1) {
> +			/* This ID is already being destroyed */
> +			atomic_dec(&cm_id_priv->sharecount);
> +			goto new_id;
> +		}
> +

Resulting in use-after-free of cm_id_priv->sharecount

Don't try and be clever with atomics, it is almost always wrong.

The share count should be 'listen_sharecount' because it *only* works
for listen.

The above test in cm_destroy_id should only be in the listen branch of
the if.

> + * Create a new listening ib_cm_id and listen on the given service ID.
> + *
> + * If there's an existing ID listening on that same device and service ID,
> + * return it.
> + *

.. Callers should call cm_destroy_id when done with the listen ..

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
  2015-05-19 16:39       ` Parav Pandit
  2015-05-19 18:01         ` Haggai Eran
@ 2015-05-19 18:38         ` Jason Gunthorpe
       [not found]           ` <20150519183843.GI18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  1 sibling, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2015-05-19 18:38 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Haggai Eran, Yann Droneaud, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth

On Tue, May 19, 2015 at 10:09:14PM +0530, Parav Pandit wrote:

> A callback function registered to IP stack for IP address
> assignment/removal. (it could be done possibly done for GID entry
> update, instead of IP)
> For each such namespace, for device supporting RoCEv2, one UDP socket
> with RoCEv2 UDP port would be created and destroyed.

You need to consult netdev for whatever scheme you come up with.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]           ` <1432058488417.98688-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-05-19 18:42             ` Parav Pandit
  0 siblings, 0 replies; 81+ messages in thread
From: Parav Pandit @ 2015-05-19 18:42 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Yann Droneaud, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On Tue, May 19, 2015 at 11:31 PM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> On Tuesday, May 19, 2015 7:39 PM, Parav Pandit <parav.pandit-1wcpHE2jlwO1Z/+hSey0Gg@public.gmane.org> wrote:
>> This has come up during our first round of patch set for RoCEv2.
>> However due to last set of changes in this patch set, we decided to
>> move in for later submission.
>>
>> Current high design approach to support this is:
>>
>> A callback function registered to IP stack for IP address
>> assignment/removal. (it could be done possibly done for GID entry
>> update, instead of IP)
>> For each such namespace, for device supporting RoCEv2, one UDP socket
>> with RoCEv2 UDP port would be created and destroyed.
>
> This needs to be done to take ownership of the RoCEv2 UDP port of course, but there is more to RoCE namespace support then that. You will need to de-multiplex rdma_cm_id requests from a shared ib_cm_id similarly to what is done in the InfiniBand patch-set here. Using the new RoCE GID cache should make it relatively easy. In addition, in order to support link-layer address resolution in the verbs layer as it is done today, we may need to add namespace support to the modify QP verb and the create AH verb.
>

Yes. We talked sometime back with Liran and Matan on same topic.

> Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]           ` <20150519183843.GI18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-19 18:44             ` Parav Pandit
       [not found]               ` <CAGgvQNTXAWkQWzBBrQfk39GaCQ2ck63AhgURpYFFBPTbkpx4kg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 81+ messages in thread
From: Parav Pandit @ 2015-05-19 18:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Haggai Eran, Yann Droneaud, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth

On Wed, May 20, 2015 at 12:08 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Tue, May 19, 2015 at 10:09:14PM +0530, Parav Pandit wrote:
>
>> A callback function registered to IP stack for IP address
>> assignment/removal. (it could be done possibly done for GID entry
>> update, instead of IP)
>> For each such namespace, for device supporting RoCEv2, one UDP socket
>> with RoCEv2 UDP port would be created and destroyed.
>
> You need to consult netdev for whatever scheme you come up with.
>

Yes. However it won't be an issue for netdev as its going to use
register_inetaddr_notifier and friend functions which are standard set
of APIs by networking stack.
But yes, we will have netdev included in the review.
Will send out patch in sometime for review to get comments. We were
just waiting for v4 patches to get in first.

> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]               ` <CAGgvQNTXAWkQWzBBrQfk39GaCQ2ck63AhgURpYFFBPTbkpx4kg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-19 19:20                 ` Jason Gunthorpe
  0 siblings, 0 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2015-05-19 19:20 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Haggai Eran, Yann Droneaud, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth

On Wed, May 20, 2015 at 12:14:10AM +0530, Parav Pandit wrote:
> On Wed, May 20, 2015 at 12:08 AM, Jason Gunthorpe
> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> > On Tue, May 19, 2015 at 10:09:14PM +0530, Parav Pandit wrote:
> >
> >> A callback function registered to IP stack for IP address
> >> assignment/removal. (it could be done possibly done for GID entry
> >> update, instead of IP)
> >> For each such namespace, for device supporting RoCEv2, one UDP socket
> >> with RoCEv2 UDP port would be created and destroyed.
> >
> > You need to consult netdev for whatever scheme you come up with.
> >
> 
> Yes. However it won't be an issue for netdev as its going to use
> register_inetaddr_notifier and friend functions which are standard
> set

I was talking about your scheme to reserve a UDP port number. Make it
clear, and ask specifically about that.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs
       [not found]     ` <20150519183545.GH18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-19 22:35       ` Jason Gunthorpe
       [not found]         ` <20150519223502.GA26324-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-05-21  7:07         ` Haggai Eran
  1 sibling, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2015-05-19 22:35 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On Tue, May 19, 2015 at 12:35:45PM -0600, Jason Gunthorpe wrote:
> On Sun, May 17, 2015 at 08:51:01AM +0300, Haggai Eran wrote:
> > @@ -212,6 +212,8 @@ struct cm_id_private {
> >  	spinlock_t lock;	/* Do not acquire inside cm.lock */
> >  	struct completion comp;
> >  	atomic_t refcount;
> > +	/* Number of clients sharing this ib_cm_id. Only valid for listeners. */
> > +	atomic_t sharecount;
> 
> No need for this atomic, hold the lock
> 
> The use of the atomic looks racy:
> 
> > +	if (!atomic_dec_and_test(&cm_id_priv->sharecount)) {
> > +		/* The id is still shared. */
> > +		return;
> > +	}
> 
> Might race with this:
> 
> > +		if (atomic_inc_return(&cm_id_priv->sharecount) == 1) {
> > +			/* This ID is already being destroyed */
> > +			atomic_dec(&cm_id_priv->sharecount);
> > +			goto new_id;
> > +		}
> > +
> 
> Resulting in use-after-free of cm_id_priv->sharecount

Actually, there is something else odd here.. I mentioned the above
because there wasn't obvious ref'ing on the cm_id_priv. Looking closer
the cm.lock should prevent use-after-free, but there is still no ref.

The more I look at this, the more I think it is sketchy. Don't try and
merge sharecount and refcount together, after cm_find_listen is called
you have to increment the refcount before dropping cm.lock.

Decrement the refcount when destroying a shared listen.

I also don't see how the 'goto new_id' can work, if cm_find_listen
succeeds then __ib_cm_listen is guarenteed to fail.

Fix the locking to make that impossible, associate sharecount with the
cm.lock and, rework how cm_destroy_id grabs the cm_id_priv->lock spinlock:

	case IB_CM_LISTEN:
		spin_lock_irq(&cm.lock);
		if (cm_id_priv->sharecount != 0) {
		     cm_id_prv->sharecount--;
		     // paired with in in ib_cm_id_create_and_listen
		     atomic_dec(&cm_id_priv->refcount);
		     spin_unlock_irq(&cm.lock);
		     return;
		}
		rb_erase(&cm_id_priv->service_node, &cm.listen_service_table);
		spin_unlock_irq(&cm.lock);
	
		spin_lock_irq(&cm_id_priv->lock);
		cm_id->state = IB_CM_IDLE;
		spin_unlock_irq(&cm_id_priv->lock);
		break;

Now that condition is eliminated, the unneeded atomic is gone, and
refcount still acts like a proper kref should.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
       [not found]   ` <1431841868-28063-5-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-05-19 18:28     ` Jason Gunthorpe
@ 2015-05-19 23:55     ` Jason Gunthorpe
       [not found]       ` <20150519235502.GB26634-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  1 sibling, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2015-05-19 23:55 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote:
> From: Guy Shapiro <guysh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> 
> Implement the get_net_device_by_port_pkey_ip callback that returns network
> device to ib_core according to connection parameters. Check the ipoib
> device and iterate over all child devices to look for a match.

Can you give a run down on how to actually set this up? Like what
shell command do you execute?

What are the child devices in the netnamespace?

> For each ipoib device we iterate through all upper devices when searching
> for a matching IP, in order to support bonding.

Checking an IP address in a packet against a device without consulting
the routing table is a big red flag for me. Can you elaborate?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 03/12] IB/core: Find the network namespace matching connection parameters
       [not found]       ` <20150519182616.GF18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-20 14:48           ` Haggai Eran
  0 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-20 14:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On 19/05/2015 21:26, Jason Gunthorpe wrote:
>> +	list_for_each_entry_rcu(client, &client_list, list)
>> +		if (client->get_net_device_by_port_pkey_ip) {
>> +			ret = client->get_net_device_by_port_pkey_ip(dev, port,
>> +								     pkey,
>> +
>> addr);
> 
> Considering the patch that introduced the rwsem, this doesn't look
> right.
> 
> We can't call a client call back on a device that is unregistered, we
> can't call a client call back on a client that is unregistering.
> 
> So the ordering of the list_del/add vs call back calls in device.c is
> not right, please audit everything with an eye toward maintaing sane
> invarients.

Thanks for pointing that out. I'll do that.

Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 03/12] IB/core: Find the network namespace matching connection parameters
@ 2015-05-20 14:48           ` Haggai Eran
  0 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-20 14:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On 19/05/2015 21:26, Jason Gunthorpe wrote:
>> +	list_for_each_entry_rcu(client, &client_list, list)
>> +		if (client->get_net_device_by_port_pkey_ip) {
>> +			ret = client->get_net_device_by_port_pkey_ip(dev, port,
>> +								     pkey,
>> +
>> addr);
> 
> Considering the patch that introduced the rwsem, this doesn't look
> right.
> 
> We can't call a client call back on a device that is unregistered, we
> can't call a client call back on a client that is unregistering.
> 
> So the ordering of the list_del/add vs call back calls in device.c is
> not right, please audit everything with an eye toward maintaing sane
> invarients.

Thanks for pointing that out. I'll do that.

Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
       [not found]       ` <20150519182810.GG18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-20 15:17           ` Haggai Eran
  0 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-20 15:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On 19/05/2015 21:28, Jason Gunthorpe wrote:
> On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote:
> 
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +	struct sockaddr_in6 *addr_in6 = (struct sockaddr_in6 *)addr;
>> +#endif
>> +	__be32 ret_addr;
>> +
>> +	switch (addr->sa_family) {
>> +	case AF_INET:
>> +		in_dev = in_dev_get(dev);
>> +		if (!in_dev)
>> +			return false;
>> +
>> +		ret_addr = inet_confirm_addr(net, in_dev, 0,
>> +					     addr_in->sin_addr.s_addr,
>> +					     RT_SCOPE_HOST);
>> +		in_dev_put(in_dev);
>> +		if (ret_addr)
>> +			return true;
>> +
>> +		break;
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +	case AF_INET6:
>> +		if (ipv6_chk_addr(net, &addr_in6->sin6_addr, dev, 1))
>> +			return true;
>> +
>> +		break;
>> +#endif
> 
> Can you use
> 
>   if (IS_ENABLED(CONFIG_IPV6))
> 
> At the call site instead of the #if guards?

Sure, I'll do that in the next revision of the patch-set.

Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
@ 2015-05-20 15:17           ` Haggai Eran
  0 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-20 15:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On 19/05/2015 21:28, Jason Gunthorpe wrote:
> On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote:
> 
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +	struct sockaddr_in6 *addr_in6 = (struct sockaddr_in6 *)addr;
>> +#endif
>> +	__be32 ret_addr;
>> +
>> +	switch (addr->sa_family) {
>> +	case AF_INET:
>> +		in_dev = in_dev_get(dev);
>> +		if (!in_dev)
>> +			return false;
>> +
>> +		ret_addr = inet_confirm_addr(net, in_dev, 0,
>> +					     addr_in->sin_addr.s_addr,
>> +					     RT_SCOPE_HOST);
>> +		in_dev_put(in_dev);
>> +		if (ret_addr)
>> +			return true;
>> +
>> +		break;
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +	case AF_INET6:
>> +		if (ipv6_chk_addr(net, &addr_in6->sin6_addr, dev, 1))
>> +			return true;
>> +
>> +		break;
>> +#endif
> 
> Can you use
> 
>   if (IS_ENABLED(CONFIG_IPV6))
> 
> At the call site instead of the #if guards?

Sure, I'll do that in the next revision of the patch-set.

Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
       [not found]       ` <20150519235502.GB26634-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-21  5:33           ` Haggai Eran
  2015-05-21  5:48           ` Haggai Eran
  1 sibling, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-21  5:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On 20/05/2015 02:55, Jason Gunthorpe wrote:
> On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote:
>> > From: Guy Shapiro <guysh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> > 
>> > Implement the get_net_device_by_port_pkey_ip callback that returns network
>> > device to ib_core according to connection parameters. Check the ipoib
>> > device and iterate over all child devices to look for a match.
> Can you give a run down on how to actually set this up? Like what
> shell command do you execute?

Sure.

There are two methods to create new child interface for IPoIB.
For a specific P_Key, write the desired P_Key to the create_child sysfs
file:
# echo 0x8000 > /sys/class/net/ib0/create_child
This creates a new interface ib0.8000 operating with P_Key 0x8000.

To create a new child interface on the default P_Key, its possible to
use iproute:
# ip link add link ib0 name ib0.1 type ipoib

In order to create a new network namespace:
# ip netns add ns1

Then, you can assign the new netdev to the namespace:
# ip link set ib0.1 netns ns1

You can then set an IP address in the network namespace, and try some
RDMA CM applications:
# ip netns exec ns1 ip addr add dev ib0.1 192.168.0.1/24
# ip netns exec ns1 rdma_server

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
@ 2015-05-21  5:33           ` Haggai Eran
  0 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-21  5:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On 20/05/2015 02:55, Jason Gunthorpe wrote:
> On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote:
>> > From: Guy Shapiro <guysh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> > 
>> > Implement the get_net_device_by_port_pkey_ip callback that returns network
>> > device to ib_core according to connection parameters. Check the ipoib
>> > device and iterate over all child devices to look for a match.
> Can you give a run down on how to actually set this up? Like what
> shell command do you execute?

Sure.

There are two methods to create new child interface for IPoIB.
For a specific P_Key, write the desired P_Key to the create_child sysfs
file:
# echo 0x8000 > /sys/class/net/ib0/create_child
This creates a new interface ib0.8000 operating with P_Key 0x8000.

To create a new child interface on the default P_Key, its possible to
use iproute:
# ip link add link ib0 name ib0.1 type ipoib

In order to create a new network namespace:
# ip netns add ns1

Then, you can assign the new netdev to the namespace:
# ip link set ib0.1 netns ns1

You can then set an IP address in the network namespace, and try some
RDMA CM applications:
# ip netns exec ns1 ip addr add dev ib0.1 192.168.0.1/24
# ip netns exec ns1 rdma_server

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
       [not found]       ` <20150519235502.GB26634-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-21  5:48           ` Haggai Eran
  2015-05-21  5:48           ` Haggai Eran
  1 sibling, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-21  5:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On 20/05/2015 02:55, Jason Gunthorpe wrote:
> On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote:
>> For each ipoib device we iterate through all upper devices when searching
>> for a matching IP, in order to support bonding.
> 
> Checking an IP address in a packet against a device without consulting
> the routing table is a big red flag for me. Can you elaborate?
We want to match the incoming CM request to a specific net_dev, and use
the network namespace of that net_dev to find the right rdma_cm_id.
Because there can be multiple network namespaces, there would also be
multiple routing tables, so I'm not sure how we could use them. The
solution we used is to walk all the upper net_devs of all the IPoIB
child devices of a given IB device and port, and find one that matches
the request's IB device, port number, P_Key and IP address.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
@ 2015-05-21  5:48           ` Haggai Eran
  0 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-21  5:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On 20/05/2015 02:55, Jason Gunthorpe wrote:
> On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote:
>> For each ipoib device we iterate through all upper devices when searching
>> for a matching IP, in order to support bonding.
> 
> Checking an IP address in a packet against a device without consulting
> the routing table is a big red flag for me. Can you elaborate?
We want to match the incoming CM request to a specific net_dev, and use
the network namespace of that net_dev to find the right rdma_cm_id.
Because there can be multiple network namespaces, there would also be
multiple routing tables, so I'm not sure how we could use them. The
solution we used is to walk all the upper net_devs of all the IPoIB
child devices of a given IB device and port, and find one that matches
the request's IB device, port number, P_Key and IP address.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
       [not found]           ` <555D6E41.10606-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-05-21  5:48             ` Or Gerlitz
       [not found]               ` <CAJ3xEMjN+o=vC4abAeG5EuOo3Y1gSyh1qPDseA_aaYmoLWAunw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-05-21 17:43             ` Jason Gunthorpe
  1 sibling, 1 reply; 81+ messages in thread
From: Or Gerlitz @ 2015-05-21  5:48 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Jason Gunthorpe, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth

On Thu, May 21, 2015 at 8:33 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> On 20/05/2015 02:55, Jason Gunthorpe wrote:
>> On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote:
>>> > From: Guy Shapiro <guysh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>> >
>>> > Implement the get_net_device_by_port_pkey_ip callback that returns network
>>> > device to ib_core according to connection parameters. Check the ipoib
>>> > device and iterate over all child devices to look for a match.
>> Can you give a run down on how to actually set this up? Like what
>> shell command do you execute?
>
> Sure.
>
> There are two methods to create new child interface for IPoIB.
> For a specific P_Key, write the desired P_Key to the create_child sysfs
> file:
> # echo 0x8000 > /sys/class/net/ib0/create_child
> This creates a new interface ib0.8000 operating with P_Key 0x8000.

0x8000 is practically zero (bits 15-0), right? not sure this is a valid pkey.


> To create a new child interface on the default P_Key, its possible to
> use iproute:
> # ip link add link ib0 name ib0.1 type ipoib

you can use non default pkeys as well here.

$ ip link add link ib0 name ib0.1 type ipoib pkey 0x8001
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
       [not found]               ` <CAJ3xEMjN+o=vC4abAeG5EuOo3Y1gSyh1qPDseA_aaYmoLWAunw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-21  6:33                 ` Haggai Eran
       [not found]                   ` <555D7C4A.2060708-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 81+ messages in thread
From: Haggai Eran @ 2015-05-21  6:33 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Jason Gunthorpe, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth

On 21/05/2015 08:48, Or Gerlitz wrote:
> On Thu, May 21, 2015 at 8:33 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>> On 20/05/2015 02:55, Jason Gunthorpe wrote:
>>> On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote:
>>>>> From: Guy Shapiro <guysh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>>>
>>>>> Implement the get_net_device_by_port_pkey_ip callback that returns network
>>>>> device to ib_core according to connection parameters. Check the ipoib
>>>>> device and iterate over all child devices to look for a match.
>>> Can you give a run down on how to actually set this up? Like what
>>> shell command do you execute?
>>
>> Sure.
>>
>> There are two methods to create new child interface for IPoIB.
>> For a specific P_Key, write the desired P_Key to the create_child sysfs
>> file:
>> # echo 0x8000 > /sys/class/net/ib0/create_child
>> This creates a new interface ib0.8000 operating with P_Key 0x8000.
> 
> 0x8000 is practically zero (bits 15-0), right? not sure this is a valid pkey.
Right, bad example :)

> 
> 
>> To create a new child interface on the default P_Key, its possible to
>> use iproute:
>> # ip link add link ib0 name ib0.1 type ipoib
> 
> you can use non default pkeys as well here.
> 
> $ ip link add link ib0 name ib0.1 type ipoib pkey 0x8001
> 

Right. I think when we started development of the namespaces patches
these child interfaces (rtnetlink with pkey) didn't work for us, but I
checked now on an updated kernel and they do.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs
       [not found]     ` <20150519183545.GH18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-21  7:07         ` Haggai Eran
  2015-05-21  7:07         ` Haggai Eran
  1 sibling, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-21  7:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On 19/05/2015 21:35, Jason Gunthorpe wrote:
...

> The share count should be 'listen_sharecount' because it *only* works
> for listen.
> 
> The above test in cm_destroy_id should only be in the listen branch of
> the if.

Okay.

> 
>> + * Create a new listening ib_cm_id and listen on the given service ID.
>> + *
>> + * If there's an existing ID listening on that same device and service ID,
>> + * return it.
>> + *
> 
> .. Callers should call cm_destroy_id when done with the listen ..

I'll add that (with ib_destroy_cm_id of course).

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs
@ 2015-05-21  7:07         ` Haggai Eran
  0 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-21  7:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On 19/05/2015 21:35, Jason Gunthorpe wrote:
...

> The share count should be 'listen_sharecount' because it *only* works
> for listen.
> 
> The above test in cm_destroy_id should only be in the listen branch of
> the if.

Okay.

> 
>> + * Create a new listening ib_cm_id and listen on the given service ID.
>> + *
>> + * If there's an existing ID listening on that same device and service ID,
>> + * return it.
>> + *
> 
> .. Callers should call cm_destroy_id when done with the listen ..

I'll add that (with ib_destroy_cm_id of course).

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs
       [not found]         ` <20150519223502.GA26324-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-21  8:08             ` Haggai Eran
  0 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-21  8:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On 20/05/2015 01:35, Jason Gunthorpe wrote:
> On Tue, May 19, 2015 at 12:35:45PM -0600, Jason Gunthorpe wrote:
>> On Sun, May 17, 2015 at 08:51:01AM +0300, Haggai Eran wrote:
>>> @@ -212,6 +212,8 @@ struct cm_id_private {
>>>  	spinlock_t lock;	/* Do not acquire inside cm.lock */
>>>  	struct completion comp;
>>>  	atomic_t refcount;
>>> +	/* Number of clients sharing this ib_cm_id. Only valid for listeners. */
>>> +	atomic_t sharecount;
>>
>> No need for this atomic, hold the lock
>>
>> The use of the atomic looks racy:
>>
>>> +	if (!atomic_dec_and_test(&cm_id_priv->sharecount)) {
>>> +		/* The id is still shared. */
>>> +		return;
>>> +	}
>>
>> Might race with this:
>>
>>> +		if (atomic_inc_return(&cm_id_priv->sharecount) == 1) {
>>> +			/* This ID is already being destroyed */
>>> +			atomic_dec(&cm_id_priv->sharecount);
>>> +			goto new_id;
>>> +		}
>>> +
>>
>> Resulting in use-after-free of cm_id_priv->sharecount
> 
> Actually, there is something else odd here.. I mentioned the above
> because there wasn't obvious ref'ing on the cm_id_priv. Looking closer
> the cm.lock should prevent use-after-free, but there is still no ref.
> 
> The more I look at this, the more I think it is sketchy. Don't try and
> merge sharecount and refcount together, 
I'm not sure what you mean here. The way I was thinking about it was
that sharecount = num of rdma_cm_ids sharing this listener, while
refcount = num of active internal uses of this ID (work items, timers,
etc.) Do you want refcount to also include the sharecount?

> after cm_find_listen is called
> you have to increment the refcount before dropping cm.lock.
> 
> Decrement the refcount when destroying a shared listen.
You mean to decrement event if listen_sharecount > 0, and the id isn't
destroyed, right? The code already calls cm_deref_id when
listen_sharecount = 0 of course.

> I also don't see how the 'goto new_id' can work, if cm_find_listen
> succeeds then __ib_cm_listen is guarenteed to fail.
> 
> Fix the locking to make that impossible, associate sharecount with the
> cm.lock and, rework how cm_destroy_id grabs the cm_id_priv->lock spinlock:
> 
> 	case IB_CM_LISTEN:
> 		spin_lock_irq(&cm.lock);
> 		if (cm_id_priv->sharecount != 0) {
> 		     cm_id_prv->sharecount--;
> 		     // paired with in in ib_cm_id_create_and_listen
> 		     atomic_dec(&cm_id_priv->refcount);
> 		     spin_unlock_irq(&cm.lock);
> 		     return;
> 		}
> 		rb_erase(&cm_id_priv->service_node, &cm.listen_service_table);
> 		spin_unlock_irq(&cm.lock);
> 	
> 		spin_lock_irq(&cm_id_priv->lock);
> 		cm_id->state = IB_CM_IDLE;
> 		spin_unlock_irq(&cm_id_priv->lock);
> 		break;
> 
> Now that condition is eliminated, the unneeded atomic is gone, and
> refcount still acts like a proper kref should.
Thanks, that looks like a better solution.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs
@ 2015-05-21  8:08             ` Haggai Eran
  0 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-21  8:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On 20/05/2015 01:35, Jason Gunthorpe wrote:
> On Tue, May 19, 2015 at 12:35:45PM -0600, Jason Gunthorpe wrote:
>> On Sun, May 17, 2015 at 08:51:01AM +0300, Haggai Eran wrote:
>>> @@ -212,6 +212,8 @@ struct cm_id_private {
>>>  	spinlock_t lock;	/* Do not acquire inside cm.lock */
>>>  	struct completion comp;
>>>  	atomic_t refcount;
>>> +	/* Number of clients sharing this ib_cm_id. Only valid for listeners. */
>>> +	atomic_t sharecount;
>>
>> No need for this atomic, hold the lock
>>
>> The use of the atomic looks racy:
>>
>>> +	if (!atomic_dec_and_test(&cm_id_priv->sharecount)) {
>>> +		/* The id is still shared. */
>>> +		return;
>>> +	}
>>
>> Might race with this:
>>
>>> +		if (atomic_inc_return(&cm_id_priv->sharecount) == 1) {
>>> +			/* This ID is already being destroyed */
>>> +			atomic_dec(&cm_id_priv->sharecount);
>>> +			goto new_id;
>>> +		}
>>> +
>>
>> Resulting in use-after-free of cm_id_priv->sharecount
> 
> Actually, there is something else odd here.. I mentioned the above
> because there wasn't obvious ref'ing on the cm_id_priv. Looking closer
> the cm.lock should prevent use-after-free, but there is still no ref.
> 
> The more I look at this, the more I think it is sketchy. Don't try and
> merge sharecount and refcount together, 
I'm not sure what you mean here. The way I was thinking about it was
that sharecount = num of rdma_cm_ids sharing this listener, while
refcount = num of active internal uses of this ID (work items, timers,
etc.) Do you want refcount to also include the sharecount?

> after cm_find_listen is called
> you have to increment the refcount before dropping cm.lock.
> 
> Decrement the refcount when destroying a shared listen.
You mean to decrement event if listen_sharecount > 0, and the id isn't
destroyed, right? The code already calls cm_deref_id when
listen_sharecount = 0 of course.

> I also don't see how the 'goto new_id' can work, if cm_find_listen
> succeeds then __ib_cm_listen is guarenteed to fail.
> 
> Fix the locking to make that impossible, associate sharecount with the
> cm.lock and, rework how cm_destroy_id grabs the cm_id_priv->lock spinlock:
> 
> 	case IB_CM_LISTEN:
> 		spin_lock_irq(&cm.lock);
> 		if (cm_id_priv->sharecount != 0) {
> 		     cm_id_prv->sharecount--;
> 		     // paired with in in ib_cm_id_create_and_listen
> 		     atomic_dec(&cm_id_priv->refcount);
> 		     spin_unlock_irq(&cm.lock);
> 		     return;
> 		}
> 		rb_erase(&cm_id_priv->service_node, &cm.listen_service_table);
> 		spin_unlock_irq(&cm.lock);
> 	
> 		spin_lock_irq(&cm_id_priv->lock);
> 		cm_id->state = IB_CM_IDLE;
> 		spin_unlock_irq(&cm_id_priv->lock);
> 		break;
> 
> Now that condition is eliminated, the unneeded atomic is gone, and
> refcount still acts like a proper kref should.
Thanks, that looks like a better solution.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
       [not found]                   ` <555D7C4A.2060708-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-05-21 10:31                     ` Or Gerlitz
  0 siblings, 0 replies; 81+ messages in thread
From: Or Gerlitz @ 2015-05-21 10:31 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Jason Gunthorpe, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth

On Thu, May 21, 2015 at 9:33 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:

>> you can use non default pkeys as well here.
>> $ ip link add link ib0 name ib0.1 type ipoib pkey 0x8001

> Right. I think when we started development of the namespaces patches
> these child interfaces (rtnetlink with pkey) didn't work for us, but I
> checked now on an updated kernel and they do.

The last change to drivers/infiniband/ulp/ipoib/ipoib_netlink.c took
place on 3.12, well before you started... could be that you provided
wrong command line and such, I haven't seen any bug report nor fix
from you group to that area of the kernel.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
       [not found]           ` <555D6E41.10606-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-05-21  5:48             ` Or Gerlitz
@ 2015-05-21 17:43             ` Jason Gunthorpe
       [not found]               ` <20150521174336.GA6771-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  1 sibling, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2015-05-21 17:43 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth, Hefty, Sean

On Thu, May 21, 2015 at 08:33:53AM +0300, Haggai Eran wrote:

> To create a new child interface on the default P_Key, its possible to
> use iproute:
> # ip link add link ib0 name ib0.1 type ipoib

Uh..

A key invariant of the IP stack is that is it possible to uniquely
identify the ingress device.

So the above scheme is fine for IPoIB, because it uses the interface
unique QPN to uniquely ID the netdevice: (Device,Port,Pkey,QPN)
is the unique ID tuple. The world is happy.

But RDMA CM doesn't provide the QPN. So when RDMA CM searches the
netdevs for an address it cannot *uniquely* map to a IPoIB interface.

This is bad, and *completely wrong*, but today, nobody is going to
really notice or care. The cases where it does something you don't
want are not very significant.

But with containers.. Think this through for a minute: 'In some cases
the RDMA CM selecs the wrong child' - that goes from being a minor
annoyance to a violation of containment! Worse the criteria for
'selects the wrong child' can be triggered by the contained users. Eg
the contained user adds a IP to their child that duplicates another
container. Now we've lost control.

The very idea of ib_get_net_dev_by_port_pkey_ip is broken.

So, I don't know what to say here.. Ideas?

1) Forbid creating more than one pkey per ipoib interface?
2) Somehow extend the RDMA CM to send the IPoIB qpn too?
3) ??

Right now the only case that comes to mind is duplicating IPs, that is
already going to cause an ARP collision, so maybe having the RDMA CM
randomly select an IP is not the end of the world... But with
containers and security, who knows? I'm not confident I've
exhaustively thought of all possibilities here.

----------

Anyhow.. looking again through this series and the existing code, the
flow is wrong, and really needs to be changed before this starts to
make sense to anyone, and is no doubt part of how we got here..

When a REQ arrives RDMA CM needs to run down these steps (this is identical
to what ip_input.c does)

 1) Locate the netdev associated with the ingress of the packet,
    in a sane world this is done by only checking the
    unique (Device,Port,Pkey,QPN) tuple.
    If we keep our brokeness, we'd do this based on
    (Device,Port,Pkey,IP) - if there are IP collisions then randomly
    select a netdev (similar to how ARP collision is handled).
 2) Then we do the ip_route_input_noref step, this will set skb_dst to
    the netdev that will handle the packet, or tell us to drop it.
    This is not always the same as the netdev that accepts the
    packet!!!

    NOTE: This route step is missing today, it does critical things
    like check that the node is actually listening on the dest IP!

 3) Now we can use skb_dst to iterate over the set of all RDMA CM listens:
     1) Bound to the skb_dst netdev
     2) Unbound in the same namespace as skb_dst netdev
    The first to match the dst IP + port is the listen that will accept the
    connection, now we go into the cma_new_conn_id path, and we don't
    need rdma_translate_ip because we already have the handling netdev.

The backwards operation of the current code is part of why this is all
so strange looking, and I think is strongly connected to the private
data compare issues Sean is talking about. It is very much the wrong
flow to look for the RDMA CM listen first, and then try to work
backwards to the netdev.

The above 3 steps hard require that the ib_cm and rdma_cm
maintain different listening lists, because we need the 2nd search in
#3. So this gets the ib_cm completely out of looking at the private
data. [And now we can think carefully about the best way to refcount
the listens in RDMA CM]

Once the above is cleaned up dropping in namespaces should just
happen naturally.

So.. I'm going to suggest you make a cleanup series to fix this:
 - Introduce the ib_get_net_dev_by_port_pkey_ip and document the
   breakage it represents
 - Rework rdma_cm to do steps 1,2,3 above, using
   ib_get_net_dev_by_port_pkey_ip to drive #1, your patch set is
   already doing about 50% of this change.
 - Fix the collateral damage

As I said to Matan, clean up series like this should not introduce
major functional changes, so stack the few remaining net namespace
things in another series after it.

** The same comments apply to RoCE too, but for RoCE step #1 works
   properly based on the (Device,Port,VLAN) unique tuple
   Ditto for iWarp **

I think this also moves to address Sean's concern about generality, at
least for listen. All three protocols will run down the same common
code to locate the netdev and find the correct RDMA CM listen. The
only difference is the 'ib_get_net_dev_by_port_pkey_ip' call varies.

.. I also wouldn't mind seeing the giant cma.c split, if it makes
   sense to have a cma_listen.c, for instance, wouldn't that be nice?

Sorry Haggi, this is a big change to your patchset :(

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs
  2015-05-21  8:08             ` Haggai Eran
  (?)
@ 2015-05-21 17:54             ` Jason Gunthorpe
  -1 siblings, 0 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2015-05-21 17:54 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Doug Ledford, linux-rdma, netdev, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On Thu, May 21, 2015 at 11:08:31AM +0300, Haggai Eran wrote:
> > The more I look at this, the more I think it is sketchy. Don't try and
> > merge sharecount and refcount together,

> I'm not sure what you mean here. The way I was thinking about it was
> that sharecount = num of rdma_cm_ids sharing this listener, while
> refcount = num of active internal uses of this ID (work items, timers,
> etc.) Do you want refcount to also include the sharecount?

If you hold on to the pointer, then you increment the refcount. The
refcount is the 'number of holders of the pointer'. Basic invariant.

When the pointer left the lock region for the lockup, the ref must be
incremented.

What you had was functionally correct because the sharecount was
implying a refcount, but it doesn't follow the standard kernel
refcounting pattern.

> > after cm_find_listen is called
> > you have to increment the refcount before dropping cm.lock.
> > 
> > Decrement the refcount when destroying a shared listen.
> You mean to decrement event if listen_sharecount > 0, and the id isn't
> destroyed, right? The code already calls cm_deref_id when
> listen_sharecount = 0 of course.

Yes, because cm_destroy_id is the ref'd pair to cm_listen, after it
returns the caller must give up their pointer.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
  2015-05-17  5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran
                   ` (8 preceding siblings ...)
  2015-05-19 14:30 ` [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Yann Droneaud
@ 2015-05-26 13:34 ` Doug Ledford
       [not found]   ` <1432647280.28905.107.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  9 siblings, 1 reply; 81+ messages in thread
From: Doug Ledford @ 2015-05-26 13:34 UTC (permalink / raw)
  To: Haggai Eran
  Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth

[-- Attachment #1: Type: text/plain, Size: 2574 bytes --]

On Sun, 2015-05-17 at 08:50 +0300, Haggai Eran wrote:
> Thanks again everyone for the review comments. I've updated the patch set
> accordingly. The main changes are in the first patch to use a read-write
> semaphore instead of an SRCU, and with the reference counting of shared
> ib_cm_ids.
> Please let me know if I missed anything, or if there are other issues with
> the series.

Hi Haggai,

I know you are probably busy reworking this right now on the basis of
Jason's comments.  However, my biggest issue with this patch set right
now is not technical (well, it is, but it's only partially technical).

This is a core feature more than anything else.  Namespaces for RDMA
devices is not unique to IB or RoCE in any way.  Yet no thought has been
given to how this will work universally across all of the RDMA capable
devices (mainly I'm talking about iWARP here...I don't think this is an
issue for usNIC as if you want namespace support there, you just start
the user space app in a given namespace and you are probably 90% of the
way there since the user space application gets its own device and so
its own MAC/IP and all of the RDMA transfers are UDP, so the
application's namespace should get inherited by all the rest, but Cisco
would need to confirm that, hence why I say 90% of the way there, it
needs confirmed).

So, while you are reworking things right now, you would ideally contact
Steve Wise and/or Tatyana Nikolova and discuss the iWARP story on this.
I know there won't be a lot of overlap between IB and iWARP, but last
time you were asked you didn't even know if this setup could be extended
to iWARP.

For this next statement, I know I'm directing this to you Haggai, but
please don't take it that way.  I'm really using your patch set to make
a broader point to everyone on the list.

When I look at patches for support for a given feature, one of the
things I'm going to look at is whether or not that feature is specific
to a given hardware type, or if it's a generic feature.  If it's a
generic feature, then I'm going to want to know that the person
submitting it has designed it well.  A pre-requisite of designing a
generic feature well is that it considers all hardware types, not just
your specific hardware type.  So when you come back with the next
version of this patch set, please have an answer for how it should work
on each hardware type even if you don't have implementation patches for
each hardware type.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: 0E572FDD


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]   ` <1432647280.28905.107.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-05-26 16:59     ` Jason Gunthorpe
  2015-05-26 17:46       ` Doug Ledford
  2015-05-28 13:15         ` Haggai Eran
  2015-05-26 17:55     ` Christian Benvenuti (benve)
  2015-05-28 13:07       ` Haggai Eran
  2 siblings, 2 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2015-05-26 16:59 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Haggai Eran, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On Tue, May 26, 2015 at 09:34:40AM -0400, Doug Ledford wrote:

> This is a core feature more than anything else.  Namespaces for RDMA
> devices is not unique to IB or RoCE in any way.  Yet no thought has been
> given to how this will work universally across all of the RDMA
> capable

I think if Haggi is able to follow the perscription I gave then things
will be general.

 - All rdma cm ids are associated with a netdev
 - The output flow uses that netdev to restrict, configure and
   determine the output RDMA device QP
 - The input flow locates the netdev as step one and then uses the
   (netdev,ip,port) tuple to find the rdma listener, which is in turn
   tied to a netdev and is restricted/configured by it.

The technology specific part is the two maps: from (input
device,packet) to netdev, from netdev to (output device,packet)

After the above clean up is done, namespace enabling is basically
providing those two mapping functions for each technology in a way
that can locate delegatable netdevs.

The trivial case for all the ethernet techs is to provide the above
maps that can take the (input device,VLAN) and locate the correct
child VLAN specific netdev. The existing code to support VLAN should
pretty much immediately enable basic namespace support for all the
ethernet families.

The big open question for ethernet is how to work without relying on
VLAN to create delgated netdevs - typically one would use a bridge and
veth's, which do not seem very RDMA compatible. But that doesn't need
to be answered right now.

Remember, this isn't RDMA namespaces, this is netdev namespace support
for RDMA-CM -> very different things.

Basically, I'm happy with the generality story, if the clean up work I
outlined turns out..

> issue for usNIC as if you want namespace support there, you just start
> the user space app in a given namespace and you are probably 90% of
> the

usNIC has no kernel facing functionality, and no interaction with
RDMA-CM, so it is irrelevant to any discussion about RDMA-CM :(

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
  2015-05-26 16:59     ` Jason Gunthorpe
@ 2015-05-26 17:46       ` Doug Ledford
       [not found]         ` <1432662396.28905.157.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-05-28 13:15         ` Haggai Eran
  1 sibling, 1 reply; 81+ messages in thread
From: Doug Ledford @ 2015-05-26 17:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Haggai Eran, linux-rdma, netdev, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

[-- Attachment #1: Type: text/plain, Size: 2852 bytes --]

On Tue, 2015-05-26 at 10:59 -0600, Jason Gunthorpe wrote:
> On Tue, May 26, 2015 at 09:34:40AM -0400, Doug Ledford wrote:
> 
> > This is a core feature more than anything else.  Namespaces for RDMA
> > devices is not unique to IB or RoCE in any way.  Yet no thought has been
> > given to how this will work universally across all of the RDMA
> > capable
> 
> I think if Haggi is able to follow the perscription I gave then things
> will be general.
> 
>  - All rdma cm ids are associated with a netdev
>  - The output flow uses that netdev to restrict, configure and
>    determine the output RDMA device QP
>  - The input flow locates the netdev as step one and then uses the
>    (netdev,ip,port) tuple to find the rdma listener, which is in turn
>    tied to a netdev and is restricted/configured by it.
> 
> The technology specific part is the two maps: from (input
> device,packet) to netdev, from netdev to (output device,packet)
> 
> After the above clean up is done, namespace enabling is basically
> providing those two mapping functions for each technology in a way
> that can locate delegatable netdevs.
> 
> The trivial case for all the ethernet techs is to provide the above
> maps that can take the (input device,VLAN) and locate the correct
> child VLAN specific netdev. The existing code to support VLAN should
> pretty much immediately enable basic namespace support for all the
> ethernet families.
> 
> The big open question for ethernet is how to work without relying on
> VLAN to create delgated netdevs - typically one would use a bridge and
> veth's, which do not seem very RDMA compatible. But that doesn't need
> to be answered right now.
> 
> Remember, this isn't RDMA namespaces, this is netdev namespace support
> for RDMA-CM -> very different things.

That was the point of my email.  This is a very myopic view of the
feature.  It *should* at least have an idea of these other things too.

> Basically, I'm happy with the generality story, if the clean up work I
> outlined turns out..
> 
> > issue for usNIC as if you want namespace support there, you just start
> > the user space app in a given namespace and you are probably 90% of
> > the
> 
> usNIC has no kernel facing functionality, and no interaction with
> RDMA-CM, so it is irrelevant to any discussion about RDMA-CM :(

Whether usNIC has a kernel facing functionality or not is irrelevant.
This feature isn't kernel only, it effects user space applications
launched in a namespace too.  And, again, my point was that this
discussion is about RDMA-CM and it should be broader (even if the
implementation isn't broader).  Due to the implementation of usNIC I
suspect it would "just work", but it would be better to know so.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: 0E572FDD


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]   ` <1432647280.28905.107.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-05-26 16:59     ` Jason Gunthorpe
@ 2015-05-26 17:55     ` Christian Benvenuti (benve)
  2015-05-28 13:07       ` Haggai Eran
  2 siblings, 0 replies; 81+ messages in thread
From: Christian Benvenuti (benve) @ 2015-05-26 17:55 UTC (permalink / raw)
  To: Doug Ledford, Haggai Eran
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1063 bytes --]

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
> Behalf Of Doug Ledford
> Sent: Tuesday, May 26, 2015 6:35 AM
> To: Haggai Eran
> Cc: linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Liran Liss; Guy
> Shapiro; Shachar Raindel; Yotam Kenneth
> Subject: Re: [PATCH v4 for-next 00/12] Add network namespace support in the
> RDMA-CM

...

> I don't think this is an issue for usNIC as if you
> want namespace support there, you just start the user space app in a given
> namespace and you are probably 90% of the way there since the user space
> application gets its own device and so its own MAC/IP and all of the RDMA
> transfers are UDP, so the application's namespace should get inherited by all
> the rest, but Cisco would need to confirm that, hence why I say 90% of the way
> there, it needs confirmed).

This is correct. 

Thanks
/Chris

N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±­ÙšŠ{ayº\x1dʇڙë,j\a­¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]         ` <1432662396.28905.157.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-05-26 18:47           ` Jason Gunthorpe
  2015-05-28 13:22             ` Haggai Eran
  1 sibling, 0 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2015-05-26 18:47 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Haggai Eran, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On Tue, May 26, 2015 at 01:46:36PM -0400, Doug Ledford wrote:

> > Remember, this isn't RDMA namespaces, this is netdev namespace support
> > for RDMA-CM -> very different things.
> 
> That was the point of my email.  This is a very myopic view of the
> feature.  It *should* at least have an idea of these other things too.

Everything you talked about seems covered: iwarp/roce/ib now have a
fairly clear uniform story for CM. usNIC doesn't use core code.

I doubt a larger discussion about a 'rdma namespace' is going to
substantially change these patches, they are really netdev focused.
Anyhow, I've been saving that discussion for when the roce and
umad/uverbs namespace stuff is re-sent. It seems more appropriate at
that point.

I don't know about you, but I am exhausted looking at these huge patch
sets, and narrowing the focus is the only way I see to get through.
This series has hopefully narrowed to: 'fix the flow in netdev
handling for rdma-cm'.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
       [not found]               ` <20150521174336.GA6771-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-28 11:51                   ` Haggai Eran
  0 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-28 11:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth, Hefty, Sean

On 21/05/2015 20:43, Jason Gunthorpe wrote:
> On Thu, May 21, 2015 at 08:33:53AM +0300, Haggai Eran wrote:
> 
>> To create a new child interface on the default P_Key, its possible to
>> use iproute:
>> # ip link add link ib0 name ib0.1 type ipoib
> 
> Uh..
> 
> A key invariant of the IP stack is that is it possible to uniquely
> identify the ingress device.
> 
> So the above scheme is fine for IPoIB, because it uses the interface
> unique QPN to uniquely ID the netdevice: (Device,Port,Pkey,QPN)
> is the unique ID tuple. The world is happy.
> 
> But RDMA CM doesn't provide the QPN. So when RDMA CM searches the
> netdevs for an address it cannot *uniquely* map to a IPoIB interface.
This is technically true, but if someone configures their system that
way, they will also have ARP conflicts in addition. I don't see why we
should support such a configuration.

> This is bad, and *completely wrong*, but today, nobody is going to
> really notice or care. The cases where it does something you don't
> want are not very significant.
> 
> But with containers.. Think this through for a minute: 'In some cases
> the RDMA CM selecs the wrong child' - that goes from being a minor
> annoyance to a violation of containment! Worse the criteria for
> 'selects the wrong child' can be triggered by the contained users. Eg
> the contained user adds a IP to their child that duplicates another
> container. Now we've lost control.
No, this is exactly what would happen in the Ethernet world. If you
create a conflicting configuration between two containers on the same
Ethernet segment, then one of them could get the traffic that was
intended for the other.
I don't see this as a violation of containment, because these containers
are assigned net_devs that communicate on the same segment, so they
behave just as two different hosts would, with or without conflicts.

> 
> The very idea of ib_get_net_dev_by_port_pkey_ip is broken.
> 
> So, I don't know what to say here.. Ideas?
> 
> 1) Forbid creating more than one pkey per ipoib interface?
You probably mean more than one IP on the same pkey. The pkey is
actually part of the request, so its not an issue.

> 2) Somehow extend the RDMA CM to send the IPoIB qpn too?
> 3) ??
We can do something crazy in the future like moving all CM requests to
run over UDP as in RoCEv2. But both adding the QPN or moving to UDP
require a wire protocol change and won't be compatible with today's systems.

> 
> Right now the only case that comes to mind is duplicating IPs, that is
> already going to cause an ARP collision, so maybe having the RDMA CM
> randomly select an IP is not the end of the world... But with
> containers and security, who knows? I'm not confident I've
> exhaustively thought of all possibilities here.
> 
> ----------
> 
> Anyhow.. looking again through this series and the existing code, the
> flow is wrong, and really needs to be changed before this starts to
> make sense to anyone, and is no doubt part of how we got here..
> 
> When a REQ arrives RDMA CM needs to run down these steps (this is identical
> to what ip_input.c does)
> 
>  1) Locate the netdev associated with the ingress of the packet,
>     in a sane world this is done by only checking the
>     unique (Device,Port,Pkey,QPN) tuple.
>     If we keep our brokeness, we'd do this based on
>     (Device,Port,Pkey,IP) - if there are IP collisions then randomly
>     select a netdev (similar to how ARP collision is handled).
That's what ib_get_net_dev_by_port_pkey_ip intends to do.

>  2) Then we do the ip_route_input_noref step, this will set skb_dst to
>     the netdev that will handle the packet, or tell us to drop it.
>     This is not always the same as the netdev that accepts the
>     packet!!!
> 
>     NOTE: This route step is missing today, it does critical things
>     like check that the node is actually listening on the dest IP!

Isn't this a little over-engineered? If all you want is to make sure the
net dev is up, can't we use something like netif_running()?

Also, this sounds like a major change in behavior even for applications
that do not use containers. I think today RDMA CM will accept
connections even if the ipoib interface is down.

> 
>  3) Now we can use skb_dst to iterate over the set of all RDMA CM listens:
>      1) Bound to the skb_dst netdev
>      2) Unbound in the same namespace as skb_dst netdev
>     The first to match the dst IP + port is the listen that will accept the
>     connection, now we go into the cma_new_conn_id path, and we don't
>     need rdma_translate_ip because we already have the handling netdev.
You shouldn't be able to bind one listener to a netdev in a namespace
and also have a different listener listening for any netdev on that same
namespace. (That is what cma_check_port verifies, right?) So when
looking for a listener in a namespace there should be only one match.

It is true we no longer need the rdma_translate_ip call.

> 
> The backwards operation of the current code is part of why this is all
> so strange looking, and I think is strongly connected to the private
> data compare issues Sean is talking about. It is very much the wrong
> flow to look for the RDMA CM listen first, and then try to work
> backwards to the netdev.
That's not what the code does. It first finds the netdev and decides on
the namespace based on that. It then finds the RDMA CM listener in that
namespace.

> 
> The above 3 steps hard require that the ib_cm and rdma_cm
> maintain different listening lists, because we need the 2nd search in
> #3. So this gets the ib_cm completely out of looking at the private
> data. [And now we can think carefully about the best way to refcount
> the listens in RDMA CM]

The current patch-set already makes sure that rdma_cm doesn't rely on
ib_cm's private_data matching.

> 
> Once the above is cleaned up dropping in namespaces should just
> happen naturally.
> 
> So.. I'm going to suggest you make a cleanup series to fix this:
>  - Introduce the ib_get_net_dev_by_port_pkey_ip and document the
>    breakage it represents
>  - Rework rdma_cm to do steps 1,2,3 above, using
>    ib_get_net_dev_by_port_pkey_ip to drive #1, your patch set is
>    already doing about 50% of this change.
>  - Fix the collateral damage
> 
> As I said to Matan, clean up series like this should not introduce
> major functional changes, so stack the few remaining net namespace
> things in another series after it.

I think I can refactor the series this way, but I don't think we need
step 2. It seems like an overly complicated way of checking whether a
netdev is up or not. It doesn't seem to provide any new information over
what ib_get_net_dev_by_port_pkey_ip does.

As for fixing the collateral damage, I assume you mean cleaning the
interfaces of ib_cm now that private data matching is not used? I don't
see why this should be part of the series.

> 
> ** The same comments apply to RoCE too, but for RoCE step #1 works
>    properly based on the (Device,Port,VLAN) unique tuple

For RoCE, you could still have multiple macvlan interfaces using the
same VLAN, with different IP address. So the unique tuple is
(Device,Port,VLAN,IP). With Matan adding a netdev to each GID entry in
the RoCE GID table, I think it would be simpler to find the netdev in RoCE.

>    Ditto for iWarp **
Yes, if iWarp CM can provide the netdev on which a request arrives, I
think it would be simple to use it with this design.

> 
> I think this also moves to address Sean's concern about generality, at
> least for listen. All three protocols will run down the same common
> code to locate the netdev and find the correct RDMA CM listen. The
> only difference is the 'ib_get_net_dev_by_port_pkey_ip' call varies.

Right.

> 
> .. I also wouldn't mind seeing the giant cma.c split, if it makes
>    sense to have a cma_listen.c, for instance, wouldn't that be nice?

It would be nice, but this patchset is becoming large already, and I
don't want to add unnecessary noise.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
@ 2015-05-28 11:51                   ` Haggai Eran
  0 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-28 11:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth, Hefty, Sean

On 21/05/2015 20:43, Jason Gunthorpe wrote:
> On Thu, May 21, 2015 at 08:33:53AM +0300, Haggai Eran wrote:
> 
>> To create a new child interface on the default P_Key, its possible to
>> use iproute:
>> # ip link add link ib0 name ib0.1 type ipoib
> 
> Uh..
> 
> A key invariant of the IP stack is that is it possible to uniquely
> identify the ingress device.
> 
> So the above scheme is fine for IPoIB, because it uses the interface
> unique QPN to uniquely ID the netdevice: (Device,Port,Pkey,QPN)
> is the unique ID tuple. The world is happy.
> 
> But RDMA CM doesn't provide the QPN. So when RDMA CM searches the
> netdevs for an address it cannot *uniquely* map to a IPoIB interface.
This is technically true, but if someone configures their system that
way, they will also have ARP conflicts in addition. I don't see why we
should support such a configuration.

> This is bad, and *completely wrong*, but today, nobody is going to
> really notice or care. The cases where it does something you don't
> want are not very significant.
> 
> But with containers.. Think this through for a minute: 'In some cases
> the RDMA CM selecs the wrong child' - that goes from being a minor
> annoyance to a violation of containment! Worse the criteria for
> 'selects the wrong child' can be triggered by the contained users. Eg
> the contained user adds a IP to their child that duplicates another
> container. Now we've lost control.
No, this is exactly what would happen in the Ethernet world. If you
create a conflicting configuration between two containers on the same
Ethernet segment, then one of them could get the traffic that was
intended for the other.
I don't see this as a violation of containment, because these containers
are assigned net_devs that communicate on the same segment, so they
behave just as two different hosts would, with or without conflicts.

> 
> The very idea of ib_get_net_dev_by_port_pkey_ip is broken.
> 
> So, I don't know what to say here.. Ideas?
> 
> 1) Forbid creating more than one pkey per ipoib interface?
You probably mean more than one IP on the same pkey. The pkey is
actually part of the request, so its not an issue.

> 2) Somehow extend the RDMA CM to send the IPoIB qpn too?
> 3) ??
We can do something crazy in the future like moving all CM requests to
run over UDP as in RoCEv2. But both adding the QPN or moving to UDP
require a wire protocol change and won't be compatible with today's systems.

> 
> Right now the only case that comes to mind is duplicating IPs, that is
> already going to cause an ARP collision, so maybe having the RDMA CM
> randomly select an IP is not the end of the world... But with
> containers and security, who knows? I'm not confident I've
> exhaustively thought of all possibilities here.
> 
> ----------
> 
> Anyhow.. looking again through this series and the existing code, the
> flow is wrong, and really needs to be changed before this starts to
> make sense to anyone, and is no doubt part of how we got here..
> 
> When a REQ arrives RDMA CM needs to run down these steps (this is identical
> to what ip_input.c does)
> 
>  1) Locate the netdev associated with the ingress of the packet,
>     in a sane world this is done by only checking the
>     unique (Device,Port,Pkey,QPN) tuple.
>     If we keep our brokeness, we'd do this based on
>     (Device,Port,Pkey,IP) - if there are IP collisions then randomly
>     select a netdev (similar to how ARP collision is handled).
That's what ib_get_net_dev_by_port_pkey_ip intends to do.

>  2) Then we do the ip_route_input_noref step, this will set skb_dst to
>     the netdev that will handle the packet, or tell us to drop it.
>     This is not always the same as the netdev that accepts the
>     packet!!!
> 
>     NOTE: This route step is missing today, it does critical things
>     like check that the node is actually listening on the dest IP!

Isn't this a little over-engineered? If all you want is to make sure the
net dev is up, can't we use something like netif_running()?

Also, this sounds like a major change in behavior even for applications
that do not use containers. I think today RDMA CM will accept
connections even if the ipoib interface is down.

> 
>  3) Now we can use skb_dst to iterate over the set of all RDMA CM listens:
>      1) Bound to the skb_dst netdev
>      2) Unbound in the same namespace as skb_dst netdev
>     The first to match the dst IP + port is the listen that will accept the
>     connection, now we go into the cma_new_conn_id path, and we don't
>     need rdma_translate_ip because we already have the handling netdev.
You shouldn't be able to bind one listener to a netdev in a namespace
and also have a different listener listening for any netdev on that same
namespace. (That is what cma_check_port verifies, right?) So when
looking for a listener in a namespace there should be only one match.

It is true we no longer need the rdma_translate_ip call.

> 
> The backwards operation of the current code is part of why this is all
> so strange looking, and I think is strongly connected to the private
> data compare issues Sean is talking about. It is very much the wrong
> flow to look for the RDMA CM listen first, and then try to work
> backwards to the netdev.
That's not what the code does. It first finds the netdev and decides on
the namespace based on that. It then finds the RDMA CM listener in that
namespace.

> 
> The above 3 steps hard require that the ib_cm and rdma_cm
> maintain different listening lists, because we need the 2nd search in
> #3. So this gets the ib_cm completely out of looking at the private
> data. [And now we can think carefully about the best way to refcount
> the listens in RDMA CM]

The current patch-set already makes sure that rdma_cm doesn't rely on
ib_cm's private_data matching.

> 
> Once the above is cleaned up dropping in namespaces should just
> happen naturally.
> 
> So.. I'm going to suggest you make a cleanup series to fix this:
>  - Introduce the ib_get_net_dev_by_port_pkey_ip and document the
>    breakage it represents
>  - Rework rdma_cm to do steps 1,2,3 above, using
>    ib_get_net_dev_by_port_pkey_ip to drive #1, your patch set is
>    already doing about 50% of this change.
>  - Fix the collateral damage
> 
> As I said to Matan, clean up series like this should not introduce
> major functional changes, so stack the few remaining net namespace
> things in another series after it.

I think I can refactor the series this way, but I don't think we need
step 2. It seems like an overly complicated way of checking whether a
netdev is up or not. It doesn't seem to provide any new information over
what ib_get_net_dev_by_port_pkey_ip does.

As for fixing the collateral damage, I assume you mean cleaning the
interfaces of ib_cm now that private data matching is not used? I don't
see why this should be part of the series.

> 
> ** The same comments apply to RoCE too, but for RoCE step #1 works
>    properly based on the (Device,Port,VLAN) unique tuple

For RoCE, you could still have multiple macvlan interfaces using the
same VLAN, with different IP address. So the unique tuple is
(Device,Port,VLAN,IP). With Matan adding a netdev to each GID entry in
the RoCE GID table, I think it would be simpler to find the netdev in RoCE.

>    Ditto for iWarp **
Yes, if iWarp CM can provide the netdev on which a request arrives, I
think it would be simple to use it with this design.

> 
> I think this also moves to address Sean's concern about generality, at
> least for listen. All three protocols will run down the same common
> code to locate the netdev and find the correct RDMA CM listen. The
> only difference is the 'ib_get_net_dev_by_port_pkey_ip' call varies.

Right.

> 
> .. I also wouldn't mind seeing the giant cma.c split, if it makes
>    sense to have a cma_listen.c, for instance, wouldn't that be nice?

It would be nice, but this patchset is becoming large already, and I
don't want to add unnecessary noise.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]   ` <1432647280.28905.107.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-05-28 13:07       ` Haggai Eran
  2015-05-26 17:55     ` Christian Benvenuti (benve)
  2015-05-28 13:07       ` Haggai Eran
  2 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-28 13:07 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth

On 26/05/2015 16:34, Doug Ledford wrote:
> On Sun, 2015-05-17 at 08:50 +0300, Haggai Eran wrote:
>> Thanks again everyone for the review comments. I've updated the patch set
>> accordingly. The main changes are in the first patch to use a read-write
>> semaphore instead of an SRCU, and with the reference counting of shared
>> ib_cm_ids.
>> Please let me know if I missed anything, or if there are other issues with
>> the series.
> 
> Hi Haggai,
> 
> I know you are probably busy reworking this right now on the basis of
> Jason's comments.  However, my biggest issue with this patch set right
> now is not technical (well, it is, but it's only partially technical).
Hi,

I'm sorry about the late reply. We had a holiday here, and then some
other tasks took precedence. I've only got back to working on this today.

> 
> This is a core feature more than anything else.  Namespaces for RDMA
> devices is not unique to IB or RoCE in any way.  Yet no thought has been
> given to how this will work universally across all of the RDMA capable
> devices (mainly I'm talking about iWARP here...
I don't agree. It is true we have are not planning to provide an iWarp
implementation for network namespaces, as we lack the capacity and the
expertise. However, I think that the changes we proposed to the rdma_cm
module will work with iWarp too. Perhaps with some of Jason's
suggestions it will be smoother, but even in the current design, I think
that if iWarp drivers can provide iw_cm with the network device on which
a request is received, then it should be simple to modify it for
namespace support without significant change to rdma_cm.

> I don't think this is an
> issue for usNIC as if you want namespace support there, you just start
> the user space app in a given namespace and you are probably 90% of the
> way there since the user space application gets its own device and so
> its own MAC/IP and all of the RDMA transfers are UDP, so the
> application's namespace should get inherited by all the rest, but Cisco
> would need to confirm that, hence why I say 90% of the way there, it
> needs confirmed).
> 
> So, while you are reworking things right now, you would ideally contact
> Steve Wise and/or Tatyana Nikolova and discuss the iWARP story on this.
> I know there won't be a lot of overlap between IB and iWARP, but last
> time you were asked you didn't even know if this setup could be extended
> to iWARP.
> 
> For this next statement, I know I'm directing this to you Haggai, but
> please don't take it that way.  I'm really using your patch set to make
> a broader point to everyone on the list.
> 
> When I look at patches for support for a given feature, one of the
> things I'm going to look at is whether or not that feature is specific
> to a given hardware type, or if it's a generic feature.  If it's a
> generic feature, then I'm going to want to know that the person
> submitting it has designed it well.  A pre-requisite of designing a
> generic feature well is that it considers all hardware types, not just
> your specific hardware type.  So when you come back with the next
> version of this patch set, please have an answer for how it should work
> on each hardware type even if you don't have implementation patches for
> each hardware type.

Well, because the RDMA subsystem supports a very diverse set of devices,
I think there are few people who know the details of all hardware types
well. If we are going to evolve the generic parts of the stack, we have
to cooperate. We have to rely on the knowledge of people on the mailing
list to say whether the feature is well designed for all hardware types,
or whether changes are warranted. In this specific case, the patches has
been on the list since February. I think it is enough time to allow
anyone who is interested in network namespace support to chime in.

Regards,
Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
@ 2015-05-28 13:07       ` Haggai Eran
  0 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-28 13:07 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth

On 26/05/2015 16:34, Doug Ledford wrote:
> On Sun, 2015-05-17 at 08:50 +0300, Haggai Eran wrote:
>> Thanks again everyone for the review comments. I've updated the patch set
>> accordingly. The main changes are in the first patch to use a read-write
>> semaphore instead of an SRCU, and with the reference counting of shared
>> ib_cm_ids.
>> Please let me know if I missed anything, or if there are other issues with
>> the series.
> 
> Hi Haggai,
> 
> I know you are probably busy reworking this right now on the basis of
> Jason's comments.  However, my biggest issue with this patch set right
> now is not technical (well, it is, but it's only partially technical).
Hi,

I'm sorry about the late reply. We had a holiday here, and then some
other tasks took precedence. I've only got back to working on this today.

> 
> This is a core feature more than anything else.  Namespaces for RDMA
> devices is not unique to IB or RoCE in any way.  Yet no thought has been
> given to how this will work universally across all of the RDMA capable
> devices (mainly I'm talking about iWARP here...
I don't agree. It is true we have are not planning to provide an iWarp
implementation for network namespaces, as we lack the capacity and the
expertise. However, I think that the changes we proposed to the rdma_cm
module will work with iWarp too. Perhaps with some of Jason's
suggestions it will be smoother, but even in the current design, I think
that if iWarp drivers can provide iw_cm with the network device on which
a request is received, then it should be simple to modify it for
namespace support without significant change to rdma_cm.

> I don't think this is an
> issue for usNIC as if you want namespace support there, you just start
> the user space app in a given namespace and you are probably 90% of the
> way there since the user space application gets its own device and so
> its own MAC/IP and all of the RDMA transfers are UDP, so the
> application's namespace should get inherited by all the rest, but Cisco
> would need to confirm that, hence why I say 90% of the way there, it
> needs confirmed).
> 
> So, while you are reworking things right now, you would ideally contact
> Steve Wise and/or Tatyana Nikolova and discuss the iWARP story on this.
> I know there won't be a lot of overlap between IB and iWARP, but last
> time you were asked you didn't even know if this setup could be extended
> to iWARP.
> 
> For this next statement, I know I'm directing this to you Haggai, but
> please don't take it that way.  I'm really using your patch set to make
> a broader point to everyone on the list.
> 
> When I look at patches for support for a given feature, one of the
> things I'm going to look at is whether or not that feature is specific
> to a given hardware type, or if it's a generic feature.  If it's a
> generic feature, then I'm going to want to know that the person
> submitting it has designed it well.  A pre-requisite of designing a
> generic feature well is that it considers all hardware types, not just
> your specific hardware type.  So when you come back with the next
> version of this patch set, please have an answer for how it should work
> on each hardware type even if you don't have implementation patches for
> each hardware type.

Well, because the RDMA subsystem supports a very diverse set of devices,
I think there are few people who know the details of all hardware types
well. If we are going to evolve the generic parts of the stack, we have
to cooperate. We have to rely on the knowledge of people on the mailing
list to say whether the feature is well designed for all hardware types,
or whether changes are warranted. In this specific case, the patches has
been on the list since February. I think it is enough time to allow
anyone who is interested in network namespace support to chime in.

Regards,
Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
  2015-05-26 16:59     ` Jason Gunthorpe
@ 2015-05-28 13:15         ` Haggai Eran
  2015-05-28 13:15         ` Haggai Eran
  1 sibling, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-28 13:15 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford
  Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth

On 26/05/2015 19:59, Jason Gunthorpe wrote:
> The big open question for ethernet is how to work without relying on
> VLAN to create delgated netdevs - typically one would use a bridge and
> veth's, which do not seem very RDMA compatible. But that doesn't need
> to be answered right now.

I think in Ethernet the first step would be to support macvlan devices.
Like IPoIB child devices, they are directly attached to an RDMA device,
so they don't require handling a complex virtual bridging topology as
veths do.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
@ 2015-05-28 13:15         ` Haggai Eran
  0 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-28 13:15 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford
  Cc: linux-rdma, netdev, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth

On 26/05/2015 19:59, Jason Gunthorpe wrote:
> The big open question for ethernet is how to work without relying on
> VLAN to create delgated netdevs - typically one would use a bridge and
> veth's, which do not seem very RDMA compatible. But that doesn't need
> to be answered right now.

I think in Ethernet the first step would be to support macvlan devices.
Like IPoIB child devices, they are directly attached to an RDMA device,
so they don't require handling a complex virtual bridging topology as
veths do.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]         ` <1432662396.28905.157.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-05-28 13:22             ` Haggai Eran
  2015-05-28 13:22             ` Haggai Eran
  1 sibling, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-28 13:22 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth

On 26/05/2015 20:46, Doug Ledford wrote:
>> Remember, this isn't RDMA namespaces, this is netdev namespace support
>> > for RDMA-CM -> very different things.
> That was the point of my email.  This is a very myopic view of the
> feature.  It *should* at least have an idea of these other things too.

We did give some thought to the question of whether an RDMA namespace is
needed, and concluded that it isn't. RDMA resources such as QP numbers,
memory keys, etc. are allocated by the devices. So different containers
wouldn't care if they share the "QP number namespace", etc. RDMA CM
ports are different because they are chosen by the applications, but
they map directly to the network namespace, so they don't require their
own namespace.

Regards,
Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
@ 2015-05-28 13:22             ` Haggai Eran
  0 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-05-28 13:22 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth

On 26/05/2015 20:46, Doug Ledford wrote:
>> Remember, this isn't RDMA namespaces, this is netdev namespace support
>> > for RDMA-CM -> very different things.
> That was the point of my email.  This is a very myopic view of the
> feature.  It *should* at least have an idea of these other things too.

We did give some thought to the question of whether an RDMA namespace is
needed, and concluded that it isn't. RDMA resources such as QP numbers,
memory keys, etc. are allocated by the devices. So different containers
wouldn't care if they share the "QP number namespace", etc. RDMA CM
ports are different because they are chosen by the applications, but
they map directly to the network namespace, so they don't require their
own namespace.

Regards,
Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]       ` <55671309.6080303-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-05-28 14:07         ` Doug Ledford
       [not found]           ` <1432822057.114391.26.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 81+ messages in thread
From: Doug Ledford @ 2015-05-28 14:07 UTC (permalink / raw)
  To: Haggai Eran
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth

[-- Attachment #1: Type: text/plain, Size: 2581 bytes --]

On Thu, 2015-05-28 at 16:07 +0300, Haggai Eran wrote:
> On 26/05/2015 16:34, Doug Ledford wrote:
> > On Sun, 2015-05-17 at 08:50 +0300, Haggai Eran wrote:
> > This is a core feature more than anything else.  Namespaces for RDMA
> > devices is not unique to IB or RoCE in any way.  Yet no thought has been
> > given to how this will work universally across all of the RDMA capable
> > devices (mainly I'm talking about iWARP here...
> I don't agree. It is true we have are not planning to provide an iWarp
> implementation for network namespaces, as we lack the capacity and the
> expertise. However, I think that the changes we proposed to the rdma_cm
> module will work with iWarp too. Perhaps with some of Jason's
> suggestions it will be smoother, but even in the current design, I think
> that if iWarp drivers can provide iw_cm with the network device on which
> a request is received, then it should be simple to modify it for
> namespace support without significant change to rdma_cm.

My request wasn't for a functional implementation, just a statement that
you had in fact thought about it and, as you say here, would expect it
to work (and preferably why as well).

> Well, because the RDMA subsystem supports a very diverse set of devices,
> I think there are few people who know the details of all hardware types
> well. If we are going to evolve the generic parts of the stack, we have
> to cooperate. We have to rely on the knowledge of people on the mailing
> list to say whether the feature is well designed for all hardware types,
> or whether changes are warranted. In this specific case, the patches has
> been on the list since February. I think it is enough time to allow
> anyone who is interested in network namespace support to chime in.

You would think that, but sometimes important information comes from
totally different places.  See mine and Jason's comments back and forth
in the SRIOV thread started by Or.

Long story short:

ip link add dev ib0 name ib0.1 type ipoib

is totally broken on at least all Red Hat OSes.  It will require
reworking of the network scripts and NetworkManager assumptions to make
it work.  It will also break DHCP on the interface as pkey/guid are the
only items that uniquely identify DHCP clients.  The net result of our
talks was that it is likely that each interface on the same pkey will
require an alias GUID per child interface in order to keep things
workable.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
  2015-05-28 11:51                   ` Haggai Eran
  (?)
@ 2015-05-28 15:45                   ` Jason Gunthorpe
  -1 siblings, 0 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2015-05-28 15:45 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Doug Ledford, linux-rdma, netdev, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth, Hefty, Sean

On Thu, May 28, 2015 at 02:51:51PM +0300, Haggai Eran wrote:

> > But RDMA CM doesn't provide the QPN. So when RDMA CM searches the
> > netdevs for an address it cannot *uniquely* map to a IPoIB interface.

> This is technically true, but if someone configures their system that
> way, they will also have ARP conflicts in addition. I don't see why we
> should support such a configuration.

Based on the dicussion in the other thread about this, the feeling is
we should not support same-GUID, same-PKey child interfaces at all. It
breaks too much stuff (DHCP,NetworkManager,IPv6,RDMA-CM)

> No, this is exactly what would happen in the Ethernet world. If you
> create a conflicting configuration between two containers on the same
> Ethernet segment, then one of them could get the traffic that was
> intended for the other.

It is not exactly the same. In Ethernet there is an ARP collision at
L3, but the traffic is properly addressed at L2 and unambiguously
directed into only one container. There are ways to deal with ARP
collisions, but those are only effective if the full LLADDR is
consistently used for routing to containers.

With RDMA-CM it is an L3 collsion, so our anti-ARP collision stuff
doesn't help.

Like I said, I don't from a security perspective what to make of this,
but it isn't exactly the same of ethernet.

> >  1) Locate the netdev associated with the ingress of the packet,
> >     in a sane world this is done by only checking the
> >     unique (Device,Port,Pkey,QPN) tuple.
> >     If we keep our brokeness, we'd do this based on
> >     (Device,Port,Pkey,IP) - if there are IP collisions then randomly
> >     select a netdev (similar to how ARP collision is handled).
> That's what ib_get_net_dev_by_port_pkey_ip intends to do.

Right, almost there.

ib_get_net_dev_by_port_pkey_ip needs to work in a very specific
way: If there is only one netdev with the (Device,Port,GUID,Pkey)
match then that is the answer.
(guid comes from the CM_REQ, if we add alias GUID support to IPoIB as
 Or suggested then it is needed)

The IP search should *only* be done if there are two children with
identical (Pkey,GUID), and as above, perhaps we should de-support that.

> >  2) Then we do the ip_route_input_noref step, this will set skb_dst to
> >     the netdev that will handle the packet, or tell us to drop it.
> >     This is not always the same as the netdev that accepts the
> >     packet!!!
> > 
> >     NOTE: This route step is missing today, it does critical things
> >     like check that the node is actually listening on the dest IP!
> 
> Isn't this a little over-engineered? If all you want is to make sure the
> net dev is up, can't we use something like netif_running()?

The routing check is not to see if the netdev is up, it is doing all
sorts of subtle userspace visible things. Like checking there is no
blackhole route configured for the packet, checking that the IP is
present in the system, netdevs are up, etc etc.

We don't get to pick and choose what netdev behaviors we implement
when doing this kind of stuff. Copy the netstack, don't make stuff up.

Understand the two layer separation, first with pick a netdev without
looking at L3 info, then we feed the netdev and L3 info into routing
to complete the process.

> Also, this sounds like a major change in behavior even for applications
> that do not use containers. I think today RDMA CM will accept
> connections even if the ipoib interface is down.

Yes, it is a change in behavior, things move closer to alignment with
how netdev works. We did a similar change to the output side years ago
as well. I guess the input side was missed.

Unless I'm missing something this isn't 'major', these are corner case
conditions/bugs that nobody sane should rely on.

> >  3) Now we can use skb_dst to iterate over the set of all RDMA CM listens:
> >      1) Bound to the skb_dst netdev
> >      2) Unbound in the same namespace as skb_dst netdev
> >     The first to match the dst IP + port is the listen that will accept the
> >     connection, now we go into the cma_new_conn_id path, and we don't
> >     need rdma_translate_ip because we already have the handling netdev.
> You shouldn't be able to bind one listener to a netdev in a namespace
> and also have a different listener listening for any netdev on that same
> namespace. (That is what cma_check_port verifies, right?) So when
> looking for a listener in a namespace there should be only one match.

You know, I don't remember off hand the exact semantics of sockets,
whatever sockets does :)

> > The backwards operation of the current code is part of why this is all
> > so strange looking, and I think is strongly connected to the private
> > data compare issues Sean is talking about. It is very much the wrong
> > flow to look for the RDMA CM listen first, and then try to work
> > backwards to the netdev.

> That's not what the code does. It first finds the netdev and decides on
> the namespace based on that. It then finds the RDMA CM listener in that
> namespace.

I was talking about the current code, but even with the patch set, the
behavior is still backwards:

 1) Find the netdev
 2) Get the namespace, throw away the netdev
 3) Find a CM_ID without a netdev
 4) Find a netdev from the CM_ID

Instead:
 1) Find the netdev
 2) Find a CM_ID compatible with that netdev

> I think I can refactor the series this way, but I don't think we need
> step 2. It seems like an overly complicated way of checking whether a
> netdev is up or not. It doesn't seem to provide any new information over
> what ib_get_net_dev_by_port_pkey_ip does.

If it is hard to do then leave a FIXME comment and go on, if it is
easy, it is the right step.

It does provide more information because
ib_get_net_dev_by_port_pkey_ip should not check the IP.

> For RoCE, you could still have multiple macvlan interfaces using the
> same VLAN, with different IP address. So the unique tuple is
> (Device,Port,VLAN,IP). With Matan adding a netdev to each GID entry in
> the RoCE GID table, I think it would be simpler to find the netdev in RoCE.

That isn't how I understand macvlan, the tuple will be
(Device,Port,MAC,VLAN)

You should never search for an IP, that is not how netdev works.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
  2015-05-28 13:22             ` Haggai Eran
  (?)
@ 2015-05-28 15:46             ` Jason Gunthorpe
       [not found]               ` <20150528154633.GB2962-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  -1 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2015-05-28 15:46 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Doug Ledford, linux-rdma, netdev, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On Thu, May 28, 2015 at 04:22:36PM +0300, Haggai Eran wrote:
> wouldn't care if they share the "QP number namespace", etc. RDMA CM
> ports are different because they are chosen by the applications, but
> they map directly to the network namespace, so they don't require their
> own namespace.

Different containers should have restricted access to the PKey and GID
tables, and the presence device itself. Just like in the SRIOV
case.

That is what the 'RDMA Namespace' would control.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]           ` <1432822057.114391.26.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-05-28 16:21               ` Or Gerlitz
  0 siblings, 0 replies; 81+ messages in thread
From: Or Gerlitz @ 2015-05-28 16:21 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Haggai Eran, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On 5/28/2015 5:07 PM, Doug Ledford wrote:
> You would think that, but sometimes important information comes from
> totally different places.  See mine and Jason's comments back and forth
> in the SRIOV thread started by Or.
>
> Long story short:
>
> ip link add dev ib0 name ib0.1 type ipoib
>
> is totally broken on at least all Red Hat OSes.  It will require
> reworking of the network scripts and NetworkManager assumptions to make
> it work.  It will also break DHCP on the interface as pkey/guid are the
> only items that uniquely identify DHCP clients.  The net result of our
> talks was that it is likely that each interface on the same pkey will
> require an alias GUID per child interface in order to keep things workable.
>

Doug,

Just to make sure we're on the same page, you're saying that the IPoIB 
DHCP scheme (client + server) used on RH product uses Client-ID which is 
eight byte long or 20 byte long the four upper bytes masked out (which 
of them?) and hence is broken when multiple entities use the same ID.

Anything else except for that (you said "reworking of the network 
scripts and NetworkManager assumptions to make it work")??

OTOH we realized that the implementation for same PKEY IPoIB childs 
which exist for a while is broken with the RH DHCP scheme and should be 
enhanced.   OTOH these childs can serve as nice building blocks for 
IPoIB containers or virtio-IPoIB scheme.

Note that out of the eleven patches that make the series, only ONE 
relates directly to IPoIB, the rest are either applicable to all the 
transport supported by the RDMA stack, or to IPoIB + RoCE.

Under some assumptions and changes people can test it with DHCP scheme 
different from RH or with non-DHCP based IP address assignment scheme.

So we have a very nice effort and work done by developers, to bring RDMA 
into containers, accompanied by reviewers providing lots of their brain 
power to make it robust.

I don't see why we should stop the whole RDMA containers support train 
just b/c we found out the IPoIB DHCP bug which was there for few years 
before this effort started.

How about let this series to go after the rest of the reviewers comments 
are addressed, s.t under IPoIB it will work on small set of 
environments, while with macvlan based RoCE support to be introduced 
later it will work on wider set of environments.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
@ 2015-05-28 16:21               ` Or Gerlitz
  0 siblings, 0 replies; 81+ messages in thread
From: Or Gerlitz @ 2015-05-28 16:21 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Haggai Eran, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On 5/28/2015 5:07 PM, Doug Ledford wrote:
> You would think that, but sometimes important information comes from
> totally different places.  See mine and Jason's comments back and forth
> in the SRIOV thread started by Or.
>
> Long story short:
>
> ip link add dev ib0 name ib0.1 type ipoib
>
> is totally broken on at least all Red Hat OSes.  It will require
> reworking of the network scripts and NetworkManager assumptions to make
> it work.  It will also break DHCP on the interface as pkey/guid are the
> only items that uniquely identify DHCP clients.  The net result of our
> talks was that it is likely that each interface on the same pkey will
> require an alias GUID per child interface in order to keep things workable.
>

Doug,

Just to make sure we're on the same page, you're saying that the IPoIB 
DHCP scheme (client + server) used on RH product uses Client-ID which is 
eight byte long or 20 byte long the four upper bytes masked out (which 
of them?) and hence is broken when multiple entities use the same ID.

Anything else except for that (you said "reworking of the network 
scripts and NetworkManager assumptions to make it work")??

OTOH we realized that the implementation for same PKEY IPoIB childs 
which exist for a while is broken with the RH DHCP scheme and should be 
enhanced.   OTOH these childs can serve as nice building blocks for 
IPoIB containers or virtio-IPoIB scheme.

Note that out of the eleven patches that make the series, only ONE 
relates directly to IPoIB, the rest are either applicable to all the 
transport supported by the RDMA stack, or to IPoIB + RoCE.

Under some assumptions and changes people can test it with DHCP scheme 
different from RH or with non-DHCP based IP address assignment scheme.

So we have a very nice effort and work done by developers, to bring RDMA 
into containers, accompanied by reviewers providing lots of their brain 
power to make it robust.

I don't see why we should stop the whole RDMA containers support train 
just b/c we found out the IPoIB DHCP bug which was there for few years 
before this effort started.

How about let this series to go after the rest of the reviewers comments 
are addressed, s.t under IPoIB it will work on small set of 
environments, while with macvlan based RoCE support to be introduced 
later it will work on wider set of environments.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]               ` <55674077.5040707-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-05-28 17:43                 ` Jason Gunthorpe
       [not found]                   ` <20150528174337.GA10448-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2015-05-28 17:43 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Doug Ledford, Haggai Eran, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On Thu, May 28, 2015 at 07:21:11PM +0300, Or Gerlitz wrote:

> Anything else except for that (you said "reworking of the network scripts
> and NetworkManager assumptions to make it work")??

IPv6 becomes very broken, child interfaces will generate the same IPv6
addreses for radv and link local resulting in duplicate address
scenarios.

About the only thing that will work properly is statically assigned
IPv4 addresses.

> I don't see why we should stop the whole RDMA containers support train just
> b/c we found out the IPoIB DHCP bug which was there for few years before
> this effort started.

I don't think that is what Doug said.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]                   ` <20150528174337.GA10448-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-28 18:22                     ` Doug Ledford
       [not found]                       ` <1432837360.114391.35.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 81+ messages in thread
From: Doug Ledford @ 2015-05-28 18:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Or Gerlitz, Haggai Eran, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

[-- Attachment #1: Type: text/plain, Size: 1357 bytes --]

On Thu, 2015-05-28 at 11:43 -0600, Jason Gunthorpe wrote:
> On Thu, May 28, 2015 at 07:21:11PM +0300, Or Gerlitz wrote:
> 
> > Anything else except for that (you said "reworking of the network scripts
> > and NetworkManager assumptions to make it work")??
> 
> IPv6 becomes very broken, child interfaces will generate the same IPv6
> addreses for radv and link local resulting in duplicate address
> scenarios.
> 
> About the only thing that will work properly is statically assigned
> IPv4 addresses.
> 
> > I don't see why we should stop the whole RDMA containers support train just
> > b/c we found out the IPoIB DHCP bug which was there for few years before
> > this effort started.
> 
> I don't think that is what Doug said.

Indeed.  There is no need to scrap things, but if the design as it
stands, and the intended means of creating objects for use in
containers, is going to result in an unworkable network, then we have to
re-evaluate how the container constructs are created, and that then has
possible consequences for how we would get from an incoming packet to
the proper container.

I'm not trying to stop the "support train" here, but at the same time,
if the train is headed for a bridge that's out....

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]                       ` <1432837360.114391.35.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-05-28 19:05                         ` Or Gerlitz
       [not found]                           ` <CAJ3xEMh2T5-56rFxWVdct2uAZYW1ZrKivWfS45V-mvhAfwyGaA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 81+ messages in thread
From: Or Gerlitz @ 2015-05-28 19:05 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Jason Gunthorpe, Or Gerlitz, Haggai Eran,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux Netdev List, Liran Liss,
	Guy Shapiro, Shachar Raindel, Yotam Kenneth

On Thu, May 28, 2015 at 9:22 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

>> I don't think that is what Doug said.

> Indeed.  There is no need to scrap things, but if the design as it
> stands, and the intended means of creating objects for use in
> containers, is going to result in an unworkable network, then we have to
> re-evaluate how the container constructs are created, and that then has
> possible consequences for how we would get from an incoming packet to
> the proper container.

To be precise, do we agree that the issue here isn't "in the design as
it stands" but rather in a problem we found in the intended way of
assigning IP addresses through DHCP for the containers?

> I'm not trying to stop the "support train" here, but at the same time,
> if the train is headed for a bridge that's out....

So what's your concrete saying here? where should we go from here?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]                           ` <CAJ3xEMh2T5-56rFxWVdct2uAZYW1ZrKivWfS45V-mvhAfwyGaA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-28 21:55                             ` Doug Ledford
       [not found]                               ` <1432850150.114391.56.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 81+ messages in thread
From: Doug Ledford @ 2015-05-28 21:55 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Jason Gunthorpe, Or Gerlitz, Haggai Eran,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux Netdev List, Liran Liss,
	Guy Shapiro, Shachar Raindel, Yotam Kenneth

[-- Attachment #1: Type: text/plain, Size: 2871 bytes --]

On Thu, 2015-05-28 at 22:05 +0300, Or Gerlitz wrote:
> On Thu, May 28, 2015 at 9:22 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> >> I don't think that is what Doug said.
> 
> > Indeed.  There is no need to scrap things, but if the design as it
> > stands, and the intended means of creating objects for use in
> > containers, is going to result in an unworkable network, then we have to
> > re-evaluate how the container constructs are created, and that then has
> > possible consequences for how we would get from an incoming packet to
> > the proper container.
> 
> To be precise, do we agree that the issue here isn't "in the design as
> it stands" but rather in a problem we found in the intended way of
> assigning IP addresses through DHCP for the containers?

No, I would say the problem *is* in the design.  But the problem is the
selected means of identifying the netdev to get to the namespace (and
the proposed means of creating non-default namespace devices to exist in
the container), not the namespace design itself.

> > I'm not trying to stop the "support train" here, but at the same time,
> > if the train is headed for a bridge that's out....
> 
> So what's your concrete saying here? where should we go from here?

This excerpt is from the commit log of patch 3/12:

The IB device and port, together with the P_Key and the IP address should
be enough to uniquely identify the ULP net device.

The problem here is that this is wrong.  If we allow more than one
device per pkey with the same GUID, then DHCP breaks, which is bad in
and of itself, but it also breaks ipv6 link local addressing.  Which
means that this hunk in patch 4/12:

+#if IS_ENABLED(CONFIG_IPV6)
+       case AF_INET6:
+               if (ipv6_chk_addr(net, &addr_in6->sin6_addr, dev, 1))
+                       return true;
+
+               break;
+#endif

can now be tricked into returning true for incorrect devices.

Where do we go from here?

First, I'm inclined to say we should modify the add_child portion of
IPoIB to refuse to add links to a PKey if that GUID is already present
on that PKey.  You could then use different PKeys on the default GUID
for separate namespaces.  If you need separate namespaces on the same
PKey, then enable alias GUIDs for use on the local adapter and require
one GUID per namespace on the same PKey.

Then I'm inclined to say that we should map for namespaces using device,
port, guid/gid, pkey.  And in this situation, since a unique guid/gid on
any given pkey maps to a unique dhcp identifier and a unique ipv6
lladdr, this becomes freely interchangeable with device, port, pkey,
address mappings that this patchset was built around.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]                               ` <1432850150.114391.56.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-06-03 10:03                                 ` Haggai Eran
  2015-06-03 16:14                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 81+ messages in thread
From: Haggai Eran @ 2015-06-03 10:03 UTC (permalink / raw)
  To: Doug Ledford, Or Gerlitz
  Cc: Jason Gunthorpe, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth

On 29/05/2015 00:55, Doug Ledford wrote:
> On Thu, 2015-05-28 at 22:05 +0300, Or Gerlitz wrote:
>> So what's your concrete saying here? where should we go from here?
> 
> This excerpt is from the commit log of patch 3/12:
> 
> The IB device and port, together with the P_Key and the IP address should
> be enough to uniquely identify the ULP net device.
> 
> The problem here is that this is wrong.  If we allow more than one
> device per pkey with the same GUID, then DHCP breaks, which is bad in
> and of itself, but it also breaks ipv6 link local addressing.  Which
> means that this hunk in patch 4/12:
> 
> +#if IS_ENABLED(CONFIG_IPV6)
> +       case AF_INET6:
> +               if (ipv6_chk_addr(net, &addr_in6->sin6_addr, dev, 1))
> +                       return true;
> +
> +               break;
> +#endif
> 
> can now be tricked into returning true for incorrect devices.
> 
> Where do we go from here?
> 
> First, I'm inclined to say we should modify the add_child portion of
> IPoIB to refuse to add links to a PKey if that GUID is already present
> on that PKey.  You could then use different PKeys on the default GUID
> for separate namespaces.  If you need separate namespaces on the same
> PKey, then enable alias GUIDs for use on the local adapter and require
> one GUID per namespace on the same PKey.
I don't think blocking the current add_child implementation is needed. I
agree IPv6 SLAAC and DHCP currently don't work well, and adding alias
GUID for child interfaces is important, but the current implementation
can be used with static IPv4 addresses, so I don't think it must be
disabled.

> Then I'm inclined to say that we should map for namespaces using device,
> port, guid/gid, pkey.  And in this situation, since a unique guid/gid on
> any given pkey maps to a unique dhcp identifier and a unique ipv6
> lladdr, this becomes freely interchangeable with device, port, pkey,
> address mappings that this patchset was built around.

What if we change the namespaces patches to map (device, port, GID,
P_Key, IP) to netdev / namespace? That is, to use both the GID and the
IP address. This would allow people to use namespaces with the current
implementation (provided they have a valid configuration with no
conflicting IP addresses), and once alias GUIDs are added, the GUIDs
will be used to uniquely resolve the namespace even with such
misconfigurations.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]               ` <20150528154633.GB2962-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-03 10:07                   ` Haggai Eran
  0 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-06-03 10:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On 28/05/2015 18:46, Jason Gunthorpe wrote:
> On Thu, May 28, 2015 at 04:22:36PM +0300, Haggai Eran wrote:
>> wouldn't care if they share the "QP number namespace", etc. RDMA CM
>> ports are different because they are chosen by the applications, but
>> they map directly to the network namespace, so they don't require their
>> own namespace.
> 
> Different containers should have restricted access to the PKey and GID
> tables, and the presence device itself. Just like in the SRIOV
> case.
> 
> That is what the 'RDMA Namespace' would control.

We were thinking here that there is a room for an RDMA cgroup. It would
limit the amount of RDMA resources a container can use. It can also be
used for the restrictions you mentioned, but maybe they are more
suitable for a namespace. I'm not sure. In RoCE for instance, a
restricted access to the GID table can be derived from the network
namespace directly, but perhaps not in InfiniBand.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
@ 2015-06-03 10:07                   ` Haggai Eran
  0 siblings, 0 replies; 81+ messages in thread
From: Haggai Eran @ 2015-06-03 10:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Liran Liss, Guy Shapiro,
	Shachar Raindel, Yotam Kenneth

On 28/05/2015 18:46, Jason Gunthorpe wrote:
> On Thu, May 28, 2015 at 04:22:36PM +0300, Haggai Eran wrote:
>> wouldn't care if they share the "QP number namespace", etc. RDMA CM
>> ports are different because they are chosen by the applications, but
>> they map directly to the network namespace, so they don't require their
>> own namespace.
> 
> Different containers should have restricted access to the PKey and GID
> tables, and the presence device itself. Just like in the SRIOV
> case.
> 
> That is what the 'RDMA Namespace' would control.

We were thinking here that there is a room for an RDMA cgroup. It would
limit the amount of RDMA resources a container can use. It can also be
used for the restrictions you mentioned, but maybe they are more
suitable for a namespace. I'm not sure. In RoCE for instance, a
restricted access to the GID table can be derived from the network
namespace directly, but perhaps not in InfiniBand.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
  2015-06-03 10:03                                 ` Haggai Eran
@ 2015-06-03 16:14                                   ` Jason Gunthorpe
       [not found]                                     ` <20150603161447.GC12073-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2015-06-03 16:14 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Doug Ledford, Or Gerlitz, Or Gerlitz, linux-rdma,
	Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth

On Wed, Jun 03, 2015 at 01:03:01PM +0300, Haggai Eran wrote:
> > Then I'm inclined to say that we should map for namespaces using device,
> > port, guid/gid, pkey.  And in this situation, since a unique guid/gid on
> > any given pkey maps to a unique dhcp identifier and a unique ipv6
> > lladdr, this becomes freely interchangeable with device, port, pkey,
> > address mappings that this patchset was built around.
> 
> What if we change the namespaces patches to map (device, port, GID,
> P_Key, IP) to netdev / namespace? That is, to use both the GID and the
> IP address.

As I keep saying, you are not supposed to use the IP address as a key
to find the netdev, that is the wrong way to use the Linux netdev
model.

Requiring unique GID/PKey allows the implementation to avoid this
wrongness, which would be simplifying and more correct.

That is the appeal to blocking this scenario when children are created.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]                                     ` <20150603161447.GC12073-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-03 19:05                                       ` Or Gerlitz
  2015-06-03 19:53                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 81+ messages in thread
From: Or Gerlitz @ 2015-06-03 19:05 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford
  Cc: Haggai Eran, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth

On Wed, Jun 3, 2015 at 7:14 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Wed, Jun 03, 2015 at 01:03:01PM +0300, Haggai Eran wrote:
>> > Then I'm inclined to say that we should map for namespaces using device,
>> > port, guid/gid, pkey.  And in this situation, since a unique guid/gid on
>> > any given pkey maps to a unique dhcp identifier and a unique ipv6
>> > lladdr, this becomes freely interchangeable with device, port, pkey,
>> > address mappings that this patchset was built around.
>>
>> What if we change the namespaces patches to map (device, port, GID,
>> P_Key, IP) to netdev / namespace? That is, to use both the GID and the
>> IP address.
>
> As I keep saying, you are not supposed to use the IP address as a key
> to find the netdev, that is the wrong way to use the Linux netdev
> model.
>
> Requiring unique GID/PKey allows the implementation to avoid this
> wrongness, which would be simplifying and more correct.
>
> That is the appeal to blocking this scenario when children are created.

Jason,

The IPoIB RTNL childs were added around release 3.6/7 of the upstream
kernel and are part of the kernel UAPI. They are perfectly used in
bunch of schemes:

1.  when static IP address assignment is used

2. under PV scheme, when the guest has para-virtual Eth NIC and the
host does routing between the back-end (e.g tap or alike) and  the
IPoIB child. Or when the host does tunneling (vxlan) and alike and
sends down the encapsulated packet through a host IP address assigned
to the IPoIB child

3. etc few more

Indeed the DHCP story isn't working there and to get DHCP work
something has to be done. But this issue can't serve for blocking the
existing UAPI and introduce regression to working systems.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
  2015-06-03 19:05                                       ` Or Gerlitz
@ 2015-06-03 19:53                                         ` Jason Gunthorpe
       [not found]                                           ` <20150603195325.GC7902-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2015-06-03 19:53 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Doug Ledford, Haggai Eran, Or Gerlitz, linux-rdma,
	Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth

On Wed, Jun 03, 2015 at 10:05:34PM +0300, Or Gerlitz wrote:

> Indeed the DHCP story isn't working there and to get DHCP work
> something has to be done. But this issue can't serve for blocking the
> existing UAPI and introduce regression to working systems.

It is not DHCP that concerns me, it is the fact we can't combine net
namespaces, RDMA-CM and duplicate GUID IPoIB children together without
adding hacks to the kernel. Searching netdevs by IP is a hack.

I'm mostly fine with it as an optional capability, similar to macvlan,
I just don't see how to cleanly integrate it with RDMA CM and
namespaces. And I don't see what RDMA CM is supposed to do when
it hits this case.

So, any ideas that don't involve the searching for IP hack??

[And yes, as discussed with Haggie, it is not the worst hack in the
 world, and maybe we can live with it, but lets understand the trade
 offs carefully]

Also, now that this has been brought up, I think you need to make a
patch to fix the IPv6 SLAAC breakage this caused. It looks trivial to
modify addrconf_ifid_infiniband to return error if the IPoIB child is
sharing a guid. It was not good at all to push the child patches
forward to 3.6/3.7 if you knew that IPv6 SLAAC was broken by them.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]                                           ` <20150603195325.GC7902-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-03 20:07                                             ` Or Gerlitz
       [not found]                                               ` <CAJ3xEMiO+hEzOJ2oJ5G-mmBeaX4ZHvUyhNSAzsrRDui6dFjvCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 81+ messages in thread
From: Or Gerlitz @ 2015-06-03 20:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, Haggai Eran, Or Gerlitz,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux Netdev List, Liran Liss,
	Guy Shapiro, Shachar Raindel, Yotam Kenneth

On Wed, Jun 3, 2015 at 10:53 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Wed, Jun 03, 2015 at 10:05:34PM +0300, Or Gerlitz wrote:
>
>> Indeed the DHCP story isn't working there and to get DHCP work
>> something has to be done. But this issue can't serve for blocking the
>> existing UAPI and introduce regression to working systems.
>
> It is not DHCP that concerns me, it is the fact we can't combine net
> namespaces, RDMA-CM and duplicate GUID IPoIB children together without
> adding hacks to the kernel. Searching netdevs by IP is a hack.
>
> I'm mostly fine with it as an optional capability, similar to macvlan,
> I just don't see how to cleanly integrate it with RDMA CM and
> namespaces. And I don't see what RDMA CM is supposed to do when
> it hits this case.
>
> So, any ideas that don't involve the searching for IP hack??
>
> [And yes, as discussed with Haggie, it is not the worst hack in the
>  world, and maybe we can live with it, but lets understand the trade
>  offs carefully]

As Haggai wrote, if we let the using IP address thing to fly up, we have
support for RDMA in containers using the RDMA-CM at IPoIB environments.
This will let people test, use, experiment, fix, interact (and even
production-it when static IP address assignment scheme is used).

Later, usage of alias GUIDs for IPoIB RTNL childs would allow to
remove the IP thing.

Later, the next stage/s in Matan's work on the RoCE GID table would
allow to support MACVLAN and hence RoCE too.

This is how the Linux kernel being evolved since the 2.5 failure to
come up with giant releases -- doing things in relativity small steps.


> Also, now that this has been brought up, I think you need to make a
> patch to fix the IPv6 SLAAC breakage this caused. It looks trivial to
> modify addrconf_ifid_infiniband to return error if the IPoIB child is
> sharing a guid. It was not good at all to push the child patches
> forward to 3.6/3.7 if you knew that IPv6 SLAAC was broken by them.

Till the alias GUID thing is introduced, maybe we can patch
addrconf_ifid_infiniband to use the QPN value from the device HW
address to come up with unique IPv6 link local address, agree? where
you think we can place the 24 bits QPN?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]                                               ` <CAJ3xEMiO+hEzOJ2oJ5G-mmBeaX4ZHvUyhNSAzsrRDui6dFjvCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-06-03 21:45                                                 ` Jason Gunthorpe
  2015-06-04  9:41                                                   ` Haggai Eran
  2015-06-03 23:48                                                 ` Jason Gunthorpe
  1 sibling, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2015-06-03 21:45 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Doug Ledford, Haggai Eran, Or Gerlitz,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux Netdev List, Liran Liss,
	Guy Shapiro, Shachar Raindel, Yotam Kenneth

On Wed, Jun 03, 2015 at 11:07:37PM +0300, Or Gerlitz wrote:
> As Haggai wrote, if we let the using IP address thing to fly up, we have
> support for RDMA in containers using the RDMA-CM at IPoIB environments.
> This will let people test, use, experiment, fix, interact (and even
> production-it when static IP address assignment scheme is used).

Sure, I think we all understand the goal, and you've explained some
reasonable use cases for the child support.

> Later, usage of alias GUIDs for IPoIB RTNL childs would allow to
> remove the IP thing.

How do we remove it? Along with same-guid child support? What is your
idea here?

> > Also, now that this has been brought up, I think you need to make a
> > patch to fix the IPv6 SLAAC breakage this caused. It looks trivial to
> > modify addrconf_ifid_infiniband to return error if the IPoIB child is
> > sharing a guid. It was not good at all to push the child patches
> > forward to 3.6/3.7 if you knew that IPv6 SLAAC was broken by them.
> 
> Till the alias GUID thing is introduced, maybe we can patch
> addrconf_ifid_infiniband to use the QPN value from the device HW
> address to come up with unique IPv6 link local address, agree? where
> you think we can place the 24 bits QPN?

I don't know if that is a good idea, an unstable SLAAC is not in
spirit with the RFCs. The safest bet is to return error and disable
SLAAC completely.

But I'm just guessing here - I'm only feel strongly that something
should be done to address this issue in the existing kernel.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]                                               ` <CAJ3xEMiO+hEzOJ2oJ5G-mmBeaX4ZHvUyhNSAzsrRDui6dFjvCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-06-03 21:45                                                 ` Jason Gunthorpe
@ 2015-06-03 23:48                                                 ` Jason Gunthorpe
       [not found]                                                   ` <20150603234811.GA15128-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  1 sibling, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2015-06-03 23:48 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Doug Ledford, Haggai Eran, Or Gerlitz,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux Netdev List, Liran Liss,
	Guy Shapiro, Shachar Raindel, Yotam Kenneth

On Wed, Jun 03, 2015 at 11:07:37PM +0300, Or Gerlitz wrote:

> > I'm mostly fine with it as an optional capability, similar to macvlan,
> > I just don't see how to cleanly integrate it with RDMA CM and
> > namespaces. And I don't see what RDMA CM is supposed to do when
> > it hits this case.
> >
> > So, any ideas that don't involve the searching for IP hack??
> >
> > [And yes, as discussed with Haggie, it is not the worst hack in the
> >  world, and maybe we can live with it, but lets understand the trade
> >  offs carefully]
> 
> As Haggai wrote, if we let the using IP address thing to fly up, we have
> support for RDMA in containers using the RDMA-CM at IPoIB environments.
> This will let people test, use, experiment, fix, interact (and even
> production-it when static IP address assignment scheme is used).

I just noticed ipvlan got merged a few months ago.. That certainly
changed my view on this topic. It is basically a software
version of the same-guid ipoib children scheme. Similar issues: Same MAC
address as the parent, IPv6 SLAAC is disabled (?),  DHCP has similar
issue (solved with RFC4361, and broadcasting fallback, it seems)..

The l2/l3 distinction in ipvlan is also very interesting. The L3 mode
solves some of the security type issues. What do you think Haggi?

Is there any chance standard things like ipvlan and macvlan could be
used with rdma-cm if their master devices are IPoIB? Are we even on
the right path to do that someday? Is that the plan for roce?

Any thoughts on the idea we still need ipoib same-guid children if
ipvlan is available?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]                                                   ` <20150603234811.GA15128-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-04  6:24                                                     ` Haggai Eran
       [not found]                                                       ` <556FEF25.80409-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 81+ messages in thread
From: Haggai Eran @ 2015-06-04  6:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Or Gerlitz
  Cc: Doug Ledford, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth

On 04/06/2015 02:48, Jason Gunthorpe wrote:
> On Wed, Jun 03, 2015 at 11:07:37PM +0300, Or Gerlitz wrote:
> 
>>> I'm mostly fine with it as an optional capability, similar to macvlan,
>>> I just don't see how to cleanly integrate it with RDMA CM and
>>> namespaces. And I don't see what RDMA CM is supposed to do when
>>> it hits this case.
>>>
>>> So, any ideas that don't involve the searching for IP hack??
>>>
>>> [And yes, as discussed with Haggie, it is not the worst hack in the
>>>  world, and maybe we can live with it, but lets understand the trade
>>>  offs carefully]
>>
>> As Haggai wrote, if we let the using IP address thing to fly up, we have
>> support for RDMA in containers using the RDMA-CM at IPoIB environments.
>> This will let people test, use, experiment, fix, interact (and even
>> production-it when static IP address assignment scheme is used).
> 
> I just noticed ipvlan got merged a few months ago.. That certainly
> changed my view on this topic. It is basically a software
> version of the same-guid ipoib children scheme. Similar issues: Same MAC
> address as the parent, IPv6 SLAAC is disabled (?),  DHCP has similar
> issue (solved with RFC4361, and broadcasting fallback, it seems)..
> 
> The l2/l3 distinction in ipvlan is also very interesting. The L3 mode
> solves some of the security type issues. What do you think Haggi?
I think some issues ipvlan is trying to solve would also affect us using
the alias GUIDs solution. ipvlan tries to solve among other the problem
of a limited MAC filter table in NICs, and avoid using promiscuous mode.
But the GID table is also limited, and we don't have something like
promiscuous mode for GIDs in InfiniBand. For large scale use of
containers we would need to also allow the current model.

As for L3 mode, it does seem more restrictive, as all routing decisions
are done in the controlling namespace. Our current ipoib child interface
implementation is more like the L2 version of ipvlan.

> 
> Is there any chance standard things like ipvlan and macvlan could be
> used with rdma-cm if their master devices are IPoIB? 
These standard interfaces seem very much connected with Ethernet (both
have an ARPHDR_ETHER-only check for their upper devices). I think
macvlan's functionality would be covered by adding alias GUIDs to ipoib,
and ipvlan L2 is covered by the current behavior. Perhaps it would be
beneficial to try and make ipvlan more generic so that it would work
over ipoib, giving us support for L3 mode.

As for rdma-cm support, the patch I had for ipoib attempts to scan each
child's upper devices in order to support such topologies. We only
tested it with bonding, but I think it would also work with such devices.

> Are we even on
> the right path to do that someday? Is that the plan for roce?
Yes, for RoCE our goal for the start was to support namespaces in RDMA
CM through macvlan devices. As long as we can update the RoCE gid table
correctly for macvlan and ipvlan devices, the RDMA CM implementation
shouldn't care where the details come from.

> Any thoughts on the idea we still need ipoib same-guid children if
> ipvlan is available?
If we port ipvlan to work over IPoIB interfaces and not just Ethernet,
then ipvlan L2 would provide exactly the same functionality. There onyl
difference I can think of is that ipvlan would use a single UD QP for
all devices (and in connected-mode, a single RC QP between a pair of
hosts), while ipoib would use a QP per child device, and multiple RC QPs
for such pairs.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
  2015-06-03 21:45                                                 ` Jason Gunthorpe
@ 2015-06-04  9:41                                                   ` Haggai Eran
  2015-06-04 16:06                                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 81+ messages in thread
From: Haggai Eran @ 2015-06-04  9:41 UTC (permalink / raw)
  To: Jason Gunthorpe, Or Gerlitz
  Cc: Doug Ledford, Or Gerlitz, linux-rdma, Linux Netdev List,
	Liran Liss, Guy Shapiro, Shachar Raindel, Yotam Kenneth

On 04/06/2015 00:45, Jason Gunthorpe wrote:
> On Wed, Jun 03, 2015 at 11:07:37PM +0300, Or Gerlitz wrote:
>> As Haggai wrote, if we let the using IP address thing to fly up, we have
>> support for RDMA in containers using the RDMA-CM at IPoIB environments.
>> This will let people test, use, experiment, fix, interact (and even
>> production-it when static IP address assignment scheme is used).
> 
> Sure, I think we all understand the goal, and you've explained some
> reasonable use cases for the child support.
> 
>> Later, usage of alias GUIDs for IPoIB RTNL childs would allow to
>> remove the IP thing.
> 
> How do we remove it? Along with same-guid child support? What is your
> idea here?
> 
>>> Also, now that this has been brought up, I think you need to make a
>>> patch to fix the IPv6 SLAAC breakage this caused. It looks trivial to
>>> modify addrconf_ifid_infiniband to return error if the IPoIB child is
>>> sharing a guid. It was not good at all to push the child patches
>>> forward to 3.6/3.7 if you knew that IPv6 SLAAC was broken by them.
>>
>> Till the alias GUID thing is introduced, maybe we can patch
>> addrconf_ifid_infiniband to use the QPN value from the device HW
>> address to come up with unique IPv6 link local address, agree? where
>> you think we can place the 24 bits QPN?
> 
> I don't know if that is a good idea, an unstable SLAAC is not in
> spirit with the RFCs. The safest bet is to return error and disable
> SLAAC completely.
Maybe this is a silly question, but doesn't DAD already disable SLAAC
addresses when there's a conflict?

Haggai

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
  2015-06-04  9:41                                                   ` Haggai Eran
@ 2015-06-04 16:06                                                     ` Jason Gunthorpe
  0 siblings, 0 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2015-06-04 16:06 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Or Gerlitz, Doug Ledford, Or Gerlitz, linux-rdma,
	Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth

On Thu, Jun 04, 2015 at 12:41:33PM +0300, Haggai Eran wrote:
> On 04/06/2015 00:45, Jason Gunthorpe wrote:

> > I don't know if that is a good idea, an unstable SLAAC is not in
> > spirit with the RFCs. The safest bet is to return error and disable
> > SLAAC completely.

> Maybe this is a silly question, but doesn't DAD already disable SLAAC
> addresses when there's a conflict?

Yes, DAD should certainly trigger and disable the child, but the
kernel should not rely on DAD for correctness, it is a safety net, and
it isn't guarenteed 100% reliable.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]                                                       ` <556FEF25.80409-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-06-04 16:40                                                         ` Jason Gunthorpe
       [not found]                                                           ` <20150604164058.GB27699-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2015-06-04 16:40 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Or Gerlitz, Doug Ledford, Or Gerlitz,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux Netdev List, Liran Liss,
	Guy Shapiro, Shachar Raindel, Yotam Kenneth

On Thu, Jun 04, 2015 at 09:24:37AM +0300, Haggai Eran wrote:
> > The l2/l3 distinction in ipvlan is also very interesting. The L3 mode
> > solves some of the security type issues. What do you think Haggi?

> I think some issues ipvlan is trying to solve would also affect us using
> the alias GUIDs solution. ipvlan tries to solve among other the problem
> of a limited MAC filter table in NICs, and avoid using promiscuous mode.
> But the GID table is also limited, and we don't have something like
> promiscuous mode for GIDs in InfiniBand. For large scale use of
> containers we would need to also allow the current model.

Yes, that is certainly true.

> As for L3 mode, it does seem more restrictive, as all routing decisions
> are done in the controlling namespace. Our current ipoib child interface
> implementation is more like the L2 version of ipvlan.

The ipoib children are exactly like macvlan, because they all have
unique LLADDRs.

It doesn't start acting like ipvlan until we reach the rdma-cm patches,
and where we see the IP stack side act like macvlan and the rdma-cm
side try to act like ipvlan - that is why it is so ugly/hacky,

> > Is there any chance standard things like ipvlan and macvlan could be
> > used with rdma-cm if their master devices are IPoIB?

> These standard interfaces seem very much connected with Ethernet (both
> have an ARPHDR_ETHER-only check for their upper devices). I think
> macvlan's functionality would be covered by adding alias GUIDs to ipoib,
> and ipvlan L2 is covered by the current behavior. Perhaps it would be
> beneficial to try and make ipvlan more generic so that it would work
> over ipoib, giving us support for L3 mode.

Yes, macvlan seems very well covered already by IPoIB child
interfaces, and I don't see too many reasons to worry about changing
that.

ipvlan on the other hand, as you observe, is valuable for many reasons.

> As for rdma-cm support, the patch I had for ipoib attempts to scan each
> child's upper devices in order to support such topologies. We only
> tested it with bonding, but I think it would also work with such devices.

.. it is so sketchy :|

Firstly: I still think the prior discussion is right, and proceeding
along the reworking of the ingress side of rdma-cm and focusing on the
device,guid,pkey makes 100% sense and will progress things right
away. Every other variation seems to build on that.

But when we get into bonding and the various vlan things, we loose
encapsulation - snooping the children list to guess what the bonding
driver is doing seems very hacky.

Discussion idea: Can we actually use the netstack to process the
RDMA-CM packets? It looks like the netstack wants a skb to do this
mid-layer work, so rdma-cm would have to synthesize a skb for the CM
packets and pass it through netdev to apply all the transformations
and access the various internal states (eg from ipvlan, bonding,
etc). rdma-cm would have to 'catch' the skb once it is done traveling
and resume its normal processing. Very similar to your notion of using
UDP, but without any on-the-wire change.

This would fit in that same ingress spot I suggested adding the
routing lookup, instead of routing we want the full stack to have a go
at figuring out the final netdev.

This seems the most general because it will work for all the *vlan
type drivers, bonding, and all of the RDMA technologies. (each would
have a slightly different way to make the skb, but same basic idea)

Lots and lots of details to do that, but conceptually it seems pretty
solid?

> Yes, for RoCE our goal for the start was to support namespaces in RDMA
> CM through macvlan devices. As long as we can update the RoCE gid table
> correctly for macvlan and ipvlan devices, the RDMA CM implementation
> shouldn't care where the details come from.

Hurm, the gid index tagged on the QP1 packet should not be directly
used for much on ingress. rdma-cm will have to recover the mac address
and vlan to use that as a guide.

Synchronizing the gid table and all the internal state in macvlan,
ipvlan, bonding seems very hard, I do not envy your task :(

> > Any thoughts on the idea we still need ipoib same-guid children if
> > ipvlan is available?
> If we port ipvlan to work over IPoIB interfaces and not just Ethernet,
> then ipvlan L2 would provide exactly the same functionality. There onyl
> difference I can think of is that ipvlan would use a single UD QP for
> all devices (and in connected-mode, a single RC QP between a pair of
> hosts), while ipoib would use a QP per child device, and multiple RC QPs
> for such pairs.

Agree with this.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
       [not found]                                                           ` <20150604164058.GB27699-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-08  7:52                                                             ` Haggai Eran
  2015-06-08 16:53                                                               ` Jason Gunthorpe
  0 siblings, 1 reply; 81+ messages in thread
From: Haggai Eran @ 2015-06-08  7:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Or Gerlitz, Doug Ledford, Or Gerlitz,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux Netdev List, Liran Liss,
	Guy Shapiro, Shachar Raindel, Yotam Kenneth

On 04/06/2015 19:40, Jason Gunthorpe wrote:
> Discussion idea: Can we actually use the netstack to process the
> RDMA-CM packets? It looks like the netstack wants a skb to do this
> mid-layer work, so rdma-cm would have to synthesize a skb for the CM
> packets and pass it through netdev to apply all the transformations
> and access the various internal states (eg from ipvlan, bonding,
> etc). rdma-cm would have to 'catch' the skb once it is done traveling
> and resume its normal processing. Very similar to your notion of using
> UDP, but without any on-the-wire change.
> 
> This would fit in that same ingress spot I suggested adding the
> routing lookup, instead of routing we want the full stack to have a go
> at figuring out the final netdev.
> 
> This seems the most general because it will work for all the *vlan
> type drivers, bonding, and all of the RDMA technologies. (each would
> have a slightly different way to make the skb, but same basic idea)
> 
> Lots and lots of details to do that, but conceptually it seems pretty
> solid?

The problem is that the network stack can do all sort of changes to the
packets (like NAT), and it may be the case that the hardware can't
reflect these changes later on when creating a QP. I think it would be
best to stick with resolving the net_dev using the request parameters,
and the simpler routing lookup. This way RDMA CM remains in control, and
if the user configures routing in an unexpected way, it can just block
the request.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
  2015-06-08  7:52                                                             ` Haggai Eran
@ 2015-06-08 16:53                                                               ` Jason Gunthorpe
  0 siblings, 0 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2015-06-08 16:53 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Or Gerlitz, Doug Ledford, Or Gerlitz, linux-rdma,
	Linux Netdev List, Liran Liss, Guy Shapiro, Shachar Raindel,
	Yotam Kenneth

On Mon, Jun 08, 2015 at 10:52:34AM +0300, Haggai Eran wrote:
> On 04/06/2015 19:40, Jason Gunthorpe wrote:
> > Discussion idea: Can we actually use the netstack to process the
> > RDMA-CM packets? It looks like the netstack wants a skb to do this
> > mid-layer work, so rdma-cm would have to synthesize a skb for the CM
> > packets and pass it through netdev to apply all the transformations
> > and access the various internal states (eg from ipvlan, bonding,
> > etc). rdma-cm would have to 'catch' the skb once it is done traveling
> > and resume its normal processing. Very similar to your notion of using
> > UDP, but without any on-the-wire change.
> > 
> > This would fit in that same ingress spot I suggested adding the
> > routing lookup, instead of routing we want the full stack to have a go
> > at figuring out the final netdev.
> > 
> > This seems the most general because it will work for all the *vlan
> > type drivers, bonding, and all of the RDMA technologies. (each would
> > have a slightly different way to make the skb, but same basic idea)
> > 
> > Lots and lots of details to do that, but conceptually it seems pretty
> > solid?
> 
> The problem is that the network stack can do all sort of changes to the
> packets (like NAT), and it may be the case that the hardware can't
> reflect these changes later on when creating a QP.

Yes, I am aware of that, but there are also alot of things netdev can
do that we can realize, like netfilter rules to block packets, for
instance

Ignoring NAT is a bad choice as well, the best would be to drop on
NAT. It would be easy to detect if the netstack mangled the REQ skb
packet, for instance.

We can't track netdev after the QP is created, but totally ignoring
one thing and while re-implementing others seems like a bad idea, long
term...

> I think it would be best to stick with resolving the net_dev using
> the request parameters, and the simpler routing lookup. This way
> RDMA CM remains in control, and if the user configures routing in an
> unexpected way, it can just block the request.

As I said, I think that is fine for the immediate IB support, but when
you start talking about roce and emulating macvlan and ipvlan.. Then
it starts to look really bad. At least think it through carefully
before posting those series.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2015-06-08 16:53 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-17  5:50 [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Haggai Eran
2015-05-17  5:50 ` [PATCH v4 for-next 02/12] IB/addr: Pass network namespace as a parameter Haggai Eran
2015-05-17  5:50 ` [PATCH v4 for-next 03/12] IB/core: Find the network namespace matching connection parameters Haggai Eran
     [not found]   ` <1431841868-28063-4-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-19 18:26     ` Jason Gunthorpe
     [not found]       ` <20150519182616.GF18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-20 14:48         ` Haggai Eran
2015-05-20 14:48           ` Haggai Eran
2015-05-17  5:51 ` [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices " Haggai Eran
     [not found]   ` <1431841868-28063-5-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-19 18:28     ` Jason Gunthorpe
     [not found]       ` <20150519182810.GG18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-20 15:17         ` Haggai Eran
2015-05-20 15:17           ` Haggai Eran
2015-05-19 23:55     ` Jason Gunthorpe
     [not found]       ` <20150519235502.GB26634-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-21  5:33         ` Haggai Eran
2015-05-21  5:33           ` Haggai Eran
     [not found]           ` <555D6E41.10606-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-21  5:48             ` Or Gerlitz
     [not found]               ` <CAJ3xEMjN+o=vC4abAeG5EuOo3Y1gSyh1qPDseA_aaYmoLWAunw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-21  6:33                 ` Haggai Eran
     [not found]                   ` <555D7C4A.2060708-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-21 10:31                     ` Or Gerlitz
2015-05-21 17:43             ` Jason Gunthorpe
     [not found]               ` <20150521174336.GA6771-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-28 11:51                 ` Haggai Eran
2015-05-28 11:51                   ` Haggai Eran
2015-05-28 15:45                   ` Jason Gunthorpe
2015-05-21  5:48         ` Haggai Eran
2015-05-21  5:48           ` Haggai Eran
2015-05-17  5:51 ` [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs Haggai Eran
2015-05-19 18:35   ` Jason Gunthorpe
     [not found]     ` <20150519183545.GH18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-19 22:35       ` Jason Gunthorpe
     [not found]         ` <20150519223502.GA26324-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-21  8:08           ` Haggai Eran
2015-05-21  8:08             ` Haggai Eran
2015-05-21 17:54             ` Jason Gunthorpe
2015-05-21  7:07       ` Haggai Eran
2015-05-21  7:07         ` Haggai Eran
     [not found] ` <1431841868-28063-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-17  5:50   ` [PATCH v4 for-next 01/12] IB/core: Add rwsem to allow reading device list or client list Haggai Eran
2015-05-17  5:51   ` [PATCH v4 for-next 06/12] IB/cm: Expose service ID in request events Haggai Eran
2015-05-17  5:51   ` [PATCH v4 for-next 07/12] IB/cma: Refactor RDMA IP CM private-data parsing code Haggai Eran
2015-05-17  5:51   ` [PATCH v4 for-next 08/12] IB/cma: Add compare_data checks to the RDMA CM module Haggai Eran
2015-05-17  5:51   ` [PATCH v4 for-next 12/12] IB/ucma: Take the network namespace from the process Haggai Eran
2015-05-17  5:51 ` [PATCH v4 for-next 09/12] IB/cma: Separate port allocation to network namespaces Haggai Eran
2015-05-17  5:51 ` [PATCH v4 for-next 10/12] IB/cma: Share CM IDs between namespaces Haggai Eran
2015-05-17  5:51 ` [PATCH v4 for-next 11/12] IB/cma: Add support for network namespaces Haggai Eran
2015-05-19 14:30 ` [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM Yann Droneaud
2015-05-19 14:54   ` Haggai Eran
2015-05-19 14:54     ` Haggai Eran
     [not found]     ` <555B4EBE.7010900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-19 16:39       ` Parav Pandit
2015-05-19 18:01         ` Haggai Eran
     [not found]           ` <1432058488417.98688-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-19 18:42             ` Parav Pandit
2015-05-19 18:38         ` Jason Gunthorpe
     [not found]           ` <20150519183843.GI18675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-19 18:44             ` Parav Pandit
     [not found]               ` <CAGgvQNTXAWkQWzBBrQfk39GaCQ2ck63AhgURpYFFBPTbkpx4kg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-19 19:20                 ` Jason Gunthorpe
2015-05-26 13:34 ` Doug Ledford
     [not found]   ` <1432647280.28905.107.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-26 16:59     ` Jason Gunthorpe
2015-05-26 17:46       ` Doug Ledford
     [not found]         ` <1432662396.28905.157.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-26 18:47           ` Jason Gunthorpe
2015-05-28 13:22           ` Haggai Eran
2015-05-28 13:22             ` Haggai Eran
2015-05-28 15:46             ` Jason Gunthorpe
     [not found]               ` <20150528154633.GB2962-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-03 10:07                 ` Haggai Eran
2015-06-03 10:07                   ` Haggai Eran
2015-05-28 13:15       ` Haggai Eran
2015-05-28 13:15         ` Haggai Eran
2015-05-26 17:55     ` Christian Benvenuti (benve)
2015-05-28 13:07     ` Haggai Eran
2015-05-28 13:07       ` Haggai Eran
     [not found]       ` <55671309.6080303-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-28 14:07         ` Doug Ledford
     [not found]           ` <1432822057.114391.26.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-28 16:21             ` Or Gerlitz
2015-05-28 16:21               ` Or Gerlitz
     [not found]               ` <55674077.5040707-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-28 17:43                 ` Jason Gunthorpe
     [not found]                   ` <20150528174337.GA10448-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-28 18:22                     ` Doug Ledford
     [not found]                       ` <1432837360.114391.35.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-28 19:05                         ` Or Gerlitz
     [not found]                           ` <CAJ3xEMh2T5-56rFxWVdct2uAZYW1ZrKivWfS45V-mvhAfwyGaA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-28 21:55                             ` Doug Ledford
     [not found]                               ` <1432850150.114391.56.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-06-03 10:03                                 ` Haggai Eran
2015-06-03 16:14                                   ` Jason Gunthorpe
     [not found]                                     ` <20150603161447.GC12073-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-03 19:05                                       ` Or Gerlitz
2015-06-03 19:53                                         ` Jason Gunthorpe
     [not found]                                           ` <20150603195325.GC7902-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-03 20:07                                             ` Or Gerlitz
     [not found]                                               ` <CAJ3xEMiO+hEzOJ2oJ5G-mmBeaX4ZHvUyhNSAzsrRDui6dFjvCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-06-03 21:45                                                 ` Jason Gunthorpe
2015-06-04  9:41                                                   ` Haggai Eran
2015-06-04 16:06                                                     ` Jason Gunthorpe
2015-06-03 23:48                                                 ` Jason Gunthorpe
     [not found]                                                   ` <20150603234811.GA15128-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-04  6:24                                                     ` Haggai Eran
     [not found]                                                       ` <556FEF25.80409-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-06-04 16:40                                                         ` Jason Gunthorpe
     [not found]                                                           ` <20150604164058.GB27699-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-08  7:52                                                             ` Haggai Eran
2015-06-08 16:53                                                               ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.