All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] RDMA device net namespace support for SMC
@ 2021-12-28 13:06 Tony Lu
  2021-12-28 13:06 ` [PATCH 1/4] net/smc: Introduce net namespace support for linkgroup Tony Lu
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Tony Lu @ 2021-12-28 13:06 UTC (permalink / raw)
  To: kgraul; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

This patch set introduces net namespace support for linkgroups.

Path 1 is the main approach to implement net ns support.

Path 2 - 4 are the additional modifications to let us know the netns.
Also, I will submit changes of smc-tools to github later.

Currently, smc doesn't support net namespace isolation. The ibdevs
registered to smc are shared for all linkgroups and connections. When
running applications in different net namespaces, such as container
environment, applications should only use the ibdevs that belongs to the
same net namespace.

This adds a new field, net, in smc linkgroup struct. During first
contact, it checks and find the linkgroup has same net namespace, if
not, it is going to create and initialized the net field with first
link's ibdev net namespace. When finding the rdma devices, it also checks
the sk net device's and ibdev's net namespaces. After net namespace
destroyed, the net device and ibdev move to root net namespace,
linkgroups won't be matched, and wait for lgr free.

If rdma net namespace exclusive mode is not enabled, it behaves as
before.

Steps to enable and test net namespaces:

1. enable RDMA device net namespace exclusive support
	rdma system set netns exclusive # default is shared

2. create new net namespace, move and initialize them
	ip netns add test1 
	rdma dev set mlx5_1 netns test1
	ip link set dev eth2 netns test1
	ip netns exec test1 ip link set eth2 up
	ip netns exec test1 ip addr add ${HOST_IP}/26 dev eth2

3. setup server and client, connect N <-> M
	ip netns exec test1 smc_run sockperf server --tcp # server
	ip netns exec test1 smc_run sockperf pp --tcp -i ${SERVER_IP} # client

4. netns isolated linkgroups (2 * 2 mesh) with their own linkgroups
  - server
LG-ID    LG-Role  LG-Type  VLAN  #Conns  PNET-ID
00000100 SERV     SINGLE      0       0
00000200 SERV     SINGLE      0       0
00000300 SERV     SINGLE      0       0
00000400 SERV     SINGLE      0       0

  - client
LG-ID    LG-Role  LG-Type  VLAN  #Conns  PNET-ID
00000100 CLNT     SINGLE      0       0
00000200 CLNT     SINGLE      0       0
00000300 CLNT     SINGLE      0       0
00000400 CLNT     SINGLE      0       0

Tony Lu (4):
  net/smc: Introduce net namespace support for linkgroup
  net/smc: Add netlink net namespace support
  net/smc: Print net namespace in log
  net/smc: Add net namespace for tracepoints

 include/uapi/linux/smc.h      |  2 ++
 include/uapi/linux/smc_diag.h | 11 ++++++-----
 net/smc/smc_core.c            | 31 ++++++++++++++++++++++---------
 net/smc/smc_core.h            |  2 ++
 net/smc/smc_diag.c            | 16 +++++++++-------
 net/smc/smc_ib.h              |  7 +++++++
 net/smc/smc_llc.c             | 19 ++++++++++++-------
 net/smc/smc_pnet.c            | 21 ++++++++++++++++-----
 net/smc/smc_tracepoint.h      | 23 ++++++++++++++++-------
 9 files changed, 92 insertions(+), 40 deletions(-)

-- 
2.32.0.3.g01195cf9f


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] net/smc: Introduce net namespace support for linkgroup
  2021-12-28 13:06 [PATCH 0/4] RDMA device net namespace support for SMC Tony Lu
@ 2021-12-28 13:06 ` Tony Lu
  2021-12-28 13:06 ` [PATCH 2/4] net/smc: Add netlink net namespace support Tony Lu
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Tony Lu @ 2021-12-28 13:06 UTC (permalink / raw)
  To: kgraul; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

Currently, rdma device supports exclusive net namespace isolation,
however linkgroup doesn't know and support ibdev net namespace.
Applications in the containers don't want to share the nics if we
enabled rdma exclusive mode. Every net namespaces should have their own
linkgroups.

This patch introduce a new field net for linkgroup, which is standing
for the ibdev net namespace in the linkgroup. The net in linkgroup is
initialized with the net namespace of link's ibdev. It compares the net
of linkgroup and sock or ibdev before choose it, if no matched, create
new one in current net namespace. If rdma net namespace exclusive mode
is not enabled, it behaves as before.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
---
 net/smc/smc_core.c | 24 +++++++++++++++++-------
 net/smc/smc_core.h |  2 ++
 net/smc/smc_ib.h   |  7 +++++++
 net/smc/smc_pnet.c | 21 ++++++++++++++++-----
 4 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 85be94cabb01..05c11bbe4318 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -897,6 +897,7 @@ static int smc_lgr_create(struct smc_sock *smc, struct smc_init_info *ini)
 			smc_wr_free_lgr_mem(lgr);
 			goto free_wq;
 		}
+		lgr->net = smc_ib_net(lnk->smcibdev);
 		lgr_list = &smc_lgr_list.list;
 		lgr_lock = &smc_lgr_list.lock;
 		atomic_inc(&lgr_cnt);
@@ -1570,7 +1571,8 @@ void smcr_port_add(struct smc_ib_device *smcibdev, u8 ibport)
 		if (strncmp(smcibdev->pnetid[ibport - 1], lgr->pnet_id,
 			    SMC_MAX_PNETID_LEN) ||
 		    lgr->type == SMC_LGR_SYMMETRIC ||
-		    lgr->type == SMC_LGR_ASYMMETRIC_PEER)
+		    lgr->type == SMC_LGR_ASYMMETRIC_PEER ||
+		    !rdma_dev_access_netns(smcibdev->ibdev, lgr->net))
 			continue;
 
 		/* trigger local add link processing */
@@ -1729,8 +1731,10 @@ static bool smcr_lgr_match(struct smc_link_group *lgr, u8 smcr_version,
 			   u8 peer_systemid[],
 			   u8 peer_gid[],
 			   u8 peer_mac_v1[],
-			   enum smc_lgr_role role, u32 clcqpn)
+			   enum smc_lgr_role role, u32 clcqpn,
+			   struct net *net)
 {
+	struct smc_link *lnk;
 	int i;
 
 	if (memcmp(lgr->peer_systemid, peer_systemid, SMC_SYSTEMID_LEN) ||
@@ -1738,12 +1742,17 @@ static bool smcr_lgr_match(struct smc_link_group *lgr, u8 smcr_version,
 		return false;
 
 	for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
-		if (!smc_link_active(&lgr->lnk[i]))
+		lnk = &lgr->lnk[i];
+
+		if (!smc_link_active(lnk))
 			continue;
-		if ((lgr->role == SMC_SERV || lgr->lnk[i].peer_qpn == clcqpn) &&
-		    !memcmp(lgr->lnk[i].peer_gid, peer_gid, SMC_GID_SIZE) &&
+		/* use verbs API to check netns, instead of lgr->net */
+		if (!rdma_dev_access_netns(lnk->smcibdev->ibdev, net))
+			return false;
+		if ((lgr->role == SMC_SERV || lnk->peer_qpn == clcqpn) &&
+		    !memcmp(lnk->peer_gid, peer_gid, SMC_GID_SIZE) &&
 		    (smcr_version == SMC_V2 ||
-		     !memcmp(lgr->lnk[i].peer_mac, peer_mac_v1, ETH_ALEN)))
+		     !memcmp(lnk->peer_mac, peer_mac_v1, ETH_ALEN)))
 			return true;
 	}
 	return false;
@@ -1759,6 +1768,7 @@ static bool smcd_lgr_match(struct smc_link_group *lgr,
 int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
 {
 	struct smc_connection *conn = &smc->conn;
+	struct net *net = sock_net(&smc->sk);
 	struct list_head *lgr_list;
 	struct smc_link_group *lgr;
 	enum smc_lgr_role role;
@@ -1785,7 +1795,7 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
 		     smcr_lgr_match(lgr, ini->smcr_version,
 				    ini->peer_systemid,
 				    ini->peer_gid, ini->peer_mac, role,
-				    ini->ib_clcqpn)) &&
+				    ini->ib_clcqpn, net)) &&
 		    !lgr->sync_err &&
 		    (ini->smcd_version == SMC_V2 ||
 		     lgr->vlan_id == ini->vlan_id) &&
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 59cef3b830d8..69e11ce22725 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -306,6 +306,8 @@ struct smc_link_group {
 			u8			nexthop_mac[ETH_ALEN];
 			u8			uses_gateway;
 			__be32			saddr;
+						/* net namespace */
+			struct net		*net;
 		};
 		struct { /* SMC-D */
 			u64			peer_gid;
diff --git a/net/smc/smc_ib.h b/net/smc/smc_ib.h
index 07585937370e..91a396120dee 100644
--- a/net/smc/smc_ib.h
+++ b/net/smc/smc_ib.h
@@ -69,6 +69,13 @@ static inline __be32 smc_ib_gid_to_ipv4(u8 gid[SMC_GID_SIZE])
 	return cpu_to_be32(INADDR_NONE);
 }
 
+static inline struct net *smc_ib_net(struct smc_ib_device *smcibdev)
+{
+	if (smcibdev && smcibdev->ibdev)
+		return read_pnet(&smcibdev->ibdev->coredev.rdma_net);
+	return NULL;
+}
+
 struct smc_init_info_smcrv2;
 struct smc_buf_desc;
 struct smc_link;
diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c
index e171cc6483f8..db9825c01e0a 100644
--- a/net/smc/smc_pnet.c
+++ b/net/smc/smc_pnet.c
@@ -977,14 +977,16 @@ static int smc_pnet_determine_gid(struct smc_ib_device *ibdev, int i,
 /* find a roce device for the given pnetid */
 static void _smc_pnet_find_roce_by_pnetid(u8 *pnet_id,
 					  struct smc_init_info *ini,
-					  struct smc_ib_device *known_dev)
+					  struct smc_ib_device *known_dev,
+					  struct net *net)
 {
 	struct smc_ib_device *ibdev;
 	int i;
 
 	mutex_lock(&smc_ib_devices.mutex);
 	list_for_each_entry(ibdev, &smc_ib_devices.list, list) {
-		if (ibdev == known_dev)
+		if (ibdev == known_dev ||
+		    !rdma_dev_access_netns(ibdev->ibdev, net))
 			continue;
 		for (i = 1; i <= SMC_MAX_PORTS; i++) {
 			if (!rdma_is_port_valid(ibdev->ibdev, i))
@@ -1001,12 +1003,14 @@ static void _smc_pnet_find_roce_by_pnetid(u8 *pnet_id,
 	mutex_unlock(&smc_ib_devices.mutex);
 }
 
-/* find alternate roce device with same pnet_id and vlan_id */
+/* find alternate roce device with same pnet_id, vlan_id and net namespace */
 void smc_pnet_find_alt_roce(struct smc_link_group *lgr,
 			    struct smc_init_info *ini,
 			    struct smc_ib_device *known_dev)
 {
-	_smc_pnet_find_roce_by_pnetid(lgr->pnet_id, ini, known_dev);
+	struct net *net = lgr->net;
+
+	_smc_pnet_find_roce_by_pnetid(lgr->pnet_id, ini, known_dev, net);
 }
 
 /* if handshake network device belongs to a roce device, return its
@@ -1015,6 +1019,7 @@ void smc_pnet_find_alt_roce(struct smc_link_group *lgr,
 static void smc_pnet_find_rdma_dev(struct net_device *netdev,
 				   struct smc_init_info *ini)
 {
+	struct net *net = dev_net(netdev);
 	struct smc_ib_device *ibdev;
 
 	mutex_lock(&smc_ib_devices.mutex);
@@ -1022,6 +1027,10 @@ static void smc_pnet_find_rdma_dev(struct net_device *netdev,
 		struct net_device *ndev;
 		int i;
 
+		/* check rdma net namespace */
+		if (!rdma_dev_access_netns(ibdev->ibdev, net))
+			continue;
+
 		for (i = 1; i <= SMC_MAX_PORTS; i++) {
 			if (!rdma_is_port_valid(ibdev->ibdev, i))
 				continue;
@@ -1052,15 +1061,17 @@ static void smc_pnet_find_roce_by_pnetid(struct net_device *ndev,
 					 struct smc_init_info *ini)
 {
 	u8 ndev_pnetid[SMC_MAX_PNETID_LEN];
+	struct net *net;
 
 	ndev = pnet_find_base_ndev(ndev);
+	net = dev_net(ndev);
 	if (smc_pnetid_by_dev_port(ndev->dev.parent, ndev->dev_port,
 				   ndev_pnetid) &&
 	    smc_pnet_find_ndev_pnetid_by_table(ndev, ndev_pnetid)) {
 		smc_pnet_find_rdma_dev(ndev, ini);
 		return; /* pnetid could not be determined */
 	}
-	_smc_pnet_find_roce_by_pnetid(ndev_pnetid, ini, NULL);
+	_smc_pnet_find_roce_by_pnetid(ndev_pnetid, ini, NULL, net);
 }
 
 static void smc_pnet_find_ism_by_pnetid(struct net_device *ndev,
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4] net/smc: Add netlink net namespace support
  2021-12-28 13:06 [PATCH 0/4] RDMA device net namespace support for SMC Tony Lu
  2021-12-28 13:06 ` [PATCH 1/4] net/smc: Introduce net namespace support for linkgroup Tony Lu
@ 2021-12-28 13:06 ` Tony Lu
  2022-01-31  0:24   ` Dmitry V. Levin
  2021-12-28 13:06 ` [PATCH 3/4] net/smc: Print net namespace in log Tony Lu
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Tony Lu @ 2021-12-28 13:06 UTC (permalink / raw)
  To: kgraul; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

This adds net namespace ID to diag of linkgroup, helps us to distinguish
different namespaces, and net_cookie is unique in the whole system.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
---
 include/uapi/linux/smc.h      |  2 ++
 include/uapi/linux/smc_diag.h | 11 ++++++-----
 net/smc/smc_core.c            |  3 +++
 net/smc/smc_diag.c            | 16 +++++++++-------
 4 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/include/uapi/linux/smc.h b/include/uapi/linux/smc.h
index 20f33b27787f..6c2874fd2c00 100644
--- a/include/uapi/linux/smc.h
+++ b/include/uapi/linux/smc.h
@@ -119,6 +119,8 @@ enum {
 	SMC_NLA_LGR_R_CONNS_NUM,	/* u32 */
 	SMC_NLA_LGR_R_V2_COMMON,	/* nest */
 	SMC_NLA_LGR_R_V2,		/* nest */
+	SMC_NLA_LGR_R_NET_COOKIE,	/* u64 */
+	SMC_NLA_LGR_R_PAD,		/* flag */
 	__SMC_NLA_LGR_R_MAX,
 	SMC_NLA_LGR_R_MAX = __SMC_NLA_LGR_R_MAX - 1
 };
diff --git a/include/uapi/linux/smc_diag.h b/include/uapi/linux/smc_diag.h
index 8cb3a6fef553..c7008d87f1a4 100644
--- a/include/uapi/linux/smc_diag.h
+++ b/include/uapi/linux/smc_diag.h
@@ -84,11 +84,12 @@ struct smc_diag_conninfo {
 /* SMC_DIAG_LINKINFO */
 
 struct smc_diag_linkinfo {
-	__u8 link_id;			/* link identifier */
-	__u8 ibname[IB_DEVICE_NAME_MAX]; /* name of the RDMA device */
-	__u8 ibport;			/* RDMA device port number */
-	__u8 gid[40];			/* local GID */
-	__u8 peer_gid[40];		/* peer GID */
+	__u8		link_id;		    /* link identifier */
+	__u8		ibname[IB_DEVICE_NAME_MAX]; /* name of the RDMA device */
+	__u8		ibport;			    /* RDMA device port number */
+	__u8		gid[40];		    /* local GID */
+	__u8		peer_gid[40];		    /* peer GID */
+	__aligned_u64	net_cookie;                 /* RDMA device net namespace */
 };
 
 struct smc_diag_lgrinfo {
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 05c11bbe4318..b9d6148d1287 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -348,6 +348,9 @@ static int smc_nl_fill_lgr(struct smc_link_group *lgr,
 		goto errattr;
 	if (nla_put_u8(skb, SMC_NLA_LGR_R_VLAN_ID, lgr->vlan_id))
 		goto errattr;
+	if (nla_put_u64_64bit(skb, SMC_NLA_LGR_R_NET_COOKIE,
+			      lgr->net->net_cookie, SMC_NLA_LGR_R_PAD))
+		goto errattr;
 	memcpy(smc_target, lgr->pnet_id, SMC_MAX_PNETID_LEN);
 	smc_target[SMC_MAX_PNETID_LEN] = 0;
 	if (nla_put_string(skb, SMC_NLA_LGR_R_PNETID, smc_target))
diff --git a/net/smc/smc_diag.c b/net/smc/smc_diag.c
index c952986a6aca..7c8dad28c18d 100644
--- a/net/smc/smc_diag.c
+++ b/net/smc/smc_diag.c
@@ -145,19 +145,21 @@ static int __smc_diag_dump(struct sock *sk, struct sk_buff *skb,
 	if (smc->conn.lgr && !smc->conn.lgr->is_smcd &&
 	    (req->diag_ext & (1 << (SMC_DIAG_LGRINFO - 1))) &&
 	    !list_empty(&smc->conn.lgr->list)) {
+		struct smc_link *link = smc->conn.lnk;
+		struct net *net = read_pnet(&link->smcibdev->ibdev->coredev.rdma_net);
+
 		struct smc_diag_lgrinfo linfo = {
 			.role = smc->conn.lgr->role,
-			.lnk[0].ibport = smc->conn.lnk->ibport,
-			.lnk[0].link_id = smc->conn.lnk->link_id,
+			.lnk[0].ibport = link->ibport,
+			.lnk[0].link_id = link->link_id,
+			.lnk[0].net_cookie = net->net_cookie,
 		};
 
 		memcpy(linfo.lnk[0].ibname,
 		       smc->conn.lgr->lnk[0].smcibdev->ibdev->name,
-		       sizeof(smc->conn.lnk->smcibdev->ibdev->name));
-		smc_gid_be16_convert(linfo.lnk[0].gid,
-				     smc->conn.lnk->gid);
-		smc_gid_be16_convert(linfo.lnk[0].peer_gid,
-				     smc->conn.lnk->peer_gid);
+		       sizeof(link->smcibdev->ibdev->name));
+		smc_gid_be16_convert(linfo.lnk[0].gid, link->gid);
+		smc_gid_be16_convert(linfo.lnk[0].peer_gid, link->peer_gid);
 
 		if (nla_put(skb, SMC_DIAG_LGRINFO, sizeof(linfo), &linfo) < 0)
 			goto errout;
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/4] net/smc: Print net namespace in log
  2021-12-28 13:06 [PATCH 0/4] RDMA device net namespace support for SMC Tony Lu
  2021-12-28 13:06 ` [PATCH 1/4] net/smc: Introduce net namespace support for linkgroup Tony Lu
  2021-12-28 13:06 ` [PATCH 2/4] net/smc: Add netlink net namespace support Tony Lu
@ 2021-12-28 13:06 ` Tony Lu
  2021-12-28 13:06 ` [PATCH 4/4] net/smc: Add net namespace for tracepoints Tony Lu
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Tony Lu @ 2021-12-28 13:06 UTC (permalink / raw)
  To: kgraul; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

This adds net namespace ID to the kernel log, net_cookie is unique in
the whole system. It is useful in container environment.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
---
 net/smc/smc_core.c |  4 ++--
 net/smc/smc_llc.c  | 19 ++++++++++++-------
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index b9d6148d1287..42be10d9c780 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1537,9 +1537,9 @@ void smcr_lgr_set_type(struct smc_link_group *lgr, enum smc_lgr_type new_type)
 		lgr_type = "ASYMMETRIC_LOCAL";
 		break;
 	}
-	pr_warn_ratelimited("smc: SMC-R lg %*phN state changed: "
+	pr_warn_ratelimited("smc: SMC-R lg %*phN net %llu state changed: "
 			    "%s, pnetid %.16s\n", SMC_LGR_ID_SIZE, &lgr->id,
-			    lgr_type, lgr->pnet_id);
+			    lgr->net->net_cookie, lgr_type, lgr->pnet_id);
 }
 
 /* set new lgr type and tag a link as asymmetric */
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index b102680296b8..991ace8e316b 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -242,9 +242,10 @@ static void smc_llc_flow_parallel(struct smc_link_group *lgr, u8 flow_type,
 	}
 	/* drop parallel or already-in-progress llc requests */
 	if (flow_type != msg_type)
-		pr_warn_once("smc: SMC-R lg %*phN dropped parallel "
+		pr_warn_once("smc: SMC-R lg %*phN net %llu dropped parallel "
 			     "LLC msg: msg %d flow %d role %d\n",
 			     SMC_LGR_ID_SIZE, &lgr->id,
+			     lgr->net->net_cookie,
 			     qentry->msg.raw.hdr.common.type,
 			     flow_type, lgr->role);
 	kfree(qentry);
@@ -359,9 +360,10 @@ struct smc_llc_qentry *smc_llc_wait(struct smc_link_group *lgr,
 					   smc_llc_flow_qentry_clr(flow));
 			return NULL;
 		}
-		pr_warn_once("smc: SMC-R lg %*phN dropped unexpected LLC msg: "
+		pr_warn_once("smc: SMC-R lg %*phN net %llu dropped unexpected LLC msg: "
 			     "msg %d exp %d flow %d role %d flags %x\n",
-			     SMC_LGR_ID_SIZE, &lgr->id, rcv_msg, exp_msg,
+			     SMC_LGR_ID_SIZE, &lgr->id, lgr->net->net_cookie,
+			     rcv_msg, exp_msg,
 			     flow->type, lgr->role,
 			     flow->qentry->msg.raw.hdr.flags);
 		smc_llc_flow_qentry_del(flow);
@@ -1816,8 +1818,9 @@ static void smc_llc_rmt_delete_rkey(struct smc_link_group *lgr)
 
 static void smc_llc_protocol_violation(struct smc_link_group *lgr, u8 type)
 {
-	pr_warn_ratelimited("smc: SMC-R lg %*phN LLC protocol violation: "
-			    "llc_type %d\n", SMC_LGR_ID_SIZE, &lgr->id, type);
+	pr_warn_ratelimited("smc: SMC-R lg %*phN net %llu LLC protocol violation: "
+			    "llc_type %d\n", SMC_LGR_ID_SIZE, &lgr->id,
+			    lgr->net->net_cookie, type);
 	smc_llc_set_termination_rsn(lgr, SMC_LLC_DEL_PROT_VIOL);
 	smc_lgr_terminate_sched(lgr);
 }
@@ -2146,9 +2149,10 @@ int smc_llc_link_init(struct smc_link *link)
 
 void smc_llc_link_active(struct smc_link *link)
 {
-	pr_warn_ratelimited("smc: SMC-R lg %*phN link added: id %*phN, "
+	pr_warn_ratelimited("smc: SMC-R lg %*phN net %llu link added: id %*phN, "
 			    "peerid %*phN, ibdev %s, ibport %d\n",
 			    SMC_LGR_ID_SIZE, &link->lgr->id,
+			    link->lgr->net->net_cookie,
 			    SMC_LGR_ID_SIZE, &link->link_uid,
 			    SMC_LGR_ID_SIZE, &link->peer_link_uid,
 			    link->smcibdev->ibdev->name, link->ibport);
@@ -2164,9 +2168,10 @@ void smc_llc_link_active(struct smc_link *link)
 void smc_llc_link_clear(struct smc_link *link, bool log)
 {
 	if (log)
-		pr_warn_ratelimited("smc: SMC-R lg %*phN link removed: id %*phN"
+		pr_warn_ratelimited("smc: SMC-R lg %*phN net %llu link removed: id %*phN"
 				    ", peerid %*phN, ibdev %s, ibport %d\n",
 				    SMC_LGR_ID_SIZE, &link->lgr->id,
+				    link->lgr->net->net_cookie,
 				    SMC_LGR_ID_SIZE, &link->link_uid,
 				    SMC_LGR_ID_SIZE, &link->peer_link_uid,
 				    link->smcibdev->ibdev->name, link->ibport);
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/4] net/smc: Add net namespace for tracepoints
  2021-12-28 13:06 [PATCH 0/4] RDMA device net namespace support for SMC Tony Lu
                   ` (2 preceding siblings ...)
  2021-12-28 13:06 ` [PATCH 3/4] net/smc: Print net namespace in log Tony Lu
@ 2021-12-28 13:06 ` Tony Lu
  2022-01-02 12:20 ` [PATCH 0/4] RDMA device net namespace support for SMC patchwork-bot+netdevbpf
  2022-02-17 11:33 ` Niklas Schnelle
  5 siblings, 0 replies; 15+ messages in thread
From: Tony Lu @ 2021-12-28 13:06 UTC (permalink / raw)
  To: kgraul; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

This prints net namespace ID, helps us to distinguish different net
namespaces when using tracepoints.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
---
 net/smc/smc_tracepoint.h | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/net/smc/smc_tracepoint.h b/net/smc/smc_tracepoint.h
index ec17f29646f5..9fc5e586d24a 100644
--- a/net/smc/smc_tracepoint.h
+++ b/net/smc/smc_tracepoint.h
@@ -22,6 +22,7 @@ TRACE_EVENT(smc_switch_to_fallback,
 	    TP_STRUCT__entry(
 			     __field(const void *, sk)
 			     __field(const void *, clcsk)
+			     __field(u64, net_cookie)
 			     __field(int, fallback_rsn)
 	    ),
 
@@ -31,11 +32,13 @@ TRACE_EVENT(smc_switch_to_fallback,
 
 			   __entry->sk = sk;
 			   __entry->clcsk = clcsk;
+			   __entry->net_cookie = sock_net(sk)->net_cookie;
 			   __entry->fallback_rsn = fallback_rsn;
 	    ),
 
-	    TP_printk("sk=%p clcsk=%p fallback_rsn=%d",
-		      __entry->sk, __entry->clcsk, __entry->fallback_rsn)
+	    TP_printk("sk=%p clcsk=%p net=%llu fallback_rsn=%d",
+		      __entry->sk, __entry->clcsk,
+		      __entry->net_cookie, __entry->fallback_rsn)
 );
 
 DECLARE_EVENT_CLASS(smc_msg_event,
@@ -46,19 +49,23 @@ DECLARE_EVENT_CLASS(smc_msg_event,
 
 		    TP_STRUCT__entry(
 				     __field(const void *, smc)
+				     __field(u64, net_cookie)
 				     __field(size_t, len)
 				     __string(name, smc->conn.lnk->ibname)
 		    ),
 
 		    TP_fast_assign(
+				   const struct sock *sk = &smc->sk;
+
 				   __entry->smc = smc;
+				   __entry->net_cookie = sock_net(sk)->net_cookie;
 				   __entry->len = len;
 				   __assign_str(name, smc->conn.lnk->ibname);
 		    ),
 
-		    TP_printk("smc=%p len=%zu dev=%s",
-			      __entry->smc, __entry->len,
-			      __get_str(name))
+		    TP_printk("smc=%p net=%llu len=%zu dev=%s",
+			      __entry->smc, __entry->net_cookie,
+			      __entry->len, __get_str(name))
 );
 
 DEFINE_EVENT(smc_msg_event, smc_tx_sendmsg,
@@ -84,6 +91,7 @@ TRACE_EVENT(smcr_link_down,
 	    TP_STRUCT__entry(
 			     __field(const void *, lnk)
 			     __field(const void *, lgr)
+			     __field(u64, net_cookie)
 			     __field(int, state)
 			     __string(name, lnk->ibname)
 			     __field(void *, location)
@@ -94,13 +102,14 @@ TRACE_EVENT(smcr_link_down,
 
 			   __entry->lnk = lnk;
 			   __entry->lgr = lgr;
+			   __entry->net_cookie = lgr->net->net_cookie;
 			   __entry->state = lnk->state;
 			   __assign_str(name, lnk->ibname);
 			   __entry->location = location;
 	    ),
 
-	    TP_printk("lnk=%p lgr=%p state=%d dev=%s location=%pS",
-		      __entry->lnk, __entry->lgr,
+	    TP_printk("lnk=%p lgr=%p net=%llu state=%d dev=%s location=%pS",
+		      __entry->lnk, __entry->lgr, __entry->net_cookie,
 		      __entry->state, __get_str(name),
 		      __entry->location)
 );
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] RDMA device net namespace support for SMC
  2021-12-28 13:06 [PATCH 0/4] RDMA device net namespace support for SMC Tony Lu
                   ` (3 preceding siblings ...)
  2021-12-28 13:06 ` [PATCH 4/4] net/smc: Add net namespace for tracepoints Tony Lu
@ 2022-01-02 12:20 ` patchwork-bot+netdevbpf
  2022-02-17 11:33 ` Niklas Schnelle
  5 siblings, 0 replies; 15+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-01-02 12:20 UTC (permalink / raw)
  To: Tony Lu; +Cc: kgraul, kuba, davem, netdev, linux-s390, linux-rdma

Hello:

This series was applied to netdev/net-next.git (master)
by David S. Miller <davem@davemloft.net>:

On Tue, 28 Dec 2021 21:06:08 +0800 you wrote:
> This patch set introduces net namespace support for linkgroups.
> 
> Path 1 is the main approach to implement net ns support.
> 
> Path 2 - 4 are the additional modifications to let us know the netns.
> Also, I will submit changes of smc-tools to github later.
> 
> [...]

Here is the summary with links:
  - [1/4] net/smc: Introduce net namespace support for linkgroup
    https://git.kernel.org/netdev/net-next/c/0237a3a683e4
  - [2/4] net/smc: Add netlink net namespace support
    https://git.kernel.org/netdev/net-next/c/79d39fc503b4
  - [3/4] net/smc: Print net namespace in log
    https://git.kernel.org/netdev/net-next/c/de2fea7b39bf
  - [4/4] net/smc: Add net namespace for tracepoints
    https://git.kernel.org/netdev/net-next/c/a838f5084828

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] net/smc: Add netlink net namespace support
  2021-12-28 13:06 ` [PATCH 2/4] net/smc: Add netlink net namespace support Tony Lu
@ 2022-01-31  0:24   ` Dmitry V. Levin
  2022-01-31 13:49     ` Karsten Graul
  0 siblings, 1 reply; 15+ messages in thread
From: Dmitry V. Levin @ 2022-01-31  0:24 UTC (permalink / raw)
  To: Tony Lu; +Cc: kgraul, kuba, davem, netdev, linux-s390, linux-rdma, linux-api

On Tue, Dec 28, 2021 at 09:06:10PM +0800, Tony Lu wrote:
> This adds net namespace ID to diag of linkgroup, helps us to distinguish
> different namespaces, and net_cookie is unique in the whole system.
> 
> Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
> ---
>  include/uapi/linux/smc.h      |  2 ++
>  include/uapi/linux/smc_diag.h | 11 ++++++-----
>  net/smc/smc_core.c            |  3 +++
>  net/smc/smc_diag.c            | 16 +++++++++-------
>  4 files changed, 20 insertions(+), 12 deletions(-)
> 
> diff --git a/include/uapi/linux/smc.h b/include/uapi/linux/smc.h
> index 20f33b27787f..6c2874fd2c00 100644
> --- a/include/uapi/linux/smc.h
> +++ b/include/uapi/linux/smc.h
> @@ -119,6 +119,8 @@ enum {
>  	SMC_NLA_LGR_R_CONNS_NUM,	/* u32 */
>  	SMC_NLA_LGR_R_V2_COMMON,	/* nest */
>  	SMC_NLA_LGR_R_V2,		/* nest */
> +	SMC_NLA_LGR_R_NET_COOKIE,	/* u64 */
> +	SMC_NLA_LGR_R_PAD,		/* flag */
>  	__SMC_NLA_LGR_R_MAX,
>  	SMC_NLA_LGR_R_MAX = __SMC_NLA_LGR_R_MAX - 1
>  };
> diff --git a/include/uapi/linux/smc_diag.h b/include/uapi/linux/smc_diag.h
> index 8cb3a6fef553..c7008d87f1a4 100644
> --- a/include/uapi/linux/smc_diag.h
> +++ b/include/uapi/linux/smc_diag.h
> @@ -84,11 +84,12 @@ struct smc_diag_conninfo {
>  /* SMC_DIAG_LINKINFO */
>  
>  struct smc_diag_linkinfo {
> -	__u8 link_id;			/* link identifier */
> -	__u8 ibname[IB_DEVICE_NAME_MAX]; /* name of the RDMA device */
> -	__u8 ibport;			/* RDMA device port number */
> -	__u8 gid[40];			/* local GID */
> -	__u8 peer_gid[40];		/* peer GID */
> +	__u8		link_id;		    /* link identifier */
> +	__u8		ibname[IB_DEVICE_NAME_MAX]; /* name of the RDMA device */
> +	__u8		ibport;			    /* RDMA device port number */
> +	__u8		gid[40];		    /* local GID */
> +	__u8		peer_gid[40];		    /* peer GID */
> +	__aligned_u64	net_cookie;                 /* RDMA device net namespace */
>  };
>  
>  struct smc_diag_lgrinfo {

I'm sorry but this is an ABI regression.

Since struct smc_diag_lgrinfo contains an object of type "struct smc_diag_linkinfo",
offset of all subsequent members of struct smc_diag_lgrinfo is changed by
this patch.

As result, applications compiled with the old version of struct smc_diag_linkinfo
will receive garbage in struct smc_diag_lgrinfo.role if the kernel implements
this new version of struct smc_diag_linkinfo.


-- 
ldv

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] net/smc: Add netlink net namespace support
  2022-01-31  0:24   ` Dmitry V. Levin
@ 2022-01-31 13:49     ` Karsten Graul
  2022-02-02  3:09       ` [PATCH] Partially revert "net/smc: Add netlink net namespace support" Dmitry V. Levin
  0 siblings, 1 reply; 15+ messages in thread
From: Karsten Graul @ 2022-01-31 13:49 UTC (permalink / raw)
  To: Dmitry V. Levin, Tony Lu
  Cc: kuba, davem, netdev, linux-s390, linux-rdma, linux-api

On 31/01/2022 01:24, Dmitry V. Levin wrote:
> On Tue, Dec 28, 2021 at 09:06:10PM +0800, Tony Lu wrote:
>> This adds net namespace ID to diag of linkgroup, helps us to distinguish
>> different namespaces, and net_cookie is unique in the whole system.
>>
> 
> I'm sorry but this is an ABI regression.
> 
> Since struct smc_diag_lgrinfo contains an object of type "struct smc_diag_linkinfo",
> offset of all subsequent members of struct smc_diag_lgrinfo is changed by
> this patch.
> 
> As result, applications compiled with the old version of struct smc_diag_linkinfo
> will receive garbage in struct smc_diag_lgrinfo.role if the kernel implements
> this new version of struct smc_diag_linkinfo.
> 

Good catch! This patch adds 2 ways to provide the net_cookie to user space, one is over the new
netlink interface, and the other is using the old smc_diag way. 
Imho to use the new netlink interface is good enough, there is no need to touch the smc_diag ABI.
We already started adding new fields to the netlink interface only, this flexibility is 
the reason why we added this interface initially.

So a patch that removes
	__aligned_u64	net_cookie;
and
	.lnk[0].net_cookie = net->net_cookie,
should solve the issue. 

Thoughts?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] Partially revert "net/smc: Add netlink net namespace support"
  2022-01-31 13:49     ` Karsten Graul
@ 2022-02-02  3:09       ` Dmitry V. Levin
  2022-02-02  7:26         ` Karsten Graul
  2022-02-09  9:43         ` Tony Lu
  0 siblings, 2 replies; 15+ messages in thread
From: Dmitry V. Levin @ 2022-02-02  3:09 UTC (permalink / raw)
  To: Karsten Graul
  Cc: Tony Lu, kuba, davem, netdev, linux-s390, linux-rdma, linux-api

The change of sizeof(struct smc_diag_linkinfo) by commit 79d39fc503b4
("net/smc: Add netlink net namespace support") introduced an ABI
regression: since struct smc_diag_lgrinfo contains an object of
type "struct smc_diag_linkinfo", offset of all subsequent members
of struct smc_diag_lgrinfo was changed by that change.

As result, applications compiled with the old version
of struct smc_diag_linkinfo will receive garbage in
struct smc_diag_lgrinfo.role if the kernel implements
this new version of struct smc_diag_linkinfo.

Fix this regression by reverting the part of commit 79d39fc503b4 that
changes struct smc_diag_linkinfo.  After all, there is SMC_GEN_NETLINK
interface which is good enough, so there is probably no need to touch
the smc_diag ABI in the first place.

Fixes: 79d39fc503b4 ("net/smc: Add netlink net namespace support")
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
---
 include/uapi/linux/smc_diag.h | 11 +++++------
 net/smc/smc_diag.c            |  2 --
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/smc_diag.h b/include/uapi/linux/smc_diag.h
index c7008d87f1a4..8cb3a6fef553 100644
--- a/include/uapi/linux/smc_diag.h
+++ b/include/uapi/linux/smc_diag.h
@@ -84,12 +84,11 @@ struct smc_diag_conninfo {
 /* SMC_DIAG_LINKINFO */
 
 struct smc_diag_linkinfo {
-	__u8		link_id;		    /* link identifier */
-	__u8		ibname[IB_DEVICE_NAME_MAX]; /* name of the RDMA device */
-	__u8		ibport;			    /* RDMA device port number */
-	__u8		gid[40];		    /* local GID */
-	__u8		peer_gid[40];		    /* peer GID */
-	__aligned_u64	net_cookie;                 /* RDMA device net namespace */
+	__u8 link_id;			/* link identifier */
+	__u8 ibname[IB_DEVICE_NAME_MAX]; /* name of the RDMA device */
+	__u8 ibport;			/* RDMA device port number */
+	__u8 gid[40];			/* local GID */
+	__u8 peer_gid[40];		/* peer GID */
 };
 
 struct smc_diag_lgrinfo {
diff --git a/net/smc/smc_diag.c b/net/smc/smc_diag.c
index b8898c787d23..1fca2f90a9c7 100644
--- a/net/smc/smc_diag.c
+++ b/net/smc/smc_diag.c
@@ -146,13 +146,11 @@ static int __smc_diag_dump(struct sock *sk, struct sk_buff *skb,
 	    (req->diag_ext & (1 << (SMC_DIAG_LGRINFO - 1))) &&
 	    !list_empty(&smc->conn.lgr->list)) {
 		struct smc_link *link = smc->conn.lnk;
-		struct net *net = read_pnet(&link->smcibdev->ibdev->coredev.rdma_net);
 
 		struct smc_diag_lgrinfo linfo = {
 			.role = smc->conn.lgr->role,
 			.lnk[0].ibport = link->ibport,
 			.lnk[0].link_id = link->link_id,
-			.lnk[0].net_cookie = net->net_cookie,
 		};
 
 		memcpy(linfo.lnk[0].ibname,
-- 
ldv

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] Partially revert "net/smc: Add netlink net namespace support"
  2022-02-02  3:09       ` [PATCH] Partially revert "net/smc: Add netlink net namespace support" Dmitry V. Levin
@ 2022-02-02  7:26         ` Karsten Graul
  2022-02-09  9:43         ` Tony Lu
  1 sibling, 0 replies; 15+ messages in thread
From: Karsten Graul @ 2022-02-02  7:26 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Tony Lu, kuba, davem, netdev, linux-s390, linux-rdma, linux-api

On 02/02/2022 04:09, Dmitry V. Levin wrote:
> The change of sizeof(struct smc_diag_linkinfo) by commit 79d39fc503b4
> ("net/smc: Add netlink net namespace support") introduced an ABI
> regression: since struct smc_diag_lgrinfo contains an object of
> type "struct smc_diag_linkinfo", offset of all subsequent members
> of struct smc_diag_lgrinfo was changed by that change.
> 
> As result, applications compiled with the old version
> of struct smc_diag_linkinfo will receive garbage in
> struct smc_diag_lgrinfo.role if the kernel implements
> this new version of struct smc_diag_linkinfo.
> 
> Fix this regression by reverting the part of commit 79d39fc503b4 that
> changes struct smc_diag_linkinfo.  After all, there is SMC_GEN_NETLINK
> interface which is good enough, so there is probably no need to touch
> the smc_diag ABI in the first place.

Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>

Thank you Dmitry.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Partially revert "net/smc: Add netlink net namespace support"
  2022-02-02  3:09       ` [PATCH] Partially revert "net/smc: Add netlink net namespace support" Dmitry V. Levin
  2022-02-02  7:26         ` Karsten Graul
@ 2022-02-09  9:43         ` Tony Lu
  1 sibling, 0 replies; 15+ messages in thread
From: Tony Lu @ 2022-02-09  9:43 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Karsten Graul, kuba, davem, netdev, linux-s390, linux-rdma, linux-api

On Wed, Feb 02, 2022 at 06:09:04AM +0300, Dmitry V. Levin wrote:
> The change of sizeof(struct smc_diag_linkinfo) by commit 79d39fc503b4
> ("net/smc: Add netlink net namespace support") introduced an ABI
> regression: since struct smc_diag_lgrinfo contains an object of
> type "struct smc_diag_linkinfo", offset of all subsequent members
> of struct smc_diag_lgrinfo was changed by that change.
> 
> As result, applications compiled with the old version
> of struct smc_diag_linkinfo will receive garbage in
> struct smc_diag_lgrinfo.role if the kernel implements
> this new version of struct smc_diag_linkinfo.
> 
> Fix this regression by reverting the part of commit 79d39fc503b4 that
> changes struct smc_diag_linkinfo.  After all, there is SMC_GEN_NETLINK
> interface which is good enough, so there is probably no need to touch
> the smc_diag ABI in the first place.
> 
> Fixes: 79d39fc503b4 ("net/smc: Add netlink net namespace support")
> Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>

Thank you and Karsten.

It was my negligence that caused the ABI incompatibility issue.
I will consider to fix it completely. And we are starting to build
smc-tools and other userspace test for potential ABI modifications.

Best regards,
Tony Lu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] RDMA device net namespace support for SMC
  2021-12-28 13:06 [PATCH 0/4] RDMA device net namespace support for SMC Tony Lu
                   ` (4 preceding siblings ...)
  2022-01-02 12:20 ` [PATCH 0/4] RDMA device net namespace support for SMC patchwork-bot+netdevbpf
@ 2022-02-17 11:33 ` Niklas Schnelle
  2022-02-21  6:54   ` Tony Lu
  5 siblings, 1 reply; 15+ messages in thread
From: Niklas Schnelle @ 2022-02-17 11:33 UTC (permalink / raw)
  To: Tony Lu
  Cc: netdev, linux-s390, linux-rdma, kgraul, Wenjia Zhang, Stefan Raspl

On Tue, 2021-12-28 at 21:06 +0800, Tony Lu wrote:
> This patch set introduces net namespace support for linkgroups.
> 
> Path 1 is the main approach to implement net ns support.
> 
> Path 2 - 4 are the additional modifications to let us know the netns.
> Also, I will submit changes of smc-tools to github later.
> 
> Currently, smc doesn't support net namespace isolation. The ibdevs
> registered to smc are shared for all linkgroups and connections. When
> running applications in different net namespaces, such as container
> environment, applications should only use the ibdevs that belongs to the
> same net namespace.
> 
> This adds a new field, net, in smc linkgroup struct. During first
> contact, it checks and find the linkgroup has same net namespace, if
> not, it is going to create and initialized the net field with first
> link's ibdev net namespace. When finding the rdma devices, it also checks
> the sk net device's and ibdev's net namespaces. After net namespace
> destroyed, the net device and ibdev move to root net namespace,
> linkgroups won't be matched, and wait for lgr free.
> 
> If rdma net namespace exclusive mode is not enabled, it behaves as
> before.
> 
> Steps to enable and test net namespaces:
> 
> 1. enable RDMA device net namespace exclusive support
> 	rdma system set netns exclusive # default is shared
> 
> 2. create new net namespace, move and initialize them
> 	ip netns add test1 
> 	rdma dev set mlx5_1 netns test1
> 	ip link set dev eth2 netns test1
> 	ip netns exec test1 ip link set eth2 up
> 	ip netns exec test1 ip addr add ${HOST_IP}/26 dev eth2
> 
> 3. setup server and client, connect N <-> M
> 	ip netns exec test1 smc_run sockperf server --tcp # server
> 	ip netns exec test1 smc_run sockperf pp --tcp -i ${SERVER_IP} # client
> 
> 4. netns isolated linkgroups (2 * 2 mesh) with their own linkgroups
>   - server

Hi Tony,

I'm having a bit of trouble getting this to work for me and was
wondering if you could test my scenario or help me figure out what's
wrong.

I'm using network namespacing to be able to test traffic between two
VFs of the same card/port with a single Linux system. By having one VF
in each of a client and server namespace, traffic doesn't shortcut via
loopback. This works great for TCP and with "rdma system set netns
exclusive" I can also verify that RDMA with "qperf -cm1 ... rc_bw" only
works once the respective RDMA device is also added to each namespace.

When I try the same with SMC-R I tried:

  ip netns exec server smc_run qperf &
  ip netns exec client smc_run qperf <ip_server> tcp_bw

With that however I only see fallback TCP connections in "ip netns exec
client watch smc_dbg". It doesn't seem to be an "smc_dbg" problem
either since the performance with and without smc_run is the same. I
also do have the same PNET_ID set on the interfaces.

As an aside do you know how to gracefully put the RDMA devices back
into the default namespace? For network interfaces I can use "ip -n
<ns> link set dev <iface> netns 1" but the equivalent "ip netns exec
<ns> rdma dev set <rdmadev> netns 1" doesn't work because there is no
PID variant. Deleting the namespace and killing processes using the
RDMA device does seem to get it back but with some delay.

Thanks,
Niklas


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] RDMA device net namespace support for SMC
  2022-02-17 11:33 ` Niklas Schnelle
@ 2022-02-21  6:54   ` Tony Lu
  2022-02-21 15:30     ` Niklas Schnelle
  0 siblings, 1 reply; 15+ messages in thread
From: Tony Lu @ 2022-02-21  6:54 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: netdev, linux-s390, linux-rdma, kgraul, Wenjia Zhang, Stefan Raspl

On Thu, Feb 17, 2022 at 12:33:06PM +0100, Niklas Schnelle wrote:
> On Tue, 2021-12-28 at 21:06 +0800, Tony Lu wrote:
> > This patch set introduces net namespace support for linkgroups.
> > 
> > Path 1 is the main approach to implement net ns support.
> > 
> > Path 2 - 4 are the additional modifications to let us know the netns.
> > Also, I will submit changes of smc-tools to github later.
> > 
> > Currently, smc doesn't support net namespace isolation. The ibdevs
> > registered to smc are shared for all linkgroups and connections. When
> > running applications in different net namespaces, such as container
> > environment, applications should only use the ibdevs that belongs to the
> > same net namespace.
> > 
> > This adds a new field, net, in smc linkgroup struct. During first
> > contact, it checks and find the linkgroup has same net namespace, if
> > not, it is going to create and initialized the net field with first
> > link's ibdev net namespace. When finding the rdma devices, it also checks
> > the sk net device's and ibdev's net namespaces. After net namespace
> > destroyed, the net device and ibdev move to root net namespace,
> > linkgroups won't be matched, and wait for lgr free.
> > 
> > If rdma net namespace exclusive mode is not enabled, it behaves as
> > before.
> > 
> > Steps to enable and test net namespaces:
> > 
> > 1. enable RDMA device net namespace exclusive support
> > 	rdma system set netns exclusive # default is shared
> > 
> > 2. create new net namespace, move and initialize them
> > 	ip netns add test1 
> > 	rdma dev set mlx5_1 netns test1
> > 	ip link set dev eth2 netns test1
> > 	ip netns exec test1 ip link set eth2 up
> > 	ip netns exec test1 ip addr add ${HOST_IP}/26 dev eth2
> > 
> > 3. setup server and client, connect N <-> M
> > 	ip netns exec test1 smc_run sockperf server --tcp # server
> > 	ip netns exec test1 smc_run sockperf pp --tcp -i ${SERVER_IP} # client
> > 
> > 4. netns isolated linkgroups (2 * 2 mesh) with their own linkgroups
> >   - server
> 
> Hi Tony,
> 
> I'm having a bit of trouble getting this to work for me and was
> wondering if you could test my scenario or help me figure out what's
> wrong.
> 
> I'm using network namespacing to be able to test traffic between two
> VFs of the same card/port with a single Linux system. By having one VF
> in each of a client and server namespace, traffic doesn't shortcut via
> loopback. This works great for TCP and with "rdma system set netns
> exclusive" I can also verify that RDMA with "qperf -cm1 ... rc_bw" only
> works once the respective RDMA device is also added to each namespace.
> 
> When I try the same with SMC-R I tried:
> 
>   ip netns exec server smc_run qperf &
>   ip netns exec client smc_run qperf <ip_server> tcp_bw
> 
> With that however I only see fallback TCP connections in "ip netns exec
> client watch smc_dbg". It doesn't seem to be an "smc_dbg" problem
> either since the performance with and without smc_run is the same. I
> also do have the same PNET_ID set on the interfaces.

Hi Niklas,

I understood your problem. This connection falls back to TCP for unknown
reasons. You can find out the fallback reason of this connection. It can
help us find out the root cause of fallbacks. For example,
if SMC_CLC_DECL_MEM (0x01010000) is occurred in this connection, it
means that there is no enough memory (smc_init_info, sndbuf, RMB,
proposal buf, clc msg).

Before you giving out the fallback reason, based on your environment,
this are some potential possibilities. You can check this list:

- RDMA device availability in netns. Run "ip netns exec server rdma dev"
  to check RDMA device in both server/client. If exclusive mode is setted,
  it should have different devices in different netns.
- SMC-R device availability in netns. Run "ip netns exec server smcr d"
  to check SMC device available list. Only if we have eth name in the
  list, it can access by this netns. smc-tools matches ethernet NIC and
  RDMA device, it can only find the name of eth nic in this netns, so
  there is no name if this eth nic doesn't belong to this netns.

  Net-Dev         IB-Dev   IB-P  IB-State  Type          Crit  #Links  PNET-ID
                  mlx5_0      1    ACTIVE  RoCE_Express2   No       0
  eth2            mlx5_1      1    ACTIVE  RoCE_Express2   No       0

  This output shows we have ONE available RDMA device in this netns.
- Misc checks, such as memory usage, loop back connection and so on.
  Also, you can check dmesg for device operations if you moved netns of
  RDMA device. Every device's operation will log in dmesg.

  # SMC module init, adds two RDMA device.
  [  +0.000512] smc: adding ib device mlx5_0 with port count 1
  [  +0.000534] smc:    ib device mlx5_0 port 1 has pnetid
  [  +0.000516] smc: adding ib device mlx5_1 with port count 1
  [  +0.000525] smc:    ib device mlx5_1 port 1 has pnetid

  # Move one RDMA device to another netns.
  [Feb21 14:16] smc: removing ib device mlx5_1
  [  +0.015723] smc: adding ib device mlx5_1 with port count 1
  [  +0.000600] smc:    ib device mlx5_1 port 1 has pnetid

> As an aside do you know how to gracefully put the RDMA devices back
> into the default namespace? For network interfaces I can use "ip -n
> <ns> link set dev <iface> netns 1" but the equivalent "ip netns exec
> <ns> rdma dev set <rdmadev> netns 1" doesn't work because there is no
> PID variant. Deleting the namespace and killing processes using the
> RDMA device does seem to get it back but with some delay.

Yes, just remove net namespace, we need to wait for all the connections
shutdown, because every sock will get refcnt of this netns.

I didn't move back device gracefully before, because life of containers
is as long as RDMA device. But you reminded me this, after reading the
implement of iproute2, I believe it's because iproute2 doesn't implement
this (based on nsid) for RDMA devices.

RDMA core provides RDMA_NLDEV_NET_NS_FD in netlink, iproute2 just
handles name (string) in this function, which is created by ip command
before.

// iproute2/rdma/dev.c
static int dev_set_netns(struct rd *rd)
{
	char *netns_path;
	uint32_t seq;
	int netns;
	int ret;

	if (rd_no_arg(rd)) {
		pr_err("Please provide device name.\n");
		return -EINVAL;
	}

	// netns_path is created before by ip command.
	// File located in /var/run/netns/{NS_NAME}, such as
	// /var/run/netns/server.
	if (asprintf(&netns_path, "%s/%s", NETNS_RUN_DIR, rd_argv(rd)) < 0)
		return -ENOMEM;

	netns = open(netns_path, O_RDONLY | O_CLOEXEC);
	if (netns < 0) {
		fprintf(stderr, "Cannot open network namespace \"%s\": %s\n",
			rd_argv(rd), strerror(errno));
		ret = -EINVAL;
		goto done;
	}

	rd_prepare_msg(rd, RDMA_NLDEV_CMD_SET,
		       &seq, (NLM_F_REQUEST | NLM_F_ACK));
	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_DEV_INDEX, rd->dev_idx);

	// based on the fd in this netns.
	mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_NET_NS_FD, netns);
	ret = rd_sendrecv_msg(rd, seq);
	close(netns);
done:
	free(netns_path);
	return ret;
}

I don't know if there are other tools that can do it with RDMA device.
But we can do it by calling netlink with RDMA_NLDEV_NET_NS_FD, and set
this value to the fd of desired netns, such as /proc/1/ns/net.

Hope this information can help you.

Best regards,
Tony Lu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] RDMA device net namespace support for SMC
  2022-02-21  6:54   ` Tony Lu
@ 2022-02-21 15:30     ` Niklas Schnelle
  2022-02-25  6:49       ` Tony Lu
  0 siblings, 1 reply; 15+ messages in thread
From: Niklas Schnelle @ 2022-02-21 15:30 UTC (permalink / raw)
  To: Tony Lu
  Cc: netdev, linux-s390, linux-rdma, kgraul, Wenjia Zhang, Stefan Raspl

On Mon, 2022-02-21 at 14:54 +0800, Tony Lu wrote:
> On Thu, Feb 17, 2022 at 12:33:06PM +0100, Niklas Schnelle wrote:
> > On Tue, 2021-12-28 at 21:06 +0800, Tony Lu wrote:
> > 
---8<---
> > Hi Tony,
> > 
> > I'm having a bit of trouble getting this to work for me and was
> > wondering if you could test my scenario or help me figure out what's
> > wrong.
> > 
> > I'm using network namespacing to be able to test traffic between two
> > VFs of the same card/port with a single Linux system. By having one VF
> > in each of a client and server namespace, traffic doesn't shortcut via
> > loopback. This works great for TCP and with "rdma system set netns
> > exclusive" I can also verify that RDMA with "qperf -cm1 ... rc_bw" only
> > works once the respective RDMA device is also added to each namespace.
> > 
> > When I try the same with SMC-R I tried:
> > 
> >   ip netns exec server smc_run qperf &
> >   ip netns exec client smc_run qperf <ip_server> tcp_bw
> > 
> > With that however I only see fallback TCP connections in "ip netns exec
> > client watch smc_dbg". It doesn't seem to be an "smc_dbg" problem
> > either since the performance with and without smc_run is the same. I
> > also do have the same PNET_ID set on the interfaces.
> 
> Hi Niklas,
> 
> I understood your problem. This connection falls back to TCP for unknown
> reasons. You can find out the fallback reason of this connection. It can
> help us find out the root cause of fallbacks. For example,
> if SMC_CLC_DECL_MEM (0x01010000) is occurred in this connection, it
> means that there is no enough memory (smc_init_info, sndbuf, RMB,
> proposal buf, clc msg).

Regarding fallback reason. It seems to be that the RDMA device is not
found (0x03030000) in smd_dbg on I see the following lines:

Server:
State          UID   Inode   Local Address           Peer Address            Intf Mode Shutd Token    Sndbuf ..
LISTEN         00000 0103804 0.0.0.0:37373
ACTIVE         00000 0112895 ::ffff:10.10.93..:46093 ::ffff:10.10.93..:54474 0000 TCP 0x03030000
ACTIVE         00000 0112701 ::ffff:10.10.93..:19765 ::ffff:10.10.93..:51934 0000 TCP 0x03030000
LISTEN         00000 0112699 0.0.0.0:19765

Client:
State          UID   Inode   Local Address           Peer Address            Intf Mode Shutd Token    Sndbuf ...
ACTIVE         00000 0116203 10.10.93.11:54474       10.10.93.12:46093       0000 TCP 0x05000000/0x03030000
ACTIVE         00000 0116201 10.10.93.11:51934       10.10.93.12:19765       0000 TCP 0x05000000/0x03030000


However this doesn't match what I'm seeing in the other commands below

> 
> Before you giving out the fallback reason, based on your environment,
> this are some potential possibilities. You can check this list:
> 
> - RDMA device availability in netns. Run "ip netns exec server rdma dev"
>   to check RDMA device in both server/client. If exclusive mode is setted,
>   it should have different devices in different netns.

I get the following output that looks as expected to me:

Server:
2: roceP9p0s0: node_type ca fw 14.25.1020 node_guid 1d82:ff9b:1bfe:2c28 sys_image_guid 282c:001b:9b03:9803
Client:
4: roceP11p0s0: node_type ca fw 14.25.1020 node_guid 0982:ff9b:63fe:64e7 sys_image_guid e764:0063:9b03:9803


> - SMC-R device availability in netns. Run "ip netns exec server smcr d"
>   to check SMC device available list. Only if we have eth name in the
>   list, it can access by this netns. smc-tools matches ethernet NIC and
>   RDMA device, it can only find the name of eth nic in this netns, so
>   there is no name if this eth nic doesn't belong to this netns.
> 
>   Net-Dev         IB-Dev   IB-P  IB-State  Type          Crit  #Links  PNET-ID
>                   mlx5_0      1    ACTIVE  RoCE_Express2   No       0
>   eth2            mlx5_1      1    ACTIVE  RoCE_Express2   No       0
> 
>   This output shows we have ONE available RDMA device in this netns.

Here too things look good to me:

Server:

Net-Dev         IB-Dev   IB-P  IB-State  Type          Crit  #Links  PNET-ID
...
          roceP12p    1    ACTIVE  RoCE_Express2   No       0  NET26
          roceP11p    1    ACTIVE  RoCE_Express2   No       0  NET25
ens2076         roceP9p0    1    ACTIVE  RoCE_Express2   No       0  NET25

Client:

Net-Dev         IB-Dev   IB-P  IB-State  Type          Crit  #Links  PNET-ID
...
          roceP12p    1    ACTIVE  RoCE_Express2   No       0  NET26
ens1296         roceP11p    1    ACTIVE  RoCE_Express2   No       0  NET25
          roceP9p0    1    ACTIVE  RoCE_Express2   No       0  NET25

And I again confirmed that a pure RDMA workload ("qperf -cm1 ... rc_bw")
works with the RDMA namespacing set to exclusive but only if I add the
RDMA devices to the namespaces. I do wonder why the other RDMA devices are still
visible in the above output though?

> - Misc checks, such as memory usage, loop back connection and so on.
>   Also, you can check dmesg for device operations if you moved netns of
>   RDMA device. Every device's operation will log in dmesg.
> 
>   # SMC module init, adds two RDMA device.
>   [  +0.000512] smc: adding ib device mlx5_0 with port count 1
>   [  +0.000534] smc:    ib device mlx5_0 port 1 has pnetid
>   [  +0.000516] smc: adding ib device mlx5_1 with port count 1
>   [  +0.000525] smc:    ib device mlx5_1 port 1 has pnetid
> 
>   # Move one RDMA device to another netns.
>   [Feb21 14:16] smc: removing ib device mlx5_1
>   [  +0.015723] smc: adding ib device mlx5_1 with port count 1
>   [  +0.000600] smc:    ib device mlx5_1 port 1 has pnetid

There is no memory pressure and SMC-R between two systems works.

I also see the smc add/remove messages in dmesg as you describe:

smc: removing ib device roceP11p0s0
smc: adding ib device roceP11p0s0 with port count 1
smc:    ib device roceP11p0s0 port 1 has pnetid NET25
smc: removing ib device roceP9p0s0
smc: adding ib device roceP9p0s0 with port count 1
smc:    ib device roceP9p0s0 port 1 has pnetid NET25
mlx5_core 000b:00:00.0 ens1296: Link up
mlx5_core 0009:00:00.0 ens2076: Link up
IPv6: ADDRCONF(NETDEV_CHANGE): ens2076: link becomes ready
smc: removing ib device roceP11p0s0
smc: adding ib device roceP11p0s0 with port count 1
smc:    ib device roceP11p0s0 port 1 has pnetid NET25
mlx5_core 000b:00:00.0 ens1296: Link up
mlx5_core 0009:00:00.0 ens2076: Link up
smc: removing ib device roceP9p0s0
smc: adding ib device roceP9p0s0 with port count 1
smc:    ib device roceP9p0s0 port 1 has pnetid NET25
IPv6: ADDRCONF(NETDEV_CHANGE): ens1296: link becomes ready

(The PCI addresses and resulting names are normal for s390)

One thing I notice is that you don't seem to have a pnetid set
in your output, did you redact those or are you dealing differently
with PNETIDs? Maybe there is an issue with matching PNETIDs betwen
RDMA devices and network devices when namespaced?

I also tested with smc_chk instead of qperf to make sure it's not a
problem with LD_PRELOAD or anything like that. With that it simply
doesn't connect. 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] RDMA device net namespace support for SMC
  2022-02-21 15:30     ` Niklas Schnelle
@ 2022-02-25  6:49       ` Tony Lu
  0 siblings, 0 replies; 15+ messages in thread
From: Tony Lu @ 2022-02-25  6:49 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: netdev, linux-s390, linux-rdma, kgraul, Wenjia Zhang, Stefan Raspl

On Mon, Feb 21, 2022 at 04:30:32PM +0100, Niklas Schnelle wrote:
> On Mon, 2022-02-21 at 14:54 +0800, Tony Lu wrote:
> > On Thu, Feb 17, 2022 at 12:33:06PM +0100, Niklas Schnelle wrote:
> > > On Tue, 2021-12-28 at 21:06 +0800, Tony Lu wrote:
> > > 
> ---8<---
> > > Hi Tony,
> > > 
> > > I'm having a bit of trouble getting this to work for me and was
> > > wondering if you could test my scenario or help me figure out what's
> > > wrong.
> > > 
> > > I'm using network namespacing to be able to test traffic between two
> > > VFs of the same card/port with a single Linux system. By having one VF
> > > in each of a client and server namespace, traffic doesn't shortcut via
> > > loopback. This works great for TCP and with "rdma system set netns
> > > exclusive" I can also verify that RDMA with "qperf -cm1 ... rc_bw" only
> > > works once the respective RDMA device is also added to each namespace.
> > > 
> > > When I try the same with SMC-R I tried:
> > > 
> > >   ip netns exec server smc_run qperf &
> > >   ip netns exec client smc_run qperf <ip_server> tcp_bw
> > > 
> > > With that however I only see fallback TCP connections in "ip netns exec
> > > client watch smc_dbg". It doesn't seem to be an "smc_dbg" problem
> > > either since the performance with and without smc_run is the same. I
> > > also do have the same PNET_ID set on the interfaces.
> > 
> > Hi Niklas,
> > 
> > I understood your problem. This connection falls back to TCP for unknown
> > reasons. You can find out the fallback reason of this connection. It can
> > help us find out the root cause of fallbacks. For example,
> > if SMC_CLC_DECL_MEM (0x01010000) is occurred in this connection, it
> > means that there is no enough memory (smc_init_info, sndbuf, RMB,
> > proposal buf, clc msg).
> 
> Regarding fallback reason. It seems to be that the RDMA device is not
> found (0x03030000) in smd_dbg on I see the following lines:
> 
> Server:
> State          UID   Inode   Local Address           Peer Address            Intf Mode Shutd Token    Sndbuf ..
> LISTEN         00000 0103804 0.0.0.0:37373
> ACTIVE         00000 0112895 ::ffff:10.10.93..:46093 ::ffff:10.10.93..:54474 0000 TCP 0x03030000
> ACTIVE         00000 0112701 ::ffff:10.10.93..:19765 ::ffff:10.10.93..:51934 0000 TCP 0x03030000
> LISTEN         00000 0112699 0.0.0.0:19765
> 
> Client:
> State          UID   Inode   Local Address           Peer Address            Intf Mode Shutd Token    Sndbuf ...
> ACTIVE         00000 0116203 10.10.93.11:54474       10.10.93.12:46093       0000 TCP 0x05000000/0x03030000
> ACTIVE         00000 0116201 10.10.93.11:51934       10.10.93.12:19765       0000 TCP 0x05000000/0x03030000
> 
> 
> However this doesn't match what I'm seeing in the other commands below

Based on the fallback reason, the server didn't find proper RDMA device
to start, so it fell back.

> > 
> > Before you giving out the fallback reason, based on your environment,
> > this are some potential possibilities. You can check this list:
> > 
> > - RDMA device availability in netns. Run "ip netns exec server rdma dev"
> >   to check RDMA device in both server/client. If exclusive mode is setted,
> >   it should have different devices in different netns.
> 
> I get the following output that looks as expected to me:
> 
> Server:
> 2: roceP9p0s0: node_type ca fw 14.25.1020 node_guid 1d82:ff9b:1bfe:2c28 sys_image_guid 282c:001b:9b03:9803
> Client:
> 4: roceP11p0s0: node_type ca fw 14.25.1020 node_guid 0982:ff9b:63fe:64e7 sys_image_guid e764:0063:9b03:9803

It looks good for now.

> 
> > - SMC-R device availability in netns. Run "ip netns exec server smcr d"
> >   to check SMC device available list. Only if we have eth name in the
> >   list, it can access by this netns. smc-tools matches ethernet NIC and
> >   RDMA device, it can only find the name of eth nic in this netns, so
> >   there is no name if this eth nic doesn't belong to this netns.
> > 
> >   Net-Dev         IB-Dev   IB-P  IB-State  Type          Crit  #Links  PNET-ID
> >                   mlx5_0      1    ACTIVE  RoCE_Express2   No       0
> >   eth2            mlx5_1      1    ACTIVE  RoCE_Express2   No       0
> > 
> >   This output shows we have ONE available RDMA device in this netns.
> 
> Here too things look good to me:
> 
> Server:
> 
> Net-Dev         IB-Dev   IB-P  IB-State  Type          Crit  #Links  PNET-ID
> ...
>           roceP12p    1    ACTIVE  RoCE_Express2   No       0  NET26
>           roceP11p    1    ACTIVE  RoCE_Express2   No       0  NET25
> ens2076         roceP9p0    1    ACTIVE  RoCE_Express2   No       0  NET25
> 
> Client:
> 
> Net-Dev         IB-Dev   IB-P  IB-State  Type          Crit  #Links  PNET-ID
> ...
>           roceP12p    1    ACTIVE  RoCE_Express2   No       0  NET26
> ens1296         roceP11p    1    ACTIVE  RoCE_Express2   No       0  NET25
>           roceP9p0    1    ACTIVE  RoCE_Express2   No       0  NET25
> 
> And I again confirmed that a pure RDMA workload ("qperf -cm1 ... rc_bw")
> works with the RDMA namespacing set to exclusive but only if I add the
> RDMA devices to the namespaces. I do wonder why the other RDMA devices are still
> visible in the above output though?

SMC maintains the list of ibdevices, which is isolated from rdma
command. SMC registered handlers for ib device, if ib device removed or
added, it triggered a event, and SMC will remove or add this device from
list. "smcr d" dumps all the list, and not filtered by netns.

> > - Misc checks, such as memory usage, loop back connection and so on.
> >   Also, you can check dmesg for device operations if you moved netns of
> >   RDMA device. Every device's operation will log in dmesg.
> > 
> >   # SMC module init, adds two RDMA device.
> >   [  +0.000512] smc: adding ib device mlx5_0 with port count 1
> >   [  +0.000534] smc:    ib device mlx5_0 port 1 has pnetid
> >   [  +0.000516] smc: adding ib device mlx5_1 with port count 1
> >   [  +0.000525] smc:    ib device mlx5_1 port 1 has pnetid
> > 
> >   # Move one RDMA device to another netns.
> >   [Feb21 14:16] smc: removing ib device mlx5_1
> >   [  +0.015723] smc: adding ib device mlx5_1 with port count 1
> >   [  +0.000600] smc:    ib device mlx5_1 port 1 has pnetid
> 
> There is no memory pressure and SMC-R between two systems works.
> 
> I also see the smc add/remove messages in dmesg as you describe:
> 
> smc: removing ib device roceP11p0s0
> smc: adding ib device roceP11p0s0 with port count 1
> smc:    ib device roceP11p0s0 port 1 has pnetid NET25

It looks like s390 has pnetid, other systems don't implement it and have
to set pnetid by user. Now dmesg shows that you can get pnetid directly
without setting it.

> smc: removing ib device roceP9p0s0
> smc: adding ib device roceP9p0s0 with port count 1
> smc:    ib device roceP9p0s0 port 1 has pnetid NET25
> mlx5_core 000b:00:00.0 ens1296: Link up
> mlx5_core 0009:00:00.0 ens2076: Link up
> IPv6: ADDRCONF(NETDEV_CHANGE): ens2076: link becomes ready
> smc: removing ib device roceP11p0s0
> smc: adding ib device roceP11p0s0 with port count 1
> smc:    ib device roceP11p0s0 port 1 has pnetid NET25
> mlx5_core 000b:00:00.0 ens1296: Link up
> mlx5_core 0009:00:00.0 ens2076: Link up
> smc: removing ib device roceP9p0s0
> smc: adding ib device roceP9p0s0 with port count 1
> smc:    ib device roceP9p0s0 port 1 has pnetid NET25
> IPv6: ADDRCONF(NETDEV_CHANGE): ens1296: link becomes ready
> 
> (The PCI addresses and resulting names are normal for s390)
> 
> One thing I notice is that you don't seem to have a pnetid set
> in your output, did you redact those or are you dealing differently
> with PNETIDs? Maybe there is an issue with matching PNETIDs betwen
> RDMA devices and network devices when namespaced?

It works okay if I setted pnetid in different netns, the logic of pnet
handling is untouched in my test environment.

$ ip netns exec test1 smcr d # mlx5_1 with pnetid TEST1
Net-Dev         IB-Dev   IB-P  IB-State  Type          Crit  #Links  PNET-ID
                mlx5_0      1    ACTIVE  RoCE_Express2   No       0  *TEST0
eth2            mlx5_1      1    ACTIVE  RoCE_Express2  Yes       1  *TEST1

$ ip netns exec test1 smcss # runs in mode SMCR
State          UID   Inode   Local Address           Peer Address            Intf Mode
ACTIVE         00993 0045755 11.213.45.7:8091        11.213.45.19:48884      0000 SMCR

Based on the dmesg and fallback reason, you can check the eth and ib
device are added to pnetlist correctly. SMC tries to find the proper
RDMA device in pnet list matched by pnetid. Currently, pnettable is
per-netns. So it should be added in current netns. 

If the arch doesn't enabled CONFIG_HAVE_PNETID (s390 enabled), it tries
to use the handshake device when pnetlist is empty, otherwise it tries
to find in pnetlist by pnetid, and no rdma device found when pnetlist is
empty, then fallback to TCP. So the default behavior is different when
list is empty.

After investigating the pnet logic, I found something that could be
improved in original implementation, which is out of this netns patch,
such as the limit of init_net in pnet_enter and remove. I will start the
discussion if needed.

Thanks,
Tony Lu

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-02-25  6:49 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-28 13:06 [PATCH 0/4] RDMA device net namespace support for SMC Tony Lu
2021-12-28 13:06 ` [PATCH 1/4] net/smc: Introduce net namespace support for linkgroup Tony Lu
2021-12-28 13:06 ` [PATCH 2/4] net/smc: Add netlink net namespace support Tony Lu
2022-01-31  0:24   ` Dmitry V. Levin
2022-01-31 13:49     ` Karsten Graul
2022-02-02  3:09       ` [PATCH] Partially revert "net/smc: Add netlink net namespace support" Dmitry V. Levin
2022-02-02  7:26         ` Karsten Graul
2022-02-09  9:43         ` Tony Lu
2021-12-28 13:06 ` [PATCH 3/4] net/smc: Print net namespace in log Tony Lu
2021-12-28 13:06 ` [PATCH 4/4] net/smc: Add net namespace for tracepoints Tony Lu
2022-01-02 12:20 ` [PATCH 0/4] RDMA device net namespace support for SMC patchwork-bot+netdevbpf
2022-02-17 11:33 ` Niklas Schnelle
2022-02-21  6:54   ` Tony Lu
2022-02-21 15:30     ` Niklas Schnelle
2022-02-25  6:49       ` Tony Lu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.