linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv2 0/6] Fix the problem that rxe can not work in net
@ 2022-10-06  8:59 yanjun.zhu
  2022-10-06  8:59 ` [PATCHv2 1/6] RDMA/rxe: Creating listening sock in newlink function yanjun.zhu
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: yanjun.zhu @ 2022-10-06  8:59 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, linux-rdma, yanjun.zhu, netdev, davem; +Cc: Zhu Yanjun

From: Zhu Yanjun <yanjun.zhu@intel.com>

When run "ip link add" command to add a rxe rdma link in a net
namespace, normally this rxe rdma link can not work in a net
name space. 

The root cause is that a sock listening on udp port 4791 is created
in init_net when the rdma_rxe module is loaded into kernel. That is,
the sock listening on udp port 4791 is created in init_net. Other net
namespace is difficult to use this sock.

The following commits will solve this problem.

In the first commit, move the creating sock listening on udp port 4791
from module_init function to rdma link creating functions. That is,
after the module rdma_rxe is loaded, the sock will not be created.
When run "rdma link add ..." command, the sock will be created. So
when creating a rdma link in the net namespace, the sock will be
created in this net namespace.

In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
will check the sock exists in the net namespace or not. If yes, rdma
link will increase the reference count of this sock, then continue other
jobs instead of creating a new sock to listen on udp port 4791. Since the
network notifier is global, when the module rdma_rxe is loaded, this
notifier will be registered.

After the rdma link is created, the command "rdma link del" is to
delete rdma link at the same time the sock is checked. If the reference
count of this sock is greater than the sock reference count needed by
udp tunnel, the sock reference count is decreased by one. If equal, it
indicates that this rdma link is the last one. As such, the udp tunnel
is shut down and the sock is closed. The above work should be
implemented in linkdel function. But currently no dellink function in
rxe. So the 3rd commit addes dellink function pointer. And the 4th
commit implements the dellink function in rxe.

To now, it is not necessary to keep a global variable to store the sock
listening udp port 4791. This global variable can be replaced by the
functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
function udp6_lib_lookup is in the fast path, a member variable l_sk6
is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
to lookup the sock, then the sock is stored in l_sk6, in the future,it
can be used directly.

All the above work has been done in init_net. And it can also work in
the net namespace. So the init_net is replaced by the individual net
namespace. This is what the 6th commit does. Because rxe device is
dependent on the net device and the sock listening on udp port 4791,
every rxe device is in exclusive mode in the individual net namespace.
Other rdma netns operations will be considerred in the future.

Test steps:
1) Suppose that 2 NICs are in 2 different net namespaces.

 # ip netns exec net0 ip link
 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
    link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
    altname enp5s0

 # ip netns exec net1 ip link
 4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
    link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff

2) Add rdma link in the different net namespace
   net0:
   ip netns exec net0 rdma link add rxe0 type rxe netdev eno2

   net1:
   ip netns exec net1 rdma link add rxe1 type rxe netdev eno3

3) Run rping test.
   net0
   # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
   [1] 1737
   # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
   verbose
   count 1
   ...
   ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
   ...

4) Remove the rdma links from the net namespaces.
   net0:
   ip netns exec net0 rdma link del rxe0
   net1:
   ip netns exec net1 rdma link del rxe1

---
V1->V2: Add the explicit initialization of sk6.
---
Zhu Yanjun (6):
  RDMA/rxe: Creating listening sock in newlink function
  RDMA/rxe: Support more rdma links in init_net
  RDMA/nldev: Add dellink function pointer
  RDMA/rxe: Implement dellink in rxe
  RDMA/rxe: Replace global variable with sock lookup functions
  RDMA/rxe: add the support of net namespace

 drivers/infiniband/core/nldev.c       |   6 ++
 drivers/infiniband/sw/rxe/rxe.c       |  26 +++++-
 drivers/infiniband/sw/rxe/rxe_net.c   | 129 ++++++++++++++++++++------
 drivers/infiniband/sw/rxe/rxe_net.h   |   9 +-
 drivers/infiniband/sw/rxe/rxe_verbs.h |   1 +
 include/rdma/rdma_netlink.h           |   2 +
 6 files changed, 134 insertions(+), 39 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCHv2 1/6] RDMA/rxe: Creating listening sock in newlink function
  2022-10-06  8:59 [PATCHv2 0/6] Fix the problem that rxe can not work in net yanjun.zhu
@ 2022-10-06  8:59 ` yanjun.zhu
  2022-10-06  8:59 ` [PATCHv2 2/6] RDMA/rxe: Support more rdma links in init_net yanjun.zhu
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: yanjun.zhu @ 2022-10-06  8:59 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, linux-rdma, yanjun.zhu, netdev, davem

From: Zhu Yanjun <yanjun.zhu@linux.dev>

Originally when the module rdma_rxe is loaded, the sock listening on udp
port 4791 is created. Currently moving the creating listening port to
newlink function.

So when running "rdma link add" command, the sock listening on udp port
4791 is created.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/rxe.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 51daac5c4feb..a22ff2207b42 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -192,6 +192,10 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
 		goto err;
 	}
 
+	err = rxe_net_init();
+	if (err)
+		return err;
+
 	err = rxe_net_add(ibdev_name, ndev);
 	if (err) {
 		pr_err("failed to add %s\n", ndev->name);
@@ -210,10 +214,6 @@ static int __init rxe_module_init(void)
 {
 	int err;
 
-	err = rxe_net_init();
-	if (err)
-		return err;
-
 	rdma_link_register(&rxe_link_ops);
 	pr_info("loaded\n");
 	return 0;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCHv2 2/6] RDMA/rxe: Support more rdma links in init_net
  2022-10-06  8:59 [PATCHv2 0/6] Fix the problem that rxe can not work in net yanjun.zhu
  2022-10-06  8:59 ` [PATCHv2 1/6] RDMA/rxe: Creating listening sock in newlink function yanjun.zhu
@ 2022-10-06  8:59 ` yanjun.zhu
  2022-10-06  8:59 ` [PATCHv2 3/6] RDMA/nldev: Add dellink function pointer yanjun.zhu
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: yanjun.zhu @ 2022-10-06  8:59 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, linux-rdma, yanjun.zhu, netdev, davem

From: Zhu Yanjun <yanjun.zhu@linux.dev>

In init_net, when several rdma links are created with the command "rdma
link add", newlink will check whether the udp port 4791 is listening or
not.
If not, creating a sock listening on udp port 4791. If yes, increasing the
reference count of the sock.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/rxe.c     |  9 ++++-
 drivers/infiniband/sw/rxe/rxe_net.c | 55 +++++++++++++++++++++--------
 drivers/infiniband/sw/rxe/rxe_net.h |  1 +
 3 files changed, 49 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index a22ff2207b42..84a07638f8df 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -8,6 +8,7 @@
 #include <net/addrconf.h>
 #include "rxe.h"
 #include "rxe_loc.h"
+#include "rxe_net.h"
 
 MODULE_AUTHOR("Bob Pearson, Frank Zago, John Groves, Kamal Heib");
 MODULE_DESCRIPTION("Soft RDMA transport");
@@ -205,7 +206,7 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
 	return err;
 }
 
-static struct rdma_link_ops rxe_link_ops = {
+struct rdma_link_ops rxe_link_ops = {
 	.type = "rxe",
 	.newlink = rxe_newlink,
 };
@@ -215,6 +216,12 @@ static int __init rxe_module_init(void)
 	int err;
 
 	rdma_link_register(&rxe_link_ops);
+	err = rxe_register_notifier();
+	if (err) {
+		pr_err("Failed to register netdev notifier\n");
+		return -1;
+	}
+
 	pr_info("loaded\n");
 	return 0;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index c53f4529f098..4772ea19c6e2 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -620,13 +620,23 @@ static struct notifier_block rxe_net_notifier = {
 
 static int rxe_net_ipv4_init(void)
 {
-	recv_sockets.sk4 = rxe_setup_udp_tunnel(&init_net,
-				htons(ROCE_V2_UDP_DPORT), false);
-	if (IS_ERR(recv_sockets.sk4)) {
-		recv_sockets.sk4 = NULL;
+	struct sock *sk;
+	struct socket *sock;
+
+	rcu_read_lock();
+	sk = udp4_lib_lookup(&init_net, 0, 0, htonl(INADDR_ANY),
+			     htons(ROCE_V2_UDP_DPORT), 0);
+	rcu_read_unlock();
+	if (sk)
+		return 0;
+
+	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), false);
+	if (IS_ERR(sock)) {
 		pr_err("Failed to create IPv4 UDP tunnel\n");
+		recv_sockets.sk4 = NULL;
 		return -1;
 	}
+	recv_sockets.sk4 = sock;
 
 	return 0;
 }
@@ -634,24 +644,46 @@ static int rxe_net_ipv4_init(void)
 static int rxe_net_ipv6_init(void)
 {
 #if IS_ENABLED(CONFIG_IPV6)
+	struct sock *sk;
+	struct socket *sock;
+
+	rcu_read_lock();
+	sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any,
+			     htons(ROCE_V2_UDP_DPORT), 0);
+	rcu_read_unlock();
+	if (sk)
+		return 0;
 
-	recv_sockets.sk6 = rxe_setup_udp_tunnel(&init_net,
-						htons(ROCE_V2_UDP_DPORT), true);
-	if (PTR_ERR(recv_sockets.sk6) == -EAFNOSUPPORT) {
+	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), true);
+	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
 		recv_sockets.sk6 = NULL;
 		pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n");
 		return 0;
 	}
 
-	if (IS_ERR(recv_sockets.sk6)) {
+	if (IS_ERR(sock)) {
 		recv_sockets.sk6 = NULL;
 		pr_err("Failed to create IPv6 UDP tunnel\n");
 		return -1;
 	}
+	recv_sockets.sk6 = sock;
 #endif
 	return 0;
 }
 
+int rxe_register_notifier(void)
+{
+	int err;
+
+	err = register_netdevice_notifier(&rxe_net_notifier);
+	if (err) {
+		pr_err("Failed to register netdev notifier\n");
+		return -1;
+	}
+
+	return 0;
+}
+
 void rxe_net_exit(void)
 {
 	rxe_release_udp_tunnel(recv_sockets.sk6);
@@ -663,19 +695,12 @@ int rxe_net_init(void)
 {
 	int err;
 
-	recv_sockets.sk6 = NULL;
-
 	err = rxe_net_ipv4_init();
 	if (err)
 		return err;
 	err = rxe_net_ipv6_init();
 	if (err)
 		goto err_out;
-	err = register_netdevice_notifier(&rxe_net_notifier);
-	if (err) {
-		pr_err("Failed to register netdev notifier\n");
-		goto err_out;
-	}
 	return 0;
 err_out:
 	rxe_net_exit();
diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
index 45d80d00f86b..a222c3eeae12 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.h
+++ b/drivers/infiniband/sw/rxe/rxe_net.h
@@ -18,6 +18,7 @@ struct rxe_recv_sockets {
 
 int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
 
+int rxe_register_notifier(void);
 int rxe_net_init(void);
 void rxe_net_exit(void);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCHv2 3/6] RDMA/nldev: Add dellink function pointer
  2022-10-06  8:59 [PATCHv2 0/6] Fix the problem that rxe can not work in net yanjun.zhu
  2022-10-06  8:59 ` [PATCHv2 1/6] RDMA/rxe: Creating listening sock in newlink function yanjun.zhu
  2022-10-06  8:59 ` [PATCHv2 2/6] RDMA/rxe: Support more rdma links in init_net yanjun.zhu
@ 2022-10-06  8:59 ` yanjun.zhu
  2022-10-06  8:59 ` [PATCHv2 4/6] RDMA/rxe: Implement dellink in rxe yanjun.zhu
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: yanjun.zhu @ 2022-10-06  8:59 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, linux-rdma, yanjun.zhu, netdev, davem

From: Zhu Yanjun <yanjun.zhu@linux.dev>

The newlink function pointer is added. And the sock listening on port 4791
is added in the newlink function. So the dellink function is needed to
remove the sock.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/core/nldev.c | 6 ++++++
 include/rdma/rdma_netlink.h     | 2 ++
 2 files changed, 8 insertions(+)

diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index b92358f606d0..ae0db4aced34 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -1744,6 +1744,12 @@ static int nldev_dellink(struct sk_buff *skb, struct nlmsghdr *nlh,
 		return -EINVAL;
 	}
 
+	if (device->link_ops) {
+		err = device->link_ops->dellink(device);
+		if (err)
+			return err;
+	}
+
 	ib_unregister_device_and_put(device);
 	return 0;
 }
diff --git a/include/rdma/rdma_netlink.h b/include/rdma/rdma_netlink.h
index c2a79aeee113..bf9df004061f 100644
--- a/include/rdma/rdma_netlink.h
+++ b/include/rdma/rdma_netlink.h
@@ -5,6 +5,7 @@
 
 #include <linux/netlink.h>
 #include <uapi/rdma/rdma_netlink.h>
+#include <rdma/ib_verbs.h>
 
 enum {
 	RDMA_NLDEV_ATTR_EMPTY_STRING = 1,
@@ -114,6 +115,7 @@ struct rdma_link_ops {
 	struct list_head list;
 	const char *type;
 	int (*newlink)(const char *ibdev_name, struct net_device *ndev);
+	int (*dellink)(struct ib_device *dev);
 };
 
 void rdma_link_register(struct rdma_link_ops *ops);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCHv2 4/6] RDMA/rxe: Implement dellink in rxe
  2022-10-06  8:59 [PATCHv2 0/6] Fix the problem that rxe can not work in net yanjun.zhu
                   ` (2 preceding siblings ...)
  2022-10-06  8:59 ` [PATCHv2 3/6] RDMA/nldev: Add dellink function pointer yanjun.zhu
@ 2022-10-06  8:59 ` yanjun.zhu
  2022-10-06  8:59 ` [PATCHv2 5/6] RDMA/rxe: Replace global variable with sock lookup functions yanjun.zhu
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: yanjun.zhu @ 2022-10-06  8:59 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, linux-rdma, yanjun.zhu, netdev, davem

From: Zhu Yanjun <yanjun.zhu@linux.dev>

When running "rdma link del" command, dellink function will be called.
If the sock refcnt is greater than the refcnt needed for udp tunnel,
the sock refcnt will be decreased by 1.

If equal, the last rdma link is deleted. The udp tunnel will be
destroyed.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/rxe.c     | 12 +++++++++++-
 drivers/infiniband/sw/rxe/rxe_net.c | 16 ++++++++++++++--
 drivers/infiniband/sw/rxe/rxe_net.h |  1 +
 3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 84a07638f8df..1b8b74ea84e9 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -166,10 +166,12 @@ void rxe_set_mtu(struct rxe_dev *rxe, unsigned int ndev_mtu)
 /* called by ifc layer to create new rxe device.
  * The caller should allocate memory for rxe by calling ib_alloc_device.
  */
+static struct rdma_link_ops rxe_link_ops;
 int rxe_add(struct rxe_dev *rxe, unsigned int mtu, const char *ibdev_name)
 {
 	rxe_init(rxe);
 	rxe_set_mtu(rxe, mtu);
+	rxe->ib_dev.link_ops = &rxe_link_ops;
 
 	return rxe_register_device(rxe, ibdev_name);
 }
@@ -206,9 +208,17 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
 	return err;
 }
 
-struct rdma_link_ops rxe_link_ops = {
+static int rxe_dellink(struct ib_device *dev)
+{
+	rxe_net_del(dev);
+
+	return 0;
+}
+
+static struct rdma_link_ops rxe_link_ops = {
 	.type = "rxe",
 	.newlink = rxe_newlink,
+	.dellink = rxe_dellink,
 };
 
 static int __init rxe_module_init(void)
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 4772ea19c6e2..6e35566e933b 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -527,6 +527,20 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev)
 	return 0;
 }
 
+#define SK_REF_FOR_TUNNEL	2
+void rxe_net_del(struct ib_device *dev)
+{
+	if (refcount_read(&recv_sockets.sk6->sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
+		__sock_put(recv_sockets.sk6->sk);
+	else
+		rxe_release_udp_tunnel(recv_sockets.sk6);
+
+	if (refcount_read(&recv_sockets.sk4->sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
+		__sock_put(recv_sockets.sk4->sk);
+	else
+		rxe_release_udp_tunnel(recv_sockets.sk4);
+}
+
 static void rxe_port_event(struct rxe_dev *rxe,
 			   enum ib_event_type event)
 {
@@ -686,8 +700,6 @@ int rxe_register_notifier(void)
 
 void rxe_net_exit(void)
 {
-	rxe_release_udp_tunnel(recv_sockets.sk6);
-	rxe_release_udp_tunnel(recv_sockets.sk4);
 	unregister_netdevice_notifier(&rxe_net_notifier);
 }
 
diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
index a222c3eeae12..f48f22f3353b 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.h
+++ b/drivers/infiniband/sw/rxe/rxe_net.h
@@ -17,6 +17,7 @@ struct rxe_recv_sockets {
 };
 
 int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
+void rxe_net_del(struct ib_device *dev);
 
 int rxe_register_notifier(void);
 int rxe_net_init(void);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCHv2 5/6] RDMA/rxe: Replace global variable with sock lookup functions
  2022-10-06  8:59 [PATCHv2 0/6] Fix the problem that rxe can not work in net yanjun.zhu
                   ` (3 preceding siblings ...)
  2022-10-06  8:59 ` [PATCHv2 4/6] RDMA/rxe: Implement dellink in rxe yanjun.zhu
@ 2022-10-06  8:59 ` yanjun.zhu
  2022-10-06  8:59 ` [PATCHv2 6/6] RDMA/rxe: add the support of net namespace yanjun.zhu
  2022-10-19 14:56 ` [PATCHv2 0/6] Fix the problem that rxe can not work in net Yanjun Zhu
  6 siblings, 0 replies; 14+ messages in thread
From: yanjun.zhu @ 2022-10-06  8:59 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, linux-rdma, yanjun.zhu, netdev, davem

From: Zhu Yanjun <yanjun.zhu@linux.dev>

Originally a global variable is to keep the sock of udp listening
on port 4791. In fact, sock lookup functions can be used to get
the sock.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/rxe.c       |  1 +
 drivers/infiniband/sw/rxe/rxe_net.c   | 58 ++++++++++++++++++++-------
 drivers/infiniband/sw/rxe/rxe_net.h   |  5 ---
 drivers/infiniband/sw/rxe/rxe_verbs.h |  1 +
 4 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 1b8b74ea84e9..933d8e129c47 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -74,6 +74,7 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
 			rxe->ndev->dev_addr);
 
 	rxe->max_ucontext			= RXE_MAX_UCONTEXT;
+	rxe->l_sk6				= NULL;
 }
 
 /* initialize port attributes */
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 6e35566e933b..64b11faccfb4 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -18,8 +18,6 @@
 #include "rxe_net.h"
 #include "rxe_loc.h"
 
-static struct rxe_recv_sockets recv_sockets;
-
 static struct dst_entry *rxe_find_route4(struct net_device *ndev,
 				  struct in_addr *saddr,
 				  struct in_addr *daddr)
@@ -49,6 +47,23 @@ static struct dst_entry *rxe_find_route6(struct net_device *ndev,
 {
 	struct dst_entry *ndst;
 	struct flowi6 fl6 = { { 0 } };
+	struct rxe_dev *rdev;
+
+	rdev = rxe_get_dev_from_net(ndev);
+	if (!rdev->l_sk6) {
+		struct sock *sk;
+
+		rcu_read_lock();
+		sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any, htons(ROCE_V2_UDP_DPORT), 0);
+		rcu_read_unlock();
+		if (!sk) {
+			pr_info("file: %s +%d, error\n", __FILE__, __LINE__);
+			return (struct dst_entry *)sk;
+		}
+		__sock_put(sk);
+		rdev->l_sk6 = sk->sk_socket;
+	}
+
 
 	memset(&fl6, 0, sizeof(fl6));
 	fl6.flowi6_oif = ndev->ifindex;
@@ -56,8 +71,8 @@ static struct dst_entry *rxe_find_route6(struct net_device *ndev,
 	memcpy(&fl6.daddr, daddr, sizeof(*daddr));
 	fl6.flowi6_proto = IPPROTO_UDP;
 
-	ndst = ipv6_stub->ipv6_dst_lookup_flow(sock_net(recv_sockets.sk6->sk),
-					       recv_sockets.sk6->sk, &fl6,
+	ndst = ipv6_stub->ipv6_dst_lookup_flow(dev_net(ndev),
+					       rdev->l_sk6->sk, &fl6,
 					       NULL);
 	if (IS_ERR(ndst)) {
 		pr_err_ratelimited("no route to %pI6\n", daddr);
@@ -530,15 +545,33 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev)
 #define SK_REF_FOR_TUNNEL	2
 void rxe_net_del(struct ib_device *dev)
 {
-	if (refcount_read(&recv_sockets.sk6->sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
-		__sock_put(recv_sockets.sk6->sk);
+	struct sock *sk;
+
+	rcu_read_lock();
+	sk = udp4_lib_lookup(&init_net, 0, 0, htonl(INADDR_ANY), htons(ROCE_V2_UDP_DPORT), 0);
+	rcu_read_unlock();
+	if (!sk)
+		return;
+
+	__sock_put(sk);
+
+	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
+		__sock_put(sk);
 	else
-		rxe_release_udp_tunnel(recv_sockets.sk6);
+		rxe_release_udp_tunnel(sk->sk_socket);
+
+	rcu_read_lock();
+	sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any, htons(ROCE_V2_UDP_DPORT), 0);
+	rcu_read_unlock();
+	if (!sk)
+		return;
+
+	__sock_put(sk);
 
-	if (refcount_read(&recv_sockets.sk4->sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
-		__sock_put(recv_sockets.sk4->sk);
+	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
+		__sock_put(sk);
 	else
-		rxe_release_udp_tunnel(recv_sockets.sk4);
+		rxe_release_udp_tunnel(sk->sk_socket);
 }
 
 static void rxe_port_event(struct rxe_dev *rxe,
@@ -647,10 +680,8 @@ static int rxe_net_ipv4_init(void)
 	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), false);
 	if (IS_ERR(sock)) {
 		pr_err("Failed to create IPv4 UDP tunnel\n");
-		recv_sockets.sk4 = NULL;
 		return -1;
 	}
-	recv_sockets.sk4 = sock;
 
 	return 0;
 }
@@ -670,17 +701,14 @@ static int rxe_net_ipv6_init(void)
 
 	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), true);
 	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
-		recv_sockets.sk6 = NULL;
 		pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n");
 		return 0;
 	}
 
 	if (IS_ERR(sock)) {
-		recv_sockets.sk6 = NULL;
 		pr_err("Failed to create IPv6 UDP tunnel\n");
 		return -1;
 	}
-	recv_sockets.sk6 = sock;
 #endif
 	return 0;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
index f48f22f3353b..027b20e1bab6 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.h
+++ b/drivers/infiniband/sw/rxe/rxe_net.h
@@ -11,11 +11,6 @@
 #include <net/if_inet6.h>
 #include <linux/module.h>
 
-struct rxe_recv_sockets {
-	struct socket *sk4;
-	struct socket *sk6;
-};
-
 int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
 void rxe_net_del(struct ib_device *dev);
 
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 96af3e054f4d..13b12f02a52e 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -406,6 +406,7 @@ struct rxe_dev {
 
 	struct rxe_port		port;
 	struct crypto_shash	*tfm;
+	struct socket		*l_sk6;
 };
 
 static inline void rxe_counter_inc(struct rxe_dev *rxe, enum rxe_counters index)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCHv2 6/6] RDMA/rxe: add the support of net namespace
  2022-10-06  8:59 [PATCHv2 0/6] Fix the problem that rxe can not work in net yanjun.zhu
                   ` (4 preceding siblings ...)
  2022-10-06  8:59 ` [PATCHv2 5/6] RDMA/rxe: Replace global variable with sock lookup functions yanjun.zhu
@ 2022-10-06  8:59 ` yanjun.zhu
  2022-10-19 14:56 ` [PATCHv2 0/6] Fix the problem that rxe can not work in net Yanjun Zhu
  6 siblings, 0 replies; 14+ messages in thread
From: yanjun.zhu @ 2022-10-06  8:59 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, linux-rdma, yanjun.zhu, netdev, davem

From: Zhu Yanjun <yanjun.zhu@linux.dev>

Originally init_net is used to indicate the current net namespace.
Currently more net namespaces are supported.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/rxe.c     |  2 +-
 drivers/infiniband/sw/rxe/rxe_net.c | 32 +++++++++++++++++------------
 drivers/infiniband/sw/rxe/rxe_net.h |  2 +-
 3 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 933d8e129c47..927c513ace81 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -196,7 +196,7 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
 		goto err;
 	}
 
-	err = rxe_net_init();
+	err = rxe_net_init(ndev);
 	if (err)
 		return err;
 
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 64b11faccfb4..b5955d0c284a 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -31,7 +31,7 @@ static struct dst_entry *rxe_find_route4(struct net_device *ndev,
 	memcpy(&fl.daddr, daddr, sizeof(*daddr));
 	fl.flowi4_proto = IPPROTO_UDP;
 
-	rt = ip_route_output_key(&init_net, &fl);
+	rt = ip_route_output_key(dev_net(ndev), &fl);
 	if (IS_ERR(rt)) {
 		pr_err_ratelimited("no route to %pI4\n", &daddr->s_addr);
 		return NULL;
@@ -54,7 +54,8 @@ static struct dst_entry *rxe_find_route6(struct net_device *ndev,
 		struct sock *sk;
 
 		rcu_read_lock();
-		sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any, htons(ROCE_V2_UDP_DPORT), 0);
+		sk = udp6_lib_lookup(dev_net(ndev), NULL, 0, &in6addr_any,
+				     htons(ROCE_V2_UDP_DPORT), 0);
 		rcu_read_unlock();
 		if (!sk) {
 			pr_info("file: %s +%d, error\n", __FILE__, __LINE__);
@@ -546,9 +547,13 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev)
 void rxe_net_del(struct ib_device *dev)
 {
 	struct sock *sk;
+	struct rxe_dev *rdev;
+
+	rdev = container_of(dev, struct rxe_dev, ib_dev);
 
 	rcu_read_lock();
-	sk = udp4_lib_lookup(&init_net, 0, 0, htonl(INADDR_ANY), htons(ROCE_V2_UDP_DPORT), 0);
+	sk = udp4_lib_lookup(dev_net(rdev->ndev), 0, 0, htonl(INADDR_ANY),
+			     htons(ROCE_V2_UDP_DPORT), 0);
 	rcu_read_unlock();
 	if (!sk)
 		return;
@@ -561,7 +566,8 @@ void rxe_net_del(struct ib_device *dev)
 		rxe_release_udp_tunnel(sk->sk_socket);
 
 	rcu_read_lock();
-	sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any, htons(ROCE_V2_UDP_DPORT), 0);
+	sk = udp6_lib_lookup(dev_net(rdev->ndev), NULL, 0, &in6addr_any,
+			     htons(ROCE_V2_UDP_DPORT), 0);
 	rcu_read_unlock();
 	if (!sk)
 		return;
@@ -665,19 +671,19 @@ static struct notifier_block rxe_net_notifier = {
 	.notifier_call = rxe_notify,
 };
 
-static int rxe_net_ipv4_init(void)
+static int rxe_net_ipv4_init(struct net_device *ndev)
 {
 	struct sock *sk;
 	struct socket *sock;
 
 	rcu_read_lock();
-	sk = udp4_lib_lookup(&init_net, 0, 0, htonl(INADDR_ANY),
+	sk = udp4_lib_lookup(dev_net(ndev), 0, 0, htonl(INADDR_ANY),
 			     htons(ROCE_V2_UDP_DPORT), 0);
 	rcu_read_unlock();
 	if (sk)
 		return 0;
 
-	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), false);
+	sock = rxe_setup_udp_tunnel(dev_net(ndev), htons(ROCE_V2_UDP_DPORT), false);
 	if (IS_ERR(sock)) {
 		pr_err("Failed to create IPv4 UDP tunnel\n");
 		return -1;
@@ -686,20 +692,20 @@ static int rxe_net_ipv4_init(void)
 	return 0;
 }
 
-static int rxe_net_ipv6_init(void)
+static int rxe_net_ipv6_init(struct net_device *ndev)
 {
 #if IS_ENABLED(CONFIG_IPV6)
 	struct sock *sk;
 	struct socket *sock;
 
 	rcu_read_lock();
-	sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any,
+	sk = udp6_lib_lookup(dev_net(ndev), NULL, 0, &in6addr_any,
 			     htons(ROCE_V2_UDP_DPORT), 0);
 	rcu_read_unlock();
 	if (sk)
 		return 0;
 
-	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), true);
+	sock = rxe_setup_udp_tunnel(dev_net(ndev), htons(ROCE_V2_UDP_DPORT), true);
 	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
 		pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n");
 		return 0;
@@ -731,14 +737,14 @@ void rxe_net_exit(void)
 	unregister_netdevice_notifier(&rxe_net_notifier);
 }
 
-int rxe_net_init(void)
+int rxe_net_init(struct net_device *ndev)
 {
 	int err;
 
-	err = rxe_net_ipv4_init();
+	err = rxe_net_ipv4_init(ndev);
 	if (err)
 		return err;
-	err = rxe_net_ipv6_init();
+	err = rxe_net_ipv6_init(ndev);
 	if (err)
 		goto err_out;
 	return 0;
diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
index 027b20e1bab6..56249677d692 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.h
+++ b/drivers/infiniband/sw/rxe/rxe_net.h
@@ -15,7 +15,7 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
 void rxe_net_del(struct ib_device *dev);
 
 int rxe_register_notifier(void);
-int rxe_net_init(void);
+int rxe_net_init(struct net_device *ndev);
 void rxe_net_exit(void);
 
 #endif /* RXE_NET_H */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCHv2 0/6] Fix the problem that rxe can not work in net
  2022-10-06  8:59 [PATCHv2 0/6] Fix the problem that rxe can not work in net yanjun.zhu
                   ` (5 preceding siblings ...)
  2022-10-06  8:59 ` [PATCHv2 6/6] RDMA/rxe: add the support of net namespace yanjun.zhu
@ 2022-10-19 14:56 ` Yanjun Zhu
  2022-11-11  2:36   ` Yanjun Zhu
  6 siblings, 1 reply; 14+ messages in thread
From: Yanjun Zhu @ 2022-10-19 14:56 UTC (permalink / raw)
  To: yanjun.zhu, jgg, leon, zyjzyj2000, linux-rdma, netdev, davem; +Cc: Zhu Yanjun

在 2022/10/6 16:59, yanjun.zhu@linux.dev 写道:
> From: Zhu Yanjun <yanjun.zhu@intel.com>
> 
> When run "ip link add" command to add a rxe rdma link in a net
> namespace, normally this rxe rdma link can not work in a net
> name space.
> 
> The root cause is that a sock listening on udp port 4791 is created
> in init_net when the rdma_rxe module is loaded into kernel. That is,
> the sock listening on udp port 4791 is created in init_net. Other net
> namespace is difficult to use this sock.
> 
> The following commits will solve this problem.
> 
> In the first commit, move the creating sock listening on udp port 4791
> from module_init function to rdma link creating functions. That is,
> after the module rdma_rxe is loaded, the sock will not be created.
> When run "rdma link add ..." command, the sock will be created. So
> when creating a rdma link in the net namespace, the sock will be
> created in this net namespace.
> 
> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
> will check the sock exists in the net namespace or not. If yes, rdma
> link will increase the reference count of this sock, then continue other
> jobs instead of creating a new sock to listen on udp port 4791. Since the
> network notifier is global, when the module rdma_rxe is loaded, this
> notifier will be registered.
> 
> After the rdma link is created, the command "rdma link del" is to
> delete rdma link at the same time the sock is checked. If the reference
> count of this sock is greater than the sock reference count needed by
> udp tunnel, the sock reference count is decreased by one. If equal, it
> indicates that this rdma link is the last one. As such, the udp tunnel
> is shut down and the sock is closed. The above work should be
> implemented in linkdel function. But currently no dellink function in
> rxe. So the 3rd commit addes dellink function pointer. And the 4th
> commit implements the dellink function in rxe.
> 
> To now, it is not necessary to keep a global variable to store the sock
> listening udp port 4791. This global variable can be replaced by the
> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
> function udp6_lib_lookup is in the fast path, a member variable l_sk6
> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
> to lookup the sock, then the sock is stored in l_sk6, in the future,it
> can be used directly.
> 
> All the above work has been done in init_net. And it can also work in
> the net namespace. So the init_net is replaced by the individual net
> namespace. This is what the 6th commit does. Because rxe device is
> dependent on the net device and the sock listening on udp port 4791,
> every rxe device is in exclusive mode in the individual net namespace.
> Other rdma netns operations will be considerred in the future.
> 
> Test steps:
> 1) Suppose that 2 NICs are in 2 different net namespaces.
> 
>   # ip netns exec net0 ip link
>   3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
>      link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
>      altname enp5s0
> 
>   # ip netns exec net1 ip link
>   4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
>      link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff
> 
> 2) Add rdma link in the different net namespace
>     net0:
>     ip netns exec net0 rdma link add rxe0 type rxe netdev eno2
> 
>     net1:
>     ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
> 
> 3) Run rping test.
>     net0
>     # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
>     [1] 1737
>     # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
>     verbose
>     count 1
>     ...
>     ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
>     ...
> 
> 4) Remove the rdma links from the net namespaces.
>     net0:
>     ip netns exec net0 rdma link del rxe0
>     net1:
>     ip netns exec net1 rdma link del rxe1
> 
> ---
> V1->V2: Add the explicit initialization of sk6.
> ---
> Zhu Yanjun (6):
>    RDMA/rxe: Creating listening sock in newlink function
>    RDMA/rxe: Support more rdma links in init_net
>    RDMA/nldev: Add dellink function pointer
>    RDMA/rxe: Implement dellink in rxe
>    RDMA/rxe: Replace global variable with sock lookup functions
>    RDMA/rxe: add the support of net namespace

Gently ping

Zhu Yanjun

> 
>   drivers/infiniband/core/nldev.c       |   6 ++
>   drivers/infiniband/sw/rxe/rxe.c       |  26 +++++-
>   drivers/infiniband/sw/rxe/rxe_net.c   | 129 ++++++++++++++++++++------
>   drivers/infiniband/sw/rxe/rxe_net.h   |   9 +-
>   drivers/infiniband/sw/rxe/rxe_verbs.h |   1 +
>   include/rdma/rdma_netlink.h           |   2 +
>   6 files changed, 134 insertions(+), 39 deletions(-)
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv2 0/6] Fix the problem that rxe can not work in net
  2022-10-19 14:56 ` [PATCHv2 0/6] Fix the problem that rxe can not work in net Yanjun Zhu
@ 2022-11-11  2:36   ` Yanjun Zhu
  2022-11-11  3:35     ` Parav Pandit
  0 siblings, 1 reply; 14+ messages in thread
From: Yanjun Zhu @ 2022-11-11  2:36 UTC (permalink / raw)
  To: Yanjun Zhu, jgg, leon, zyjzyj2000, linux-rdma, netdev, davem,
	Parav Pandit
  Cc: Zhu Yanjun

在 2022/10/19 22:56, Yanjun Zhu 写道:
> 在 2022/10/6 16:59, yanjun.zhu@linux.dev 写道:
>> From: Zhu Yanjun <yanjun.zhu@intel.com>
>>
>> When run "ip link add" command to add a rxe rdma link in a net
>> namespace, normally this rxe rdma link can not work in a net
>> name space.
>>
>> The root cause is that a sock listening on udp port 4791 is created
>> in init_net when the rdma_rxe module is loaded into kernel. That is,
>> the sock listening on udp port 4791 is created in init_net. Other net
>> namespace is difficult to use this sock.
>>
>> The following commits will solve this problem.
>>
>> In the first commit, move the creating sock listening on udp port 4791
>> from module_init function to rdma link creating functions. That is,
>> after the module rdma_rxe is loaded, the sock will not be created.
>> When run "rdma link add ..." command, the sock will be created. So
>> when creating a rdma link in the net namespace, the sock will be
>> created in this net namespace.
>>
>> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
>> will check the sock exists in the net namespace or not. If yes, rdma
>> link will increase the reference count of this sock, then continue other
>> jobs instead of creating a new sock to listen on udp port 4791. Since the
>> network notifier is global, when the module rdma_rxe is loaded, this
>> notifier will be registered.
>>
>> After the rdma link is created, the command "rdma link del" is to
>> delete rdma link at the same time the sock is checked. If the reference
>> count of this sock is greater than the sock reference count needed by
>> udp tunnel, the sock reference count is decreased by one. If equal, it
>> indicates that this rdma link is the last one. As such, the udp tunnel
>> is shut down and the sock is closed. The above work should be
>> implemented in linkdel function. But currently no dellink function in
>> rxe. So the 3rd commit addes dellink function pointer. And the 4th
>> commit implements the dellink function in rxe.
>>
>> To now, it is not necessary to keep a global variable to store the sock
>> listening udp port 4791. This global variable can be replaced by the
>> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
>> function udp6_lib_lookup is in the fast path, a member variable l_sk6
>> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
>> to lookup the sock, then the sock is stored in l_sk6, in the future,it
>> can be used directly.
>>
>> All the above work has been done in init_net. And it can also work in
>> the net namespace. So the init_net is replaced by the individual net
>> namespace. This is what the 6th commit does. Because rxe device is
>> dependent on the net device and the sock listening on udp port 4791,
>> every rxe device is in exclusive mode in the individual net namespace.
>> Other rdma netns operations will be considerred in the future.
>>
>> Test steps:
>> 1) Suppose that 2 NICs are in 2 different net namespaces.
>>
>>   # ip netns exec net0 ip link
>>   3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
>>      link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
>>      altname enp5s0
>>
>>   # ip netns exec net1 ip link
>>   4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
>>      link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff
>>
>> 2) Add rdma link in the different net namespace
>>     net0:
>>     ip netns exec net0 rdma link add rxe0 type rxe netdev eno2
>>
>>     net1:
>>     ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
>>
>> 3) Run rping test.
>>     net0
>>     # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
>>     [1] 1737
>>     # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
>>     verbose
>>     count 1
>>     ...
>>     ping data: rdma-ping-0: 
>> ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
>>     ...
>>
>> 4) Remove the rdma links from the net namespaces.
>>     net0:
>>     ip netns exec net0 rdma link del rxe0
>>     net1:
>>     ip netns exec net1 rdma link del rxe1
>>
>> ---
>> V1->V2: Add the explicit initialization of sk6.
>> ---
>> Zhu Yanjun (6):
>>    RDMA/rxe: Creating listening sock in newlink function
>>    RDMA/rxe: Support more rdma links in init_net
>>    RDMA/nldev: Add dellink function pointer
>>    RDMA/rxe: Implement dellink in rxe
>>    RDMA/rxe: Replace global variable with sock lookup functions
>>    RDMA/rxe: add the support of net namespace

Hi, Parav Pandit

I think you are the expert of netns. Can you help to review these patches?

Thanks and Regards,
Zhu Yanjun

> 
> Gently ping
> 
> Zhu Yanjun
> 
>>
>>   drivers/infiniband/core/nldev.c       |   6 ++
>>   drivers/infiniband/sw/rxe/rxe.c       |  26 +++++-
>>   drivers/infiniband/sw/rxe/rxe_net.c   | 129 ++++++++++++++++++++------
>>   drivers/infiniband/sw/rxe/rxe_net.h   |   9 +-
>>   drivers/infiniband/sw/rxe/rxe_verbs.h |   1 +
>>   include/rdma/rdma_netlink.h           |   2 +
>>   6 files changed, 134 insertions(+), 39 deletions(-)
>>
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCHv2 0/6] Fix the problem that rxe can not work in net
  2022-11-11  2:36   ` Yanjun Zhu
@ 2022-11-11  3:35     ` Parav Pandit
  2022-11-11  3:38       ` Yanjun Zhu
  0 siblings, 1 reply; 14+ messages in thread
From: Parav Pandit @ 2022-11-11  3:35 UTC (permalink / raw)
  To: Yanjun Zhu, jgg, leon, zyjzyj2000, linux-rdma, netdev, davem; +Cc: Zhu Yanjun


> From: Yanjun Zhu <yanjun.zhu@linux.dev>
> Sent: Thursday, November 10, 2022 9:37 PM


> Can you help to review these patches?
I will try to review it before 13th.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv2 0/6] Fix the problem that rxe can not work in net
  2022-11-11  3:35     ` Parav Pandit
@ 2022-11-11  3:38       ` Yanjun Zhu
  2022-11-13  4:58         ` Parav Pandit
  0 siblings, 1 reply; 14+ messages in thread
From: Yanjun Zhu @ 2022-11-11  3:38 UTC (permalink / raw)
  To: Parav Pandit, jgg, leon, zyjzyj2000, linux-rdma, netdev, davem; +Cc: Zhu Yanjun


在 2022/11/11 11:35, Parav Pandit 写道:
>> From: Yanjun Zhu <yanjun.zhu@linux.dev>
>> Sent: Thursday, November 10, 2022 9:37 PM
>
>> Can you help to review these patches?
> I will try to review it before 13th.

Thank you very much.

Zhu Yanjun


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCHv2 0/6] Fix the problem that rxe can not work in net
  2022-11-11  3:38       ` Yanjun Zhu
@ 2022-11-13  4:58         ` Parav Pandit
  2022-11-13 10:25           ` Yanjun Zhu
  2023-02-13 12:00           ` Zhu Yanjun
  0 siblings, 2 replies; 14+ messages in thread
From: Parav Pandit @ 2022-11-13  4:58 UTC (permalink / raw)
  To: Yanjun Zhu, jgg, leon, zyjzyj2000, linux-rdma, netdev, davem; +Cc: Zhu Yanjun

Hi Yanjun,

> From: Yanjun Zhu <yanjun.zhu@linux.dev>
> Sent: Thursday, November 10, 2022 10:38 PM
> 
> 
> 在 2022/11/11 11:35, Parav Pandit 写道:
> >> From: Yanjun Zhu <yanjun.zhu@linux.dev>
> >> Sent: Thursday, November 10, 2022 9:37 PM
> >
> >> Can you help to review these patches?
> > I will try to review it before 13th.

I did a brief review of patch set.
I didn’t go line by line for each patch; hence I give lumped comments here for overall series.

1. Add example and test results in below test flow in exclusive mode in cover letter.
   # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
   # ip netns del net0
   Verify that rdma device rxe1 is deleted.

2. Usage of dev_net() in rxe_setup_udp_tunnel() is unsafe.
   This is because when rxe_setup_udp_tunnel() is executed, net ns of netdev can change. 
   This needs to be synchronized with per net notifier register_pernet_subsys() of exit or exit_batch.
   This notifiers callback should be added to rxe module.

3. You need to set bind_ifindex of udp config to the netdev given in newlink in rxe_setup_udp_tunnel.
   Should be a separate pre-patch to ensure that close and right relation to udp socket with netdev in a given netns.

4. Rearrange series to implement delete link as separate series from net ns securing series.
They are unrelated. Current delink series may have use after free accesses. Those needs to be guarded in likely larger series.

5. udp tunnel must shutdown synchronously when rdma link del is done.
   This means any new packet arriving after this point, will be dropped.
   Any existing packet handling present is flushed.
   From your cover letter description, it appears that sock deletion is refcount based and above semantics is not ensured.

6. In patch 5, rxe_get_dev_from_net() can return NULL, hence l_sk6 check can be unsafe. Please add check for rdev null before rdev->l_sk6 check.

7. In patch 5, I didn't fully inspect, but seems like call to rxe_find_route4() is not rcu safe. 
Hence, extension of dev_net() in rxe_find_route4() doesn't look secure.
Accessing sock_net() is more accurate, because at this layer, it is processing packets at socket layer.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RE: [PATCHv2 0/6] Fix the problem that rxe can not work in net
  2022-11-13  4:58         ` Parav Pandit
@ 2022-11-13 10:25           ` Yanjun Zhu
  2023-02-13 12:00           ` Zhu Yanjun
  1 sibling, 0 replies; 14+ messages in thread
From: Yanjun Zhu @ 2022-11-13 10:25 UTC (permalink / raw)
  To: Parav Pandit, Yanjun Zhu, jgg, leon, zyjzyj2000, linux-rdma,
	netdev, davem
  Cc: Zhu Yanjun

在 2022/11/13 12:58, Parav Pandit 写道:
> Hi Yanjun,
> 
>> From: Yanjun Zhu <yanjun.zhu@linux.dev>
>> Sent: Thursday, November 10, 2022 10:38 PM
>>
>>
>> 在 2022/11/11 11:35, Parav Pandit 写道:
>>>> From: Yanjun Zhu <yanjun.zhu@linux.dev>
>>>> Sent: Thursday, November 10, 2022 9:37 PM
>>>
>>>> Can you help to review these patches?
>>> I will try to review it before 13th.
> 
> I did a brief review of patch set.
> I didn’t go line by line for each patch; hence I give lumped comments here for overall series.
> 
> 1. Add example and test results in below test flow in exclusive mode in cover letter.
>     # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
>     # ip netns del net0
>     Verify that rdma device rxe1 is deleted.

Hi, Parav

Thanks a lot. I will add this example to the cover letter.
And I confirm that rdma device rxe1 is deleted after the command "ip 
netns del net0".

I will delve into the following comments.

Thanks and Regards,
Zhu Yanjun

> 
> 2. Usage of dev_net() in rxe_setup_udp_tunnel() is unsafe.
>     This is because when rxe_setup_udp_tunnel() is executed, net ns of netdev can change.
>     This needs to be synchronized with per net notifier register_pernet_subsys() of exit or exit_batch.
>     This notifiers callback should be added to rxe module.
> 
> 3. You need to set bind_ifindex of udp config to the netdev given in newlink in rxe_setup_udp_tunnel.
>     Should be a separate pre-patch to ensure that close and right relation to udp socket with netdev in a given netns.
> 
> 4. Rearrange series to implement delete link as separate series from net ns securing series.
> They are unrelated. Current delink series may have use after free accesses. Those needs to be guarded in likely larger series.
> 
> 5. udp tunnel must shutdown synchronously when rdma link del is done.
>     This means any new packet arriving after this point, will be dropped.
>     Any existing packet handling present is flushed.
>     From your cover letter description, it appears that sock deletion is refcount based and above semantics is not ensured.
> 
> 6. In patch 5, rxe_get_dev_from_net() can return NULL, hence l_sk6 check can be unsafe. Please add check for rdev null before rdev->l_sk6 check.
> 
> 7. In patch 5, I didn't fully inspect, but seems like call to rxe_find_route4() is not rcu safe.
> Hence, extension of dev_net() in rxe_find_route4() doesn't look secure.
> Accessing sock_net() is more accurate, because at this layer, it is processing packets at socket layer.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RE: [PATCHv2 0/6] Fix the problem that rxe can not work in net
  2022-11-13  4:58         ` Parav Pandit
  2022-11-13 10:25           ` Yanjun Zhu
@ 2023-02-13 12:00           ` Zhu Yanjun
  1 sibling, 0 replies; 14+ messages in thread
From: Zhu Yanjun @ 2023-02-13 12:00 UTC (permalink / raw)
  To: Parav Pandit, Yanjun Zhu, jgg, leon, zyjzyj2000, linux-rdma,
	netdev, davem
  Cc: Zhu Yanjun

在 2022/11/13 12:58, Parav Pandit 写道:
> Hi Yanjun,
> 
>> From: Yanjun Zhu <yanjun.zhu@linux.dev>
>> Sent: Thursday, November 10, 2022 10:38 PM
>>
>>
>> 在 2022/11/11 11:35, Parav Pandit 写道:
>>>> From: Yanjun Zhu <yanjun.zhu@linux.dev>
>>>> Sent: Thursday, November 10, 2022 9:37 PM
>>>
>>>> Can you help to review these patches?
>>> I will try to review it before 13th.
> 
> I did a brief review of patch set.
> I didn’t go line by line for each patch; hence I give lumped comments here for overall series.
> 
> 1. Add example and test results in below test flow in exclusive mode in cover letter.
>     # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
>     # ip netns del net0
>     Verify that rdma device rxe1 is deleted.

Sorry. It is late to reply.
Got it. I will add the above example in the cover letter.

> 
> 2. Usage of dev_net() in rxe_setup_udp_tunnel() is unsafe.
>     This is because when rxe_setup_udp_tunnel() is executed, net ns of netdev can change.
>     This needs to be synchronized with per net notifier register_pernet_subsys() of exit or exit_batch.
>     This notifiers callback should be added to rxe module.

No. The netdev device and rxe device are one-to-one correspondence. When 
the netdev device is removed from
the net namespace, the rxe device will be removed. So this will not 
happen that rxe_setup_udp_tunnel is executed
while net namespace of the netdev can change.
In the latest commits, I will add register_pernet_subsys() because the 
init and exit functions can help initialization
and cleanup.

> 
> 3. You need to set bind_ifindex of udp config to the netdev given in newlink in rxe_setup_udp_tunnel.
>     Should be a separate pre-patch to ensure that close and right relation to udp socket with netdev in a given netns.
> 

No. In the rxe, the sock listeing to the udp port 4791 does not bind 
with the netdev device. A sock can listen to the packets from several
netdev devices. That is, this port 4791 sock is shared by many rxe rdma 
links.

> 4. Rearrange series to implement delete link as separate series from net ns securing series.
> They are unrelated. Current delink series may have use after free accesses. Those needs to be guarded in likely larger series.
> 

Got it. I found the use-after-free problem with the sock. And now it is 
fixed in the latest commits.
I will send it out very soon.

> 5. udp tunnel must shutdown synchronously when rdma link del is done.
>     This means any new packet arriving after this point, will be dropped.
>     Any existing packet handling present is flushed.
>     From your cover letter description, it appears that sock deletion is refcount based and above semantics is not ensured.
> 

The port 4791 udp tunnel is shared by many rxe rdma links. If one rdma 
link exists in net namespace, this udp tunnel should not be
shutdown. Only if no rxe rdma link exist, this udp tunnel will be destroyed.

> 6. In patch 5, rxe_get_dev_from_net() can return NULL, hence l_sk6 check can be unsafe. Please add check for rdev null before rdev->l_sk6 check.
> 

Got it. the l_sk6 is replaced with the sk6 in net namespace notifier.

> 7. In patch 5, I didn't fully inspect, but seems like call to rxe_find_route4() is not rcu safe.
> Hence, extension of dev_net() in rxe_find_route4() doesn't look secure.
> Accessing sock_net() is more accurate, because at this layer, it is processing packets at socket layer.

No. As I mentioned in the above, because the netdev device and rxe 
device are one-to-one correspondence, if one net device is moved out of 
the net namespace,
the related rxe device is also removed. So dev_net seems safe.

I will send out the latest commits very soon.

Zhu Yanjun


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-02-13 12:00 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-06  8:59 [PATCHv2 0/6] Fix the problem that rxe can not work in net yanjun.zhu
2022-10-06  8:59 ` [PATCHv2 1/6] RDMA/rxe: Creating listening sock in newlink function yanjun.zhu
2022-10-06  8:59 ` [PATCHv2 2/6] RDMA/rxe: Support more rdma links in init_net yanjun.zhu
2022-10-06  8:59 ` [PATCHv2 3/6] RDMA/nldev: Add dellink function pointer yanjun.zhu
2022-10-06  8:59 ` [PATCHv2 4/6] RDMA/rxe: Implement dellink in rxe yanjun.zhu
2022-10-06  8:59 ` [PATCHv2 5/6] RDMA/rxe: Replace global variable with sock lookup functions yanjun.zhu
2022-10-06  8:59 ` [PATCHv2 6/6] RDMA/rxe: add the support of net namespace yanjun.zhu
2022-10-19 14:56 ` [PATCHv2 0/6] Fix the problem that rxe can not work in net Yanjun Zhu
2022-11-11  2:36   ` Yanjun Zhu
2022-11-11  3:35     ` Parav Pandit
2022-11-11  3:38       ` Yanjun Zhu
2022-11-13  4:58         ` Parav Pandit
2022-11-13 10:25           ` Yanjun Zhu
2023-02-13 12:00           ` Zhu Yanjun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).