All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
@ 2023-02-14  6:06 Zhu Yanjun
  2023-02-14  6:06 ` [PATCHv3 1/8] RDMA/rxe: Creating listening sock in newlink function Zhu Yanjun
                   ` (9 more replies)
  0 siblings, 10 replies; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-14  6:06 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, linux-rdma, parav, yanjun.zhu; +Cc: Zhu Yanjun

From: Zhu Yanjun <yanjun.zhu@linux.dev>

When run "ip link add" command to add a rxe rdma link in a net
namespace, normally this rxe rdma link can not work in a net
name space.

The root cause is that a sock listening on udp port 4791 is created
in init_net when the rdma_rxe module is loaded into kernel. That is,
the sock listening on udp port 4791 is created in init_net. Other net
namespace is difficult to use this sock.

The following commits will solve this problem.

In the first commit, move the creating sock listening on udp port 4791
from module_init function to rdma link creating functions. That is,
after the module rdma_rxe is loaded, the sock will not be created.
When run "rdma link add ..." command, the sock will be created. So
when creating a rdma link in the net namespace, the sock will be
created in this net namespace.

In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
will check the sock exists in the net namespace or not. If yes, rdma
link will increase the reference count of this sock, then continue other
jobs instead of creating a new sock to listen on udp port 4791. Since the
network notifier is global, when the module rdma_rxe is loaded, this
notifier will be registered.

After the rdma link is created, the command "rdma link del" is to
delete rdma link at the same time the sock is checked. If the reference
count of this sock is greater than the sock reference count needed by
udp tunnel, the sock reference count is decreased by one. If equal, it
indicates that this rdma link is the last one. As such, the udp tunnel
is shut down and the sock is closed. The above work should be
implemented in linkdel function. But currently no dellink function in
rxe. So the 3rd commit addes dellink function pointer. And the 4th
commit implements the dellink function in rxe.

To now, it is not necessary to keep a global variable to store the sock
listening udp port 4791. This global variable can be replaced by the
functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
function udp6_lib_lookup is in the fast path, a member variable l_sk6
is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
to lookup the sock, then the sock is stored in l_sk6, in the future,it
can be used directly.

All the above work has been done in init_net. And it can also work in
the net namespace. So the init_net is replaced by the individual net
namespace. This is what the 6th commit does. Because rxe device is
dependent on the net device and the sock listening on udp port 4791,
every rxe device is in exclusive mode in the individual net namespace.
Other rdma netns operations will be considerred in the future.

In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys
functions are added. When a new net namespace is created, the init
function will initialize the sk4 and sk6 socks. Then the 2 socks will
be released when the net namespace is destroyed. The functions
rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net
namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will
handle sk6. Then sk4 and sk6 are used in the previous commits.

As the sk4 and sk6 in pernet namespace can be accessed, it is not
necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is
replaced with the sk6 in pernet namespace.

Test steps:
1) Suppose that 2 NICs are in 2 different net namespaces.

  # ip netns exec net0 ip link
  3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
     link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
     altname enp5s0

  # ip netns exec net1 ip link
  4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
     link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff

2) Add rdma link in the different net namespace
    net0:
    # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2

    net1:
    # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3

3) Run rping test.
    net0
    # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
    [1] 1737
    # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
    verbose
    count 1
    ...
    ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
    ...

4) Remove the rdma links from the net namespaces.
    net0:
    # ip netns exec net0 ss -lu
    State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
    UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
    UNCONN    0         0         [::]:4791             [::]:*

    # ip netns exec net0 rdma link del rxe0
    
    # ip netns exec net0 ss -lu
    State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
    
    net1:
    # ip netns exec net0 ss -lu
    State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
    UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
    UNCONN    0         0         [::]:4791             [::]:*
    
    # ip netns exec net1 rdma link del rxe1

    # ip netns exec net0 ss -lu
    State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process

V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to
           verify rdma link is removed.
        2) Add register_pernet_subsys/unregister_pernet_subsys net namespace
        3) Replace l_sk6 with sk6 of pernet_name_space

V1->V2: Add the explicit initialization of sk6.

Zhu Yanjun (8):
  RDMA/rxe: Creating listening sock in newlink function
  RDMA/rxe: Support more rdma links in init_net
  RDMA/nldev: Add dellink function pointer
  RDMA/rxe: Implement dellink in rxe
  RDMA/rxe: Replace global variable with sock lookup functions
  RDMA/rxe: add the support of net namespace
  RDMA/rxe: Add the support of net namespace notifier
  RDMA/rxe: Replace l_sk6 with sk6 in net namespace

 drivers/infiniband/core/nldev.c     |   6 ++
 drivers/infiniband/sw/rxe/Makefile  |   3 +-
 drivers/infiniband/sw/rxe/rxe.c     |  35 +++++++-
 drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++-------
 drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
 drivers/infiniband/sw/rxe/rxe_ns.c  | 128 ++++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_ns.h  |  11 +++
 include/rdma/rdma_netlink.h         |   2 +
 8 files changed, 267 insertions(+), 40 deletions(-)
 create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
 create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCHv3 1/8] RDMA/rxe: Creating listening sock in newlink function
  2023-02-14  6:06 [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
@ 2023-02-14  6:06 ` Zhu Yanjun
  2023-02-23 13:10   ` Zhu Yanjun
  2023-02-14  6:06 ` [PATCHv3 2/8] RDMA/rxe: Support more rdma links in init_net Zhu Yanjun
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-14  6:06 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, linux-rdma, parav, yanjun.zhu; +Cc: Zhu Yanjun

From: Zhu Yanjun <yanjun.zhu@linux.dev>

Originally when the module rdma_rxe is loaded, the sock listening on udp
port 4791 is created. Currently moving the creating listening port to
newlink function.

So when running "rdma link add" command, the sock listening on udp port
4791 is created.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/rxe.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 136c2efe3466..64644cb0bb38 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -192,6 +192,10 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
 		goto err;
 	}
 
+	err = rxe_net_init();
+	if (err)
+		return err;
+
 	err = rxe_net_add(ibdev_name, ndev);
 	if (err) {
 		rxe_dbg(exists, "failed to add %s\n", ndev->name);
@@ -208,12 +212,6 @@ static struct rdma_link_ops rxe_link_ops = {
 
 static int __init rxe_module_init(void)
 {
-	int err;
-
-	err = rxe_net_init();
-	if (err)
-		return err;
-
 	rdma_link_register(&rxe_link_ops);
 	pr_info("loaded\n");
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv3 2/8] RDMA/rxe: Support more rdma links in init_net
  2023-02-14  6:06 [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
  2023-02-14  6:06 ` [PATCHv3 1/8] RDMA/rxe: Creating listening sock in newlink function Zhu Yanjun
@ 2023-02-14  6:06 ` Zhu Yanjun
  2023-02-23 13:10   ` Zhu Yanjun
  2023-02-14  6:06 ` [PATCHv3 3/8] RDMA/nldev: Add dellink function pointer Zhu Yanjun
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-14  6:06 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, linux-rdma, parav, yanjun.zhu; +Cc: Zhu Yanjun

From: Zhu Yanjun <yanjun.zhu@linux.dev>

In init_net, when several rdma links are created with the command "rdma
link add", newlink will check whether the udp port 4791 is listening or
not.
If not, creating a sock listening on udp port 4791. If yes, increasing the
reference count of the sock.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/rxe.c     | 12 ++++++-
 drivers/infiniband/sw/rxe/rxe_net.c | 55 +++++++++++++++++++++--------
 drivers/infiniband/sw/rxe/rxe_net.h |  1 +
 3 files changed, 52 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 64644cb0bb38..0ce6adb43cfc 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -8,6 +8,7 @@
 #include <net/addrconf.h>
 #include "rxe.h"
 #include "rxe_loc.h"
+#include "rxe_net.h"
 
 MODULE_AUTHOR("Bob Pearson, Frank Zago, John Groves, Kamal Heib");
 MODULE_DESCRIPTION("Soft RDMA transport");
@@ -205,14 +206,23 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
 	return err;
 }
 
-static struct rdma_link_ops rxe_link_ops = {
+struct rdma_link_ops rxe_link_ops = {
 	.type = "rxe",
 	.newlink = rxe_newlink,
 };
 
 static int __init rxe_module_init(void)
 {
+	int err;
+
 	rdma_link_register(&rxe_link_ops);
+
+	err = rxe_register_notifier();
+	if (err) {
+		pr_err("Failed to register netdev notifier\n");
+		return -1;
+	}
+
 	pr_info("loaded\n");
 	return 0;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index e02e1624bcf4..3ca92e062800 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -623,13 +623,23 @@ static struct notifier_block rxe_net_notifier = {
 
 static int rxe_net_ipv4_init(void)
 {
-	recv_sockets.sk4 = rxe_setup_udp_tunnel(&init_net,
-				htons(ROCE_V2_UDP_DPORT), false);
-	if (IS_ERR(recv_sockets.sk4)) {
-		recv_sockets.sk4 = NULL;
+	struct sock *sk;
+	struct socket *sock;
+
+	rcu_read_lock();
+	sk = udp4_lib_lookup(&init_net, 0, 0, htonl(INADDR_ANY),
+			     htons(ROCE_V2_UDP_DPORT), 0);
+	rcu_read_unlock();
+	if (sk)
+		return 0;
+
+	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), false);
+	if (IS_ERR(sock)) {
 		pr_err("Failed to create IPv4 UDP tunnel\n");
+		recv_sockets.sk4 = NULL;
 		return -1;
 	}
+	recv_sockets.sk4 = sock;
 
 	return 0;
 }
@@ -637,24 +647,46 @@ static int rxe_net_ipv4_init(void)
 static int rxe_net_ipv6_init(void)
 {
 #if IS_ENABLED(CONFIG_IPV6)
+	struct sock *sk;
+	struct socket *sock;
+
+	rcu_read_lock();
+	sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any,
+			     htons(ROCE_V2_UDP_DPORT), 0);
+	rcu_read_unlock();
+	if (sk)
+		return 0;
 
-	recv_sockets.sk6 = rxe_setup_udp_tunnel(&init_net,
-						htons(ROCE_V2_UDP_DPORT), true);
-	if (PTR_ERR(recv_sockets.sk6) == -EAFNOSUPPORT) {
+	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), true);
+	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
 		recv_sockets.sk6 = NULL;
 		pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n");
 		return 0;
 	}
 
-	if (IS_ERR(recv_sockets.sk6)) {
+	if (IS_ERR(sock)) {
 		recv_sockets.sk6 = NULL;
 		pr_err("Failed to create IPv6 UDP tunnel\n");
 		return -1;
 	}
+	recv_sockets.sk6 = sock;
 #endif
 	return 0;
 }
 
+int rxe_register_notifier(void)
+{
+	int err;
+
+	err = register_netdevice_notifier(&rxe_net_notifier);
+	if (err) {
+		pr_err("Failed to register netdev notifier\n");
+		return -1;
+	}
+
+	return 0;
+}
+
 void rxe_net_exit(void)
 {
 	rxe_release_udp_tunnel(recv_sockets.sk6);
@@ -666,19 +698,12 @@ int rxe_net_init(void)
 {
 	int err;
 
-	recv_sockets.sk6 = NULL;
-
 	err = rxe_net_ipv4_init();
 	if (err)
 		return err;
 	err = rxe_net_ipv6_init();
 	if (err)
 		goto err_out;
-	err = register_netdevice_notifier(&rxe_net_notifier);
-	if (err) {
-		pr_err("Failed to register netdev notifier\n");
-		goto err_out;
-	}
 	return 0;
 err_out:
 	rxe_net_exit();
diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
index 45d80d00f86b..a222c3eeae12 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.h
+++ b/drivers/infiniband/sw/rxe/rxe_net.h
@@ -18,6 +18,7 @@ struct rxe_recv_sockets {
 
 int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
 
+int rxe_register_notifier(void);
 int rxe_net_init(void);
 void rxe_net_exit(void);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv3 3/8] RDMA/nldev: Add dellink function pointer
  2023-02-14  6:06 [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
  2023-02-14  6:06 ` [PATCHv3 1/8] RDMA/rxe: Creating listening sock in newlink function Zhu Yanjun
  2023-02-14  6:06 ` [PATCHv3 2/8] RDMA/rxe: Support more rdma links in init_net Zhu Yanjun
@ 2023-02-14  6:06 ` Zhu Yanjun
  2023-02-23 13:11   ` Zhu Yanjun
  2023-02-14  6:06 ` [PATCHv3 4/8] RDMA/rxe: Implement dellink in rxe Zhu Yanjun
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-14  6:06 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, linux-rdma, parav, yanjun.zhu; +Cc: Zhu Yanjun

From: Zhu Yanjun <yanjun.zhu@linux.dev>

The newlink function pointer is added. And the sock listening on port 4791
is added in the newlink function. So the dellink function is needed to
remove the sock.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/core/nldev.c | 6 ++++++
 include/rdma/rdma_netlink.h     | 2 ++
 2 files changed, 8 insertions(+)

diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index d5d3e4f0de77..97a62685ed5b 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -1758,6 +1758,12 @@ static int nldev_dellink(struct sk_buff *skb, struct nlmsghdr *nlh,
 		return -EINVAL;
 	}
 
+	if (device->link_ops) {
+		err = device->link_ops->dellink(device);
+		if (err)
+			return err;
+	}
+
 	ib_unregister_device_and_put(device);
 	return 0;
 }
diff --git a/include/rdma/rdma_netlink.h b/include/rdma/rdma_netlink.h
index c2a79aeee113..bf9df004061f 100644
--- a/include/rdma/rdma_netlink.h
+++ b/include/rdma/rdma_netlink.h
@@ -5,6 +5,7 @@
 
 #include <linux/netlink.h>
 #include <uapi/rdma/rdma_netlink.h>
+#include <rdma/ib_verbs.h>
 
 enum {
 	RDMA_NLDEV_ATTR_EMPTY_STRING = 1,
@@ -114,6 +115,7 @@ struct rdma_link_ops {
 	struct list_head list;
 	const char *type;
 	int (*newlink)(const char *ibdev_name, struct net_device *ndev);
+	int (*dellink)(struct ib_device *dev);
 };
 
 void rdma_link_register(struct rdma_link_ops *ops);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv3 4/8] RDMA/rxe: Implement dellink in rxe
  2023-02-14  6:06 [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
                   ` (2 preceding siblings ...)
  2023-02-14  6:06 ` [PATCHv3 3/8] RDMA/nldev: Add dellink function pointer Zhu Yanjun
@ 2023-02-14  6:06 ` Zhu Yanjun
  2023-02-23 13:12   ` Zhu Yanjun
  2023-02-14  6:06 ` [PATCHv3 5/8] RDMA/rxe: Replace global variable with sock lookup functions Zhu Yanjun
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-14  6:06 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, linux-rdma, parav, yanjun.zhu; +Cc: Zhu Yanjun

From: Zhu Yanjun <yanjun.zhu@linux.dev>

When running "rdma link del" command, dellink function will be called.
If the sock refcnt is greater than the refcnt needed for udp tunnel,
the sock refcnt will be decreased by 1.

If equal, the last rdma link is removed. The udp tunnel will be
destroyed.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/rxe.c     | 12 +++++++++++-
 drivers/infiniband/sw/rxe/rxe_net.c | 17 +++++++++++++++--
 drivers/infiniband/sw/rxe/rxe_net.h |  1 +
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 0ce6adb43cfc..ebfabc6d6b76 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -166,10 +166,12 @@ void rxe_set_mtu(struct rxe_dev *rxe, unsigned int ndev_mtu)
 /* called by ifc layer to create new rxe device.
  * The caller should allocate memory for rxe by calling ib_alloc_device.
  */
+static struct rdma_link_ops rxe_link_ops;
 int rxe_add(struct rxe_dev *rxe, unsigned int mtu, const char *ibdev_name)
 {
 	rxe_init(rxe);
 	rxe_set_mtu(rxe, mtu);
+	rxe->ib_dev.link_ops = &rxe_link_ops;
 
 	return rxe_register_device(rxe, ibdev_name);
 }
@@ -206,9 +208,17 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
 	return err;
 }
 
-struct rdma_link_ops rxe_link_ops = {
+static int rxe_dellink(struct ib_device *dev)
+{
+	rxe_net_del(dev);
+
+	return 0;
+}
+
+static struct rdma_link_ops rxe_link_ops = {
 	.type = "rxe",
 	.newlink = rxe_newlink,
+	.dellink = rxe_dellink,
 };
 
 static int __init rxe_module_init(void)
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 3ca92e062800..4cc7de7b115b 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -530,6 +530,21 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev)
 	return 0;
 }
 
+#define SK_REF_FOR_TUNNEL	2
+void rxe_net_del(struct ib_device *dev)
+{
+	if (refcount_read(&recv_sockets.sk6->sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
+		__sock_put(recv_sockets.sk6->sk);
+	else
+		rxe_release_udp_tunnel(recv_sockets.sk6);
+
+	if (refcount_read(&recv_sockets.sk4->sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
+		__sock_put(recv_sockets.sk4->sk);
+	else
+		rxe_release_udp_tunnel(recv_sockets.sk4);
+}
+#undef SK_REF_FOR_TUNNEL
+
 static void rxe_port_event(struct rxe_dev *rxe,
 			   enum ib_event_type event)
 {
@@ -689,8 +704,6 @@ int rxe_register_notifier(void)
 
 void rxe_net_exit(void)
 {
-	rxe_release_udp_tunnel(recv_sockets.sk6);
-	rxe_release_udp_tunnel(recv_sockets.sk4);
 	unregister_netdevice_notifier(&rxe_net_notifier);
 }
 
diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
index a222c3eeae12..f48f22f3353b 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.h
+++ b/drivers/infiniband/sw/rxe/rxe_net.h
@@ -17,6 +17,7 @@ struct rxe_recv_sockets {
 };
 
 int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
+void rxe_net_del(struct ib_device *dev);
 
 int rxe_register_notifier(void);
 int rxe_net_init(void);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv3 5/8] RDMA/rxe: Replace global variable with sock lookup functions
  2023-02-14  6:06 [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
                   ` (3 preceding siblings ...)
  2023-02-14  6:06 ` [PATCHv3 4/8] RDMA/rxe: Implement dellink in rxe Zhu Yanjun
@ 2023-02-14  6:06 ` Zhu Yanjun
  2023-02-23 13:13   ` Zhu Yanjun
  2023-02-14  6:06 ` [PATCHv3 6/8] RDMA/rxe: add the support of net namespace Zhu Yanjun
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-14  6:06 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, linux-rdma, parav, yanjun.zhu; +Cc: Zhu Yanjun

From: Zhu Yanjun <yanjun.zhu@linux.dev>

Originally a global variable is to keep the sock of udp listening
on port 4791. In fact, sock lookup functions can be used to get
the sock.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/rxe.c       |  1 +
 drivers/infiniband/sw/rxe/rxe_net.c   | 58 ++++++++++++++++++++-------
 drivers/infiniband/sw/rxe/rxe_net.h   |  5 ---
 drivers/infiniband/sw/rxe/rxe_verbs.h |  1 +
 4 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index ebfabc6d6b76..e81c2164d77f 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -74,6 +74,7 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
 			rxe->ndev->dev_addr);
 
 	rxe->max_ucontext			= RXE_MAX_UCONTEXT;
+	rxe->l_sk6				= NULL;
 }
 
 /* initialize port attributes */
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 4cc7de7b115b..b56e2c32fbf7 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -18,8 +18,6 @@
 #include "rxe_net.h"
 #include "rxe_loc.h"
 
-static struct rxe_recv_sockets recv_sockets;
-
 static struct dst_entry *rxe_find_route4(struct rxe_qp *qp,
 					 struct net_device *ndev,
 					 struct in_addr *saddr,
@@ -51,6 +49,23 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
 {
 	struct dst_entry *ndst;
 	struct flowi6 fl6 = { { 0 } };
+	struct rxe_dev *rdev;
+
+	rdev = rxe_get_dev_from_net(ndev);
+	if (!rdev->l_sk6) {
+		struct sock *sk;
+
+		rcu_read_lock();
+		sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any, htons(ROCE_V2_UDP_DPORT), 0);
+		rcu_read_unlock();
+		if (!sk) {
+			pr_info("file: %s +%d, error\n", __FILE__, __LINE__);
+			return (struct dst_entry *)sk;
+		}
+		__sock_put(sk);
+		rdev->l_sk6 = sk->sk_socket;
+	}
+
 
 	memset(&fl6, 0, sizeof(fl6));
 	fl6.flowi6_oif = ndev->ifindex;
@@ -58,8 +73,8 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
 	memcpy(&fl6.daddr, daddr, sizeof(*daddr));
 	fl6.flowi6_proto = IPPROTO_UDP;
 
-	ndst = ipv6_stub->ipv6_dst_lookup_flow(sock_net(recv_sockets.sk6->sk),
-					       recv_sockets.sk6->sk, &fl6,
+	ndst = ipv6_stub->ipv6_dst_lookup_flow(dev_net(ndev),
+					       rdev->l_sk6->sk, &fl6,
 					       NULL);
 	if (IS_ERR(ndst)) {
 		rxe_dbg_qp(qp, "no route to %pI6\n", daddr);
@@ -533,15 +548,33 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev)
 #define SK_REF_FOR_TUNNEL	2
 void rxe_net_del(struct ib_device *dev)
 {
-	if (refcount_read(&recv_sockets.sk6->sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
-		__sock_put(recv_sockets.sk6->sk);
+	struct sock *sk;
+
+	rcu_read_lock();
+	sk = udp4_lib_lookup(&init_net, 0, 0, htonl(INADDR_ANY), htons(ROCE_V2_UDP_DPORT), 0);
+	rcu_read_unlock();
+	if (!sk)
+		return;
+
+	__sock_put(sk);
+
+	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
+		__sock_put(sk);
 	else
-		rxe_release_udp_tunnel(recv_sockets.sk6);
+		rxe_release_udp_tunnel(sk->sk_socket);
+
+	rcu_read_lock();
+	sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any, htons(ROCE_V2_UDP_DPORT), 0);
+	rcu_read_unlock();
+	if (!sk)
+		return;
+
+	__sock_put(sk);
 
-	if (refcount_read(&recv_sockets.sk4->sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
-		__sock_put(recv_sockets.sk4->sk);
+	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
+		__sock_put(sk);
 	else
-		rxe_release_udp_tunnel(recv_sockets.sk4);
+		rxe_release_udp_tunnel(sk->sk_socket);
 }
 #undef SK_REF_FOR_TUNNEL
 
@@ -651,10 +684,8 @@ static int rxe_net_ipv4_init(void)
 	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), false);
 	if (IS_ERR(sock)) {
 		pr_err("Failed to create IPv4 UDP tunnel\n");
-		recv_sockets.sk4 = NULL;
 		return -1;
 	}
-	recv_sockets.sk4 = sock;
 
 	return 0;
 }
@@ -674,17 +705,14 @@ static int rxe_net_ipv6_init(void)
 
 	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), true);
 	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
-		recv_sockets.sk6 = NULL;
 		pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n");
 		return 0;
 	}
 
 	if (IS_ERR(sock)) {
-		recv_sockets.sk6 = NULL;
 		pr_err("Failed to create IPv6 UDP tunnel\n");
 		return -1;
 	}
-	recv_sockets.sk6 = sock;
 #endif
 	return 0;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
index f48f22f3353b..027b20e1bab6 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.h
+++ b/drivers/infiniband/sw/rxe/rxe_net.h
@@ -11,11 +11,6 @@
 #include <net/if_inet6.h>
 #include <linux/module.h>
 
-struct rxe_recv_sockets {
-	struct socket *sk4;
-	struct socket *sk6;
-};
-
 int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
 void rxe_net_del(struct ib_device *dev);
 
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 19ddfa890480..52c4ef4d0305 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -408,6 +408,7 @@ struct rxe_dev {
 
 	struct rxe_port		port;
 	struct crypto_shash	*tfm;
+	struct socket		*l_sk6;
 };
 
 static inline void rxe_counter_inc(struct rxe_dev *rxe, enum rxe_counters index)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv3 6/8] RDMA/rxe: add the support of net namespace
  2023-02-14  6:06 [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
                   ` (4 preceding siblings ...)
  2023-02-14  6:06 ` [PATCHv3 5/8] RDMA/rxe: Replace global variable with sock lookup functions Zhu Yanjun
@ 2023-02-14  6:06 ` Zhu Yanjun
  2023-02-23 13:14   ` Zhu Yanjun
  2023-02-14  6:06 ` [PATCHv3 7/8] RDMA/rxe: Add the support of net namespace notifier Zhu Yanjun
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-14  6:06 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, linux-rdma, parav, yanjun.zhu; +Cc: Zhu Yanjun

From: Zhu Yanjun <yanjun.zhu@linux.dev>

Originally init_net is used to indicate the current net namespace.
Currently more net namespaces are supported.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/rxe.c     |  2 +-
 drivers/infiniband/sw/rxe/rxe_net.c | 33 +++++++++++++++++------------
 drivers/infiniband/sw/rxe/rxe_net.h |  2 +-
 3 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index e81c2164d77f..4a17e4a003f5 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -196,7 +196,7 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
 		goto err;
 	}
 
-	err = rxe_net_init();
+	err = rxe_net_init(ndev);
 	if (err)
 		return err;
 
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index b56e2c32fbf7..9af90587642a 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -32,7 +32,7 @@ static struct dst_entry *rxe_find_route4(struct rxe_qp *qp,
 	memcpy(&fl.daddr, daddr, sizeof(*daddr));
 	fl.flowi4_proto = IPPROTO_UDP;
 
-	rt = ip_route_output_key(&init_net, &fl);
+	rt = ip_route_output_key(dev_net(ndev), &fl);
 	if (IS_ERR(rt)) {
 		rxe_dbg_qp(qp, "no route to %pI4\n", &daddr->s_addr);
 		return NULL;
@@ -56,7 +56,8 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
 		struct sock *sk;
 
 		rcu_read_lock();
-		sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any, htons(ROCE_V2_UDP_DPORT), 0);
+		sk = udp6_lib_lookup(dev_net(ndev), NULL, 0, &in6addr_any,
+				     htons(ROCE_V2_UDP_DPORT), 0);
 		rcu_read_unlock();
 		if (!sk) {
 			pr_info("file: %s +%d, error\n", __FILE__, __LINE__);
@@ -549,9 +550,13 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev)
 void rxe_net_del(struct ib_device *dev)
 {
 	struct sock *sk;
+	struct rxe_dev *rdev;
+
+	rdev = container_of(dev, struct rxe_dev, ib_dev);
 
 	rcu_read_lock();
-	sk = udp4_lib_lookup(&init_net, 0, 0, htonl(INADDR_ANY), htons(ROCE_V2_UDP_DPORT), 0);
+	sk = udp4_lib_lookup(dev_net(rdev->ndev), 0, 0, htonl(INADDR_ANY),
+			     htons(ROCE_V2_UDP_DPORT), 0);
 	rcu_read_unlock();
 	if (!sk)
 		return;
@@ -564,7 +569,8 @@ void rxe_net_del(struct ib_device *dev)
 		rxe_release_udp_tunnel(sk->sk_socket);
 
 	rcu_read_lock();
-	sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any, htons(ROCE_V2_UDP_DPORT), 0);
+	sk = udp6_lib_lookup(dev_net(rdev->ndev), NULL, 0, &in6addr_any,
+			     htons(ROCE_V2_UDP_DPORT), 0);
 	rcu_read_unlock();
 	if (!sk)
 		return;
@@ -636,6 +642,7 @@ static int rxe_notify(struct notifier_block *not_blk,
 	switch (event) {
 	case NETDEV_UNREGISTER:
 		ib_unregister_device_queued(&rxe->ib_dev);
+		rxe_net_del(&rxe->ib_dev);
 		break;
 	case NETDEV_UP:
 		rxe_port_up(rxe);
@@ -669,19 +676,19 @@ static struct notifier_block rxe_net_notifier = {
 	.notifier_call = rxe_notify,
 };
 
-static int rxe_net_ipv4_init(void)
+static int rxe_net_ipv4_init(struct net_device *ndev)
 {
 	struct sock *sk;
 	struct socket *sock;
 
 	rcu_read_lock();
-	sk = udp4_lib_lookup(&init_net, 0, 0, htonl(INADDR_ANY),
+	sk = udp4_lib_lookup(dev_net(ndev), 0, 0, htonl(INADDR_ANY),
 			     htons(ROCE_V2_UDP_DPORT), 0);
 	rcu_read_unlock();
 	if (sk)
 		return 0;
 
-	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), false);
+	sock = rxe_setup_udp_tunnel(dev_net(ndev), htons(ROCE_V2_UDP_DPORT), false);
 	if (IS_ERR(sock)) {
 		pr_err("Failed to create IPv4 UDP tunnel\n");
 		return -1;
@@ -690,20 +697,20 @@ static int rxe_net_ipv4_init(void)
 	return 0;
 }
 
-static int rxe_net_ipv6_init(void)
+static int rxe_net_ipv6_init(struct net_device *ndev)
 {
 #if IS_ENABLED(CONFIG_IPV6)
 	struct sock *sk;
 	struct socket *sock;
 
 	rcu_read_lock();
-	sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any,
+	sk = udp6_lib_lookup(dev_net(ndev), NULL, 0, &in6addr_any,
 			     htons(ROCE_V2_UDP_DPORT), 0);
 	rcu_read_unlock();
 	if (sk)
 		return 0;
 
-	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), true);
+	sock = rxe_setup_udp_tunnel(dev_net(ndev), htons(ROCE_V2_UDP_DPORT), true);
 	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
 		pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n");
 		return 0;
@@ -735,14 +742,14 @@ void rxe_net_exit(void)
 	unregister_netdevice_notifier(&rxe_net_notifier);
 }
 
-int rxe_net_init(void)
+int rxe_net_init(struct net_device *ndev)
 {
 	int err;
 
-	err = rxe_net_ipv4_init();
+	err = rxe_net_ipv4_init(ndev);
 	if (err)
 		return err;
-	err = rxe_net_ipv6_init();
+	err = rxe_net_ipv6_init(ndev);
 	if (err)
 		goto err_out;
 	return 0;
diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
index 027b20e1bab6..56249677d692 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.h
+++ b/drivers/infiniband/sw/rxe/rxe_net.h
@@ -15,7 +15,7 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
 void rxe_net_del(struct ib_device *dev);
 
 int rxe_register_notifier(void);
-int rxe_net_init(void);
+int rxe_net_init(struct net_device *ndev);
 void rxe_net_exit(void);
 
 #endif /* RXE_NET_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv3 7/8] RDMA/rxe: Add the support of net namespace notifier
  2023-02-14  6:06 [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
                   ` (5 preceding siblings ...)
  2023-02-14  6:06 ` [PATCHv3 6/8] RDMA/rxe: add the support of net namespace Zhu Yanjun
@ 2023-02-14  6:06 ` Zhu Yanjun
  2023-02-23 13:14   ` Zhu Yanjun
  2023-02-14  6:06 ` [PATCHv3 8/8] RDMA/rxe: Replace l_sk6 with sk6 in net namespace Zhu Yanjun
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-14  6:06 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, linux-rdma, parav, yanjun.zhu; +Cc: Zhu Yanjun

From: Zhu Yanjun <yanjun.zhu@linux.dev>

The functions register_pernet_subsys/unregister_pernet_subsys register a
notifier of net namespace. When a new net namespace is created, the init
function of rxe will be called to initialize sk4 and sk6 socks. When a
net namespace is destroyed, the exit function will be called to handle
sk4 and sk6 socks.

The functions rxe_ns_pernet_sk4 and rxe_ns_pernet_sk6 are used to get
sk4 and sk6 socks.

The functions rxe_ns_pernet_set_sk4 and rxe_ns_pernet_set_sk6 are used
to set sk4 and sk6 socks.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/Makefile  |   3 +-
 drivers/infiniband/sw/rxe/rxe.c     |   9 ++
 drivers/infiniband/sw/rxe/rxe_net.c |  50 +++++------
 drivers/infiniband/sw/rxe/rxe_ns.c  | 134 ++++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_ns.h  |  17 ++++
 5 files changed, 187 insertions(+), 26 deletions(-)
 create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
 create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.h

diff --git a/drivers/infiniband/sw/rxe/Makefile b/drivers/infiniband/sw/rxe/Makefile
index 5395a581f4bb..8380f97674cb 100644
--- a/drivers/infiniband/sw/rxe/Makefile
+++ b/drivers/infiniband/sw/rxe/Makefile
@@ -22,4 +22,5 @@ rdma_rxe-y := \
 	rxe_mcast.o \
 	rxe_task.o \
 	rxe_net.o \
-	rxe_hw_counters.o
+	rxe_hw_counters.o \
+	rxe_ns.o
diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 4a17e4a003f5..c297677bf06a 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -9,6 +9,7 @@
 #include "rxe.h"
 #include "rxe_loc.h"
 #include "rxe_net.h"
+#include "rxe_ns.h"
 
 MODULE_AUTHOR("Bob Pearson, Frank Zago, John Groves, Kamal Heib");
 MODULE_DESCRIPTION("Soft RDMA transport");
@@ -234,6 +235,12 @@ static int __init rxe_module_init(void)
 		return -1;
 	}
 
+	err = rxe_namespace_init();
+	if (err) {
+		pr_err("Failed to register net namespace notifier\n");
+		return -1;
+	}
+
 	pr_info("loaded\n");
 	return 0;
 }
@@ -244,6 +251,8 @@ static void __exit rxe_module_exit(void)
 	ib_unregister_driver(RDMA_DRIVER_RXE);
 	rxe_net_exit();
 
+	rxe_namespace_exit();
+
 	pr_info("unloaded\n");
 }
 
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 9af90587642a..8135876b11f6 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -17,6 +17,7 @@
 #include "rxe.h"
 #include "rxe_net.h"
 #include "rxe_loc.h"
+#include "rxe_ns.h"
 
 static struct dst_entry *rxe_find_route4(struct rxe_qp *qp,
 					 struct net_device *ndev,
@@ -554,33 +555,30 @@ void rxe_net_del(struct ib_device *dev)
 
 	rdev = container_of(dev, struct rxe_dev, ib_dev);
 
-	rcu_read_lock();
-	sk = udp4_lib_lookup(dev_net(rdev->ndev), 0, 0, htonl(INADDR_ANY),
-			     htons(ROCE_V2_UDP_DPORT), 0);
-	rcu_read_unlock();
+	sk = rxe_ns_pernet_sk4(dev_net(rdev->ndev));
 	if (!sk)
 		return;
 
-	__sock_put(sk);
 
-	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
+	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL) {
 		__sock_put(sk);
-	else
+	} else {
 		rxe_release_udp_tunnel(sk->sk_socket);
+		sk = NULL;
+		rxe_ns_pernet_set_sk4(dev_net(rdev->ndev), sk);
+	}
 
-	rcu_read_lock();
-	sk = udp6_lib_lookup(dev_net(rdev->ndev), NULL, 0, &in6addr_any,
-			     htons(ROCE_V2_UDP_DPORT), 0);
-	rcu_read_unlock();
+	sk = rxe_ns_pernet_sk6(dev_net(rdev->ndev));
 	if (!sk)
 		return;
 
-	__sock_put(sk);
-
-	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
+	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL) {
 		__sock_put(sk);
-	else
+	} else {
 		rxe_release_udp_tunnel(sk->sk_socket);
+		sk = NULL;
+		rxe_ns_pernet_set_sk6(dev_net(rdev->ndev), sk);
+	}
 }
 #undef SK_REF_FOR_TUNNEL
 
@@ -681,18 +679,18 @@ static int rxe_net_ipv4_init(struct net_device *ndev)
 	struct sock *sk;
 	struct socket *sock;
 
-	rcu_read_lock();
-	sk = udp4_lib_lookup(dev_net(ndev), 0, 0, htonl(INADDR_ANY),
-			     htons(ROCE_V2_UDP_DPORT), 0);
-	rcu_read_unlock();
-	if (sk)
+	sk = rxe_ns_pernet_sk4(dev_net(ndev));
+	if (sk) {
+		sock_hold(sk);
 		return 0;
+	}
 
 	sock = rxe_setup_udp_tunnel(dev_net(ndev), htons(ROCE_V2_UDP_DPORT), false);
 	if (IS_ERR(sock)) {
 		pr_err("Failed to create IPv4 UDP tunnel\n");
 		return -1;
 	}
+	rxe_ns_pernet_set_sk4(dev_net(ndev), sock->sk);
 
 	return 0;
 }
@@ -703,12 +701,11 @@ static int rxe_net_ipv6_init(struct net_device *ndev)
 	struct sock *sk;
 	struct socket *sock;
 
-	rcu_read_lock();
-	sk = udp6_lib_lookup(dev_net(ndev), NULL, 0, &in6addr_any,
-			     htons(ROCE_V2_UDP_DPORT), 0);
-	rcu_read_unlock();
-	if (sk)
+	sk = rxe_ns_pernet_sk6(dev_net(ndev));
+	if (sk) {
+		sock_hold(sk);
 		return 0;
+	}
 
 	sock = rxe_setup_udp_tunnel(dev_net(ndev), htons(ROCE_V2_UDP_DPORT), true);
 	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
@@ -720,6 +717,9 @@ static int rxe_net_ipv6_init(struct net_device *ndev)
 		pr_err("Failed to create IPv6 UDP tunnel\n");
 		return -1;
 	}
+
+	rxe_ns_pernet_set_sk6(dev_net(ndev), sock->sk);
+
 #endif
 	return 0;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_ns.c b/drivers/infiniband/sw/rxe/rxe_ns.c
new file mode 100644
index 000000000000..29d08899dcda
--- /dev/null
+++ b/drivers/infiniband/sw/rxe/rxe_ns.c
@@ -0,0 +1,134 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
+ * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
+ */
+
+#include <net/sock.h>
+#include <net/netns/generic.h>
+#include <net/net_namespace.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/pid_namespace.h>
+#include <net/udp_tunnel.h>
+
+#include "rxe_ns.h"
+
+/*
+ * Per network namespace data
+ */
+struct rxe_ns_sock {
+	struct sock __rcu *rxe_sk4;
+	struct sock __rcu *rxe_sk6;
+};
+
+/*
+ * Index to store custom data for each network namespace.
+ */
+static unsigned int rxe_pernet_id;
+
+/*
+ * Called for every existing and added network namespaces
+ */
+static int __net_init rxe_ns_init(struct net *net)
+{
+	/*
+	 * create (if not present) and access data item in network namespace
+	 * (net) using the id (net_id)
+	 */
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+
+	rcu_assign_pointer(ns_sk->rxe_sk4, NULL); /* initialize sock 4 socket */
+	rcu_assign_pointer(ns_sk->rxe_sk6, NULL); /* initialize sock 6 socket */
+	synchronize_rcu();
+
+	return 0;
+}
+
+static void __net_exit rxe_ns_exit(struct net *net)
+{
+	/*
+	 * called when the network namespace is removed
+	 */
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+	struct sock *rxe_sk4 = NULL;
+	struct sock *rxe_sk6 = NULL;
+
+	rcu_read_lock();
+	rxe_sk4 = rcu_dereference(ns_sk->rxe_sk4);
+	rxe_sk6 = rcu_dereference(ns_sk->rxe_sk6);
+	rcu_read_unlock();
+
+	/* close socket */
+	if (rxe_sk4 && rxe_sk4->sk_socket) {
+		udp_tunnel_sock_release(rxe_sk4->sk_socket);
+		rcu_assign_pointer(ns_sk->rxe_sk4, NULL);
+		synchronize_rcu();
+	}
+
+	if (rxe_sk6 && rxe_sk6->sk_socket) {
+		udp_tunnel_sock_release(rxe_sk6->sk_socket);
+		rcu_assign_pointer(ns_sk->rxe_sk6, NULL);
+		synchronize_rcu();
+	}
+}
+
+/*
+ * callback to make the module network namespace aware
+ */
+static struct pernet_operations rxe_net_ops __net_initdata = {
+	.init = rxe_ns_init,
+	.exit = rxe_ns_exit,
+	.id = &rxe_pernet_id,
+	.size = sizeof(struct rxe_ns_sock),
+};
+
+struct sock *rxe_ns_pernet_sk4(struct net *net)
+{
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+	struct sock *sk;
+
+	rcu_read_lock();
+	sk = rcu_dereference(ns_sk->rxe_sk4);
+	rcu_read_unlock();
+
+	return sk;
+}
+
+void rxe_ns_pernet_set_sk4(struct net *net, struct sock *sk)
+{
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+
+	rcu_assign_pointer(ns_sk->rxe_sk4, sk);
+	synchronize_rcu();
+}
+
+struct sock *rxe_ns_pernet_sk6(struct net *net)
+{
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+	struct sock *sk;
+
+	rcu_read_lock();
+	sk = rcu_dereference(ns_sk->rxe_sk6);
+	rcu_read_unlock();
+
+	return sk;
+}
+
+void rxe_ns_pernet_set_sk6(struct net *net, struct sock *sk)
+{
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+
+	rcu_assign_pointer(ns_sk->rxe_sk6, sk);
+	synchronize_rcu();
+}
+
+int __init rxe_namespace_init(void)
+{
+	return register_pernet_subsys(&rxe_net_ops);
+}
+
+void __exit rxe_namespace_exit(void)
+{
+	unregister_pernet_subsys(&rxe_net_ops);
+}
diff --git a/drivers/infiniband/sw/rxe/rxe_ns.h b/drivers/infiniband/sw/rxe/rxe_ns.h
new file mode 100644
index 000000000000..a3eac9558889
--- /dev/null
+++ b/drivers/infiniband/sw/rxe/rxe_ns.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/*
+ * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
+ * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
+ */
+
+#ifndef RXE_NS_H
+#define RXE_NS_H
+
+struct sock *rxe_ns_pernet_sk4(struct net *net);
+struct sock *rxe_ns_pernet_sk6(struct net *net);
+void rxe_ns_pernet_set_sk4(struct net *net, struct sock *sk);
+void rxe_ns_pernet_set_sk6(struct net *net, struct sock *sk);
+int __init rxe_namespace_init(void);
+void __exit rxe_namespace_exit(void);
+
+#endif /* RXE_NS_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv3 8/8] RDMA/rxe: Replace l_sk6 with sk6 in net namespace
  2023-02-14  6:06 [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
                   ` (6 preceding siblings ...)
  2023-02-14  6:06 ` [PATCHv3 7/8] RDMA/rxe: Add the support of net namespace notifier Zhu Yanjun
@ 2023-02-14  6:06 ` Zhu Yanjun
  2023-02-23 13:15   ` Zhu Yanjun
  2023-02-23  0:31 ` [PATCHv3 0/8] Fix the problem that rxe can not work " Zhu Yanjun
  2023-04-12 17:22 ` Mark Lehrer
  9 siblings, 1 reply; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-14  6:06 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, linux-rdma, parav, yanjun.zhu; +Cc: Zhu Yanjun

From: Zhu Yanjun <yanjun.zhu@linux.dev>

The net namespace variable sk6 can be used. As such, l_sk6 can be
replaced with it.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/rxe.c       |  1 -
 drivers/infiniband/sw/rxe/rxe_net.c   | 20 +-------------------
 drivers/infiniband/sw/rxe/rxe_verbs.h |  1 -
 3 files changed, 1 insertion(+), 21 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index c297677bf06a..3260f598a7fb 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -75,7 +75,6 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
 			rxe->ndev->dev_addr);
 
 	rxe->max_ucontext			= RXE_MAX_UCONTEXT;
-	rxe->l_sk6				= NULL;
 }
 
 /* initialize port attributes */
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 8135876b11f6..ebcb86fa1e5e 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -50,24 +50,6 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
 {
 	struct dst_entry *ndst;
 	struct flowi6 fl6 = { { 0 } };
-	struct rxe_dev *rdev;
-
-	rdev = rxe_get_dev_from_net(ndev);
-	if (!rdev->l_sk6) {
-		struct sock *sk;
-
-		rcu_read_lock();
-		sk = udp6_lib_lookup(dev_net(ndev), NULL, 0, &in6addr_any,
-				     htons(ROCE_V2_UDP_DPORT), 0);
-		rcu_read_unlock();
-		if (!sk) {
-			pr_info("file: %s +%d, error\n", __FILE__, __LINE__);
-			return (struct dst_entry *)sk;
-		}
-		__sock_put(sk);
-		rdev->l_sk6 = sk->sk_socket;
-	}
-
 
 	memset(&fl6, 0, sizeof(fl6));
 	fl6.flowi6_oif = ndev->ifindex;
@@ -76,7 +58,7 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
 	fl6.flowi6_proto = IPPROTO_UDP;
 
 	ndst = ipv6_stub->ipv6_dst_lookup_flow(dev_net(ndev),
-					       rdev->l_sk6->sk, &fl6,
+					       rxe_ns_pernet_sk6(dev_net(ndev)), &fl6,
 					       NULL);
 	if (IS_ERR(ndst)) {
 		rxe_dbg_qp(qp, "no route to %pI6\n", daddr);
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 52c4ef4d0305..19ddfa890480 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -408,7 +408,6 @@ struct rxe_dev {
 
 	struct rxe_port		port;
 	struct crypto_shash	*tfm;
-	struct socket		*l_sk6;
 };
 
 static inline void rxe_counter_inc(struct rxe_dev *rxe, enum rxe_counters index)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-02-14  6:06 [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
                   ` (7 preceding siblings ...)
  2023-02-14  6:06 ` [PATCHv3 8/8] RDMA/rxe: Replace l_sk6 with sk6 in net namespace Zhu Yanjun
@ 2023-02-23  0:31 ` Zhu Yanjun
  2023-02-23  4:56   ` Jakub Kicinski
  2023-02-25  8:43   ` Rain River
  2023-04-12 17:22 ` Mark Lehrer
  9 siblings, 2 replies; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-23  0:31 UTC (permalink / raw)
  To: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma, parav, netdev; +Cc: Zhu Yanjun

在 2023/2/14 14:06, Zhu Yanjun 写道:
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> 
> When run "ip link add" command to add a rxe rdma link in a net
> namespace, normally this rxe rdma link can not work in a net
> name space.
> 
> The root cause is that a sock listening on udp port 4791 is created
> in init_net when the rdma_rxe module is loaded into kernel. That is,
> the sock listening on udp port 4791 is created in init_net. Other net
> namespace is difficult to use this sock.
> 
> The following commits will solve this problem.
> 
> In the first commit, move the creating sock listening on udp port 4791
> from module_init function to rdma link creating functions. That is,
> after the module rdma_rxe is loaded, the sock will not be created.
> When run "rdma link add ..." command, the sock will be created. So
> when creating a rdma link in the net namespace, the sock will be
> created in this net namespace.
> 
> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
> will check the sock exists in the net namespace or not. If yes, rdma
> link will increase the reference count of this sock, then continue other
> jobs instead of creating a new sock to listen on udp port 4791. Since the
> network notifier is global, when the module rdma_rxe is loaded, this
> notifier will be registered.
> 
> After the rdma link is created, the command "rdma link del" is to
> delete rdma link at the same time the sock is checked. If the reference
> count of this sock is greater than the sock reference count needed by
> udp tunnel, the sock reference count is decreased by one. If equal, it
> indicates that this rdma link is the last one. As such, the udp tunnel
> is shut down and the sock is closed. The above work should be
> implemented in linkdel function. But currently no dellink function in
> rxe. So the 3rd commit addes dellink function pointer. And the 4th
> commit implements the dellink function in rxe.
> 
> To now, it is not necessary to keep a global variable to store the sock
> listening udp port 4791. This global variable can be replaced by the
> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
> function udp6_lib_lookup is in the fast path, a member variable l_sk6
> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
> to lookup the sock, then the sock is stored in l_sk6, in the future,it
> can be used directly.
> 
> All the above work has been done in init_net. And it can also work in
> the net namespace. So the init_net is replaced by the individual net
> namespace. This is what the 6th commit does. Because rxe device is
> dependent on the net device and the sock listening on udp port 4791,
> every rxe device is in exclusive mode in the individual net namespace.
> Other rdma netns operations will be considerred in the future.
> 
> In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys
> functions are added. When a new net namespace is created, the init
> function will initialize the sk4 and sk6 socks. Then the 2 socks will
> be released when the net namespace is destroyed. The functions
> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net
> namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will
> handle sk6. Then sk4 and sk6 are used in the previous commits.
> 
> As the sk4 and sk6 in pernet namespace can be accessed, it is not
> necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is
> replaced with the sk6 in pernet namespace.
> 
> Test steps:
> 1) Suppose that 2 NICs are in 2 different net namespaces.
> 
>    # ip netns exec net0 ip link
>    3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
>       link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
>       altname enp5s0
> 
>    # ip netns exec net1 ip link
>    4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
>       link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff
> 
> 2) Add rdma link in the different net namespace
>      net0:
>      # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2
> 
>      net1:
>      # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
> 
> 3) Run rping test.
>      net0
>      # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
>      [1] 1737
>      # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
>      verbose
>      count 1
>      ...
>      ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
>      ...
> 
> 4) Remove the rdma links from the net namespaces.
>      net0:
>      # ip netns exec net0 ss -lu
>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>      UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>      UNCONN    0         0         [::]:4791             [::]:*
> 
>      # ip netns exec net0 rdma link del rxe0
>      
>      # ip netns exec net0 ss -lu
>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>      
>      net1:
>      # ip netns exec net0 ss -lu
>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>      UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>      UNCONN    0         0         [::]:4791             [::]:*
>      
>      # ip netns exec net1 rdma link del rxe1
> 
>      # ip netns exec net0 ss -lu
>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
> 
> V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to
>             verify rdma link is removed.
>          2) Add register_pernet_subsys/unregister_pernet_subsys net namespace
>          3) Replace l_sk6 with sk6 of pernet_name_space
> 
> V1->V2: Add the explicit initialization of sk6.

Add netdev@vger.kernel.org.

Zhu Yanjun

> 
> Zhu Yanjun (8):
>    RDMA/rxe: Creating listening sock in newlink function
>    RDMA/rxe: Support more rdma links in init_net
>    RDMA/nldev: Add dellink function pointer
>    RDMA/rxe: Implement dellink in rxe
>    RDMA/rxe: Replace global variable with sock lookup functions
>    RDMA/rxe: add the support of net namespace
>    RDMA/rxe: Add the support of net namespace notifier
>    RDMA/rxe: Replace l_sk6 with sk6 in net namespace
> 
>   drivers/infiniband/core/nldev.c     |   6 ++
>   drivers/infiniband/sw/rxe/Makefile  |   3 +-
>   drivers/infiniband/sw/rxe/rxe.c     |  35 +++++++-
>   drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++-------
>   drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
>   drivers/infiniband/sw/rxe/rxe_ns.c  | 128 ++++++++++++++++++++++++++++
>   drivers/infiniband/sw/rxe/rxe_ns.h  |  11 +++
>   include/rdma/rdma_netlink.h         |   2 +
>   8 files changed, 267 insertions(+), 40 deletions(-)
>   create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
>   create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.h
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-02-23  0:31 ` [PATCHv3 0/8] Fix the problem that rxe can not work " Zhu Yanjun
@ 2023-02-23  4:56   ` Jakub Kicinski
  2023-02-23 11:42     ` Zhu Yanjun
  2023-02-25  8:43   ` Rain River
  1 sibling, 1 reply; 39+ messages in thread
From: Jakub Kicinski @ 2023-02-23  4:56 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma, parav, netdev

On Thu, 23 Feb 2023 08:31:49 +0800 Zhu Yanjun wrote:
> > V1->V2: Add the explicit initialization of sk6.  
> 
> Add netdev@vger.kernel.org.

On the commit letter? Thanks, but that's not how it works. 
Repost the patches if you want us to see them.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-02-23  4:56   ` Jakub Kicinski
@ 2023-02-23 11:42     ` Zhu Yanjun
  0 siblings, 0 replies; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-23 11:42 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma, parav, netdev


在 2023/2/23 12:56, Jakub Kicinski 写道:
> On Thu, 23 Feb 2023 08:31:49 +0800 Zhu Yanjun wrote:
>>> V1->V2: Add the explicit initialization of sk6.
>> Add netdev@vger.kernel.org.
> On the commit letter? Thanks, but that's not how it works.
> Repost the patches if you want us to see them.

Got it. I will resend all the commits.

Zhu Yanjun


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 1/8] RDMA/rxe: Creating listening sock in newlink function
  2023-02-14  6:06 ` [PATCHv3 1/8] RDMA/rxe: Creating listening sock in newlink function Zhu Yanjun
@ 2023-02-23 13:10   ` Zhu Yanjun
  0 siblings, 0 replies; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-23 13:10 UTC (permalink / raw)
  To: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma, parav, netdev; +Cc: Zhu Yanjun

在 2023/2/14 14:06, Zhu Yanjun 写道:
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> 
> Originally when the module rdma_rxe is loaded, the sock listening on udp
> port 4791 is created. Currently moving the creating listening port to
> newlink function.
> 
> So when running "rdma link add" command, the sock listening on udp port
> 4791 is created.
> 
> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>

Add netdev@vger.kernel.org.

Zhu Yanjun

> ---
>   drivers/infiniband/sw/rxe/rxe.c | 10 ++++------
>   1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index 136c2efe3466..64644cb0bb38 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -192,6 +192,10 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
>   		goto err;
>   	}
>   
> +	err = rxe_net_init();
> +	if (err)
> +		return err;
> +
>   	err = rxe_net_add(ibdev_name, ndev);
>   	if (err) {
>   		rxe_dbg(exists, "failed to add %s\n", ndev->name);
> @@ -208,12 +212,6 @@ static struct rdma_link_ops rxe_link_ops = {
>   
>   static int __init rxe_module_init(void)
>   {
> -	int err;
> -
> -	err = rxe_net_init();
> -	if (err)
> -		return err;
> -
>   	rdma_link_register(&rxe_link_ops);
>   	pr_info("loaded\n");
>   	return 0;


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 2/8] RDMA/rxe: Support more rdma links in init_net
  2023-02-14  6:06 ` [PATCHv3 2/8] RDMA/rxe: Support more rdma links in init_net Zhu Yanjun
@ 2023-02-23 13:10   ` Zhu Yanjun
  0 siblings, 0 replies; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-23 13:10 UTC (permalink / raw)
  To: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma, parav, netdev; +Cc: Zhu Yanjun

在 2023/2/14 14:06, Zhu Yanjun 写道:
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> 
> In init_net, when several rdma links are created with the command "rdma
> link add", newlink will check whether the udp port 4791 is listening or
> not.
> If not, creating a sock listening on udp port 4791. If yes, increasing the
> reference count of the sock.
> 
> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>

Add netdev@vger.kernel.org.

Zhu Yanjun

> ---
>   drivers/infiniband/sw/rxe/rxe.c     | 12 ++++++-
>   drivers/infiniband/sw/rxe/rxe_net.c | 55 +++++++++++++++++++++--------
>   drivers/infiniband/sw/rxe/rxe_net.h |  1 +
>   3 files changed, 52 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index 64644cb0bb38..0ce6adb43cfc 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -8,6 +8,7 @@
>   #include <net/addrconf.h>
>   #include "rxe.h"
>   #include "rxe_loc.h"
> +#include "rxe_net.h"
>   
>   MODULE_AUTHOR("Bob Pearson, Frank Zago, John Groves, Kamal Heib");
>   MODULE_DESCRIPTION("Soft RDMA transport");
> @@ -205,14 +206,23 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
>   	return err;
>   }
>   
> -static struct rdma_link_ops rxe_link_ops = {
> +struct rdma_link_ops rxe_link_ops = {
>   	.type = "rxe",
>   	.newlink = rxe_newlink,
>   };
>   
>   static int __init rxe_module_init(void)
>   {
> +	int err;
> +
>   	rdma_link_register(&rxe_link_ops);
> +
> +	err = rxe_register_notifier();
> +	if (err) {
> +		pr_err("Failed to register netdev notifier\n");
> +		return -1;
> +	}
> +
>   	pr_info("loaded\n");
>   	return 0;
>   }
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> index e02e1624bcf4..3ca92e062800 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.c
> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> @@ -623,13 +623,23 @@ static struct notifier_block rxe_net_notifier = {
>   
>   static int rxe_net_ipv4_init(void)
>   {
> -	recv_sockets.sk4 = rxe_setup_udp_tunnel(&init_net,
> -				htons(ROCE_V2_UDP_DPORT), false);
> -	if (IS_ERR(recv_sockets.sk4)) {
> -		recv_sockets.sk4 = NULL;
> +	struct sock *sk;
> +	struct socket *sock;
> +
> +	rcu_read_lock();
> +	sk = udp4_lib_lookup(&init_net, 0, 0, htonl(INADDR_ANY),
> +			     htons(ROCE_V2_UDP_DPORT), 0);
> +	rcu_read_unlock();
> +	if (sk)
> +		return 0;
> +
> +	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), false);
> +	if (IS_ERR(sock)) {
>   		pr_err("Failed to create IPv4 UDP tunnel\n");
> +		recv_sockets.sk4 = NULL;
>   		return -1;
>   	}
> +	recv_sockets.sk4 = sock;
>   
>   	return 0;
>   }
> @@ -637,24 +647,46 @@ static int rxe_net_ipv4_init(void)
>   static int rxe_net_ipv6_init(void)
>   {
>   #if IS_ENABLED(CONFIG_IPV6)
> +	struct sock *sk;
> +	struct socket *sock;
> +
> +	rcu_read_lock();
> +	sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any,
> +			     htons(ROCE_V2_UDP_DPORT), 0);
> +	rcu_read_unlock();
> +	if (sk)
> +		return 0;
>   
> -	recv_sockets.sk6 = rxe_setup_udp_tunnel(&init_net,
> -						htons(ROCE_V2_UDP_DPORT), true);
> -	if (PTR_ERR(recv_sockets.sk6) == -EAFNOSUPPORT) {
> +	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), true);
> +	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
>   		recv_sockets.sk6 = NULL;
>   		pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n");
>   		return 0;
>   	}
>   
> -	if (IS_ERR(recv_sockets.sk6)) {
> +	if (IS_ERR(sock)) {
>   		recv_sockets.sk6 = NULL;
>   		pr_err("Failed to create IPv6 UDP tunnel\n");
>   		return -1;
>   	}
> +	recv_sockets.sk6 = sock;
>   #endif
>   	return 0;
>   }
>   
> +int rxe_register_notifier(void)
> +{
> +	int err;
> +
> +	err = register_netdevice_notifier(&rxe_net_notifier);
> +	if (err) {
> +		pr_err("Failed to register netdev notifier\n");
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
>   void rxe_net_exit(void)
>   {
>   	rxe_release_udp_tunnel(recv_sockets.sk6);
> @@ -666,19 +698,12 @@ int rxe_net_init(void)
>   {
>   	int err;
>   
> -	recv_sockets.sk6 = NULL;
> -
>   	err = rxe_net_ipv4_init();
>   	if (err)
>   		return err;
>   	err = rxe_net_ipv6_init();
>   	if (err)
>   		goto err_out;
> -	err = register_netdevice_notifier(&rxe_net_notifier);
> -	if (err) {
> -		pr_err("Failed to register netdev notifier\n");
> -		goto err_out;
> -	}
>   	return 0;
>   err_out:
>   	rxe_net_exit();
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
> index 45d80d00f86b..a222c3eeae12 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.h
> +++ b/drivers/infiniband/sw/rxe/rxe_net.h
> @@ -18,6 +18,7 @@ struct rxe_recv_sockets {
>   
>   int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
>   
> +int rxe_register_notifier(void);
>   int rxe_net_init(void);
>   void rxe_net_exit(void);
>   


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 3/8] RDMA/nldev: Add dellink function pointer
  2023-02-14  6:06 ` [PATCHv3 3/8] RDMA/nldev: Add dellink function pointer Zhu Yanjun
@ 2023-02-23 13:11   ` Zhu Yanjun
  0 siblings, 0 replies; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-23 13:11 UTC (permalink / raw)
  To: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma, parav, netdev; +Cc: Zhu Yanjun

在 2023/2/14 14:06, Zhu Yanjun 写道:
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> 
> The newlink function pointer is added. And the sock listening on port 4791
> is added in the newlink function. So the dellink function is needed to
> remove the sock.
> 
> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>

Add netdev@vger.kernel.org.

Zhu Yanjun

> ---
>   drivers/infiniband/core/nldev.c | 6 ++++++
>   include/rdma/rdma_netlink.h     | 2 ++
>   2 files changed, 8 insertions(+)
> 
> diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
> index d5d3e4f0de77..97a62685ed5b 100644
> --- a/drivers/infiniband/core/nldev.c
> +++ b/drivers/infiniband/core/nldev.c
> @@ -1758,6 +1758,12 @@ static int nldev_dellink(struct sk_buff *skb, struct nlmsghdr *nlh,
>   		return -EINVAL;
>   	}
>   
> +	if (device->link_ops) {
> +		err = device->link_ops->dellink(device);
> +		if (err)
> +			return err;
> +	}
> +
>   	ib_unregister_device_and_put(device);
>   	return 0;
>   }
> diff --git a/include/rdma/rdma_netlink.h b/include/rdma/rdma_netlink.h
> index c2a79aeee113..bf9df004061f 100644
> --- a/include/rdma/rdma_netlink.h
> +++ b/include/rdma/rdma_netlink.h
> @@ -5,6 +5,7 @@
>   
>   #include <linux/netlink.h>
>   #include <uapi/rdma/rdma_netlink.h>
> +#include <rdma/ib_verbs.h>
>   
>   enum {
>   	RDMA_NLDEV_ATTR_EMPTY_STRING = 1,
> @@ -114,6 +115,7 @@ struct rdma_link_ops {
>   	struct list_head list;
>   	const char *type;
>   	int (*newlink)(const char *ibdev_name, struct net_device *ndev);
> +	int (*dellink)(struct ib_device *dev);
>   };
>   
>   void rdma_link_register(struct rdma_link_ops *ops);


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 4/8] RDMA/rxe: Implement dellink in rxe
  2023-02-14  6:06 ` [PATCHv3 4/8] RDMA/rxe: Implement dellink in rxe Zhu Yanjun
@ 2023-02-23 13:12   ` Zhu Yanjun
  0 siblings, 0 replies; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-23 13:12 UTC (permalink / raw)
  To: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma, parav, netdev; +Cc: Zhu Yanjun

在 2023/2/14 14:06, Zhu Yanjun 写道:
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> 
> When running "rdma link del" command, dellink function will be called.
> If the sock refcnt is greater than the refcnt needed for udp tunnel,
> the sock refcnt will be decreased by 1.
> 
> If equal, the last rdma link is removed. The udp tunnel will be
> destroyed.
> 
> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>

Add netdev@vger.kernel.org.

Zhu Yanjun

> ---
>   drivers/infiniband/sw/rxe/rxe.c     | 12 +++++++++++-
>   drivers/infiniband/sw/rxe/rxe_net.c | 17 +++++++++++++++--
>   drivers/infiniband/sw/rxe/rxe_net.h |  1 +
>   3 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index 0ce6adb43cfc..ebfabc6d6b76 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -166,10 +166,12 @@ void rxe_set_mtu(struct rxe_dev *rxe, unsigned int ndev_mtu)
>   /* called by ifc layer to create new rxe device.
>    * The caller should allocate memory for rxe by calling ib_alloc_device.
>    */
> +static struct rdma_link_ops rxe_link_ops;
>   int rxe_add(struct rxe_dev *rxe, unsigned int mtu, const char *ibdev_name)
>   {
>   	rxe_init(rxe);
>   	rxe_set_mtu(rxe, mtu);
> +	rxe->ib_dev.link_ops = &rxe_link_ops;
>   
>   	return rxe_register_device(rxe, ibdev_name);
>   }
> @@ -206,9 +208,17 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
>   	return err;
>   }
>   
> -struct rdma_link_ops rxe_link_ops = {
> +static int rxe_dellink(struct ib_device *dev)
> +{
> +	rxe_net_del(dev);
> +
> +	return 0;
> +}
> +
> +static struct rdma_link_ops rxe_link_ops = {
>   	.type = "rxe",
>   	.newlink = rxe_newlink,
> +	.dellink = rxe_dellink,
>   };
>   
>   static int __init rxe_module_init(void)
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> index 3ca92e062800..4cc7de7b115b 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.c
> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> @@ -530,6 +530,21 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev)
>   	return 0;
>   }
>   
> +#define SK_REF_FOR_TUNNEL	2
> +void rxe_net_del(struct ib_device *dev)
> +{
> +	if (refcount_read(&recv_sockets.sk6->sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
> +		__sock_put(recv_sockets.sk6->sk);
> +	else
> +		rxe_release_udp_tunnel(recv_sockets.sk6);
> +
> +	if (refcount_read(&recv_sockets.sk4->sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
> +		__sock_put(recv_sockets.sk4->sk);
> +	else
> +		rxe_release_udp_tunnel(recv_sockets.sk4);
> +}
> +#undef SK_REF_FOR_TUNNEL
> +
>   static void rxe_port_event(struct rxe_dev *rxe,
>   			   enum ib_event_type event)
>   {
> @@ -689,8 +704,6 @@ int rxe_register_notifier(void)
>   
>   void rxe_net_exit(void)
>   {
> -	rxe_release_udp_tunnel(recv_sockets.sk6);
> -	rxe_release_udp_tunnel(recv_sockets.sk4);
>   	unregister_netdevice_notifier(&rxe_net_notifier);
>   }
>   
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
> index a222c3eeae12..f48f22f3353b 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.h
> +++ b/drivers/infiniband/sw/rxe/rxe_net.h
> @@ -17,6 +17,7 @@ struct rxe_recv_sockets {
>   };
>   
>   int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
> +void rxe_net_del(struct ib_device *dev);
>   
>   int rxe_register_notifier(void);
>   int rxe_net_init(void);


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 5/8] RDMA/rxe: Replace global variable with sock lookup functions
  2023-02-14  6:06 ` [PATCHv3 5/8] RDMA/rxe: Replace global variable with sock lookup functions Zhu Yanjun
@ 2023-02-23 13:13   ` Zhu Yanjun
  0 siblings, 0 replies; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-23 13:13 UTC (permalink / raw)
  To: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma, parav, netdev; +Cc: Zhu Yanjun

在 2023/2/14 14:06, Zhu Yanjun 写道:
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> 
> Originally a global variable is to keep the sock of udp listening
> on port 4791. In fact, sock lookup functions can be used to get
> the sock.
> 
> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>

Add netdev@vger.kernel.org.

Zhu Yanjun

> ---
>   drivers/infiniband/sw/rxe/rxe.c       |  1 +
>   drivers/infiniband/sw/rxe/rxe_net.c   | 58 ++++++++++++++++++++-------
>   drivers/infiniband/sw/rxe/rxe_net.h   |  5 ---
>   drivers/infiniband/sw/rxe/rxe_verbs.h |  1 +
>   4 files changed, 45 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index ebfabc6d6b76..e81c2164d77f 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -74,6 +74,7 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
>   			rxe->ndev->dev_addr);
>   
>   	rxe->max_ucontext			= RXE_MAX_UCONTEXT;
> +	rxe->l_sk6				= NULL;
>   }
>   
>   /* initialize port attributes */
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> index 4cc7de7b115b..b56e2c32fbf7 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.c
> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> @@ -18,8 +18,6 @@
>   #include "rxe_net.h"
>   #include "rxe_loc.h"
>   
> -static struct rxe_recv_sockets recv_sockets;
> -
>   static struct dst_entry *rxe_find_route4(struct rxe_qp *qp,
>   					 struct net_device *ndev,
>   					 struct in_addr *saddr,
> @@ -51,6 +49,23 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
>   {
>   	struct dst_entry *ndst;
>   	struct flowi6 fl6 = { { 0 } };
> +	struct rxe_dev *rdev;
> +
> +	rdev = rxe_get_dev_from_net(ndev);
> +	if (!rdev->l_sk6) {
> +		struct sock *sk;
> +
> +		rcu_read_lock();
> +		sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any, htons(ROCE_V2_UDP_DPORT), 0);
> +		rcu_read_unlock();
> +		if (!sk) {
> +			pr_info("file: %s +%d, error\n", __FILE__, __LINE__);
> +			return (struct dst_entry *)sk;
> +		}
> +		__sock_put(sk);
> +		rdev->l_sk6 = sk->sk_socket;
> +	}
> +
>   
>   	memset(&fl6, 0, sizeof(fl6));
>   	fl6.flowi6_oif = ndev->ifindex;
> @@ -58,8 +73,8 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
>   	memcpy(&fl6.daddr, daddr, sizeof(*daddr));
>   	fl6.flowi6_proto = IPPROTO_UDP;
>   
> -	ndst = ipv6_stub->ipv6_dst_lookup_flow(sock_net(recv_sockets.sk6->sk),
> -					       recv_sockets.sk6->sk, &fl6,
> +	ndst = ipv6_stub->ipv6_dst_lookup_flow(dev_net(ndev),
> +					       rdev->l_sk6->sk, &fl6,
>   					       NULL);
>   	if (IS_ERR(ndst)) {
>   		rxe_dbg_qp(qp, "no route to %pI6\n", daddr);
> @@ -533,15 +548,33 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev)
>   #define SK_REF_FOR_TUNNEL	2
>   void rxe_net_del(struct ib_device *dev)
>   {
> -	if (refcount_read(&recv_sockets.sk6->sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
> -		__sock_put(recv_sockets.sk6->sk);
> +	struct sock *sk;
> +
> +	rcu_read_lock();
> +	sk = udp4_lib_lookup(&init_net, 0, 0, htonl(INADDR_ANY), htons(ROCE_V2_UDP_DPORT), 0);
> +	rcu_read_unlock();
> +	if (!sk)
> +		return;
> +
> +	__sock_put(sk);
> +
> +	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
> +		__sock_put(sk);
>   	else
> -		rxe_release_udp_tunnel(recv_sockets.sk6);
> +		rxe_release_udp_tunnel(sk->sk_socket);
> +
> +	rcu_read_lock();
> +	sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any, htons(ROCE_V2_UDP_DPORT), 0);
> +	rcu_read_unlock();
> +	if (!sk)
> +		return;
> +
> +	__sock_put(sk);
>   
> -	if (refcount_read(&recv_sockets.sk4->sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
> -		__sock_put(recv_sockets.sk4->sk);
> +	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
> +		__sock_put(sk);
>   	else
> -		rxe_release_udp_tunnel(recv_sockets.sk4);
> +		rxe_release_udp_tunnel(sk->sk_socket);
>   }
>   #undef SK_REF_FOR_TUNNEL
>   
> @@ -651,10 +684,8 @@ static int rxe_net_ipv4_init(void)
>   	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), false);
>   	if (IS_ERR(sock)) {
>   		pr_err("Failed to create IPv4 UDP tunnel\n");
> -		recv_sockets.sk4 = NULL;
>   		return -1;
>   	}
> -	recv_sockets.sk4 = sock;
>   
>   	return 0;
>   }
> @@ -674,17 +705,14 @@ static int rxe_net_ipv6_init(void)
>   
>   	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), true);
>   	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
> -		recv_sockets.sk6 = NULL;
>   		pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n");
>   		return 0;
>   	}
>   
>   	if (IS_ERR(sock)) {
> -		recv_sockets.sk6 = NULL;
>   		pr_err("Failed to create IPv6 UDP tunnel\n");
>   		return -1;
>   	}
> -	recv_sockets.sk6 = sock;
>   #endif
>   	return 0;
>   }
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
> index f48f22f3353b..027b20e1bab6 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.h
> +++ b/drivers/infiniband/sw/rxe/rxe_net.h
> @@ -11,11 +11,6 @@
>   #include <net/if_inet6.h>
>   #include <linux/module.h>
>   
> -struct rxe_recv_sockets {
> -	struct socket *sk4;
> -	struct socket *sk6;
> -};
> -
>   int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
>   void rxe_net_del(struct ib_device *dev);
>   
> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
> index 19ddfa890480..52c4ef4d0305 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.h
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
> @@ -408,6 +408,7 @@ struct rxe_dev {
>   
>   	struct rxe_port		port;
>   	struct crypto_shash	*tfm;
> +	struct socket		*l_sk6;
>   };
>   
>   static inline void rxe_counter_inc(struct rxe_dev *rxe, enum rxe_counters index)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 6/8] RDMA/rxe: add the support of net namespace
  2023-02-14  6:06 ` [PATCHv3 6/8] RDMA/rxe: add the support of net namespace Zhu Yanjun
@ 2023-02-23 13:14   ` Zhu Yanjun
  0 siblings, 0 replies; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-23 13:14 UTC (permalink / raw)
  To: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma, parav, netdev; +Cc: Zhu Yanjun

在 2023/2/14 14:06, Zhu Yanjun 写道:
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> 
> Originally init_net is used to indicate the current net namespace.
> Currently more net namespaces are supported.
> 
> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>

Add netdev@vger.kernel.org.

Zhu Yanjun

> ---
>   drivers/infiniband/sw/rxe/rxe.c     |  2 +-
>   drivers/infiniband/sw/rxe/rxe_net.c | 33 +++++++++++++++++------------
>   drivers/infiniband/sw/rxe/rxe_net.h |  2 +-
>   3 files changed, 22 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index e81c2164d77f..4a17e4a003f5 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -196,7 +196,7 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
>   		goto err;
>   	}
>   
> -	err = rxe_net_init();
> +	err = rxe_net_init(ndev);
>   	if (err)
>   		return err;
>   
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> index b56e2c32fbf7..9af90587642a 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.c
> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> @@ -32,7 +32,7 @@ static struct dst_entry *rxe_find_route4(struct rxe_qp *qp,
>   	memcpy(&fl.daddr, daddr, sizeof(*daddr));
>   	fl.flowi4_proto = IPPROTO_UDP;
>   
> -	rt = ip_route_output_key(&init_net, &fl);
> +	rt = ip_route_output_key(dev_net(ndev), &fl);
>   	if (IS_ERR(rt)) {
>   		rxe_dbg_qp(qp, "no route to %pI4\n", &daddr->s_addr);
>   		return NULL;
> @@ -56,7 +56,8 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
>   		struct sock *sk;
>   
>   		rcu_read_lock();
> -		sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any, htons(ROCE_V2_UDP_DPORT), 0);
> +		sk = udp6_lib_lookup(dev_net(ndev), NULL, 0, &in6addr_any,
> +				     htons(ROCE_V2_UDP_DPORT), 0);
>   		rcu_read_unlock();
>   		if (!sk) {
>   			pr_info("file: %s +%d, error\n", __FILE__, __LINE__);
> @@ -549,9 +550,13 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev)
>   void rxe_net_del(struct ib_device *dev)
>   {
>   	struct sock *sk;
> +	struct rxe_dev *rdev;
> +
> +	rdev = container_of(dev, struct rxe_dev, ib_dev);
>   
>   	rcu_read_lock();
> -	sk = udp4_lib_lookup(&init_net, 0, 0, htonl(INADDR_ANY), htons(ROCE_V2_UDP_DPORT), 0);
> +	sk = udp4_lib_lookup(dev_net(rdev->ndev), 0, 0, htonl(INADDR_ANY),
> +			     htons(ROCE_V2_UDP_DPORT), 0);
>   	rcu_read_unlock();
>   	if (!sk)
>   		return;
> @@ -564,7 +569,8 @@ void rxe_net_del(struct ib_device *dev)
>   		rxe_release_udp_tunnel(sk->sk_socket);
>   
>   	rcu_read_lock();
> -	sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any, htons(ROCE_V2_UDP_DPORT), 0);
> +	sk = udp6_lib_lookup(dev_net(rdev->ndev), NULL, 0, &in6addr_any,
> +			     htons(ROCE_V2_UDP_DPORT), 0);
>   	rcu_read_unlock();
>   	if (!sk)
>   		return;
> @@ -636,6 +642,7 @@ static int rxe_notify(struct notifier_block *not_blk,
>   	switch (event) {
>   	case NETDEV_UNREGISTER:
>   		ib_unregister_device_queued(&rxe->ib_dev);
> +		rxe_net_del(&rxe->ib_dev);
>   		break;
>   	case NETDEV_UP:
>   		rxe_port_up(rxe);
> @@ -669,19 +676,19 @@ static struct notifier_block rxe_net_notifier = {
>   	.notifier_call = rxe_notify,
>   };
>   
> -static int rxe_net_ipv4_init(void)
> +static int rxe_net_ipv4_init(struct net_device *ndev)
>   {
>   	struct sock *sk;
>   	struct socket *sock;
>   
>   	rcu_read_lock();
> -	sk = udp4_lib_lookup(&init_net, 0, 0, htonl(INADDR_ANY),
> +	sk = udp4_lib_lookup(dev_net(ndev), 0, 0, htonl(INADDR_ANY),
>   			     htons(ROCE_V2_UDP_DPORT), 0);
>   	rcu_read_unlock();
>   	if (sk)
>   		return 0;
>   
> -	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), false);
> +	sock = rxe_setup_udp_tunnel(dev_net(ndev), htons(ROCE_V2_UDP_DPORT), false);
>   	if (IS_ERR(sock)) {
>   		pr_err("Failed to create IPv4 UDP tunnel\n");
>   		return -1;
> @@ -690,20 +697,20 @@ static int rxe_net_ipv4_init(void)
>   	return 0;
>   }
>   
> -static int rxe_net_ipv6_init(void)
> +static int rxe_net_ipv6_init(struct net_device *ndev)
>   {
>   #if IS_ENABLED(CONFIG_IPV6)
>   	struct sock *sk;
>   	struct socket *sock;
>   
>   	rcu_read_lock();
> -	sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any,
> +	sk = udp6_lib_lookup(dev_net(ndev), NULL, 0, &in6addr_any,
>   			     htons(ROCE_V2_UDP_DPORT), 0);
>   	rcu_read_unlock();
>   	if (sk)
>   		return 0;
>   
> -	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), true);
> +	sock = rxe_setup_udp_tunnel(dev_net(ndev), htons(ROCE_V2_UDP_DPORT), true);
>   	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
>   		pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n");
>   		return 0;
> @@ -735,14 +742,14 @@ void rxe_net_exit(void)
>   	unregister_netdevice_notifier(&rxe_net_notifier);
>   }
>   
> -int rxe_net_init(void)
> +int rxe_net_init(struct net_device *ndev)
>   {
>   	int err;
>   
> -	err = rxe_net_ipv4_init();
> +	err = rxe_net_ipv4_init(ndev);
>   	if (err)
>   		return err;
> -	err = rxe_net_ipv6_init();
> +	err = rxe_net_ipv6_init(ndev);
>   	if (err)
>   		goto err_out;
>   	return 0;
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
> index 027b20e1bab6..56249677d692 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.h
> +++ b/drivers/infiniband/sw/rxe/rxe_net.h
> @@ -15,7 +15,7 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
>   void rxe_net_del(struct ib_device *dev);
>   
>   int rxe_register_notifier(void);
> -int rxe_net_init(void);
> +int rxe_net_init(struct net_device *ndev);
>   void rxe_net_exit(void);
>   
>   #endif /* RXE_NET_H */


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 7/8] RDMA/rxe: Add the support of net namespace notifier
  2023-02-14  6:06 ` [PATCHv3 7/8] RDMA/rxe: Add the support of net namespace notifier Zhu Yanjun
@ 2023-02-23 13:14   ` Zhu Yanjun
  0 siblings, 0 replies; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-23 13:14 UTC (permalink / raw)
  To: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma, parav, netdev; +Cc: Zhu Yanjun

在 2023/2/14 14:06, Zhu Yanjun 写道:
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> 
> The functions register_pernet_subsys/unregister_pernet_subsys register a
> notifier of net namespace. When a new net namespace is created, the init
> function of rxe will be called to initialize sk4 and sk6 socks. When a
> net namespace is destroyed, the exit function will be called to handle
> sk4 and sk6 socks.
> 
> The functions rxe_ns_pernet_sk4 and rxe_ns_pernet_sk6 are used to get
> sk4 and sk6 socks.
> 
> The functions rxe_ns_pernet_set_sk4 and rxe_ns_pernet_set_sk6 are used
> to set sk4 and sk6 socks.
> 
> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>

Add netdev@vger.kernel.org.

Zhu Yanjun
> ---
>   drivers/infiniband/sw/rxe/Makefile  |   3 +-
>   drivers/infiniband/sw/rxe/rxe.c     |   9 ++
>   drivers/infiniband/sw/rxe/rxe_net.c |  50 +++++------
>   drivers/infiniband/sw/rxe/rxe_ns.c  | 134 ++++++++++++++++++++++++++++
>   drivers/infiniband/sw/rxe/rxe_ns.h  |  17 ++++
>   5 files changed, 187 insertions(+), 26 deletions(-)
>   create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
>   create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.h
> 
> diff --git a/drivers/infiniband/sw/rxe/Makefile b/drivers/infiniband/sw/rxe/Makefile
> index 5395a581f4bb..8380f97674cb 100644
> --- a/drivers/infiniband/sw/rxe/Makefile
> +++ b/drivers/infiniband/sw/rxe/Makefile
> @@ -22,4 +22,5 @@ rdma_rxe-y := \
>   	rxe_mcast.o \
>   	rxe_task.o \
>   	rxe_net.o \
> -	rxe_hw_counters.o
> +	rxe_hw_counters.o \
> +	rxe_ns.o
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index 4a17e4a003f5..c297677bf06a 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -9,6 +9,7 @@
>   #include "rxe.h"
>   #include "rxe_loc.h"
>   #include "rxe_net.h"
> +#include "rxe_ns.h"
>   
>   MODULE_AUTHOR("Bob Pearson, Frank Zago, John Groves, Kamal Heib");
>   MODULE_DESCRIPTION("Soft RDMA transport");
> @@ -234,6 +235,12 @@ static int __init rxe_module_init(void)
>   		return -1;
>   	}
>   
> +	err = rxe_namespace_init();
> +	if (err) {
> +		pr_err("Failed to register net namespace notifier\n");
> +		return -1;
> +	}
> +
>   	pr_info("loaded\n");
>   	return 0;
>   }
> @@ -244,6 +251,8 @@ static void __exit rxe_module_exit(void)
>   	ib_unregister_driver(RDMA_DRIVER_RXE);
>   	rxe_net_exit();
>   
> +	rxe_namespace_exit();
> +
>   	pr_info("unloaded\n");
>   }
>   
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> index 9af90587642a..8135876b11f6 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.c
> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> @@ -17,6 +17,7 @@
>   #include "rxe.h"
>   #include "rxe_net.h"
>   #include "rxe_loc.h"
> +#include "rxe_ns.h"
>   
>   static struct dst_entry *rxe_find_route4(struct rxe_qp *qp,
>   					 struct net_device *ndev,
> @@ -554,33 +555,30 @@ void rxe_net_del(struct ib_device *dev)
>   
>   	rdev = container_of(dev, struct rxe_dev, ib_dev);
>   
> -	rcu_read_lock();
> -	sk = udp4_lib_lookup(dev_net(rdev->ndev), 0, 0, htonl(INADDR_ANY),
> -			     htons(ROCE_V2_UDP_DPORT), 0);
> -	rcu_read_unlock();
> +	sk = rxe_ns_pernet_sk4(dev_net(rdev->ndev));
>   	if (!sk)
>   		return;
>   
> -	__sock_put(sk);
>   
> -	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
> +	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL) {
>   		__sock_put(sk);
> -	else
> +	} else {
>   		rxe_release_udp_tunnel(sk->sk_socket);
> +		sk = NULL;
> +		rxe_ns_pernet_set_sk4(dev_net(rdev->ndev), sk);
> +	}
>   
> -	rcu_read_lock();
> -	sk = udp6_lib_lookup(dev_net(rdev->ndev), NULL, 0, &in6addr_any,
> -			     htons(ROCE_V2_UDP_DPORT), 0);
> -	rcu_read_unlock();
> +	sk = rxe_ns_pernet_sk6(dev_net(rdev->ndev));
>   	if (!sk)
>   		return;
>   
> -	__sock_put(sk);
> -
> -	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
> +	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL) {
>   		__sock_put(sk);
> -	else
> +	} else {
>   		rxe_release_udp_tunnel(sk->sk_socket);
> +		sk = NULL;
> +		rxe_ns_pernet_set_sk6(dev_net(rdev->ndev), sk);
> +	}
>   }
>   #undef SK_REF_FOR_TUNNEL
>   
> @@ -681,18 +679,18 @@ static int rxe_net_ipv4_init(struct net_device *ndev)
>   	struct sock *sk;
>   	struct socket *sock;
>   
> -	rcu_read_lock();
> -	sk = udp4_lib_lookup(dev_net(ndev), 0, 0, htonl(INADDR_ANY),
> -			     htons(ROCE_V2_UDP_DPORT), 0);
> -	rcu_read_unlock();
> -	if (sk)
> +	sk = rxe_ns_pernet_sk4(dev_net(ndev));
> +	if (sk) {
> +		sock_hold(sk);
>   		return 0;
> +	}
>   
>   	sock = rxe_setup_udp_tunnel(dev_net(ndev), htons(ROCE_V2_UDP_DPORT), false);
>   	if (IS_ERR(sock)) {
>   		pr_err("Failed to create IPv4 UDP tunnel\n");
>   		return -1;
>   	}
> +	rxe_ns_pernet_set_sk4(dev_net(ndev), sock->sk);
>   
>   	return 0;
>   }
> @@ -703,12 +701,11 @@ static int rxe_net_ipv6_init(struct net_device *ndev)
>   	struct sock *sk;
>   	struct socket *sock;
>   
> -	rcu_read_lock();
> -	sk = udp6_lib_lookup(dev_net(ndev), NULL, 0, &in6addr_any,
> -			     htons(ROCE_V2_UDP_DPORT), 0);
> -	rcu_read_unlock();
> -	if (sk)
> +	sk = rxe_ns_pernet_sk6(dev_net(ndev));
> +	if (sk) {
> +		sock_hold(sk);
>   		return 0;
> +	}
>   
>   	sock = rxe_setup_udp_tunnel(dev_net(ndev), htons(ROCE_V2_UDP_DPORT), true);
>   	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
> @@ -720,6 +717,9 @@ static int rxe_net_ipv6_init(struct net_device *ndev)
>   		pr_err("Failed to create IPv6 UDP tunnel\n");
>   		return -1;
>   	}
> +
> +	rxe_ns_pernet_set_sk6(dev_net(ndev), sock->sk);
> +
>   #endif
>   	return 0;
>   }
> diff --git a/drivers/infiniband/sw/rxe/rxe_ns.c b/drivers/infiniband/sw/rxe/rxe_ns.c
> new file mode 100644
> index 000000000000..29d08899dcda
> --- /dev/null
> +++ b/drivers/infiniband/sw/rxe/rxe_ns.c
> @@ -0,0 +1,134 @@
> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> +/*
> + * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
> + * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
> + */
> +
> +#include <net/sock.h>
> +#include <net/netns/generic.h>
> +#include <net/net_namespace.h>
> +#include <linux/module.h>
> +#include <linux/skbuff.h>
> +#include <linux/pid_namespace.h>
> +#include <net/udp_tunnel.h>
> +
> +#include "rxe_ns.h"
> +
> +/*
> + * Per network namespace data
> + */
> +struct rxe_ns_sock {
> +	struct sock __rcu *rxe_sk4;
> +	struct sock __rcu *rxe_sk6;
> +};
> +
> +/*
> + * Index to store custom data for each network namespace.
> + */
> +static unsigned int rxe_pernet_id;
> +
> +/*
> + * Called for every existing and added network namespaces
> + */
> +static int __net_init rxe_ns_init(struct net *net)
> +{
> +	/*
> +	 * create (if not present) and access data item in network namespace
> +	 * (net) using the id (net_id)
> +	 */
> +	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
> +
> +	rcu_assign_pointer(ns_sk->rxe_sk4, NULL); /* initialize sock 4 socket */
> +	rcu_assign_pointer(ns_sk->rxe_sk6, NULL); /* initialize sock 6 socket */
> +	synchronize_rcu();
> +
> +	return 0;
> +}
> +
> +static void __net_exit rxe_ns_exit(struct net *net)
> +{
> +	/*
> +	 * called when the network namespace is removed
> +	 */
> +	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
> +	struct sock *rxe_sk4 = NULL;
> +	struct sock *rxe_sk6 = NULL;
> +
> +	rcu_read_lock();
> +	rxe_sk4 = rcu_dereference(ns_sk->rxe_sk4);
> +	rxe_sk6 = rcu_dereference(ns_sk->rxe_sk6);
> +	rcu_read_unlock();
> +
> +	/* close socket */
> +	if (rxe_sk4 && rxe_sk4->sk_socket) {
> +		udp_tunnel_sock_release(rxe_sk4->sk_socket);
> +		rcu_assign_pointer(ns_sk->rxe_sk4, NULL);
> +		synchronize_rcu();
> +	}
> +
> +	if (rxe_sk6 && rxe_sk6->sk_socket) {
> +		udp_tunnel_sock_release(rxe_sk6->sk_socket);
> +		rcu_assign_pointer(ns_sk->rxe_sk6, NULL);
> +		synchronize_rcu();
> +	}
> +}
> +
> +/*
> + * callback to make the module network namespace aware
> + */
> +static struct pernet_operations rxe_net_ops __net_initdata = {
> +	.init = rxe_ns_init,
> +	.exit = rxe_ns_exit,
> +	.id = &rxe_pernet_id,
> +	.size = sizeof(struct rxe_ns_sock),
> +};
> +
> +struct sock *rxe_ns_pernet_sk4(struct net *net)
> +{
> +	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
> +	struct sock *sk;
> +
> +	rcu_read_lock();
> +	sk = rcu_dereference(ns_sk->rxe_sk4);
> +	rcu_read_unlock();
> +
> +	return sk;
> +}
> +
> +void rxe_ns_pernet_set_sk4(struct net *net, struct sock *sk)
> +{
> +	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
> +
> +	rcu_assign_pointer(ns_sk->rxe_sk4, sk);
> +	synchronize_rcu();
> +}
> +
> +struct sock *rxe_ns_pernet_sk6(struct net *net)
> +{
> +	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
> +	struct sock *sk;
> +
> +	rcu_read_lock();
> +	sk = rcu_dereference(ns_sk->rxe_sk6);
> +	rcu_read_unlock();
> +
> +	return sk;
> +}
> +
> +void rxe_ns_pernet_set_sk6(struct net *net, struct sock *sk)
> +{
> +	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
> +
> +	rcu_assign_pointer(ns_sk->rxe_sk6, sk);
> +	synchronize_rcu();
> +}
> +
> +int __init rxe_namespace_init(void)
> +{
> +	return register_pernet_subsys(&rxe_net_ops);
> +}
> +
> +void __exit rxe_namespace_exit(void)
> +{
> +	unregister_pernet_subsys(&rxe_net_ops);
> +}
> diff --git a/drivers/infiniband/sw/rxe/rxe_ns.h b/drivers/infiniband/sw/rxe/rxe_ns.h
> new file mode 100644
> index 000000000000..a3eac9558889
> --- /dev/null
> +++ b/drivers/infiniband/sw/rxe/rxe_ns.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
> +/*
> + * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
> + * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
> + */
> +
> +#ifndef RXE_NS_H
> +#define RXE_NS_H
> +
> +struct sock *rxe_ns_pernet_sk4(struct net *net);
> +struct sock *rxe_ns_pernet_sk6(struct net *net);
> +void rxe_ns_pernet_set_sk4(struct net *net, struct sock *sk);
> +void rxe_ns_pernet_set_sk6(struct net *net, struct sock *sk);
> +int __init rxe_namespace_init(void);
> +void __exit rxe_namespace_exit(void);
> +
> +#endif /* RXE_NS_H */


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 8/8] RDMA/rxe: Replace l_sk6 with sk6 in net namespace
  2023-02-14  6:06 ` [PATCHv3 8/8] RDMA/rxe: Replace l_sk6 with sk6 in net namespace Zhu Yanjun
@ 2023-02-23 13:15   ` Zhu Yanjun
  0 siblings, 0 replies; 39+ messages in thread
From: Zhu Yanjun @ 2023-02-23 13:15 UTC (permalink / raw)
  To: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma, parav, netdev; +Cc: Zhu Yanjun

在 2023/2/14 14:06, Zhu Yanjun 写道:
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> 
> The net namespace variable sk6 can be used. As such, l_sk6 can be
> replaced with it.
> 
> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>

Add netdev@vger.kernel.org.

Zhu Yanjun

> ---
>   drivers/infiniband/sw/rxe/rxe.c       |  1 -
>   drivers/infiniband/sw/rxe/rxe_net.c   | 20 +-------------------
>   drivers/infiniband/sw/rxe/rxe_verbs.h |  1 -
>   3 files changed, 1 insertion(+), 21 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index c297677bf06a..3260f598a7fb 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -75,7 +75,6 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
>   			rxe->ndev->dev_addr);
>   
>   	rxe->max_ucontext			= RXE_MAX_UCONTEXT;
> -	rxe->l_sk6				= NULL;
>   }
>   
>   /* initialize port attributes */
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> index 8135876b11f6..ebcb86fa1e5e 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.c
> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> @@ -50,24 +50,6 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
>   {
>   	struct dst_entry *ndst;
>   	struct flowi6 fl6 = { { 0 } };
> -	struct rxe_dev *rdev;
> -
> -	rdev = rxe_get_dev_from_net(ndev);
> -	if (!rdev->l_sk6) {
> -		struct sock *sk;
> -
> -		rcu_read_lock();
> -		sk = udp6_lib_lookup(dev_net(ndev), NULL, 0, &in6addr_any,
> -				     htons(ROCE_V2_UDP_DPORT), 0);
> -		rcu_read_unlock();
> -		if (!sk) {
> -			pr_info("file: %s +%d, error\n", __FILE__, __LINE__);
> -			return (struct dst_entry *)sk;
> -		}
> -		__sock_put(sk);
> -		rdev->l_sk6 = sk->sk_socket;
> -	}
> -
>   
>   	memset(&fl6, 0, sizeof(fl6));
>   	fl6.flowi6_oif = ndev->ifindex;
> @@ -76,7 +58,7 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
>   	fl6.flowi6_proto = IPPROTO_UDP;
>   
>   	ndst = ipv6_stub->ipv6_dst_lookup_flow(dev_net(ndev),
> -					       rdev->l_sk6->sk, &fl6,
> +					       rxe_ns_pernet_sk6(dev_net(ndev)), &fl6,
>   					       NULL);
>   	if (IS_ERR(ndst)) {
>   		rxe_dbg_qp(qp, "no route to %pI6\n", daddr);
> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
> index 52c4ef4d0305..19ddfa890480 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.h
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
> @@ -408,7 +408,6 @@ struct rxe_dev {
>   
>   	struct rxe_port		port;
>   	struct crypto_shash	*tfm;
> -	struct socket		*l_sk6;
>   };
>   
>   static inline void rxe_counter_inc(struct rxe_dev *rxe, enum rxe_counters index)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-02-23  0:31 ` [PATCHv3 0/8] Fix the problem that rxe can not work " Zhu Yanjun
  2023-02-23  4:56   ` Jakub Kicinski
@ 2023-02-25  8:43   ` Rain River
  1 sibling, 0 replies; 39+ messages in thread
From: Rain River @ 2023-02-25  8:43 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma, parav, netdev

On Thu, Feb 23, 2023 at 8:37 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>
> 在 2023/2/14 14:06, Zhu Yanjun 写道:
> > From: Zhu Yanjun <yanjun.zhu@linux.dev>
> >
> > When run "ip link add" command to add a rxe rdma link in a net
> > namespace, normally this rxe rdma link can not work in a net
> > name space.
> >
> > The root cause is that a sock listening on udp port 4791 is created
> > in init_net when the rdma_rxe module is loaded into kernel. That is,
> > the sock listening on udp port 4791 is created in init_net. Other net
> > namespace is difficult to use this sock.
> >
> > The following commits will solve this problem.
> >
> > In the first commit, move the creating sock listening on udp port 4791
> > from module_init function to rdma link creating functions. That is,
> > after the module rdma_rxe is loaded, the sock will not be created.
> > When run "rdma link add ..." command, the sock will be created. So
> > when creating a rdma link in the net namespace, the sock will be
> > created in this net namespace.
> >
> > In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
> > will check the sock exists in the net namespace or not. If yes, rdma
> > link will increase the reference count of this sock, then continue other
> > jobs instead of creating a new sock to listen on udp port 4791. Since the
> > network notifier is global, when the module rdma_rxe is loaded, this
> > notifier will be registered.
> >
> > After the rdma link is created, the command "rdma link del" is to
> > delete rdma link at the same time the sock is checked. If the reference
> > count of this sock is greater than the sock reference count needed by
> > udp tunnel, the sock reference count is decreased by one. If equal, it
> > indicates that this rdma link is the last one. As such, the udp tunnel
> > is shut down and the sock is closed. The above work should be
> > implemented in linkdel function. But currently no dellink function in
> > rxe. So the 3rd commit addes dellink function pointer. And the 4th
> > commit implements the dellink function in rxe.
> >
> > To now, it is not necessary to keep a global variable to store the sock
> > listening udp port 4791. This global variable can be replaced by the
> > functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
> > function udp6_lib_lookup is in the fast path, a member variable l_sk6
> > is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
> > to lookup the sock, then the sock is stored in l_sk6, in the future,it
> > can be used directly.
> >
> > All the above work has been done in init_net. And it can also work in
> > the net namespace. So the init_net is replaced by the individual net
> > namespace. This is what the 6th commit does. Because rxe device is
> > dependent on the net device and the sock listening on udp port 4791,
> > every rxe device is in exclusive mode in the individual net namespace.
> > Other rdma netns operations will be considerred in the future.
> >
> > In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys
> > functions are added. When a new net namespace is created, the init
> > function will initialize the sk4 and sk6 socks. Then the 2 socks will
> > be released when the net namespace is destroyed. The functions
> > rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net
> > namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will
> > handle sk6. Then sk4 and sk6 are used in the previous commits.
> >
> > As the sk4 and sk6 in pernet namespace can be accessed, it is not
> > necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is
> > replaced with the sk6 in pernet namespace.
> >
> > Test steps:
> > 1) Suppose that 2 NICs are in 2 different net namespaces.
> >
> >    # ip netns exec net0 ip link
> >    3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
> >       link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
> >       altname enp5s0
> >
> >    # ip netns exec net1 ip link
> >    4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
> >       link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff
> >
> > 2) Add rdma link in the different net namespace
> >      net0:
> >      # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2
> >
> >      net1:
> >      # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
> >
> > 3) Run rping test.
> >      net0
> >      # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
> >      [1] 1737
> >      # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
> >      verbose
> >      count 1
> >      ...
> >      ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
> >      ...
> >
> > 4) Remove the rdma links from the net namespaces.
> >      net0:
> >      # ip netns exec net0 ss -lu
> >      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
> >      UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
> >      UNCONN    0         0         [::]:4791             [::]:*
> >
> >      # ip netns exec net0 rdma link del rxe0
> >
> >      # ip netns exec net0 ss -lu
> >      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
> >
> >      net1:
> >      # ip netns exec net0 ss -lu
> >      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
> >      UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
> >      UNCONN    0         0         [::]:4791             [::]:*
> >
> >      # ip netns exec net1 rdma link del rxe1
> >
> >      # ip netns exec net0 ss -lu
> >      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
> >
> > V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to
> >             verify rdma link is removed.
> >          2) Add register_pernet_subsys/unregister_pernet_subsys net namespace
> >          3) Replace l_sk6 with sk6 of pernet_name_space

Thanks,

Tested-by: Rain River <rain.1986.08.12@gmail.com>

> >
> > V1->V2: Add the explicit initialization of sk6.
>
> Add netdev@vger.kernel.org.
>
> Zhu Yanjun
>
> >
> > Zhu Yanjun (8):
> >    RDMA/rxe: Creating listening sock in newlink function
> >    RDMA/rxe: Support more rdma links in init_net
> >    RDMA/nldev: Add dellink function pointer
> >    RDMA/rxe: Implement dellink in rxe
> >    RDMA/rxe: Replace global variable with sock lookup functions
> >    RDMA/rxe: add the support of net namespace
> >    RDMA/rxe: Add the support of net namespace notifier
> >    RDMA/rxe: Replace l_sk6 with sk6 in net namespace
> >
> >   drivers/infiniband/core/nldev.c     |   6 ++
> >   drivers/infiniband/sw/rxe/Makefile  |   3 +-
> >   drivers/infiniband/sw/rxe/rxe.c     |  35 +++++++-
> >   drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++-------
> >   drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
> >   drivers/infiniband/sw/rxe/rxe_ns.c  | 128 ++++++++++++++++++++++++++++
> >   drivers/infiniband/sw/rxe/rxe_ns.h  |  11 +++
> >   include/rdma/rdma_netlink.h         |   2 +
> >   8 files changed, 267 insertions(+), 40 deletions(-)
> >   create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
> >   create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.h
> >
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-02-14  6:06 [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
                   ` (8 preceding siblings ...)
  2023-02-23  0:31 ` [PATCHv3 0/8] Fix the problem that rxe can not work " Zhu Yanjun
@ 2023-04-12 17:22 ` Mark Lehrer
  2023-04-12 21:01   ` Mark Lehrer
  2023-04-13  7:17   ` Zhu Yanjun
  9 siblings, 2 replies; 39+ messages in thread
From: Mark Lehrer @ 2023-04-12 17:22 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: jgg, leon, zyjzyj2000, linux-rdma, parav, Zhu Yanjun

> When run "ip link add" command to add a rxe rdma link in a net
> namespace, normally this rxe rdma link can not work in a net
> name space.

Thank you for this patch, Yanjun!  It is very helpful for some
research I'm doing.  I just tested the patch and now I have success
with utilities like rping and ib_send_bw.  It looks like rdma_cm is at
least doing the basics with no problems.

However, I am still not able to "nvme discover" - this fails with
rdma_resolve_addr error -101.  It looks like this function is part of
rdma_cma.  Is this expected to work, or is more patching needed for
nvme-cli to have success?

It looks like the kernel nvme-fabrics driver is making the call to
rdma_resolve_addr here.  According to strace, nvme-cli is just opening
the fabrics device and writing the host NQN etc.  Is there an easy way
to prove that rdma_resolve_addr is working from userland?

Thanks,
Mark



On Mon, Feb 13, 2023 at 11:13 PM Zhu Yanjun <yanjun.zhu@intel.com> wrote:
>
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>
> When run "ip link add" command to add a rxe rdma link in a net
> namespace, normally this rxe rdma link can not work in a net
> name space.
>
> The root cause is that a sock listening on udp port 4791 is created
> in init_net when the rdma_rxe module is loaded into kernel. That is,
> the sock listening on udp port 4791 is created in init_net. Other net
> namespace is difficult to use this sock.
>
> The following commits will solve this problem.
>
> In the first commit, move the creating sock listening on udp port 4791
> from module_init function to rdma link creating functions. That is,
> after the module rdma_rxe is loaded, the sock will not be created.
> When run "rdma link add ..." command, the sock will be created. So
> when creating a rdma link in the net namespace, the sock will be
> created in this net namespace.
>
> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
> will check the sock exists in the net namespace or not. If yes, rdma
> link will increase the reference count of this sock, then continue other
> jobs instead of creating a new sock to listen on udp port 4791. Since the
> network notifier is global, when the module rdma_rxe is loaded, this
> notifier will be registered.
>
> After the rdma link is created, the command "rdma link del" is to
> delete rdma link at the same time the sock is checked. If the reference
> count of this sock is greater than the sock reference count needed by
> udp tunnel, the sock reference count is decreased by one. If equal, it
> indicates that this rdma link is the last one. As such, the udp tunnel
> is shut down and the sock is closed. The above work should be
> implemented in linkdel function. But currently no dellink function in
> rxe. So the 3rd commit addes dellink function pointer. And the 4th
> commit implements the dellink function in rxe.
>
> To now, it is not necessary to keep a global variable to store the sock
> listening udp port 4791. This global variable can be replaced by the
> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
> function udp6_lib_lookup is in the fast path, a member variable l_sk6
> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
> to lookup the sock, then the sock is stored in l_sk6, in the future,it
> can be used directly.
>
> All the above work has been done in init_net. And it can also work in
> the net namespace. So the init_net is replaced by the individual net
> namespace. This is what the 6th commit does. Because rxe device is
> dependent on the net device and the sock listening on udp port 4791,
> every rxe device is in exclusive mode in the individual net namespace.
> Other rdma netns operations will be considerred in the future.
>
> In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys
> functions are added. When a new net namespace is created, the init
> function will initialize the sk4 and sk6 socks. Then the 2 socks will
> be released when the net namespace is destroyed. The functions
> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net
> namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will
> handle sk6. Then sk4 and sk6 are used in the previous commits.
>
> As the sk4 and sk6 in pernet namespace can be accessed, it is not
> necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is
> replaced with the sk6 in pernet namespace.
>
> Test steps:
> 1) Suppose that 2 NICs are in 2 different net namespaces.
>
>   # ip netns exec net0 ip link
>   3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
>      link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
>      altname enp5s0
>
>   # ip netns exec net1 ip link
>   4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
>      link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff
>
> 2) Add rdma link in the different net namespace
>     net0:
>     # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2
>
>     net1:
>     # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
>
> 3) Run rping test.
>     net0
>     # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
>     [1] 1737
>     # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
>     verbose
>     count 1
>     ...
>     ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
>     ...
>
> 4) Remove the rdma links from the net namespaces.
>     net0:
>     # ip netns exec net0 ss -lu
>     State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>     UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>     UNCONN    0         0         [::]:4791             [::]:*
>
>     # ip netns exec net0 rdma link del rxe0
>
>     # ip netns exec net0 ss -lu
>     State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>
>     net1:
>     # ip netns exec net0 ss -lu
>     State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>     UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>     UNCONN    0         0         [::]:4791             [::]:*
>
>     # ip netns exec net1 rdma link del rxe1
>
>     # ip netns exec net0 ss -lu
>     State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>
> V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to
>            verify rdma link is removed.
>         2) Add register_pernet_subsys/unregister_pernet_subsys net namespace
>         3) Replace l_sk6 with sk6 of pernet_name_space
>
> V1->V2: Add the explicit initialization of sk6.
>
> Zhu Yanjun (8):
>   RDMA/rxe: Creating listening sock in newlink function
>   RDMA/rxe: Support more rdma links in init_net
>   RDMA/nldev: Add dellink function pointer
>   RDMA/rxe: Implement dellink in rxe
>   RDMA/rxe: Replace global variable with sock lookup functions
>   RDMA/rxe: add the support of net namespace
>   RDMA/rxe: Add the support of net namespace notifier
>   RDMA/rxe: Replace l_sk6 with sk6 in net namespace
>
>  drivers/infiniband/core/nldev.c     |   6 ++
>  drivers/infiniband/sw/rxe/Makefile  |   3 +-
>  drivers/infiniband/sw/rxe/rxe.c     |  35 +++++++-
>  drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++-------
>  drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
>  drivers/infiniband/sw/rxe/rxe_ns.c  | 128 ++++++++++++++++++++++++++++
>  drivers/infiniband/sw/rxe/rxe_ns.h  |  11 +++
>  include/rdma/rdma_netlink.h         |   2 +
>  8 files changed, 267 insertions(+), 40 deletions(-)
>  create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
>  create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.h
>
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-04-12 17:22 ` Mark Lehrer
@ 2023-04-12 21:01   ` Mark Lehrer
  2023-04-13  7:22     ` Zhu Yanjun
  2023-04-13  7:17   ` Zhu Yanjun
  1 sibling, 1 reply; 39+ messages in thread
From: Mark Lehrer @ 2023-04-12 21:01 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: jgg, leon, zyjzyj2000, linux-rdma, parav, Zhu Yanjun

> the fabrics device and writing the host NQN etc.  Is there an easy way
> to prove that rdma_resolve_addr is working from userland?

Actually I meant "is there a way to prove that the kernel
rdma_resolve_addr() works with netns?"

It seems like this is the real problem.  If we run commands like nvme
discover & nvme connect within the netns context, the system will use
the non-netns IP & RDMA stacks to connect.  As an aside - this seems
like it would be a major security issue for container systems, doesn't
it?

I'll investigate to see if the fabrics module & nvme-cli have a way to
set and use the proper netns context.

Thanks,
Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-04-12 17:22 ` Mark Lehrer
  2023-04-12 21:01   ` Mark Lehrer
@ 2023-04-13  7:17   ` Zhu Yanjun
  1 sibling, 0 replies; 39+ messages in thread
From: Zhu Yanjun @ 2023-04-13  7:17 UTC (permalink / raw)
  To: Mark Lehrer, Zhu Yanjun; +Cc: jgg, leon, zyjzyj2000, linux-rdma, parav


在 2023/4/13 1:22, Mark Lehrer 写道:
>> When run "ip link add" command to add a rxe rdma link in a net
>> namespace, normally this rxe rdma link can not work in a net
>> name space.
> Thank you for this patch, Yanjun!  It is very helpful for some
> research I'm doing.  I just tested the patch and now I have success
> with utilities like rping and ib_send_bw.  It looks like rdma_cm is at
> least doing the basics with no problems.
>
> However, I am still not able to "nvme discover" - this fails with
> rdma_resolve_addr error -101.  It looks like this function is part of
> rdma_cma.  Is this expected to work, or is more patching needed for
> nvme-cli to have success?

Thanks for your testing.

These commits are to make SoftRoCE work in the different net namespaces.

Especially in the same host, in 2 or more different net namespace, SoftRoCE

can connect to each other.

I just make rping and perftest tests. And I do not make NVMe tests.

If you let me know how to reproduce the problem that you confronted,

it can help me a lot to understand this problem and fix it.

Thanks,

Zhu Yanjun

>
> It looks like the kernel nvme-fabrics driver is making the call to
> rdma_resolve_addr here.  According to strace, nvme-cli is just opening
> the fabrics device and writing the host NQN etc.  Is there an easy way
> to prove that rdma_resolve_addr is working from userland?
>
> Thanks,
> Mark
>
>
>
> On Mon, Feb 13, 2023 at 11:13 PM Zhu Yanjun <yanjun.zhu@intel.com> wrote:
>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>
>> When run "ip link add" command to add a rxe rdma link in a net
>> namespace, normally this rxe rdma link can not work in a net
>> name space.
>>
>> The root cause is that a sock listening on udp port 4791 is created
>> in init_net when the rdma_rxe module is loaded into kernel. That is,
>> the sock listening on udp port 4791 is created in init_net. Other net
>> namespace is difficult to use this sock.
>>
>> The following commits will solve this problem.
>>
>> In the first commit, move the creating sock listening on udp port 4791
>> from module_init function to rdma link creating functions. That is,
>> after the module rdma_rxe is loaded, the sock will not be created.
>> When run "rdma link add ..." command, the sock will be created. So
>> when creating a rdma link in the net namespace, the sock will be
>> created in this net namespace.
>>
>> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
>> will check the sock exists in the net namespace or not. If yes, rdma
>> link will increase the reference count of this sock, then continue other
>> jobs instead of creating a new sock to listen on udp port 4791. Since the
>> network notifier is global, when the module rdma_rxe is loaded, this
>> notifier will be registered.
>>
>> After the rdma link is created, the command "rdma link del" is to
>> delete rdma link at the same time the sock is checked. If the reference
>> count of this sock is greater than the sock reference count needed by
>> udp tunnel, the sock reference count is decreased by one. If equal, it
>> indicates that this rdma link is the last one. As such, the udp tunnel
>> is shut down and the sock is closed. The above work should be
>> implemented in linkdel function. But currently no dellink function in
>> rxe. So the 3rd commit addes dellink function pointer. And the 4th
>> commit implements the dellink function in rxe.
>>
>> To now, it is not necessary to keep a global variable to store the sock
>> listening udp port 4791. This global variable can be replaced by the
>> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
>> function udp6_lib_lookup is in the fast path, a member variable l_sk6
>> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
>> to lookup the sock, then the sock is stored in l_sk6, in the future,it
>> can be used directly.
>>
>> All the above work has been done in init_net. And it can also work in
>> the net namespace. So the init_net is replaced by the individual net
>> namespace. This is what the 6th commit does. Because rxe device is
>> dependent on the net device and the sock listening on udp port 4791,
>> every rxe device is in exclusive mode in the individual net namespace.
>> Other rdma netns operations will be considerred in the future.
>>
>> In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys
>> functions are added. When a new net namespace is created, the init
>> function will initialize the sk4 and sk6 socks. Then the 2 socks will
>> be released when the net namespace is destroyed. The functions
>> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net
>> namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will
>> handle sk6. Then sk4 and sk6 are used in the previous commits.
>>
>> As the sk4 and sk6 in pernet namespace can be accessed, it is not
>> necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is
>> replaced with the sk6 in pernet namespace.
>>
>> Test steps:
>> 1) Suppose that 2 NICs are in 2 different net namespaces.
>>
>>    # ip netns exec net0 ip link
>>    3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
>>       link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
>>       altname enp5s0
>>
>>    # ip netns exec net1 ip link
>>    4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
>>       link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff
>>
>> 2) Add rdma link in the different net namespace
>>      net0:
>>      # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2
>>
>>      net1:
>>      # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
>>
>> 3) Run rping test.
>>      net0
>>      # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
>>      [1] 1737
>>      # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
>>      verbose
>>      count 1
>>      ...
>>      ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
>>      ...
>>
>> 4) Remove the rdma links from the net namespaces.
>>      net0:
>>      # ip netns exec net0 ss -lu
>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>      UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>>      UNCONN    0         0         [::]:4791             [::]:*
>>
>>      # ip netns exec net0 rdma link del rxe0
>>
>>      # ip netns exec net0 ss -lu
>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>
>>      net1:
>>      # ip netns exec net0 ss -lu
>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>      UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>>      UNCONN    0         0         [::]:4791             [::]:*
>>
>>      # ip netns exec net1 rdma link del rxe1
>>
>>      # ip netns exec net0 ss -lu
>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>
>> V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to
>>             verify rdma link is removed.
>>          2) Add register_pernet_subsys/unregister_pernet_subsys net namespace
>>          3) Replace l_sk6 with sk6 of pernet_name_space
>>
>> V1->V2: Add the explicit initialization of sk6.
>>
>> Zhu Yanjun (8):
>>    RDMA/rxe: Creating listening sock in newlink function
>>    RDMA/rxe: Support more rdma links in init_net
>>    RDMA/nldev: Add dellink function pointer
>>    RDMA/rxe: Implement dellink in rxe
>>    RDMA/rxe: Replace global variable with sock lookup functions
>>    RDMA/rxe: add the support of net namespace
>>    RDMA/rxe: Add the support of net namespace notifier
>>    RDMA/rxe: Replace l_sk6 with sk6 in net namespace
>>
>>   drivers/infiniband/core/nldev.c     |   6 ++
>>   drivers/infiniband/sw/rxe/Makefile  |   3 +-
>>   drivers/infiniband/sw/rxe/rxe.c     |  35 +++++++-
>>   drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++-------
>>   drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
>>   drivers/infiniband/sw/rxe/rxe_ns.c  | 128 ++++++++++++++++++++++++++++
>>   drivers/infiniband/sw/rxe/rxe_ns.h  |  11 +++
>>   include/rdma/rdma_netlink.h         |   2 +
>>   8 files changed, 267 insertions(+), 40 deletions(-)
>>   create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
>>   create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.h
>>
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-04-12 21:01   ` Mark Lehrer
@ 2023-04-13  7:22     ` Zhu Yanjun
  2023-04-13 13:00       ` Mark Lehrer
  0 siblings, 1 reply; 39+ messages in thread
From: Zhu Yanjun @ 2023-04-13  7:22 UTC (permalink / raw)
  To: Mark Lehrer, Zhu Yanjun; +Cc: jgg, leon, zyjzyj2000, linux-rdma, parav


在 2023/4/13 5:01, Mark Lehrer 写道:
>> the fabrics device and writing the host NQN etc.  Is there an easy way
>> to prove that rdma_resolve_addr is working from userland?
> Actually I meant "is there a way to prove that the kernel
> rdma_resolve_addr() works with netns?"

I think rdma_resolve_addr can work with netns because rdma on mlx5 can 
work well with netns.

I do not delve into the source code. But IMO, this function should be 
used in rdma on mlx5.

>
> It seems like this is the real problem.  If we run commands like nvme
> discover & nvme connect within the netns context, the system will use
> the non-netns IP & RDMA stacks to connect.  As an aside - this seems
> like it would be a major security issue for container systems, doesn't
> it?

Do you make tests nvme + mlx5 + net ns in your host? Can it work?

Thanks

Zhu Yanjun

>
> I'll investigate to see if the fabrics module & nvme-cli have a way to
> set and use the proper netns context.
>
> Thanks,
> Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-04-13  7:22     ` Zhu Yanjun
@ 2023-04-13 13:00       ` Mark Lehrer
  2023-04-13 13:05         ` Parav Pandit
  0 siblings, 1 reply; 39+ messages in thread
From: Mark Lehrer @ 2023-04-13 13:00 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma, parav

> Do you make tests nvme + mlx5 + net ns in your host? Can it work?

Sort of, but not really.  In our last test, we configured a virtual
function and put it in the netns context, but also configured a
physical function outside the netns context.  TCP NVMe connections
always used the correct interface.

However, the RoCEv2 NVMe connection always used the physical function,
regardless of the user space netns context of the nvme-cli process.
When we ran "ip link set <physical function> down" the RoCEv2 NVMe
connections stopped working, but TCP NVMe connections were fine.
We'll be doing more tests today to make sure we're not doing something
wrong.

Thanks,
Mark




On Thu, Apr 13, 2023 at 7:22 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>
>
> 在 2023/4/13 5:01, Mark Lehrer 写道:
> >> the fabrics device and writing the host NQN etc.  Is there an easy way
> >> to prove that rdma_resolve_addr is working from userland?
> > Actually I meant "is there a way to prove that the kernel
> > rdma_resolve_addr() works with netns?"
>
> I think rdma_resolve_addr can work with netns because rdma on mlx5 can
> work well with netns.
>
> I do not delve into the source code. But IMO, this function should be
> used in rdma on mlx5.
>
> >
> > It seems like this is the real problem.  If we run commands like nvme
> > discover & nvme connect within the netns context, the system will use
> > the non-netns IP & RDMA stacks to connect.  As an aside - this seems
> > like it would be a major security issue for container systems, doesn't
> > it?
>
> Do you make tests nvme + mlx5 + net ns in your host? Can it work?
>
> Thanks
>
> Zhu Yanjun
>
> >
> > I'll investigate to see if the fabrics module & nvme-cli have a way to
> > set and use the proper netns context.
> >
> > Thanks,
> > Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-04-13 13:00       ` Mark Lehrer
@ 2023-04-13 13:05         ` Parav Pandit
  2023-04-13 15:38           ` Mark Lehrer
  0 siblings, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2023-04-13 13:05 UTC (permalink / raw)
  To: Mark Lehrer, Zhu Yanjun; +Cc: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma



> From: Mark Lehrer <lehrer@gmail.com>
> Sent: Thursday, April 13, 2023 9:01 AM
> 
> > Do you make tests nvme + mlx5 + net ns in your host? Can it work?
> 
> Sort of, but not really.  In our last test, we configured a virtual function and put
> it in the netns context, but also configured a physical function outside the netns
> context.  TCP NVMe connections always used the correct interface.
> 
Didn’t get a chance to review the thread discussion.
The way to use VF is:

1. rdma system in exclusive mode
$ rdma system set netns exclusive

2. Move netdevice of the VF to the net ns
$ ip link set [ DEV ] netns NSNAME

3. Move RDMA device of the VF to the net ns
$ rdma dev set [ DEV ] netns NSNAME

You are probably missing #1 and #3 configuration.
#1 should be done before creating any namespaces.

Man pages for #1 and #3:
[a] https://man7.org/linux/man-pages/man8/rdma-system.8.html
[b] https://man7.org/linux/man-pages/man8/rdma-dev.8.html

> However, the RoCEv2 NVMe connection always used the physical function,
> regardless of the user space netns context of the nvme-cli process.
> When we ran "ip link set <physical function> down" the RoCEv2 NVMe
> connections stopped working, but TCP NVMe connections were fine.
> We'll be doing more tests today to make sure we're not doing something
> wrong.
> 
> Thanks,
> Mark
> 
> 
> 
> 
> On Thu, Apr 13, 2023 at 7:22 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
> >
> >
> > 在 2023/4/13 5:01, Mark Lehrer 写道:
> > >> the fabrics device and writing the host NQN etc.  Is there an easy
> > >> way to prove that rdma_resolve_addr is working from userland?
> > > Actually I meant "is there a way to prove that the kernel
> > > rdma_resolve_addr() works with netns?"
> >
> > I think rdma_resolve_addr can work with netns because rdma on mlx5 can
> > work well with netns.
> >
> > I do not delve into the source code. But IMO, this function should be
> > used in rdma on mlx5.
> >
> > >
> > > It seems like this is the real problem.  If we run commands like
> > > nvme discover & nvme connect within the netns context, the system
> > > will use the non-netns IP & RDMA stacks to connect.  As an aside -
> > > this seems like it would be a major security issue for container
> > > systems, doesn't it?
> >
> > Do you make tests nvme + mlx5 + net ns in your host? Can it work?
> >
> > Thanks
> >
> > Zhu Yanjun
> >
> > >
> > > I'll investigate to see if the fabrics module & nvme-cli have a way
> > > to set and use the proper netns context.
> > >
> > > Thanks,
> > > Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-04-13 13:05         ` Parav Pandit
@ 2023-04-13 15:38           ` Mark Lehrer
  2023-04-13 16:20             ` Parav Pandit
  0 siblings, 1 reply; 39+ messages in thread
From: Mark Lehrer @ 2023-04-13 15:38 UTC (permalink / raw)
  To: Parav Pandit; +Cc: Zhu Yanjun, Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma

> Didn’t get a chance to review the thread discussion.
> The way to use VF is:

Virtual functions were just a debugging aid.  We really just want to
use a single physical function and put it into the netns.  However, we
will do additional VF tests as it still may be a viable workaround.

When using the physical function, we are still having no joy using
exclusive mode with mlx5:


# nvme discover -t rdma -a 192.168.42.11 -s 4420
Discovery Log Number of Records 2, Generation counter 2
=====Discovery Log Entry 0======
... (works as expected)

# rdma system set netns exclusive
# ip netns add netnstest
# ip link set eth1 netns netnstest
# rdma dev set mlx5_0 netns netnstest
# nsenter --net=/var/run/netns/netnstest /bin/bash
# ip link set eth1 up
# ip addr add 192.168.42.12/24 dev eth1
(tested ib_send_bw here, works perfectly)

# nvme discover -t rdma -a 192.168.42.11 -s 4420
Failed to write to /dev/nvme-fabrics: Connection reset by peer
failed to add controller, error Unknown error -1

# dmesg | tail -3
[  240.361647] mlx5_core 0000:05:00.0 eth1: Link up
[  240.371772] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[  259.964542] nvme nvme0: rdma connection establishment failed (-104)

Am I missing something here?

Thanks,
Mark


On Thu, Apr 13, 2023 at 7:05 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Mark Lehrer <lehrer@gmail.com>
> > Sent: Thursday, April 13, 2023 9:01 AM
> >
> > > Do you make tests nvme + mlx5 + net ns in your host? Can it work?
> >
> > Sort of, but not really.  In our last test, we configured a virtual function and put
> > it in the netns context, but also configured a physical function outside the netns
> > context.  TCP NVMe connections always used the correct interface.
> >
> Didn’t get a chance to review the thread discussion.
> The way to use VF is:
>
> 1. rdma system in exclusive mode
> $ rdma system set netns exclusive
>
> 2. Move netdevice of the VF to the net ns
> $ ip link set [ DEV ] netns NSNAME
>
> 3. Move RDMA device of the VF to the net ns
> $ rdma dev set [ DEV ] netns NSNAME
>
> You are probably missing #1 and #3 configuration.
> #1 should be done before creating any namespaces.
>
> Man pages for #1 and #3:
> [a] https://man7.org/linux/man-pages/man8/rdma-system.8.html
> [b] https://man7.org/linux/man-pages/man8/rdma-dev.8.html
>
> > However, the RoCEv2 NVMe connection always used the physical function,
> > regardless of the user space netns context of the nvme-cli process.
> > When we ran "ip link set <physical function> down" the RoCEv2 NVMe
> > connections stopped working, but TCP NVMe connections were fine.
> > We'll be doing more tests today to make sure we're not doing something
> > wrong.
> >
> > Thanks,
> > Mark
> >
> >
> >
> >
> > On Thu, Apr 13, 2023 at 7:22 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
> > >
> > >
> > > 在 2023/4/13 5:01, Mark Lehrer 写道:
> > > >> the fabrics device and writing the host NQN etc.  Is there an easy
> > > >> way to prove that rdma_resolve_addr is working from userland?
> > > > Actually I meant "is there a way to prove that the kernel
> > > > rdma_resolve_addr() works with netns?"
> > >
> > > I think rdma_resolve_addr can work with netns because rdma on mlx5 can
> > > work well with netns.
> > >
> > > I do not delve into the source code. But IMO, this function should be
> > > used in rdma on mlx5.
> > >
> > > >
> > > > It seems like this is the real problem.  If we run commands like
> > > > nvme discover & nvme connect within the netns context, the system
> > > > will use the non-netns IP & RDMA stacks to connect.  As an aside -
> > > > this seems like it would be a major security issue for container
> > > > systems, doesn't it?
> > >
> > > Do you make tests nvme + mlx5 + net ns in your host? Can it work?
> > >
> > > Thanks
> > >
> > > Zhu Yanjun
> > >
> > > >
> > > > I'll investigate to see if the fabrics module & nvme-cli have a way
> > > > to set and use the proper netns context.
> > > >
> > > > Thanks,
> > > > Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-04-13 15:38           ` Mark Lehrer
@ 2023-04-13 16:20             ` Parav Pandit
  2023-04-13 16:23               ` Parav Pandit
  0 siblings, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2023-04-13 16:20 UTC (permalink / raw)
  To: Mark Lehrer; +Cc: Zhu Yanjun, Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma



> From: Mark Lehrer <lehrer@gmail.com>
> Sent: Thursday, April 13, 2023 11:39 AM
> 
> > Didn’t get a chance to review the thread discussion.
> > The way to use VF is:
> 
> Virtual functions were just a debugging aid.  We really just want to use a single
> physical function and put it into the netns.  However, we will do additional VF
> tests as it still may be a viable workaround.
> 
> When using the physical function, we are still having no joy using exclusive
> mode with mlx5:
> 

static int nvmet_rdma_enable_port(struct nvmet_rdma_port *port)
{
        struct sockaddr *addr = (struct sockaddr *)&port->addr;
        struct rdma_cm_id *cm_id;
        int ret;

        cm_id = rdma_create_id(&init_net, nvmet_rdma_cm_handler, port,
                                                     ^^^^^^^
Nvme target is not net ns aware.

                        RDMA_PS_TCP, IB_QPT_RC);
        if (IS_ERR(cm_id)) {
                pr_err("CM ID creation failed\n");
                return PTR_ERR(cm_id);
        }

> 
> # nvme discover -t rdma -a 192.168.42.11 -s 4420 Discovery Log Number of
> Records 2, Generation counter 2 =====Discovery Log Entry 0====== ... (works
> as expected)
> 
> # rdma system set netns exclusive
> # ip netns add netnstest
> # ip link set eth1 netns netnstest
> # rdma dev set mlx5_0 netns netnstest
> # nsenter --net=/var/run/netns/netnstest /bin/bash # ip link set eth1 up # ip
> addr add 192.168.42.12/24 dev eth1 (tested ib_send_bw here, works perfectly)
> 
> # nvme discover -t rdma -a 192.168.42.11 -s 4420 Failed to write to /dev/nvme-
> fabrics: Connection reset by peer failed to add controller, error Unknown error
> -1
> 
> # dmesg | tail -3
> [  240.361647] mlx5_core 0000:05:00.0 eth1: Link up [  240.371772] IPv6:
> ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready [  259.964542] nvme
> nvme0: rdma connection establishment failed (-104)
> 
> Am I missing something here?
> 
> Thanks,
> Mark
> 
> 
> On Thu, Apr 13, 2023 at 7:05 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Mark Lehrer <lehrer@gmail.com>
> > > Sent: Thursday, April 13, 2023 9:01 AM
> > >
> > > > Do you make tests nvme + mlx5 + net ns in your host? Can it work?
> > >
> > > Sort of, but not really.  In our last test, we configured a virtual
> > > function and put it in the netns context, but also configured a
> > > physical function outside the netns context.  TCP NVMe connections always
> used the correct interface.
> > >
> > Didn’t get a chance to review the thread discussion.
> > The way to use VF is:
> >
> > 1. rdma system in exclusive mode
> > $ rdma system set netns exclusive
> >
> > 2. Move netdevice of the VF to the net ns $ ip link set [ DEV ] netns
> > NSNAME
> >
> > 3. Move RDMA device of the VF to the net ns $ rdma dev set [ DEV ]
> > netns NSNAME
> >
> > You are probably missing #1 and #3 configuration.
> > #1 should be done before creating any namespaces.
> >
> > Man pages for #1 and #3:
> > [a] https://man7.org/linux/man-pages/man8/rdma-system.8.html
> > [b] https://man7.org/linux/man-pages/man8/rdma-dev.8.html
> >
> > > However, the RoCEv2 NVMe connection always used the physical
> > > function, regardless of the user space netns context of the nvme-cli process.
> > > When we ran "ip link set <physical function> down" the RoCEv2 NVMe
> > > connections stopped working, but TCP NVMe connections were fine.
> > > We'll be doing more tests today to make sure we're not doing
> > > something wrong.
> > >
> > > Thanks,
> > > Mark
> > >
> > >
> > >
> > >
> > > On Thu, Apr 13, 2023 at 7:22 AM Zhu Yanjun <yanjun.zhu@linux.dev>
> wrote:
> > > >
> > > >
> > > > 在 2023/4/13 5:01, Mark Lehrer 写道:
> > > > >> the fabrics device and writing the host NQN etc.  Is there an
> > > > >> easy way to prove that rdma_resolve_addr is working from userland?
> > > > > Actually I meant "is there a way to prove that the kernel
> > > > > rdma_resolve_addr() works with netns?"
> > > >
> > > > I think rdma_resolve_addr can work with netns because rdma on mlx5
> > > > can work well with netns.
> > > >
> > > > I do not delve into the source code. But IMO, this function should
> > > > be used in rdma on mlx5.
> > > >
> > > > >
> > > > > It seems like this is the real problem.  If we run commands like
> > > > > nvme discover & nvme connect within the netns context, the
> > > > > system will use the non-netns IP & RDMA stacks to connect.  As
> > > > > an aside - this seems like it would be a major security issue
> > > > > for container systems, doesn't it?
> > > >
> > > > Do you make tests nvme + mlx5 + net ns in your host? Can it work?
> > > >
> > > > Thanks
> > > >
> > > > Zhu Yanjun
> > > >
> > > > >
> > > > > I'll investigate to see if the fabrics module & nvme-cli have a
> > > > > way to set and use the proper netns context.
> > > > >
> > > > > Thanks,
> > > > > Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-04-13 16:20             ` Parav Pandit
@ 2023-04-13 16:23               ` Parav Pandit
  2023-04-13 16:37                 ` Mark Lehrer
  0 siblings, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2023-04-13 16:23 UTC (permalink / raw)
  To: Mark Lehrer; +Cc: Zhu Yanjun, Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma


> From: Parav Pandit <parav@nvidia.com>
> Sent: Thursday, April 13, 2023 12:20 PM
> 
> > From: Mark Lehrer <lehrer@gmail.com>
> > Sent: Thursday, April 13, 2023 11:39 AM
> >
> > > Didn’t get a chance to review the thread discussion.
> > > The way to use VF is:
> >
> > Virtual functions were just a debugging aid.  We really just want to
> > use a single physical function and put it into the netns.  However, we
> > will do additional VF tests as it still may be a viable workaround.
> >
> > When using the physical function, we are still having no joy using
> > exclusive mode with mlx5:
> >
> 
> static int nvmet_rdma_enable_port(struct nvmet_rdma_port *port) {
>         struct sockaddr *addr = (struct sockaddr *)&port->addr;
>         struct rdma_cm_id *cm_id;
>         int ret;
> 
>         cm_id = rdma_create_id(&init_net, nvmet_rdma_cm_handler, port,
>                                                      ^^^^^^^ Nvme target is not net ns aware.
> 
>                         RDMA_PS_TCP, IB_QPT_RC);
>         if (IS_ERR(cm_id)) {
>                 pr_err("CM ID creation failed\n");
>                 return PTR_ERR(cm_id);
>         }
> 
> >
Clicked send email too early.

574 static int nvme_rdma_alloc_queue(struct nvme_rdma_ctrl *ctrl,
 575                 int idx, size_t queue_size)
 576 {
[..]
597         queue->cm_id = rdma_create_id(&init_net, nvme_rdma_cm_handler, queue,
 598                         RDMA_PS_TCP, IB_QPT_RC);
 599         if (IS_ERR(queue->cm_id)) {

Initiator is not net ns aware.
Given some of the work involves workqueue operation, it needs to hold the reference to net ns and implement the net ns delete routine to terminate.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-04-13 16:23               ` Parav Pandit
@ 2023-04-13 16:37                 ` Mark Lehrer
  2023-04-13 16:42                   ` Parav Pandit
  0 siblings, 1 reply; 39+ messages in thread
From: Mark Lehrer @ 2023-04-13 16:37 UTC (permalink / raw)
  To: Parav Pandit; +Cc: Zhu Yanjun, Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma

> Initiator is not net ns aware.

Am I correct in my assessment that this could be a container jailbreak
risk?  We aren't using containers, but we were shocked that RoCEv2
connections magically worked through the physical function which was
not in the netns context.


Thanks,
Mark

On Thu, Apr 13, 2023 at 10:23 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Parav Pandit <parav@nvidia.com>
> > Sent: Thursday, April 13, 2023 12:20 PM
> >
> > > From: Mark Lehrer <lehrer@gmail.com>
> > > Sent: Thursday, April 13, 2023 11:39 AM
> > >
> > > > Didn’t get a chance to review the thread discussion.
> > > > The way to use VF is:
> > >
> > > Virtual functions were just a debugging aid.  We really just want to
> > > use a single physical function and put it into the netns.  However, we
> > > will do additional VF tests as it still may be a viable workaround.
> > >
> > > When using the physical function, we are still having no joy using
> > > exclusive mode with mlx5:
> > >
> >
> > static int nvmet_rdma_enable_port(struct nvmet_rdma_port *port) {
> >         struct sockaddr *addr = (struct sockaddr *)&port->addr;
> >         struct rdma_cm_id *cm_id;
> >         int ret;
> >
> >         cm_id = rdma_create_id(&init_net, nvmet_rdma_cm_handler, port,
> >                                                      ^^^^^^^ Nvme target is not net ns aware.
> >
> >                         RDMA_PS_TCP, IB_QPT_RC);
> >         if (IS_ERR(cm_id)) {
> >                 pr_err("CM ID creation failed\n");
> >                 return PTR_ERR(cm_id);
> >         }
> >
> > >
> Clicked send email too early.
>
> 574 static int nvme_rdma_alloc_queue(struct nvme_rdma_ctrl *ctrl,
>  575                 int idx, size_t queue_size)
>  576 {
> [..]
> 597         queue->cm_id = rdma_create_id(&init_net, nvme_rdma_cm_handler, queue,
>  598                         RDMA_PS_TCP, IB_QPT_RC);
>  599         if (IS_ERR(queue->cm_id)) {
>
> Initiator is not net ns aware.
> Given some of the work involves workqueue operation, it needs to hold the reference to net ns and implement the net ns delete routine to terminate.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-04-13 16:37                 ` Mark Lehrer
@ 2023-04-13 16:42                   ` Parav Pandit
  2023-04-14 15:49                     ` Zhu Yanjun
  0 siblings, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2023-04-13 16:42 UTC (permalink / raw)
  To: Mark Lehrer; +Cc: Zhu Yanjun, Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma



> From: Mark Lehrer <lehrer@gmail.com>
> Sent: Thursday, April 13, 2023 12:38 PM
> 
> > Initiator is not net ns aware.
> 
> Am I correct in my assessment that this could be a container jailbreak risk?  We
> aren't using containers, 
Unlikely. because container orchestration must need to give access to the nvme char/misc device to the container.
And it should do it only when nvme initiator/target are net ns aware.

> but we were shocked that RoCEv2 connections
> magically worked through the physical function which was not in the netns
> context.

I do not understand this part.
If you are in exclusive mode rdma devices must be in respective/appropriate net ns.
It unlikely works, may be some misconfiguration. Hard to way without exact commands.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-04-13 16:42                   ` Parav Pandit
@ 2023-04-14 15:49                     ` Zhu Yanjun
       [not found]                       ` <CADvaNzWfS5TFQ3b5JyaKFft06ihazadSJ15V3aXvWZh1jp1cCA@mail.gmail.com>
  0 siblings, 1 reply; 39+ messages in thread
From: Zhu Yanjun @ 2023-04-14 15:49 UTC (permalink / raw)
  To: Parav Pandit, Mark Lehrer; +Cc: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma


在 2023/4/14 0:42, Parav Pandit 写道:
>
>> From: Mark Lehrer <lehrer@gmail.com>
>> Sent: Thursday, April 13, 2023 12:38 PM
>>
>>> Initiator is not net ns aware.
>> Am I correct in my assessment that this could be a container jailbreak risk?  We
>> aren't using containers,
> Unlikely. because container orchestration must need to give access to the nvme char/misc device to the container.
> And it should do it only when nvme initiator/target are net ns aware.
>
>> but we were shocked that RoCEv2 connections
>> magically worked through the physical function which was not in the netns
>> context.
> I do not understand this part.
> If you are in exclusive mode rdma devices must be in respective/appropriate net ns.

After applying these commits, rxe works in the exclusive mode.

Zhu Yanjun

> It unlikely works, may be some misconfiguration. Hard to way without exact commands.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
       [not found]                       ` <CADvaNzWfS5TFQ3b5JyaKFft06ihazadSJ15V3aXvWZh1jp1cCA@mail.gmail.com>
@ 2023-04-14 16:24                         ` Mark Lehrer
  2023-04-15 13:35                           ` Zhu Yanjun
  2023-04-19  0:43                           ` Parav Pandit
  0 siblings, 2 replies; 39+ messages in thread
From: Mark Lehrer @ 2023-04-14 16:24 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Parav Pandit, Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma

Apologies if you get this twice, lindbergh rejected my email for
admittedly legitimate reasons.

>> If you are in exclusive mode rdma devices must be in respective/appropriate net ns.
>
> After applying these commits, rxe works in the exclusive mode.

Yanjun,

Thanks again for the original patch.  It is good for the soft roce
driver to be a "reference" for proper rdma functionality.  What is
still needed for this fix to make it to mainline?

As an aside - is rdma_rxe now good enough for Red Hat to build it by
default again in EL10, or is more work needed?

I'm going to try making the nvme-fabrics set of modules use the
network namespace properly with RoCEv2.  TCP seems to work properly
already, so this should be more of a "port" than real development.
Are you (or anyone else) interested in working on this too?  I'm more
familiar with the video frame buffer area of the kernel, so first I'm
familiarizing myself with how nvme-fabrics works with TCP & netns.

Thanks,
Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-04-14 16:24                         ` Mark Lehrer
@ 2023-04-15 13:35                           ` Zhu Yanjun
  2023-04-19  0:43                           ` Parav Pandit
  1 sibling, 0 replies; 39+ messages in thread
From: Zhu Yanjun @ 2023-04-15 13:35 UTC (permalink / raw)
  To: Mark Lehrer; +Cc: Parav Pandit, Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma

在 2023/4/15 0:24, Mark Lehrer 写道:
> Apologies if you get this twice, lindbergh rejected my email for
> admittedly legitimate reasons.
> 
>>> If you are in exclusive mode rdma devices must be in respective/appropriate net ns.
>>
>> After applying these commits, rxe works in the exclusive mode.
> 
> Yanjun,
> 
> Thanks again for the original patch.  It is good for the soft roce
> driver to be a "reference" for proper rdma functionality.  What is
> still needed for this fix to make it to mainline?

I am working hard to push these commits to mainline.

Zhu Yanjun

> 
> As an aside - is rdma_rxe now good enough for Red Hat to build it by
> default again in EL10, or is more work needed?
> 
> I'm going to try making the nvme-fabrics set of modules use the
> network namespace properly with RoCEv2.  TCP seems to work properly
> already, so this should be more of a "port" than real development.
> Are you (or anyone else) interested in working on this too?  I'm more
> familiar with the video frame buffer area of the kernel, so first I'm
> familiarizing myself with how nvme-fabrics works with TCP & netns.
> 
> Thanks,
> Mark


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-04-14 16:24                         ` Mark Lehrer
  2023-04-15 13:35                           ` Zhu Yanjun
@ 2023-04-19  0:43                           ` Parav Pandit
  2023-04-19  4:19                             ` Zhu Yanjun
  1 sibling, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2023-04-19  0:43 UTC (permalink / raw)
  To: Mark Lehrer, Zhu Yanjun; +Cc: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma



> From: Mark Lehrer <lehrer@gmail.com>
> Sent: Friday, April 14, 2023 12:24 PM
 
> I'm going to try making the nvme-fabrics set of modules use the network
> namespace properly with RoCEv2.  TCP seems to work properly already, so this
> should be more of a "port" than real development.
TCP without net ns notifier missed the net ns delete scenario results in a use after free bug, that should be fixed first as its critical.

> Are you (or anyone else) interested in working on this too?  I'm more familiar
> with the video frame buffer area of the kernel, so first I'm familiarizing myself
> with how nvme-fabrics works with TCP & netns.
> 
> Thanks,
> Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-04-19  0:43                           ` Parav Pandit
@ 2023-04-19  4:19                             ` Zhu Yanjun
  2023-04-19 18:01                               ` Mark Lehrer
  0 siblings, 1 reply; 39+ messages in thread
From: Zhu Yanjun @ 2023-04-19  4:19 UTC (permalink / raw)
  To: Parav Pandit, Mark Lehrer; +Cc: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma


在 2023/4/19 8:43, Parav Pandit 写道:
>
>> From: Mark Lehrer <lehrer@gmail.com>
>> Sent: Friday, April 14, 2023 12:24 PM
>   
>> I'm going to try making the nvme-fabrics set of modules use the network
>> namespace properly with RoCEv2.  TCP seems to work properly already, so this
>> should be more of a "port" than real development.
> TCP without net ns notifier missed the net ns delete scenario results in a use after free bug, that should be fixed first as its critical.

Sure. I also confronted this mentioned problem. If I remember correctly, 
a net ns callback can fix this problem.

Zhu Yanjun

>
>> Are you (or anyone else) interested in working on this too?  I'm more familiar
>> with the video frame buffer area of the kernel, so first I'm familiarizing myself
>> with how nvme-fabrics works with TCP & netns.
>>
>> Thanks,
>> Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-04-19  4:19                             ` Zhu Yanjun
@ 2023-04-19 18:01                               ` Mark Lehrer
  2023-04-20 14:28                                 ` Zhu Yanjun
  0 siblings, 1 reply; 39+ messages in thread
From: Mark Lehrer @ 2023-04-19 18:01 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Parav Pandit, Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma

> TCP without net ns notifier missed the net ns delete scenario results in a use after free bug, that should be fixed first as its critical.
>
> Sure. I also confronted this mentioned problem. If I remember correctly,
> a net ns callback can fix this problem.

I'm not sure if the bug fix will be this in depth, but I have a
related question.  What is the proper way for the kernel nvme
initiator code to know which netns context to use?  e.g. should we
take the pid of the process that opened /dev/nvme-fabrics and look it
up (presumaly this will be nvme-cli), and will this method give us
enough details for both tcp & rdma?

Mark


On Tue, Apr 18, 2023 at 10:19 PM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>
>
> 在 2023/4/19 8:43, Parav Pandit 写道:
> >
> >> From: Mark Lehrer <lehrer@gmail.com>
> >> Sent: Friday, April 14, 2023 12:24 PM
> >
> >> I'm going to try making the nvme-fabrics set of modules use the network
> >> namespace properly with RoCEv2.  TCP seems to work properly already, so this
> >> should be more of a "port" than real development.
> > TCP without net ns notifier missed the net ns delete scenario results in a use after free bug, that should be fixed first as its critical.
>
> Sure. I also confronted this mentioned problem. If I remember correctly,
> a net ns callback can fix this problem.
>
> Zhu Yanjun
>
> >
> >> Are you (or anyone else) interested in working on this too?  I'm more familiar
> >> with the video frame buffer area of the kernel, so first I'm familiarizing myself
> >> with how nvme-fabrics works with TCP & netns.
> >>
> >> Thanks,
> >> Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace
  2023-04-19 18:01                               ` Mark Lehrer
@ 2023-04-20 14:28                                 ` Zhu Yanjun
  0 siblings, 0 replies; 39+ messages in thread
From: Zhu Yanjun @ 2023-04-20 14:28 UTC (permalink / raw)
  To: Mark Lehrer; +Cc: Parav Pandit, Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma


在 2023/4/20 2:01, Mark Lehrer 写道:
>> TCP without net ns notifier missed the net ns delete scenario results in a use after free bug, that should be fixed first as its critical.
>>
>> Sure. I also confronted this mentioned problem. If I remember correctly,
>> a net ns callback can fix this problem.
> I'm not sure if the bug fix will be this in depth, but I have a
> related question.  What is the proper way for the kernel nvme
> initiator code to know which netns context to use?  e.g. should we
> take the pid of the process that opened /dev/nvme-fabrics and look it
> up (presumaly this will be nvme-cli), and will this method give us
> enough details for both tcp & rdma?

Please check the netns callback functions. You will find all the answers 
to your questions.

Zhu Yanjun

>
> Mark
>
>
> On Tue, Apr 18, 2023 at 10:19 PM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>>
>> 在 2023/4/19 8:43, Parav Pandit 写道:
>>>> From: Mark Lehrer <lehrer@gmail.com>
>>>> Sent: Friday, April 14, 2023 12:24 PM
>>>> I'm going to try making the nvme-fabrics set of modules use the network
>>>> namespace properly with RoCEv2.  TCP seems to work properly already, so this
>>>> should be more of a "port" than real development.
>>> TCP without net ns notifier missed the net ns delete scenario results in a use after free bug, that should be fixed first as its critical.
>> Sure. I also confronted this mentioned problem. If I remember correctly,
>> a net ns callback can fix this problem.
>>
>> Zhu Yanjun
>>
>>>> Are you (or anyone else) interested in working on this too?  I'm more familiar
>>>> with the video frame buffer area of the kernel, so first I'm familiarizing myself
>>>> with how nvme-fabrics works with TCP & netns.
>>>>
>>>> Thanks,
>>>> Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2023-04-20 14:29 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-14  6:06 [PATCHv3 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
2023-02-14  6:06 ` [PATCHv3 1/8] RDMA/rxe: Creating listening sock in newlink function Zhu Yanjun
2023-02-23 13:10   ` Zhu Yanjun
2023-02-14  6:06 ` [PATCHv3 2/8] RDMA/rxe: Support more rdma links in init_net Zhu Yanjun
2023-02-23 13:10   ` Zhu Yanjun
2023-02-14  6:06 ` [PATCHv3 3/8] RDMA/nldev: Add dellink function pointer Zhu Yanjun
2023-02-23 13:11   ` Zhu Yanjun
2023-02-14  6:06 ` [PATCHv3 4/8] RDMA/rxe: Implement dellink in rxe Zhu Yanjun
2023-02-23 13:12   ` Zhu Yanjun
2023-02-14  6:06 ` [PATCHv3 5/8] RDMA/rxe: Replace global variable with sock lookup functions Zhu Yanjun
2023-02-23 13:13   ` Zhu Yanjun
2023-02-14  6:06 ` [PATCHv3 6/8] RDMA/rxe: add the support of net namespace Zhu Yanjun
2023-02-23 13:14   ` Zhu Yanjun
2023-02-14  6:06 ` [PATCHv3 7/8] RDMA/rxe: Add the support of net namespace notifier Zhu Yanjun
2023-02-23 13:14   ` Zhu Yanjun
2023-02-14  6:06 ` [PATCHv3 8/8] RDMA/rxe: Replace l_sk6 with sk6 in net namespace Zhu Yanjun
2023-02-23 13:15   ` Zhu Yanjun
2023-02-23  0:31 ` [PATCHv3 0/8] Fix the problem that rxe can not work " Zhu Yanjun
2023-02-23  4:56   ` Jakub Kicinski
2023-02-23 11:42     ` Zhu Yanjun
2023-02-25  8:43   ` Rain River
2023-04-12 17:22 ` Mark Lehrer
2023-04-12 21:01   ` Mark Lehrer
2023-04-13  7:22     ` Zhu Yanjun
2023-04-13 13:00       ` Mark Lehrer
2023-04-13 13:05         ` Parav Pandit
2023-04-13 15:38           ` Mark Lehrer
2023-04-13 16:20             ` Parav Pandit
2023-04-13 16:23               ` Parav Pandit
2023-04-13 16:37                 ` Mark Lehrer
2023-04-13 16:42                   ` Parav Pandit
2023-04-14 15:49                     ` Zhu Yanjun
     [not found]                       ` <CADvaNzWfS5TFQ3b5JyaKFft06ihazadSJ15V3aXvWZh1jp1cCA@mail.gmail.com>
2023-04-14 16:24                         ` Mark Lehrer
2023-04-15 13:35                           ` Zhu Yanjun
2023-04-19  0:43                           ` Parav Pandit
2023-04-19  4:19                             ` Zhu Yanjun
2023-04-19 18:01                               ` Mark Lehrer
2023-04-20 14:28                                 ` Zhu Yanjun
2023-04-13  7:17   ` Zhu Yanjun

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.