All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
@ 2018-03-16 12:38 Kirill Tkhai
  2018-03-16 13:00 ` Sowmini Varadhan
  0 siblings, 1 reply; 19+ messages in thread
From: Kirill Tkhai @ 2018-03-16 12:38 UTC (permalink / raw)
  To: santosh.shilimkar, davem, netdev, linux-rdma, rds-devel,
	edumazet, ktkhai, sowmini.varadhan

Hi,

rds_tcp_dev_event() is the last user of NETDEV_UNREGISTER_FINAL stage.
If we rework it, we'll be able to kill the stage, and this will decrease
the number of rtnl_lock() we take during generic net device unregistration.
This is very hot path for namespaces.

467fa15356acf by Sowmini Varadhan added NETDEV_UNREGISTER_FINAL dependence
with the commentary:

	/* rds-tcp registers as a pernet subys, so the ->exit will only
	 * get invoked after network acitivity has quiesced. We need to
	 * clean up all sockets  to quiesce network activity, and use
	 * the unregistration of the per-net loopback device as a trigger
	 * to start that cleanup.
	 */

It seems all the protocols pernet subsystems meet this situation, but they
solve it in generic way. What does rds so specific have?

This commit makes event handler to iterate rds_tcp_conn_list and
kill them. If we change the stage to NETDEV_UNREGISTER, what will change?
Can unregistered loopback device on dead net add new items to rds_tcp_conn_list?
How it's possible?

What the problem is to move rds_tcp_kill_sock() into pernet subsys exit?

Kirill
---
 net/rds/tcp.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index eb04e7fa2467..4c6db9cb6261 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -567,7 +567,7 @@ static int rds_tcp_dev_event(struct notifier_block *this,
 	 * the unregistration of the per-net loopback device as a trigger
 	 * to start that cleanup.
 	 */
-	if (event == NETDEV_UNREGISTER_FINAL &&
+	if (event == NETDEV_UNREGISTER &&
 	    dev->ifindex == LOOPBACK_IFINDEX)
 		rds_tcp_kill_sock(dev_net(dev));
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
  2018-03-16 12:38 [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL) Kirill Tkhai
@ 2018-03-16 13:00 ` Sowmini Varadhan
  2018-03-16 13:17   ` Kirill Tkhai
  0 siblings, 1 reply; 19+ messages in thread
From: Sowmini Varadhan @ 2018-03-16 13:00 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: santosh.shilimkar, davem, netdev, linux-rdma, rds-devel, edumazet

On (03/16/18 15:38), Kirill Tkhai wrote:
> 
> 467fa15356acf by Sowmini Varadhan added NETDEV_UNREGISTER_FINAL dependence
> with the commentary:
> 
> 	/* rds-tcp registers as a pernet subys, so the ->exit will only
> 	 * get invoked after network acitivity has quiesced. We need to
> 	 * clean up all sockets  to quiesce network activity, and use
> 	 * the unregistration of the per-net loopback device as a trigger
> 	 * to start that cleanup.
> 	 */
> 
> It seems all the protocols pernet subsystems meet this situation, but they
> solve it in generic way. What does rds so specific have?

The difference with rds is this: most consumers of netns associate
a net_device with the netns, and cleanup of the netns state 
happens as part of the net_device teardown without the constraint
above. rds-tcp does has a netns tied to listening socket, not
to a specific network interface (net_device) so it registers
as a pernet-subsys. But this means that cleanup has to be
cone carefully (see comments in net_namespace.h before
register_pernet_subsys)

For rds-tcp, we need to be able to do the right thing in both of these
cases
1. modprobe -r rds-tcp (cleanup of rds-tcp state should happen in
   every namespace, including init_net)
2. netns delete (rds_tcp.ko should remain loaded for other namespaces)

> This commit makes event handler to iterate rds_tcp_conn_list and
> kill them. If we change the stage to NETDEV_UNREGISTER, what will change?

The above two cases need to work correctly.

> Can unregistered loopback device on dead net add new items to rds_tcp_conn_list?
> How it's possible?

I dont understand the question- no unregistered loopback devices
cannot add items. 

fwiw, I had asked questions about this (netns per net_device
vs netns for module) on the netdev list a few years ago, I can
try to hunt down that thread for you later (nobody replied to 
it, but maybe it will help answer your questions).

--Sowmini

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
  2018-03-16 13:00 ` Sowmini Varadhan
@ 2018-03-16 13:17   ` Kirill Tkhai
  2018-03-16 13:53     ` Sowmini Varadhan
  2018-03-16 17:29     ` Sowmini Varadhan
  0 siblings, 2 replies; 19+ messages in thread
From: Kirill Tkhai @ 2018-03-16 13:17 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: santosh.shilimkar, davem, netdev, linux-rdma, rds-devel, edumazet

On 16.03.2018 16:00, Sowmini Varadhan wrote:
> On (03/16/18 15:38), Kirill Tkhai wrote:
>>
>> 467fa15356acf by Sowmini Varadhan added NETDEV_UNREGISTER_FINAL dependence
>> with the commentary:
>>
>> 	/* rds-tcp registers as a pernet subys, so the ->exit will only
>> 	 * get invoked after network acitivity has quiesced. We need to
>> 	 * clean up all sockets  to quiesce network activity, and use
>> 	 * the unregistration of the per-net loopback device as a trigger
>> 	 * to start that cleanup.
>> 	 */
>>
>> It seems all the protocols pernet subsystems meet this situation, but they
>> solve it in generic way. What does rds so specific have?
> 
> The difference with rds is this: most consumers of netns associate
> a net_device with the netns, and cleanup of the netns state 
> happens as part of the net_device teardown without the constraint
> above. rds-tcp does has a netns tied to listening socket, not
> to a specific network interface (net_device) so it registers
> as a pernet-subsys. But this means that cleanup has to be
> cone carefully (see comments in net_namespace.h before
> register_pernet_subsys)

This is not a problem, and rds-tcp is not the only pernet_subsys registering
a socket. It's OK to close it from .exit method. There are many examples,
let me point you to icmp_sk_ops as one of them. But it's not the only.

> For rds-tcp, we need to be able to do the right thing in both of these
> cases
> 1. modprobe -r rds-tcp (cleanup of rds-tcp state should happen in
>    every namespace, including init_net)
> 2. netns delete (rds_tcp.ko should remain loaded for other namespaces)

The same as above, every pernet_subsys does this. It's not a problem.
exit and exit_batch methods are called in both of the cases.

Please, see __unregister_pernet_operations()->ops_exit_list for the details.

>> This commit makes event handler to iterate rds_tcp_conn_list and
>> kill them. If we change the stage to NETDEV_UNREGISTER, what will change?
> 
> The above two cases need to work correctly.

Yeah, but let's find another way to have the same.

>> Can unregistered loopback device on dead net add new items to rds_tcp_conn_list?
>> How it's possible?
> 
> I dont understand the question- no unregistered loopback devices
> cannot add items. 

If we replace NETDEV_UNREGISTER_FINAL with NETDEV_UNREGISTER, the only change
which happens is we call rds_tcp_kill_sock() earlier. So, it may be a reason
of problems only if someone changes the list during the time between
NETDEV_UNREGISTER and NETDEV_UNREGISTER_FINAL are called for loopback.
But since this time noone related to this net can extend the list,
there is no a problem to do that.

> fwiw, I had asked questions about this (netns per net_device
> vs netns for module) on the netdev list a few years ago, I can
> try to hunt down that thread for you later (nobody replied to 
> it, but maybe it will help answer your questions).

After your words it looks like we may simply do all the things
in exit method.

Kirill

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
  2018-03-16 13:17   ` Kirill Tkhai
@ 2018-03-16 13:53     ` Sowmini Varadhan
  2018-03-16 14:36       ` Kirill Tkhai
  2018-03-16 17:29     ` Sowmini Varadhan
  1 sibling, 1 reply; 19+ messages in thread
From: Sowmini Varadhan @ 2018-03-16 13:53 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: santosh.shilimkar, davem, netdev, linux-rdma, rds-devel, edumazet


Found my previous question:

https://www.mail-archive.com/netdev@vger.kernel.org/msg72330.html

(see section about "Comments are specifically ivinted.."

> This is not a problem, and rds-tcp is not the only pernet_subsys registering
> a socket. It's OK to close it from .exit method. There are many examples,
> let me point you to icmp_sk_ops as one of them. But it's not the only.

I'm not averse to changing this to NETDEV_UNREGISTER
as long as it works for the 2 test cases below- you 
can test it by using rds-ping from rds-tools rpm, to
be used from/to init_net, from/to the netns  against
some external machine (i.e something not on the same
physical host)

> > For rds-tcp, we need to be able to do the right thing in both of these
> > cases
> > 1. modprobe -r rds-tcp (cleanup of rds-tcp state should happen in
> >    every namespace, including init_net)
> > 2. netns delete (rds_tcp.ko should remain loaded for other namespaces)
> 
> The same as above, every pernet_subsys does this. It's not a problem.
> exit and exit_batch methods are called in both of the cases.
> 
> Please, see __unregister_pernet_operations()->ops_exit_list for the details.

I am familiar with ops_exit_list, but this is the sequence:
- when the module is loaded (or netns is started) it starts a 
  kernel listen socket on *.16385
- when you start the rds-pings above, it will create kernel
  tcp connections from/to the 16385 in the netns. And it will
  start socket keepalives for those connections. Each tcp 
  connection is associated with a rds_connection

As I recall, when I wrote the initial patchset, my problem
was that in order to let the module unload make progress,
all these sockets had to be cleaned up. But to clean up these
sockets, net_device cleanup had to complete (should not have
any new incoming connections to the listen endpoint on a 
non-loopback socket) so I ended up with a circular dependancy.

> If we replace NETDEV_UNREGISTER_FINAL with NETDEV_UNREGISTER, the only change
> which happens is we call rds_tcp_kill_sock() earlier. So, it may be a reason
> of problems only if someone changes the list during the time between
> NETDEV_UNREGISTER and NETDEV_UNREGISTER_FINAL are called for loopback.
> But since this time noone related to this net can extend the list,
> there is no a problem to do that.

Please share your patch, I can review it and maybe help to test
it..

As I was trying to say in my RFC, I am quite open to ways to make
this cleanup more obvious

--Sowmini

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
  2018-03-16 13:53     ` Sowmini Varadhan
@ 2018-03-16 14:36       ` Kirill Tkhai
  2018-03-16 14:41         ` Kirill Tkhai
  0 siblings, 1 reply; 19+ messages in thread
From: Kirill Tkhai @ 2018-03-16 14:36 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: santosh.shilimkar, davem, netdev, linux-rdma, rds-devel, edumazet

On 16.03.2018 16:53, Sowmini Varadhan wrote:
> 
> Found my previous question:
> 
> https://www.mail-archive.com/netdev@vger.kernel.org/msg72330.html
> 
> (see section about "Comments are specifically ivinted.."

I see, thanks.

>> This is not a problem, and rds-tcp is not the only pernet_subsys registering
>> a socket. It's OK to close it from .exit method. There are many examples,
>> let me point you to icmp_sk_ops as one of them. But it's not the only.
> 
> I'm not averse to changing this to NETDEV_UNREGISTER
> as long as it works for the 2 test cases below- you 
> can test it by using rds-ping from rds-tools rpm, to
> be used from/to init_net, from/to the netns  against
> some external machine (i.e something not on the same
> physical host)
> 
>>> For rds-tcp, we need to be able to do the right thing in both of these
>>> cases
>>> 1. modprobe -r rds-tcp (cleanup of rds-tcp state should happen in
>>>    every namespace, including init_net)
>>> 2. netns delete (rds_tcp.ko should remain loaded for other namespaces)
>>
>> The same as above, every pernet_subsys does this. It's not a problem.
>> exit and exit_batch methods are called in both of the cases.
>>
>> Please, see __unregister_pernet_operations()->ops_exit_list for the details.
> 
> I am familiar with ops_exit_list, but this is the sequence:
> - when the module is loaded (or netns is started) it starts a 
>   kernel listen socket on *.16385
> - when you start the rds-pings above, it will create kernel
>   tcp connections from/to the 16385 in the netns. And it will
>   start socket keepalives for those connections. Each tcp 
>   connection is associated with a rds_connection
> 
> As I recall, when I wrote the initial patchset, my problem
> was that in order to let the module unload make progress,
> all these sockets had to be cleaned up. But to clean up these
> sockets, net_device cleanup had to complete (should not have
> any new incoming connections to the listen endpoint on a 
> non-loopback socket) so I ended up with a circular dependancy.

Ah, I see the reasons. Please, see my proposition at the end of this letter.
 
>> If we replace NETDEV_UNREGISTER_FINAL with NETDEV_UNREGISTER, the only change
>> which happens is we call rds_tcp_kill_sock() earlier. So, it may be a reason
>> of problems only if someone changes the list during the time between
>> NETDEV_UNREGISTER and NETDEV_UNREGISTER_FINAL are called for loopback.
>> But since this time noone related to this net can extend the list,
>> there is no a problem to do that.
> 
> Please share your patch, I can review it and maybe help to test
> it..
> 
> As I was trying to say in my RFC, I am quite open to ways to make
> this cleanup more obvious

How about something like this? Compile tested only.

[PATCH]rds: Use pernet device to kill RDS sockets

We register a new pernet device and use the fact,
that loopback device is last unregistered device.
So, on exit path, the new exit method will be called
before loopback_dev destruction.
    
---
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index eb04e7fa2467..ec37868bf2dd 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -493,28 +493,11 @@ static void __net_exit rds_tcp_exit_net(struct net *net)
 
 	if (net != &init_net && rtn->ctl_table)
 		kfree(rtn->ctl_table);
-
-	/* If rds_tcp_exit_net() is called as a result of netns deletion,
-	 * the rds_tcp_kill_sock() device notifier would already have cleaned
-	 * up the listen socket, thus there is no work to do in this function.
-	 *
-	 * If rds_tcp_exit_net() is called as a result of module unload,
-	 * i.e., due to rds_tcp_exit() -> unregister_pernet_subsys(), then
-	 * we do need to clean up the listen socket here.
-	 */
-	if (rtn->rds_tcp_listen_sock) {
-		struct socket *lsock = rtn->rds_tcp_listen_sock;
-
-		rtn->rds_tcp_listen_sock = NULL;
-		rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w);
-	}
 }
 
 static struct pernet_operations rds_tcp_net_ops = {
 	.init = rds_tcp_init_net,
 	.exit = rds_tcp_exit_net,
-	.id = &rds_tcp_netid,
-	.size = sizeof(struct rds_tcp_net),
 	.async = true,
 };
 
@@ -545,40 +528,38 @@ static void rds_tcp_kill_sock(struct net *net)
 		rds_conn_destroy(tc->t_cpath->cp_conn);
 }
 
-void *rds_tcp_listen_sock_def_readable(struct net *net)
+static __net_init int rds_tcp_init_dev(struct net *net)
 {
 	struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
-	struct socket *lsock = rtn->rds_tcp_listen_sock;
 
-	if (!lsock)
-		return NULL;
+	rtn->rds_tcp_listen_sock = NULL;
+	return 0;
+}
 
-	return lsock->sk->sk_user_data;
+static void __net_exit rds_tcp_exit_dev(struct net *net)
+{
+	rds_tcp_kill_sock(net);
 }
 
-static int rds_tcp_dev_event(struct notifier_block *this,
-			     unsigned long event, void *ptr)
+static struct pernet_operations rds_tcp_dev_ops = {
+	.init = rds_tcp_init_dev,
+	.exit = rds_tcp_exit_dev,
+	.id = &rds_tcp_netid,
+	.size = sizeof(struct rds_tcp_net),
+	.async = true,
+};
+
+void *rds_tcp_listen_sock_def_readable(struct net *net)
 {
-	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+	struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
+	struct socket *lsock = rtn->rds_tcp_listen_sock;
 
-	/* rds-tcp registers as a pernet subys, so the ->exit will only
-	 * get invoked after network acitivity has quiesced. We need to
-	 * clean up all sockets  to quiesce network activity, and use
-	 * the unregistration of the per-net loopback device as a trigger
-	 * to start that cleanup.
-	 */
-	if (event == NETDEV_UNREGISTER_FINAL &&
-	    dev->ifindex == LOOPBACK_IFINDEX)
-		rds_tcp_kill_sock(dev_net(dev));
+	if (!lsock)
+		return NULL;
 
-	return NOTIFY_DONE;
+	return lsock->sk->sk_user_data;
 }
 
-static struct notifier_block rds_tcp_dev_notifier = {
-	.notifier_call        = rds_tcp_dev_event,
-	.priority = -10, /* must be called after other network notifiers */
-};
-
 /* when sysctl is used to modify some kernel socket parameters,this
  * function  resets the RDS connections in that netns  so that we can
  * restart with new parameters.  The assumption is that such reset
@@ -624,9 +605,8 @@ static void rds_tcp_exit(void)
 	rds_tcp_set_unloading();
 	synchronize_rcu();
 	rds_info_deregister_func(RDS_INFO_TCP_SOCKETS, rds_tcp_tc_info);
+	unregister_pernet_device(&rds_tcp_dev_ops);
 	unregister_pernet_subsys(&rds_tcp_net_ops);
-	if (unregister_netdevice_notifier(&rds_tcp_dev_notifier))
-		pr_warn("could not unregister rds_tcp_dev_notifier\n");
 	rds_tcp_destroy_conns();
 	rds_trans_unregister(&rds_tcp_transport);
 	rds_tcp_recv_exit();
@@ -650,15 +630,13 @@ static int rds_tcp_init(void)
 	if (ret)
 		goto out_slab;
 
-	ret = register_pernet_subsys(&rds_tcp_net_ops);
+	ret = register_pernet_device(&rds_tcp_dev_ops);
 	if (ret)
 		goto out_recv;
 
-	ret = register_netdevice_notifier(&rds_tcp_dev_notifier);
-	if (ret) {
-		pr_warn("could not register rds_tcp_dev_notifier\n");
+	ret = register_pernet_subsys(&rds_tcp_net_ops);
+	if (ret)
 		goto out_pernet;
-	}
 
 	rds_trans_register(&rds_tcp_transport);
 
@@ -667,7 +645,7 @@ static int rds_tcp_init(void)
 	goto out;
 
 out_pernet:
-	unregister_pernet_subsys(&rds_tcp_net_ops);
+	unregister_pernet_device(&rds_tcp_dev_ops);
 out_recv:
 	rds_tcp_recv_exit();
 out_slab:

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
  2018-03-16 14:36       ` Kirill Tkhai
@ 2018-03-16 14:41         ` Kirill Tkhai
  0 siblings, 0 replies; 19+ messages in thread
From: Kirill Tkhai @ 2018-03-16 14:41 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: santosh.shilimkar, davem, netdev, linux-rdma, rds-devel, edumazet

On 16.03.2018 17:36, Kirill Tkhai wrote:
> On 16.03.2018 16:53, Sowmini Varadhan wrote:
>>
>> Found my previous question:
>>
>> https://www.mail-archive.com/netdev@vger.kernel.org/msg72330.html
>>
>> (see section about "Comments are specifically ivinted.."
> 
> I see, thanks.
> 
>>> This is not a problem, and rds-tcp is not the only pernet_subsys registering
>>> a socket. It's OK to close it from .exit method. There are many examples,
>>> let me point you to icmp_sk_ops as one of them. But it's not the only.
>>
>> I'm not averse to changing this to NETDEV_UNREGISTER
>> as long as it works for the 2 test cases below- you 
>> can test it by using rds-ping from rds-tools rpm, to
>> be used from/to init_net, from/to the netns  against
>> some external machine (i.e something not on the same
>> physical host)
>>
>>>> For rds-tcp, we need to be able to do the right thing in both of these
>>>> cases
>>>> 1. modprobe -r rds-tcp (cleanup of rds-tcp state should happen in
>>>>    every namespace, including init_net)
>>>> 2. netns delete (rds_tcp.ko should remain loaded for other namespaces)
>>>
>>> The same as above, every pernet_subsys does this. It's not a problem.
>>> exit and exit_batch methods are called in both of the cases.
>>>
>>> Please, see __unregister_pernet_operations()->ops_exit_list for the details.
>>
>> I am familiar with ops_exit_list, but this is the sequence:
>> - when the module is loaded (or netns is started) it starts a 
>>   kernel listen socket on *.16385
>> - when you start the rds-pings above, it will create kernel
>>   tcp connections from/to the 16385 in the netns. And it will
>>   start socket keepalives for those connections. Each tcp 
>>   connection is associated with a rds_connection
>>
>> As I recall, when I wrote the initial patchset, my problem
>> was that in order to let the module unload make progress,
>> all these sockets had to be cleaned up. But to clean up these
>> sockets, net_device cleanup had to complete (should not have
>> any new incoming connections to the listen endpoint on a 
>> non-loopback socket) so I ended up with a circular dependancy.
> 
> Ah, I see the reasons. Please, see my proposition at the end of this letter.
>  
>>> If we replace NETDEV_UNREGISTER_FINAL with NETDEV_UNREGISTER, the only change
>>> which happens is we call rds_tcp_kill_sock() earlier. So, it may be a reason
>>> of problems only if someone changes the list during the time between
>>> NETDEV_UNREGISTER and NETDEV_UNREGISTER_FINAL are called for loopback.
>>> But since this time noone related to this net can extend the list,
>>> there is no a problem to do that.
>>
>> Please share your patch, I can review it and maybe help to test
>> it..
>>
>> As I was trying to say in my RFC, I am quite open to ways to make
>> this cleanup more obvious
> 
> How about something like this? Compile tested only.
> 
> [PATCH]rds: Use pernet device to kill RDS sockets
> 
> We register a new pernet device and use the fact,
> that loopback device is last unregistered device.
> So, on exit path, the new exit method will be called
> before loopback_dev destruction.

$git diff --patience gives better diff

[PATCH]rds: Use pernet device to kill RDS sockets
    
We register a new pernet device and use the fact,
that loopback device is last unregistered device.
So, on exit path, the new exit method will be called
before loopback_dev destruction.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index eb04e7fa2467..ec37868bf2dd 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -493,28 +493,11 @@ static void __net_exit rds_tcp_exit_net(struct net *net)
 
 	if (net != &init_net && rtn->ctl_table)
 		kfree(rtn->ctl_table);
-
-	/* If rds_tcp_exit_net() is called as a result of netns deletion,
-	 * the rds_tcp_kill_sock() device notifier would already have cleaned
-	 * up the listen socket, thus there is no work to do in this function.
-	 *
-	 * If rds_tcp_exit_net() is called as a result of module unload,
-	 * i.e., due to rds_tcp_exit() -> unregister_pernet_subsys(), then
-	 * we do need to clean up the listen socket here.
-	 */
-	if (rtn->rds_tcp_listen_sock) {
-		struct socket *lsock = rtn->rds_tcp_listen_sock;
-
-		rtn->rds_tcp_listen_sock = NULL;
-		rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w);
-	}
 }
 
 static struct pernet_operations rds_tcp_net_ops = {
 	.init = rds_tcp_init_net,
 	.exit = rds_tcp_exit_net,
-	.id = &rds_tcp_netid,
-	.size = sizeof(struct rds_tcp_net),
 	.async = true,
 };
 
@@ -545,6 +528,27 @@ static void rds_tcp_kill_sock(struct net *net)
 		rds_conn_destroy(tc->t_cpath->cp_conn);
 }
 
+static __net_init int rds_tcp_init_dev(struct net *net)
+{
+	struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
+
+	rtn->rds_tcp_listen_sock = NULL;
+	return 0;
+}
+
+static void __net_exit rds_tcp_exit_dev(struct net *net)
+{
+	rds_tcp_kill_sock(net);
+}
+
+static struct pernet_operations rds_tcp_dev_ops = {
+	.init = rds_tcp_init_dev,
+	.exit = rds_tcp_exit_dev,
+	.id = &rds_tcp_netid,
+	.size = sizeof(struct rds_tcp_net),
+	.async = true,
+};
+
 void *rds_tcp_listen_sock_def_readable(struct net *net)
 {
 	struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
@@ -556,29 +560,6 @@ void *rds_tcp_listen_sock_def_readable(struct net *net)
 	return lsock->sk->sk_user_data;
 }
 
-static int rds_tcp_dev_event(struct notifier_block *this,
-			     unsigned long event, void *ptr)
-{
-	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
-
-	/* rds-tcp registers as a pernet subys, so the ->exit will only
-	 * get invoked after network acitivity has quiesced. We need to
-	 * clean up all sockets  to quiesce network activity, and use
-	 * the unregistration of the per-net loopback device as a trigger
-	 * to start that cleanup.
-	 */
-	if (event == NETDEV_UNREGISTER_FINAL &&
-	    dev->ifindex == LOOPBACK_IFINDEX)
-		rds_tcp_kill_sock(dev_net(dev));
-
-	return NOTIFY_DONE;
-}
-
-static struct notifier_block rds_tcp_dev_notifier = {
-	.notifier_call        = rds_tcp_dev_event,
-	.priority = -10, /* must be called after other network notifiers */
-};
-
 /* when sysctl is used to modify some kernel socket parameters,this
  * function  resets the RDS connections in that netns  so that we can
  * restart with new parameters.  The assumption is that such reset
@@ -624,9 +605,8 @@ static void rds_tcp_exit(void)
 	rds_tcp_set_unloading();
 	synchronize_rcu();
 	rds_info_deregister_func(RDS_INFO_TCP_SOCKETS, rds_tcp_tc_info);
+	unregister_pernet_device(&rds_tcp_dev_ops);
 	unregister_pernet_subsys(&rds_tcp_net_ops);
-	if (unregister_netdevice_notifier(&rds_tcp_dev_notifier))
-		pr_warn("could not unregister rds_tcp_dev_notifier\n");
 	rds_tcp_destroy_conns();
 	rds_trans_unregister(&rds_tcp_transport);
 	rds_tcp_recv_exit();
@@ -650,15 +630,13 @@ static int rds_tcp_init(void)
 	if (ret)
 		goto out_slab;
 
-	ret = register_pernet_subsys(&rds_tcp_net_ops);
+	ret = register_pernet_device(&rds_tcp_dev_ops);
 	if (ret)
 		goto out_recv;
 
-	ret = register_netdevice_notifier(&rds_tcp_dev_notifier);
-	if (ret) {
-		pr_warn("could not register rds_tcp_dev_notifier\n");
+	ret = register_pernet_subsys(&rds_tcp_net_ops);
+	if (ret)
 		goto out_pernet;
-	}
 
 	rds_trans_register(&rds_tcp_transport);
 
@@ -667,7 +645,7 @@ static int rds_tcp_init(void)
 	goto out;
 
 out_pernet:
-	unregister_pernet_subsys(&rds_tcp_net_ops);
+	unregister_pernet_device(&rds_tcp_dev_ops);
 out_recv:
 	rds_tcp_recv_exit();
 out_slab:

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
  2018-03-16 13:17   ` Kirill Tkhai
  2018-03-16 13:53     ` Sowmini Varadhan
@ 2018-03-16 17:29     ` Sowmini Varadhan
  2018-03-16 18:14       ` Kirill Tkhai
  1 sibling, 1 reply; 19+ messages in thread
From: Sowmini Varadhan @ 2018-03-16 17:29 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: santosh.shilimkar, davem, netdev, linux-rdma, rds-devel, edumazet


I had taken some of this offline, but it occurs to me
that some of these notes should be saved to the netdev archives, 
in case this question pops up again in a few years.

When I run your patch, I get a repeatable panic by doing
  modprobe rds-tcp
  ip netns create blue
the panic is because we are finding a null trn in rds_tcp_init_net.

I think there's something very disturbed about calling
register_pernet_operations() twice, once via 
register_pernet_device() and again via register_pernet_subsys().

I suspect this has everything to do with the panic but I have
not had time to debug every little detail here.

In general, rds_tcp is not a network device, it is a kernel
module.  That is the fundamental problem here. 

To repeat the comments form net_namespace.h:
 * Network interfaces need to be removed from a dying netns _before_
 * subsys notifiers can be called, as most of the network code cleanup
 * (which is done from subsys notifiers) runs with the assumption that
 * dev_remove_pack has been called so no new packets will arrive during
 * and after the cleanup functions have been called.  dev_remove_pack
 * is not per namespace so instead the guarantee of no more packets
 * arriving in a network namespace is provided by ensuring that all
 * network devices and all sockets have left the network namespace
 * before the cleanup methods are called.

when the "blue" netns starts up, it creates at least one kernel listen
socket on *.16385. This socket, and any other child/client sockets 
created must be cleaned up before the cleanup_net can happen.

This is why I chose to call regster_pernet_subsys. Again, as per
comments in net_namespace.h:

 * Use these carefully.  If you implement a network device and it
 * needs per network namespace operations use device pernet operations,
 * otherwise use pernet subsys operations.

On (03/16/18 18:51), Kirill Tkhai wrote:
> > Let's find another approach. Could you tell what problem we have in 
> > case of rds_tcp_dev_ops is declared as pernet_device?

As above, rds-tcp is not a network device.

> One more question. Which time we take a reference of loopback device?
> Is it possible before we created a net completely?

We dont take a reference on the loopback device. 
We make sure none of the kernel sockets does a get_net() so
that we dont block the cleanup_net, and then, when all
the network interfaces have been taken down (loopback is
the last one) we know there are no more packets coming in
and out, so it is safe to dismantle all kernel sockets 
created by rds-tcp.

Hope that helps.

--Sowmini

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
  2018-03-16 17:29     ` Sowmini Varadhan
@ 2018-03-16 18:14       ` Kirill Tkhai
  2018-03-16 18:31         ` Sowmini Varadhan
  2018-03-17 14:15         ` Sowmini Varadhan
  0 siblings, 2 replies; 19+ messages in thread
From: Kirill Tkhai @ 2018-03-16 18:14 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: santosh.shilimkar, davem, netdev, linux-rdma, rds-devel, edumazet

On 16.03.2018 20:29, Sowmini Varadhan wrote:
> 
> I had taken some of this offline, but it occurs to me
> that some of these notes should be saved to the netdev archives, 
> in case this question pops up again in a few years.
> 
> When I run your patch, I get a repeatable panic by doing
>   modprobe rds-tcp
>   ip netns create blue
> the panic is because we are finding a null trn in rds_tcp_init_net.

I did the second version and sent you. Have you tried it?

> I think there's something very disturbed about calling
> register_pernet_operations() twice, once via 
> register_pernet_device() and again via register_pernet_subsys().

Calling netdevice handler for every event is more disturbing,
as number of events is several times bigger, than one more
pernet exit method.

> I suspect this has everything to do with the panic but I have
> not had time to debug every little detail here.

You speak in the way I forced you to spend a lot of time debugging
my patches. But I sent you only one with a warn it's not tested.
(but twice -- since the second time I used --patience diff option
to make idea more visible for you). It's strange to hear this from
you.

> In general, rds_tcp is not a network device, it is a kernel
> module.  That is the fundamental problem here. 
> 
> To repeat the comments form net_namespace.h:
>  * Network interfaces need to be removed from a dying netns _before_
>  * subsys notifiers can be called, as most of the network code cleanup
>  * (which is done from subsys notifiers) runs with the assumption that
>  * dev_remove_pack has been called so no new packets will arrive during
>  * and after the cleanup functions have been called.  dev_remove_pack
>  * is not per namespace so instead the guarantee of no more packets
>  * arriving in a network namespace is provided by ensuring that all
>  * network devices and all sockets have left the network namespace
>  * before the cleanup methods are called.
> 
> when the "blue" netns starts up, it creates at least one kernel listen
> socket on *.16385. This socket, and any other child/client sockets 
> created must be cleaned up before the cleanup_net can happen.
> 
> This is why I chose to call regster_pernet_subsys. Again, as per
> comments in net_namespace.h:
> 
>  * Use these carefully.  If you implement a network device and it
>  * needs per network namespace operations use device pernet operations,
>  * otherwise use pernet subsys operations.

This is not a strict rule, and currently I'm working on pernet_operations
synchronization. This commentary is for generic code, while your rds
is differ from the rest of ordinary pernet_operations.

> On (03/16/18 18:51), Kirill Tkhai wrote:
>>> Let's find another approach. Could you tell what problem we have in 
>>> case of rds_tcp_dev_ops is declared as pernet_device?
> 
> As above, rds-tcp is not a network device.

It's not an agrument. It's not a rule, what you read in the comment.
We need to use rules of reasonableness, and the generic rules do not
fit to your driver. So, it's need to search a solution. Be the only
user NETDEV_UNREGISTER_FINAL in the kernel does not look a good one.

>> One more question. Which time we take a reference of loopback device?
>> Is it possible before we created a net completely?
> 
> We dont take a reference on the loopback device. 
> We make sure none of the kernel sockets does a get_net() so
> that we dont block the cleanup_net, and then, when all
> the network interfaces have been taken down (loopback is
> the last one) we know there are no more packets coming in
> and out, so it is safe to dismantle all kernel sockets 
> created by rds-tcp.

So that, if RDS packets does not act on netdev refcount,
which circular dependence did you mean in your second
message (quote below)?

>As I recall, when I wrote the initial patchset, my problem
>was that in order to let the module unload make progress,
>all these sockets had to be cleaned up. But to clean up these
>sockets, net_device cleanup had to complete (should not have
>any new incoming connections to the listen endpoint on a 
>non-loopback socket) so I ended up with a circular dependancy.

Kirill

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
  2018-03-16 18:14       ` Kirill Tkhai
@ 2018-03-16 18:31         ` Sowmini Varadhan
  2018-03-16 18:48           ` Kirill Tkhai
  2018-03-17 14:15         ` Sowmini Varadhan
  1 sibling, 1 reply; 19+ messages in thread
From: Sowmini Varadhan @ 2018-03-16 18:31 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: santosh.shilimkar, davem, netdev, linux-rdma, rds-devel, edumazet

On (03/16/18 21:14), Kirill Tkhai wrote:
> 
> I did the second version and sent you. Have you tried it?

I tried it briefly, and it works for the handful of testcases
that I tried, but I still think its very werid to register
as both a device and a subsys, esp in the light of the 
warning in net_namespace.h

Thus I have to spend some time reviewing your patch,
and I cannot give you an answer in the next 5 minutes.

> Calling netdevice handler for every event is more disturbing,
> as number of events is several times bigger, than one more
> pernet exit method.

So you are saying there are scaling constraints on subsystems
that register for netdevice handlers. The disturbing part
of that is that it does not scale.

Thanks.
--Sowmini

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
  2018-03-16 18:31         ` Sowmini Varadhan
@ 2018-03-16 18:48           ` Kirill Tkhai
  2018-03-16 18:53             ` Sowmini Varadhan
  0 siblings, 1 reply; 19+ messages in thread
From: Kirill Tkhai @ 2018-03-16 18:48 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: santosh.shilimkar, davem, netdev, linux-rdma, rds-devel, edumazet

On 16.03.2018 21:31, Sowmini Varadhan wrote:
> On (03/16/18 21:14), Kirill Tkhai wrote:
>>
>> I did the second version and sent you. Have you tried it?
> 
> I tried it briefly, and it works for the handful of testcases
> that I tried, but I still think its very werid to register
> as both a device and a subsys, esp in the light of the 
> warning in net_namespace.h

I'll send a patch updating comment in net_namespace.h to clarify
the situation.

> Thus I have to spend some time reviewing your patch,
> and I cannot give you an answer in the next 5 minutes.

No problem, 5 minutes response never required. Thanks for
your review.

>> Calling netdevice handler for every event is more disturbing,
>> as number of events is several times bigger, than one more
>> pernet exit method.
> 
> So you are saying there are scaling constraints on subsystems
> that register for netdevice handlers. The disturbing part
> of that is that it does not scale.

It does not scale. But if we do not work on this, nothing will change.

My point about less numbers is that every namespace will call this
new pernet->exit once. So, we will have 1 new function call per one
namespace.

Let's count numbers, if we have the same functionality implemented as netdev event.
Every net namespace has loopback dev. This device will be registered
and unregistered. The event handler will be called for many events
during this actions: NETDEV_REGISTER, NETDEV_UNREGISTER, NETDEV_POST_INIT, etc, etc.

So, if we use pernet exit, it leads us to signify less number of function calls.
Just this.

Kirill

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
  2018-03-16 18:48           ` Kirill Tkhai
@ 2018-03-16 18:53             ` Sowmini Varadhan
  0 siblings, 0 replies; 19+ messages in thread
From: Sowmini Varadhan @ 2018-03-16 18:53 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: santosh.shilimkar, davem, netdev, linux-rdma, rds-devel, edumazet

On (03/16/18 21:48), Kirill Tkhai wrote:
> 
> > Thus I have to spend some time reviewing your patch,
> > and I cannot give you an answer in the next 5 minutes.
> 
> No problem, 5 minutes response never required. Thanks for
> your review.

thank you. I would like to take some time this weekend
to understand why v1 of your patch hit the null rtn,
but v2 did not (a superficial glance at the patch suggested
that you were registering twice in both cases, with just
a reordering, so I would like to understand the root-cause of
the null ptr deref with v1)

As for registering 2 times, that needs some comments to
provide guidance for other subsystems. e.g., I found the
large block comment in net-namespace.h very helpful, so lets
please clearly document what and why and when this should
be used.

--Sowmini

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
  2018-03-16 18:14       ` Kirill Tkhai
  2018-03-16 18:31         ` Sowmini Varadhan
@ 2018-03-17 14:15         ` Sowmini Varadhan
  2018-03-17 21:13           ` Kirill Tkhai
  2018-03-17 21:26           ` [rds-devel] " Sowmini Varadhan
  1 sibling, 2 replies; 19+ messages in thread
From: Sowmini Varadhan @ 2018-03-17 14:15 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: santosh.shilimkar, davem, netdev, linux-rdma, rds-devel, edumazet


I spent a long time staring at both v1 and v2 of your patch.

I understand the overall goal, but I am afraid to say that these
patches are complete hacks.

I was trying to understand why patchv1 blows with a null rtn in 
rds_tcp_init_net, but v2 does not, and the analysis is ugly.  

I'm going to put down the analysis here, and others can
decide if this sort of hack is a healthy solution for a scaling
issue (IMHO  it is not- we should get the real fix for the
scaling instead of using duck-tape-and-chewing-gum solutions)

What is happening in v1 is this:

1. Wnen I do "modprobe rds_tcp" in init_net, we end up doing the
   following in rds_tcp_init
       register_pernet_device(&rds_tcp_dev_ops);
       register_pernet_device(&rds_tcp_net_ops);
   Where rds_tcp_dev_ops has 
        .id = &rds_tcp_netid,
        .size = sizeof(struct rds_tcp_net),
   and rds_tcp_net_ops has 0 values for both of these.

2. So now pernet_list has &rds_tcp_net_ops as the first member of the
   pernet_list.

3. Now I do "ip netns create blue". As part of setup_net(), we walk
   the pernet_list and call the ->init of each member (through ops_init()). 
   So we'd hit rds_tcp_net_ops first. Since the id/size are 0, we'd
   skip the struct rds_tcp_net allocation, so rds_tcp_init_net would 
   find a null return from net_generic() and bomb.

The way I view it (and ymmv) the hack here is to call both
register_pernet_device and register_pernet_subsys: the kernel only
guarantees that calling *one* of register_pernet_* will ensure
that you can safely call net_generic() afterwards.

The v2 patch "works around" this by reordering the registration. 
So this time, init_net will set up the rds_tcp_net_ops as the second 
member, and the first memeber will be the pernet_operations struct 
that has non-zero id and size.

But then the unregistration (necessarily) works in the opposite order
you have to unregister_pernet_device first (so that interfaces are
quiesced) and then unregister_pernet_subsys() so that sockets can
be safely quiesced.

I dont think this type of hack makes the code cleaner, it just
make things much harder to understand, and completely brittle
for subsequent changes.

To solve the scaling problem why not just have a well-defined 
callback to modules when devices are quiesced, instead of 
overloading the pernet_device registration in this obscure way?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
  2018-03-17 14:15         ` Sowmini Varadhan
@ 2018-03-17 21:13           ` Kirill Tkhai
  2018-03-17 21:26           ` [rds-devel] " Sowmini Varadhan
  1 sibling, 0 replies; 19+ messages in thread
From: Kirill Tkhai @ 2018-03-17 21:13 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: santosh.shilimkar, davem, netdev, linux-rdma, rds-devel, edumazet

On 17.03.2018 17:15, Sowmini Varadhan wrote:
> 
> I spent a long time staring at both v1 and v2 of your patch.

Thanks for your time!

> I understand the overall goal, but I am afraid to say that these
> patches are complete hacks.

I'm not agree with you, see below the explanations.
 
> I was trying to understand why patchv1 blows with a null rtn in 
> rds_tcp_init_net, but v2 does not, and the analysis is ugly.  
> 
> I'm going to put down the analysis here, and others can
> decide if this sort of hack is a healthy solution for a scaling
> issue (IMHO  it is not- we should get the real fix for the
> scaling instead of using duck-tape-and-chewing-gum solutions)
> 
> What is happening in v1 is this:
> 
> 1. Wnen I do "modprobe rds_tcp" in init_net, we end up doing the
>    following in rds_tcp_init
>        register_pernet_device(&rds_tcp_dev_ops);
>        register_pernet_device(&rds_tcp_net_ops);
>    Where rds_tcp_dev_ops has 
>         .id = &rds_tcp_netid,
>         .size = sizeof(struct rds_tcp_net),
>    and rds_tcp_net_ops has 0 values for both of these.
> 
> 2. So now pernet_list has &rds_tcp_net_ops as the first member of the
>    pernet_list.
> 
> 3. Now I do "ip netns create blue". As part of setup_net(), we walk
>    the pernet_list and call the ->init of each member (through ops_init()). 
>    So we'd hit rds_tcp_net_ops first. Since the id/size are 0, we'd
>    skip the struct rds_tcp_net allocation, so rds_tcp_init_net would 
>    find a null return from net_generic() and bomb.
> 
> The way I view it (and ymmv) the hack here is to call both
> register_pernet_device and register_pernet_subsys: the kernel only
> guarantees that calling *one* of register_pernet_* will ensure
> that you can safely call net_generic() afterwards.
> 
> The v2 patch "works around" this by reordering the registration. 
> So this time, init_net will set up the rds_tcp_net_ops as the second 
> member, and the first memeber will be the pernet_operations struct 
> that has non-zero id and size.
> 
> But then the unregistration (necessarily) works in the opposite order
> you have to unregister_pernet_device first (so that interfaces are
> quiesced) and then unregister_pernet_subsys() so that sockets can
> be safely quiesced.
> 
> I dont think this type of hack makes the code cleaner, it just
> make things much harder to understand, and completely brittle
> for subsequent changes.

It's not a hack, it's just a way to fix the problem, like other
pernet_operations do. It's OK for pernet_operations to share
net_generic() id. The only thing you need is to request the id in
the pernet_operations, which go the first in pernet_list. There are
a lot of examples in kernel:

1)sunrpc_net_ops
  rpcsec_gss_net_ops

  these pernet_operations share sunrpc_net_id. It's requested in sunrpc_net_ops:

  static struct pernet_operations sunrpc_net_ops = {
        .init = sunrpc_init_net,
        .exit = sunrpc_exit_net,
        .id = &sunrpc_net_id,
        .size = sizeof(struct sunrpc_net),

  and it's also used by rpcsec_gss_net_ops:

  static struct pernet_operations rpcsec_gss_net_ops = {
        .init = rpcsec_gss_init_net,
        .exit = rpcsec_gss_exit_net,
  };

  rpcsec_gss_init_net()->gss_svc_init_net()->rsc_cache_create_net(), where:

  static int rsc_cache_create_net(struct net *net)
  {
        struct sunrpc_net *sn = net_generic(net, sunrpc_net_id);
        ...                                      ^^^here^^^

  The only thing is sunrpc_net_ops must be registered before rpcsec_gss_net_ops,
  and rpc code guarantees that.

2)ipvs_core_ops
  ipvs_core_dev_ops
  ip_vs_ftp_ops
  ip_vs_lblc_ops
  ip_vs_lblcr_ops

  these pernet_operations (5!) share ip_vs_net_id, which is requested in ipvs_core_ops:

  static struct pernet_operations ipvs_core_ops = {
        .init = __ip_vs_init,
        .exit = __ip_vs_cleanup,
        .id   = &ip_vs_net_id,
        .size = sizeof(struct netns_ipvs),
  };

  static int __net_init __ip_vs_init(struct net *net)
  {
        struct netns_ipvs *ipvs;
        ...
        ipvs = net_generic(net, ip_vs_net_id);
        net->ipvs = ipvs;
        ...
  }


  static struct pernet_operations ipvs_core_dev_ops = {
        .exit = __ip_vs_dev_cleanup,
  };

  static void __net_exit __ip_vs_dev_cleanup(struct net *net)
  {
        struct netns_ipvs *ipvs = net_ipvs(net);
                                  ^^^requested in ipvs_core_ops
        ...
  }
  
Look at the above example. They solve the same problem, rds has.
They need to do some actions at pernet_device exit time. And there
is ipvs_core_dev_ops added for this, since ipvs_core_ops are called
in pernet_subsys time. See ip_vs_init():

static int __init ip_vs_init(void)
{
        ...
        ret = register_pernet_subsys(&ipvs_core_ops);   /* Alloc ip_vs struct */
        if (ret < 0)
                goto cleanup_conn;
        ...
        ret = register_pernet_device(&ipvs_core_dev_ops);

That's all. They use pernet_device exit, which may be called in parallel
with anything, and which doesn't use rtnl_lock().

There is no reasons, rds_tcp_net_ops uses its own way different to all other
pernet_operations, and uses exclusive rtnl_lock().

> To solve the scaling problem why not just have a well-defined 
> callback to modules when devices are quiesced, instead of 
> overloading the pernet_device registration in this obscure way?

We already have them. These callbacks are called pernet_operations exit methods.
These methods can execute in parallel and scale nice.

Kirill

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [rds-devel] [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
  2018-03-17 14:15         ` Sowmini Varadhan
  2018-03-17 21:13           ` Kirill Tkhai
@ 2018-03-17 21:26           ` Sowmini Varadhan
  2018-03-17 21:55             ` Kirill Tkhai
  1 sibling, 1 reply; 19+ messages in thread
From: Sowmini Varadhan @ 2018-03-17 21:26 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: rds-devel, linux-rdma, netdev, edumazet, davem

On (03/17/18 10:15), Sowmini Varadhan wrote:
> To solve the scaling problem why not just have a well-defined 
> callback to modules when devices are quiesced, instead of 
> overloading the pernet_device registration in this obscure way?

I thought about this a bit, and maybe I missed your original point-
today we are able to do all the needed cleanup for rds-tcp when
we unload the module, even though network activity has not quiesced,
and there is no reason we cannot use the same code for netns cleanup
as well. I think this is what you were trying to ask, when you 
said "why do you need to know that loopback is down?"
I'm sorry I  missed that, I will re-examine the code and get back to
you- it should be possible to just do one registration and 
cleanup rds-state and avoid the hack of registering twice

(saw your most recent long mail- sorry- both v1 and v2 are hacks)

I'm on the road at the moment, so I'll get back to you on this.

Thanks
--Sowmini

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [rds-devel] [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
  2018-03-17 21:26           ` [rds-devel] " Sowmini Varadhan
@ 2018-03-17 21:55             ` Kirill Tkhai
  2018-03-18 20:45               ` Sowmini Varadhan
  0 siblings, 1 reply; 19+ messages in thread
From: Kirill Tkhai @ 2018-03-17 21:55 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: rds-devel, linux-rdma, netdev, edumazet, davem

On 18.03.2018 00:26, Sowmini Varadhan wrote:
> On (03/17/18 10:15), Sowmini Varadhan wrote:
>> To solve the scaling problem why not just have a well-defined 
>> callback to modules when devices are quiesced, instead of 
>> overloading the pernet_device registration in this obscure way?
> 
> I thought about this a bit, and maybe I missed your original point-
> today we are able to do all the needed cleanup for rds-tcp when
> we unload the module, even though network activity has not quiesced,
> and there is no reason we cannot use the same code for netns cleanup
> as well. I think this is what you were trying to ask, when you 
> said "why do you need to know that loopback is down?"

I just want to make rds not using NETDEV_UNREGISTER_FINAL. If there is
another solution to do that, I'm not again that.

> I'm sorry I  missed that, I will re-examine the code and get back to
> you- it should be possible to just do one registration and 
> cleanup rds-state and avoid the hack of registering twice

Sounds great, I'll wait for your response.
 
> (saw your most recent long mail- sorry- both v1 and v2 are hacks)
> 
> I'm on the road at the moment, so I'll get back to you on this.

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [rds-devel] [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
  2018-03-17 21:55             ` Kirill Tkhai
@ 2018-03-18 20:45               ` Sowmini Varadhan
  2018-03-19 10:08                 ` Kirill Tkhai
  2018-03-20 11:37                 ` Håkon Bugge
  0 siblings, 2 replies; 19+ messages in thread
From: Sowmini Varadhan @ 2018-03-18 20:45 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: rds-devel, linux-rdma, netdev, edumazet, davem

On (03/18/18 00:55), Kirill Tkhai wrote:
> 
> I just want to make rds not using NETDEV_UNREGISTER_FINAL. If there is
> another solution to do that, I'm not again that.

The patch below takes care of this. I've done some preliminary testing,
and I'll send it upstream after doing additional self-review/testing.
Please also take a look, if you can, to see if I missed something.

Thanks for the input,

--Sowmini
-------------------------------patch follows--------------------------------

diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 08ea9cd..87c2643 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -485,40 +485,6 @@ static __net_init int rds_tcp_init_net(struct net *net)
 	return err;
 }
 
-static void __net_exit rds_tcp_exit_net(struct net *net)
-{
-	struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
-
-	if (rtn->rds_tcp_sysctl)
-		unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
-
-	if (net != &init_net && rtn->ctl_table)
-		kfree(rtn->ctl_table);
-
-	/* If rds_tcp_exit_net() is called as a result of netns deletion,
-	 * the rds_tcp_kill_sock() device notifier would already have cleaned
-	 * up the listen socket, thus there is no work to do in this function.
-	 *
-	 * If rds_tcp_exit_net() is called as a result of module unload,
-	 * i.e., due to rds_tcp_exit() -> unregister_pernet_subsys(), then
-	 * we do need to clean up the listen socket here.
-	 */
-	if (rtn->rds_tcp_listen_sock) {
-		struct socket *lsock = rtn->rds_tcp_listen_sock;
-
-		rtn->rds_tcp_listen_sock = NULL;
-		rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w);
-	}
-}
-
-static struct pernet_operations rds_tcp_net_ops = {
-	.init = rds_tcp_init_net,
-	.exit = rds_tcp_exit_net,
-	.id = &rds_tcp_netid,
-	.size = sizeof(struct rds_tcp_net),
-	.async = true,
-};
-
 static void rds_tcp_kill_sock(struct net *net)
 {
 	struct rds_tcp_connection *tc, *_tc;
@@ -546,40 +512,38 @@ static void rds_tcp_kill_sock(struct net *net)
 		rds_conn_destroy(tc->t_cpath->cp_conn);
 }
 
-void *rds_tcp_listen_sock_def_readable(struct net *net)
+static void __net_exit rds_tcp_exit_net(struct net *net)
 {
 	struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
-	struct socket *lsock = rtn->rds_tcp_listen_sock;
 
-	if (!lsock)
-		return NULL;
+	rds_tcp_kill_sock(net);
 
-	return lsock->sk->sk_user_data;
+	if (rtn->rds_tcp_sysctl)
+		unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
+
+	if (net != &init_net && rtn->ctl_table)
+		kfree(rtn->ctl_table);
 }
 
-static int rds_tcp_dev_event(struct notifier_block *this,
-			     unsigned long event, void *ptr)
+static struct pernet_operations rds_tcp_net_ops = {
+	.init = rds_tcp_init_net,
+	.exit = rds_tcp_exit_net,
+	.id = &rds_tcp_netid,
+	.size = sizeof(struct rds_tcp_net),
+	.async = true,
+};
+
+void *rds_tcp_listen_sock_def_readable(struct net *net)
 {
-	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+	struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
+	struct socket *lsock = rtn->rds_tcp_listen_sock;
 
-	/* rds-tcp registers as a pernet subys, so the ->exit will only
-	 * get invoked after network acitivity has quiesced. We need to
-	 * clean up all sockets  to quiesce network activity, and use
-	 * the unregistration of the per-net loopback device as a trigger
-	 * to start that cleanup.
-	 */
-	if (event == NETDEV_UNREGISTER_FINAL &&
-	    dev->ifindex == LOOPBACK_IFINDEX)
-		rds_tcp_kill_sock(dev_net(dev));
+	if (!lsock)
+		return NULL;
 
-	return NOTIFY_DONE;
+	return lsock->sk->sk_user_data;
 }
 
-static struct notifier_block rds_tcp_dev_notifier = {
-	.notifier_call        = rds_tcp_dev_event,
-	.priority = -10, /* must be called after other network notifiers */
-};
-
 /* when sysctl is used to modify some kernel socket parameters,this
  * function  resets the RDS connections in that netns  so that we can
  * restart with new parameters.  The assumption is that such reset
@@ -625,9 +589,7 @@ static void rds_tcp_exit(void)
 	rds_tcp_set_unloading();
 	synchronize_rcu();
 	rds_info_deregister_func(RDS_INFO_TCP_SOCKETS, rds_tcp_tc_info);
-	unregister_pernet_subsys(&rds_tcp_net_ops);
-	if (unregister_netdevice_notifier(&rds_tcp_dev_notifier))
-		pr_warn("could not unregister rds_tcp_dev_notifier\n");
+	unregister_pernet_device(&rds_tcp_net_ops);
 	rds_tcp_destroy_conns();
 	rds_trans_unregister(&rds_tcp_transport);
 	rds_tcp_recv_exit();
@@ -651,24 +613,17 @@ static int rds_tcp_init(void)
 	if (ret)
 		goto out_slab;
 
-	ret = register_pernet_subsys(&rds_tcp_net_ops);
+	ret = register_pernet_device(&rds_tcp_net_ops);
 	if (ret)
 		goto out_recv;
 
-	ret = register_netdevice_notifier(&rds_tcp_dev_notifier);
-	if (ret) {
-		pr_warn("could not register rds_tcp_dev_notifier\n");
-		goto out_pernet;
-	}
-
 	rds_trans_register(&rds_tcp_transport);
 
 	rds_info_register_func(RDS_INFO_TCP_SOCKETS, rds_tcp_tc_info);
 
 	goto out;
 
-out_pernet:
-	unregister_pernet_subsys(&rds_tcp_net_ops);
+	unregister_pernet_device(&rds_tcp_net_ops);
 out_recv:
 	rds_tcp_recv_exit();
 out_slab:

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [rds-devel] [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
  2018-03-18 20:45               ` Sowmini Varadhan
@ 2018-03-19 10:08                 ` Kirill Tkhai
  2018-03-20 11:37                 ` Håkon Bugge
  1 sibling, 0 replies; 19+ messages in thread
From: Kirill Tkhai @ 2018-03-19 10:08 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: rds-devel, linux-rdma, netdev, edumazet, davem

Hi, Sowmini,

thanks for looking into this.

On 18.03.2018 23:45, Sowmini Varadhan wrote:
> On (03/18/18 00:55), Kirill Tkhai wrote:
>>
>> I just want to make rds not using NETDEV_UNREGISTER_FINAL. If there is
>> another solution to do that, I'm not again that.
> 
> The patch below takes care of this. I've done some preliminary testing,
> and I'll send it upstream after doing additional self-review/testing.
> Please also take a look, if you can, to see if I missed something.
> 
> Thanks for the input,
> 
> --Sowmini
> -------------------------------patch follows--------------------------------
> 
> diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> index 08ea9cd..87c2643 100644
> --- a/net/rds/tcp.c
> +++ b/net/rds/tcp.c
> @@ -485,40 +485,6 @@ static __net_init int rds_tcp_init_net(struct net *net)
>  	return err;
>  }
>  
> -static void __net_exit rds_tcp_exit_net(struct net *net)
> -{
> -	struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
> -
> -	if (rtn->rds_tcp_sysctl)
> -		unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
> -
> -	if (net != &init_net && rtn->ctl_table)
> -		kfree(rtn->ctl_table);
> -
> -	/* If rds_tcp_exit_net() is called as a result of netns deletion,
> -	 * the rds_tcp_kill_sock() device notifier would already have cleaned
> -	 * up the listen socket, thus there is no work to do in this function.
> -	 *
> -	 * If rds_tcp_exit_net() is called as a result of module unload,
> -	 * i.e., due to rds_tcp_exit() -> unregister_pernet_subsys(), then
> -	 * we do need to clean up the listen socket here.
> -	 */
> -	if (rtn->rds_tcp_listen_sock) {
> -		struct socket *lsock = rtn->rds_tcp_listen_sock;
> -
> -		rtn->rds_tcp_listen_sock = NULL;
> -		rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w);
> -	}
> -}
> -
> -static struct pernet_operations rds_tcp_net_ops = {
> -	.init = rds_tcp_init_net,
> -	.exit = rds_tcp_exit_net,
> -	.id = &rds_tcp_netid,
> -	.size = sizeof(struct rds_tcp_net),
> -	.async = true,
> -};
> -
>  static void rds_tcp_kill_sock(struct net *net)
>  {
>  	struct rds_tcp_connection *tc, *_tc;
> @@ -546,40 +512,38 @@ static void rds_tcp_kill_sock(struct net *net)
>  		rds_conn_destroy(tc->t_cpath->cp_conn);
>  }
>  
> -void *rds_tcp_listen_sock_def_readable(struct net *net)
> +static void __net_exit rds_tcp_exit_net(struct net *net)
>  {
>  	struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
> -	struct socket *lsock = rtn->rds_tcp_listen_sock;
>  
> -	if (!lsock)
> -		return NULL;
> +	rds_tcp_kill_sock(net);

rds_tcp_listen_sock destruction looks nice and safe, since all
the places the sockets is dereferenced use sk_callback_lock.
So they don't miss rds_tcp_listen_sock = NULL, as rds_tcp_listen_stop()
takes the lock too.

rds_tcp_conn_list is populated from:

1)rds_tcp_accept_one(), which can't happen after we flushed the queue
in rds_tcp_listen_stop();

2)rds_sendmsg(), which is triggered by userspace, and that's impossible,
when net is dead;

3)rds_ib_cm_handle_connect(), which call rds_conn_create() with init_net
argument only. This may race with module unloading only, but this problem
is already solved in RDS by rds_destroy_pending() check, which care about
that:

static bool rds_tcp_is_unloading(struct rds_connection *conn)
{
        return atomic_read(&rds_tcp_unloading) != 0;
}

static void rds_tcp_exit(void)
{
        rds_tcp_set_unloading();
        synchronize_rcu();
 	...
}

static struct rds_connection * __rds_conn_create(...)
{
	...
        rcu_read_lock();
        if (rds_destroy_pending(conn))
                ret = -ENETDOWN;
        else    
                ret = trans->conn_alloc(conn, GFP_ATOMIC);
	...
}

So, everything looks good for me.

Kirill

>  
> -	return lsock->sk->sk_user_data;
> +	if (rtn->rds_tcp_sysctl)
> +		unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
> +
> +	if (net != &init_net && rtn->ctl_table)
> +		kfree(rtn->ctl_table);
>  }
>  
> -static int rds_tcp_dev_event(struct notifier_block *this,
> -			     unsigned long event, void *ptr)
> +static struct pernet_operations rds_tcp_net_ops = {
> +	.init = rds_tcp_init_net,
> +	.exit = rds_tcp_exit_net,
> +	.id = &rds_tcp_netid,
> +	.size = sizeof(struct rds_tcp_net),
> +	.async = true,
> +};
> +
> +void *rds_tcp_listen_sock_def_readable(struct net *net)
>  {
> -	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> +	struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
> +	struct socket *lsock = rtn->rds_tcp_listen_sock;
>  
> -	/* rds-tcp registers as a pernet subys, so the ->exit will only
> -	 * get invoked after network acitivity has quiesced. We need to
> -	 * clean up all sockets  to quiesce network activity, and use
> -	 * the unregistration of the per-net loopback device as a trigger
> -	 * to start that cleanup.
> -	 */
> -	if (event == NETDEV_UNREGISTER_FINAL &&
> -	    dev->ifindex == LOOPBACK_IFINDEX)
> -		rds_tcp_kill_sock(dev_net(dev));
> +	if (!lsock)
> +		return NULL;
>  
> -	return NOTIFY_DONE;
> +	return lsock->sk->sk_user_data;
>  }
>  
> -static struct notifier_block rds_tcp_dev_notifier = {
> -	.notifier_call        = rds_tcp_dev_event,
> -	.priority = -10, /* must be called after other network notifiers */
> -};
> -
>  /* when sysctl is used to modify some kernel socket parameters,this
>   * function  resets the RDS connections in that netns  so that we can
>   * restart with new parameters.  The assumption is that such reset
> @@ -625,9 +589,7 @@ static void rds_tcp_exit(void)
>  	rds_tcp_set_unloading();
>  	synchronize_rcu();
>  	rds_info_deregister_func(RDS_INFO_TCP_SOCKETS, rds_tcp_tc_info);
> -	unregister_pernet_subsys(&rds_tcp_net_ops);
> -	if (unregister_netdevice_notifier(&rds_tcp_dev_notifier))
> -		pr_warn("could not unregister rds_tcp_dev_notifier\n");
> +	unregister_pernet_device(&rds_tcp_net_ops);
>  	rds_tcp_destroy_conns();
>  	rds_trans_unregister(&rds_tcp_transport);
>  	rds_tcp_recv_exit();
> @@ -651,24 +613,17 @@ static int rds_tcp_init(void)
>  	if (ret)
>  		goto out_slab;
>  
> -	ret = register_pernet_subsys(&rds_tcp_net_ops);
> +	ret = register_pernet_device(&rds_tcp_net_ops);
>  	if (ret)
>  		goto out_recv;
>  
> -	ret = register_netdevice_notifier(&rds_tcp_dev_notifier);
> -	if (ret) {
> -		pr_warn("could not register rds_tcp_dev_notifier\n");
> -		goto out_pernet;
> -	}
> -
>  	rds_trans_register(&rds_tcp_transport);
>  
>  	rds_info_register_func(RDS_INFO_TCP_SOCKETS, rds_tcp_tc_info);
>  
>  	goto out;
>  
> -out_pernet:
> -	unregister_pernet_subsys(&rds_tcp_net_ops);
> +	unregister_pernet_device(&rds_tcp_net_ops);
>  out_recv:
>  	rds_tcp_recv_exit();
>  out_slab:

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [rds-devel] [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
  2018-03-18 20:45               ` Sowmini Varadhan
  2018-03-19 10:08                 ` Kirill Tkhai
@ 2018-03-20 11:37                 ` Håkon Bugge
  2018-03-20 13:29                   ` Sowmini Varadhan
  1 sibling, 1 reply; 19+ messages in thread
From: Håkon Bugge @ 2018-03-20 11:37 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Kirill Tkhai, rds-devel, OFED mailing list, netdev, edumazet, davem

Hi Sowmini,

A little nit below. And some spelling issues in existing commentary you can consider fixing, since you reshuffle this file considerable.


Thxs, Håkon


> On 18 Mar 2018, at 21:45, Sowmini Varadhan <sowmini.varadhan@oracle.com> wrote:
> 
> On (03/18/18 00:55), Kirill Tkhai wrote:
>> 
>> I just want to make rds not using NETDEV_UNREGISTER_FINAL. If there is
>> another solution to do that, I'm not again that.
> 
> The patch below takes care of this. I've done some preliminary testing,
> and I'll send it upstream after doing additional self-review/testing.
> Please also take a look, if you can, to see if I missed something.
> 
> Thanks for the input,
> 
> --Sowmini
> -------------------------------patch follows--------------------------------
> 
> diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> index 08ea9cd..87c2643 100644
> --- a/net/rds/tcp.c
> +++ b/net/rds/tcp.c
> @@ -485,40 +485,6 @@ static __net_init int rds_tcp_init_net(struct net *net)
> 	return err;
> }
> 
> -static void __net_exit rds_tcp_exit_net(struct net *net)
> -{
> -	struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
> -
> -	if (rtn->rds_tcp_sysctl)
> -		unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
> -
> -	if (net != &init_net && rtn->ctl_table)
> -		kfree(rtn->ctl_table);
> -
> -	/* If rds_tcp_exit_net() is called as a result of netns deletion,
> -	 * the rds_tcp_kill_sock() device notifier would already have cleaned
> -	 * up the listen socket, thus there is no work to do in this function.
> -	 *
> -	 * If rds_tcp_exit_net() is called as a result of module unload,
> -	 * i.e., due to rds_tcp_exit() -> unregister_pernet_subsys(), then
> -	 * we do need to clean up the listen socket here.
> -	 */
> -	if (rtn->rds_tcp_listen_sock) {
> -		struct socket *lsock = rtn->rds_tcp_listen_sock;
> -
> -		rtn->rds_tcp_listen_sock = NULL;
> -		rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w);
> -	}
> -}
> -
> -static struct pernet_operations rds_tcp_net_ops = {
> -	.init = rds_tcp_init_net,
> -	.exit = rds_tcp_exit_net,
> -	.id = &rds_tcp_netid,
> -	.size = sizeof(struct rds_tcp_net),
> -	.async = true,
> -};
> -
> static void rds_tcp_kill_sock(struct net *net)
> {
> 	struct rds_tcp_connection *tc, *_tc;
> @@ -546,40 +512,38 @@ static void rds_tcp_kill_sock(struct net *net)
> 		rds_conn_destroy(tc->t_cpath->cp_conn);
> }
> 
> -void *rds_tcp_listen_sock_def_readable(struct net *net)
> +static void __net_exit rds_tcp_exit_net(struct net *net)
> {
> 	struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
> -	struct socket *lsock = rtn->rds_tcp_listen_sock;
> 
> -	if (!lsock)
> -		return NULL;
> +	rds_tcp_kill_sock(net);
> 
> -	return lsock->sk->sk_user_data;
> +	if (rtn->rds_tcp_sysctl)
> +		unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
> +
> +	if (net != &init_net && rtn->ctl_table)
> +		kfree(rtn->ctl_table);

Well, this comes from the existing code, but as pointed out by Linus recently, kfree() handles NULL pointers. So, if rtn->ctl_table is NULL most likely, the code is OK _if_ you add an unlikely() around the if-clause. Otherwise, the “ && rtn->ctl_table” can simply be removed.

> }
> 
> -static int rds_tcp_dev_event(struct notifier_block *this,
> -			     unsigned long event, void *ptr)
> +static struct pernet_operations rds_tcp_net_ops = {
> +	.init = rds_tcp_init_net,
> +	.exit = rds_tcp_exit_net,
> +	.id = &rds_tcp_netid,
> +	.size = sizeof(struct rds_tcp_net),
> +	.async = true,
> +};
> +
> +void *rds_tcp_listen_sock_def_readable(struct net *net)
> {
> -	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> +	struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
> +	struct socket *lsock = rtn->rds_tcp_listen_sock;
> 
> -	/* rds-tcp registers as a pernet subys, so the ->exit will only
> -	 * get invoked after network acitivity has quiesced. We need to
> -	 * clean up all sockets  to quiesce network activity, and use
> -	 * the unregistration of the per-net loopback device as a trigger
> -	 * to start that cleanup.
> -	 */
> -	if (event == NETDEV_UNREGISTER_FINAL &&
> -	    dev->ifindex == LOOPBACK_IFINDEX)
> -		rds_tcp_kill_sock(dev_net(dev));
> +	if (!lsock)
> +		return NULL;
> 
> -	return NOTIFY_DONE;
> +	return lsock->sk->sk_user_data;
> }
> 
> -static struct notifier_block rds_tcp_dev_notifier = {
> -	.notifier_call        = rds_tcp_dev_event,
> -	.priority = -10, /* must be called after other network notifiers */
> -};
> -
> /* when sysctl is used to modify some kernel socket parameters,this

s/when/When/
s/parameters,this/parameters, this/

Well, not part of your commit.


>  * function  resets the RDS connections in that netns  so that we can

Two double spaces incidents above

Not part of your commit


>  * restart with new parameters.  The assumption is that such reset
> @@ -625,9 +589,7 @@ static void rds_tcp_exit(void)
> 	rds_tcp_set_unloading();
> 	synchronize_rcu();
> 	rds_info_deregister_func(RDS_INFO_TCP_SOCKETS, rds_tcp_tc_info);
> -	unregister_pernet_subsys(&rds_tcp_net_ops);
> -	if (unregister_netdevice_notifier(&rds_tcp_dev_notifier))
> -		pr_warn("could not unregister rds_tcp_dev_notifier\n");
> +	unregister_pernet_device(&rds_tcp_net_ops);
> 	rds_tcp_destroy_conns();
> 	rds_trans_unregister(&rds_tcp_transport);
> 	rds_tcp_recv_exit();
> @@ -651,24 +613,17 @@ static int rds_tcp_init(void)
> 	if (ret)
> 		goto out_slab;
> 
> -	ret = register_pernet_subsys(&rds_tcp_net_ops);
> +	ret = register_pernet_device(&rds_tcp_net_ops);
> 	if (ret)
> 		goto out_recv;
> 
> -	ret = register_netdevice_notifier(&rds_tcp_dev_notifier);
> -	if (ret) {
> -		pr_warn("could not register rds_tcp_dev_notifier\n");
> -		goto out_pernet;
> -	}
> -
> 	rds_trans_register(&rds_tcp_transport);
> 
> 	rds_info_register_func(RDS_INFO_TCP_SOCKETS, rds_tcp_tc_info);
> 
> 	goto out;
> 
> -out_pernet:
> -	unregister_pernet_subsys(&rds_tcp_net_ops);
> +	unregister_pernet_device(&rds_tcp_net_ops);
> out_recv:
> 	rds_tcp_recv_exit();
> out_slab:
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [rds-devel] [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
  2018-03-20 11:37                 ` Håkon Bugge
@ 2018-03-20 13:29                   ` Sowmini Varadhan
  0 siblings, 0 replies; 19+ messages in thread
From: Sowmini Varadhan @ 2018-03-20 13:29 UTC (permalink / raw)
  To: H??kon Bugge
  Cc: Kirill Tkhai, rds-devel, OFED mailing list, netdev, edumazet, davem

On (03/20/18 12:37), H??kon Bugge wrote:
> 
> A little nit below. And some spelling issues in existing commentary
> you can consider fixing, since you reshuffle this file considerable.
> > +	if (net != &init_net && rtn->ctl_table)
> > +		kfree(rtn->ctl_table);
> 
> Well, this comes from the existing code, but as pointed out by Linus
> recently, kfree() handles NULL pointers. So, if rtn->ctl_table is
> NULL most likely, the code is OK _if_ you add an unlikely() around the
> if-clause. Otherwise, the ??? && rtn->ctl_table??? can simply be removed.

As you observe correctly, this comes from the existing code,
and is unrelated to the contents of the commit comment.

So if we feel passionately about htis, let us please fix this
separately, with its own ceremony.


> s/when/When/
> s/parameters,this/parameters, this/
> 
> Well, not part of your commit.

As above.
> 
> 
> >  * function  resets the RDS connections in that netns  so that we can
> 
> Two double spaces incidents above
> 
> Not part of your commit

As above.

Thanks much.
--Sowmini

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-03-20 13:29 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-16 12:38 [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL) Kirill Tkhai
2018-03-16 13:00 ` Sowmini Varadhan
2018-03-16 13:17   ` Kirill Tkhai
2018-03-16 13:53     ` Sowmini Varadhan
2018-03-16 14:36       ` Kirill Tkhai
2018-03-16 14:41         ` Kirill Tkhai
2018-03-16 17:29     ` Sowmini Varadhan
2018-03-16 18:14       ` Kirill Tkhai
2018-03-16 18:31         ` Sowmini Varadhan
2018-03-16 18:48           ` Kirill Tkhai
2018-03-16 18:53             ` Sowmini Varadhan
2018-03-17 14:15         ` Sowmini Varadhan
2018-03-17 21:13           ` Kirill Tkhai
2018-03-17 21:26           ` [rds-devel] " Sowmini Varadhan
2018-03-17 21:55             ` Kirill Tkhai
2018-03-18 20:45               ` Sowmini Varadhan
2018-03-19 10:08                 ` Kirill Tkhai
2018-03-20 11:37                 ` Håkon Bugge
2018-03-20 13:29                   ` Sowmini Varadhan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.