All of lore.kernel.org
 help / color / mirror / Atom feed
* sock_create_kern() and (lack of) get_net()
@ 2017-05-03 11:39 David Laight
  2017-05-03 16:33 ` Cong Wang
  0 siblings, 1 reply; 3+ messages in thread
From: David Laight @ 2017-05-03 11:39 UTC (permalink / raw)
  To: netdev

sock_create_kern() passes 'kern=1' to __sock_create().
sock_create() passes 'kern=0' and uses current->nsproxy->ns_net.

The 'kern' parameter is passed to security_socket_create() and
security_socket_post_create() - I think this is just checking
whether the call is allowed.

The 'kern' parameter is also passed through to sk_alloc() and
controls whether the socket holds a reference count to the namespace.

The latter 'feature' is there because some sockets are used within
the protocol stack itself and the network namespace needs to be
deleteable while those sockets exits.
Prior to 4.2 get_get() was called when all sockets were created and
a 'dance' was done is a few places to drop the reference.
These sockets are still inside the namespace - so must be deleted
by the code that deletes the namespace.

I suspect that many of the sockets created with 'kern=1' are not 'special'
and should hold a reference to the namespace.

In particular code that calls sock_create_kern() and then uses the
kernel_xxx() socket functions at the bottom of net/socket.c probably
want to hold a reference to the network namespace.
I'm pretty sure the socket can still exist (eg draining data) after
sock_release() is called - so the driver can't hold the namespace
reference on behalf of the socket.

A quick audit shows calls to __sock_create(..., 1) at:
  ./fs/cifs/connect.c:3176
  ./net/wireless/nl80211.c:10022
  ./net/sunrpc/svcsock.c:1516
  ./net/sunrpc/clnt.c:1247
  ./net/sunrpc/xprtsock.c:1952
  ./net/sunrpc/xprtsock.c:2019
  ./net/9p/trans_fd.c:948
  ./net/9p/trans_fd.c:996
and calls to sock_create_kern() at:
  ./drivers/infiniband/sw/rxe/rxe_qp.c:233
  ./drivers/block/drbd/drbd_receiver.c:631
  ./drivers/block/drbd/drbd_receiver.c:726
  ./fs/dlm/lowcomms.c:732
  ./fs/dlm/lowcomms.c:1053
  ./fs/dlm/lowcomms.c:1134
  ./fs/dlm/lowcomms.c:1221
  ./fs/dlm/lowcomms.c:1303
  ./fs/afs/rxrpc.c:68
  ./net/ceph/messenger.c:480
  ./net/rds/tcp_connect.c
  ./net/rds/tcp_connect.c:108
  ./net/rds/tcp_listen.c:128
  ./net/rds/tcp_listen.c:247
  ./net/rxrpc/local_object.c:117
  ./net/smc/af_smc.c:1317
  ./net/l2tp/l2tp_core.c:1506
  ./net/l2tp/l2tp_core.c:1534
All of which look to me like code that is using IP connections and
would need to be shut down before any namespace could be deleted.

There are also calls to sock_create_kern() in:
  ./net/tipc/server.c:330
  ./net/ipv6/ip6_udp_tunnel.c:22
  ./net/ipv4/udp_tunnel.c:19
  ./net/ipv4/af_inet.c:1529
  ./net/bluetooth/rfcomm/core.c:203
  ./net/netfilter/ipvs/ip_vs_sync.c:1503
  ./net/netfilter/ipvs/ip_vs_sync.c:1560
These might all be internal to the protocol stack.

I suspect that the 'kern' parameter to __sock_create() needs changing
to 'flags' with:
  1 - traditional 'kernel' socket, pass '1' to security_socket_create().
  2 - 'protocol internal' socket, don't hold a net_ns reference count.
The call sites would then need auditing to see which value they should
pass.

As usual I've probably missed something obvious...

	David

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: sock_create_kern() and (lack of) get_net()
  2017-05-03 11:39 sock_create_kern() and (lack of) get_net() David Laight
@ 2017-05-03 16:33 ` Cong Wang
  2017-05-04  9:00   ` David Laight
  0 siblings, 1 reply; 3+ messages in thread
From: Cong Wang @ 2017-05-03 16:33 UTC (permalink / raw)
  To: David Laight; +Cc: netdev

On Wed, May 3, 2017 at 4:39 AM, David Laight <David.Laight@aculab.com> wrote:
> I suspect that many of the sockets created with 'kern=1' are not 'special'
> and should hold a reference to the namespace.

They are special if they are created in net init, which means they
have the same life-time with netns. They should NOT hold a refcnt,
otherwise who would release the last netns refcnt? net exit is called
when refcnt reaches 0, if you really held it, it is always at least 1.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: sock_create_kern() and (lack of) get_net()
  2017-05-03 16:33 ` Cong Wang
@ 2017-05-04  9:00   ` David Laight
  0 siblings, 0 replies; 3+ messages in thread
From: David Laight @ 2017-05-04  9:00 UTC (permalink / raw)
  To: 'Cong Wang'; +Cc: netdev

From: Cong Wang
> Sent: 03 May 2017 17:33
> On Wed, May 3, 2017 at 4:39 AM, David Laight <David.Laight@aculab.com> wrote:
> > I suspect that many of the sockets created with 'kern=1' are not 'special'
> > and should hold a reference to the namespace.
> 
> They are special if they are created in net init, which means they
> have the same life-time with netns. They should NOT hold a refcnt,
> otherwise who would release the last netns refcnt? net exit is called
> when refcnt reaches 0, if you really held it, it is always at least 1.

Right, so some are 'special' but some aren't?

At the moment no sockets created in the kernel hold a reference to the
namespace, deleting the namespace won't magically cause the kernel
module that created the socket to delete it.
If the socket is being used for a TCP connection then in can sit in
one of the FIN-WAIT states after sock_release() has been called
(I'm not sure exactly when the 'sock' and 'socket' get freed).

I guess it would be possible for the namespace deleting code to
traverse the list of sockets that reference the namespace and 'unhook'
then all from the relevant protocol stack so that any 'user' calls
error out.
It would need to allow for threads blocked inside the socket functions.

Does any of that happen?

	David


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-05-04  9:00 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-03 11:39 sock_create_kern() and (lack of) get_net() David Laight
2017-05-03 16:33 ` Cong Wang
2017-05-04  9:00   ` David Laight

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.