All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] Improve NFS use of network and mount namespaces
@ 2009-05-12 21:51 ` Matt Helsley
  0 siblings, 0 replies; 18+ messages in thread
From: Matt Helsley @ 2009-05-12 21:51 UTC (permalink / raw)
  To: Containers, linux-nfs-u79uwXL29TY76Z2rM5mHXA; +Cc: Eric Biederman


Sun RPC currently opens sockets from the initial network namespace making it
impossible to restrict which NFS servers a container may interact with.

For example, the NFS server at 10.0.0.3 reachable from the initial namespace
will always be used even if an entirely different server with the address
10.0.0.3 is reachable from a container's network namespace. Hence network
namespaces cannot be used to restrict the network access of a container as long
as the RPC code opens sockets using the initial network namespace. This is
in stark contrast to other protocols like HTTP where the sockets are created in
their proper namespaces because kernel threads are not used to open sockets for
client network IO.

We may plausibly end up with namespaces created by:
I) The administrator may mount 10.0.0.3:/export_foo from init's
container, clone the mount namespace, and unmount from the original
mount namespace.

II) The administrator may start a task which clones the mount namespace
before mounting 10.0.0.3:/export_foo.

Proposed Solution:

The network namespace of the task that did the mount best defines which server
the "administrator", whether in a container or not, expects to work with.
When the mount is done inside a container then that is the network namespace 
to use. When the mount is done prior to creating the container then that's the 
namespace that should be used.

This allows system administrators to isolate network traffic generated by NFS
clients by mounting after creating a container. If partial isolation is desired
then the administrator may mount before creating a container with a new network
namespace. In each case the RPC packets would originate from a consistent
namespace.

One way to ensure consistent namespace usage would be to hold a reference to
the original network namespace as long as the mount exists. This naturally 
suggests storing the network namespace reference in the NFS superblock. 
However, it may be better to store it with the RPC transport itself since
it is directly responsible for (re)opening the sockets.

This patch adds a reference to the network namespace to the RPC
transport. When the NFS export is mounted the network namespace of
the current task establishes which namespace to reference. That
reference is stored in the RPC transport and used to open sockets
whenever a new socket is required.

Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 fs/nfs/client.c             |    5 ++++-
 include/linux/net.h         |    2 ++
 include/linux/sunrpc/clnt.h |    1 +
 include/linux/sunrpc/xprt.h |    1 +
 net/socket.c                |    5 +++++
 net/sunrpc/clnt.c           |    1 +
 net/sunrpc/xprtsock.c       |   26 ++++++++++++++++++++++----
 7 files changed, 36 insertions(+), 5 deletions(-)

Index: linux-2.6.29/fs/nfs/client.c
===================================================================
--- linux-2.6.29.orig/fs/nfs/client.c
+++ linux-2.6.29/fs/nfs/client.c
@@ -10,11 +10,11 @@
  */
 
 
 #include <linux/module.h>
 #include <linux/init.h>
-#include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/time.h>
 #include <linux/kernel.h>
 #include <linux/mm.h>
 #include <linux/string.h>
 #include <linux/stat.h>
@@ -564,10 +564,11 @@ static int nfs_create_rpc_client(struct 
 	struct rpc_clnt		*clnt = NULL;
 	struct rpc_create_args args = {
 		.protocol	= clp->cl_proto,
 		.address	= (struct sockaddr *)&clp->cl_addr,
 		.addrsize	= clp->cl_addrlen,
+		.net_ns		= current->nsproxy->net_ns,
 		.timeout	= timeparms,
 		.servername	= clp->cl_hostname,
 		.program	= &nfs_program,
 		.version	= clp->rpc_ops->version,
 		.authflavor	= flavor,
@@ -579,12 +580,14 @@ static int nfs_create_rpc_client(struct 
 		args.flags |= RPC_CLNT_CREATE_NONPRIVPORT;
 
 	if (!IS_ERR(clp->cl_rpcclient))
 		return 0;
 
+	get_net(current->nsproxy->net_ns);
 	clnt = rpc_create(&args);
 	if (IS_ERR(clnt)) {
+		put_net(current->nsproxy->net_ns);
 		dprintk("%s: cannot create RPC client. Error = %ld\n",
 				__func__, PTR_ERR(clnt));
 		return PTR_ERR(clnt);
 	}
 
Index: linux-2.6.29/include/linux/net.h
===================================================================
--- linux-2.6.29.orig/include/linux/net.h
+++ linux-2.6.29/include/linux/net.h
@@ -210,10 +210,12 @@ extern int	     sock_register(const stru
 extern void	     sock_unregister(int family);
 extern int	     sock_create(int family, int type, int proto,
 				 struct socket **res);
 extern int	     sock_create_kern(int family, int type, int proto,
 				      struct socket **res);
+extern int	     net_sock_create_kern(struct net *net, int family, int type,
+					  int proto, struct socket **res);
 extern int	     sock_create_lite(int family, int type, int proto,
 				      struct socket **res); 
 extern void	     sock_release(struct socket *sock);
 extern int   	     sock_sendmsg(struct socket *sock, struct msghdr *msg,
 				  size_t len);
Index: linux-2.6.29/include/linux/sunrpc/clnt.h
===================================================================
--- linux-2.6.29.orig/include/linux/sunrpc/clnt.h
+++ linux-2.6.29/include/linux/sunrpc/clnt.h
@@ -100,10 +100,11 @@ struct rpc_procinfo {
 struct rpc_create_args {
 	int			protocol;
 	struct sockaddr		*address;
 	size_t			addrsize;
 	struct sockaddr		*saddress;
+	struct net 		*net_ns;
 	const struct rpc_timeout *timeout;
 	char			*servername;
 	struct rpc_program	*program;
 	u32			prognumber;	/* overrides program->number */
 	u32			version;
Index: linux-2.6.29/include/linux/sunrpc/xprt.h
===================================================================
--- linux-2.6.29.orig/include/linux/sunrpc/xprt.h
+++ linux-2.6.29/include/linux/sunrpc/xprt.h
@@ -194,10 +194,11 @@ struct rpc_xprt {
 
 struct xprt_create {
 	int			ident;		/* XPRT_TRANSPORT identifier */
 	struct sockaddr *	srcaddr;	/* optional local address */
 	struct sockaddr *	dstaddr;	/* remote peer address */
+	struct net *		net_ns;		/* net namespace */
 	size_t			addrlen;
 };
 
 struct xprt_class {
 	struct list_head	list;
Index: linux-2.6.29/net/socket.c
===================================================================
--- linux-2.6.29.orig/net/socket.c
+++ linux-2.6.29/net/socket.c
@@ -1212,10 +1212,15 @@ int sock_create(int family, int type, in
 int sock_create_kern(int family, int type, int protocol, struct socket **res)
 {
 	return __sock_create(&init_net, family, type, protocol, res, 1);
 }
 
+int net_sock_create_kern(struct net *net, int family, int type, int protocol, struct socket **res)
+{
+	return __sock_create(net, family, type, protocol, res, 1);
+}
+
 SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
 {
 	int retval;
 	struct socket *sock;
 	int flags;
Index: linux-2.6.29/net/sunrpc/clnt.c
===================================================================
--- linux-2.6.29.orig/net/sunrpc/clnt.c
+++ linux-2.6.29/net/sunrpc/clnt.c
@@ -263,10 +263,11 @@ struct rpc_clnt *rpc_create(struct rpc_c
 	struct rpc_clnt *clnt;
 	struct xprt_create xprtargs = {
 		.ident = args->protocol,
 		.srcaddr = args->saddress,
 		.dstaddr = args->address,
+		.net_ns  = args->net_ns,
 		.addrlen = args->addrsize,
 	};
 	char servername[48];
 
 	/*
Index: linux-2.6.29/net/sunrpc/xprtsock.c
===================================================================
--- linux-2.6.29.orig/net/sunrpc/xprtsock.c
+++ linux-2.6.29/net/sunrpc/xprtsock.c
@@ -234,10 +234,11 @@ struct sock_xprt {
 	 * Connection of transports
 	 */
 	struct delayed_work	connect_worker;
 	struct sockaddr_storage	addr;
 	unsigned short		port;
+	struct net		*net_ns;
 
 	/*
 	 * UDP socket buffer size parameters
 	 */
 	size_t			rcvsize,
@@ -819,10 +820,11 @@ static void xs_destroy(struct rpc_xprt *
 	cancel_rearming_delayed_work(&transport->connect_worker);
 
 	xs_close(xprt);
 	xs_free_peer_addresses(xprt);
 	kfree(xprt->slot);
+ 	put_net(transport->net_ns);
 	kfree(xprt);
 	module_put(THIS_MODULE);
 }
 
 static inline struct rpc_xprt *xprt_from_sock(struct sock *sk)
@@ -1537,11 +1539,13 @@ static void xs_udp_connect_worker4(struc
 		goto out;
 
 	/* Start by resetting any existing state */
 	xs_close(xprt);
 
-	if ((err = sock_create_kern(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock)) < 0) {
+ 	err = net_sock_create_kern(transport->net_ns, PF_INET, SOCK_DGRAM,
+ 				   IPPROTO_UDP, &sock);
+	if (err < 0) {
 		dprintk("RPC:       can't create UDP transport socket (%d).\n", -err);
 		goto out;
 	}
 	xs_reclassify_socket4(sock);
 
@@ -1578,11 +1582,13 @@ static void xs_udp_connect_worker6(struc
 		goto out;
 
 	/* Start by resetting any existing state */
 	xs_close(xprt);
 
-	if ((err = sock_create_kern(PF_INET6, SOCK_DGRAM, IPPROTO_UDP, &sock)) < 0) {
+ 	err = net_sock_create_kern(transport->net_ns, PF_INET6, SOCK_DGRAM,
+ 				   IPPROTO_UDP, &sock);
+	if (err < 0) {
 		dprintk("RPC:       can't create UDP transport socket (%d).\n", -err);
 		goto out;
 	}
 	xs_reclassify_socket6(sock);
 
@@ -1684,11 +1690,13 @@ static void xs_tcp_connect_worker4(struc
 	if (xprt->shutdown)
 		goto out;
 
 	if (!sock) {
 		/* start from scratch */
-		if ((err = sock_create_kern(PF_INET, SOCK_STREAM, IPPROTO_TCP, &sock)) < 0) {
+		err = net_sock_create_kern(transport->net_ns, PF_INET,
+					   SOCK_STREAM, IPPROTO_TCP, &sock);
+		if (err < 0) {
 			dprintk("RPC:       can't create TCP transport socket (%d).\n", -err);
 			goto out;
 		}
 		xs_reclassify_socket4(sock);
 
@@ -1744,11 +1752,13 @@ static void xs_tcp_connect_worker6(struc
 	if (xprt->shutdown)
 		goto out;
 
 	if (!sock) {
 		/* start from scratch */
-		if ((err = sock_create_kern(PF_INET6, SOCK_STREAM, IPPROTO_TCP, &sock)) < 0) {
+		err = net_sock_create_kern(transport->net_ns, PF_INET6,
+					   SOCK_STREAM, IPPROTO_TCP, &sock);
+		if (err < 0) {
 			dprintk("RPC:       can't create TCP transport socket (%d).\n", -err);
 			goto out;
 		}
 		xs_reclassify_socket6(sock);
 
@@ -1988,10 +1998,14 @@ static struct rpc_xprt *xs_setup_udp(str
 
 	xprt->ops = &xs_udp_ops;
 
 	xprt->timeout = &xs_udp_default_timeout;
 
+	if (args->net_ns)
+		transport->net_ns = args->net_ns;
+	else
+		transport->net_ns = &init_net;
 	switch (addr->sa_family) {
 	case AF_INET:
 		if (((struct sockaddr_in *)addr)->sin_port != htons(0))
 			xprt_set_bound(xprt);
 
@@ -2055,10 +2069,14 @@ static struct rpc_xprt *xs_setup_tcp(str
 	xprt->idle_timeout = XS_IDLE_DISC_TO;
 
 	xprt->ops = &xs_tcp_ops;
 	xprt->timeout = &xs_tcp_default_timeout;
 
+	if (args->net_ns)
+		transport->net_ns = args->net_ns;
+	else
+		transport->net_ns = &init_net;
 	switch (addr->sa_family) {
 	case AF_INET:
 		if (((struct sockaddr_in *)addr)->sin_port != htons(0))
 			xprt_set_bound(xprt);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH] Improve NFS use of network and mount namespaces
@ 2009-05-12 21:51 ` Matt Helsley
  0 siblings, 0 replies; 18+ messages in thread
From: Matt Helsley @ 2009-05-12 21:51 UTC (permalink / raw)
  To: Containers, linux-nfs; +Cc: Eric Biederman


Sun RPC currently opens sockets from the initial network namespace making it
impossible to restrict which NFS servers a container may interact with.

For example, the NFS server at 10.0.0.3 reachable from the initial namespace
will always be used even if an entirely different server with the address
10.0.0.3 is reachable from a container's network namespace. Hence network
namespaces cannot be used to restrict the network access of a container as long
as the RPC code opens sockets using the initial network namespace. This is
in stark contrast to other protocols like HTTP where the sockets are created in
their proper namespaces because kernel threads are not used to open sockets for
client network IO.

We may plausibly end up with namespaces created by:
I) The administrator may mount 10.0.0.3:/export_foo from init's
container, clone the mount namespace, and unmount from the original
mount namespace.

II) The administrator may start a task which clones the mount namespace
before mounting 10.0.0.3:/export_foo.

Proposed Solution:

The network namespace of the task that did the mount best defines which server
the "administrator", whether in a container or not, expects to work with.
When the mount is done inside a container then that is the network namespace 
to use. When the mount is done prior to creating the container then that's the 
namespace that should be used.

This allows system administrators to isolate network traffic generated by NFS
clients by mounting after creating a container. If partial isolation is desired
then the administrator may mount before creating a container with a new network
namespace. In each case the RPC packets would originate from a consistent
namespace.

One way to ensure consistent namespace usage would be to hold a reference to
the original network namespace as long as the mount exists. This naturally 
suggests storing the network namespace reference in the NFS superblock. 
However, it may be better to store it with the RPC transport itself since
it is directly responsible for (re)opening the sockets.

This patch adds a reference to the network namespace to the RPC
transport. When the NFS export is mounted the network namespace of
the current task establishes which namespace to reference. That
reference is stored in the RPC transport and used to open sockets
whenever a new socket is required.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
---
 fs/nfs/client.c             |    5 ++++-
 include/linux/net.h         |    2 ++
 include/linux/sunrpc/clnt.h |    1 +
 include/linux/sunrpc/xprt.h |    1 +
 net/socket.c                |    5 +++++
 net/sunrpc/clnt.c           |    1 +
 net/sunrpc/xprtsock.c       |   26 ++++++++++++++++++++++----
 7 files changed, 36 insertions(+), 5 deletions(-)

Index: linux-2.6.29/fs/nfs/client.c
===================================================================
--- linux-2.6.29.orig/fs/nfs/client.c
+++ linux-2.6.29/fs/nfs/client.c
@@ -10,11 +10,11 @@
  */
 
 
 #include <linux/module.h>
 #include <linux/init.h>
-#include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/time.h>
 #include <linux/kernel.h>
 #include <linux/mm.h>
 #include <linux/string.h>
 #include <linux/stat.h>
@@ -564,10 +564,11 @@ static int nfs_create_rpc_client(struct 
 	struct rpc_clnt		*clnt = NULL;
 	struct rpc_create_args args = {
 		.protocol	= clp->cl_proto,
 		.address	= (struct sockaddr *)&clp->cl_addr,
 		.addrsize	= clp->cl_addrlen,
+		.net_ns		= current->nsproxy->net_ns,
 		.timeout	= timeparms,
 		.servername	= clp->cl_hostname,
 		.program	= &nfs_program,
 		.version	= clp->rpc_ops->version,
 		.authflavor	= flavor,
@@ -579,12 +580,14 @@ static int nfs_create_rpc_client(struct 
 		args.flags |= RPC_CLNT_CREATE_NONPRIVPORT;
 
 	if (!IS_ERR(clp->cl_rpcclient))
 		return 0;
 
+	get_net(current->nsproxy->net_ns);
 	clnt = rpc_create(&args);
 	if (IS_ERR(clnt)) {
+		put_net(current->nsproxy->net_ns);
 		dprintk("%s: cannot create RPC client. Error = %ld\n",
 				__func__, PTR_ERR(clnt));
 		return PTR_ERR(clnt);
 	}
 
Index: linux-2.6.29/include/linux/net.h
===================================================================
--- linux-2.6.29.orig/include/linux/net.h
+++ linux-2.6.29/include/linux/net.h
@@ -210,10 +210,12 @@ extern int	     sock_register(const stru
 extern void	     sock_unregister(int family);
 extern int	     sock_create(int family, int type, int proto,
 				 struct socket **res);
 extern int	     sock_create_kern(int family, int type, int proto,
 				      struct socket **res);
+extern int	     net_sock_create_kern(struct net *net, int family, int type,
+					  int proto, struct socket **res);
 extern int	     sock_create_lite(int family, int type, int proto,
 				      struct socket **res); 
 extern void	     sock_release(struct socket *sock);
 extern int   	     sock_sendmsg(struct socket *sock, struct msghdr *msg,
 				  size_t len);
Index: linux-2.6.29/include/linux/sunrpc/clnt.h
===================================================================
--- linux-2.6.29.orig/include/linux/sunrpc/clnt.h
+++ linux-2.6.29/include/linux/sunrpc/clnt.h
@@ -100,10 +100,11 @@ struct rpc_procinfo {
 struct rpc_create_args {
 	int			protocol;
 	struct sockaddr		*address;
 	size_t			addrsize;
 	struct sockaddr		*saddress;
+	struct net 		*net_ns;
 	const struct rpc_timeout *timeout;
 	char			*servername;
 	struct rpc_program	*program;
 	u32			prognumber;	/* overrides program->number */
 	u32			version;
Index: linux-2.6.29/include/linux/sunrpc/xprt.h
===================================================================
--- linux-2.6.29.orig/include/linux/sunrpc/xprt.h
+++ linux-2.6.29/include/linux/sunrpc/xprt.h
@@ -194,10 +194,11 @@ struct rpc_xprt {
 
 struct xprt_create {
 	int			ident;		/* XPRT_TRANSPORT identifier */
 	struct sockaddr *	srcaddr;	/* optional local address */
 	struct sockaddr *	dstaddr;	/* remote peer address */
+	struct net *		net_ns;		/* net namespace */
 	size_t			addrlen;
 };
 
 struct xprt_class {
 	struct list_head	list;
Index: linux-2.6.29/net/socket.c
===================================================================
--- linux-2.6.29.orig/net/socket.c
+++ linux-2.6.29/net/socket.c
@@ -1212,10 +1212,15 @@ int sock_create(int family, int type, in
 int sock_create_kern(int family, int type, int protocol, struct socket **res)
 {
 	return __sock_create(&init_net, family, type, protocol, res, 1);
 }
 
+int net_sock_create_kern(struct net *net, int family, int type, int protocol, struct socket **res)
+{
+	return __sock_create(net, family, type, protocol, res, 1);
+}
+
 SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
 {
 	int retval;
 	struct socket *sock;
 	int flags;
Index: linux-2.6.29/net/sunrpc/clnt.c
===================================================================
--- linux-2.6.29.orig/net/sunrpc/clnt.c
+++ linux-2.6.29/net/sunrpc/clnt.c
@@ -263,10 +263,11 @@ struct rpc_clnt *rpc_create(struct rpc_c
 	struct rpc_clnt *clnt;
 	struct xprt_create xprtargs = {
 		.ident = args->protocol,
 		.srcaddr = args->saddress,
 		.dstaddr = args->address,
+		.net_ns  = args->net_ns,
 		.addrlen = args->addrsize,
 	};
 	char servername[48];
 
 	/*
Index: linux-2.6.29/net/sunrpc/xprtsock.c
===================================================================
--- linux-2.6.29.orig/net/sunrpc/xprtsock.c
+++ linux-2.6.29/net/sunrpc/xprtsock.c
@@ -234,10 +234,11 @@ struct sock_xprt {
 	 * Connection of transports
 	 */
 	struct delayed_work	connect_worker;
 	struct sockaddr_storage	addr;
 	unsigned short		port;
+	struct net		*net_ns;
 
 	/*
 	 * UDP socket buffer size parameters
 	 */
 	size_t			rcvsize,
@@ -819,10 +820,11 @@ static void xs_destroy(struct rpc_xprt *
 	cancel_rearming_delayed_work(&transport->connect_worker);
 
 	xs_close(xprt);
 	xs_free_peer_addresses(xprt);
 	kfree(xprt->slot);
+ 	put_net(transport->net_ns);
 	kfree(xprt);
 	module_put(THIS_MODULE);
 }
 
 static inline struct rpc_xprt *xprt_from_sock(struct sock *sk)
@@ -1537,11 +1539,13 @@ static void xs_udp_connect_worker4(struc
 		goto out;
 
 	/* Start by resetting any existing state */
 	xs_close(xprt);
 
-	if ((err = sock_create_kern(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock)) < 0) {
+ 	err = net_sock_create_kern(transport->net_ns, PF_INET, SOCK_DGRAM,
+ 				   IPPROTO_UDP, &sock);
+	if (err < 0) {
 		dprintk("RPC:       can't create UDP transport socket (%d).\n", -err);
 		goto out;
 	}
 	xs_reclassify_socket4(sock);
 
@@ -1578,11 +1582,13 @@ static void xs_udp_connect_worker6(struc
 		goto out;
 
 	/* Start by resetting any existing state */
 	xs_close(xprt);
 
-	if ((err = sock_create_kern(PF_INET6, SOCK_DGRAM, IPPROTO_UDP, &sock)) < 0) {
+ 	err = net_sock_create_kern(transport->net_ns, PF_INET6, SOCK_DGRAM,
+ 				   IPPROTO_UDP, &sock);
+	if (err < 0) {
 		dprintk("RPC:       can't create UDP transport socket (%d).\n", -err);
 		goto out;
 	}
 	xs_reclassify_socket6(sock);
 
@@ -1684,11 +1690,13 @@ static void xs_tcp_connect_worker4(struc
 	if (xprt->shutdown)
 		goto out;
 
 	if (!sock) {
 		/* start from scratch */
-		if ((err = sock_create_kern(PF_INET, SOCK_STREAM, IPPROTO_TCP, &sock)) < 0) {
+		err = net_sock_create_kern(transport->net_ns, PF_INET,
+					   SOCK_STREAM, IPPROTO_TCP, &sock);
+		if (err < 0) {
 			dprintk("RPC:       can't create TCP transport socket (%d).\n", -err);
 			goto out;
 		}
 		xs_reclassify_socket4(sock);
 
@@ -1744,11 +1752,13 @@ static void xs_tcp_connect_worker6(struc
 	if (xprt->shutdown)
 		goto out;
 
 	if (!sock) {
 		/* start from scratch */
-		if ((err = sock_create_kern(PF_INET6, SOCK_STREAM, IPPROTO_TCP, &sock)) < 0) {
+		err = net_sock_create_kern(transport->net_ns, PF_INET6,
+					   SOCK_STREAM, IPPROTO_TCP, &sock);
+		if (err < 0) {
 			dprintk("RPC:       can't create TCP transport socket (%d).\n", -err);
 			goto out;
 		}
 		xs_reclassify_socket6(sock);
 
@@ -1988,10 +1998,14 @@ static struct rpc_xprt *xs_setup_udp(str
 
 	xprt->ops = &xs_udp_ops;
 
 	xprt->timeout = &xs_udp_default_timeout;
 
+	if (args->net_ns)
+		transport->net_ns = args->net_ns;
+	else
+		transport->net_ns = &init_net;
 	switch (addr->sa_family) {
 	case AF_INET:
 		if (((struct sockaddr_in *)addr)->sin_port != htons(0))
 			xprt_set_bound(xprt);
 
@@ -2055,10 +2069,14 @@ static struct rpc_xprt *xs_setup_tcp(str
 	xprt->idle_timeout = XS_IDLE_DISC_TO;
 
 	xprt->ops = &xs_tcp_ops;
 	xprt->timeout = &xs_tcp_default_timeout;
 
+	if (args->net_ns)
+		transport->net_ns = args->net_ns;
+	else
+		transport->net_ns = &init_net;
 	switch (addr->sa_family) {
 	case AF_INET:
 		if (((struct sockaddr_in *)addr)->sin_port != htons(0))
 			xprt_set_bound(xprt);
 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] Improve NFS use of network and mount namespaces
  2009-05-12 21:51 ` Matt Helsley
@ 2009-05-12 22:18     ` Chuck Lever
  -1 siblings, 0 replies; 18+ messages in thread
From: Chuck Lever @ 2009-05-12 22:18 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Containers, linux-nfs-u79uwXL29TY76Z2rM5mHXA, Eric Biederman

Hi Matt-

On May 12, 2009, at 5:51 PM, Matt Helsley wrote:
> Sun RPC currently opens sockets from the initial network namespace  
> making it
> impossible to restrict which NFS servers a container may interact  
> with.
>
> For example, the NFS server at 10.0.0.3 reachable from the initial  
> namespace
> will always be used even if an entirely different server with the  
> address
> 10.0.0.3 is reachable from a container's network namespace. Hence  
> network
> namespaces cannot be used to restrict the network access of a  
> container as long
> as the RPC code opens sockets using the initial network namespace.  
> This is
> in stark contrast to other protocols like HTTP where the sockets are  
> created in
> their proper namespaces because kernel threads are not used to open  
> sockets for
> client network IO.
>
> We may plausibly end up with namespaces created by:
> I) The administrator may mount 10.0.0.3:/export_foo from init's
> container, clone the mount namespace, and unmount from the original
> mount namespace.
>
> II) The administrator may start a task which clones the mount  
> namespace
> before mounting 10.0.0.3:/export_foo.

> Proposed Solution:
>
> The network namespace of the task that did the mount best defines  
> which server
> the "administrator", whether in a container or not, expects to work  
> with.
> When the mount is done inside a container then that is the network  
> namespace
> to use. When the mount is done prior to creating the container then  
> that's the
> namespace that should be used.
>
> This allows system administrators to isolate network traffic  
> generated by NFS
> clients by mounting after creating a container. If partial isolation  
> is desired
> then the administrator may mount before creating a container with a  
> new network
> namespace. In each case the RPC packets would originate from a  
> consistent
> namespace.
>
> One way to ensure consistent namespace usage would be to hold a  
> reference to
> the original network namespace as long as the mount exists. This  
> naturally
> suggests storing the network namespace reference in the NFS  
> superblock.
> However, it may be better to store it with the RPC transport itself  
> since
> it is directly responsible for (re)opening the sockets.
>
> This patch adds a reference to the network namespace to the RPC
> transport. When the NFS export is mounted the network namespace of
> the current task establishes which namespace to reference. That
> reference is stored in the RPC transport and used to open sockets
> whenever a new socket is required.

Some (perhaps random) thoughts.

NFS clients can also receive traffic.   A server can post an NLM_GRANT  
request to a client to tell it that a lock the client was waiting for  
has now been granted.  An NFSv4 server can post a delegation callback  
request to a client.  Servers can also send SM_NOTIFY requests to  
indicate they have rebooted.

lockd needs to use the same network (and UTS) namespace as the NFS  
mount point when handling locks for these mount points, as it sends  
requests to a server on separate transports, and it sends a  
"caller_name" (today, a UTS name; meant to be a FQDN) in NLM_LOCK  
requests that is designed to be used by the server to call the client  
back.  Some servers perform a DNS lookup on this name; some merely  
swipe the source address of the incoming address.

The client's lockd sends its caller_name to statd (in it's user space)  
so that statd can send this name to servers when the client reboots.   
Most servers have a statd that performs a DNS lookup on this name to  
send an SM_NOTIFY back to the client.

So, lockd's callback service, and the NFSv4 delegation service, both  
started by an NFS mount on a client, will likely need to be sensitive  
to the mount point's network and UTS namespace.

If we want to support NFSv2/v3 lock recovery when a container  
restarts, to support NFSv4 delegation for files accessed in a  
container, and to support asynchronous NLM GRANT for files locked in a  
container, this is probably the way it will have to be done.

lockd/statd are probably not ready at this point to support this kind  
of thing because of the simple way they manage caller_name strings  
today.  We have text-based NFS mount options in the kernel now, so it  
might be possible (or even easy) to have user space figure out the  
right configuration, and then pass some of this information down via  
mount options.  This would keep policy decisions in user space, and  
reduce the amount of heuristics needed in the kernel.

So, yes, I think the RPC layer is going to have to be sensitive to  
network namespaces, but something has to be done about the upper  
layers too.

> Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> ---
> fs/nfs/client.c             |    5 ++++-
> include/linux/net.h         |    2 ++
> include/linux/sunrpc/clnt.h |    1 +
> include/linux/sunrpc/xprt.h |    1 +
> net/socket.c                |    5 +++++
> net/sunrpc/clnt.c           |    1 +
> net/sunrpc/xprtsock.c       |   26 ++++++++++++++++++++++----
> 7 files changed, 36 insertions(+), 5 deletions(-)
>
> Index: linux-2.6.29/fs/nfs/client.c
> ===================================================================
> --- linux-2.6.29.orig/fs/nfs/client.c
> +++ linux-2.6.29/fs/nfs/client.c
> @@ -10,11 +10,11 @@
>  */
>
>
> #include <linux/module.h>
> #include <linux/init.h>
> -#include <linux/sched.h>
> +#include <linux/nsproxy.h>
> #include <linux/time.h>
> #include <linux/kernel.h>
> #include <linux/mm.h>
> #include <linux/string.h>
> #include <linux/stat.h>
> @@ -564,10 +564,11 @@ static int nfs_create_rpc_client(struct
> 	struct rpc_clnt		*clnt = NULL;
> 	struct rpc_create_args args = {
> 		.protocol	= clp->cl_proto,
> 		.address	= (struct sockaddr *)&clp->cl_addr,
> 		.addrsize	= clp->cl_addrlen,
> +		.net_ns		= current->nsproxy->net_ns,
> 		.timeout	= timeparms,
> 		.servername	= clp->cl_hostname,
> 		.program	= &nfs_program,
> 		.version	= clp->rpc_ops->version,
> 		.authflavor	= flavor,
> @@ -579,12 +580,14 @@ static int nfs_create_rpc_client(struct
> 		args.flags |= RPC_CLNT_CREATE_NONPRIVPORT;
>
> 	if (!IS_ERR(clp->cl_rpcclient))
> 		return 0;
>
> +	get_net(current->nsproxy->net_ns);
> 	clnt = rpc_create(&args);
> 	if (IS_ERR(clnt)) {
> +		put_net(current->nsproxy->net_ns);
> 		dprintk("%s: cannot create RPC client. Error = %ld\n",
> 				__func__, PTR_ERR(clnt));
> 		return PTR_ERR(clnt);
> 	}
>
> Index: linux-2.6.29/include/linux/net.h
> ===================================================================
> --- linux-2.6.29.orig/include/linux/net.h
> +++ linux-2.6.29/include/linux/net.h
> @@ -210,10 +210,12 @@ extern int	     sock_register(const stru
> extern void	     sock_unregister(int family);
> extern int	     sock_create(int family, int type, int proto,
> 				 struct socket **res);
> extern int	     sock_create_kern(int family, int type, int proto,
> 				      struct socket **res);
> +extern int	     net_sock_create_kern(struct net *net, int family,  
> int type,
> +					  int proto, struct socket **res);
> extern int	     sock_create_lite(int family, int type, int proto,
> 				      struct socket **res);
> extern void	     sock_release(struct socket *sock);
> extern int   	     sock_sendmsg(struct socket *sock, struct msghdr  
> *msg,
> 				  size_t len);
> Index: linux-2.6.29/include/linux/sunrpc/clnt.h
> ===================================================================
> --- linux-2.6.29.orig/include/linux/sunrpc/clnt.h
> +++ linux-2.6.29/include/linux/sunrpc/clnt.h
> @@ -100,10 +100,11 @@ struct rpc_procinfo {
> struct rpc_create_args {
> 	int			protocol;
> 	struct sockaddr		*address;
> 	size_t			addrsize;
> 	struct sockaddr		*saddress;
> +	struct net 		*net_ns;
> 	const struct rpc_timeout *timeout;
> 	char			*servername;
> 	struct rpc_program	*program;
> 	u32			prognumber;	/* overrides program->number */
> 	u32			version;
> Index: linux-2.6.29/include/linux/sunrpc/xprt.h
> ===================================================================
> --- linux-2.6.29.orig/include/linux/sunrpc/xprt.h
> +++ linux-2.6.29/include/linux/sunrpc/xprt.h
> @@ -194,10 +194,11 @@ struct rpc_xprt {
>
> struct xprt_create {
> 	int			ident;		/* XPRT_TRANSPORT identifier */
> 	struct sockaddr *	srcaddr;	/* optional local address */
> 	struct sockaddr *	dstaddr;	/* remote peer address */
> +	struct net *		net_ns;		/* net namespace */
> 	size_t			addrlen;
> };
>
> struct xprt_class {
> 	struct list_head	list;
> Index: linux-2.6.29/net/socket.c
> ===================================================================
> --- linux-2.6.29.orig/net/socket.c
> +++ linux-2.6.29/net/socket.c
> @@ -1212,10 +1212,15 @@ int sock_create(int family, int type, in
> int sock_create_kern(int family, int type, int protocol, struct  
> socket **res)
> {
> 	return __sock_create(&init_net, family, type, protocol, res, 1);
> }
>
> +int net_sock_create_kern(struct net *net, int family, int type, int  
> protocol, struct socket **res)
> +{
> +	return __sock_create(net, family, type, protocol, res, 1);
> +}
> +
> SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
> {
> 	int retval;
> 	struct socket *sock;
> 	int flags;
> Index: linux-2.6.29/net/sunrpc/clnt.c
> ===================================================================
> --- linux-2.6.29.orig/net/sunrpc/clnt.c
> +++ linux-2.6.29/net/sunrpc/clnt.c
> @@ -263,10 +263,11 @@ struct rpc_clnt *rpc_create(struct rpc_c
> 	struct rpc_clnt *clnt;
> 	struct xprt_create xprtargs = {
> 		.ident = args->protocol,
> 		.srcaddr = args->saddress,
> 		.dstaddr = args->address,
> +		.net_ns  = args->net_ns,
> 		.addrlen = args->addrsize,
> 	};
> 	char servername[48];
>
> 	/*
> Index: linux-2.6.29/net/sunrpc/xprtsock.c
> ===================================================================
> --- linux-2.6.29.orig/net/sunrpc/xprtsock.c
> +++ linux-2.6.29/net/sunrpc/xprtsock.c
> @@ -234,10 +234,11 @@ struct sock_xprt {
> 	 * Connection of transports
> 	 */
> 	struct delayed_work	connect_worker;
> 	struct sockaddr_storage	addr;
> 	unsigned short		port;
> +	struct net		*net_ns;
>
> 	/*
> 	 * UDP socket buffer size parameters
> 	 */
> 	size_t			rcvsize,
> @@ -819,10 +820,11 @@ static void xs_destroy(struct rpc_xprt *
> 	cancel_rearming_delayed_work(&transport->connect_worker);
>
> 	xs_close(xprt);
> 	xs_free_peer_addresses(xprt);
> 	kfree(xprt->slot);
> + 	put_net(transport->net_ns);
> 	kfree(xprt);
> 	module_put(THIS_MODULE);
> }
>
> static inline struct rpc_xprt *xprt_from_sock(struct sock *sk)
> @@ -1537,11 +1539,13 @@ static void xs_udp_connect_worker4(struc
> 		goto out;
>
> 	/* Start by resetting any existing state */
> 	xs_close(xprt);
>
> -	if ((err = sock_create_kern(PF_INET, SOCK_DGRAM, IPPROTO_UDP,  
> &sock)) < 0) {
> + 	err = net_sock_create_kern(transport->net_ns, PF_INET, SOCK_DGRAM,
> + 				   IPPROTO_UDP, &sock);
> +	if (err < 0) {
> 		dprintk("RPC:       can't create UDP transport socket (%d).\n", - 
> err);
> 		goto out;
> 	}
> 	xs_reclassify_socket4(sock);
>
> @@ -1578,11 +1582,13 @@ static void xs_udp_connect_worker6(struc
> 		goto out;
>
> 	/* Start by resetting any existing state */
> 	xs_close(xprt);
>
> -	if ((err = sock_create_kern(PF_INET6, SOCK_DGRAM, IPPROTO_UDP,  
> &sock)) < 0) {
> + 	err = net_sock_create_kern(transport->net_ns, PF_INET6, SOCK_DGRAM,
> + 				   IPPROTO_UDP, &sock);
> +	if (err < 0) {
> 		dprintk("RPC:       can't create UDP transport socket (%d).\n", - 
> err);
> 		goto out;
> 	}
> 	xs_reclassify_socket6(sock);
>
> @@ -1684,11 +1690,13 @@ static void xs_tcp_connect_worker4(struc
> 	if (xprt->shutdown)
> 		goto out;
>
> 	if (!sock) {
> 		/* start from scratch */
> -		if ((err = sock_create_kern(PF_INET, SOCK_STREAM, IPPROTO_TCP,  
> &sock)) < 0) {
> +		err = net_sock_create_kern(transport->net_ns, PF_INET,
> +					   SOCK_STREAM, IPPROTO_TCP, &sock);
> +		if (err < 0) {
> 			dprintk("RPC:       can't create TCP transport socket (%d).\n", - 
> err);
> 			goto out;
> 		}
> 		xs_reclassify_socket4(sock);
>
> @@ -1744,11 +1752,13 @@ static void xs_tcp_connect_worker6(struc
> 	if (xprt->shutdown)
> 		goto out;
>
> 	if (!sock) {
> 		/* start from scratch */
> -		if ((err = sock_create_kern(PF_INET6, SOCK_STREAM, IPPROTO_TCP,  
> &sock)) < 0) {
> +		err = net_sock_create_kern(transport->net_ns, PF_INET6,
> +					   SOCK_STREAM, IPPROTO_TCP, &sock);
> +		if (err < 0) {
> 			dprintk("RPC:       can't create TCP transport socket (%d).\n", - 
> err);
> 			goto out;
> 		}
> 		xs_reclassify_socket6(sock);
>
> @@ -1988,10 +1998,14 @@ static struct rpc_xprt *xs_setup_udp(str
>
> 	xprt->ops = &xs_udp_ops;
>
> 	xprt->timeout = &xs_udp_default_timeout;
>
> +	if (args->net_ns)
> +		transport->net_ns = args->net_ns;
> +	else
> +		transport->net_ns = &init_net;
> 	switch (addr->sa_family) {
> 	case AF_INET:
> 		if (((struct sockaddr_in *)addr)->sin_port != htons(0))
> 			xprt_set_bound(xprt);
>
> @@ -2055,10 +2069,14 @@ static struct rpc_xprt *xs_setup_tcp(str
> 	xprt->idle_timeout = XS_IDLE_DISC_TO;
>
> 	xprt->ops = &xs_tcp_ops;
> 	xprt->timeout = &xs_tcp_default_timeout;
>
> +	if (args->net_ns)
> +		transport->net_ns = args->net_ns;
> +	else
> +		transport->net_ns = &init_net;
> 	switch (addr->sa_family) {
> 	case AF_INET:
> 		if (((struct sockaddr_in *)addr)->sin_port != htons(0))
> 			xprt_set_bound(xprt);

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] Improve NFS use of network and mount namespaces
@ 2009-05-12 22:18     ` Chuck Lever
  0 siblings, 0 replies; 18+ messages in thread
From: Chuck Lever @ 2009-05-12 22:18 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Containers, linux-nfs, Eric Biederman

Hi Matt-

On May 12, 2009, at 5:51 PM, Matt Helsley wrote:
> Sun RPC currently opens sockets from the initial network namespace  
> making it
> impossible to restrict which NFS servers a container may interact  
> with.
>
> For example, the NFS server at 10.0.0.3 reachable from the initial  
> namespace
> will always be used even if an entirely different server with the  
> address
> 10.0.0.3 is reachable from a container's network namespace. Hence  
> network
> namespaces cannot be used to restrict the network access of a  
> container as long
> as the RPC code opens sockets using the initial network namespace.  
> This is
> in stark contrast to other protocols like HTTP where the sockets are  
> created in
> their proper namespaces because kernel threads are not used to open  
> sockets for
> client network IO.
>
> We may plausibly end up with namespaces created by:
> I) The administrator may mount 10.0.0.3:/export_foo from init's
> container, clone the mount namespace, and unmount from the original
> mount namespace.
>
> II) The administrator may start a task which clones the mount  
> namespace
> before mounting 10.0.0.3:/export_foo.

> Proposed Solution:
>
> The network namespace of the task that did the mount best defines  
> which server
> the "administrator", whether in a container or not, expects to work  
> with.
> When the mount is done inside a container then that is the network  
> namespace
> to use. When the mount is done prior to creating the container then  
> that's the
> namespace that should be used.
>
> This allows system administrators to isolate network traffic  
> generated by NFS
> clients by mounting after creating a container. If partial isolation  
> is desired
> then the administrator may mount before creating a container with a  
> new network
> namespace. In each case the RPC packets would originate from a  
> consistent
> namespace.
>
> One way to ensure consistent namespace usage would be to hold a  
> reference to
> the original network namespace as long as the mount exists. This  
> naturally
> suggests storing the network namespace reference in the NFS  
> superblock.
> However, it may be better to store it with the RPC transport itself  
> since
> it is directly responsible for (re)opening the sockets.
>
> This patch adds a reference to the network namespace to the RPC
> transport. When the NFS export is mounted the network namespace of
> the current task establishes which namespace to reference. That
> reference is stored in the RPC transport and used to open sockets
> whenever a new socket is required.

Some (perhaps random) thoughts.

NFS clients can also receive traffic.   A server can post an NLM_GRANT  
request to a client to tell it that a lock the client was waiting for  
has now been granted.  An NFSv4 server can post a delegation callback  
request to a client.  Servers can also send SM_NOTIFY requests to  
indicate they have rebooted.

lockd needs to use the same network (and UTS) namespace as the NFS  
mount point when handling locks for these mount points, as it sends  
requests to a server on separate transports, and it sends a  
"caller_name" (today, a UTS name; meant to be a FQDN) in NLM_LOCK  
requests that is designed to be used by the server to call the client  
back.  Some servers perform a DNS lookup on this name; some merely  
swipe the source address of the incoming address.

The client's lockd sends its caller_name to statd (in it's user space)  
so that statd can send this name to servers when the client reboots.   
Most servers have a statd that performs a DNS lookup on this name to  
send an SM_NOTIFY back to the client.

So, lockd's callback service, and the NFSv4 delegation service, both  
started by an NFS mount on a client, will likely need to be sensitive  
to the mount point's network and UTS namespace.

If we want to support NFSv2/v3 lock recovery when a container  
restarts, to support NFSv4 delegation for files accessed in a  
container, and to support asynchronous NLM GRANT for files locked in a  
container, this is probably the way it will have to be done.

lockd/statd are probably not ready at this point to support this kind  
of thing because of the simple way they manage caller_name strings  
today.  We have text-based NFS mount options in the kernel now, so it  
might be possible (or even easy) to have user space figure out the  
right configuration, and then pass some of this information down via  
mount options.  This would keep policy decisions in user space, and  
reduce the amount of heuristics needed in the kernel.

So, yes, I think the RPC layer is going to have to be sensitive to  
network namespaces, but something has to be done about the upper  
layers too.

> Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
> ---
> fs/nfs/client.c             |    5 ++++-
> include/linux/net.h         |    2 ++
> include/linux/sunrpc/clnt.h |    1 +
> include/linux/sunrpc/xprt.h |    1 +
> net/socket.c                |    5 +++++
> net/sunrpc/clnt.c           |    1 +
> net/sunrpc/xprtsock.c       |   26 ++++++++++++++++++++++----
> 7 files changed, 36 insertions(+), 5 deletions(-)
>
> Index: linux-2.6.29/fs/nfs/client.c
> ===================================================================
> --- linux-2.6.29.orig/fs/nfs/client.c
> +++ linux-2.6.29/fs/nfs/client.c
> @@ -10,11 +10,11 @@
>  */
>
>
> #include <linux/module.h>
> #include <linux/init.h>
> -#include <linux/sched.h>
> +#include <linux/nsproxy.h>
> #include <linux/time.h>
> #include <linux/kernel.h>
> #include <linux/mm.h>
> #include <linux/string.h>
> #include <linux/stat.h>
> @@ -564,10 +564,11 @@ static int nfs_create_rpc_client(struct
> 	struct rpc_clnt		*clnt = NULL;
> 	struct rpc_create_args args = {
> 		.protocol	= clp->cl_proto,
> 		.address	= (struct sockaddr *)&clp->cl_addr,
> 		.addrsize	= clp->cl_addrlen,
> +		.net_ns		= current->nsproxy->net_ns,
> 		.timeout	= timeparms,
> 		.servername	= clp->cl_hostname,
> 		.program	= &nfs_program,
> 		.version	= clp->rpc_ops->version,
> 		.authflavor	= flavor,
> @@ -579,12 +580,14 @@ static int nfs_create_rpc_client(struct
> 		args.flags |= RPC_CLNT_CREATE_NONPRIVPORT;
>
> 	if (!IS_ERR(clp->cl_rpcclient))
> 		return 0;
>
> +	get_net(current->nsproxy->net_ns);
> 	clnt = rpc_create(&args);
> 	if (IS_ERR(clnt)) {
> +		put_net(current->nsproxy->net_ns);
> 		dprintk("%s: cannot create RPC client. Error = %ld\n",
> 				__func__, PTR_ERR(clnt));
> 		return PTR_ERR(clnt);
> 	}
>
> Index: linux-2.6.29/include/linux/net.h
> ===================================================================
> --- linux-2.6.29.orig/include/linux/net.h
> +++ linux-2.6.29/include/linux/net.h
> @@ -210,10 +210,12 @@ extern int	     sock_register(const stru
> extern void	     sock_unregister(int family);
> extern int	     sock_create(int family, int type, int proto,
> 				 struct socket **res);
> extern int	     sock_create_kern(int family, int type, int proto,
> 				      struct socket **res);
> +extern int	     net_sock_create_kern(struct net *net, int family,  
> int type,
> +					  int proto, struct socket **res);
> extern int	     sock_create_lite(int family, int type, int proto,
> 				      struct socket **res);
> extern void	     sock_release(struct socket *sock);
> extern int   	     sock_sendmsg(struct socket *sock, struct msghdr  
> *msg,
> 				  size_t len);
> Index: linux-2.6.29/include/linux/sunrpc/clnt.h
> ===================================================================
> --- linux-2.6.29.orig/include/linux/sunrpc/clnt.h
> +++ linux-2.6.29/include/linux/sunrpc/clnt.h
> @@ -100,10 +100,11 @@ struct rpc_procinfo {
> struct rpc_create_args {
> 	int			protocol;
> 	struct sockaddr		*address;
> 	size_t			addrsize;
> 	struct sockaddr		*saddress;
> +	struct net 		*net_ns;
> 	const struct rpc_timeout *timeout;
> 	char			*servername;
> 	struct rpc_program	*program;
> 	u32			prognumber;	/* overrides program->number */
> 	u32			version;
> Index: linux-2.6.29/include/linux/sunrpc/xprt.h
> ===================================================================
> --- linux-2.6.29.orig/include/linux/sunrpc/xprt.h
> +++ linux-2.6.29/include/linux/sunrpc/xprt.h
> @@ -194,10 +194,11 @@ struct rpc_xprt {
>
> struct xprt_create {
> 	int			ident;		/* XPRT_TRANSPORT identifier */
> 	struct sockaddr *	srcaddr;	/* optional local address */
> 	struct sockaddr *	dstaddr;	/* remote peer address */
> +	struct net *		net_ns;		/* net namespace */
> 	size_t			addrlen;
> };
>
> struct xprt_class {
> 	struct list_head	list;
> Index: linux-2.6.29/net/socket.c
> ===================================================================
> --- linux-2.6.29.orig/net/socket.c
> +++ linux-2.6.29/net/socket.c
> @@ -1212,10 +1212,15 @@ int sock_create(int family, int type, in
> int sock_create_kern(int family, int type, int protocol, struct  
> socket **res)
> {
> 	return __sock_create(&init_net, family, type, protocol, res, 1);
> }
>
> +int net_sock_create_kern(struct net *net, int family, int type, int  
> protocol, struct socket **res)
> +{
> +	return __sock_create(net, family, type, protocol, res, 1);
> +}
> +
> SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
> {
> 	int retval;
> 	struct socket *sock;
> 	int flags;
> Index: linux-2.6.29/net/sunrpc/clnt.c
> ===================================================================
> --- linux-2.6.29.orig/net/sunrpc/clnt.c
> +++ linux-2.6.29/net/sunrpc/clnt.c
> @@ -263,10 +263,11 @@ struct rpc_clnt *rpc_create(struct rpc_c
> 	struct rpc_clnt *clnt;
> 	struct xprt_create xprtargs = {
> 		.ident = args->protocol,
> 		.srcaddr = args->saddress,
> 		.dstaddr = args->address,
> +		.net_ns  = args->net_ns,
> 		.addrlen = args->addrsize,
> 	};
> 	char servername[48];
>
> 	/*
> Index: linux-2.6.29/net/sunrpc/xprtsock.c
> ===================================================================
> --- linux-2.6.29.orig/net/sunrpc/xprtsock.c
> +++ linux-2.6.29/net/sunrpc/xprtsock.c
> @@ -234,10 +234,11 @@ struct sock_xprt {
> 	 * Connection of transports
> 	 */
> 	struct delayed_work	connect_worker;
> 	struct sockaddr_storage	addr;
> 	unsigned short		port;
> +	struct net		*net_ns;
>
> 	/*
> 	 * UDP socket buffer size parameters
> 	 */
> 	size_t			rcvsize,
> @@ -819,10 +820,11 @@ static void xs_destroy(struct rpc_xprt *
> 	cancel_rearming_delayed_work(&transport->connect_worker);
>
> 	xs_close(xprt);
> 	xs_free_peer_addresses(xprt);
> 	kfree(xprt->slot);
> + 	put_net(transport->net_ns);
> 	kfree(xprt);
> 	module_put(THIS_MODULE);
> }
>
> static inline struct rpc_xprt *xprt_from_sock(struct sock *sk)
> @@ -1537,11 +1539,13 @@ static void xs_udp_connect_worker4(struc
> 		goto out;
>
> 	/* Start by resetting any existing state */
> 	xs_close(xprt);
>
> -	if ((err = sock_create_kern(PF_INET, SOCK_DGRAM, IPPROTO_UDP,  
> &sock)) < 0) {
> + 	err = net_sock_create_kern(transport->net_ns, PF_INET, SOCK_DGRAM,
> + 				   IPPROTO_UDP, &sock);
> +	if (err < 0) {
> 		dprintk("RPC:       can't create UDP transport socket (%d).\n", - 
> err);
> 		goto out;
> 	}
> 	xs_reclassify_socket4(sock);
>
> @@ -1578,11 +1582,13 @@ static void xs_udp_connect_worker6(struc
> 		goto out;
>
> 	/* Start by resetting any existing state */
> 	xs_close(xprt);
>
> -	if ((err = sock_create_kern(PF_INET6, SOCK_DGRAM, IPPROTO_UDP,  
> &sock)) < 0) {
> + 	err = net_sock_create_kern(transport->net_ns, PF_INET6, SOCK_DGRAM,
> + 				   IPPROTO_UDP, &sock);
> +	if (err < 0) {
> 		dprintk("RPC:       can't create UDP transport socket (%d).\n", - 
> err);
> 		goto out;
> 	}
> 	xs_reclassify_socket6(sock);
>
> @@ -1684,11 +1690,13 @@ static void xs_tcp_connect_worker4(struc
> 	if (xprt->shutdown)
> 		goto out;
>
> 	if (!sock) {
> 		/* start from scratch */
> -		if ((err = sock_create_kern(PF_INET, SOCK_STREAM, IPPROTO_TCP,  
> &sock)) < 0) {
> +		err = net_sock_create_kern(transport->net_ns, PF_INET,
> +					   SOCK_STREAM, IPPROTO_TCP, &sock);
> +		if (err < 0) {
> 			dprintk("RPC:       can't create TCP transport socket (%d).\n", - 
> err);
> 			goto out;
> 		}
> 		xs_reclassify_socket4(sock);
>
> @@ -1744,11 +1752,13 @@ static void xs_tcp_connect_worker6(struc
> 	if (xprt->shutdown)
> 		goto out;
>
> 	if (!sock) {
> 		/* start from scratch */
> -		if ((err = sock_create_kern(PF_INET6, SOCK_STREAM, IPPROTO_TCP,  
> &sock)) < 0) {
> +		err = net_sock_create_kern(transport->net_ns, PF_INET6,
> +					   SOCK_STREAM, IPPROTO_TCP, &sock);
> +		if (err < 0) {
> 			dprintk("RPC:       can't create TCP transport socket (%d).\n", - 
> err);
> 			goto out;
> 		}
> 		xs_reclassify_socket6(sock);
>
> @@ -1988,10 +1998,14 @@ static struct rpc_xprt *xs_setup_udp(str
>
> 	xprt->ops = &xs_udp_ops;
>
> 	xprt->timeout = &xs_udp_default_timeout;
>
> +	if (args->net_ns)
> +		transport->net_ns = args->net_ns;
> +	else
> +		transport->net_ns = &init_net;
> 	switch (addr->sa_family) {
> 	case AF_INET:
> 		if (((struct sockaddr_in *)addr)->sin_port != htons(0))
> 			xprt_set_bound(xprt);
>
> @@ -2055,10 +2069,14 @@ static struct rpc_xprt *xs_setup_tcp(str
> 	xprt->idle_timeout = XS_IDLE_DISC_TO;
>
> 	xprt->ops = &xs_tcp_ops;
> 	xprt->timeout = &xs_tcp_default_timeout;
>
> +	if (args->net_ns)
> +		transport->net_ns = args->net_ns;
> +	else
> +		transport->net_ns = &init_net;
> 	switch (addr->sa_family) {
> 	case AF_INET:
> 		if (((struct sockaddr_in *)addr)->sin_port != htons(0))
> 			xprt_set_bound(xprt);

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] Improve NFS use of network and mount namespaces
  2009-05-12 21:51 ` Matt Helsley
@ 2009-05-12 23:46     ` Trond Myklebust
  -1 siblings, 0 replies; 18+ messages in thread
From: Trond Myklebust @ 2009-05-12 23:46 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Containers, linux-nfs-u79uwXL29TY76Z2rM5mHXA, Eric Biederman

On Tue, 2009-05-12 at 14:51 -0700, Matt Helsley wrote:
> Sun RPC currently opens sockets from the initial network namespace making it
> impossible to restrict which NFS servers a container may interact with.
> 
> For example, the NFS server at 10.0.0.3 reachable from the initial namespace
> will always be used even if an entirely different server with the address
> 10.0.0.3 is reachable from a container's network namespace. Hence network
> namespaces cannot be used to restrict the network access of a container as long
> as the RPC code opens sockets using the initial network namespace. This is
> in stark contrast to other protocols like HTTP where the sockets are created in
> their proper namespaces because kernel threads are not used to open sockets for
> client network IO.
> 
> We may plausibly end up with namespaces created by:
> I) The administrator may mount 10.0.0.3:/export_foo from init's
> container, clone the mount namespace, and unmount from the original
> mount namespace.
> 
> II) The administrator may start a task which clones the mount namespace
> before mounting 10.0.0.3:/export_foo.
> 
> Proposed Solution:
> 
> The network namespace of the task that did the mount best defines which server
> the "administrator", whether in a container or not, expects to work with.
> When the mount is done inside a container then that is the network namespace 
> to use. When the mount is done prior to creating the container then that's the 
> namespace that should be used.
> 
> This allows system administrators to isolate network traffic generated by NFS
> clients by mounting after creating a container. If partial isolation is desired
> then the administrator may mount before creating a container with a new network
> namespace. In each case the RPC packets would originate from a consistent
> namespace.
> 
> One way to ensure consistent namespace usage would be to hold a reference to
> the original network namespace as long as the mount exists. This naturally 
> suggests storing the network namespace reference in the NFS superblock. 
> However, it may be better to store it with the RPC transport itself since
> it is directly responsible for (re)opening the sockets.
> 
> This patch adds a reference to the network namespace to the RPC
> transport. When the NFS export is mounted the network namespace of
> the current task establishes which namespace to reference. That
> reference is stored in the RPC transport and used to open sockets
> whenever a new socket is required.

Ewwwwwwww

You ignore the fact that NFS super blocks that point to the same
filesystem are shared (including between containers). We don't want to
have separate page caches in cases where the filesystems are the same;
that causes unnecessary cache consistency problems. There is sharing at
other levels too. All NFSv4 super blocks that share a server IP address,
will also share a common lease. Ditto when it comes to NFSv2 and NFSv3
clients, and lock monitoring state.

You ignore the fact that NFS often depends on a whole slew of other RPC
services. Kernel services like NLM (a.k.a lockd), the portmap/rpcbind
client, and user space utilities like statd and the portmap/rpcbind
server. Are we supposed to add socket namespace crap to all those apis
too?

What happens to services like rpc.gssd, of which there is only one user
space instance, and which use the ip address of the server (as supplied
by the kernel) to figure out who they are talking to?

Finally, what happens if someone decides to set up a private socket
namespace, using CLONE_NEWNET, without also using CLONE_NEWNS to create
a private mount namespace? Would anyone have even the remotest chance in
hell of figuring out what filesystem is mounted where in the ensuing
chaos?

Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] Improve NFS use of network and mount namespaces
@ 2009-05-12 23:46     ` Trond Myklebust
  0 siblings, 0 replies; 18+ messages in thread
From: Trond Myklebust @ 2009-05-12 23:46 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Containers, linux-nfs, Eric Biederman

On Tue, 2009-05-12 at 14:51 -0700, Matt Helsley wrote:
> Sun RPC currently opens sockets from the initial network namespace making it
> impossible to restrict which NFS servers a container may interact with.
> 
> For example, the NFS server at 10.0.0.3 reachable from the initial namespace
> will always be used even if an entirely different server with the address
> 10.0.0.3 is reachable from a container's network namespace. Hence network
> namespaces cannot be used to restrict the network access of a container as long
> as the RPC code opens sockets using the initial network namespace. This is
> in stark contrast to other protocols like HTTP where the sockets are created in
> their proper namespaces because kernel threads are not used to open sockets for
> client network IO.
> 
> We may plausibly end up with namespaces created by:
> I) The administrator may mount 10.0.0.3:/export_foo from init's
> container, clone the mount namespace, and unmount from the original
> mount namespace.
> 
> II) The administrator may start a task which clones the mount namespace
> before mounting 10.0.0.3:/export_foo.
> 
> Proposed Solution:
> 
> The network namespace of the task that did the mount best defines which server
> the "administrator", whether in a container or not, expects to work with.
> When the mount is done inside a container then that is the network namespace 
> to use. When the mount is done prior to creating the container then that's the 
> namespace that should be used.
> 
> This allows system administrators to isolate network traffic generated by NFS
> clients by mounting after creating a container. If partial isolation is desired
> then the administrator may mount before creating a container with a new network
> namespace. In each case the RPC packets would originate from a consistent
> namespace.
> 
> One way to ensure consistent namespace usage would be to hold a reference to
> the original network namespace as long as the mount exists. This naturally 
> suggests storing the network namespace reference in the NFS superblock. 
> However, it may be better to store it with the RPC transport itself since
> it is directly responsible for (re)opening the sockets.
> 
> This patch adds a reference to the network namespace to the RPC
> transport. When the NFS export is mounted the network namespace of
> the current task establishes which namespace to reference. That
> reference is stored in the RPC transport and used to open sockets
> whenever a new socket is required.

Ewwwwwwww

You ignore the fact that NFS super blocks that point to the same
filesystem are shared (including between containers). We don't want to
have separate page caches in cases where the filesystems are the same;
that causes unnecessary cache consistency problems. There is sharing at
other levels too. All NFSv4 super blocks that share a server IP address,
will also share a common lease. Ditto when it comes to NFSv2 and NFSv3
clients, and lock monitoring state.

You ignore the fact that NFS often depends on a whole slew of other RPC
services. Kernel services like NLM (a.k.a lockd), the portmap/rpcbind
client, and user space utilities like statd and the portmap/rpcbind
server. Are we supposed to add socket namespace crap to all those apis
too?

What happens to services like rpc.gssd, of which there is only one user
space instance, and which use the ip address of the server (as supplied
by the kernel) to figure out who they are talking to?

Finally, what happens if someone decides to set up a private socket
namespace, using CLONE_NEWNET, without also using CLONE_NEWNS to create
a private mount namespace? Would anyone have even the remotest chance in
hell of figuring out what filesystem is mounted where in the ensuing
chaos?

Trond


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] Improve NFS use of network and mount namespaces
  2009-05-12 21:51 ` Matt Helsley
@ 2009-05-13  0:01     ` Eric W. Biederman
  -1 siblings, 0 replies; 18+ messages in thread
From: Eric W. Biederman @ 2009-05-13  0:01 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Containers, linux-nfs-u79uwXL29TY76Z2rM5mHXA

Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> Sun RPC currently opens sockets from the initial network namespace making it
> impossible to restrict which NFS servers a container may interact with.
>
> For example, the NFS server at 10.0.0.3 reachable from the initial namespace
> will always be used even if an entirely different server with the address
> 10.0.0.3 is reachable from a container's network namespace. Hence network
> namespaces cannot be used to restrict the network access of a container as long
> as the RPC code opens sockets using the initial network namespace. This is
> in stark contrast to other protocols like HTTP where the sockets are created in
> their proper namespaces because kernel threads are not used to open sockets for
> client network IO.
>
> We may plausibly end up with namespaces created by:
> I) The administrator may mount 10.0.0.3:/export_foo from init's
> container, clone the mount namespace, and unmount from the original
> mount namespace.
>
> II) The administrator may start a task which clones the mount namespace
> before mounting 10.0.0.3:/export_foo.
>
> Proposed Solution:
>
> The network namespace of the task that did the mount best defines which server
> the "administrator", whether in a container or not, expects to work with.
> When the mount is done inside a container then that is the network namespace 
> to use. When the mount is done prior to creating the container then that's the 
> namespace that should be used.
>
> This allows system administrators to isolate network traffic generated by NFS
> clients by mounting after creating a container. If partial isolation is desired
> then the administrator may mount before creating a container with a new network
> namespace. In each case the RPC packets would originate from a consistent
> namespace.
>
> One way to ensure consistent namespace usage would be to hold a reference to
> the original network namespace as long as the mount exists. This naturally 
> suggests storing the network namespace reference in the NFS superblock. 
> However, it may be better to store it with the RPC transport itself since
> it is directly responsible for (re)opening the sockets.
>
> This patch adds a reference to the network namespace to the RPC
> transport. When the NFS export is mounted the network namespace of
> the current task establishes which namespace to reference. That
> reference is stored in the RPC transport and used to open sockets
> whenever a new socket is required.

Matt.  This may be the basis of something and the problem is real.
However it is clear you have missed a lot of details.

So could you first address this problem in nfs_get_sb by 
denying the mount if we are not in the initial network namespace.

I.e.

if (current->nsproxy->net_ns != &init_net)
	return -EINVAL;

That should be a lot simpler to get right and at least give reliable
and predictable semantics.


Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] Improve NFS use of network and mount namespaces
@ 2009-05-13  0:01     ` Eric W. Biederman
  0 siblings, 0 replies; 18+ messages in thread
From: Eric W. Biederman @ 2009-05-13  0:01 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Containers, linux-nfs

Matt Helsley <matthltc@us.ibm.com> writes:

> Sun RPC currently opens sockets from the initial network namespace making it
> impossible to restrict which NFS servers a container may interact with.
>
> For example, the NFS server at 10.0.0.3 reachable from the initial namespace
> will always be used even if an entirely different server with the address
> 10.0.0.3 is reachable from a container's network namespace. Hence network
> namespaces cannot be used to restrict the network access of a container as long
> as the RPC code opens sockets using the initial network namespace. This is
> in stark contrast to other protocols like HTTP where the sockets are created in
> their proper namespaces because kernel threads are not used to open sockets for
> client network IO.
>
> We may plausibly end up with namespaces created by:
> I) The administrator may mount 10.0.0.3:/export_foo from init's
> container, clone the mount namespace, and unmount from the original
> mount namespace.
>
> II) The administrator may start a task which clones the mount namespace
> before mounting 10.0.0.3:/export_foo.
>
> Proposed Solution:
>
> The network namespace of the task that did the mount best defines which server
> the "administrator", whether in a container or not, expects to work with.
> When the mount is done inside a container then that is the network namespace 
> to use. When the mount is done prior to creating the container then that's the 
> namespace that should be used.
>
> This allows system administrators to isolate network traffic generated by NFS
> clients by mounting after creating a container. If partial isolation is desired
> then the administrator may mount before creating a container with a new network
> namespace. In each case the RPC packets would originate from a consistent
> namespace.
>
> One way to ensure consistent namespace usage would be to hold a reference to
> the original network namespace as long as the mount exists. This naturally 
> suggests storing the network namespace reference in the NFS superblock. 
> However, it may be better to store it with the RPC transport itself since
> it is directly responsible for (re)opening the sockets.
>
> This patch adds a reference to the network namespace to the RPC
> transport. When the NFS export is mounted the network namespace of
> the current task establishes which namespace to reference. That
> reference is stored in the RPC transport and used to open sockets
> whenever a new socket is required.

Matt.  This may be the basis of something and the problem is real.
However it is clear you have missed a lot of details.

So could you first address this problem in nfs_get_sb by 
denying the mount if we are not in the initial network namespace.

I.e.

if (current->nsproxy->net_ns != &init_net)
	return -EINVAL;

That should be a lot simpler to get right and at least give reliable
and predictable semantics.


Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] Improve NFS use of network and mount namespaces
  2009-05-12 23:46     ` Trond Myklebust
@ 2009-05-13  0:04         ` Eric W. Biederman
  -1 siblings, 0 replies; 18+ messages in thread
From: Eric W. Biederman @ 2009-05-13  0:04 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Matt Helsley, Containers, linux-nfs-u79uwXL29TY76Z2rM5mHXA

Trond Myklebust <trond.myklebust-41N18TsMXrtuMpJDpNschA@public.gmane.org> writes:

> Finally, what happens if someone decides to set up a private socket
> namespace, using CLONE_NEWNET, without also using CLONE_NEWNS to create
> a private mount namespace? Would anyone have even the remotest chance in
> hell of figuring out what filesystem is mounted where in the ensuing
> chaos?

Good question.  Multiple NFS servers with the same ip address reachable
from the same machine sounds about as nasty pickle as it gets.

The only way I can even imagine a setup like that is someone connecting
to a vpn.  So they are behind more than one NAT gateway.

Bleh NAT sucks.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] Improve NFS use of network and mount namespaces
@ 2009-05-13  0:04         ` Eric W. Biederman
  0 siblings, 0 replies; 18+ messages in thread
From: Eric W. Biederman @ 2009-05-13  0:04 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Matt Helsley, Containers, linux-nfs

Trond Myklebust <trond.myklebust@fys.uio.no> writes:

> Finally, what happens if someone decides to set up a private socket
> namespace, using CLONE_NEWNET, without also using CLONE_NEWNS to create
> a private mount namespace? Would anyone have even the remotest chance in
> hell of figuring out what filesystem is mounted where in the ensuing
> chaos?

Good question.  Multiple NFS servers with the same ip address reachable
from the same machine sounds about as nasty pickle as it gets.

The only way I can even imagine a setup like that is someone connecting
to a vpn.  So they are behind more than one NAT gateway.

Bleh NAT sucks.

Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] Improve NFS use of network and mount namespaces
  2009-05-13  0:04         ` Eric W. Biederman
@ 2009-05-13  0:13             ` Trond Myklebust
  -1 siblings, 0 replies; 18+ messages in thread
From: Trond Myklebust @ 2009-05-13  0:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Containers, linux-nfs-u79uwXL29TY76Z2rM5mHXA

On Tue, 2009-05-12 at 17:04 -0700, Eric W. Biederman wrote:
> Trond Myklebust <trond.myklebust-41N18TsMXrtuMpJDpNschA@public.gmane.org> writes:
> 
> > Finally, what happens if someone decides to set up a private socket
> > namespace, using CLONE_NEWNET, without also using CLONE_NEWNS to create
> > a private mount namespace? Would anyone have even the remotest chance in
> > hell of figuring out what filesystem is mounted where in the ensuing
> > chaos?
> 
> Good question.  Multiple NFS servers with the same ip address reachable
> from the same machine sounds about as nasty pickle as it gets.
> 
> The only way I can even imagine a setup like that is someone connecting
> to a vpn.  So they are behind more than one NAT gateway.
> 
> Bleh NAT sucks.

It is doable, though, and it will affect more than just NFS. Pretty much
all networked filesystems are affected.

It begs the question: is there ever any possible justification for
allowing CLONE_NEWNET without implying CLONE_NEWNS?

Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] Improve NFS use of network and mount namespaces
@ 2009-05-13  0:13             ` Trond Myklebust
  0 siblings, 0 replies; 18+ messages in thread
From: Trond Myklebust @ 2009-05-13  0:13 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Matt Helsley, Containers, linux-nfs

On Tue, 2009-05-12 at 17:04 -0700, Eric W. Biederman wrote:
> Trond Myklebust <trond.myklebust@fys.uio.no> writes:
> 
> > Finally, what happens if someone decides to set up a private socket
> > namespace, using CLONE_NEWNET, without also using CLONE_NEWNS to create
> > a private mount namespace? Would anyone have even the remotest chance in
> > hell of figuring out what filesystem is mounted where in the ensuing
> > chaos?
> 
> Good question.  Multiple NFS servers with the same ip address reachable
> from the same machine sounds about as nasty pickle as it gets.
> 
> The only way I can even imagine a setup like that is someone connecting
> to a vpn.  So they are behind more than one NAT gateway.
> 
> Bleh NAT sucks.

It is doable, though, and it will affect more than just NFS. Pretty much
all networked filesystems are affected.

It begs the question: is there ever any possible justification for
allowing CLONE_NEWNET without implying CLONE_NEWNS?

Trond


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] Improve NFS use of network and mount namespaces
  2009-05-13  0:13             ` Trond Myklebust
@ 2009-05-13  0:44                 ` Matt Helsley
  -1 siblings, 0 replies; 18+ messages in thread
From: Matt Helsley @ 2009-05-13  0:44 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Eric W. Biederman, Matt Helsley, Containers,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

On Tue, May 12, 2009 at 08:13:24PM -0400, Trond Myklebust wrote:
> On Tue, 2009-05-12 at 17:04 -0700, Eric W. Biederman wrote:
> > Trond Myklebust <trond.myklebust-41N18TsMXrtuMpJDpNschA@public.gmane.org> writes:
> > 
> > > Finally, what happens if someone decides to set up a private socket
> > > namespace, using CLONE_NEWNET, without also using CLONE_NEWNS to create
> > > a private mount namespace? Would anyone have even the remotest chance in
> > > hell of figuring out what filesystem is mounted where in the ensuing
> > > chaos?
> > 
> > Good question.  Multiple NFS servers with the same ip address reachable
> > from the same machine sounds about as nasty pickle as it gets.
> > 
> > The only way I can even imagine a setup like that is someone connecting
> > to a vpn.  So they are behind more than one NAT gateway.
> > 
> > Bleh NAT sucks.
> 
> It is doable, though, and it will affect more than just NFS. Pretty much
> all networked filesystems are affected.
> 
> It begs the question: is there ever any possible justification for
> allowing CLONE_NEWNET without implying CLONE_NEWNS?

There are so many filesystem-based kernel APIs that this is a pervasive
problem IMHO -- not just with CLONE_NEWNET. However, even if we required
CLONE_NEWNET|CLONE_NEWNS network namespaces still present a problem to
network filesystems in general.

-Matt
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] Improve NFS use of network and mount namespaces
@ 2009-05-13  0:44                 ` Matt Helsley
  0 siblings, 0 replies; 18+ messages in thread
From: Matt Helsley @ 2009-05-13  0:44 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Eric W. Biederman, Matt Helsley, Containers, linux-nfs

On Tue, May 12, 2009 at 08:13:24PM -0400, Trond Myklebust wrote:
> On Tue, 2009-05-12 at 17:04 -0700, Eric W. Biederman wrote:
> > Trond Myklebust <trond.myklebust@fys.uio.no> writes:
> > 
> > > Finally, what happens if someone decides to set up a private socket
> > > namespace, using CLONE_NEWNET, without also using CLONE_NEWNS to create
> > > a private mount namespace? Would anyone have even the remotest chance in
> > > hell of figuring out what filesystem is mounted where in the ensuing
> > > chaos?
> > 
> > Good question.  Multiple NFS servers with the same ip address reachable
> > from the same machine sounds about as nasty pickle as it gets.
> > 
> > The only way I can even imagine a setup like that is someone connecting
> > to a vpn.  So they are behind more than one NAT gateway.
> > 
> > Bleh NAT sucks.
> 
> It is doable, though, and it will affect more than just NFS. Pretty much
> all networked filesystems are affected.
> 
> It begs the question: is there ever any possible justification for
> allowing CLONE_NEWNET without implying CLONE_NEWNS?

There are so many filesystem-based kernel APIs that this is a pervasive
problem IMHO -- not just with CLONE_NEWNET. However, even if we required
CLONE_NEWNET|CLONE_NEWNS network namespaces still present a problem to
network filesystems in general.

-Matt

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] Improve NFS use of network and mount namespaces
  2009-05-13  0:01     ` Eric W. Biederman
@ 2009-05-13  1:05         ` Matt Helsley
  -1 siblings, 0 replies; 18+ messages in thread
From: Matt Helsley @ 2009-05-13  1:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Containers, linux-nfs-u79uwXL29TY76Z2rM5mHXA

On Tue, May 12, 2009 at 05:01:58PM -0700, Eric W. Biederman wrote:
> Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Sun RPC currently opens sockets from the initial network namespace making it
> > impossible to restrict which NFS servers a container may interact with.
> >
> > For example, the NFS server at 10.0.0.3 reachable from the initial namespace
> > will always be used even if an entirely different server with the address
> > 10.0.0.3 is reachable from a container's network namespace. Hence network
> > namespaces cannot be used to restrict the network access of a container as long
> > as the RPC code opens sockets using the initial network namespace. This is
> > in stark contrast to other protocols like HTTP where the sockets are created in
> > their proper namespaces because kernel threads are not used to open sockets for
> > client network IO.
> >
> > We may plausibly end up with namespaces created by:
> > I) The administrator may mount 10.0.0.3:/export_foo from init's
> > container, clone the mount namespace, and unmount from the original
> > mount namespace.
> >
> > II) The administrator may start a task which clones the mount namespace
> > before mounting 10.0.0.3:/export_foo.
> >
> > Proposed Solution:
> >
> > The network namespace of the task that did the mount best defines which server
> > the "administrator", whether in a container or not, expects to work with.
> > When the mount is done inside a container then that is the network namespace 
> > to use. When the mount is done prior to creating the container then that's the 
> > namespace that should be used.
> >
> > This allows system administrators to isolate network traffic generated by NFS
> > clients by mounting after creating a container. If partial isolation is desired
> > then the administrator may mount before creating a container with a new network
> > namespace. In each case the RPC packets would originate from a consistent
> > namespace.
> >
> > One way to ensure consistent namespace usage would be to hold a reference to
> > the original network namespace as long as the mount exists. This naturally 
> > suggests storing the network namespace reference in the NFS superblock. 
> > However, it may be better to store it with the RPC transport itself since
> > it is directly responsible for (re)opening the sockets.
> >
> > This patch adds a reference to the network namespace to the RPC
> > transport. When the NFS export is mounted the network namespace of
> > the current task establishes which namespace to reference. That
> > reference is stored in the RPC transport and used to open sockets
> > whenever a new socket is required.
> 
> Matt.  This may be the basis of something and the problem is real.
> However it is clear you have missed a lot of details.

Well crap. While I did not ignore all the RPC services I noticed
when I tried reading the NFS/RPC code, based on the response from Chuck,
you, and Trond, I clearly fucked up when I thought I had properly understood 
how the RPC code works with the services that support NFS.

I figured that since RPC was the core of these services it would be a
good place to start trying to address the problem. It looked like the
RPC transport was a good place to deal with all of these services since
it's responsible for (re)opening the sockets needed to perform RPC IO.
But apparently the transport is not shared the way I thought it was :/..

> So could you first address this problem in nfs_get_sb by 
> denying the mount if we are not in the initial network namespace.
> 
> I.e.
> 
> if (current->nsproxy->net_ns != &init_net)
> 	return -EINVAL;
> 
> That should be a lot simpler to get right and at least give reliable
> and predictable semantics.

Yes, that seems like a reasonable preventitive measure for now.

	-Matt

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] Improve NFS use of network and mount namespaces
@ 2009-05-13  1:05         ` Matt Helsley
  0 siblings, 0 replies; 18+ messages in thread
From: Matt Helsley @ 2009-05-13  1:05 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Matt Helsley, Containers, linux-nfs

On Tue, May 12, 2009 at 05:01:58PM -0700, Eric W. Biederman wrote:
> Matt Helsley <matthltc@us.ibm.com> writes:
> 
> > Sun RPC currently opens sockets from the initial network namespace making it
> > impossible to restrict which NFS servers a container may interact with.
> >
> > For example, the NFS server at 10.0.0.3 reachable from the initial namespace
> > will always be used even if an entirely different server with the address
> > 10.0.0.3 is reachable from a container's network namespace. Hence network
> > namespaces cannot be used to restrict the network access of a container as long
> > as the RPC code opens sockets using the initial network namespace. This is
> > in stark contrast to other protocols like HTTP where the sockets are created in
> > their proper namespaces because kernel threads are not used to open sockets for
> > client network IO.
> >
> > We may plausibly end up with namespaces created by:
> > I) The administrator may mount 10.0.0.3:/export_foo from init's
> > container, clone the mount namespace, and unmount from the original
> > mount namespace.
> >
> > II) The administrator may start a task which clones the mount namespace
> > before mounting 10.0.0.3:/export_foo.
> >
> > Proposed Solution:
> >
> > The network namespace of the task that did the mount best defines which server
> > the "administrator", whether in a container or not, expects to work with.
> > When the mount is done inside a container then that is the network namespace 
> > to use. When the mount is done prior to creating the container then that's the 
> > namespace that should be used.
> >
> > This allows system administrators to isolate network traffic generated by NFS
> > clients by mounting after creating a container. If partial isolation is desired
> > then the administrator may mount before creating a container with a new network
> > namespace. In each case the RPC packets would originate from a consistent
> > namespace.
> >
> > One way to ensure consistent namespace usage would be to hold a reference to
> > the original network namespace as long as the mount exists. This naturally 
> > suggests storing the network namespace reference in the NFS superblock. 
> > However, it may be better to store it with the RPC transport itself since
> > it is directly responsible for (re)opening the sockets.
> >
> > This patch adds a reference to the network namespace to the RPC
> > transport. When the NFS export is mounted the network namespace of
> > the current task establishes which namespace to reference. That
> > reference is stored in the RPC transport and used to open sockets
> > whenever a new socket is required.
> 
> Matt.  This may be the basis of something and the problem is real.
> However it is clear you have missed a lot of details.

Well crap. While I did not ignore all the RPC services I noticed
when I tried reading the NFS/RPC code, based on the response from Chuck,
you, and Trond, I clearly fucked up when I thought I had properly understood 
how the RPC code works with the services that support NFS.

I figured that since RPC was the core of these services it would be a
good place to start trying to address the problem. It looked like the
RPC transport was a good place to deal with all of these services since
it's responsible for (re)opening the sockets needed to perform RPC IO.
But apparently the transport is not shared the way I thought it was :/..

> So could you first address this problem in nfs_get_sb by 
> denying the mount if we are not in the initial network namespace.
> 
> I.e.
> 
> if (current->nsproxy->net_ns != &init_net)
> 	return -EINVAL;
> 
> That should be a lot simpler to get right and at least give reliable
> and predictable semantics.

Yes, that seems like a reasonable preventitive measure for now.

	-Matt


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] Improve NFS use of network and mount namespaces
  2009-05-13  0:13             ` Trond Myklebust
@ 2009-05-13  1:11                 ` Eric W. Biederman
  -1 siblings, 0 replies; 18+ messages in thread
From: Eric W. Biederman @ 2009-05-13  1:11 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Matt Helsley, Containers, linux-nfs-u79uwXL29TY76Z2rM5mHXA

Trond Myklebust <trond.myklebust-41N18TsMXrtuMpJDpNschA@public.gmane.org> writes:

> On Tue, 2009-05-12 at 17:04 -0700, Eric W. Biederman wrote:
>> Trond Myklebust <trond.myklebust-41N18TsMXrtuMpJDpNschA@public.gmane.org> writes:
>> 
>> > Finally, what happens if someone decides to set up a private socket
>> > namespace, using CLONE_NEWNET, without also using CLONE_NEWNS to create
>> > a private mount namespace? Would anyone have even the remotest chance in
>> > hell of figuring out what filesystem is mounted where in the ensuing
>> > chaos?
>> 
>> Good question.  Multiple NFS servers with the same ip address reachable
>> from the same machine sounds about as nasty pickle as it gets.
>> 
>> The only way I can even imagine a setup like that is someone connecting
>> to a vpn.  So they are behind more than one NAT gateway.
>> 
>> Bleh NAT sucks.
>
> It is doable, though, and it will affect more than just NFS. Pretty much
> all networked filesystems are affected.

Good point.  That was an oversight when I did the initial round of patches
to deny the unsupported cases in other than the initial network namespace.

> It begs the question: is there ever any possible justification for
> allowing CLONE_NEWNET without implying CLONE_NEWNS?

Superblocks and the like are independent of the mount namespace.  So I
don't even seeing CONE_NEWNS helping except for looking in
/proc/mounts.

If network filesystems have a path based identity.  AKA ip address.  This
is a problem.  If there is some kind of other identity like a uuid
this problem might not even matter.

As for the original question.

We have test setups at work where we have tests running in different
network namespaces but they don't conflict in the filesystem so
CLONE_NEWNS would be redundant.  As well as unhelpful.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] Improve NFS use of network and mount namespaces
@ 2009-05-13  1:11                 ` Eric W. Biederman
  0 siblings, 0 replies; 18+ messages in thread
From: Eric W. Biederman @ 2009-05-13  1:11 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Matt Helsley, Containers, linux-nfs

Trond Myklebust <trond.myklebust@fys.uio.no> writes:

> On Tue, 2009-05-12 at 17:04 -0700, Eric W. Biederman wrote:
>> Trond Myklebust <trond.myklebust@fys.uio.no> writes:
>> 
>> > Finally, what happens if someone decides to set up a private socket
>> > namespace, using CLONE_NEWNET, without also using CLONE_NEWNS to create
>> > a private mount namespace? Would anyone have even the remotest chance in
>> > hell of figuring out what filesystem is mounted where in the ensuing
>> > chaos?
>> 
>> Good question.  Multiple NFS servers with the same ip address reachable
>> from the same machine sounds about as nasty pickle as it gets.
>> 
>> The only way I can even imagine a setup like that is someone connecting
>> to a vpn.  So they are behind more than one NAT gateway.
>> 
>> Bleh NAT sucks.
>
> It is doable, though, and it will affect more than just NFS. Pretty much
> all networked filesystems are affected.

Good point.  That was an oversight when I did the initial round of patches
to deny the unsupported cases in other than the initial network namespace.

> It begs the question: is there ever any possible justification for
> allowing CLONE_NEWNET without implying CLONE_NEWNS?

Superblocks and the like are independent of the mount namespace.  So I
don't even seeing CONE_NEWNS helping except for looking in
/proc/mounts.

If network filesystems have a path based identity.  AKA ip address.  This
is a problem.  If there is some kind of other identity like a uuid
this problem might not even matter.

As for the original question.

We have test setups at work where we have tests running in different
network namespaces but they don't conflict in the filesystem so
CLONE_NEWNS would be redundant.  As well as unhelpful.

Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2009-05-13  1:11 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-12 21:51 [RFC][PATCH] Improve NFS use of network and mount namespaces Matt Helsley
2009-05-12 21:51 ` Matt Helsley
     [not found] ` <20090512215138.GD3912-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-12 22:18   ` Chuck Lever
2009-05-12 22:18     ` Chuck Lever
2009-05-12 23:46   ` Trond Myklebust
2009-05-12 23:46     ` Trond Myklebust
     [not found]     ` <1242172010.5407.79.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-13  0:04       ` Eric W. Biederman
2009-05-13  0:04         ` Eric W. Biederman
     [not found]         ` <m13ab97trc.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2009-05-13  0:13           ` Trond Myklebust
2009-05-13  0:13             ` Trond Myklebust
     [not found]             ` <1242173604.5407.82.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-13  0:44               ` Matt Helsley
2009-05-13  0:44                 ` Matt Helsley
2009-05-13  1:11               ` Eric W. Biederman
2009-05-13  1:11                 ` Eric W. Biederman
2009-05-13  0:01   ` Eric W. Biederman
2009-05-13  0:01     ` Eric W. Biederman
     [not found]     ` <m1fxf97tvt.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2009-05-13  1:05       ` Matt Helsley
2009-05-13  1:05         ` Matt Helsley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.