All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: netfilter: nf_conntrack: add support for "conntrack zones"
@ 2010-01-14 14:05 Patrick McHardy
  2010-01-14 15:05 ` jamal
       [not found] ` <4B4F24AC.70105-dcUjhNyLwpNeoWH0uzbU5w@public.gmane.org>
  0 siblings, 2 replies; 184+ messages in thread
From: Patrick McHardy @ 2010-01-14 14:05 UTC (permalink / raw)
  To: Netfilter Development Mailinglist; +Cc: Linux Netdev List, containers

[-- Attachment #1: Type: text/plain, Size: 2897 bytes --]

The attached largish patch adds support for "conntrack zones",
which are virtual conntrack tables that can be used to seperate
connections from different zones, allowing to handle multiple
connections with equal identities in conntrack and NAT.

A zone is simply a numerical identifier associated with a network
device that is incorporated into the various hashes and used to
distinguish entries in addition to the connection tuples. Additionally
it is used to seperate conntrack defragmentation queues. An iptables
target for the raw table could be used alternatively to the network
device for assigning conntrack entries to zones.

This is mainly useful when connecting multiple private networks using
the same addresses (which unfortunately happens occasionally) to pass
the packets through a set of veth devices and SNAT each network to a
unique address, after which they can pass through the "main" zone and
be handled like regular non-clashing packets and/or have NAT applied a
second time based f.i. on the outgoing interface.

Something like this, with multiple tunl and veth devices, each pair
using a unique zone:

  <tunl0 / zone 1>
     |
  PREROUTING
     |
  FORWARD
     |
  POSTROUTING: SNAT to unique network
     |
  <veth1 / zone 1>
  <veth0 / zone 0>
     |
  PREROUTING
     |
  FORWARD
     |
  POSTROUTING: SNAT to eth0 address
     |
  <eth0>

As probably everyone has noticed, this is quite similar to what you
can do using network namespaces. The main reason for not using
network namespaces is that its an all-or-nothing approach, you can't
virtualize just connection tracking. Beside the difficulties in
managing different namespaces from f.i. an IKE or PPP daemon running
in the initial namespace, network namespaces have a quite large
overhead, especially when used with a large conntrack table.

I'm not too fond of this partial feature duplication myself, but I
couldn't think of a better way to do this without the downsides of
using namespaces. Having partially shared network namespaces would
be great, but it doesn't seem to fit in the design very well.
I'm open for any better suggestion :)

A couple of notes on the patch:

- its not entirely finished yet (ctnetlink and xt_connlimit are
  missing), I wanted to have a discussion about the general idea first.

- the patch uses ct_extend to avoid increasing the connection tracking
  entry size when this feature is not used. An older version of this
  patch adds the zone identifier to the conntrack tuples. This greatly
  simplifies the changes to the code since the zone doesn't has to
  passed around (something like 40 lines total), but has the downside
  of increasing the tuple size.

- the overhead should be quite small, its mainly the extra argument
  passing and an occasional extra comparison. Code size increase with
  all netfilter options enabled on x86_64 is 152 bytes.

Any comments welcome.

[-- Attachment #2: 01.diff --]
[-- Type: text/x-patch, Size: 50283 bytes --]

commit 7f68e7aa55f9e1f9dfd647b60dace4149f27ae1f
Author: Patrick McHardy <kaber@trash.net>
Date:   Thu Jan 14 13:51:06 2010 +0100

    netfilter: nf_conntrack: add support for "conntrack zones"
    
    Normally, each connection needs a unique identity. Conntrack zones allow
    to specify a numerical zone for each interface, connections in different
    zones can use the same identity.
    
    Signed-off-by: Patrick McHardy <kaber@trash.net>

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a3fccc8..6e6a209 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -953,6 +953,10 @@ struct net_device {
 	/* max exchange id for FCoE LRO by ddp */
 	unsigned int		fcoe_ddp_xid;
 #endif
+
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+	u16			nf_ct_zone;
+#endif
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
diff --git a/include/net/ip.h b/include/net/ip.h
index 85108cf..61aface 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -336,8 +336,11 @@ enum ip_defrag_users {
 	IP_DEFRAG_LOCAL_DELIVER,
 	IP_DEFRAG_CALL_RA_CHAIN,
 	IP_DEFRAG_CONNTRACK_IN,
+	__IP_DEFRAG_CONNTRACK_IN_END	= IP_DEFRAG_CONNTRACK_IN + 0xffff,
 	IP_DEFRAG_CONNTRACK_OUT,
+	__IP_DEFRAG_CONNTRACK_OUT_END	= IP_DEFRAG_CONNTRACK_OUT + 0xffff,
 	IP_DEFRAG_CONNTRACK_BRIDGE_IN,
+	__IP_DEFRAG_CONNTRACK_BRIDGE_IN = IP_DEFRAG_CONNTRACK_BRIDGE_IN + 0xffff,
 	IP_DEFRAG_VS_IN,
 	IP_DEFRAG_VS_OUT,
 	IP_DEFRAG_VS_FWD
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index ccab594..b82a68d 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -353,8 +353,11 @@ struct inet_frag_queue;
 enum ip6_defrag_users {
 	IP6_DEFRAG_LOCAL_DELIVER,
 	IP6_DEFRAG_CONNTRACK_IN,
+	__IP6_DEFRAG_CONNTRACK_IN	= IP6_DEFRAG_CONNTRACK_IN + 0xffff,
 	IP6_DEFRAG_CONNTRACK_OUT,
+	__IP6_DEFRAG_CONNTRACK_OUT	= IP6_DEFRAG_CONNTRACK_OUT + 0xffff,
 	IP6_DEFRAG_CONNTRACK_BRIDGE_IN,
+	__IP6_DEFRAG_CONNTRACK_BRIDGE_IN = IP6_DEFRAG_CONNTRACK_BRIDGE_IN + 0xffff,
 };
 
 struct ip6_create_arg {
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index a0904ad..9488ac6 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -198,7 +198,8 @@ extern void *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced, int null
 extern void nf_ct_free_hashtable(void *hash, int vmalloced, unsigned int size);
 
 extern struct nf_conntrack_tuple_hash *
-__nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple);
+__nf_conntrack_find(struct net *net, u16 zone,
+		    const struct nf_conntrack_tuple *tuple);
 
 extern void nf_conntrack_hash_insert(struct nf_conn *ct);
 extern void nf_ct_delete_from_lists(struct nf_conn *ct);
@@ -267,7 +268,7 @@ extern void
 nf_ct_iterate_cleanup(struct net *net, int (*iter)(struct nf_conn *i, void *data), void *data);
 extern void nf_conntrack_free(struct nf_conn *ct);
 extern struct nf_conn *
-nf_conntrack_alloc(struct net *net,
+nf_conntrack_alloc(struct net *net, u16 zone,
 		   const struct nf_conntrack_tuple *orig,
 		   const struct nf_conntrack_tuple *repl,
 		   gfp_t gfp);
diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index 5a449b4..c7a1162 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -20,7 +20,7 @@
 /* This header is used to share core functionality between the
    standalone connection tracking module, and the compatibility layer's use
    of connection tracking. */
-extern unsigned int nf_conntrack_in(struct net *net,
+extern unsigned int nf_conntrack_in(struct net *net, u16 zone,
 				    u_int8_t pf,
 				    unsigned int hooknum,
 				    struct sk_buff *skb);
@@ -49,7 +49,8 @@ nf_ct_invert_tuple(struct nf_conntrack_tuple *inverse,
 
 /* Find a connection corresponding to a tuple. */
 extern struct nf_conntrack_tuple_hash *
-nf_conntrack_find_get(struct net *net, const struct nf_conntrack_tuple *tuple);
+nf_conntrack_find_get(struct net *net, u16 zone,
+		      const struct nf_conntrack_tuple *tuple);
 
 extern int __nf_conntrack_confirm(struct sk_buff *skb);
 
diff --git a/include/net/netfilter/nf_conntrack_expect.h b/include/net/netfilter/nf_conntrack_expect.h
index 9a2b9cb..83c49f3 100644
--- a/include/net/netfilter/nf_conntrack_expect.h
+++ b/include/net/netfilter/nf_conntrack_expect.h
@@ -77,13 +77,16 @@ int nf_conntrack_expect_init(struct net *net);
 void nf_conntrack_expect_fini(struct net *net);
 
 struct nf_conntrack_expect *
-__nf_ct_expect_find(struct net *net, const struct nf_conntrack_tuple *tuple);
+__nf_ct_expect_find(struct net *net, u16 zone,
+		    const struct nf_conntrack_tuple *tuple);
 
 struct nf_conntrack_expect *
-nf_ct_expect_find_get(struct net *net, const struct nf_conntrack_tuple *tuple);
+nf_ct_expect_find_get(struct net *net, u16 zone,
+		      const struct nf_conntrack_tuple *tuple);
 
 struct nf_conntrack_expect *
-nf_ct_find_expectation(struct net *net, const struct nf_conntrack_tuple *tuple);
+nf_ct_find_expectation(struct net *net, u16 zone,
+		       const struct nf_conntrack_tuple *tuple);
 
 void nf_ct_unlink_expect(struct nf_conntrack_expect *exp);
 void nf_ct_remove_expectations(struct nf_conn *ct);
diff --git a/include/net/netfilter/nf_conntrack_extend.h b/include/net/netfilter/nf_conntrack_extend.h
index e192dc1..2d2a1f9 100644
--- a/include/net/netfilter/nf_conntrack_extend.h
+++ b/include/net/netfilter/nf_conntrack_extend.h
@@ -8,6 +8,7 @@ enum nf_ct_ext_id {
 	NF_CT_EXT_NAT,
 	NF_CT_EXT_ACCT,
 	NF_CT_EXT_ECACHE,
+	NF_CT_EXT_ZONE,
 	NF_CT_EXT_NUM,
 };
 
@@ -15,6 +16,7 @@ enum nf_ct_ext_id {
 #define NF_CT_EXT_NAT_TYPE struct nf_conn_nat
 #define NF_CT_EXT_ACCT_TYPE struct nf_conn_counter
 #define NF_CT_EXT_ECACHE_TYPE struct nf_conntrack_ecache
+#define NF_CT_EXT_ZONE_TYPE struct nf_conntrack_zone
 
 /* Extensions: optional stuff which isn't permanently in struct. */
 struct nf_ct_ext {
diff --git a/include/net/netfilter/nf_conntrack_l4proto.h b/include/net/netfilter/nf_conntrack_l4proto.h
index ca6dcf3..14b6492 100644
--- a/include/net/netfilter/nf_conntrack_l4proto.h
+++ b/include/net/netfilter/nf_conntrack_l4proto.h
@@ -49,8 +49,8 @@ struct nf_conntrack_l4proto {
 	/* Called when a conntrack entry is destroyed */
 	void (*destroy)(struct nf_conn *ct);
 
-	int (*error)(struct net *net, struct sk_buff *skb, unsigned int dataoff,
-		     enum ip_conntrack_info *ctinfo,
+	int (*error)(struct net *net, u16 zone, struct sk_buff *skb,
+		     unsigned int dataoff, enum ip_conntrack_info *ctinfo,
 		     u_int8_t pf, unsigned int hooknum);
 
 	/* Print out the per-protocol part of the tuple. Return like seq_* */
diff --git a/include/net/netfilter/nf_conntrack_zones.h b/include/net/netfilter/nf_conntrack_zones.h
new file mode 100644
index 0000000..77d430b
--- /dev/null
+++ b/include/net/netfilter/nf_conntrack_zones.h
@@ -0,0 +1,30 @@
+#ifndef _NF_CONNTRACK_ZONES_H
+#define _NF_CONNTRACK_ZONES_H
+
+#include <net/netfilter/nf_conntrack_extend.h>
+
+struct nf_conntrack_zone {
+	u16	id;
+};
+
+static inline u16 nf_ct_zone(const struct nf_conn *ct)
+{
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+	struct nf_conntrack_zone *nf_ct_zone;
+	nf_ct_zone = nf_ct_ext_find(ct, NF_CT_EXT_ZONE);
+	if (nf_ct_zone)
+		return nf_ct_zone->id;
+#endif
+	return 0;
+}
+
+static inline u16 nf_ct_dev_zone(const struct net_device *dev)
+{
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+	return dev->nf_ct_zone;
+#else
+	return 0;
+#endif
+}
+
+#endif /* _NF_CONNTRACK_ZONES_H */
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index fbc1c74..83d8bf2 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -289,6 +289,23 @@ static ssize_t show_ifalias(struct device *dev,
 	return ret;
 }
 
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+NETDEVICE_SHOW(nf_ct_zone, fmt_dec);
+
+static int change_nf_ct_zone(struct net_device *net, unsigned long zone)
+{
+	net->nf_ct_zone = zone;
+	return 0;
+}
+
+static ssize_t store_nf_ct_zone(struct device *dev,
+				struct device_attribute *attr,
+				const char *buf, size_t len)
+{
+	return netdev_store(dev, attr, buf, len, change_nf_ct_zone);
+}
+#endif
+
 static struct device_attribute net_class_attributes[] = {
 	__ATTR(addr_len, S_IRUGO, show_addr_len, NULL),
 	__ATTR(dev_id, S_IRUGO, show_dev_id, NULL),
@@ -309,6 +326,9 @@ static struct device_attribute net_class_attributes[] = {
 	__ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags),
 	__ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len,
 	       store_tx_queue_len),
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+	__ATTR(nf_ct_zone, S_IRUGO | S_IWUSR, show_nf_ct_zone, store_nf_ct_zone),
+#endif
 	{}
 };
 
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
index d171b12..b3a0634 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
@@ -23,6 +23,7 @@
 #include <net/netfilter/nf_conntrack_l4proto.h>
 #include <net/netfilter/nf_conntrack_l3proto.h>
 #include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 #include <net/netfilter/ipv4/nf_conntrack_ipv4.h>
 #include <net/netfilter/nf_nat_helper.h>
 #include <net/netfilter/ipv4/nf_defrag_ipv4.h>
@@ -140,7 +141,7 @@ static unsigned int ipv4_conntrack_in(unsigned int hooknum,
 				      const struct net_device *out,
 				      int (*okfn)(struct sk_buff *))
 {
-	return nf_conntrack_in(dev_net(in), PF_INET, hooknum, skb);
+	return nf_conntrack_in(dev_net(in), nf_ct_dev_zone(in), PF_INET, hooknum, skb);
 }
 
 static unsigned int ipv4_conntrack_local(unsigned int hooknum,
@@ -153,7 +154,7 @@ static unsigned int ipv4_conntrack_local(unsigned int hooknum,
 	if (skb->len < sizeof(struct iphdr) ||
 	    ip_hdrlen(skb) < sizeof(struct iphdr))
 		return NF_ACCEPT;
-	return nf_conntrack_in(dev_net(out), PF_INET, hooknum, skb);
+	return nf_conntrack_in(dev_net(out), nf_ct_dev_zone(out), PF_INET, hooknum, skb);
 }
 
 /* Connection tracking may drop packets, but never alters them, so
@@ -266,7 +267,7 @@ getorigdst(struct sock *sk, int optval, void __user *user, int *len)
 		return -EINVAL;
 	}
 
-	h = nf_conntrack_find_get(sock_net(sk), &tuple);
+	h = nf_conntrack_find_get(sock_net(sk), 0, &tuple);
 	if (h) {
 		struct sockaddr_in sin;
 		struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
diff --git a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
index 7afd39b..82b4b30 100644
--- a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
+++ b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
@@ -114,7 +114,7 @@ static bool icmp_new(struct nf_conn *ct, const struct sk_buff *skb,
 
 /* Returns conntrack if it dealt with ICMP, and filled in skb fields */
 static int
-icmp_error_message(struct net *net, struct sk_buff *skb,
+icmp_error_message(struct net *net, u16 zone, struct sk_buff *skb,
 		 enum ip_conntrack_info *ctinfo,
 		 unsigned int hooknum)
 {
@@ -146,7 +146,7 @@ icmp_error_message(struct net *net, struct sk_buff *skb,
 
 	*ctinfo = IP_CT_RELATED;
 
-	h = nf_conntrack_find_get(net, &innertuple);
+	h = nf_conntrack_find_get(net, zone, &innertuple);
 	if (!h) {
 		pr_debug("icmp_error_message: no match\n");
 		return -NF_ACCEPT;
@@ -163,7 +163,8 @@ icmp_error_message(struct net *net, struct sk_buff *skb,
 
 /* Small and modified version of icmp_rcv */
 static int
-icmp_error(struct net *net, struct sk_buff *skb, unsigned int dataoff,
+icmp_error(struct net *net, u16 zone,
+	   struct sk_buff *skb, unsigned int dataoff,
 	   enum ip_conntrack_info *ctinfo, u_int8_t pf, unsigned int hooknum)
 {
 	const struct icmphdr *icmph;
@@ -208,7 +209,7 @@ icmp_error(struct net *net, struct sk_buff *skb, unsigned int dataoff,
 	    icmph->type != ICMP_REDIRECT)
 		return NF_ACCEPT;
 
-	return icmp_error_message(net, skb, ctinfo, hooknum);
+	return icmp_error_message(net, zone, skb, ctinfo, hooknum);
 }
 
 #if defined(CONFIG_NF_CT_NETLINK) || defined(CONFIG_NF_CT_NETLINK_MODULE)
diff --git a/net/ipv4/netfilter/nf_defrag_ipv4.c b/net/ipv4/netfilter/nf_defrag_ipv4.c
index 331ead3..488e889 100644
--- a/net/ipv4/netfilter/nf_defrag_ipv4.c
+++ b/net/ipv4/netfilter/nf_defrag_ipv4.c
@@ -16,6 +16,7 @@
 
 #include <linux/netfilter_bridge.h>
 #include <linux/netfilter_ipv4.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 #include <net/netfilter/ipv4/nf_defrag_ipv4.h>
 
 /* Returns new sk_buff, or NULL */
@@ -35,18 +36,18 @@ static int nf_ct_ipv4_gather_frags(struct sk_buff *skb, u_int32_t user)
 	return err;
 }
 
-static enum ip_defrag_users nf_ct_defrag_user(unsigned int hooknum,
+static enum ip_defrag_users nf_ct_defrag_user(unsigned int hooknum, u16 zone,
 					      struct sk_buff *skb)
 {
 #ifdef CONFIG_BRIDGE_NETFILTER
 	if (skb->nf_bridge &&
 	    skb->nf_bridge->mask & BRNF_NF_BRIDGE_PREROUTING)
-		return IP_DEFRAG_CONNTRACK_BRIDGE_IN;
+		return IP_DEFRAG_CONNTRACK_BRIDGE_IN + zone;
 #endif
 	if (hooknum == NF_INET_PRE_ROUTING)
-		return IP_DEFRAG_CONNTRACK_IN;
+		return IP_DEFRAG_CONNTRACK_IN + zone;
 	else
-		return IP_DEFRAG_CONNTRACK_OUT;
+		return IP_DEFRAG_CONNTRACK_OUT + zone;
 }
 
 static unsigned int ipv4_conntrack_defrag(unsigned int hooknum,
@@ -65,7 +66,9 @@ static unsigned int ipv4_conntrack_defrag(unsigned int hooknum,
 #endif
 	/* Gather fragments. */
 	if (ip_hdr(skb)->frag_off & htons(IP_MF | IP_OFFSET)) {
-		enum ip_defrag_users user = nf_ct_defrag_user(hooknum, skb);
+		u16 zone = nf_ct_dev_zone(hooknum == NF_INET_PRE_ROUTING ? in : out);
+		enum ip_defrag_users user = nf_ct_defrag_user(hooknum, zone, skb);
+
 		if (nf_ct_ipv4_gather_frags(skb, user))
 			return NF_STOLEN;
 	}
diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
index fe1a644..64b9979 100644
--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -30,6 +30,7 @@
 #include <net/netfilter/nf_conntrack_helper.h>
 #include <net/netfilter/nf_conntrack_l3proto.h>
 #include <net/netfilter/nf_conntrack_l4proto.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 
 static DEFINE_SPINLOCK(nf_nat_lock);
 
@@ -72,13 +73,13 @@ EXPORT_SYMBOL_GPL(nf_nat_proto_put);
 
 /* We keep an extra hash for each conntrack, for fast searching. */
 static inline unsigned int
-hash_by_src(const struct nf_conntrack_tuple *tuple)
+hash_by_src(const struct nf_conntrack_tuple *tuple, u16 zone)
 {
 	unsigned int hash;
 
 	/* Original src, to ensure we map it consistently if poss. */
 	hash = jhash_3words((__force u32)tuple->src.u3.ip,
-			    (__force u32)tuple->src.u.all,
+			    (__force u32)tuple->src.u.all ^ zone,
 			    tuple->dst.protonum, 0);
 	return ((u64)hash * nf_nat_htable_size) >> 32;
 }
@@ -142,12 +143,12 @@ same_src(const struct nf_conn *ct,
 
 /* Only called for SRC manip */
 static int
-find_appropriate_src(struct net *net,
+find_appropriate_src(struct net *net, u16 zone,
 		     const struct nf_conntrack_tuple *tuple,
 		     struct nf_conntrack_tuple *result,
 		     const struct nf_nat_range *range)
 {
-	unsigned int h = hash_by_src(tuple);
+	unsigned int h = hash_by_src(tuple, zone);
 	const struct nf_conn_nat *nat;
 	const struct nf_conn *ct;
 	const struct hlist_node *n;
@@ -155,7 +156,7 @@ find_appropriate_src(struct net *net,
 	rcu_read_lock();
 	hlist_for_each_entry_rcu(nat, n, &net->ipv4.nat_bysource[h], bysource) {
 		ct = nat->ct;
-		if (same_src(ct, tuple)) {
+		if (same_src(ct, tuple) && nf_ct_zone(ct) == zone) {
 			/* Copy source part from reply tuple. */
 			nf_ct_invert_tuplepr(result,
 				       &ct->tuplehash[IP_CT_DIR_REPLY].tuple);
@@ -178,7 +179,7 @@ find_appropriate_src(struct net *net,
    the ip with the lowest src-ip/dst-ip/proto usage.
 */
 static void
-find_best_ips_proto(struct nf_conntrack_tuple *tuple,
+find_best_ips_proto(u16 zone, struct nf_conntrack_tuple *tuple,
 		    const struct nf_nat_range *range,
 		    const struct nf_conn *ct,
 		    enum nf_nat_manip_type maniptype)
@@ -212,7 +213,7 @@ find_best_ips_proto(struct nf_conntrack_tuple *tuple,
 	maxip = ntohl(range->max_ip);
 	j = jhash_2words((__force u32)tuple->src.u3.ip,
 			 range->flags & IP_NAT_RANGE_PERSISTENT ?
-				0 : (__force u32)tuple->dst.u3.ip, 0);
+				0 : (__force u32)tuple->dst.u3.ip ^ zone, 0);
 	j = ((u64)j * (maxip - minip + 1)) >> 32;
 	*var_ipp = htonl(minip + j);
 }
@@ -232,6 +233,7 @@ get_unique_tuple(struct nf_conntrack_tuple *tuple,
 {
 	struct net *net = nf_ct_net(ct);
 	const struct nf_nat_protocol *proto;
+	u16 zone = nf_ct_zone(ct);
 
 	/* 1) If this srcip/proto/src-proto-part is currently mapped,
 	   and that same mapping gives a unique tuple within the given
@@ -242,7 +244,7 @@ get_unique_tuple(struct nf_conntrack_tuple *tuple,
 	   manips not an issue.  */
 	if (maniptype == IP_NAT_MANIP_SRC &&
 	    !(range->flags & IP_NAT_RANGE_PROTO_RANDOM)) {
-		if (find_appropriate_src(net, orig_tuple, tuple, range)) {
+		if (find_appropriate_src(net, zone, orig_tuple, tuple, range)) {
 			pr_debug("get_unique_tuple: Found current src map\n");
 			if (!nf_nat_used_tuple(tuple, ct))
 				return;
@@ -252,7 +254,7 @@ get_unique_tuple(struct nf_conntrack_tuple *tuple,
 	/* 2) Select the least-used IP/proto combination in the given
 	   range. */
 	*tuple = *orig_tuple;
-	find_best_ips_proto(tuple, range, ct, maniptype);
+	find_best_ips_proto(zone, tuple, range, ct, maniptype);
 
 	/* 3) The per-protocol part of the manip is made to map into
 	   the range to make a unique tuple. */
@@ -330,7 +332,8 @@ nf_nat_setup_info(struct nf_conn *ct,
 	if (have_to_hash) {
 		unsigned int srchash;
 
-		srchash = hash_by_src(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple);
+		srchash = hash_by_src(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
+				      nf_ct_zone(ct));
 		spin_lock_bh(&nf_nat_lock);
 		/* nf_conntrack_alter_reply might re-allocate exntension aera */
 		nat = nfct_nat(ct);
diff --git a/net/ipv4/netfilter/nf_nat_pptp.c b/net/ipv4/netfilter/nf_nat_pptp.c
index 9eb1710..4c06003 100644
--- a/net/ipv4/netfilter/nf_nat_pptp.c
+++ b/net/ipv4/netfilter/nf_nat_pptp.c
@@ -25,6 +25,7 @@
 #include <net/netfilter/nf_nat_rule.h>
 #include <net/netfilter/nf_conntrack_helper.h>
 #include <net/netfilter/nf_conntrack_expect.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 #include <linux/netfilter/nf_conntrack_proto_gre.h>
 #include <linux/netfilter/nf_conntrack_pptp.h>
 
@@ -74,7 +75,7 @@ static void pptp_nat_expected(struct nf_conn *ct,
 
 	pr_debug("trying to unexpect other dir: ");
 	nf_ct_dump_tuple_ip(&t);
-	other_exp = nf_ct_expect_find_get(net, &t);
+	other_exp = nf_ct_expect_find_get(net, nf_ct_zone(ct), &t);
 	if (other_exp) {
 		nf_ct_unexpect_related(other_exp);
 		nf_ct_expect_put(other_exp);
diff --git a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
index 0956eba..0db0d7f 100644
--- a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
@@ -27,6 +27,7 @@
 #include <net/netfilter/nf_conntrack_l4proto.h>
 #include <net/netfilter/nf_conntrack_l3proto.h>
 #include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 #include <net/netfilter/ipv6/nf_conntrack_ipv6.h>
 #include <net/netfilter/nf_log.h>
 
@@ -188,18 +189,18 @@ out:
 	return nf_conntrack_confirm(skb);
 }
 
-static enum ip6_defrag_users nf_ct6_defrag_user(unsigned int hooknum,
+static enum ip6_defrag_users nf_ct6_defrag_user(unsigned int hooknum, u16 zone,
 						struct sk_buff *skb)
 {
 #ifdef CONFIG_BRIDGE_NETFILTER
 	if (skb->nf_bridge &&
 	    skb->nf_bridge->mask & BRNF_NF_BRIDGE_PREROUTING)
-		return IP6_DEFRAG_CONNTRACK_BRIDGE_IN;
+		return IP6_DEFRAG_CONNTRACK_BRIDGE_IN + zone;
 #endif
 	if (hooknum == NF_INET_PRE_ROUTING)
-		return IP6_DEFRAG_CONNTRACK_IN;
+		return IP6_DEFRAG_CONNTRACK_IN + zone;
 	else
-		return IP6_DEFRAG_CONNTRACK_OUT;
+		return IP6_DEFRAG_CONNTRACK_OUT + zone;
 
 }
 
@@ -210,12 +211,14 @@ static unsigned int ipv6_defrag(unsigned int hooknum,
 				int (*okfn)(struct sk_buff *))
 {
 	struct sk_buff *reasm;
+	u16 zone;
 
 	/* Previously seen (loopback)?  */
 	if (skb->nfct)
 		return NF_ACCEPT;
 
-	reasm = nf_ct_frag6_gather(skb, nf_ct6_defrag_user(hooknum, skb));
+	zone = nf_ct_dev_zone(hooknum == NF_INET_PRE_ROUTING ? in : out);
+	reasm = nf_ct_frag6_gather(skb, nf_ct6_defrag_user(hooknum, zone, skb));
 	/* queued */
 	if (reasm == NULL)
 		return NF_STOLEN;
@@ -230,7 +233,7 @@ static unsigned int ipv6_defrag(unsigned int hooknum,
 	return NF_STOLEN;
 }
 
-static unsigned int __ipv6_conntrack_in(struct net *net,
+static unsigned int __ipv6_conntrack_in(struct net *net, u16 zone,
 					unsigned int hooknum,
 					struct sk_buff *skb,
 					int (*okfn)(struct sk_buff *))
@@ -243,7 +246,7 @@ static unsigned int __ipv6_conntrack_in(struct net *net,
 		if (!reasm->nfct) {
 			unsigned int ret;
 
-			ret = nf_conntrack_in(net, PF_INET6, hooknum, reasm);
+			ret = nf_conntrack_in(net, zone, PF_INET6, hooknum, reasm);
 			if (ret != NF_ACCEPT)
 				return ret;
 		}
@@ -253,7 +256,7 @@ static unsigned int __ipv6_conntrack_in(struct net *net,
 		return NF_ACCEPT;
 	}
 
-	return nf_conntrack_in(net, PF_INET6, hooknum, skb);
+	return nf_conntrack_in(net, zone, PF_INET6, hooknum, skb);
 }
 
 static unsigned int ipv6_conntrack_in(unsigned int hooknum,
@@ -262,7 +265,7 @@ static unsigned int ipv6_conntrack_in(unsigned int hooknum,
 				      const struct net_device *out,
 				      int (*okfn)(struct sk_buff *))
 {
-	return __ipv6_conntrack_in(dev_net(in), hooknum, skb, okfn);
+	return __ipv6_conntrack_in(dev_net(in), nf_ct_dev_zone(in), hooknum, skb, okfn);
 }
 
 static unsigned int ipv6_conntrack_local(unsigned int hooknum,
@@ -277,7 +280,7 @@ static unsigned int ipv6_conntrack_local(unsigned int hooknum,
 			printk("ipv6_conntrack_local: packet too short\n");
 		return NF_ACCEPT;
 	}
-	return __ipv6_conntrack_in(dev_net(out), hooknum, skb, okfn);
+	return __ipv6_conntrack_in(dev_net(out), nf_ct_dev_zone(out), hooknum, skb, okfn);
 }
 
 static struct nf_hook_ops ipv6_conntrack_ops[] __read_mostly = {
diff --git a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
index c7b8bd1..c423818 100644
--- a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
@@ -128,7 +128,7 @@ static bool icmpv6_new(struct nf_conn *ct, const struct sk_buff *skb,
 }
 
 static int
-icmpv6_error_message(struct net *net,
+icmpv6_error_message(struct net *net, u16 zone,
 		     struct sk_buff *skb,
 		     unsigned int icmp6off,
 		     enum ip_conntrack_info *ctinfo,
@@ -163,7 +163,7 @@ icmpv6_error_message(struct net *net,
 
 	*ctinfo = IP_CT_RELATED;
 
-	h = nf_conntrack_find_get(net, &intuple);
+	h = nf_conntrack_find_get(net, zone, &intuple);
 	if (!h) {
 		pr_debug("icmpv6_error: no match\n");
 		return -NF_ACCEPT;
@@ -179,7 +179,8 @@ icmpv6_error_message(struct net *net,
 }
 
 static int
-icmpv6_error(struct net *net, struct sk_buff *skb, unsigned int dataoff,
+icmpv6_error(struct net *net, u16 zone,
+	     struct sk_buff *skb, unsigned int dataoff,
 	     enum ip_conntrack_info *ctinfo, u_int8_t pf, unsigned int hooknum)
 {
 	const struct icmp6hdr *icmp6h;
@@ -215,7 +216,7 @@ icmpv6_error(struct net *net, struct sk_buff *skb, unsigned int dataoff,
 	if (icmp6h->icmp6_type >= 128)
 		return NF_ACCEPT;
 
-	return icmpv6_error_message(net, skb, dataoff, ctinfo, hooknum);
+	return icmpv6_error_message(net, zone, skb, dataoff, ctinfo, hooknum);
 }
 
 #if defined(CONFIG_NF_CT_NETLINK) || defined(CONFIG_NF_CT_NETLINK_MODULE)
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 634d14a..15374ba 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -83,6 +83,15 @@ config NF_CONNTRACK_SECMARK
 
 	  If unsure, say 'N'.
 
+config NF_CONNTRACK_ZONES
+	bool  'Connection tracking zones'
+	help
+	  This option enables support for connection tracking zones.
+	  Normally, each connection needs to have a unique identity.
+	  Connection tracking zones allow to have multiple connections
+	  using the same identity, as long as they are contained in
+	  different zones.
+
 config NF_CONNTRACK_EVENTS
 	bool "Connection tracking events"
 	depends on NETFILTER_ADVANCED
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 0e98c32..90909e3 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -41,6 +41,7 @@
 #include <net/netfilter/nf_conntrack_extend.h>
 #include <net/netfilter/nf_conntrack_acct.h>
 #include <net/netfilter/nf_conntrack_ecache.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 #include <net/netfilter/nf_nat.h>
 #include <net/netfilter/nf_nat_core.h>
 
@@ -69,7 +70,7 @@ static int nf_conntrack_hash_rnd_initted;
 static unsigned int nf_conntrack_hash_rnd;
 
 static u_int32_t __hash_conntrack(const struct nf_conntrack_tuple *tuple,
-				  unsigned int size, unsigned int rnd)
+				  u16 zone, unsigned int size, unsigned int rnd)
 {
 	unsigned int n;
 	u_int32_t h;
@@ -80,15 +81,16 @@ static u_int32_t __hash_conntrack(const struct nf_conntrack_tuple *tuple,
 	 */
 	n = (sizeof(tuple->src) + sizeof(tuple->dst.u3)) / sizeof(u32);
 	h = jhash2((u32 *)tuple, n,
-		   rnd ^ (((__force __u16)tuple->dst.u.all << 16) |
+		   zone ^ rnd ^ (((__force __u16)tuple->dst.u.all << 16) |
 			  tuple->dst.protonum));
 
 	return ((u64)h * size) >> 32;
 }
 
-static inline u_int32_t hash_conntrack(const struct nf_conntrack_tuple *tuple)
+static inline u_int32_t hash_conntrack(const struct nf_conntrack_tuple *tuple,
+				       u16 zone)
 {
-	return __hash_conntrack(tuple, nf_conntrack_htable_size,
+	return __hash_conntrack(tuple, zone, nf_conntrack_htable_size,
 				nf_conntrack_hash_rnd);
 }
 
@@ -292,11 +294,12 @@ static void death_by_timeout(unsigned long ul_conntrack)
  * - Caller must lock nf_conntrack_lock before calling this function
  */
 struct nf_conntrack_tuple_hash *
-__nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
+__nf_conntrack_find(struct net *net, u16 zone,
+		    const struct nf_conntrack_tuple *tuple)
 {
 	struct nf_conntrack_tuple_hash *h;
 	struct hlist_nulls_node *n;
-	unsigned int hash = hash_conntrack(tuple);
+	unsigned int hash = hash_conntrack(tuple, zone);
 
 	/* Disable BHs the entire time since we normally need to disable them
 	 * at least once for the stats anyway.
@@ -304,7 +307,8 @@ __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
 	local_bh_disable();
 begin:
 	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
-		if (nf_ct_tuple_equal(tuple, &h->tuple)) {
+		if (nf_ct_tuple_equal(tuple, &h->tuple) &&
+		    nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)) == zone) {
 			NF_CT_STAT_INC(net, found);
 			local_bh_enable();
 			return h;
@@ -326,21 +330,23 @@ EXPORT_SYMBOL_GPL(__nf_conntrack_find);
 
 /* Find a connection corresponding to a tuple. */
 struct nf_conntrack_tuple_hash *
-nf_conntrack_find_get(struct net *net, const struct nf_conntrack_tuple *tuple)
+nf_conntrack_find_get(struct net *net, u16 zone,
+		      const struct nf_conntrack_tuple *tuple)
 {
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct;
 
 	rcu_read_lock();
 begin:
-	h = __nf_conntrack_find(net, tuple);
+	h = __nf_conntrack_find(net, zone, tuple);
 	if (h) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		if (unlikely(nf_ct_is_dying(ct) ||
 			     !atomic_inc_not_zero(&ct->ct_general.use)))
 			h = NULL;
 		else {
-			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple))) {
+			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
+				     nf_ct_zone(ct) != zone)) {
 				nf_ct_put(ct);
 				goto begin;
 			}
@@ -367,9 +373,11 @@ static void __nf_conntrack_hash_insert(struct nf_conn *ct,
 void nf_conntrack_hash_insert(struct nf_conn *ct)
 {
 	unsigned int hash, repl_hash;
+	u16 zone;
 
-	hash = hash_conntrack(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple);
-	repl_hash = hash_conntrack(&ct->tuplehash[IP_CT_DIR_REPLY].tuple);
+	zone = nf_ct_zone(ct);
+	hash = hash_conntrack(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, zone);
+	repl_hash = hash_conntrack(&ct->tuplehash[IP_CT_DIR_REPLY].tuple, zone);
 
 	__nf_conntrack_hash_insert(ct, hash, repl_hash);
 }
@@ -385,6 +393,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	struct nf_conn_help *help;
 	struct hlist_nulls_node *n;
 	enum ip_conntrack_info ctinfo;
+	u16 zone;
 	struct net *net;
 
 	ct = nf_ct_get(skb, &ctinfo);
@@ -397,8 +406,9 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	if (CTINFO2DIR(ctinfo) != IP_CT_DIR_ORIGINAL)
 		return NF_ACCEPT;
 
-	hash = hash_conntrack(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple);
-	repl_hash = hash_conntrack(&ct->tuplehash[IP_CT_DIR_REPLY].tuple);
+	zone = nf_ct_zone(ct);
+	hash = hash_conntrack(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, zone);
+	repl_hash = hash_conntrack(&ct->tuplehash[IP_CT_DIR_REPLY].tuple, zone);
 
 	/* We're not in hash table, and we refuse to set up related
 	   connections for unconfirmed conns.  But packet copies and
@@ -417,11 +427,13 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	   not in the hash.  If there is, we lost race. */
 	hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode)
 		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
-				      &h->tuple))
+				      &h->tuple) &&
+		    zone == nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)))
 			goto out;
 	hlist_nulls_for_each_entry(h, n, &net->ct.hash[repl_hash], hnnode)
 		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_REPLY].tuple,
-				      &h->tuple))
+				      &h->tuple) &&
+		    zone == nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)))
 			goto out;
 
 	/* Remove from unconfirmed list */
@@ -468,15 +480,19 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
 	struct net *net = nf_ct_net(ignored_conntrack);
 	struct nf_conntrack_tuple_hash *h;
 	struct hlist_nulls_node *n;
-	unsigned int hash = hash_conntrack(tuple);
+	struct nf_conn *ct;
+	u16 zone = nf_ct_zone(ignored_conntrack);
+	unsigned int hash = hash_conntrack(tuple, zone);
 
 	/* Disable BHs the entire time since we need to disable them at
 	 * least once for the stats anyway.
 	 */
 	rcu_read_lock_bh();
 	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
-		if (nf_ct_tuplehash_to_ctrack(h) != ignored_conntrack &&
-		    nf_ct_tuple_equal(tuple, &h->tuple)) {
+		ct = nf_ct_tuplehash_to_ctrack(h);
+		if (ct != ignored_conntrack &&
+		    nf_ct_tuple_equal(tuple, &h->tuple) &&
+		    nf_ct_zone(ct) == zone) {
 			NF_CT_STAT_INC(net, found);
 			rcu_read_unlock_bh();
 			return 1;
@@ -539,7 +555,7 @@ static noinline int early_drop(struct net *net, unsigned int hash)
 	return dropped;
 }
 
-struct nf_conn *nf_conntrack_alloc(struct net *net,
+struct nf_conn *nf_conntrack_alloc(struct net *net, u16 zone,
 				   const struct nf_conntrack_tuple *orig,
 				   const struct nf_conntrack_tuple *repl,
 				   gfp_t gfp)
@@ -557,7 +573,7 @@ struct nf_conn *nf_conntrack_alloc(struct net *net,
 
 	if (nf_conntrack_max &&
 	    unlikely(atomic_read(&net->ct.count) > nf_conntrack_max)) {
-		unsigned int hash = hash_conntrack(orig);
+		unsigned int hash = hash_conntrack(orig, zone);
 		if (!early_drop(net, hash)) {
 			atomic_dec(&net->ct.count);
 			if (net_ratelimit())
@@ -578,6 +594,7 @@ struct nf_conn *nf_conntrack_alloc(struct net *net,
 		atomic_dec(&net->ct.count);
 		return ERR_PTR(-ENOMEM);
 	}
+
 	/*
 	 * Let ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.next
 	 * and ct->tuplehash[IP_CT_DIR_REPLY].hnnode.next unchanged.
@@ -594,6 +611,16 @@ struct nf_conn *nf_conntrack_alloc(struct net *net,
 #ifdef CONFIG_NET_NS
 	ct->ct_net = net;
 #endif
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+	if (zone) {
+		struct nf_conntrack_zone *nf_ct_zone;
+
+		nf_ct_zone = nf_ct_ext_add(ct, NF_CT_EXT_ZONE, GFP_ATOMIC);
+		if (!nf_ct_zone)
+			goto out_free;
+		nf_ct_zone->id = zone;
+	}
+#endif
 
 	/*
 	 * changes to lookup keys must be done before setting refcnt to 1
@@ -601,6 +628,12 @@ struct nf_conn *nf_conntrack_alloc(struct net *net,
 	smp_wmb();
 	atomic_set(&ct->ct_general.use, 1);
 	return ct;
+
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+out_free:
+	kmem_cache_free(nf_conntrack_cachep, ct);
+	return ERR_PTR(-ENOMEM);
+#endif
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_alloc);
 
@@ -618,7 +651,7 @@ EXPORT_SYMBOL_GPL(nf_conntrack_free);
 /* Allocate a new conntrack: we return -ENOMEM if classification
    failed due to stress.  Otherwise it really is unclassifiable. */
 static struct nf_conntrack_tuple_hash *
-init_conntrack(struct net *net,
+init_conntrack(struct net *net, u16 zone,
 	       const struct nf_conntrack_tuple *tuple,
 	       struct nf_conntrack_l3proto *l3proto,
 	       struct nf_conntrack_l4proto *l4proto,
@@ -635,7 +668,7 @@ init_conntrack(struct net *net,
 		return NULL;
 	}
 
-	ct = nf_conntrack_alloc(net, tuple, &repl_tuple, GFP_ATOMIC);
+	ct = nf_conntrack_alloc(net, zone, tuple, &repl_tuple, GFP_ATOMIC);
 	if (IS_ERR(ct)) {
 		pr_debug("Can't allocate conntrack.\n");
 		return (struct nf_conntrack_tuple_hash *)ct;
@@ -651,7 +684,7 @@ init_conntrack(struct net *net,
 	nf_ct_ecache_ext_add(ct, GFP_ATOMIC);
 
 	spin_lock_bh(&nf_conntrack_lock);
-	exp = nf_ct_find_expectation(net, tuple);
+	exp = nf_ct_find_expectation(net, zone, tuple);
 	if (exp) {
 		pr_debug("conntrack: expectation arrives ct=%p exp=%p\n",
 			 ct, exp);
@@ -694,7 +727,7 @@ init_conntrack(struct net *net,
 
 /* On success, returns conntrack ptr, sets skb->nfct and ctinfo */
 static inline struct nf_conn *
-resolve_normal_ct(struct net *net,
+resolve_normal_ct(struct net *net, u16 zone,
 		  struct sk_buff *skb,
 		  unsigned int dataoff,
 		  u_int16_t l3num,
@@ -716,9 +749,10 @@ resolve_normal_ct(struct net *net,
 	}
 
 	/* look for tuple match */
-	h = nf_conntrack_find_get(net, &tuple);
+	h = nf_conntrack_find_get(net, zone, &tuple);
 	if (!h) {
-		h = init_conntrack(net, &tuple, l3proto, l4proto, skb, dataoff);
+		h = init_conntrack(net, zone, &tuple, l3proto, l4proto,
+				   skb, dataoff);
 		if (!h)
 			return NULL;
 		if (IS_ERR(h))
@@ -752,7 +786,7 @@ resolve_normal_ct(struct net *net,
 }
 
 unsigned int
-nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
+nf_conntrack_in(struct net *net, u16 zone, u_int8_t pf, unsigned int hooknum,
 		struct sk_buff *skb)
 {
 	struct nf_conn *ct;
@@ -787,7 +821,8 @@ nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
 	 * inverse of the return code tells to the netfilter
 	 * core what to do with the packet. */
 	if (l4proto->error != NULL) {
-		ret = l4proto->error(net, skb, dataoff, &ctinfo, pf, hooknum);
+		ret = l4proto->error(net, zone, skb, dataoff, &ctinfo,
+				     pf, hooknum);
 		if (ret <= 0) {
 			NF_CT_STAT_INC_ATOMIC(net, error);
 			NF_CT_STAT_INC_ATOMIC(net, invalid);
@@ -795,7 +830,7 @@ nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
 		}
 	}
 
-	ct = resolve_normal_ct(net, skb, dataoff, pf, protonum,
+	ct = resolve_normal_ct(net, zone, skb, dataoff, pf, protonum,
 			       l3proto, l4proto, &set_reply, &ctinfo);
 	if (!ct) {
 		/* Not valid part of a connection */
@@ -938,6 +973,12 @@ bool __nf_ct_kill_acct(struct nf_conn *ct,
 }
 EXPORT_SYMBOL_GPL(__nf_ct_kill_acct);
 
+static struct nf_ct_ext_type nf_ct_zone_extend __read_mostly = {
+	.len	= sizeof(struct nf_conntrack_zone),
+	.align	= __alignof__(struct nf_conntrack_zone),
+	.id	= NF_CT_EXT_ZONE,
+};
+
 #if defined(CONFIG_NF_CT_NETLINK) || defined(CONFIG_NF_CT_NETLINK_MODULE)
 
 #include <linux/netfilter/nfnetlink.h>
@@ -1115,6 +1156,7 @@ static void nf_conntrack_cleanup_init_net(void)
 {
 	nf_conntrack_helper_fini();
 	nf_conntrack_proto_fini();
+	nf_ct_extend_unregister(&nf_ct_zone_extend);
 	kmem_cache_destroy(nf_conntrack_cachep);
 }
 
@@ -1193,6 +1235,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 	int rnd;
 	struct hlist_nulls_head *hash, *old_hash;
 	struct nf_conntrack_tuple_hash *h;
+	struct nf_conn *ct;
 
 	/* On boot, we can set this without any fancy locking. */
 	if (!nf_conntrack_htable_size)
@@ -1220,8 +1263,10 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 		while (!hlist_nulls_empty(&init_net.ct.hash[i])) {
 			h = hlist_nulls_entry(init_net.ct.hash[i].first,
 					struct nf_conntrack_tuple_hash, hnnode);
+			ct = nf_ct_tuplehash_to_ctrack(h);
 			hlist_nulls_del_rcu(&h->hnnode);
-			bucket = __hash_conntrack(&h->tuple, hashsize, rnd);
+			bucket = __hash_conntrack(&h->tuple, nf_ct_zone(ct),
+						  hashsize, rnd);
 			hlist_nulls_add_head_rcu(&h->hnnode, &hash[bucket]);
 		}
 	}
@@ -1288,8 +1333,17 @@ static int nf_conntrack_init_init_net(void)
 	if (ret < 0)
 		goto err_helper;
 
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+	ret = nf_ct_extend_register(&nf_ct_zone_extend);
+	if (ret < 0)
+		goto err_extend;
+#endif
 	return 0;
 
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+err_extend:
+	nf_conntrack_helper_fini();
+#endif
 err_helper:
 	nf_conntrack_proto_fini();
 err_proto:
diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c
index fdf5d2a..5fd0347 100644
--- a/net/netfilter/nf_conntrack_expect.c
+++ b/net/netfilter/nf_conntrack_expect.c
@@ -27,6 +27,7 @@
 #include <net/netfilter/nf_conntrack_expect.h>
 #include <net/netfilter/nf_conntrack_helper.h>
 #include <net/netfilter/nf_conntrack_tuple.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 
 unsigned int nf_ct_expect_hsize __read_mostly;
 EXPORT_SYMBOL_GPL(nf_ct_expect_hsize);
@@ -84,7 +85,8 @@ static unsigned int nf_ct_expect_dst_hash(const struct nf_conntrack_tuple *tuple
 }
 
 struct nf_conntrack_expect *
-__nf_ct_expect_find(struct net *net, const struct nf_conntrack_tuple *tuple)
+__nf_ct_expect_find(struct net *net, u16 zone,
+		    const struct nf_conntrack_tuple *tuple)
 {
 	struct nf_conntrack_expect *i;
 	struct hlist_node *n;
@@ -104,12 +106,13 @@ EXPORT_SYMBOL_GPL(__nf_ct_expect_find);
 
 /* Just find a expectation corresponding to a tuple. */
 struct nf_conntrack_expect *
-nf_ct_expect_find_get(struct net *net, const struct nf_conntrack_tuple *tuple)
+nf_ct_expect_find_get(struct net *net, u16 zone,
+		      const struct nf_conntrack_tuple *tuple)
 {
 	struct nf_conntrack_expect *i;
 
 	rcu_read_lock();
-	i = __nf_ct_expect_find(net, tuple);
+	i = __nf_ct_expect_find(net, zone, tuple);
 	if (i && !atomic_inc_not_zero(&i->use))
 		i = NULL;
 	rcu_read_unlock();
@@ -121,7 +124,8 @@ EXPORT_SYMBOL_GPL(nf_ct_expect_find_get);
 /* If an expectation for this connection is found, it gets delete from
  * global list then returned. */
 struct nf_conntrack_expect *
-nf_ct_find_expectation(struct net *net, const struct nf_conntrack_tuple *tuple)
+nf_ct_find_expectation(struct net *net, u16 zone,
+		       const struct nf_conntrack_tuple *tuple)
 {
 	struct nf_conntrack_expect *i, *exp = NULL;
 	struct hlist_node *n;
@@ -133,7 +137,8 @@ nf_ct_find_expectation(struct net *net, const struct nf_conntrack_tuple *tuple)
 	h = nf_ct_expect_dst_hash(tuple);
 	hlist_for_each_entry(i, n, &net->ct.expect_hash[h], hnode) {
 		if (!(i->flags & NF_CT_EXPECT_INACTIVE) &&
-		    nf_ct_tuple_mask_cmp(tuple, &i->tuple, &i->mask)) {
+		    nf_ct_tuple_mask_cmp(tuple, &i->tuple, &i->mask) &&
+		    nf_ct_zone(i->master) == zone) {
 			exp = i;
 			break;
 		}
@@ -204,7 +209,8 @@ static inline int expect_matches(const struct nf_conntrack_expect *a,
 {
 	return a->master == b->master && a->class == b->class &&
 		nf_ct_tuple_equal(&a->tuple, &b->tuple) &&
-		nf_ct_tuple_mask_equal(&a->mask, &b->mask);
+		nf_ct_tuple_mask_equal(&a->mask, &b->mask) &&
+		nf_ct_zone(a->master) == nf_ct_zone(b->master);
 }
 
 /* Generally a bad idea to call this: could have matched already. */
diff --git a/net/netfilter/nf_conntrack_h323_main.c b/net/netfilter/nf_conntrack_h323_main.c
index 6636949..a1c8dd9 100644
--- a/net/netfilter/nf_conntrack_h323_main.c
+++ b/net/netfilter/nf_conntrack_h323_main.c
@@ -29,6 +29,7 @@
 #include <net/netfilter/nf_conntrack_expect.h>
 #include <net/netfilter/nf_conntrack_ecache.h>
 #include <net/netfilter/nf_conntrack_helper.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 #include <linux/netfilter/nf_conntrack_h323.h>
 
 /* Parameters */
@@ -1216,7 +1217,7 @@ static struct nf_conntrack_expect *find_expect(struct nf_conn *ct,
 	tuple.dst.u.tcp.port = port;
 	tuple.dst.protonum = IPPROTO_TCP;
 
-	exp = __nf_ct_expect_find(net, &tuple);
+	exp = __nf_ct_expect_find(net, nf_ct_zone(ct), &tuple);
 	if (exp && exp->master == ct)
 		return exp;
 	return NULL;
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 59d8064..2a9c4c3 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -790,7 +790,7 @@ ctnetlink_del_conntrack(struct sock *ctnl, struct sk_buff *skb,
 	if (err < 0)
 		return err;
 
-	h = nf_conntrack_find_get(&init_net, &tuple);
+	h = nf_conntrack_find_get(&init_net, 0, &tuple);
 	if (!h)
 		return -ENOENT;
 
@@ -850,7 +850,7 @@ ctnetlink_get_conntrack(struct sock *ctnl, struct sk_buff *skb,
 	if (err < 0)
 		return err;
 
-	h = nf_conntrack_find_get(&init_net, &tuple);
+	h = nf_conntrack_find_get(&init_net, 0, &tuple);
 	if (!h)
 		return -ENOENT;
 
@@ -1184,7 +1184,7 @@ ctnetlink_create_conntrack(const struct nlattr * const cda[],
 	int err = -EINVAL;
 	struct nf_conntrack_helper *helper;
 
-	ct = nf_conntrack_alloc(&init_net, otuple, rtuple, GFP_ATOMIC);
+	ct = nf_conntrack_alloc(&init_net, 0, otuple, rtuple, GFP_ATOMIC);
 	if (IS_ERR(ct))
 		return ERR_PTR(-ENOMEM);
 
@@ -1285,7 +1285,7 @@ ctnetlink_create_conntrack(const struct nlattr * const cda[],
 		if (err < 0)
 			goto err2;
 
-		master_h = nf_conntrack_find_get(&init_net, &master);
+		master_h = nf_conntrack_find_get(&init_net, 0, &master);
 		if (master_h == NULL) {
 			err = -ENOENT;
 			goto err2;
@@ -1333,9 +1333,9 @@ ctnetlink_new_conntrack(struct sock *ctnl, struct sk_buff *skb,
 
 	spin_lock_bh(&nf_conntrack_lock);
 	if (cda[CTA_TUPLE_ORIG])
-		h = __nf_conntrack_find(&init_net, &otuple);
+		h = __nf_conntrack_find(&init_net, 0, &otuple);
 	else if (cda[CTA_TUPLE_REPLY])
-		h = __nf_conntrack_find(&init_net, &rtuple);
+		h = __nf_conntrack_find(&init_net, 0, &rtuple);
 
 	if (h == NULL) {
 		err = -ENOENT;
@@ -1660,7 +1660,7 @@ ctnetlink_get_expect(struct sock *ctnl, struct sk_buff *skb,
 	if (err < 0)
 		return err;
 
-	exp = nf_ct_expect_find_get(&init_net, &tuple);
+	exp = nf_ct_expect_find_get(&init_net, 0, &tuple);
 	if (!exp)
 		return -ENOENT;
 
@@ -1716,7 +1716,7 @@ ctnetlink_del_expect(struct sock *ctnl, struct sk_buff *skb,
 			return err;
 
 		/* bump usage count to 2 */
-		exp = nf_ct_expect_find_get(&init_net, &tuple);
+		exp = nf_ct_expect_find_get(&init_net, 0, &tuple);
 		if (!exp)
 			return -ENOENT;
 
@@ -1805,7 +1805,7 @@ ctnetlink_create_expect(const struct nlattr * const cda[], u_int8_t u3,
 		return err;
 
 	/* Look for master conntrack of this expectation */
-	h = nf_conntrack_find_get(&init_net, &master_tuple);
+	h = nf_conntrack_find_get(&init_net, 0, &master_tuple);
 	if (!h)
 		return -ENOENT;
 	ct = nf_ct_tuplehash_to_ctrack(h);
@@ -1861,7 +1861,7 @@ ctnetlink_new_expect(struct sock *ctnl, struct sk_buff *skb,
 		return err;
 
 	spin_lock_bh(&nf_conntrack_lock);
-	exp = __nf_ct_expect_find(&init_net, &tuple);
+	exp = __nf_ct_expect_find(&init_net, 0, &tuple);
 
 	if (!exp) {
 		spin_unlock_bh(&nf_conntrack_lock);
diff --git a/net/netfilter/nf_conntrack_pptp.c b/net/netfilter/nf_conntrack_pptp.c
index 3807ac7..ffe2ae6 100644
--- a/net/netfilter/nf_conntrack_pptp.c
+++ b/net/netfilter/nf_conntrack_pptp.c
@@ -28,6 +28,7 @@
 #include <net/netfilter/nf_conntrack.h>
 #include <net/netfilter/nf_conntrack_core.h>
 #include <net/netfilter/nf_conntrack_helper.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 #include <linux/netfilter/nf_conntrack_proto_gre.h>
 #include <linux/netfilter/nf_conntrack_pptp.h>
 
@@ -123,7 +124,7 @@ static void pptp_expectfn(struct nf_conn *ct,
 		pr_debug("trying to unexpect other dir: ");
 		nf_ct_dump_tuple(&inv_t);
 
-		exp_other = nf_ct_expect_find_get(net, &inv_t);
+		exp_other = nf_ct_expect_find_get(net, nf_ct_zone(ct), &inv_t);
 		if (exp_other) {
 			/* delete other expectation.  */
 			pr_debug("found\n");
@@ -136,7 +137,7 @@ static void pptp_expectfn(struct nf_conn *ct,
 	rcu_read_unlock();
 }
 
-static int destroy_sibling_or_exp(struct net *net,
+static int destroy_sibling_or_exp(struct net *net, u16 zone,
 				  const struct nf_conntrack_tuple *t)
 {
 	const struct nf_conntrack_tuple_hash *h;
@@ -146,7 +147,7 @@ static int destroy_sibling_or_exp(struct net *net,
 	pr_debug("trying to timeout ct or exp for tuple ");
 	nf_ct_dump_tuple(t);
 
-	h = nf_conntrack_find_get(net, t);
+	h = nf_conntrack_find_get(net, zone, t);
 	if (h)  {
 		sibling = nf_ct_tuplehash_to_ctrack(h);
 		pr_debug("setting timeout of conntrack %p to 0\n", sibling);
@@ -157,7 +158,7 @@ static int destroy_sibling_or_exp(struct net *net,
 		nf_ct_put(sibling);
 		return 1;
 	} else {
-		exp = nf_ct_expect_find_get(net, t);
+		exp = nf_ct_expect_find_get(net, zone, t);
 		if (exp) {
 			pr_debug("unexpect_related of expect %p\n", exp);
 			nf_ct_unexpect_related(exp);
@@ -182,7 +183,7 @@ static void pptp_destroy_siblings(struct nf_conn *ct)
 	t.dst.protonum = IPPROTO_GRE;
 	t.src.u.gre.key = help->help.ct_pptp_info.pns_call_id;
 	t.dst.u.gre.key = help->help.ct_pptp_info.pac_call_id;
-	if (!destroy_sibling_or_exp(net, &t))
+	if (!destroy_sibling_or_exp(net, nf_ct_zone(ct), &t))
 		pr_debug("failed to timeout original pns->pac ct/exp\n");
 
 	/* try reply (pac->pns) tuple */
@@ -190,7 +191,7 @@ static void pptp_destroy_siblings(struct nf_conn *ct)
 	t.dst.protonum = IPPROTO_GRE;
 	t.src.u.gre.key = help->help.ct_pptp_info.pac_call_id;
 	t.dst.u.gre.key = help->help.ct_pptp_info.pns_call_id;
-	if (!destroy_sibling_or_exp(net, &t))
+	if (!destroy_sibling_or_exp(net, nf_ct_zone(ct), &t))
 		pr_debug("failed to timeout reply pac->pns ct/exp\n");
 }
 
diff --git a/net/netfilter/nf_conntrack_proto_dccp.c b/net/netfilter/nf_conntrack_proto_dccp.c
index dd37550..d1c1848 100644
--- a/net/netfilter/nf_conntrack_proto_dccp.c
+++ b/net/netfilter/nf_conntrack_proto_dccp.c
@@ -561,7 +561,7 @@ static int dccp_packet(struct nf_conn *ct, const struct sk_buff *skb,
 	return NF_ACCEPT;
 }
 
-static int dccp_error(struct net *net, struct sk_buff *skb,
+static int dccp_error(struct net *net, u16 zone, struct sk_buff *skb,
 		      unsigned int dataoff, enum ip_conntrack_info *ctinfo,
 		      u_int8_t pf, unsigned int hooknum)
 {
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index 3c96437..2bfe5bf 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -760,7 +760,7 @@ static const u8 tcp_valid_flags[(TH_FIN|TH_SYN|TH_RST|TH_ACK|TH_URG) + 1] =
 };
 
 /* Protect conntrack agaist broken packets. Code taken from ipt_unclean.c.  */
-static int tcp_error(struct net *net,
+static int tcp_error(struct net *net, u16 zone,
 		     struct sk_buff *skb,
 		     unsigned int dataoff,
 		     enum ip_conntrack_info *ctinfo,
diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c
index 5c5518b..aee7515 100644
--- a/net/netfilter/nf_conntrack_proto_udp.c
+++ b/net/netfilter/nf_conntrack_proto_udp.c
@@ -91,8 +91,8 @@ static bool udp_new(struct nf_conn *ct, const struct sk_buff *skb,
 	return true;
 }
 
-static int udp_error(struct net *net, struct sk_buff *skb, unsigned int dataoff,
-		     enum ip_conntrack_info *ctinfo,
+static int udp_error(struct net *net, u16 zone, struct sk_buff *skb,
+		     unsigned int dataoff, enum ip_conntrack_info *ctinfo,
 		     u_int8_t pf,
 		     unsigned int hooknum)
 {
diff --git a/net/netfilter/nf_conntrack_proto_udplite.c b/net/netfilter/nf_conntrack_proto_udplite.c
index 458655b..cc94a67 100644
--- a/net/netfilter/nf_conntrack_proto_udplite.c
+++ b/net/netfilter/nf_conntrack_proto_udplite.c
@@ -89,7 +89,7 @@ static bool udplite_new(struct nf_conn *ct, const struct sk_buff *skb,
 	return true;
 }
 
-static int udplite_error(struct net *net,
+static int udplite_error(struct net *net, u16 zone,
 			 struct sk_buff *skb,
 			 unsigned int dataoff,
 			 enum ip_conntrack_info *ctinfo,
diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c
index 4b57216..3b5efc9 100644
--- a/net/netfilter/nf_conntrack_sip.c
+++ b/net/netfilter/nf_conntrack_sip.c
@@ -22,6 +22,7 @@
 #include <net/netfilter/nf_conntrack_core.h>
 #include <net/netfilter/nf_conntrack_expect.h>
 #include <net/netfilter/nf_conntrack_helper.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 #include <linux/netfilter/nf_conntrack_sip.h>
 
 MODULE_LICENSE("GPL");
@@ -777,7 +778,7 @@ static int set_expected_rtp_rtcp(struct sk_buff *skb,
 
 	rcu_read_lock();
 	do {
-		exp = __nf_ct_expect_find(net, &tuple);
+		exp = __nf_ct_expect_find(net, nf_ct_zone(ct), &tuple);
 
 		if (!exp || exp->master == ct ||
 		    nfct_help(exp->master)->helper != nfct_help(ct)->helper ||
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 028aba6..69da6ef 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -26,6 +26,7 @@
 #include <net/netfilter/nf_conntrack_expect.h>
 #include <net/netfilter/nf_conntrack_helper.h>
 #include <net/netfilter/nf_conntrack_acct.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 
 MODULE_LICENSE("GPL");
 
@@ -171,6 +172,11 @@ static int ct_seq_show(struct seq_file *s, void *v)
 		goto release;
 #endif
 
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+	if (seq_printf(s, "zone=%u ", nf_ct_zone(ct)));
+		goto release;
+#endif
+
 	if (seq_printf(s, "use=%u\n", atomic_read(&ct->ct_general.use)))
 		goto release;
 
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index 8103bef..a637ee6 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -113,7 +113,7 @@ static int count_them(struct xt_connlimit_data *data,
 
 	/* check the saved connections */
 	list_for_each_entry_safe(conn, tmp, hash, list) {
-		found    = nf_conntrack_find_get(&init_net, &conn->tuple);
+		found    = nf_conntrack_find_get(&init_net, 0, &conn->tuple);
 		found_ct = NULL;
 
 		if (found != NULL)

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
       [not found] ` <4B4F24AC.70105-dcUjhNyLwpNeoWH0uzbU5w@public.gmane.org>
@ 2010-01-14 15:05   ` jamal
  0 siblings, 0 replies; 184+ messages in thread
From: jamal @ 2010-01-14 15:05 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear


Ive had an equivalent discussion with B Greear (CCed) at one point on
something similar, curious if you solve things differently - couldnt
tell from the patch if you address it.
Comments inline:

On Thu, 2010-01-14 at 15:05 +0100, Patrick McHardy wrote:
> The attached largish patch adds support for "conntrack zones",
> which are virtual conntrack tables that can be used to seperate
> connections from different zones, allowing to handle multiple
> connections with equal identities in conntrack and NAT.
>
> A zone is simply a numerical identifier associated with a network
> device that is incorporated into the various hashes and used to
> distinguish entries in addition to the connection tuples. Additionally
> it is used to seperate conntrack defragmentation queues. An iptables
> target for the raw table could be used alternatively to the network
> device for assigning conntrack entries to zones.
>
>
> This is mainly useful when connecting multiple private networks using
> the same addresses (which unfortunately happens occasionally) 

Agreed that this would be a main driver of such a feature.
Which means that you need zones (or whatever noun other people use) to
work on not just netfilter, but also routing, ipsec etc.
As a digression: this is trivial to solve with network namespaces. 

> to pass
> the packets through a set of veth devices and SNAT each network to a
> unique address, after which they can pass through the "main" zone and
> be handled like regular non-clashing packets and/or have NAT applied a
> second time based f.i. on the outgoing interface.
> 

The fundamental question i have is:
how you deal with overlapping addresses?
i.e zone1 uses 10.0.0.1 and zone2 uses 10.0.0.1 but they are for
different NAT users/endpoints.

> Something like this, with multiple tunl and veth devices, each pair
> using a unique zone:
> 
>   <tunl0 / zone 1>
>      |
>   PREROUTING
>      |
>   FORWARD
>      |
>   POSTROUTING: SNAT to unique network
>      |
>   <veth1 / zone 1>
>   <veth0 / zone 0>
>      |
>   PREROUTING
>      |
>   FORWARD
>      |
>   POSTROUTING: SNAT to eth0 address
>      |
>   <eth0>
> 
> As probably everyone has noticed, this is quite similar to what you
> can do using network namespaces. The main reason for not using
> network namespaces is that its an all-or-nothing approach, you can't
> virtualize just connection tracking. 

Unless there is a clever approach for overlapping IP addresses (my
question above), i dont see a way around essentially virtualizing the
whole stack which clone(CLONE_NEWNET) provides..

> Beside the difficulties in
> managing different namespaces from f.i. an IKE or PPP daemon running
> in the initial namespace, 

This is a valid concern against the namespace approach. Existing tools
of course could be taught to know about namespaces - and one could
argue that if you can resolve the overlap IP address issue, then you
_have to_ modify user space anyways.

> network namespaces have a quite large
> overhead, especially when used with a large conntrack table.

Elaboration needed.
You said the size in 64 bit increases to 152B per conntrack i think?
Do you have a hand-wave figure we can use as a metric to elaborate this
point? What would a typical user of this feature have in number of
"zones" and how many contracks per zone? Actually we could also look
at extremes (huge number vs low numbers)...

You may also wanna look as a metric at code complexity/maintainability
of this scheme vs namespace (which adds zero changes to the kernel).
I am pretty sure you will soon be "zoning" on other pieces of the net
stack ;->

> I'm not too fond of this partial feature duplication myself, but I
> couldn't think of a better way to do this without the downsides of
> using namespaces. Having partially shared network namespaces would
> be great, but it doesn't seem to fit in the design very well.
> I'm open for any better suggestion :)

My opinions above.

BTW, why not use skb->mark instead of creating a new semantic construct?

cheers,
jamal

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-01-14 14:05 RFC: netfilter: nf_conntrack: add support for "conntrack zones" Patrick McHardy
@ 2010-01-14 15:05 ` jamal
  2010-01-14 15:37   ` Patrick McHardy
                     ` (2 more replies)
       [not found] ` <4B4F24AC.70105-dcUjhNyLwpNeoWH0uzbU5w@public.gmane.org>
  1 sibling, 3 replies; 184+ messages in thread
From: jamal @ 2010-01-14 15:05 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Netfilter Development Mailinglist, Linux Netdev List, containers,
	Ben Greear


Ive had an equivalent discussion with B Greear (CCed) at one point on
something similar, curious if you solve things differently - couldnt
tell from the patch if you address it.
Comments inline:

On Thu, 2010-01-14 at 15:05 +0100, Patrick McHardy wrote:
> The attached largish patch adds support for "conntrack zones",
> which are virtual conntrack tables that can be used to seperate
> connections from different zones, allowing to handle multiple
> connections with equal identities in conntrack and NAT.
>
> A zone is simply a numerical identifier associated with a network
> device that is incorporated into the various hashes and used to
> distinguish entries in addition to the connection tuples. Additionally
> it is used to seperate conntrack defragmentation queues. An iptables
> target for the raw table could be used alternatively to the network
> device for assigning conntrack entries to zones.
>
>
> This is mainly useful when connecting multiple private networks using
> the same addresses (which unfortunately happens occasionally) 

Agreed that this would be a main driver of such a feature.
Which means that you need zones (or whatever noun other people use) to
work on not just netfilter, but also routing, ipsec etc.
As a digression: this is trivial to solve with network namespaces. 

> to pass
> the packets through a set of veth devices and SNAT each network to a
> unique address, after which they can pass through the "main" zone and
> be handled like regular non-clashing packets and/or have NAT applied a
> second time based f.i. on the outgoing interface.
> 

The fundamental question i have is:
how you deal with overlapping addresses?
i.e zone1 uses 10.0.0.1 and zone2 uses 10.0.0.1 but they are for
different NAT users/endpoints.

> Something like this, with multiple tunl and veth devices, each pair
> using a unique zone:
> 
>   <tunl0 / zone 1>
>      |
>   PREROUTING
>      |
>   FORWARD
>      |
>   POSTROUTING: SNAT to unique network
>      |
>   <veth1 / zone 1>
>   <veth0 / zone 0>
>      |
>   PREROUTING
>      |
>   FORWARD
>      |
>   POSTROUTING: SNAT to eth0 address
>      |
>   <eth0>
> 
> As probably everyone has noticed, this is quite similar to what you
> can do using network namespaces. The main reason for not using
> network namespaces is that its an all-or-nothing approach, you can't
> virtualize just connection tracking. 

Unless there is a clever approach for overlapping IP addresses (my
question above), i dont see a way around essentially virtualizing the
whole stack which clone(CLONE_NEWNET) provides..

> Beside the difficulties in
> managing different namespaces from f.i. an IKE or PPP daemon running
> in the initial namespace, 

This is a valid concern against the namespace approach. Existing tools
of course could be taught to know about namespaces - and one could
argue that if you can resolve the overlap IP address issue, then you
_have to_ modify user space anyways.

> network namespaces have a quite large
> overhead, especially when used with a large conntrack table.

Elaboration needed.
You said the size in 64 bit increases to 152B per conntrack i think?
Do you have a hand-wave figure we can use as a metric to elaborate this
point? What would a typical user of this feature have in number of
"zones" and how many contracks per zone? Actually we could also look
at extremes (huge number vs low numbers)...

You may also wanna look as a metric at code complexity/maintainability
of this scheme vs namespace (which adds zero changes to the kernel).
I am pretty sure you will soon be "zoning" on other pieces of the net
stack ;->

> I'm not too fond of this partial feature duplication myself, but I
> couldn't think of a better way to do this without the downsides of
> using namespaces. Having partially shared network namespaces would
> be great, but it doesn't seem to fit in the design very well.
> I'm open for any better suggestion :)

My opinions above.

BTW, why not use skb->mark instead of creating a new semantic construct?

cheers,
jamal


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-01-14 15:05 ` jamal
  2010-01-14 15:37   ` Patrick McHardy
@ 2010-01-14 15:37   ` Patrick McHardy
  2010-01-14 18:32   ` Ben Greear
  2 siblings, 0 replies; 184+ messages in thread
From: Patrick McHardy @ 2010-01-14 15:37 UTC (permalink / raw)
  To: hadi-fAAogVwAN2Kw5LPnMra/2Q
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear

jamal wrote:
> Ive had an equivalent discussion with B Greear (CCed) at one point on
> something similar, curious if you solve things differently - couldnt
> tell from the patch if you address it.

Its basically the same, except that this patch uses ct_extend
and mark values.

> Comments inline:
> 
> On Thu, 2010-01-14 at 15:05 +0100, Patrick McHardy wrote:
>> The attached largish patch adds support for "conntrack zones",
>> which are virtual conntrack tables that can be used to seperate
>> connections from different zones, allowing to handle multiple
>> connections with equal identities in conntrack and NAT.
>>
>> A zone is simply a numerical identifier associated with a network
>> device that is incorporated into the various hashes and used to
>> distinguish entries in addition to the connection tuples. Additionally
>> it is used to seperate conntrack defragmentation queues. An iptables
>> target for the raw table could be used alternatively to the network
>> device for assigning conntrack entries to zones.
>>
>>
>> This is mainly useful when connecting multiple private networks using
>> the same addresses (which unfortunately happens occasionally) 
> 
> Agreed that this would be a main driver of such a feature.
> Which means that you need zones (or whatever noun other people use) to
> work on not just netfilter, but also routing, ipsec etc.

Routing already works fine. I believe IPsec should also work already,
but I haven't tried it.

> As a digression: this is trivial to solve with network namespaces. 
> 
>> to pass
>> the packets through a set of veth devices and SNAT each network to a
>> unique address, after which they can pass through the "main" zone and
>> be handled like regular non-clashing packets and/or have NAT applied a
>> second time based f.i. on the outgoing interface.
>>
> 
> The fundamental question i have is:
> how you deal with overlapping addresses?
> i.e zone1 uses 10.0.0.1 and zone2 uses 10.0.0.1 but they are for
> different NAT users/endpoints.

The zone is set based on some other criteria (in this case the
incoming device). The packets make one pass through the stack
to a veth device and are SNATed in POSTROUTING to non-clashing
addresses. When they come out of the other side of the veth
device, they make a second pass through the network stack and
can be handled like any other packet.

So the setup would be (with 10.0.0.0/24 on if0 and if1):

ip rule add from if0 lookup t0
ip route add default veth0 table t0
iptables -t nat -A POSTROUTING -o veth0 -j NETMAP --to 10.1.0.0/24
echo 1 >/sys/class/net/if0/nf_ct_zone
echo 1 >/sys/class/net/veth0/nf_ct_zone

ip rule add from if1 lookup t1
ip route add default veth2 table t0
iptables -t nat -A POSTROUTING -o veth2 -j NETMARK --to 10.1.1.0/24
etho 2 >/sys/class/net/if1/nf_ct_zone
echo 2 >/sys/class/net/veth2/nf_ct_zone

The mapped packets are received on veth1 and veth3 with non-clashing
addresses.

>> As probably everyone has noticed, this is quite similar to what you
>> can do using network namespaces. The main reason for not using
>> network namespaces is that its an all-or-nothing approach, you can't
>> virtualize just connection tracking. 
> 
> Unless there is a clever approach for overlapping IP addresses (my
> question above), i dont see a way around essentially virtualizing the
> whole stack which clone(CLONE_NEWNET) provides..

I don't understand the problem.

>> Beside the difficulties in
>> managing different namespaces from f.i. an IKE or PPP daemon running
>> in the initial namespace, 
> 
> This is a valid concern against the namespace approach. Existing tools
> of course could be taught to know about namespaces - and one could
> argue that if you can resolve the overlap IP address issue, then you
> _have to_ modify user space anyways.

I don't think thats true. In any case its completely impractical
to modify every userspace tool that does something with networking
and potentially make complex configuration changes to have all
those namespaces interact nicely. Currently they are simply not
very well suited for virtualizing selected parts of networking.

>> network namespaces have a quite large
>> overhead, especially when used with a large conntrack table.
> 
> Elaboration needed.
> You said the size in 64 bit increases to 152B per conntrack i think?

I said code size increases by 152b.

> Do you have a hand-wave figure we can use as a metric to elaborate this
> point? What would a typical user of this feature have in number of
> "zones" and how many contracks per zone? Actually we could also look
> at extremes (huge number vs low numbers)...

I'm not sure whether there is a typical user for overlapping
networks :) I know of setups with ~150 overlapping networks.

The number of conntracks per zone doesn't matter since the
table is shared between all zones. network namespaces would
allocate 150 tables, each of the same size, which might be
quite large.

> You may also wanna look as a metric at code complexity/maintainability
> of this scheme vs namespace (which adds zero changes to the kernel).

There's not a lot of complexity, its basically passing a numeric
identifier around in a few spots and comparing it. Something like
TOS handling in the routing code.

> I am pretty sure you will soon be "zoning" on other pieces of the net
> stack ;->

I've thought about that and I don't think that's necessary for this
use case. Its enough to resolve overlapping address ranges, everything
else can be done in the second path through the stack.

>> I'm not too fond of this partial feature duplication myself, but I
>> couldn't think of a better way to do this without the downsides of
>> using namespaces. Having partially shared network namespaces would
>> be great, but it doesn't seem to fit in the design very well.
>> I'm open for any better suggestion :)
> 
> My opinions above.
> 
> BTW, why not use skb->mark instead of creating a new semantic construct?

Because people are already using it for different purposes.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-01-14 15:05 ` jamal
@ 2010-01-14 15:37   ` Patrick McHardy
  2010-01-14 17:33     ` jamal
       [not found]     ` <4B4F3A50.1050400-dcUjhNyLwpNeoWH0uzbU5w@public.gmane.org>
  2010-01-14 15:37   ` Patrick McHardy
  2010-01-14 18:32   ` Ben Greear
  2 siblings, 2 replies; 184+ messages in thread
From: Patrick McHardy @ 2010-01-14 15:37 UTC (permalink / raw)
  To: hadi
  Cc: Netfilter Development Mailinglist, Linux Netdev List, containers,
	Ben Greear

jamal wrote:
> Ive had an equivalent discussion with B Greear (CCed) at one point on
> something similar, curious if you solve things differently - couldnt
> tell from the patch if you address it.

Its basically the same, except that this patch uses ct_extend
and mark values.

> Comments inline:
> 
> On Thu, 2010-01-14 at 15:05 +0100, Patrick McHardy wrote:
>> The attached largish patch adds support for "conntrack zones",
>> which are virtual conntrack tables that can be used to seperate
>> connections from different zones, allowing to handle multiple
>> connections with equal identities in conntrack and NAT.
>>
>> A zone is simply a numerical identifier associated with a network
>> device that is incorporated into the various hashes and used to
>> distinguish entries in addition to the connection tuples. Additionally
>> it is used to seperate conntrack defragmentation queues. An iptables
>> target for the raw table could be used alternatively to the network
>> device for assigning conntrack entries to zones.
>>
>>
>> This is mainly useful when connecting multiple private networks using
>> the same addresses (which unfortunately happens occasionally) 
> 
> Agreed that this would be a main driver of such a feature.
> Which means that you need zones (or whatever noun other people use) to
> work on not just netfilter, but also routing, ipsec etc.

Routing already works fine. I believe IPsec should also work already,
but I haven't tried it.

> As a digression: this is trivial to solve with network namespaces. 
> 
>> to pass
>> the packets through a set of veth devices and SNAT each network to a
>> unique address, after which they can pass through the "main" zone and
>> be handled like regular non-clashing packets and/or have NAT applied a
>> second time based f.i. on the outgoing interface.
>>
> 
> The fundamental question i have is:
> how you deal with overlapping addresses?
> i.e zone1 uses 10.0.0.1 and zone2 uses 10.0.0.1 but they are for
> different NAT users/endpoints.

The zone is set based on some other criteria (in this case the
incoming device). The packets make one pass through the stack
to a veth device and are SNATed in POSTROUTING to non-clashing
addresses. When they come out of the other side of the veth
device, they make a second pass through the network stack and
can be handled like any other packet.

So the setup would be (with 10.0.0.0/24 on if0 and if1):

ip rule add from if0 lookup t0
ip route add default veth0 table t0
iptables -t nat -A POSTROUTING -o veth0 -j NETMAP --to 10.1.0.0/24
echo 1 >/sys/class/net/if0/nf_ct_zone
echo 1 >/sys/class/net/veth0/nf_ct_zone

ip rule add from if1 lookup t1
ip route add default veth2 table t0
iptables -t nat -A POSTROUTING -o veth2 -j NETMARK --to 10.1.1.0/24
etho 2 >/sys/class/net/if1/nf_ct_zone
echo 2 >/sys/class/net/veth2/nf_ct_zone

The mapped packets are received on veth1 and veth3 with non-clashing
addresses.

>> As probably everyone has noticed, this is quite similar to what you
>> can do using network namespaces. The main reason for not using
>> network namespaces is that its an all-or-nothing approach, you can't
>> virtualize just connection tracking. 
> 
> Unless there is a clever approach for overlapping IP addresses (my
> question above), i dont see a way around essentially virtualizing the
> whole stack which clone(CLONE_NEWNET) provides..

I don't understand the problem.

>> Beside the difficulties in
>> managing different namespaces from f.i. an IKE or PPP daemon running
>> in the initial namespace, 
> 
> This is a valid concern against the namespace approach. Existing tools
> of course could be taught to know about namespaces - and one could
> argue that if you can resolve the overlap IP address issue, then you
> _have to_ modify user space anyways.

I don't think thats true. In any case its completely impractical
to modify every userspace tool that does something with networking
and potentially make complex configuration changes to have all
those namespaces interact nicely. Currently they are simply not
very well suited for virtualizing selected parts of networking.

>> network namespaces have a quite large
>> overhead, especially when used with a large conntrack table.
> 
> Elaboration needed.
> You said the size in 64 bit increases to 152B per conntrack i think?

I said code size increases by 152b.

> Do you have a hand-wave figure we can use as a metric to elaborate this
> point? What would a typical user of this feature have in number of
> "zones" and how many contracks per zone? Actually we could also look
> at extremes (huge number vs low numbers)...

I'm not sure whether there is a typical user for overlapping
networks :) I know of setups with ~150 overlapping networks.

The number of conntracks per zone doesn't matter since the
table is shared between all zones. network namespaces would
allocate 150 tables, each of the same size, which might be
quite large.

> You may also wanna look as a metric at code complexity/maintainability
> of this scheme vs namespace (which adds zero changes to the kernel).

There's not a lot of complexity, its basically passing a numeric
identifier around in a few spots and comparing it. Something like
TOS handling in the routing code.

> I am pretty sure you will soon be "zoning" on other pieces of the net
> stack ;->

I've thought about that and I don't think that's necessary for this
use case. Its enough to resolve overlapping address ranges, everything
else can be done in the second path through the stack.

>> I'm not too fond of this partial feature duplication myself, but I
>> couldn't think of a better way to do this without the downsides of
>> using namespaces. Having partially shared network namespaces would
>> be great, but it doesn't seem to fit in the design very well.
>> I'm open for any better suggestion :)
> 
> My opinions above.
> 
> BTW, why not use skb->mark instead of creating a new semantic construct?

Because people are already using it for different purposes.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
       [not found]     ` <4B4F3A50.1050400-dcUjhNyLwpNeoWH0uzbU5w@public.gmane.org>
@ 2010-01-14 17:33       ` jamal
  0 siblings, 0 replies; 184+ messages in thread
From: jamal @ 2010-01-14 17:33 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear

On Thu, 2010-01-14 at 16:37 +0100, Patrick McHardy wrote:
> jamal wrote:

> > Agreed that this would be a main driver of such a feature.
> > Which means that you need zones (or whatever noun other people use) to
> > work on not just netfilter, but also routing, ipsec etc.
> 
> Routing already works fine. I believe IPsec should also work already,
> but I haven't tried it.

maybe further discussion  would clarify this point..

> The zone is set based on some other criteria (in this case the
> incoming device).

If you are using a netdev as a reference point, then I take it 
if you add vlans should be possible to do multiple zones on a single
physical netdev? Or is there some other way to satisfy that?

>  The packets make one pass through the stack
> to a veth device and are SNATed in POSTROUTING to non-clashing
> addresses. 

Ok - makes sense. 
i.e NAT would work; and policy routing as well as arp would be fine.
Also it looks to be sufficiently useful to fit a specific use case you
are interested in.
But back to my question on routing, ipsec etc (and you may not be
interested in solving this problem, but it is what i was getting to
earlier). Lets take for example: 
a) network tables like SAD/SPD tables: how you would separate those on a
per-zone basis? i.e 10.0.0.1/zone1 could use different
policy/association than 10.0.0.1/zone2
b) dynamic protocols (routing, IKE etc): how do you do that without 
making both sides understand what is going on?

> > This is a valid concern against the namespace approach. Existing tools
> > of course could be taught to know about namespaces - and one could
> > argue that if you can resolve the overlap IP address issue, then you
> > _have to_ modify user space anyways.
> 
> I don't think thats true. 

Refer to my statements above for an example.

> In any case its completely impractical
> to modify every userspace tool that does something with networking
> and potentially make complex configuration changes to have all
> those namespaces interact nicely. 

Agreed. But the major ones like iproute2 etc could be taught. We have
namespaces in the kernel already, over a period of time I think changing
the user space tools would a sensible evolution.

> Currently they are simply not
> very well suited for virtualizing selected parts of networking.

My contention is that it is a lot less headache to just virtualize 
all the network stack and then use what you want than it is to go and
selectively changing the network objects.
Note: if i wanted today i could run racoon on every namespace 
unchanged and it would work or i could modify racoon to understand
namespaces...

> I'm not sure whether there is a typical user for overlapping
> networks :) I know of setups with ~150 overlapping networks.
> 
> The number of conntracks per zone doesn't matter since the
> table is shared between all zones. network namespaces would
> allocate 150 tables, each of the same size, which might be
> quite large.

Thats what i was looking for ..
So the difference, to pick the 150 zones example so as to put a number
around it, is namespaces will consume 150.X bytes (where X is the
overhead of a conntrack table) and you approach will be (X + 152) bytes,
correct?
What is the typical sizeof X?

> > You may also wanna look as a metric at code complexity/maintainability
> > of this scheme vs namespace (which adds zero changes to the kernel).
> 
> There's not a lot of complexity, its basically passing a numeric
> identifier around in a few spots and comparing it. Something like
> TOS handling in the routing code.

I think the challenge is whether zones will have to encroach on other
net stack objects or not. You are already touching structure netdev...
A digression: TOS is different really - it has network level semantic. This 
would be more like mark or in some cases ifindex (i.e local semantics)
 
> > BTW, why not use skb->mark instead of creating a new semantic construct?
> 
> Because people are already using it for different purposes.

tru dat - it only gives you one semantical axis and you need an
additional dimension in your case (namespace have that resolved via
struct net).

cheers,
jamal

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-01-14 15:37   ` Patrick McHardy
@ 2010-01-14 17:33     ` jamal
  2010-01-15 10:15       ` Patrick McHardy
  2010-01-15 10:15       ` Patrick McHardy
       [not found]     ` <4B4F3A50.1050400-dcUjhNyLwpNeoWH0uzbU5w@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: jamal @ 2010-01-14 17:33 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Netfilter Development Mailinglist, Linux Netdev List, containers,
	Ben Greear

On Thu, 2010-01-14 at 16:37 +0100, Patrick McHardy wrote:
> jamal wrote:

> > Agreed that this would be a main driver of such a feature.
> > Which means that you need zones (or whatever noun other people use) to
> > work on not just netfilter, but also routing, ipsec etc.
> 
> Routing already works fine. I believe IPsec should also work already,
> but I haven't tried it.

maybe further discussion  would clarify this point..

> The zone is set based on some other criteria (in this case the
> incoming device).

If you are using a netdev as a reference point, then I take it 
if you add vlans should be possible to do multiple zones on a single
physical netdev? Or is there some other way to satisfy that?

>  The packets make one pass through the stack
> to a veth device and are SNATed in POSTROUTING to non-clashing
> addresses. 

Ok - makes sense. 
i.e NAT would work; and policy routing as well as arp would be fine.
Also it looks to be sufficiently useful to fit a specific use case you
are interested in.
But back to my question on routing, ipsec etc (and you may not be
interested in solving this problem, but it is what i was getting to
earlier). Lets take for example: 
a) network tables like SAD/SPD tables: how you would separate those on a
per-zone basis? i.e 10.0.0.1/zone1 could use different
policy/association than 10.0.0.1/zone2
b) dynamic protocols (routing, IKE etc): how do you do that without 
making both sides understand what is going on?

> > This is a valid concern against the namespace approach. Existing tools
> > of course could be taught to know about namespaces - and one could
> > argue that if you can resolve the overlap IP address issue, then you
> > _have to_ modify user space anyways.
> 
> I don't think thats true. 

Refer to my statements above for an example.

> In any case its completely impractical
> to modify every userspace tool that does something with networking
> and potentially make complex configuration changes to have all
> those namespaces interact nicely. 

Agreed. But the major ones like iproute2 etc could be taught. We have
namespaces in the kernel already, over a period of time I think changing
the user space tools would a sensible evolution.

> Currently they are simply not
> very well suited for virtualizing selected parts of networking.

My contention is that it is a lot less headache to just virtualize 
all the network stack and then use what you want than it is to go and
selectively changing the network objects.
Note: if i wanted today i could run racoon on every namespace 
unchanged and it would work or i could modify racoon to understand
namespaces...

> I'm not sure whether there is a typical user for overlapping
> networks :) I know of setups with ~150 overlapping networks.
> 
> The number of conntracks per zone doesn't matter since the
> table is shared between all zones. network namespaces would
> allocate 150 tables, each of the same size, which might be
> quite large.

Thats what i was looking for ..
So the difference, to pick the 150 zones example so as to put a number
around it, is namespaces will consume 150.X bytes (where X is the
overhead of a conntrack table) and you approach will be (X + 152) bytes,
correct?
What is the typical sizeof X?

> > You may also wanna look as a metric at code complexity/maintainability
> > of this scheme vs namespace (which adds zero changes to the kernel).
> 
> There's not a lot of complexity, its basically passing a numeric
> identifier around in a few spots and comparing it. Something like
> TOS handling in the routing code.

I think the challenge is whether zones will have to encroach on other
net stack objects or not. You are already touching structure netdev...
A digression: TOS is different really - it has network level semantic. This 
would be more like mark or in some cases ifindex (i.e local semantics)
 
> > BTW, why not use skb->mark instead of creating a new semantic construct?
> 
> Because people are already using it for different purposes.

tru dat - it only gives you one semantical axis and you need an
additional dimension in your case (namespace have that resolved via
struct net).

cheers,
jamal


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-01-14 15:05 ` jamal
  2010-01-14 15:37   ` Patrick McHardy
  2010-01-14 15:37   ` Patrick McHardy
@ 2010-01-14 18:32   ` Ben Greear
  2010-01-15 15:03     ` jamal
       [not found]     ` <4B4F6332.50606-my8/4N5VtI7c+919tysfdA@public.gmane.org>
  2 siblings, 2 replies; 184+ messages in thread
From: Ben Greear @ 2010-01-14 18:32 UTC (permalink / raw)
  To: hadi-fAAogVwAN2Kw5LPnMra/2Q
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist

On 01/14/2010 07:05 AM, jamal wrote:
>
> Ive had an equivalent discussion with B Greear (CCed) at one point on
> something similar, curious if you solve things differently - couldnt
> tell from the patch if you address it.
> Comments inline:
>
> On Thu, 2010-01-14 at 15:05 +0100, Patrick McHardy wrote:
>> The attached largish patch adds support for "conntrack zones",
>> which are virtual conntrack tables that can be used to seperate
>> connections from different zones, allowing to handle multiple
>> connections with equal identities in conntrack and NAT.
>>
>> A zone is simply a numerical identifier associated with a network
>> device that is incorporated into the various hashes and used to
>> distinguish entries in addition to the connection tuples. Additionally
>> it is used to seperate conntrack defragmentation queues. An iptables
>> target for the raw table could be used alternatively to the network
>> device for assigning conntrack entries to zones.
>>
>>
>> This is mainly useful when connecting multiple private networks using
>> the same addresses (which unfortunately happens occasionally)
>
> Agreed that this would be a main driver of such a feature.
> Which means that you need zones (or whatever noun other people use) to
> work on not just netfilter, but also routing, ipsec etc.
> As a digression: this is trivial to solve with network namespaces.

For small or simple cases, this may be true..but there is a lot of work
to make a complex user-space app that manages arbitrary amounts of interfaces
routing tables in an arbitrary amount of network namespaces.  With the contrack-zones
approach, user-space apps do not require any significant changes, and you do not
need the rest of the namespace overhead to accomplish the task.

Thanks,
Ben

-- 
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-01-14 17:33     ` jamal
@ 2010-01-15 10:15       ` Patrick McHardy
  2010-01-15 10:15       ` Patrick McHardy
  1 sibling, 0 replies; 184+ messages in thread
From: Patrick McHardy @ 2010-01-15 10:15 UTC (permalink / raw)
  To: hadi-fAAogVwAN2Kw5LPnMra/2Q
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear

jamal wrote:
> On Thu, 2010-01-14 at 16:37 +0100, Patrick McHardy wrote:
>> jamal wrote:
> 
>>> Agreed that this would be a main driver of such a feature.
>>> Which means that you need zones (or whatever noun other people use) to
>>> work on not just netfilter, but also routing, ipsec etc.
>> Routing already works fine. I believe IPsec should also work already,
>> but I haven't tried it.
> 
> maybe further discussion  would clarify this point..
> 
>> The zone is set based on some other criteria (in this case the
>> incoming device).
> 
> If you are using a netdev as a reference point, then I take it 
> if you add vlans should be possible to do multiple zones on a single
> physical netdev? Or is there some other way to satisfy that?

Yes, you can assign a zone to each netdev. macvlan will also work.

Using a netfilter target for the raw table might be a better choice
on second thought though, it provides more flexibility and avoids
the netfilter-specific device setting. I'll probably change that.

>>  The packets make one pass through the stack
>> to a veth device and are SNATed in POSTROUTING to non-clashing
>> addresses. 
> 
> Ok - makes sense. 
> i.e NAT would work; and policy routing as well as arp would be fine.
> Also it looks to be sufficiently useful to fit a specific use case you
> are interested in.
> But back to my question on routing, ipsec etc (and you may not be
> interested in solving this problem, but it is what i was getting to
> earlier). Lets take for example: 
> a) network tables like SAD/SPD tables: how you would separate those on a
> per-zone basis? i.e 10.0.0.1/zone1 could use different
> policy/association than 10.0.0.1/zone2

The selectors include an ifindex, which could be used to
distinguish both based on the interface.

> b) dynamic protocols (routing, IKE etc): how do you do that without 
> making both sides understand what is going on?

In case of IPsec the outer addresses are different, its only the
selectors which will have similar addresses. A keying deamon should
have no trouble with this. The ifindex would be needed in the
selectors though to make sure each policy is used for the correct
traffic.

A routing daemon is unrealistic to be used in this scenario, at
least a single one for all the overlapping networks.

>>> This is a valid concern against the namespace approach. Existing tools
>>> of course could be taught to know about namespaces - and one could
>>> argue that if you can resolve the overlap IP address issue, then you
>>> _have to_ modify user space anyways.
>> I don't think thats true. 
> 
> Refer to my statements above for an example.
> 
>> In any case its completely impractical
>> to modify every userspace tool that does something with networking
>> and potentially make complex configuration changes to have all
>> those namespaces interact nicely. 
> 
> Agreed. But the major ones like iproute2 etc could be taught. We have
> namespaces in the kernel already, over a period of time I think changing
> the user space tools would a sensible evolution.

Yes, that might be useful in any case. But I don't think it would
even work for iproute or other standalone programs, a process can't
associate to an existing namespace except through clone(). So it
needs to run as child of a process already associated with the
namespace.

>> Currently they are simply not
>> very well suited for virtualizing selected parts of networking.
> 
> My contention is that it is a lot less headache to just virtualize 
> all the network stack and then use what you want than it is to go and
> selectively changing the network objects.
> Note: if i wanted today i could run racoon on every namespace 
> unchanged and it would work or i could modify racoon to understand
> namespaces...

See above.

>> I'm not sure whether there is a typical user for overlapping
>> networks :) I know of setups with ~150 overlapping networks.
>>
>> The number of conntracks per zone doesn't matter since the
>> table is shared between all zones. network namespaces would
>> allocate 150 tables, each of the same size, which might be
>> quite large.
> 
> Thats what i was looking for ..
> So the difference, to pick the 150 zones example so as to put a number
> around it, is namespaces will consume 150.X bytes (where X is the
> overhead of a conntrack table) and you approach will be (X + 152) bytes,
> correct?
> What is the typical sizeof X?

No, to give some correct number. Assuming a conntrack table of
10MB (large, but reasonable depending on the number of connections)
we get an overhead of:

namespaces: 150 * 10MB memory use
"zones": 152 bytes increased code size

Both approaches additionally need one extra connection tracking
entry of ~300 bytes per connection that is actually handled twice.

>>> You may also wanna look as a metric at code complexity/maintainability
>>> of this scheme vs namespace (which adds zero changes to the kernel).
>> There's not a lot of complexity, its basically passing a numeric
>> identifier around in a few spots and comparing it. Something like
>> TOS handling in the routing code.
> 
> I think the challenge is whether zones will have to encroach on other
> net stack objects or not. You are already touching structure netdev...

That will go away once I add a target for classification. I completely
agree that its undesirable to add this in more spots, but this is meant
purely for being able to pass traffic through conntrack/NAT more than
once.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-01-14 17:33     ` jamal
  2010-01-15 10:15       ` Patrick McHardy
@ 2010-01-15 10:15       ` Patrick McHardy
  2010-01-15 15:19         ` jamal
       [not found]         ` <4B50403A.6010507-dcUjhNyLwpNeoWH0uzbU5w@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Patrick McHardy @ 2010-01-15 10:15 UTC (permalink / raw)
  To: hadi
  Cc: Netfilter Development Mailinglist, Linux Netdev List, containers,
	Ben Greear

jamal wrote:
> On Thu, 2010-01-14 at 16:37 +0100, Patrick McHardy wrote:
>> jamal wrote:
> 
>>> Agreed that this would be a main driver of such a feature.
>>> Which means that you need zones (or whatever noun other people use) to
>>> work on not just netfilter, but also routing, ipsec etc.
>> Routing already works fine. I believe IPsec should also work already,
>> but I haven't tried it.
> 
> maybe further discussion  would clarify this point..
> 
>> The zone is set based on some other criteria (in this case the
>> incoming device).
> 
> If you are using a netdev as a reference point, then I take it 
> if you add vlans should be possible to do multiple zones on a single
> physical netdev? Or is there some other way to satisfy that?

Yes, you can assign a zone to each netdev. macvlan will also work.

Using a netfilter target for the raw table might be a better choice
on second thought though, it provides more flexibility and avoids
the netfilter-specific device setting. I'll probably change that.

>>  The packets make one pass through the stack
>> to a veth device and are SNATed in POSTROUTING to non-clashing
>> addresses. 
> 
> Ok - makes sense. 
> i.e NAT would work; and policy routing as well as arp would be fine.
> Also it looks to be sufficiently useful to fit a specific use case you
> are interested in.
> But back to my question on routing, ipsec etc (and you may not be
> interested in solving this problem, but it is what i was getting to
> earlier). Lets take for example: 
> a) network tables like SAD/SPD tables: how you would separate those on a
> per-zone basis? i.e 10.0.0.1/zone1 could use different
> policy/association than 10.0.0.1/zone2

The selectors include an ifindex, which could be used to
distinguish both based on the interface.

> b) dynamic protocols (routing, IKE etc): how do you do that without 
> making both sides understand what is going on?

In case of IPsec the outer addresses are different, its only the
selectors which will have similar addresses. A keying deamon should
have no trouble with this. The ifindex would be needed in the
selectors though to make sure each policy is used for the correct
traffic.

A routing daemon is unrealistic to be used in this scenario, at
least a single one for all the overlapping networks.

>>> This is a valid concern against the namespace approach. Existing tools
>>> of course could be taught to know about namespaces - and one could
>>> argue that if you can resolve the overlap IP address issue, then you
>>> _have to_ modify user space anyways.
>> I don't think thats true. 
> 
> Refer to my statements above for an example.
> 
>> In any case its completely impractical
>> to modify every userspace tool that does something with networking
>> and potentially make complex configuration changes to have all
>> those namespaces interact nicely. 
> 
> Agreed. But the major ones like iproute2 etc could be taught. We have
> namespaces in the kernel already, over a period of time I think changing
> the user space tools would a sensible evolution.

Yes, that might be useful in any case. But I don't think it would
even work for iproute or other standalone programs, a process can't
associate to an existing namespace except through clone(). So it
needs to run as child of a process already associated with the
namespace.

>> Currently they are simply not
>> very well suited for virtualizing selected parts of networking.
> 
> My contention is that it is a lot less headache to just virtualize 
> all the network stack and then use what you want than it is to go and
> selectively changing the network objects.
> Note: if i wanted today i could run racoon on every namespace 
> unchanged and it would work or i could modify racoon to understand
> namespaces...

See above.

>> I'm not sure whether there is a typical user for overlapping
>> networks :) I know of setups with ~150 overlapping networks.
>>
>> The number of conntracks per zone doesn't matter since the
>> table is shared between all zones. network namespaces would
>> allocate 150 tables, each of the same size, which might be
>> quite large.
> 
> Thats what i was looking for ..
> So the difference, to pick the 150 zones example so as to put a number
> around it, is namespaces will consume 150.X bytes (where X is the
> overhead of a conntrack table) and you approach will be (X + 152) bytes,
> correct?
> What is the typical sizeof X?

No, to give some correct number. Assuming a conntrack table of
10MB (large, but reasonable depending on the number of connections)
we get an overhead of:

namespaces: 150 * 10MB memory use
"zones": 152 bytes increased code size

Both approaches additionally need one extra connection tracking
entry of ~300 bytes per connection that is actually handled twice.

>>> You may also wanna look as a metric at code complexity/maintainability
>>> of this scheme vs namespace (which adds zero changes to the kernel).
>> There's not a lot of complexity, its basically passing a numeric
>> identifier around in a few spots and comparing it. Something like
>> TOS handling in the routing code.
> 
> I think the challenge is whether zones will have to encroach on other
> net stack objects or not. You are already touching structure netdev...

That will go away once I add a target for classification. I completely
agree that its undesirable to add this in more spots, but this is meant
purely for being able to pass traffic through conntrack/NAT more than
once.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
       [not found]     ` <4B4F6332.50606-my8/4N5VtI7c+919tysfdA@public.gmane.org>
@ 2010-01-15 15:03       ` jamal
  0 siblings, 0 replies; 184+ messages in thread
From: jamal @ 2010-01-15 15:03 UTC (permalink / raw)
  To: Ben Greear
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist

On Thu, 2010-01-14 at 10:32 -0800, Ben Greear wrote:

> For small or simple cases, this may be true..but there is a lot of work
> to make a complex user-space app that manages arbitrary amounts of interfaces
> routing tables in an arbitrary amount of network namespaces.  With the contrack-zones
> approach, user-space apps do not require any significant changes, and you do not
> need the rest of the namespace overhead to accomplish the task.

I think for your use case what you state is true. In the general case,
it is not. 
Note: I am not arguing against the patch - just that it is not the
generic scenario solution compared to namespaces.

cheers,
jamal

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-01-14 18:32   ` Ben Greear
@ 2010-01-15 15:03     ` jamal
       [not found]     ` <4B4F6332.50606-my8/4N5VtI7c+919tysfdA@public.gmane.org>
  1 sibling, 0 replies; 184+ messages in thread
From: jamal @ 2010-01-15 15:03 UTC (permalink / raw)
  To: Ben Greear
  Cc: Patrick McHardy, Netfilter Development Mailinglist,
	Linux Netdev List, containers

On Thu, 2010-01-14 at 10:32 -0800, Ben Greear wrote:

> For small or simple cases, this may be true..but there is a lot of work
> to make a complex user-space app that manages arbitrary amounts of interfaces
> routing tables in an arbitrary amount of network namespaces.  With the contrack-zones
> approach, user-space apps do not require any significant changes, and you do not
> need the rest of the namespace overhead to accomplish the task.

I think for your use case what you state is true. In the general case,
it is not. 
Note: I am not arguing against the patch - just that it is not the
generic scenario solution compared to namespaces.

cheers,
jamal


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
       [not found]         ` <4B50403A.6010507-dcUjhNyLwpNeoWH0uzbU5w@public.gmane.org>
@ 2010-01-15 15:19           ` jamal
  0 siblings, 0 replies; 184+ messages in thread
From: jamal @ 2010-01-15 15:19 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear

On Fri, 2010-01-15 at 11:15 +0100, Patrick McHardy wrote:
> jamal wrote:

> > b) dynamic protocols (routing, IKE etc): how do you do that without 
> > making both sides understand what is going on?
> 
> In case of IPsec the outer addresses are different, its only the
> selectors which will have similar addresses. A keying deamon should
> have no trouble with this. The ifindex would be needed in the
> selectors though to make sure each policy is used for the correct
> traffic.

you need to have user space knowledgeable of the mapping between an
ifindex and a zone. It may work with perhaps that info explicitly in
config with tunnel mode/ESP.

> A routing daemon is unrealistic to be used in this scenario, at
> least a single one for all the overlapping networks.

I think in general, it would be hard to deal with anything that requires
dynamic control where one or more peers have to discover each other once
you have IP overlap. You will have to change those user space apps.

In any case, for what you seem to intend this for, i think it works.

> > Agreed. But the major ones like iproute2 etc could be taught. We have
> > namespaces in the kernel already, over a period of time I think changing
> > the user space tools would a sensible evolution.
> 
> Yes, that might be useful in any case. But I don't think it would
> even work for iproute or other standalone programs, a process can't
> associate to an existing namespace except through clone(). So it
> needs to run as child of a process already associated with the
> namespace.

The mechanics are not there, yet. But if i had sufficient permission,
and was able to find the namespaces when i ask and/or get events when it
is created it should be an issue of sending it a message.
The current approach to say migrate a veth via iproute2 requires we 
know the pid of the target namespace. Thats a usability issue.
I tried to muck with namespaces and if you use a library like lxc
you can do it - but it is a hack as it stands today (and merging
iproute2 with lxc is questionable).


>  (X + 152) bytes,
> > correct?
> > What is the typical sizeof X?
> 
> No, to give some correct number. Assuming a conntrack table of
> 10MB (large, but reasonable depending on the number of connections)
> we get an overhead of:
> 
> namespaces: 150 * 10MB memory use
> "zones": 152 bytes increased code size

That is substantial if you are doing an embedded device.
But otherwise, RAM is so cheap that i would take usability
any day for an extra $5.

BTW, I think the zones approach will still use more than 10MB
in this case given it encompasses all "zones" whereas namespace only
does it for a single mapped "zone".

> Both approaches additionally need one extra connection tracking
> entry of ~300 bytes per connection that is actually handled twice.

Ok, so computation is not a differentiator.

> That will go away once I add a target for classification. 

Makes sense 
On a side note: I wouldnt mind seeing some field in struct
netdev for some general purpose grouping/IDing which could be
set from user space. 

cheers,
jamal

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-01-15 10:15       ` Patrick McHardy
@ 2010-01-15 15:19         ` jamal
  2010-02-22 20:46           ` Eric W. Biederman
  2010-02-22 20:46           ` Eric W. Biederman
       [not found]         ` <4B50403A.6010507-dcUjhNyLwpNeoWH0uzbU5w@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: jamal @ 2010-01-15 15:19 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Netfilter Development Mailinglist, Linux Netdev List, containers,
	Ben Greear

On Fri, 2010-01-15 at 11:15 +0100, Patrick McHardy wrote:
> jamal wrote:

> > b) dynamic protocols (routing, IKE etc): how do you do that without 
> > making both sides understand what is going on?
> 
> In case of IPsec the outer addresses are different, its only the
> selectors which will have similar addresses. A keying deamon should
> have no trouble with this. The ifindex would be needed in the
> selectors though to make sure each policy is used for the correct
> traffic.

you need to have user space knowledgeable of the mapping between an
ifindex and a zone. It may work with perhaps that info explicitly in
config with tunnel mode/ESP.

> A routing daemon is unrealistic to be used in this scenario, at
> least a single one for all the overlapping networks.

I think in general, it would be hard to deal with anything that requires
dynamic control where one or more peers have to discover each other once
you have IP overlap. You will have to change those user space apps.

In any case, for what you seem to intend this for, i think it works.

> > Agreed. But the major ones like iproute2 etc could be taught. We have
> > namespaces in the kernel already, over a period of time I think changing
> > the user space tools would a sensible evolution.
> 
> Yes, that might be useful in any case. But I don't think it would
> even work for iproute or other standalone programs, a process can't
> associate to an existing namespace except through clone(). So it
> needs to run as child of a process already associated with the
> namespace.

The mechanics are not there, yet. But if i had sufficient permission,
and was able to find the namespaces when i ask and/or get events when it
is created it should be an issue of sending it a message.
The current approach to say migrate a veth via iproute2 requires we 
know the pid of the target namespace. Thats a usability issue.
I tried to muck with namespaces and if you use a library like lxc
you can do it - but it is a hack as it stands today (and merging
iproute2 with lxc is questionable).


>  (X + 152) bytes,
> > correct?
> > What is the typical sizeof X?
> 
> No, to give some correct number. Assuming a conntrack table of
> 10MB (large, but reasonable depending on the number of connections)
> we get an overhead of:
> 
> namespaces: 150 * 10MB memory use
> "zones": 152 bytes increased code size

That is substantial if you are doing an embedded device.
But otherwise, RAM is so cheap that i would take usability
any day for an extra $5.

BTW, I think the zones approach will still use more than 10MB
in this case given it encompasses all "zones" whereas namespace only
does it for a single mapped "zone".

> Both approaches additionally need one extra connection tracking
> entry of ~300 bytes per connection that is actually handled twice.

Ok, so computation is not a differentiator.

> That will go away once I add a target for classification. 

Makes sense 
On a side note: I wouldnt mind seeing some field in struct
netdev for some general purpose grouping/IDing which could be
set from user space. 

cheers,
jamal




^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-01-15 15:19         ` jamal
@ 2010-02-22 20:46           ` Eric W. Biederman
  2010-02-22 20:46           ` Eric W. Biederman
  1 sibling, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-22 20:46 UTC (permalink / raw)
  To: hadi-fAAogVwAN2Kw5LPnMra/2Q
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist

jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> writes:

>> > Agreed. But the major ones like iproute2 etc could be taught. We have
>> > namespaces in the kernel already, over a period of time I think changing
>> > the user space tools would a sensible evolution.
>> 
>> Yes, that might be useful in any case. But I don't think it would
>> even work for iproute or other standalone programs, a process can't
>> associate to an existing namespace except through clone(). So it
>> needs to run as child of a process already associated with the
>> namespace.
>
> The mechanics are not there, yet. But if i had sufficient permission,
> and was able to find the namespaces when i ask and/or get events when it
> is created it should be an issue of sending it a message.
> The current approach to say migrate a veth via iproute2 requires we 
> know the pid of the target namespace. Thats a usability issue.
> I tried to muck with namespaces and if you use a library like lxc
> you can do it - but it is a hack as it stands today (and merging
> iproute2 with lxc is questionable).

This is one of the long standing issues that we have always known
we needed to solve, but have not taken the time to do it.  Now that
the need is more real it looks about time to solve this one.

There are currently two problems.
1) A process is needed to hold a reference to the network namespace.
2) We use pids which are an awkward way of talking about network
   namespaces.

The solution I have been playing with involves.
- Using a file descriptor to refer to a network namespace.
- Using a trivial virtual filesystem to persistently hold onto
  a namespace without the need of a process.
- Have a convention of mounting the fs at something like
  /var/run/netns/<name>

That solves the naming problem, and it should allow iproute and
it's kin to have support without being closely integrated with
lxc or anything else that creates namespaces.

It is a big conversation, and it is something that has to done
right but it looks like the problem is finally real enough that
it is time to solve it.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-01-15 15:19         ` jamal
  2010-02-22 20:46           ` Eric W. Biederman
@ 2010-02-22 20:46           ` Eric W. Biederman
       [not found]             ` <m13a0tf17t.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2010-02-22 21:55             ` jamal
  1 sibling, 2 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-22 20:46 UTC (permalink / raw)
  To: hadi
  Cc: Patrick McHardy, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear

jamal <hadi@cyberus.ca> writes:

>> > Agreed. But the major ones like iproute2 etc could be taught. We have
>> > namespaces in the kernel already, over a period of time I think changing
>> > the user space tools would a sensible evolution.
>> 
>> Yes, that might be useful in any case. But I don't think it would
>> even work for iproute or other standalone programs, a process can't
>> associate to an existing namespace except through clone(). So it
>> needs to run as child of a process already associated with the
>> namespace.
>
> The mechanics are not there, yet. But if i had sufficient permission,
> and was able to find the namespaces when i ask and/or get events when it
> is created it should be an issue of sending it a message.
> The current approach to say migrate a veth via iproute2 requires we 
> know the pid of the target namespace. Thats a usability issue.
> I tried to muck with namespaces and if you use a library like lxc
> you can do it - but it is a hack as it stands today (and merging
> iproute2 with lxc is questionable).

This is one of the long standing issues that we have always known
we needed to solve, but have not taken the time to do it.  Now that
the need is more real it looks about time to solve this one.

There are currently two problems.
1) A process is needed to hold a reference to the network namespace.
2) We use pids which are an awkward way of talking about network
   namespaces.

The solution I have been playing with involves.
- Using a file descriptor to refer to a network namespace.
- Using a trivial virtual filesystem to persistently hold onto
  a namespace without the need of a process.
- Have a convention of mounting the fs at something like
  /var/run/netns/<name>

That solves the naming problem, and it should allow iproute and
it's kin to have support without being closely integrated with
lxc or anything else that creates namespaces.

It is a big conversation, and it is something that has to done
right but it looks like the problem is finally real enough that
it is time to solve it.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
       [not found]             ` <m13a0tf17t.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-22 21:55               ` jamal
  0 siblings, 0 replies; 184+ messages in thread
From: jamal @ 2010-02-22 21:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist

On Mon, 2010-02-22 at 12:46 -0800, Eric W. Biederman wrote:
> jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> writes:

> 
> This is one of the long standing issues that we have always known
> we needed to solve, but have not taken the time to do it.  Now that
> the need is more real it looks about time to solve this one.
> 
> There are currently two problems.
> 1) A process is needed to hold a reference to the network namespace.
> 2) We use pids which are an awkward way of talking about network
>    namespaces.
> 
> The solution I have been playing with involves.
> - Using a file descriptor to refer to a network namespace.
> - Using a trivial virtual filesystem to persistently hold onto
>   a namespace without the need of a process.
> - Have a convention of mounting the fs at something like
>   /var/run/netns/<name>
> 

I didnt quiet follow how i could use the above to do:
"ip ns <name/id> route add blah" from namespace0.

I tend to think in packets and wires instead of files;
How about just allowing a "control" channel from which
i could discover the namespace?
Example, assuming i have the right permissions:
1) listen to async events example on a multicast bus when
a namespace is created or destroyed. Provide me a little more info on
the created namespace such as its pid, name(?), types of namespace, etc
2) send a query to dump existing namespace or query by name, id etc.
I get the same details as above.

using genetlink should provide you with sufficient ability to do this.

cheers,
jamal

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-02-22 20:46           ` Eric W. Biederman
       [not found]             ` <m13a0tf17t.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-22 21:55             ` jamal
  2010-02-22 23:17               ` Eric W. Biederman
  2010-02-22 23:17               ` Eric W. Biederman
  1 sibling, 2 replies; 184+ messages in thread
From: jamal @ 2010-02-22 21:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Patrick McHardy, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear

On Mon, 2010-02-22 at 12:46 -0800, Eric W. Biederman wrote:
> jamal <hadi@cyberus.ca> writes:

> 
> This is one of the long standing issues that we have always known
> we needed to solve, but have not taken the time to do it.  Now that
> the need is more real it looks about time to solve this one.
> 
> There are currently two problems.
> 1) A process is needed to hold a reference to the network namespace.
> 2) We use pids which are an awkward way of talking about network
>    namespaces.
> 
> The solution I have been playing with involves.
> - Using a file descriptor to refer to a network namespace.
> - Using a trivial virtual filesystem to persistently hold onto
>   a namespace without the need of a process.
> - Have a convention of mounting the fs at something like
>   /var/run/netns/<name>
> 

I didnt quiet follow how i could use the above to do:
"ip ns <name/id> route add blah" from namespace0.

I tend to think in packets and wires instead of files;
How about just allowing a "control" channel from which
i could discover the namespace?
Example, assuming i have the right permissions:
1) listen to async events example on a multicast bus when
a namespace is created or destroyed. Provide me a little more info on
the created namespace such as its pid, name(?), types of namespace, etc
2) send a query to dump existing namespace or query by name, id etc.
I get the same details as above.

using genetlink should provide you with sufficient ability to do this.

cheers,
jamal


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-02-22 21:55             ` jamal
@ 2010-02-22 23:17               ` Eric W. Biederman
  2010-02-22 23:17               ` Eric W. Biederman
  1 sibling, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-22 23:17 UTC (permalink / raw)
  To: hadi-fAAogVwAN2Kw5LPnMra/2Q
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist

jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> writes:

> On Mon, 2010-02-22 at 12:46 -0800, Eric W. Biederman wrote:
>> jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> writes:
>
>> 
>> This is one of the long standing issues that we have always known
>> we needed to solve, but have not taken the time to do it.  Now that
>> the need is more real it looks about time to solve this one.
>> 
>> There are currently two problems.
>> 1) A process is needed to hold a reference to the network namespace.
>> 2) We use pids which are an awkward way of talking about network
>>    namespaces.
>> 
>> The solution I have been playing with involves.
>> - Using a file descriptor to refer to a network namespace.
>> - Using a trivial virtual filesystem to persistently hold onto
>>   a namespace without the need of a process.
>> - Have a convention of mounting the fs at something like
>>   /var/run/netns/<name>
>> 
>
> I didnt quiet follow how i could use the above to do:
> "ip ns <name/id> route add blah" from namespace0.
>
> I tend to think in packets and wires instead of files;
> How about just allowing a "control" channel from which
> i could discover the namespace?
> Example, assuming i have the right permissions:
> 1) listen to async events example on a multicast bus when
> a namespace is created or destroyed. Provide me a little more info on
> the created namespace such as its pid, name(?), types of namespace, etc
> 2) send a query to dump existing namespace or query by name, id etc.
> I get the same details as above.
>
> using genetlink should provide you with sufficient ability to do this.

What I am thinking is:

"ip ns <name> route add blah" is:
fd = open("/var/run/netns/<name>");
sys_setns(fd);  /* Like unshare but takes an existing namespace */
/* Then the rest of the existing ip command */

"ip ns list" is:
dfd = open("/var/run/netns", O_DIRECTORY);
getdents(dfd, buf, count);

"ip ns new <name>" is:
unshare(CLONE_NEWNS);
fd = nsfd(NETNS);
mkdir("/var/run/netns/<name>");
mount("none", "/var/run/netns/<name>", "ns", 0, fd);

Using unix domain names means that which namespaces you see is under
control of userspace.  Which allows for nested containers (something I
use today), and ultimately container migration.

Using genetlink userspace doesn't result in a nestable implementation
unless I introduce yet another namespace, ugh.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-02-22 21:55             ` jamal
  2010-02-22 23:17               ` Eric W. Biederman
@ 2010-02-22 23:17               ` Eric W. Biederman
       [not found]                 ` <m1wry46es9.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 1 reply; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-22 23:17 UTC (permalink / raw)
  To: hadi
  Cc: Patrick McHardy, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear

jamal <hadi@cyberus.ca> writes:

> On Mon, 2010-02-22 at 12:46 -0800, Eric W. Biederman wrote:
>> jamal <hadi@cyberus.ca> writes:
>
>> 
>> This is one of the long standing issues that we have always known
>> we needed to solve, but have not taken the time to do it.  Now that
>> the need is more real it looks about time to solve this one.
>> 
>> There are currently two problems.
>> 1) A process is needed to hold a reference to the network namespace.
>> 2) We use pids which are an awkward way of talking about network
>>    namespaces.
>> 
>> The solution I have been playing with involves.
>> - Using a file descriptor to refer to a network namespace.
>> - Using a trivial virtual filesystem to persistently hold onto
>>   a namespace without the need of a process.
>> - Have a convention of mounting the fs at something like
>>   /var/run/netns/<name>
>> 
>
> I didnt quiet follow how i could use the above to do:
> "ip ns <name/id> route add blah" from namespace0.
>
> I tend to think in packets and wires instead of files;
> How about just allowing a "control" channel from which
> i could discover the namespace?
> Example, assuming i have the right permissions:
> 1) listen to async events example on a multicast bus when
> a namespace is created or destroyed. Provide me a little more info on
> the created namespace such as its pid, name(?), types of namespace, etc
> 2) send a query to dump existing namespace or query by name, id etc.
> I get the same details as above.
>
> using genetlink should provide you with sufficient ability to do this.

What I am thinking is:

"ip ns <name> route add blah" is:
fd = open("/var/run/netns/<name>");
sys_setns(fd);  /* Like unshare but takes an existing namespace */
/* Then the rest of the existing ip command */

"ip ns list" is:
dfd = open("/var/run/netns", O_DIRECTORY);
getdents(dfd, buf, count);

"ip ns new <name>" is:
unshare(CLONE_NEWNS);
fd = nsfd(NETNS);
mkdir("/var/run/netns/<name>");
mount("none", "/var/run/netns/<name>", "ns", 0, fd);

Using unix domain names means that which namespaces you see is under
control of userspace.  Which allows for nested containers (something I
use today), and ultimately container migration.

Using genetlink userspace doesn't result in a nestable implementation
unless I introduce yet another namespace, ugh.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
       [not found]                 ` <m1wry46es9.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-23 13:27                   ` jamal
  2010-02-23 14:07                     ` Eric W. Biederman
  2010-02-23 14:07                     ` Eric W. Biederman
  0 siblings, 2 replies; 184+ messages in thread
From: jamal @ 2010-02-23 13:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist

On Mon, 2010-02-22 at 15:17 -0800, Eric W. Biederman wrote:

> What I am thinking is:
> 
> "ip ns <name> route add blah" is:
> fd = open("/var/run/netns/<name>");
> sys_setns(fd);  /* Like unshare but takes an existing namespace */
> /* Then the rest of the existing ip command */

The other two below make some sense; For the above:
Does the point after sys_setns(fd) allow me to do io inside
ns <name>? Can i do open() and get a fd from ns <name>?

> "ip ns list" is:
> dfd = open("/var/run/netns", O_DIRECTORY);
> getdents(dfd, buf, count);
> 
> "ip ns new <name>" is:
> unshare(CLONE_NEWNS);
> fd = nsfd(NETNS);
> mkdir("/var/run/netns/<name>");
> mount("none", "/var/run/netns/<name>", "ns", 0, fd);
> 
> Using unix domain names means that which namespaces you see is under
> control of userspace.  Which allows for nested containers (something I
> use today), and ultimately container migration.

The only problem that i see is events are not as nice. I take it i am 
going to get something like an inotify when a new namespace is created?

> Using genetlink userspace doesn't result in a nestable implementation
> unless I introduce yet another namespace, ugh.

Is it not just a naming convention that you are dealing with?
Example in your scheme above a nested namespace shows up as:
/var/run/netns/<name>/<nestedname>, no?

cheers,
jamal

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-02-23 13:27                   ` jamal
  2010-02-23 14:07                     ` Eric W. Biederman
@ 2010-02-23 14:07                     ` Eric W. Biederman
  1 sibling, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-23 14:07 UTC (permalink / raw)
  To: hadi-fAAogVwAN2Kw5LPnMra/2Q
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist

jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> writes:

> On Mon, 2010-02-22 at 15:17 -0800, Eric W. Biederman wrote:
>
>> What I am thinking is:
>> 
>> "ip ns <name> route add blah" is:
>> fd = open("/var/run/netns/<name>");
>> sys_setns(fd);  /* Like unshare but takes an existing namespace */
>> /* Then the rest of the existing ip command */
>
> The other two below make some sense; For the above:
> Does the point after sys_setns(fd) allow me to do io inside
> ns <name>? Can i do open() and get a fd from ns <name>?

Yes.  My intention is that current->nsproxy->net_ns be changed.
We can already change it in unshare so this is feasible.

>> "ip ns list" is:
>> dfd = open("/var/run/netns", O_DIRECTORY);
>> getdents(dfd, buf, count);
>> 
>> "ip ns new <name>" is:
>> unshare(CLONE_NEWNS);
>> fd = nsfd(NETNS);
>> mkdir("/var/run/netns/<name>");
>> mount("none", "/var/run/netns/<name>", "ns", 0, fd);
>> 
>> Using unix domain names means that which namespaces you see is under
>> control of userspace.  Which allows for nested containers (something I
>> use today), and ultimately container migration.
>
> The only problem that i see is events are not as nice. I take it i am 
> going to get something like an inotify when a new namespace is created?

Yes.  Inotify would at the very least see that mkdir.  You could also
use poll on /proc/mounts to see the set of mounts change.

>> Using genetlink userspace doesn't result in a nestable implementation
>> unless I introduce yet another namespace, ugh.
>
> Is it not just a naming convention that you are dealing with?
> Example in your scheme above a nested namespace shows up as:
> /var/run/netns/<name>/<nestedname>, no?

No.  More like:

For the outer namespace:
/var/run/netns/<name>

For the inner namespace:
/some/random/fs/path/to/a/chroot/var/run/netns/<name>

For a doubly nested scenario:
/some/random/fs/path/to/a/chroot/some/other/random/fs/path/to/another/chroot/var/run/netns/<name>

Since I would be using mount namespaces instead of chroot it is not
strictly required that the fs paths nest at all.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-02-23 13:27                   ` jamal
@ 2010-02-23 14:07                     ` Eric W. Biederman
  2010-02-23 14:20                       ` jamal
       [not found]                       ` <m1iq9ocafv.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2010-02-23 14:07                     ` Eric W. Biederman
  1 sibling, 2 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-23 14:07 UTC (permalink / raw)
  To: hadi
  Cc: Patrick McHardy, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear

jamal <hadi@cyberus.ca> writes:

> On Mon, 2010-02-22 at 15:17 -0800, Eric W. Biederman wrote:
>
>> What I am thinking is:
>> 
>> "ip ns <name> route add blah" is:
>> fd = open("/var/run/netns/<name>");
>> sys_setns(fd);  /* Like unshare but takes an existing namespace */
>> /* Then the rest of the existing ip command */
>
> The other two below make some sense; For the above:
> Does the point after sys_setns(fd) allow me to do io inside
> ns <name>? Can i do open() and get a fd from ns <name>?

Yes.  My intention is that current->nsproxy->net_ns be changed.
We can already change it in unshare so this is feasible.

>> "ip ns list" is:
>> dfd = open("/var/run/netns", O_DIRECTORY);
>> getdents(dfd, buf, count);
>> 
>> "ip ns new <name>" is:
>> unshare(CLONE_NEWNS);
>> fd = nsfd(NETNS);
>> mkdir("/var/run/netns/<name>");
>> mount("none", "/var/run/netns/<name>", "ns", 0, fd);
>> 
>> Using unix domain names means that which namespaces you see is under
>> control of userspace.  Which allows for nested containers (something I
>> use today), and ultimately container migration.
>
> The only problem that i see is events are not as nice. I take it i am 
> going to get something like an inotify when a new namespace is created?

Yes.  Inotify would at the very least see that mkdir.  You could also
use poll on /proc/mounts to see the set of mounts change.

>> Using genetlink userspace doesn't result in a nestable implementation
>> unless I introduce yet another namespace, ugh.
>
> Is it not just a naming convention that you are dealing with?
> Example in your scheme above a nested namespace shows up as:
> /var/run/netns/<name>/<nestedname>, no?

No.  More like:

For the outer namespace:
/var/run/netns/<name>

For the inner namespace:
/some/random/fs/path/to/a/chroot/var/run/netns/<name>

For a doubly nested scenario:
/some/random/fs/path/to/a/chroot/some/other/random/fs/path/to/another/chroot/var/run/netns/<name>

Since I would be using mount namespaces instead of chroot it is not
strictly required that the fs paths nest at all.

Eric





^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
       [not found]                       ` <m1iq9ocafv.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-23 14:20                         ` jamal
  0 siblings, 0 replies; 184+ messages in thread
From: jamal @ 2010-02-23 14:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

Added Daniel to the discussion..

On Tue, 2010-02-23 at 06:07 -0800, Eric W. Biederman wrote:
> jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> writes:

> > Does the point after sys_setns(fd) allow me to do io inside
> > ns <name>? Can i do open() and get a fd from ns <name>?
> 
> Yes.  My intention is that current->nsproxy->net_ns be changed.
> We can already change it in unshare so this is feasible.

I like it if it makes it as easy as it sounds;-> With lxc,
i essentially have to create a proxy process inside the
namespace that i use unix domain to open fds inside the ns.
Do i still need that?

> > The only problem that i see is events are not as nice. I take it i am 
> > going to get something like an inotify when a new namespace is created?
> 
> Yes.  Inotify would at the very least see that mkdir.  You could also
> use poll on /proc/mounts to see the set of mounts change.

It is not as nice but livable. I suppose attributes of the specific
namespace are retrieved somewhere there as well..

> > Is it not just a naming convention that you are dealing with?
> > Example in your scheme above a nested namespace shows up as:
> > /var/run/netns/<name>/<nestedname>, no?
> 
> No.  More like:
> 
> For the outer namespace:
> /var/run/netns/<name>
> 
> For the inner namespace:
> /some/random/fs/path/to/a/chroot/var/run/netns/<name>
> 
> For a doubly nested scenario:
> /some/random/fs/path/to/a/chroot/some/other/random/fs/path/to/another/chroot/var/run/netns/<name>
> 
> Since I would be using mount namespaces instead of chroot it is not
> strictly required that the fs paths nest at all.

Ok.

cheers,
jamal

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-02-23 14:07                     ` Eric W. Biederman
@ 2010-02-23 14:20                       ` jamal
  2010-02-23 20:00                         ` Eric W. Biederman
  2010-02-23 20:00                         ` Eric W. Biederman
       [not found]                       ` <m1iq9ocafv.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: jamal @ 2010-02-23 14:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Daniel Lezcano, Patrick McHardy, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear

Added Daniel to the discussion..

On Tue, 2010-02-23 at 06:07 -0800, Eric W. Biederman wrote:
> jamal <hadi@cyberus.ca> writes:

> > Does the point after sys_setns(fd) allow me to do io inside
> > ns <name>? Can i do open() and get a fd from ns <name>?
> 
> Yes.  My intention is that current->nsproxy->net_ns be changed.
> We can already change it in unshare so this is feasible.

I like it if it makes it as easy as it sounds;-> With lxc,
i essentially have to create a proxy process inside the
namespace that i use unix domain to open fds inside the ns.
Do i still need that?

> > The only problem that i see is events are not as nice. I take it i am 
> > going to get something like an inotify when a new namespace is created?
> 
> Yes.  Inotify would at the very least see that mkdir.  You could also
> use poll on /proc/mounts to see the set of mounts change.

It is not as nice but livable. I suppose attributes of the specific
namespace are retrieved somewhere there as well..

> > Is it not just a naming convention that you are dealing with?
> > Example in your scheme above a nested namespace shows up as:
> > /var/run/netns/<name>/<nestedname>, no?
> 
> No.  More like:
> 
> For the outer namespace:
> /var/run/netns/<name>
> 
> For the inner namespace:
> /some/random/fs/path/to/a/chroot/var/run/netns/<name>
> 
> For a doubly nested scenario:
> /some/random/fs/path/to/a/chroot/some/other/random/fs/path/to/another/chroot/var/run/netns/<name>
> 
> Since I would be using mount namespaces instead of chroot it is not
> strictly required that the fs paths nest at all.

Ok.

cheers,
jamal


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-02-23 14:20                       ` jamal
  2010-02-23 20:00                         ` Eric W. Biederman
@ 2010-02-23 20:00                         ` Eric W. Biederman
  1 sibling, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-23 20:00 UTC (permalink / raw)
  To: hadi-fAAogVwAN2Kw5LPnMra/2Q
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> writes:

> Added Daniel to the discussion..
>
> On Tue, 2010-02-23 at 06:07 -0800, Eric W. Biederman wrote:
>> jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> writes:
>
>> > Does the point after sys_setns(fd) allow me to do io inside
>> > ns <name>? Can i do open() and get a fd from ns <name>?
>> 
>> Yes.  My intention is that current->nsproxy->net_ns be changed.
>> We can already change it in unshare so this is feasible.
>
> I like it if it makes it as easy as it sounds;-> With lxc,
> i essentially have to create a proxy process inside the
> namespace that i use unix domain to open fds inside the ns.
> Do i still need that?

That point of the mount to hold a persistent reference to the
namespace without using a process.

The point of the of the to be written set_ns call is to change
the default network namespace of the process such that all future
open/bind/socket calls happen in the referenced network namespace.

The are a few stray places like sysfs where it is the mount point
not current->nsproxy->net_ns that will determine what we see.

>> > The only problem that i see is events are not as nice. I take it i am 
>> > going to get something like an inotify when a new namespace is created?
>> 
>> Yes.  Inotify would at the very least see that mkdir.  You could also
>> use poll on /proc/mounts to see the set of mounts change.
>
> It is not as nice but livable. I suppose attributes of the specific
> namespace are retrieved somewhere there as well..

Attributes of the specific namespace?

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-02-23 14:20                       ` jamal
@ 2010-02-23 20:00                         ` Eric W. Biederman
  2010-02-23 23:09                           ` jamal
                                             ` (2 more replies)
  2010-02-23 20:00                         ` Eric W. Biederman
  1 sibling, 3 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-23 20:00 UTC (permalink / raw)
  To: hadi
  Cc: Daniel Lezcano, Patrick McHardy, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear

jamal <hadi@cyberus.ca> writes:

> Added Daniel to the discussion..
>
> On Tue, 2010-02-23 at 06:07 -0800, Eric W. Biederman wrote:
>> jamal <hadi@cyberus.ca> writes:
>
>> > Does the point after sys_setns(fd) allow me to do io inside
>> > ns <name>? Can i do open() and get a fd from ns <name>?
>> 
>> Yes.  My intention is that current->nsproxy->net_ns be changed.
>> We can already change it in unshare so this is feasible.
>
> I like it if it makes it as easy as it sounds;-> With lxc,
> i essentially have to create a proxy process inside the
> namespace that i use unix domain to open fds inside the ns.
> Do i still need that?

That point of the mount to hold a persistent reference to the
namespace without using a process.

The point of the of the to be written set_ns call is to change
the default network namespace of the process such that all future
open/bind/socket calls happen in the referenced network namespace.

The are a few stray places like sysfs where it is the mount point
not current->nsproxy->net_ns that will determine what we see.

>> > The only problem that i see is events are not as nice. I take it i am 
>> > going to get something like an inotify when a new namespace is created?
>> 
>> Yes.  Inotify would at the very least see that mkdir.  You could also
>> use poll on /proc/mounts to see the set of mounts change.
>
> It is not as nice but livable. I suppose attributes of the specific
> namespace are retrieved somewhere there as well..

Attributes of the specific namespace?

Eric



^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
       [not found]                           ` <m1r5obbu2w.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-23 23:09                             ` jamal
  2010-02-23 23:49                             ` Matt Helsley
  1 sibling, 0 replies; 184+ messages in thread
From: jamal @ 2010-02-23 23:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

On Tue, 2010-02-23 at 12:00 -0800, Eric W. Biederman wrote:

> That point of the mount to hold a persistent reference to the
> namespace without using a process.
> 
> The point of the of the to be written set_ns call is to change
> the default network namespace of the process such that all future
> open/bind/socket calls happen in the referenced network namespace.

Ok, i like it ;-> Patches RSN? Let me if you want someone to test..

> The are a few stray places like sysfs where it is the mount point
> not current->nsproxy->net_ns that will determine what we see.

Is sysfs considered "usable enough" for namespaces?

> Attributes of the specific namespace?

Well, example what is being un/shared etc. 

cheers,
jamal

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-02-23 20:00                         ` Eric W. Biederman
@ 2010-02-23 23:09                           ` jamal
  2010-02-24  1:43                             ` Eric W. Biederman
                                               ` (3 more replies)
       [not found]                           ` <m1r5obbu2w.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2010-02-23 23:49                           ` Matt Helsley
  2 siblings, 4 replies; 184+ messages in thread
From: jamal @ 2010-02-23 23:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Daniel Lezcano, Patrick McHardy, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear

On Tue, 2010-02-23 at 12:00 -0800, Eric W. Biederman wrote:

> That point of the mount to hold a persistent reference to the
> namespace without using a process.
> 
> The point of the of the to be written set_ns call is to change
> the default network namespace of the process such that all future
> open/bind/socket calls happen in the referenced network namespace.

Ok, i like it ;-> Patches RSN? Let me if you want someone to test..

> The are a few stray places like sysfs where it is the mount point
> not current->nsproxy->net_ns that will determine what we see.

Is sysfs considered "usable enough" for namespaces?

> Attributes of the specific namespace?

Well, example what is being un/shared etc. 

cheers,
jamal



^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
       [not found]                           ` <m1r5obbu2w.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2010-02-23 23:09                             ` RFC: netfilter: nf_conntrack: add support for "conntrack zones" jamal
@ 2010-02-23 23:49                             ` Matt Helsley
  1 sibling, 0 replies; 184+ messages in thread
From: Matt Helsley @ 2010-02-23 23:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

On Tue, Feb 23, 2010 at 12:00:55PM -0800, Eric W. Biederman wrote:
> jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> writes:
> 
> > Added Daniel to the discussion..
> >
> > On Tue, 2010-02-23 at 06:07 -0800, Eric W. Biederman wrote:
> >> jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> writes:
> >
> >> > Does the point after sys_setns(fd) allow me to do io inside
> >> > ns <name>? Can i do open() and get a fd from ns <name>?
> >> 
> >> Yes.  My intention is that current->nsproxy->net_ns be changed.
> >> We can already change it in unshare so this is feasible.
> >
> > I like it if it makes it as easy as it sounds;-> With lxc,
> > i essentially have to create a proxy process inside the
> > namespace that i use unix domain to open fds inside the ns.
> > Do i still need that?
> 
> That point of the mount to hold a persistent reference to the
> namespace without using a process.

I think technicaly it's still held using processes, only now it's
much more indirect:

netns <- mount <- mount namespace(s) <- process(es)

The big difference is we'd be waiting for all the processes
sharing that mount (or dups of it in multiple mount namespaces) to
exit too -- not just those sharing the netns.

Using a mount requires keeping names for the namespaces themselves
in the kernel which is a problem we've largely avoided so far.
The nscgroup is an example of the messes that creates, I think. And it
further complicates c/r -- we'd need to checkpoint and recreate the
names of the namespaces too. So we'll need a namespace for the names of
the namespaces to make restart reliable won't we? Makes my head spin...

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-02-23 20:00                         ` Eric W. Biederman
  2010-02-23 23:09                           ` jamal
       [not found]                           ` <m1r5obbu2w.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-23 23:49                           ` Matt Helsley
  2010-02-24  1:32                             ` Eric W. Biederman
       [not found]                             ` <20100223234942.GO3604-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2 siblings, 2 replies; 184+ messages in thread
From: Matt Helsley @ 2010-02-23 23:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: hadi, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

On Tue, Feb 23, 2010 at 12:00:55PM -0800, Eric W. Biederman wrote:
> jamal <hadi@cyberus.ca> writes:
> 
> > Added Daniel to the discussion..
> >
> > On Tue, 2010-02-23 at 06:07 -0800, Eric W. Biederman wrote:
> >> jamal <hadi@cyberus.ca> writes:
> >
> >> > Does the point after sys_setns(fd) allow me to do io inside
> >> > ns <name>? Can i do open() and get a fd from ns <name>?
> >> 
> >> Yes.  My intention is that current->nsproxy->net_ns be changed.
> >> We can already change it in unshare so this is feasible.
> >
> > I like it if it makes it as easy as it sounds;-> With lxc,
> > i essentially have to create a proxy process inside the
> > namespace that i use unix domain to open fds inside the ns.
> > Do i still need that?
> 
> That point of the mount to hold a persistent reference to the
> namespace without using a process.

I think technicaly it's still held using processes, only now it's
much more indirect:

netns <- mount <- mount namespace(s) <- process(es)

The big difference is we'd be waiting for all the processes
sharing that mount (or dups of it in multiple mount namespaces) to
exit too -- not just those sharing the netns.

Using a mount requires keeping names for the namespaces themselves
in the kernel which is a problem we've largely avoided so far.
The nscgroup is an example of the messes that creates, I think. And it
further complicates c/r -- we'd need to checkpoint and recreate the
names of the namespaces too. So we'll need a namespace for the names of
the namespaces to make restart reliable won't we? Makes my head spin...

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
       [not found]                             ` <20100223234942.GO3604-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-02-24  1:32                               ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-24  1:32 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> On Tue, Feb 23, 2010 at 12:00:55PM -0800, Eric W. Biederman wrote:
>> jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> writes:
>> 
>> > Added Daniel to the discussion..
>> >
>> > On Tue, 2010-02-23 at 06:07 -0800, Eric W. Biederman wrote:
>> >> jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> writes:
>> >
>> >> > Does the point after sys_setns(fd) allow me to do io inside
>> >> > ns <name>? Can i do open() and get a fd from ns <name>?
>> >> 
>> >> Yes.  My intention is that current->nsproxy->net_ns be changed.
>> >> We can already change it in unshare so this is feasible.
>> >
>> > I like it if it makes it as easy as it sounds;-> With lxc,
>> > i essentially have to create a proxy process inside the
>> > namespace that i use unix domain to open fds inside the ns.
>> > Do i still need that?
>> 
>> That point of the mount to hold a persistent reference to the
>> namespace without using a process.
>
> I think technicaly it's still held using processes, only now it's
> much more indirect:
>
> netns <- mount <- mount namespace(s) <- process(es)

True. The practical difference is that it doesn't require a dedicated
process which is a big improvement operationally.

> Using a mount requires keeping names for the namespaces themselves
> in the kernel which is a problem we've largely avoided so far.
> The nscgroup is an example of the messes that creates, I think. And it
> further complicates c/r -- we'd need to checkpoint and recreate the
> names of the namespaces too. So we'll need a namespace for the names of
> the namespaces to make restart reliable won't we? Makes my head spin...

This is strictly different.  It may require a bit of extra support from
checkpoint/restart because it introduces some more user visible objects
but the names themselves are nothing special.  The name that userspace
sees and deals with is the name of the mount point.  No new namespaces
are required.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-02-23 23:49                           ` Matt Helsley
@ 2010-02-24  1:32                             ` Eric W. Biederman
  2010-02-24  1:39                               ` Serge E. Hallyn
       [not found]                               ` <m18waj2zc8.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
       [not found]                             ` <20100223234942.GO3604-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-24  1:32 UTC (permalink / raw)
  To: Matt Helsley
  Cc: hadi, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

Matt Helsley <matthltc@us.ibm.com> writes:

> On Tue, Feb 23, 2010 at 12:00:55PM -0800, Eric W. Biederman wrote:
>> jamal <hadi@cyberus.ca> writes:
>> 
>> > Added Daniel to the discussion..
>> >
>> > On Tue, 2010-02-23 at 06:07 -0800, Eric W. Biederman wrote:
>> >> jamal <hadi@cyberus.ca> writes:
>> >
>> >> > Does the point after sys_setns(fd) allow me to do io inside
>> >> > ns <name>? Can i do open() and get a fd from ns <name>?
>> >> 
>> >> Yes.  My intention is that current->nsproxy->net_ns be changed.
>> >> We can already change it in unshare so this is feasible.
>> >
>> > I like it if it makes it as easy as it sounds;-> With lxc,
>> > i essentially have to create a proxy process inside the
>> > namespace that i use unix domain to open fds inside the ns.
>> > Do i still need that?
>> 
>> That point of the mount to hold a persistent reference to the
>> namespace without using a process.
>
> I think technicaly it's still held using processes, only now it's
> much more indirect:
>
> netns <- mount <- mount namespace(s) <- process(es)

True. The practical difference is that it doesn't require a dedicated
process which is a big improvement operationally.

> Using a mount requires keeping names for the namespaces themselves
> in the kernel which is a problem we've largely avoided so far.
> The nscgroup is an example of the messes that creates, I think. And it
> further complicates c/r -- we'd need to checkpoint and recreate the
> names of the namespaces too. So we'll need a namespace for the names of
> the namespaces to make restart reliable won't we? Makes my head spin...

This is strictly different.  It may require a bit of extra support from
checkpoint/restart because it introduces some more user visible objects
but the names themselves are nothing special.  The name that userspace
sees and deals with is the name of the mount point.  No new namespaces
are required.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
       [not found]                               ` <m18waj2zc8.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-24  1:39                                 ` Serge E. Hallyn
  0 siblings, 0 replies; 184+ messages in thread
From: Serge E. Hallyn @ 2010-02-24  1:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
> 
> > On Tue, Feb 23, 2010 at 12:00:55PM -0800, Eric W. Biederman wrote:
> >> jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> writes:
> >> 
> >> > Added Daniel to the discussion..
> >> >
> >> > On Tue, 2010-02-23 at 06:07 -0800, Eric W. Biederman wrote:
> >> >> jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> writes:
> >> >
> >> >> > Does the point after sys_setns(fd) allow me to do io inside
> >> >> > ns <name>? Can i do open() and get a fd from ns <name>?
> >> >> 
> >> >> Yes.  My intention is that current->nsproxy->net_ns be changed.
> >> >> We can already change it in unshare so this is feasible.
> >> >
> >> > I like it if it makes it as easy as it sounds;-> With lxc,
> >> > i essentially have to create a proxy process inside the
> >> > namespace that i use unix domain to open fds inside the ns.
> >> > Do i still need that?
> >> 
> >> That point of the mount to hold a persistent reference to the
> >> namespace without using a process.
> >
> > I think technicaly it's still held using processes, only now it's
> > much more indirect:
> >
> > netns <- mount <- mount namespace(s) <- process(es)
> 
> True. The practical difference is that it doesn't require a dedicated
> process which is a big improvement operationally.
> 
> > Using a mount requires keeping names for the namespaces themselves
> > in the kernel which is a problem we've largely avoided so far.
> > The nscgroup is an example of the messes that creates, I think. And it
> > further complicates c/r -- we'd need to checkpoint and recreate the
> > names of the namespaces too. So we'll need a namespace for the names of
> > the namespaces to make restart reliable won't we? Makes my head spin...
> 
> This is strictly different.  It may require a bit of extra support from
> checkpoint/restart because it introduces some more user visible objects
> but the names themselves are nothing special.  The name that userspace
> sees and deals with is the name of the mount point.  No new namespaces
> are required.

Shouldn't be a big deal - assuming the mount is of a special type
for a network ns, we just record the objref for the checkpointed
network ns.  We don't need a namespace for the namespaces - we just
need unique names for the checkpoint image (the objref, which is
unique per netns).

Guess it really is about time that i work on some clean patches
for checkpoint/restart of mounts namespaces and mounts.

-serge

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-02-24  1:32                             ` Eric W. Biederman
@ 2010-02-24  1:39                               ` Serge E. Hallyn
       [not found]                               ` <m18waj2zc8.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 0 replies; 184+ messages in thread
From: Serge E. Hallyn @ 2010-02-24  1:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

Quoting Eric W. Biederman (ebiederm@xmission.com):
> Matt Helsley <matthltc@us.ibm.com> writes:
> 
> > On Tue, Feb 23, 2010 at 12:00:55PM -0800, Eric W. Biederman wrote:
> >> jamal <hadi@cyberus.ca> writes:
> >> 
> >> > Added Daniel to the discussion..
> >> >
> >> > On Tue, 2010-02-23 at 06:07 -0800, Eric W. Biederman wrote:
> >> >> jamal <hadi@cyberus.ca> writes:
> >> >
> >> >> > Does the point after sys_setns(fd) allow me to do io inside
> >> >> > ns <name>? Can i do open() and get a fd from ns <name>?
> >> >> 
> >> >> Yes.  My intention is that current->nsproxy->net_ns be changed.
> >> >> We can already change it in unshare so this is feasible.
> >> >
> >> > I like it if it makes it as easy as it sounds;-> With lxc,
> >> > i essentially have to create a proxy process inside the
> >> > namespace that i use unix domain to open fds inside the ns.
> >> > Do i still need that?
> >> 
> >> That point of the mount to hold a persistent reference to the
> >> namespace without using a process.
> >
> > I think technicaly it's still held using processes, only now it's
> > much more indirect:
> >
> > netns <- mount <- mount namespace(s) <- process(es)
> 
> True. The practical difference is that it doesn't require a dedicated
> process which is a big improvement operationally.
> 
> > Using a mount requires keeping names for the namespaces themselves
> > in the kernel which is a problem we've largely avoided so far.
> > The nscgroup is an example of the messes that creates, I think. And it
> > further complicates c/r -- we'd need to checkpoint and recreate the
> > names of the namespaces too. So we'll need a namespace for the names of
> > the namespaces to make restart reliable won't we? Makes my head spin...
> 
> This is strictly different.  It may require a bit of extra support from
> checkpoint/restart because it introduces some more user visible objects
> but the names themselves are nothing special.  The name that userspace
> sees and deals with is the name of the mount point.  No new namespaces
> are required.

Shouldn't be a big deal - assuming the mount is of a special type
for a network ns, we just record the objref for the checkpointed
network ns.  We don't need a namespace for the namespaces - we just
need unique names for the checkpoint image (the objref, which is
unique per netns).

Guess it really is about time that i work on some clean patches
for checkpoint/restart of mounts namespaces and mounts.

-serge

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-02-23 23:09                           ` jamal
@ 2010-02-24  1:43                             ` Eric W. Biederman
  2010-02-24  1:43                             ` Eric W. Biederman
                                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-24  1:43 UTC (permalink / raw)
  To: hadi-fAAogVwAN2Kw5LPnMra/2Q
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> writes:

> On Tue, 2010-02-23 at 12:00 -0800, Eric W. Biederman wrote:
>
>> That point of the mount to hold a persistent reference to the
>> namespace without using a process.
>> 
>> The point of the of the to be written set_ns call is to change
>> the default network namespace of the process such that all future
>> open/bind/socket calls happen in the referenced network namespace.
>
> Ok, i like it ;-> Patches RSN? Let me if you want someone to test..

My target will be 2.6.35.   There is an old prototype implementation
that hit the containers list and I think netdev a year or so ago.

>> The are a few stray places like sysfs where it is the mount point
>> not current->nsproxy->net_ns that will determine what we see.
>
> Is sysfs considered "usable enough" for namespaces?

Mine is ;) I had a bad cold and didn't get through all of the patches
this development cycle, just all the prereqs.  I plan on getting that
final conversation started for as soon as 2.6.34-rc1 hits.

>> Attributes of the specific namespace?
>
> Well, example what is being un/shared etc. 

Got it.  Implementation wise I'm going to stash a pointer
to the namespace in a inode or super block, simple.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"
  2010-02-23 23:09                           ` jamal
  2010-02-24  1:43                             ` Eric W. Biederman
@ 2010-02-24  1:43                             ` Eric W. Biederman
  2010-02-25 20:57                             ` [RFC][PATCH] ns: Syscalls for better namespace sharing control Eric W. Biederman
  2010-02-25 20:57                             ` [RFC][PATCH] ns: Syscalls for better namespace sharing control Eric W. Biederman
  3 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-24  1:43 UTC (permalink / raw)
  To: hadi
  Cc: Daniel Lezcano, Patrick McHardy, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear

jamal <hadi@cyberus.ca> writes:

> On Tue, 2010-02-23 at 12:00 -0800, Eric W. Biederman wrote:
>
>> That point of the mount to hold a persistent reference to the
>> namespace without using a process.
>> 
>> The point of the of the to be written set_ns call is to change
>> the default network namespace of the process such that all future
>> open/bind/socket calls happen in the referenced network namespace.
>
> Ok, i like it ;-> Patches RSN? Let me if you want someone to test..

My target will be 2.6.35.   There is an old prototype implementation
that hit the containers list and I think netdev a year or so ago.

>> The are a few stray places like sysfs where it is the mount point
>> not current->nsproxy->net_ns that will determine what we see.
>
> Is sysfs considered "usable enough" for namespaces?

Mine is ;) I had a bad cold and didn't get through all of the patches
this development cycle, just all the prereqs.  I plan on getting that
final conversation started for as soon as 2.6.34-rc1 hits.

>> Attributes of the specific namespace?
>
> Well, example what is being un/shared etc. 

Got it.  Implementation wise I'm going to stash a pointer
to the namespace in a inode or super block, simple.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-23 23:09                           ` jamal
                                               ` (2 preceding siblings ...)
  2010-02-25 20:57                             ` [RFC][PATCH] ns: Syscalls for better namespace sharing control Eric W. Biederman
@ 2010-02-25 20:57                             ` Eric W. Biederman
  3 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-25 20:57 UTC (permalink / raw)
  To: hadi-fAAogVwAN2Kw5LPnMra/2Q
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano


Introduce two new system calls:
int nsfd(pid_t pid, unsigned long nstype);
int setns(unsigned long nstype, int fd);

These two new system calls address three specific problems that can
make namespaces hard to work with.
- Namespaces require a dedicated process to pin them in memory.
- It is not possible to use a namespace unless you are the
  child of the original creator.
- Namespaces don't have names that userspace can use to talk
  about them.

The nsfd() system call returns a file descriptor that can
be used to talk about a specific namespace, and to keep
the specified namespace alive.

The fd returned by nsfd() can be bind mounted as:
mount --bind /proc/self/fd/N /some/filesystem/path
to keep the namespace alive indefinitely as long as
it is mounted.

open works on the fd returned by nsfd() so another
process can get a hold of it and do interesting things.

Overall that allows for persistent naming of namespaces
according to userspace policy.

setns() allows changing the namespace of the current process
to a namespace that originates with nsfd().

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---

This is just my first pass at this, and not yet compiled tested.
I was pleasantly surprised at how easy all of this was to implement.

I have verified mount will let me bind mount /proc/self/fd/N so
there is nothing special needed for the mount case, except
getting the reference counting and lifetime rules correct for
my filesystem objects.

 arch/x86/ia32/ia32entry.S          |    2 +
 arch/x86/include/asm/unistd_32.h   |    4 +-
 arch/x86/include/asm/unistd_64.h   |    4 +
 arch/x86/kernel/syscall_table_32.S |    2 +
 fs/Makefile                        |    2 +-
 fs/nsfd.c                          |  278 ++++++++++++++++++++++++++++++++++++
 include/linux/magic.h              |    1 +
 include/linux/nsproxy.h            |    1 +
 include/linux/nstype.h             |    6 +
 kernel/nsproxy.c                   |   17 +++
 10 files changed, 315 insertions(+), 2 deletions(-)
 create mode 100644 fs/nsfd.c
 create mode 100644 include/linux/nstype.h

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 53147ad..9fd33de 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -842,4 +842,6 @@ ia32_sys_call_table:
 	.quad compat_sys_rt_tgsigqueueinfo	/* 335 */
 	.quad sys_perf_event_open
 	.quad compat_sys_recvmmsg
+	.quad sys_nsfd
+	.quad sys_setns
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 3baf379..5b7833c 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -343,10 +343,12 @@
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_event_open	336
 #define __NR_recvmmsg		337
+#define __NR_nsfd		338
+#define __NR_setns		339
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 338
+#define NR_syscalls 340
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 4843f7b..260d542 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -663,6 +663,10 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
 #define __NR_recvmmsg				299
 __SYSCALL(__NR_recvmmsg, sys_recvmmsg)
+#define __NR_nsfd				300
+__SYSCALL(__NR_nsfd, sys_nsfd)
+#define __NR_setns				301
+__SYSCALL(__NR_setns, sys_setns)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 15228b5..e09a45b 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -337,3 +337,5 @@ ENTRY(sys_call_table)
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_event_open
 	.long sys_recvmmsg
+	.long sys_nsfd
+	.long sys_setns
diff --git a/fs/Makefile b/fs/Makefile
index af6d047..74d5091 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		attr.o bad_inode.o file.o filesystems.o namespace.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o drop_caches.o splice.o sync.o utimes.o \
-		stack.o fs_struct.o
+		stack.o fs_struct.o nsfd.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/nsfd.c b/fs/nsfd.c
new file mode 100644
index 0000000..71bcc55
--- /dev/null
+++ b/fs/nsfd.c
@@ -0,0 +1,278 @@
+#include <linux/nstype.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <net/net_namespace.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/cred.h>
+#include <linux/sched.h>
+#include <linux/ptrace.h>
+#include <linux/nsproxy.h>
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+
+static struct vfsmount *nsfd_mnt __read_mostly;
+static struct inode *nsfd_inode;
+
+static const struct file_operations nsfd_file_operations = {
+	.llseek = no_llseek,
+};
+
+
+static int nsfd_get_sb(struct file_system_type *fs_type, int flags,
+	const char *dev_name, void *data, struct vfsmount *mnt)
+{
+	return get_sb_pseudo(fs_type, "nsfd:", NULL, NSFD_FS_MAGIC, mnt);
+}
+
+static char *nsfd_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	static const char name[] = "nsfd";
+
+	if (sizeof(name) > buflen)
+		return ERR_PTR(-ENAMETOOLONG);
+
+	return memcpy(buffer, name, sizeof(name));
+}
+
+static const struct dentry_operations nsfd_dentry_operations = {
+	.d_dname		= nsfd_dname,
+};
+
+static struct file_system_type nsfd_fs_type = {
+	.name		= "nsfd",
+	.get_sb		= nsfd_get_sb,
+	.kill_sb	= kill_anon_super,
+	
+};
+
+static void netns_dentry_release(struct dentry *dentry)
+{
+	put_net(dentry->d_fsdata);
+	dentry->d_fsdata = NULL;
+}
+
+static const struct dentry_operations netns_dentry_operations = {
+	.d_dname	= nsfd_dname,
+	.d_release	= netns_dentry_release,
+};
+
+static const struct dentry_operations *nsfd_dops[] = {
+	[NSTYPE_NET] = &netns_dentry_operations,
+};
+
+static const struct dentry_operations *nstype_dops(unsigned long nstype)
+{
+	const struct dentry_operations *d_op = NULL;
+
+	if (nstype < sizeof(nsfd_dops)/sizeof(nsfd_dops[0]))
+		d_op = nsfd_dops[nstype];
+
+	return d_op;
+}
+
+static struct file *nsfd_fget(int fd, unsigned long nstype)
+{
+	const struct dentry_operations *d_op;
+	struct file *file;
+
+	d_op = nstype_dops(nstype);
+	if (!d_op)
+		return ERR_PTR(-EINVAL);
+
+	file = fget(fd);
+	if (!file)
+		return ERR_PTR(-EBADF);
+
+	if (file->f_op != &nsfd_file_operations)
+		goto out_invalid;
+
+	if (file->f_path.dentry->d_op != d_op)
+		goto out_invalid;
+
+	return file;
+
+out_invalid:
+	fput(file);
+	return ERR_PTR(-EINVAL);
+}
+
+static struct inode *nsfd_mkinode(void)
+{
+	struct inode *inode;
+	inode = new_inode(nsfd_mnt->mnt_sb);
+	if (!inode)
+		return ERR_PTR(-ENOMEM);
+
+	inode->i_fop = &nsfd_file_operations;
+
+	/*
+	 * Mark the inode dirty from the very beginning,
+	 * that way it will never be moved to the dirty
+	 * list because mark_inode_dirty() will think that
+	 * it already _is_ on the dirty list.
+	 */
+	inode->i_state = I_DIRTY;
+	inode->i_mode = S_IRUSR | S_IWUSR;
+	inode->i_uid = current_fsuid();
+	inode->i_gid = current_fsgid();
+	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+	return inode;
+}
+
+
+static struct file *nsfd_getfile(void)
+{
+	struct qstr name = { .name = "" };
+	struct path path;
+	struct file *file;
+
+	path.dentry = d_alloc(nsfd_mnt->mnt_sb->s_root, &name);
+	if (!path.dentry)
+		return ERR_PTR(-ENOMEM);
+
+	path.mnt = mntget(nsfd_mnt);
+
+	/*
+	 * We know the nsfd_inode inode count is always greater than zero,
+	 * so we can avoid doing an igrab() and we can use an open-coded
+	 * atomic_inc().
+	 */
+	atomic_inc(&nsfd_inode->i_count);
+	path.dentry->d_op = &nsfd_dentry_operations;
+	d_instantiate(path.dentry, nsfd_inode);
+
+	file = alloc_file(&path, FMODE_READ, &nsfd_file_operations);
+	if (!file) {
+		path_put(&path);
+		return ERR_PTR(-ENFILE);
+	}
+	file->f_mapping = nsfd_inode->i_mapping;
+
+	file->f_pos = 0;
+	file->f_flags = O_RDONLY;
+	file->f_version = 0;
+	file->private_data = NULL;
+
+	return file;
+}
+
+static void *nsfd_getns(pid_t pid, unsigned long nstype)
+{
+	struct task_struct *task;
+	struct nsproxy *nsproxy;
+	void *ns;
+
+	ns = ERR_PTR(-ESRCH);
+	rcu_read_lock();
+	if (pid == 0)
+		task = current;
+	else
+		task = find_task_by_vpid(pid);
+	if (!task)
+		goto out;
+
+	ns = ERR_PTR(-EPERM);
+	if (!ptrace_may_access(task, PTRACE_MODE_ATTACH))
+		goto out;
+
+	ns = ERR_PTR(-ESRCH);
+	nsproxy = task_nsproxy(task);
+	if (!nsproxy)
+		goto out;
+
+	ns = ERR_PTR(-EINVAL);
+	switch(nstype) {
+	case NSTYPE_NET:
+		ns = get_net(nsproxy->net_ns);
+		break;
+	}
+out:
+	rcu_read_unlock();
+	return ns;
+}
+
+SYSCALL_DEFINE2(nsfd, pid_t, pid, unsigned long, nstype)
+{
+	const struct dentry_operations *d_op;
+	struct file *file;
+	int fd;
+	void *ns;
+
+	d_op = nstype_dops(nstype);
+	if (!d_op)
+		return -EINVAL;
+
+	file = nsfd_getfile();
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	ns = nsfd_getns(pid, nstype);
+	if (IS_ERR(ns)) {
+		fput(file);
+		return PTR_ERR(ns);
+	}
+
+	file->f_dentry->d_fsdata = ns;
+	file->f_dentry->d_op = d_op;
+	
+	fd = get_unused_fd();
+	if (fd < 0) {
+		fput(file);
+		return fd;
+	}
+	fd_install(fd, file);
+
+	return fd;
+}
+
+
+SYSCALL_DEFINE2(setns, unsigned long, nstype, int, fd)
+{
+	struct file *file;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	file = nsfd_fget(fd, nstype);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	set_namespace(nstype, file->f_dentry->d_fsdata);
+
+	fput(file);
+	return 0;
+}
+
+
+static int __init nsfd_init(void)
+{
+	int error;
+
+	error = register_filesystem(&nsfd_fs_type);
+	if (error)
+		goto err_exit;
+
+	nsfd_mnt  = kern_mount(&nsfd_fs_type);
+	if (IS_ERR(nsfd_mnt)) {
+		error = PTR_ERR(nsfd_mnt);
+		goto err_unregister_filesystem;
+	}
+
+	nsfd_inode = nsfd_mkinode();
+	if (IS_ERR(nsfd_inode)) {
+		error = PTR_ERR(nsfd_inode);
+		goto err_mntput;
+	}
+
+	return 0;
+
+err_mntput:
+	mntput(nsfd_mnt);
+err_unregister_filesystem:
+	unregister_filesystem(&nsfd_fs_type);
+err_exit:
+	panic(KERN_ERR "nsfd_init() failed (%d)\n", error);
+}
+
+fs_initcall(nsfd_init);
diff --git a/include/linux/magic.h b/include/linux/magic.h
index 76285e0..a4fe6eb 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -26,6 +26,7 @@
 #define ISOFS_SUPER_MAGIC	0x9660
 #define JFFS2_SUPER_MAGIC	0x72b6
 #define ANON_INODE_FS_MAGIC	0x09041934
+#define NSFD_FS_MAGIC		0x6e736664
 
 #define MINIX_SUPER_MAGIC	0x137F		/* original minix fs */
 #define MINIX_SUPER_MAGIC2	0x138F		/* minix fs, 30 char names */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 7b370c7..45f1e07 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -65,6 +65,7 @@ static inline struct nsproxy *task_nsproxy(struct task_struct *tsk)
 int copy_namespaces(unsigned long flags, struct task_struct *tsk);
 void exit_task_namespaces(struct task_struct *tsk);
 void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
+void set_namespace(unsigned long nstype, void *ns);
 void free_nsproxy(struct nsproxy *ns);
 int unshare_nsproxy_namespaces(unsigned long, struct nsproxy **,
 	struct fs_struct *);
diff --git a/include/linux/nstype.h b/include/linux/nstype.h
new file mode 100644
index 0000000..3bdf856
--- /dev/null
+++ b/include/linux/nstype.h
@@ -0,0 +1,6 @@
+#ifndef _LINUX_NSTYPE_H
+#define _LINUX_NSTYPE_H
+
+#define NSTYPE_NET 0
+
+#endif /* _LINUX_NSTYPE_H */
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 09b4ff9..574461c 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -21,6 +21,7 @@
 #include <linux/pid_namespace.h>
 #include <net/net_namespace.h>
 #include <linux/ipc_namespace.h>
+#include <linux/nstype.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -221,6 +222,22 @@ void exit_task_namespaces(struct task_struct *p)
 	switch_task_namespaces(p, NULL);
 }
 
+void set_namespace(unsigned long nstype, void *ns)
+{
+	struct task_struct *tsk = current;
+	struct nsproxy *new_nsproxy;
+
+	new_nsproxy = create_new_namespaces(0, tsk, tsk->fs);
+	switch(nstype) {
+	case NSTYPE_NET:
+		put_net(new_nsproxy->net_ns);
+		new_nsproxy->net_ns = get_net(ns);
+		break;
+	}
+
+	switch_task_namespaces(tsk, new_nsproxy);
+}
+
 static int __init nsproxy_cache_init(void)
 {
 	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
-- 
1.6.5.2.143.g8cc62

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-23 23:09                           ` jamal
  2010-02-24  1:43                             ` Eric W. Biederman
  2010-02-24  1:43                             ` Eric W. Biederman
@ 2010-02-25 20:57                             ` Eric W. Biederman
  2010-02-25 21:31                               ` Daniel Lezcano
                                                 ` (4 more replies)
  2010-02-25 20:57                             ` [RFC][PATCH] ns: Syscalls for better namespace sharing control Eric W. Biederman
  3 siblings, 5 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-25 20:57 UTC (permalink / raw)
  To: hadi
  Cc: Daniel Lezcano, Patrick McHardy, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear, Serge Hallyn,
	Matt Helsley


Introduce two new system calls:
int nsfd(pid_t pid, unsigned long nstype);
int setns(unsigned long nstype, int fd);

These two new system calls address three specific problems that can
make namespaces hard to work with.
- Namespaces require a dedicated process to pin them in memory.
- It is not possible to use a namespace unless you are the
  child of the original creator.
- Namespaces don't have names that userspace can use to talk
  about them.

The nsfd() system call returns a file descriptor that can
be used to talk about a specific namespace, and to keep
the specified namespace alive.

The fd returned by nsfd() can be bind mounted as:
mount --bind /proc/self/fd/N /some/filesystem/path
to keep the namespace alive indefinitely as long as
it is mounted.

open works on the fd returned by nsfd() so another
process can get a hold of it and do interesting things.

Overall that allows for persistent naming of namespaces
according to userspace policy.

setns() allows changing the namespace of the current process
to a namespace that originates with nsfd().

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---

This is just my first pass at this, and not yet compiled tested.
I was pleasantly surprised at how easy all of this was to implement.

I have verified mount will let me bind mount /proc/self/fd/N so
there is nothing special needed for the mount case, except
getting the reference counting and lifetime rules correct for
my filesystem objects.

 arch/x86/ia32/ia32entry.S          |    2 +
 arch/x86/include/asm/unistd_32.h   |    4 +-
 arch/x86/include/asm/unistd_64.h   |    4 +
 arch/x86/kernel/syscall_table_32.S |    2 +
 fs/Makefile                        |    2 +-
 fs/nsfd.c                          |  278 ++++++++++++++++++++++++++++++++++++
 include/linux/magic.h              |    1 +
 include/linux/nsproxy.h            |    1 +
 include/linux/nstype.h             |    6 +
 kernel/nsproxy.c                   |   17 +++
 10 files changed, 315 insertions(+), 2 deletions(-)
 create mode 100644 fs/nsfd.c
 create mode 100644 include/linux/nstype.h

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 53147ad..9fd33de 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -842,4 +842,6 @@ ia32_sys_call_table:
 	.quad compat_sys_rt_tgsigqueueinfo	/* 335 */
 	.quad sys_perf_event_open
 	.quad compat_sys_recvmmsg
+	.quad sys_nsfd
+	.quad sys_setns
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 3baf379..5b7833c 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -343,10 +343,12 @@
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_event_open	336
 #define __NR_recvmmsg		337
+#define __NR_nsfd		338
+#define __NR_setns		339
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 338
+#define NR_syscalls 340
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 4843f7b..260d542 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -663,6 +663,10 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
 #define __NR_recvmmsg				299
 __SYSCALL(__NR_recvmmsg, sys_recvmmsg)
+#define __NR_nsfd				300
+__SYSCALL(__NR_nsfd, sys_nsfd)
+#define __NR_setns				301
+__SYSCALL(__NR_setns, sys_setns)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 15228b5..e09a45b 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -337,3 +337,5 @@ ENTRY(sys_call_table)
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_event_open
 	.long sys_recvmmsg
+	.long sys_nsfd
+	.long sys_setns
diff --git a/fs/Makefile b/fs/Makefile
index af6d047..74d5091 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		attr.o bad_inode.o file.o filesystems.o namespace.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o drop_caches.o splice.o sync.o utimes.o \
-		stack.o fs_struct.o
+		stack.o fs_struct.o nsfd.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/nsfd.c b/fs/nsfd.c
new file mode 100644
index 0000000..71bcc55
--- /dev/null
+++ b/fs/nsfd.c
@@ -0,0 +1,278 @@
+#include <linux/nstype.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <net/net_namespace.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/cred.h>
+#include <linux/sched.h>
+#include <linux/ptrace.h>
+#include <linux/nsproxy.h>
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+
+static struct vfsmount *nsfd_mnt __read_mostly;
+static struct inode *nsfd_inode;
+
+static const struct file_operations nsfd_file_operations = {
+	.llseek = no_llseek,
+};
+
+
+static int nsfd_get_sb(struct file_system_type *fs_type, int flags,
+	const char *dev_name, void *data, struct vfsmount *mnt)
+{
+	return get_sb_pseudo(fs_type, "nsfd:", NULL, NSFD_FS_MAGIC, mnt);
+}
+
+static char *nsfd_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	static const char name[] = "nsfd";
+
+	if (sizeof(name) > buflen)
+		return ERR_PTR(-ENAMETOOLONG);
+
+	return memcpy(buffer, name, sizeof(name));
+}
+
+static const struct dentry_operations nsfd_dentry_operations = {
+	.d_dname		= nsfd_dname,
+};
+
+static struct file_system_type nsfd_fs_type = {
+	.name		= "nsfd",
+	.get_sb		= nsfd_get_sb,
+	.kill_sb	= kill_anon_super,
+	
+};
+
+static void netns_dentry_release(struct dentry *dentry)
+{
+	put_net(dentry->d_fsdata);
+	dentry->d_fsdata = NULL;
+}
+
+static const struct dentry_operations netns_dentry_operations = {
+	.d_dname	= nsfd_dname,
+	.d_release	= netns_dentry_release,
+};
+
+static const struct dentry_operations *nsfd_dops[] = {
+	[NSTYPE_NET] = &netns_dentry_operations,
+};
+
+static const struct dentry_operations *nstype_dops(unsigned long nstype)
+{
+	const struct dentry_operations *d_op = NULL;
+
+	if (nstype < sizeof(nsfd_dops)/sizeof(nsfd_dops[0]))
+		d_op = nsfd_dops[nstype];
+
+	return d_op;
+}
+
+static struct file *nsfd_fget(int fd, unsigned long nstype)
+{
+	const struct dentry_operations *d_op;
+	struct file *file;
+
+	d_op = nstype_dops(nstype);
+	if (!d_op)
+		return ERR_PTR(-EINVAL);
+
+	file = fget(fd);
+	if (!file)
+		return ERR_PTR(-EBADF);
+
+	if (file->f_op != &nsfd_file_operations)
+		goto out_invalid;
+
+	if (file->f_path.dentry->d_op != d_op)
+		goto out_invalid;
+
+	return file;
+
+out_invalid:
+	fput(file);
+	return ERR_PTR(-EINVAL);
+}
+
+static struct inode *nsfd_mkinode(void)
+{
+	struct inode *inode;
+	inode = new_inode(nsfd_mnt->mnt_sb);
+	if (!inode)
+		return ERR_PTR(-ENOMEM);
+
+	inode->i_fop = &nsfd_file_operations;
+
+	/*
+	 * Mark the inode dirty from the very beginning,
+	 * that way it will never be moved to the dirty
+	 * list because mark_inode_dirty() will think that
+	 * it already _is_ on the dirty list.
+	 */
+	inode->i_state = I_DIRTY;
+	inode->i_mode = S_IRUSR | S_IWUSR;
+	inode->i_uid = current_fsuid();
+	inode->i_gid = current_fsgid();
+	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+	return inode;
+}
+
+
+static struct file *nsfd_getfile(void)
+{
+	struct qstr name = { .name = "" };
+	struct path path;
+	struct file *file;
+
+	path.dentry = d_alloc(nsfd_mnt->mnt_sb->s_root, &name);
+	if (!path.dentry)
+		return ERR_PTR(-ENOMEM);
+
+	path.mnt = mntget(nsfd_mnt);
+
+	/*
+	 * We know the nsfd_inode inode count is always greater than zero,
+	 * so we can avoid doing an igrab() and we can use an open-coded
+	 * atomic_inc().
+	 */
+	atomic_inc(&nsfd_inode->i_count);
+	path.dentry->d_op = &nsfd_dentry_operations;
+	d_instantiate(path.dentry, nsfd_inode);
+
+	file = alloc_file(&path, FMODE_READ, &nsfd_file_operations);
+	if (!file) {
+		path_put(&path);
+		return ERR_PTR(-ENFILE);
+	}
+	file->f_mapping = nsfd_inode->i_mapping;
+
+	file->f_pos = 0;
+	file->f_flags = O_RDONLY;
+	file->f_version = 0;
+	file->private_data = NULL;
+
+	return file;
+}
+
+static void *nsfd_getns(pid_t pid, unsigned long nstype)
+{
+	struct task_struct *task;
+	struct nsproxy *nsproxy;
+	void *ns;
+
+	ns = ERR_PTR(-ESRCH);
+	rcu_read_lock();
+	if (pid == 0)
+		task = current;
+	else
+		task = find_task_by_vpid(pid);
+	if (!task)
+		goto out;
+
+	ns = ERR_PTR(-EPERM);
+	if (!ptrace_may_access(task, PTRACE_MODE_ATTACH))
+		goto out;
+
+	ns = ERR_PTR(-ESRCH);
+	nsproxy = task_nsproxy(task);
+	if (!nsproxy)
+		goto out;
+
+	ns = ERR_PTR(-EINVAL);
+	switch(nstype) {
+	case NSTYPE_NET:
+		ns = get_net(nsproxy->net_ns);
+		break;
+	}
+out:
+	rcu_read_unlock();
+	return ns;
+}
+
+SYSCALL_DEFINE2(nsfd, pid_t, pid, unsigned long, nstype)
+{
+	const struct dentry_operations *d_op;
+	struct file *file;
+	int fd;
+	void *ns;
+
+	d_op = nstype_dops(nstype);
+	if (!d_op)
+		return -EINVAL;
+
+	file = nsfd_getfile();
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	ns = nsfd_getns(pid, nstype);
+	if (IS_ERR(ns)) {
+		fput(file);
+		return PTR_ERR(ns);
+	}
+
+	file->f_dentry->d_fsdata = ns;
+	file->f_dentry->d_op = d_op;
+	
+	fd = get_unused_fd();
+	if (fd < 0) {
+		fput(file);
+		return fd;
+	}
+	fd_install(fd, file);
+
+	return fd;
+}
+
+
+SYSCALL_DEFINE2(setns, unsigned long, nstype, int, fd)
+{
+	struct file *file;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	file = nsfd_fget(fd, nstype);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	set_namespace(nstype, file->f_dentry->d_fsdata);
+
+	fput(file);
+	return 0;
+}
+
+
+static int __init nsfd_init(void)
+{
+	int error;
+
+	error = register_filesystem(&nsfd_fs_type);
+	if (error)
+		goto err_exit;
+
+	nsfd_mnt  = kern_mount(&nsfd_fs_type);
+	if (IS_ERR(nsfd_mnt)) {
+		error = PTR_ERR(nsfd_mnt);
+		goto err_unregister_filesystem;
+	}
+
+	nsfd_inode = nsfd_mkinode();
+	if (IS_ERR(nsfd_inode)) {
+		error = PTR_ERR(nsfd_inode);
+		goto err_mntput;
+	}
+
+	return 0;
+
+err_mntput:
+	mntput(nsfd_mnt);
+err_unregister_filesystem:
+	unregister_filesystem(&nsfd_fs_type);
+err_exit:
+	panic(KERN_ERR "nsfd_init() failed (%d)\n", error);
+}
+
+fs_initcall(nsfd_init);
diff --git a/include/linux/magic.h b/include/linux/magic.h
index 76285e0..a4fe6eb 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -26,6 +26,7 @@
 #define ISOFS_SUPER_MAGIC	0x9660
 #define JFFS2_SUPER_MAGIC	0x72b6
 #define ANON_INODE_FS_MAGIC	0x09041934
+#define NSFD_FS_MAGIC		0x6e736664
 
 #define MINIX_SUPER_MAGIC	0x137F		/* original minix fs */
 #define MINIX_SUPER_MAGIC2	0x138F		/* minix fs, 30 char names */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 7b370c7..45f1e07 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -65,6 +65,7 @@ static inline struct nsproxy *task_nsproxy(struct task_struct *tsk)
 int copy_namespaces(unsigned long flags, struct task_struct *tsk);
 void exit_task_namespaces(struct task_struct *tsk);
 void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
+void set_namespace(unsigned long nstype, void *ns);
 void free_nsproxy(struct nsproxy *ns);
 int unshare_nsproxy_namespaces(unsigned long, struct nsproxy **,
 	struct fs_struct *);
diff --git a/include/linux/nstype.h b/include/linux/nstype.h
new file mode 100644
index 0000000..3bdf856
--- /dev/null
+++ b/include/linux/nstype.h
@@ -0,0 +1,6 @@
+#ifndef _LINUX_NSTYPE_H
+#define _LINUX_NSTYPE_H
+
+#define NSTYPE_NET 0
+
+#endif /* _LINUX_NSTYPE_H */
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 09b4ff9..574461c 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -21,6 +21,7 @@
 #include <linux/pid_namespace.h>
 #include <net/net_namespace.h>
 #include <linux/ipc_namespace.h>
+#include <linux/nstype.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -221,6 +222,22 @@ void exit_task_namespaces(struct task_struct *p)
 	switch_task_namespaces(p, NULL);
 }
 
+void set_namespace(unsigned long nstype, void *ns)
+{
+	struct task_struct *tsk = current;
+	struct nsproxy *new_nsproxy;
+
+	new_nsproxy = create_new_namespaces(0, tsk, tsk->fs);
+	switch(nstype) {
+	case NSTYPE_NET:
+		put_net(new_nsproxy->net_ns);
+		new_nsproxy->net_ns = get_net(ns);
+		break;
+	}
+
+	switch_task_namespaces(tsk, new_nsproxy);
+}
+
 static int __init nsproxy_cache_init(void)
 {
 	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
-- 
1.6.5.2.143.g8cc62


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                               ` <m1pr3t2fvl.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-25 21:31                                 ` Daniel Lezcano
  2010-02-25 21:46                                 ` Matt Helsley
                                                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-02-25 21:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

Eric W. Biederman wrote:
> Introduce two new system calls:
> int nsfd(pid_t pid, unsigned long nstype);
> int setns(unsigned long nstype, int fd);
>
> These two new system calls address three specific problems that can
> make namespaces hard to work with.
> - Namespaces require a dedicated process to pin them in memory.
> - It is not possible to use a namespace unless you are the
>   child of the original creator.
> - Namespaces don't have names that userspace can use to talk
>   about them.
>
> The nsfd() system call returns a file descriptor that can
> be used to talk about a specific namespace, and to keep
> the specified namespace alive.
>
> The fd returned by nsfd() can be bind mounted as:
> mount --bind /proc/self/fd/N /some/filesystem/path
> to keep the namespace alive indefinitely as long as
> it is mounted.
>
> open works on the fd returned by nsfd() so another
> process can get a hold of it and do interesting things.
>
> Overall that allows for persistent naming of namespaces
> according to userspace policy.
>
> setns() allows changing the namespace of the current process
> to a namespace that originates with nsfd().
>
> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
>   

Is it planned to support all the namespaces for 'nsfd' ?
 I mean will it be possible to specify an Or'ed combination of nstype to 
grab a reference for several namespaces at a time of the targeted process ?

for example : nsfd( 1234, NSTYPE_NET | NSTYPE_IPC, NSTYPE_MNT)

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-25 20:57                             ` [RFC][PATCH] ns: Syscalls for better namespace sharing control Eric W. Biederman
@ 2010-02-25 21:31                               ` Daniel Lezcano
  2010-02-25 21:49                                 ` Eric W. Biederman
       [not found]                                 ` <4B86EC45.3060005-GANU6spQydw@public.gmane.org>
       [not found]                               ` <m1pr3t2fvl.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
                                                 ` (3 subsequent siblings)
  4 siblings, 2 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-02-25 21:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: hadi, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

Eric W. Biederman wrote:
> Introduce two new system calls:
> int nsfd(pid_t pid, unsigned long nstype);
> int setns(unsigned long nstype, int fd);
>
> These two new system calls address three specific problems that can
> make namespaces hard to work with.
> - Namespaces require a dedicated process to pin them in memory.
> - It is not possible to use a namespace unless you are the
>   child of the original creator.
> - Namespaces don't have names that userspace can use to talk
>   about them.
>
> The nsfd() system call returns a file descriptor that can
> be used to talk about a specific namespace, and to keep
> the specified namespace alive.
>
> The fd returned by nsfd() can be bind mounted as:
> mount --bind /proc/self/fd/N /some/filesystem/path
> to keep the namespace alive indefinitely as long as
> it is mounted.
>
> open works on the fd returned by nsfd() so another
> process can get a hold of it and do interesting things.
>
> Overall that allows for persistent naming of namespaces
> according to userspace policy.
>
> setns() allows changing the namespace of the current process
> to a namespace that originates with nsfd().
>
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
>   

Is it planned to support all the namespaces for 'nsfd' ?
 I mean will it be possible to specify an Or'ed combination of nstype to 
grab a reference for several namespaces at a time of the targeted process ?

for example : nsfd( 1234, NSTYPE_NET | NSTYPE_IPC, NSTYPE_MNT)

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                               ` <m1pr3t2fvl.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2010-02-25 21:31                                 ` Daniel Lezcano
@ 2010-02-25 21:46                                 ` Matt Helsley
  2010-02-26  1:09                                 ` Matt Helsley
                                                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 184+ messages in thread
From: Matt Helsley @ 2010-02-25 21:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

On Thu, Feb 25, 2010 at 12:57:02PM -0800, Eric W. Biederman wrote:
> 
> Introduce two new system calls:
> int nsfd(pid_t pid, unsigned long nstype);
> int setns(unsigned long nstype, int fd);
> 
> These two new system calls address three specific problems that can
> make namespaces hard to work with.
> - Namespaces require a dedicated process to pin them in memory.
> - It is not possible to use a namespace unless you are the
>   child of the original creator.
> - Namespaces don't have names that userspace can use to talk
>   about them.
> 
> The nsfd() system call returns a file descriptor that can
> be used to talk about a specific namespace, and to keep
> the specified namespace alive.
> 
> The fd returned by nsfd() can be bind mounted as:
> mount --bind /proc/self/fd/N /some/filesystem/path
> to keep the namespace alive indefinitely as long as
> it is mounted.
> 
> open works on the fd returned by nsfd() so another
> process can get a hold of it and do interesting things.
> 
> Overall that allows for persistent naming of namespaces
> according to userspace policy.
> 
> setns() allows changing the namespace of the current process
> to a namespace that originates with nsfd().
> 
> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

Hi Eric,

	Seems like an ok concept to me. Did you try doing this with
anon_inodes and bind mounting the /proc/<pid>/fd/N as above to keep
them alive and name them?

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-25 20:57                             ` [RFC][PATCH] ns: Syscalls for better namespace sharing control Eric W. Biederman
  2010-02-25 21:31                               ` Daniel Lezcano
       [not found]                               ` <m1pr3t2fvl.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-25 21:46                               ` Matt Helsley
  2010-02-25 21:54                                 ` Eric W. Biederman
                                                   ` (2 more replies)
  2010-02-26  1:09                               ` Matt Helsley
  2010-02-26  3:15                               ` [RFC][PATCH] ns: Syscalls for better namespace sharing control. v2 Eric W. Biederman
  4 siblings, 3 replies; 184+ messages in thread
From: Matt Helsley @ 2010-02-25 21:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: hadi, Daniel Lezcano, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn, Matt Helsley

On Thu, Feb 25, 2010 at 12:57:02PM -0800, Eric W. Biederman wrote:
> 
> Introduce two new system calls:
> int nsfd(pid_t pid, unsigned long nstype);
> int setns(unsigned long nstype, int fd);
> 
> These two new system calls address three specific problems that can
> make namespaces hard to work with.
> - Namespaces require a dedicated process to pin them in memory.
> - It is not possible to use a namespace unless you are the
>   child of the original creator.
> - Namespaces don't have names that userspace can use to talk
>   about them.
> 
> The nsfd() system call returns a file descriptor that can
> be used to talk about a specific namespace, and to keep
> the specified namespace alive.
> 
> The fd returned by nsfd() can be bind mounted as:
> mount --bind /proc/self/fd/N /some/filesystem/path
> to keep the namespace alive indefinitely as long as
> it is mounted.
> 
> open works on the fd returned by nsfd() so another
> process can get a hold of it and do interesting things.
> 
> Overall that allows for persistent naming of namespaces
> according to userspace policy.
> 
> setns() allows changing the namespace of the current process
> to a namespace that originates with nsfd().
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

Hi Eric,

	Seems like an ok concept to me. Did you try doing this with
anon_inodes and bind mounting the /proc/<pid>/fd/N as above to keep
them alive and name them?

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                 ` <4B86EC45.3060005-GANU6spQydw@public.gmane.org>
@ 2010-02-25 21:49                                   ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-25 21:49 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:

> Eric W. Biederman wrote:
>> Introduce two new system calls:
>> int nsfd(pid_t pid, unsigned long nstype);
>> int setns(unsigned long nstype, int fd);
>>
>> These two new system calls address three specific problems that can
>> make namespaces hard to work with.
>> - Namespaces require a dedicated process to pin them in memory.
>> - It is not possible to use a namespace unless you are the
>>   child of the original creator.
>> - Namespaces don't have names that userspace can use to talk
>>   about them.
>>
>> The nsfd() system call returns a file descriptor that can
>> be used to talk about a specific namespace, and to keep
>> the specified namespace alive.
>>
>> The fd returned by nsfd() can be bind mounted as:
>> mount --bind /proc/self/fd/N /some/filesystem/path
>> to keep the namespace alive indefinitely as long as
>> it is mounted.
>>
>> open works on the fd returned by nsfd() so another
>> process can get a hold of it and do interesting things.
>>
>> Overall that allows for persistent naming of namespaces
>> according to userspace policy.
>>
>> setns() allows changing the namespace of the current process
>> to a namespace that originates with nsfd().
>>
>> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>> ---
>>   
>
> Is it planned to support all the namespaces for 'nsfd' ?
> I mean will it be possible to specify an Or'ed combination of nstype to grab a
> reference for several namespaces at a time of the targeted process ?
>
> for example : nsfd( 1234, NSTYPE_NET | NSTYPE_IPC, NSTYPE_MNT)

No, the plan is only one namespace at a time.

It would not be much of a change to support multiple namespaces,
but I don't think I want to go there.  Bitmaps filling up are
ugly and I don't see what would be gained.

I does make sense to support all of the namespaces we can support
with unshare, but with nstype as an enumeration not as a bitmap.

This is slightly better than the earlier version that used a netlink
socket as the reference as I can give it the semantics of a deleted
file and only when that file goes away drop the reference on the
namespace.  It is also better in that this interface can support all
of the namespaces, without adding yet another syscall.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-25 21:31                               ` Daniel Lezcano
@ 2010-02-25 21:49                                 ` Eric W. Biederman
       [not found]                                   ` <m1mxyx0yv7.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
                                                     ` (2 more replies)
       [not found]                                 ` <4B86EC45.3060005-GANU6spQydw@public.gmane.org>
  1 sibling, 3 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-25 21:49 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: hadi, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

Daniel Lezcano <daniel.lezcano@free.fr> writes:

> Eric W. Biederman wrote:
>> Introduce two new system calls:
>> int nsfd(pid_t pid, unsigned long nstype);
>> int setns(unsigned long nstype, int fd);
>>
>> These two new system calls address three specific problems that can
>> make namespaces hard to work with.
>> - Namespaces require a dedicated process to pin them in memory.
>> - It is not possible to use a namespace unless you are the
>>   child of the original creator.
>> - Namespaces don't have names that userspace can use to talk
>>   about them.
>>
>> The nsfd() system call returns a file descriptor that can
>> be used to talk about a specific namespace, and to keep
>> the specified namespace alive.
>>
>> The fd returned by nsfd() can be bind mounted as:
>> mount --bind /proc/self/fd/N /some/filesystem/path
>> to keep the namespace alive indefinitely as long as
>> it is mounted.
>>
>> open works on the fd returned by nsfd() so another
>> process can get a hold of it and do interesting things.
>>
>> Overall that allows for persistent naming of namespaces
>> according to userspace policy.
>>
>> setns() allows changing the namespace of the current process
>> to a namespace that originates with nsfd().
>>
>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>> ---
>>   
>
> Is it planned to support all the namespaces for 'nsfd' ?
> I mean will it be possible to specify an Or'ed combination of nstype to grab a
> reference for several namespaces at a time of the targeted process ?
>
> for example : nsfd( 1234, NSTYPE_NET | NSTYPE_IPC, NSTYPE_MNT)

No, the plan is only one namespace at a time.

It would not be much of a change to support multiple namespaces,
but I don't think I want to go there.  Bitmaps filling up are
ugly and I don't see what would be gained.

I does make sense to support all of the namespaces we can support
with unshare, but with nstype as an enumeration not as a bitmap.

This is slightly better than the earlier version that used a netlink
socket as the reference as I can give it the semantics of a deleted
file and only when that file goes away drop the reference on the
namespace.  It is also better in that this interface can support all
of the namespaces, without adding yet another syscall.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                 ` <20100225214656.GS3604-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-02-25 21:54                                   ` Eric W. Biederman
  2010-02-26  0:53                                   ` Eric W. Biederman
  1 sibling, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-25 21:54 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> 	Seems like an ok concept to me. Did you try doing this with
> anon_inodes and bind mounting the /proc/<pid>/fd/N as above to keep
> them alive and name them?

I used a normal file.  anon_nodes strictly speaking might work, but they
keep their state in the struct file not in the struct dentry.  So even
if the anon_inodes survived they would not be good for anything.  Otherwise
I would have just reused the anon_inodes.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-25 21:46                               ` Matt Helsley
@ 2010-02-25 21:54                                 ` Eric W. Biederman
       [not found]                                 ` <20100225214656.GS3604-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2010-02-26  0:53                                 ` Eric W. Biederman
  2 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-25 21:54 UTC (permalink / raw)
  To: Matt Helsley
  Cc: hadi, Daniel Lezcano, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn

Matt Helsley <matthltc@us.ibm.com> writes:

> 	Seems like an ok concept to me. Did you try doing this with
> anon_inodes and bind mounting the /proc/<pid>/fd/N as above to keep
> them alive and name them?

I used a normal file.  anon_nodes strictly speaking might work, but they
keep their state in the struct file not in the struct dentry.  So even
if the anon_inodes survived they would not be good for anything.  Otherwise
I would have just reused the anon_inodes.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                   ` <m1mxyx0yv7.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-25 22:13                                     ` Daniel Lezcano
  2010-02-26 20:35                                     ` Eric W. Biederman
  1 sibling, 0 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-02-25 22:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:
>
>   
>> Eric W. Biederman wrote:
>>     
>>> Introduce two new system calls:
>>> int nsfd(pid_t pid, unsigned long nstype);
>>> int setns(unsigned long nstype, int fd);
>>>
>>> These two new system calls address three specific problems that can
>>> make namespaces hard to work with.
>>> - Namespaces require a dedicated process to pin them in memory.
>>> - It is not possible to use a namespace unless you are the
>>>   child of the original creator.
>>> - Namespaces don't have names that userspace can use to talk
>>>   about them.
>>>
>>> The nsfd() system call returns a file descriptor that can
>>> be used to talk about a specific namespace, and to keep
>>> the specified namespace alive.
>>>
>>> The fd returned by nsfd() can be bind mounted as:
>>> mount --bind /proc/self/fd/N /some/filesystem/path
>>> to keep the namespace alive indefinitely as long as
>>> it is mounted.
>>>
>>> open works on the fd returned by nsfd() so another
>>> process can get a hold of it and do interesting things.
>>>
>>> Overall that allows for persistent naming of namespaces
>>> according to userspace policy.
>>>
>>> setns() allows changing the namespace of the current process
>>> to a namespace that originates with nsfd().
>>>
>>> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>>> ---
>>>   
>>>       
>> Is it planned to support all the namespaces for 'nsfd' ?
>> I mean will it be possible to specify an Or'ed combination of nstype to grab a
>> reference for several namespaces at a time of the targeted process ?
>>
>> for example : nsfd( 1234, NSTYPE_NET | NSTYPE_IPC, NSTYPE_MNT)
>>     
>
> No, the plan is only one namespace at a time.
>
> It would not be much of a change to support multiple namespaces,
> but I don't think I want to go there.  Bitmaps filling up are
> ugly and I don't see what would be gained.
>   
The idea I had in mind when I asked this question was if we can "move" a 
process inside a container, aka a set of namespaces :)
> I does make sense to support all of the namespaces we can support
> with unshare, but with nstype as an enumeration not as a bitmap.
>   
I suppose when you say "to support all of the namespaces we can support 
with *unshare*", you exclude the pid namespace which is created only 
with clone, right ? Do you think we can extend the concept to all the 
namespaces including the pid_namespace ?

> This is slightly better than the earlier version that used a netlink
> socket as the reference as I can give it the semantics of a deleted
> file and only when that file goes away drop the reference on the
> namespace.  It is also better in that this interface can support all
> of the namespaces, without adding yet another syscall.
>   
I like the idea :)

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-25 21:49                                 ` Eric W. Biederman
       [not found]                                   ` <m1mxyx0yv7.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-25 22:13                                   ` Daniel Lezcano
  2010-02-25 22:31                                     ` Eric W. Biederman
       [not found]                                     ` <4B86F5EC.60902-GANU6spQydw@public.gmane.org>
  2010-02-26 20:35                                   ` Eric W. Biederman
  2 siblings, 2 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-02-25 22:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: hadi, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>
>   
>> Eric W. Biederman wrote:
>>     
>>> Introduce two new system calls:
>>> int nsfd(pid_t pid, unsigned long nstype);
>>> int setns(unsigned long nstype, int fd);
>>>
>>> These two new system calls address three specific problems that can
>>> make namespaces hard to work with.
>>> - Namespaces require a dedicated process to pin them in memory.
>>> - It is not possible to use a namespace unless you are the
>>>   child of the original creator.
>>> - Namespaces don't have names that userspace can use to talk
>>>   about them.
>>>
>>> The nsfd() system call returns a file descriptor that can
>>> be used to talk about a specific namespace, and to keep
>>> the specified namespace alive.
>>>
>>> The fd returned by nsfd() can be bind mounted as:
>>> mount --bind /proc/self/fd/N /some/filesystem/path
>>> to keep the namespace alive indefinitely as long as
>>> it is mounted.
>>>
>>> open works on the fd returned by nsfd() so another
>>> process can get a hold of it and do interesting things.
>>>
>>> Overall that allows for persistent naming of namespaces
>>> according to userspace policy.
>>>
>>> setns() allows changing the namespace of the current process
>>> to a namespace that originates with nsfd().
>>>
>>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>>> ---
>>>   
>>>       
>> Is it planned to support all the namespaces for 'nsfd' ?
>> I mean will it be possible to specify an Or'ed combination of nstype to grab a
>> reference for several namespaces at a time of the targeted process ?
>>
>> for example : nsfd( 1234, NSTYPE_NET | NSTYPE_IPC, NSTYPE_MNT)
>>     
>
> No, the plan is only one namespace at a time.
>
> It would not be much of a change to support multiple namespaces,
> but I don't think I want to go there.  Bitmaps filling up are
> ugly and I don't see what would be gained.
>   
The idea I had in mind when I asked this question was if we can "move" a 
process inside a container, aka a set of namespaces :)
> I does make sense to support all of the namespaces we can support
> with unshare, but with nstype as an enumeration not as a bitmap.
>   
I suppose when you say "to support all of the namespaces we can support 
with *unshare*", you exclude the pid namespace which is created only 
with clone, right ? Do you think we can extend the concept to all the 
namespaces including the pid_namespace ?

> This is slightly better than the earlier version that used a netlink
> socket as the reference as I can give it the semantics of a deleted
> file and only when that file goes away drop the reference on the
> namespace.  It is also better in that this interface can support all
> of the namespaces, without adding yet another syscall.
>   
I like the idea :)


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                     ` <4B86F5EC.60902-GANU6spQydw@public.gmane.org>
@ 2010-02-25 22:31                                       ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-25 22:31 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:

>> No, the plan is only one namespace at a time.
>>
>> It would not be much of a change to support multiple namespaces,
>> but I don't think I want to go there.  Bitmaps filling up are
>> ugly and I don't see what would be gained.
>>   
> The idea I had in mind when I asked this question was if we can "move" a process
> inside a container, aka a set of namespaces :)

Yes.

>> I does make sense to support all of the namespaces we can support
>> with unshare, but with nstype as an enumeration not as a bitmap.
>>   
> I suppose when you say "to support all of the namespaces we can support with
> *unshare*", you exclude the pid namespace which is created only with clone,
> right ? Do you think we can extend the concept to all the namespaces including
> the pid_namespace ?

Yes, and I think also the credential/uid namespace.

It is possible that this could be the basis for a general purpose
enter, but that is not the primary motivation.  I am after the
easy cases simple cases.  So I can modify /sbin/ip to take advantage
of it.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-25 22:13                                   ` Daniel Lezcano
@ 2010-02-25 22:31                                     ` Eric W. Biederman
       [not found]                                     ` <4B86F5EC.60902-GANU6spQydw@public.gmane.org>
  1 sibling, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-25 22:31 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: hadi, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

Daniel Lezcano <daniel.lezcano@free.fr> writes:

>> No, the plan is only one namespace at a time.
>>
>> It would not be much of a change to support multiple namespaces,
>> but I don't think I want to go there.  Bitmaps filling up are
>> ugly and I don't see what would be gained.
>>   
> The idea I had in mind when I asked this question was if we can "move" a process
> inside a container, aka a set of namespaces :)

Yes.

>> I does make sense to support all of the namespaces we can support
>> with unshare, but with nstype as an enumeration not as a bitmap.
>>   
> I suppose when you say "to support all of the namespaces we can support with
> *unshare*", you exclude the pid namespace which is created only with clone,
> right ? Do you think we can extend the concept to all the namespaces including
> the pid_namespace ?

Yes, and I think also the credential/uid namespace.

It is possible that this could be the basis for a general purpose
enter, but that is not the primary motivation.  I am after the
easy cases simple cases.  So I can modify /sbin/ip to take advantage
of it.

Eric


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                 ` <20100225214656.GS3604-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2010-02-25 21:54                                   ` Eric W. Biederman
@ 2010-02-26  0:53                                   ` Eric W. Biederman
  1 sibling, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-26  0:53 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:


> 	Seems like an ok concept to me. Did you try doing this with
> anon_inodes and bind mounting the /proc/<pid>/fd/N as above to keep
> them alive and name them?

Of course this part doesn't work in my patch because I have the wrong
mnt_ns on my mount MS_NOUSER on my superblock.

MS_NOUSER is easy to get past.  Getting a vfsmount in the proper mnt
namespace could be tricky.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-25 21:46                               ` Matt Helsley
  2010-02-25 21:54                                 ` Eric W. Biederman
       [not found]                                 ` <20100225214656.GS3604-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-02-26  0:53                                 ` Eric W. Biederman
  2 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-26  0:53 UTC (permalink / raw)
  To: Matt Helsley
  Cc: hadi, Daniel Lezcano, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn

Matt Helsley <matthltc@us.ibm.com> writes:


> 	Seems like an ok concept to me. Did you try doing this with
> anon_inodes and bind mounting the /proc/<pid>/fd/N as above to keep
> them alive and name them?

Of course this part doesn't work in my patch because I have the wrong
mnt_ns on my mount MS_NOUSER on my superblock.

MS_NOUSER is easy to get past.  Getting a vfsmount in the proper mnt
namespace could be tricky.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                               ` <m1pr3t2fvl.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2010-02-25 21:31                                 ` Daniel Lezcano
  2010-02-25 21:46                                 ` Matt Helsley
@ 2010-02-26  1:09                                 ` Matt Helsley
  2010-02-26  3:15                                 ` [RFC][PATCH] ns: Syscalls for better namespace sharing control. v2 Eric W. Biederman
                                                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 184+ messages in thread
From: Matt Helsley @ 2010-02-26  1:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

On Thu, Feb 25, 2010 at 12:57:02PM -0800, Eric W. Biederman wrote:
> 
> Introduce two new system calls:
> int nsfd(pid_t pid, unsigned long nstype);
> int setns(unsigned long nstype, int fd);
> 
> These two new system calls address three specific problems that can
> make namespaces hard to work with.
> - Namespaces require a dedicated process to pin them in memory.
> - It is not possible to use a namespace unless you are the
>   child of the original creator.
> - Namespaces don't have names that userspace can use to talk
>   about them.
> 
> The nsfd() system call returns a file descriptor that can
> be used to talk about a specific namespace, and to keep
> the specified namespace alive.
> 
> The fd returned by nsfd() can be bind mounted as:
> mount --bind /proc/self/fd/N /some/filesystem/path
> to keep the namespace alive indefinitely as long as
> it is mounted.
> 
> open works on the fd returned by nsfd() so another
> process can get a hold of it and do interesting things.
> 
> Overall that allows for persistent naming of namespaces
> according to userspace policy.
> 
> setns() allows changing the namespace of the current process
> to a namespace that originates with nsfd().
> 
> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
> 
> This is just my first pass at this, and not yet compiled tested.
> I was pleasantly surprised at how easy all of this was to implement.

<snip>

> +SYSCALL_DEFINE2(setns, unsigned long, nstype, int, fd)
> +{
> +	struct file *file;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;

Is this check preliminary? In the future would we check against the
owner of the target namespace too? Naturally that will require tagging
each namespace with an owner but I thought that was already part of the
plan...

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-25 20:57                             ` [RFC][PATCH] ns: Syscalls for better namespace sharing control Eric W. Biederman
                                                 ` (2 preceding siblings ...)
  2010-02-25 21:46                               ` Matt Helsley
@ 2010-02-26  1:09                               ` Matt Helsley
       [not found]                                 ` <20100226010915.GA20106-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2010-02-26  1:26                                 ` Eric W. Biederman
  2010-02-26  3:15                               ` [RFC][PATCH] ns: Syscalls for better namespace sharing control. v2 Eric W. Biederman
  4 siblings, 2 replies; 184+ messages in thread
From: Matt Helsley @ 2010-02-26  1:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: hadi, Daniel Lezcano, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn, Matt Helsley

On Thu, Feb 25, 2010 at 12:57:02PM -0800, Eric W. Biederman wrote:
> 
> Introduce two new system calls:
> int nsfd(pid_t pid, unsigned long nstype);
> int setns(unsigned long nstype, int fd);
> 
> These two new system calls address three specific problems that can
> make namespaces hard to work with.
> - Namespaces require a dedicated process to pin them in memory.
> - It is not possible to use a namespace unless you are the
>   child of the original creator.
> - Namespaces don't have names that userspace can use to talk
>   about them.
> 
> The nsfd() system call returns a file descriptor that can
> be used to talk about a specific namespace, and to keep
> the specified namespace alive.
> 
> The fd returned by nsfd() can be bind mounted as:
> mount --bind /proc/self/fd/N /some/filesystem/path
> to keep the namespace alive indefinitely as long as
> it is mounted.
> 
> open works on the fd returned by nsfd() so another
> process can get a hold of it and do interesting things.
> 
> Overall that allows for persistent naming of namespaces
> according to userspace policy.
> 
> setns() allows changing the namespace of the current process
> to a namespace that originates with nsfd().
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
> 
> This is just my first pass at this, and not yet compiled tested.
> I was pleasantly surprised at how easy all of this was to implement.

<snip>

> +SYSCALL_DEFINE2(setns, unsigned long, nstype, int, fd)
> +{
> +	struct file *file;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;

Is this check preliminary? In the future would we check against the
owner of the target namespace too? Naturally that will require tagging
each namespace with an owner but I thought that was already part of the
plan...

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                 ` <20100226010915.GA20106-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-02-26  1:26                                   ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-26  1:26 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> On Thu, Feb 25, 2010 at 12:57:02PM -0800, Eric W. Biederman wrote:
>> 
>> Introduce two new system calls:
>> int nsfd(pid_t pid, unsigned long nstype);
>> int setns(unsigned long nstype, int fd);
>> 
>> These two new system calls address three specific problems that can
>> make namespaces hard to work with.
>> - Namespaces require a dedicated process to pin them in memory.
>> - It is not possible to use a namespace unless you are the
>>   child of the original creator.
>> - Namespaces don't have names that userspace can use to talk
>>   about them.
>> 
>> The nsfd() system call returns a file descriptor that can
>> be used to talk about a specific namespace, and to keep
>> the specified namespace alive.
>> 
>> The fd returned by nsfd() can be bind mounted as:
>> mount --bind /proc/self/fd/N /some/filesystem/path
>> to keep the namespace alive indefinitely as long as
>> it is mounted.
>> 
>> open works on the fd returned by nsfd() so another
>> process can get a hold of it and do interesting things.
>> 
>> Overall that allows for persistent naming of namespaces
>> according to userspace policy.
>> 
>> setns() allows changing the namespace of the current process
>> to a namespace that originates with nsfd().
>> 
>> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>> ---
>> 
>> This is just my first pass at this, and not yet compiled tested.
>> I was pleasantly surprised at how easy all of this was to implement.
>
> <snip>
>
>> +SYSCALL_DEFINE2(setns, unsigned long, nstype, int, fd)
>> +{
>> +	struct file *file;
>> +
>> +	if (!capable(CAP_SYS_ADMIN))
>> +		return -EPERM;
>
> Is this check preliminary? In the future would we check against the
> owner of the target namespace too? Naturally that will require tagging
> each namespace with an owner but I thought that was already part of the
> plan...

We aren't modifying the namespace here so namespace owners are
irrelevant here.

We are modifying the process so we need to have CAP_SYS_ADMIN in the
processes credential/uid namespace.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-26  1:09                               ` Matt Helsley
       [not found]                                 ` <20100226010915.GA20106-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-02-26  1:26                                 ` Eric W. Biederman
  1 sibling, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-26  1:26 UTC (permalink / raw)
  To: Matt Helsley
  Cc: hadi, Daniel Lezcano, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn

Matt Helsley <matthltc@us.ibm.com> writes:

> On Thu, Feb 25, 2010 at 12:57:02PM -0800, Eric W. Biederman wrote:
>> 
>> Introduce two new system calls:
>> int nsfd(pid_t pid, unsigned long nstype);
>> int setns(unsigned long nstype, int fd);
>> 
>> These two new system calls address three specific problems that can
>> make namespaces hard to work with.
>> - Namespaces require a dedicated process to pin them in memory.
>> - It is not possible to use a namespace unless you are the
>>   child of the original creator.
>> - Namespaces don't have names that userspace can use to talk
>>   about them.
>> 
>> The nsfd() system call returns a file descriptor that can
>> be used to talk about a specific namespace, and to keep
>> the specified namespace alive.
>> 
>> The fd returned by nsfd() can be bind mounted as:
>> mount --bind /proc/self/fd/N /some/filesystem/path
>> to keep the namespace alive indefinitely as long as
>> it is mounted.
>> 
>> open works on the fd returned by nsfd() so another
>> process can get a hold of it and do interesting things.
>> 
>> Overall that allows for persistent naming of namespaces
>> according to userspace policy.
>> 
>> setns() allows changing the namespace of the current process
>> to a namespace that originates with nsfd().
>> 
>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>> ---
>> 
>> This is just my first pass at this, and not yet compiled tested.
>> I was pleasantly surprised at how easy all of this was to implement.
>
> <snip>
>
>> +SYSCALL_DEFINE2(setns, unsigned long, nstype, int, fd)
>> +{
>> +	struct file *file;
>> +
>> +	if (!capable(CAP_SYS_ADMIN))
>> +		return -EPERM;
>
> Is this check preliminary? In the future would we check against the
> owner of the target namespace too? Naturally that will require tagging
> each namespace with an owner but I thought that was already part of the
> plan...

We aren't modifying the namespace here so namespace owners are
irrelevant here.

We are modifying the process so we need to have CAP_SYS_ADMIN in the
processes credential/uid namespace.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* [RFC][PATCH] ns: Syscalls for better namespace sharing control. v2
       [not found]                               ` <m1pr3t2fvl.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
                                                   ` (2 preceding siblings ...)
  2010-02-26  1:09                                 ` Matt Helsley
@ 2010-02-26  3:15                                 ` Eric W. Biederman
  2010-02-26 21:13                                 ` [RFC][PATCH] ns: Syscalls for better namespace sharing control Pavel Emelyanov
  2010-05-27 12:28                                 ` [Devel] " Enrico Weigelt
  5 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-26  3:15 UTC (permalink / raw)
  To: hadi-fAAogVwAN2Kw5LPnMra/2Q
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano


Introduce two new system calls:
int nsfd(pid_t pid, unsigned long nstype);
int setns(unsigned long nstype, int fd);

These two new system calls address three specific problems that can
make namespaces hard to work with.
- Namespaces require a dedicated process to pin them in memory.
- It is not possible to use a namespace unless you are the
  child of the original creator.
- Namespaces don't have names that userspace can use to talk
  about them.

The nsfd() system call returns a file descriptor that can
be used to talk about a specific namespace, and to keep
the specified namespace alive.

The file descriptor returned from nsfd has the lifetime
semantics of a deleted file.  As long as the fd is
open or it is bind mounted into the filesystem
namespace the namespace will be kept alive.

The fd returned by nsfd() can be bind mounted as:
mount --bind /proc/self/fd/N /some/filesystem/path

open works on the fd returned by nsfd() so another
process can get a hold of it and do interesting things.

Overall that allows for naming of namespaces with
userspace policy.

setns() allows changing the namespace of the current process
to a namespace that originates with nsfd().

v2: The code is tested and works in the common case.
    The vfs has some of the strangest rules...

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---

Enough for one day.  This code works, now it just needs
a some more use/testing and careful scrutiny before 2.6.35 rolls
around.

 arch/x86/ia32/ia32entry.S          |    2 +
 arch/x86/include/asm/unistd_32.h   |    4 +-
 arch/x86/include/asm/unistd_64.h   |    4 +
 arch/x86/kernel/syscall_table_32.S |    2 +
 fs/Makefile                        |    2 +-
 fs/nsfd.c                          |  320 ++++++++++++++++++++++++++++++++++++
 include/linux/magic.h              |    1 +
 include/linux/nsproxy.h            |    1 +
 include/linux/nstype.h             |    6 +
 kernel/nsproxy.c                   |   17 ++
 10 files changed, 357 insertions(+), 2 deletions(-)
 create mode 100644 fs/nsfd.c
 create mode 100644 include/linux/nstype.h

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 53147ad..9fd33de 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -842,4 +842,6 @@ ia32_sys_call_table:
 	.quad compat_sys_rt_tgsigqueueinfo	/* 335 */
 	.quad sys_perf_event_open
 	.quad compat_sys_recvmmsg
+	.quad sys_nsfd
+	.quad sys_setns
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 3baf379..5b7833c 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -343,10 +343,12 @@
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_event_open	336
 #define __NR_recvmmsg		337
+#define __NR_nsfd		338
+#define __NR_setns		339
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 338
+#define NR_syscalls 340
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 4843f7b..260d542 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -663,6 +663,10 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
 #define __NR_recvmmsg				299
 __SYSCALL(__NR_recvmmsg, sys_recvmmsg)
+#define __NR_nsfd				300
+__SYSCALL(__NR_nsfd, sys_nsfd)
+#define __NR_setns				301
+__SYSCALL(__NR_setns, sys_setns)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 15228b5..e09a45b 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -337,3 +337,5 @@ ENTRY(sys_call_table)
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_event_open
 	.long sys_recvmmsg
+	.long sys_nsfd
+	.long sys_setns
diff --git a/fs/Makefile b/fs/Makefile
index af6d047..74d5091 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		attr.o bad_inode.o file.o filesystems.o namespace.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o drop_caches.o splice.o sync.o utimes.o \
-		stack.o fs_struct.o
+		stack.o fs_struct.o nsfd.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/nsfd.c b/fs/nsfd.c
new file mode 100644
index 0000000..ec04a1e
--- /dev/null
+++ b/fs/nsfd.c
@@ -0,0 +1,320 @@
+#include <linux/nstype.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <net/net_namespace.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/cred.h>
+#include <linux/sched.h>
+#include <linux/ptrace.h>
+#include <linux/nsproxy.h>
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+#include <linux/fs_struct.h>
+
+static struct vfsmount *nsfd_mnt __read_mostly;
+static struct inode *nsfd_inode;
+
+static const struct file_operations nsfd_file_operations = {
+	.llseek = no_llseek,
+};
+
+static const struct super_operations nsfd_super_operations = {
+	.statfs		= simple_statfs,
+};
+
+static char *nsfd_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	static const char name[] = "nsfd";
+
+	if (sizeof(name) > buflen)
+		return ERR_PTR(-ENAMETOOLONG);
+
+	return memcpy(buffer, name, sizeof(name));
+}
+
+static const struct dentry_operations nsfd_dentry_operations = {
+	.d_dname		= nsfd_dname,
+};
+
+static struct inode *nsfd_mkinode(struct super_block *sb)
+{
+	struct inode *inode;
+
+	inode = new_inode(sb);
+	if (!inode)
+		return ERR_PTR(-ENOMEM);
+
+	inode->i_fop = &nsfd_file_operations;
+
+	/*
+	 * Mark the inode dirty from the very beginning,
+	 * that way it will never be moved to the dirty
+	 * list because mark_inode_dirty() will think that
+	 * it already _is_ on the dirty list.
+	 */
+	inode->i_state	= I_DIRTY;
+	inode->i_ino	= 1;
+	inode->i_mode	= S_IFREG | S_IRUSR | S_IWUSR;
+	inode->i_atime	= inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+	inode->i_flags	= S_IMMUTABLE;
+
+	return inode;
+}
+
+static struct dentry *nsfd_alloc_dentry(struct inode *inode)
+{
+	struct dentry *dentry;
+
+	/*
+	 * We know the nsfd_inode inode count is always greater than zero,
+	 * so we can avoid doing an igrab() and we can use an open-coded
+	 * atomic_inc().
+	 */
+	dentry = d_alloc_root(inode);
+	if (dentry) {
+		atomic_inc(&inode->i_count);
+		dentry->d_op = &nsfd_dentry_operations;
+	}
+	return dentry;
+}
+
+static int nsfd_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct inode *inode = NULL;
+
+	sb->s_flags		= 0;
+	sb->s_maxbytes		= MAX_LFS_FILESIZE;
+	sb->s_blocksize		= PAGE_SIZE;
+	sb->s_blocksize_bits	= PAGE_SHIFT;
+	sb->s_magic 		= NSFD_FS_MAGIC;
+	sb->s_op		= &nsfd_super_operations;
+	sb->s_time_gran		= 1;
+
+	inode = nsfd_mkinode(sb);
+	if (!inode)
+		goto Enomem;
+
+	sb->s_root = nsfd_alloc_dentry(inode);
+	if (!sb->s_root)
+		goto Enomem;
+
+	/* Save the inode for later.. */
+	nsfd_inode = inode;
+
+	return 0;
+
+Enomem:
+	iput(inode);
+	return -ENOMEM;
+}
+
+static int nsfd_get_sb(struct file_system_type *fs_type, int flags,
+	const char *dev_name, void *data, struct vfsmount *mnt)
+{
+	/* We can't use get_sb_psuedo because that sets MS_NOUSER */
+	return get_sb_single(fs_type, 0, NULL, nsfd_fill_super, mnt);
+}
+
+
+static struct file_system_type nsfd_fs_type = {
+	.name		= "nsfd",
+	.get_sb		= nsfd_get_sb,
+	.kill_sb	= kill_anon_super,
+	
+};
+
+static void netns_dentry_release(struct dentry *dentry)
+{
+	put_net(dentry->d_fsdata);
+	dentry->d_fsdata = NULL;
+}
+
+static const struct dentry_operations netns_dentry_operations = {
+	.d_dname	= nsfd_dname,
+	.d_release	= netns_dentry_release,
+};
+
+static const struct dentry_operations *nsfd_dops[] = {
+	[NSTYPE_NET] = &netns_dentry_operations,
+};
+
+static const struct dentry_operations *nstype_dops(unsigned long nstype)
+{
+	const struct dentry_operations *d_op = NULL;
+
+	if (nstype < sizeof(nsfd_dops)/sizeof(nsfd_dops[0]))
+		d_op = nsfd_dops[nstype];
+
+	return d_op;
+}
+
+static struct file *nsfd_fget(int fd, unsigned long nstype)
+{
+	const struct dentry_operations *d_op;
+	struct file *file;
+
+	d_op = nstype_dops(nstype);
+	if (!d_op)
+		return ERR_PTR(-EINVAL);
+
+	file = fget(fd);
+	if (!file)
+		return ERR_PTR(-EBADF);
+
+	if (file->f_op != &nsfd_file_operations)
+		goto out_invalid;
+
+	if (file->f_path.dentry->d_op != d_op)
+		goto out_invalid;
+
+	return file;
+
+out_invalid:
+	fput(file);
+	return ERR_PTR(-EINVAL);
+}
+
+
+static struct file *nsfd_getfile(void)
+{
+	struct path path;
+	struct file *file;
+
+	path.dentry = nsfd_alloc_dentry(nsfd_inode);
+	if (!path.dentry)
+		return ERR_PTR(-ENOMEM);
+
+	/* HACK I need a vfsmnt with mnt_ns == current_nsproxy_mnt_ns
+	 * and (mnt_sb->s_flags & MS_NOUSER) == 0.  The only way I can
+	 * get such a vfsmount without having an instnace of my filesystem
+	 * mounted in the namespace is to steal one.
+	 */
+	path.mnt = mntget(current->fs->root.mnt);
+
+	file = alloc_file(&path, FMODE_READ, &nsfd_file_operations);
+	if (!file) {
+		path_put(&path);
+		return ERR_PTR(-ENFILE);
+	}
+	file->f_mapping = nsfd_inode->i_mapping;
+
+	file->f_pos = 0;
+	file->f_flags = O_RDONLY;
+	file->f_version = 0;
+	file->private_data = NULL;
+
+	return file;
+}
+
+static void *nsfd_getns(pid_t pid, unsigned long nstype)
+{
+	struct task_struct *task;
+	struct nsproxy *nsproxy;
+	void *ns;
+
+	ns = ERR_PTR(-ESRCH);
+	rcu_read_lock();
+	if (pid == 0)
+		task = current;
+	else
+		task = find_task_by_vpid(pid);
+	if (!task)
+		goto out;
+
+	ns = ERR_PTR(-EPERM);
+	if (!ptrace_may_access(task, PTRACE_MODE_ATTACH))
+		goto out;
+
+	ns = ERR_PTR(-ESRCH);
+	nsproxy = task_nsproxy(task);
+	if (!nsproxy)
+		goto out;
+
+	ns = ERR_PTR(-EINVAL);
+	switch(nstype) {
+	case NSTYPE_NET:
+		ns = get_net(nsproxy->net_ns);
+		break;
+	}
+out:
+	rcu_read_unlock();
+	return ns;
+}
+
+SYSCALL_DEFINE2(nsfd, pid_t, pid, unsigned long, nstype)
+{
+	const struct dentry_operations *d_op;
+	struct file *file;
+	int fd;
+	void *ns;
+
+	d_op = nstype_dops(nstype);
+	if (!d_op)
+		return -EINVAL;
+
+	file = nsfd_getfile();
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	ns = nsfd_getns(pid, nstype);
+	if (IS_ERR(ns)) {
+		fput(file);
+		return PTR_ERR(ns);
+	}
+
+	file->f_dentry->d_fsdata = ns;
+	file->f_dentry->d_op = d_op;
+	
+	fd = get_unused_fd();
+	if (fd < 0) {
+		fput(file);
+		return fd;
+	}
+	fd_install(fd, file);
+
+	return fd;
+}
+
+
+SYSCALL_DEFINE2(setns, unsigned long, nstype, int, fd)
+{
+	struct file *file;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	file = nsfd_fget(fd, nstype);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	set_namespace(nstype, file->f_dentry->d_fsdata);
+
+	fput(file);
+	return 0;
+}
+
+
+static int __init nsfd_init(void)
+{
+	int error;
+
+	error = register_filesystem(&nsfd_fs_type);
+	if (error)
+		goto err_exit;
+
+	nsfd_mnt  = kern_mount(&nsfd_fs_type);
+	if (IS_ERR(nsfd_mnt)) {
+		error = PTR_ERR(nsfd_mnt);
+		goto err_unregister_filesystem;
+	}
+
+	return 0;
+
+err_unregister_filesystem:
+	unregister_filesystem(&nsfd_fs_type);
+err_exit:
+	panic(KERN_ERR "nsfd_init() failed (%d)\n", error);
+}
+
+fs_initcall(nsfd_init);
diff --git a/include/linux/magic.h b/include/linux/magic.h
index 76285e0..a4fe6eb 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -26,6 +26,7 @@
 #define ISOFS_SUPER_MAGIC	0x9660
 #define JFFS2_SUPER_MAGIC	0x72b6
 #define ANON_INODE_FS_MAGIC	0x09041934
+#define NSFD_FS_MAGIC		0x6e736664
 
 #define MINIX_SUPER_MAGIC	0x137F		/* original minix fs */
 #define MINIX_SUPER_MAGIC2	0x138F		/* minix fs, 30 char names */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 7b370c7..45f1e07 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -65,6 +65,7 @@ static inline struct nsproxy *task_nsproxy(struct task_struct *tsk)
 int copy_namespaces(unsigned long flags, struct task_struct *tsk);
 void exit_task_namespaces(struct task_struct *tsk);
 void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
+void set_namespace(unsigned long nstype, void *ns);
 void free_nsproxy(struct nsproxy *ns);
 int unshare_nsproxy_namespaces(unsigned long, struct nsproxy **,
 	struct fs_struct *);
diff --git a/include/linux/nstype.h b/include/linux/nstype.h
new file mode 100644
index 0000000..3bdf856
--- /dev/null
+++ b/include/linux/nstype.h
@@ -0,0 +1,6 @@
+#ifndef _LINUX_NSTYPE_H
+#define _LINUX_NSTYPE_H
+
+#define NSTYPE_NET 0
+
+#endif /* _LINUX_NSTYPE_H */
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 09b4ff9..574461c 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -21,6 +21,7 @@
 #include <linux/pid_namespace.h>
 #include <net/net_namespace.h>
 #include <linux/ipc_namespace.h>
+#include <linux/nstype.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -221,6 +222,22 @@ void exit_task_namespaces(struct task_struct *p)
 	switch_task_namespaces(p, NULL);
 }
 
+void set_namespace(unsigned long nstype, void *ns)
+{
+	struct task_struct *tsk = current;
+	struct nsproxy *new_nsproxy;
+
+	new_nsproxy = create_new_namespaces(0, tsk, tsk->fs);
+	switch(nstype) {
+	case NSTYPE_NET:
+		put_net(new_nsproxy->net_ns);
+		new_nsproxy->net_ns = get_net(ns);
+		break;
+	}
+
+	switch_task_namespaces(tsk, new_nsproxy);
+}
+
 static int __init nsproxy_cache_init(void)
 {
 	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
-- 
1.6.5.2.143.g8cc62

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [RFC][PATCH] ns: Syscalls for better namespace sharing control. v2
  2010-02-25 20:57                             ` [RFC][PATCH] ns: Syscalls for better namespace sharing control Eric W. Biederman
                                                 ` (3 preceding siblings ...)
  2010-02-26  1:09                               ` Matt Helsley
@ 2010-02-26  3:15                               ` Eric W. Biederman
       [not found]                                 ` <m18wagy9f3.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  4 siblings, 1 reply; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-26  3:15 UTC (permalink / raw)
  To: hadi
  Cc: Daniel Lezcano, Patrick McHardy, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear, Serge Hallyn,
	Matt Helsley


Introduce two new system calls:
int nsfd(pid_t pid, unsigned long nstype);
int setns(unsigned long nstype, int fd);

These two new system calls address three specific problems that can
make namespaces hard to work with.
- Namespaces require a dedicated process to pin them in memory.
- It is not possible to use a namespace unless you are the
  child of the original creator.
- Namespaces don't have names that userspace can use to talk
  about them.

The nsfd() system call returns a file descriptor that can
be used to talk about a specific namespace, and to keep
the specified namespace alive.

The file descriptor returned from nsfd has the lifetime
semantics of a deleted file.  As long as the fd is
open or it is bind mounted into the filesystem
namespace the namespace will be kept alive.

The fd returned by nsfd() can be bind mounted as:
mount --bind /proc/self/fd/N /some/filesystem/path

open works on the fd returned by nsfd() so another
process can get a hold of it and do interesting things.

Overall that allows for naming of namespaces with
userspace policy.

setns() allows changing the namespace of the current process
to a namespace that originates with nsfd().

v2: The code is tested and works in the common case.
    The vfs has some of the strangest rules...

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---

Enough for one day.  This code works, now it just needs
a some more use/testing and careful scrutiny before 2.6.35 rolls
around.

 arch/x86/ia32/ia32entry.S          |    2 +
 arch/x86/include/asm/unistd_32.h   |    4 +-
 arch/x86/include/asm/unistd_64.h   |    4 +
 arch/x86/kernel/syscall_table_32.S |    2 +
 fs/Makefile                        |    2 +-
 fs/nsfd.c                          |  320 ++++++++++++++++++++++++++++++++++++
 include/linux/magic.h              |    1 +
 include/linux/nsproxy.h            |    1 +
 include/linux/nstype.h             |    6 +
 kernel/nsproxy.c                   |   17 ++
 10 files changed, 357 insertions(+), 2 deletions(-)
 create mode 100644 fs/nsfd.c
 create mode 100644 include/linux/nstype.h

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 53147ad..9fd33de 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -842,4 +842,6 @@ ia32_sys_call_table:
 	.quad compat_sys_rt_tgsigqueueinfo	/* 335 */
 	.quad sys_perf_event_open
 	.quad compat_sys_recvmmsg
+	.quad sys_nsfd
+	.quad sys_setns
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 3baf379..5b7833c 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -343,10 +343,12 @@
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_event_open	336
 #define __NR_recvmmsg		337
+#define __NR_nsfd		338
+#define __NR_setns		339
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 338
+#define NR_syscalls 340
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 4843f7b..260d542 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -663,6 +663,10 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
 #define __NR_recvmmsg				299
 __SYSCALL(__NR_recvmmsg, sys_recvmmsg)
+#define __NR_nsfd				300
+__SYSCALL(__NR_nsfd, sys_nsfd)
+#define __NR_setns				301
+__SYSCALL(__NR_setns, sys_setns)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 15228b5..e09a45b 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -337,3 +337,5 @@ ENTRY(sys_call_table)
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_event_open
 	.long sys_recvmmsg
+	.long sys_nsfd
+	.long sys_setns
diff --git a/fs/Makefile b/fs/Makefile
index af6d047..74d5091 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		attr.o bad_inode.o file.o filesystems.o namespace.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o drop_caches.o splice.o sync.o utimes.o \
-		stack.o fs_struct.o
+		stack.o fs_struct.o nsfd.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/nsfd.c b/fs/nsfd.c
new file mode 100644
index 0000000..ec04a1e
--- /dev/null
+++ b/fs/nsfd.c
@@ -0,0 +1,320 @@
+#include <linux/nstype.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <net/net_namespace.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/cred.h>
+#include <linux/sched.h>
+#include <linux/ptrace.h>
+#include <linux/nsproxy.h>
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+#include <linux/fs_struct.h>
+
+static struct vfsmount *nsfd_mnt __read_mostly;
+static struct inode *nsfd_inode;
+
+static const struct file_operations nsfd_file_operations = {
+	.llseek = no_llseek,
+};
+
+static const struct super_operations nsfd_super_operations = {
+	.statfs		= simple_statfs,
+};
+
+static char *nsfd_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	static const char name[] = "nsfd";
+
+	if (sizeof(name) > buflen)
+		return ERR_PTR(-ENAMETOOLONG);
+
+	return memcpy(buffer, name, sizeof(name));
+}
+
+static const struct dentry_operations nsfd_dentry_operations = {
+	.d_dname		= nsfd_dname,
+};
+
+static struct inode *nsfd_mkinode(struct super_block *sb)
+{
+	struct inode *inode;
+
+	inode = new_inode(sb);
+	if (!inode)
+		return ERR_PTR(-ENOMEM);
+
+	inode->i_fop = &nsfd_file_operations;
+
+	/*
+	 * Mark the inode dirty from the very beginning,
+	 * that way it will never be moved to the dirty
+	 * list because mark_inode_dirty() will think that
+	 * it already _is_ on the dirty list.
+	 */
+	inode->i_state	= I_DIRTY;
+	inode->i_ino	= 1;
+	inode->i_mode	= S_IFREG | S_IRUSR | S_IWUSR;
+	inode->i_atime	= inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+	inode->i_flags	= S_IMMUTABLE;
+
+	return inode;
+}
+
+static struct dentry *nsfd_alloc_dentry(struct inode *inode)
+{
+	struct dentry *dentry;
+
+	/*
+	 * We know the nsfd_inode inode count is always greater than zero,
+	 * so we can avoid doing an igrab() and we can use an open-coded
+	 * atomic_inc().
+	 */
+	dentry = d_alloc_root(inode);
+	if (dentry) {
+		atomic_inc(&inode->i_count);
+		dentry->d_op = &nsfd_dentry_operations;
+	}
+	return dentry;
+}
+
+static int nsfd_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct inode *inode = NULL;
+
+	sb->s_flags		= 0;
+	sb->s_maxbytes		= MAX_LFS_FILESIZE;
+	sb->s_blocksize		= PAGE_SIZE;
+	sb->s_blocksize_bits	= PAGE_SHIFT;
+	sb->s_magic 		= NSFD_FS_MAGIC;
+	sb->s_op		= &nsfd_super_operations;
+	sb->s_time_gran		= 1;
+
+	inode = nsfd_mkinode(sb);
+	if (!inode)
+		goto Enomem;
+
+	sb->s_root = nsfd_alloc_dentry(inode);
+	if (!sb->s_root)
+		goto Enomem;
+
+	/* Save the inode for later.. */
+	nsfd_inode = inode;
+
+	return 0;
+
+Enomem:
+	iput(inode);
+	return -ENOMEM;
+}
+
+static int nsfd_get_sb(struct file_system_type *fs_type, int flags,
+	const char *dev_name, void *data, struct vfsmount *mnt)
+{
+	/* We can't use get_sb_psuedo because that sets MS_NOUSER */
+	return get_sb_single(fs_type, 0, NULL, nsfd_fill_super, mnt);
+}
+
+
+static struct file_system_type nsfd_fs_type = {
+	.name		= "nsfd",
+	.get_sb		= nsfd_get_sb,
+	.kill_sb	= kill_anon_super,
+	
+};
+
+static void netns_dentry_release(struct dentry *dentry)
+{
+	put_net(dentry->d_fsdata);
+	dentry->d_fsdata = NULL;
+}
+
+static const struct dentry_operations netns_dentry_operations = {
+	.d_dname	= nsfd_dname,
+	.d_release	= netns_dentry_release,
+};
+
+static const struct dentry_operations *nsfd_dops[] = {
+	[NSTYPE_NET] = &netns_dentry_operations,
+};
+
+static const struct dentry_operations *nstype_dops(unsigned long nstype)
+{
+	const struct dentry_operations *d_op = NULL;
+
+	if (nstype < sizeof(nsfd_dops)/sizeof(nsfd_dops[0]))
+		d_op = nsfd_dops[nstype];
+
+	return d_op;
+}
+
+static struct file *nsfd_fget(int fd, unsigned long nstype)
+{
+	const struct dentry_operations *d_op;
+	struct file *file;
+
+	d_op = nstype_dops(nstype);
+	if (!d_op)
+		return ERR_PTR(-EINVAL);
+
+	file = fget(fd);
+	if (!file)
+		return ERR_PTR(-EBADF);
+
+	if (file->f_op != &nsfd_file_operations)
+		goto out_invalid;
+
+	if (file->f_path.dentry->d_op != d_op)
+		goto out_invalid;
+
+	return file;
+
+out_invalid:
+	fput(file);
+	return ERR_PTR(-EINVAL);
+}
+
+
+static struct file *nsfd_getfile(void)
+{
+	struct path path;
+	struct file *file;
+
+	path.dentry = nsfd_alloc_dentry(nsfd_inode);
+	if (!path.dentry)
+		return ERR_PTR(-ENOMEM);
+
+	/* HACK I need a vfsmnt with mnt_ns == current_nsproxy_mnt_ns
+	 * and (mnt_sb->s_flags & MS_NOUSER) == 0.  The only way I can
+	 * get such a vfsmount without having an instnace of my filesystem
+	 * mounted in the namespace is to steal one.
+	 */
+	path.mnt = mntget(current->fs->root.mnt);
+
+	file = alloc_file(&path, FMODE_READ, &nsfd_file_operations);
+	if (!file) {
+		path_put(&path);
+		return ERR_PTR(-ENFILE);
+	}
+	file->f_mapping = nsfd_inode->i_mapping;
+
+	file->f_pos = 0;
+	file->f_flags = O_RDONLY;
+	file->f_version = 0;
+	file->private_data = NULL;
+
+	return file;
+}
+
+static void *nsfd_getns(pid_t pid, unsigned long nstype)
+{
+	struct task_struct *task;
+	struct nsproxy *nsproxy;
+	void *ns;
+
+	ns = ERR_PTR(-ESRCH);
+	rcu_read_lock();
+	if (pid == 0)
+		task = current;
+	else
+		task = find_task_by_vpid(pid);
+	if (!task)
+		goto out;
+
+	ns = ERR_PTR(-EPERM);
+	if (!ptrace_may_access(task, PTRACE_MODE_ATTACH))
+		goto out;
+
+	ns = ERR_PTR(-ESRCH);
+	nsproxy = task_nsproxy(task);
+	if (!nsproxy)
+		goto out;
+
+	ns = ERR_PTR(-EINVAL);
+	switch(nstype) {
+	case NSTYPE_NET:
+		ns = get_net(nsproxy->net_ns);
+		break;
+	}
+out:
+	rcu_read_unlock();
+	return ns;
+}
+
+SYSCALL_DEFINE2(nsfd, pid_t, pid, unsigned long, nstype)
+{
+	const struct dentry_operations *d_op;
+	struct file *file;
+	int fd;
+	void *ns;
+
+	d_op = nstype_dops(nstype);
+	if (!d_op)
+		return -EINVAL;
+
+	file = nsfd_getfile();
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	ns = nsfd_getns(pid, nstype);
+	if (IS_ERR(ns)) {
+		fput(file);
+		return PTR_ERR(ns);
+	}
+
+	file->f_dentry->d_fsdata = ns;
+	file->f_dentry->d_op = d_op;
+	
+	fd = get_unused_fd();
+	if (fd < 0) {
+		fput(file);
+		return fd;
+	}
+	fd_install(fd, file);
+
+	return fd;
+}
+
+
+SYSCALL_DEFINE2(setns, unsigned long, nstype, int, fd)
+{
+	struct file *file;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	file = nsfd_fget(fd, nstype);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	set_namespace(nstype, file->f_dentry->d_fsdata);
+
+	fput(file);
+	return 0;
+}
+
+
+static int __init nsfd_init(void)
+{
+	int error;
+
+	error = register_filesystem(&nsfd_fs_type);
+	if (error)
+		goto err_exit;
+
+	nsfd_mnt  = kern_mount(&nsfd_fs_type);
+	if (IS_ERR(nsfd_mnt)) {
+		error = PTR_ERR(nsfd_mnt);
+		goto err_unregister_filesystem;
+	}
+
+	return 0;
+
+err_unregister_filesystem:
+	unregister_filesystem(&nsfd_fs_type);
+err_exit:
+	panic(KERN_ERR "nsfd_init() failed (%d)\n", error);
+}
+
+fs_initcall(nsfd_init);
diff --git a/include/linux/magic.h b/include/linux/magic.h
index 76285e0..a4fe6eb 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -26,6 +26,7 @@
 #define ISOFS_SUPER_MAGIC	0x9660
 #define JFFS2_SUPER_MAGIC	0x72b6
 #define ANON_INODE_FS_MAGIC	0x09041934
+#define NSFD_FS_MAGIC		0x6e736664
 
 #define MINIX_SUPER_MAGIC	0x137F		/* original minix fs */
 #define MINIX_SUPER_MAGIC2	0x138F		/* minix fs, 30 char names */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 7b370c7..45f1e07 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -65,6 +65,7 @@ static inline struct nsproxy *task_nsproxy(struct task_struct *tsk)
 int copy_namespaces(unsigned long flags, struct task_struct *tsk);
 void exit_task_namespaces(struct task_struct *tsk);
 void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
+void set_namespace(unsigned long nstype, void *ns);
 void free_nsproxy(struct nsproxy *ns);
 int unshare_nsproxy_namespaces(unsigned long, struct nsproxy **,
 	struct fs_struct *);
diff --git a/include/linux/nstype.h b/include/linux/nstype.h
new file mode 100644
index 0000000..3bdf856
--- /dev/null
+++ b/include/linux/nstype.h
@@ -0,0 +1,6 @@
+#ifndef _LINUX_NSTYPE_H
+#define _LINUX_NSTYPE_H
+
+#define NSTYPE_NET 0
+
+#endif /* _LINUX_NSTYPE_H */
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 09b4ff9..574461c 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -21,6 +21,7 @@
 #include <linux/pid_namespace.h>
 #include <net/net_namespace.h>
 #include <linux/ipc_namespace.h>
+#include <linux/nstype.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -221,6 +222,22 @@ void exit_task_namespaces(struct task_struct *p)
 	switch_task_namespaces(p, NULL);
 }
 
+void set_namespace(unsigned long nstype, void *ns)
+{
+	struct task_struct *tsk = current;
+	struct nsproxy *new_nsproxy;
+
+	new_nsproxy = create_new_namespaces(0, tsk, tsk->fs);
+	switch(nstype) {
+	case NSTYPE_NET:
+		put_net(new_nsproxy->net_ns);
+		new_nsproxy->net_ns = get_net(ns);
+		break;
+	}
+
+	switch_task_namespaces(tsk, new_nsproxy);
+}
+
 static int __init nsproxy_cache_init(void)
 {
 	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
-- 
1.6.5.2.143.g8cc62


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                   ` <m1mxyx0yv7.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2010-02-25 22:13                                     ` Daniel Lezcano
@ 2010-02-26 20:35                                     ` Eric W. Biederman
  1 sibling, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-26 20:35 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano


> No, the plan is only one namespace at a time.

Looking at this a bit more I am frustrated and relieved.

I was looking at what it would take to join an arbitrary mount
namespace and I realized it is completely non-obvious what fs->root
and fs->pwd should be set to.

If I leave them untouched the new mount namespace is useless,
as all path lookups will give results in a different mount namespace,
so not even mount or umount can be used.

I can not change fs->root to mnt_ns->root as that is rootfs and there
is always something mounted on top so I can not use that.

In comparison an unshare of the mount namespace doesn't have to move
fs->root or fs->pwd at all and only has to update their mounts to
the corresponding mounts in the new mount namespace.

I might be able to find the topmost root filesystem and put at least
root there, but I'm not particularly fond of that option.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-25 21:49                                 ` Eric W. Biederman
       [not found]                                   ` <m1mxyx0yv7.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2010-02-25 22:13                                   ` Daniel Lezcano
@ 2010-02-26 20:35                                   ` Eric W. Biederman
  2 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-26 20:35 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: hadi, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano


> No, the plan is only one namespace at a time.

Looking at this a bit more I am frustrated and relieved.

I was looking at what it would take to join an arbitrary mount
namespace and I realized it is completely non-obvious what fs->root
and fs->pwd should be set to.

If I leave them untouched the new mount namespace is useless,
as all path lookups will give results in a different mount namespace,
so not even mount or umount can be used.

I can not change fs->root to mnt_ns->root as that is rootfs and there
is always something mounted on top so I can not use that.

In comparison an unshare of the mount namespace doesn't have to move
fs->root or fs->pwd at all and only has to update their mounts to
the corresponding mounts in the new mount namespace.

I might be able to find the topmost root filesystem and put at least
root there, but I'm not particularly fond of that option.

Eric






^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                               ` <m1pr3t2fvl.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
                                                   ` (3 preceding siblings ...)
  2010-02-26  3:15                                 ` [RFC][PATCH] ns: Syscalls for better namespace sharing control. v2 Eric W. Biederman
@ 2010-02-26 21:13                                 ` Pavel Emelyanov
       [not found]                                   ` <4B883987.6090408-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2010-02-26 21:24                                   ` Eric W. Biederman
  2010-05-27 12:28                                 ` [Devel] " Enrico Weigelt
  5 siblings, 2 replies; 184+ messages in thread
From: Pavel Emelyanov @ 2010-02-26 21:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

> +static struct inode *nsfd_mkinode(void)
> +{
> +	struct inode *inode;
> +	inode = new_inode(nsfd_mnt->mnt_sb);
> +	if (!inode)
> +		return ERR_PTR(-ENOMEM);
> +
> +	inode->i_fop = &nsfd_file_operations;
> +
> +	/*
> +	 * Mark the inode dirty from the very beginning,
> +	 * that way it will never be moved to the dirty
> +	 * list because mark_inode_dirty() will think that
> +	 * it already _is_ on the dirty list.
> +	 */
> +	inode->i_state = I_DIRTY;
> +	inode->i_mode = S_IRUSR | S_IWUSR;
> +	inode->i_uid = current_fsuid();
> +	inode->i_gid = current_fsgid();
> +	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> +	return inode;
> +}

Why not use anon inodes?

> diff --git a/include/linux/nstype.h b/include/linux/nstype.h
> new file mode 100644
> index 0000000..3bdf856
> --- /dev/null
> +++ b/include/linux/nstype.h
> @@ -0,0 +1,6 @@
> +#ifndef _LINUX_NSTYPE_H
> +#define _LINUX_NSTYPE_H
> +
> +#define NSTYPE_NET 0
> +
> +#endif /* _LINUX_NSTYPE_H */

Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
I currently have a way to create all namespaces we have with one
syscall. Why don't we have an ability to enter them all with one syscall?

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                   ` <4B883987.6090408-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2010-02-26 21:24                                     ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-26 21:24 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:

>> +static struct inode *nsfd_mkinode(void)
>> +{
>> +	struct inode *inode;
>> +	inode = new_inode(nsfd_mnt->mnt_sb);
>> +	if (!inode)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	inode->i_fop = &nsfd_file_operations;
>> +
>> +	/*
>> +	 * Mark the inode dirty from the very beginning,
>> +	 * that way it will never be moved to the dirty
>> +	 * list because mark_inode_dirty() will think that
>> +	 * it already _is_ on the dirty list.
>> +	 */
>> +	inode->i_state = I_DIRTY;
>> +	inode->i_mode = S_IRUSR | S_IWUSR;
>> +	inode->i_uid = current_fsuid();
>> +	inode->i_gid = current_fsgid();
>> +	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
>> +	return inode;
>> +}
>
> Why not use anon inodes?

Because you can't mount them anywhere.

>> diff --git a/include/linux/nstype.h b/include/linux/nstype.h
>> new file mode 100644
>> index 0000000..3bdf856
>> --- /dev/null
>> +++ b/include/linux/nstype.h
>> @@ -0,0 +1,6 @@
>> +#ifndef _LINUX_NSTYPE_H
>> +#define _LINUX_NSTYPE_H
>> +
>> +#define NSTYPE_NET 0
>> +
>> +#endif /* _LINUX_NSTYPE_H */
>
> Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
> I currently have a way to create all namespaces we have with one
> syscall. Why don't we have an ability to enter them all with one syscall?

The CLONE_NEWXXX series of bits has been an royal pain to work with,
and it appears to be unnecessary complications for no gain.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-26 21:13                                 ` [RFC][PATCH] ns: Syscalls for better namespace sharing control Pavel Emelyanov
       [not found]                                   ` <4B883987.6090408-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2010-02-26 21:24                                   ` Eric W. Biederman
  2010-02-26 21:34                                     ` Pavel Emelyanov
       [not found]                                     ` <m1bpfbwuze.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-26 21:24 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: hadi, Daniel Lezcano, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn, Matt Helsley

Pavel Emelyanov <xemul@parallels.com> writes:

>> +static struct inode *nsfd_mkinode(void)
>> +{
>> +	struct inode *inode;
>> +	inode = new_inode(nsfd_mnt->mnt_sb);
>> +	if (!inode)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	inode->i_fop = &nsfd_file_operations;
>> +
>> +	/*
>> +	 * Mark the inode dirty from the very beginning,
>> +	 * that way it will never be moved to the dirty
>> +	 * list because mark_inode_dirty() will think that
>> +	 * it already _is_ on the dirty list.
>> +	 */
>> +	inode->i_state = I_DIRTY;
>> +	inode->i_mode = S_IRUSR | S_IWUSR;
>> +	inode->i_uid = current_fsuid();
>> +	inode->i_gid = current_fsgid();
>> +	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
>> +	return inode;
>> +}
>
> Why not use anon inodes?

Because you can't mount them anywhere.

>> diff --git a/include/linux/nstype.h b/include/linux/nstype.h
>> new file mode 100644
>> index 0000000..3bdf856
>> --- /dev/null
>> +++ b/include/linux/nstype.h
>> @@ -0,0 +1,6 @@
>> +#ifndef _LINUX_NSTYPE_H
>> +#define _LINUX_NSTYPE_H
>> +
>> +#define NSTYPE_NET 0
>> +
>> +#endif /* _LINUX_NSTYPE_H */
>
> Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
> I currently have a way to create all namespaces we have with one
> syscall. Why don't we have an ability to enter them all with one syscall?

The CLONE_NEWXXX series of bits has been an royal pain to work with,
and it appears to be unnecessary complications for no gain.

Eric


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                     ` <m1bpfbwuze.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-26 21:34                                       ` Pavel Emelyanov
  2010-02-26 21:35                                       ` Pavel Emelyanov
  1 sibling, 0 replies; 184+ messages in thread
From: Pavel Emelyanov @ 2010-02-26 21:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

>> Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
>> I currently have a way to create all namespaces we have with one
>> syscall. Why don't we have an ability to enter them all with one syscall?
> 
> The CLONE_NEWXXX series of bits has been an royal pain to work with,
> and it appears to be unnecessary complications for no gain.

That's the answer for the "Yet another set..." question.
How about the "Why don't we have..." one?

> Eric
> 
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-26 21:24                                   ` Eric W. Biederman
@ 2010-02-26 21:34                                     ` Pavel Emelyanov
       [not found]                                       ` <4B883E6F.1060907-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2010-02-26 21:42                                       ` Eric W. Biederman
       [not found]                                     ` <m1bpfbwuze.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Pavel Emelyanov @ 2010-02-26 21:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: hadi, Daniel Lezcano, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn, Matt Helsley

>> Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
>> I currently have a way to create all namespaces we have with one
>> syscall. Why don't we have an ability to enter them all with one syscall?
> 
> The CLONE_NEWXXX series of bits has been an royal pain to work with,
> and it appears to be unnecessary complications for no gain.

That's the answer for the "Yet another set..." question.
How about the "Why don't we have..." one?

> Eric
> 
> 


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                     ` <m1bpfbwuze.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2010-02-26 21:34                                       ` Pavel Emelyanov
@ 2010-02-26 21:35                                       ` Pavel Emelyanov
  2010-02-26 21:49                                         ` Eric W. Biederman
       [not found]                                         ` <4B883EAF.5020607-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Pavel Emelyanov @ 2010-02-26 21:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

Eric W. Biederman wrote:
> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
> 
>>> +static struct inode *nsfd_mkinode(void)
>>> +{
>>> +	struct inode *inode;
>>> +	inode = new_inode(nsfd_mnt->mnt_sb);
>>> +	if (!inode)
>>> +		return ERR_PTR(-ENOMEM);
>>> +
>>> +	inode->i_fop = &nsfd_file_operations;
>>> +
>>> +	/*
>>> +	 * Mark the inode dirty from the very beginning,
>>> +	 * that way it will never be moved to the dirty
>>> +	 * list because mark_inode_dirty() will think that
>>> +	 * it already _is_ on the dirty list.
>>> +	 */
>>> +	inode->i_state = I_DIRTY;
>>> +	inode->i_mode = S_IRUSR | S_IWUSR;
>>> +	inode->i_uid = current_fsuid();
>>> +	inode->i_gid = current_fsgid();
>>> +	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
>>> +	return inode;
>>> +}
>> Why not use anon inodes?
> 
> Because you can't mount them anywhere.

Worth changing them that way?

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                       ` <4B883E6F.1060907-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2010-02-26 21:42                                         ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-26 21:42 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:

>>> Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
>>> I currently have a way to create all namespaces we have with one
>>> syscall. Why don't we have an ability to enter them all with one syscall?
>> 
>> The CLONE_NEWXXX series of bits has been an royal pain to work with,
>> and it appears to be unnecessary complications for no gain.
>
> That's the answer for the "Yet another set..." question.
> How about the "Why don't we have..." one?

I am not certain which question you are asking:

Why don't we have an ability to enter all namespaces with one syscall
invocation?

Why don't we have a syscall that allows us to enter every namespace?

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-26 21:34                                     ` Pavel Emelyanov
       [not found]                                       ` <4B883E6F.1060907-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2010-02-26 21:42                                       ` Eric W. Biederman
  2010-02-26 21:58                                         ` Oren Laadan
                                                           ` (2 more replies)
  1 sibling, 3 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-26 21:42 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: hadi, Daniel Lezcano, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn, Matt Helsley

Pavel Emelyanov <xemul@parallels.com> writes:

>>> Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
>>> I currently have a way to create all namespaces we have with one
>>> syscall. Why don't we have an ability to enter them all with one syscall?
>> 
>> The CLONE_NEWXXX series of bits has been an royal pain to work with,
>> and it appears to be unnecessary complications for no gain.
>
> That's the answer for the "Yet another set..." question.
> How about the "Why don't we have..." one?

I am not certain which question you are asking:

Why don't we have an ability to enter all namespaces with one syscall
invocation?

Why don't we have a syscall that allows us to enter every namespace?

Eric


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                         ` <4B883EAF.5020607-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2010-02-26 21:49                                           ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-26 21:49 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:

> Eric W. Biederman wrote:
>> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>> 
>>>> +static struct inode *nsfd_mkinode(void)
>>>> +{
>>>> +	struct inode *inode;
>>>> +	inode = new_inode(nsfd_mnt->mnt_sb);
>>>> +	if (!inode)
>>>> +		return ERR_PTR(-ENOMEM);
>>>> +
>>>> +	inode->i_fop = &nsfd_file_operations;
>>>> +
>>>> +	/*
>>>> +	 * Mark the inode dirty from the very beginning,
>>>> +	 * that way it will never be moved to the dirty
>>>> +	 * list because mark_inode_dirty() will think that
>>>> +	 * it already _is_ on the dirty list.
>>>> +	 */
>>>> +	inode->i_state = I_DIRTY;
>>>> +	inode->i_mode = S_IRUSR | S_IWUSR;
>>>> +	inode->i_uid = current_fsuid();
>>>> +	inode->i_gid = current_fsgid();
>>>> +	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
>>>> +	return inode;
>>>> +}
>>> Why not use anon inodes?
>> 
>> Because you can't mount them anywhere.
>
> Worth changing them that way?

I don't think so.  They keep all of their state in struct file.  To be
usefully bind mounted you need to keep your state in the dentry or the
inode.

Ultimately what I have done is fix rootfs so it supports bind mounts and
used rootfs inodes.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-26 21:35                                       ` Pavel Emelyanov
@ 2010-02-26 21:49                                         ` Eric W. Biederman
       [not found]                                         ` <4B883EAF.5020607-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  1 sibling, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-26 21:49 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: hadi, Daniel Lezcano, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn, Matt Helsley

Pavel Emelyanov <xemul@parallels.com> writes:

> Eric W. Biederman wrote:
>> Pavel Emelyanov <xemul@parallels.com> writes:
>> 
>>>> +static struct inode *nsfd_mkinode(void)
>>>> +{
>>>> +	struct inode *inode;
>>>> +	inode = new_inode(nsfd_mnt->mnt_sb);
>>>> +	if (!inode)
>>>> +		return ERR_PTR(-ENOMEM);
>>>> +
>>>> +	inode->i_fop = &nsfd_file_operations;
>>>> +
>>>> +	/*
>>>> +	 * Mark the inode dirty from the very beginning,
>>>> +	 * that way it will never be moved to the dirty
>>>> +	 * list because mark_inode_dirty() will think that
>>>> +	 * it already _is_ on the dirty list.
>>>> +	 */
>>>> +	inode->i_state = I_DIRTY;
>>>> +	inode->i_mode = S_IRUSR | S_IWUSR;
>>>> +	inode->i_uid = current_fsuid();
>>>> +	inode->i_gid = current_fsgid();
>>>> +	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
>>>> +	return inode;
>>>> +}
>>> Why not use anon inodes?
>> 
>> Because you can't mount them anywhere.
>
> Worth changing them that way?

I don't think so.  They keep all of their state in struct file.  To be
usefully bind mounted you need to keep your state in the dentry or the
inode.

Ultimately what I have done is fix rootfs so it supports bind mounts and
used rootfs inodes.

Eric


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                         ` <m13a0nwu6p.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-26 21:58                                           ` Oren Laadan
  2010-02-27  8:30                                           ` Pavel Emelyanov
  1 sibling, 0 replies; 184+ messages in thread
From: Oren Laadan @ 2010-02-26 21:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano



Eric W. Biederman wrote:
> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
> 
>>>> Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
>>>> I currently have a way to create all namespaces we have with one
>>>> syscall. Why don't we have an ability to enter them all with one syscall?
>>> The CLONE_NEWXXX series of bits has been an royal pain to work with,
>>> and it appears to be unnecessary complications for no gain.
>> That's the answer for the "Yet another set..." question.
>> How about the "Why don't we have..." one?
> 
> I am not certain which question you are asking:
> 
> Why don't we have an ability to enter all namespaces with one syscall
> invocation?

That's how I understood the question, and I, too, wonder why not ?

By the way, an alternative to using bitmap is to change the prototype
of setns() to accept an array of FD's:

	int setns(int *fds, int nfds);

So the process will atomically enter all the namespaces as specified
by the FDs.

Oren.

> 
> Why don't we have a syscall that allows us to enter every namespace?
> 
> Eric
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-26 21:42                                       ` Eric W. Biederman
@ 2010-02-26 21:58                                         ` Oren Laadan
  2010-02-26 22:16                                           ` Eric W. Biederman
       [not found]                                           ` <4B8843FE.4000404-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2010-02-27  8:30                                         ` Pavel Emelyanov
       [not found]                                         ` <m13a0nwu6p.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2 siblings, 2 replies; 184+ messages in thread
From: Oren Laadan @ 2010-02-26 21:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Ben Greear, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Daniel Lezcano



Eric W. Biederman wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
> 
>>>> Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
>>>> I currently have a way to create all namespaces we have with one
>>>> syscall. Why don't we have an ability to enter them all with one syscall?
>>> The CLONE_NEWXXX series of bits has been an royal pain to work with,
>>> and it appears to be unnecessary complications for no gain.
>> That's the answer for the "Yet another set..." question.
>> How about the "Why don't we have..." one?
> 
> I am not certain which question you are asking:
> 
> Why don't we have an ability to enter all namespaces with one syscall
> invocation?

That's how I understood the question, and I, too, wonder why not ?

By the way, an alternative to using bitmap is to change the prototype
of setns() to accept an array of FD's:

	int setns(int *fds, int nfds);

So the process will atomically enter all the namespaces as specified
by the FDs.

Oren.

> 
> Why don't we have a syscall that allows us to enter every namespace?
> 
> Eric
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                           ` <4B8843FE.4000404-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-02-26 22:16                                             ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-26 22:16 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> writes:

> Eric W. Biederman wrote:
>> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>
>>>>> Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
>>>>> I currently have a way to create all namespaces we have with one
>>>>> syscall. Why don't we have an ability to enter them all with one syscall?
>>>> The CLONE_NEWXXX series of bits has been an royal pain to work with,
>>>> and it appears to be unnecessary complications for no gain.
>>> That's the answer for the "Yet another set..." question.
>>> How about the "Why don't we have..." one?
>>
>> I am not certain which question you are asking:
>>
>> Why don't we have an ability to enter all namespaces with one syscall
>> invocation?
>
> That's how I understood the question, and I, too, wonder why not ?
>
> By the way, an alternative to using bitmap is to change the prototype
> of setns() to accept an array of FD's:
>
> 	int setns(int *fds, int nfds);
>
> So the process will atomically enter all the namespaces as specified
> by the FDs.

We could.  Mostly I implemented things in the simplest way possible.

Semantically I know of no reason why need to enter more than one namespace
at once, and I don't expect entering a namespace to be on anyone's fast
path so every last drop of performance was not crucial.

The only justification I can think of for more than one namespace at a
time is that because we have a synchronize_rcu() in the kernel we can
loop in the kernel much more quickly than we can loop in userspace.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-26 21:58                                         ` Oren Laadan
@ 2010-02-26 22:16                                           ` Eric W. Biederman
  2010-02-26 22:52                                             ` Oren Laadan
       [not found]                                             ` <m1zl2vtzg4.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
       [not found]                                           ` <4B8843FE.4000404-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-26 22:16 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Pavel Emelyanov, Ben Greear, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Daniel Lezcano

Oren Laadan <orenl@cs.columbia.edu> writes:

> Eric W. Biederman wrote:
>> Pavel Emelyanov <xemul@parallels.com> writes:
>>
>>>>> Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
>>>>> I currently have a way to create all namespaces we have with one
>>>>> syscall. Why don't we have an ability to enter them all with one syscall?
>>>> The CLONE_NEWXXX series of bits has been an royal pain to work with,
>>>> and it appears to be unnecessary complications for no gain.
>>> That's the answer for the "Yet another set..." question.
>>> How about the "Why don't we have..." one?
>>
>> I am not certain which question you are asking:
>>
>> Why don't we have an ability to enter all namespaces with one syscall
>> invocation?
>
> That's how I understood the question, and I, too, wonder why not ?
>
> By the way, an alternative to using bitmap is to change the prototype
> of setns() to accept an array of FD's:
>
> 	int setns(int *fds, int nfds);
>
> So the process will atomically enter all the namespaces as specified
> by the FDs.

We could.  Mostly I implemented things in the simplest way possible.

Semantically I know of no reason why need to enter more than one namespace
at once, and I don't expect entering a namespace to be on anyone's fast
path so every last drop of performance was not crucial.

The only justification I can think of for more than one namespace at a
time is that because we have a synchronize_rcu() in the kernel we can
loop in the kernel much more quickly than we can loop in userspace.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                             ` <m1zl2vtzg4.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-26 22:52                                               ` Oren Laadan
  0 siblings, 0 replies; 184+ messages in thread
From: Oren Laadan @ 2010-02-26 22:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano



Eric W. Biederman wrote:
> Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> writes:
> 
>> Eric W. Biederman wrote:
>>> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>>
>>>>>> Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
>>>>>> I currently have a way to create all namespaces we have with one
>>>>>> syscall. Why don't we have an ability to enter them all with one syscall?
>>>>> The CLONE_NEWXXX series of bits has been an royal pain to work with,
>>>>> and it appears to be unnecessary complications for no gain.
>>>> That's the answer for the "Yet another set..." question.
>>>> How about the "Why don't we have..." one?
>>> I am not certain which question you are asking:
>>>
>>> Why don't we have an ability to enter all namespaces with one syscall
>>> invocation?
>> That's how I understood the question, and I, too, wonder why not ?
>>
>> By the way, an alternative to using bitmap is to change the prototype
>> of setns() to accept an array of FD's:
>>
>> 	int setns(int *fds, int nfds);
>>
>> So the process will atomically enter all the namespaces as specified
>> by the FDs.
> 
> We could.  Mostly I implemented things in the simplest way possible.
> 
> Semantically I know of no reason why need to enter more than one namespace
> at once, and I don't expect entering a namespace to be on anyone's fast
> path so every last drop of performance was not crucial.
> 
> The only justification I can think of for more than one namespace at a
> time is that because we have a synchronize_rcu() in the kernel we can
> loop in the kernel much more quickly than we can loop in userspace.

Can't think of a specific scenario, but I wonder if there would
be a problem (security or otherwise) with a process that only
partly belongs to a container, even if for a short time ?

Oren.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-26 22:16                                           ` Eric W. Biederman
@ 2010-02-26 22:52                                             ` Oren Laadan
  2010-02-26 23:13                                               ` Eric W. Biederman
       [not found]                                               ` <4B885093.4070807-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
       [not found]                                             ` <m1zl2vtzg4.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Oren Laadan @ 2010-02-26 22:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Ben Greear, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Daniel Lezcano



Eric W. Biederman wrote:
> Oren Laadan <orenl@cs.columbia.edu> writes:
> 
>> Eric W. Biederman wrote:
>>> Pavel Emelyanov <xemul@parallels.com> writes:
>>>
>>>>>> Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
>>>>>> I currently have a way to create all namespaces we have with one
>>>>>> syscall. Why don't we have an ability to enter them all with one syscall?
>>>>> The CLONE_NEWXXX series of bits has been an royal pain to work with,
>>>>> and it appears to be unnecessary complications for no gain.
>>>> That's the answer for the "Yet another set..." question.
>>>> How about the "Why don't we have..." one?
>>> I am not certain which question you are asking:
>>>
>>> Why don't we have an ability to enter all namespaces with one syscall
>>> invocation?
>> That's how I understood the question, and I, too, wonder why not ?
>>
>> By the way, an alternative to using bitmap is to change the prototype
>> of setns() to accept an array of FD's:
>>
>> 	int setns(int *fds, int nfds);
>>
>> So the process will atomically enter all the namespaces as specified
>> by the FDs.
> 
> We could.  Mostly I implemented things in the simplest way possible.
> 
> Semantically I know of no reason why need to enter more than one namespace
> at once, and I don't expect entering a namespace to be on anyone's fast
> path so every last drop of performance was not crucial.
> 
> The only justification I can think of for more than one namespace at a
> time is that because we have a synchronize_rcu() in the kernel we can
> loop in the kernel much more quickly than we can loop in userspace.

Can't think of a specific scenario, but I wonder if there would
be a problem (security or otherwise) with a process that only
partly belongs to a container, even if for a short time ?

Oren.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                               ` <4B885093.4070807-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-02-26 23:13                                                 ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-26 23:13 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear, Daniel Lezcano

Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> writes:

> Can't think of a specific scenario, but I wonder if there would
> be a problem (security or otherwise) with a process that only
> partly belongs to a container, even if for a short time ?

If we can find an instance of that then there are fundamental problems
with setns.

The driving use case right now is for things like network namespaces where
userspace really wants to have several at once, and wants to be able to
control them all.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-26 22:52                                             ` Oren Laadan
@ 2010-02-26 23:13                                               ` Eric W. Biederman
       [not found]                                               ` <4B885093.4070807-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  1 sibling, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-26 23:13 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Pavel Emelyanov, Ben Greear, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Daniel Lezcano

Oren Laadan <orenl@cs.columbia.edu> writes:

> Can't think of a specific scenario, but I wonder if there would
> be a problem (security or otherwise) with a process that only
> partly belongs to a container, even if for a short time ?

If we can find an instance of that then there are fundamental problems
with setns.

The driving use case right now is for things like network namespaces where
userspace really wants to have several at once, and wants to be able to
control them all.

Eric


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                         ` <m13a0nwu6p.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2010-02-26 21:58                                           ` Oren Laadan
@ 2010-02-27  8:30                                           ` Pavel Emelyanov
  1 sibling, 0 replies; 184+ messages in thread
From: Pavel Emelyanov @ 2010-02-27  8:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

Eric W. Biederman wrote:
> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
> 
>>>> Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
>>>> I currently have a way to create all namespaces we have with one
>>>> syscall. Why don't we have an ability to enter them all with one syscall?
>>> The CLONE_NEWXXX series of bits has been an royal pain to work with,
>>> and it appears to be unnecessary complications for no gain.
>> That's the answer for the "Yet another set..." question.
>> How about the "Why don't we have..." one?
> 
> I am not certain which question you are asking:
> 
> Why don't we have an ability to enter all namespaces with one syscall
> invocation?

Exactly. Please add at least the NSTYPE_NSPROXY or whatever, that will
pin all namespaces of a given pid from the very beginning.

> Why don't we have a syscall that allows us to enter every namespace?

This one is done in the patch, no?

Although the approach is OK for me, there's one design issue, that came
up to my mind recently: can we use this fd to wail for a namespace to 
stop? I currently don't see this ability, but this is something I require
badly.

Thoughts?

> Eric
> 
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-26 21:42                                       ` Eric W. Biederman
  2010-02-26 21:58                                         ` Oren Laadan
@ 2010-02-27  8:30                                         ` Pavel Emelyanov
       [not found]                                           ` <4B88D80A.8010701-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2010-02-27  9:04                                           ` Eric W. Biederman
       [not found]                                         ` <m13a0nwu6p.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2 siblings, 2 replies; 184+ messages in thread
From: Pavel Emelyanov @ 2010-02-27  8:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: hadi, Daniel Lezcano, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn, Matt Helsley

Eric W. Biederman wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
> 
>>>> Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
>>>> I currently have a way to create all namespaces we have with one
>>>> syscall. Why don't we have an ability to enter them all with one syscall?
>>> The CLONE_NEWXXX series of bits has been an royal pain to work with,
>>> and it appears to be unnecessary complications for no gain.
>> That's the answer for the "Yet another set..." question.
>> How about the "Why don't we have..." one?
> 
> I am not certain which question you are asking:
> 
> Why don't we have an ability to enter all namespaces with one syscall
> invocation?

Exactly. Please add at least the NSTYPE_NSPROXY or whatever, that will
pin all namespaces of a given pid from the very beginning.

> Why don't we have a syscall that allows us to enter every namespace?

This one is done in the patch, no?

Although the approach is OK for me, there's one design issue, that came
up to my mind recently: can we use this fd to wail for a namespace to 
stop? I currently don't see this ability, but this is something I require
badly.

Thoughts?

> Eric
> 
> 


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                           ` <4B88D80A.8010701-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2010-02-27  9:04                                             ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-27  9:04 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:

> Eric W. Biederman wrote:
>> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>> 
>>>>> Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
>>>>> I currently have a way to create all namespaces we have with one
>>>>> syscall. Why don't we have an ability to enter them all with one syscall?
>>>> The CLONE_NEWXXX series of bits has been an royal pain to work with,
>>>> and it appears to be unnecessary complications for no gain.
>>> That's the answer for the "Yet another set..." question.
>>> How about the "Why don't we have..." one?
>> 
>> I am not certain which question you are asking:
>> 
>> Why don't we have an ability to enter all namespaces with one syscall
>> invocation?
>
> Exactly. Please add at least the NSTYPE_NSPROXY or whatever, that will
> pin all namespaces of a given pid from the very beginning.

For nsfd(2) that is doable.  At least for now setns can't restore it.

>> Why don't we have a syscall that allows us to enter every namespace?
>
> This one is done in the patch, no?
>
> Although the approach is OK for me, there's one design issue, that came
> up to my mind recently: can we use this fd to wail for a namespace to 
> stop? I currently don't see this ability, but this is something I require
> badly.

I have designed these file descriptors to pin the namespaces, so
waiting for them to exit isn't something they can do now.  It makes a
lot of sense to have similar ones that take  weak references to the namespaces
that we can use to wait for a namespace to exit.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-27  8:30                                         ` Pavel Emelyanov
       [not found]                                           ` <4B88D80A.8010701-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2010-02-27  9:04                                           ` Eric W. Biederman
       [not found]                                             ` <m1mxyvrqvk.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 1 reply; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-27  9:04 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: hadi, Daniel Lezcano, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn, Matt Helsley

Pavel Emelyanov <xemul@parallels.com> writes:

> Eric W. Biederman wrote:
>> Pavel Emelyanov <xemul@parallels.com> writes:
>> 
>>>>> Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
>>>>> I currently have a way to create all namespaces we have with one
>>>>> syscall. Why don't we have an ability to enter them all with one syscall?
>>>> The CLONE_NEWXXX series of bits has been an royal pain to work with,
>>>> and it appears to be unnecessary complications for no gain.
>>> That's the answer for the "Yet another set..." question.
>>> How about the "Why don't we have..." one?
>> 
>> I am not certain which question you are asking:
>> 
>> Why don't we have an ability to enter all namespaces with one syscall
>> invocation?
>
> Exactly. Please add at least the NSTYPE_NSPROXY or whatever, that will
> pin all namespaces of a given pid from the very beginning.

For nsfd(2) that is doable.  At least for now setns can't restore it.

>> Why don't we have a syscall that allows us to enter every namespace?
>
> This one is done in the patch, no?
>
> Although the approach is OK for me, there's one design issue, that came
> up to my mind recently: can we use this fd to wail for a namespace to 
> stop? I currently don't see this ability, but this is something I require
> badly.

I have designed these file descriptors to pin the namespaces, so
waiting for them to exit isn't something they can do now.  It makes a
lot of sense to have similar ones that take  weak references to the namespaces
that we can use to wait for a namespace to exit.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                             ` <m1mxyvrqvk.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-27  9:21                                               ` Pavel Emelyanov
  2010-02-27  9:42                                                 ` Eric W. Biederman
       [not found]                                                 ` <4B88E431.6040609-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  0 siblings, 2 replies; 184+ messages in thread
From: Pavel Emelyanov @ 2010-02-27  9:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

Eric W. Biederman wrote:
> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
> 
>> Eric W. Biederman wrote:
>>> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>>
>>>>>> Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
>>>>>> I currently have a way to create all namespaces we have with one
>>>>>> syscall. Why don't we have an ability to enter them all with one syscall?
>>>>> The CLONE_NEWXXX series of bits has been an royal pain to work with,
>>>>> and it appears to be unnecessary complications for no gain.
>>>> That's the answer for the "Yet another set..." question.
>>>> How about the "Why don't we have..." one?
>>> I am not certain which question you are asking:
>>>
>>> Why don't we have an ability to enter all namespaces with one syscall
>>> invocation?
>> Exactly. Please add at least the NSTYPE_NSPROXY or whatever, that will
>> pin all namespaces of a given pid from the very beginning.
> 
> For nsfd(2) that is doable.  At least for now setns can't restore it.

Thanks. What's the problem with setns?

>>> Why don't we have a syscall that allows us to enter every namespace?
>> This one is done in the patch, no?
>>
>> Although the approach is OK for me, there's one design issue, that came
>> up to my mind recently: can we use this fd to wail for a namespace to 
>> stop? I currently don't see this ability, but this is something I require
>> badly.
> 
> I have designed these file descriptors to pin the namespaces, so
> waiting for them to exit isn't something they can do now.  It makes a
> lot of sense to have similar ones that take  weak references to the namespaces
> that we can use to wait for a namespace to exit.

Yes, I saw this from patches. Eric, I'd very much appreciate if we
workout a solution that will allow us to kill two birds with one stone.
I do not want to invent yet another bunch of system calls for "taking
weak reference".

As a "brain storm" start up. Can we use inotify/dnotify for this? 
Or maybe we should better equip the nsfd call with flags argument and 
add a flag for weak reference? In that case - how shall we get a 
notification about namespace is dead? With poll? Maybe worth making
the sys_close return only when the namespace is dead (by providing a
proper ->release callback of a file)?

> Eric
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                 ` <4B88E431.6040609-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2010-02-27  9:42                                                   ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-27  9:42 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:

> Thanks. What's the problem with setns?

joining a preexisting namespace is roughly the same problem as
unsharing a namespace.  We simply haven't figure out how to do it
safely for the pid and the uid namespaces.

>> I have designed these file descriptors to pin the namespaces, so
>> waiting for them to exit isn't something they can do now.  It makes a
>> lot of sense to have similar ones that take  weak references to the namespaces
>> that we can use to wait for a namespace to exit.
>
> Yes, I saw this from patches. Eric, I'd very much appreciate if we
> workout a solution that will allow us to kill two birds with one stone.
> I do not want to invent yet another bunch of system calls for "taking
> weak reference".

Definitely.  I only consider the current interface to be a mushy not
set in stone.

> As a "brain storm" start up. Can we use inotify/dnotify for this? 
> Or maybe we should better equip the nsfd call with flags argument and 
> add a flag for weak reference? In that case - how shall we get a 
> notification about namespace is dead? With poll? Maybe worth making
> the sys_close return only when the namespace is dead (by providing a
> proper ->release callback of a file)?

We would want poll to work, anything else is a weird work-around.
The challenging part is that we don't have any infrastructure for
notifying when a namespace goes away.  So that has to be built before
we can wire it up to userspace.  I don't expect it is too difficult
but there is work to be done.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-27  9:21                                               ` Pavel Emelyanov
@ 2010-02-27  9:42                                                 ` Eric W. Biederman
  2010-02-27 16:16                                                   ` Pavel Emelyanov
       [not found]                                                   ` <m1bpfbqajn.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
       [not found]                                                 ` <4B88E431.6040609-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-27  9:42 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: hadi, Daniel Lezcano, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn, Matt Helsley

Pavel Emelyanov <xemul@parallels.com> writes:

> Thanks. What's the problem with setns?

joining a preexisting namespace is roughly the same problem as
unsharing a namespace.  We simply haven't figure out how to do it
safely for the pid and the uid namespaces.

>> I have designed these file descriptors to pin the namespaces, so
>> waiting for them to exit isn't something they can do now.  It makes a
>> lot of sense to have similar ones that take  weak references to the namespaces
>> that we can use to wait for a namespace to exit.
>
> Yes, I saw this from patches. Eric, I'd very much appreciate if we
> workout a solution that will allow us to kill two birds with one stone.
> I do not want to invent yet another bunch of system calls for "taking
> weak reference".

Definitely.  I only consider the current interface to be a mushy not
set in stone.

> As a "brain storm" start up. Can we use inotify/dnotify for this? 
> Or maybe we should better equip the nsfd call with flags argument and 
> add a flag for weak reference? In that case - how shall we get a 
> notification about namespace is dead? With poll? Maybe worth making
> the sys_close return only when the namespace is dead (by providing a
> proper ->release callback of a file)?

We would want poll to work, anything else is a weird work-around.
The challenging part is that we don't have any infrastructure for
notifying when a namespace goes away.  So that has to be built before
we can wire it up to userspace.  I don't expect it is too difficult
but there is work to be done.

Eric


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                   ` <m1bpfbqajn.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-27 16:16                                                     ` Pavel Emelyanov
  0 siblings, 0 replies; 184+ messages in thread
From: Pavel Emelyanov @ 2010-02-27 16:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

Eric W. Biederman wrote:
> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
> 
>> Thanks. What's the problem with setns?
> 
> joining a preexisting namespace is roughly the same problem as
> unsharing a namespace.  We simply haven't figure out how to do it
> safely for the pid and the uid namespaces.

The pid may change after this for sure. What problems do you know
about it? What if we try to allocate the same PID in a new space
or return -EBUSY? This will be a good starting point. If we manage
to fix it later this will not break the API at all.

>>> I have designed these file descriptors to pin the namespaces, so
>>> waiting for them to exit isn't something they can do now.  It makes a
>>> lot of sense to have similar ones that take  weak references to the namespaces
>>> that we can use to wait for a namespace to exit.
>> Yes, I saw this from patches. Eric, I'd very much appreciate if we
>> workout a solution that will allow us to kill two birds with one stone.
>> I do not want to invent yet another bunch of system calls for "taking
>> weak reference".
> 
> Definitely.  I only consider the current interface to be a mushy not
> set in stone.

OK. The interface is good. I just don't want you to send it for an inclusion
until we decide what to do with waiting.

>> As a "brain storm" start up. Can we use inotify/dnotify for this? 
>> Or maybe we should better equip the nsfd call with flags argument and 
>> add a flag for weak reference? In that case - how shall we get a 
>> notification about namespace is dead? With poll? Maybe worth making
>> the sys_close return only when the namespace is dead (by providing a
>> proper ->release callback of a file)?
> 
> We would want poll to work, anything else is a weird work-around.
> The challenging part is that we don't have any infrastructure for
> notifying when a namespace goes away.  So that has to be built before
> we can wire it up to userspace.  I don't expect it is too difficult
> but there is work to be done.

Poll is OK with me. As far as the notification is concerned - that's also
done in OpenVZ. If you are OK to wait for a week or two I can do it for net
namespaces.

> Eric
> 
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-27  9:42                                                 ` Eric W. Biederman
@ 2010-02-27 16:16                                                   ` Pavel Emelyanov
  2010-02-27 19:08                                                     ` Eric W. Biederman
       [not found]                                                     ` <4B894564.7080104-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
       [not found]                                                   ` <m1bpfbqajn.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Pavel Emelyanov @ 2010-02-27 16:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: hadi, Daniel Lezcano, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn, Matt Helsley

Eric W. Biederman wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
> 
>> Thanks. What's the problem with setns?
> 
> joining a preexisting namespace is roughly the same problem as
> unsharing a namespace.  We simply haven't figure out how to do it
> safely for the pid and the uid namespaces.

The pid may change after this for sure. What problems do you know
about it? What if we try to allocate the same PID in a new space
or return -EBUSY? This will be a good starting point. If we manage
to fix it later this will not break the API at all.

>>> I have designed these file descriptors to pin the namespaces, so
>>> waiting for them to exit isn't something they can do now.  It makes a
>>> lot of sense to have similar ones that take  weak references to the namespaces
>>> that we can use to wait for a namespace to exit.
>> Yes, I saw this from patches. Eric, I'd very much appreciate if we
>> workout a solution that will allow us to kill two birds with one stone.
>> I do not want to invent yet another bunch of system calls for "taking
>> weak reference".
> 
> Definitely.  I only consider the current interface to be a mushy not
> set in stone.

OK. The interface is good. I just don't want you to send it for an inclusion
until we decide what to do with waiting.

>> As a "brain storm" start up. Can we use inotify/dnotify for this? 
>> Or maybe we should better equip the nsfd call with flags argument and 
>> add a flag for weak reference? In that case - how shall we get a 
>> notification about namespace is dead? With poll? Maybe worth making
>> the sys_close return only when the namespace is dead (by providing a
>> proper ->release callback of a file)?
> 
> We would want poll to work, anything else is a weird work-around.
> The challenging part is that we don't have any infrastructure for
> notifying when a namespace goes away.  So that has to be built before
> we can wire it up to userspace.  I don't expect it is too difficult
> but there is work to be done.

Poll is OK with me. As far as the notification is concerned - that's also
done in OpenVZ. If you are OK to wait for a week or two I can do it for net
namespaces.

> Eric
> 
> 


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                     ` <4B894564.7080104-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2010-02-27 19:08                                                       ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-27 19:08 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:

> Eric W. Biederman wrote:
>> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>> 
>>> Thanks. What's the problem with setns?
>> 
>> joining a preexisting namespace is roughly the same problem as
>> unsharing a namespace.  We simply haven't figure out how to do it
>> safely for the pid and the uid namespaces.
>
> The pid may change after this for sure. What problems do you know
> about it? What if we try to allocate the same PID in a new space
> or return -EBUSY? This will be a good starting point. If we manage
> to fix it later this will not break the API at all.

Parentage.  The pid is the identity of a process and all kinds of things
make assumptions in all kinds of strange places.  I don't see how
waitpid can work if you change the pid.

glibc doesn't cope if you change someones pid.

>> Definitely.  I only consider the current interface to be a mushy not
>> set in stone.
>
> OK. The interface is good. I just don't want you to send it for an inclusion
> until we decide what to do with waiting.

Sure.  I am get a jump on 2.6.35 not aiming for inclusion this merge
window.  There is plenty of time.

>
> Poll is OK with me. As far as the notification is concerned - that's also
> done in OpenVZ. If you are OK to wait for a week or two I can do it for net
> namespaces.

Seems reasonable.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-27 16:16                                                   ` Pavel Emelyanov
@ 2010-02-27 19:08                                                     ` Eric W. Biederman
       [not found]                                                       ` <m1iq9io5sc.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2010-02-27 19:29                                                       ` Pavel Emelyanov
       [not found]                                                     ` <4B894564.7080104-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-27 19:08 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: hadi, Daniel Lezcano, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn, Matt Helsley

Pavel Emelyanov <xemul@parallels.com> writes:

> Eric W. Biederman wrote:
>> Pavel Emelyanov <xemul@parallels.com> writes:
>> 
>>> Thanks. What's the problem with setns?
>> 
>> joining a preexisting namespace is roughly the same problem as
>> unsharing a namespace.  We simply haven't figure out how to do it
>> safely for the pid and the uid namespaces.
>
> The pid may change after this for sure. What problems do you know
> about it? What if we try to allocate the same PID in a new space
> or return -EBUSY? This will be a good starting point. If we manage
> to fix it later this will not break the API at all.

Parentage.  The pid is the identity of a process and all kinds of things
make assumptions in all kinds of strange places.  I don't see how
waitpid can work if you change the pid.

glibc doesn't cope if you change someones pid.

>> Definitely.  I only consider the current interface to be a mushy not
>> set in stone.
>
> OK. The interface is good. I just don't want you to send it for an inclusion
> until we decide what to do with waiting.

Sure.  I am get a jump on 2.6.35 not aiming for inclusion this merge
window.  There is plenty of time.

>
> Poll is OK with me. As far as the notification is concerned - that's also
> done in OpenVZ. If you are OK to wait for a week or two I can do it for net
> namespaces.

Seems reasonable.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                       ` <m1iq9io5sc.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-27 19:29                                                         ` Pavel Emelyanov
  0 siblings, 0 replies; 184+ messages in thread
From: Pavel Emelyanov @ 2010-02-27 19:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

Eric W. Biederman wrote:
> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
> 
>> Eric W. Biederman wrote:
>>> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>>
>>>> Thanks. What's the problem with setns?
>>> joining a preexisting namespace is roughly the same problem as
>>> unsharing a namespace.  We simply haven't figure out how to do it
>>> safely for the pid and the uid namespaces.
>> The pid may change after this for sure. What problems do you know
>> about it? What if we try to allocate the same PID in a new space
>> or return -EBUSY? This will be a good starting point. If we manage
>> to fix it later this will not break the API at all.
> 
> Parentage.  The pid is the identity of a process and all kinds of things
> make assumptions in all kinds of strange places.  I don't see how
> waitpid can work if you change the pid.

Agree. But what if we enter a pid space, which is a subnamespace of a current
one? In that case parent will still see the task by its old pid. We can restrict
first version of entering with this rule as well and this restriction will not
block us in typical usecase (I mean enter a container from a host).

> glibc doesn't cope if you change someones pid.

OK, but what if we try to allocate the same pid returning -EBUSY on failure?

My aim is to provide even a restricted enter. For most of the cases this
should work and make our lives easier. So two restrictions currently:
a) enter a sub namespace
b) allocate the same pid as we have now

Hm? :)

>>> Definitely.  I only consider the current interface to be a mushy not
>>> set in stone.
>> OK. The interface is good. I just don't want you to send it for an inclusion
>> until we decide what to do with waiting.
> 
> Sure.  I am get a jump on 2.6.35 not aiming for inclusion this merge
> window.  There is plenty of time.

Good!

>> Poll is OK with me. As far as the notification is concerned - that's also
>> done in OpenVZ. If you are OK to wait for a week or two I can do it for net
>> namespaces.
> 
> Seems reasonable.

OK. I'll spend some time playing with it next week then.

> Eric
> 

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-27 19:08                                                     ` Eric W. Biederman
       [not found]                                                       ` <m1iq9io5sc.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-27 19:29                                                       ` Pavel Emelyanov
  2010-02-27 19:44                                                         ` Eric W. Biederman
       [not found]                                                         ` <4B89727C.9040602-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Pavel Emelyanov @ 2010-02-27 19:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: hadi, Daniel Lezcano, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn, Matt Helsley

Eric W. Biederman wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
> 
>> Eric W. Biederman wrote:
>>> Pavel Emelyanov <xemul@parallels.com> writes:
>>>
>>>> Thanks. What's the problem with setns?
>>> joining a preexisting namespace is roughly the same problem as
>>> unsharing a namespace.  We simply haven't figure out how to do it
>>> safely for the pid and the uid namespaces.
>> The pid may change after this for sure. What problems do you know
>> about it? What if we try to allocate the same PID in a new space
>> or return -EBUSY? This will be a good starting point. If we manage
>> to fix it later this will not break the API at all.
> 
> Parentage.  The pid is the identity of a process and all kinds of things
> make assumptions in all kinds of strange places.  I don't see how
> waitpid can work if you change the pid.

Agree. But what if we enter a pid space, which is a subnamespace of a current
one? In that case parent will still see the task by its old pid. We can restrict
first version of entering with this rule as well and this restriction will not
block us in typical usecase (I mean enter a container from a host).

> glibc doesn't cope if you change someones pid.

OK, but what if we try to allocate the same pid returning -EBUSY on failure?

My aim is to provide even a restricted enter. For most of the cases this
should work and make our lives easier. So two restrictions currently:
a) enter a sub namespace
b) allocate the same pid as we have now

Hm? :)

>>> Definitely.  I only consider the current interface to be a mushy not
>>> set in stone.
>> OK. The interface is good. I just don't want you to send it for an inclusion
>> until we decide what to do with waiting.
> 
> Sure.  I am get a jump on 2.6.35 not aiming for inclusion this merge
> window.  There is plenty of time.

Good!

>> Poll is OK with me. As far as the notification is concerned - that's also
>> done in OpenVZ. If you are OK to wait for a week or two I can do it for net
>> namespaces.
> 
> Seems reasonable.

OK. I'll spend some time playing with it next week then.

> Eric
> 

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                         ` <4B89727C.9040602-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2010-02-27 19:44                                                           ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-27 19:44 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:

> Eric W. Biederman wrote:
>> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>> 
>>> Eric W. Biederman wrote:
>>>> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>>>
>>>>> Thanks. What's the problem with setns?
>>>> joining a preexisting namespace is roughly the same problem as
>>>> unsharing a namespace.  We simply haven't figure out how to do it
>>>> safely for the pid and the uid namespaces.
>>> The pid may change after this for sure. What problems do you know
>>> about it? What if we try to allocate the same PID in a new space
>>> or return -EBUSY? This will be a good starting point. If we manage
>>> to fix it later this will not break the API at all.
>> 
>> Parentage.  The pid is the identity of a process and all kinds of things
>> make assumptions in all kinds of strange places.  I don't see how
>> waitpid can work if you change the pid.
>
> Agree. But what if we enter a pid space, which is a subnamespace of a current
> one? In that case parent will still see the task by its old pid. We can restrict
> first version of entering with this rule as well and this restriction will not
> block us in typical usecase (I mean enter a container from a host).

When I was thinking about pid namespaces and unshare last time.  The idea I came
to was we unshare of the pid namespace should only affect which pid namespace
your children are in.

I remember that do that there were a few cases where you would have to access
task->pid->pid_ns instead of task->nsproxy->pid_ns, but essentially it was pretty
simple.

>> glibc doesn't cope if you change someones pid.
>
> OK, but what if we try to allocate the same pid returning -EBUSY on failure?
>
> My aim is to provide even a restricted enter. For most of the cases this
> should work and make our lives easier. So two restrictions currently:
> a) enter a sub namespace
> b) allocate the same pid as we have now
>
> Hm? :)

Replacing struct pid is guaranteed to do all kinds of nasty things with
signal handling and the like, de_thread is nasty enough and you are talking
something worse.  So if we can change pid namespaces without changing
the pid I am for it.


Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-27 19:29                                                       ` Pavel Emelyanov
@ 2010-02-27 19:44                                                         ` Eric W. Biederman
  2010-02-28 22:05                                                           ` Daniel Lezcano
       [not found]                                                           ` <m1ljeempk6.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
       [not found]                                                         ` <4B89727C.9040602-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-02-27 19:44 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: hadi, Daniel Lezcano, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn, Matt Helsley

Pavel Emelyanov <xemul@parallels.com> writes:

> Eric W. Biederman wrote:
>> Pavel Emelyanov <xemul@parallels.com> writes:
>> 
>>> Eric W. Biederman wrote:
>>>> Pavel Emelyanov <xemul@parallels.com> writes:
>>>>
>>>>> Thanks. What's the problem with setns?
>>>> joining a preexisting namespace is roughly the same problem as
>>>> unsharing a namespace.  We simply haven't figure out how to do it
>>>> safely for the pid and the uid namespaces.
>>> The pid may change after this for sure. What problems do you know
>>> about it? What if we try to allocate the same PID in a new space
>>> or return -EBUSY? This will be a good starting point. If we manage
>>> to fix it later this will not break the API at all.
>> 
>> Parentage.  The pid is the identity of a process and all kinds of things
>> make assumptions in all kinds of strange places.  I don't see how
>> waitpid can work if you change the pid.
>
> Agree. But what if we enter a pid space, which is a subnamespace of a current
> one? In that case parent will still see the task by its old pid. We can restrict
> first version of entering with this rule as well and this restriction will not
> block us in typical usecase (I mean enter a container from a host).

When I was thinking about pid namespaces and unshare last time.  The idea I came
to was we unshare of the pid namespace should only affect which pid namespace
your children are in.

I remember that do that there were a few cases where you would have to access
task->pid->pid_ns instead of task->nsproxy->pid_ns, but essentially it was pretty
simple.

>> glibc doesn't cope if you change someones pid.
>
> OK, but what if we try to allocate the same pid returning -EBUSY on failure?
>
> My aim is to provide even a restricted enter. For most of the cases this
> should work and make our lives easier. So two restrictions currently:
> a) enter a sub namespace
> b) allocate the same pid as we have now
>
> Hm? :)

Replacing struct pid is guaranteed to do all kinds of nasty things with
signal handling and the like, de_thread is nasty enough and you are talking
something worse.  So if we can change pid namespaces without changing
the pid I am for it.


Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                           ` <m1ljeempk6.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-02-28 22:05                                                             ` Daniel Lezcano
  0 siblings, 0 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-02-28 22:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear

Eric W. Biederman wrote:
> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
> 
>> Eric W. Biederman wrote:
>>> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>>
>>>> Eric W. Biederman wrote:
>>>>> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>>>>
>>>>>> Thanks. What's the problem with setns?
>>>>> joining a preexisting namespace is roughly the same problem as
>>>>> unsharing a namespace.  We simply haven't figure out how to do it
>>>>> safely for the pid and the uid namespaces.
>>>> The pid may change after this for sure. What problems do you know
>>>> about it? What if we try to allocate the same PID in a new space
>>>> or return -EBUSY? This will be a good starting point. If we manage
>>>> to fix it later this will not break the API at all.
>>> Parentage.  The pid is the identity of a process and all kinds of things
>>> make assumptions in all kinds of strange places.  I don't see how
>>> waitpid can work if you change the pid.
>> Agree. But what if we enter a pid space, which is a subnamespace of a current
>> one? In that case parent will still see the task by its old pid. We can restrict
>> first version of entering with this rule as well and this restriction will not
>> block us in typical usecase (I mean enter a container from a host).
> 
> When I was thinking about pid namespaces and unshare last time.  The idea I came
> to was we unshare of the pid namespace should only affect which pid namespace
> your children are in.
> 
> I remember that do that there were a few cases where you would have to access
> task->pid->pid_ns instead of task->nsproxy->pid_ns, but essentially it was pretty
> simple.
> 
>>> glibc doesn't cope if you change someones pid.
>> OK, but what if we try to allocate the same pid returning -EBUSY on failure?
>>
>> My aim is to provide even a restricted enter. For most of the cases this
>> should work and make our lives easier. So two restrictions currently:
>> a) enter a sub namespace
>> b) allocate the same pid as we have now
>>
>> Hm? :)
> 
> Replacing struct pid is guaranteed to do all kinds of nasty things with
> signal handling and the like, de_thread is nasty enough and you are talking
> something worse.  So if we can change pid namespaces without changing
> the pid I am for it.

I agree with all the points you and Pavel you talked about but I don't 
feel comfortable to have the current process to switch the pid namespace 
because of the process tree hierarchy (what will be the parent of the 
process when you enter the pid namespace for example). What is the 
difference with the sys_bindns or the sys_hijack, proposed a couple of 
years ago ?

I did a suggestion some weeks ago about a new syscall 'cloneat' where 
the child process becomes the child of the targeted process specified in 
the syscall. Maybe it would be interesting to replace the 'setns' by, or 
add, a 'cloneat' syscall with the file descriptor passed as parameter. 
The copy_process function shall not use the nsproxy of the caller but 
the one provided in the fd argument.

The newly created process becomes the child of the process where we 
retrieve the namespace with nsfd and this one have to 'waitpid' it, (the 
caller of 'cloneat' can not wait it). It's a bit similar with the 
CLONE_PARENT flag, except the creation order is inverted (the father 
creates for the child).

So when entering the container, we specify the pid 1 of the container 
which is usually a child reaper.

Does it make sense ?

Thanks
   -- Daniel

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-27 19:44                                                         ` Eric W. Biederman
@ 2010-02-28 22:05                                                           ` Daniel Lezcano
  2010-03-01 19:24                                                             ` Eric W. Biederman
                                                                               ` (4 more replies)
       [not found]                                                           ` <m1ljeempk6.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 5 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-02-28 22:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, hadi, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn, Matt Helsley

Eric W. Biederman wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
> 
>> Eric W. Biederman wrote:
>>> Pavel Emelyanov <xemul@parallels.com> writes:
>>>
>>>> Eric W. Biederman wrote:
>>>>> Pavel Emelyanov <xemul@parallels.com> writes:
>>>>>
>>>>>> Thanks. What's the problem with setns?
>>>>> joining a preexisting namespace is roughly the same problem as
>>>>> unsharing a namespace.  We simply haven't figure out how to do it
>>>>> safely for the pid and the uid namespaces.
>>>> The pid may change after this for sure. What problems do you know
>>>> about it? What if we try to allocate the same PID in a new space
>>>> or return -EBUSY? This will be a good starting point. If we manage
>>>> to fix it later this will not break the API at all.
>>> Parentage.  The pid is the identity of a process and all kinds of things
>>> make assumptions in all kinds of strange places.  I don't see how
>>> waitpid can work if you change the pid.
>> Agree. But what if we enter a pid space, which is a subnamespace of a current
>> one? In that case parent will still see the task by its old pid. We can restrict
>> first version of entering with this rule as well and this restriction will not
>> block us in typical usecase (I mean enter a container from a host).
> 
> When I was thinking about pid namespaces and unshare last time.  The idea I came
> to was we unshare of the pid namespace should only affect which pid namespace
> your children are in.
> 
> I remember that do that there were a few cases where you would have to access
> task->pid->pid_ns instead of task->nsproxy->pid_ns, but essentially it was pretty
> simple.
> 
>>> glibc doesn't cope if you change someones pid.
>> OK, but what if we try to allocate the same pid returning -EBUSY on failure?
>>
>> My aim is to provide even a restricted enter. For most of the cases this
>> should work and make our lives easier. So two restrictions currently:
>> a) enter a sub namespace
>> b) allocate the same pid as we have now
>>
>> Hm? :)
> 
> Replacing struct pid is guaranteed to do all kinds of nasty things with
> signal handling and the like, de_thread is nasty enough and you are talking
> something worse.  So if we can change pid namespaces without changing
> the pid I am for it.

I agree with all the points you and Pavel you talked about but I don't 
feel comfortable to have the current process to switch the pid namespace 
because of the process tree hierarchy (what will be the parent of the 
process when you enter the pid namespace for example). What is the 
difference with the sys_bindns or the sys_hijack, proposed a couple of 
years ago ?

I did a suggestion some weeks ago about a new syscall 'cloneat' where 
the child process becomes the child of the targeted process specified in 
the syscall. Maybe it would be interesting to replace the 'setns' by, or 
add, a 'cloneat' syscall with the file descriptor passed as parameter. 
The copy_process function shall not use the nsproxy of the caller but 
the one provided in the fd argument.

The newly created process becomes the child of the process where we 
retrieve the namespace with nsfd and this one have to 'waitpid' it, (the 
caller of 'cloneat' can not wait it). It's a bit similar with the 
CLONE_PARENT flag, except the creation order is inverted (the father 
creates for the child).

So when entering the container, we specify the pid 1 of the container 
which is usually a child reaper.

Does it make sense ?

Thanks
   -- Daniel





^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                             ` <4B8AE8C1.1030305-GANU6spQydw@public.gmane.org>
@ 2010-03-01 19:24                                                               ` Eric W. Biederman
  2010-03-01 21:42                                                               ` Eric W. Biederman
                                                                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-01 19:24 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear

Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:


>> Replacing struct pid is guaranteed to do all kinds of nasty things with
>> signal handling and the like, de_thread is nasty enough and you are talking
>> something worse.  So if we can change pid namespaces without changing
>> the pid I am for it.
>
> I agree with all the points you and Pavel you talked about but I don't feel
> comfortable to have the current process to switch the pid namespace because of
> the process tree hierarchy (what will be the parent of the process when you
> enter the pid namespace for example). What is the difference with the sys_bindns
> or the sys_hijack, proposed a couple of years ago ?

I was not aiming at the general enter case.  There is a very specific case
in networking where we only need a network namespace, not full blown containers
so I was seeing what could be done to handle the easy case.

The big idea is solving the namespace naming issues with bind mounts and file
descriptors.  All of the rest is window dressing for that idea.

setns looks like the easy way but what is really needed for the network namespace
is a way to open sockets that are in a specified network namespace.

> I did a suggestion some weeks ago about a new syscall 'cloneat' where the child
> process becomes the child of the targeted process specified in the
> syscall. Maybe it would be interesting to replace the 'setns' by, or add, a
> cloneat' syscall with the file descriptor passed as parameter. The copy_process
> function shall not use the nsproxy of the caller but the one provided in the fd
> argument.
>
> The newly created process becomes the child of the process where we retrieve the
> namespace with nsfd and this one have to 'waitpid' it, (the caller of 'cloneat'
> can not wait it). It's a bit similar with the CLONE_PARENT flag, except the
> creation order is inverted (the father creates for the child).
>
> So when entering the container, we specify the pid 1 of the container which is
> usually a child reaper.
>
> Does it make sense ?

Essentially.  I am not hugely interested in solving the general case
if it takes us off into tangents about pid namespace semantics.

I have just realized that while the original use case for having unix
domain sockets able to work across network namespaces was a little
weak, there are much better arguments.  Operationally it is a game
changer.  In the case where you don't need to support migration it
allows direct access to your X server and greatly simplifies the
design of a server designed to start processes in your container.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-28 22:05                                                           ` Daniel Lezcano
@ 2010-03-01 19:24                                                             ` Eric W. Biederman
  2010-03-01 21:42                                                             ` Eric W. Biederman
                                                                               ` (3 subsequent siblings)
  4 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-01 19:24 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, hadi, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn, Matt Helsley

Daniel Lezcano <daniel.lezcano@free.fr> writes:


>> Replacing struct pid is guaranteed to do all kinds of nasty things with
>> signal handling and the like, de_thread is nasty enough and you are talking
>> something worse.  So if we can change pid namespaces without changing
>> the pid I am for it.
>
> I agree with all the points you and Pavel you talked about but I don't feel
> comfortable to have the current process to switch the pid namespace because of
> the process tree hierarchy (what will be the parent of the process when you
> enter the pid namespace for example). What is the difference with the sys_bindns
> or the sys_hijack, proposed a couple of years ago ?

I was not aiming at the general enter case.  There is a very specific case
in networking where we only need a network namespace, not full blown containers
so I was seeing what could be done to handle the easy case.

The big idea is solving the namespace naming issues with bind mounts and file
descriptors.  All of the rest is window dressing for that idea.

setns looks like the easy way but what is really needed for the network namespace
is a way to open sockets that are in a specified network namespace.

> I did a suggestion some weeks ago about a new syscall 'cloneat' where the child
> process becomes the child of the targeted process specified in the
> syscall. Maybe it would be interesting to replace the 'setns' by, or add, a
> cloneat' syscall with the file descriptor passed as parameter. The copy_process
> function shall not use the nsproxy of the caller but the one provided in the fd
> argument.
>
> The newly created process becomes the child of the process where we retrieve the
> namespace with nsfd and this one have to 'waitpid' it, (the caller of 'cloneat'
> can not wait it). It's a bit similar with the CLONE_PARENT flag, except the
> creation order is inverted (the father creates for the child).
>
> So when entering the container, we specify the pid 1 of the container which is
> usually a child reaper.
>
> Does it make sense ?

Essentially.  I am not hugely interested in solving the general case
if it takes us off into tangents about pid namespace semantics.

I have just realized that while the original use case for having unix
domain sockets able to work across network namespaces was a little
weak, there are much better arguments.  Operationally it is a game
changer.  In the case where you don't need to support migration it
allows direct access to your X server and greatly simplifies the
design of a server designed to start processes in your container.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                             ` <4B8AE8C1.1030305-GANU6spQydw@public.gmane.org>
  2010-03-01 19:24                                                               ` Eric W. Biederman
@ 2010-03-01 21:42                                                               ` Eric W. Biederman
  2010-03-02 15:03                                                               ` Pavel Emelyanov
  2010-03-03 20:59                                                               ` Oren Laadan
  3 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-01 21:42 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear

Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:

> I agree with all the points you and Pavel you talked about but I don't feel
> comfortable to have the current process to switch the pid namespace because of
> the process tree hierarchy (what will be the parent of the process when you
> enter the pid namespace for example). What is the difference with the sys_bindns
> or the sys_hijack, proposed a couple of years ago ?

I think what has changed is:
- We have mostly completed most of the namespace work.
- We have operational experience with the current namespaces.
- We have people not in the core containers group feeling the pain
  of not having some of these features.

So I think we are at point where we can perhaps talk about these
things and finally solve some of these issues.

Clearly how to enter a container is on your and Pavel's mind as big
concerns.  I am aiming a little lower.

I am of two mind about my patches.  Right now they are a brilliant
proof of concept that we can name namespaces without needing a
namespace for the names of namespaces, and start to be a practical
solution to the join problem.   At the same time, I'm not certain
I like a solution that requires yet more syscalls so I ask myself
is there not yet a simpler way.

Hopefully we can resolve something before the next merge window.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-28 22:05                                                           ` Daniel Lezcano
  2010-03-01 19:24                                                             ` Eric W. Biederman
@ 2010-03-01 21:42                                                             ` Eric W. Biederman
  2010-03-02 13:10                                                               ` Cedric Le Goater
       [not found]                                                               ` <m1ljebwwgd.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
       [not found]                                                             ` <4B8AE8C1.1030305-GANU6spQydw@public.gmane.org>
                                                                               ` (2 subsequent siblings)
  4 siblings, 2 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-01 21:42 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, hadi, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn, Matt Helsley

Daniel Lezcano <daniel.lezcano@free.fr> writes:

> I agree with all the points you and Pavel you talked about but I don't feel
> comfortable to have the current process to switch the pid namespace because of
> the process tree hierarchy (what will be the parent of the process when you
> enter the pid namespace for example). What is the difference with the sys_bindns
> or the sys_hijack, proposed a couple of years ago ?

I think what has changed is:
- We have mostly completed most of the namespace work.
- We have operational experience with the current namespaces.
- We have people not in the core containers group feeling the pain
  of not having some of these features.

So I think we are at point where we can perhaps talk about these
things and finally solve some of these issues.

Clearly how to enter a container is on your and Pavel's mind as big
concerns.  I am aiming a little lower.

I am of two mind about my patches.  Right now they are a brilliant
proof of concept that we can name namespaces without needing a
namespace for the names of namespaces, and start to be a practical
solution to the join problem.   At the same time, I'm not certain
I like a solution that requires yet more syscalls so I ask myself
is there not yet a simpler way.

Hopefully we can resolve something before the next merge window.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                               ` <m1ljebwwgd.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-03-02 13:10                                                                 ` Cedric Le Goater
  0 siblings, 0 replies; 184+ messages in thread
From: Cedric Le Goater @ 2010-03-02 13:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear

On 03/01/2010 10:42 PM, Eric W. Biederman wrote:
> I am of two mind about my patches.  Right now they are a brilliant
> proof of concept that we can name namespaces without needing a
> namespace for the names of namespaces, and start to be a practical
> solution to the join problem.   At the same time, I'm not certain
> I like a solution that requires yet more syscalls so I ask myself
> is there not yet a simpler way.

thinking aloud,

what if you made the nsproxy a vfs_inode ? we could then mount the nsfs
to do all sorts of fs operations on the object, like notifying easily
its deletion. we would need to find a meaningful name, probably the inode
number.

one syscall (nsfd) would be required to get the nsproxy of a task (pid).
you can't guess that from an inode number.


C.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-01 21:42                                                             ` Eric W. Biederman
@ 2010-03-02 13:10                                                               ` Cedric Le Goater
       [not found]                                                               ` <m1ljebwwgd.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 0 replies; 184+ messages in thread
From: Cedric Le Goater @ 2010-03-02 13:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Daniel Lezcano, Pavel Emelyanov, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear

On 03/01/2010 10:42 PM, Eric W. Biederman wrote:
> I am of two mind about my patches.  Right now they are a brilliant
> proof of concept that we can name namespaces without needing a
> namespace for the names of namespaces, and start to be a practical
> solution to the join problem.   At the same time, I'm not certain
> I like a solution that requires yet more syscalls so I ask myself
> is there not yet a simpler way.

thinking aloud,

what if you made the nsproxy a vfs_inode ? we could then mount the nsfs
to do all sorts of fs operations on the object, like notifying easily
its deletion. we would need to find a meaningful name, probably the inode
number.

one syscall (nsfd) would be required to get the nsproxy of a task (pid).
you can't guess that from an inode number.


C.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                             ` <4B8AE8C1.1030305-GANU6spQydw@public.gmane.org>
  2010-03-01 19:24                                                               ` Eric W. Biederman
  2010-03-01 21:42                                                               ` Eric W. Biederman
@ 2010-03-02 15:03                                                               ` Pavel Emelyanov
  2010-03-03 20:59                                                               ` Oren Laadan
  3 siblings, 0 replies; 184+ messages in thread
From: Pavel Emelyanov @ 2010-03-02 15:03 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Eric W. Biederman, Ben Greear

> I agree with all the points you and Pavel you talked about but I don't 
> feel comfortable to have the current process to switch the pid namespace 
> because of the process tree hierarchy (what will be the parent of the 
> process when you enter the pid namespace for example).

The answer is - the one, that used to be. I see no problems with it.
Do you?

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-28 22:05                                                           ` Daniel Lezcano
                                                                               ` (2 preceding siblings ...)
       [not found]                                                             ` <4B8AE8C1.1030305-GANU6spQydw@public.gmane.org>
@ 2010-03-02 15:03                                                             ` Pavel Emelyanov
  2010-03-02 15:14                                                               ` Jan Engelhardt
                                                                                 ` (2 more replies)
  2010-03-03 20:59                                                             ` Oren Laadan
  4 siblings, 3 replies; 184+ messages in thread
From: Pavel Emelyanov @ 2010-03-02 15:03 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Eric W. Biederman, hadi, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn, Matt Helsley

> I agree with all the points you and Pavel you talked about but I don't 
> feel comfortable to have the current process to switch the pid namespace 
> because of the process tree hierarchy (what will be the parent of the 
> process when you enter the pid namespace for example).

The answer is - the one, that used to be. I see no problems with it.
Do you?

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                               ` <4B8D28CF.8060304-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2010-03-02 15:14                                                                 ` Jan Engelhardt
  2010-03-02 21:19                                                                 ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 184+ messages in thread
From: Jan Engelhardt @ 2010-03-02 15:14 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Eric W. Biederman, Ben Greear

On Tuesday 2010-03-02 16:03, Pavel Emelyanov wrote:

>> I agree with all the points you and Pavel you talked about but I don't 
>> feel comfortable to have the current process to switch the pid namespace 
>> because of the process tree hierarchy (what will be the parent of the 
>> process when you enter the pid namespace for example).
>
>The answer is - the one, that used to be. I see no problems with it.
>Do you?

But perhaps it could be named "namespacefd" instead of nsfd, to reduce 
potential clashes (because glibc will usually just use the same name 
when making the syscall available as a C function).

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-02 15:03                                                             ` Pavel Emelyanov
@ 2010-03-02 15:14                                                               ` Jan Engelhardt
       [not found]                                                                 ` <alpine.LSU.2.01.1003021613570.17303-SHaQjdQMGhDmsUXKMKRlFA@public.gmane.org>
  2010-03-02 21:45                                                                 ` Eric W. Biederman
       [not found]                                                               ` <4B8D28CF.8060304-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2010-03-02 21:19                                                               ` Sukadev Bhattiprolu
  2 siblings, 2 replies; 184+ messages in thread
From: Jan Engelhardt @ 2010-03-02 15:14 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Daniel Lezcano, Eric W. Biederman, hadi, Patrick McHardy,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear, Serge Hallyn, Matt Helsley

On Tuesday 2010-03-02 16:03, Pavel Emelyanov wrote:

>> I agree with all the points you and Pavel you talked about but I don't 
>> feel comfortable to have the current process to switch the pid namespace 
>> because of the process tree hierarchy (what will be the parent of the 
>> process when you enter the pid namespace for example).
>
>The answer is - the one, that used to be. I see no problems with it.
>Do you?

But perhaps it could be named "namespacefd" instead of nsfd, to reduce 
potential clashes (because glibc will usually just use the same name 
when making the syscall available as a C function).

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                               ` <4B8D28CF.8060304-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2010-03-02 15:14                                                                 ` Jan Engelhardt
@ 2010-03-02 21:19                                                                 ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 184+ messages in thread
From: Sukadev Bhattiprolu @ 2010-03-02 21:19 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Eric W. Biederman, Ben Greear

Pavel Emelyanov [xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org] wrote:
| > I agree with all the points you and Pavel you talked about but I don't 
| > feel comfortable to have the current process to switch the pid namespace 
| > because of the process tree hierarchy (what will be the parent of the 
| > process when you enter the pid namespace for example).
| 
| The answer is - the one, that used to be. I see no problems with it.
| Do you?

Just to be clear, when a process unshares its pid namespace, it takes
on additional pid nr (== 1) in the new namespace but retains its original
pid nr(s) in the parent (ancestor) namespaces right ?

i.e the process becomes the container-init of the new namespace. When it
exits, all its children belonging to the new namespace are killed too,
but any children in the parent namespace (i.e children created before
unshare()) are not killed.

After the unshare() the process will not be able to signal any children
it created before the unshare() (bc their active pid namespaces are
different)

Sukadev

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-02 15:03                                                             ` Pavel Emelyanov
  2010-03-02 15:14                                                               ` Jan Engelhardt
       [not found]                                                               ` <4B8D28CF.8060304-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2010-03-02 21:19                                                               ` Sukadev Bhattiprolu
  2010-03-02 22:13                                                                 ` Eric W. Biederman
       [not found]                                                                 ` <20100302211942.GA17816-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2 siblings, 2 replies; 184+ messages in thread
From: Sukadev Bhattiprolu @ 2010-03-02 21:19 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Daniel Lezcano, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Eric W. Biederman, Ben Greear

Pavel Emelyanov [xemul@parallels.com] wrote:
| > I agree with all the points you and Pavel you talked about but I don't 
| > feel comfortable to have the current process to switch the pid namespace 
| > because of the process tree hierarchy (what will be the parent of the 
| > process when you enter the pid namespace for example).
| 
| The answer is - the one, that used to be. I see no problems with it.
| Do you?

Just to be clear, when a process unshares its pid namespace, it takes
on additional pid nr (== 1) in the new namespace but retains its original
pid nr(s) in the parent (ancestor) namespaces right ?

i.e the process becomes the container-init of the new namespace. When it
exits, all its children belonging to the new namespace are killed too,
but any children in the parent namespace (i.e children created before
unshare()) are not killed.

After the unshare() the process will not be able to signal any children
it created before the unshare() (bc their active pid namespaces are
different)

Sukadev

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                 ` <alpine.LSU.2.01.1003021613570.17303-SHaQjdQMGhDmsUXKMKRlFA@public.gmane.org>
@ 2010-03-02 21:45                                                                   ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-02 21:45 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear

Jan Engelhardt <jengelh-nopoi9nDyk+ELgA04lAiVw@public.gmane.org> writes:

> On Tuesday 2010-03-02 16:03, Pavel Emelyanov wrote:
>
>>> I agree with all the points you and Pavel you talked about but I don't 
>>> feel comfortable to have the current process to switch the pid namespace 
>>> because of the process tree hierarchy (what will be the parent of the 
>>> process when you enter the pid namespace for example).
>>
>>The answer is - the one, that used to be. I see no problems with it.
>>Do you?
>
> But perhaps it could be named "namespacefd" instead of nsfd, to reduce 
> potential clashes (because glibc will usually just use the same name 
> when making the syscall available as a C function).

Maybe.  namespacefd seems like a real mouthful.  I agree nsfd might be
a bit non-obvious for a rarish syscall.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-02 15:14                                                               ` Jan Engelhardt
       [not found]                                                                 ` <alpine.LSU.2.01.1003021613570.17303-SHaQjdQMGhDmsUXKMKRlFA@public.gmane.org>
@ 2010-03-02 21:45                                                                 ` Eric W. Biederman
  1 sibling, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-02 21:45 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Pavel Emelyanov, Daniel Lezcano, hadi, Patrick McHardy,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear, Serge Hallyn, Matt Helsley

Jan Engelhardt <jengelh@medozas.de> writes:

> On Tuesday 2010-03-02 16:03, Pavel Emelyanov wrote:
>
>>> I agree with all the points you and Pavel you talked about but I don't 
>>> feel comfortable to have the current process to switch the pid namespace 
>>> because of the process tree hierarchy (what will be the parent of the 
>>> process when you enter the pid namespace for example).
>>
>>The answer is - the one, that used to be. I see no problems with it.
>>Do you?
>
> But perhaps it could be named "namespacefd" instead of nsfd, to reduce 
> potential clashes (because glibc will usually just use the same name 
> when making the syscall available as a C function).

Maybe.  namespacefd seems like a real mouthful.  I agree nsfd might be
a bit non-obvious for a rarish syscall.

Eric


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                 ` <20100302211942.GA17816-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-03-02 22:13                                                                   ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-02 22:13 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear

Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

> Pavel Emelyanov [xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org] wrote:
> | > I agree with all the points you and Pavel you talked about but I don't 
> | > feel comfortable to have the current process to switch the pid namespace 
> | > because of the process tree hierarchy (what will be the parent of the 
> | > process when you enter the pid namespace for example).
> | 
> | The answer is - the one, that used to be. I see no problems with it.
> | Do you?
>
> Just to be clear, when a process unshares its pid namespace, it takes
> on additional pid nr (== 1) in the new namespace but retains its original
> pid nr(s) in the parent (ancestor) namespaces right ?
>
> i.e the process becomes the container-init of the new namespace. When it
> exits, all its children belonging to the new namespace are killed too,
> but any children in the parent namespace (i.e children created before
> unshare()) are not killed.
>
> After the unshare() the process will not be able to signal any children
> it created before the unshare() (bc their active pid namespaces are
> different)

The only case that I see as being simple and unsurprising worked a bit
differently:

We currently have:

ns_of_pid(task_pid(tsk))
tsk->nsproxy->pid_ns


I would reduce the usage of tsk->nsproxy->pid_ns as much as possible,
and use ns_of_pid(task_pid(tsk)) for all of the routine things that
need to know the pid namespace of a process.  Possibly even to the point
or reversing the order of the upid array so using it is more efficient.

I would leave tsk->nsproxy->pid_ns for use by fork/clone when allocating
a childs pid number.

The unsharing process would have to become the child reaper.  I think the first
child would become pid 1 in that pid namespace.


From an implementation point of view who gets pid 1 when the child_reaper is
not visible inside the pid namespace doesn't make much difference but we would
want to carefully look at the details so we minimize userspace confusion.


I don't think a process tree rooted at pid 0 is a show stopper.  It is
somewhat confusing but we already have a forked process tree today,
and user space certainly hasn't fallen over.  In the case of a join if you want
to live in properly in the process tree you can daemonize and become a child
of init.




I think replacing a struct pid for another struct pid allocated in
descendant pid_namespace (but has all of the same struct upid values
as the first struct pid) is a disastrous idea.  It destroys the
uniqueness of struct pid and we have a lot of places where we check
that for equality of pid pointers, and that now would be broken.
Otherthings like proc directories also used a cached struct pid and
would start thinking the process was gone when it was not.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-02 21:19                                                               ` Sukadev Bhattiprolu
@ 2010-03-02 22:13                                                                 ` Eric W. Biederman
  2010-03-03  0:07                                                                   ` Sukadev Bhattiprolu
       [not found]                                                                   ` <m1y6iaqsmm.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
       [not found]                                                                 ` <20100302211942.GA17816-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-02 22:13 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Pavel Emelyanov, Daniel Lezcano, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear

Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:

> Pavel Emelyanov [xemul@parallels.com] wrote:
> | > I agree with all the points you and Pavel you talked about but I don't 
> | > feel comfortable to have the current process to switch the pid namespace 
> | > because of the process tree hierarchy (what will be the parent of the 
> | > process when you enter the pid namespace for example).
> | 
> | The answer is - the one, that used to be. I see no problems with it.
> | Do you?
>
> Just to be clear, when a process unshares its pid namespace, it takes
> on additional pid nr (== 1) in the new namespace but retains its original
> pid nr(s) in the parent (ancestor) namespaces right ?
>
> i.e the process becomes the container-init of the new namespace. When it
> exits, all its children belonging to the new namespace are killed too,
> but any children in the parent namespace (i.e children created before
> unshare()) are not killed.
>
> After the unshare() the process will not be able to signal any children
> it created before the unshare() (bc their active pid namespaces are
> different)

The only case that I see as being simple and unsurprising worked a bit
differently:

We currently have:

ns_of_pid(task_pid(tsk))
tsk->nsproxy->pid_ns


I would reduce the usage of tsk->nsproxy->pid_ns as much as possible,
and use ns_of_pid(task_pid(tsk)) for all of the routine things that
need to know the pid namespace of a process.  Possibly even to the point
or reversing the order of the upid array so using it is more efficient.

I would leave tsk->nsproxy->pid_ns for use by fork/clone when allocating
a childs pid number.

The unsharing process would have to become the child reaper.  I think the first
child would become pid 1 in that pid namespace.


>From an implementation point of view who gets pid 1 when the child_reaper is
not visible inside the pid namespace doesn't make much difference but we would
want to carefully look at the details so we minimize userspace confusion.


I don't think a process tree rooted at pid 0 is a show stopper.  It is
somewhat confusing but we already have a forked process tree today,
and user space certainly hasn't fallen over.  In the case of a join if you want
to live in properly in the process tree you can daemonize and become a child
of init.




I think replacing a struct pid for another struct pid allocated in
descendant pid_namespace (but has all of the same struct upid values
as the first struct pid) is a disastrous idea.  It destroys the
uniqueness of struct pid and we have a lot of places where we check
that for equality of pid pointers, and that now would be broken.
Otherthings like proc directories also used a cached struct pid and
would start thinking the process was gone when it was not.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                   ` <m1y6iaqsmm.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-03-03  0:07                                                                     ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 184+ messages in thread
From: Sukadev Bhattiprolu @ 2010-03-03  0:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear

Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| 
| I think replacing a struct pid for another struct pid allocated in
| descendant pid_namespace (but has all of the same struct upid values
| as the first struct pid) is a disastrous idea.  It destroys the

True. Sorry, I did not mean we would need a new 'struct pid' for an
existing process. I think we talked earlier of finding a way of attaching
additional pid numbers to the same struct pid.

Sukadev

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-02 22:13                                                                 ` Eric W. Biederman
@ 2010-03-03  0:07                                                                   ` Sukadev Bhattiprolu
       [not found]                                                                     ` <20100303000743.GA13744-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2010-03-03  0:46                                                                     ` Eric W. Biederman
       [not found]                                                                   ` <m1y6iaqsmm.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Sukadev Bhattiprolu @ 2010-03-03  0:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Daniel Lezcano, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear

Eric W. Biederman [ebiederm@xmission.com] wrote:
| 
| I think replacing a struct pid for another struct pid allocated in
| descendant pid_namespace (but has all of the same struct upid values
| as the first struct pid) is a disastrous idea.  It destroys the

True. Sorry, I did not mean we would need a new 'struct pid' for an
existing process. I think we talked earlier of finding a way of attaching
additional pid numbers to the same struct pid.

Sukadev


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                     ` <20100303000743.GA13744-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-03-03  0:46                                                                       ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-03  0:46 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear

Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

> Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
> | 
> | I think replacing a struct pid for another struct pid allocated in
> | descendant pid_namespace (but has all of the same struct upid values
> | as the first struct pid) is a disastrous idea.  It destroys the
>
> True. Sorry, I did not mean we would need a new 'struct pid' for an
> existing process. I think we talked earlier of finding a way of attaching
> additional pid numbers to the same struct pid.

I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
to be that you become the idle task aka pid 0, and not the init task pid 1 the
implementation is trivial.

Eric
----

 arch/powerpc/platforms/cell/spufs/sched.c |    2 +-
 arch/um/drivers/mconsole_kern.c           |    2 +-
 fs/proc/root.c                            |    2 +-
 init/main.c                               |    9 ---------
 kernel/cgroup.c                           |    2 +-
 kernel/fork.c                             |   16 +++++++++++++---
 kernel/nsproxy.c                          |    2 +-
 kernel/perf_event.c                       |    2 +-
 kernel/pid.c                              |    8 ++++----
 kernel/signal.c                           |    9 ++++-----
 kernel/sysctl_binary.c                    |    2 +-
 11 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index 4678078..b7f2026 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -1094,7 +1094,7 @@ static int show_spu_loadavg(struct seq_file *s, void *private)
 		LOAD_INT(c), LOAD_FRAC(c),
 		count_active_contexts(),
 		atomic_read(&nr_spu_contexts),
-		current->nsproxy->pid_ns->last_pid);
+		task_active_pid_ns(current)->last_pid);
 	return 0;
 }
 
diff --git a/arch/um/drivers/mconsole_kern.c b/arch/um/drivers/mconsole_kern.c
index 3b3c366..4e6985e 100644
--- a/arch/um/drivers/mconsole_kern.c
+++ b/arch/um/drivers/mconsole_kern.c
@@ -125,7 +125,7 @@ void mconsole_log(struct mc_request *req)
 void mconsole_proc(struct mc_request *req)
 {
 	struct nameidata nd;
-	struct vfsmount *mnt = current->nsproxy->pid_ns->proc_mnt;
+	struct vfsmount *mnt = task_active_pid_ns(current)->proc_mnt;
 	struct file *file;
 	int n, err;
 	char *ptr = req->request.data, *buf;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index b080b79..fbcd3f8 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -57,7 +57,7 @@ static int proc_get_sb(struct file_system_type *fs_type,
 	if (flags & MS_KERNMOUNT)
 		ns = (struct pid_namespace *)data;
 	else
-		ns = current->nsproxy->pid_ns;
+		ns = task_active_pid_ns(current);
 
 	sb = sget(fs_type, proc_test_super, proc_set_super, ns);
 	if (IS_ERR(sb))
diff --git a/init/main.c b/init/main.c
index 4cb47a1..67e40fc 100644
--- a/init/main.c
+++ b/init/main.c
@@ -851,15 +851,6 @@ static int __init kernel_init(void * unused)
 	 * init can run on any cpu.
 	 */
 	set_cpus_allowed_ptr(current, cpu_all_mask);
-	/*
-	 * Tell the world that we're going to be the grim
-	 * reaper of innocent orphaned children.
-	 *
-	 * We don't want people to have to make incorrect
-	 * assumptions about where in the task array this
-	 * can be found.
-	 */
-	init_pid_ns.child_reaper = current;
 
 	cad_pid = task_pid(current);
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index aa3bee5..737d2eb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2453,7 +2453,7 @@ static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp,
 {
 	struct cgroup_pidlist *l;
 	/* don't need task_nsproxy() if we're looking at ourself */
-	struct pid_namespace *ns = get_pid_ns(current->nsproxy->pid_ns);
+	struct pid_namespace *ns = get_pid_ns(task_active_pid_ns(current));
 	/*
 	 * We can't drop the pidlist_mutex before taking the l->mutex in case
 	 * the last ref-holder is trying to remove l from the list at the same
diff --git a/kernel/fork.c b/kernel/fork.c
index f88bd98..832c035 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1172,7 +1172,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		if (!pid)
 			goto bad_fork_cleanup_io;
 
-		if (clone_flags & CLONE_NEWPID) {
+		if (pid->numbers[pid->level].nr == 1) {
 			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
 			if (retval < 0)
 				goto bad_fork_free_pid;
@@ -1279,7 +1279,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		tracehook_finish_clone(p, clone_flags, trace);
 
 		if (thread_group_leader(p)) {
-			if (clone_flags & CLONE_NEWPID)
+			if (pid->numbers[pid->level].nr == 1)
 				p->nsproxy->pid_ns->child_reaper = p;
 
 			p->signal->leader_pid = pid;
@@ -1539,10 +1539,19 @@ static void check_unshare_flags(unsigned long *flags_ptr)
 		*flags_ptr |= CLONE_THREAD;
 
 	/*
+	 * If unsharing the pid namespace and the task was created
+	 * using CLONE_THREAD, then must unshare the thread.
+	 */
+	if ((*flags_ptr & CLONE_NEWPID) &&
+	    (atomic_read(&current->signal->count) > 1))
+		*flags_ptr |= CLONE_THREAD;
+
+	/*
 	 * If unsharing namespace, must also unshare filesystem information.
 	 */
 	if (*flags_ptr & CLONE_NEWNS)
 		*flags_ptr |= CLONE_FS;
+
 }
 
 /*
@@ -1647,7 +1656,8 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
 	err = -EINVAL;
 	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
 				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
-				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET))
+				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
+				CLONE_NEWPID))
 		goto bad_unshare_out;
 
 	/*
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index e3be4ef..1d023d5 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -173,7 +173,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 	int err = 0;
 
 	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			       CLONE_NEWNET)))
+			       CLONE_NEWNET | CLONE_NEWPID)))
 		return 0;
 
 	if (!capable(CAP_SYS_ADMIN))
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 2ae7409..74865cd 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -4436,7 +4436,7 @@ perf_event_alloc(struct perf_event_attr *attr,
 
 	event->parent		= parent_event;
 
-	event->ns		= get_pid_ns(current->nsproxy->pid_ns);
+	event->ns		= get_pid_ns(task_active_pid_ns(current));
 	event->id		= atomic64_inc_return(&perf_event_id);
 
 	event->state		= PERF_EVENT_STATE_INACTIVE;
diff --git a/kernel/pid.c b/kernel/pid.c
index 2e17c9c..6b64a82 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -305,7 +305,7 @@ EXPORT_SYMBOL_GPL(find_pid_ns);
 
 struct pid *find_vpid(int nr)
 {
-	return find_pid_ns(nr, current->nsproxy->pid_ns);
+	return find_pid_ns(nr, task_active_pid_ns(current));
 }
 EXPORT_SYMBOL_GPL(find_vpid);
 
@@ -385,7 +385,7 @@ struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *ns)
 
 struct task_struct *find_task_by_vpid(pid_t vnr)
 {
-	return find_task_by_pid_ns(vnr, current->nsproxy->pid_ns);
+	return find_task_by_pid_ns(vnr, task_active_pid_ns(current));
 }
 
 struct pid *get_task_pid(struct task_struct *task, enum pid_type type)
@@ -437,7 +437,7 @@ pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
 
 pid_t pid_vnr(struct pid *pid)
 {
-	return pid_nr_ns(pid, current->nsproxy->pid_ns);
+	return pid_nr_ns(pid, task_active_pid_ns(current));
 }
 EXPORT_SYMBOL_GPL(pid_vnr);
 
@@ -448,7 +448,7 @@ pid_t __task_pid_nr_ns(struct task_struct *task, enum pid_type type,
 
 	rcu_read_lock();
 	if (!ns)
-		ns = current->nsproxy->pid_ns;
+		ns = task_active_pid_ns(current);
 	if (likely(pid_alive(task))) {
 		if (type != PIDTYPE_PID)
 			task = task->group_leader;
diff --git a/kernel/signal.c b/kernel/signal.c
index 934ae5e..885b699 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1438,16 +1438,15 @@ int do_notify_parent(struct task_struct *tsk, int sig)
 	 * we are under tasklist_lock here so our parent is tied to
 	 * us and cannot exit and release its namespace.
 	 *
-	 * the only it can is to switch its nsproxy with sys_unshare,
-	 * bu uncharing pid namespaces is not allowed, so we'll always
-	 * see relevant namespace
+	 * The only it can is to switch its nsproxy with sys_unshare,
+	 * but we use the pid_namespace for task_pid which never changes.
 	 *
 	 * write_lock() currently calls preempt_disable() which is the
 	 * same as rcu_read_lock(), but according to Oleg, this is not
 	 * correct to rely on this
 	 */
 	rcu_read_lock();
-	info.si_pid = task_pid_nr_ns(tsk, tsk->parent->nsproxy->pid_ns);
+	info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(tsk->parent));
 	info.si_uid = __task_cred(tsk)->uid;
 	rcu_read_unlock();
 
@@ -1518,7 +1517,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, int why)
 	 * see comment in do_notify_parent() abot the following 3 lines
 	 */
 	rcu_read_lock();
-	info.si_pid = task_pid_nr_ns(tsk, parent->nsproxy->pid_ns);
+	info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(parent));
 	info.si_uid = __task_cred(tsk)->uid;
 	rcu_read_unlock();
 
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 8f5d16e..1e4da59 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -1356,7 +1356,7 @@ static ssize_t binary_sysctl(const int *name, int nlen,
 		goto out_putname;
 	}
 
-	mnt = current->nsproxy->pid_ns->proc_mnt;
+	mnt = task_active_pid_ns(current)->proc_mnt;
 	result = vfs_path_lookup(mnt->mnt_root, mnt, pathname, 0, &nd);
 	if (result)
 		goto out_putname;

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-03  0:07                                                                   ` Sukadev Bhattiprolu
       [not found]                                                                     ` <20100303000743.GA13744-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-03-03  0:46                                                                     ` Eric W. Biederman
  2010-03-03 15:38                                                                       ` Serge E. Hallyn
                                                                                         ` (2 more replies)
  1 sibling, 3 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-03  0:46 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Pavel Emelyanov, Daniel Lezcano, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear

Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:

> Eric W. Biederman [ebiederm@xmission.com] wrote:
> | 
> | I think replacing a struct pid for another struct pid allocated in
> | descendant pid_namespace (but has all of the same struct upid values
> | as the first struct pid) is a disastrous idea.  It destroys the
>
> True. Sorry, I did not mean we would need a new 'struct pid' for an
> existing process. I think we talked earlier of finding a way of attaching
> additional pid numbers to the same struct pid.

I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
to be that you become the idle task aka pid 0, and not the init task pid 1 the
implementation is trivial.

Eric
----

 arch/powerpc/platforms/cell/spufs/sched.c |    2 +-
 arch/um/drivers/mconsole_kern.c           |    2 +-
 fs/proc/root.c                            |    2 +-
 init/main.c                               |    9 ---------
 kernel/cgroup.c                           |    2 +-
 kernel/fork.c                             |   16 +++++++++++++---
 kernel/nsproxy.c                          |    2 +-
 kernel/perf_event.c                       |    2 +-
 kernel/pid.c                              |    8 ++++----
 kernel/signal.c                           |    9 ++++-----
 kernel/sysctl_binary.c                    |    2 +-
 11 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index 4678078..b7f2026 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -1094,7 +1094,7 @@ static int show_spu_loadavg(struct seq_file *s, void *private)
 		LOAD_INT(c), LOAD_FRAC(c),
 		count_active_contexts(),
 		atomic_read(&nr_spu_contexts),
-		current->nsproxy->pid_ns->last_pid);
+		task_active_pid_ns(current)->last_pid);
 	return 0;
 }
 
diff --git a/arch/um/drivers/mconsole_kern.c b/arch/um/drivers/mconsole_kern.c
index 3b3c366..4e6985e 100644
--- a/arch/um/drivers/mconsole_kern.c
+++ b/arch/um/drivers/mconsole_kern.c
@@ -125,7 +125,7 @@ void mconsole_log(struct mc_request *req)
 void mconsole_proc(struct mc_request *req)
 {
 	struct nameidata nd;
-	struct vfsmount *mnt = current->nsproxy->pid_ns->proc_mnt;
+	struct vfsmount *mnt = task_active_pid_ns(current)->proc_mnt;
 	struct file *file;
 	int n, err;
 	char *ptr = req->request.data, *buf;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index b080b79..fbcd3f8 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -57,7 +57,7 @@ static int proc_get_sb(struct file_system_type *fs_type,
 	if (flags & MS_KERNMOUNT)
 		ns = (struct pid_namespace *)data;
 	else
-		ns = current->nsproxy->pid_ns;
+		ns = task_active_pid_ns(current);
 
 	sb = sget(fs_type, proc_test_super, proc_set_super, ns);
 	if (IS_ERR(sb))
diff --git a/init/main.c b/init/main.c
index 4cb47a1..67e40fc 100644
--- a/init/main.c
+++ b/init/main.c
@@ -851,15 +851,6 @@ static int __init kernel_init(void * unused)
 	 * init can run on any cpu.
 	 */
 	set_cpus_allowed_ptr(current, cpu_all_mask);
-	/*
-	 * Tell the world that we're going to be the grim
-	 * reaper of innocent orphaned children.
-	 *
-	 * We don't want people to have to make incorrect
-	 * assumptions about where in the task array this
-	 * can be found.
-	 */
-	init_pid_ns.child_reaper = current;
 
 	cad_pid = task_pid(current);
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index aa3bee5..737d2eb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2453,7 +2453,7 @@ static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp,
 {
 	struct cgroup_pidlist *l;
 	/* don't need task_nsproxy() if we're looking at ourself */
-	struct pid_namespace *ns = get_pid_ns(current->nsproxy->pid_ns);
+	struct pid_namespace *ns = get_pid_ns(task_active_pid_ns(current));
 	/*
 	 * We can't drop the pidlist_mutex before taking the l->mutex in case
 	 * the last ref-holder is trying to remove l from the list at the same
diff --git a/kernel/fork.c b/kernel/fork.c
index f88bd98..832c035 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1172,7 +1172,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		if (!pid)
 			goto bad_fork_cleanup_io;
 
-		if (clone_flags & CLONE_NEWPID) {
+		if (pid->numbers[pid->level].nr == 1) {
 			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
 			if (retval < 0)
 				goto bad_fork_free_pid;
@@ -1279,7 +1279,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		tracehook_finish_clone(p, clone_flags, trace);
 
 		if (thread_group_leader(p)) {
-			if (clone_flags & CLONE_NEWPID)
+			if (pid->numbers[pid->level].nr == 1)
 				p->nsproxy->pid_ns->child_reaper = p;
 
 			p->signal->leader_pid = pid;
@@ -1539,10 +1539,19 @@ static void check_unshare_flags(unsigned long *flags_ptr)
 		*flags_ptr |= CLONE_THREAD;
 
 	/*
+	 * If unsharing the pid namespace and the task was created
+	 * using CLONE_THREAD, then must unshare the thread.
+	 */
+	if ((*flags_ptr & CLONE_NEWPID) &&
+	    (atomic_read(&current->signal->count) > 1))
+		*flags_ptr |= CLONE_THREAD;
+
+	/*
 	 * If unsharing namespace, must also unshare filesystem information.
 	 */
 	if (*flags_ptr & CLONE_NEWNS)
 		*flags_ptr |= CLONE_FS;
+
 }
 
 /*
@@ -1647,7 +1656,8 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
 	err = -EINVAL;
 	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
 				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
-				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET))
+				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
+				CLONE_NEWPID))
 		goto bad_unshare_out;
 
 	/*
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index e3be4ef..1d023d5 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -173,7 +173,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 	int err = 0;
 
 	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			       CLONE_NEWNET)))
+			       CLONE_NEWNET | CLONE_NEWPID)))
 		return 0;
 
 	if (!capable(CAP_SYS_ADMIN))
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 2ae7409..74865cd 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -4436,7 +4436,7 @@ perf_event_alloc(struct perf_event_attr *attr,
 
 	event->parent		= parent_event;
 
-	event->ns		= get_pid_ns(current->nsproxy->pid_ns);
+	event->ns		= get_pid_ns(task_active_pid_ns(current));
 	event->id		= atomic64_inc_return(&perf_event_id);
 
 	event->state		= PERF_EVENT_STATE_INACTIVE;
diff --git a/kernel/pid.c b/kernel/pid.c
index 2e17c9c..6b64a82 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -305,7 +305,7 @@ EXPORT_SYMBOL_GPL(find_pid_ns);
 
 struct pid *find_vpid(int nr)
 {
-	return find_pid_ns(nr, current->nsproxy->pid_ns);
+	return find_pid_ns(nr, task_active_pid_ns(current));
 }
 EXPORT_SYMBOL_GPL(find_vpid);
 
@@ -385,7 +385,7 @@ struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *ns)
 
 struct task_struct *find_task_by_vpid(pid_t vnr)
 {
-	return find_task_by_pid_ns(vnr, current->nsproxy->pid_ns);
+	return find_task_by_pid_ns(vnr, task_active_pid_ns(current));
 }
 
 struct pid *get_task_pid(struct task_struct *task, enum pid_type type)
@@ -437,7 +437,7 @@ pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
 
 pid_t pid_vnr(struct pid *pid)
 {
-	return pid_nr_ns(pid, current->nsproxy->pid_ns);
+	return pid_nr_ns(pid, task_active_pid_ns(current));
 }
 EXPORT_SYMBOL_GPL(pid_vnr);
 
@@ -448,7 +448,7 @@ pid_t __task_pid_nr_ns(struct task_struct *task, enum pid_type type,
 
 	rcu_read_lock();
 	if (!ns)
-		ns = current->nsproxy->pid_ns;
+		ns = task_active_pid_ns(current);
 	if (likely(pid_alive(task))) {
 		if (type != PIDTYPE_PID)
 			task = task->group_leader;
diff --git a/kernel/signal.c b/kernel/signal.c
index 934ae5e..885b699 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1438,16 +1438,15 @@ int do_notify_parent(struct task_struct *tsk, int sig)
 	 * we are under tasklist_lock here so our parent is tied to
 	 * us and cannot exit and release its namespace.
 	 *
-	 * the only it can is to switch its nsproxy with sys_unshare,
-	 * bu uncharing pid namespaces is not allowed, so we'll always
-	 * see relevant namespace
+	 * The only it can is to switch its nsproxy with sys_unshare,
+	 * but we use the pid_namespace for task_pid which never changes.
 	 *
 	 * write_lock() currently calls preempt_disable() which is the
 	 * same as rcu_read_lock(), but according to Oleg, this is not
 	 * correct to rely on this
 	 */
 	rcu_read_lock();
-	info.si_pid = task_pid_nr_ns(tsk, tsk->parent->nsproxy->pid_ns);
+	info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(tsk->parent));
 	info.si_uid = __task_cred(tsk)->uid;
 	rcu_read_unlock();
 
@@ -1518,7 +1517,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, int why)
 	 * see comment in do_notify_parent() abot the following 3 lines
 	 */
 	rcu_read_lock();
-	info.si_pid = task_pid_nr_ns(tsk, parent->nsproxy->pid_ns);
+	info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(parent));
 	info.si_uid = __task_cred(tsk)->uid;
 	rcu_read_unlock();
 
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 8f5d16e..1e4da59 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -1356,7 +1356,7 @@ static ssize_t binary_sysctl(const int *name, int nlen,
 		goto out_putname;
 	}
 
-	mnt = current->nsproxy->pid_ns->proc_mnt;
+	mnt = task_active_pid_ns(current)->proc_mnt;
 	result = vfs_path_lookup(mnt->mnt_root, mnt, pathname, 0, &nd);
 	if (result)
 		goto out_putname;

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                       ` <m1ocj6qljj.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-03-03 15:38                                                                         ` Serge E. Hallyn
  2010-03-03 16:50                                                                         ` Pavel Emelyanov
  1 sibling, 0 replies; 184+ messages in thread
From: Serge E. Hallyn @ 2010-03-03 15:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:
> 
> > Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
> > | 
> > | I think replacing a struct pid for another struct pid allocated in
> > | descendant pid_namespace (but has all of the same struct upid values
> > | as the first struct pid) is a disastrous idea.  It destroys the
> >
> > True. Sorry, I did not mean we would need a new 'struct pid' for an
> > existing process. I think we talked earlier of finding a way of attaching
> > additional pid numbers to the same struct pid.
> 
> I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
> to be that you become the idle task aka pid 0, and not the init task pid 1 the
> implementation is trivial.

Heh, and then (browsing through your copy_process() patch hunks) the next
forked task becomes the child reaper for the new pidns?  <shrug>  why not
I guess.

Now if that child reaper then gets killed, will the idle task get killed too?
And if not, then idle task can just re-populating the new pidns with new
idle tasks...

If this brought us a step closer to entering an existing pidns that would
be one thing, but is there actually any advantage to being able to
unshare a new pidns?  Oh, I guess there is - PAM can then use it at
login, which might be neat.

-serge

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-03  0:46                                                                     ` Eric W. Biederman
@ 2010-03-03 15:38                                                                       ` Serge E. Hallyn
  2010-03-03 19:47                                                                         ` Eric W. Biederman
       [not found]                                                                         ` <20100303153800.GA937-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2010-03-03 16:50                                                                       ` Pavel Emelyanov
       [not found]                                                                       ` <m1ocj6qljj.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2 siblings, 2 replies; 184+ messages in thread
From: Serge E. Hallyn @ 2010-03-03 15:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Sukadev Bhattiprolu, Pavel Emelyanov, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear

Quoting Eric W. Biederman (ebiederm@xmission.com):
> Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:
> 
> > Eric W. Biederman [ebiederm@xmission.com] wrote:
> > | 
> > | I think replacing a struct pid for another struct pid allocated in
> > | descendant pid_namespace (but has all of the same struct upid values
> > | as the first struct pid) is a disastrous idea.  It destroys the
> >
> > True. Sorry, I did not mean we would need a new 'struct pid' for an
> > existing process. I think we talked earlier of finding a way of attaching
> > additional pid numbers to the same struct pid.
> 
> I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
> to be that you become the idle task aka pid 0, and not the init task pid 1 the
> implementation is trivial.

Heh, and then (browsing through your copy_process() patch hunks) the next
forked task becomes the child reaper for the new pidns?  <shrug>  why not
I guess.

Now if that child reaper then gets killed, will the idle task get killed too?
And if not, then idle task can just re-populating the new pidns with new
idle tasks...

If this brought us a step closer to entering an existing pidns that would
be one thing, but is there actually any advantage to being able to
unshare a new pidns?  Oh, I guess there is - PAM can then use it at
login, which might be neat.

-serge

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                       ` <m1ocj6qljj.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2010-03-03 15:38                                                                         ` Serge E. Hallyn
@ 2010-03-03 16:50                                                                         ` Pavel Emelyanov
  1 sibling, 0 replies; 184+ messages in thread
From: Pavel Emelyanov @ 2010-03-03 16:50 UTC (permalink / raw)
  To: Eric W. Biederman, Sukadev Bhattiprolu, Daniel Lezcano, Serge Hallyn
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear

> I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
> to be that you become the idle task aka pid 0, and not the init task pid 1 the
> implementation is trivial.

This is not ... handy - if after enter you have pid 0 you obviously
can't perform 2 parallel enters. The way I see it:

As far as the numbers reported to the userspace are concerned:
1. task, that enters is still visible by its old parent by old pid
2. task, that enters gets some pid within the entering namespace
   and reports its parent pid to have pid 1 (init obviously doesn't
   care)
3. we _can_ try to allocate new pid equal to the old one so that
   glibc stays happy


As far as the pointers are concerned:
1. parent pointer doesn't change
2. task_pid(tsk) one (i.e. struct pid * one) _can_ change if
   a) we don't allow threads enter (de_thread problem is handeled)
   b) we don't allow leave the group/session, i.e. check, that there
      is the only one task that enters lives in its pgid/sid
   c) we wait for the quiescent state to pass by before destroying
      the old pid to handle race with sys_kill()

Thoughts/questions? ("This is a nasty problem" answer is not acceptable,
the real code problems/races please)

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-03  0:46                                                                     ` Eric W. Biederman
  2010-03-03 15:38                                                                       ` Serge E. Hallyn
@ 2010-03-03 16:50                                                                       ` Pavel Emelyanov
       [not found]                                                                         ` <4B8E9370.3050300-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2010-03-03 20:16                                                                         ` Eric W. Biederman
       [not found]                                                                       ` <m1ocj6qljj.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2 siblings, 2 replies; 184+ messages in thread
From: Pavel Emelyanov @ 2010-03-03 16:50 UTC (permalink / raw)
  To: Eric W. Biederman, Sukadev Bhattiprolu, Daniel Lezcano, Serge Hallyn
  Cc: Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

> I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
> to be that you become the idle task aka pid 0, and not the init task pid 1 the
> implementation is trivial.

This is not ... handy - if after enter you have pid 0 you obviously
can't perform 2 parallel enters. The way I see it:

As far as the numbers reported to the userspace are concerned:
1. task, that enters is still visible by its old parent by old pid
2. task, that enters gets some pid within the entering namespace
   and reports its parent pid to have pid 1 (init obviously doesn't
   care)
3. we _can_ try to allocate new pid equal to the old one so that
   glibc stays happy


As far as the pointers are concerned:
1. parent pointer doesn't change
2. task_pid(tsk) one (i.e. struct pid * one) _can_ change if
   a) we don't allow threads enter (de_thread problem is handeled)
   b) we don't allow leave the group/session, i.e. check, that there
      is the only one task that enters lives in its pgid/sid
   c) we wait for the quiescent state to pass by before destroying
      the old pid to handle race with sys_kill()

Thoughts/questions? ("This is a nasty problem" answer is not acceptable,
the real code problems/races please)

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                         ` <20100303153800.GA937-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-03-03 19:47                                                                           ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-03 19:47 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

"Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
>> Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:
>> 
>> > Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
>> > | 
>> > | I think replacing a struct pid for another struct pid allocated in
>> > | descendant pid_namespace (but has all of the same struct upid values
>> > | as the first struct pid) is a disastrous idea.  It destroys the
>> >
>> > True. Sorry, I did not mean we would need a new 'struct pid' for an
>> > existing process. I think we talked earlier of finding a way of attaching
>> > additional pid numbers to the same struct pid.
>> 
>> I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
>> to be that you become the idle task aka pid 0, and not the init task pid 1 the
>> implementation is trivial.
>
> Heh, and then (browsing through your copy_process() patch hunks) the next
> forked task becomes the child reaper for the new pidns?  <shrug>  why not
> I guess.
>
> Now if that child reaper then gets killed, will the idle task get killed too?

No.

> And if not, then idle task can just re-populating the new pidns with new
> idle tasks...

After zap_pid_namespace interesting...

> If this brought us a step closer to entering an existing pidns that would
> be one thing, but is there actually any advantage to being able to
> unshare a new pidns?  Oh, I guess there is - PAM can then use it at
> login, which might be neat.

I have to say that the semantics of my patch are unworkable for
unshare.  Unless I am mistaken for PAM to use it requires that the
current process fully change and become what it needs to be.
Requiring an extra fork to fully complete the process is a problem.

Scratch one bright idea.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-03 15:38                                                                       ` Serge E. Hallyn
@ 2010-03-03 19:47                                                                         ` Eric W. Biederman
  2010-03-04 21:45                                                                           ` Eric W. Biederman
       [not found]                                                                           ` <m13a0hmblr.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
       [not found]                                                                         ` <20100303153800.GA937-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-03 19:47 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Sukadev Bhattiprolu, Pavel Emelyanov, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear

"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:
>> 
>> > Eric W. Biederman [ebiederm@xmission.com] wrote:
>> > | 
>> > | I think replacing a struct pid for another struct pid allocated in
>> > | descendant pid_namespace (but has all of the same struct upid values
>> > | as the first struct pid) is a disastrous idea.  It destroys the
>> >
>> > True. Sorry, I did not mean we would need a new 'struct pid' for an
>> > existing process. I think we talked earlier of finding a way of attaching
>> > additional pid numbers to the same struct pid.
>> 
>> I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
>> to be that you become the idle task aka pid 0, and not the init task pid 1 the
>> implementation is trivial.
>
> Heh, and then (browsing through your copy_process() patch hunks) the next
> forked task becomes the child reaper for the new pidns?  <shrug>  why not
> I guess.
>
> Now if that child reaper then gets killed, will the idle task get killed too?

No.

> And if not, then idle task can just re-populating the new pidns with new
> idle tasks...

After zap_pid_namespace interesting...

> If this brought us a step closer to entering an existing pidns that would
> be one thing, but is there actually any advantage to being able to
> unshare a new pidns?  Oh, I guess there is - PAM can then use it at
> login, which might be neat.

I have to say that the semantics of my patch are unworkable for
unshare.  Unless I am mistaken for PAM to use it requires that the
current process fully change and become what it needs to be.
Requiring an extra fork to fully complete the process is a problem.

Scratch one bright idea.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                         ` <4B8E9370.3050300-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2010-03-03 20:16                                                                           ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-03 20:16 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:

>> I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
>> to be that you become the idle task aka pid 0, and not the init task pid 1 the
>> implementation is trivial.
>
> This is not ... handy - if after enter you have pid 0 you obviously
> can't perform 2 parallel enters.

2 parallel enters?  I meant you have pid 0 in the entered pid namespace.
You have pid 0 because your pid simply does not map.

There is nothing that makes to parallel enters impossible in that.
Even today we have one thread per cpu that has task->pid == &init_struct_pid
which is pid 0.

For the case of unshare where we are designed to be used with PAM I don't
think my proposed semantics work.  For a join needed an extra fork before
you are really in the pid namespace should be minor.

> The way I see it:
>
> As far as the numbers reported to the userspace are concerned:
> 1. task, that enters is still visible by its old parent by old pid
> 2. task, that enters gets some pid within the entering namespace
>    and reports its parent pid to have pid 1 (init obviously doesn't
>    care)
> 3. we _can_ try to allocate new pid equal to the old one so that
>    glibc stays happy
>
>
> As far as the pointers are concerned:
> 1. parent pointer doesn't change
> 2. task_pid(tsk) one (i.e. struct pid * one) _can_ change if
>    a) we don't allow threads enter (de_thread problem is handeled)
>    b) we don't allow leave the group/session, i.e. check, that there
>       is the only one task that enters lives in its pgid/sid
>    c) we wait for the quiescent state to pass by before destroying
>       the old pid to handle race with sys_kill()

That doesn't handle the case of cached struct pids.  A good example is
waitpid, where it waits for a specific struct pid.  Which means that
allocating a new struct pid and changing task->pid will cause
waitpid(pid) to wait forever...

To change struct pid would require the refcount on struct pid to show
no references from anywhere except the task_struct.

At the cost of a little memory we can solve that problem for unshare
if we have a an extra upid in struct pid, how we verify there is space
in struct pid I'm not certain.

I do think that at least until someone calls exec the namespace pids are
reported to the process itself should not change.  That is kill and
waitpid etc.  Which suggests an implementation the opposite of what
I proposed.  With ns_of_pid(task_pid(current)) being used as the
pid namespace of children, and current->nsproxy->pid_ns not changing
in the case of unshare.

Shrug.

Or perhaps this is a case where we use we can implement join with
an extra process but we can't implement unshare, because the effect
cannot be immediate.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-03 16:50                                                                       ` Pavel Emelyanov
       [not found]                                                                         ` <4B8E9370.3050300-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2010-03-03 20:16                                                                         ` Eric W. Biederman
  2010-03-05 19:18                                                                           ` Pavel Emelyanov
       [not found]                                                                           ` <m17hptjh3m.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-03 20:16 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Sukadev Bhattiprolu, Daniel Lezcano, Serge Hallyn,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

Pavel Emelyanov <xemul@parallels.com> writes:

>> I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
>> to be that you become the idle task aka pid 0, and not the init task pid 1 the
>> implementation is trivial.
>
> This is not ... handy - if after enter you have pid 0 you obviously
> can't perform 2 parallel enters.

2 parallel enters?  I meant you have pid 0 in the entered pid namespace.
You have pid 0 because your pid simply does not map.

There is nothing that makes to parallel enters impossible in that.
Even today we have one thread per cpu that has task->pid == &init_struct_pid
which is pid 0.

For the case of unshare where we are designed to be used with PAM I don't
think my proposed semantics work.  For a join needed an extra fork before
you are really in the pid namespace should be minor.

> The way I see it:
>
> As far as the numbers reported to the userspace are concerned:
> 1. task, that enters is still visible by its old parent by old pid
> 2. task, that enters gets some pid within the entering namespace
>    and reports its parent pid to have pid 1 (init obviously doesn't
>    care)
> 3. we _can_ try to allocate new pid equal to the old one so that
>    glibc stays happy
>
>
> As far as the pointers are concerned:
> 1. parent pointer doesn't change
> 2. task_pid(tsk) one (i.e. struct pid * one) _can_ change if
>    a) we don't allow threads enter (de_thread problem is handeled)
>    b) we don't allow leave the group/session, i.e. check, that there
>       is the only one task that enters lives in its pgid/sid
>    c) we wait for the quiescent state to pass by before destroying
>       the old pid to handle race with sys_kill()

That doesn't handle the case of cached struct pids.  A good example is
waitpid, where it waits for a specific struct pid.  Which means that
allocating a new struct pid and changing task->pid will cause
waitpid(pid) to wait forever...

To change struct pid would require the refcount on struct pid to show
no references from anywhere except the task_struct.

At the cost of a little memory we can solve that problem for unshare
if we have a an extra upid in struct pid, how we verify there is space
in struct pid I'm not certain.

I do think that at least until someone calls exec the namespace pids are
reported to the process itself should not change.  That is kill and
waitpid etc.  Which suggests an implementation the opposite of what
I proposed.  With ns_of_pid(task_pid(current)) being used as the
pid namespace of children, and current->nsproxy->pid_ns not changing
in the case of unshare.

Shrug.

Or perhaps this is a case where we use we can implement join with
an extra process but we can't implement unshare, because the effect
cannot be immediate.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control. v2
       [not found]                                 ` <m18wagy9f3.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-03-03 20:29                                     ` Jonathan Corbet
  0 siblings, 0 replies; 184+ messages in thread
From: Jonathan Corbet @ 2010-03-03 20:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

Quick question:

> +void set_namespace(unsigned long nstype, void *ns)
> +{
> +	struct task_struct *tsk = current;
> +	struct nsproxy *new_nsproxy;
> +
> +	new_nsproxy = create_new_namespaces(0, tsk, tsk->fs);
> +	switch(nstype) {
> +	case NSTYPE_NET:
> +		put_net(new_nsproxy->net_ns);
> +		new_nsproxy->net_ns = get_net(ns);
> +		break;
> +	}
> +
> +	switch_task_namespaces(tsk, new_nsproxy);
> +}

I assume that, at some future point when more than one namespace type
is supported, there will be a check to ensure that the type of the given
namespace matches nstype?  I can imagine all kinds of mayhem that could
result in the case of an accidental (or intentional) mismatch.

Actually, why does setns() require the nstype parameter at all?  A
namespace fd is certainly going to have to know what sort of namespace
it represents...

Thanks,

jon

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control. v2
@ 2010-03-03 20:29                                     ` Jonathan Corbet
  0 siblings, 0 replies; 184+ messages in thread
From: Jonathan Corbet @ 2010-03-03 20:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

Quick question:

> +void set_namespace(unsigned long nstype, void *ns)
> +{
> +	struct task_struct *tsk = current;
> +	struct nsproxy *new_nsproxy;
> +
> +	new_nsproxy = create_new_namespaces(0, tsk, tsk->fs);
> +	switch(nstype) {
> +	case NSTYPE_NET:
> +		put_net(new_nsproxy->net_ns);
> +		new_nsproxy->net_ns = get_net(ns);
> +		break;
> +	}
> +
> +	switch_task_namespaces(tsk, new_nsproxy);
> +}

I assume that, at some future point when more than one namespace type
is supported, there will be a check to ensure that the type of the given
namespace matches nstype?  I can imagine all kinds of mayhem that could
result in the case of an accidental (or intentional) mismatch.

Actually, why does setns() require the nstype parameter at all?  A
namespace fd is certainly going to have to know what sort of namespace
it represents...

Thanks,

jon

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control. v2
       [not found]                                     ` <20100303132931.11afb659-vw3g6Xz/EtPk1uMJSBkQmQ@public.gmane.org>
@ 2010-03-03 20:50                                       ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-03 20:50 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Ben Greear, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Daniel Lezcano

Jonathan Corbet <corbet-T1hC0tSOHrs@public.gmane.org> writes:

> Quick question:
>
>> +void set_namespace(unsigned long nstype, void *ns)
>> +{
>> +	struct task_struct *tsk = current;
>> +	struct nsproxy *new_nsproxy;
>> +
>> +	new_nsproxy = create_new_namespaces(0, tsk, tsk->fs);
>> +	switch(nstype) {
>> +	case NSTYPE_NET:
>> +		put_net(new_nsproxy->net_ns);
>> +		new_nsproxy->net_ns = get_net(ns);
>> +		break;
>> +	}
>> +
>> +	switch_task_namespaces(tsk, new_nsproxy);
>> +}
>
> I assume that, at some future point when more than one namespace type
> is supported, there will be a check to ensure that the type of the given
> namespace matches nstype?  I can imagine all kinds of mayhem that could
> result in the case of an accidental (or intentional) mismatch.
>
> Actually, why does setns() require the nstype parameter at all?  A
> namespace fd is certainly going to have to know what sort of namespace
> it represents...

But userspace might not know for certain and want to check that it is
getting what it expected.  It could be confusing if you think you are
changing your network stack and all of sudden sysv ipc shared memory
was changed instead.

As for the check that nstype is valid that happens earlier in setns.

The plan is to post a patch series with all of the namespace types.


Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control. v2
  2010-03-03 20:29                                     ` Jonathan Corbet
  (?)
@ 2010-03-03 20:50                                     ` Eric W. Biederman
  -1 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-03 20:50 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: hadi, Daniel Lezcano, Patrick McHardy, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear,
	Serge Hallyn, Matt Helsley

Jonathan Corbet <corbet@lwn.net> writes:

> Quick question:
>
>> +void set_namespace(unsigned long nstype, void *ns)
>> +{
>> +	struct task_struct *tsk = current;
>> +	struct nsproxy *new_nsproxy;
>> +
>> +	new_nsproxy = create_new_namespaces(0, tsk, tsk->fs);
>> +	switch(nstype) {
>> +	case NSTYPE_NET:
>> +		put_net(new_nsproxy->net_ns);
>> +		new_nsproxy->net_ns = get_net(ns);
>> +		break;
>> +	}
>> +
>> +	switch_task_namespaces(tsk, new_nsproxy);
>> +}
>
> I assume that, at some future point when more than one namespace type
> is supported, there will be a check to ensure that the type of the given
> namespace matches nstype?  I can imagine all kinds of mayhem that could
> result in the case of an accidental (or intentional) mismatch.
>
> Actually, why does setns() require the nstype parameter at all?  A
> namespace fd is certainly going to have to know what sort of namespace
> it represents...

But userspace might not know for certain and want to check that it is
getting what it expected.  It could be confusing if you think you are
changing your network stack and all of sudden sysv ipc shared memory
was changed instead.

As for the check that nstype is valid that happens earlier in setns.

The plan is to post a patch series with all of the namespace types.


Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                             ` <4B8AE8C1.1030305-GANU6spQydw@public.gmane.org>
                                                                                 ` (2 preceding siblings ...)
  2010-03-02 15:03                                                               ` Pavel Emelyanov
@ 2010-03-03 20:59                                                               ` Oren Laadan
  3 siblings, 0 replies; 184+ messages in thread
From: Oren Laadan @ 2010-03-03 20:59 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Eric W. Biederman, Ben Greear



Daniel Lezcano wrote:
> Eric W. Biederman wrote:
>> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>
>>> Eric W. Biederman wrote:
>>>> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>>>
>>>>> Eric W. Biederman wrote:
>>>>>> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>>>>>
>>>>>>> Thanks. What's the problem with setns?
>>>>>> joining a preexisting namespace is roughly the same problem as
>>>>>> unsharing a namespace.  We simply haven't figure out how to do it
>>>>>> safely for the pid and the uid namespaces.
>>>>> The pid may change after this for sure. What problems do you know
>>>>> about it? What if we try to allocate the same PID in a new space
>>>>> or return -EBUSY? This will be a good starting point. If we manage
>>>>> to fix it later this will not break the API at all.
>>>> Parentage.  The pid is the identity of a process and all kinds of things
>>>> make assumptions in all kinds of strange places.  I don't see how
>>>> waitpid can work if you change the pid.
>>> Agree. But what if we enter a pid space, which is a subnamespace of a current
>>> one? In that case parent will still see the task by its old pid. We can restrict
>>> first version of entering with this rule as well and this restriction will not
>>> block us in typical usecase (I mean enter a container from a host).
>> When I was thinking about pid namespaces and unshare last time.  The idea I came
>> to was we unshare of the pid namespace should only affect which pid namespace
>> your children are in.
>>
>> I remember that do that there were a few cases where you would have to access
>> task->pid->pid_ns instead of task->nsproxy->pid_ns, but essentially it was pretty
>> simple.
>>
>>>> glibc doesn't cope if you change someones pid.
>>> OK, but what if we try to allocate the same pid returning -EBUSY on failure?
>>>
>>> My aim is to provide even a restricted enter. For most of the cases this
>>> should work and make our lives easier. So two restrictions currently:
>>> a) enter a sub namespace
>>> b) allocate the same pid as we have now
>>>
>>> Hm? :)
>> Replacing struct pid is guaranteed to do all kinds of nasty things with
>> signal handling and the like, de_thread is nasty enough and you are talking
>> something worse.  So if we can change pid namespaces without changing
>> the pid I am for it.
> 
> I agree with all the points you and Pavel you talked about but I don't 
> feel comfortable to have the current process to switch the pid namespace 
> because of the process tree hierarchy (what will be the parent of the 
> process when you enter the pid namespace for example). What is the 
> difference with the sys_bindns or the sys_hijack, proposed a couple of 
> years ago ?
> 
> I did a suggestion some weeks ago about a new syscall 'cloneat' where 
> the child process becomes the child of the targeted process specified in 
> the syscall. Maybe it would be interesting to replace the 'setns' by, or 
> add, a 'cloneat' syscall with the file descriptor passed as parameter. 
> The copy_process function shall not use the nsproxy of the caller but 
> the one provided in the fd argument.
> 
> The newly created process becomes the child of the process where we 
> retrieve the namespace with nsfd and this one have to 'waitpid' it, (the 
> caller of 'cloneat' can not wait it). It's a bit similar with the 
> CLONE_PARENT flag, except the creation order is inverted (the father 
> creates for the child).
> 
> So when entering the container, we specify the pid 1 of the container 
> which is usually a child reaper.
> 
> Does it make sense ?

For what it's worth, I think that this suggestion (cloneat) is the
so far the cleanest to allow a process to enter an existing namespace.

Oren.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-02-28 22:05                                                           ` Daniel Lezcano
                                                                               ` (3 preceding siblings ...)
  2010-03-02 15:03                                                             ` Pavel Emelyanov
@ 2010-03-03 20:59                                                             ` Oren Laadan
  2010-03-03 21:05                                                               ` Eric W. Biederman
       [not found]                                                               ` <4B8ECD99.3040107-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  4 siblings, 2 replies; 184+ messages in thread
From: Oren Laadan @ 2010-03-03 20:59 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Eric W. Biederman, Pavel Emelyanov, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear



Daniel Lezcano wrote:
> Eric W. Biederman wrote:
>> Pavel Emelyanov <xemul@parallels.com> writes:
>>
>>> Eric W. Biederman wrote:
>>>> Pavel Emelyanov <xemul@parallels.com> writes:
>>>>
>>>>> Eric W. Biederman wrote:
>>>>>> Pavel Emelyanov <xemul@parallels.com> writes:
>>>>>>
>>>>>>> Thanks. What's the problem with setns?
>>>>>> joining a preexisting namespace is roughly the same problem as
>>>>>> unsharing a namespace.  We simply haven't figure out how to do it
>>>>>> safely for the pid and the uid namespaces.
>>>>> The pid may change after this for sure. What problems do you know
>>>>> about it? What if we try to allocate the same PID in a new space
>>>>> or return -EBUSY? This will be a good starting point. If we manage
>>>>> to fix it later this will not break the API at all.
>>>> Parentage.  The pid is the identity of a process and all kinds of things
>>>> make assumptions in all kinds of strange places.  I don't see how
>>>> waitpid can work if you change the pid.
>>> Agree. But what if we enter a pid space, which is a subnamespace of a current
>>> one? In that case parent will still see the task by its old pid. We can restrict
>>> first version of entering with this rule as well and this restriction will not
>>> block us in typical usecase (I mean enter a container from a host).
>> When I was thinking about pid namespaces and unshare last time.  The idea I came
>> to was we unshare of the pid namespace should only affect which pid namespace
>> your children are in.
>>
>> I remember that do that there were a few cases where you would have to access
>> task->pid->pid_ns instead of task->nsproxy->pid_ns, but essentially it was pretty
>> simple.
>>
>>>> glibc doesn't cope if you change someones pid.
>>> OK, but what if we try to allocate the same pid returning -EBUSY on failure?
>>>
>>> My aim is to provide even a restricted enter. For most of the cases this
>>> should work and make our lives easier. So two restrictions currently:
>>> a) enter a sub namespace
>>> b) allocate the same pid as we have now
>>>
>>> Hm? :)
>> Replacing struct pid is guaranteed to do all kinds of nasty things with
>> signal handling and the like, de_thread is nasty enough and you are talking
>> something worse.  So if we can change pid namespaces without changing
>> the pid I am for it.
> 
> I agree with all the points you and Pavel you talked about but I don't 
> feel comfortable to have the current process to switch the pid namespace 
> because of the process tree hierarchy (what will be the parent of the 
> process when you enter the pid namespace for example). What is the 
> difference with the sys_bindns or the sys_hijack, proposed a couple of 
> years ago ?
> 
> I did a suggestion some weeks ago about a new syscall 'cloneat' where 
> the child process becomes the child of the targeted process specified in 
> the syscall. Maybe it would be interesting to replace the 'setns' by, or 
> add, a 'cloneat' syscall with the file descriptor passed as parameter. 
> The copy_process function shall not use the nsproxy of the caller but 
> the one provided in the fd argument.
> 
> The newly created process becomes the child of the process where we 
> retrieve the namespace with nsfd and this one have to 'waitpid' it, (the 
> caller of 'cloneat' can not wait it). It's a bit similar with the 
> CLONE_PARENT flag, except the creation order is inverted (the father 
> creates for the child).
> 
> So when entering the container, we specify the pid 1 of the container 
> which is usually a child reaper.
> 
> Does it make sense ?

For what it's worth, I think that this suggestion (cloneat) is the
so far the cleanest to allow a process to enter an existing namespace.

Oren.


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                               ` <4B8ECD99.3040107-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-03-03 21:05                                                                 ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-03 21:05 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear

Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> writes:

> Daniel Lezcano wrote:
>>
>> I agree with all the points you and Pavel you talked about but I don't feel
>> comfortable to have the current process to switch the pid namespace because of
>> the process tree hierarchy (what will be the parent of the process when you
>> enter the pid namespace for example). What is the difference with the
>> sys_bindns or the sys_hijack, proposed a couple of years ago ?
>>
>> I did a suggestion some weeks ago about a new syscall 'cloneat' where the
>> child process becomes the child of the targeted process specified in the
>> syscall. Maybe it would be interesting to replace the 'setns' by, or add, a
>> cloneat' syscall with the file descriptor passed as parameter. The
>> copy_process function shall not use the nsproxy of the caller but the one
>> provided in the fd argument.
>>
>> The newly created process becomes the child of the process where we retrieve
>> the namespace with nsfd and this one have to 'waitpid' it, (the caller of
>> cloneat' can not wait it). It's a bit similar with the CLONE_PARENT flag,
>> except the creation order is inverted (the father creates for the child).
>>
>> So when entering the container, we specify the pid 1 of the container which is
>> usually a child reaper.
>>
>> Does it make sense ?
>
> For what it's worth, I think that this suggestion (cloneat) is the
> so far the cleanest to allow a process to enter an existing namespace.

If the goal is to enter a container you are probably right.  I don't
think I have seen how scary the cloneat code is.

At least for the network namespace there is a lot of value in being
able to just change that single namespace.  Having multiple logical
network stacks has it's challenges but has a lot of practical
applications.  Especially when there is the possibility of private
ipv4 addresses overlapping, or you have interfaces where you never
want to forward between them but you want forwarding enabled.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-03 20:59                                                             ` Oren Laadan
@ 2010-03-03 21:05                                                               ` Eric W. Biederman
       [not found]                                                                 ` <m18wa9glpo.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
       [not found]                                                               ` <4B8ECD99.3040107-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  1 sibling, 1 reply; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-03 21:05 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Daniel Lezcano, Pavel Emelyanov, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear

Oren Laadan <orenl@cs.columbia.edu> writes:

> Daniel Lezcano wrote:
>>
>> I agree with all the points you and Pavel you talked about but I don't feel
>> comfortable to have the current process to switch the pid namespace because of
>> the process tree hierarchy (what will be the parent of the process when you
>> enter the pid namespace for example). What is the difference with the
>> sys_bindns or the sys_hijack, proposed a couple of years ago ?
>>
>> I did a suggestion some weeks ago about a new syscall 'cloneat' where the
>> child process becomes the child of the targeted process specified in the
>> syscall. Maybe it would be interesting to replace the 'setns' by, or add, a
>> cloneat' syscall with the file descriptor passed as parameter. The
>> copy_process function shall not use the nsproxy of the caller but the one
>> provided in the fd argument.
>>
>> The newly created process becomes the child of the process where we retrieve
>> the namespace with nsfd and this one have to 'waitpid' it, (the caller of
>> cloneat' can not wait it). It's a bit similar with the CLONE_PARENT flag,
>> except the creation order is inverted (the father creates for the child).
>>
>> So when entering the container, we specify the pid 1 of the container which is
>> usually a child reaper.
>>
>> Does it make sense ?
>
> For what it's worth, I think that this suggestion (cloneat) is the
> so far the cleanest to allow a process to enter an existing namespace.

If the goal is to enter a container you are probably right.  I don't
think I have seen how scary the cloneat code is.

At least for the network namespace there is a lot of value in being
able to just change that single namespace.  Having multiple logical
network stacks has it's challenges but has a lot of practical
applications.  Especially when there is the possibility of private
ipv4 addresses overlapping, or you have interfaces where you never
want to forward between them but you want forwarding enabled.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                           ` <m13a0hmblr.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-03-04 21:45                                                                             ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-04 21:45 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:

> "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
>
>> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
>>> Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:
>>> 
>>> > Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
>>> > | 
>>> > | I think replacing a struct pid for another struct pid allocated in
>>> > | descendant pid_namespace (but has all of the same struct upid values
>>> > | as the first struct pid) is a disastrous idea.  It destroys the
>>> >
>>> > True. Sorry, I did not mean we would need a new 'struct pid' for an
>>> > existing process. I think we talked earlier of finding a way of attaching
>>> > additional pid numbers to the same struct pid.
>>> 
>>> I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
>>> to be that you become the idle task aka pid 0, and not the init task pid 1 the
>>> implementation is trivial.
>>
>> Heh, and then (browsing through your copy_process() patch hunks) the next
>> forked task becomes the child reaper for the new pidns?  <shrug>  why not
>> I guess.
>>
>> Now if that child reaper then gets killed, will the idle task get killed too?
>
> No.
>
>> And if not, then idle task can just re-populating the new pidns with new
>> idle tasks...
>
> After zap_pid_namespace interesting...
>
>> If this brought us a step closer to entering an existing pidns that would
>> be one thing, but is there actually any advantage to being able to
>> unshare a new pidns?  Oh, I guess there is - PAM can then use it at
>> login, which might be neat.
>
> I have to say that the semantics of my patch are unworkable for
> unshare.  Unless I am mistaken for PAM to use it requires that the
> current process fully change and become what it needs to be.
> Requiring an extra fork to fully complete the process is a problem.
>
> Scratch one bright idea.

Maybe not.  I just looked and in the vast majority of cases the login
process goes like this.

{
	setup stuff include pam
	child = fork();
	if (!child) {
		setuid()
                exec /bin/bash
        }
        waitpid(child);
        
        pam and other cleanup
}

So an unshare of the pid namespace that doesn't really take effect
until we fork may actually be usable from pam, and in fact is probably
the preferred implementation.  It looks like neither openssh nor login
from util-linux-ng will cope properly with getting any pid back from
wait() except the pid of their child.  It looks like they both with
terminate.  Which means if you login in a new pid namespace (where the
unsharing process becomes pid 1) and call nohup everything will get
killed and you will be logged out.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-03 19:47                                                                         ` Eric W. Biederman
@ 2010-03-04 21:45                                                                           ` Eric W. Biederman
  2010-03-04 22:55                                                                             ` Jan Engelhardt
       [not found]                                                                             ` <m1pr3j92x8.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
       [not found]                                                                           ` <m13a0hmblr.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-04 21:45 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Sukadev Bhattiprolu, Pavel Emelyanov, Linux Netdev List,
	containers, Netfilter Development Mailinglist, Ben Greear

ebiederm@xmission.com (Eric W. Biederman) writes:

> "Serge E. Hallyn" <serue@us.ibm.com> writes:
>
>> Quoting Eric W. Biederman (ebiederm@xmission.com):
>>> Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:
>>> 
>>> > Eric W. Biederman [ebiederm@xmission.com] wrote:
>>> > | 
>>> > | I think replacing a struct pid for another struct pid allocated in
>>> > | descendant pid_namespace (but has all of the same struct upid values
>>> > | as the first struct pid) is a disastrous idea.  It destroys the
>>> >
>>> > True. Sorry, I did not mean we would need a new 'struct pid' for an
>>> > existing process. I think we talked earlier of finding a way of attaching
>>> > additional pid numbers to the same struct pid.
>>> 
>>> I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
>>> to be that you become the idle task aka pid 0, and not the init task pid 1 the
>>> implementation is trivial.
>>
>> Heh, and then (browsing through your copy_process() patch hunks) the next
>> forked task becomes the child reaper for the new pidns?  <shrug>  why not
>> I guess.
>>
>> Now if that child reaper then gets killed, will the idle task get killed too?
>
> No.
>
>> And if not, then idle task can just re-populating the new pidns with new
>> idle tasks...
>
> After zap_pid_namespace interesting...
>
>> If this brought us a step closer to entering an existing pidns that would
>> be one thing, but is there actually any advantage to being able to
>> unshare a new pidns?  Oh, I guess there is - PAM can then use it at
>> login, which might be neat.
>
> I have to say that the semantics of my patch are unworkable for
> unshare.  Unless I am mistaken for PAM to use it requires that the
> current process fully change and become what it needs to be.
> Requiring an extra fork to fully complete the process is a problem.
>
> Scratch one bright idea.

Maybe not.  I just looked and in the vast majority of cases the login
process goes like this.

{
	setup stuff include pam
	child = fork();
	if (!child) {
		setuid()
                exec /bin/bash
        }
        waitpid(child);
        
        pam and other cleanup
}

So an unshare of the pid namespace that doesn't really take effect
until we fork may actually be usable from pam, and in fact is probably
the preferred implementation.  It looks like neither openssh nor login
from util-linux-ng will cope properly with getting any pid back from
wait() except the pid of their child.  It looks like they both with
terminate.  Which means if you login in a new pid namespace (where the
unsharing process becomes pid 1) and call nohup everything will get
killed and you will be logged out.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                             ` <m1pr3j92x8.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-03-04 22:55                                                                               ` Jan Engelhardt
  0 siblings, 0 replies; 184+ messages in thread
From: Jan Engelhardt @ 2010-03-04 22:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu


On Thursday 2010-03-04 22:45, Eric W. Biederman wrote:
>
>So an unshare of the pid namespace that doesn't really take effect
>until we fork may actually be usable from pam, and in fact is probably
>the preferred implementation.  It looks like neither openssh nor login
>from util-linux-ng will cope properly with getting any pid back from
>wait() except the pid of their child.

Correct; I can tell from experience with pam_mount. GDM for example is 
very unhappy if you fork/exit processes in PAM modules and don't hide 
the fact by bending SIGCHLD from gdm_handler to mypam_handler (which 
itself is racy, suppose GDM re-set the SIGCHLD handler midway through).

(In this particular case however, I'd prefer if login programs like GDM 
just ignored any PIDs they did not spawn in the first place instead of 
moaning around.)

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-04 21:45                                                                           ` Eric W. Biederman
@ 2010-03-04 22:55                                                                             ` Jan Engelhardt
       [not found]                                                                             ` <m1pr3j92x8.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 0 replies; 184+ messages in thread
From: Jan Engelhardt @ 2010-03-04 22:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Sukadev Bhattiprolu, Pavel Emelyanov,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear


On Thursday 2010-03-04 22:45, Eric W. Biederman wrote:
>
>So an unshare of the pid namespace that doesn't really take effect
>until we fork may actually be usable from pam, and in fact is probably
>the preferred implementation.  It looks like neither openssh nor login
>from util-linux-ng will cope properly with getting any pid back from
>wait() except the pid of their child.

Correct; I can tell from experience with pam_mount. GDM for example is 
very unhappy if you fork/exit processes in PAM modules and don't hide 
the fact by bending SIGCHLD from gdm_handler to mypam_handler (which 
itself is racy, suppose GDM re-set the SIGCHLD handler midway through).

(In this particular case however, I'd prefer if login programs like GDM 
just ignored any PIDs they did not spawn in the first place instead of 
moaning around.)

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                           ` <m17hptjh3m.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-03-05 19:18                                                                             ` Pavel Emelyanov
  0 siblings, 0 replies; 184+ messages in thread
From: Pavel Emelyanov @ 2010-03-05 19:18 UTC (permalink / raw)
  To: Eric W. Biederman, Sukadev Bhattiprolu
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear

> 2 parallel enters?  I meant you have pid 0 in the entered pid namespace.
> You have pid 0 because your pid simply does not map.

Oh, I see.

> There is nothing that makes to parallel enters impossible in that.
> Even today we have one thread per cpu that has task->pid == &init_struct_pid
> which is pid 0.

How about the forked processes then? Who will be their parent?

> For the case of unshare where we are designed to be used with PAM I don't
> think my proposed semantics work.  For a join needed an extra fork before
> you are really in the pid namespace should be minor.

Hm... One more proposal - can we adopt the planned new fork_with_pids system
call to fork the process right into a new pid namespace?

> That doesn't handle the case of cached struct pids.  A good example is
> waitpid, where it waits for a specific struct pid.  Which means that
> allocating a new struct pid and changing task->pid will cause
> waitpid(pid) to wait forever...

OK. Good example. Thanks.

> To change struct pid would require the refcount on struct pid to show
> no references from anywhere except the task_struct.

I think this is OK to return -EBUSY for this. And fix the waitpid
respectively not to block this common case. All the others I think
can be stayed as is.

> At the cost of a little memory we can solve that problem for unshare
> if we have a an extra upid in struct pid, how we verify there is space
> in struct pid I'm not certain.
> 
> I do think that at least until someone calls exec the namespace pids are
> reported to the process itself should not change.  That is kill and

Wait a second - in that case the wait will be blocked too! No?

> waitpid etc.  Which suggests an implementation the opposite of what
> I proposed.  With ns_of_pid(task_pid(current)) being used as the
> pid namespace of children, and current->nsproxy->pid_ns not changing
> in the case of unshare.
> 
> Shrug.
> 
> Or perhaps this is a case where we use we can implement join with
> an extra process but we can't implement unshare, because the effect
> cannot be immediate.

Well, I'm talking only about the join now.

> Eric
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-03 20:16                                                                         ` Eric W. Biederman
@ 2010-03-05 19:18                                                                           ` Pavel Emelyanov
       [not found]                                                                             ` <4B9158F5.5040205-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2010-03-05 20:26                                                                             ` Eric W. Biederman
       [not found]                                                                           ` <m17hptjh3m.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Pavel Emelyanov @ 2010-03-05 19:18 UTC (permalink / raw)
  To: Eric W. Biederman, Sukadev Bhattiprolu
  Cc: Daniel Lezcano, Serge Hallyn, Linux Netdev List, containers,
	Netfilter Development Mailinglist, Ben Greear

> 2 parallel enters?  I meant you have pid 0 in the entered pid namespace.
> You have pid 0 because your pid simply does not map.

Oh, I see.

> There is nothing that makes to parallel enters impossible in that.
> Even today we have one thread per cpu that has task->pid == &init_struct_pid
> which is pid 0.

How about the forked processes then? Who will be their parent?

> For the case of unshare where we are designed to be used with PAM I don't
> think my proposed semantics work.  For a join needed an extra fork before
> you are really in the pid namespace should be minor.

Hm... One more proposal - can we adopt the planned new fork_with_pids system
call to fork the process right into a new pid namespace?

> That doesn't handle the case of cached struct pids.  A good example is
> waitpid, where it waits for a specific struct pid.  Which means that
> allocating a new struct pid and changing task->pid will cause
> waitpid(pid) to wait forever...

OK. Good example. Thanks.

> To change struct pid would require the refcount on struct pid to show
> no references from anywhere except the task_struct.

I think this is OK to return -EBUSY for this. And fix the waitpid
respectively not to block this common case. All the others I think
can be stayed as is.

> At the cost of a little memory we can solve that problem for unshare
> if we have a an extra upid in struct pid, how we verify there is space
> in struct pid I'm not certain.
> 
> I do think that at least until someone calls exec the namespace pids are
> reported to the process itself should not change.  That is kill and

Wait a second - in that case the wait will be blocked too! No?

> waitpid etc.  Which suggests an implementation the opposite of what
> I proposed.  With ns_of_pid(task_pid(current)) being used as the
> pid namespace of children, and current->nsproxy->pid_ns not changing
> in the case of unshare.
> 
> Shrug.
> 
> Or perhaps this is a case where we use we can implement join with
> an extra process but we can't implement unshare, because the effect
> cannot be immediate.

Well, I'm talking only about the join now.

> Eric
> 


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                             ` <4B9158F5.5040205-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2010-03-05 20:26                                                                               ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-05 20:26 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:

>> 2 parallel enters?  I meant you have pid 0 in the entered pid namespace.
>> You have pid 0 because your pid simply does not map.
>
> Oh, I see.
>
>> There is nothing that makes to parallel enters impossible in that.
>> Even today we have one thread per cpu that has task->pid == &init_struct_pid
>> which is pid 0.
>
> How about the forked processes then? Who will be their parent?

The normal rules of parentage apply.   So the child will see simply
see it's parent as ppid == 0.  If that child daemonizes it will become
a child of the pid namespaces init.

This is a lot like something that gets started from call_usermodehelper.  It's
parent process is not a descendant of init either.


The implementation of the join is to simply change current->nsproxy->pid_ns.
Then to use it you simply fork to get a child in the target pid namespace.

>> For the case of unshare where we are designed to be used with PAM I don't
>> think my proposed semantics work.  For a join needed an extra fork before
>> you are really in the pid namespace should be minor.
>
> Hm... One more proposal - can we adopt the planned new fork_with_pids system
> call to fork the process right into a new pid namespace?

In a lot of ways I like this idea of sys_hijack/sys_cloneat, and I
don't think anything I am doing fundamentally undermines it.  The use
case of doing things in fork is that there is automatic inheritance of
everything.  All of the namespaces and all of the control groups, and
possibly also the parent process.  It does have the high cost that the
process we are copying from must be stopped because there are no locks
that let us take everything.  I haven't looked at the recent proposals
to see if anyone has solved that problem cleanly.



If we can do a sys_hijack/sys_cloneat style of join, that means we can
afford a fork.  At which point the my proposed pid namespace semantics
should be fine.

aka:
setns(NSTYPE_PID);
pid = fork();
if (pid == 0) {
	getpid() == 2; /* Or whatever the first free pid is joined pid namespace */
        getppid() == 0;
} else {
	pid == 6400; /* Or whatever the first free pid is in the original pid namespace */
	waitpid(pid);
}

>> That doesn't handle the case of cached struct pids.  A good example is
>> waitpid, where it waits for a specific struct pid.  Which means that
>> allocating a new struct pid and changing task->pid will cause
>> waitpid(pid) to wait forever...
>
> OK. Good example. Thanks.
>
>> To change struct pid would require the refcount on struct pid to show
>> no references from anywhere except the task_struct.
>
> I think this is OK to return -EBUSY for this. And fix the waitpid
> respectively not to block this common case. All the others I think
> can be stayed as is.

That would probably work.  setsid() and setpgrp() have similar sorts
of restrictions.  That is both more challenging and more limiting than
the semantics that come out of my unshare(CLONE_NEWPID) patch.  So I
would prefer to keep this sort of thing as a last resort.

>> At the cost of a little memory we can solve that problem for unshare
>> if we have a an extra upid in struct pid, how we verify there is space
>> in struct pid I'm not certain.
>> 
>> I do think that at least until someone calls exec the namespace pids are
>> reported to the process itself should not change.  That is kill and
>
> Wait a second - in that case the wait will be blocked too! No?

If all we do is populate an unused struct upid in struct pid there
isn't a chance of a problem.  

>> waitpid etc.  Which suggests an implementation the opposite of what
>> I proposed.  With ns_of_pid(task_pid(current)) being used as the
>> pid namespace of children, and current->nsproxy->pid_ns not changing
>> in the case of unshare.
>> 
>> Shrug.
>> 
>> Or perhaps this is a case where we use we can implement join with
>> an extra process but we can't implement unshare, because the effect
>> cannot be immediate.
>
> Well, I'm talking only about the join now.

Overall it sounds like the semantics I have proposed with
unshare(CLONE_NEWPID) are workable, and simple to implement.  The
extra fork is a bit surprising but it certainly does not
look like a show stopper for implementing a pid namespace join.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-05 19:18                                                                           ` Pavel Emelyanov
       [not found]                                                                             ` <4B9158F5.5040205-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2010-03-05 20:26                                                                             ` Eric W. Biederman
  2010-03-06 14:47                                                                               ` Daniel Lezcano
  1 sibling, 1 reply; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-05 20:26 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Sukadev Bhattiprolu, Daniel Lezcano, Serge Hallyn,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

Pavel Emelyanov <xemul@parallels.com> writes:

>> 2 parallel enters?  I meant you have pid 0 in the entered pid namespace.
>> You have pid 0 because your pid simply does not map.
>
> Oh, I see.
>
>> There is nothing that makes to parallel enters impossible in that.
>> Even today we have one thread per cpu that has task->pid == &init_struct_pid
>> which is pid 0.
>
> How about the forked processes then? Who will be their parent?

The normal rules of parentage apply.   So the child will see simply
see it's parent as ppid == 0.  If that child daemonizes it will become
a child of the pid namespaces init.

This is a lot like something that gets started from call_usermodehelper.  It's
parent process is not a descendant of init either.


The implementation of the join is to simply change current->nsproxy->pid_ns.
Then to use it you simply fork to get a child in the target pid namespace.

>> For the case of unshare where we are designed to be used with PAM I don't
>> think my proposed semantics work.  For a join needed an extra fork before
>> you are really in the pid namespace should be minor.
>
> Hm... One more proposal - can we adopt the planned new fork_with_pids system
> call to fork the process right into a new pid namespace?

In a lot of ways I like this idea of sys_hijack/sys_cloneat, and I
don't think anything I am doing fundamentally undermines it.  The use
case of doing things in fork is that there is automatic inheritance of
everything.  All of the namespaces and all of the control groups, and
possibly also the parent process.  It does have the high cost that the
process we are copying from must be stopped because there are no locks
that let us take everything.  I haven't looked at the recent proposals
to see if anyone has solved that problem cleanly.



If we can do a sys_hijack/sys_cloneat style of join, that means we can
afford a fork.  At which point the my proposed pid namespace semantics
should be fine.

aka:
setns(NSTYPE_PID);
pid = fork();
if (pid == 0) {
	getpid() == 2; /* Or whatever the first free pid is joined pid namespace */
        getppid() == 0;
} else {
	pid == 6400; /* Or whatever the first free pid is in the original pid namespace */
	waitpid(pid);
}

>> That doesn't handle the case of cached struct pids.  A good example is
>> waitpid, where it waits for a specific struct pid.  Which means that
>> allocating a new struct pid and changing task->pid will cause
>> waitpid(pid) to wait forever...
>
> OK. Good example. Thanks.
>
>> To change struct pid would require the refcount on struct pid to show
>> no references from anywhere except the task_struct.
>
> I think this is OK to return -EBUSY for this. And fix the waitpid
> respectively not to block this common case. All the others I think
> can be stayed as is.

That would probably work.  setsid() and setpgrp() have similar sorts
of restrictions.  That is both more challenging and more limiting than
the semantics that come out of my unshare(CLONE_NEWPID) patch.  So I
would prefer to keep this sort of thing as a last resort.

>> At the cost of a little memory we can solve that problem for unshare
>> if we have a an extra upid in struct pid, how we verify there is space
>> in struct pid I'm not certain.
>> 
>> I do think that at least until someone calls exec the namespace pids are
>> reported to the process itself should not change.  That is kill and
>
> Wait a second - in that case the wait will be blocked too! No?

If all we do is populate an unused struct upid in struct pid there
isn't a chance of a problem.  

>> waitpid etc.  Which suggests an implementation the opposite of what
>> I proposed.  With ns_of_pid(task_pid(current)) being used as the
>> pid namespace of children, and current->nsproxy->pid_ns not changing
>> in the case of unshare.
>> 
>> Shrug.
>> 
>> Or perhaps this is a case where we use we can implement join with
>> an extra process but we can't implement unshare, because the effect
>> cannot be immediate.
>
> Well, I'm talking only about the join now.

Overall it sounds like the semantics I have proposed with
unshare(CLONE_NEWPID) are workable, and simple to implement.  The
extra fork is a bit surprising but it certainly does not
look like a show stopper for implementing a pid namespace join.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-05 20:26                                                                             ` Eric W. Biederman
@ 2010-03-06 14:47                                                                               ` Daniel Lezcano
       [not found]                                                                                 ` <4B926B1B.5070207-GANU6spQydw@public.gmane.org>
  0 siblings, 1 reply; 184+ messages in thread
From: Daniel Lezcano @ 2010-03-06 14:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

Eric W. Biederman wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
>
>   
>>> 2 parallel enters?  I meant you have pid 0 in the entered pid namespace.
>>> You have pid 0 because your pid simply does not map.
>>>       
>> Oh, I see.
>>
>>     
>>> There is nothing that makes to parallel enters impossible in that.
>>> Even today we have one thread per cpu that has task->pid == &init_struct_pid
>>> which is pid 0.
>>>       
>> How about the forked processes then? Who will be their parent?
>>     
>
> The normal rules of parentage apply.   So the child will see simply
> see it's parent as ppid == 0.  If that child daemonizes it will become
> a child of the pid namespaces init.
>
> This is a lot like something that gets started from call_usermodehelper.  It's
> parent process is not a descendant of init either.
>
>
> The implementation of the join is to simply change current->nsproxy->pid_ns.
> Then to use it you simply fork to get a child in the target pid namespace.
>   
If the normal rules of parentage apply, that means pid 0 has to wait 
it's child.
If we are in the scenario of pid 0, it's child pid 1234 and we kill the 
pid 1 of the pid namespace, I suppose pid 1234 will be killed too.
The pid 0 will stay in the pid namespace and will able to fork again a 
new pid 1.

I think Serge already reported that...

That sounds good :)
>>> For the case of unshare where we are designed to be used with PAM I don't
>>> think my proposed semantics work.  For a join needed an extra fork before
>>> you are really in the pid namespace should be minor.
>>>       
>> Hm... One more proposal - can we adopt the planned new fork_with_pids system
>> call to fork the process right into a new pid namespace?
>>     
>
> In a lot of ways I like this idea of sys_hijack/sys_cloneat, and I
> don't think anything I am doing fundamentally undermines it.  The use
> case of doing things in fork is that there is automatic inheritance of
> everything.  All of the namespaces and all of the control groups, and
> possibly also the parent process.  
And also the rootfs for executing the command inside the container (eg. 
shutdown), the uid/gid (if there is a user namespace), the mount points, ...
But I suppose we can do the same with setns for all the namespaces and 
chrooting within the container rootfs.

What I see is a problem with the tty. For example, we cloneat the init 
process of the container which is usually /sbin/init but this one has 
its tty mapped to /dev/console, so the output of the exec'ed command 
will go to the console.
> It does have the high cost that the
> process we are copying from must be stopped because there are no locks
> that let us take everything.  I haven't looked at the recent proposals
> to see if anyone has solved that problem cleanly.
>   
Right.

> If we can do a sys_hijack/sys_cloneat style of join, that means we can
> afford a fork.  At which point the my proposed pid namespace semantics
> should be fine.
>
> aka:
> setns(NSTYPE_PID);
> pid = fork();
> if (pid == 0) {
> 	getpid() == 2; /* Or whatever the first free pid is joined pid namespace */
>         getppid() == 0;
> } else {
> 	pid == 6400; /* Or whatever the first free pid is in the original pid namespace */
> 	waitpid(pid);
> }
>
>   
>>> That doesn't handle the case of cached struct pids.  A good example is
>>> waitpid, where it waits for a specific struct pid.  Which means that
>>> allocating a new struct pid and changing task->pid will cause
>>> waitpid(pid) to wait forever...
>>>       
>> OK. Good example. Thanks.
>>
>>     
>>> To change struct pid would require the refcount on struct pid to show
>>> no references from anywhere except the task_struct.
>>>       
>> I think this is OK to return -EBUSY for this. And fix the waitpid
>> respectively not to block this common case. All the others I think
>> can be stayed as is.
>>     
>
> That would probably work.  setsid() and setpgrp() have similar sorts
> of restrictions.  That is both more challenging and more limiting than
> the semantics that come out of my unshare(CLONE_NEWPID) patch.  So I
> would prefer to keep this sort of thing as a last resort.
>
>   
>>> At the cost of a little memory we can solve that problem for unshare
>>> if we have a an extra upid in struct pid, how we verify there is space
>>> in struct pid I'm not certain.
>>>
>>> I do think that at least until someone calls exec the namespace pids are
>>> reported to the process itself should not change.  That is kill and
>>>       
>> Wait a second - in that case the wait will be blocked too! No?
>>     
>
> If all we do is populate an unused struct upid in struct pid there
> isn't a chance of a problem.  
>
>   
>>> waitpid etc.  Which suggests an implementation the opposite of what
>>> I proposed.  With ns_of_pid(task_pid(current)) being used as the
>>> pid namespace of children, and current->nsproxy->pid_ns not changing
>>> in the case of unshare.
>>>
>>> Shrug.
>>>
>>> Or perhaps this is a case where we use we can implement join with
>>> an extra process but we can't implement unshare, because the effect
>>> cannot be immediate.
>>>       
>> Well, I'm talking only about the join now.
>>     
>
> Overall it sounds like the semantics I have proposed with
> unshare(CLONE_NEWPID) are workable, and simple to implement.  The
> extra fork is a bit surprising but it certainly does not
> look like a show stopper for implementing a pid namespace join.
>   
I agree, it's some kind of "ghost" process.
IMO, with a bit of userspace code it would be possible to enter or exec 
a command inside a container with nsfd, setns.

+1 to test your patchset Eric :)

Just a mindless suggestion, the "nsopen" / "nsattach" syscall names 
should be more clear no ?

Jumping back, one question about the nsfd and the poll for waiting the 
end of the namespace.
If we have an openened file descriptor on a specific namespace, we grab 
a reference on this one, so the namespace won't be destroyed until we 
close the fd which is used to poll the end of the namespace, no ? Did I 
miss something ?

Thanks
  -- Daniel

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                 ` <4B926B1B.5070207-GANU6spQydw@public.gmane.org>
@ 2010-03-06 20:48                                                                                   ` Eric W. Biederman
       [not found]                                                                                     ` <m1aaulyy5c.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  0 siblings, 1 reply; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-06 20:48 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:

> Eric W. Biederman wrote:

> If the normal rules of parentage apply, that means pid 0 has to wait it's child.
> If we are in the scenario of pid 0, it's child pid 1234 and we kill the pid 1 of
> the pid namespace, I suppose pid 1234 will be killed too.
> The pid 0 will stay in the pid namespace and will able to fork again a new pid
> 1.
>
> I think Serge already reported that...
>
> That sounds good :)

I expect zap_pid_ns_processes should also arrange so we cannot allocate any
more processes.  We certainly need to do something explicit or pid 1 won't
be allocated.  It might make sense to resurrect a pid namespace after it's
death but it is definitely weird.

>> In a lot of ways I like this idea of sys_hijack/sys_cloneat, and I
>> don't think anything I am doing fundamentally undermines it.  The use
>> case of doing things in fork is that there is automatic inheritance of
>> everything.  All of the namespaces and all of the control groups, and
>> possibly also the parent process.  
> And also the rootfs for executing the command inside the container
> (eg. shutdown), the uid/gid (if there is a user namespace), the mount points,
> ...
> But I suppose we can do the same with setns for all the namespaces and chrooting
> within the container rootfs.
>
> What I see is a problem with the tty. For example, we cloneat the init process
> of the container which is usually /sbin/init but this one has its tty mapped to
> /dev/console, so the output of the exec'ed command will go to the console.

My original thinking was that the fd's would come from the caller of sys_cloneat....

>> Overall it sounds like the semantics I have proposed with
>> unshare(CLONE_NEWPID) are workable, and simple to implement.  The
>> extra fork is a bit surprising but it certainly does not
>> look like a show stopper for implementing a pid namespace join.
>>   
> I agree, it's some kind of "ghost" process.
> IMO, with a bit of userspace code it would be possible to enter or exec a
> command inside a container with nsfd, setns.
>
> +1 to test your patchset Eric :)

I will see about reposting sometime soon.

> Just a mindless suggestion, the "nsopen" / "nsattach" syscall names should be
> more clear no ?

Not bad suggestions.

I am going to explore a bit more.  Given that nsfd is using the same
permission checks as a proc file, I think I can just make it a proc
file.  Something like "/proc/<pid>/ns/net".  With a little luck that
won't suck too badly.

> Jumping back, one question about the nsfd and the poll for waiting the end of
> the namespace.
> If we have an openened file descriptor on a specific namespace, we grab a
> reference on this one, so the namespace won't be destroyed until we close the fd
> which is used to poll the end of the namespace, no ? Did I miss something ?

Not really.  The assumption was that there would be a very similar
file descriptor that we could use with poll.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                     ` <m1aaulyy5c.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-03-06 21:26                                                                                       ` Daniel Lezcano
       [not found]                                                                                         ` <4B92C886.9020507-GANU6spQydw@public.gmane.org>
  2010-03-08  8:32                                                                                         ` Eric W. Biederman
  0 siblings, 2 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-03-06 21:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:
>
>   
>> Eric W. Biederman wrote:
>>     
>
>   
>> If the normal rules of parentage apply, that means pid 0 has to wait it's child.
>> If we are in the scenario of pid 0, it's child pid 1234 and we kill the pid 1 of
>> the pid namespace, I suppose pid 1234 will be killed too.
>> The pid 0 will stay in the pid namespace and will able to fork again a new pid
>> 1.
>>
>> I think Serge already reported that...
>>
>> That sounds good :)
>>     
>
> I expect zap_pid_ns_processes should also arrange so we cannot allocate any
> more processes.  We certainly need to do something explicit or pid 1 won't
> be allocated.  It might make sense to resurrect a pid namespace after it's
> death but it is definitely weird.
>   
Mmh, yes. But that was just an idea, maybe a bit out of the scope you 
are aiming.

>>> In a lot of ways I like this idea of sys_hijack/sys_cloneat, and I
>>> don't think anything I am doing fundamentally undermines it.  The use
>>> case of doing things in fork is that there is automatic inheritance of
>>> everything.  All of the namespaces and all of the control groups, and
>>> possibly also the parent process.  
>>>       
>> And also the rootfs for executing the command inside the container
>> (eg. shutdown), the uid/gid (if there is a user namespace), the mount points,
>> ...
>> But I suppose we can do the same with setns for all the namespaces and chrooting
>> within the container rootfs.
>>
>> What I see is a problem with the tty. For example, we cloneat the init process
>> of the container which is usually /sbin/init but this one has its tty mapped to
>> /dev/console, so the output of the exec'ed command will go to the console.
>>     
>
> My original thinking was that the fd's would come from the caller of sys_cloneat....
Oh, ok :s

>>> Overall it sounds like the semantics I have proposed with
>>> unshare(CLONE_NEWPID) are workable, and simple to implement.  The
>>> extra fork is a bit surprising but it certainly does not
>>> look like a show stopper for implementing a pid namespace join.
>>>   
>>>       
>> I agree, it's some kind of "ghost" process.
>> IMO, with a bit of userspace code it would be possible to enter or exec a
>> command inside a container with nsfd, setns.
>>
>> +1 to test your patchset Eric :)
>>     
>
> I will see about reposting sometime soon.
>   
Great ! thanks.

>> Just a mindless suggestion, the "nsopen" / "nsattach" syscall names should be
>> more clear no ?
>>     
>
> Not bad suggestions.
>
> I am going to explore a bit more.  Given that nsfd is using the same
> permission checks as a proc file, I think I can just make it a proc
> file.  Something like "/proc/<pid>/ns/net".  With a little luck that
> won't suck too badly.
>   
Ah ! yes. Good idea.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                         ` <4B92C886.9020507-GANU6spQydw@public.gmane.org>
@ 2010-03-08  8:32                                                                                           ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-08  8:32 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu


I have take an snapshot of my development tree and placed it at.


git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git


>> I am going to explore a bit more.  Given that nsfd is using the same
>> permission checks as a proc file, I think I can just make it a proc
>> file.  Something like "/proc/<pid>/ns/net".  With a little luck that
>> won't suck too badly.
>>   
> Ah ! yes. Good idea.

It is a hair more code to use proc files but nothing worth counting.

Probably the biggest thing I am aware of right now in my development
tree is in getting uids to pass properly between unix domain sockets
I would up writing this cred_to_ucred function.

Serge can you take a look and check my logic, and do you have
any idea of where we should place something like pid_vnr but
for the uid namespace?

void cred_to_ucred(struct pid *pid, const struct cred *cred,
		   struct ucred *ucred)
{
	ucred->pid = pid_vnr(pid);
	ucred->uid = ucred->gid = -1;
	if (cred) {
		struct user_namespace *cred_ns = cred->user->user_ns;
		struct user_namespace *current_ns = current_user_ns();
		struct user_namespace *tmp;

		if (likely(cred_ns == current_ns)) {
			ucred->uid = cred->euid;
			ucred->gid = cred->egid;
		} else {
			/* Is cred in a child user namespace */
			tmp = cred_ns;
			do {
				tmp = tmp->creator->user_ns;
				if (tmp == current_ns) {
					ucred->uid = tmp->creator->uid;
					ucred->gid = overflowgid;
					return;
				}
			} while (tmp != &init_user_ns);

			/* Is cred the creator of my user namespace,
			 * or the creator of one of it's parents?
			 */
			for( tmp = current_ns; tmp != &init_user_ns;
			     tmp = tmp->creator->user_ns) {
				if (cred->user == tmp->creator) {
					ucred->uid = 0;
					ucred->gid = 0;
					return;
				}
			}

			/* No user namespace relationship so no mapping */
			ucred->uid = overflowuid;
			ucred->gid = overflowgid;
		}
	}
}

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-06 21:26                                                                                       ` Daniel Lezcano
       [not found]                                                                                         ` <4B92C886.9020507-GANU6spQydw@public.gmane.org>
@ 2010-03-08  8:32                                                                                         ` Eric W. Biederman
  2010-03-08 16:54                                                                                           ` Daniel Lezcano
                                                                                                             ` (2 more replies)
  1 sibling, 3 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-08  8:32 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear


I have take an snapshot of my development tree and placed it at.


git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git


>> I am going to explore a bit more.  Given that nsfd is using the same
>> permission checks as a proc file, I think I can just make it a proc
>> file.  Something like "/proc/<pid>/ns/net".  With a little luck that
>> won't suck too badly.
>>   
> Ah ! yes. Good idea.

It is a hair more code to use proc files but nothing worth counting.

Probably the biggest thing I am aware of right now in my development
tree is in getting uids to pass properly between unix domain sockets
I would up writing this cred_to_ucred function.

Serge can you take a look and check my logic, and do you have
any idea of where we should place something like pid_vnr but
for the uid namespace?

void cred_to_ucred(struct pid *pid, const struct cred *cred,
		   struct ucred *ucred)
{
	ucred->pid = pid_vnr(pid);
	ucred->uid = ucred->gid = -1;
	if (cred) {
		struct user_namespace *cred_ns = cred->user->user_ns;
		struct user_namespace *current_ns = current_user_ns();
		struct user_namespace *tmp;

		if (likely(cred_ns == current_ns)) {
			ucred->uid = cred->euid;
			ucred->gid = cred->egid;
		} else {
			/* Is cred in a child user namespace */
			tmp = cred_ns;
			do {
				tmp = tmp->creator->user_ns;
				if (tmp == current_ns) {
					ucred->uid = tmp->creator->uid;
					ucred->gid = overflowgid;
					return;
				}
			} while (tmp != &init_user_ns);

			/* Is cred the creator of my user namespace,
			 * or the creator of one of it's parents?
			 */
			for( tmp = current_ns; tmp != &init_user_ns;
			     tmp = tmp->creator->user_ns) {
				if (cred->user == tmp->creator) {
					ucred->uid = 0;
					ucred->gid = 0;
					return;
				}
			}

			/* No user namespace relationship so no mapping */
			ucred->uid = overflowuid;
			ucred->gid = overflowgid;
		}
	}
}

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                           ` <m1fx4bxlfy.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-03-08 16:54                                                                                             ` Daniel Lezcano
  2010-03-08 17:07                                                                                             ` Serge E. Hallyn
  1 sibling, 0 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-03-08 16:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Eric W. Biederman wrote:
> I have take an snapshot of my development tree and placed it at.
>
>
> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>   

Hi Eric,

thanks for the pointer.

I tried to boot the kernel under qemu and I got this oops:

Loading /lib/kbd/keymaps/i386/azerty/fr.map
Creating block device nodes.
Creating character device nodes.
Making device-mapper control node
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff812df7e7>] netlink_broadcast+0x1bd/0x384
PGD 3cfd0067 PUD 3cfc1067 PMD 0
Oops: 0002 [#1] DEBUG_PAGEALLOC
last sysfs file: /sys/class/firmware/timeout
CPU 0
Pid: 841, comm: modprobe Not tainted 2.6.33 #1 /
RIP: 0010:[<ffffffff812df7e7>]  [<ffffffff812df7e7>] 
netlink_broadcast+0x1bd/0x384
RSP: 0018:ffff88003cfc3ca8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88003ce947f0 RCX: ffff88003cf877f0
RDX: ffff88003f939dd0 RSI: ffff88003f939ef0 RDI: ffff88003f939e00
RBP: ffff88003cfc3d18 R08: ffff88003f939d98 R09: ffff88003f939e88
R10: ffff88003cf87818 R11: 0000000000000286 R12: ffff88003f939d98
R13: ffff88003f939e88 R14: ffff88003ce94818 R15: ffff88003ce94800
FS:  00007f23a90a06f0(0000) GS:ffffffff8161b000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000003cfcd000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 841, threadinfo ffff88003cfc2000, task 
ffff88003d1db058)
Stack:
 0000000000000270 00000000000000d0 ffff88003f9377f0 ffffffff8203d630
<0> ffff88003f939f5c 0000000000000000 0000000000000001 0000000000000000
<0> ffff88003cf03000 0000000000000020 0000000000000004 ffff88003f939e88
Call Trace:
 [<ffffffff8121c1bd>] kobject_uevent_env+0x414/0x59b
 [<ffffffff81117432>] ? sysfs_create_file+0x25/0x27
 [<ffffffff8105eb13>] ? module_add_modinfo_attrs+0xd6/0xfc
 [<ffffffff8121c34f>] kobject_uevent+0xb/0xd
 [<ffffffff8105eba4>] mod_sysfs_setup+0x6b/0x99
 [<ffffffff810602aa>] load_module+0x12a2/0x16f1
 [<ffffffff81060b34>] sys_init_module+0x60/0x230
 [<ffffffff81002928>] system_call_fastpath+0x16/0x1b
Code: 00 ff c8 74 2a 8b 75 98 4c 89 ef e8 22 f8 fd ff 48 85 c0 49 89 c4 
74 3c 48 8d 50 38 48 8b 40 38 48 85 c0 74 0
2 ff 00 48 8b 42 08 <ff> 00 eb 4f 48 8b 55 b0 ff 02 49 83 7d 10 00 74 08 
4c 89 ef e8
RIP  [<ffffffff812df7e7>] netlink_broadcast+0x1bd/0x384
 RSP <ffff88003cfc3ca8>
CR2: 0000000000000000
---[ end trace 979d62b87f68fab3 ]---
netlink_recvmsg: missing NETLINK_CB proto: 15 pid: 0 dsg_group: 1
destructor = (null)
nlmsghdr: len=1080321121 type=27951 flags=646f seq=795176053 pid=1769169779

I will try to investigate further later - kid duty :)

  -- Daniel

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-08  8:32                                                                                         ` Eric W. Biederman
@ 2010-03-08 16:54                                                                                           ` Daniel Lezcano
       [not found]                                                                                             ` <4B952BBE.6070507-GANU6spQydw@public.gmane.org>
  2010-03-08 17:29                                                                                             ` Eric W. Biederman
       [not found]                                                                                           ` <m1fx4bxlfy.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2010-03-08 17:07                                                                                           ` Serge E. Hallyn
  2 siblings, 2 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-03-08 16:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

Eric W. Biederman wrote:
> I have take an snapshot of my development tree and placed it at.
>
>
> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>   

Hi Eric,

thanks for the pointer.

I tried to boot the kernel under qemu and I got this oops:

Loading /lib/kbd/keymaps/i386/azerty/fr.map
Creating block device nodes.
Creating character device nodes.
Making device-mapper control node
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff812df7e7>] netlink_broadcast+0x1bd/0x384
PGD 3cfd0067 PUD 3cfc1067 PMD 0
Oops: 0002 [#1] DEBUG_PAGEALLOC
last sysfs file: /sys/class/firmware/timeout
CPU 0
Pid: 841, comm: modprobe Not tainted 2.6.33 #1 /
RIP: 0010:[<ffffffff812df7e7>]  [<ffffffff812df7e7>] 
netlink_broadcast+0x1bd/0x384
RSP: 0018:ffff88003cfc3ca8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88003ce947f0 RCX: ffff88003cf877f0
RDX: ffff88003f939dd0 RSI: ffff88003f939ef0 RDI: ffff88003f939e00
RBP: ffff88003cfc3d18 R08: ffff88003f939d98 R09: ffff88003f939e88
R10: ffff88003cf87818 R11: 0000000000000286 R12: ffff88003f939d98
R13: ffff88003f939e88 R14: ffff88003ce94818 R15: ffff88003ce94800
FS:  00007f23a90a06f0(0000) GS:ffffffff8161b000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000003cfcd000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 841, threadinfo ffff88003cfc2000, task 
ffff88003d1db058)
Stack:
 0000000000000270 00000000000000d0 ffff88003f9377f0 ffffffff8203d630
<0> ffff88003f939f5c 0000000000000000 0000000000000001 0000000000000000
<0> ffff88003cf03000 0000000000000020 0000000000000004 ffff88003f939e88
Call Trace:
 [<ffffffff8121c1bd>] kobject_uevent_env+0x414/0x59b
 [<ffffffff81117432>] ? sysfs_create_file+0x25/0x27
 [<ffffffff8105eb13>] ? module_add_modinfo_attrs+0xd6/0xfc
 [<ffffffff8121c34f>] kobject_uevent+0xb/0xd
 [<ffffffff8105eba4>] mod_sysfs_setup+0x6b/0x99
 [<ffffffff810602aa>] load_module+0x12a2/0x16f1
 [<ffffffff81060b34>] sys_init_module+0x60/0x230
 [<ffffffff81002928>] system_call_fastpath+0x16/0x1b
Code: 00 ff c8 74 2a 8b 75 98 4c 89 ef e8 22 f8 fd ff 48 85 c0 49 89 c4 
74 3c 48 8d 50 38 48 8b 40 38 48 85 c0 74 0
2 ff 00 48 8b 42 08 <ff> 00 eb 4f 48 8b 55 b0 ff 02 49 83 7d 10 00 74 08 
4c 89 ef e8
RIP  [<ffffffff812df7e7>] netlink_broadcast+0x1bd/0x384
 RSP <ffff88003cfc3ca8>
CR2: 0000000000000000
---[ end trace 979d62b87f68fab3 ]---
netlink_recvmsg: missing NETLINK_CB proto: 15 pid: 0 dsg_group: 1
destructor = (null)
nlmsghdr: len=1080321121 type=27951 flags=646f seq=795176053 pid=1769169779

I will try to investigate further later - kid duty :)

  -- Daniel

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                           ` <m1fx4bxlfy.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2010-03-08 16:54                                                                                             ` Daniel Lezcano
@ 2010-03-08 17:07                                                                                             ` Serge E. Hallyn
  1 sibling, 0 replies; 184+ messages in thread
From: Serge E. Hallyn @ 2010-03-08 17:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> 
> I have take an snapshot of my development tree and placed it at.
> 
> 
> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
> 
> 
> >> I am going to explore a bit more.  Given that nsfd is using the same
> >> permission checks as a proc file, I think I can just make it a proc
> >> file.  Something like "/proc/<pid>/ns/net".  With a little luck that
> >> won't suck too badly.
> >>   
> > Ah ! yes. Good idea.
> 
> It is a hair more code to use proc files but nothing worth counting.
> 
> Probably the biggest thing I am aware of right now in my development
> tree is in getting uids to pass properly between unix domain sockets
> I would up writing this cred_to_ucred function.
> 
> Serge can you take a look and check my logic, and do you have
> any idea of where we should place something like pid_vnr but
> for the uid namespace?

Well my first thought was user_namespace, but I'm thinking kernel/cred.c is
the best place for it.

> void cred_to_ucred(struct pid *pid, const struct cred *cred,
> 		   struct ucred *ucred)
> {
> 	ucred->pid = pid_vnr(pid);
> 	ucred->uid = ucred->gid = -1;
> 	if (cred) {
> 		struct user_namespace *cred_ns = cred->user->user_ns;
> 		struct user_namespace *current_ns = current_user_ns();
> 		struct user_namespace *tmp;
> 
> 		if (likely(cred_ns == current_ns)) {
> 			ucred->uid = cred->euid;
> 			ucred->gid = cred->egid;
> 		} else {
> 			/* Is cred in a child user namespace */
> 			tmp = cred_ns;
> 			do {
> 				tmp = tmp->creator->user_ns;
> 				if (tmp == current_ns) {

	Hmm, I think you want to catch one level up - so the creator itself
	is in current_user_ns, so

	do {
		if (tmp->creator->user_ns == current_ns) {
			ucred->uid = tmp->creator->uid;
			ucred->gid = tmp->creator_gid;
			return;
		}
		tmp = tmp->creator->user_ns;
	} while (tmp != &init_user_ns);

> 					ucred->uid = tmp->creator->uid;
> 					ucred->gid = overflowgid;

			should we start recording a user_ns->creator_gid
			instead?

> 					return;
> 				}
> 			} while (tmp != &init_user_ns);
> 
> 			/* Is cred the creator of my user namespace,
> 			 * or the creator of one of it's parents?
> 			 */
> 			for( tmp = current_ns; tmp != &init_user_ns;
> 			     tmp = tmp->creator->user_ns) {
> 				if (cred->user == tmp->creator) {
> 					ucred->uid = 0;
> 					ucred->gid = 0;
> 					return;
> 				}
> 			}

That looks right.

> 			/* No user namespace relationship so no mapping */
> 			ucred->uid = overflowuid;
> 			ucred->gid = overflowgid;
> 		}
> 	}
> }
> 
> Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-08  8:32                                                                                         ` Eric W. Biederman
  2010-03-08 16:54                                                                                           ` Daniel Lezcano
       [not found]                                                                                           ` <m1fx4bxlfy.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-03-08 17:07                                                                                           ` Serge E. Hallyn
       [not found]                                                                                             ` <20100308170719.GD6399-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2010-03-08 17:35                                                                                             ` Eric W. Biederman
  2 siblings, 2 replies; 184+ messages in thread
From: Serge E. Hallyn @ 2010-03-08 17:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Daniel Lezcano, Pavel Emelyanov, Sukadev Bhattiprolu,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

Quoting Eric W. Biederman (ebiederm@xmission.com):
> 
> I have take an snapshot of my development tree and placed it at.
> 
> 
> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
> 
> 
> >> I am going to explore a bit more.  Given that nsfd is using the same
> >> permission checks as a proc file, I think I can just make it a proc
> >> file.  Something like "/proc/<pid>/ns/net".  With a little luck that
> >> won't suck too badly.
> >>   
> > Ah ! yes. Good idea.
> 
> It is a hair more code to use proc files but nothing worth counting.
> 
> Probably the biggest thing I am aware of right now in my development
> tree is in getting uids to pass properly between unix domain sockets
> I would up writing this cred_to_ucred function.
> 
> Serge can you take a look and check my logic, and do you have
> any idea of where we should place something like pid_vnr but
> for the uid namespace?

Well my first thought was user_namespace, but I'm thinking kernel/cred.c is
the best place for it.

> void cred_to_ucred(struct pid *pid, const struct cred *cred,
> 		   struct ucred *ucred)
> {
> 	ucred->pid = pid_vnr(pid);
> 	ucred->uid = ucred->gid = -1;
> 	if (cred) {
> 		struct user_namespace *cred_ns = cred->user->user_ns;
> 		struct user_namespace *current_ns = current_user_ns();
> 		struct user_namespace *tmp;
> 
> 		if (likely(cred_ns == current_ns)) {
> 			ucred->uid = cred->euid;
> 			ucred->gid = cred->egid;
> 		} else {
> 			/* Is cred in a child user namespace */
> 			tmp = cred_ns;
> 			do {
> 				tmp = tmp->creator->user_ns;
> 				if (tmp == current_ns) {

	Hmm, I think you want to catch one level up - so the creator itself
	is in current_user_ns, so

	do {
		if (tmp->creator->user_ns == current_ns) {
			ucred->uid = tmp->creator->uid;
			ucred->gid = tmp->creator_gid;
			return;
		}
		tmp = tmp->creator->user_ns;
	} while (tmp != &init_user_ns);

> 					ucred->uid = tmp->creator->uid;
> 					ucred->gid = overflowgid;

			should we start recording a user_ns->creator_gid
			instead?

> 					return;
> 				}
> 			} while (tmp != &init_user_ns);
> 
> 			/* Is cred the creator of my user namespace,
> 			 * or the creator of one of it's parents?
> 			 */
> 			for( tmp = current_ns; tmp != &init_user_ns;
> 			     tmp = tmp->creator->user_ns) {
> 				if (cred->user == tmp->creator) {
> 					ucred->uid = 0;
> 					ucred->gid = 0;
> 					return;
> 				}
> 			}

That looks right.

> 			/* No user namespace relationship so no mapping */
> 			ucred->uid = overflowuid;
> 			ucred->gid = overflowgid;
> 		}
> 	}
> }
> 
> Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                             ` <4B952BBE.6070507-GANU6spQydw@public.gmane.org>
@ 2010-03-08 17:29                                                                                               ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-08 17:29 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:

> Eric W. Biederman wrote:
>> I have take an snapshot of my development tree and placed it at.
>>
>>
>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>   
>
> Hi Eric,
>
> thanks for the pointer.
>
> I tried to boot the kernel under qemu and I got this oops:

I am clearly running an old userspace on my test machine.  No udev.
It looks like udev has a long standing netlink misfeature, where
it does not initializing NETLINK_CB....


From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
From: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Date: Mon, 8 Mar 2010 09:25:20 -0800
Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 lib/kobject_uevent.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 920a3ca..b8229cc 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -216,7 +216,7 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
 
 		/* allocate message with the maximum possible size */
 		len = strlen(action_string) + strlen(devpath) + 2;
-		skb = alloc_skb(len + env->buflen, GFP_KERNEL);
+		skb = nlmsg_new(len + env->buflen, GFP_KERNEL);
 		if (skb) {
 			char *scratch;
 
-- 
1.6.5.2.143.g8cc62

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-08 16:54                                                                                           ` Daniel Lezcano
       [not found]                                                                                             ` <4B952BBE.6070507-GANU6spQydw@public.gmane.org>
@ 2010-03-08 17:29                                                                                             ` Eric W. Biederman
  2010-03-08 19:57                                                                                               ` Daniel Lezcano
       [not found]                                                                                               ` <m11vfuvi1t.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-08 17:29 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

Daniel Lezcano <daniel.lezcano@free.fr> writes:

> Eric W. Biederman wrote:
>> I have take an snapshot of my development tree and placed it at.
>>
>>
>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>   
>
> Hi Eric,
>
> thanks for the pointer.
>
> I tried to boot the kernel under qemu and I got this oops:

I am clearly running an old userspace on my test machine.  No udev.
It looks like udev has a long standing netlink misfeature, where
it does not initializing NETLINK_CB....


>From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
From: Eric W. Biederman <ebiederm@xmission.com>
Date: Mon, 8 Mar 2010 09:25:20 -0800
Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 lib/kobject_uevent.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 920a3ca..b8229cc 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -216,7 +216,7 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
 
 		/* allocate message with the maximum possible size */
 		len = strlen(action_string) + strlen(devpath) + 2;
-		skb = alloc_skb(len + env->buflen, GFP_KERNEL);
+		skb = nlmsg_new(len + env->buflen, GFP_KERNEL);
 		if (skb) {
 			char *scratch;
 
-- 
1.6.5.2.143.g8cc62

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                             ` <20100308170719.GD6399-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-03-08 17:35                                                                                               ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-08 17:35 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

"Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
>> 
>> I have take an snapshot of my development tree and placed it at.
>> 
>> 
>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>> 
>> 
>> >> I am going to explore a bit more.  Given that nsfd is using the same
>> >> permission checks as a proc file, I think I can just make it a proc
>> >> file.  Something like "/proc/<pid>/ns/net".  With a little luck that
>> >> won't suck too badly.
>> >>   
>> > Ah ! yes. Good idea.
>> 
>> It is a hair more code to use proc files but nothing worth counting.
>> 
>> Probably the biggest thing I am aware of right now in my development
>> tree is in getting uids to pass properly between unix domain sockets
>> I would up writing this cred_to_ucred function.
>> 
>> Serge can you take a look and check my logic, and do you have
>> any idea of where we should place something like pid_vnr but
>> for the uid namespace?
>
> Well my first thought was user_namespace, but I'm thinking kernel/cred.c is
> the best place for it.

Thanks.

>> void cred_to_ucred(struct pid *pid, const struct cred *cred,
>> 		   struct ucred *ucred)
>> {
>> 	ucred->pid = pid_vnr(pid);
>> 	ucred->uid = ucred->gid = -1;
>> 	if (cred) {
>> 		struct user_namespace *cred_ns = cred->user->user_ns;
>> 		struct user_namespace *current_ns = current_user_ns();
>> 		struct user_namespace *tmp;
>> 
>> 		if (likely(cred_ns == current_ns)) {
>> 			ucred->uid = cred->euid;
>> 			ucred->gid = cred->egid;
>> 		} else {
>> 			/* Is cred in a child user namespace */
>> 			tmp = cred_ns;
>> 			do {
>> 				tmp = tmp->creator->user_ns;
>> 				if (tmp == current_ns) {
>
> 	Hmm, I think you want to catch one level up - so the creator itself
> 	is in current_user_ns, so

>
> 	do {
> 		if (tmp->creator->user_ns == current_ns) {
> 			ucred->uid = tmp->creator->uid;
> 			ucred->gid = tmp->creator_gid;
> 			return;
> 		}
> 		tmp = tmp->creator->user_ns;
> 	} while (tmp != &init_user_ns);

Good catch.  

>> 					ucred->uid = tmp->creator->uid;
>> 					ucred->gid = overflowgid;
>
> 			should we start recording a user_ns->creator_gid
> 			instead?

I had a similar question.  Possibly we can just grab the creators cred.


>> 					return;
>> 				}
>> 			} while (tmp != &init_user_ns);
>> 
>> 			/* Is cred the creator of my user namespace,
>> 			 * or the creator of one of it's parents?
>> 			 */
>> 			for( tmp = current_ns; tmp != &init_user_ns;
>> 			     tmp = tmp->creator->user_ns) {
>> 				if (cred->user == tmp->creator) {
>> 					ucred->uid = 0;
>> 					ucred->gid = 0;
>> 					return;
>> 				}
>> 			}
>
> That looks right.
>
>> 			/* No user namespace relationship so no mapping */
>> 			ucred->uid = overflowuid;
>> 			ucred->gid = overflowgid;
>> 		}
>> 	}
>> }

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-08 17:07                                                                                           ` Serge E. Hallyn
       [not found]                                                                                             ` <20100308170719.GD6399-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-03-08 17:35                                                                                             ` Eric W. Biederman
  2010-03-08 17:47                                                                                               ` Serge E. Hallyn
       [not found]                                                                                               ` <m1pr3eu36u.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-08 17:35 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Daniel Lezcano, Pavel Emelyanov, Sukadev Bhattiprolu,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> 
>> I have take an snapshot of my development tree and placed it at.
>> 
>> 
>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>> 
>> 
>> >> I am going to explore a bit more.  Given that nsfd is using the same
>> >> permission checks as a proc file, I think I can just make it a proc
>> >> file.  Something like "/proc/<pid>/ns/net".  With a little luck that
>> >> won't suck too badly.
>> >>   
>> > Ah ! yes. Good idea.
>> 
>> It is a hair more code to use proc files but nothing worth counting.
>> 
>> Probably the biggest thing I am aware of right now in my development
>> tree is in getting uids to pass properly between unix domain sockets
>> I would up writing this cred_to_ucred function.
>> 
>> Serge can you take a look and check my logic, and do you have
>> any idea of where we should place something like pid_vnr but
>> for the uid namespace?
>
> Well my first thought was user_namespace, but I'm thinking kernel/cred.c is
> the best place for it.

Thanks.

>> void cred_to_ucred(struct pid *pid, const struct cred *cred,
>> 		   struct ucred *ucred)
>> {
>> 	ucred->pid = pid_vnr(pid);
>> 	ucred->uid = ucred->gid = -1;
>> 	if (cred) {
>> 		struct user_namespace *cred_ns = cred->user->user_ns;
>> 		struct user_namespace *current_ns = current_user_ns();
>> 		struct user_namespace *tmp;
>> 
>> 		if (likely(cred_ns == current_ns)) {
>> 			ucred->uid = cred->euid;
>> 			ucred->gid = cred->egid;
>> 		} else {
>> 			/* Is cred in a child user namespace */
>> 			tmp = cred_ns;
>> 			do {
>> 				tmp = tmp->creator->user_ns;
>> 				if (tmp == current_ns) {
>
> 	Hmm, I think you want to catch one level up - so the creator itself
> 	is in current_user_ns, so

>
> 	do {
> 		if (tmp->creator->user_ns == current_ns) {
> 			ucred->uid = tmp->creator->uid;
> 			ucred->gid = tmp->creator_gid;
> 			return;
> 		}
> 		tmp = tmp->creator->user_ns;
> 	} while (tmp != &init_user_ns);

Good catch.  

>> 					ucred->uid = tmp->creator->uid;
>> 					ucred->gid = overflowgid;
>
> 			should we start recording a user_ns->creator_gid
> 			instead?

I had a similar question.  Possibly we can just grab the creators cred.


>> 					return;
>> 				}
>> 			} while (tmp != &init_user_ns);
>> 
>> 			/* Is cred the creator of my user namespace,
>> 			 * or the creator of one of it's parents?
>> 			 */
>> 			for( tmp = current_ns; tmp != &init_user_ns;
>> 			     tmp = tmp->creator->user_ns) {
>> 				if (cred->user == tmp->creator) {
>> 					ucred->uid = 0;
>> 					ucred->gid = 0;
>> 					return;
>> 				}
>> 			}
>
> That looks right.
>
>> 			/* No user namespace relationship so no mapping */
>> 			ucred->uid = overflowuid;
>> 			ucred->gid = overflowgid;
>> 		}
>> 	}
>> }

Eric


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                               ` <m1pr3eu36u.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-03-08 17:47                                                                                                 ` Serge E. Hallyn
  0 siblings, 0 replies; 184+ messages in thread
From: Serge E. Hallyn @ 2010-03-08 17:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> >> 
> >> I have take an snapshot of my development tree and placed it at.
> >> 
> >> 
> >> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
> >> 
> >> 
> >> >> I am going to explore a bit more.  Given that nsfd is using the same
> >> >> permission checks as a proc file, I think I can just make it a proc
> >> >> file.  Something like "/proc/<pid>/ns/net".  With a little luck that
> >> >> won't suck too badly.
> >> >>   
> >> > Ah ! yes. Good idea.
> >> 
> >> It is a hair more code to use proc files but nothing worth counting.
> >> 
> >> Probably the biggest thing I am aware of right now in my development
> >> tree is in getting uids to pass properly between unix domain sockets
> >> I would up writing this cred_to_ucred function.
> >> 
> >> Serge can you take a look and check my logic, and do you have
> >> any idea of where we should place something like pid_vnr but
> >> for the uid namespace?
> >
> > Well my first thought was user_namespace, but I'm thinking kernel/cred.c is
> > the best place for it.
> 
> Thanks.
> 
> >> void cred_to_ucred(struct pid *pid, const struct cred *cred,
> >> 		   struct ucred *ucred)
> >> {
> >> 	ucred->pid = pid_vnr(pid);
> >> 	ucred->uid = ucred->gid = -1;
> >> 	if (cred) {
> >> 		struct user_namespace *cred_ns = cred->user->user_ns;
> >> 		struct user_namespace *current_ns = current_user_ns();
> >> 		struct user_namespace *tmp;
> >> 
> >> 		if (likely(cred_ns == current_ns)) {
> >> 			ucred->uid = cred->euid;
> >> 			ucred->gid = cred->egid;
> >> 		} else {
> >> 			/* Is cred in a child user namespace */
> >> 			tmp = cred_ns;
> >> 			do {
> >> 				tmp = tmp->creator->user_ns;
> >> 				if (tmp == current_ns) {
> >
> > 	Hmm, I think you want to catch one level up - so the creator itself
> > 	is in current_user_ns, so
> 
> >
> > 	do {
> > 		if (tmp->creator->user_ns == current_ns) {
> > 			ucred->uid = tmp->creator->uid;
> > 			ucred->gid = tmp->creator_gid;
> > 			return;
> > 		}
> > 		tmp = tmp->creator->user_ns;
> > 	} while (tmp != &init_user_ns);
> 
> Good catch.  
> 
> >> 					ucred->uid = tmp->creator->uid;
> >> 					ucred->gid = overflowgid;
> >
> > 			should we start recording a user_ns->creator_gid
> > 			instead?
> 
> I had a similar question.  Possibly we can just grab the creators cred.

Oh, yeah, make user_ns->creator a cred, excellent idea - then we have
the LSM and capability fields cached as well.

-serge

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-08 17:35                                                                                             ` Eric W. Biederman
@ 2010-03-08 17:47                                                                                               ` Serge E. Hallyn
       [not found]                                                                                               ` <m1pr3eu36u.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 0 replies; 184+ messages in thread
From: Serge E. Hallyn @ 2010-03-08 17:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Daniel Lezcano, Pavel Emelyanov, Sukadev Bhattiprolu,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serue@us.ibm.com> writes:
> 
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> 
> >> I have take an snapshot of my development tree and placed it at.
> >> 
> >> 
> >> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
> >> 
> >> 
> >> >> I am going to explore a bit more.  Given that nsfd is using the same
> >> >> permission checks as a proc file, I think I can just make it a proc
> >> >> file.  Something like "/proc/<pid>/ns/net".  With a little luck that
> >> >> won't suck too badly.
> >> >>   
> >> > Ah ! yes. Good idea.
> >> 
> >> It is a hair more code to use proc files but nothing worth counting.
> >> 
> >> Probably the biggest thing I am aware of right now in my development
> >> tree is in getting uids to pass properly between unix domain sockets
> >> I would up writing this cred_to_ucred function.
> >> 
> >> Serge can you take a look and check my logic, and do you have
> >> any idea of where we should place something like pid_vnr but
> >> for the uid namespace?
> >
> > Well my first thought was user_namespace, but I'm thinking kernel/cred.c is
> > the best place for it.
> 
> Thanks.
> 
> >> void cred_to_ucred(struct pid *pid, const struct cred *cred,
> >> 		   struct ucred *ucred)
> >> {
> >> 	ucred->pid = pid_vnr(pid);
> >> 	ucred->uid = ucred->gid = -1;
> >> 	if (cred) {
> >> 		struct user_namespace *cred_ns = cred->user->user_ns;
> >> 		struct user_namespace *current_ns = current_user_ns();
> >> 		struct user_namespace *tmp;
> >> 
> >> 		if (likely(cred_ns == current_ns)) {
> >> 			ucred->uid = cred->euid;
> >> 			ucred->gid = cred->egid;
> >> 		} else {
> >> 			/* Is cred in a child user namespace */
> >> 			tmp = cred_ns;
> >> 			do {
> >> 				tmp = tmp->creator->user_ns;
> >> 				if (tmp == current_ns) {
> >
> > 	Hmm, I think you want to catch one level up - so the creator itself
> > 	is in current_user_ns, so
> 
> >
> > 	do {
> > 		if (tmp->creator->user_ns == current_ns) {
> > 			ucred->uid = tmp->creator->uid;
> > 			ucred->gid = tmp->creator_gid;
> > 			return;
> > 		}
> > 		tmp = tmp->creator->user_ns;
> > 	} while (tmp != &init_user_ns);
> 
> Good catch.  
> 
> >> 					ucred->uid = tmp->creator->uid;
> >> 					ucred->gid = overflowgid;
> >
> > 			should we start recording a user_ns->creator_gid
> > 			instead?
> 
> I had a similar question.  Possibly we can just grab the creators cred.

Oh, yeah, make user_ns->creator a cred, excellent idea - then we have
the LSM and capability fields cached as well.

-serge

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                               ` <m11vfuvi1t.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-03-08 19:57                                                                                                 ` Daniel Lezcano
  0 siblings, 0 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-03-08 19:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:
>
>   
>> Eric W. Biederman wrote:
>>     
>>> I have take an snapshot of my development tree and placed it at.
>>>
>>>
>>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>>   
>>>       
>> Hi Eric,
>>
>> thanks for the pointer.
>>
>> I tried to boot the kernel under qemu and I got this oops:
>>     
>
> I am clearly running an old userspace on my test machine.  No udev.
> It looks like udev has a long standing netlink misfeature, where
> it does not initializing NETLINK_CB....
>
>
> >From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
> From: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> Date: Mon, 8 Mar 2010 09:25:20 -0800
> Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...
>
> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>   
Thanks.

I was able to boot but I have the following warning:

------------[ cut here ]------------
WARNING: at net/netlink/af_netlink.c:198 netlink_sock_destruct+0x72/0xac()
Hardware name:
Modules linked in: [last unloaded: scsi_wait_scan]
Pid: 840, comm: nash-hotplug Tainted: G        W  2.6.33 #2
Call Trace:
 [<ffffffff812df182>] ? netlink_sock_destruct+0x72/0xac
 [<ffffffff8102ca29>] warn_slowpath_common+0x77/0xa4
 [<ffffffff8102ca65>] warn_slowpath_null+0xf/0x11
 [<ffffffff812df182>] netlink_sock_destruct+0x72/0xac
 [<ffffffff812bb2a4>] __sk_free+0x1e/0x118
 [<ffffffff812bb40d>] sk_free+0x19/0x1b
 [<ffffffff812e0dc2>] netlink_release+0x246/0x253
 [<ffffffff812b825a>] sock_release+0x1a/0x6b
 [<ffffffff812b82cd>] sock_close+0x22/0x26
 [<ffffffff810c7823>] __fput+0x11b/0x1d7
 [<ffffffff810c78f6>] fput+0x17/0x19
 [<ffffffff810c4ae2>] filp_close+0x67/0x72
 [<ffffffff8102e75c>] put_files_struct+0x6a/0xd4
 [<ffffffff8102e80d>] exit_files+0x47/0x4f
 [<ffffffff8102fe59>] do_exit+0x1eb/0x693
 [<ffffffff813864c2>] ? _raw_spin_unlock_irq+0x2b/0x31
 [<ffffffff81030373>] do_group_exit+0x72/0x9b
 [<ffffffff8103f37c>] get_signal_to_deliver+0x3a1/0x3c1
 [<ffffffff81001e8e>] do_notify_resume+0x8d/0x6ea
 [<ffffffff810538c9>] ? trace_hardirqs_on_caller+0x110/0x13a
 [<ffffffff8102851e>] ? finish_task_switch+0x6a/0xb3
 [<ffffffff810284b4>] ? finish_task_switch+0x0/0xb3
 [<ffffffff813867aa>] ? retint_signal+0x11/0x87
 [<ffffffff810538c9>] ? trace_hardirqs_on_caller+0x110/0x13a
 [<ffffffff813867df>] retint_signal+0x46/0x87
---[ end trace d4a1e4cbaa70d63d ]---


And I have a kernel panic when exiting a network namespace using a macvlan:

linux-swk0 login: BUG: unable to handle kernel paging request at 
ffff880035475678
IP: [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
PGD 160b063 PUD 160f063 PMD 2aa067 PTE 35475160
Oops: 0002 [#1] DEBUG_PAGEALLOC
last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/net/eth0/flags
CPU 0
Pid: 10, comm: netns Tainted: G        W  2.6.33 #2 /
RIP: 0010:[<ffffffff8128dbef>]  [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
RSP: 0018:ffff88003f92bc50  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff880035440800 RCX: ffff880035440800
RDX: ffff880035475678 RSI: ffff88003f913710 RDI: ffff88003cde9800
RBP: ffff88003f92bc70 R08: 0000000000000004 R09: 0000000000000000
R10: 0080000000000000 R11: ffff88003f92bbf0 R12: ffff88003cde9800
R13: ffff880035440de0 R14: 0080000000000000 R15: 0000000800000000
FS:  0000000000000000(0000) GS:ffffffff8161b000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff880035475678 CR3: 000000003eb41000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process netns (pid: 10, threadinfo ffff88003f92a000, task ffff88003f913058)
Stack:
 ffffffff814328a0 ffff880035440800 ffffffff814328a0 ffff88003553a800
<0> ffff88003f92bc90 ffffffff812c9150 ffff880035440800 ffff88003f92bd00
<0> ffff88003f92bcd0 ffffffff812c9259 ffff88003f92bcd0 ffff88003f92bd00
Call Trace:
 [<ffffffff812c9150>] dev_close+0x86/0xa8
 [<ffffffff812c9259>] rollback_registered_many+0xe7/0x208
 [<ffffffff812c9390>] unregister_netdevice_many+0x16/0x62
 [<ffffffff812c952d>] default_device_exit_batch+0x9f/0xb3
 [<ffffffff812c3906>] ops_exit_list+0x4e/0x56
 [<ffffffff812c40f4>] cleanup_net+0xfe/0x1b7
 [<ffffffff81042db6>] worker_thread+0x227/0x32d
 [<ffffffff81042d60>] ? worker_thread+0x1d1/0x32d
 [<ffffffff813864c2>] ? _raw_spin_unlock_irq+0x2b/0x31
 [<ffffffff812c3ff6>] ? cleanup_net+0x0/0x1b7
 [<ffffffff810466ae>] ? autoremove_wake_function+0x0/0x38
 [<ffffffff81042b8f>] ? worker_thread+0x0/0x32d
 [<ffffffff810462e0>] kthread+0x7c/0x84
 [<ffffffff810035b4>] kernel_thread_helper+0x4/0x10
 [<ffffffff8138673a>] ? restore_args+0x0/0x30
 [<ffffffff81046264>] ? kthread+0x0/0x84
 [<ffffffff810035b0>] ? kernel_thread_helper+0x0/0x10
Code: 01 00 00 02 74 0b 83 ce ff 4c 89 e7 e8 a1 8f 03 00 48 8b b3 50 02 
00 00 4c 89 e7 e8 df 8e 03 00 49 8b 45 18 49 8b 55 20 48 85 c0 <48> 89 
02 74 04 48 89 50 08 48 be 00 02 20 00 00 00 ad de 49 89
RIP  [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
 RSP <ffff88003f92bc50>
CR2: ffff880035475678
---[ end trace d4a1e4cbaa70d63e ]---

addr2line -e ./vmlinux ffffffff812c9150 gives net/core/dev.c:1252

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-08 17:29                                                                                             ` Eric W. Biederman
@ 2010-03-08 19:57                                                                                               ` Daniel Lezcano
  2010-03-08 20:24                                                                                                 ` Eric W. Biederman
       [not found]                                                                                                 ` <4B9556A9.60206-GANU6spQydw@public.gmane.org>
       [not found]                                                                                               ` <m11vfuvi1t.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-03-08 19:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>
>   
>> Eric W. Biederman wrote:
>>     
>>> I have take an snapshot of my development tree and placed it at.
>>>
>>>
>>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>>   
>>>       
>> Hi Eric,
>>
>> thanks for the pointer.
>>
>> I tried to boot the kernel under qemu and I got this oops:
>>     
>
> I am clearly running an old userspace on my test machine.  No udev.
> It looks like udev has a long standing netlink misfeature, where
> it does not initializing NETLINK_CB....
>
>
> >From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
> From: Eric W. Biederman <ebiederm@xmission.com>
> Date: Mon, 8 Mar 2010 09:25:20 -0800
> Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...
>
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>   
Thanks.

I was able to boot but I have the following warning:

------------[ cut here ]------------
WARNING: at net/netlink/af_netlink.c:198 netlink_sock_destruct+0x72/0xac()
Hardware name:
Modules linked in: [last unloaded: scsi_wait_scan]
Pid: 840, comm: nash-hotplug Tainted: G        W  2.6.33 #2
Call Trace:
 [<ffffffff812df182>] ? netlink_sock_destruct+0x72/0xac
 [<ffffffff8102ca29>] warn_slowpath_common+0x77/0xa4
 [<ffffffff8102ca65>] warn_slowpath_null+0xf/0x11
 [<ffffffff812df182>] netlink_sock_destruct+0x72/0xac
 [<ffffffff812bb2a4>] __sk_free+0x1e/0x118
 [<ffffffff812bb40d>] sk_free+0x19/0x1b
 [<ffffffff812e0dc2>] netlink_release+0x246/0x253
 [<ffffffff812b825a>] sock_release+0x1a/0x6b
 [<ffffffff812b82cd>] sock_close+0x22/0x26
 [<ffffffff810c7823>] __fput+0x11b/0x1d7
 [<ffffffff810c78f6>] fput+0x17/0x19
 [<ffffffff810c4ae2>] filp_close+0x67/0x72
 [<ffffffff8102e75c>] put_files_struct+0x6a/0xd4
 [<ffffffff8102e80d>] exit_files+0x47/0x4f
 [<ffffffff8102fe59>] do_exit+0x1eb/0x693
 [<ffffffff813864c2>] ? _raw_spin_unlock_irq+0x2b/0x31
 [<ffffffff81030373>] do_group_exit+0x72/0x9b
 [<ffffffff8103f37c>] get_signal_to_deliver+0x3a1/0x3c1
 [<ffffffff81001e8e>] do_notify_resume+0x8d/0x6ea
 [<ffffffff810538c9>] ? trace_hardirqs_on_caller+0x110/0x13a
 [<ffffffff8102851e>] ? finish_task_switch+0x6a/0xb3
 [<ffffffff810284b4>] ? finish_task_switch+0x0/0xb3
 [<ffffffff813867aa>] ? retint_signal+0x11/0x87
 [<ffffffff810538c9>] ? trace_hardirqs_on_caller+0x110/0x13a
 [<ffffffff813867df>] retint_signal+0x46/0x87
---[ end trace d4a1e4cbaa70d63d ]---


And I have a kernel panic when exiting a network namespace using a macvlan:

linux-swk0 login: BUG: unable to handle kernel paging request at 
ffff880035475678
IP: [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
PGD 160b063 PUD 160f063 PMD 2aa067 PTE 35475160
Oops: 0002 [#1] DEBUG_PAGEALLOC
last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/net/eth0/flags
CPU 0
Pid: 10, comm: netns Tainted: G        W  2.6.33 #2 /
RIP: 0010:[<ffffffff8128dbef>]  [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
RSP: 0018:ffff88003f92bc50  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff880035440800 RCX: ffff880035440800
RDX: ffff880035475678 RSI: ffff88003f913710 RDI: ffff88003cde9800
RBP: ffff88003f92bc70 R08: 0000000000000004 R09: 0000000000000000
R10: 0080000000000000 R11: ffff88003f92bbf0 R12: ffff88003cde9800
R13: ffff880035440de0 R14: 0080000000000000 R15: 0000000800000000
FS:  0000000000000000(0000) GS:ffffffff8161b000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff880035475678 CR3: 000000003eb41000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process netns (pid: 10, threadinfo ffff88003f92a000, task ffff88003f913058)
Stack:
 ffffffff814328a0 ffff880035440800 ffffffff814328a0 ffff88003553a800
<0> ffff88003f92bc90 ffffffff812c9150 ffff880035440800 ffff88003f92bd00
<0> ffff88003f92bcd0 ffffffff812c9259 ffff88003f92bcd0 ffff88003f92bd00
Call Trace:
 [<ffffffff812c9150>] dev_close+0x86/0xa8
 [<ffffffff812c9259>] rollback_registered_many+0xe7/0x208
 [<ffffffff812c9390>] unregister_netdevice_many+0x16/0x62
 [<ffffffff812c952d>] default_device_exit_batch+0x9f/0xb3
 [<ffffffff812c3906>] ops_exit_list+0x4e/0x56
 [<ffffffff812c40f4>] cleanup_net+0xfe/0x1b7
 [<ffffffff81042db6>] worker_thread+0x227/0x32d
 [<ffffffff81042d60>] ? worker_thread+0x1d1/0x32d
 [<ffffffff813864c2>] ? _raw_spin_unlock_irq+0x2b/0x31
 [<ffffffff812c3ff6>] ? cleanup_net+0x0/0x1b7
 [<ffffffff810466ae>] ? autoremove_wake_function+0x0/0x38
 [<ffffffff81042b8f>] ? worker_thread+0x0/0x32d
 [<ffffffff810462e0>] kthread+0x7c/0x84
 [<ffffffff810035b4>] kernel_thread_helper+0x4/0x10
 [<ffffffff8138673a>] ? restore_args+0x0/0x30
 [<ffffffff81046264>] ? kthread+0x0/0x84
 [<ffffffff810035b0>] ? kernel_thread_helper+0x0/0x10
Code: 01 00 00 02 74 0b 83 ce ff 4c 89 e7 e8 a1 8f 03 00 48 8b b3 50 02 
00 00 4c 89 e7 e8 df 8e 03 00 49 8b 45 18 49 8b 55 20 48 85 c0 <48> 89 
02 74 04 48 89 50 08 48 be 00 02 20 00 00 00 ad de 49 89
RIP  [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
 RSP <ffff88003f92bc50>
CR2: ffff880035475678
---[ end trace d4a1e4cbaa70d63e ]---

addr2line -e ./vmlinux ffffffff812c9150 gives net/core/dev.c:1252

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                                 ` <4B9556A9.60206-GANU6spQydw@public.gmane.org>
@ 2010-03-08 20:24                                                                                                   ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-08 20:24 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:

> Eric W. Biederman wrote:
>> Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:
>>
>>   
>>> Eric W. Biederman wrote:
>>>     
>>>> I have take an snapshot of my development tree and placed it at.
>>>>
>>>>
>>>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>>>         
>>> Hi Eric,
>>>
>>> thanks for the pointer.
>>>
>>> I tried to boot the kernel under qemu and I got this oops:
>>>     
>>
>> I am clearly running an old userspace on my test machine.  No udev.
>> It looks like udev has a long standing netlink misfeature, where
>> it does not initializing NETLINK_CB....
>>
>>
>> >From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
>> From: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>> Date: Mon, 8 Mar 2010 09:25:20 -0800
>> Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...
>>
>> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>>   
> Thanks.
>
> I was able to boot but I have the following warning:

Thanks for the bug report.

For the moment you might want to drop:
af_netlink:  Allow credentials to work across namespaces.
af_netlink: Debugging in case I have missed something.

Although I am curious if you hit my debugging messages in
netlink recv.

I guess if the goal is to test my nsfd bits you can drop everything
starting with my 'scm: Reorder scm_cookie.' commit.  The rest is what
it takes to get get uids, gid and pids translated when the cross
namespaces on an af_unix of an af_netlink socket.

At least in the af_netlink case it appears clear I am have missed
something.

This is a warning that netlink throws when the packet accounting messed
up.  So it sounds like you are exercising another path that I failed
to exercise and fix.
> ------------[ cut here ]------------
> WARNING: at net/netlink/af_netlink.c:198 netlink_sock_destruct+0x72/0xac()
> Hardware name:
> Modules linked in: [last unloaded: scsi_wait_scan]
> Pid: 840, comm: nash-hotplug Tainted: G        W  2.6.33 #2
> Call Trace:
> [<ffffffff812df182>] ? netlink_sock_destruct+0x72/0xac
> [<ffffffff8102ca29>] warn_slowpath_common+0x77/0xa4
> [<ffffffff8102ca65>] warn_slowpath_null+0xf/0x11
> [<ffffffff812df182>] netlink_sock_destruct+0x72/0xac
> [<ffffffff812bb2a4>] __sk_free+0x1e/0x118
> [<ffffffff812bb40d>] sk_free+0x19/0x1b
> [<ffffffff812e0dc2>] netlink_release+0x246/0x253
> [<ffffffff812b825a>] sock_release+0x1a/0x6b
> [<ffffffff812b82cd>] sock_close+0x22/0x26
> [<ffffffff810c7823>] __fput+0x11b/0x1d7
> [<ffffffff810c78f6>] fput+0x17/0x19
> [<ffffffff810c4ae2>] filp_close+0x67/0x72
> [<ffffffff8102e75c>] put_files_struct+0x6a/0xd4
> [<ffffffff8102e80d>] exit_files+0x47/0x4f
> [<ffffffff8102fe59>] do_exit+0x1eb/0x693
> [<ffffffff813864c2>] ? _raw_spin_unlock_irq+0x2b/0x31
> [<ffffffff81030373>] do_group_exit+0x72/0x9b
> [<ffffffff8103f37c>] get_signal_to_deliver+0x3a1/0x3c1
> [<ffffffff81001e8e>] do_notify_resume+0x8d/0x6ea
> [<ffffffff810538c9>] ? trace_hardirqs_on_caller+0x110/0x13a
> [<ffffffff8102851e>] ? finish_task_switch+0x6a/0xb3
> [<ffffffff810284b4>] ? finish_task_switch+0x0/0xb3
> [<ffffffff813867aa>] ? retint_signal+0x11/0x87
> [<ffffffff810538c9>] ? trace_hardirqs_on_caller+0x110/0x13a
> [<ffffffff813867df>] retint_signal+0x46/0x87
> ---[ end trace d4a1e4cbaa70d63d ]---
>
>
> And I have a kernel panic when exiting a network namespace using a macvlan:

I wonder/hope this is simply the result of corruption from earlier problems.
I haven't touched anything that should affect the macvlan driver in 2.6.33.

> linux-swk0 login: BUG: unable to handle kernel paging request at
> ffff880035475678
> IP: [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
> PGD 160b063 PUD 160f063 PMD 2aa067 PTE 35475160
> Oops: 0002 [#1] DEBUG_PAGEALLOC
> last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/net/eth0/flags
> CPU 0
> Pid: 10, comm: netns Tainted: G        W  2.6.33 #2 /
> RIP: 0010:[<ffffffff8128dbef>]  [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
> RSP: 0018:ffff88003f92bc50  EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff880035440800 RCX: ffff880035440800
> RDX: ffff880035475678 RSI: ffff88003f913710 RDI: ffff88003cde9800
> RBP: ffff88003f92bc70 R08: 0000000000000004 R09: 0000000000000000
> R10: 0080000000000000 R11: ffff88003f92bbf0 R12: ffff88003cde9800
> R13: ffff880035440de0 R14: 0080000000000000 R15: 0000000800000000
> FS:  0000000000000000(0000) GS:ffffffff8161b000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: ffff880035475678 CR3: 000000003eb41000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process netns (pid: 10, threadinfo ffff88003f92a000, task ffff88003f913058)
> Stack:
> ffffffff814328a0 ffff880035440800 ffffffff814328a0 ffff88003553a800
> <0> ffff88003f92bc90 ffffffff812c9150 ffff880035440800 ffff88003f92bd00
> <0> ffff88003f92bcd0 ffffffff812c9259 ffff88003f92bcd0 ffff88003f92bd00
> Call Trace:
> [<ffffffff812c9150>] dev_close+0x86/0xa8
> [<ffffffff812c9259>] rollback_registered_many+0xe7/0x208
> [<ffffffff812c9390>] unregister_netdevice_many+0x16/0x62
> [<ffffffff812c952d>] default_device_exit_batch+0x9f/0xb3
> [<ffffffff812c3906>] ops_exit_list+0x4e/0x56
> [<ffffffff812c40f4>] cleanup_net+0xfe/0x1b7
> [<ffffffff81042db6>] worker_thread+0x227/0x32d
> [<ffffffff81042d60>] ? worker_thread+0x1d1/0x32d
> [<ffffffff813864c2>] ? _raw_spin_unlock_irq+0x2b/0x31
> [<ffffffff812c3ff6>] ? cleanup_net+0x0/0x1b7
> [<ffffffff810466ae>] ? autoremove_wake_function+0x0/0x38
> [<ffffffff81042b8f>] ? worker_thread+0x0/0x32d
> [<ffffffff810462e0>] kthread+0x7c/0x84
> [<ffffffff810035b4>] kernel_thread_helper+0x4/0x10
> [<ffffffff8138673a>] ? restore_args+0x0/0x30
> [<ffffffff81046264>] ? kthread+0x0/0x84
> [<ffffffff810035b0>] ? kernel_thread_helper+0x0/0x10
> Code: 01 00 00 02 74 0b 83 ce ff 4c 89 e7 e8 a1 8f 03 00 48 8b b3 50 02 00 00 4c
> 89 e7 e8 df 8e 03 00 49 8b 45 18 49 8b 55 20 48 85 c0 <48> 89 02 74 04 48 89 50
> 08 48 be 00 02 20 00 00 00 ad de 49 89
> RIP  [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
> RSP <ffff88003f92bc50>
> CR2: ffff880035475678
> ---[ end trace d4a1e4cbaa70d63e ]---
>
> addr2line -e ./vmlinux ffffffff812c9150 gives net/core/dev.c:1252

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-08 19:57                                                                                               ` Daniel Lezcano
@ 2010-03-08 20:24                                                                                                 ` Eric W. Biederman
  2010-03-08 20:42                                                                                                   ` Daniel Lezcano
       [not found]                                                                                                   ` <m11vfusgsa.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
       [not found]                                                                                                 ` <4B9556A9.60206-GANU6spQydw@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-08 20:24 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

Daniel Lezcano <daniel.lezcano@free.fr> writes:

> Eric W. Biederman wrote:
>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>>
>>   
>>> Eric W. Biederman wrote:
>>>     
>>>> I have take an snapshot of my development tree and placed it at.
>>>>
>>>>
>>>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>>>         
>>> Hi Eric,
>>>
>>> thanks for the pointer.
>>>
>>> I tried to boot the kernel under qemu and I got this oops:
>>>     
>>
>> I am clearly running an old userspace on my test machine.  No udev.
>> It looks like udev has a long standing netlink misfeature, where
>> it does not initializing NETLINK_CB....
>>
>>
>> >From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
>> From: Eric W. Biederman <ebiederm@xmission.com>
>> Date: Mon, 8 Mar 2010 09:25:20 -0800
>> Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...
>>
>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>>   
> Thanks.
>
> I was able to boot but I have the following warning:

Thanks for the bug report.

For the moment you might want to drop:
af_netlink:  Allow credentials to work across namespaces.
af_netlink: Debugging in case I have missed something.

Although I am curious if you hit my debugging messages in
netlink recv.

I guess if the goal is to test my nsfd bits you can drop everything
starting with my 'scm: Reorder scm_cookie.' commit.  The rest is what
it takes to get get uids, gid and pids translated when the cross
namespaces on an af_unix of an af_netlink socket.

At least in the af_netlink case it appears clear I am have missed
something.

This is a warning that netlink throws when the packet accounting messed
up.  So it sounds like you are exercising another path that I failed
to exercise and fix.
> ------------[ cut here ]------------
> WARNING: at net/netlink/af_netlink.c:198 netlink_sock_destruct+0x72/0xac()
> Hardware name:
> Modules linked in: [last unloaded: scsi_wait_scan]
> Pid: 840, comm: nash-hotplug Tainted: G        W  2.6.33 #2
> Call Trace:
> [<ffffffff812df182>] ? netlink_sock_destruct+0x72/0xac
> [<ffffffff8102ca29>] warn_slowpath_common+0x77/0xa4
> [<ffffffff8102ca65>] warn_slowpath_null+0xf/0x11
> [<ffffffff812df182>] netlink_sock_destruct+0x72/0xac
> [<ffffffff812bb2a4>] __sk_free+0x1e/0x118
> [<ffffffff812bb40d>] sk_free+0x19/0x1b
> [<ffffffff812e0dc2>] netlink_release+0x246/0x253
> [<ffffffff812b825a>] sock_release+0x1a/0x6b
> [<ffffffff812b82cd>] sock_close+0x22/0x26
> [<ffffffff810c7823>] __fput+0x11b/0x1d7
> [<ffffffff810c78f6>] fput+0x17/0x19
> [<ffffffff810c4ae2>] filp_close+0x67/0x72
> [<ffffffff8102e75c>] put_files_struct+0x6a/0xd4
> [<ffffffff8102e80d>] exit_files+0x47/0x4f
> [<ffffffff8102fe59>] do_exit+0x1eb/0x693
> [<ffffffff813864c2>] ? _raw_spin_unlock_irq+0x2b/0x31
> [<ffffffff81030373>] do_group_exit+0x72/0x9b
> [<ffffffff8103f37c>] get_signal_to_deliver+0x3a1/0x3c1
> [<ffffffff81001e8e>] do_notify_resume+0x8d/0x6ea
> [<ffffffff810538c9>] ? trace_hardirqs_on_caller+0x110/0x13a
> [<ffffffff8102851e>] ? finish_task_switch+0x6a/0xb3
> [<ffffffff810284b4>] ? finish_task_switch+0x0/0xb3
> [<ffffffff813867aa>] ? retint_signal+0x11/0x87
> [<ffffffff810538c9>] ? trace_hardirqs_on_caller+0x110/0x13a
> [<ffffffff813867df>] retint_signal+0x46/0x87
> ---[ end trace d4a1e4cbaa70d63d ]---
>
>
> And I have a kernel panic when exiting a network namespace using a macvlan:

I wonder/hope this is simply the result of corruption from earlier problems.
I haven't touched anything that should affect the macvlan driver in 2.6.33.

> linux-swk0 login: BUG: unable to handle kernel paging request at
> ffff880035475678
> IP: [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
> PGD 160b063 PUD 160f063 PMD 2aa067 PTE 35475160
> Oops: 0002 [#1] DEBUG_PAGEALLOC
> last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/net/eth0/flags
> CPU 0
> Pid: 10, comm: netns Tainted: G        W  2.6.33 #2 /
> RIP: 0010:[<ffffffff8128dbef>]  [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
> RSP: 0018:ffff88003f92bc50  EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff880035440800 RCX: ffff880035440800
> RDX: ffff880035475678 RSI: ffff88003f913710 RDI: ffff88003cde9800
> RBP: ffff88003f92bc70 R08: 0000000000000004 R09: 0000000000000000
> R10: 0080000000000000 R11: ffff88003f92bbf0 R12: ffff88003cde9800
> R13: ffff880035440de0 R14: 0080000000000000 R15: 0000000800000000
> FS:  0000000000000000(0000) GS:ffffffff8161b000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: ffff880035475678 CR3: 000000003eb41000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process netns (pid: 10, threadinfo ffff88003f92a000, task ffff88003f913058)
> Stack:
> ffffffff814328a0 ffff880035440800 ffffffff814328a0 ffff88003553a800
> <0> ffff88003f92bc90 ffffffff812c9150 ffff880035440800 ffff88003f92bd00
> <0> ffff88003f92bcd0 ffffffff812c9259 ffff88003f92bcd0 ffff88003f92bd00
> Call Trace:
> [<ffffffff812c9150>] dev_close+0x86/0xa8
> [<ffffffff812c9259>] rollback_registered_many+0xe7/0x208
> [<ffffffff812c9390>] unregister_netdevice_many+0x16/0x62
> [<ffffffff812c952d>] default_device_exit_batch+0x9f/0xb3
> [<ffffffff812c3906>] ops_exit_list+0x4e/0x56
> [<ffffffff812c40f4>] cleanup_net+0xfe/0x1b7
> [<ffffffff81042db6>] worker_thread+0x227/0x32d
> [<ffffffff81042d60>] ? worker_thread+0x1d1/0x32d
> [<ffffffff813864c2>] ? _raw_spin_unlock_irq+0x2b/0x31
> [<ffffffff812c3ff6>] ? cleanup_net+0x0/0x1b7
> [<ffffffff810466ae>] ? autoremove_wake_function+0x0/0x38
> [<ffffffff81042b8f>] ? worker_thread+0x0/0x32d
> [<ffffffff810462e0>] kthread+0x7c/0x84
> [<ffffffff810035b4>] kernel_thread_helper+0x4/0x10
> [<ffffffff8138673a>] ? restore_args+0x0/0x30
> [<ffffffff81046264>] ? kthread+0x0/0x84
> [<ffffffff810035b0>] ? kernel_thread_helper+0x0/0x10
> Code: 01 00 00 02 74 0b 83 ce ff 4c 89 e7 e8 a1 8f 03 00 48 8b b3 50 02 00 00 4c
> 89 e7 e8 df 8e 03 00 49 8b 45 18 49 8b 55 20 48 85 c0 <48> 89 02 74 04 48 89 50
> 08 48 be 00 02 20 00 00 00 ad de 49 89
> RIP  [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
> RSP <ffff88003f92bc50>
> CR2: ffff880035475678
> ---[ end trace d4a1e4cbaa70d63e ]---
>
> addr2line -e ./vmlinux ffffffff812c9150 gives net/core/dev.c:1252

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                                   ` <m11vfusgsa.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-03-08 20:42                                                                                                     ` Daniel Lezcano
  0 siblings, 0 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-03-08 20:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:
>
>   
>> Eric W. Biederman wrote:
>>     
>>> Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:
>>>
>>>   
>>>       
>>>> Eric W. Biederman wrote:
>>>>     
>>>>         
>>>>> I have take an snapshot of my development tree and placed it at.
>>>>>
>>>>>
>>>>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>>>>         
>>>>>           
>>>> Hi Eric,
>>>>
>>>> thanks for the pointer.
>>>>
>>>> I tried to boot the kernel under qemu and I got this oops:
>>>>     
>>>>         
>>> I am clearly running an old userspace on my test machine.  No udev.
>>> It looks like udev has a long standing netlink misfeature, where
>>> it does not initializing NETLINK_CB....
>>>
>>>
>>> >From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
>>> From: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>>> Date: Mon, 8 Mar 2010 09:25:20 -0800
>>> Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...
>>>
>>> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>>>   
>>>       
>> Thanks.
>>
>> I was able to boot but I have the following warning:
>>     
>
> Thanks for the bug report.
>   
Thanks to you for the patchset :)

> For the moment you might want to drop:
> af_netlink:  Allow credentials to work across namespaces.
> af_netlink: Debugging in case I have missed something.
>
> Although I am curious if you hit my debugging messages in
> netlink recv.
>   
No, it does not appear (looked for "missing NETLINK_CB proto").

> I guess if the goal is to test my nsfd bits you can drop everything
> starting with my 'scm: Reorder scm_cookie.' commit.  The rest is what
> it takes to get get uids, gid and pids translated when the cross
> namespaces on an af_unix of an af_netlink socket.
>
> At least in the af_netlink case it appears clear I am have missed
> something.
>
> This is a warning that netlink throws when the packet accounting messed
> up.  So it sounds like you are exercising another path that I failed
> to exercise and fix.
>   
I will look forward if I find more clues for this warning.

In the meantime  was able to enter the container with the ugly following 
program:

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <syscall.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/param.h>

#define __NR_setns 300

int setns(int nstype, int fd)
{
    return syscall (__NR_setns, nstype, fd);
}

int main(int argc, char *argv[])
{
    char path[MAXPATHLEN];
    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
    const int size = sizeof(ns) / sizeof(char *);
    int fd[size];
    int i;

    if (argc != 3) {
        fprintf(stderr, "mynsenter <pid> <command>\n");
        exit(1);
    }

    for (i = 0; i < size; i++) {
       
        sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);

        fd[i] = open(path, O_RDONLY);
        if (fd[i] < 0) {
            perror("open");
            return -1;
        }

    }

    for (i = 0; i < size; i++) {

        if (setns(0, fd[i])) {
            perror("setns");
            return -1;
        }
    }

    execve(argv[2], &argv[2], NULL);
    perror("execve");

    return 0;
}

At the fist glance, no problem :)

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-08 20:24                                                                                                 ` Eric W. Biederman
@ 2010-03-08 20:42                                                                                                   ` Daniel Lezcano
       [not found]                                                                                                     ` <4B95611C.5060403-GANU6spQydw@public.gmane.org>
  2010-03-08 20:47                                                                                                     ` Eric W. Biederman
       [not found]                                                                                                   ` <m11vfusgsa.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-03-08 20:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>
>   
>> Eric W. Biederman wrote:
>>     
>>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>>>
>>>   
>>>       
>>>> Eric W. Biederman wrote:
>>>>     
>>>>         
>>>>> I have take an snapshot of my development tree and placed it at.
>>>>>
>>>>>
>>>>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>>>>         
>>>>>           
>>>> Hi Eric,
>>>>
>>>> thanks for the pointer.
>>>>
>>>> I tried to boot the kernel under qemu and I got this oops:
>>>>     
>>>>         
>>> I am clearly running an old userspace on my test machine.  No udev.
>>> It looks like udev has a long standing netlink misfeature, where
>>> it does not initializing NETLINK_CB....
>>>
>>>
>>> >From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
>>> From: Eric W. Biederman <ebiederm@xmission.com>
>>> Date: Mon, 8 Mar 2010 09:25:20 -0800
>>> Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...
>>>
>>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>>>   
>>>       
>> Thanks.
>>
>> I was able to boot but I have the following warning:
>>     
>
> Thanks for the bug report.
>   
Thanks to you for the patchset :)

> For the moment you might want to drop:
> af_netlink:  Allow credentials to work across namespaces.
> af_netlink: Debugging in case I have missed something.
>
> Although I am curious if you hit my debugging messages in
> netlink recv.
>   
No, it does not appear (looked for "missing NETLINK_CB proto").

> I guess if the goal is to test my nsfd bits you can drop everything
> starting with my 'scm: Reorder scm_cookie.' commit.  The rest is what
> it takes to get get uids, gid and pids translated when the cross
> namespaces on an af_unix of an af_netlink socket.
>
> At least in the af_netlink case it appears clear I am have missed
> something.
>
> This is a warning that netlink throws when the packet accounting messed
> up.  So it sounds like you are exercising another path that I failed
> to exercise and fix.
>   
I will look forward if I find more clues for this warning.

In the meantime  was able to enter the container with the ugly following 
program:

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <syscall.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/param.h>

#define __NR_setns 300

int setns(int nstype, int fd)
{
    return syscall (__NR_setns, nstype, fd);
}

int main(int argc, char *argv[])
{
    char path[MAXPATHLEN];
    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
    const int size = sizeof(ns) / sizeof(char *);
    int fd[size];
    int i;

    if (argc != 3) {
        fprintf(stderr, "mynsenter <pid> <command>\n");
        exit(1);
    }

    for (i = 0; i < size; i++) {
       
        sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);

        fd[i] = open(path, O_RDONLY);
        if (fd[i] < 0) {
            perror("open");
            return -1;
        }

    }

    for (i = 0; i < size; i++) {

        if (setns(0, fd[i])) {
            perror("setns");
            return -1;
        }
    }

    execve(argv[2], &argv[2], NULL);
    perror("execve");

    return 0;
}

At the fist glance, no problem :)

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                                     ` <4B95611C.5060403-GANU6spQydw@public.gmane.org>
@ 2010-03-08 20:47                                                                                                       ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-08 20:47 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:

> Eric W. Biederman wrote:
>> Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:
>>
>>   
>>> Eric W. Biederman wrote:
>>>     
>>>> Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:
>>>>
>>>>         
>>>>> Eric W. Biederman wrote:
>>>>>             
>>>>>> I have take an snapshot of my development tree and placed it at.
>>>>>>
>>>>>>
>>>>>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>>>>>                   
>>>>> Hi Eric,
>>>>>
>>>>> thanks for the pointer.
>>>>>
>>>>> I tried to boot the kernel under qemu and I got this oops:
>>>>>             
>>>> I am clearly running an old userspace on my test machine.  No udev.
>>>> It looks like udev has a long standing netlink misfeature, where
>>>> it does not initializing NETLINK_CB....
>>>>
>>>>
>>>> >From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
>>>> From: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>>>> Date: Mon, 8 Mar 2010 09:25:20 -0800
>>>> Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...
>>>>
>>>> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>>>>         
>>> Thanks.
>>>
>>> I was able to boot but I have the following warning:
>>>     
>>
>> Thanks for the bug report.
>>   
> Thanks to you for the patchset :)
>
>> For the moment you might want to drop:
>> af_netlink:  Allow credentials to work across namespaces.
>> af_netlink: Debugging in case I have missed something.
>>
>> Although I am curious if you hit my debugging messages in
>> netlink recv.
>>   
> No, it does not appear (looked for "missing NETLINK_CB proto").
>
>> I guess if the goal is to test my nsfd bits you can drop everything
>> starting with my 'scm: Reorder scm_cookie.' commit.  The rest is what
>> it takes to get get uids, gid and pids translated when the cross
>> namespaces on an af_unix of an af_netlink socket.
>>
>> At least in the af_netlink case it appears clear I am have missed
>> something.
>>
>> This is a warning that netlink throws when the packet accounting messed
>> up.  So it sounds like you are exercising another path that I failed
>> to exercise and fix.
>>   
> I will look forward if I find more clues for this warning.
>
> In the meantime  was able to enter the container with the ugly following
> program:
>
> #include <unistd.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <syscall.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sys/param.h>
>
> #define __NR_setns 300
>
> int setns(int nstype, int fd)
> {
>    return syscall (__NR_setns, nstype, fd);
> }
>
> int main(int argc, char *argv[])
> {
>    char path[MAXPATHLEN];
>    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
>    const int size = sizeof(ns) / sizeof(char *);
>    int fd[size];
>    int i;
>
>    if (argc != 3) {
>        fprintf(stderr, "mynsenter <pid> <command>\n");
>        exit(1);
>    }
>
>    for (i = 0; i < size; i++) {
>            sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);
>
>        fd[i] = open(path, O_RDONLY);
>        if (fd[i] < 0) {
>            perror("open");
>            return -1;
>        }
>
>    }
>
>    for (i = 0; i < size; i++) {
>
>        if (setns(0, fd[i])) {
>            perror("setns");
>            return -1;
>        }
>    }
>
>    execve(argv[2], &argv[2], NULL);
>    perror("execve");
>
>    return 0;
> }
>
> At the fist glance, no problem :)

No fork() so your processes is completely in the pid namespace?

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-08 20:42                                                                                                   ` Daniel Lezcano
       [not found]                                                                                                     ` <4B95611C.5060403-GANU6spQydw@public.gmane.org>
@ 2010-03-08 20:47                                                                                                     ` Eric W. Biederman
  2010-03-08 21:12                                                                                                       ` Daniel Lezcano
       [not found]                                                                                                       ` <m1sk8ar15b.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-08 20:47 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

Daniel Lezcano <daniel.lezcano@free.fr> writes:

> Eric W. Biederman wrote:
>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>>
>>   
>>> Eric W. Biederman wrote:
>>>     
>>>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>>>>
>>>>         
>>>>> Eric W. Biederman wrote:
>>>>>             
>>>>>> I have take an snapshot of my development tree and placed it at.
>>>>>>
>>>>>>
>>>>>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>>>>>                   
>>>>> Hi Eric,
>>>>>
>>>>> thanks for the pointer.
>>>>>
>>>>> I tried to boot the kernel under qemu and I got this oops:
>>>>>             
>>>> I am clearly running an old userspace on my test machine.  No udev.
>>>> It looks like udev has a long standing netlink misfeature, where
>>>> it does not initializing NETLINK_CB....
>>>>
>>>>
>>>> >From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
>>>> From: Eric W. Biederman <ebiederm@xmission.com>
>>>> Date: Mon, 8 Mar 2010 09:25:20 -0800
>>>> Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...
>>>>
>>>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>>>>         
>>> Thanks.
>>>
>>> I was able to boot but I have the following warning:
>>>     
>>
>> Thanks for the bug report.
>>   
> Thanks to you for the patchset :)
>
>> For the moment you might want to drop:
>> af_netlink:  Allow credentials to work across namespaces.
>> af_netlink: Debugging in case I have missed something.
>>
>> Although I am curious if you hit my debugging messages in
>> netlink recv.
>>   
> No, it does not appear (looked for "missing NETLINK_CB proto").
>
>> I guess if the goal is to test my nsfd bits you can drop everything
>> starting with my 'scm: Reorder scm_cookie.' commit.  The rest is what
>> it takes to get get uids, gid and pids translated when the cross
>> namespaces on an af_unix of an af_netlink socket.
>>
>> At least in the af_netlink case it appears clear I am have missed
>> something.
>>
>> This is a warning that netlink throws when the packet accounting messed
>> up.  So it sounds like you are exercising another path that I failed
>> to exercise and fix.
>>   
> I will look forward if I find more clues for this warning.
>
> In the meantime  was able to enter the container with the ugly following
> program:
>
> #include <unistd.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <syscall.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sys/param.h>
>
> #define __NR_setns 300
>
> int setns(int nstype, int fd)
> {
>    return syscall (__NR_setns, nstype, fd);
> }
>
> int main(int argc, char *argv[])
> {
>    char path[MAXPATHLEN];
>    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
>    const int size = sizeof(ns) / sizeof(char *);
>    int fd[size];
>    int i;
>
>    if (argc != 3) {
>        fprintf(stderr, "mynsenter <pid> <command>\n");
>        exit(1);
>    }
>
>    for (i = 0; i < size; i++) {
>            sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);
>
>        fd[i] = open(path, O_RDONLY);
>        if (fd[i] < 0) {
>            perror("open");
>            return -1;
>        }
>
>    }
>
>    for (i = 0; i < size; i++) {
>
>        if (setns(0, fd[i])) {
>            perror("setns");
>            return -1;
>        }
>    }
>
>    execve(argv[2], &argv[2], NULL);
>    perror("execve");
>
>    return 0;
> }
>
> At the fist glance, no problem :)

No fork() so your processes is completely in the pid namespace?

Eric


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                                       ` <m1sk8ar15b.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-03-08 21:12                                                                                                         ` Daniel Lezcano
  0 siblings, 0 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-03-08 21:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:
>
>   
>> Eric W. Biederman wrote:
>>     
>>> Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:
>>>
>>>   
>>>       
>>>> Eric W. Biederman wrote:
>>>>     
>>>>         
>>>>> Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:
>>>>>
>>>>>         
>>>>>           
>>>>>> Eric W. Biederman wrote:
>>>>>>             
>>>>>>             
>>>>>>> I have take an snapshot of my development tree and placed it at.
>>>>>>>
>>>>>>>
>>>>>>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>>>>>>                   
>>>>>>>               
>>>>>> Hi Eric,
>>>>>>
>>>>>> thanks for the pointer.
>>>>>>
>>>>>> I tried to boot the kernel under qemu and I got this oops:
>>>>>>             
>>>>>>             
>>>>> I am clearly running an old userspace on my test machine.  No udev.
>>>>> It looks like udev has a long standing netlink misfeature, where
>>>>> it does not initializing NETLINK_CB....
>>>>>
>>>>>
>>>>> >From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
>>>>> From: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>>>>> Date: Mon, 8 Mar 2010 09:25:20 -0800
>>>>> Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...
>>>>>
>>>>> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>>>>>         
>>>>>           
>>>> Thanks.
>>>>
>>>> I was able to boot but I have the following warning:
>>>>     
>>>>         
>>> Thanks for the bug report.
>>>   
>>>       
>> Thanks to you for the patchset :)
>>
>>     
>>> For the moment you might want to drop:
>>> af_netlink:  Allow credentials to work across namespaces.
>>> af_netlink: Debugging in case I have missed something.
>>>
>>> Although I am curious if you hit my debugging messages in
>>> netlink recv.
>>>   
>>>       
>> No, it does not appear (looked for "missing NETLINK_CB proto").
>>
>>     
>>> I guess if the goal is to test my nsfd bits you can drop everything
>>> starting with my 'scm: Reorder scm_cookie.' commit.  The rest is what
>>> it takes to get get uids, gid and pids translated when the cross
>>> namespaces on an af_unix of an af_netlink socket.
>>>
>>> At least in the af_netlink case it appears clear I am have missed
>>> something.
>>>
>>> This is a warning that netlink throws when the packet accounting messed
>>> up.  So it sounds like you are exercising another path that I failed
>>> to exercise and fix.
>>>   
>>>       
>> I will look forward if I find more clues for this warning.
>>
>> In the meantime  was able to enter the container with the ugly following
>> program:
>>
>> #include <unistd.h>
>> #include <stdlib.h>
>> #include <stdio.h>
>> #include <syscall.h>
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <sys/param.h>
>>
>> #define __NR_setns 300
>>
>> int setns(int nstype, int fd)
>> {
>>    return syscall (__NR_setns, nstype, fd);
>> }
>>
>> int main(int argc, char *argv[])
>> {
>>    char path[MAXPATHLEN];
>>    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
>>    const int size = sizeof(ns) / sizeof(char *);
>>    int fd[size];
>>    int i;
>>
>>    if (argc != 3) {
>>        fprintf(stderr, "mynsenter <pid> <command>\n");
>>        exit(1);
>>    }
>>
>>    for (i = 0; i < size; i++) {
>>            sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);
>>
>>        fd[i] = open(path, O_RDONLY);
>>        if (fd[i] < 0) {
>>            perror("open");
>>            return -1;
>>        }
>>
>>    }
>>
>>    for (i = 0; i < size; i++) {
>>
>>        if (setns(0, fd[i])) {
>>            perror("setns");
>>            return -1;
>>        }
>>    }
>>
>>    execve(argv[2], &argv[2], NULL);
>>    perror("execve");
>>
>>    return 0;
>> }
>>
>> At the fist glance, no problem :)
>>     
>
> No fork() so your processes is completely in the pid namespace?
>   
What I do is to attach "/bin/sh" to the container with this program.
The container is a VPS running busybox with the full isolation.

echo $$ gives the real pid.
All the forked processes appears in the pid namespace, they are visible 
through /proc with the virtual pid.
I am not able to change to the /proc/self directory (I assume this is 
normal).

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-08 20:47                                                                                                     ` Eric W. Biederman
@ 2010-03-08 21:12                                                                                                       ` Daniel Lezcano
  2010-03-08 21:25                                                                                                         ` Eric W. Biederman
       [not found]                                                                                                         ` <4B956852.7050804-GANU6spQydw@public.gmane.org>
       [not found]                                                                                                       ` <m1sk8ar15b.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-03-08 21:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>
>   
>> Eric W. Biederman wrote:
>>     
>>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>>>
>>>   
>>>       
>>>> Eric W. Biederman wrote:
>>>>     
>>>>         
>>>>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>>>>>
>>>>>         
>>>>>           
>>>>>> Eric W. Biederman wrote:
>>>>>>             
>>>>>>             
>>>>>>> I have take an snapshot of my development tree and placed it at.
>>>>>>>
>>>>>>>
>>>>>>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>>>>>>                   
>>>>>>>               
>>>>>> Hi Eric,
>>>>>>
>>>>>> thanks for the pointer.
>>>>>>
>>>>>> I tried to boot the kernel under qemu and I got this oops:
>>>>>>             
>>>>>>             
>>>>> I am clearly running an old userspace on my test machine.  No udev.
>>>>> It looks like udev has a long standing netlink misfeature, where
>>>>> it does not initializing NETLINK_CB....
>>>>>
>>>>>
>>>>> >From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
>>>>> From: Eric W. Biederman <ebiederm@xmission.com>
>>>>> Date: Mon, 8 Mar 2010 09:25:20 -0800
>>>>> Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...
>>>>>
>>>>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>>>>>         
>>>>>           
>>>> Thanks.
>>>>
>>>> I was able to boot but I have the following warning:
>>>>     
>>>>         
>>> Thanks for the bug report.
>>>   
>>>       
>> Thanks to you for the patchset :)
>>
>>     
>>> For the moment you might want to drop:
>>> af_netlink:  Allow credentials to work across namespaces.
>>> af_netlink: Debugging in case I have missed something.
>>>
>>> Although I am curious if you hit my debugging messages in
>>> netlink recv.
>>>   
>>>       
>> No, it does not appear (looked for "missing NETLINK_CB proto").
>>
>>     
>>> I guess if the goal is to test my nsfd bits you can drop everything
>>> starting with my 'scm: Reorder scm_cookie.' commit.  The rest is what
>>> it takes to get get uids, gid and pids translated when the cross
>>> namespaces on an af_unix of an af_netlink socket.
>>>
>>> At least in the af_netlink case it appears clear I am have missed
>>> something.
>>>
>>> This is a warning that netlink throws when the packet accounting messed
>>> up.  So it sounds like you are exercising another path that I failed
>>> to exercise and fix.
>>>   
>>>       
>> I will look forward if I find more clues for this warning.
>>
>> In the meantime  was able to enter the container with the ugly following
>> program:
>>
>> #include <unistd.h>
>> #include <stdlib.h>
>> #include <stdio.h>
>> #include <syscall.h>
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <sys/param.h>
>>
>> #define __NR_setns 300
>>
>> int setns(int nstype, int fd)
>> {
>>    return syscall (__NR_setns, nstype, fd);
>> }
>>
>> int main(int argc, char *argv[])
>> {
>>    char path[MAXPATHLEN];
>>    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
>>    const int size = sizeof(ns) / sizeof(char *);
>>    int fd[size];
>>    int i;
>>
>>    if (argc != 3) {
>>        fprintf(stderr, "mynsenter <pid> <command>\n");
>>        exit(1);
>>    }
>>
>>    for (i = 0; i < size; i++) {
>>            sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);
>>
>>        fd[i] = open(path, O_RDONLY);
>>        if (fd[i] < 0) {
>>            perror("open");
>>            return -1;
>>        }
>>
>>    }
>>
>>    for (i = 0; i < size; i++) {
>>
>>        if (setns(0, fd[i])) {
>>            perror("setns");
>>            return -1;
>>        }
>>    }
>>
>>    execve(argv[2], &argv[2], NULL);
>>    perror("execve");
>>
>>    return 0;
>> }
>>
>> At the fist glance, no problem :)
>>     
>
> No fork() so your processes is completely in the pid namespace?
>   
What I do is to attach "/bin/sh" to the container with this program.
The container is a VPS running busybox with the full isolation.

echo $$ gives the real pid.
All the forked processes appears in the pid namespace, they are visible 
through /proc with the virtual pid.
I am not able to change to the /proc/self directory (I assume this is 
normal).



^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                                         ` <4B956852.7050804-GANU6spQydw@public.gmane.org>
@ 2010-03-08 21:25                                                                                                           ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-08 21:25 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:

> Eric W. Biederman wrote:
>> Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:
>>
>>   
>>> Eric W. Biederman wrote:
>>>     
>>>> Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:
>>>>
>>>>         
>>>>> Eric W. Biederman wrote:
>>>>>             
>>>>>> Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:
>>>>>>
>>>>>>                   
>>>>>>> Eric W. Biederman wrote:
>>>>>>>                         
>>>>>>>> I have take an snapshot of my development tree and placed it at.
>>>>>>>>
>>>>>>>>
>>>>>>>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>>>>>>>                                 
>>>>>>> Hi Eric,
>>>>>>>
>>>>>>> thanks for the pointer.
>>>>>>>
>>>>>>> I tried to boot the kernel under qemu and I got this oops:
>>>>>>>                         
>>>>>> I am clearly running an old userspace on my test machine.  No udev.
>>>>>> It looks like udev has a long standing netlink misfeature, where
>>>>>> it does not initializing NETLINK_CB....
>>>>>>
>>>>>>
>>>>>> >From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
>>>>>> From: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>>>>>> Date: Mon, 8 Mar 2010 09:25:20 -0800
>>>>>> Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...
>>>>>>
>>>>>> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>>>>>>                   
>>>>> Thanks.
>>>>>
>>>>> I was able to boot but I have the following warning:
>>>>>             
>>>> Thanks for the bug report.
>>>>         
>>> Thanks to you for the patchset :)
>>>
>>>     
>>>> For the moment you might want to drop:
>>>> af_netlink:  Allow credentials to work across namespaces.
>>>> af_netlink: Debugging in case I have missed something.
>>>>
>>>> Although I am curious if you hit my debugging messages in
>>>> netlink recv.
>>>>         
>>> No, it does not appear (looked for "missing NETLINK_CB proto").
>>>
>>>     
>>>> I guess if the goal is to test my nsfd bits you can drop everything
>>>> starting with my 'scm: Reorder scm_cookie.' commit.  The rest is what
>>>> it takes to get get uids, gid and pids translated when the cross
>>>> namespaces on an af_unix of an af_netlink socket.
>>>>
>>>> At least in the af_netlink case it appears clear I am have missed
>>>> something.
>>>>
>>>> This is a warning that netlink throws when the packet accounting messed
>>>> up.  So it sounds like you are exercising another path that I failed
>>>> to exercise and fix.
>>>>         
>>> I will look forward if I find more clues for this warning.
>>>
>>> In the meantime  was able to enter the container with the ugly following
>>> program:
>>>
>>> #include <unistd.h>
>>> #include <stdlib.h>
>>> #include <stdio.h>
>>> #include <syscall.h>
>>> #include <sys/types.h>
>>> #include <sys/stat.h>
>>> #include <fcntl.h>
>>> #include <sys/param.h>
>>>
>>> #define __NR_setns 300
>>>
>>> int setns(int nstype, int fd)
>>> {
>>>    return syscall (__NR_setns, nstype, fd);
>>> }
>>>
>>> int main(int argc, char *argv[])
>>> {
>>>    char path[MAXPATHLEN];
>>>    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
>>>    const int size = sizeof(ns) / sizeof(char *);
>>>    int fd[size];
>>>    int i;
>>>
>>>    if (argc != 3) {
>>>        fprintf(stderr, "mynsenter <pid> <command>\n");
>>>        exit(1);
>>>    }
>>>
>>>    for (i = 0; i < size; i++) {
>>>            sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);
>>>
>>>        fd[i] = open(path, O_RDONLY);
>>>        if (fd[i] < 0) {
>>>            perror("open");
>>>            return -1;
>>>        }
>>>
>>>    }
>>>
>>>    for (i = 0; i < size; i++) {
>>>
>>>        if (setns(0, fd[i])) {
>>>            perror("setns");
>>>            return -1;
>>>        }
>>>    }
>>>
>>>    execve(argv[2], &argv[2], NULL);
>>>    perror("execve");
>>>
>>>    return 0;
>>> }
>>>
>>> At the fist glance, no problem :)
>>>     
>>
>> No fork() so your processes is completely in the pid namespace?
>>   
> What I do is to attach "/bin/sh" to the container with this program.
> The container is a VPS running busybox with the full isolation.
>
> echo $$ gives the real pid.
> All the forked processes appears in the pid namespace, they are visible through
> /proc with the virtual pid.
> I am not able to change to the /proc/self directory (I assume this is normal).

I guess my meaning is I was expecting.
child = fork();
if (child == 0) {
	execve(...);
}
waitpid(child);

This puts /bin/sh in the container as well.

I'm not certain about the /proc/self thing I have never encountered that.
But I guess if your pid is outside of the pid namespace of that instance
of proc /proc/self will be a broken symlink.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-08 21:12                                                                                                       ` Daniel Lezcano
@ 2010-03-08 21:25                                                                                                         ` Eric W. Biederman
  2010-03-08 21:49                                                                                                           ` Serge E. Hallyn
                                                                                                                             ` (3 more replies)
       [not found]                                                                                                         ` <4B956852.7050804-GANU6spQydw@public.gmane.org>
  1 sibling, 4 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-08 21:25 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

Daniel Lezcano <daniel.lezcano@free.fr> writes:

> Eric W. Biederman wrote:
>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>>
>>   
>>> Eric W. Biederman wrote:
>>>     
>>>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>>>>
>>>>         
>>>>> Eric W. Biederman wrote:
>>>>>             
>>>>>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>>>>>>
>>>>>>                   
>>>>>>> Eric W. Biederman wrote:
>>>>>>>                         
>>>>>>>> I have take an snapshot of my development tree and placed it at.
>>>>>>>>
>>>>>>>>
>>>>>>>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>>>>>>>                                 
>>>>>>> Hi Eric,
>>>>>>>
>>>>>>> thanks for the pointer.
>>>>>>>
>>>>>>> I tried to boot the kernel under qemu and I got this oops:
>>>>>>>                         
>>>>>> I am clearly running an old userspace on my test machine.  No udev.
>>>>>> It looks like udev has a long standing netlink misfeature, where
>>>>>> it does not initializing NETLINK_CB....
>>>>>>
>>>>>>
>>>>>> >From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
>>>>>> From: Eric W. Biederman <ebiederm@xmission.com>
>>>>>> Date: Mon, 8 Mar 2010 09:25:20 -0800
>>>>>> Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...
>>>>>>
>>>>>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>>>>>>                   
>>>>> Thanks.
>>>>>
>>>>> I was able to boot but I have the following warning:
>>>>>             
>>>> Thanks for the bug report.
>>>>         
>>> Thanks to you for the patchset :)
>>>
>>>     
>>>> For the moment you might want to drop:
>>>> af_netlink:  Allow credentials to work across namespaces.
>>>> af_netlink: Debugging in case I have missed something.
>>>>
>>>> Although I am curious if you hit my debugging messages in
>>>> netlink recv.
>>>>         
>>> No, it does not appear (looked for "missing NETLINK_CB proto").
>>>
>>>     
>>>> I guess if the goal is to test my nsfd bits you can drop everything
>>>> starting with my 'scm: Reorder scm_cookie.' commit.  The rest is what
>>>> it takes to get get uids, gid and pids translated when the cross
>>>> namespaces on an af_unix of an af_netlink socket.
>>>>
>>>> At least in the af_netlink case it appears clear I am have missed
>>>> something.
>>>>
>>>> This is a warning that netlink throws when the packet accounting messed
>>>> up.  So it sounds like you are exercising another path that I failed
>>>> to exercise and fix.
>>>>         
>>> I will look forward if I find more clues for this warning.
>>>
>>> In the meantime  was able to enter the container with the ugly following
>>> program:
>>>
>>> #include <unistd.h>
>>> #include <stdlib.h>
>>> #include <stdio.h>
>>> #include <syscall.h>
>>> #include <sys/types.h>
>>> #include <sys/stat.h>
>>> #include <fcntl.h>
>>> #include <sys/param.h>
>>>
>>> #define __NR_setns 300
>>>
>>> int setns(int nstype, int fd)
>>> {
>>>    return syscall (__NR_setns, nstype, fd);
>>> }
>>>
>>> int main(int argc, char *argv[])
>>> {
>>>    char path[MAXPATHLEN];
>>>    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
>>>    const int size = sizeof(ns) / sizeof(char *);
>>>    int fd[size];
>>>    int i;
>>>
>>>    if (argc != 3) {
>>>        fprintf(stderr, "mynsenter <pid> <command>\n");
>>>        exit(1);
>>>    }
>>>
>>>    for (i = 0; i < size; i++) {
>>>            sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);
>>>
>>>        fd[i] = open(path, O_RDONLY);
>>>        if (fd[i] < 0) {
>>>            perror("open");
>>>            return -1;
>>>        }
>>>
>>>    }
>>>
>>>    for (i = 0; i < size; i++) {
>>>
>>>        if (setns(0, fd[i])) {
>>>            perror("setns");
>>>            return -1;
>>>        }
>>>    }
>>>
>>>    execve(argv[2], &argv[2], NULL);
>>>    perror("execve");
>>>
>>>    return 0;
>>> }
>>>
>>> At the fist glance, no problem :)
>>>     
>>
>> No fork() so your processes is completely in the pid namespace?
>>   
> What I do is to attach "/bin/sh" to the container with this program.
> The container is a VPS running busybox with the full isolation.
>
> echo $$ gives the real pid.
> All the forked processes appears in the pid namespace, they are visible through
> /proc with the virtual pid.
> I am not able to change to the /proc/self directory (I assume this is normal).

I guess my meaning is I was expecting.
child = fork();
if (child == 0) {
	execve(...);
}
waitpid(child);

This puts /bin/sh in the container as well.

I'm not certain about the /proc/self thing I have never encountered that.
But I guess if your pid is outside of the pid namespace of that instance
of proc /proc/self will be a broken symlink.

Eric


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                                           ` <m1lje2qzf4.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-03-08 21:49                                                                                                             ` Serge E. Hallyn
  2010-03-09 10:03                                                                                                             ` Daniel Lezcano
  2010-03-10 21:16                                                                                                             ` Daniel Lezcano
  2 siblings, 0 replies; 184+ messages in thread
From: Serge E. Hallyn @ 2010-03-08 21:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:
> I guess my meaning is I was expecting.
> child = fork();
> if (child == 0) {
> 	execve(...);
> }
> waitpid(child);
> 
> This puts /bin/sh in the container as well.
> 
> I'm not certain about the /proc/self thing I have never encountered that.
> But I guess if your pid is outside of the pid namespace of that instance
> of proc /proc/self will be a broken symlink.
> 
> Eric

Hmm, worse than a broken symlink, will it be a wrong symlink if just
the right pid is created in the container?

-serge

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-08 21:25                                                                                                         ` Eric W. Biederman
@ 2010-03-08 21:49                                                                                                           ` Serge E. Hallyn
  2010-03-08 22:24                                                                                                             ` Eric W. Biederman
       [not found]                                                                                                             ` <20100308214945.GA26617-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2010-03-09 10:03                                                                                                           ` Daniel Lezcano
                                                                                                                             ` (2 subsequent siblings)
  3 siblings, 2 replies; 184+ messages in thread
From: Serge E. Hallyn @ 2010-03-08 21:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Daniel Lezcano, Pavel Emelyanov, Sukadev Bhattiprolu,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

Quoting Eric W. Biederman (ebiederm@xmission.com):
> Daniel Lezcano <daniel.lezcano@free.fr> writes:
> I guess my meaning is I was expecting.
> child = fork();
> if (child == 0) {
> 	execve(...);
> }
> waitpid(child);
> 
> This puts /bin/sh in the container as well.
> 
> I'm not certain about the /proc/self thing I have never encountered that.
> But I guess if your pid is outside of the pid namespace of that instance
> of proc /proc/self will be a broken symlink.
> 
> Eric

Hmm, worse than a broken symlink, will it be a wrong symlink if just
the right pid is created in the container?

-serge

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                                             ` <20100308214945.GA26617-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-03-08 22:24                                                                                                               ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-08 22:24 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

"Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
>> Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:
>> I guess my meaning is I was expecting.
>> child = fork();
>> if (child == 0) {
>> 	execve(...);
>> }
>> waitpid(child);
>> 
>> This puts /bin/sh in the container as well.
>> 
>> I'm not certain about the /proc/self thing I have never encountered that.
>> But I guess if your pid is outside of the pid namespace of that instance
>> of proc /proc/self will be a broken symlink.
>> 
>> Eric
>
> Hmm, worse than a broken symlink, will it be a wrong symlink if just
> the right pid is created in the container?

It won't happen. readlink and followlink are both based on 
task_tgid_nr_ns(current, ns_of_proc).

Which fails if your process is not known in that pid namespace.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-08 21:49                                                                                                           ` Serge E. Hallyn
@ 2010-03-08 22:24                                                                                                             ` Eric W. Biederman
       [not found]                                                                                                             ` <20100308214945.GA26617-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-08 22:24 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Daniel Lezcano, Pavel Emelyanov, Sukadev Bhattiprolu,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>> I guess my meaning is I was expecting.
>> child = fork();
>> if (child == 0) {
>> 	execve(...);
>> }
>> waitpid(child);
>> 
>> This puts /bin/sh in the container as well.
>> 
>> I'm not certain about the /proc/self thing I have never encountered that.
>> But I guess if your pid is outside of the pid namespace of that instance
>> of proc /proc/self will be a broken symlink.
>> 
>> Eric
>
> Hmm, worse than a broken symlink, will it be a wrong symlink if just
> the right pid is created in the container?

It won't happen. readlink and followlink are both based on 
task_tgid_nr_ns(current, ns_of_proc).

Which fails if your process is not known in that pid namespace.

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                                           ` <m1lje2qzf4.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2010-03-08 21:49                                                                                                             ` Serge E. Hallyn
@ 2010-03-09 10:03                                                                                                             ` Daniel Lezcano
  2010-03-10 21:16                                                                                                             ` Daniel Lezcano
  2 siblings, 0 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-03-09 10:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Eric W. Biederman wrote:

[ ... ]
> I guess my meaning is I was expecting.
> child = fork();
> if (child == 0) {
> 	execve(...);
> }
> waitpid(child);
>
> This puts /bin/sh in the container as well.
>   
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <syscall.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/param.h>

#define __NR_setns 300

int setns(int nstype, int fd)
{
    return syscall (__NR_setns, nstype, fd);
}

int main(int argc, char *argv[])
{
    char path[MAXPATHLEN];
    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
    const int size = sizeof(ns) / sizeof(char *);
    int fd[size];
    int i;
    pid_t pid;
   
    if (argc != 3) {
        fprintf(stderr, "mynsenter <pid> <command>\n");
        exit(1);
    }

    for (i = 0; i < size; i++) {
       
        sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);

        fd[i] = open(path, O_RDONLY| FD_CLOEXEC);
        if (fd[i] < 0) {
            perror("open");
            return -1;
        }

    }
   
    for (i = 0; i < size; i++)
        if (setns(0, fd[i])) {
            perror("setns");
            return -1;
        }

    pid = fork();
    if (!pid) {

        fprintf(stderr, "mypid is %d\n", syscall(__NR_getpid));

        execve(argv[2], &argv[2], NULL);
        perror("execve");

    }

    if (pid < 0) {
        perror("fork");
        return -1;
    }

    if (waitpid(&pid, NULL, 0) < 0) {
        perror("waitpid");
    }

    return 0;
}

Waitpid returns an error:

waitpid: No child processes

The pid number returned by fork is the pid from the init pid namespace 
but it seems waitpid is waiting for a pid belonging to the child pid 
namespace.

waitpid
 -> wait4
   -> find_get_pid
     -> find_vpid
       -> find_pid_ns(nr, current->nsproxy->pid_ns);

The current->nsproxy->pid_ns is the one of the namespace we attached to. 
So the real pid returned by the fork does not exist in this pid namespace.
Maybe fork should return a pid number belonging to the current pid 
namespace we are attached no ?

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-08 21:25                                                                                                         ` Eric W. Biederman
  2010-03-08 21:49                                                                                                           ` Serge E. Hallyn
@ 2010-03-09 10:03                                                                                                           ` Daniel Lezcano
       [not found]                                                                                                             ` <4B961D09.4010802-GANU6spQydw@public.gmane.org>
  2010-03-09 10:13                                                                                                             ` Eric W. Biederman
       [not found]                                                                                                           ` <m1lje2qzf4.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2010-03-10 21:16                                                                                                           ` Daniel Lezcano
  3 siblings, 2 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-03-09 10:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

Eric W. Biederman wrote:

[ ... ]
> I guess my meaning is I was expecting.
> child = fork();
> if (child == 0) {
> 	execve(...);
> }
> waitpid(child);
>
> This puts /bin/sh in the container as well.
>   
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <syscall.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/param.h>

#define __NR_setns 300

int setns(int nstype, int fd)
{
    return syscall (__NR_setns, nstype, fd);
}

int main(int argc, char *argv[])
{
    char path[MAXPATHLEN];
    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
    const int size = sizeof(ns) / sizeof(char *);
    int fd[size];
    int i;
    pid_t pid;
   
    if (argc != 3) {
        fprintf(stderr, "mynsenter <pid> <command>\n");
        exit(1);
    }

    for (i = 0; i < size; i++) {
       
        sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);

        fd[i] = open(path, O_RDONLY| FD_CLOEXEC);
        if (fd[i] < 0) {
            perror("open");
            return -1;
        }

    }
   
    for (i = 0; i < size; i++)
        if (setns(0, fd[i])) {
            perror("setns");
            return -1;
        }

    pid = fork();
    if (!pid) {

        fprintf(stderr, "mypid is %d\n", syscall(__NR_getpid));

        execve(argv[2], &argv[2], NULL);
        perror("execve");

    }

    if (pid < 0) {
        perror("fork");
        return -1;
    }

    if (waitpid(&pid, NULL, 0) < 0) {
        perror("waitpid");
    }

    return 0;
}

Waitpid returns an error:

waitpid: No child processes

The pid number returned by fork is the pid from the init pid namespace 
but it seems waitpid is waiting for a pid belonging to the child pid 
namespace.

waitpid
 -> wait4
   -> find_get_pid
     -> find_vpid
       -> find_pid_ns(nr, current->nsproxy->pid_ns);

The current->nsproxy->pid_ns is the one of the namespace we attached to. 
So the real pid returned by the fork does not exist in this pid namespace.
Maybe fork should return a pid number belonging to the current pid 
namespace we are attached no ?





^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                                             ` <4B961D09.4010802-GANU6spQydw@public.gmane.org>
@ 2010-03-09 10:13                                                                                                               ` Eric W. Biederman
  0 siblings, 0 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-09 10:13 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:

> Eric W. Biederman wrote:
>
> [ ... ]
>> I guess my meaning is I was expecting.
>> child = fork();
>> if (child == 0) {
>> 	execve(...);
>> }
>> waitpid(child);
>>
>> This puts /bin/sh in the container as well.
>>   
> #include <unistd.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <syscall.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sys/param.h>
>
> #define __NR_setns 300
>
> int setns(int nstype, int fd)
> {
>    return syscall (__NR_setns, nstype, fd);
> }
>
> int main(int argc, char *argv[])
> {
>    char path[MAXPATHLEN];
>    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
>    const int size = sizeof(ns) / sizeof(char *);
>    int fd[size];
>    int i;
>    pid_t pid;
>    if (argc != 3) {
>        fprintf(stderr, "mynsenter <pid> <command>\n");
>        exit(1);
>    }
>
>    for (i = 0; i < size; i++) {
>            sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);
>
>        fd[i] = open(path, O_RDONLY| FD_CLOEXEC);
>        if (fd[i] < 0) {
>            perror("open");
>            return -1;
>        }
>
>    }
>    for (i = 0; i < size; i++)
>        if (setns(0, fd[i])) {
>            perror("setns");
>            return -1;
>        }
>
>    pid = fork();
>    if (!pid) {
>
>        fprintf(stderr, "mypid is %d\n", syscall(__NR_getpid));
>
>        execve(argv[2], &argv[2], NULL);
>        perror("execve");
>
>    }
>
>    if (pid < 0) {
>        perror("fork");
>        return -1;
>    }
>
>    if (waitpid(&pid, NULL, 0) < 0) {
>        perror("waitpid");
>    }
>
>    return 0;
> }

&pid ???  Isn't that a type error?

> Waitpid returns an error:
>
> waitpid: No child processes
>
> The pid number returned by fork is the pid from the init pid namespace but it
> seems waitpid is waiting for a pid belonging to the child pid namespace.
>
> waitpid
> -> wait4
>   -> find_get_pid
>     -> find_vpid
>       -> find_pid_ns(nr, current->nsproxy->pid_ns);

But it isn't.  It is.
           find_pid_ns(nr, task_active_pid_ns(current));
Which is:
           find_pid_ns(nr, ns_of_pid(task_pid(current)));
           
Which is a value that doesn't change.  When we attach to a pid namespace.

> The current->nsproxy->pid_ns is the one of the namespace we attached to. So the
> real pid returned by the fork does not exist in this pid namespace.
> Maybe fork should return a pid number belonging to the current pid namespace we
> are attached no ?

Do you not have my patch that changed that?

Eric

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-09 10:03                                                                                                           ` Daniel Lezcano
       [not found]                                                                                                             ` <4B961D09.4010802-GANU6spQydw@public.gmane.org>
@ 2010-03-09 10:13                                                                                                             ` Eric W. Biederman
  2010-03-09 10:26                                                                                                               ` Daniel Lezcano
       [not found]                                                                                                               ` <m1ocixn6q3.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 2 replies; 184+ messages in thread
From: Eric W. Biederman @ 2010-03-09 10:13 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Pavel Emelyanov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

Daniel Lezcano <daniel.lezcano@free.fr> writes:

> Eric W. Biederman wrote:
>
> [ ... ]
>> I guess my meaning is I was expecting.
>> child = fork();
>> if (child == 0) {
>> 	execve(...);
>> }
>> waitpid(child);
>>
>> This puts /bin/sh in the container as well.
>>   
> #include <unistd.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <syscall.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sys/param.h>
>
> #define __NR_setns 300
>
> int setns(int nstype, int fd)
> {
>    return syscall (__NR_setns, nstype, fd);
> }
>
> int main(int argc, char *argv[])
> {
>    char path[MAXPATHLEN];
>    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
>    const int size = sizeof(ns) / sizeof(char *);
>    int fd[size];
>    int i;
>    pid_t pid;
>    if (argc != 3) {
>        fprintf(stderr, "mynsenter <pid> <command>\n");
>        exit(1);
>    }
>
>    for (i = 0; i < size; i++) {
>            sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);
>
>        fd[i] = open(path, O_RDONLY| FD_CLOEXEC);
>        if (fd[i] < 0) {
>            perror("open");
>            return -1;
>        }
>
>    }
>    for (i = 0; i < size; i++)
>        if (setns(0, fd[i])) {
>            perror("setns");
>            return -1;
>        }
>
>    pid = fork();
>    if (!pid) {
>
>        fprintf(stderr, "mypid is %d\n", syscall(__NR_getpid));
>
>        execve(argv[2], &argv[2], NULL);
>        perror("execve");
>
>    }
>
>    if (pid < 0) {
>        perror("fork");
>        return -1;
>    }
>
>    if (waitpid(&pid, NULL, 0) < 0) {
>        perror("waitpid");
>    }
>
>    return 0;
> }

&pid ???  Isn't that a type error?

> Waitpid returns an error:
>
> waitpid: No child processes
>
> The pid number returned by fork is the pid from the init pid namespace but it
> seems waitpid is waiting for a pid belonging to the child pid namespace.
>
> waitpid
> -> wait4
>   -> find_get_pid
>     -> find_vpid
>       -> find_pid_ns(nr, current->nsproxy->pid_ns);

But it isn't.  It is.
           find_pid_ns(nr, task_active_pid_ns(current));
Which is:
           find_pid_ns(nr, ns_of_pid(task_pid(current)));
           
Which is a value that doesn't change.  When we attach to a pid namespace.

> The current->nsproxy->pid_ns is the one of the namespace we attached to. So the
> real pid returned by the fork does not exist in this pid namespace.
> Maybe fork should return a pid number belonging to the current pid namespace we
> are attached no ?

Do you not have my patch that changed that?

Eric


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                                               ` <m1ocixn6q3.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-03-09 10:26                                                                                                                 ` Daniel Lezcano
  0 siblings, 0 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-03-09 10:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:
>
>   
>> Eric W. Biederman wrote:
>>
>> [ ... ]
>>     
>>> I guess my meaning is I was expecting.
>>> child = fork();
>>> if (child == 0) {
>>> 	execve(...);
>>> }
>>> waitpid(child);
>>>
>>> This puts /bin/sh in the container as well.
>>>   
>>>       
>> #include <unistd.h>
>> #include <stdlib.h>
>> #include <stdio.h>
>> #include <syscall.h>
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <sys/param.h>
>>
>> #define __NR_setns 300
>>
>> int setns(int nstype, int fd)
>> {
>>    return syscall (__NR_setns, nstype, fd);
>> }
>>
>> int main(int argc, char *argv[])
>> {
>>    char path[MAXPATHLEN];
>>    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
>>    const int size = sizeof(ns) / sizeof(char *);
>>    int fd[size];
>>    int i;
>>    pid_t pid;
>>    if (argc != 3) {
>>        fprintf(stderr, "mynsenter <pid> <command>\n");
>>        exit(1);
>>    }
>>
>>    for (i = 0; i < size; i++) {
>>            sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);
>>
>>        fd[i] = open(path, O_RDONLY| FD_CLOEXEC);
>>        if (fd[i] < 0) {
>>            perror("open");
>>            return -1;
>>        }
>>
>>    }
>>    for (i = 0; i < size; i++)
>>        if (setns(0, fd[i])) {
>>            perror("setns");
>>            return -1;
>>        }
>>
>>    pid = fork();
>>    if (!pid) {
>>
>>        fprintf(stderr, "mypid is %d\n", syscall(__NR_getpid));
>>
>>        execve(argv[2], &argv[2], NULL);
>>        perror("execve");
>>
>>    }
>>
>>    if (pid < 0) {
>>        perror("fork");
>>        return -1;
>>    }
>>
>>    if (waitpid(&pid, NULL, 0) < 0) {
>>        perror("waitpid");
>>    }
>>
>>    return 0;
>> }
>>     
>
> &pid ???  Isn't that a type error?
>   
argh ! right :)

Sorry for the noise. Works well now.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-09 10:13                                                                                                             ` Eric W. Biederman
@ 2010-03-09 10:26                                                                                                               ` Daniel Lezcano
       [not found]                                                                                                               ` <m1ocixn6q3.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  1 sibling, 0 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-03-09 10:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>
>   
>> Eric W. Biederman wrote:
>>
>> [ ... ]
>>     
>>> I guess my meaning is I was expecting.
>>> child = fork();
>>> if (child == 0) {
>>> 	execve(...);
>>> }
>>> waitpid(child);
>>>
>>> This puts /bin/sh in the container as well.
>>>   
>>>       
>> #include <unistd.h>
>> #include <stdlib.h>
>> #include <stdio.h>
>> #include <syscall.h>
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <sys/param.h>
>>
>> #define __NR_setns 300
>>
>> int setns(int nstype, int fd)
>> {
>>    return syscall (__NR_setns, nstype, fd);
>> }
>>
>> int main(int argc, char *argv[])
>> {
>>    char path[MAXPATHLEN];
>>    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
>>    const int size = sizeof(ns) / sizeof(char *);
>>    int fd[size];
>>    int i;
>>    pid_t pid;
>>    if (argc != 3) {
>>        fprintf(stderr, "mynsenter <pid> <command>\n");
>>        exit(1);
>>    }
>>
>>    for (i = 0; i < size; i++) {
>>            sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);
>>
>>        fd[i] = open(path, O_RDONLY| FD_CLOEXEC);
>>        if (fd[i] < 0) {
>>            perror("open");
>>            return -1;
>>        }
>>
>>    }
>>    for (i = 0; i < size; i++)
>>        if (setns(0, fd[i])) {
>>            perror("setns");
>>            return -1;
>>        }
>>
>>    pid = fork();
>>    if (!pid) {
>>
>>        fprintf(stderr, "mypid is %d\n", syscall(__NR_getpid));
>>
>>        execve(argv[2], &argv[2], NULL);
>>        perror("execve");
>>
>>    }
>>
>>    if (pid < 0) {
>>        perror("fork");
>>        return -1;
>>    }
>>
>>    if (waitpid(&pid, NULL, 0) < 0) {
>>        perror("waitpid");
>>    }
>>
>>    return 0;
>> }
>>     
>
> &pid ???  Isn't that a type error?
>   
argh ! right :)

Sorry for the noise. Works well now.


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                                                           ` <m1lje2qzf4.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  2010-03-08 21:49                                                                                                             ` Serge E. Hallyn
  2010-03-09 10:03                                                                                                             ` Daniel Lezcano
@ 2010-03-10 21:16                                                                                                             ` Daniel Lezcano
  2 siblings, 0 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-03-10 21:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Netfilter Development Mailinglist, Ben Greear,
	Sukadev Bhattiprolu

Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> writes:
>   

[ ... ]

> I guess my meaning is I was expecting.
> child = fork();
> if (child == 0) {
> 	execve(...);
> }
> waitpid(child);
>
> This puts /bin/sh in the container as well.
>   

Eric,

at this point I did not fall in any obvious bug and I was able to enter 
/ execute commands directly inside the container.

Excellent !

Thanks
  -- Daniel

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
  2010-03-08 21:25                                                                                                         ` Eric W. Biederman
                                                                                                                             ` (2 preceding siblings ...)
       [not found]                                                                                                           ` <m1lje2qzf4.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-03-10 21:16                                                                                                           ` Daniel Lezcano
  3 siblings, 0 replies; 184+ messages in thread
From: Daniel Lezcano @ 2010-03-10 21:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Netdev List, containers, Netfilter Development Mailinglist,
	Ben Greear

Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>   

[ ... ]

> I guess my meaning is I was expecting.
> child = fork();
> if (child == 0) {
> 	execve(...);
> }
> waitpid(child);
>
> This puts /bin/sh in the container as well.
>   

Eric,

at this point I did not fall in any obvious bug and I was able to enter 
/ execute commands directly inside the container.

Excellent !

Thanks
  -- Daniel



^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Devel] Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                                                 ` <m18wa9glpo.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2010-05-27 12:06                                                                   ` Enrico Weigelt
  0 siblings, 0 replies; 184+ messages in thread
From: Enrico Weigelt @ 2010-05-27 12:06 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	netfilter-devel-u79uwXL29TY76Z2rM5mHXA

* Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> schrieb:

> At least for the network namespace there is a lot of value in being
> able to just change that single namespace.  Having multiple logical
> network stacks has it's challenges but has a lot of practical
> applications.  Especially when there is the possibility of private
> ipv4 addresses overlapping, or you have interfaces where you never
> want to forward between them but you want forwarding enabled.

ACK. One practical example: virtualized routes, eg. for VPNs.
Several years ago, I had a customer who provided VPNs via central
hubs - one of the main problem was that he had dedicated physical
machines for the VPN hubs due overlapping IP spaces. We've 
later migrated them to coliunx-based VMs to save a lot iron.

In one of my next projects this issue will pop up again.


cu
-- 
---------------------------------------------------------------------
 Enrico Weigelt    ==   metux IT service - http://www.metux.de/
---------------------------------------------------------------------
 Please visit the OpenSource QM Taskforce:
 	http://wiki.metux.de/public/OpenSource_QM_Taskforce
 Patches / Fixes for a lot dozens of packages in dozens of versions:
	http://patches.metux.de/
---------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Devel] [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                               ` <m1pr3t2fvl.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
                                                   ` (4 preceding siblings ...)
  2010-02-26 21:13                                 ` [RFC][PATCH] ns: Syscalls for better namespace sharing control Pavel Emelyanov
@ 2010-05-27 12:28                                 ` Enrico Weigelt
       [not found]                                   ` <20100527122800.GC31480-q9I3ByPDOfiE+EvaaNYduQ@public.gmane.org>
  5 siblings, 1 reply; 184+ messages in thread
From: Enrico Weigelt @ 2010-05-27 12:28 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

* Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:

Hi,

> The nsfd() system call returns a file descriptor that can
> be used to talk about a specific namespace, and to keep
> the specified namespace alive.

Why not directly mapping it into the filesystem (eg. /proc)
instead of yet another syscall ?


cu
-- 
---------------------------------------------------------------------
 Enrico Weigelt    ==   metux IT service - http://www.metux.de/
---------------------------------------------------------------------
 Please visit the OpenSource QM Taskforce:
 	http://wiki.metux.de/public/OpenSource_QM_Taskforce
 Patches / Fixes for a lot dozens of packages in dozens of versions:
	http://patches.metux.de/
---------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Devel] [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                   ` <20100527122800.GC31480-q9I3ByPDOfiE+EvaaNYduQ@public.gmane.org>
@ 2010-05-27 12:44                                     ` Daniel Lezcano
       [not found]                                       ` <4BFE6938.50607-GANU6spQydw@public.gmane.org>
  0 siblings, 1 reply; 184+ messages in thread
From: Daniel Lezcano @ 2010-05-27 12:44 UTC (permalink / raw)
  To: weigelt-EU+a56NjgY8
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On 05/27/2010 02:28 PM, Enrico Weigelt wrote:
> * Eric W. Biederman<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>  wrote:
>
> Hi,
>
>    
>> The nsfd() system call returns a file descriptor that can
>> be used to talk about a specific namespace, and to keep
>> the specified namespace alive.
>>      
> Why not directly mapping it into the filesystem (eg. /proc)
> instead of yet another syscall ?
>    

I think it is already done, the last version maps 
/proc/<pid>/ns/[net,pid,...].
You have to open one of these files and use the 'setns' syscall.

Thanks
   -- Daniel

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Devel] [RFC][PATCH] ns: Syscalls for better namespace sharing control.
       [not found]                                       ` <4BFE6938.50607-GANU6spQydw@public.gmane.org>
@ 2010-05-27 15:42                                         ` Enrico Weigelt
  0 siblings, 0 replies; 184+ messages in thread
From: Enrico Weigelt @ 2010-05-27 15:42 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

* Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> wrote:

> I think it is already done, the last version maps 
> /proc/<pid>/ns/[net,pid,...].

Ok.

> You have to open one of these files and use the 'setns' syscall.

The setns() syscall IMHO could also be replaced by procfs 
operations, eg. writing proper values to exactly these files.


cu
-- 
---------------------------------------------------------------------
 Enrico Weigelt    ==   metux IT service - http://www.metux.de/
---------------------------------------------------------------------
 Please visit the OpenSource QM Taskforce:
 	http://wiki.metux.de/public/OpenSource_QM_Taskforce
 Patches / Fixes for a lot dozens of packages in dozens of versions:
	http://patches.metux.de/
---------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 184+ messages in thread

* RFC: netfilter: nf_conntrack: add support for "conntrack zones"
@ 2010-01-14 14:05 Patrick McHardy
  0 siblings, 0 replies; 184+ messages in thread
From: Patrick McHardy @ 2010-01-14 14:05 UTC (permalink / raw)
  To: Netfilter Development Mailinglist
  Cc: Linux Netdev List,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

[-- Attachment #1: Type: text/plain, Size: 2897 bytes --]

The attached largish patch adds support for "conntrack zones",
which are virtual conntrack tables that can be used to seperate
connections from different zones, allowing to handle multiple
connections with equal identities in conntrack and NAT.

A zone is simply a numerical identifier associated with a network
device that is incorporated into the various hashes and used to
distinguish entries in addition to the connection tuples. Additionally
it is used to seperate conntrack defragmentation queues. An iptables
target for the raw table could be used alternatively to the network
device for assigning conntrack entries to zones.

This is mainly useful when connecting multiple private networks using
the same addresses (which unfortunately happens occasionally) to pass
the packets through a set of veth devices and SNAT each network to a
unique address, after which they can pass through the "main" zone and
be handled like regular non-clashing packets and/or have NAT applied a
second time based f.i. on the outgoing interface.

Something like this, with multiple tunl and veth devices, each pair
using a unique zone:

  <tunl0 / zone 1>
     |
  PREROUTING
     |
  FORWARD
     |
  POSTROUTING: SNAT to unique network
     |
  <veth1 / zone 1>
  <veth0 / zone 0>
     |
  PREROUTING
     |
  FORWARD
     |
  POSTROUTING: SNAT to eth0 address
     |
  <eth0>

As probably everyone has noticed, this is quite similar to what you
can do using network namespaces. The main reason for not using
network namespaces is that its an all-or-nothing approach, you can't
virtualize just connection tracking. Beside the difficulties in
managing different namespaces from f.i. an IKE or PPP daemon running
in the initial namespace, network namespaces have a quite large
overhead, especially when used with a large conntrack table.

I'm not too fond of this partial feature duplication myself, but I
couldn't think of a better way to do this without the downsides of
using namespaces. Having partially shared network namespaces would
be great, but it doesn't seem to fit in the design very well.
I'm open for any better suggestion :)

A couple of notes on the patch:

- its not entirely finished yet (ctnetlink and xt_connlimit are
  missing), I wanted to have a discussion about the general idea first.

- the patch uses ct_extend to avoid increasing the connection tracking
  entry size when this feature is not used. An older version of this
  patch adds the zone identifier to the conntrack tuples. This greatly
  simplifies the changes to the code since the zone doesn't has to
  passed around (something like 40 lines total), but has the downside
  of increasing the tuple size.

- the overhead should be quite small, its mainly the extra argument
  passing and an occasional extra comparison. Code size increase with
  all netfilter options enabled on x86_64 is 152 bytes.

Any comments welcome.

[-- Attachment #2: 01.diff --]
[-- Type: text/x-patch, Size: 50343 bytes --]

commit 7f68e7aa55f9e1f9dfd647b60dace4149f27ae1f
Author: Patrick McHardy <kaber-dcUjhNyLwpNeoWH0uzbU5w@public.gmane.org>
Date:   Thu Jan 14 13:51:06 2010 +0100

    netfilter: nf_conntrack: add support for "conntrack zones"
    
    Normally, each connection needs a unique identity. Conntrack zones allow
    to specify a numerical zone for each interface, connections in different
    zones can use the same identity.
    
    Signed-off-by: Patrick McHardy <kaber-dcUjhNyLwpNeoWH0uzbU5w@public.gmane.org>

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a3fccc8..6e6a209 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -953,6 +953,10 @@ struct net_device {
 	/* max exchange id for FCoE LRO by ddp */
 	unsigned int		fcoe_ddp_xid;
 #endif
+
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+	u16			nf_ct_zone;
+#endif
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
diff --git a/include/net/ip.h b/include/net/ip.h
index 85108cf..61aface 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -336,8 +336,11 @@ enum ip_defrag_users {
 	IP_DEFRAG_LOCAL_DELIVER,
 	IP_DEFRAG_CALL_RA_CHAIN,
 	IP_DEFRAG_CONNTRACK_IN,
+	__IP_DEFRAG_CONNTRACK_IN_END	= IP_DEFRAG_CONNTRACK_IN + 0xffff,
 	IP_DEFRAG_CONNTRACK_OUT,
+	__IP_DEFRAG_CONNTRACK_OUT_END	= IP_DEFRAG_CONNTRACK_OUT + 0xffff,
 	IP_DEFRAG_CONNTRACK_BRIDGE_IN,
+	__IP_DEFRAG_CONNTRACK_BRIDGE_IN = IP_DEFRAG_CONNTRACK_BRIDGE_IN + 0xffff,
 	IP_DEFRAG_VS_IN,
 	IP_DEFRAG_VS_OUT,
 	IP_DEFRAG_VS_FWD
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index ccab594..b82a68d 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -353,8 +353,11 @@ struct inet_frag_queue;
 enum ip6_defrag_users {
 	IP6_DEFRAG_LOCAL_DELIVER,
 	IP6_DEFRAG_CONNTRACK_IN,
+	__IP6_DEFRAG_CONNTRACK_IN	= IP6_DEFRAG_CONNTRACK_IN + 0xffff,
 	IP6_DEFRAG_CONNTRACK_OUT,
+	__IP6_DEFRAG_CONNTRACK_OUT	= IP6_DEFRAG_CONNTRACK_OUT + 0xffff,
 	IP6_DEFRAG_CONNTRACK_BRIDGE_IN,
+	__IP6_DEFRAG_CONNTRACK_BRIDGE_IN = IP6_DEFRAG_CONNTRACK_BRIDGE_IN + 0xffff,
 };
 
 struct ip6_create_arg {
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index a0904ad..9488ac6 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -198,7 +198,8 @@ extern void *nf_ct_alloc_hashtable(unsigned int *sizep, int *vmalloced, int null
 extern void nf_ct_free_hashtable(void *hash, int vmalloced, unsigned int size);
 
 extern struct nf_conntrack_tuple_hash *
-__nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple);
+__nf_conntrack_find(struct net *net, u16 zone,
+		    const struct nf_conntrack_tuple *tuple);
 
 extern void nf_conntrack_hash_insert(struct nf_conn *ct);
 extern void nf_ct_delete_from_lists(struct nf_conn *ct);
@@ -267,7 +268,7 @@ extern void
 nf_ct_iterate_cleanup(struct net *net, int (*iter)(struct nf_conn *i, void *data), void *data);
 extern void nf_conntrack_free(struct nf_conn *ct);
 extern struct nf_conn *
-nf_conntrack_alloc(struct net *net,
+nf_conntrack_alloc(struct net *net, u16 zone,
 		   const struct nf_conntrack_tuple *orig,
 		   const struct nf_conntrack_tuple *repl,
 		   gfp_t gfp);
diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index 5a449b4..c7a1162 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -20,7 +20,7 @@
 /* This header is used to share core functionality between the
    standalone connection tracking module, and the compatibility layer's use
    of connection tracking. */
-extern unsigned int nf_conntrack_in(struct net *net,
+extern unsigned int nf_conntrack_in(struct net *net, u16 zone,
 				    u_int8_t pf,
 				    unsigned int hooknum,
 				    struct sk_buff *skb);
@@ -49,7 +49,8 @@ nf_ct_invert_tuple(struct nf_conntrack_tuple *inverse,
 
 /* Find a connection corresponding to a tuple. */
 extern struct nf_conntrack_tuple_hash *
-nf_conntrack_find_get(struct net *net, const struct nf_conntrack_tuple *tuple);
+nf_conntrack_find_get(struct net *net, u16 zone,
+		      const struct nf_conntrack_tuple *tuple);
 
 extern int __nf_conntrack_confirm(struct sk_buff *skb);
 
diff --git a/include/net/netfilter/nf_conntrack_expect.h b/include/net/netfilter/nf_conntrack_expect.h
index 9a2b9cb..83c49f3 100644
--- a/include/net/netfilter/nf_conntrack_expect.h
+++ b/include/net/netfilter/nf_conntrack_expect.h
@@ -77,13 +77,16 @@ int nf_conntrack_expect_init(struct net *net);
 void nf_conntrack_expect_fini(struct net *net);
 
 struct nf_conntrack_expect *
-__nf_ct_expect_find(struct net *net, const struct nf_conntrack_tuple *tuple);
+__nf_ct_expect_find(struct net *net, u16 zone,
+		    const struct nf_conntrack_tuple *tuple);
 
 struct nf_conntrack_expect *
-nf_ct_expect_find_get(struct net *net, const struct nf_conntrack_tuple *tuple);
+nf_ct_expect_find_get(struct net *net, u16 zone,
+		      const struct nf_conntrack_tuple *tuple);
 
 struct nf_conntrack_expect *
-nf_ct_find_expectation(struct net *net, const struct nf_conntrack_tuple *tuple);
+nf_ct_find_expectation(struct net *net, u16 zone,
+		       const struct nf_conntrack_tuple *tuple);
 
 void nf_ct_unlink_expect(struct nf_conntrack_expect *exp);
 void nf_ct_remove_expectations(struct nf_conn *ct);
diff --git a/include/net/netfilter/nf_conntrack_extend.h b/include/net/netfilter/nf_conntrack_extend.h
index e192dc1..2d2a1f9 100644
--- a/include/net/netfilter/nf_conntrack_extend.h
+++ b/include/net/netfilter/nf_conntrack_extend.h
@@ -8,6 +8,7 @@ enum nf_ct_ext_id {
 	NF_CT_EXT_NAT,
 	NF_CT_EXT_ACCT,
 	NF_CT_EXT_ECACHE,
+	NF_CT_EXT_ZONE,
 	NF_CT_EXT_NUM,
 };
 
@@ -15,6 +16,7 @@ enum nf_ct_ext_id {
 #define NF_CT_EXT_NAT_TYPE struct nf_conn_nat
 #define NF_CT_EXT_ACCT_TYPE struct nf_conn_counter
 #define NF_CT_EXT_ECACHE_TYPE struct nf_conntrack_ecache
+#define NF_CT_EXT_ZONE_TYPE struct nf_conntrack_zone
 
 /* Extensions: optional stuff which isn't permanently in struct. */
 struct nf_ct_ext {
diff --git a/include/net/netfilter/nf_conntrack_l4proto.h b/include/net/netfilter/nf_conntrack_l4proto.h
index ca6dcf3..14b6492 100644
--- a/include/net/netfilter/nf_conntrack_l4proto.h
+++ b/include/net/netfilter/nf_conntrack_l4proto.h
@@ -49,8 +49,8 @@ struct nf_conntrack_l4proto {
 	/* Called when a conntrack entry is destroyed */
 	void (*destroy)(struct nf_conn *ct);
 
-	int (*error)(struct net *net, struct sk_buff *skb, unsigned int dataoff,
-		     enum ip_conntrack_info *ctinfo,
+	int (*error)(struct net *net, u16 zone, struct sk_buff *skb,
+		     unsigned int dataoff, enum ip_conntrack_info *ctinfo,
 		     u_int8_t pf, unsigned int hooknum);
 
 	/* Print out the per-protocol part of the tuple. Return like seq_* */
diff --git a/include/net/netfilter/nf_conntrack_zones.h b/include/net/netfilter/nf_conntrack_zones.h
new file mode 100644
index 0000000..77d430b
--- /dev/null
+++ b/include/net/netfilter/nf_conntrack_zones.h
@@ -0,0 +1,30 @@
+#ifndef _NF_CONNTRACK_ZONES_H
+#define _NF_CONNTRACK_ZONES_H
+
+#include <net/netfilter/nf_conntrack_extend.h>
+
+struct nf_conntrack_zone {
+	u16	id;
+};
+
+static inline u16 nf_ct_zone(const struct nf_conn *ct)
+{
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+	struct nf_conntrack_zone *nf_ct_zone;
+	nf_ct_zone = nf_ct_ext_find(ct, NF_CT_EXT_ZONE);
+	if (nf_ct_zone)
+		return nf_ct_zone->id;
+#endif
+	return 0;
+}
+
+static inline u16 nf_ct_dev_zone(const struct net_device *dev)
+{
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+	return dev->nf_ct_zone;
+#else
+	return 0;
+#endif
+}
+
+#endif /* _NF_CONNTRACK_ZONES_H */
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index fbc1c74..83d8bf2 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -289,6 +289,23 @@ static ssize_t show_ifalias(struct device *dev,
 	return ret;
 }
 
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+NETDEVICE_SHOW(nf_ct_zone, fmt_dec);
+
+static int change_nf_ct_zone(struct net_device *net, unsigned long zone)
+{
+	net->nf_ct_zone = zone;
+	return 0;
+}
+
+static ssize_t store_nf_ct_zone(struct device *dev,
+				struct device_attribute *attr,
+				const char *buf, size_t len)
+{
+	return netdev_store(dev, attr, buf, len, change_nf_ct_zone);
+}
+#endif
+
 static struct device_attribute net_class_attributes[] = {
 	__ATTR(addr_len, S_IRUGO, show_addr_len, NULL),
 	__ATTR(dev_id, S_IRUGO, show_dev_id, NULL),
@@ -309,6 +326,9 @@ static struct device_attribute net_class_attributes[] = {
 	__ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags),
 	__ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len,
 	       store_tx_queue_len),
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+	__ATTR(nf_ct_zone, S_IRUGO | S_IWUSR, show_nf_ct_zone, store_nf_ct_zone),
+#endif
 	{}
 };
 
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
index d171b12..b3a0634 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
@@ -23,6 +23,7 @@
 #include <net/netfilter/nf_conntrack_l4proto.h>
 #include <net/netfilter/nf_conntrack_l3proto.h>
 #include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 #include <net/netfilter/ipv4/nf_conntrack_ipv4.h>
 #include <net/netfilter/nf_nat_helper.h>
 #include <net/netfilter/ipv4/nf_defrag_ipv4.h>
@@ -140,7 +141,7 @@ static unsigned int ipv4_conntrack_in(unsigned int hooknum,
 				      const struct net_device *out,
 				      int (*okfn)(struct sk_buff *))
 {
-	return nf_conntrack_in(dev_net(in), PF_INET, hooknum, skb);
+	return nf_conntrack_in(dev_net(in), nf_ct_dev_zone(in), PF_INET, hooknum, skb);
 }
 
 static unsigned int ipv4_conntrack_local(unsigned int hooknum,
@@ -153,7 +154,7 @@ static unsigned int ipv4_conntrack_local(unsigned int hooknum,
 	if (skb->len < sizeof(struct iphdr) ||
 	    ip_hdrlen(skb) < sizeof(struct iphdr))
 		return NF_ACCEPT;
-	return nf_conntrack_in(dev_net(out), PF_INET, hooknum, skb);
+	return nf_conntrack_in(dev_net(out), nf_ct_dev_zone(out), PF_INET, hooknum, skb);
 }
 
 /* Connection tracking may drop packets, but never alters them, so
@@ -266,7 +267,7 @@ getorigdst(struct sock *sk, int optval, void __user *user, int *len)
 		return -EINVAL;
 	}
 
-	h = nf_conntrack_find_get(sock_net(sk), &tuple);
+	h = nf_conntrack_find_get(sock_net(sk), 0, &tuple);
 	if (h) {
 		struct sockaddr_in sin;
 		struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
diff --git a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
index 7afd39b..82b4b30 100644
--- a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
+++ b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
@@ -114,7 +114,7 @@ static bool icmp_new(struct nf_conn *ct, const struct sk_buff *skb,
 
 /* Returns conntrack if it dealt with ICMP, and filled in skb fields */
 static int
-icmp_error_message(struct net *net, struct sk_buff *skb,
+icmp_error_message(struct net *net, u16 zone, struct sk_buff *skb,
 		 enum ip_conntrack_info *ctinfo,
 		 unsigned int hooknum)
 {
@@ -146,7 +146,7 @@ icmp_error_message(struct net *net, struct sk_buff *skb,
 
 	*ctinfo = IP_CT_RELATED;
 
-	h = nf_conntrack_find_get(net, &innertuple);
+	h = nf_conntrack_find_get(net, zone, &innertuple);
 	if (!h) {
 		pr_debug("icmp_error_message: no match\n");
 		return -NF_ACCEPT;
@@ -163,7 +163,8 @@ icmp_error_message(struct net *net, struct sk_buff *skb,
 
 /* Small and modified version of icmp_rcv */
 static int
-icmp_error(struct net *net, struct sk_buff *skb, unsigned int dataoff,
+icmp_error(struct net *net, u16 zone,
+	   struct sk_buff *skb, unsigned int dataoff,
 	   enum ip_conntrack_info *ctinfo, u_int8_t pf, unsigned int hooknum)
 {
 	const struct icmphdr *icmph;
@@ -208,7 +209,7 @@ icmp_error(struct net *net, struct sk_buff *skb, unsigned int dataoff,
 	    icmph->type != ICMP_REDIRECT)
 		return NF_ACCEPT;
 
-	return icmp_error_message(net, skb, ctinfo, hooknum);
+	return icmp_error_message(net, zone, skb, ctinfo, hooknum);
 }
 
 #if defined(CONFIG_NF_CT_NETLINK) || defined(CONFIG_NF_CT_NETLINK_MODULE)
diff --git a/net/ipv4/netfilter/nf_defrag_ipv4.c b/net/ipv4/netfilter/nf_defrag_ipv4.c
index 331ead3..488e889 100644
--- a/net/ipv4/netfilter/nf_defrag_ipv4.c
+++ b/net/ipv4/netfilter/nf_defrag_ipv4.c
@@ -16,6 +16,7 @@
 
 #include <linux/netfilter_bridge.h>
 #include <linux/netfilter_ipv4.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 #include <net/netfilter/ipv4/nf_defrag_ipv4.h>
 
 /* Returns new sk_buff, or NULL */
@@ -35,18 +36,18 @@ static int nf_ct_ipv4_gather_frags(struct sk_buff *skb, u_int32_t user)
 	return err;
 }
 
-static enum ip_defrag_users nf_ct_defrag_user(unsigned int hooknum,
+static enum ip_defrag_users nf_ct_defrag_user(unsigned int hooknum, u16 zone,
 					      struct sk_buff *skb)
 {
 #ifdef CONFIG_BRIDGE_NETFILTER
 	if (skb->nf_bridge &&
 	    skb->nf_bridge->mask & BRNF_NF_BRIDGE_PREROUTING)
-		return IP_DEFRAG_CONNTRACK_BRIDGE_IN;
+		return IP_DEFRAG_CONNTRACK_BRIDGE_IN + zone;
 #endif
 	if (hooknum == NF_INET_PRE_ROUTING)
-		return IP_DEFRAG_CONNTRACK_IN;
+		return IP_DEFRAG_CONNTRACK_IN + zone;
 	else
-		return IP_DEFRAG_CONNTRACK_OUT;
+		return IP_DEFRAG_CONNTRACK_OUT + zone;
 }
 
 static unsigned int ipv4_conntrack_defrag(unsigned int hooknum,
@@ -65,7 +66,9 @@ static unsigned int ipv4_conntrack_defrag(unsigned int hooknum,
 #endif
 	/* Gather fragments. */
 	if (ip_hdr(skb)->frag_off & htons(IP_MF | IP_OFFSET)) {
-		enum ip_defrag_users user = nf_ct_defrag_user(hooknum, skb);
+		u16 zone = nf_ct_dev_zone(hooknum == NF_INET_PRE_ROUTING ? in : out);
+		enum ip_defrag_users user = nf_ct_defrag_user(hooknum, zone, skb);
+
 		if (nf_ct_ipv4_gather_frags(skb, user))
 			return NF_STOLEN;
 	}
diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
index fe1a644..64b9979 100644
--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -30,6 +30,7 @@
 #include <net/netfilter/nf_conntrack_helper.h>
 #include <net/netfilter/nf_conntrack_l3proto.h>
 #include <net/netfilter/nf_conntrack_l4proto.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 
 static DEFINE_SPINLOCK(nf_nat_lock);
 
@@ -72,13 +73,13 @@ EXPORT_SYMBOL_GPL(nf_nat_proto_put);
 
 /* We keep an extra hash for each conntrack, for fast searching. */
 static inline unsigned int
-hash_by_src(const struct nf_conntrack_tuple *tuple)
+hash_by_src(const struct nf_conntrack_tuple *tuple, u16 zone)
 {
 	unsigned int hash;
 
 	/* Original src, to ensure we map it consistently if poss. */
 	hash = jhash_3words((__force u32)tuple->src.u3.ip,
-			    (__force u32)tuple->src.u.all,
+			    (__force u32)tuple->src.u.all ^ zone,
 			    tuple->dst.protonum, 0);
 	return ((u64)hash * nf_nat_htable_size) >> 32;
 }
@@ -142,12 +143,12 @@ same_src(const struct nf_conn *ct,
 
 /* Only called for SRC manip */
 static int
-find_appropriate_src(struct net *net,
+find_appropriate_src(struct net *net, u16 zone,
 		     const struct nf_conntrack_tuple *tuple,
 		     struct nf_conntrack_tuple *result,
 		     const struct nf_nat_range *range)
 {
-	unsigned int h = hash_by_src(tuple);
+	unsigned int h = hash_by_src(tuple, zone);
 	const struct nf_conn_nat *nat;
 	const struct nf_conn *ct;
 	const struct hlist_node *n;
@@ -155,7 +156,7 @@ find_appropriate_src(struct net *net,
 	rcu_read_lock();
 	hlist_for_each_entry_rcu(nat, n, &net->ipv4.nat_bysource[h], bysource) {
 		ct = nat->ct;
-		if (same_src(ct, tuple)) {
+		if (same_src(ct, tuple) && nf_ct_zone(ct) == zone) {
 			/* Copy source part from reply tuple. */
 			nf_ct_invert_tuplepr(result,
 				       &ct->tuplehash[IP_CT_DIR_REPLY].tuple);
@@ -178,7 +179,7 @@ find_appropriate_src(struct net *net,
    the ip with the lowest src-ip/dst-ip/proto usage.
 */
 static void
-find_best_ips_proto(struct nf_conntrack_tuple *tuple,
+find_best_ips_proto(u16 zone, struct nf_conntrack_tuple *tuple,
 		    const struct nf_nat_range *range,
 		    const struct nf_conn *ct,
 		    enum nf_nat_manip_type maniptype)
@@ -212,7 +213,7 @@ find_best_ips_proto(struct nf_conntrack_tuple *tuple,
 	maxip = ntohl(range->max_ip);
 	j = jhash_2words((__force u32)tuple->src.u3.ip,
 			 range->flags & IP_NAT_RANGE_PERSISTENT ?
-				0 : (__force u32)tuple->dst.u3.ip, 0);
+				0 : (__force u32)tuple->dst.u3.ip ^ zone, 0);
 	j = ((u64)j * (maxip - minip + 1)) >> 32;
 	*var_ipp = htonl(minip + j);
 }
@@ -232,6 +233,7 @@ get_unique_tuple(struct nf_conntrack_tuple *tuple,
 {
 	struct net *net = nf_ct_net(ct);
 	const struct nf_nat_protocol *proto;
+	u16 zone = nf_ct_zone(ct);
 
 	/* 1) If this srcip/proto/src-proto-part is currently mapped,
 	   and that same mapping gives a unique tuple within the given
@@ -242,7 +244,7 @@ get_unique_tuple(struct nf_conntrack_tuple *tuple,
 	   manips not an issue.  */
 	if (maniptype == IP_NAT_MANIP_SRC &&
 	    !(range->flags & IP_NAT_RANGE_PROTO_RANDOM)) {
-		if (find_appropriate_src(net, orig_tuple, tuple, range)) {
+		if (find_appropriate_src(net, zone, orig_tuple, tuple, range)) {
 			pr_debug("get_unique_tuple: Found current src map\n");
 			if (!nf_nat_used_tuple(tuple, ct))
 				return;
@@ -252,7 +254,7 @@ get_unique_tuple(struct nf_conntrack_tuple *tuple,
 	/* 2) Select the least-used IP/proto combination in the given
 	   range. */
 	*tuple = *orig_tuple;
-	find_best_ips_proto(tuple, range, ct, maniptype);
+	find_best_ips_proto(zone, tuple, range, ct, maniptype);
 
 	/* 3) The per-protocol part of the manip is made to map into
 	   the range to make a unique tuple. */
@@ -330,7 +332,8 @@ nf_nat_setup_info(struct nf_conn *ct,
 	if (have_to_hash) {
 		unsigned int srchash;
 
-		srchash = hash_by_src(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple);
+		srchash = hash_by_src(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
+				      nf_ct_zone(ct));
 		spin_lock_bh(&nf_nat_lock);
 		/* nf_conntrack_alter_reply might re-allocate exntension aera */
 		nat = nfct_nat(ct);
diff --git a/net/ipv4/netfilter/nf_nat_pptp.c b/net/ipv4/netfilter/nf_nat_pptp.c
index 9eb1710..4c06003 100644
--- a/net/ipv4/netfilter/nf_nat_pptp.c
+++ b/net/ipv4/netfilter/nf_nat_pptp.c
@@ -25,6 +25,7 @@
 #include <net/netfilter/nf_nat_rule.h>
 #include <net/netfilter/nf_conntrack_helper.h>
 #include <net/netfilter/nf_conntrack_expect.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 #include <linux/netfilter/nf_conntrack_proto_gre.h>
 #include <linux/netfilter/nf_conntrack_pptp.h>
 
@@ -74,7 +75,7 @@ static void pptp_nat_expected(struct nf_conn *ct,
 
 	pr_debug("trying to unexpect other dir: ");
 	nf_ct_dump_tuple_ip(&t);
-	other_exp = nf_ct_expect_find_get(net, &t);
+	other_exp = nf_ct_expect_find_get(net, nf_ct_zone(ct), &t);
 	if (other_exp) {
 		nf_ct_unexpect_related(other_exp);
 		nf_ct_expect_put(other_exp);
diff --git a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
index 0956eba..0db0d7f 100644
--- a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
@@ -27,6 +27,7 @@
 #include <net/netfilter/nf_conntrack_l4proto.h>
 #include <net/netfilter/nf_conntrack_l3proto.h>
 #include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 #include <net/netfilter/ipv6/nf_conntrack_ipv6.h>
 #include <net/netfilter/nf_log.h>
 
@@ -188,18 +189,18 @@ out:
 	return nf_conntrack_confirm(skb);
 }
 
-static enum ip6_defrag_users nf_ct6_defrag_user(unsigned int hooknum,
+static enum ip6_defrag_users nf_ct6_defrag_user(unsigned int hooknum, u16 zone,
 						struct sk_buff *skb)
 {
 #ifdef CONFIG_BRIDGE_NETFILTER
 	if (skb->nf_bridge &&
 	    skb->nf_bridge->mask & BRNF_NF_BRIDGE_PREROUTING)
-		return IP6_DEFRAG_CONNTRACK_BRIDGE_IN;
+		return IP6_DEFRAG_CONNTRACK_BRIDGE_IN + zone;
 #endif
 	if (hooknum == NF_INET_PRE_ROUTING)
-		return IP6_DEFRAG_CONNTRACK_IN;
+		return IP6_DEFRAG_CONNTRACK_IN + zone;
 	else
-		return IP6_DEFRAG_CONNTRACK_OUT;
+		return IP6_DEFRAG_CONNTRACK_OUT + zone;
 
 }
 
@@ -210,12 +211,14 @@ static unsigned int ipv6_defrag(unsigned int hooknum,
 				int (*okfn)(struct sk_buff *))
 {
 	struct sk_buff *reasm;
+	u16 zone;
 
 	/* Previously seen (loopback)?  */
 	if (skb->nfct)
 		return NF_ACCEPT;
 
-	reasm = nf_ct_frag6_gather(skb, nf_ct6_defrag_user(hooknum, skb));
+	zone = nf_ct_dev_zone(hooknum == NF_INET_PRE_ROUTING ? in : out);
+	reasm = nf_ct_frag6_gather(skb, nf_ct6_defrag_user(hooknum, zone, skb));
 	/* queued */
 	if (reasm == NULL)
 		return NF_STOLEN;
@@ -230,7 +233,7 @@ static unsigned int ipv6_defrag(unsigned int hooknum,
 	return NF_STOLEN;
 }
 
-static unsigned int __ipv6_conntrack_in(struct net *net,
+static unsigned int __ipv6_conntrack_in(struct net *net, u16 zone,
 					unsigned int hooknum,
 					struct sk_buff *skb,
 					int (*okfn)(struct sk_buff *))
@@ -243,7 +246,7 @@ static unsigned int __ipv6_conntrack_in(struct net *net,
 		if (!reasm->nfct) {
 			unsigned int ret;
 
-			ret = nf_conntrack_in(net, PF_INET6, hooknum, reasm);
+			ret = nf_conntrack_in(net, zone, PF_INET6, hooknum, reasm);
 			if (ret != NF_ACCEPT)
 				return ret;
 		}
@@ -253,7 +256,7 @@ static unsigned int __ipv6_conntrack_in(struct net *net,
 		return NF_ACCEPT;
 	}
 
-	return nf_conntrack_in(net, PF_INET6, hooknum, skb);
+	return nf_conntrack_in(net, zone, PF_INET6, hooknum, skb);
 }
 
 static unsigned int ipv6_conntrack_in(unsigned int hooknum,
@@ -262,7 +265,7 @@ static unsigned int ipv6_conntrack_in(unsigned int hooknum,
 				      const struct net_device *out,
 				      int (*okfn)(struct sk_buff *))
 {
-	return __ipv6_conntrack_in(dev_net(in), hooknum, skb, okfn);
+	return __ipv6_conntrack_in(dev_net(in), nf_ct_dev_zone(in), hooknum, skb, okfn);
 }
 
 static unsigned int ipv6_conntrack_local(unsigned int hooknum,
@@ -277,7 +280,7 @@ static unsigned int ipv6_conntrack_local(unsigned int hooknum,
 			printk("ipv6_conntrack_local: packet too short\n");
 		return NF_ACCEPT;
 	}
-	return __ipv6_conntrack_in(dev_net(out), hooknum, skb, okfn);
+	return __ipv6_conntrack_in(dev_net(out), nf_ct_dev_zone(out), hooknum, skb, okfn);
 }
 
 static struct nf_hook_ops ipv6_conntrack_ops[] __read_mostly = {
diff --git a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
index c7b8bd1..c423818 100644
--- a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
@@ -128,7 +128,7 @@ static bool icmpv6_new(struct nf_conn *ct, const struct sk_buff *skb,
 }
 
 static int
-icmpv6_error_message(struct net *net,
+icmpv6_error_message(struct net *net, u16 zone,
 		     struct sk_buff *skb,
 		     unsigned int icmp6off,
 		     enum ip_conntrack_info *ctinfo,
@@ -163,7 +163,7 @@ icmpv6_error_message(struct net *net,
 
 	*ctinfo = IP_CT_RELATED;
 
-	h = nf_conntrack_find_get(net, &intuple);
+	h = nf_conntrack_find_get(net, zone, &intuple);
 	if (!h) {
 		pr_debug("icmpv6_error: no match\n");
 		return -NF_ACCEPT;
@@ -179,7 +179,8 @@ icmpv6_error_message(struct net *net,
 }
 
 static int
-icmpv6_error(struct net *net, struct sk_buff *skb, unsigned int dataoff,
+icmpv6_error(struct net *net, u16 zone,
+	     struct sk_buff *skb, unsigned int dataoff,
 	     enum ip_conntrack_info *ctinfo, u_int8_t pf, unsigned int hooknum)
 {
 	const struct icmp6hdr *icmp6h;
@@ -215,7 +216,7 @@ icmpv6_error(struct net *net, struct sk_buff *skb, unsigned int dataoff,
 	if (icmp6h->icmp6_type >= 128)
 		return NF_ACCEPT;
 
-	return icmpv6_error_message(net, skb, dataoff, ctinfo, hooknum);
+	return icmpv6_error_message(net, zone, skb, dataoff, ctinfo, hooknum);
 }
 
 #if defined(CONFIG_NF_CT_NETLINK) || defined(CONFIG_NF_CT_NETLINK_MODULE)
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 634d14a..15374ba 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -83,6 +83,15 @@ config NF_CONNTRACK_SECMARK
 
 	  If unsure, say 'N'.
 
+config NF_CONNTRACK_ZONES
+	bool  'Connection tracking zones'
+	help
+	  This option enables support for connection tracking zones.
+	  Normally, each connection needs to have a unique identity.
+	  Connection tracking zones allow to have multiple connections
+	  using the same identity, as long as they are contained in
+	  different zones.
+
 config NF_CONNTRACK_EVENTS
 	bool "Connection tracking events"
 	depends on NETFILTER_ADVANCED
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 0e98c32..90909e3 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -41,6 +41,7 @@
 #include <net/netfilter/nf_conntrack_extend.h>
 #include <net/netfilter/nf_conntrack_acct.h>
 #include <net/netfilter/nf_conntrack_ecache.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 #include <net/netfilter/nf_nat.h>
 #include <net/netfilter/nf_nat_core.h>
 
@@ -69,7 +70,7 @@ static int nf_conntrack_hash_rnd_initted;
 static unsigned int nf_conntrack_hash_rnd;
 
 static u_int32_t __hash_conntrack(const struct nf_conntrack_tuple *tuple,
-				  unsigned int size, unsigned int rnd)
+				  u16 zone, unsigned int size, unsigned int rnd)
 {
 	unsigned int n;
 	u_int32_t h;
@@ -80,15 +81,16 @@ static u_int32_t __hash_conntrack(const struct nf_conntrack_tuple *tuple,
 	 */
 	n = (sizeof(tuple->src) + sizeof(tuple->dst.u3)) / sizeof(u32);
 	h = jhash2((u32 *)tuple, n,
-		   rnd ^ (((__force __u16)tuple->dst.u.all << 16) |
+		   zone ^ rnd ^ (((__force __u16)tuple->dst.u.all << 16) |
 			  tuple->dst.protonum));
 
 	return ((u64)h * size) >> 32;
 }
 
-static inline u_int32_t hash_conntrack(const struct nf_conntrack_tuple *tuple)
+static inline u_int32_t hash_conntrack(const struct nf_conntrack_tuple *tuple,
+				       u16 zone)
 {
-	return __hash_conntrack(tuple, nf_conntrack_htable_size,
+	return __hash_conntrack(tuple, zone, nf_conntrack_htable_size,
 				nf_conntrack_hash_rnd);
 }
 
@@ -292,11 +294,12 @@ static void death_by_timeout(unsigned long ul_conntrack)
  * - Caller must lock nf_conntrack_lock before calling this function
  */
 struct nf_conntrack_tuple_hash *
-__nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
+__nf_conntrack_find(struct net *net, u16 zone,
+		    const struct nf_conntrack_tuple *tuple)
 {
 	struct nf_conntrack_tuple_hash *h;
 	struct hlist_nulls_node *n;
-	unsigned int hash = hash_conntrack(tuple);
+	unsigned int hash = hash_conntrack(tuple, zone);
 
 	/* Disable BHs the entire time since we normally need to disable them
 	 * at least once for the stats anyway.
@@ -304,7 +307,8 @@ __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
 	local_bh_disable();
 begin:
 	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
-		if (nf_ct_tuple_equal(tuple, &h->tuple)) {
+		if (nf_ct_tuple_equal(tuple, &h->tuple) &&
+		    nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)) == zone) {
 			NF_CT_STAT_INC(net, found);
 			local_bh_enable();
 			return h;
@@ -326,21 +330,23 @@ EXPORT_SYMBOL_GPL(__nf_conntrack_find);
 
 /* Find a connection corresponding to a tuple. */
 struct nf_conntrack_tuple_hash *
-nf_conntrack_find_get(struct net *net, const struct nf_conntrack_tuple *tuple)
+nf_conntrack_find_get(struct net *net, u16 zone,
+		      const struct nf_conntrack_tuple *tuple)
 {
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct;
 
 	rcu_read_lock();
 begin:
-	h = __nf_conntrack_find(net, tuple);
+	h = __nf_conntrack_find(net, zone, tuple);
 	if (h) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		if (unlikely(nf_ct_is_dying(ct) ||
 			     !atomic_inc_not_zero(&ct->ct_general.use)))
 			h = NULL;
 		else {
-			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple))) {
+			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
+				     nf_ct_zone(ct) != zone)) {
 				nf_ct_put(ct);
 				goto begin;
 			}
@@ -367,9 +373,11 @@ static void __nf_conntrack_hash_insert(struct nf_conn *ct,
 void nf_conntrack_hash_insert(struct nf_conn *ct)
 {
 	unsigned int hash, repl_hash;
+	u16 zone;
 
-	hash = hash_conntrack(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple);
-	repl_hash = hash_conntrack(&ct->tuplehash[IP_CT_DIR_REPLY].tuple);
+	zone = nf_ct_zone(ct);
+	hash = hash_conntrack(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, zone);
+	repl_hash = hash_conntrack(&ct->tuplehash[IP_CT_DIR_REPLY].tuple, zone);
 
 	__nf_conntrack_hash_insert(ct, hash, repl_hash);
 }
@@ -385,6 +393,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	struct nf_conn_help *help;
 	struct hlist_nulls_node *n;
 	enum ip_conntrack_info ctinfo;
+	u16 zone;
 	struct net *net;
 
 	ct = nf_ct_get(skb, &ctinfo);
@@ -397,8 +406,9 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	if (CTINFO2DIR(ctinfo) != IP_CT_DIR_ORIGINAL)
 		return NF_ACCEPT;
 
-	hash = hash_conntrack(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple);
-	repl_hash = hash_conntrack(&ct->tuplehash[IP_CT_DIR_REPLY].tuple);
+	zone = nf_ct_zone(ct);
+	hash = hash_conntrack(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, zone);
+	repl_hash = hash_conntrack(&ct->tuplehash[IP_CT_DIR_REPLY].tuple, zone);
 
 	/* We're not in hash table, and we refuse to set up related
 	   connections for unconfirmed conns.  But packet copies and
@@ -417,11 +427,13 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	   not in the hash.  If there is, we lost race. */
 	hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode)
 		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
-				      &h->tuple))
+				      &h->tuple) &&
+		    zone == nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)))
 			goto out;
 	hlist_nulls_for_each_entry(h, n, &net->ct.hash[repl_hash], hnnode)
 		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_REPLY].tuple,
-				      &h->tuple))
+				      &h->tuple) &&
+		    zone == nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)))
 			goto out;
 
 	/* Remove from unconfirmed list */
@@ -468,15 +480,19 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
 	struct net *net = nf_ct_net(ignored_conntrack);
 	struct nf_conntrack_tuple_hash *h;
 	struct hlist_nulls_node *n;
-	unsigned int hash = hash_conntrack(tuple);
+	struct nf_conn *ct;
+	u16 zone = nf_ct_zone(ignored_conntrack);
+	unsigned int hash = hash_conntrack(tuple, zone);
 
 	/* Disable BHs the entire time since we need to disable them at
 	 * least once for the stats anyway.
 	 */
 	rcu_read_lock_bh();
 	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
-		if (nf_ct_tuplehash_to_ctrack(h) != ignored_conntrack &&
-		    nf_ct_tuple_equal(tuple, &h->tuple)) {
+		ct = nf_ct_tuplehash_to_ctrack(h);
+		if (ct != ignored_conntrack &&
+		    nf_ct_tuple_equal(tuple, &h->tuple) &&
+		    nf_ct_zone(ct) == zone) {
 			NF_CT_STAT_INC(net, found);
 			rcu_read_unlock_bh();
 			return 1;
@@ -539,7 +555,7 @@ static noinline int early_drop(struct net *net, unsigned int hash)
 	return dropped;
 }
 
-struct nf_conn *nf_conntrack_alloc(struct net *net,
+struct nf_conn *nf_conntrack_alloc(struct net *net, u16 zone,
 				   const struct nf_conntrack_tuple *orig,
 				   const struct nf_conntrack_tuple *repl,
 				   gfp_t gfp)
@@ -557,7 +573,7 @@ struct nf_conn *nf_conntrack_alloc(struct net *net,
 
 	if (nf_conntrack_max &&
 	    unlikely(atomic_read(&net->ct.count) > nf_conntrack_max)) {
-		unsigned int hash = hash_conntrack(orig);
+		unsigned int hash = hash_conntrack(orig, zone);
 		if (!early_drop(net, hash)) {
 			atomic_dec(&net->ct.count);
 			if (net_ratelimit())
@@ -578,6 +594,7 @@ struct nf_conn *nf_conntrack_alloc(struct net *net,
 		atomic_dec(&net->ct.count);
 		return ERR_PTR(-ENOMEM);
 	}
+
 	/*
 	 * Let ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.next
 	 * and ct->tuplehash[IP_CT_DIR_REPLY].hnnode.next unchanged.
@@ -594,6 +611,16 @@ struct nf_conn *nf_conntrack_alloc(struct net *net,
 #ifdef CONFIG_NET_NS
 	ct->ct_net = net;
 #endif
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+	if (zone) {
+		struct nf_conntrack_zone *nf_ct_zone;
+
+		nf_ct_zone = nf_ct_ext_add(ct, NF_CT_EXT_ZONE, GFP_ATOMIC);
+		if (!nf_ct_zone)
+			goto out_free;
+		nf_ct_zone->id = zone;
+	}
+#endif
 
 	/*
 	 * changes to lookup keys must be done before setting refcnt to 1
@@ -601,6 +628,12 @@ struct nf_conn *nf_conntrack_alloc(struct net *net,
 	smp_wmb();
 	atomic_set(&ct->ct_general.use, 1);
 	return ct;
+
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+out_free:
+	kmem_cache_free(nf_conntrack_cachep, ct);
+	return ERR_PTR(-ENOMEM);
+#endif
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_alloc);
 
@@ -618,7 +651,7 @@ EXPORT_SYMBOL_GPL(nf_conntrack_free);
 /* Allocate a new conntrack: we return -ENOMEM if classification
    failed due to stress.  Otherwise it really is unclassifiable. */
 static struct nf_conntrack_tuple_hash *
-init_conntrack(struct net *net,
+init_conntrack(struct net *net, u16 zone,
 	       const struct nf_conntrack_tuple *tuple,
 	       struct nf_conntrack_l3proto *l3proto,
 	       struct nf_conntrack_l4proto *l4proto,
@@ -635,7 +668,7 @@ init_conntrack(struct net *net,
 		return NULL;
 	}
 
-	ct = nf_conntrack_alloc(net, tuple, &repl_tuple, GFP_ATOMIC);
+	ct = nf_conntrack_alloc(net, zone, tuple, &repl_tuple, GFP_ATOMIC);
 	if (IS_ERR(ct)) {
 		pr_debug("Can't allocate conntrack.\n");
 		return (struct nf_conntrack_tuple_hash *)ct;
@@ -651,7 +684,7 @@ init_conntrack(struct net *net,
 	nf_ct_ecache_ext_add(ct, GFP_ATOMIC);
 
 	spin_lock_bh(&nf_conntrack_lock);
-	exp = nf_ct_find_expectation(net, tuple);
+	exp = nf_ct_find_expectation(net, zone, tuple);
 	if (exp) {
 		pr_debug("conntrack: expectation arrives ct=%p exp=%p\n",
 			 ct, exp);
@@ -694,7 +727,7 @@ init_conntrack(struct net *net,
 
 /* On success, returns conntrack ptr, sets skb->nfct and ctinfo */
 static inline struct nf_conn *
-resolve_normal_ct(struct net *net,
+resolve_normal_ct(struct net *net, u16 zone,
 		  struct sk_buff *skb,
 		  unsigned int dataoff,
 		  u_int16_t l3num,
@@ -716,9 +749,10 @@ resolve_normal_ct(struct net *net,
 	}
 
 	/* look for tuple match */
-	h = nf_conntrack_find_get(net, &tuple);
+	h = nf_conntrack_find_get(net, zone, &tuple);
 	if (!h) {
-		h = init_conntrack(net, &tuple, l3proto, l4proto, skb, dataoff);
+		h = init_conntrack(net, zone, &tuple, l3proto, l4proto,
+				   skb, dataoff);
 		if (!h)
 			return NULL;
 		if (IS_ERR(h))
@@ -752,7 +786,7 @@ resolve_normal_ct(struct net *net,
 }
 
 unsigned int
-nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
+nf_conntrack_in(struct net *net, u16 zone, u_int8_t pf, unsigned int hooknum,
 		struct sk_buff *skb)
 {
 	struct nf_conn *ct;
@@ -787,7 +821,8 @@ nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
 	 * inverse of the return code tells to the netfilter
 	 * core what to do with the packet. */
 	if (l4proto->error != NULL) {
-		ret = l4proto->error(net, skb, dataoff, &ctinfo, pf, hooknum);
+		ret = l4proto->error(net, zone, skb, dataoff, &ctinfo,
+				     pf, hooknum);
 		if (ret <= 0) {
 			NF_CT_STAT_INC_ATOMIC(net, error);
 			NF_CT_STAT_INC_ATOMIC(net, invalid);
@@ -795,7 +830,7 @@ nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
 		}
 	}
 
-	ct = resolve_normal_ct(net, skb, dataoff, pf, protonum,
+	ct = resolve_normal_ct(net, zone, skb, dataoff, pf, protonum,
 			       l3proto, l4proto, &set_reply, &ctinfo);
 	if (!ct) {
 		/* Not valid part of a connection */
@@ -938,6 +973,12 @@ bool __nf_ct_kill_acct(struct nf_conn *ct,
 }
 EXPORT_SYMBOL_GPL(__nf_ct_kill_acct);
 
+static struct nf_ct_ext_type nf_ct_zone_extend __read_mostly = {
+	.len	= sizeof(struct nf_conntrack_zone),
+	.align	= __alignof__(struct nf_conntrack_zone),
+	.id	= NF_CT_EXT_ZONE,
+};
+
 #if defined(CONFIG_NF_CT_NETLINK) || defined(CONFIG_NF_CT_NETLINK_MODULE)
 
 #include <linux/netfilter/nfnetlink.h>
@@ -1115,6 +1156,7 @@ static void nf_conntrack_cleanup_init_net(void)
 {
 	nf_conntrack_helper_fini();
 	nf_conntrack_proto_fini();
+	nf_ct_extend_unregister(&nf_ct_zone_extend);
 	kmem_cache_destroy(nf_conntrack_cachep);
 }
 
@@ -1193,6 +1235,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 	int rnd;
 	struct hlist_nulls_head *hash, *old_hash;
 	struct nf_conntrack_tuple_hash *h;
+	struct nf_conn *ct;
 
 	/* On boot, we can set this without any fancy locking. */
 	if (!nf_conntrack_htable_size)
@@ -1220,8 +1263,10 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 		while (!hlist_nulls_empty(&init_net.ct.hash[i])) {
 			h = hlist_nulls_entry(init_net.ct.hash[i].first,
 					struct nf_conntrack_tuple_hash, hnnode);
+			ct = nf_ct_tuplehash_to_ctrack(h);
 			hlist_nulls_del_rcu(&h->hnnode);
-			bucket = __hash_conntrack(&h->tuple, hashsize, rnd);
+			bucket = __hash_conntrack(&h->tuple, nf_ct_zone(ct),
+						  hashsize, rnd);
 			hlist_nulls_add_head_rcu(&h->hnnode, &hash[bucket]);
 		}
 	}
@@ -1288,8 +1333,17 @@ static int nf_conntrack_init_init_net(void)
 	if (ret < 0)
 		goto err_helper;
 
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+	ret = nf_ct_extend_register(&nf_ct_zone_extend);
+	if (ret < 0)
+		goto err_extend;
+#endif
 	return 0;
 
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+err_extend:
+	nf_conntrack_helper_fini();
+#endif
 err_helper:
 	nf_conntrack_proto_fini();
 err_proto:
diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c
index fdf5d2a..5fd0347 100644
--- a/net/netfilter/nf_conntrack_expect.c
+++ b/net/netfilter/nf_conntrack_expect.c
@@ -27,6 +27,7 @@
 #include <net/netfilter/nf_conntrack_expect.h>
 #include <net/netfilter/nf_conntrack_helper.h>
 #include <net/netfilter/nf_conntrack_tuple.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 
 unsigned int nf_ct_expect_hsize __read_mostly;
 EXPORT_SYMBOL_GPL(nf_ct_expect_hsize);
@@ -84,7 +85,8 @@ static unsigned int nf_ct_expect_dst_hash(const struct nf_conntrack_tuple *tuple
 }
 
 struct nf_conntrack_expect *
-__nf_ct_expect_find(struct net *net, const struct nf_conntrack_tuple *tuple)
+__nf_ct_expect_find(struct net *net, u16 zone,
+		    const struct nf_conntrack_tuple *tuple)
 {
 	struct nf_conntrack_expect *i;
 	struct hlist_node *n;
@@ -104,12 +106,13 @@ EXPORT_SYMBOL_GPL(__nf_ct_expect_find);
 
 /* Just find a expectation corresponding to a tuple. */
 struct nf_conntrack_expect *
-nf_ct_expect_find_get(struct net *net, const struct nf_conntrack_tuple *tuple)
+nf_ct_expect_find_get(struct net *net, u16 zone,
+		      const struct nf_conntrack_tuple *tuple)
 {
 	struct nf_conntrack_expect *i;
 
 	rcu_read_lock();
-	i = __nf_ct_expect_find(net, tuple);
+	i = __nf_ct_expect_find(net, zone, tuple);
 	if (i && !atomic_inc_not_zero(&i->use))
 		i = NULL;
 	rcu_read_unlock();
@@ -121,7 +124,8 @@ EXPORT_SYMBOL_GPL(nf_ct_expect_find_get);
 /* If an expectation for this connection is found, it gets delete from
  * global list then returned. */
 struct nf_conntrack_expect *
-nf_ct_find_expectation(struct net *net, const struct nf_conntrack_tuple *tuple)
+nf_ct_find_expectation(struct net *net, u16 zone,
+		       const struct nf_conntrack_tuple *tuple)
 {
 	struct nf_conntrack_expect *i, *exp = NULL;
 	struct hlist_node *n;
@@ -133,7 +137,8 @@ nf_ct_find_expectation(struct net *net, const struct nf_conntrack_tuple *tuple)
 	h = nf_ct_expect_dst_hash(tuple);
 	hlist_for_each_entry(i, n, &net->ct.expect_hash[h], hnode) {
 		if (!(i->flags & NF_CT_EXPECT_INACTIVE) &&
-		    nf_ct_tuple_mask_cmp(tuple, &i->tuple, &i->mask)) {
+		    nf_ct_tuple_mask_cmp(tuple, &i->tuple, &i->mask) &&
+		    nf_ct_zone(i->master) == zone) {
 			exp = i;
 			break;
 		}
@@ -204,7 +209,8 @@ static inline int expect_matches(const struct nf_conntrack_expect *a,
 {
 	return a->master == b->master && a->class == b->class &&
 		nf_ct_tuple_equal(&a->tuple, &b->tuple) &&
-		nf_ct_tuple_mask_equal(&a->mask, &b->mask);
+		nf_ct_tuple_mask_equal(&a->mask, &b->mask) &&
+		nf_ct_zone(a->master) == nf_ct_zone(b->master);
 }
 
 /* Generally a bad idea to call this: could have matched already. */
diff --git a/net/netfilter/nf_conntrack_h323_main.c b/net/netfilter/nf_conntrack_h323_main.c
index 6636949..a1c8dd9 100644
--- a/net/netfilter/nf_conntrack_h323_main.c
+++ b/net/netfilter/nf_conntrack_h323_main.c
@@ -29,6 +29,7 @@
 #include <net/netfilter/nf_conntrack_expect.h>
 #include <net/netfilter/nf_conntrack_ecache.h>
 #include <net/netfilter/nf_conntrack_helper.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 #include <linux/netfilter/nf_conntrack_h323.h>
 
 /* Parameters */
@@ -1216,7 +1217,7 @@ static struct nf_conntrack_expect *find_expect(struct nf_conn *ct,
 	tuple.dst.u.tcp.port = port;
 	tuple.dst.protonum = IPPROTO_TCP;
 
-	exp = __nf_ct_expect_find(net, &tuple);
+	exp = __nf_ct_expect_find(net, nf_ct_zone(ct), &tuple);
 	if (exp && exp->master == ct)
 		return exp;
 	return NULL;
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 59d8064..2a9c4c3 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -790,7 +790,7 @@ ctnetlink_del_conntrack(struct sock *ctnl, struct sk_buff *skb,
 	if (err < 0)
 		return err;
 
-	h = nf_conntrack_find_get(&init_net, &tuple);
+	h = nf_conntrack_find_get(&init_net, 0, &tuple);
 	if (!h)
 		return -ENOENT;
 
@@ -850,7 +850,7 @@ ctnetlink_get_conntrack(struct sock *ctnl, struct sk_buff *skb,
 	if (err < 0)
 		return err;
 
-	h = nf_conntrack_find_get(&init_net, &tuple);
+	h = nf_conntrack_find_get(&init_net, 0, &tuple);
 	if (!h)
 		return -ENOENT;
 
@@ -1184,7 +1184,7 @@ ctnetlink_create_conntrack(const struct nlattr * const cda[],
 	int err = -EINVAL;
 	struct nf_conntrack_helper *helper;
 
-	ct = nf_conntrack_alloc(&init_net, otuple, rtuple, GFP_ATOMIC);
+	ct = nf_conntrack_alloc(&init_net, 0, otuple, rtuple, GFP_ATOMIC);
 	if (IS_ERR(ct))
 		return ERR_PTR(-ENOMEM);
 
@@ -1285,7 +1285,7 @@ ctnetlink_create_conntrack(const struct nlattr * const cda[],
 		if (err < 0)
 			goto err2;
 
-		master_h = nf_conntrack_find_get(&init_net, &master);
+		master_h = nf_conntrack_find_get(&init_net, 0, &master);
 		if (master_h == NULL) {
 			err = -ENOENT;
 			goto err2;
@@ -1333,9 +1333,9 @@ ctnetlink_new_conntrack(struct sock *ctnl, struct sk_buff *skb,
 
 	spin_lock_bh(&nf_conntrack_lock);
 	if (cda[CTA_TUPLE_ORIG])
-		h = __nf_conntrack_find(&init_net, &otuple);
+		h = __nf_conntrack_find(&init_net, 0, &otuple);
 	else if (cda[CTA_TUPLE_REPLY])
-		h = __nf_conntrack_find(&init_net, &rtuple);
+		h = __nf_conntrack_find(&init_net, 0, &rtuple);
 
 	if (h == NULL) {
 		err = -ENOENT;
@@ -1660,7 +1660,7 @@ ctnetlink_get_expect(struct sock *ctnl, struct sk_buff *skb,
 	if (err < 0)
 		return err;
 
-	exp = nf_ct_expect_find_get(&init_net, &tuple);
+	exp = nf_ct_expect_find_get(&init_net, 0, &tuple);
 	if (!exp)
 		return -ENOENT;
 
@@ -1716,7 +1716,7 @@ ctnetlink_del_expect(struct sock *ctnl, struct sk_buff *skb,
 			return err;
 
 		/* bump usage count to 2 */
-		exp = nf_ct_expect_find_get(&init_net, &tuple);
+		exp = nf_ct_expect_find_get(&init_net, 0, &tuple);
 		if (!exp)
 			return -ENOENT;
 
@@ -1805,7 +1805,7 @@ ctnetlink_create_expect(const struct nlattr * const cda[], u_int8_t u3,
 		return err;
 
 	/* Look for master conntrack of this expectation */
-	h = nf_conntrack_find_get(&init_net, &master_tuple);
+	h = nf_conntrack_find_get(&init_net, 0, &master_tuple);
 	if (!h)
 		return -ENOENT;
 	ct = nf_ct_tuplehash_to_ctrack(h);
@@ -1861,7 +1861,7 @@ ctnetlink_new_expect(struct sock *ctnl, struct sk_buff *skb,
 		return err;
 
 	spin_lock_bh(&nf_conntrack_lock);
-	exp = __nf_ct_expect_find(&init_net, &tuple);
+	exp = __nf_ct_expect_find(&init_net, 0, &tuple);
 
 	if (!exp) {
 		spin_unlock_bh(&nf_conntrack_lock);
diff --git a/net/netfilter/nf_conntrack_pptp.c b/net/netfilter/nf_conntrack_pptp.c
index 3807ac7..ffe2ae6 100644
--- a/net/netfilter/nf_conntrack_pptp.c
+++ b/net/netfilter/nf_conntrack_pptp.c
@@ -28,6 +28,7 @@
 #include <net/netfilter/nf_conntrack.h>
 #include <net/netfilter/nf_conntrack_core.h>
 #include <net/netfilter/nf_conntrack_helper.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 #include <linux/netfilter/nf_conntrack_proto_gre.h>
 #include <linux/netfilter/nf_conntrack_pptp.h>
 
@@ -123,7 +124,7 @@ static void pptp_expectfn(struct nf_conn *ct,
 		pr_debug("trying to unexpect other dir: ");
 		nf_ct_dump_tuple(&inv_t);
 
-		exp_other = nf_ct_expect_find_get(net, &inv_t);
+		exp_other = nf_ct_expect_find_get(net, nf_ct_zone(ct), &inv_t);
 		if (exp_other) {
 			/* delete other expectation.  */
 			pr_debug("found\n");
@@ -136,7 +137,7 @@ static void pptp_expectfn(struct nf_conn *ct,
 	rcu_read_unlock();
 }
 
-static int destroy_sibling_or_exp(struct net *net,
+static int destroy_sibling_or_exp(struct net *net, u16 zone,
 				  const struct nf_conntrack_tuple *t)
 {
 	const struct nf_conntrack_tuple_hash *h;
@@ -146,7 +147,7 @@ static int destroy_sibling_or_exp(struct net *net,
 	pr_debug("trying to timeout ct or exp for tuple ");
 	nf_ct_dump_tuple(t);
 
-	h = nf_conntrack_find_get(net, t);
+	h = nf_conntrack_find_get(net, zone, t);
 	if (h)  {
 		sibling = nf_ct_tuplehash_to_ctrack(h);
 		pr_debug("setting timeout of conntrack %p to 0\n", sibling);
@@ -157,7 +158,7 @@ static int destroy_sibling_or_exp(struct net *net,
 		nf_ct_put(sibling);
 		return 1;
 	} else {
-		exp = nf_ct_expect_find_get(net, t);
+		exp = nf_ct_expect_find_get(net, zone, t);
 		if (exp) {
 			pr_debug("unexpect_related of expect %p\n", exp);
 			nf_ct_unexpect_related(exp);
@@ -182,7 +183,7 @@ static void pptp_destroy_siblings(struct nf_conn *ct)
 	t.dst.protonum = IPPROTO_GRE;
 	t.src.u.gre.key = help->help.ct_pptp_info.pns_call_id;
 	t.dst.u.gre.key = help->help.ct_pptp_info.pac_call_id;
-	if (!destroy_sibling_or_exp(net, &t))
+	if (!destroy_sibling_or_exp(net, nf_ct_zone(ct), &t))
 		pr_debug("failed to timeout original pns->pac ct/exp\n");
 
 	/* try reply (pac->pns) tuple */
@@ -190,7 +191,7 @@ static void pptp_destroy_siblings(struct nf_conn *ct)
 	t.dst.protonum = IPPROTO_GRE;
 	t.src.u.gre.key = help->help.ct_pptp_info.pac_call_id;
 	t.dst.u.gre.key = help->help.ct_pptp_info.pns_call_id;
-	if (!destroy_sibling_or_exp(net, &t))
+	if (!destroy_sibling_or_exp(net, nf_ct_zone(ct), &t))
 		pr_debug("failed to timeout reply pac->pns ct/exp\n");
 }
 
diff --git a/net/netfilter/nf_conntrack_proto_dccp.c b/net/netfilter/nf_conntrack_proto_dccp.c
index dd37550..d1c1848 100644
--- a/net/netfilter/nf_conntrack_proto_dccp.c
+++ b/net/netfilter/nf_conntrack_proto_dccp.c
@@ -561,7 +561,7 @@ static int dccp_packet(struct nf_conn *ct, const struct sk_buff *skb,
 	return NF_ACCEPT;
 }
 
-static int dccp_error(struct net *net, struct sk_buff *skb,
+static int dccp_error(struct net *net, u16 zone, struct sk_buff *skb,
 		      unsigned int dataoff, enum ip_conntrack_info *ctinfo,
 		      u_int8_t pf, unsigned int hooknum)
 {
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index 3c96437..2bfe5bf 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -760,7 +760,7 @@ static const u8 tcp_valid_flags[(TH_FIN|TH_SYN|TH_RST|TH_ACK|TH_URG) + 1] =
 };
 
 /* Protect conntrack agaist broken packets. Code taken from ipt_unclean.c.  */
-static int tcp_error(struct net *net,
+static int tcp_error(struct net *net, u16 zone,
 		     struct sk_buff *skb,
 		     unsigned int dataoff,
 		     enum ip_conntrack_info *ctinfo,
diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c
index 5c5518b..aee7515 100644
--- a/net/netfilter/nf_conntrack_proto_udp.c
+++ b/net/netfilter/nf_conntrack_proto_udp.c
@@ -91,8 +91,8 @@ static bool udp_new(struct nf_conn *ct, const struct sk_buff *skb,
 	return true;
 }
 
-static int udp_error(struct net *net, struct sk_buff *skb, unsigned int dataoff,
-		     enum ip_conntrack_info *ctinfo,
+static int udp_error(struct net *net, u16 zone, struct sk_buff *skb,
+		     unsigned int dataoff, enum ip_conntrack_info *ctinfo,
 		     u_int8_t pf,
 		     unsigned int hooknum)
 {
diff --git a/net/netfilter/nf_conntrack_proto_udplite.c b/net/netfilter/nf_conntrack_proto_udplite.c
index 458655b..cc94a67 100644
--- a/net/netfilter/nf_conntrack_proto_udplite.c
+++ b/net/netfilter/nf_conntrack_proto_udplite.c
@@ -89,7 +89,7 @@ static bool udplite_new(struct nf_conn *ct, const struct sk_buff *skb,
 	return true;
 }
 
-static int udplite_error(struct net *net,
+static int udplite_error(struct net *net, u16 zone,
 			 struct sk_buff *skb,
 			 unsigned int dataoff,
 			 enum ip_conntrack_info *ctinfo,
diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c
index 4b57216..3b5efc9 100644
--- a/net/netfilter/nf_conntrack_sip.c
+++ b/net/netfilter/nf_conntrack_sip.c
@@ -22,6 +22,7 @@
 #include <net/netfilter/nf_conntrack_core.h>
 #include <net/netfilter/nf_conntrack_expect.h>
 #include <net/netfilter/nf_conntrack_helper.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 #include <linux/netfilter/nf_conntrack_sip.h>
 
 MODULE_LICENSE("GPL");
@@ -777,7 +778,7 @@ static int set_expected_rtp_rtcp(struct sk_buff *skb,
 
 	rcu_read_lock();
 	do {
-		exp = __nf_ct_expect_find(net, &tuple);
+		exp = __nf_ct_expect_find(net, nf_ct_zone(ct), &tuple);
 
 		if (!exp || exp->master == ct ||
 		    nfct_help(exp->master)->helper != nfct_help(ct)->helper ||
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 028aba6..69da6ef 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -26,6 +26,7 @@
 #include <net/netfilter/nf_conntrack_expect.h>
 #include <net/netfilter/nf_conntrack_helper.h>
 #include <net/netfilter/nf_conntrack_acct.h>
+#include <net/netfilter/nf_conntrack_zones.h>
 
 MODULE_LICENSE("GPL");
 
@@ -171,6 +172,11 @@ static int ct_seq_show(struct seq_file *s, void *v)
 		goto release;
 #endif
 
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+	if (seq_printf(s, "zone=%u ", nf_ct_zone(ct)));
+		goto release;
+#endif
+
 	if (seq_printf(s, "use=%u\n", atomic_read(&ct->ct_general.use)))
 		goto release;
 
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index 8103bef..a637ee6 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -113,7 +113,7 @@ static int count_them(struct xt_connlimit_data *data,
 
 	/* check the saved connections */
 	list_for_each_entry_safe(conn, tmp, hash, list) {
-		found    = nf_conntrack_find_get(&init_net, &conn->tuple);
+		found    = nf_conntrack_find_get(&init_net, 0, &conn->tuple);
 		found_ct = NULL;
 
 		if (found != NULL)

[-- Attachment #3: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 184+ messages in thread

end of thread, other threads:[~2010-05-27 15:42 UTC | newest]

Thread overview: 184+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-14 14:05 RFC: netfilter: nf_conntrack: add support for "conntrack zones" Patrick McHardy
2010-01-14 15:05 ` jamal
2010-01-14 15:37   ` Patrick McHardy
2010-01-14 17:33     ` jamal
2010-01-15 10:15       ` Patrick McHardy
2010-01-15 10:15       ` Patrick McHardy
2010-01-15 15:19         ` jamal
2010-02-22 20:46           ` Eric W. Biederman
2010-02-22 20:46           ` Eric W. Biederman
     [not found]             ` <m13a0tf17t.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-22 21:55               ` jamal
2010-02-22 21:55             ` jamal
2010-02-22 23:17               ` Eric W. Biederman
2010-02-22 23:17               ` Eric W. Biederman
     [not found]                 ` <m1wry46es9.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-23 13:27                   ` jamal
2010-02-23 14:07                     ` Eric W. Biederman
2010-02-23 14:20                       ` jamal
2010-02-23 20:00                         ` Eric W. Biederman
2010-02-23 23:09                           ` jamal
2010-02-24  1:43                             ` Eric W. Biederman
2010-02-24  1:43                             ` Eric W. Biederman
2010-02-25 20:57                             ` [RFC][PATCH] ns: Syscalls for better namespace sharing control Eric W. Biederman
2010-02-25 21:31                               ` Daniel Lezcano
2010-02-25 21:49                                 ` Eric W. Biederman
     [not found]                                   ` <m1mxyx0yv7.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-25 22:13                                     ` Daniel Lezcano
2010-02-26 20:35                                     ` Eric W. Biederman
2010-02-25 22:13                                   ` Daniel Lezcano
2010-02-25 22:31                                     ` Eric W. Biederman
     [not found]                                     ` <4B86F5EC.60902-GANU6spQydw@public.gmane.org>
2010-02-25 22:31                                       ` Eric W. Biederman
2010-02-26 20:35                                   ` Eric W. Biederman
     [not found]                                 ` <4B86EC45.3060005-GANU6spQydw@public.gmane.org>
2010-02-25 21:49                                   ` Eric W. Biederman
     [not found]                               ` <m1pr3t2fvl.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-25 21:31                                 ` Daniel Lezcano
2010-02-25 21:46                                 ` Matt Helsley
2010-02-26  1:09                                 ` Matt Helsley
2010-02-26  3:15                                 ` [RFC][PATCH] ns: Syscalls for better namespace sharing control. v2 Eric W. Biederman
2010-02-26 21:13                                 ` [RFC][PATCH] ns: Syscalls for better namespace sharing control Pavel Emelyanov
     [not found]                                   ` <4B883987.6090408-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2010-02-26 21:24                                     ` Eric W. Biederman
2010-02-26 21:24                                   ` Eric W. Biederman
2010-02-26 21:34                                     ` Pavel Emelyanov
     [not found]                                       ` <4B883E6F.1060907-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2010-02-26 21:42                                         ` Eric W. Biederman
2010-02-26 21:42                                       ` Eric W. Biederman
2010-02-26 21:58                                         ` Oren Laadan
2010-02-26 22:16                                           ` Eric W. Biederman
2010-02-26 22:52                                             ` Oren Laadan
2010-02-26 23:13                                               ` Eric W. Biederman
     [not found]                                               ` <4B885093.4070807-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-02-26 23:13                                                 ` Eric W. Biederman
     [not found]                                             ` <m1zl2vtzg4.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-26 22:52                                               ` Oren Laadan
     [not found]                                           ` <4B8843FE.4000404-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-02-26 22:16                                             ` Eric W. Biederman
2010-02-27  8:30                                         ` Pavel Emelyanov
     [not found]                                           ` <4B88D80A.8010701-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2010-02-27  9:04                                             ` Eric W. Biederman
2010-02-27  9:04                                           ` Eric W. Biederman
     [not found]                                             ` <m1mxyvrqvk.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-27  9:21                                               ` Pavel Emelyanov
2010-02-27  9:42                                                 ` Eric W. Biederman
2010-02-27 16:16                                                   ` Pavel Emelyanov
2010-02-27 19:08                                                     ` Eric W. Biederman
     [not found]                                                       ` <m1iq9io5sc.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-27 19:29                                                         ` Pavel Emelyanov
2010-02-27 19:29                                                       ` Pavel Emelyanov
2010-02-27 19:44                                                         ` Eric W. Biederman
2010-02-28 22:05                                                           ` Daniel Lezcano
2010-03-01 19:24                                                             ` Eric W. Biederman
2010-03-01 21:42                                                             ` Eric W. Biederman
2010-03-02 13:10                                                               ` Cedric Le Goater
     [not found]                                                               ` <m1ljebwwgd.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-02 13:10                                                                 ` Cedric Le Goater
     [not found]                                                             ` <4B8AE8C1.1030305-GANU6spQydw@public.gmane.org>
2010-03-01 19:24                                                               ` Eric W. Biederman
2010-03-01 21:42                                                               ` Eric W. Biederman
2010-03-02 15:03                                                               ` Pavel Emelyanov
2010-03-03 20:59                                                               ` Oren Laadan
2010-03-02 15:03                                                             ` Pavel Emelyanov
2010-03-02 15:14                                                               ` Jan Engelhardt
     [not found]                                                                 ` <alpine.LSU.2.01.1003021613570.17303-SHaQjdQMGhDmsUXKMKRlFA@public.gmane.org>
2010-03-02 21:45                                                                   ` Eric W. Biederman
2010-03-02 21:45                                                                 ` Eric W. Biederman
     [not found]                                                               ` <4B8D28CF.8060304-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2010-03-02 15:14                                                                 ` Jan Engelhardt
2010-03-02 21:19                                                                 ` Sukadev Bhattiprolu
2010-03-02 21:19                                                               ` Sukadev Bhattiprolu
2010-03-02 22:13                                                                 ` Eric W. Biederman
2010-03-03  0:07                                                                   ` Sukadev Bhattiprolu
     [not found]                                                                     ` <20100303000743.GA13744-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-03-03  0:46                                                                       ` Eric W. Biederman
2010-03-03  0:46                                                                     ` Eric W. Biederman
2010-03-03 15:38                                                                       ` Serge E. Hallyn
2010-03-03 19:47                                                                         ` Eric W. Biederman
2010-03-04 21:45                                                                           ` Eric W. Biederman
2010-03-04 22:55                                                                             ` Jan Engelhardt
     [not found]                                                                             ` <m1pr3j92x8.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-04 22:55                                                                               ` Jan Engelhardt
     [not found]                                                                           ` <m13a0hmblr.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-04 21:45                                                                             ` Eric W. Biederman
     [not found]                                                                         ` <20100303153800.GA937-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-03-03 19:47                                                                           ` Eric W. Biederman
2010-03-03 16:50                                                                       ` Pavel Emelyanov
     [not found]                                                                         ` <4B8E9370.3050300-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2010-03-03 20:16                                                                           ` Eric W. Biederman
2010-03-03 20:16                                                                         ` Eric W. Biederman
2010-03-05 19:18                                                                           ` Pavel Emelyanov
     [not found]                                                                             ` <4B9158F5.5040205-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2010-03-05 20:26                                                                               ` Eric W. Biederman
2010-03-05 20:26                                                                             ` Eric W. Biederman
2010-03-06 14:47                                                                               ` Daniel Lezcano
     [not found]                                                                                 ` <4B926B1B.5070207-GANU6spQydw@public.gmane.org>
2010-03-06 20:48                                                                                   ` Eric W. Biederman
     [not found]                                                                                     ` <m1aaulyy5c.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-06 21:26                                                                                       ` Daniel Lezcano
     [not found]                                                                                         ` <4B92C886.9020507-GANU6spQydw@public.gmane.org>
2010-03-08  8:32                                                                                           ` Eric W. Biederman
2010-03-08  8:32                                                                                         ` Eric W. Biederman
2010-03-08 16:54                                                                                           ` Daniel Lezcano
     [not found]                                                                                             ` <4B952BBE.6070507-GANU6spQydw@public.gmane.org>
2010-03-08 17:29                                                                                               ` Eric W. Biederman
2010-03-08 17:29                                                                                             ` Eric W. Biederman
2010-03-08 19:57                                                                                               ` Daniel Lezcano
2010-03-08 20:24                                                                                                 ` Eric W. Biederman
2010-03-08 20:42                                                                                                   ` Daniel Lezcano
     [not found]                                                                                                     ` <4B95611C.5060403-GANU6spQydw@public.gmane.org>
2010-03-08 20:47                                                                                                       ` Eric W. Biederman
2010-03-08 20:47                                                                                                     ` Eric W. Biederman
2010-03-08 21:12                                                                                                       ` Daniel Lezcano
2010-03-08 21:25                                                                                                         ` Eric W. Biederman
2010-03-08 21:49                                                                                                           ` Serge E. Hallyn
2010-03-08 22:24                                                                                                             ` Eric W. Biederman
     [not found]                                                                                                             ` <20100308214945.GA26617-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-03-08 22:24                                                                                                               ` Eric W. Biederman
2010-03-09 10:03                                                                                                           ` Daniel Lezcano
     [not found]                                                                                                             ` <4B961D09.4010802-GANU6spQydw@public.gmane.org>
2010-03-09 10:13                                                                                                               ` Eric W. Biederman
2010-03-09 10:13                                                                                                             ` Eric W. Biederman
2010-03-09 10:26                                                                                                               ` Daniel Lezcano
     [not found]                                                                                                               ` <m1ocixn6q3.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-09 10:26                                                                                                                 ` Daniel Lezcano
     [not found]                                                                                                           ` <m1lje2qzf4.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-08 21:49                                                                                                             ` Serge E. Hallyn
2010-03-09 10:03                                                                                                             ` Daniel Lezcano
2010-03-10 21:16                                                                                                             ` Daniel Lezcano
2010-03-10 21:16                                                                                                           ` Daniel Lezcano
     [not found]                                                                                                         ` <4B956852.7050804-GANU6spQydw@public.gmane.org>
2010-03-08 21:25                                                                                                           ` Eric W. Biederman
     [not found]                                                                                                       ` <m1sk8ar15b.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-08 21:12                                                                                                         ` Daniel Lezcano
     [not found]                                                                                                   ` <m11vfusgsa.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-08 20:42                                                                                                     ` Daniel Lezcano
     [not found]                                                                                                 ` <4B9556A9.60206-GANU6spQydw@public.gmane.org>
2010-03-08 20:24                                                                                                   ` Eric W. Biederman
     [not found]                                                                                               ` <m11vfuvi1t.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-08 19:57                                                                                                 ` Daniel Lezcano
     [not found]                                                                                           ` <m1fx4bxlfy.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-08 16:54                                                                                             ` Daniel Lezcano
2010-03-08 17:07                                                                                             ` Serge E. Hallyn
2010-03-08 17:07                                                                                           ` Serge E. Hallyn
     [not found]                                                                                             ` <20100308170719.GD6399-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-03-08 17:35                                                                                               ` Eric W. Biederman
2010-03-08 17:35                                                                                             ` Eric W. Biederman
2010-03-08 17:47                                                                                               ` Serge E. Hallyn
     [not found]                                                                                               ` <m1pr3eu36u.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-08 17:47                                                                                                 ` Serge E. Hallyn
     [not found]                                                                           ` <m17hptjh3m.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-05 19:18                                                                             ` Pavel Emelyanov
     [not found]                                                                       ` <m1ocj6qljj.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-03 15:38                                                                         ` Serge E. Hallyn
2010-03-03 16:50                                                                         ` Pavel Emelyanov
     [not found]                                                                   ` <m1y6iaqsmm.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-03  0:07                                                                     ` Sukadev Bhattiprolu
     [not found]                                                                 ` <20100302211942.GA17816-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-03-02 22:13                                                                   ` Eric W. Biederman
2010-03-03 20:59                                                             ` Oren Laadan
2010-03-03 21:05                                                               ` Eric W. Biederman
     [not found]                                                                 ` <m18wa9glpo.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-05-27 12:06                                                                   ` [Devel] " Enrico Weigelt
     [not found]                                                               ` <4B8ECD99.3040107-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-03 21:05                                                                 ` Eric W. Biederman
     [not found]                                                           ` <m1ljeempk6.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-28 22:05                                                             ` Daniel Lezcano
     [not found]                                                         ` <4B89727C.9040602-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2010-02-27 19:44                                                           ` Eric W. Biederman
     [not found]                                                     ` <4B894564.7080104-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2010-02-27 19:08                                                       ` Eric W. Biederman
     [not found]                                                   ` <m1bpfbqajn.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-27 16:16                                                     ` Pavel Emelyanov
     [not found]                                                 ` <4B88E431.6040609-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2010-02-27  9:42                                                   ` Eric W. Biederman
     [not found]                                         ` <m13a0nwu6p.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-26 21:58                                           ` Oren Laadan
2010-02-27  8:30                                           ` Pavel Emelyanov
     [not found]                                     ` <m1bpfbwuze.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-26 21:34                                       ` Pavel Emelyanov
2010-02-26 21:35                                       ` Pavel Emelyanov
2010-02-26 21:49                                         ` Eric W. Biederman
     [not found]                                         ` <4B883EAF.5020607-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2010-02-26 21:49                                           ` Eric W. Biederman
2010-05-27 12:28                                 ` [Devel] " Enrico Weigelt
     [not found]                                   ` <20100527122800.GC31480-q9I3ByPDOfiE+EvaaNYduQ@public.gmane.org>
2010-05-27 12:44                                     ` Daniel Lezcano
     [not found]                                       ` <4BFE6938.50607-GANU6spQydw@public.gmane.org>
2010-05-27 15:42                                         ` Enrico Weigelt
2010-02-25 21:46                               ` Matt Helsley
2010-02-25 21:54                                 ` Eric W. Biederman
     [not found]                                 ` <20100225214656.GS3604-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-02-25 21:54                                   ` Eric W. Biederman
2010-02-26  0:53                                   ` Eric W. Biederman
2010-02-26  0:53                                 ` Eric W. Biederman
2010-02-26  1:09                               ` Matt Helsley
     [not found]                                 ` <20100226010915.GA20106-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-02-26  1:26                                   ` Eric W. Biederman
2010-02-26  1:26                                 ` Eric W. Biederman
2010-02-26  3:15                               ` [RFC][PATCH] ns: Syscalls for better namespace sharing control. v2 Eric W. Biederman
     [not found]                                 ` <m18wagy9f3.fsf_-_-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-03-03 20:29                                   ` Jonathan Corbet
2010-03-03 20:29                                     ` Jonathan Corbet
2010-03-03 20:50                                     ` Eric W. Biederman
     [not found]                                     ` <20100303132931.11afb659-vw3g6Xz/EtPk1uMJSBkQmQ@public.gmane.org>
2010-03-03 20:50                                       ` Eric W. Biederman
2010-02-25 20:57                             ` [RFC][PATCH] ns: Syscalls for better namespace sharing control Eric W. Biederman
     [not found]                           ` <m1r5obbu2w.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-23 23:09                             ` RFC: netfilter: nf_conntrack: add support for "conntrack zones" jamal
2010-02-23 23:49                             ` Matt Helsley
2010-02-23 23:49                           ` Matt Helsley
2010-02-24  1:32                             ` Eric W. Biederman
2010-02-24  1:39                               ` Serge E. Hallyn
     [not found]                               ` <m18waj2zc8.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-24  1:39                                 ` Serge E. Hallyn
     [not found]                             ` <20100223234942.GO3604-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-02-24  1:32                               ` Eric W. Biederman
2010-02-23 20:00                         ` Eric W. Biederman
     [not found]                       ` <m1iq9ocafv.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2010-02-23 14:20                         ` jamal
2010-02-23 14:07                     ` Eric W. Biederman
     [not found]         ` <4B50403A.6010507-dcUjhNyLwpNeoWH0uzbU5w@public.gmane.org>
2010-01-15 15:19           ` jamal
     [not found]     ` <4B4F3A50.1050400-dcUjhNyLwpNeoWH0uzbU5w@public.gmane.org>
2010-01-14 17:33       ` jamal
2010-01-14 15:37   ` Patrick McHardy
2010-01-14 18:32   ` Ben Greear
2010-01-15 15:03     ` jamal
     [not found]     ` <4B4F6332.50606-my8/4N5VtI7c+919tysfdA@public.gmane.org>
2010-01-15 15:03       ` jamal
     [not found] ` <4B4F24AC.70105-dcUjhNyLwpNeoWH0uzbU5w@public.gmane.org>
2010-01-14 15:05   ` jamal
  -- strict thread matches above, loose matches on Subject: below --
2010-01-14 14:05 Patrick McHardy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.