netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [net-next 0/16] Proposal for VRF-lite - v3
@ 2015-07-27 18:30 David Ahern
  2015-07-27 18:30 ` [PATCH net-next 01/16] net: Refactor rtable allocation and initialization David Ahern
                   ` (16 more replies)
  0 siblings, 17 replies; 32+ messages in thread
From: David Ahern @ 2015-07-27 18:30 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, svaidya, mingo,
	luto, David Ahern

In the context of internet scale routing a requirement that always comes
up is the need to partition the available routing tables into disjoint
routing planes. A specific use case is the multi-tenancy problem where
each tenant has their own unique routing tables and in the very least
need different default gateways.

This patch allows the ability to create virtual router domains (aka VRFs
(VRF-lite to be specific) in the linux packet forwarding stack. The main
observation is that through the use of rules and socket binding to interfaces,
all the facilities that we need are already present in the infrastructure. What
is missing is a handle that identifies a routing domain and can be used to
gather applicable rules/tables and uniqify neighbor selection. The scheme used
needs to preserves the notions of ECMP, and general routing principles.

This driver is a cross between functionality that the IPVLAN driver
and the Team drivers provide where a device is created and packets
into/out of the routing domain are shuttled through this device. The
device is then used as a handle to identify the applicable rules. The
VRF device is thus the layer3 equivalent of a vlan device.

The very important point to note is that this is only a Layer3 concept
so L2 tools (e.g., LLDP) do not need to be run in each VRF, processes can
run in unaware mode or select a VRF to be talking through. Also the
behavioral model is a generalized application of the familiar VRF-Lite
model with some performance paths that need optimization. (Specifically
the output route selector that Roopa, Robert, Thomas and EricB are
currently discussing on the MPLS thread)

High Level points
=================
1. Simple overlay driver (minimal changes to current stack)
   * uses the existing fib tables and fib rules infrastructure
2. Modelled closely after the ipvlan driver
3. Uses current API and infrastructure.
   * Applications can use SO_BINDTODEVICE or cmsg device indentifiers
     to pick VRF (ping, traceroute just work)
   * Standard IP Rules work, and since they are aggregated against the
     device, scale is manageable
4. Completely orthogonal to Namespaces and only provides separation in
   the routing plane (and ARP)

                                                 N2
           N1 (all configs here)          +---------------+
    +--------------+                      |               |
    |swp1 :10.0.1.1+----------------------+swp1 :10.0.1.2 |
    |              |                      |               |
    |swp2 :10.0.2.1+----------------------+swp2 :10.0.2.2 |
    |              |                      +---------------+
    | VRF 1        |
    | table 5      |
    |              |
    +---------------+
    |              |
    | VRF 2        |                             N3
    | table 6      |                      +---------------+
    |              |                      |               |
    |swp3 :10.0.2.1+----------------------+swp1 :10.0.2.2 |
    |              |                      |               |
    |swp4 :10.0.3.1+----------------------+swp2 :10.0.3.2 |
    +--------------+                      +---------------+


Given the topology above, the setup needed to get the basic VRF
functions working would be

Create the VRF devices and associate with a table
    ip link add vrf1 type vrf table 5
    ip link add vrf2 type vrf table 6

Install the lookup rules that map table to VRF domain
    ip rule add pref 200 oif vrf1 lookup 5
    ip rule add pref 200 iif vrf1 lookup 5
    ip rule add pref 200 oif vrf2 lookup 6
    ip rule add pref 200 iif vrf2 lookup 6

    ip link set vrf1 up
    ip link set vrf2 up

Enslave the routing member interfaces
    ip link set swp1 master vrf1
    ip link set swp2 master vrf1
    ip link set swp3 master vrf2
    ip link set swp4 master vrf2

Connected routes are automatically moved from main table to the VRF
table.

ping using VRF0 is simply
    ping -I vrf0 10.0.1.2

Or using the task context and a command such as the example chvrf in
patch 15 unmodified applications are run in a VRF context using:
   chvrf -v 1 ping 10.0.1.2


Design Highlights
=================
If a device is enslaved to a VRF device (ie., associated with a VRF)
then:
1. Rx path
   The master device index is used as the iif for all lookups.

2. Tx path
   Similarly, for Tx the VRF device oif is used in the flow to direct
   lookups to the table associated with the VRF via its rule. From there
   the FLOWI_FLAG_VRFSRC flag is used to indicate that the oif should
   not be used for FIB table lookups.

3. Connected and local routes
   On link up for a device, connected and local routes are added to the
   table associated with the VRF device, rather than the local and main
   tables.

4. Socket lookups
   Socket lookups use the VRF device for comparison with sk_bound_dev_if.
   If a socket is not bound to a device a socket match can happen based
   on destination address, port and protocol in which case a VRF global
   or agnostic process handles the connection (ie., this allows 1 listener
   socket to handle connections across VRFs). The child socket becomes
   bound to the VRF (sk_bound_dev_if is set to the VRF device).

5. Neighbor entries
   Neighbor entries are not impacted by the VRF device. Entries are
   associated with a particular interface; the VRF association is indirect
   via the interface-to-VRF device enslavement.

Version 3
- addressed comments from first 2 RFCs with the exception of the name
  Nicolas: We will do the name conversion once we agree on what the
           correct name should be (vrf, mrf or something else)

-  packets flow through the VRF device in both directions allowing the
   following:
   - tcpdump -i vrf<n>
   - tc rules on vrf device
   - netfilter rules on vrf device

Ingo/Andy: I added you two as a start point for the proposed task related
           changes. Not sure who should be the reviewer; please let me know
           if someone else is more appropriate. Thanks.

TO-DO
=====
1. IPv6

2. listen filter to restrict VRF connections
   - i.e., bind to VRF's a, b, c only or NOT VRFs e, f, g


Bug-Fixes and ideas from Hannes, Roopa Prabhu, Jon Toppins, Jamal

Patches can also be pulled from:
    https://github.com/dsahern/linux.git, vrf-dev-rfc3 branch
    https://github.com/dsahern/iproute2,  vrf-dev-rfc3 branch


David Ahern (15):
  net: Refactor rtable allocation and initialization
  net: export a few FIB functions
  net: Introduce VRF related flags and helpers
  net: Use VRF device index for lookups on RX
  net: Use VRF device index for lookups on TX
  net: Tx via VRF device
  net: Add inet_addr lookup by table
  net: Fix up inet_addr_type checks
  net: Add routes to the table associated with the device
  net: Use passed in table for nexthop lookups
  net: Use VRF device index for socket lookups
  net: Add ipv4 route helper to set next hop
  net: Introduce VRF device driver - v2
  net: Add sk_bind_dev_if to task_struct
  net: Add chvrf command
  net: FIB debugging tracepoints

 drivers/net/Kconfig          |   7 +
 drivers/net/Makefile         |   1 +
 drivers/net/vrf.c            | 596 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/netdevice.h    |  21 +++
 include/linux/sched.h        |   3 +
 include/net/flow.h           |   1 +
 include/net/route.h          |  13 ++
 include/net/vrf.h            |  83 +++++++++++
 include/uapi/linux/if_link.h |   9 ++
 include/uapi/linux/prctl.h   |   4 +
 kernel/fork.c                |   2 +
 kernel/sys.c                 |  35 +++++
 net/ipv4/af_inet.c           |  14 +-
 net/ipv4/arp.c               |  15 +-
 net/ipv4/fib_frontend.c      |  68 +++++++--
 net/ipv4/fib_semantics.c     |  44 ++++--
 net/ipv4/fib_trie.c          |   8 +-
 net/ipv4/icmp.c              |   9 +-
 net/ipv4/route.c             | 149 +++++++++++--------
 net/ipv4/syncookies.c        |   5 +-
 net/ipv4/tcp_input.c         |   6 +-
 net/ipv4/tcp_ipv4.c          |  11 +-
 net/ipv6/af_inet6.c          |   1 +
 net/ipv6/route.c             |   2 +-
 tools/net/Makefile           |   6 +-
 tools/net/chvrf.c            | 225 ++++++++++++++++++++++++++++
 26 files changed, 1230 insertions(+), 108 deletions(-)
 create mode 100644 drivers/net/vrf.c
 create mode 100644 include/net/vrf.h
 create mode 100644 tools/net/chvrf.c

-- 
2.3.2 (Apple Git-55)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH net-next 01/16] net: Refactor rtable allocation and initialization
  2015-07-27 18:30 [net-next 0/16] Proposal for VRF-lite - v3 David Ahern
@ 2015-07-27 18:30 ` David Ahern
  2015-07-27 18:30 ` [PATCH net-next 02/16] net: export a few FIB functions David Ahern
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-27 18:30 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, svaidya, mingo,
	luto, David Ahern

All callers to rt_dst_alloc have nearly the same initialization following
a successful allocation. Consolidate it into ip_route_new_rtable.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/net/route.h |   3 ++
 net/ipv4/route.c    | 111 +++++++++++++++++++++++-----------------------------
 2 files changed, 51 insertions(+), 63 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 2d45f419477f..cec7a2a055c8 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -111,6 +111,9 @@ struct in_device;
 int ip_rt_init(void);
 void rt_cache_flush(struct net *net);
 void rt_flush_dev(struct net_device *dev);
+struct rtable *ip_route_new_rtable(struct net_device *dev,
+				   unsigned int flags, u16 type,
+				   bool nopolicy, bool noxfrm, bool do_cache);
 struct rtable *__ip_route_output_key(struct net *, struct flowi4 *flp);
 struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
 				    struct sock *sk);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 11096396ef4a..ef140919211f 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1443,12 +1443,42 @@ static struct rtable *rt_dst_alloc(struct net_device *dev,
 			 (noxfrm ? DST_NOXFRM : 0));
 }
 
+struct rtable *ip_route_new_rtable(struct net_device *dev,
+				   unsigned int flags, u16 type,
+				   bool nopolicy, bool noxfrm, bool do_cache)
+{
+	struct rtable *rth;
+
+	rth = rt_dst_alloc(dev, nopolicy, noxfrm, do_cache);
+	if (rth) {
+		rth->rt_genid = rt_genid_ipv4(dev_net(dev));
+		rth->rt_flags = flags;
+		rth->rt_type = type;
+		rth->rt_is_input = 0;
+		rth->rt_iif = 0;
+		rth->rt_pmtu = 0;
+		rth->rt_gateway = 0;
+		rth->rt_uses_gateway = 0;
+		INIT_LIST_HEAD(&rth->rt_uncached);
+		rth->rt_lwtstate = NULL;
+
+		rth->dst.output = ip_output;
+		if (flags & RTCF_LOCAL)
+			rth->dst.input = ip_local_deliver;
+	}
+
+	return rth;
+}
+EXPORT_SYMBOL(ip_route_new_rtable);
+
 /* called in rcu_read_lock() section */
 static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 				u8 tos, struct net_device *dev, int our)
 {
 	struct rtable *rth;
 	struct in_device *in_dev = __in_dev_get_rcu(dev);
+	unsigned int flags = RTCF_MULTICAST;
+	u16 type = RTN_MULTICAST;
 	u32 itag = 0;
 	int err;
 
@@ -1474,8 +1504,13 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 		if (err < 0)
 			goto e_err;
 	}
-	rth = rt_dst_alloc(dev_net(dev)->loopback_dev,
-			   IN_DEV_CONF_GET(in_dev, NOPOLICY), false, false);
+	if (our)
+		flags |= RTCF_LOCAL;
+
+	rth = ip_route_new_rtable(dev_net(dev)->loopback_dev,
+				  flags, type,
+				  IN_DEV_CONF_GET(in_dev, NOPOLICY),
+				  false, false);
 	if (!rth)
 		goto e_nobufs;
 
@@ -1483,22 +1518,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	rth->dst.tclassid = itag;
 #endif
 	rth->dst.output = ip_rt_bug;
-
-	rth->rt_genid	= rt_genid_ipv4(dev_net(dev));
-	rth->rt_flags	= RTCF_MULTICAST;
-	rth->rt_type	= RTN_MULTICAST;
 	rth->rt_is_input= 1;
-	rth->rt_iif	= 0;
-	rth->rt_pmtu	= 0;
-	rth->rt_gateway	= 0;
-	rth->rt_uses_gateway = 0;
-	INIT_LIST_HEAD(&rth->rt_uncached);
-	rth->rt_lwtstate = NULL;
-	if (our) {
-		rth->dst.input= ip_local_deliver;
-		rth->rt_flags |= RTCF_LOCAL;
-	}
-
 #ifdef CONFIG_IP_MROUTE
 	if (!ipv4_is_local_multicast(daddr) && IN_DEV_MFORWARD(in_dev))
 		rth->dst.input = ip_mr_input;
@@ -1606,28 +1626,17 @@ static int __mkroute_input(struct sk_buff *skb,
 		}
 	}
 
-	rth = rt_dst_alloc(out_dev->dev,
-			   IN_DEV_CONF_GET(in_dev, NOPOLICY),
-			   IN_DEV_CONF_GET(out_dev, NOXFRM), do_cache);
+	rth = ip_route_new_rtable(out_dev->dev, 0, res->type,
+				  IN_DEV_CONF_GET(in_dev, NOPOLICY),
+				  IN_DEV_CONF_GET(out_dev, NOXFRM), do_cache);
 	if (!rth) {
 		err = -ENOBUFS;
 		goto cleanup;
 	}
 
-	rth->rt_genid = rt_genid_ipv4(dev_net(rth->dst.dev));
-	rth->rt_flags = 0;
-	rth->rt_type = res->type;
 	rth->rt_is_input = 1;
-	rth->rt_iif 	= 0;
-	rth->rt_pmtu	= 0;
-	rth->rt_gateway	= 0;
-	rth->rt_uses_gateway = 0;
-	INIT_LIST_HEAD(&rth->rt_uncached);
-	rth->rt_lwtstate = NULL;
 	RT_CACHE_STAT_INC(in_slow_tot);
-
 	rth->dst.input = ip_forward;
-	rth->dst.output = ip_output;
 
 	rt_set_nexthop(rth, daddr, res, fnhe, res->fi, res->type, itag);
 	if (lwtunnel_output_redirect(rth->rt_lwtstate))
@@ -1788,28 +1797,17 @@ out:	return err;
 		}
 	}
 
-	rth = rt_dst_alloc(net->loopback_dev,
-			   IN_DEV_CONF_GET(in_dev, NOPOLICY), false, do_cache);
+	rth = ip_route_new_rtable(net->loopback_dev, flags | RTCF_LOCAL,
+				  res.type, IN_DEV_CONF_GET(in_dev, NOPOLICY),
+				  false, do_cache);
 	if (!rth)
 		goto e_nobufs;
 
-	rth->dst.input= ip_local_deliver;
 	rth->dst.output= ip_rt_bug;
 #ifdef CONFIG_IP_ROUTE_CLASSID
 	rth->dst.tclassid = itag;
 #endif
-
-	rth->rt_genid = rt_genid_ipv4(net);
-	rth->rt_flags 	= flags|RTCF_LOCAL;
-	rth->rt_type	= res.type;
 	rth->rt_is_input = 1;
-	rth->rt_iif	= 0;
-	rth->rt_pmtu	= 0;
-	rth->rt_gateway	= 0;
-	rth->rt_uses_gateway = 0;
-	INIT_LIST_HEAD(&rth->rt_uncached);
-	rth->rt_lwtstate = NULL;
-
 	RT_CACHE_STAT_INC(in_slow_tot);
 	if (res.type == RTN_UNREACHABLE) {
 		rth->dst.input= ip_error;
@@ -1981,29 +1979,16 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	}
 
 add:
-	rth = rt_dst_alloc(dev_out,
-			   IN_DEV_CONF_GET(in_dev, NOPOLICY),
-			   IN_DEV_CONF_GET(in_dev, NOXFRM),
-			   do_cache);
+	rth = ip_route_new_rtable(dev_out, flags, type,
+				  IN_DEV_CONF_GET(in_dev, NOPOLICY),
+				  IN_DEV_CONF_GET(in_dev, NOXFRM),
+				  do_cache);
 	if (!rth)
 		return ERR_PTR(-ENOBUFS);
 
-	rth->dst.output = ip_output;
-
-	rth->rt_genid = rt_genid_ipv4(dev_net(dev_out));
-	rth->rt_flags	= flags;
-	rth->rt_type	= type;
-	rth->rt_is_input = 0;
-	rth->rt_iif	= orig_oif ? : 0;
-	rth->rt_pmtu	= 0;
-	rth->rt_gateway = 0;
-	rth->rt_uses_gateway = 0;
-	INIT_LIST_HEAD(&rth->rt_uncached);
-	rth->rt_lwtstate = NULL;
+	rth->rt_iif	= orig_oif;
 	RT_CACHE_STAT_INC(out_slow_tot);
 
-	if (flags & RTCF_LOCAL)
-		rth->dst.input = ip_local_deliver;
 	if (flags & (RTCF_BROADCAST | RTCF_MULTICAST)) {
 		if (flags & RTCF_LOCAL &&
 		    !(dev_out->flags & IFF_LOOPBACK)) {
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 02/16] net: export a few FIB functions
  2015-07-27 18:30 [net-next 0/16] Proposal for VRF-lite - v3 David Ahern
  2015-07-27 18:30 ` [PATCH net-next 01/16] net: Refactor rtable allocation and initialization David Ahern
@ 2015-07-27 18:30 ` David Ahern
  2015-07-27 18:30 ` [PATCH net-next 03/16] net: Introduce VRF related flags and helpers David Ahern
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-27 18:30 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, svaidya, mingo,
	luto, David Ahern

Required by the VRF driver.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 net/ipv4/fib_frontend.c | 2 ++
 net/ipv4/fib_trie.c     | 1 +
 2 files changed, 3 insertions(+)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 6b98de0d7949..c565fc182240 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -108,6 +108,7 @@ struct fib_table *fib_new_table(struct net *net, u32 id)
 	hlist_add_head_rcu(&tb->tb_hlist, &net->ipv4.fib_table_hash[h]);
 	return tb;
 }
+EXPORT_SYMBOL_GPL(fib_new_table);
 
 /* caller must hold either rtnl or rcu read lock */
 struct fib_table *fib_get_table(struct net *net, u32 id)
@@ -127,6 +128,7 @@ struct fib_table *fib_get_table(struct net *net, u32 id)
 	}
 	return NULL;
 }
+EXPORT_SYMBOL_GPL(fib_get_table);
 #endif /* CONFIG_IP_MULTIPLE_TABLES */
 
 static void fib_replace_table(struct net *net, struct fib_table *old,
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 15d32612e3c6..ac2d828c6daa 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1887,6 +1887,7 @@ void fib_free_table(struct fib_table *tb)
 {
 	call_rcu(&tb->rcu, __trie_free_rcu);
 }
+EXPORT_SYMBOL_GPL(fib_free_table);
 
 static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb,
 			     struct sk_buff *skb, struct netlink_callback *cb)
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 03/16] net: Introduce VRF related flags and helpers
  2015-07-27 18:30 [net-next 0/16] Proposal for VRF-lite - v3 David Ahern
  2015-07-27 18:30 ` [PATCH net-next 01/16] net: Refactor rtable allocation and initialization David Ahern
  2015-07-27 18:30 ` [PATCH net-next 02/16] net: export a few FIB functions David Ahern
@ 2015-07-27 18:30 ` David Ahern
  2015-07-27 18:30 ` [PATCH net-next 04/16] net: Use VRF device index for lookups on RX David Ahern
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-27 18:30 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, svaidya, mingo,
	luto, David Ahern

Add a VRF_MASTER flag for interfaces and helper functions for determining
if a device is a VRF_MASTER.

Add link attribute for passing VRF_TABLE id.

Add vrf_ptr to netdevice.

Add various macros for determining if a device is a VRF device, the index
of the master VRF device and table associated with VRF device.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/linux/netdevice.h    | 21 +++++++++++
 include/net/vrf.h            | 83 ++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/if_link.h |  9 +++++
 3 files changed, 113 insertions(+)
 create mode 100644 include/net/vrf.h

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 607b5f41f46f..81cddddbaf78 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1289,6 +1289,7 @@ enum netdev_priv_flags {
 	IFF_XMIT_DST_RELEASE_PERM	= 1<<22,
 	IFF_IPVLAN_MASTER		= 1<<23,
 	IFF_IPVLAN_SLAVE		= 1<<24,
+	IFF_VRF_MASTER			= 1<<25,
 };
 
 #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
@@ -1316,6 +1317,7 @@ enum netdev_priv_flags {
 #define IFF_XMIT_DST_RELEASE_PERM	IFF_XMIT_DST_RELEASE_PERM
 #define IFF_IPVLAN_MASTER		IFF_IPVLAN_MASTER
 #define IFF_IPVLAN_SLAVE		IFF_IPVLAN_SLAVE
+#define IFF_VRF_MASTER			IFF_VRF_MASTER
 
 /**
  *	struct net_device - The DEVICE structure.
@@ -1432,6 +1434,7 @@ enum netdev_priv_flags {
  *	@dn_ptr:	DECnet specific data
  *	@ip6_ptr:	IPv6 specific data
  *	@ax25_ptr:	AX.25 specific data
+ *	@vrf_ptr:	VRF specific data
  *	@ieee80211_ptr:	IEEE 802.11 specific data, assign before registering
  *
  *	@last_rx:	Time of last Rx
@@ -1650,6 +1653,7 @@ struct net_device {
 	struct dn_dev __rcu     *dn_ptr;
 	struct inet6_dev __rcu	*ip6_ptr;
 	void			*ax25_ptr;
+	struct net_vrf_dev	*vrf_ptr;
 	struct wireless_dev	*ieee80211_ptr;
 	struct wpan_dev		*ieee802154_ptr;
 #if IS_ENABLED(CONFIG_MPLS_ROUTING)
@@ -3808,6 +3812,23 @@ static inline bool netif_supports_nofcs(struct net_device *dev)
 	return dev->priv_flags & IFF_SUPP_NOFCS;
 }
 
+static inline bool netif_is_vrf(const struct net_device *dev)
+{
+	return dev->priv_flags & IFF_VRF_MASTER;
+}
+
+static inline bool netif_index_is_vrf(struct net *net, int ifindex)
+{
+	struct net_device *dev = dev_get_by_index(net, ifindex);
+	bool rc = false;
+
+	if (dev) {
+		rc = netif_is_vrf(dev);
+		dev_put(dev);
+	}
+	return rc;
+}
+
 /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
 static inline void netif_keep_dst(struct net_device *dev)
 {
diff --git a/include/net/vrf.h b/include/net/vrf.h
new file mode 100644
index 000000000000..743a172ee849
--- /dev/null
+++ b/include/net/vrf.h
@@ -0,0 +1,83 @@
+/*
+ * include/net/net_vrf.h - adds vrf dev structure definitions
+ * Copyright (c) 2015 Cumulus Networks
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef __LINUX_NET_VRF_H
+#define __LINUX_NET_VRF_H
+
+struct net_vrf_dev {
+	int                     ifindex; /* ifindex of master dev */
+	u32                     tb_id;   /* table id for VRF */
+};
+
+#if IS_ENABLED(CONFIG_NET_VRF)
+static inline int vrf_master_dev_ifindex(const struct net_device *dev)
+{
+	struct net_vrf_dev *vrf_ptr;
+	int ifindex = 0;
+
+	if (!dev)
+		return 0;
+
+	if (netif_is_vrf(dev))
+		ifindex = dev->ifindex;
+	else {
+		vrf_ptr = rcu_dereference(dev->vrf_ptr);
+		if (vrf_ptr)
+			ifindex = vrf_ptr->ifindex;
+	}
+
+	return ifindex;
+}
+
+static inline int vrf_get_master_dev_ifindex(struct net *net, int ifindex)
+{
+	int rc = 0;
+
+	if (ifindex) {
+		struct net_device *dev = dev_get_by_index(net, ifindex);
+
+		if (dev) {
+			rc = vrf_master_dev_ifindex(dev);
+			dev_put(dev);
+		}
+	}
+	return rc;
+}
+
+static inline int vrf_dev_table(const struct net_device *dev)
+{
+	int tb_id = 0;
+
+	if (dev) {
+		struct net_vrf_dev *vrf_ptr = rcu_dereference(dev->vrf_ptr);
+
+		if (vrf_ptr)
+			tb_id = vrf_ptr->tb_id;
+	}
+	return tb_id;
+}
+#else
+static inline int vrf_master_dev_ifindex(const struct net_device *dev)
+{
+	return 0;
+}
+
+static inline int vrf_get_master_dev_ifindex(struct net *net, int ifindex)
+{
+	return 0;
+}
+
+static inline int vrf_dev_table(const struct net_device *dev)
+{
+	return 0;
+}
+#endif
+
+#endif /* __LINUX_NET_VRF_H */
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 9eeb5d9cf8f0..0037b6b8858d 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -340,6 +340,15 @@ enum macvlan_macaddr_mode {
 
 #define MACVLAN_FLAG_NOPROMISC	1
 
+/* VRF section */
+enum {
+	IFLA_VRF_UNSPEC,
+	IFLA_VRF_TABLE,
+	__IFLA_VRF_MAX
+};
+
+#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1)
+
 /* IPVLAN section */
 enum {
 	IFLA_IPVLAN_UNSPEC,
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 04/16] net: Use VRF device index for lookups on RX
  2015-07-27 18:30 [net-next 0/16] Proposal for VRF-lite - v3 David Ahern
                   ` (2 preceding siblings ...)
  2015-07-27 18:30 ` [PATCH net-next 03/16] net: Introduce VRF related flags and helpers David Ahern
@ 2015-07-27 18:30 ` David Ahern
  2015-07-27 18:30 ` [PATCH net-next 05/16] net: Use VRF device index for lookups on TX David Ahern
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-27 18:30 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, svaidya, mingo,
	luto, David Ahern

On ingress use index of VRF master device for route lookups if real device
is enslaved. Rules are expected to be installed for the VRF device to
direct lookups to a specific table.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 net/ipv4/fib_frontend.c | 8 +++++++-
 net/ipv4/route.c        | 3 ++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index c565fc182240..6e68a003d0fd 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -45,6 +45,7 @@
 #include <net/ip_fib.h>
 #include <net/rtnetlink.h>
 #include <net/xfrm.h>
+#include <net/vrf.h>
 
 #ifndef CONFIG_IP_MULTIPLE_TABLES
 
@@ -311,7 +312,9 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 	bool dev_match;
 
 	fl4.flowi4_oif = 0;
-	fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX;
+	fl4.flowi4_iif = vrf_master_dev_ifindex(dev);
+	if (!fl4.flowi4_iif)
+		fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX;
 	fl4.daddr = src;
 	fl4.saddr = dst;
 	fl4.flowi4_tos = tos;
@@ -341,6 +344,9 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 		if (nh->nh_dev == dev) {
 			dev_match = true;
 			break;
+		} else if (vrf_master_dev_ifindex(nh->nh_dev) == dev->ifindex) {
+			dev_match = true;
+			break;
 		}
 	}
 #else
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index ef140919211f..ba74c83c05be 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -112,6 +112,7 @@
 #endif
 #include <net/secure_seq.h>
 #include <net/ip_tunnels.h>
+#include <net/vrf.h>
 
 #define RT_FL_TOS(oldflp4) \
 	((oldflp4)->flowi4_tos & (IPTOS_RT_MASK | RTO_ONLINK))
@@ -1735,7 +1736,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	 *	Now we are ready to route packet.
 	 */
 	fl4.flowi4_oif = 0;
-	fl4.flowi4_iif = dev->ifindex;
+	fl4.flowi4_iif = vrf_master_dev_ifindex(dev) ? : dev->ifindex;
 	fl4.flowi4_mark = skb->mark;
 	fl4.flowi4_tos = tos;
 	fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 05/16] net: Use VRF device index for lookups on TX
  2015-07-27 18:30 [net-next 0/16] Proposal for VRF-lite - v3 David Ahern
                   ` (3 preceding siblings ...)
  2015-07-27 18:30 ` [PATCH net-next 04/16] net: Use VRF device index for lookups on RX David Ahern
@ 2015-07-27 18:30 ` David Ahern
  2015-07-27 18:30 ` [PATCH net-next 06/16] net: Tx via VRF device David Ahern
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-27 18:30 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, svaidya, mingo,
	luto, David Ahern

As with ingress use the index of VRF master device for route lookups on
egress. However, the oif should only be used to direct the lookups to a
specific table. Routes in the table are not based on the VRF device but
rather interfaces that are part of the VRF so do not consider the oif for
lookups within the table. The FLOWI_FLAG_VRFSRC is used to control this
latter part.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/net/flow.h  | 1 +
 include/net/route.h | 3 +++
 net/ipv4/fib_trie.c | 7 +++++--
 net/ipv4/icmp.c     | 4 ++++
 net/ipv4/route.c    | 3 +++
 5 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index 3098ae33a178..f305588fc162 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -33,6 +33,7 @@ struct flowi_common {
 	__u8	flowic_flags;
 #define FLOWI_FLAG_ANYSRC		0x01
 #define FLOWI_FLAG_KNOWN_NH		0x02
+#define FLOWI_FLAG_VRFSRC		0x04
 	__u32	flowic_secid;
 	struct flowi_tunnel flowic_tun_key;
 };
diff --git a/include/net/route.h b/include/net/route.h
index cec7a2a055c8..54f97eea0fb2 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -254,6 +254,9 @@ static inline void ip_route_connect_init(struct flowi4 *fl4, __be32 dst, __be32
 	if (inet_sk(sk)->transparent)
 		flow_flags |= FLOWI_FLAG_ANYSRC;
 
+	if (netif_index_is_vrf(sock_net(sk), oif))
+		flow_flags |= FLOWI_FLAG_VRFSRC;
+
 	flowi4_init_output(fl4, oif, sk->sk_mark, tos, RT_SCOPE_UNIVERSE,
 			   protocol, flow_flags, dst, src, dport, sport);
 }
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index ac2d828c6daa..7da901c56e35 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1421,8 +1421,11 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp,
 			    nh->nh_flags & RTNH_F_LINKDOWN &&
 			    !(fib_flags & FIB_LOOKUP_IGNORE_LINKSTATE))
 				continue;
-			if (flp->flowi4_oif && flp->flowi4_oif != nh->nh_oif)
-				continue;
+			if (!(flp->flowi4_flags & FLOWI_FLAG_VRFSRC)) {
+				if (flp->flowi4_oif &&
+				    flp->flowi4_oif != nh->nh_oif)
+					continue;
+			}
 
 			if (!(fib_flags & FIB_LOOKUP_NOREF))
 				atomic_inc(&fi->fib_clntref);
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index c0556f1e4bf0..d2d142b775b8 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -96,6 +96,7 @@
 #include <net/xfrm.h>
 #include <net/inet_common.h>
 #include <net/ip_fib.h>
+#include <net/vrf.h>
 
 /*
  *	Build xmit assembly blocks
@@ -425,6 +426,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	fl4.flowi4_mark = mark;
 	fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
 	fl4.flowi4_proto = IPPROTO_ICMP;
+	fl4.flowi4_oif = vrf_master_dev_ifindex(skb->dev) ? : skb->dev->ifindex;
 	security_skb_classify_flow(skb, flowi4_to_flowi(&fl4));
 	rt = ip_route_output_key(net, &fl4);
 	if (IS_ERR(rt))
@@ -458,6 +460,8 @@ static struct rtable *icmp_route_lookup(struct net *net,
 	fl4->flowi4_proto = IPPROTO_ICMP;
 	fl4->fl4_icmp_type = type;
 	fl4->fl4_icmp_code = code;
+	fl4->flowi4_oif = vrf_master_dev_ifindex(skb_in->dev) ? : skb_in->dev->ifindex;
+
 	security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
 	rt = __ip_route_output_key(net, fl4);
 	if (IS_ERR(rt))
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index ba74c83c05be..8119896e1159 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2093,6 +2093,9 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 		if (!dev_out)
 			goto out;
 
+		if (netif_is_vrf(dev_out))
+			fl4->flowi4_flags |= FLOWI_FLAG_VRFSRC;
+
 		/* RACE: Check return value of inet_select_addr instead. */
 		if (!(dev_out->flags & IFF_UP) || !__in_dev_get_rcu(dev_out)) {
 			rth = ERR_PTR(-ENETUNREACH);
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 06/16] net: Tx via VRF device
  2015-07-27 18:30 [net-next 0/16] Proposal for VRF-lite - v3 David Ahern
                   ` (4 preceding siblings ...)
  2015-07-27 18:30 ` [PATCH net-next 05/16] net: Use VRF device index for lookups on TX David Ahern
@ 2015-07-27 18:30 ` David Ahern
  2015-07-27 18:31 ` [PATCH net-next 07/16] net: Add inet_addr lookup by table David Ahern
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-27 18:30 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, svaidya, mingo,
	luto, David Ahern

If out device is enslaved to a VRF device we want packets to go through the
VRF master device first. This allows for example iptables rules and tc rules
to be configured on the VRF as a whole as well as the option for rules on
specific netdevices. This is accomplished by updating the dev in the dst to
point to the VRF device if it is enslaved.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 net/ipv4/route.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 8119896e1159..050a3c1d89ba 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1903,6 +1903,23 @@ int ip_route_input_noref(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 }
 EXPORT_SYMBOL(ip_route_input_noref);
 
+/* if out device is enslaved to a VRF device update dst to
+ * send through it
+ */
+static void rt_use_vrf_dev(struct rtable *rth, struct net_device *dev_out)
+{
+#if IS_ENABLED(CONFIG_NET_VRF)
+	int ifindex = vrf_master_dev_ifindex(dev_out);
+	struct net_device *mdev;
+
+	mdev = dev_get_by_index(dev_net(dev_out), ifindex);
+	if (mdev) {
+		dev_put(rth->dst.dev);
+		rth->dst.dev = mdev;
+	}
+#endif
+}
+
 /* called with rcu_read_lock() */
 static struct rtable *__mkroute_output(const struct fib_result *res,
 				       const struct flowi4 *fl4, int orig_oif,
@@ -2008,6 +2025,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	}
 
 	rt_set_nexthop(rth, fl4->daddr, res, fnhe, fi, type, 0);
+	rt_use_vrf_dev(rth, dev_out);
 
 	return rth;
 }
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 07/16] net: Add inet_addr lookup by table
  2015-07-27 18:30 [net-next 0/16] Proposal for VRF-lite - v3 David Ahern
                   ` (5 preceding siblings ...)
  2015-07-27 18:30 ` [PATCH net-next 06/16] net: Tx via VRF device David Ahern
@ 2015-07-27 18:31 ` David Ahern
  2015-07-27 18:31 ` [PATCH net-next 08/16] net: Fix up inet_addr_type checks David Ahern
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-27 18:31 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, svaidya, mingo,
	luto, David Ahern

Currently inet_addr_type and inet_dev_addr_type expect local addresses
to be in the local table. With the VRF device local routes for devices
associated with a VRF will be in the table associated with the VRF.
Provide an alternate inet_addr lookup to use a specific table rather
than defaulting to the local table.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/net/route.h     |  1 +
 net/ipv4/fib_frontend.c | 22 +++++++++++++++-------
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 54f97eea0fb2..3b51c339c269 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -192,6 +192,7 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk);
 void ip_rt_send_redirect(struct sk_buff *skb);
 
 unsigned int inet_addr_type(struct net *net, __be32 addr);
+unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id);
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
 				__be32 addr);
 void ip_rt_multicast_event(struct in_device *);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 6e68a003d0fd..cc413b0170ed 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -214,12 +214,12 @@ void fib_flush_external(struct net *net)
  */
 static inline unsigned int __inet_dev_addr_type(struct net *net,
 						const struct net_device *dev,
-						__be32 addr)
+						__be32 addr, int tb_id)
 {
 	struct flowi4		fl4 = { .daddr = addr };
 	struct fib_result	res;
 	unsigned int ret = RTN_BROADCAST;
-	struct fib_table *local_table;
+	struct fib_table *table;
 
 	if (ipv4_is_zeronet(addr) || ipv4_is_lbcast(addr))
 		return RTN_BROADCAST;
@@ -228,10 +228,10 @@ static inline unsigned int __inet_dev_addr_type(struct net *net,
 
 	rcu_read_lock();
 
-	local_table = fib_get_table(net, RT_TABLE_LOCAL);
-	if (local_table) {
+	table = fib_get_table(net, tb_id);
+	if (table) {
 		ret = RTN_UNICAST;
-		if (!fib_table_lookup(local_table, &fl4, &res, FIB_LOOKUP_NOREF)) {
+		if (!fib_table_lookup(table, &fl4, &res, FIB_LOOKUP_NOREF)) {
 			if (!dev || dev == res.fi->fib_dev)
 				ret = res.type;
 		}
@@ -241,16 +241,24 @@ static inline unsigned int __inet_dev_addr_type(struct net *net,
 	return ret;
 }
 
+unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id)
+{
+	return __inet_dev_addr_type(net, NULL, addr, tb_id);
+}
+EXPORT_SYMBOL(inet_addr_type_table);
+
 unsigned int inet_addr_type(struct net *net, __be32 addr)
 {
-	return __inet_dev_addr_type(net, NULL, addr);
+	return __inet_dev_addr_type(net, NULL, addr, RT_TABLE_LOCAL);
 }
 EXPORT_SYMBOL(inet_addr_type);
 
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
 				__be32 addr)
 {
-	return __inet_dev_addr_type(net, dev, addr);
+	int rt_table = vrf_dev_table(dev) ? : RT_TABLE_LOCAL;
+
+	return __inet_dev_addr_type(net, dev, addr, rt_table);
 }
 EXPORT_SYMBOL(inet_dev_addr_type);
 
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 08/16] net: Fix up inet_addr_type checks
  2015-07-27 18:30 [net-next 0/16] Proposal for VRF-lite - v3 David Ahern
                   ` (6 preceding siblings ...)
  2015-07-27 18:31 ` [PATCH net-next 07/16] net: Add inet_addr lookup by table David Ahern
@ 2015-07-27 18:31 ` David Ahern
  2015-07-27 18:31 ` [PATCH net-next 09/16] net: Add routes to the table associated with the device David Ahern
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-27 18:31 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, svaidya, mingo,
	luto, David Ahern

Currently inet_addr_type and inet_dev_addr_type expect local addresses
to be in the local table. With the VRF device local routes for devices
associated with a VRF will be in the table associated with the VRF.
Provide an alternate inet_addr lookup to use a specific table rather
than defaulting to the local table.

inet_addr_type_dev_table keeps the same semantics as inet_addr_type but
if the passed in device is enslaved to a VRF then the table for that VRF
is used for the lookup.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/net/route.h      |  3 +++
 net/ipv4/af_inet.c       | 13 ++++++++++++-
 net/ipv4/arp.c           | 15 +++++++++------
 net/ipv4/fib_frontend.c  | 28 +++++++++++++++++++++++++---
 net/ipv4/fib_semantics.c |  6 ++++--
 net/ipv4/icmp.c          |  5 +++--
 6 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 3b51c339c269..b14cbec93fbd 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -195,6 +195,9 @@ unsigned int inet_addr_type(struct net *net, __be32 addr);
 unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id);
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
 				__be32 addr);
+unsigned int inet_addr_type_dev_table(struct net *net,
+				      const struct net_device *dev,
+				      __be32 addr);
 void ip_rt_multicast_event(struct in_device *);
 int ip_rt_ioctl(struct net *, unsigned int cmd, void __user *arg);
 void ip_rt_get_source(u8 *src, struct sk_buff *skb, struct rtable *rt);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index cc4e498a0ccf..09c7c1ee307e 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -119,6 +119,7 @@
 #ifdef CONFIG_IP_MROUTE
 #include <linux/mroute.h>
 #endif
+#include <net/vrf.h>
 
 
 /* The inetsw table contains everything that inet_create needs to
@@ -427,6 +428,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	struct net *net = sock_net(sk);
 	unsigned short snum;
 	int chk_addr_ret;
+	int tb_id = 0;
 	int err;
 
 	/* If the socket has its own bind function then use it. (RAW) */
@@ -448,7 +450,16 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 			goto out;
 	}
 
-	chk_addr_ret = inet_addr_type(net, addr->sin_addr.s_addr);
+	if (sk->sk_bound_dev_if) {
+		struct net_device *dev;
+
+		dev = dev_get_by_index(net, sk->sk_bound_dev_if);
+		if (dev) {
+			tb_id = vrf_dev_table(dev);
+			dev_put(dev);
+		}
+	}
+	chk_addr_ret = inet_addr_type_table(net, addr->sin_addr.s_addr, tb_id);
 
 	/* Not specified by any standard per-se, however it breaks too
 	 * many applications when removed.  It is unfortunate since
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 1d59e50ce8b7..53eee7cecce8 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -233,7 +233,7 @@ static int arp_constructor(struct neighbour *neigh)
 		return -EINVAL;
 	}
 
-	neigh->type = inet_addr_type(dev_net(dev), addr);
+	neigh->type = inet_addr_type_dev_table(dev_net(dev), dev, addr);
 
 	parms = in_dev->arp_parms;
 	__neigh_parms_put(neigh->parms);
@@ -343,7 +343,7 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
 	switch (IN_DEV_ARP_ANNOUNCE(in_dev)) {
 	default:
 	case 0:		/* By default announce any local IP */
-		if (skb && inet_addr_type(dev_net(dev),
+		if (skb && inet_addr_type_dev_table(dev_net(dev), dev,
 					  ip_hdr(skb)->saddr) == RTN_LOCAL)
 			saddr = ip_hdr(skb)->saddr;
 		break;
@@ -351,7 +351,8 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
 		if (!skb)
 			break;
 		saddr = ip_hdr(skb)->saddr;
-		if (inet_addr_type(dev_net(dev), saddr) == RTN_LOCAL) {
+		if (inet_addr_type_dev_table(dev_net(dev), dev,
+					     saddr) == RTN_LOCAL) {
 			/* saddr should be known to target */
 			if (inet_addr_onlink(in_dev, target, saddr))
 				break;
@@ -751,7 +752,7 @@ static int arp_process(struct sock *sk, struct sk_buff *skb)
 	/* Special case: IPv4 duplicate address detection packet (RFC2131) */
 	if (sip == 0) {
 		if (arp->ar_op == htons(ARPOP_REQUEST) &&
-		    inet_addr_type(net, tip) == RTN_LOCAL &&
+		    inet_addr_type_dev_table(net, dev, tip) == RTN_LOCAL &&
 		    !arp_ignore(in_dev, sip, tip))
 			arp_send(ARPOP_REPLY, ETH_P_ARP, sip, dev, tip, sha,
 				 dev->dev_addr, sha);
@@ -811,16 +812,18 @@ static int arp_process(struct sock *sk, struct sk_buff *skb)
 	n = __neigh_lookup(&arp_tbl, &sip, dev, 0);
 
 	if (IN_DEV_ARP_ACCEPT(in_dev)) {
+		unsigned int addr_type = inet_addr_type_dev_table(net, dev, sip);
+
 		/* Unsolicited ARP is not accepted by default.
 		   It is possible, that this option should be enabled for some
 		   devices (strip is candidate)
 		 */
 		is_garp = arp->ar_op == htons(ARPOP_REQUEST) && tip == sip &&
-			  inet_addr_type(net, sip) == RTN_UNICAST;
+			  addr_type == RTN_UNICAST;
 
 		if (!n &&
 		    ((arp->ar_op == htons(ARPOP_REPLY)  &&
-		      inet_addr_type(net, sip) == RTN_UNICAST) || is_garp))
+				addr_type == RTN_UNICAST) || is_garp))
 			n = __neigh_lookup(&arp_tbl, &sip, dev, 1);
 	}
 
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index cc413b0170ed..5ce0d11222ca 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -228,6 +228,9 @@ static inline unsigned int __inet_dev_addr_type(struct net *net,
 
 	rcu_read_lock();
 
+	if (!tb_id)
+		tb_id = RT_TABLE_LOCAL;
+
 	table = fib_get_table(net, tb_id);
 	if (table) {
 		ret = RTN_UNICAST;
@@ -262,6 +265,19 @@ unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
 }
 EXPORT_SYMBOL(inet_dev_addr_type);
 
+/* inet_addr_type with dev == NULL but using the table from a dev
+ * if one is associated
+ */
+unsigned int inet_addr_type_dev_table(struct net *net,
+				      const struct net_device *dev,
+				      __be32 addr)
+{
+	int rt_table = vrf_dev_table(dev) ? : RT_TABLE_LOCAL;
+
+	return __inet_dev_addr_type(net, NULL, addr, rt_table);
+}
+EXPORT_SYMBOL(inet_addr_type_dev_table);
+
 __be32 fib_compute_spec_dst(struct sk_buff *skb)
 {
 	struct net_device *dev = skb->dev;
@@ -512,9 +528,12 @@ static int rtentry_to_fib_config(struct net *net, int cmd, struct rtentry *rt,
 
 	addr = sk_extract_addr(&rt->rt_gateway);
 	if (rt->rt_gateway.sa_family == AF_INET && addr) {
+		unsigned int addr_type;
+
 		cfg->fc_gw = addr;
+		addr_type = inet_addr_type_table(net, cfg->fc_table, addr);
 		if (rt->rt_flags & RTF_GATEWAY &&
-		    inet_addr_type(net, addr) == RTN_UNICAST)
+		    addr_type == RTN_UNICAST)
 			cfg->fc_scope = RT_SCOPE_UNIVERSE;
 	}
 
@@ -986,11 +1005,14 @@ void fib_del_ifaddr(struct in_ifaddr *ifa, struct in_ifaddr *iprim)
 			fib_magic(RTM_DELROUTE, RTN_BROADCAST, any, 32, prim);
 	}
 	if (!(ok & LOCAL_OK)) {
+		unsigned int addr_type;
+
 		fib_magic(RTM_DELROUTE, RTN_LOCAL, ifa->ifa_local, 32, prim);
 
 		/* Check, that this local address finally disappeared. */
-		if (gone &&
-		    inet_addr_type(dev_net(dev), ifa->ifa_local) != RTN_LOCAL) {
+		addr_type = inet_addr_type_dev_table(dev_net(dev), dev,
+						     ifa->ifa_local);
+		if (gone && addr_type != RTN_LOCAL) {
 			/* And the last, but not the least thing.
 			 * We must flush stray FIB entries.
 			 *
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 65e00399a9a6..a578eacf9fcd 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -670,16 +670,18 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 		struct fib_result res;
 
 		if (nh->nh_flags & RTNH_F_ONLINK) {
+			unsigned int addr_type;
 
 			if (cfg->fc_scope >= RT_SCOPE_LINK)
 				return -EINVAL;
-			if (inet_addr_type(net, nh->nh_gw) != RTN_UNICAST)
-				return -EINVAL;
 			dev = __dev_get_by_index(net, nh->nh_oif);
 			if (!dev)
 				return -ENODEV;
 			if (!(dev->flags & IFF_UP))
 				return -ENETDOWN;
+			addr_type = inet_addr_type_dev_table(net, dev, nh->nh_gw);
+			if (addr_type != RTN_UNICAST)
+				return -EINVAL;
 			if (!netif_carrier_ok(dev))
 				nh->nh_flags |= RTNH_F_LINKDOWN;
 			nh->nh_dev = dev;
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index d2d142b775b8..79fd8ff39f65 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -484,7 +484,8 @@ static struct rtable *icmp_route_lookup(struct net *net,
 	if (err)
 		goto relookup_failed;
 
-	if (inet_addr_type(net, fl4_dec.saddr) == RTN_LOCAL) {
+	if (inet_addr_type_dev_table(net, skb_in->dev,
+				     fl4_dec.saddr) == RTN_LOCAL) {
 		rt2 = __ip_route_output_key(net, &fl4_dec);
 		if (IS_ERR(rt2))
 			err = PTR_ERR(rt2);
@@ -833,7 +834,7 @@ static bool icmp_unreach(struct sk_buff *skb)
 	 */
 
 	if (!net->ipv4.sysctl_icmp_ignore_bogus_error_responses &&
-	    inet_addr_type(net, iph->daddr) == RTN_BROADCAST) {
+	    inet_addr_type_dev_table(net, skb->dev, iph->daddr) == RTN_BROADCAST) {
 		net_warn_ratelimited("%pI4 sent an invalid ICMP type %u, code %u error to a broadcast: %pI4 on %s\n",
 				     &ip_hdr(skb)->saddr,
 				     icmph->type, icmph->code,
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 09/16] net: Add routes to the table associated with the device
  2015-07-27 18:30 [net-next 0/16] Proposal for VRF-lite - v3 David Ahern
                   ` (7 preceding siblings ...)
  2015-07-27 18:31 ` [PATCH net-next 08/16] net: Fix up inet_addr_type checks David Ahern
@ 2015-07-27 18:31 ` David Ahern
  2015-07-27 18:31 ` [PATCH net-next 10/16] net: Use passed in table for nexthop lookups David Ahern
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-27 18:31 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, svaidya, mingo,
	luto, David Ahern

When a device associated with a VRF is brought up or down routes
should be added to/removed from the table associated with the VRF.
fib_magic defaults to using the main or local tables. Have it use
the table with the device if there is one.

A part of this is directing prefsrc validations to the correct
table as well.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 net/ipv4/fib_frontend.c  |  8 ++++----
 net/ipv4/fib_semantics.c | 25 +++++++++++++++++++------
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 5ce0d11222ca..e35541a64449 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -805,6 +805,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifaddr *ifa)
 {
 	struct net *net = dev_net(ifa->ifa_dev->dev);
+	int tb_id = vrf_dev_table(ifa->ifa_dev->dev);
 	struct fib_table *tb;
 	struct fib_config cfg = {
 		.fc_protocol = RTPROT_KERNEL,
@@ -819,11 +820,10 @@ static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifad
 		},
 	};
 
-	if (type == RTN_UNICAST)
-		tb = fib_new_table(net, RT_TABLE_MAIN);
-	else
-		tb = fib_new_table(net, RT_TABLE_LOCAL);
+	if (!tb_id)
+		tb_id = (type == RTN_UNICAST) ? RT_TABLE_MAIN : RT_TABLE_LOCAL;
 
+	tb = fib_new_table(net, tb_id);
 	if (!tb)
 		return;
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index a578eacf9fcd..37e1dee7692a 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -838,6 +838,23 @@ __be32 fib_info_update_nh_saddr(struct net *net, struct fib_nh *nh)
 	return nh->nh_saddr;
 }
 
+static bool fib_valid_prefsrc(struct fib_config *cfg, __be32 fib_prefsrc)
+{
+	if (cfg->fc_type != RTN_LOCAL || !cfg->fc_dst ||
+	    fib_prefsrc != cfg->fc_dst) {
+		int tb_id = cfg->fc_table;
+
+		if (tb_id == RT_TABLE_MAIN)
+			tb_id = RT_TABLE_LOCAL;
+
+		if (inet_addr_type_table(cfg->fc_nlinfo.nl_net,
+					 fib_prefsrc, tb_id) != RTN_LOCAL) {
+			return false;
+		}
+	}
+	return true;
+}
+
 struct fib_info *fib_create_info(struct fib_config *cfg)
 {
 	int err;
@@ -1033,12 +1050,8 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 			fi->fib_flags |= RTNH_F_LINKDOWN;
 	}
 
-	if (fi->fib_prefsrc) {
-		if (cfg->fc_type != RTN_LOCAL || !cfg->fc_dst ||
-		    fi->fib_prefsrc != cfg->fc_dst)
-			if (inet_addr_type(net, fi->fib_prefsrc) != RTN_LOCAL)
-				goto err_inval;
-	}
+	if (fi->fib_prefsrc && !fib_valid_prefsrc(cfg, fi->fib_prefsrc))
+		goto err_inval;
 
 	change_nexthops(fi) {
 		fib_info_update_nh_saddr(net, nexthop_nh);
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 10/16] net: Use passed in table for nexthop lookups
  2015-07-27 18:30 [net-next 0/16] Proposal for VRF-lite - v3 David Ahern
                   ` (8 preceding siblings ...)
  2015-07-27 18:31 ` [PATCH net-next 09/16] net: Add routes to the table associated with the device David Ahern
@ 2015-07-27 18:31 ` David Ahern
  2015-07-27 18:31 ` [PATCH net-next 11/16] net: Use VRF device index for socket lookups David Ahern
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-27 18:31 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, svaidya, mingo,
	luto, David Ahern

If a user passes in a table for new routes use that table for nexthop
lookups. Specifically, this solves the case where a connected route does
not exist in the main table, but only another table and then a subsequent
route is added with a next hop using the connected route. ie.,

$ ip route ls
default via 10.0.2.2 dev eth0
10.0.2.0/24 dev eth0  proto kernel  scope link  src 10.0.2.15
169.254.0.0/16 dev eth0  scope link  metric 1003
192.168.56.0/24 dev eth1  proto kernel  scope link  src 192.168.56.51

$ ip route ls table 10
1.1.1.0/24 dev eth2  scope link

Without this patch adding a nexthop route fails:

$ ip route add table 10 2.2.2.0/24 via 1.1.1.10
RTNETLINK answers: Network is unreachable

With this patch the route is added successfully.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 net/ipv4/fib_semantics.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 37e1dee7692a..7d79dfbfa5d2 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -691,6 +691,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 		}
 		rcu_read_lock();
 		{
+			struct fib_table *tbl = NULL;
 			struct flowi4 fl4 = {
 				.daddr = nh->nh_gw,
 				.flowi4_scope = cfg->fc_scope + 1,
@@ -701,8 +702,16 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 			/* It is not necessary, but requires a bit of thinking */
 			if (fl4.flowi4_scope < RT_SCOPE_LINK)
 				fl4.flowi4_scope = RT_SCOPE_LINK;
-			err = fib_lookup(net, &fl4, &res,
-					 FIB_LOOKUP_IGNORE_LINKSTATE);
+
+			if (cfg->fc_table)
+				tbl = fib_get_table(net, cfg->fc_table);
+
+			if (tbl)
+				err = fib_table_lookup(tbl, &fl4, &res,
+						   FIB_LOOKUP_IGNORE_LINKSTATE);
+			else
+				err = fib_lookup(net, &fl4, &res,
+						 FIB_LOOKUP_IGNORE_LINKSTATE);
 			if (err) {
 				rcu_read_unlock();
 				return err;
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 11/16] net: Use VRF device index for socket lookups
  2015-07-27 18:30 [net-next 0/16] Proposal for VRF-lite - v3 David Ahern
                   ` (9 preceding siblings ...)
  2015-07-27 18:31 ` [PATCH net-next 10/16] net: Use passed in table for nexthop lookups David Ahern
@ 2015-07-27 18:31 ` David Ahern
  2015-07-27 18:31 ` [PATCH net-next 12/16] net: Add ipv4 route helper to set next hop David Ahern
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-27 18:31 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, svaidya, mingo,
	luto, David Ahern

The intent of the VRF device is to leverage the existing SO_BINDTODEVICE
as a means of creating L3 domains. Since sockets are expected to be bound
to the VRF device the index of the master device needs to be used for
socket lookups.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 net/ipv4/syncookies.c |  5 ++++-
 net/ipv4/tcp_input.c  |  6 +++++-
 net/ipv4/tcp_ipv4.c   | 11 +++++++++--
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index d70b1f603692..dab52fba5872 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -18,6 +18,7 @@
 #include <linux/export.h>
 #include <net/tcp.h>
 #include <net/route.h>
+#include <net/vrf.h>
 
 extern int sysctl_tcp_syncookies;
 
@@ -348,7 +349,9 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	treq->snt_synack	= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsecr : 0;
 	treq->tfo_listener	= false;
 
-	ireq->ir_iif = sk->sk_bound_dev_if;
+	ireq->ir_iif = vrf_get_master_dev_ifindex(sock_net(sk), skb->skb_iif);
+	if (!ireq->ir_iif)
+		ireq->ir_iif = sk->sk_bound_dev_if;
 
 	/* We throwed the options of the initial SYN away, so we hope
 	 * the ACK carries the same options again (see RFC1122 4.2.3.8)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4e4d6bcd0ca9..df82fb05c459 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -72,6 +72,7 @@
 #include <net/dst.h>
 #include <net/tcp.h>
 #include <net/inet_common.h>
+#include <net/vrf.h>
 #include <linux/ipsec.h>
 #include <asm/unaligned.h>
 #include <linux/errqueue.h>
@@ -6141,7 +6142,10 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	tcp_openreq_init(req, &tmp_opt, skb, sk);
 
 	/* Note: tcp_v6_init_req() might override ir_iif for link locals */
-	inet_rsk(req)->ir_iif = sk->sk_bound_dev_if;
+	inet_rsk(req)->ir_iif = vrf_get_master_dev_ifindex(sock_net(sk),
+							   skb->skb_iif);
+	if (!inet_rsk(req)->ir_iif)
+		inet_rsk(req)->ir_iif = sk->sk_bound_dev_if;
 
 	af_ops->init_req(req, sk, skb);
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 486ba96ae91a..d0c40f4d9058 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -75,6 +75,7 @@
 #include <net/secure_seq.h>
 #include <net/tcp_memcontrol.h>
 #include <net/busy_poll.h>
+#include <net/vrf.h>
 
 #include <linux/inet.h>
 #include <linux/ipv6.h>
@@ -682,6 +683,8 @@ static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
 	 */
 	if (sk)
 		arg.bound_dev_if = sk->sk_bound_dev_if;
+	if (!arg.bound_dev_if && skb->dev)
+		arg.bound_dev_if = vrf_master_dev_ifindex(skb->dev);
 
 	arg.tos = ip_hdr(skb)->tos;
 	ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
@@ -766,8 +769,10 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
 				      ip_hdr(skb)->saddr, /* XXX */
 				      arg.iov[0].iov_len, IPPROTO_TCP, 0);
 	arg.csumoffset = offsetof(struct tcphdr, check) / 2;
-	if (oif)
-		arg.bound_dev_if = oif;
+	arg.bound_dev_if = oif ? : vrf_master_dev_ifindex(skb_dst(skb)->dev);
+	if (!arg.bound_dev_if)
+		arg.bound_dev_if = vrf_master_dev_ifindex(skb->dev);
+
 	arg.tos = tos;
 	ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
 			      skb, &TCP_SKB_CB(skb)->header.h4.opt,
@@ -1269,6 +1274,8 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 	ireq		      = inet_rsk(req);
 	sk_daddr_set(newsk, ireq->ir_rmt_addr);
 	sk_rcv_saddr_set(newsk, ireq->ir_loc_addr);
+	if (netif_index_is_vrf(sock_net(newsk), ireq->ir_iif))
+		newsk->sk_bound_dev_if = ireq->ir_iif;
 	newinet->inet_saddr	      = ireq->ir_loc_addr;
 	inet_opt	      = ireq->opt;
 	rcu_assign_pointer(newinet->inet_opt, inet_opt);
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 12/16] net: Add ipv4 route helper to set next hop
  2015-07-27 18:30 [net-next 0/16] Proposal for VRF-lite - v3 David Ahern
                   ` (10 preceding siblings ...)
  2015-07-27 18:31 ` [PATCH net-next 11/16] net: Use VRF device index for socket lookups David Ahern
@ 2015-07-27 18:31 ` David Ahern
  2015-07-27 18:31 ` [PATCH net-next 13/16] net: Introduce VRF device driver - v2 David Ahern
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-27 18:31 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, svaidya, mingo,
	luto, David Ahern

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/net/route.h |  3 +++
 net/ipv4/route.c    | 10 ++++++++++
 2 files changed, 13 insertions(+)

diff --git a/include/net/route.h b/include/net/route.h
index b14cbec93fbd..900d50fbcfc7 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -107,6 +107,7 @@ struct rt_cache_stat {
 extern struct ip_rt_acct __percpu *ip_rt_acct;
 
 struct in_device;
+struct fib_result;
 
 int ip_rt_init(void);
 void rt_cache_flush(struct net *net);
@@ -114,6 +115,8 @@ void rt_flush_dev(struct net_device *dev);
 struct rtable *ip_route_new_rtable(struct net_device *dev,
 				   unsigned int flags, u16 type,
 				   bool nopolicy, bool noxfrm, bool do_cache);
+void ip_route_set_nexthop(struct rtable *rt, __be32 daddr,
+			  const struct fib_result *res);
 struct rtable *__ip_route_output_key(struct net *, struct flowi4 *flp);
 struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
 				    struct sock *sk);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 050a3c1d89ba..47dae001a000 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1537,6 +1537,16 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	return err;
 }
 
+void ip_route_set_nexthop(struct rtable *rt, __be32 daddr,
+			  const struct fib_result *res)
+{
+	struct fib_nh_exception *fnhe;
+
+	fnhe = find_exception(&FIB_RES_NH(*res), daddr);
+
+	rt_set_nexthop(rt, daddr, res, fnhe, res->fi, res->type, 0);
+}
+EXPORT_SYMBOL(ip_route_set_nexthop);
 
 static void ip_handle_martian_source(struct net_device *dev,
 				     struct in_device *in_dev,
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 13/16] net: Introduce VRF device driver - v2
  2015-07-27 18:30 [net-next 0/16] Proposal for VRF-lite - v3 David Ahern
                   ` (11 preceding siblings ...)
  2015-07-27 18:31 ` [PATCH net-next 12/16] net: Add ipv4 route helper to set next hop David Ahern
@ 2015-07-27 18:31 ` David Ahern
  2015-07-27 20:01   ` Nikolay Aleksandrov
  2015-07-27 18:31 ` [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct David Ahern
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 32+ messages in thread
From: David Ahern @ 2015-07-27 18:31 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, svaidya, mingo,
	luto, David Ahern

This driver borrows heavily from IPvlan and teaming drivers.

Routing domains (VRF-lite) are created by instantiating a VRF master
device with an associated table and enslaving all routed interfaces that
participate in the domain. As part of the enslavement, all connected
routes for the enslaved devices are moved to the table associated with
the VRF device. Outgoing sockets must bind to the VRF device to function.

Standard FIB rules bind the VRF device to tables and regular fib rule
processing is followed. Routed traffic through the box, is forwarded by
using the VRF device as the IIF and following the IIF rule to a table
that is mated with the VRF.

Example:

   Create vrf 1:
     ip link add vrf1 type vrf table 5
     ip rule add iif vrf1 table 5
     ip rule add oif vrf1 table 5
     ip route add table 5 prohibit default
     ip link set vrf1 up

   Add interface to vrf 1:
     ip link set eth1 master vrf1

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>

v2:
- addressed comments from first RFC
- significant changes to improve simplicity of implementation
---
 drivers/net/Kconfig  |   7 +
 drivers/net/Makefile |   1 +
 drivers/net/vrf.c    | 596 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 604 insertions(+)
 create mode 100644 drivers/net/vrf.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index c18f9e62a9fa..e58468b02987 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -297,6 +297,13 @@ config NLMON
 	  diagnostics, etc. This is mostly intended for developers or support
 	  to debug netlink issues. If unsure, say N.
 
+config NET_VRF
+	tristate "Virtual Routing and Forwarding (Lite)"
+	depends on IP_MULTIPLE_TABLES && IPV6_MULTIPLE_TABLES
+	---help---
+	  This option enables the support for mapping interfaces into VRF's. The
+	  support enables VRF devices.
+
 endif # NET_CORE
 
 config SUNGEM_PHY
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index c12cb22478a7..ca16dd689b36 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
 obj-$(CONFIG_VXLAN) += vxlan.o
 obj-$(CONFIG_GENEVE) += geneve.o
 obj-$(CONFIG_NLMON) += nlmon.o
+obj-$(CONFIG_NET_VRF) += vrf.o
 
 #
 # Networking Drivers
diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
new file mode 100644
index 000000000000..8669b0f9d749
--- /dev/null
+++ b/drivers/net/vrf.c
@@ -0,0 +1,596 @@
+/*
+ * vrf.c: device driver to encapsulate a VRF space
+ *
+ * Copyright (c) 2015 Cumulus Networks
+ *
+ * Based on dummy, team and ipvlan drivers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ip.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+#include <linux/rtnetlink.h>
+#include <net/rtnetlink.h>
+#include <linux/u64_stats_sync.h>
+#include <linux/hashtable.h>
+
+#include <linux/inetdevice.h>
+#include <net/ip.h>
+#include <net/ip_fib.h>
+#include <net/ip6_route.h>
+#include <net/rtnetlink.h>
+#include <net/route.h>
+#include <net/addrconf.h>
+#include <net/vrf.h>
+
+#define DRV_NAME	"vrf"
+#define DRV_VERSION	"1.0"
+
+#define vrf_is_slave(dev)   ((dev)->flags & IFF_SLAVE)
+#define vrf_is_master(dev)  ((dev)->flags & IFF_MASTER)
+
+#define vrf_master_get_rcu(dev) \
+	((struct net_device *)rcu_dereference(dev->rx_handler_data))
+
+struct pcpu_dstats {
+	u64			tx_pkts;
+	u64			tx_bytes;
+	u64			tx_drps;
+	u64			rx_pkts;
+	u64			rx_bytes;
+	struct u64_stats_sync	syncp;
+};
+
+struct slave {
+	struct list_head	list;
+	struct net_device	*dev;
+};
+
+struct slave_queue {
+	spinlock_t		lock; /* lock for slave insert/delete */
+	struct list_head	all_slaves;
+	int			num_slaves;
+};
+
+struct net_vrf {
+	struct slave_queue	queue;
+	struct fib_table	*tb;
+	u32			tb_id;
+};
+
+static bool is_ip_rx_frame(struct sk_buff *skb)
+{
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+	case htons(ETH_P_IPV6):
+		return true;
+	}
+	return false;
+}
+
+/* note: already called with rcu_read_lock */
+static rx_handler_result_t vrf_handle_frame(struct sk_buff **pskb)
+{
+	struct sk_buff *skb = *pskb;
+
+	if (is_ip_rx_frame(skb)) {
+		struct net_device *dev = vrf_master_get_rcu(skb->dev);
+		struct pcpu_dstats *dstats = this_cpu_ptr(dev->dstats);
+
+		u64_stats_update_begin(&dstats->syncp);
+		dstats->rx_pkts++;
+		dstats->rx_bytes += skb->len;
+		u64_stats_update_end(&dstats->syncp);
+
+		skb->dev = dev;
+
+		return RX_HANDLER_ANOTHER;
+	}
+	return RX_HANDLER_PASS;
+}
+
+static struct rtnl_link_stats64 *vrf_get_stats64(struct net_device *dev,
+						 struct rtnl_link_stats64 *stats)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		const struct pcpu_dstats *dstats;
+		u64 tbytes, tpkts, tdrops, rbytes, rpkts;
+		unsigned int start;
+
+		dstats = per_cpu_ptr(dev->dstats, i);
+		do {
+			start = u64_stats_fetch_begin_irq(&dstats->syncp);
+			tbytes = dstats->tx_bytes;
+			tpkts = dstats->tx_pkts;
+			tdrops = dstats->tx_drps;
+			rbytes = dstats->rx_bytes;
+			rpkts = dstats->rx_pkts;
+		} while (u64_stats_fetch_retry_irq(&dstats->syncp, start));
+		stats->tx_bytes += tbytes;
+		stats->tx_packets += tpkts;
+		stats->tx_dropped += tdrops;
+		stats->rx_bytes += rbytes;
+		stats->rx_packets += rpkts;
+	}
+	return stats;
+}
+
+static int vrf_process_v4_outbound(struct sk_buff *skb)
+{
+	const struct iphdr *iph = ip_hdr(skb);
+	struct net_device *dev = skb->dev;
+	struct net *net = dev_net(dev);
+	int tb_id = vrf_dev_table(dev);
+	int err, ret = NET_XMIT_DROP;
+	struct in_device *in_dev;
+	struct flowi4 fl4 = {
+		.flowi4_tos = RT_TOS(iph->tos),
+		.daddr = iph->daddr,
+		.saddr = iph->saddr,
+	};
+	struct rtable *rth;
+
+	struct fib_result res;
+	struct fib_table *tbl;
+
+	rcu_read_lock();
+	in_dev = __in_dev_get_rcu(dev);
+	if (!in_dev)
+		goto out;
+
+	if (tb_id == 0)
+		goto out;
+
+	tbl = fib_get_table(net, tb_id);
+	if (!tbl)
+		goto out;
+
+	res.tclassid    = 0;
+	res.fi          = NULL;
+	res.table       = NULL;
+	if (fib_table_lookup(tbl, &fl4, &res, 0) != 0)
+		goto out;
+
+	dev = FIB_RES_DEV(res);
+
+	rth = ip_route_new_rtable(dev, RTCF_LOCAL, res.type,
+				  IN_DEV_CONF_GET(in_dev, NOPOLICY),
+				  IN_DEV_CONF_GET(in_dev, NOXFRM),
+				  true);
+	if (!rth)
+		goto out;
+
+	ip_route_set_nexthop(rth, fl4.daddr, &res);
+
+	skb_dst_drop(skb);
+	skb_dst_set(skb, &rth->dst);
+	err = ip_local_out(skb);
+	if (!net_xmit_eval(err))
+		ret = NET_XMIT_SUCCESS;
+
+out:
+	rcu_read_unlock();
+
+	if (ret != NET_XMIT_SUCCESS)
+		dev->stats.tx_errors++;
+
+	return ret;
+}
+
+static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct net_device *mdev)
+{
+	struct pcpu_dstats *dstats = this_cpu_ptr(mdev->dstats);
+	unsigned int len = skb->len;
+	int ret = NET_XMIT_DROP;
+
+	if (skb_mac_header_was_set(skb)) {
+		skb_pull(skb, sizeof(struct ethhdr));
+		skb->mac_header = (typeof(skb->mac_header))~0U;
+		skb_reset_network_header(skb);
+	}
+
+	if (skb->protocol == htons(ETH_P_IP))
+		ret = vrf_process_v4_outbound(skb);
+
+	if (ret == NET_XMIT_SUCCESS) {
+		u64_stats_update_begin(&dstats->syncp);
+		dstats->tx_pkts++;
+		dstats->tx_bytes += len;
+		u64_stats_update_end(&dstats->syncp);
+	} else {
+		dstats->tx_drps++;
+		kfree_skb(skb);
+	}
+	return ret;
+}
+
+/**************************** device handling ********************/
+
+/* cycle interface to flush neighbor cache and move routes across tables */
+static void cycle_netdev(struct net_device *dev)
+{
+	unsigned int flags = dev->flags;
+	int ret;
+
+	if (!netif_running(dev))
+		return;
+
+	ret = dev_change_flags(dev, flags & ~IFF_UP);
+	if (ret >= 0)
+		ret = dev_change_flags(dev, flags);
+
+	if (ret < 0) {
+		netdev_err(dev,
+			   "Failed to cycle device %s; route tables might be wrong!\n",
+			   dev->name);
+	}
+}
+
+/* queue->lock must be held */
+static struct slave *__vrf_find_slave_dev(struct slave_queue *queue,
+					  struct net_device *dev)
+{
+	struct list_head *this, *head;
+
+	head = &queue->all_slaves;
+	list_for_each(this, head) {
+		struct slave *slave = list_entry(this, struct slave, list);
+
+		if (slave->dev == dev)
+			return slave;
+	}
+
+	return NULL;
+}
+
+/* inverse of __vrf_insert_slave; queue->lock must be held */
+static void __vrf_remove_slave(struct slave_queue *queue, struct slave *slave)
+{
+	dev_put(slave->dev);
+	list_del(&slave->list);
+	queue->num_slaves--;
+}
+
+/* queue->lock must be held */
+static void __vrf_insert_slave(struct slave_queue *queue, struct slave *slave)
+{
+	dev_hold(slave->dev);
+	list_add(&slave->list, &queue->all_slaves);
+	queue->num_slaves++;
+}
+
+static int do_vrf_add_slave(struct net_device *dev, struct net_device *port_dev)
+{
+	struct net_vrf_dev *vrf_ptr = kmalloc(sizeof(*vrf_ptr), GFP_KERNEL);
+	struct slave *slave = kzalloc(sizeof(*slave), GFP_KERNEL);
+	struct slave *duplicate_slave;
+	struct net_vrf *vrf = netdev_priv(dev);
+	struct slave_queue *queue = &vrf->queue;
+	int ret = -ENOMEM;
+
+	if (!slave || !vrf_ptr)
+		goto out_fail;
+
+	slave->dev = port_dev;
+
+	vrf_ptr->ifindex = dev->ifindex;
+	vrf_ptr->tb_id = vrf->tb_id;
+
+	spin_lock_bh(&queue->lock);
+
+	duplicate_slave = __vrf_find_slave_dev(queue, port_dev);
+	if (duplicate_slave) {
+		spin_unlock_bh(&queue->lock);
+		ret = -EBUSY;
+		goto out_fail;
+	}
+
+	__vrf_insert_slave(queue, slave);
+
+	spin_unlock_bh(&queue->lock);
+
+	/* register the packet handler for slave ports */
+	ret = netdev_rx_handler_register(port_dev, vrf_handle_frame, dev);
+	if (ret) {
+		netdev_err(port_dev,
+			   "Device %s failed to register rx_handler\n",
+			   port_dev->name);
+		goto out_remove;
+	}
+
+	ret = netdev_master_upper_dev_link(port_dev, dev);
+	if (ret < 0)
+		goto out_unregister;
+
+	port_dev->flags |= IFF_SLAVE;
+
+	rcu_assign_pointer(port_dev->vrf_ptr, vrf_ptr);
+	cycle_netdev(port_dev);
+
+	return 0;
+
+out_unregister:
+	netdev_rx_handler_unregister(port_dev);
+out_remove:
+	__vrf_remove_slave(queue, slave);
+out_fail:
+	kfree(vrf_ptr);
+	kfree(slave);
+	return ret;
+}
+
+static int vrf_add_slave(struct net_device *dev, struct net_device *port_dev)
+{
+	ASSERT_RTNL();
+
+	if (!dev || !port_dev || dev_net(dev) != dev_net(port_dev))
+		return -ENODEV;
+
+	if (!vrf_is_master(dev) || vrf_is_master(port_dev) ||
+	    vrf_is_slave(port_dev))
+		return -EINVAL;
+
+	return do_vrf_add_slave(dev, port_dev);
+}
+
+/* inverse of do_vrf_add_slave */
+static int do_vrf_del_slave(struct net_device *dev, struct net_device *port_dev)
+{
+	struct net_vrf *vrf = netdev_priv(dev);
+	struct slave_queue *queue = &vrf->queue;
+	struct net_vrf_dev *vrf_ptr = NULL;
+	struct slave *slave;
+
+	vrf_ptr = rcu_dereference(dev->vrf_ptr);
+	RCU_INIT_POINTER(dev->vrf_ptr, NULL);
+	kfree(vrf_ptr);
+
+	netdev_upper_dev_unlink(port_dev, dev);
+	port_dev->flags &= ~IFF_SLAVE;
+
+	netdev_rx_handler_unregister(dev);
+
+	cycle_netdev(port_dev);
+
+	spin_lock_bh(&queue->lock);
+
+	slave = __vrf_find_slave_dev(queue, port_dev);
+	if (slave)
+		__vrf_remove_slave(queue, slave);
+
+	spin_unlock_bh(&queue->lock);
+
+	kfree(slave);
+
+	return 0;
+}
+
+static int vrf_del_slave(struct net_device *dev, struct net_device *port_dev)
+{
+	ASSERT_RTNL();
+
+	if (!dev || !port_dev)
+		return -ENODEV;
+
+	if (!vrf_is_master(dev))
+		return -EINVAL;
+
+	return do_vrf_del_slave(dev, port_dev);
+}
+
+static int vrf_dev_init(struct net_device *dev)
+{
+	struct net_vrf *vrf = netdev_priv(dev);
+
+	spin_lock_init(&vrf->queue.lock);
+	INIT_LIST_HEAD(&vrf->queue.all_slaves);
+
+	dev->dstats = netdev_alloc_pcpu_stats(struct pcpu_dstats);
+	if (!dev->dstats)
+		return -ENOMEM;
+
+	dev->flags  =  IFF_MASTER | IFF_NOARP;
+
+	return 0;
+}
+
+static void vrf_dev_uninit(struct net_device *dev)
+{
+	free_percpu(dev->dstats);
+}
+
+static int vrf_dev_close(struct net_device *dev)
+{
+	struct net_vrf *vrf = netdev_priv(dev);
+	struct slave_queue *queue = &vrf->queue;
+	struct list_head *this, *head;
+
+	head = &queue->all_slaves;
+	list_for_each(this, head) {
+		struct slave *slave = list_entry(this, struct slave, list);
+
+		slave->dev->vrf_ptr->ifindex = 0;
+		slave->dev->vrf_ptr->tb_id = 0;
+	}
+
+	if (dev->flags & IFF_MASTER)
+		dev->flags &= ~IFF_UP;
+
+	return 0;
+}
+
+static int vrf_dev_open(struct net_device *dev)
+{
+	struct net_vrf *vrf = netdev_priv(dev);
+	struct slave_queue *queue = &vrf->queue;
+	struct list_head *this, *head;
+
+	head = &queue->all_slaves;
+	list_for_each(this, head) {
+		struct slave *slave = list_entry(this, struct slave, list);
+
+		slave->dev->vrf_ptr->ifindex = dev->ifindex;
+		slave->dev->vrf_ptr->tb_id = vrf->tb_id;
+	}
+
+	if (dev->flags & IFF_MASTER)
+		dev->flags |= IFF_UP;
+
+	if (!vrf->tb)
+		return -EINVAL;
+
+	return 0;
+}
+
+static const struct net_device_ops vrf_netdev_ops = {
+	.ndo_init		= vrf_dev_init,
+	.ndo_uninit		= vrf_dev_uninit,
+	.ndo_open		= vrf_dev_open,
+	.ndo_stop		= vrf_dev_close,
+	.ndo_start_xmit		= vrf_xmit,
+	.ndo_get_stats64	= vrf_get_stats64,
+	.ndo_add_slave		= vrf_add_slave,
+	.ndo_del_slave		= vrf_del_slave,
+};
+
+static void vrf_get_drvinfo(struct net_device *dev,
+			    struct ethtool_drvinfo *info)
+{
+	strlcpy(info->driver, DRV_NAME, sizeof(info->driver));
+	strlcpy(info->version, DRV_VERSION, sizeof(info->version));
+}
+
+static const struct ethtool_ops vrf_ethtool_ops = {
+	.get_drvinfo	= vrf_get_drvinfo,
+};
+
+static void vrf_setup(struct net_device *dev)
+{
+	ether_setup(dev);
+
+	/* Initialize the device structure. */
+	dev->netdev_ops = &vrf_netdev_ops;
+	dev->ethtool_ops = &vrf_ethtool_ops;
+	dev->destructor = free_netdev;
+
+	/* Fill in device structure with ethernet-generic values. */
+	eth_hw_addr_random(dev);
+}
+
+static int vrf_validate(struct nlattr *tb[], struct nlattr *data[])
+{
+	if (tb[IFLA_ADDRESS]) {
+		if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
+			return -EINVAL;
+		if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS])))
+			return -EADDRNOTAVAIL;
+	}
+	return 0;
+}
+
+static int vrf_newlink(struct net *src_net, struct net_device *dev,
+		       struct nlattr *tb[], struct nlattr *data[])
+{
+	struct net_vrf *vrf = netdev_priv(dev);
+	int err;
+
+	if (!data || !data[IFLA_VRF_TABLE])
+		return -EINVAL;
+
+	vrf->tb_id = nla_get_u32(data[IFLA_VRF_TABLE]);
+
+	/* reserve a table for this VRF device */
+	err = -ERANGE;
+	vrf->tb = fib_new_table(dev_net(dev), vrf->tb_id);
+	if (!vrf->tb)
+		goto out_fail;
+
+	dev->priv_flags |= IFF_VRF_MASTER;
+
+	err = -ENOMEM;
+	dev->vrf_ptr = kmalloc(sizeof(*dev->vrf_ptr), GFP_KERNEL);
+	if (!dev->vrf_ptr)
+		goto out_fail;
+
+	dev->vrf_ptr->ifindex = dev->ifindex;
+	dev->vrf_ptr->tb_id = vrf->tb_id;
+
+	err = register_netdevice(dev);
+	if (err < 0)
+		goto out_fail;
+
+	return 0;
+
+out_fail:
+	kfree(dev->vrf_ptr);
+	free_netdev(dev);
+	return err;
+}
+
+static void vrf_dellink(struct net_device *dev, struct list_head *head)
+{
+	struct net_vrf *vrf = netdev_priv(dev);
+
+	kfree(dev->vrf_ptr);
+	fib_free_table(vrf->tb);
+}
+
+static size_t vrf_nl_getsize(const struct net_device *dev)
+{
+	return nla_total_size(sizeof(u32));  /* IFLA_VRF_TABLE */
+}
+
+static int vrf_fillinfo(struct sk_buff *skb,
+			const struct net_device *dev)
+{
+	struct net_vrf *vrf = netdev_priv(dev);
+
+	return nla_put_u32(skb, IFLA_VRF_TABLE, vrf->tb_id);
+}
+
+static const struct nla_policy vrf_nl_policy[IFLA_VRF_MAX + 1] = {
+	[IFLA_VRF_TABLE] = { .type = NLA_U32 },
+};
+
+static struct rtnl_link_ops vrf_link_ops __read_mostly = {
+	.kind		= DRV_NAME,
+	.priv_size	= sizeof(struct net_vrf),
+
+	.get_size	= vrf_nl_getsize,
+	.policy		= vrf_nl_policy,
+	.validate	= vrf_validate,
+	.fill_info	= vrf_fillinfo,
+
+	.newlink	= vrf_newlink,
+	.dellink	= vrf_dellink,
+	.setup		= vrf_setup,
+	.maxtype	= IFLA_VRF_MAX,
+};
+
+static int __init vrf_init_module(void)
+{
+	return rtnl_link_register(&vrf_link_ops);
+}
+
+static void __exit vrf_cleanup_module(void)
+{
+	rtnl_link_unregister(&vrf_link_ops);
+}
+
+module_init(vrf_init_module);
+module_exit(vrf_cleanup_module);
+MODULE_AUTHOR("Shrijeet Mukherjee, David Ahern");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_RTNL_LINK(DRV_NAME);
+MODULE_VERSION(DRV_VERSION);
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct
  2015-07-27 18:30 [net-next 0/16] Proposal for VRF-lite - v3 David Ahern
                   ` (12 preceding siblings ...)
  2015-07-27 18:31 ` [PATCH net-next 13/16] net: Introduce VRF device driver - v2 David Ahern
@ 2015-07-27 18:31 ` David Ahern
  2015-07-27 20:33   ` Eric W. Biederman
  2015-07-28 15:25   ` Andy Lutomirski
  2015-07-27 18:31 ` [PATCH net-next 15/16] net: Add chvrf command David Ahern
                   ` (2 subsequent siblings)
  16 siblings, 2 replies; 32+ messages in thread
From: David Ahern @ 2015-07-27 18:31 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, svaidya, mingo,
	luto, David Ahern

Allow tasks to have a default device index for binding sockets. If set
the value is passed to all AF_INET/AF_INET6 sockets when they are created.

The task setting is passed parent to child on fork, but can be set or
changed after task creation using prctl (if task has CAP_NET_ADMIN
permissions). The setting for a socket can be retrieved using prctl().
This option allows an administrator to restrict a task to only send/receive
packets through the specified device. In the case of VRF devices this
option restricts tasks to a specific VRF.

Correlation of the device index to a specific VRF, ie.,
   ifindex --> VRF device --> VRF id
is left to userspace.

Example using VRF devices:
1. vrf1 is created and assigned to table 5
2. eth2 is enslaved to vrf1
3. eth2 is given the address 1.1.1.1/24

$ ip route ls table 5
prohibit default
1.1.1.0/24 dev eth2  scope link
local 1.1.1.1 dev eth2  proto kernel  scope host  src 1.1.1.1

With out setting a VRF context ping, tcp and udp attempts fail. e.g,
$ ping 1.1.1.254
connect: Network is unreachable

After binding the task to the vrf device ping succeeds:
$ ./chvrf -v 1 ping -c1 1.1.1.254
PING 1.1.1.254 (1.1.1.254) 56(84) bytes of data.
64 bytes from 1.1.1.254: icmp_seq=1 ttl=64 time=2.32 ms

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/linux/sched.h      |  3 +++
 include/uapi/linux/prctl.h |  4 ++++
 kernel/fork.c              |  2 ++
 kernel/sys.c               | 35 +++++++++++++++++++++++++++++++++++
 net/ipv4/af_inet.c         |  1 +
 net/ipv4/route.c           |  4 +++-
 net/ipv6/af_inet6.c        |  1 +
 net/ipv6/route.c           |  2 +-
 8 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04b5ada460b4..29b336b8a466 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1528,6 +1528,9 @@ struct task_struct {
 	struct files_struct *files;
 /* namespaces */
 	struct nsproxy *nsproxy;
+/* network */
+	/* if set INET/INET6 sockets are bound to given dev index on create */
+	int sk_bind_dev_if;
 /* signal handlers */
 	struct signal_struct *signal;
 	struct sighand_struct *sighand;
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..1ef45195d146 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,8 @@ struct prctl_mm_map {
 # define PR_FP_MODE_FR		(1 << 0)	/* 64b FP registers */
 # define PR_FP_MODE_FRE		(1 << 1)	/* 32b compatibility */
 
+/* get/set network interface sockets are bound to by default */
+#define PR_SET_SK_BIND_DEV_IF   47
+#define PR_GET_SK_BIND_DEV_IF   48
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index dbd9b8d7b7cc..8b396e77d2bf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -380,6 +380,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	tsk->splice_pipe = NULL;
 	tsk->task_frag.page = NULL;
 
+	tsk->sk_bind_dev_if = orig->sk_bind_dev_if;
+
 	account_kernel_stack(ti, 1);
 
 	return tsk;
diff --git a/kernel/sys.c b/kernel/sys.c
index 259fda25eb6b..59119ac0a0bd 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -52,6 +52,7 @@
 #include <linux/rcupdate.h>
 #include <linux/uidgid.h>
 #include <linux/cred.h>
+#include <linux/netdevice.h>
 
 #include <linux/kmsg_dump.h>
 /* Move somewhere else to avoid recompiling? */
@@ -2267,6 +2268,40 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_NET
+	case PR_SET_SK_BIND_DEV_IF:
+	{
+		struct net_device *dev;
+		int idx = (int) arg2;
+
+		if (!capable(CAP_NET_ADMIN))
+			return -EPERM;
+
+		if (idx) {
+			dev = dev_get_by_index(me->nsproxy->net_ns, idx);
+			if (!dev)
+				return -EINVAL;
+			dev_put(dev);
+		}
+		me->sk_bind_dev_if = idx;
+		break;
+	}
+	case PR_GET_SK_BIND_DEV_IF:
+	{
+		struct task_struct *tsk;
+		int sk_bind_dev_if = -EINVAL;
+
+		rcu_read_lock();
+		tsk = find_task_by_vpid(arg2);
+		if (tsk)
+			sk_bind_dev_if = tsk->sk_bind_dev_if;
+		rcu_read_unlock();
+		if (tsk != me && !capable(CAP_NET_ADMIN))
+			return -EPERM;
+		error = sk_bind_dev_if;
+		break;
+	}
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 09c7c1ee307e..0651efa18d39 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -352,6 +352,7 @@ static int inet_create(struct net *net, struct socket *sock, int protocol,
 	sk->sk_destruct	   = inet_sock_destruct;
 	sk->sk_protocol	   = protocol;
 	sk->sk_backlog_rcv = sk->sk_prot->backlog_rcv;
+	sk->sk_bound_dev_if = current->sk_bind_dev_if;
 
 	inet->uc_ttl	= -1;
 	inet->mc_loop	= 1;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 47dae001a000..99e7ff5c56ac 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2480,7 +2480,9 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
 	fl4.daddr = dst;
 	fl4.saddr = src;
 	fl4.flowi4_tos = rtm->rtm_tos;
-	fl4.flowi4_oif = tb[RTA_OIF] ? nla_get_u32(tb[RTA_OIF]) : 0;
+	fl4.flowi4_oif = current->sk_bind_dev_if;
+	if (tb[RTA_OIF])
+		fl4.flowi4_oif = nla_get_u32(tb[RTA_OIF]);
 	fl4.flowi4_mark = mark;
 
 	if (iif) {
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 7bc92ea4ae8f..7728be810982 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -189,6 +189,7 @@ static int inet6_create(struct net *net, struct socket *sock, int protocol,
 	sk->sk_destruct		= inet_sock_destruct;
 	sk->sk_family		= PF_INET6;
 	sk->sk_protocol		= protocol;
+	sk->sk_bound_dev_if	= current->sk_bind_dev_if;
 
 	sk->sk_backlog_rcv	= answer->prot->backlog_rcv;
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 54fccf0d705d..8b7fd9eaceb4 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3007,7 +3007,7 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
 	struct sk_buff *skb;
 	struct rtmsg *rtm;
 	struct flowi6 fl6;
-	int err, iif = 0, oif = 0;
+	int err, iif = 0, oif = current->sk_bind_dev_if;
 
 	err = nlmsg_parse(nlh, sizeof(*rtm), tb, RTA_MAX, rtm_ipv6_policy);
 	if (err < 0)
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 15/16] net: Add chvrf command
  2015-07-27 18:30 [net-next 0/16] Proposal for VRF-lite - v3 David Ahern
                   ` (13 preceding siblings ...)
  2015-07-27 18:31 ` [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct David Ahern
@ 2015-07-27 18:31 ` David Ahern
  2015-07-27 18:31 ` [PATCH] iproute2: Add support for VRF device David Ahern
  2015-07-27 20:30 ` [net-next 0/16] Proposal for VRF-lite - v3 Eric W. Biederman
  16 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-27 18:31 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, svaidya, mingo,
	luto, David Ahern

Example of how to use the default bind to interface option for tasks and
correlate with VRF devices.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 tools/net/Makefile |   6 +-
 tools/net/chvrf.c  | 225 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 229 insertions(+), 2 deletions(-)
 create mode 100644 tools/net/chvrf.c

diff --git a/tools/net/Makefile b/tools/net/Makefile
index ee577ea03ba5..c13f11f5637a 100644
--- a/tools/net/Makefile
+++ b/tools/net/Makefile
@@ -10,7 +10,7 @@ YACC = bison
 %.lex.c: %.l
 	$(LEX) -o $@ $<
 
-all : bpf_jit_disasm bpf_dbg bpf_asm
+all : bpf_jit_disasm bpf_dbg bpf_asm chvrf
 
 bpf_jit_disasm : CFLAGS = -Wall -O2 -DPACKAGE='bpf_jit_disasm'
 bpf_jit_disasm : LDLIBS = -lopcodes -lbfd -ldl
@@ -25,8 +25,10 @@ bpf_asm : LDLIBS =
 bpf_asm : bpf_asm.o bpf_exp.yacc.o bpf_exp.lex.o
 bpf_exp.lex.o : bpf_exp.yacc.c
 
+chvrf : CFLAGS = -Wall -O2
+
 clean :
-	rm -rf *.o bpf_jit_disasm bpf_dbg bpf_asm bpf_exp.yacc.* bpf_exp.lex.*
+	rm -rf *.o bpf_jit_disasm bpf_dbg bpf_asm bpf_exp.yacc.* bpf_exp.lex.* chvrf
 
 install :
 	install bpf_jit_disasm $(prefix)/bin/bpf_jit_disasm
diff --git a/tools/net/chvrf.c b/tools/net/chvrf.c
new file mode 100644
index 000000000000..71cc925fd101
--- /dev/null
+++ b/tools/net/chvrf.c
@@ -0,0 +1,225 @@
+/*
+ * chvrf.c - Example of how to use the default bind-to-device option for
+ *           tasks and correlate to VRFs via the VRF device.
+ *
+ * Copyright (c) 2015 Cumulus Networks
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+#include <sys/ioctl.h>
+#include <sys/prctl.h>
+#include <sys/socket.h>
+#include <signal.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <netinet/in.h>
+#include <net/if.h> /* for struct ifreq  */
+#include <libgen.h>
+#include <errno.h>
+
+#ifndef PR_SET_SK_BIND_DEV_IF
+#define PR_SET_SK_BIND_DEV_IF   47
+#endif
+#ifndef PR_GET_SK_BIND_DEV_IF
+#define PR_GET_SK_BIND_DEV_IF   48
+#endif
+
+static int vrf_to_device(int vrf)
+{
+	struct ifreq ifdata;
+	int sd, rc;
+
+	memset(&ifdata, 0, sizeof(ifdata));
+	snprintf(ifdata.ifr_name, sizeof(ifdata.ifr_name) - 1, "vrf%d", vrf);
+
+	sd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+	if (sd < 0) {
+		perror("socket failed");
+		return -1;
+	}
+
+	/* Get the index for the specified interface */
+	rc = ioctl(sd, SIOCGIFINDEX, (char *)&ifdata);
+	close(sd);
+	if (rc != 0) {
+		perror("ioctl(SIOCGIFINDEX) failed");
+		return -1;
+	}
+
+	return ifdata.ifr_ifindex;
+}
+
+static int device_to_vrf(int idx)
+{
+	struct ifreq ifdata;
+	int sd, vrf, rc;
+
+	memset(&ifdata, 0, sizeof(ifdata));
+	ifdata.ifr_ifindex = idx;
+
+	sd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+	if (sd < 0) {
+		perror("socket failed");
+		return -1;
+	}
+
+	/* Get the index for the specified interface */
+	rc = ioctl(sd, SIOCGIFNAME, (char *)&ifdata);
+	close(sd);
+	if (rc != 0) {
+		perror("ioctl(SIOCGIFNAME) failed");
+		return -1;
+	}
+
+	if (sscanf(ifdata.ifr_name, "vrf%d", &vrf) != 1) {
+		fprintf(stderr, "Unexpected device name (%s)\n", ifdata.ifr_name);
+		vrf = -1;
+	}
+
+	return vrf;
+}
+
+static int set_vrf(int vrf)
+{
+	int idx;
+	long err;
+
+	/* convert vrf to device index */
+	idx = vrf_to_device(vrf);
+	if (idx < 0) {
+		fprintf(stderr, "Failed to get device index for vrf %d\n", vrf);
+		return -1;
+	}
+
+	/* set default device bind */
+	err = prctl(PR_SET_SK_BIND_DEV_IF, idx);
+	if (err < 0) {
+		fprintf(stderr, "prctl failed to device index: %d\n", errno);
+		return -1;
+	}
+
+	return 0;
+}
+
+/* get vrf context for given process id */
+static int get_vrf(pid_t pid)
+{
+	int vrf;
+	long err;
+
+	/* lookup device index pid is tied to */
+	err = prctl(PR_GET_SK_BIND_DEV_IF, pid);
+	if (err < 0) {
+		fprintf(stderr, "prctl failed: %d\n", errno);
+		return -1;
+	}
+
+	if (err == 0)
+		return 0;
+
+	/* convert device index to vrf id */
+	vrf = device_to_vrf((int)err);
+	if (vrf < 0) {
+		fprintf(stderr, "Failed to get device index for vrf %d\n", vrf);
+		return -1;
+	}
+
+	return vrf;
+}
+
+static int run_vrf(char **argv, int vrf)
+{
+	char *cmd;
+
+	if (set_vrf(vrf) != 0) {
+		fprintf(stderr, "Failed to set vrf context\n");
+		return 1;
+	}
+
+	cmd = strdup(argv[0]);
+	if (!cmd) {
+		fprintf(stderr, "Failed to set command\n");
+		return 1;
+	}
+	argv[0] = basename(argv[0]);
+	if (execvp(cmd, argv) < 0)
+		perror("Failed to exec command\n");
+
+	return 1;
+}
+
+static int show_vrf(pid_t pid)
+{
+	int vrf = get_vrf(pid);
+
+	switch (vrf) {
+	case -1:
+		fprintf(stderr, "Failed to get vrf context for pid %d\n", pid);
+		if (kill(pid, 0) < 0) {
+			if (errno == ESRCH)
+				fprintf(stderr, "No process with given pid\n");
+		}
+		break;
+	case 0:
+		printf("Process %d is not running in a VRF context\n", pid);
+		break;
+	default:
+		printf("Process %d is running in VRF %d\n", pid, vrf);
+	}
+	return vrf < 0 ? 1 : 0;
+}
+
+static void usage(char *_prog)
+{
+	const char *prog = basename(_prog);
+
+	fprintf(stderr, "usage:\n");
+	fprintf(stderr, "\nShow VRF context for given pid\n");
+	fprintf(stderr, "\t%s -p pid\n", prog);
+	fprintf(stderr, "\nRun command in given VRF context\n");
+	fprintf(stderr, "\t%s -v vrf <command>\n", prog);
+}
+
+int main(int argc, char *argv[])
+{
+	int rc;
+	pid_t pid = 0;
+	int vrf = 0;
+
+	extern char *optarg;
+	extern int optind;
+
+	while ((rc = getopt(argc, argv, "+:p:v:")) != -1) {
+		switch (rc) {
+		case 'p':
+			pid = atoi(optarg);
+			break;
+		case 'v':
+			vrf = atoi(optarg);
+			break;
+		default:
+			usage(argv[0]);
+			return 1;
+		}
+	}
+
+	if ((pid && vrf) || (!pid && !vrf)) {
+		usage(argv[0]);
+		return 1;
+	}
+
+	if (pid)
+		return show_vrf(pid);
+
+	if (optind == argc) {
+		usage(argv[0]);
+		return 1;
+	}
+
+	return run_vrf(&argv[optind], vrf);
+}
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH] iproute2: Add support for VRF device
  2015-07-27 18:30 [net-next 0/16] Proposal for VRF-lite - v3 David Ahern
                   ` (14 preceding siblings ...)
  2015-07-27 18:31 ` [PATCH net-next 15/16] net: Add chvrf command David Ahern
@ 2015-07-27 18:31 ` David Ahern
  2015-07-27 20:30 ` [net-next 0/16] Proposal for VRF-lite - v3 Eric W. Biederman
  16 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-27 18:31 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, svaidya, mingo,
	luto, David Ahern

Allow user to create a vrf device and specify its table binding.
Based on the iplink_vlan implementation.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/linux/if_link.h |  8 +++++
 ip/Makefile             |  2 +-
 ip/iplink.c             |  2 +-
 ip/iplink_vrf.c         | 87 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 97 insertions(+), 2 deletions(-)
 create mode 100644 ip/iplink_vrf.c

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 8df6a8466839..28872fbf6814 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -337,6 +337,14 @@ enum macvlan_macaddr_mode {
 
 #define MACVLAN_FLAG_NOPROMISC	1
 
+/* VRF section */
+enum {
+	IFLA_VRF_UNSPEC,
+	IFLA_VRF_TABLE,
+	__IFLA_VRF_MAX
+};
+
+#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1)
 /* IPVLAN section */
 enum {
 	IFLA_IPVLAN_UNSPEC,
diff --git a/ip/Makefile b/ip/Makefile
index 77653ecc5785..d8b38ac2e44b 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -7,7 +7,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \
     iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \
     link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \
     iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o \
-    iplink_geneve.o
+    iplink_geneve.o iplink_vrf.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/iplink.c b/ip/iplink.c
index e296e6f611b8..892e8bc8808b 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -93,7 +93,7 @@ void iplink_usage(void)
 		fprintf(stderr, "TYPE := { vlan | veth | vcan | dummy | ifb | macvlan | macvtap |\n");
 		fprintf(stderr, "          bridge | bond | ipoib | ip6tnl | ipip | sit | vxlan |\n");
 		fprintf(stderr, "          gre | gretap | ip6gre | ip6gretap | vti | nlmon |\n");
-		fprintf(stderr, "          bond_slave | ipvlan | geneve }\n");
+		fprintf(stderr, "          bond_slave | ipvlan | geneve | vrf }\n");
 	}
 	exit(-1);
 }
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
new file mode 100644
index 000000000000..bfcb3cdeaf35
--- /dev/null
+++ b/ip/iplink_vrf.c
@@ -0,0 +1,87 @@
+/* iplink_vrf.c	VRF device support
+ *
+ *              This program is free software; you can redistribute it and/or
+ *              modify it under the terms of the GNU General Public License
+ *              as published by the Free Software Foundation; either version
+ *              2 of the License, or (at your option) any later version.
+ *
+ * Authors:     Shrijeet Mukherjee <shm@cumulusnetworks.com>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <linux/if_link.h>
+
+#include "rt_names.h"
+#include "utils.h"
+#include "ip_common.h"
+
+static void vrf_explain(FILE *f)
+{
+	fprintf(f, "Usage: ... vrf table TABLEID \n");
+}
+
+static void explain(void)
+{
+	vrf_explain(stderr);
+}
+
+static int table_arg(void)
+{
+	fprintf(stderr,"Error: argument of \"table\" must be 0-32767 and currently unused\n");
+	return -1;
+}
+
+static int vrf_parse_opt(struct link_util *lu, int argc, char **argv,
+			    struct nlmsghdr *n)
+{
+	while (argc > 0) {
+		if (matches(*argv, "table") == 0) {
+			__u32 table = 0;
+			NEXT_ARG();
+
+			table = atoi(*argv);
+			if (table < 0 || table > 32767)
+				return table_arg();
+			/* XXX need a table in-use check here */
+			fprintf(stderr, "adding table %d\n", table);
+			addattr32(n, 1024, IFLA_VRF_TABLE, table);
+		} else if (matches(*argv, "help") == 0) {
+			explain();
+			return -1;
+		} else {
+			fprintf(stderr, "vrf: unknown option \"%s\"?\n",
+				*argv);
+			explain();
+			return -1;
+		}
+		argc--, argv++;
+	}
+
+	return 0;
+}
+
+static void vrf_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
+{
+	if (!tb)
+		return;
+
+	if (tb[IFLA_VRF_TABLE])
+		fprintf(f, "table %u ", rta_getattr_u32(tb[IFLA_VRF_TABLE]));
+}
+
+static void vrf_print_help(struct link_util *lu, int argc, char **argv,
+			      FILE *f)
+{
+	vrf_explain(f);
+}
+
+struct link_util vrf_link_util = {
+	.id		= "vrf",
+	.maxattr	= IFLA_VRF_MAX,
+	.parse_opt	= vrf_parse_opt,
+	.print_opt	= vrf_print_opt,
+	.print_help	= vrf_print_help,
+};
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 13/16] net: Introduce VRF device driver - v2
  2015-07-27 18:31 ` [PATCH net-next 13/16] net: Introduce VRF device driver - v2 David Ahern
@ 2015-07-27 20:01   ` Nikolay Aleksandrov
  2015-07-28 16:22     ` David Ahern
  0 siblings, 1 reply; 32+ messages in thread
From: Nikolay Aleksandrov @ 2015-07-27 20:01 UTC (permalink / raw)
  To: David Ahern, netdev
  Cc: shm, roopa, gospo, jtoppins, ddutt, hannes, nicolas.dichtel,
	stephen, hadi, ebiederm, davem, svaidya, mingo, luto

On 07/27/2015 08:31 PM, David Ahern wrote:
> This driver borrows heavily from IPvlan and teaming drivers.
> 
> Routing domains (VRF-lite) are created by instantiating a VRF master
> device with an associated table and enslaving all routed interfaces that
> participate in the domain. As part of the enslavement, all connected
> routes for the enslaved devices are moved to the table associated with
> the VRF device. Outgoing sockets must bind to the VRF device to function.
> 
> Standard FIB rules bind the VRF device to tables and regular fib rule
> processing is followed. Routed traffic through the box, is forwarded by
> using the VRF device as the IIF and following the IIF rule to a table
> that is mated with the VRF.
> 
> Example:
> 
>    Create vrf 1:
>      ip link add vrf1 type vrf table 5
>      ip rule add iif vrf1 table 5
>      ip rule add oif vrf1 table 5
>      ip route add table 5 prohibit default
>      ip link set vrf1 up
> 
>    Add interface to vrf 1:
>      ip link set eth1 master vrf1
> 
> Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
> 
> v2:
> - addressed comments from first RFC
> - significant changes to improve simplicity of implementation
> ---
>  drivers/net/Kconfig  |   7 +
>  drivers/net/Makefile |   1 +
>  drivers/net/vrf.c    | 596 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 604 insertions(+)
>  create mode 100644 drivers/net/vrf.c
> 
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index c18f9e62a9fa..e58468b02987 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -297,6 +297,13 @@ config NLMON
>  	  diagnostics, etc. This is mostly intended for developers or support
>  	  to debug netlink issues. If unsure, say N.
>  
> +config NET_VRF
> +	tristate "Virtual Routing and Forwarding (Lite)"
> +	depends on IP_MULTIPLE_TABLES && IPV6_MULTIPLE_TABLES
> +	---help---
> +	  This option enables the support for mapping interfaces into VRF's. The
> +	  support enables VRF devices.
> +
>  endif # NET_CORE
>  
>  config SUNGEM_PHY
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index c12cb22478a7..ca16dd689b36 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
>  obj-$(CONFIG_VXLAN) += vxlan.o
>  obj-$(CONFIG_GENEVE) += geneve.o
>  obj-$(CONFIG_NLMON) += nlmon.o
> +obj-$(CONFIG_NET_VRF) += vrf.o
>  
>  #
>  # Networking Drivers
> diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
> new file mode 100644
> index 000000000000..8669b0f9d749
> --- /dev/null
> +++ b/drivers/net/vrf.c
> @@ -0,0 +1,596 @@
> +/*
> + * vrf.c: device driver to encapsulate a VRF space
> + *
> + * Copyright (c) 2015 Cumulus Networks
> + *
> + * Based on dummy, team and ipvlan drivers
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <linux/ip.h>
> +#include <linux/init.h>
> +#include <linux/moduleparam.h>
> +#include <linux/rtnetlink.h>
> +#include <net/rtnetlink.h>
> +#include <linux/u64_stats_sync.h>
> +#include <linux/hashtable.h>
> +
> +#include <linux/inetdevice.h>
> +#include <net/ip.h>
> +#include <net/ip_fib.h>
> +#include <net/ip6_route.h>
> +#include <net/rtnetlink.h>
> +#include <net/route.h>
> +#include <net/addrconf.h>
> +#include <net/vrf.h>
> +
> +#define DRV_NAME	"vrf"
> +#define DRV_VERSION	"1.0"
> +
> +#define vrf_is_slave(dev)   ((dev)->flags & IFF_SLAVE)
> +#define vrf_is_master(dev)  ((dev)->flags & IFF_MASTER)
> +
> +#define vrf_master_get_rcu(dev) \
> +	((struct net_device *)rcu_dereference(dev->rx_handler_data))
> +
> +struct pcpu_dstats {
> +	u64			tx_pkts;
> +	u64			tx_bytes;
> +	u64			tx_drps;
> +	u64			rx_pkts;
> +	u64			rx_bytes;
> +	struct u64_stats_sync	syncp;
> +};
> +
> +struct slave {
> +	struct list_head	list;
> +	struct net_device	*dev;
> +};
> +
> +struct slave_queue {
> +	spinlock_t		lock; /* lock for slave insert/delete */
^^^^
I don't think you actually need this lock since all VRF dev operations are done
under RTNL so you already got protection against add/del running concurrently.
It would simplify the code if you can get rid of it.

> +	struct list_head	all_slaves;
> +	int			num_slaves;
> +};
> +
> +struct net_vrf {
> +	struct slave_queue	queue;
> +	struct fib_table	*tb;
> +	u32			tb_id;
> +};
> +
> +static bool is_ip_rx_frame(struct sk_buff *skb)
> +{
> +	switch (skb->protocol) {
> +	case htons(ETH_P_IP):
> +	case htons(ETH_P_IPV6):
> +		return true;
> +	}
> +	return false;
> +}
> +
> +/* note: already called with rcu_read_lock */
> +static rx_handler_result_t vrf_handle_frame(struct sk_buff **pskb)
> +{
> +	struct sk_buff *skb = *pskb;
> +
> +	if (is_ip_rx_frame(skb)) {
> +		struct net_device *dev = vrf_master_get_rcu(skb->dev);
> +		struct pcpu_dstats *dstats = this_cpu_ptr(dev->dstats);
> +
> +		u64_stats_update_begin(&dstats->syncp);
> +		dstats->rx_pkts++;
> +		dstats->rx_bytes += skb->len;
> +		u64_stats_update_end(&dstats->syncp);
> +
> +		skb->dev = dev;
> +
> +		return RX_HANDLER_ANOTHER;
> +	}
> +	return RX_HANDLER_PASS;
> +}
> +
> +static struct rtnl_link_stats64 *vrf_get_stats64(struct net_device *dev,
> +						 struct rtnl_link_stats64 *stats)
> +{
> +	int i;
> +
> +	for_each_possible_cpu(i) {
> +		const struct pcpu_dstats *dstats;
> +		u64 tbytes, tpkts, tdrops, rbytes, rpkts;
> +		unsigned int start;
> +
> +		dstats = per_cpu_ptr(dev->dstats, i);
> +		do {
> +			start = u64_stats_fetch_begin_irq(&dstats->syncp);
> +			tbytes = dstats->tx_bytes;
> +			tpkts = dstats->tx_pkts;
> +			tdrops = dstats->tx_drps;
> +			rbytes = dstats->rx_bytes;
> +			rpkts = dstats->rx_pkts;
> +		} while (u64_stats_fetch_retry_irq(&dstats->syncp, start));
> +		stats->tx_bytes += tbytes;
> +		stats->tx_packets += tpkts;
> +		stats->tx_dropped += tdrops;
> +		stats->rx_bytes += rbytes;
> +		stats->rx_packets += rpkts;
> +	}
> +	return stats;
> +}
> +
> +static int vrf_process_v4_outbound(struct sk_buff *skb)
> +{
> +	const struct iphdr *iph = ip_hdr(skb);
> +	struct net_device *dev = skb->dev;
> +	struct net *net = dev_net(dev);
> +	int tb_id = vrf_dev_table(dev);
> +	int err, ret = NET_XMIT_DROP;
> +	struct in_device *in_dev;
> +	struct flowi4 fl4 = {
> +		.flowi4_tos = RT_TOS(iph->tos),
> +		.daddr = iph->daddr,
> +		.saddr = iph->saddr,
> +	};
> +	struct rtable *rth;
> +
> +	struct fib_result res;
> +	struct fib_table *tbl;
> +
> +	rcu_read_lock();
> +	in_dev = __in_dev_get_rcu(dev);
> +	if (!in_dev)
> +		goto out;
> +
> +	if (tb_id == 0)
> +		goto out;
> +
> +	tbl = fib_get_table(net, tb_id);
> +	if (!tbl)
> +		goto out;
> +
> +	res.tclassid    = 0;
> +	res.fi          = NULL;
> +	res.table       = NULL;
> +	if (fib_table_lookup(tbl, &fl4, &res, 0) != 0)
> +		goto out;
> +
> +	dev = FIB_RES_DEV(res);
> +
> +	rth = ip_route_new_rtable(dev, RTCF_LOCAL, res.type,
> +				  IN_DEV_CONF_GET(in_dev, NOPOLICY),
> +				  IN_DEV_CONF_GET(in_dev, NOXFRM),
> +				  true);
> +	if (!rth)
> +		goto out;
> +
> +	ip_route_set_nexthop(rth, fl4.daddr, &res);
> +
> +	skb_dst_drop(skb);
> +	skb_dst_set(skb, &rth->dst);
> +	err = ip_local_out(skb);
> +	if (!net_xmit_eval(err))
> +		ret = NET_XMIT_SUCCESS;
> +
> +out:
> +	rcu_read_unlock();
> +
> +	if (ret != NET_XMIT_SUCCESS)
> +		dev->stats.tx_errors++;
> +
> +	return ret;
> +}
> +
> +static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct net_device *mdev)
> +{
> +	struct pcpu_dstats *dstats = this_cpu_ptr(mdev->dstats);
> +	unsigned int len = skb->len;
> +	int ret = NET_XMIT_DROP;
> +
> +	if (skb_mac_header_was_set(skb)) {
> +		skb_pull(skb, sizeof(struct ethhdr));
> +		skb->mac_header = (typeof(skb->mac_header))~0U;
> +		skb_reset_network_header(skb);
> +	}
> +
> +	if (skb->protocol == htons(ETH_P_IP))
> +		ret = vrf_process_v4_outbound(skb);
> +
> +	if (ret == NET_XMIT_SUCCESS) {
> +		u64_stats_update_begin(&dstats->syncp);
> +		dstats->tx_pkts++;
> +		dstats->tx_bytes += len;
> +		u64_stats_update_end(&dstats->syncp);
> +	} else {
> +		dstats->tx_drps++;
> +		kfree_skb(skb);
> +	}
> +	return ret;
> +}
> +
> +/**************************** device handling ********************/
> +
> +/* cycle interface to flush neighbor cache and move routes across tables */
> +static void cycle_netdev(struct net_device *dev)
> +{
> +	unsigned int flags = dev->flags;
> +	int ret;
> +
> +	if (!netif_running(dev))
> +		return;
> +
> +	ret = dev_change_flags(dev, flags & ~IFF_UP);
> +	if (ret >= 0)
> +		ret = dev_change_flags(dev, flags);
> +
> +	if (ret < 0) {
> +		netdev_err(dev,
> +			   "Failed to cycle device %s; route tables might be wrong!\n",
> +			   dev->name);
> +	}
> +}
> +
> +/* queue->lock must be held */
> +static struct slave *__vrf_find_slave_dev(struct slave_queue *queue,
> +					  struct net_device *dev)
> +{
> +	struct list_head *this, *head;
> +
> +	head = &queue->all_slaves;
> +	list_for_each(this, head) {
^^^^
list_for_each_entry()

> +		struct slave *slave = list_entry(this, struct slave, list);
> +
> +		if (slave->dev == dev)
> +			return slave;
> +	}
> +
> +	return NULL;
> +}
> +
> +/* inverse of __vrf_insert_slave; queue->lock must be held */
> +static void __vrf_remove_slave(struct slave_queue *queue, struct slave *slave)
> +{
> +	dev_put(slave->dev);
> +	list_del(&slave->list);
> +	queue->num_slaves--;
> +}
> +
> +/* queue->lock must be held */
> +static void __vrf_insert_slave(struct slave_queue *queue, struct slave *slave)
> +{
> +	dev_hold(slave->dev);
> +	list_add(&slave->list, &queue->all_slaves);
> +	queue->num_slaves++;
> +}
> +
> +static int do_vrf_add_slave(struct net_device *dev, struct net_device *port_dev)
> +{
> +	struct net_vrf_dev *vrf_ptr = kmalloc(sizeof(*vrf_ptr), GFP_KERNEL);
> +	struct slave *slave = kzalloc(sizeof(*slave), GFP_KERNEL);
> +	struct slave *duplicate_slave;
> +	struct net_vrf *vrf = netdev_priv(dev);
> +	struct slave_queue *queue = &vrf->queue;
> +	int ret = -ENOMEM;
> +
> +	if (!slave || !vrf_ptr)
> +		goto out_fail;
> +
> +	slave->dev = port_dev;
> +
> +	vrf_ptr->ifindex = dev->ifindex;
> +	vrf_ptr->tb_id = vrf->tb_id;
> +
> +	spin_lock_bh(&queue->lock);
> +
> +	duplicate_slave = __vrf_find_slave_dev(queue, port_dev);
> +	if (duplicate_slave) {
> +		spin_unlock_bh(&queue->lock);
> +		ret = -EBUSY;
> +		goto out_fail;
> +	}
> +
> +	__vrf_insert_slave(queue, slave);
> +
> +	spin_unlock_bh(&queue->lock);
> +
> +	/* register the packet handler for slave ports */
> +	ret = netdev_rx_handler_register(port_dev, vrf_handle_frame, dev);
> +	if (ret) {
> +		netdev_err(port_dev,
> +			   "Device %s failed to register rx_handler\n",
> +			   port_dev->name);
> +		goto out_remove;
> +	}
> +
> +	ret = netdev_master_upper_dev_link(port_dev, dev);
> +	if (ret < 0)
> +		goto out_unregister;
> +
> +	port_dev->flags |= IFF_SLAVE;
> +
> +	rcu_assign_pointer(port_dev->vrf_ptr, vrf_ptr);
> +	cycle_netdev(port_dev);
> +
> +	return 0;
> +
> +out_unregister:
> +	netdev_rx_handler_unregister(port_dev);
> +out_remove:
> +	__vrf_remove_slave(queue, slave);
> +out_fail:
> +	kfree(vrf_ptr);
> +	kfree(slave);
> +	return ret;
> +}
> +
> +static int vrf_add_slave(struct net_device *dev, struct net_device *port_dev)
> +{
> +	ASSERT_RTNL();
These asserts are not needed, the ndo_slave_add/del always run with rtnl and
many drivers rely on that.

> +
> +	if (!dev || !port_dev || dev_net(dev) != dev_net(port_dev))
> +		return -ENODEV;

I don't think the !dev or !port_dev can happen. Also the net can't be different
looking at do_set_master() in rtnetlink.c

> +
> +	if (!vrf_is_master(dev) || vrf_is_master(port_dev) ||

Hmm, this means that bonds won't be able to be VRF slaves.
They have the IFF_MASTER flag set.

> +	    vrf_is_slave(port_dev))
> +		return -EINVAL;
> +
> +	return do_vrf_add_slave(dev, port_dev);
> +}
> +
> +/* inverse of do_vrf_add_slave */
> +static int do_vrf_del_slave(struct net_device *dev, struct net_device *port_dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +	struct slave_queue *queue = &vrf->queue;
> +	struct net_vrf_dev *vrf_ptr = NULL;
> +	struct slave *slave;
> +
> +	vrf_ptr = rcu_dereference(dev->vrf_ptr);
> +	RCU_INIT_POINTER(dev->vrf_ptr, NULL);

I think this isn't safe, you should wait for a grace period before freeing the
pointer. Actually you can just move the kfree() below the netdev_rx_handler_unregister()
since it does synchronize_rcu() anyway.

> +	kfree(vrf_ptr);
> +
> +	netdev_upper_dev_unlink(port_dev, dev);
> +	port_dev->flags &= ~IFF_SLAVE;
> +
> +	netdev_rx_handler_unregister(dev);
> +
> +	cycle_netdev(port_dev);
> +
> +	spin_lock_bh(&queue->lock);
> +
> +	slave = __vrf_find_slave_dev(queue, port_dev);
> +	if (slave)
> +		__vrf_remove_slave(queue, slave);
> +
> +	spin_unlock_bh(&queue->lock);
> +
> +	kfree(slave);
> +
> +	return 0;
> +}
> +
> +static int vrf_del_slave(struct net_device *dev, struct net_device *port_dev)
> +{
> +	ASSERT_RTNL();
> +
> +	if (!dev || !port_dev)
> +		return -ENODEV;
^
I really don't think any of this can happen.

> +
> +	if (!vrf_is_master(dev))
> +		return -EINVAL;
> +
> +	return do_vrf_del_slave(dev, port_dev);
> +}
> +
> +static int vrf_dev_init(struct net_device *dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +
> +	spin_lock_init(&vrf->queue.lock);
> +	INIT_LIST_HEAD(&vrf->queue.all_slaves);
> +
> +	dev->dstats = netdev_alloc_pcpu_stats(struct pcpu_dstats);
> +	if (!dev->dstats)
> +		return -ENOMEM;
> +
> +	dev->flags  =  IFF_MASTER | IFF_NOARP;
> +
> +	return 0;
> +}
> +
> +static void vrf_dev_uninit(struct net_device *dev)
> +{
> +	free_percpu(dev->dstats);
> +}
> +
> +static int vrf_dev_close(struct net_device *dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +	struct slave_queue *queue = &vrf->queue;
> +	struct list_head *this, *head;
> +
> +	head = &queue->all_slaves;
> +	list_for_each(this, head) {
^^^
list_for_each_entry()

> +		struct slave *slave = list_entry(this, struct slave, list);
> +
> +		slave->dev->vrf_ptr->ifindex = 0;
> +		slave->dev->vrf_ptr->tb_id = 0;
> +	}
> +
> +	if (dev->flags & IFF_MASTER)
> +		dev->flags &= ~IFF_UP;
> +
> +	return 0;
> +}
> +
> +static int vrf_dev_open(struct net_device *dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +	struct slave_queue *queue = &vrf->queue;
> +	struct list_head *this, *head;
> +
> +	head = &queue->all_slaves;
> +	list_for_each(this, head) {
^^^
list_for_each_entry()

> +		struct slave *slave = list_entry(this, struct slave, list);
> +
> +		slave->dev->vrf_ptr->ifindex = dev->ifindex;
> +		slave->dev->vrf_ptr->tb_id = vrf->tb_id;
> +	}
> +
> +	if (dev->flags & IFF_MASTER)
> +		dev->flags |= IFF_UP;
> +
> +	if (!vrf->tb)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static const struct net_device_ops vrf_netdev_ops = {
> +	.ndo_init		= vrf_dev_init,
> +	.ndo_uninit		= vrf_dev_uninit,
> +	.ndo_open		= vrf_dev_open,
> +	.ndo_stop		= vrf_dev_close,
> +	.ndo_start_xmit		= vrf_xmit,
> +	.ndo_get_stats64	= vrf_get_stats64,
> +	.ndo_add_slave		= vrf_add_slave,
> +	.ndo_del_slave		= vrf_del_slave,
> +};
> +
> +static void vrf_get_drvinfo(struct net_device *dev,
> +			    struct ethtool_drvinfo *info)
> +{
> +	strlcpy(info->driver, DRV_NAME, sizeof(info->driver));
> +	strlcpy(info->version, DRV_VERSION, sizeof(info->version));
> +}
> +
> +static const struct ethtool_ops vrf_ethtool_ops = {
> +	.get_drvinfo	= vrf_get_drvinfo,
> +};
> +
> +static void vrf_setup(struct net_device *dev)
> +{
> +	ether_setup(dev);
> +
> +	/* Initialize the device structure. */
> +	dev->netdev_ops = &vrf_netdev_ops;
> +	dev->ethtool_ops = &vrf_ethtool_ops;
> +	dev->destructor = free_netdev;
> +
> +	/* Fill in device structure with ethernet-generic values. */
> +	eth_hw_addr_random(dev);
> +}
> +
> +static int vrf_validate(struct nlattr *tb[], struct nlattr *data[])
> +{
> +	if (tb[IFLA_ADDRESS]) {
> +		if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
> +			return -EINVAL;
> +		if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS])))
> +			return -EADDRNOTAVAIL;
> +	}
> +	return 0;
> +}
> +
> +static int vrf_newlink(struct net *src_net, struct net_device *dev,
> +		       struct nlattr *tb[], struct nlattr *data[])
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +	int err;
> +
> +	if (!data || !data[IFLA_VRF_TABLE])
> +		return -EINVAL;
> +
> +	vrf->tb_id = nla_get_u32(data[IFLA_VRF_TABLE]);
> +
> +	/* reserve a table for this VRF device */
> +	err = -ERANGE;
> +	vrf->tb = fib_new_table(dev_net(dev), vrf->tb_id);
> +	if (!vrf->tb)
> +		goto out_fail;
> +
> +	dev->priv_flags |= IFF_VRF_MASTER;
> +
> +	err = -ENOMEM;
> +	dev->vrf_ptr = kmalloc(sizeof(*dev->vrf_ptr), GFP_KERNEL);
> +	if (!dev->vrf_ptr)
> +		goto out_fail;
> +
> +	dev->vrf_ptr->ifindex = dev->ifindex;
> +	dev->vrf_ptr->tb_id = vrf->tb_id;
> +
> +	err = register_netdevice(dev);
> +	if (err < 0)
> +		goto out_fail;
> +
> +	return 0;
> +
> +out_fail:
> +	kfree(dev->vrf_ptr);
> +	free_netdev(dev);
> +	return err;
> +}
> +
> +static void vrf_dellink(struct net_device *dev, struct list_head *head)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +
> +	kfree(dev->vrf_ptr);
> +	fib_free_table(vrf->tb);
> +}
> +
> +static size_t vrf_nl_getsize(const struct net_device *dev)
> +{
> +	return nla_total_size(sizeof(u32));  /* IFLA_VRF_TABLE */
> +}
> +
> +static int vrf_fillinfo(struct sk_buff *skb,
> +			const struct net_device *dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +
> +	return nla_put_u32(skb, IFLA_VRF_TABLE, vrf->tb_id);
> +}
> +
> +static const struct nla_policy vrf_nl_policy[IFLA_VRF_MAX + 1] = {
> +	[IFLA_VRF_TABLE] = { .type = NLA_U32 },
> +};
> +
> +static struct rtnl_link_ops vrf_link_ops __read_mostly = {
> +	.kind		= DRV_NAME,
> +	.priv_size	= sizeof(struct net_vrf),
> +
> +	.get_size	= vrf_nl_getsize,
> +	.policy		= vrf_nl_policy,
> +	.validate	= vrf_validate,
> +	.fill_info	= vrf_fillinfo,
> +
> +	.newlink	= vrf_newlink,
> +	.dellink	= vrf_dellink,
> +	.setup		= vrf_setup,
> +	.maxtype	= IFLA_VRF_MAX,
> +};
> +
> +static int __init vrf_init_module(void)
> +{
> +	return rtnl_link_register(&vrf_link_ops);
> +}
> +
> +static void __exit vrf_cleanup_module(void)
> +{
> +	rtnl_link_unregister(&vrf_link_ops);
> +}
> +
> +module_init(vrf_init_module);
> +module_exit(vrf_cleanup_module);
> +MODULE_AUTHOR("Shrijeet Mukherjee, David Ahern");
> +MODULE_LICENSE("GPL");
> +MODULE_ALIAS_RTNL_LINK(DRV_NAME);
> +MODULE_VERSION(DRV_VERSION);
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net-next 0/16] Proposal for VRF-lite - v3
  2015-07-27 18:30 [net-next 0/16] Proposal for VRF-lite - v3 David Ahern
                   ` (15 preceding siblings ...)
  2015-07-27 18:31 ` [PATCH] iproute2: Add support for VRF device David Ahern
@ 2015-07-27 20:30 ` Eric W. Biederman
  2015-07-28 16:02   ` David Ahern
  16 siblings, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2015-07-27 20:30 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, davem, svaidya, mingo, luto

David Ahern <dsa@cumulusnetworks.com> writes:

> In the context of internet scale routing a requirement that always comes
> up is the need to partition the available routing tables into disjoint
> routing planes. A specific use case is the multi-tenancy problem where
> each tenant has their own unique routing tables and in the very least
> need different default gateways.
>
> This patch allows the ability to create virtual router domains (aka VRFs
> (VRF-lite to be specific) in the linux packet forwarding stack. The main
> observation is that through the use of rules and socket binding to interfaces,
> all the facilities that we need are already present in the infrastructure. What
> is missing is a handle that identifies a routing domain and can be used to
> gather applicable rules/tables and uniqify neighbor selection. The scheme used
> needs to preserves the notions of ECMP, and general routing
> principles.

This paragraph is false when it comes to sockets, as I have already
pointed out.

- VPN Routing and Forwarding (RFC4364 and it's kin) implies isolation
  strong enough to allow using the the same ip on different machines
  in different VPN instances and not have confusion.

- The routing table is not the only table in the kernel that uses
  an ip address as a key.

  The result is that you can combine packets fragments that come in
  on different interfaces (irrespective of your VPN), confuse tcp
  parameters between interfaces, scramble your ipsec connections and I
  don't know what else.

Binding a socket to a network device is not strong enough to do what you
want to do and it will lead to subtle bugs, that can be triggered by
accident or by hostile actors.

If these kinds of limitations are well documented and it is specified
that these kinds of problems can occur with your socket code there may
be a place for this code somewhere.

However described like it is your code is wrong and fundmentally broken.
>
> Version 3
> - addressed comments from first 2 RFCs with the exception of the name
>   Nicolas: We will do the name conversion once we agree on what the
>            correct name should be (vrf, mrf or something else)

Not so.  I described the deep problems between your goals and your
implementation and they are not even mentioned let alone addressed.

> -  packets flow through the VRF device in both directions allowing the
>    following:
>    - tcpdump -i vrf<n>
>    - tc rules on vrf device
>    - netfilter rules on vrf device
>
> Ingo/Andy: I added you two as a start point for the proposed task related
>            changes. Not sure who should be the reviewer; please let me know
>            if someone else is more appropriate. Thanks.

It looks like you are trying to implement a namespace that isn't a
namespace.  Given that it is broken by design you have my nack.

Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct
  2015-07-27 18:31 ` [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct David Ahern
@ 2015-07-27 20:33   ` Eric W. Biederman
  2015-07-28 12:19     ` Hannes Frederic Sowa
  2015-07-28 15:25   ` Andy Lutomirski
  1 sibling, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2015-07-27 20:33 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, davem, svaidya, mingo, luto

David Ahern <dsa@cumulusnetworks.com> writes:

> Allow tasks to have a default device index for binding sockets. If set
> the value is passed to all AF_INET/AF_INET6 sockets when they are
> created.
>
> The task setting is passed parent to child on fork, but can be set or
> changed after task creation using prctl (if task has CAP_NET_ADMIN
> permissions). The setting for a socket can be retrieved using prctl().
> This option allows an administrator to restrict a task to only send/receive
> packets through the specified device. In the case of VRF devices this
> option restricts tasks to a specific VRF.
>
> Correlation of the device index to a specific VRF, ie.,
>    ifindex --> VRF device --> VRF id
> is left to userspace.

Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>

Because it is broken by design.  Your routing device is only safe for
programs that know it's limitations it is not appropriate for general
applications.

Since you don't even seen to know it's limitations I think this is a
bad path to walk down.

> Example using VRF devices:
> 1. vrf1 is created and assigned to table 5
> 2. eth2 is enslaved to vrf1
> 3. eth2 is given the address 1.1.1.1/24
>
> $ ip route ls table 5
> prohibit default
> 1.1.1.0/24 dev eth2  scope link
> local 1.1.1.1 dev eth2  proto kernel  scope host  src 1.1.1.1
>
> With out setting a VRF context ping, tcp and udp attempts fail. e.g,
> $ ping 1.1.1.254
> connect: Network is unreachable
>
> After binding the task to the vrf device ping succeeds:
> $ ./chvrf -v 1 ping -c1 1.1.1.254
> PING 1.1.1.254 (1.1.1.254) 56(84) bytes of data.
> 64 bytes from 1.1.1.254: icmp_seq=1 ttl=64 time=2.32 ms
>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
> ---
>  include/linux/sched.h      |  3 +++
>  include/uapi/linux/prctl.h |  4 ++++
>  kernel/fork.c              |  2 ++
>  kernel/sys.c               | 35 +++++++++++++++++++++++++++++++++++
>  net/ipv4/af_inet.c         |  1 +
>  net/ipv4/route.c           |  4 +++-
>  net/ipv6/af_inet6.c        |  1 +
>  net/ipv6/route.c           |  2 +-
>  8 files changed, 50 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 04b5ada460b4..29b336b8a466 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1528,6 +1528,9 @@ struct task_struct {
>  	struct files_struct *files;
>  /* namespaces */
>  	struct nsproxy *nsproxy;
> +/* network */
> +	/* if set INET/INET6 sockets are bound to given dev index on create */
> +	int sk_bind_dev_if;
>  /* signal handlers */
>  	struct signal_struct *signal;
>  	struct sighand_struct *sighand;
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 31891d9535e2..1ef45195d146 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -190,4 +190,8 @@ struct prctl_mm_map {
>  # define PR_FP_MODE_FR		(1 << 0)	/* 64b FP registers */
>  # define PR_FP_MODE_FRE		(1 << 1)	/* 32b compatibility */
>  
> +/* get/set network interface sockets are bound to by default */
> +#define PR_SET_SK_BIND_DEV_IF   47
> +#define PR_GET_SK_BIND_DEV_IF   48
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index dbd9b8d7b7cc..8b396e77d2bf 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -380,6 +380,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
>  	tsk->splice_pipe = NULL;
>  	tsk->task_frag.page = NULL;
>  
> +	tsk->sk_bind_dev_if = orig->sk_bind_dev_if;
> +
>  	account_kernel_stack(ti, 1);
>  
>  	return tsk;
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 259fda25eb6b..59119ac0a0bd 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -52,6 +52,7 @@
>  #include <linux/rcupdate.h>
>  #include <linux/uidgid.h>
>  #include <linux/cred.h>
> +#include <linux/netdevice.h>
>  
>  #include <linux/kmsg_dump.h>
>  /* Move somewhere else to avoid recompiling? */
> @@ -2267,6 +2268,40 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  	case PR_GET_FP_MODE:
>  		error = GET_FP_MODE(me);
>  		break;
> +#ifdef CONFIG_NET
> +	case PR_SET_SK_BIND_DEV_IF:
> +	{
> +		struct net_device *dev;
> +		int idx = (int) arg2;
> +
> +		if (!capable(CAP_NET_ADMIN))
> +			return -EPERM;
> +
> +		if (idx) {
> +			dev = dev_get_by_index(me->nsproxy->net_ns, idx);
> +			if (!dev)
> +				return -EINVAL;
> +			dev_put(dev);
> +		}
> +		me->sk_bind_dev_if = idx;
> +		break;
> +	}
> +	case PR_GET_SK_BIND_DEV_IF:
> +	{
> +		struct task_struct *tsk;
> +		int sk_bind_dev_if = -EINVAL;
> +
> +		rcu_read_lock();
> +		tsk = find_task_by_vpid(arg2);
> +		if (tsk)
> +			sk_bind_dev_if = tsk->sk_bind_dev_if;
> +		rcu_read_unlock();
> +		if (tsk != me && !capable(CAP_NET_ADMIN))
> +			return -EPERM;
> +		error = sk_bind_dev_if;
> +		break;
> +	}
> +#endif
>  	default:
>  		error = -EINVAL;
>  		break;
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 09c7c1ee307e..0651efa18d39 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -352,6 +352,7 @@ static int inet_create(struct net *net, struct socket *sock, int protocol,
>  	sk->sk_destruct	   = inet_sock_destruct;
>  	sk->sk_protocol	   = protocol;
>  	sk->sk_backlog_rcv = sk->sk_prot->backlog_rcv;
> +	sk->sk_bound_dev_if = current->sk_bind_dev_if;
>  
>  	inet->uc_ttl	= -1;
>  	inet->mc_loop	= 1;
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 47dae001a000..99e7ff5c56ac 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -2480,7 +2480,9 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
>  	fl4.daddr = dst;
>  	fl4.saddr = src;
>  	fl4.flowi4_tos = rtm->rtm_tos;
> -	fl4.flowi4_oif = tb[RTA_OIF] ? nla_get_u32(tb[RTA_OIF]) : 0;
> +	fl4.flowi4_oif = current->sk_bind_dev_if;
> +	if (tb[RTA_OIF])
> +		fl4.flowi4_oif = nla_get_u32(tb[RTA_OIF]);
>  	fl4.flowi4_mark = mark;
>  
>  	if (iif) {
> diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
> index 7bc92ea4ae8f..7728be810982 100644
> --- a/net/ipv6/af_inet6.c
> +++ b/net/ipv6/af_inet6.c
> @@ -189,6 +189,7 @@ static int inet6_create(struct net *net, struct socket *sock, int protocol,
>  	sk->sk_destruct		= inet_sock_destruct;
>  	sk->sk_family		= PF_INET6;
>  	sk->sk_protocol		= protocol;
> +	sk->sk_bound_dev_if	= current->sk_bind_dev_if;
>  
>  	sk->sk_backlog_rcv	= answer->prot->backlog_rcv;
>  
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 54fccf0d705d..8b7fd9eaceb4 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -3007,7 +3007,7 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
>  	struct sk_buff *skb;
>  	struct rtmsg *rtm;
>  	struct flowi6 fl6;
> -	int err, iif = 0, oif = 0;
> +	int err, iif = 0, oif = current->sk_bind_dev_if;
>  
>  	err = nlmsg_parse(nlh, sizeof(*rtm), tb, RTA_MAX, rtm_ipv6_policy);
>  	if (err < 0)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct
  2015-07-27 20:33   ` Eric W. Biederman
@ 2015-07-28 12:19     ` Hannes Frederic Sowa
  2015-07-28 13:54       ` Eric W. Biederman
  2015-07-28 16:01       ` Eric Dumazet
  0 siblings, 2 replies; 32+ messages in thread
From: Hannes Frederic Sowa @ 2015-07-28 12:19 UTC (permalink / raw)
  To: Eric W. Biederman, David Ahern
  Cc: netdev, shm, roopa, gospo, jtoppins, nikolay, ddutt,
	nicolas.dichtel, stephen, hadi, davem, svaidya, mingo, luto

Hello Eric,

On Mon, 2015-07-27 at 15:33 -0500, Eric W. Biederman wrote:
> David Ahern <dsa@cumulusnetworks.com> writes:
> 
> > Allow tasks to have a default device index for binding sockets. If 
> > set
> > the value is passed to all AF_INET/AF_INET6 sockets when they are
> > created.
> > 
> > The task setting is passed parent to child on fork, but can be set 
> > or
> > changed after task creation using prctl (if task has CAP_NET_ADMIN
> > permissions). The setting for a socket can be retrieved using 
> > prctl().
> > This option allows an administrator to restrict a task to only 
> > send/receive
> > packets through the specified device. In the case of VRF devices 
> > this
> > option restricts tasks to a specific VRF.
> > 
> > Correlation of the device index to a specific VRF, ie.,
> >    ifindex --> VRF device --> VRF id
> > is left to userspace.
> 
> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> Because it is broken by design.  Your routing device is only safe for
> programs that know it's limitations it is not appropriate for general
> applications.
> 
> Since you don't even seen to know it's limitations I think this is a
> bad path to walk down.

Can you please elaborate about the broken by design?

Different operating systems are already using this approach with good
success. I read your other mail regarding isolation of different VRFs
and I agree that all code which persists state depending solely on the
IP address is affected by this and this must be dealt with and fixed
(actually, there aren't too many).

But I wouldn't call that broken by design. This stuff will get fixed
like e.g. cross-talk between fragmentation queues, icmp rate limiters
etc, which could already happen in the past.

What is your opinion on the fundamental approach only from a user
perspective? Do you think that is broken, too?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct
  2015-07-28 12:19     ` Hannes Frederic Sowa
@ 2015-07-28 13:54       ` Eric W. Biederman
  2015-07-28 14:20         ` Hannes Frederic Sowa
  2015-07-28 16:01       ` Eric Dumazet
  1 sibling, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2015-07-28 13:54 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Ahern, netdev, shm, roopa, gospo, jtoppins, nikolay, ddutt,
	nicolas.dichtel, stephen, hadi, davem, svaidya, mingo, luto

Hannes Frederic Sowa <hannes@stressinduktion.org> writes:

> Hello Eric,
>
> On Mon, 2015-07-27 at 15:33 -0500, Eric W. Biederman wrote:
>> David Ahern <dsa@cumulusnetworks.com> writes:
>> 
>> > Allow tasks to have a default device index for binding sockets. If 
>> > set
>> > the value is passed to all AF_INET/AF_INET6 sockets when they are
>> > created.
>> > 
>> > The task setting is passed parent to child on fork, but can be set 
>> > or
>> > changed after task creation using prctl (if task has CAP_NET_ADMIN
>> > permissions). The setting for a socket can be retrieved using 
>> > prctl().
>> > This option allows an administrator to restrict a task to only 
>> > send/receive
>> > packets through the specified device. In the case of VRF devices 
>> > this
>> > option restricts tasks to a specific VRF.
>> > 
>> > Correlation of the device index to a specific VRF, ie.,
>> >    ifindex --> VRF device --> VRF id
>> > is left to userspace.
>> 
>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> 
>> Because it is broken by design.  Your routing device is only safe for
>> programs that know it's limitations it is not appropriate for general
>> applications.
>> 
>> Since you don't even seen to know it's limitations I think this is a
>> bad path to walk down.
>
> Can you please elaborate about the broken by design?
>
> Different operating systems are already using this approach with good
> success. I read your other mail regarding isolation of different VRFs
> and I agree that all code which persists state depending solely on the
> IP address is affected by this and this must be dealt with and fixed
> (actually, there aren't too many).

The size of struct net would tend to disagree with the assertion that
there are not too many.

> But I wouldn't call that broken by design. This stuff will get fixed
> like e.g. cross-talk between fragmentation queues, icmp rate limiters
> etc, which could already happen in the past.
>
> What is your opinion on the fundamental approach only from a user
> perspective? Do you think that is broken, too?

I think promising something to userspace that a design can not deliver
is a fundamental problem.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct
  2015-07-28 13:54       ` Eric W. Biederman
@ 2015-07-28 14:20         ` Hannes Frederic Sowa
  0 siblings, 0 replies; 32+ messages in thread
From: Hannes Frederic Sowa @ 2015-07-28 14:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Ahern, netdev, shm, roopa, gospo, jtoppins, nikolay, ddutt,
	nicolas.dichtel, stephen, hadi, davem, svaidya, mingo, luto

On Tue, 2015-07-28 at 08:54 -0500, Eric W. Biederman wrote:
> Hannes Frederic Sowa <hannes@stressinduktion.org> writes:
> 
> > Hello Eric,
> > 
> > On Mon, 2015-07-27 at 15:33 -0500, Eric W. Biederman wrote:
> > > David Ahern <dsa@cumulusnetworks.com> writes:
> > > 
> > > > Allow tasks to have a default device index for binding sockets. 
> > > > If 
> > > > set
> > > > the value is passed to all AF_INET/AF_INET6 sockets when they 
> > > > are
> > > > created.
> > > > 
> > > > The task setting is passed parent to child on fork, but can be 
> > > > set 
> > > > or
> > > > changed after task creation using prctl (if task has 
> > > > CAP_NET_ADMIN
> > > > permissions). The setting for a socket can be retrieved using 
> > > > prctl().
> > > > This option allows an administrator to restrict a task to only 
> > > > send/receive
> > > > packets through the specified device. In the case of VRF devices 
> > > > this
> > > > option restricts tasks to a specific VRF.
> > > > 
> > > > Correlation of the device index to a specific VRF, ie.,
> > > >    ifindex --> VRF device --> VRF id
> > > > is left to userspace.
> > > 
> > > Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > > 
> > > Because it is broken by design.  Your routing device is only safe 
> > > for
> > > programs that know it's limitations it is not appropriate for 
> > > general
> > > applications.
> > > 
> > > Since you don't even seen to know it's limitations I think this is 
> > > a
> > > bad path to walk down.
> > 
> > Can you please elaborate about the broken by design?
> > 
> > Different operating systems are already using this approach with 
> > good
> > success. I read your other mail regarding isolation of different 
> > VRFs
> > and I agree that all code which persists state depending solely on 
> > the
> > IP address is affected by this and this must be dealt with and fixed
> > (actually, there aren't too many).
> 
> The size of struct net would tend to disagree with the assertion that
> there are not too many.

netns_frags and inet_peer comes to my mind at first. 

All those data structures simply need to have an opaque id added to the
hash and comparison functions to deal with this problem. And we will
need this in future anyway, as openvswitch will get connection tracking
support and thus the fragmentation engine and icmp rate limiter will
need to be taught about zones in OVS.

> > But I wouldn't call that broken by design. This stuff will get fixed
> > like e.g. cross-talk between fragmentation queues, icmp rate 
> > limiters
> > etc, which could already happen in the past.
> > 
> > What is your opinion on the fundamental approach only from a user
> > perspective? Do you think that is broken, too?
> 
> I think promising something to userspace that a design can not deliver
> is a fundamental problem.

You are still talking about the isolation aspect, right?

Bye,
Hannes

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct
  2015-07-27 18:31 ` [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct David Ahern
  2015-07-27 20:33   ` Eric W. Biederman
@ 2015-07-28 15:25   ` Andy Lutomirski
  2015-07-28 16:11     ` David Ahern
  1 sibling, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2015-07-28 15:25 UTC (permalink / raw)
  To: David Ahern
  Cc: gospo, Eric W. Biederman, ddutt, Hannes Frederic Sowa,
	Ingo Molnar, Stephen Hemminger, Nicolas Dichtel, jtoppins,
	Network Development, David S. Miller, Shrijeet Mukherjee,
	svaidya, hadi, nikolay, roopa

On Jul 27, 2015 11:33 AM, "David Ahern" <dsa@cumulusnetworks.com> wrote:
>
> Allow tasks to have a default device index for binding sockets. If set
> the value is passed to all AF_INET/AF_INET6 sockets when they are created.
>

This is not intended to be a review of the concept.  I haven't thought
about whether the concept is a good idea, broken by design, or
whatever.  FWIW, if this were added to the kernel and didn't require
excessive privilege, I'd probably use it.  (I still don't really
understand why binding to a device requires privilege in the first
place, but, again, I haven't thought about it very much.)

> +#ifdef CONFIG_NET
> +       case PR_SET_SK_BIND_DEV_IF:
> +       {
> +               struct net_device *dev;
> +               int idx = (int) arg2;
> +
> +               if (!capable(CAP_NET_ADMIN))
> +                       return -EPERM;
> +

Can you either use ns_capable or add a comment as to why not?

Also, please return -EINVAL if unused args are nonzero.

> +               if (idx) {
> +                       dev = dev_get_by_index(me->nsproxy->net_ns, idx);
> +                       if (!dev)
> +                               return -EINVAL;
> +                       dev_put(dev);
> +               }
> +               me->sk_bind_dev_if = idx;
> +               break;
> +       }
> +       case PR_GET_SK_BIND_DEV_IF:
> +       {
> +               struct task_struct *tsk;
> +               int sk_bind_dev_if = -EINVAL;
> +
> +               rcu_read_lock();
> +               tsk = find_task_by_vpid(arg2);
> +               if (tsk)
> +                       sk_bind_dev_if = tsk->sk_bind_dev_if;

Why do you support different tasks here?  Could this use proc instead?

The same -EINVAL issue applies.

Also, I think you need to hook setns and unshare to do something
reasonable when the task is bound to a device.

--Andy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct
  2015-07-28 12:19     ` Hannes Frederic Sowa
  2015-07-28 13:54       ` Eric W. Biederman
@ 2015-07-28 16:01       ` Eric Dumazet
  2015-07-28 16:07         ` David Ahern
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2015-07-28 16:01 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Eric W. Biederman, David Ahern, netdev, shm, roopa, gospo,
	jtoppins, nikolay, ddutt, nicolas.dichtel, stephen, hadi, davem,
	svaidya, mingo, luto

On Tue, 2015-07-28 at 14:19 +0200, Hannes Frederic Sowa wrote:
> Hello Eric,
> 
> On Mon, 2015-07-27 at 15:33 -0500, Eric W. Biederman wrote:
> > David Ahern <dsa@cumulusnetworks.com> writes:
> > 
> > > Allow tasks to have a default device index for binding sockets. If 
> > > set
> > > the value is passed to all AF_INET/AF_INET6 sockets when they are
> > > created.
> > > 
> > > The task setting is passed parent to child on fork, but can be set 
> > > or
> > > changed after task creation using prctl (if task has CAP_NET_ADMIN
> > > permissions). The setting for a socket can be retrieved using 
> > > prctl().
> > > This option allows an administrator to restrict a task to only 
> > > send/receive
> > > packets through the specified device. In the case of VRF devices 
> > > this
> > > option restricts tasks to a specific VRF.
> > > 
> > > Correlation of the device index to a specific VRF, ie.,
> > >    ifindex --> VRF device --> VRF id
> > > is left to userspace.
> > 
> > Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > 
> > Because it is broken by design.  Your routing device is only safe for
> > programs that know it's limitations it is not appropriate for general
> > applications.
> > 
> > Since you don't even seen to know it's limitations I think this is a
> > bad path to walk down.
> 
> Can you please elaborate about the broken by design?
> 
> Different operating systems are already using this approach with good
> success. I read your other mail regarding isolation of different VRFs
> and I agree that all code which persists state depending solely on the
> IP address is affected by this and this must be dealt with and fixed
> (actually, there aren't too many).
> 
> But I wouldn't call that broken by design. This stuff will get fixed
> like e.g. cross-talk between fragmentation queues, icmp rate limiters
> etc, which could already happen in the past.
> 
> What is your opinion on the fundamental approach only from a user
> perspective? Do you think that is broken, too?

I agree with Eric here.

This sk_bind_dev_if on task_struct is quite a hack.

What will be added next ? An array of dev_if ? netfilter support ?
af_packet support ? What about /proc files and netlink dumps ?

We already have network namespaces. Extend this if needed, instead of
bypassing them.

No need to add something else (with lack of proper reporting for various
tools)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net-next 0/16] Proposal for VRF-lite - v3
  2015-07-27 20:30 ` [net-next 0/16] Proposal for VRF-lite - v3 Eric W. Biederman
@ 2015-07-28 16:02   ` David Ahern
  2015-07-28 17:07     ` Eric W. Biederman
  0 siblings, 1 reply; 32+ messages in thread
From: David Ahern @ 2015-07-28 16:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: netdev, shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, davem, svaidya, mingo, luto

On 7/27/15 2:30 PM, Eric W. Biederman wrote:
> This paragraph is false when it comes to sockets, as I have already
> pointed out.
>
> - VPN Routing and Forwarding (RFC4364 and it's kin) implies isolation
>    strong enough to allow using the the same ip on different machines
>    in different VPN instances and not have confusion.
>
> - The routing table is not the only table in the kernel that uses
>    an ip address as a key.
>
>    The result is that you can combine packets fragments that come in
>    on different interfaces (irrespective of your VPN), confuse tcp
>    parameters between interfaces, scramble your ipsec connections and I
>    don't know what else.

The duplicate IP address is a problem with the networking stack today; 
the VRF device does not introduce it. The VRF device does allow 
duplicate IP addresses within a namespace but separate VRFs, though yes 
various places that rely solely on source address like IP fragmentation 
do need to be fixed.

I looked at the IPv4 fragmentation code yesterday and will continue 
today. So help me with the history: is there any reason why the device 
index is not used today? It seems like a straight forward change.

1. simple netdevices with the same IP address
--> no problem using index in the lookup

2. 2 ipsec tunnels -- different netdevices, same IP address
--> no problem using index

3. stacked devices like bonding and team interfaces appear to the stack 
as a single device
--> no problem using index of stacked device

4. If an interface is deleted and a new one is created with the same IP 
address then we want to fail the lookup
--> no problem using index

5. other???

Is there a use case where I can't add ifindex of the incoming device (or 
higher level device if skb->dev is changed) to the hash and lookup for 
fragments?


>> Version 3
>> - addressed comments from first 2 RFCs with the exception of the name
>>    Nicolas: We will do the name conversion once we agree on what the
>>             correct name should be (vrf, mrf or something else)
>
> Not so.  I described the deep problems between your goals and your
> implementation and they are not even mentioned let alone addressed.

I have addressed comments to the extent that I can. As I stated in my 
last followup to you Eric I did not understand your point. I asked for 
clarification, a --verbose if you will. I can't read your mind, so I 
need you to elaborate on your points to be able to respond and address 
your concerns.

>
>> -  packets flow through the VRF device in both directions allowing the
>>     following:
>>     - tcpdump -i vrf<n>
>>     - tc rules on vrf device
>>     - netfilter rules on vrf device
>>
>> Ingo/Andy: I added you two as a start point for the proposed task related
>>             changes. Not sure who should be the reviewer; please let me know
>>             if someone else is more appropriate. Thanks.
>
> It looks like you are trying to implement a namespace that isn't a
> namespace.  Given that it is broken by design you have my nack.

This is an L3 separation within a namespace, not a device level 
separation which is what namespaces provide.

David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct
  2015-07-28 16:01       ` Eric Dumazet
@ 2015-07-28 16:07         ` David Ahern
  2015-07-28 16:52           ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: David Ahern @ 2015-07-28 16:07 UTC (permalink / raw)
  To: Eric Dumazet, Hannes Frederic Sowa
  Cc: Eric W. Biederman, netdev, shm, roopa, gospo, jtoppins, nikolay,
	ddutt, nicolas.dichtel, stephen, hadi, davem, svaidya, mingo,
	luto

On 7/28/15 10:01 AM, Eric Dumazet wrote:
> On Tue, 2015-07-28 at 14:19 +0200, Hannes Frederic Sowa wrote:
>> Hello Eric,
>>
>> On Mon, 2015-07-27 at 15:33 -0500, Eric W. Biederman wrote:
>>> David Ahern <dsa@cumulusnetworks.com> writes:
>>>
>>>> Allow tasks to have a default device index for binding sockets. If
>>>> set
>>>> the value is passed to all AF_INET/AF_INET6 sockets when they are
>>>> created.
>>>>
>>>> The task setting is passed parent to child on fork, but can be set
>>>> or
>>>> changed after task creation using prctl (if task has CAP_NET_ADMIN
>>>> permissions). The setting for a socket can be retrieved using
>>>> prctl().
>>>> This option allows an administrator to restrict a task to only
>>>> send/receive
>>>> packets through the specified device. In the case of VRF devices
>>>> this
>>>> option restricts tasks to a specific VRF.
>>>>
>>>> Correlation of the device index to a specific VRF, ie.,
>>>>     ifindex --> VRF device --> VRF id
>>>> is left to userspace.
>>>
>>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>
>>> Because it is broken by design.  Your routing device is only safe for
>>> programs that know it's limitations it is not appropriate for general
>>> applications.
>>>
>>> Since you don't even seen to know it's limitations I think this is a
>>> bad path to walk down.
>>
>> Can you please elaborate about the broken by design?
>>
>> Different operating systems are already using this approach with good
>> success. I read your other mail regarding isolation of different VRFs
>> and I agree that all code which persists state depending solely on the
>> IP address is affected by this and this must be dealt with and fixed
>> (actually, there aren't too many).
>>
>> But I wouldn't call that broken by design. This stuff will get fixed
>> like e.g. cross-talk between fragmentation queues, icmp rate limiters
>> etc, which could already happen in the past.
>>
>> What is your opinion on the fundamental approach only from a user
>> perspective? Do you think that is broken, too?
>
> I agree with Eric here.
>
> This sk_bind_dev_if on task_struct is quite a hack.
>
> What will be added next ? An array of dev_if ? netfilter support ?
> af_packet support ? What about /proc files and netlink dumps ?

It could just as easily be a pointer to a struct (e.g., struct net_ctx) 
such that the intrusion to task_struct is simply 8 bytes -- very similar 
to the nsproxy used for the assorted namespaces. The struct can then 
contain whatever network config is imposed on the task.

>
> We already have network namespaces. Extend this if needed, instead of
> bypassing them.

Problems with using network namespaces for VRFs has been discussed in 
the past. e.g.,
     http://www.spinics.net/lists/netdev/msg298368.html

David

>
> No need to add something else (with lack of proper reporting for various
> tools)
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct
  2015-07-28 15:25   ` Andy Lutomirski
@ 2015-07-28 16:11     ` David Ahern
  2015-07-28 17:12       ` Tom Herbert
  0 siblings, 1 reply; 32+ messages in thread
From: David Ahern @ 2015-07-28 16:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: gospo, Eric W. Biederman, ddutt, Hannes Frederic Sowa,
	Ingo Molnar, Stephen Hemminger, Nicolas Dichtel, jtoppins,
	Network Development, David S. Miller, Shrijeet Mukherjee,
	svaidya, hadi, nikolay, roopa

On 7/28/15 9:25 AM, Andy Lutomirski wrote:
> On Jul 27, 2015 11:33 AM, "David Ahern" <dsa@cumulusnetworks.com> wrote:
>>
>> Allow tasks to have a default device index for binding sockets. If set
>> the value is passed to all AF_INET/AF_INET6 sockets when they are created.
>>
>
> This is not intended to be a review of the concept.  I haven't thought
> about whether the concept is a good idea, broken by design, or
> whatever.  FWIW, if this were added to the kernel and didn't require
> excessive privilege, I'd probably use it.  (I still don't really
> understand why binding to a device requires privilege in the first
> place, but, again, I haven't thought about it very much.)

The intent here is to restrict a task to only sending and receiving 
packets from a single network device. The device can be single ethernet 
interface, a stacked device (e.g, bond) or in our case a VRF device 
which restricts a task to interfaces (and hence network paths) 
associated with the VRF.

>
>> +#ifdef CONFIG_NET
>> +       case PR_SET_SK_BIND_DEV_IF:
>> +       {
>> +               struct net_device *dev;
>> +               int idx = (int) arg2;
>> +
>> +               if (!capable(CAP_NET_ADMIN))
>> +                       return -EPERM;
>> +
>
> Can you either use ns_capable or add a comment as to why not?

will do.

>
> Also, please return -EINVAL if unused args are nonzero.

ok.

>
>> +               if (idx) {
>> +                       dev = dev_get_by_index(me->nsproxy->net_ns, idx);
>> +                       if (!dev)
>> +                               return -EINVAL;
>> +                       dev_put(dev);
>> +               }
>> +               me->sk_bind_dev_if = idx;
>> +               break;
>> +       }
>> +       case PR_GET_SK_BIND_DEV_IF:
>> +       {
>> +               struct task_struct *tsk;
>> +               int sk_bind_dev_if = -EINVAL;
>> +
>> +               rcu_read_lock();
>> +               tsk = find_task_by_vpid(arg2);
>> +               if (tsk)
>> +                       sk_bind_dev_if = tsk->sk_bind_dev_if;
>
> Why do you support different tasks here?  Could this use proc instead?

In this case we want to allow a separate process to determine if a task 
is restricted to a device.

>
> The same -EINVAL issue applies.
>
> Also, I think you need to hook setns and unshare to do something
> reasonable when the task is bound to a device.

ack on both.

Thanks for the review,
David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 13/16] net: Introduce VRF device driver - v2
  2015-07-27 20:01   ` Nikolay Aleksandrov
@ 2015-07-28 16:22     ` David Ahern
  0 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-28 16:22 UTC (permalink / raw)
  To: Nikolay Aleksandrov, netdev
  Cc: shm, roopa, gospo, jtoppins, ddutt, hannes, nicolas.dichtel,
	stephen, hadi, ebiederm, davem, svaidya, mingo, luto

On 7/27/15 2:01 PM, Nikolay Aleksandrov wrote:
>> +
>> +	if (!vrf_is_master(dev) || vrf_is_master(port_dev) ||
>
> Hmm, this means that bonds won't be able to be VRF slaves.
> They have the IFF_MASTER flag set.

Right, will change to the IFF_VRF_MASTER flag.

>
>> +	    vrf_is_slave(port_dev))
>> +		return -EINVAL;
>> +
>> +	return do_vrf_add_slave(dev, port_dev);
>> +}
>> +
>> +/* inverse of do_vrf_add_slave */
>> +static int do_vrf_del_slave(struct net_device *dev, struct net_device *port_dev)
>> +{
>> +	struct net_vrf *vrf = netdev_priv(dev);
>> +	struct slave_queue *queue = &vrf->queue;
>> +	struct net_vrf_dev *vrf_ptr = NULL;
>> +	struct slave *slave;
>> +
>> +	vrf_ptr = rcu_dereference(dev->vrf_ptr);
>> +	RCU_INIT_POINTER(dev->vrf_ptr, NULL);
>
> I think this isn't safe, you should wait for a grace period before freeing the
> pointer. Actually you can just move the kfree() below the netdev_rx_handler_unregister()
> since it does synchronize_rcu() anyway.

ok

And ack on all other comments..

Thanks for the review,
David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct
  2015-07-28 16:07         ` David Ahern
@ 2015-07-28 16:52           ` Eric Dumazet
  0 siblings, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2015-07-28 16:52 UTC (permalink / raw)
  To: David Ahern
  Cc: Hannes Frederic Sowa, Eric W. Biederman, netdev, shm, roopa,
	gospo, jtoppins, nikolay, ddutt, nicolas.dichtel, stephen, hadi,
	davem, svaidya, mingo, luto

On Tue, 2015-07-28 at 10:07 -0600, David Ahern wrote:

> Problems with using network namespaces for VRFs has been discussed in 
> the past. e.g.,
>      http://www.spinics.net/lists/netdev/msg298368.html

Great. Are you suggesting to get rid of network namespaces ?

If not, your proposal only increases bloat and maintenance burden.

If namespaces cant be fixed, they are the wrong design and we should
remove them.

If they can be fixed, they must be fixed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [net-next 0/16] Proposal for VRF-lite - v3
  2015-07-28 16:02   ` David Ahern
@ 2015-07-28 17:07     ` Eric W. Biederman
  0 siblings, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2015-07-28 17:07 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, davem, svaidya, mingo, luto

David Ahern <dsa@cumulusnetworks.com> writes:

> On 7/27/15 2:30 PM, Eric W. Biederman wrote:
>> This paragraph is false when it comes to sockets, as I have already
>> pointed out.
>>
>> - VPN Routing and Forwarding (RFC4364 and it's kin) implies isolation
>>    strong enough to allow using the the same ip on different machines
>>    in different VPN instances and not have confusion.
>>
>> - The routing table is not the only table in the kernel that uses
>>    an ip address as a key.
>>
>>    The result is that you can combine packets fragments that come in
>>    on different interfaces (irrespective of your VPN), confuse tcp
>>    parameters between interfaces, scramble your ipsec connections and I
>>    don't know what else.
>
> The duplicate IP address is a problem with the networking stack today; the VRF
> device does not introduce it. The VRF device does allow duplicate IP addresses
> within a namespace but separate VRFs, though yes various places that rely solely
> on source address like IP fragmentation do need to be fixed.

No. The same IP address being used by different machines is not a
problem with the IP stack today.  IP addresses are defined to be
globally unique.

At the point you introduce VPNs/VRFs you introduce duplicate IP
addresses and then the code needs to cope.

As such I think there is a deep mismatch between the semantics of
BINDTODEVICE and VRFs because BINDTODEVICE by definition does not worry
about duplicate IP addresses.

Which means that you can't just reuse the BINDTODEVICE infrastructure.
It is fundamentally insufficient to the task.  So as you are discovering
you have to invent something new.

That new thing needs a definition.  Maybe the new thing makes sense and
you can just slice off a chunk of a network namespace.  Maybe you go
through and change all of the code.

> I looked at the IPv4 fragmentation code yesterday and will continue today. So
> help me with the history: is there any reason why the device index is not used
> today? It seems like a straight forward change.

Sigh.  I would have hoped someone dealing with routing issues would have
seen this at a glance.  The reason is multi-path reception of fragments.

Adding the device index into the fragment reassembly logic would break
fragment reassembly when fragments of the same packet come into a
machine on different network devices.  Given that only the first
fragment has port numbers I can easily see network path selection code
hashing fragments onto different paths through the network.

> Is there a use case where I can't add ifindex of the incoming device (or higher
> level device if skb->dev is changed) to the hash and lookup for
> fragments?

As detailed above.  That breaks fragment reassembly on multiple paths.

>>> Version 3
>>> - addressed comments from first 2 RFCs with the exception of the name
>>>    Nicolas: We will do the name conversion once we agree on what the
>>>             correct name should be (vrf, mrf or something else)
>>
>> Not so.  I described the deep problems between your goals and your
>> implementation and they are not even mentioned let alone addressed.
>
> I have addressed comments to the extent that I can. As I stated in my last
> followup to you Eric I did not understand your point. I asked for clarification,
> a --verbose if you will. I can't read your mind, so I need you to elaborate on
> your points to be able to respond and address your concerns.

Hopefully this helps.

Everything we are talking about follows from what I said at the outset.
You are introducing the idea of having the same ip address refer to
different network destinations depending upon context.  Outside of
network namespaces that concept is new and it breaks a lot of
assumptions.

The entire network stack is too large to fit in my head.  I don't know
every place where ip addresses are used as part of the index into a
table.  It is beholden on the implementor of a new feature to figure out
how to introduce such a concept safely.  I don't see that happening with
VRF-lite.

Pretty fundamentally a network device index is insufficient for your needs.

>>> -  packets flow through the VRF device in both directions allowing the
>>>     following:
>>>     - tcpdump -i vrf<n>
>>>     - tc rules on vrf device
>>>     - netfilter rules on vrf device
>>>
>>> Ingo/Andy: I added you two as a start point for the proposed task related
>>>             changes. Not sure who should be the reviewer; please let me know
>>>             if someone else is more appropriate. Thanks.
>>
>> It looks like you are trying to implement a namespace that isn't a
>> namespace.  Given that it is broken by design you have my nack.
>
> This is an L3 separation within a namespace, not a device level separation which
> is what namespaces provide.

Not my meaning.  I was not talking about network namespaces and how your
vrf is almost but not completely the same as a network namespace.

What I was talking about is that you are implementing something that is
used roughly the same way as the other namespaces pid, mount, ipc, net,
uts, etc.  As the poor excuse for an overall maintainer of namespaces
I am one of, if not the person, you need to talk to, to add something to
the task struct.

Until you sort out your semantics and have something that doesn't look
like it will break users with every maintenance patch.  I am saying no.
You don't have something that is currently appropriate to add to task
struct.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct
  2015-07-28 16:11     ` David Ahern
@ 2015-07-28 17:12       ` Tom Herbert
  0 siblings, 0 replies; 32+ messages in thread
From: Tom Herbert @ 2015-07-28 17:12 UTC (permalink / raw)
  To: David Ahern
  Cc: Andy Lutomirski, gospo, Eric W. Biederman, ddutt,
	Hannes Frederic Sowa, Ingo Molnar, Stephen Hemminger,
	Nicolas Dichtel, jtoppins, Network Development, David S. Miller,
	Shrijeet Mukherjee, svaidya, hadi, nikolay, Roopa Prabhu,
	Blake Matheny, Alex Gartrell

On Tue, Jul 28, 2015 at 9:11 AM, David Ahern <dsa@cumulusnetworks.com> wrote:
> On 7/28/15 9:25 AM, Andy Lutomirski wrote:
>>
>> On Jul 27, 2015 11:33 AM, "David Ahern" <dsa@cumulusnetworks.com> wrote:
>>>
>>>
>>> Allow tasks to have a default device index for binding sockets. If set
>>> the value is passed to all AF_INET/AF_INET6 sockets when they are
>>> created.
>>>
>>
>> This is not intended to be a review of the concept.  I haven't thought
>> about whether the concept is a good idea, broken by design, or
>> whatever.  FWIW, if this were added to the kernel and didn't require
>> excessive privilege, I'd probably use it.  (I still don't really
>> understand why binding to a device requires privilege in the first
>> place, but, again, I haven't thought about it very much.)
>
>
> The intent here is to restrict a task to only sending and receiving packets
> from a single network device. The device can be single ethernet interface, a
> stacked device (e.g, bond) or in our case a VRF device which restricts a
> task to interfaces (and hence network paths) associated with the VRF.
>
We are also intending to implement similar functionality for ILA to
restrict tasks (probably from cgroup) to binding to it's assigned
addresses. This seems most easily accomplished by adding a binding
interface which is only checked at bind time. After binding, the a
connection should be processed no differently than any others,
additional plumbing in the data path for network name spaces just
seems like overhead.

Tom

>>
>>> +#ifdef CONFIG_NET
>>> +       case PR_SET_SK_BIND_DEV_IF:
>>> +       {
>>> +               struct net_device *dev;
>>> +               int idx = (int) arg2;
>>> +
>>> +               if (!capable(CAP_NET_ADMIN))
>>> +                       return -EPERM;
>>> +
>>
>>
>> Can you either use ns_capable or add a comment as to why not?
>
>
> will do.
>
>>
>> Also, please return -EINVAL if unused args are nonzero.
>
>
> ok.
>
>>
>>> +               if (idx) {
>>> +                       dev = dev_get_by_index(me->nsproxy->net_ns, idx);
>>> +                       if (!dev)
>>> +                               return -EINVAL;
>>> +                       dev_put(dev);
>>> +               }
>>> +               me->sk_bind_dev_if = idx;
>>> +               break;
>>> +       }
>>> +       case PR_GET_SK_BIND_DEV_IF:
>>> +       {
>>> +               struct task_struct *tsk;
>>> +               int sk_bind_dev_if = -EINVAL;
>>> +
>>> +               rcu_read_lock();
>>> +               tsk = find_task_by_vpid(arg2);
>>> +               if (tsk)
>>> +                       sk_bind_dev_if = tsk->sk_bind_dev_if;
>>
>>
>> Why do you support different tasks here?  Could this use proc instead?
>
>
> In this case we want to allow a separate process to determine if a task is
> restricted to a device.
>
>>
>> The same -EINVAL issue applies.
>>
>> Also, I think you need to hook setns and unshare to do something
>> reasonable when the task is bound to a device.
>
>
> ack on both.
>
> Thanks for the review,
> David
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2015-07-28 17:13 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-27 18:30 [net-next 0/16] Proposal for VRF-lite - v3 David Ahern
2015-07-27 18:30 ` [PATCH net-next 01/16] net: Refactor rtable allocation and initialization David Ahern
2015-07-27 18:30 ` [PATCH net-next 02/16] net: export a few FIB functions David Ahern
2015-07-27 18:30 ` [PATCH net-next 03/16] net: Introduce VRF related flags and helpers David Ahern
2015-07-27 18:30 ` [PATCH net-next 04/16] net: Use VRF device index for lookups on RX David Ahern
2015-07-27 18:30 ` [PATCH net-next 05/16] net: Use VRF device index for lookups on TX David Ahern
2015-07-27 18:30 ` [PATCH net-next 06/16] net: Tx via VRF device David Ahern
2015-07-27 18:31 ` [PATCH net-next 07/16] net: Add inet_addr lookup by table David Ahern
2015-07-27 18:31 ` [PATCH net-next 08/16] net: Fix up inet_addr_type checks David Ahern
2015-07-27 18:31 ` [PATCH net-next 09/16] net: Add routes to the table associated with the device David Ahern
2015-07-27 18:31 ` [PATCH net-next 10/16] net: Use passed in table for nexthop lookups David Ahern
2015-07-27 18:31 ` [PATCH net-next 11/16] net: Use VRF device index for socket lookups David Ahern
2015-07-27 18:31 ` [PATCH net-next 12/16] net: Add ipv4 route helper to set next hop David Ahern
2015-07-27 18:31 ` [PATCH net-next 13/16] net: Introduce VRF device driver - v2 David Ahern
2015-07-27 20:01   ` Nikolay Aleksandrov
2015-07-28 16:22     ` David Ahern
2015-07-27 18:31 ` [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct David Ahern
2015-07-27 20:33   ` Eric W. Biederman
2015-07-28 12:19     ` Hannes Frederic Sowa
2015-07-28 13:54       ` Eric W. Biederman
2015-07-28 14:20         ` Hannes Frederic Sowa
2015-07-28 16:01       ` Eric Dumazet
2015-07-28 16:07         ` David Ahern
2015-07-28 16:52           ` Eric Dumazet
2015-07-28 15:25   ` Andy Lutomirski
2015-07-28 16:11     ` David Ahern
2015-07-28 17:12       ` Tom Herbert
2015-07-27 18:31 ` [PATCH net-next 15/16] net: Add chvrf command David Ahern
2015-07-27 18:31 ` [PATCH] iproute2: Add support for VRF device David Ahern
2015-07-27 20:30 ` [net-next 0/16] Proposal for VRF-lite - v3 Eric W. Biederman
2015-07-28 16:02   ` David Ahern
2015-07-28 17:07     ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).